Quantification (machine learning)

Last updated

In machine learning and data mining, quantification (variously called learning to quantify, or supervised prevalence estimation, or class prior estimation) is the task of using supervised learning in order to train models (quantifiers) that estimate the relative frequencies (also known as prevalence values) of the classes of interest in a sample of unlabelled data items. [1] [2] For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these 100,000 tweets which belong to class `Positive' (i.e., which manifest a positive stance towards this candidate), and to do the same for classes `Neutral' and `Negative'. [3]

Contents

Quantification may also be viewed as the task of training predictors that estimate a (discrete) probability distribution, i.e., that generate a predicted distribution that approximates the unknown true distribution of the items across the classes of interest. Quantification is different from classification, since the goal of classification is to predict the class labels of individual data items, while the goal of quantification it to predict the class prevalence values of sets of data items. Quantification is also different from regression, since in regression the training data items have real-valued labels, while in quantification the training data items have class labels.

It has been shown in multiple research works [4] [5] [6] [7] [8] that performing quantification by classifying all unlabelled instances and then counting the instances that have been attributed to each class (the 'classify and count' method) usually leads to suboptimal quantification accuracy. This suboptimality may be seen as a direct consequence of 'Vapnik's principle', which states:

If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem. [9]

In our case, the problem to be solved directly is quantification, while the more general intermediate problem is classification. As a result of the suboptimality of the 'classify and count' method, quantification has evolved as a task in its own right, different (in goals, methods, techniques, and evaluation measures) from classification.

Quantification tasks

The main variants of quantification, according to the characteristics of the set of classes used, are:

Most known quantification methods address the binary case or the single-label multiclass case, and only few of them address the ordinal case or the regression case.

Binary-only methods include the Mixture Model (MM) method, [4] the HDy method, [11] SVM(KLD), [7] and SVM(Q). [6]

Methods that can deal with both the binary case and the single-label multiclass case include probabilistic classify and count (PCC), [5] adjusted classify and count (ACC), [4] probabilistic adjusted classify and count (PACC), [5] and the Saerens-Latinne-Decaestecker EM-based method (SLD). [12]

Methods for the ordinal case include Ordinal Quantification Tree (OQT), [13] and ordinal versions of the above-mentioned ACC, PACC, and SLD methods. [14]

A number of methods that address regression quantification have also been proposed. [15]

Evaluation measures for quantification

Several evaluation measures can be used for evaluating the error of a quantification method. Since quantification consists of generating a predicted probability distribution that estimates a true probability distribution, these evaluation measures are ones that compare two probability distributions. Most evaluation measures for quantification belong to the class of divergences. Evaluation measures for binary quantification and single-label multiclass quantification are [16]

Evaluation measures for ordinal quantification are

Applications

Quantification is of special interest in fields such as the social sciences, [17] epidemiology, [18] market research, and ecological modelling, [19] since these fields are inherently concerned with aggregate data. However, quantification is also useful as a building block for solving other downstream tasks, such as measuring classifier bias, [20] performing word sense disambiguation, [21] allocating resources, [4] and improving the accuracy of classifiers. [12]

Resources

Related Research Articles

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, variance. It is used in supervised learning and a family of machine learning algorithms that convert weak learners to strong ones.

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. 5–12–23

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Classification is usually understood to mean the allocation of objects to certain pre-existing classes or categories. This distinguishes it from the earlier step in which the classes themselves are established, often through clustering in which similar objects are grouped together. Examples include a pregnancy test, identifying spam emails and deciding whether to give someone a driving licence.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. Calibration can mean

<span class="mw-page-title-main">Receiver operating characteristic</span> Diagnostic plot of binary classifier ability

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model at varying threshold values.

<span class="mw-page-title-main">Linear discriminant analysis</span> Method used in statistics, pattern recognition, and other fields

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

When classification is performed by a computer, statistical methods are normally used to develop the algorithm.

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes of the elements to be classified.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

Youden's J statistic is a single statistic that captures the performance of a dichotomous diagnostic test. (Bookmaker) Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes.

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

The following outline is provided as an overview of and topical guide to machine learning:

References

  1. Pablo González; Alberto Castaño; Nitesh Chawla; Juan José del Coz (2017). "A review on quantification learning". ACM Computing Surveys . 50 (5): 74:1–74:40. doi:10.1145/3117807. hdl: 10651/45313 . S2CID   38185871.
  2. Andrea Esuli; Alessandro Fabris; Alejandro Moreo; Fabrizio Sebastiani (2023). Learning to Quantify. The Information Retrieval Series. Vol. 47. Cham, CH: Springer Nature. doi:10.1007/978-3-031-20467-8. ISBN   978-3-031-20466-1. S2CID   257560090.
  3. Hopkins, Daniel J.; King, Gary (2010). "A Method of Automated Nonparametric Content Analysis for Social Science". American Journal of Political Science. 54 (1): 229–247. doi:10.1111/j.1540-5907.2009.00428.x. ISSN   0092-5853. JSTOR   20647981. S2CID   1177676.
  4. 1 2 3 4 George Forman (2008). "Quantifying counts and costs via classification". Data Mining and Knowledge Discovery . 17 (2): 164–206. doi:10.1007/s10618-008-0097-y. S2CID   1435935.
  5. 1 2 3 Antonio Bella; Cèsar Ferri; José Hernández-Orallo; María José Ramírez-Quintana (2010). "Quantification via Probability Estimators". 2010 IEEE International Conference on Data Mining. pp. 737–742. doi:10.1109/icdm.2010.75. ISBN   978-1-4244-9131-5. S2CID   9670485.
  6. 1 2 José Barranquero; Jorge Díez; Juan José del Coz (2015). "Quantification-oriented learning based on reliable classifiers". Pattern Recognition . 48 (2): 591–604. Bibcode:2015PatRe..48..591B. doi:10.1016/j.patcog.2014.07.032. hdl: 10651/30611 .
  7. 1 2 Andrea Esuli; Fabrizio Sebastiani (2015). "Optimizing text quantifiers for multivariate loss functions". ACM Transactions on Knowledge Discovery from Data . 9 (4): Article 27. arXiv: 1502.05491 . doi:10.1145/2700406. S2CID   16824608.
  8. Wei Gao; Fabrizio Sebastiani (2016). "From classification to quantification in tweet sentiment analysis". Social Network Analysis and Mining . 6 (19): 1–22. doi:10.1007/s13278-016-0327-z. S2CID   15631612.
  9. Vladimir Vapnik (1998). Statistical learning theory. New York, US: Wiley.
  10. Jerzak, Connor T.; King, Gary; Strezhnev, Anton (2022). "An Improved Method of Automated Nonparametric Content Analysis for Social Science". Political Analysis. 31 (1): 42–58. doi:10.1017/pan.2021.36. ISSN   1047-1987. S2CID   3796379.
  11. Víctor González-Castro; Rocío Alaiz-Rodríguez; Enrique Alegre (2013). "Class distribution estimation based on the {H}ellinger distance". Information Sciences . 218: 146–164. doi:10.1016/j.ins.2012.05.028.
  12. 1 2 Marco Saerens; Patrice Latinne; Christine Decaestecker (2002). "Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure" (PDF). Neural Computation . 14 (1): 21–41. doi:10.1162/089976602753284446. PMID   11747533. S2CID   18254013.
  13. Giovanni Da San Martino; Wei Gao; Fabrizio Sebastiani (2016). "Ordinal Text Quantification". Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. pp. 937–940. doi:10.1145/2911451.2914749. ISBN   9781450340694. S2CID   8102324.{{cite book}}: CS1 maint: date and year (link)
  14. Mirko Bunse; Alejandro Moreo; Fabrizio Sebastiani; Martin Senz (2022). "Ordinal quantification through regularization". Proceedings of the 33rd European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML / PKDD 2022), Grenoble, FR.
  15. Antonio Bella; Cèsar Ferri; José Hernández-Orallo; María José Ramírez-Quintana (2014). "Aggregative quantification for regression". Data Mining and Knowledge Discovery. 28 (2): 475–518. doi:10.1007/s10618-013-0308-z. hdl: 10251/49300 .
  16. Fabrizio Sebastiani (2020). "Evaluation measures for quantification: An axiomatic approach". Information Retrieval Journal . 23 (3): 255–288. arXiv: 1809.01991 . doi:10.1007/s10791-019-09363-y. S2CID   52170301.
  17. Daniel J. Hopkins; Gary King (2010). "A method of automated nonparametric content analysis for social science". American Journal of Political Science . 54 (1): 229–247. doi:10.1111/j.1540-5907.2009.00428.x. S2CID   1177676.
  18. Gary King; Ying Lu (2008). "Verbal autopsy methods with multiple causes of death". Statistical Science . 23 (1): 78–91. arXiv: 0808.0645 . doi:10.1214/07-sts247. S2CID   4084198.
  19. Pablo González; Eva Álvarez; Jorge Díez; Ángel López-Urrutia; Juan J. del Coz (2017). "Validation methods for plankton image classification systems" (PDF). Limnology and Oceanography: Methods . 15 (3): 221–237. Bibcode:2017LimOM..15..221G. doi:10.1002/lom3.10151. S2CID   59438870.
  20. Alessandro Fabris; Andrea Esuli; Alejandro Moreo; Fabrizio Sebastiani (2023). "Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach". Journal of Artificial Intelligence Research. 76: 1117–1180. arXiv: 2109.08549 . doi:10.1613/jair.1.14033. S2CID   247315416.
  21. Yee Seng Chan; Hwee Tou Ng (2005). "Word sense disambiguation with distribution estimation". Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005). Edinburgh, UK: 1010–1015.
  22. "LQ 2021: the 1st International Workshop on Learning to Quantify".
  23. "LQ 2022: the 2nd International Workshop on Learning to Quantify".
  24. "LQ 2023: the 3rd International Workshop on Learning to Quantify".
  25. "LQ 2024: the 4th International Workshop on Learning to Quantify".
  26. "LeQua 2022: A Data Challenge on Learning to Quantify".
  27. "LeQua 2024: A Data Challenge on Learning to Quantify".
  28. "QuaPy: A Python-Based Framework for Quantification". GitHub . 23 November 2021.
  29. "QuantificationLib: A Python library for quantification and prevalence estimation". GitHub . 8 April 2024.