Quantification (machine learning)

Last updated

In machine learning and data mining, quantification (variously called learning to quantify, or supervised prevalence estimation, or class prior estimation) is the task of using supervised learning in order to train models (quantifiers) that estimate the relative frequencies (also known as prevalence values) of the classes of interest in a sample of unlabelled data items [1] [2] . For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these tweets which belong to class `Positive' (i.e., which manifest a positive stance towards this candidate), and to do the same for classes `Neutral' and `Negative'.

Contents

Quantification may also be viewed as the task of training predictors that estimate a (discrete) probability distribution, i.e., that generate a predicted distribution that approximates the unknown true distribution of the items across the classes of interest. Quantification is different from classification, since the goal of classification is to predict the class labels of individual data items, while the goal of quantification it to predict the class prevalence values of sets of data items. Quantification is also different from regression, since in regression the training data items have real-valued labels, while in quantification the training data items have class labels.

It has been shown in multiple research works [3] [4] [5] [6] [7] that performing quantification by classifying all unlabelled instances and then counting the instances that have been attributed to each class (the 'classify and count' method) usually leads to suboptimal quantification accuracy. This suboptimality may be seen as a direct consequence of 'Vapnik's principle', which states:

If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem [8] .

In our case, the problem to be solved directly is quantification, while the more general intermediate problem is classification. As a result of the suboptimality of the 'classify and count' method, quantification has evolved as a task in its own right, different (in goals, methods, techniques, and evaluation measures) from classification.

Quantification tasks

Quantification tasks according to the set of classes

The main variants of quantification, according to the characteristics of the set of classes used, are:

Most known quantification methods address the binary case or the single-label multiclass case, and only few of them address the multi-label, ordinal, and regression cases. Binary-only methods include the Mixture Model (MM) method [3] , the HDy method [9] , SVM(KLD) [6] , and SVM(Q) [5] . Methods that can deal with both the binary case and the single-label multiclass case include probabilistic classify and count (PCC) [4] , adjusted classify and count (ACC) [3] , probabilistic adjusted classify and count (PACC) [4] , the Saerens-Latinne-Decaestecker EM-based method (SLD) [10] , and KDEy [11] .

Methods for multi-label quantification include regression-based quantification (RQ) and label powerset-based quantification (LPQ) [12] . Methods for the ordinal case include ordinal versions of the above-mentioned ACC, PACC, and SLD methods [13] , and ordinal versions of the above-mentioned HDy method [14] . Methods for the regression case include Regress and splice and Adjusted regress and sum [15] .

Quantification tasks according to the type of data

Several subtasks of quantification may be identified according to the type of data involved. Example such tasks are:

Evaluation measures for quantification

Several evaluation measures can be used for evaluating the error of a quantification method. Since quantification consists of generating a predicted probability distribution that estimates a true probability distribution, these evaluation measures are ones that compare two probability distributions. Most evaluation measures for quantification belong to the class of divergences. Evaluation measures for binary quantification, single-label multiclass quantification, and multi-label quantification, are [19] [12]

Evaluation measures for ordinal quantification are

Applications

Quantification is of special interest in fields such as the social sciences [20] , epidemiology [21] , market research, and ecological modelling [22] , since these fields are inherently concerned with aggregate data. However, quantification is also useful as a building block for solving other downstream tasks, such as improving the accuracy of classifiers on out-of-distribution data [10] [23] , allocating resources [3] , measuring classifier bias [24] , and estimating the accuracy of classifiers on out-of-distribution data [25] [26] .

Resources

References

  1. Pablo González; Alberto Castaño; Nitesh Chawla; Juan José del Coz (2017). "A review on quantification learning". ACM Computing Surveys . 50 (5): 74:1–74:40. doi:10.1145/3117807. hdl: 10651/45313 . S2CID   38185871.
  2. Andrea Esuli; Alessandro Fabris; Alejandro Moreo; Fabrizio Sebastiani (2023). Learning to Quantify (PDF). The Information Retrieval Series. Vol. 47. Cham, CH: Springer Nature. doi:10.1007/978-3-031-20467-8. ISBN   978-3-031-20466-1. S2CID   257560090.
  3. 1 2 3 4 George Forman (2008). "Quantifying counts and costs via classification". Data Mining and Knowledge Discovery . 17 (2): 164–206. doi:10.1007/s10618-008-0097-y. S2CID   1435935.
  4. 1 2 3 Antonio Bella; Cèsar Ferri; José Hernández-Orallo; María José Ramírez-Quintana (2010). "Quantification via Probability Estimators". Proceedings of the 2010 IEEE International Conference on Data Mining. pp. 737–742. doi:10.1109/icdm.2010.75. ISBN   978-1-4244-9131-5. S2CID   9670485.
  5. 1 2 José Barranquero; Jorge Díez; Juan José del Coz (2015). "Quantification-oriented learning based on reliable classifiers". Pattern Recognition . 48 (2): 591–604. Bibcode:2015PatRe..48..591B. doi:10.1016/j.patcog.2014.07.032. hdl: 10651/30611 .
  6. 1 2 Andrea Esuli; Fabrizio Sebastiani (2015). "Optimizing text quantifiers for multivariate loss functions". ACM Transactions on Knowledge Discovery from Data . 9 (4) 27. arXiv: 1502.05491 . doi:10.1145/2700406. S2CID   16824608.
  7. Wei Gao; Fabrizio Sebastiani (2016). "From classification to quantification in tweet sentiment analysis". Social Network Analysis and Mining . 6 (19): 1–22. doi:10.1007/s13278-016-0327-z. S2CID   15631612.
  8. Vladimir Vapnik (1998). Statistical learning theory. New York, US: Wiley.
  9. Víctor González-Castro; Rocío Alaiz-Rodríguez; Enrique Alegre (2013). "Class distribution estimation based on the {H}ellinger distance". Information Sciences . 218: 146–164. doi:10.1016/j.ins.2012.05.028.
  10. 1 2 Marco Saerens; Patrice Latinne; Christine Decaestecker (2002). "Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure" (PDF). Neural Computation . 14 (1): 21–41. doi:10.1162/089976602753284446. PMID   11747533. S2CID   18254013.
  11. Alejandro Moreo; Pablo González; Juan José del Coz (2025). "Kernel density estimation for multiclass quantification". Machine Learning. 114 (4). doi:10.1007/s10994-024-06726-5.
  12. 1 2 Alejandro Moreo; Manuel Francisco; Fabrizio Sebastiani (2024). "Multi-label quantification". ACM Transactions on Knowledge Discovery from Data. 18 (1): 1–36. arXiv: 2211.08063 . doi:10.1145/3606264.
  13. Mirko Bunse; Alejandro Moreo; Fabrizio Sebastiani; Martin Senz (2024). "Ordinal quantification through regularization". Data Mining and Knowledge Discovery . 38 (6): 4076–4121.
  14. Alberto Castaño; Pablo González; Jaime Alonso González; Juan José del Coz (2024). "Matching Distributions Algorithms Based on the Earth Mover's Distance for Ordinal Quantification". IEEE Transactions on Neural Networks and Learning Systems . 35 (1): 1050–1061. doi:10.1109/TNNLS.2022.3179355.
  15. Antonio Bella; Cèsar Ferri; José Hernández-Orallo; María José Ramírez-Quintana (2014). "Aggregative quantification for regression". Data Mining and Knowledge Discovery . 28 (2): 475–518. doi:10.1007/s10618-013-0308-z. hdl: 10251/49300 .
  16. Alessio Micheli; Alejandro Moreo; Marco Podda; Fabrizio Sebastiani; William Simoni; Domenico Tortorella (2025). "Efficient quantification on large-scale networks". Machine Learning . 114 (12). doi:10.1007/s10994-025-06915-w.
  17. Clemens Damke; Eyke Hüllermeier (2025). "Distribution matching for graph quantification under structural covariate shift". Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD 2025), Porto, PT. pp. 403–419. doi:10.1007/978-3-032-05981-9%5C_24.
  18. Feiyu Li; Hassan Habibi Gharakheili; Gustavo Batista (2024). "Quantification over time". Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD 2024), Vilnius, LT. pp. 282–299. doi:10.1007/978-3-031-70362-1\_17.
  19. Fabrizio Sebastiani (2020). "Evaluation measures for quantification: An axiomatic approach". Information Retrieval Journal . 23 (3): 255–288. arXiv: 1809.01991 . doi:10.1007/s10791-019-09363-y. S2CID   52170301.
  20. Daniel J. Hopkins; Gary King (2010). "A method of automated nonparametric content analysis for social science". American Journal of Political Science . 54 (1): 229–247. doi:10.1111/j.1540-5907.2009.00428.x. S2CID   1177676.
  21. Gary King; Ying Lu (2008). "Verbal autopsy methods with multiple causes of death". Statistical Science . 23 (1): 78–91. arXiv: 0808.0645 . doi:10.1214/07-sts247. S2CID   4084198.
  22. Pablo González; Eva Álvarez; Jorge Díez; Ángel López-Urrutia; Juan J. del Coz (2017). "Validation methods for plankton image classification systems" (PDF). Limnology and Oceanography: Methods . 15 (3): 221–237. Bibcode:2017LimOM..15..221G. doi:10.1002/lom3.10151. S2CID   59438870.
  23. Yee Seng Chan; Hwee Tou Ng (2005). "Word sense disambiguation with distribution estimation". Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005). Edinburgh, UK: 1010–1015.
  24. Alessandro Fabris; Andrea Esuli; Alejandro Moreo; Fabrizio Sebastiani (2023). "Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach" (PDF). Journal of Artificial Intelligence Research . 76: 1117–1180. arXiv: 2109.08549 . doi:10.1613/jair.1.14033. S2CID   247315416.
  25. Lorenzo Volpi; Alejandro Moreo; Fabrizio Sebastiani (2025). "LEAP: Linear equations for classifier accuracy prediction under prior probability shift". Machine Learning . 114 (12). doi:10.1007/s10994-025-06878-y.
  26. Lorenzo Volpi; Alejandro Moreo; Fabrizio Sebastiani (2025). "QuAcc: Using quantification to predict classifier accuracy under prior probability shift=". Intelligenza Artificiale . 19 (2): 141–157. doi:10.1177/17248035251338347.
  27. "LQ 2021: the 1st International Workshop on Learning to Quantify".
  28. "LQ 2022: the 2nd International Workshop on Learning to Quantify".
  29. "LQ 2023: the 3rd International Workshop on Learning to Quantify".
  30. "LQ 2024: the 4th International Workshop on Learning to Quantify".
  31. "LQ 2025: the 5th International Workshop on Learning to Quantify".
  32. "LeQua 2022: A Data Challenge on Learning to Quantify".
  33. "LeQua 2024: A Data Challenge on Learning to Quantify".
  34. "QuaPy: A Python-Based Framework for Quantification". GitHub . 23 November 2021.
  35. "QuantificationLib: A Python library for quantification and prevalence estimation". GitHub . 8 April 2024.