Calibration (statistics)

Last updated

There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. Calibration can mean

Contents

  • a reverse process to regression, where instead of a future dependent variable being predicted from known explanatory variables, a known observation of the dependent variables is used to predict a corresponding explanatory variable; [1]
  • procedures in statistical classification to determine class membership probabilities which assess the uncertainty of a given new observation belonging to each of the already established classes.

In addition, calibration is used in statistics with the usual general meaning of calibration. For example, model calibration can be also used to refer to Bayesian inference about the value of a model's parameters, given some data set, or more generally to any type of fitting of a statistical model. As Philip Dawid puts it, "a forecaster is well calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent." [2]

In classification

Calibration in classification means transforming classifier scores into class membership probabilities. An overview of calibration methods for two-class and multi-class classification tasks is given by Gebel (2009). [3] A classifier might separate the classes well, but be poorly calibrated, meaning that the estimated class probabilities are far from the true class probabilities. In this case, a calibration step may help improve the estimated probabilities. A variety of metrics exist that are aimed to measure the extent to which a classifier produces well-calibrated probabilities. Foundational work includes the Expected Calibration Error (ECE). [4] Into the 2020s, variants include the Adaptive Calibration Error (ACE) and the Test-based Calibration Error (TCE), which address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the [0,1] range. [5] [6]

A 2020s advancement in calibration assessment is the introduction of the Estimated Calibration Index (ECI). [7] The ECI extends the concepts of the Expected Calibration Error (ECE) to provide a more nuanced measure of a model's calibration, particularly addressing overconfidence and underconfidence tendencies. Originally formulated for binary settings, the ECI has been adapted for multiclass settings, offering both local and global insights into model calibration. This framework aims to overcome some of the theoretical and interpretative limitations of existing calibration metrics. Through a series of experiments, Famiglini et al. demonstrate the framework's effectiveness in delivering a more accurate understanding of model calibration levels and discuss strategies for mitigating biases in calibration assessment. An online tool has been proposed to compute both ECE and ECI. [8] The following univariate calibration methods exist for transforming classifier scores into class membership probabilities in the two-class case:

In probability prediction and forecasting

In prediction and forecasting, a Brier score is sometimes used to assess prediction accuracy of a set of predictions, specifically that the magnitude of the assigned probabilities track the relative frequency of the observed outcomes. Philip E. Tetlock employs the term "calibration" in this sense in his 2015 book Superforecasting . [16] This differs from accuracy and precision. For example, as expressed by Daniel Kahneman, "if you give all events that happen a probability of .6 and all the events that don't happen a probability of .4, your calibration is perfect but your discrimination is miserable". [16] In meteorology, in particular, as concerns weather forecasting, a related mode of assessment is known as forecast skill.

In regression

The calibration problem in regression is the use of known data on the observed relationship between a dependent variable and an independent variable to make estimates of other values of the independent variable from new observations of the dependent variable. [17] [18] [19] This can be known as "inverse regression"; [20] there is also sliced inverse regression. The following multivariate calibration methods exist for transforming classifier scores into class membership probabilities in the case with classes count greater than two:

Example

One example is that of dating objects, using observable evidence such as tree rings for dendrochronology or carbon-14 for radiometric dating. The observation is caused by the age of the object being dated, rather than the reverse, and the aim is to use the method for estimating dates based on new observations. The problem is whether the model used for relating known ages with observations should aim to minimise the error in the observation, or minimise the error in the date. The two approaches will produce different results, and the difference will increase if the model is then used for extrapolation at some distance from the known results.

See also

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

Observational error is the difference between a measured value of a quantity and its unknown true value. Such errors are inherent in the measurement process; for example lengths measured with a ruler calibrated in whole centimeters will have a measurement error of several millimeters. The error or uncertainty of a measurement can be estimated, and is specified with the measurement as, for example, 32.3 ± 0.5 cm.

The following outline is provided as an overview of and topical guide to statistics:

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

When classification is performed by a computer, statistical methods are normally used to develop the algorithm.

In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results otherwise. Robust regression methods are designed to limit the effect that violations of assumptions by the underlying data-generating process have on regression estimates.

Uncertainty quantification (UQ) is the science of quantitative characterization and estimation of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known. An example would be to predict the acceleration of a human body in a head-on crash with another car: even if the speed was exactly known, small differences in the manufacturing of individual cars, how tightly every bolt has been tightened, etc., will lead to different results that can only be predicted in a statistical sense.

A computerized classification test (CCT) refers to, as its name would suggest, a Performance Appraisal System that is administered by computer for the purpose of classifying examinees. The most common CCT is a mastery test where the test classifies examinees as "Pass" or "Fail," but the term also includes tests that classify examinees into more than two categories. While the term may generally be considered to refer to all computer-administered tests for classification, it is usually used to refer to tests that are interactively administered or of variable-length, similar to computerized adaptive testing (CAT). Like CAT, variable-length CCTs can accomplish the goal of the test with a fraction of the number of items used in a conventional fixed-form test.

Discriminative models, also referred to as conditional models, are a class of models frequently used for classification. They are typically used to solve binary classification problems, i.e. assign labels, such as pass/fail, win/lose, alive/dead or healthy/sick, to existing datapoints.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining, machine learning and statistics. They offer applicable and successful solutions in different areas of electronic fraud crimes.

In machine learning, Platt scaling or Platt calibration is a way of transforming the outputs of a classification model into a probability distribution over classes. The method was invented by John Platt in the context of support vector machines, replacing an earlier method by Vapnik, but can be applied to other classification models. Platt scaling works by fitting a logistic regression model to a classifier's scores.

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

In statistics, econometrics, epidemiology, genetics and related disciplines, causal graphs are probabilistic graphical models used to encode assumptions about the data-generating process.

The following outline is provided as an overview of and topical guide to machine learning:

Non-homogeneous Gaussian regression (NGR) is a type of statistical regression analysis used in the atmospheric sciences as a way to convert ensemble forecasts into probabilistic forecasts. Relative to simple linear regression, NGR uses the ensemble spread as an additional predictor, which is used to improve the prediction of uncertainty and allows the predicted uncertainty to vary from case to case. The prediction of uncertainty in NGR is derived from both past forecast errors statistics and the ensemble spread. NGR was originally developed for site-specific medium range temperature forecasting, but has since also been applied to site-specific medium-range wind forecasting and to seasonal forecasts, and has been adapted for precipitation forecasting. The introduction of NGR was the first demonstration that probabilistic forecasts that take account of the varying ensemble spread could achieve better skill scores than forecasts based on standard model output statistics approaches applied to the ensemble mean.

In statistics, the negative log predictive density (NLPD) is a measure of error between a model's predictions and associated true values. A smaller value is better. Importantly the NLPD assesses the quality of the model's uncertainty quantification. It is used for both regression and classification.

References

  1. Cook, Ian; Upton, Graham (2006). Oxford Dictionary of Statistics. Oxford: Oxford University Press. ISBN   978-0-19-954145-4.
  2. Dawid, A. P (1982). "The Well-Calibrated Bayesian". Journal of the American Statistical Association. 77 (379): 605–610. doi:10.1080/01621459.1982.10477856.
  3. 1 2 Gebel, Martin (2009). Multivariate calibration of classifier scores into the probability space (PDF) (PhD thesis). University of Dortmund.
  4. M.P. Naeini, G. Cooper, and M. Hauskrecht, Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
  5. J. Nixon, M.W. Dusenberry, L. Zhang, G. Jerfel, & D. Tran. Measuring Calibration in Deep Learning. In: CVPR workshops (Vol. 2, No. 7), 2019.
  6. T. Matsubara, N. Tax, R. Mudd, & I. Guy. TCE: A Test-Based Approach to Measuring Calibration Error. In: Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI), PMLR, 2023.
  7. Famiglini, Lorenzo, Andrea Campagner, and Federico Cabitza. "Towards a Rigorous Calibration Assessment Framework: Advancements in Metrics, Methods, and Use." ECAI 2023. IOS Press, 2023. 645-652. Doi 10.3233/FAIA230327
  8. Famiglini, Lorenzo; Campagner, Andrea; Cabitza, Federico (2023), "Towards a Rigorous Calibration Assessment Framework: Advancements in Metrics, Methods, and Use", ECAI 2023, IOS Press, pp. 645–652, doi:10.3233/faia230327, hdl: 10281/456604 , retrieved 25 March 2024
  9. U. M. Garczarek " Archived 2004-11-23 at the Wayback Machine ," Classification Rules in Standardized Partition Spaces, Dissertation, Universität Dortmund, 2002
  10. P. N. Bennett, Using asymmetric distributions to improve text classifier probability estimates: A comparison of new and standard parametric methods, Technical Report CMU-CS-02-126, Carnegie Mellon, School of Computer Science, 2002.
  11. B. Zadrozny and C. Elkan, Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, 694699, Edmonton, ACM Press, 2002.
  12. D. D. Lewis and W. A. Gale, A Sequential Algorithm for Training Text classifiers. In: W. B. Croft and C. J. van Rijsbergen (eds.), Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '94), 312. New York, Springer-Verlag, 1994.
  13. J. C. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: A. J. Smola, P. Bartlett, B. Schölkopf and D. Schuurmans (eds.), Advances in Large Margin Classiers, 6174. Cambridge, MIT Press, 1999.
  14. Naeini MP, Cooper GF, Hauskrecht M. Obtaining Well Calibrated Probabilities Using Bayesian Binning. Proceedings of the . AAAI Conference on Artificial Intelligence AAAI Conference on Artificial Intelligence. 2015;2015:2901-2907.
  15. Meelis Kull, Telmo Silva Filho, Peter Flach; Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR 54:623-631, 2017.
  16. 1 2 "Edge Master Class 2015: A Short Course in Superforecasting, Class II". edge.org. Edge Foundation. 24 August 2015. Retrieved 13 April 2018. Calibration is when I say there's a 70 percent likelihood of something happening, things happen 70 percent of time.
  17. Brown, P.J. (1994) Measurement, Regression and Calibration, OUP. ISBN   0-19-852245-2
  18. Ng, K. H., Pooi, A. H. (2008) "Calibration Intervals in Linear Regression Models", Communications in Statistics - Theory and Methods, 37 (11), 1688–1696.
  19. Hardin, J. W., Schmiediche, H., Carroll, R. J. (2003) "The regression-calibration method for fitting generalized linear models with additive measurement error", Stata Journal, 3 (4), 361372. link, pdf
  20. Draper, N.L., Smith, H. (1998) Applied Regression analysis, 3rd Edition, Wiley. ISBN   0-471-17082-8
  21. T. Hastie and R. Tibshirani, "," Classification by pairwise coupling. In: M. I. Jordan, M. J. Kearns and S. A. Solla (eds.), Advances in Neural Information Processing Systems, volume 10, Cambridge, MIT Press, 1998.