Probabilistic classification

Last updated January 18, 2024

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right^[1] or when combining classifiers into ensembles.

Types of classification

Formally, an "ordinary" classifier is some rule, or function, that assigns to a sample $x$ a class label $ŷ$ :

{\hat {y}}=f(x)

The samples come from some set $X$ (e.g., the set of all documents, or the set of all images), while the class labels form a finite set $Y$ defined prior to training.

Probabilistic classifiers generalize this notion of classifiers: instead of functions, they are conditional distributions $\Pr(Y\vert X)$ , meaning that for a given $x\in X$ , they assign probabilities to all $y\in Y$ (and these probabilities sum to one). "Hard" classification can then be done using the optimal decision rule ^[2]^: 39–40

{\hat {y}}=\operatorname {\arg \max } _{y}\Pr(Y=y\vert X)

or, in English, the predicted class is that which has the highest probability.

Binary probabilistic classifiers are also called binary regression models in statistics. In econometrics, probabilistic classification in general is called discrete choice.

Some classification models, such as naive Bayes, logistic regression and multilayer perceptrons (when trained under an appropriate loss function) are naturally probabilistic. Other models such as support vector machines are not, but methods exist to turn them into probabilistic classifiers.

Generative and conditional training

Some models, such as logistic regression, are conditionally trained: they optimize the conditional probability $\Pr(Y\vert X)$ directly on a training set (see empirical risk minimization). Other classifiers, such as naive Bayes, are trained generatively: at training time, the class-conditional distribution $\Pr(X\vert Y)$ and the class prior $\Pr(Y)$ are found, and the conditional distribution $\Pr(Y\vert X)$ is derived using Bayes' rule.^[2]^: 43

Probability calibration

Not all classification models are naturally probabilistic, and some that are, notably naive Bayes classifiers, decision trees and boosting methods, produce distorted class probability distributions.^[3] In the case of decision trees, where $Pr(y | x)$ is the proportion of training samples with label $y$ in the leaf where $x$ ends up, these distortions come about because learning algorithms such as C4.5 or CART explicitly aim to produce homogeneous leaves (giving probabilities close to zero or one, and thus high bias) while using few samples to estimate the relevant proportion (high variance).^[4]

Calibration can be assessed using a calibration plot (also called a reliability diagram).^[3]^[5] A calibration plot shows the proportion of items in each class for bands of predicted probability or score (such as a distorted probability distribution or the "signed distance to the hyperplane" in a support vector machine). Deviations from the identity function indicate a poorly-calibrated classifier for which the predicted probabilities or scores can not be used as probabilities. In this case one can use a method to turn these scores into properly calibrated class membership probabilities.

For the binary case, a common approach is to apply Platt scaling, which learns a logistic regression model on the scores.^[6] An alternative method using isotonic regression ^[7] is generally superior to Platt's method when sufficient training data is available.^[3]

In the multiclass case, one can use a reduction to binary tasks, followed by univariate calibration with an algorithm as described above and further application of the pairwise coupling algorithm by Hastie and Tibshirani.^[8]

Evaluating probabilistic classification

Commonly used evaluation metrics that compare the predicted probability to observed outcomes include log loss, Brier score, and a variety of calibration errors. The former is also used as a loss function in the training of logistic models.

Calibration errors metrics aim to quantify the extent to which a probabilistic classifier's outputs are well-calibrated. As Philip Dawid put it, "a forecaster is well-calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent".^[9] Foundational work in the domain of measuring calibration error is the Expected Calibration Error (ECE) metric.^[10] More recent works propose variants to ECE that address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the [0,1], including the Adaptive Calibration Error (ACE) ^[11] and Test-based Calibration Error (TCE).^[12]

A method used to assign scores to pairs of predicted probabilities and actual discrete outcomes, so that different predictive methods can be compared, is called a scoring rule.

Software Implementations

MoRPE^[13] is a trainable probabilistic classifier that uses isotonic regression for probability calibration. It solves the multiclass case by reduction to binary tasks. It is a type of kernel machine that uses an inhomogeneous polynomial kernel.

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974).

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivity) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. 5-12-23

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. "Calibration" can mean

In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

A generative model is a statistical model of the joint probability distribution $on given observable variable X and target variable Y;$
A discriminative model is a model of the conditional probability $of the target Y, given an observation x; and$
Classifiers computed without using a probability model are also referred to loosely as "discriminative".

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

Given a population whose members each belong to one of a number of different sets or classes, a classification rule or classifier is a procedure by which the elements of the population set are each predicted to belong to one of the classes. A perfect classification is one for which every element in the population is assigned to the class it really belongs to. The bayes classifier is the classifier which assigns classes optimally based on the known attributes of the elements to be classified.

<span class="mw-page-title-main">Scoring rule</span> Measure for evaluating probabilistic forecasts

In decision theory, a scoring rule provides a summary measure for the evaluation of probabilistic predictions or forecasts. It is applicable to tasks in which predictions assign probabilities to events, i.e. one issues a probability distribution $as prediction. This includes probabilistic classification of a set of mutually exclusive outcomes or classes.$

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Discriminative models, also referred to as conditional models, are a class of logistical models used for classification or regression. They distinguish decision boundaries through observed data, such as pass/fail, win/lose, alive/dead or healthy/sick.

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes.

In statistical classification, Bayes error rate is the lowest possible error rate for any classifier of a random outcome and is analogous to the irreducible error.

In machine learning, Platt scaling or Platt calibration is a way of transforming the outputs of a classification model into a probability distribution over classes. The method was invented by John Platt in the context of support vector machines, replacing an earlier method by Vapnik, but can be applied to other classification models. Platt scaling works by fitting a logistic regression model to a classifier's scores.

The following outline is provided as an overview of and topical guide to machine learning:

In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of simply outputting a single value, as in linear regression.

In machine learning and data mining, quantification is the task of using supervised learning in order to train models (quantifiers) that estimate the relative frequencies of the classes of interest in a sample of unlabelled data items. For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these 100,000 tweets which belong to class `Positive', and to do the same for classes `Neutral' and `Negative'.

References

↑ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). The Elements of Statistical Learning. p. 348. Archived from the original on 2015-01-26. [I]n data mining applications the interest is often more in the class probabilities $p_{\ell }(x),\ell =1,\dots ,K$ themselves, rather than in performing a class assignment.
1 2 Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer.
1 2 3 Niculescu-Mizil, Alexandru; Caruana, Rich (2005). Predicting good probabilities with supervised learning (PDF). ICML. doi:10.1145/1102351.1102430. Archived from the original (PDF) on 2014-03-11.
↑ Zadrozny, Bianca; Elkan, Charles (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers (PDF). ICML. pp. 609–616.
↑ "Probability calibration". jmetzen.github.io. Retrieved 2019-06-18.
↑ Platt, John (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods". Advances in Large Margin Classifiers. 10 (3): 61–74.
↑ Zadrozny, Bianca; Elkan, Charles (2002). "Transforming classifier scores into accurate multiclass probability estimates" (PDF). Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 694–699. CiteSeerX 10.1.1.164.8140 . doi:10.1145/775047.775151. ISBN 978-1-58113-567-1. S2CID 3349576. CiteSeerX: 10.1.1.13.7457 .
↑ Hastie, Trevor; Tibshirani, Robert (1998). "Classification by pairwise coupling". The Annals of Statistics . 26 (2): 451–471. CiteSeerX 10.1.1.309.4720 . doi:10.1214/aos/1028144844. Zbl 0932.62071. CiteSeerX: 10.1.1.46.6032 .
↑ Dawid, A. P (1982). "The Well-Calibrated Bayesian". Journal of the American Statistical Association. 77 (379): 605–610. doi:10.1080/01621459.1982.10477856.
↑ Naeini, M.P.; Cooper, G.; Hauskrecht, M. (2015). "Obtaining well calibrated probabilities using bayesian binning" (PDF). Proceedings of the AAAI Conference on Artificial Intelligence.
↑ Nixon, J.; Dusenberry, M.W.; Zhang, L.; Jerfel, G.; Tran, D. (2019). "Measuring Calibration in Deep Learning" (PDF). CVPR workshops.
↑ Matsubara, T.; Tax, N.; Mudd, R.; Guy, I. (2023). "TCE: A Test-Based Approach to Measuring Calibration Error". Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI). arXiv: 2306.14343 .
↑ "MoRPE". GitHub. Retrieved 17 February 2023.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). The Elements of Statistical Learning. p. 348. Archived from the original on 2015-01-26. [I]n data mining applications the interest is often more in the class probabilities $p_{\ell }(x),\ell =1,\dots ,K$ themselves, rather than in performing a class assignment.

[bishop-2] 1 2 Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer.

[Niculescu-3] 1 2 3 Niculescu-Mizil, Alexandru; Caruana, Rich (2005). Predicting good probabilities with supervised learning (PDF). ICML. doi:10.1145/1102351.1102430. Archived from the original (PDF) on 2014-03-11.

[4] Zadrozny, Bianca; Elkan, Charles (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers (PDF). ICML. pp. 609–616.

[5] "Probability calibration". jmetzen.github.io. Retrieved 2019-06-18.

[platt99-6] Platt, John (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods". Advances in Large Margin Classifiers. 10 (3): 61–74.

[7] Zadrozny, Bianca; Elkan, Charles (2002). "Transforming classifier scores into accurate multiclass probability estimates" (PDF). Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 694–699. CiteSeerX 10.1.1.164.8140 . doi:10.1145/775047.775151. ISBN 978-1-58113-567-1. S2CID 3349576. CiteSeerX: 10.1.1.13.7457 .

[8] Hastie, Trevor; Tibshirani, Robert (1998). "Classification by pairwise coupling". The Annals of Statistics . 26 (2): 451–471. CiteSeerX 10.1.1.309.4720 . doi:10.1214/aos/1028144844. Zbl 0932.62071. CiteSeerX: 10.1.1.46.6032 .

[9] Dawid, A. P (1982). "The Well-Calibrated Bayesian". Journal of the American Statistical Association. 77 (379): 605–610. doi:10.1080/01621459.1982.10477856.

[10] Naeini, M.P.; Cooper, G.; Hauskrecht, M. (2015). "Obtaining well calibrated probabilities using bayesian binning" (PDF). Proceedings of the AAAI Conference on Artificial Intelligence.

[11] Nixon, J.; Dusenberry, M.W.; Zhang, L.; Jerfel, G.; Tran, D. (2019). "Measuring Calibration in Deep Learning" (PDF). CVPR workshops.

[12] Matsubara, T.; Tax, N.; Mudd, R.; Guy, I. (2023). "TCE: A Test-Based Approach to Measuring Calibration Error". Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI). arXiv: 2306.14343 .

[13] "MoRPE". GitHub. Retrieved 17 February 2023.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]