Bayes error rate

Last updated December 31, 2023

In statistical classification, Bayes error rate is the lowest possible error rate for any classifier of a random outcome (into, for example, one of two categories) and is analogous to the irreducible error.^[1]^[2]

A number of approaches to the estimation of the Bayes error rate exist. One method seeks to obtain analytical bounds which are inherently dependent on distribution parameters, and hence difficult to estimate. Another approach focuses on class densities, while yet another method combines and compares various classifiers.^[2]

The Bayes error rate finds important use in the study of patterns and machine learning techniques.^[3]

Error determination

In terms of machine learning and pattern classification, the labels of a set of random observations can be divided into 2 or more classes. Each observation is called an instance and the class it belongs to is the label. The Bayes error rate of the data distribution is the probability an instance is misclassified by a classifier that knows the true class probabilities given the predictors.

For a multiclass classifier, the expected prediction error may be calculated as follows:^[3]

EPE=E_{x}[\sum _{k=1}^{K}L(C_{k},{\hat {C}}(x))P(C_{k}|x)]

where x is the instance, $E[]$ the expectation value, C_k is a class into which an instance is classified, P(C_k|x) is the conditional probability of label k for instance x, and L() is the 0–1 loss function:

L(x,y)=1-\delta _{x,y}={\begin{cases}0&{\text{if }}x=y\\1&{\text{if }}x\neq y\end{cases}},

where $\delta _{x,y}$ is the Kronecker delta.

When the learner knows the conditional probability, then one solution is:

{\hat {C}}_{B}(x)=\arg \max _{k\in \{1...K\}}P(C_{k}|X=x)

This solution is known as the Bayes classifier.

The corresponding expected Prediction Error is called the Bayes error rate:

BE=E_{x}[\sum _{k=1}^{K}L(C_{k},{\hat {C}}_{B}(x))P(C_{k}|x)]=E_{x}[\sum _{k=1,\ C_{k}\neq {\hat {C}}_{B}(x)}^{K}P(C_{k}|x)]=E_{x}[1-P({\hat {C}}_{B}(x)|x)]

,

where the sum can be omitted in the last step due to considering the counter event. By the definition of the Bayes classifier, it maximizes $P({\hat {C}}_{B}(x)|x)$ and, therefore, minimizes the Bayes error BE.

The Bayes error is non-zero if the classification labels are not deterministic, i.e., there is a non-zero probability of a given instance belonging to more than one class.^[4] In a regression context with squared error, the Bayes error is equal to the noise variance.^[3]

Proof of Minimality

Proof that the Bayes error rate is indeed the minimum possible and that the Bayes classifier is therefore optimal, may be found together on the Wikipedia page Bayes classifier.

Plug-in Rules for Binary Classifiers

A plug-in rule uses an estimate of the posterior probability $\eta$ to form a classification rule. Given an estimate ${\tilde {\eta }}$ , the excess Bayes error rate of the associated classifier is bounded above by:

2\mathbb {E} [|\eta (X)-{\tilde {\eta }}(X)|].

To see this, note that the excess Bayes error is equal to 0 where the classifiers agree, and equal to $2|\eta (X)-1/2|$ where they disagree. To form the bound, notice that ${\tilde {\eta }}$ is at least as far as $1/2$ when the classifiers disagree.

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974).

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivity) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the size of a class of sets. The notion can be extended to classes of binary functions. It is defined as the cardinality of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

Empirical risk minimization is a principle in statistical learning theory which defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large numbers; more specifically, we cannot know exactly how well a predictive algorithm will work in practice because we don't know the true distribution of the data, but we can instead estimate and optimize the performance of the algorithm on a known set of training data. The performance over the known set of training data is referred to as the empirical risk.

In statistics and signal processing, a minimum mean square error (MMSE) estimator is an estimation method which minimizes the mean square error (MSE), which is a common measure of estimator quality, of the fitted values of a dependent variable. In the Bayesian setting, the term MMSE more specifically refers to estimation with quadratic loss function. In such case, the MMSE estimator is given by the posterior mean of the parameter to be estimated. Since the posterior mean is cumbersome to calculate, the form of the MMSE estimator is usually constrained to be within a certain class of functions. Linear MMSE estimators are a popular choice since they are easy to use, easy to calculate, and very versatile. It has given rise to many popular estimators such as the Wiener–Kolmogorov filter and Kalman filter.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

The iterative proportional fitting procedure is the operation of finding the fitted matrix $which is the closest to an initial matrix but with the row and column totals of a target matrix . The fitted matrix being of the form, where and are diagonal matrices such that has the margins of . Some algorithms can be chosen to perform biproportion. We have also the entropy maximization, information loss minimization or RAS which consists of factoring the matrix rows to match the specified row totals, then factoring its columns to match the specified column totals; each step usually disturbs the previous step's match, so these steps are repeated in cycles, re-adjusting the rows and columns in turn, until all specified marginal totals are satisfactorily approximated. However, all algorithms give the same solution. In three- or more-dimensional cases, adjustment steps are applied for the marginals of each dimension in turn, the steps likewise repeated in cycles.$

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.

In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.

Extension neural network is a pattern recognition method found by M. H. Wang and C. P. Hung in 2003 to classify instances of data sets. Extension neural network is composed of artificial neural network and extension theory concepts. It uses the fast and adaptive learning capability of neural network and correlation estimation property of extension theory by calculating extension distance.
ENN was used in:

In the mathematical theory of probability, the voter model is an interacting particle system introduced by Richard A. Holley and Thomas M. Liggett in 1975.

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems. Given $as the space of all possible inputs, and as the set of labels, a typical goal of classification algorithms is to find a function which best predicts a label for a given input . However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same to generate different . As a result, the goal of the learning problem is to minimize expected loss, defined as$

Nonlinear tides are generated by hydrodynamic distortions of tides. A tidal wave is said to be nonlinear when its shape deviates from a pure sinusoidal wave. In mathematical terms, the wave owes its nonlinearity due to the nonlinear advection and frictional terms in the governing equations. These become more important in shallow-water regions such as in estuaries. Nonlinear tides are studied in the fields of coastal morphodynamics, coastal engineering and physical oceanography. The nonlinearity of tides has important implications for the transport of sediment.

References

↑ Fukunaga, Keinosuke (1990). Introduction to Statistical Pattern Recognition. pp. 3, 97. ISBN 0122698517.
1 2 K. Tumer, K. (1996) "Estimating the Bayes error rate through classifier combining" in Proceedings of the 13th International Conference on Pattern Recognition, Volume 2, 695–699
1 2 3 Hastie, Trevor (2009). The Elements of Statistical Learning (2nd ed.). Springer. p. 21. ISBN 978-0387848570.
↑ Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2018). Foundations of Machine Learning (2nd ed.). p. 22.

This statistics-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[stat-1] Fukunaga, Keinosuke (1990). Introduction to Statistical Pattern Recognition. pp. 3, 97. ISBN 0122698517.

[Tumer-2] 1 2 K. Tumer, K. (1996) "Estimating the Bayes error rate through classifier combining" in Proceedings of the 13th International Conference on Pattern Recognition, Volume 2, 695–699

[ESL-3] 1 2 3 Hastie, Trevor (2009). The Elements of Statistical Learning (2nd ed.). Springer. p. 21. ISBN 978-0387848570.

[4] Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2018). Foundations of Machine Learning (2nd ed.). p. 22.

[1]

[2]

[3]

[4]