Bayes classifier

Last updated January 24, 2024

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.^[1]

Definition

Suppose a pair $(X,Y)$ takes values in $\mathbb {R} ^{d}\times \{1,2,\dots ,K\}$ , where $Y$ is the class label of an element whose features are given by $X$ . Assume that the conditional distribution of X, given that the label Y takes the value r is given by

(X\mid Y=r)\sim P_{r}

for

r=1,2,\dots ,K

where " $\sim$ " means "is distributed as", and where $P_{r}$ denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function $C:\mathbb {R} ^{d}\to \{1,2,\dots ,K\}$ , with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as

{\mathcal {R}}(C)=\operatorname {P} \{C(X)\neq Y\}.

The Bayes classifier is

C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r\mid X=x).

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, $\operatorname {P} (Y=r\mid X=x)$ . The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier $C$ (possibly depending on some training data) is defined as ${\mathcal {R}}(C)-{\mathcal {R}}(C^{\text{Bayes}}).$ Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.^[2]

Considering the components $x_{i}$ of $x$ to be mutually independent, we get the naive bayes classifier, where $C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r)\prod _{i=1}^{d}P_{r}(x_{i}).$

Properties

Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk $R(h)$ , Bayes risk $R^{*}$ , all possible classes to which the points can be classified $Y=\{0,1\}$ . Let the posterior probability of a point belonging to class 1 be $\eta (x)=Pr(Y=1|X=x)$ . Define the classifier ${\mathcal {h}}^{*}$ as

${\mathcal {h}}^{*}(x)={\begin{cases}1&{\text{if }}\eta (x)\geqslant 0.5,\\0&{\text{otherwise.}}\end{cases}}$

Then we have the following results:

(a) $R(h^{*})=R^{*}$ , i.e. $h^{*}$ is a Bayes classifier,

(b) For any classifier $h$ , the excess risk satisfies $R(h)-R^{*}=2\mathbb {E} _{X}\left[|\eta (x)-0.5|\cdot \mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}\right]$

(c) $R^{*}=\mathbb {E} _{X}\left[\min(\eta (X),1-\eta (X))\right]$

(d) $R^{*}={\frac {1}{2}}-{\frac {1}{2}}\mathbb {E} [|2\eta (X)-1|]$

Proof of (a): For any classifier $h$ , we have

$R(h)=\mathbb {E} _{XY}\left[\mathbb {I} _{\left\{h(X)\neq Y\right\}}\right]$

$=\mathbb {E} _{X}\mathbb {E} _{Y|X}[\mathbb {I} _{\left\{h(X)\neq Y\right\}}]$ (due to Fubini's theorem)

$=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}]$

Notice that $R(h)$ is minimised by taking $\forall x\in X$ ,

$h(x)={\begin{cases}1&{\text{if }}\eta (x)\geqslant 1-\eta (x),\\0&{\text{otherwise.}}\end{cases}}$

Therefore the minimum possible risk is the Bayes risk, $R^{*}=R(h^{*})$ .

Proof of (b):

${\begin{aligned}R(h)-R^{*}&=R(h)-R(h^{*})\\&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}-\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}-(1-\eta (X))\mathbb {I} _{\left\{h^{*}(X)=1\right\}}]\\&=\mathbb {E} _{X}[|2\eta (X)-1|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\\&=2\mathbb {E} _{X}[|\eta (X)-0.5|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\end{aligned}}$

Proof of (c):

${\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h*(X)=1\right\}}]\\&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\end{aligned}}$

Proof of (d):

${\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\\&={\frac {1}{2}}-\mathbb {E} _{X}[\max(\eta (X)-1/2,1/2-\eta (X))]\\&={\frac {1}{2}}-{\frac {1}{2}}\mathbb {E} [|2\eta (X)-1|]\end{aligned}}$

General case

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.

${\begin{aligned}\mathbb {E} _{Y}(\mathbb {I} _{\{y\neq {\hat {y}}\}})&=\mathbb {E} _{X}\mathbb {E} _{Y|X}\left(\mathbb {I} _{\{y\neq {\hat {y}}\}}|X=x\right)\\&=\mathbb {E} \left[Pr(Y=1|X=x)\mathbb {I} _{\{{\hat {y}}=2,3,\dots ,n\}}+Pr(Y=2|X=x)\mathbb {I} _{\{{\hat {y}}=1,3,\dots ,n\}}+\dots +Pr(Y=n|X=x)\mathbb {I} _{\{{\hat {y}}=1,2,3,\dots ,n-1\}}\right]\end{aligned}}$

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier $h(x)=k,\quad \arg \max _{k}Pr(Y=k|X=x)$

for each observation x.

Related Research Articles

In mathematics, the associative property is a property of some binary operations, which means that rearranging the parentheses in an expression will not change the result. In propositional logic, associativity is a valid rule of replacement for expressions in logical proofs.

<span class="mw-page-title-main">Inner product space</span> Generalization of the dot product; used to define Hilbert spaces

In mathematics, an inner product space is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often denoted with angle brackets such as in $. Inner products allow formal definitions of intuitive geometric notions, such as lengths, angles, and orthogonality of vectors. Inner product spaces generalize Euclidean vector spaces, in which the inner product is the dot product or scalar product of Cartesian coordinates. Inner product spaces of infinite dimension are widely used in functional analysis. Inner product spaces over the field of complex numbers are sometimes referred to as unitary spaces . The first usage of the concept of a vector space with an inner product is due to Giuseppe Peano, in 1898.$

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mathematics, a linear form is a linear map from a vector space to its field of scalars.

In physics and mathematics, the Lorentz group is the group of all Lorentz transformations of Minkowski spacetime, the classical and quantum setting for all (non-gravitational) physical phenomena. The Lorentz group is named for the Dutch physicist Hendrik Lorentz.

In mathematics, in particular in algebraic topology, differential geometry and algebraic geometry, the Chern classes are characteristic classes associated with complex vector bundles. They have since become fundamental concepts in many branches of mathematics and physics, such as string theory, Chern–Simons theory, knot theory, Gromov–Witten invariants. Chern classes were introduced by Shiing-Shen Chern (1946).

In the calculus of variations and classical mechanics, the Euler–Lagrange equations are a system of second-order ordinary differential equations whose solutions are stationary points of the given action functional. The equations were discovered in the 1750s by Swiss mathematician Leonhard Euler and Italian mathematician Joseph-Louis Lagrange.

In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In mathematics, the inverse trigonometric functions are the inverse functions of the trigonometric functions. Specifically, they are the inverses of the sine, cosine, tangent, cotangent, secant, and cosecant functions, and are used to obtain an angle from any of the angle's trigonometric ratios. Inverse trigonometric functions are widely used in engineering, navigation, physics, and geometry.

In mathematics, in the area of analytic number theory, the Dirichlet eta function is defined by the following Dirichlet series, which converges for any complex number having real part > 0:

In functional analysis and related areas of mathematics, locally convex topological vector spaces (LCTVS) or locally convex spaces are examples of topological vector spaces (TVS) that generalize normed spaces. They can be defined as topological vector spaces whose topology is generated by translations of balanced, absorbent, convex sets. Alternatively they can be defined as a vector space with a family of seminorms, and a topology can be defined in terms of that family. Although in general such spaces are not necessarily normable, the existence of a convex local base for the zero vector is strong enough for the Hahn–Banach theorem to hold, yielding a sufficiently rich theory of continuous linear functionals.

In control theory, a state observer or state estimator is a system that provides an estimate of the internal state of a given real system, from measurements of the input and output of the real system. It is typically computer-implemented, and provides the basis of many practical applications.

In mathematics, the Weierstrass functions are special functions of a complex variable that are auxiliary to the Weierstrass elliptic function. They are named for Karl Weierstrass. The relation between the sigma, zeta, and $functions is analogous to that between the sine, cotangent, and squared cosecant functions: the logarithmic derivative of the sine is the cotangent, whose derivative is negative the squared cosecant.$

In physics, the Majorana equation is a relativistic wave equation. It is named after the Italian physicist Ettore Majorana, who proposed it in 1937 as a means of describing fermions that are their own antiparticle. Particles corresponding to this equation are termed Majorana particles, although that term now has a more expansive meaning, referring to any fermionic particle that is its own anti-particle.

In mathematics, a Minkowski plane is one of the Benz planes.

<span class="mw-page-title-main">Classical group</span>

In mathematics, the classical groups are defined as the special linear groups over the reals $R$ , the complex numbers $C$ and the quaternions $H$ together with special automorphism groups of symmetric or skew-symmetric bilinear forms and Hermitian or skew-Hermitian sesquilinear forms defined on real, complex and quaternionic finite-dimensional vector spaces. Of these, the complex classical Lie groups are four infinite families of Lie groups that together with the exceptional groups exhaust the classification of simple Lie groups. The compact classical groups are compact real forms of the complex classical groups. The finite analogues of the classical groups are the classical groups of Lie type. The term "classical group" was coined by Hermann Weyl, it being the title of his 1939 monograph The Classical Groups.

In mathematics and mathematical physics, raising and lowering indices are operations on tensors which change their type. Raising and lowering indices are a form of index manipulation in tensor expressions.

<span class="mw-page-title-main">Loss functions for classification</span> Concept in machine learning

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems. Given $as the space of all possible inputs, and as the set of labels, a typical goal of classification algorithms is to find a function which best predicts a label for a given input . However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same to generate different . As a result, the goal of the learning problem is to minimize expected loss, defined as$

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

↑ Devroye, L.; Gyorfi, L. & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer. ISBN 0-3879-4618-7.
↑ Farago, A.; Lugosi, G. (1993). "Strong universal consistency of neural network classifiers". IEEE Transactions on Information Theory. 39 (4): 1146–1151. doi:10.1109/18.243433.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[PTPR-1] Devroye, L.; Gyorfi, L. & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer. ISBN 0-3879-4618-7.

[2] Farago, A.; Lugosi, G. (1993). "Strong universal consistency of neural network classifiers". IEEE Transactions on Information Theory. 39 (4): 1146–1151. doi:10.1109/18.243433.

[1]

[2]