Part of a series on |
Bayesian statistics |
---|
Posterior = Likelihood × Prior ÷ Evidence |
Background |
Model building |
Posterior approximation |
Estimators |
Evidence approximation |
Model evaluation |
In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features. [1]
Suppose a pair takes values in , where is the class label of an element whose features are given by . Assume that the conditional distribution of X, given that the label Y takes the value r is given by where "" means "is distributed as", and where denotes a probability distribution.
A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function , with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as
The Bayes classifier is
In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, . The Bayes classifier is a useful benchmark in statistical classification.
The excess risk of a general classifier (possibly depending on some training data) is defined as Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity. [2]
Considering the components of to be mutually independent, we get the naive Bayes classifier, where
Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.
Define the variables: Risk , Bayes risk , all possible classes to which the points can be classified . Let the posterior probability of a point belonging to class 1 be . Define the classifier as
Then we have the following results:
Proof of (a): For any classifier , we have where the second line was derived through Fubini's theorem
Notice that is minimised by taking ,
Therefore the minimum possible risk is the Bayes risk, .
Proof of (b):
Proof of (c):
Proof of (d):
The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.
This is minimised by simultaneously minimizing all the terms of the expectation using the classifier for each observation x.
In mathematics, an inner product space is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often denoted with angle brackets such as in . Inner products allow formal definitions of intuitive geometric notions, such as lengths, angles, and orthogonality of vectors. Inner product spaces generalize Euclidean vector spaces, in which the inner product is the dot product or scalar product of Cartesian coordinates. Inner product spaces of infinite dimension are widely used in functional analysis. Inner product spaces over the field of complex numbers are sometimes referred to as unitary spaces. The first usage of the concept of a vector space with an inner product is due to Giuseppe Peano, in 1898.
In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions:
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.
In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.
In mathematics, a linear form is a linear map from a vector space to its field of scalars.
In mathematics, in particular in algebraic topology, differential geometry and algebraic geometry, the Chern classes are characteristic classes associated with complex vector bundles. They have since become fundamental concepts in many branches of mathematics and physics, such as string theory, Chern–Simons theory, knot theory, Gromov–Witten invariants. Chern classes were introduced by Shiing-Shen Chern.
In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.
In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.
In mathematics, the inverse trigonometric functions are the inverse functions of the trigonometric functions, under suitably restricted domains. Specifically, they are the inverses of the sine, cosine, tangent, cotangent, secant, and cosecant functions, and are used to obtain an angle from any of the angle's trigonometric ratios. Inverse trigonometric functions are widely used in engineering, navigation, physics, and geometry.
In the mathematical fields of linear algebra and functional analysis, the orthogonal complement of a subspace of a vector space equipped with a bilinear form is the set of all vectors in that are orthogonal to every vector in . Informally, it is called the perp, short for perpendicular complement. It is a subspace of .
In probability and statistics, the Dirichlet distribution, often denoted , is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.
In control theory, a state observer or state estimator is a system that provides an estimate of the internal state of a given real system, from measurements of the input and output of the real system. It is typically computer-implemented, and provides the basis of many practical applications.
In mathematics, the Weierstrass functions are special functions of a complex variable that are auxiliary to the Weierstrass elliptic function. They are named for Karl Weierstrass. The relation between the sigma, zeta, and functions is analogous to that between the sine, cotangent, and squared cosecant functions: the logarithmic derivative of the sine is the cotangent, whose derivative is negative the squared cosecant.
In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.
In mathematics, the classifying space for the unitary group U(n) is a space BU(n) together with a universal bundle EU(n) such that any hermitian bundle on a paracompact space X is the pull-back of EU(n) by a map X → BU(n) unique up to homotopy.
In physics, the Majorana equation is a relativistic wave equation. It is named after the Italian physicist Ettore Majorana, who proposed it in 1937 as a means of describing fermions that are their own antiparticle. Particles corresponding to this equation are termed Majorana particles, although that term now has a more expansive meaning, referring to any fermionic particle that is its own anti-particle.
In mathematics, the classical groups are defined as the special linear groups over the reals , the complex numbers and the quaternions together with special automorphism groups of symmetric or skew-symmetric bilinear forms and Hermitian or skew-Hermitian sesquilinear forms defined on real, complex and quaternionic finite-dimensional vector spaces. Of these, the complex classical Lie groups are four infinite families of Lie groups that together with the exceptional groups exhaust the classification of simple Lie groups. The compact classical groups are compact real forms of the complex classical groups. The finite analogues of the classical groups are the classical groups of Lie type. The term "classical group" was coined by Hermann Weyl, it being the title of his 1939 monograph The Classical Groups.
In mathematics and mathematical physics, raising and lowering indices are operations on tensors which change their type. Raising and lowering indices are a form of index manipulation in tensor expressions.
In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rn, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K
In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems. Given as the space of all possible inputs, and as the set of labels, a typical goal of classification algorithms is to find a function which best predicts a label for a given input . However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same to generate different . As a result, the goal of the learning problem is to minimize expected loss, defined as