Quadratic classifier

Last updated March 31, 2024

In statistics, a quadratic classifier is a statistical classifier that uses a quadratic decision surface to separate measurements of two or more classes of objects or events. It is a more general version of the linear classifier.

The classification problem

Statistical classification considers a set of vectors of observations $x$ of an object or event, each of which has a known type $y$ . This set is referred to as the training set. The problem is then to determine, for a given new observation vector, what the best class should be. For a quadratic classifier, the correct solution is assumed to be quadratic in the measurements, so $y$ will be decided based on

\mathbf {x^{T}Ax} +\mathbf {b^{T}x} +c

In the special case where each observation consists of two measurements, this means that the surfaces separating the classes will be conic sections (i.e., either a line, a circle or ellipse, a parabola or a hyperbola). In this sense, we can state that a quadratic model is a generalization of the linear model, and its use is justified by the desire to extend the classifier's ability to represent more complex separating surfaces.

Quadratic discriminant analysis

Quadratic discriminant analysis (QDA) is closely related to linear discriminant analysis (LDA), where it is assumed that the measurements from each class are normally distributed.^[1] Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.^[2] When the normality assumption is true, the best possible test for the hypothesis that a given measurement is from a given class is the likelihood ratio test. Suppose there are only two groups, with means $\mu _{0},\mu _{1}$ and covariance matrices $\Sigma _{0},\Sigma _{1}$ corresponding to $y=0$ and $y=1$ respectively. Then the likelihood ratio is given by

{\text{Likelihood ratio}}={\frac {{\sqrt {2\pi |\Sigma _{1}|}}^{-1}\exp \left(-{\frac {1}{2}}(\mathbf {x} -{\boldsymbol {\mu }}_{1})^{T}\Sigma _{1}^{-1}(\mathbf {x} -{\boldsymbol {\mu }}_{1})\right)}{{\sqrt {2\pi |\Sigma _{0}|}}^{-1}\exp \left(-{\frac {1}{2}}(\mathbf {x} -{\boldsymbol {\mu }}_{0})^{T}\Sigma _{0}^{-1}(\mathbf {x} -{\boldsymbol {\mu }}_{0})\right)}}<t

for some threshold $t$ . After some rearrangement, it can be shown that the resulting separating surface between the classes is a quadratic. The sample estimates of the mean vector and variance-covariance matrices will substitute the population quantities in this formula.

Other

While QDA is the most commonly-used method for obtaining a classifier, other methods are also possible. One such method is to create a longer measurement vector from the old one by adding all pairwise products of individual measurements. For instance, the vector

[x_{1},\;x_{2},\;x_{3}]

would become

[x_{1},\;x_{2},\;x_{3},\;x_{1}^{2},\;x_{1}x_{2},\;x_{1}x_{3},\;x_{2}^{2},\;x_{2}x_{3},\;x_{3}^{2}].

Finding a quadratic classifier for the original measurements would then become the same as finding a linear classifier based on the expanded measurement vector. This observation has been used in extending neural network models;^[3] the "circular" case, which corresponds to introducing only the sum of pure quadratic terms $\;x_{1}^{2}+x_{2}^{2}+x_{3}^{2}+\cdots \;$ with no mixed products ( $\;x_{1}x_{2},\;x_{1}x_{3},\;\ldots \;$ ), has been proven to be the optimal compromise between extending the classifier's representation power and controlling the risk of overfitting (Vapnik-Chervonenkis dimension).^[4]

For linear classifiers based only on dot products, these expanded measurements do not have to be actually computed, since the dot product in the higher-dimensional space is simply related to that in the original space. This is an example of the so-called kernel trick, which can be applied to linear discriminant analysis as well as the support vector machine.

Related Research Articles

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

The Mahalanobis distance is a measure of the distance between a point $and a distribution, introduced by P. C. Mahalanobis in 1936. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based on measurements in 1927.$

Hotellings <i>T</i>-squared distribution Type of probability distribution

In statistics, particularly in hypothesis testing, the Hotelling's T-squared distribution (T²), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the F-distribution and is most notable for arising as the distribution of a set of sample statistics that are natural generalizations of the statistics underlying the Student's t-distribution. The Hotelling's t-squared statistic (t²) is a generalization of Student's t-statistic that is used in multivariate hypothesis testing.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

In statistics, the theory of minimum norm quadratic unbiased estimation (MINQUE) was developed by C. R. Rao. MINQUE is a theory alongside other estimation methods in estimation theory, such as the method of moments or maximum likelihood estimation. Similar to the theory of best linear unbiased estimation, MINQUE is specifically concerned with linear regression models. The method was originally conceived to estimate heteroscedastic error variance in multiple linear regression. MINQUE estimators also provide an alternative to maximum likelihood estimators or restricted maximum likelihood estimators for variance components in mixed effects models. MINQUE estimators are quadratic forms of the response variable and are used to estimate a linear function of the variances.

The James–Stein estimator is a biased estimator of the mean, $, of (possibly) correlated Gaussian distributed random variables with unknown means .$

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In directional statistics, the von Mises–Fisher distribution, is a probability distribution on the $-sphere in . If the distribution reduces to the von Mises distribution on the circle.$

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which $given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.$

In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.

In probability and statistics, a natural exponential family (NEF) is a class of probability distributions that is a special case of an exponential family (EF).

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

In probability theory and statistics, the generalized chi-squared distribution is the distribution of a quadratic form of a multinormal variable, or a linear combination of different normal variables and squares of normal variables. Equivalently, it is also a linear sum of independent noncentral chi-square variables and a normal variable. There are several other such generalizations for which the same term is sometimes used; some of them are special cases of the family discussed here, for example the gamma distribution.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

In the mathematical theory of probability, multivariate Laplace distributions are extensions of the Laplace distribution and the asymmetric Laplace distribution to multiple variables. The marginal distributions of symmetric multivariate Laplace distribution variables are Laplace distributions. The marginal distributions of asymmetric multivariate Laplace distribution variables are asymmetric Laplace distributions.

References

Citations

↑ Tharwat, Alaa (2016). "Linear vs. quadratic discriminant analysis classifier: a tutorial". International Journal of Applied Pattern Recognition. 3 (2): 145. doi:10.1504/IJAPR.2016.079050. ISSN 2049-887X.
↑ "Linear & Quadratic Discriminant Analysis · UC Business Analytics R Programming Guide". uc-r.github.io. Retrieved 2020-03-29.
↑ Cover TM (1965). "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition". IEEE Transactions on Electronic Computers. EC-14 (3): 326–334. doi:10.1109/pgec.1965.264137.
↑ Ridella S, Rovetta S, Zunino R (1997). "Circular backpropagation networks for classification". IEEE Transactions on Neural Networks. 8 (1): 84–97. doi:10.1109/72.554194. PMID 18255613. href IEEE: .

General references

Sathyanarayana, Shashi (2010). "Pattern Recognition Primer II". Wolfram Demonstrations Project .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Tharwat, Alaa (2016). "Linear vs. quadratic discriminant analysis classifier: a tutorial". International Journal of Applied Pattern Recognition. 3 (2): 145. doi:10.1504/IJAPR.2016.079050. ISSN 2049-887X.

[2] "Linear & Quadratic Discriminant Analysis · UC Business Analytics R Programming Guide". uc-r.github.io. Retrieved 2020-03-29.

[3] Cover TM (1965). "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition". IEEE Transactions on Electronic Computers. EC-14 (3): 326–334. doi:10.1109/pgec.1965.264137.

[4] Ridella S, Rovetta S, Zunino R (1997). "Circular backpropagation networks for classification". IEEE Transactions on Neural Networks. 8 (1): 84–97. doi:10.1109/72.554194. PMID 18255613. href IEEE: .

[1]

[2]

[3]

[4]