Kernel Fisher discriminant analysis

Last updated

In statistics, kernel Fisher discriminant analysis (KFD), [1] also known as generalized discriminant analysis [2] and kernel discriminant analysis, [3] is a kernelized version of linear discriminant analysis (LDA). It is named after Ronald Fisher.

Contents

Linear discriminant analysis

Intuitively, the idea of LDA is to find a projection where class separation is maximized. Given two sets of labeled data, and , we can calculate the mean value of each class, and , as

where is the number of examples of class . The goal of linear discriminant analysis is to give a large separation of the class means while also keeping the in-class variance small. [4] This is formulated as maximizing, with respect to , the following ratio:

where is the between-class covariance matrix and is the total within-class covariance matrix:

The maximum of the above ratio is attained at

as can be shown by the Lagrange multiplier method (sketch of proof):

Maximizing is equivalent to maximizing

subject to

This, in turn, is equivalent to maximizing , where is the Lagrange multiplier.

At the maximum, the derivatives of with respect to and must be zero. Taking yields

which is trivially satisfied by and

Extending LDA

To extend LDA to non-linear mappings, the data, given as the points can be mapped to a new feature space, via some function In this new feature space, the function that needs to be maximized is [1]

where

and

Further, note that . Explicitly computing the mappings and then performing LDA can be computationally expensive, and in many cases intractable. For example, may be infinitely dimensional. Thus, rather than explicitly mapping the data to , the data can be implicitly embedded by rewriting the algorithm in terms of dot products and using kernel functions in which the dot product in the new feature space is replaced by a kernel function,.

LDA can be reformulated in terms of dot products by first noting that will have an expansion of the form [5]

Then note that

where

The numerator of can then be written as:

Similarly, the denominator can be written as

with the component of defined as is the identity matrix, and the matrix with all entries equal to . This identity can be derived by starting out with the expression for and using the expansion of and the definitions of and

With these equations for the numerator and denominator of , the equation for can be rewritten as

Then, differentiating and setting equal to zero gives

Since only the direction of , and hence the direction of matters, the above can be solved for as

Note that in practice, is usually singular and so a multiple of the identity is added to it [1]

Given the solution for , the projection of a new data point is given by [1]

Multi-class KFD

The extension to cases where there are more than two classes is relatively straightforward. [2] [6] [7] Let be the number of classes. Then multi-class KFD involves projecting the data into a -dimensional space using discriminant functions

This can be written in matrix notation

where the are the columns of . [6] Further, the between-class covariance matrix is now

where is the mean of all the data in the new feature space. The within-class covariance matrix is

The solution is now obtained by maximizing

The kernel trick can again be used and the goal of multi-class KFD becomes [7]

where and

The are defined as in the above section and is defined as

can then be computed by finding the leading eigenvectors of . [7] Furthermore, the projection of a new input, , is given by [7]

where the component of is given by .

Classification using KFD

In both two-class and multi-class KFD, the class label of a new input can be assigned as [7]

where is the projected mean for class and is a distance function.

Applications

Kernel discriminant analysis has been used in a variety of applications. These include:

See also

Related Research Articles

In physics, the cross section is a measure of the probability that a specific process will take place when some kind of radiant excitation intersects a localized phenomenon. For example, the Rutherford cross-section is a measure of probability that an alpha particle will be deflected by a given angle during an interaction with an atomic nucleus. Cross section is typically denoted σ (sigma) and is expressed in units of area, more specifically in barns. In a way, it can be thought of as the size of the object that the excitation must hit in order for the process to occur, but more exactly, it is a parameter of a stochastic process.

Dirac delta function Pseudo-function δ such that an integral of δ(x-c)f(x) always takes the value of f(c)

In mathematics, the Dirac delta distribution, also known as the unit impulse, is a generalized function or distribution over the real numbers, whose value is zero everywhere except at zero, and whose integral over the entire real line is equal to one.

Stress–energy tensor Tensor describing energy momentum density in spacetime

The stress–energy tensor, sometimes called the stress–energy–momentum tensor or the energy–momentum tensor, is a tensor physical quantity that describes the density and flux of energy and momentum in spacetime, generalizing the stress tensor of Newtonian physics. It is an attribute of matter, radiation, and non-gravitational force fields. This density and flux of energy and momentum are the sources of the gravitational field in the Einstein field equations of general relativity, just as mass density is the source of such a field in Newtonian gravity.

Heat equation Type of partial differential equation

In mathematics and physics, the heat equation is a certain partial differential equation. Solutions of the heat equation are sometimes known as caloric functions. The theory of the heat equation was first developed by Joseph Fourier in 1822 for the purpose of modeling how a quantity such as heat diffuses through a given region.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space under the operation of composition.

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In physics, the S-matrix or scattering matrix relates the initial state and the final state of a physical system undergoing a scattering process. It is used in quantum mechanics, scattering theory and quantum field theory (QFT).

In the theory of stochastic processes, the Karhunen–Loève theorem, also known as the Kosambi–Karhunen–Loève theorem is a representation of a stochastic process as an infinite linear combination of orthogonal functions, analogous to a Fourier series representation of a function on a bounded interval. The transformation is also known as Hotelling transform and eigenvector transform, and is closely related to principal component analysis (PCA) technique widely used in image processing and in data analysis in many fields.

In physics, the Hamilton–Jacobi equation, named after William Rowan Hamilton and Carl Gustav Jacob Jacobi, is an alternative formulation of classical mechanics, equivalent to other formulations such as Newton's laws of motion, Lagrangian mechanics and Hamiltonian mechanics. The Hamilton–Jacobi equation is particularly useful in identifying conserved quantities for mechanical systems, which may be possible even when the mechanical problem itself cannot be solved completely.

In mathematics, the discrete Laplace operator is an analog of the continuous Laplace operator, defined so that it has meaning on a graph or a discrete grid. For the case of a finite-dimensional graph, the discrete Laplace operator is more commonly called the Laplacian matrix.

In geometry, various formalisms exist to express a rotation in three dimensions as a mathematical transformation. In physics, this concept is applied to classical mechanics where rotational kinematics is the science of quantitative description of a purely rotational motion. The orientation of an object at a given instant is described with the same tools, as it is defined as an imaginary rotation from a reference placement in space, rather than an actually observed rotation from a previous placement in space.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.

In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then, positive-definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.

A flavor of the k·p perturbation theory used for calculating the structure of multiple, degenerate electronic bands in bulk and quantum well semiconductors. The method is a generalization of the single band k·p theory.

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

Multipole radiation is a theoretical framework for the description of electromagnetic or gravitational radiation from time-dependent distributions of distant sources. These tools are applied to physical phenomena which occur at a variety of length scales - from gravitational waves due to galaxy collisions to gamma radiation resulting from nuclear decay. Multipole radiation is analyzed using similar multipole expansion techniques that describe fields from static sources, however there are important differences in the details of the analysis because multipole radiation fields behave quite differently from static fields. This article is primarily concerned with electromagnetic multipole radiation, although the treatment of gravitational waves is similar.

In mathematics, the Kodaira–Spencer map, introduced by Kunihiko Kodaira and Donald C. Spencer, is a map associated to a deformation of a scheme or complex manifold X, taking a tangent space of a point of the deformation space to the first cohomology group of the sheaf of vector fields on X.

In pure and applied mathematics, quantum mechanics and computer graphics, a tensor operator generalizes the notion of operators which are scalars and vectors. A special class of these are spherical tensor operators which apply the notion of the spherical basis and spherical harmonics. The spherical basis closely relates to the description of angular momentum in quantum mechanics and spherical harmonic functions. The coordinate-free generalization of a tensor operator is known as a representation operator.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

References

  1. 1 2 3 4 5 Mika, S; Rätsch, G.; Weston, J.; Schölkopf, B.; Müller, KR (1999). Fisher discriminant analysis with kernels. Neural Networks for Signal Processing. Vol. IX. pp. 41–48. CiteSeerX   10.1.1.35.9904 . doi:10.1109/NNSP.1999.788121. ISBN   978-0-7803-5673-3. S2CID   8473401.
  2. 1 2 3 Baudat, G.; Anouar, F. (2000). "Generalized discriminant analysis using a kernel approach". Neural Computation. 12 (10): 2385–2404. CiteSeerX   10.1.1.412.760 . doi:10.1162/089976600300014980. PMID   11032039. S2CID   7036341.
  3. 1 2 Li, Y.; Gong, S.; Liddell, H. (2003). "Recognising trajectories of facial identities using kernel discriminant analysis". Image and Vision Computing. 21 (13–14): 1077–1086. CiteSeerX   10.1.1.2.6315 . doi:10.1016/j.imavis.2003.08.010.
  4. Bishop, CM (2006). Pattern Recognition and Machine Learning. New York, NY: Springer.
  5. Scholkopf, B; Herbrich, R.; Smola, A. (2001). A generalized representer theorem. Computational Learning Theory. Lecture Notes in Computer Science. Vol. 2111. pp. 416–426. CiteSeerX   10.1.1.42.8617 . doi:10.1007/3-540-44581-1_27. ISBN   978-3-540-42343-0.
  6. 1 2 Duda, R.; Hart, P.; Stork, D. (2001). Pattern Classification. New York, NY: Wiley.
  7. 1 2 3 4 5 Zhang, J.; Ma, K.K. (2004). "Kernel fisher discriminant for texture classification".{{cite journal}}: Cite journal requires |journal= (help)
  8. Liu, Q.; Lu, H.; Ma, S. (2004). "Improving kernel Fisher discriminant analysis for face recognition". IEEE Transactions on Circuits and Systems for Video Technology. 14 (1): 42–49. doi:10.1109/tcsvt.2003.818352. S2CID   39657721.
  9. Liu, Q.; Huang, R.; Lu, H.; Ma, S. (2002). "Face recognition using kernel-based Fisher discriminant analysis". IEEE International Conference on Automatic Face and Gesture Recognition.
  10. Kurita, T.; Taguchi, T. (2002). A modification of kernel-based Fisher discriminant analysis for face detection. IEEE International Conference on Automatic Face and Gesture Recognition. pp. 300–305. CiteSeerX   10.1.1.100.3568 . doi:10.1109/AFGR.2002.1004170. ISBN   978-0-7695-1602-8. S2CID   7581426.
  11. Feng, Y.; Shi, P. (2004). "Face detection based on kernel fisher discriminant analysis". IEEE International Conference on Automatic Face and Gesture Recognition.
  12. Yang, J.; Frangi, AF; Yang, JY; Zang, D., Jin, Z. (2005). "KPCA plus LDA: a complete kernel Fisher discriminant framework for feature extraction and recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 27 (2): 230–244. CiteSeerX   10.1.1.330.1179 . doi:10.1109/tpami.2005.33. PMID   15688560. S2CID   9771368.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  13. Wang, Y.; Ruan, Q. (2006). "Kernel fisher discriminant analysis for palmprint recognition". International Conference on Pattern Recognition.
  14. Wei, L.; Yang, Y.; Nishikawa, R.M.; Jiang, Y. (2005). "A study on several machine-learning methods for classification of malignant and benign clustered microcalcifications". IEEE Transactions on Medical Imaging. 24 (3): 371–380. doi:10.1109/tmi.2004.842457. PMID   15754987. S2CID   36691320.
  15. Malmgren, T. (1997). "An iterative nonlinear discriminant analysis program: IDA 1.0". Computer Physics Communications. 106 (3): 230–236. Bibcode:1997CoPhC.106..230M. doi:10.1016/S0010-4655(97)00100-8.