Least-squares support-vector machine

Last updated

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. [1] LS-SVMs are a class of kernel-based learning methods.

Contents

From support-vector machine to least-squares support-vector machine

Given a training set with input data and corresponding binary class labels , the SVM [2] classifier, according to Vapnik's original formulation, satisfies the following conditions:

The spiral data:
y
i
=
1
{\displaystyle y_{i}=1}
for blue data point,
y
i
=
-
1
{\displaystyle y_{i}=-1}
for red data point Data spiral.png
The spiral data: for blue data point, for red data point

which is equivalent to

where is the nonlinear map from original space to the high- or infinite-dimensional space.

Inseparable data

In case such a separating hyperplane does not exist, we introduce so-called slack variables such that

According to the structural risk minimization principle, the risk bound is minimized by the following minimization problem:

The result of the SVM classifier Data spiral svm.png
The result of the SVM classifier

To solve this problem, we could construct the Lagrangian function:

where are the Lagrangian multipliers. The optimal point will be in the saddle point of the Lagrangian function, and then we obtain

By substituting by its expression in the Lagrangian formed from the appropriate objective and constraints, we will get the following quadratic programming problem:

where is called the kernel function. Solving this QP problem subject to constraints in (8), we will get the hyperplane in the high-dimensional space and hence the classifier in the original space.

Least-squares SVM formulation

The least-squares version of the SVM classifier is obtained by reformulating the minimization problem as

subject to the equality constraints

The least-squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression interpretation with binary targets .

Using , we have

with Notice, that this error would also make sense for least-squares data fitting, so that the same end results holds for the regression case.

Hence the LS-SVM classifier formulation is equivalent to

with and

The result of the LS-SVM classifier Data spiral lssvm.png
The result of the LS-SVM classifier

Both and should be considered as hyperparameters to tune the amount of regularization versus the sum squared error. The solution does only depend on the ratio , therefore the original formulation uses only as tuning parameter. We use both and as parameters in order to provide a Bayesian interpretation to LS-SVM.

The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function:

where are the Lagrange multipliers. The conditions for optimality are

Elimination of and will yield a linear system instead of a quadratic programming problem:

with , and . Here, is an identity matrix, and is the kernel matrix defined by .

Kernel function K

For the kernel function K(•, •) one typically has the following choices:

where , , , and are constants. Notice that the Mercer condition holds for all and values in the polynomial and RBF case, but not for all possible choices of and in the MLP case. The scale parameters , and determine the scaling of the inputs in the polynomial, RBF and MLP kernel function. This scaling is related to the bandwidth of the kernel in statistics, where it is shown that the bandwidth is an important parameter of the generalization behavior of a kernel method.

Bayesian interpretation for LS-SVM

A Bayesian interpretation of the SVM has been proposed by Smola et al. They showed that the use of different kernels in SVM can be regarded as defining different prior probability distributions on the functional space, as . Here is a constant and is the regularization operator corresponding to the selected kernel.

A general Bayesian evidence framework was developed by MacKay, [3] [4] [5] and MacKay has used it to the problem of regression, forward neural network and classification network. Provided data set , a model with parameter vector and a so-called hyperparameter or regularization parameter , Bayesian inference is constructed with 3 levels of inference:

We can see that Bayesian evidence framework is a unified theory for learning the model and model selection. Kwok used the Bayesian evidence framework to interpret the formulation of SVM and model selection. And he also applied Bayesian evidence framework to support vector regression.

Now, given the data points and the hyperparameters and of the model , the model parameters and are estimated by maximizing the posterior . Applying Bayes’ rule, we obtain

where is a normalizing constant such the integral over all possible and is equal to 1. We assume and are independent of the hyperparameter , and are conditional independent, i.e., we assume

When , the distribution of will approximate a uniform distribution. Furthermore, we assume and are Gaussian distribution, so we obtain the a priori distribution of and with to be

Here is the dimensionality of the feature space, same as the dimensionality of .

The probability of is assumed to depend only on and . We assume that the data points are independently identically distributed (i.i.d.), so that:

In order to obtain the least square cost function, it is assumed that the probability of a data point is proportional to:

A Gaussian distribution is taken for the errors as:

It is assumed that the and are determined in such a way that the class centers and are mapped onto the target -1 and +1, respectively. The projections of the class elements follow a multivariate Gaussian distribution, which have variance .

Combining the preceding expressions, and neglecting all constants, Bayes’ rule becomes

The maximum posterior density estimates and are then obtained by minimizing the negative logarithm of (26), so we arrive (10).

Related Research Articles

In mathematics, the classic Möbius inversion formula is a relation between pairs of arithmetic functions, each defined from the other by sums over divisors. It was introduced into number theory in 1832 by August Ferdinand Möbius.

The Liouville Lambda function, denoted by λ(n) and named after Joseph Liouville, is an important arithmetic function. Its value is +1 if n is the product of an even number of prime numbers, and −1 if it is the product of an odd number of primes.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space under the operation of composition.

In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.

In quantum mechanics, information theory, and Fourier analysis, the entropic uncertainty or Hirschman uncertainty is defined as the sum of the temporal and spectral Shannon entropies. It turns out that Heisenberg's uncertainty principle can be expressed as a lower bound on the sum of these entropies. This is stronger than the usual statement of the uncertainty principle in terms of the product of standard deviations.

Stable distribution Distribution of variables which satisfies a stability property under linear combinations

In probability theory, a distribution is said to be stable if a linear combination of two independent random variables with this distribution has the same distribution, up to location and scale parameters. A random variable is said to be stable if its distribution is stable. The stable distribution family is also sometimes referred to as the Lévy alpha-stable distribution, after Paul Lévy, the first mathematician to have studied it.

In mathematics, the Lerch zeta function, sometimes called the Hurwitz–Lerch zeta function, is a special function that generalizes the Hurwitz zeta function and the polylogarithm. It is named after Czech mathematician Mathias Lerch, who published a paper about the function in 1887.

In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. Note that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.

In mathematics, the Selberg class is an axiomatic definition of a class of L-functions. The members of the class are Dirichlet series which obey four axioms that seem to capture the essential properties satisfied by most functions that are commonly called L-functions or zeta functions. Although the exact nature of the class is conjectural, the hope is that the definition of the class will lead to a classification of its contents and an elucidation of its properties, including insight into their relationship to automorphic forms and the Riemann hypothesis. The class was defined by Atle Selberg in, who preferred not to use the word "axiom" that later authors have employed.

Parameterized post-Newtonian formalism

In physics, precisely in the study of the theory of general relativity and many alternatives to it, the post-Newtonian formalism is a calculational tool that expresses Einstein's (nonlinear) equations of gravity in terms of the lowest-order deviations from Newton's law of universal gravitation. This allows approximations to Einstein's equations to be made in the case of weak fields. Higher-order terms can be added to increase accuracy, but for strong fields, it may be preferable to solve the complete equations numerically. Some of these post-Newtonian approximations are expansions in a small parameter, which is the ratio of the velocity of the matter forming the gravitational field to the speed of light, which in this case is better called the speed of gravity. In the limit, when the fundamental speed of gravity becomes infinite, the post-Newtonian expansion reduces to Newton's law of gravity.

In complex analysis, functional analysis and operator theory, a Bergman space is a function space of holomorphic functions in a domain D of the complex plane that are sufficiently well-behaved at the boundary that they are absolutely integrable. Specifically, for 0 < p < ∞, the Bergman space Ap(D) is the space of all holomorphic functions in D for which the p-norm is finite:

Oblate spheroidal coordinates Three-dimensional orthogonal coordinate system

Oblate spheroidal coordinates are a three-dimensional orthogonal coordinate system that results from rotating the two-dimensional elliptic coordinate system about the non-focal axis of the ellipse, i.e., the symmetry axis that separates the foci. Thus, the two foci are transformed into a ring of radius in the x-y plane. Oblate spheroidal coordinates can also be considered as a limiting case of ellipsoidal coordinates in which the two largest semi-axes are equal in length.

Inverse Gaussian distribution Family of continuous probability distributions

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

Tail value at risk (TVaR), also known as tail conditional expectation (TCE) or conditional tail expectation (CTE), is a risk measure associated with the more general value at risk. It quantifies the expected value of the loss given that an event outside a given probability level has occurred.

In number theory, an average order of an arithmetic function is some simpler or better-understood function which takes the same values "on average".

Logit-normal distribution

In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and P is the standard logistic function, then X = P(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).

A geometric stable distribution or geo-stable distribution is a type of leptokurtic probability distribution. Geometric stable distributions were introduced in Klebanov, L. B., Maniya, G. M., and Melamed, I. A. (1985). A problem of Zolotarev and analogs of infinitely divisible and stable distributions in a scheme for summing a random number of random variables. These distributions are analogues for stable distributions for the case when the number of summands is random, independent of the distribution of summand, and having geometric distribution. The geometric stable distribution may be symmetric or asymmetric. A symmetric geometric stable distribution is also referred to as a Linnik distribution. The Laplace distribution and asymmetric Laplace distribution are special cases of the geometric stable distribution. The Laplace distribution is also a special case of a Linnik distribution. The Mittag-Leffler distribution is also a special case of a geometric stable distribution.

Grunsky matrix

In complex analysis and geometric function theory, the Grunsky matrices, or Grunsky operators, are infinite matrices introduced in 1939 by Helmut Grunsky. The matrices correspond to either a single holomorphic function on the unit disk or a pair of holomorphic functions on the unit disk and its complement. The Grunsky inequalities express boundedness properties of these matrices, which in general are contraction operators or in important special cases unitary operators. As Grunsky showed, these inequalities hold if and only if the holomorphic function is univalent. The inequalities are equivalent to the inequalities of Goluzin, discovered in 1947. Roughly speaking, the Grunsky inequalities give information on the coefficients of the logarithm of a univalent function; later generalizations by Milin, starting from the Lebedev–Milin inequality, succeeded in exponentiating the inequalities to obtain inequalities for the coefficients of the univalent function itself. The Grunsky matrix and its associated inequalities were originally formulated in a more general setting of univalent functions between a region bounded by finitely many sufficiently smooth Jordan curves and its complement: the results of Grunsky, Goluzin and Milin generalize to that case.

In number theory, the prime omega functions and count the number of prime factors of a natural number Thereby counts each distinct prime factor, whereas the related function counts the total number of prime factors of honoring their multiplicity. That is, if we have a prime factorization of of the form for distinct primes , then the respective prime omega functions are given by and . These prime factor counting functions have many important number theoretic relations.

References

  1. Suykens, J. A. K.; Vandewalle, J. (1999) "Least squares support vector machine classifiers", Neural Processing Letters, 9 (3), 293–300.
  2. Vapnik, V. The nature of statistical learning theory. Springer-Verlag, New York, 1995.
  3. MacKay, D. J. C. Bayesian Interpolation. Neural Computation, 4(3): 415–447, May 1992.
  4. MacKay, D. J. C. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3): 448–472, May 1992.
  5. MacKay, D. J. C. The evidence framework applied to classification networks. Neural Computation, 4(5): 720–736, Sep. 1992.

Bibliography