Least-squares support-vector machine

Last updated May 27, 2022

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle.^[1] LS-SVMs are a class of kernel-based learning methods.

From support-vector machine to least-squares support-vector machine

Given a training set $\{x_{i},y_{i}\}_{i=1}^{N}$ with input data $x_{i}\in \mathbb {R} ^{n}$ and corresponding binary class labels $y_{i}\in \{-1,+1\}$ , the SVM ^[2] classifier, according to Vapnik's original formulation, satisfies the following conditions:

{\begin{cases}w^{T}\phi (x_{i})+b\geq 1,&{\text{if }}\quad y_{i}=+1,\\w^{T}\phi (x_{i})+b\leq -1,&{\text{if }}\quad y_{i}=-1,\end{cases}}

which is equivalent to

y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1,\quad i=1,\ldots ,N,

where $\phi (x)$ is the nonlinear map from original space to the high- or infinite-dimensional space.

Inseparable data

In case such a separating hyperplane does not exist, we introduce so-called slack variables $\xi _{i}$ such that

{\begin{cases}y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1-\xi _{i},&i=1,\ldots ,N,\\\xi _{i}\geq 0,&i=1,\ldots ,N.\end{cases}}

According to the structural risk minimization principle, the risk bound is minimized by the following minimization problem:

\min J_{1}(w,\xi )={\frac {1}{2}}w^{T}w+c\sum \limits _{i=1}^{N}\xi _{i},

{\text{Subject to }}{\begin{cases}y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1-\xi _{i},&i=1,\ldots ,N,\\\xi _{i}\geq 0,&i=1,\ldots ,N,\end{cases}}

To solve this problem, we could construct the Lagrangian function:

L_{1}(w,b,\xi ,\alpha ,\beta )={\frac {1}{2}}w^{T}w+c\sum \limits _{i=1}^{N}{\xi _{i}}-\sum \limits _{i=1}^{N}\alpha _{i}\left\{y_{i}\left[{w^{T}\phi (x_{i})+b}\right]-1+\xi _{i}\right\}-\sum \limits _{i=1}^{N}\beta _{i}\xi _{i},

where $\alpha _{i}\geq 0,\ \beta _{i}\geq 0\ (i=1,\ldots ,N)$ are the Lagrangian multipliers. The optimal point will be in the saddle point of the Lagrangian function, and then we obtain

{\begin{cases}{\frac {\partial L_{1}}{\partial w}}=0\quad \to \quad w=\sum \limits _{i=1}^{N}\alpha _{i}y_{i}\phi (x_{i}),\\{\frac {\partial L_{1}}{\partial b}}=0\quad \to \quad \sum \limits _{i=1}^{N}\alpha _{i}y_{i}=0,\\{\frac {\partial L_{1}}{\partial \xi _{i}}}=0\quad \to \quad 0\leq \alpha _{i}\leq c,\;i=1,\ldots ,N.\end{cases}}

By substituting $w$ by its expression in the Lagrangian formed from the appropriate objective and constraints, we will get the following quadratic programming problem:

\max Q_{1}(\alpha )=-{\frac {1}{2}}\sum \limits _{i,j=1}^{N}{\alpha _{i}\alpha _{j}y_{i}y_{j}K(x_{i},x_{j})}+\sum \limits _{i=1}^{N}\alpha _{i},

where $K(x_{i},x_{j})=\left\langle \phi (x_{i}),\phi (x_{j})\right\rangle$ is called the kernel function. Solving this QP problem subject to constraints in (8), we will get the hyperplane in the high-dimensional space and hence the classifier in the original space.

Least-squares SVM formulation

The least-squares version of the SVM classifier is obtained by reformulating the minimization problem as

\min J_{2}(w,b,e)={\frac {\mu }{2}}w^{T}w+{\frac {\zeta }{2}}\sum \limits _{i=1}^{N}e_{i}^{2},

subject to the equality constraints

y_{i}\left[{w^{T}\phi (x_{i})+b}\right]=1-e_{i},\quad i=1,\ldots ,N.

The least-squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression interpretation with binary targets $y_{i}=\pm 1$ .

Using $y_{i}^{2}=1$ , we have

\sum \limits _{i=1}^{N}e_{i}^{2}=\sum \limits _{i=1}^{N}(y_{i}e_{i})^{2}=\sum \limits _{i=1}^{N}e_{i}^{2}=\sum \limits _{i=1}^{N}\left(y_{i}-(w^{T}\phi (x_{i})+b)\right)^{2},

with $e_{i}=y_{i}-(w^{T}\phi (x_{i})+b).$ Notice, that this error would also make sense for least-squares data fitting, so that the same end results holds for the regression case.

Hence the LS-SVM classifier formulation is equivalent to

J_{2}(w,b,e)=\mu E_{W}+\zeta E_{D}

with $E_{W}={\frac {1}{2}}w^{T}w$ and $E_{D}={\frac {1}{2}}\sum \limits _{i=1}^{N}e_{i}^{2}={\frac {1}{2}}\sum \limits _{i=1}^{N}\left(y_{i}-(w^{T}\phi (x_{i})+b)\right)^{2}.$

Both $\mu$ and $\zeta$ should be considered as hyperparameters to tune the amount of regularization versus the sum squared error. The solution does only depend on the ratio $\gamma =\zeta /\mu$ , therefore the original formulation uses only $\gamma$ as tuning parameter. We use both $\mu$ and $\zeta$ as parameters in order to provide a Bayesian interpretation to LS-SVM.

The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function:

{\begin{cases}L_{2}(w,b,e,\alpha )\;=J_{2}(w,e)-\sum \limits _{i=1}^{N}\alpha _{i}\left\{{\left[{w^{T}\phi (x_{i})+b}\right]+e_{i}-y_{i}}\right\},\\\quad \quad \quad \quad \quad \;={\frac {1}{2}}w^{T}w+{\frac {\gamma }{2}}\sum \limits _{i=1}^{N}e_{i}^{2}-\sum \limits _{i=1}^{N}\alpha _{i}\left\{\left[w^{T}\phi (x_{i})+b\right]+e_{i}-y_{i}\right\},\end{cases}}

where $\alpha _{i}\in \mathbb {R}$ are the Lagrange multipliers. The conditions for optimality are

{\begin{cases}{\frac {\partial L_{2}}{\partial w}}=0\quad \to \quad w=\sum \limits _{i=1}^{N}\alpha _{i}\phi (x_{i}),\\{\frac {\partial L_{2}}{\partial b}}=0\quad \to \quad \sum \limits _{i=1}^{N}\alpha _{i}=0,\\{\frac {\partial L_{2}}{\partial e_{i}}}=0\quad \to \quad \alpha _{i}=\gamma e_{i},\;i=1,\ldots ,N,\\{\frac {\partial L_{2}}{\partial \alpha _{i}}}=0\quad \to \quad y_{i}=w^{T}\phi (x_{i})+b+e_{i},\,i=1,\ldots ,N.\end{cases}}

Elimination of $w$ and $e$ will yield a linear system instead of a quadratic programming problem:

\left[{\begin{matrix}0&1_{N}^{T}\\1_{N}&\Omega +\gamma ^{-1}I_{N}\end{matrix}}\right]\left[{\begin{matrix}b\\\alpha \end{matrix}}\right]=\left[{\begin{matrix}0\\Y\end{matrix}}\right],

with $Y=[y_{1},\ldots ,y_{N}]^{T}$ , $1_{N}=[1,\ldots ,1]^{T}$ and $\alpha =[\alpha _{1},\ldots ,\alpha _{N}]^{T}$ . Here, $I_{N}$ is an $N\times N$ identity matrix, and $\Omega \in \mathbb {R} ^{N\times N}$ is the kernel matrix defined by $\Omega _{ij}=\phi (x_{i})^{T}\phi (x_{j})=K(x_{i},x_{j})$ .

Kernel function K

For the kernel function K(•, •) one typically has the following choices:

Linear kernel : $K(x,x_{i})=x_{i}^{T}x,$
Polynomial kernel of degree $d$ : $K(x,x_{i})=\left({1+x_{i}^{T}x/c}\right)^{d},$
Radial basis function RBF kernel : $K(x,x_{i})=\exp \left({-\left\|{x-x_{i}}\right\|^{2}/\sigma ^{2}}\right),$
MLP kernel : $K(x,x_{i})=\tanh \left({k\,x_{i}^{T}x+\theta }\right),$

where $d$ , $c$ , $\sigma$ , $k$ and $\theta$ are constants. Notice that the Mercer condition holds for all $c,\sigma \in \mathbb {R} ^{+}$ and $d\in N$ values in the polynomial and RBF case, but not for all possible choices of $k$ and $\theta$ in the MLP case. The scale parameters $c$ , $\sigma$ and $k$ determine the scaling of the inputs in the polynomial, RBF and MLP kernel function. This scaling is related to the bandwidth of the kernel in statistics, where it is shown that the bandwidth is an important parameter of the generalization behavior of a kernel method.

Bayesian interpretation for LS-SVM

A Bayesian interpretation of the SVM has been proposed by Smola et al. They showed that the use of different kernels in SVM can be regarded as defining different prior probability distributions on the functional space, as $P[f]\propto \exp \left({-\beta \left\|{{\hat {P}}f}\right\|^{2}}\right)$ . Here $\beta >0$ is a constant and ${\hat {P}}$ is the regularization operator corresponding to the selected kernel.

A general Bayesian evidence framework was developed by MacKay,^[3]^[4]^[5] and MacKay has used it to the problem of regression, forward neural network and classification network. Provided data set $D$ , a model $\mathbb {M}$ with parameter vector $w$ and a so-called hyperparameter or regularization parameter $\lambda$ , Bayesian inference is constructed with 3 levels of inference:

In level 1, for a given value of $\lambda$ , the first level of inference infers the posterior distribution of $w$ by Bayesian rule

p(w|D,\lambda ,\mathbb {M} )\propto p(D|w,\mathbb {M} )p(w|\lambda ,\mathbb {M} ).

The second level of inference determines the value of $\lambda$ , by maximizing

p(\lambda |D,\mathbb {M} )\propto p(D|\lambda ,\mathbb {M} )p(\lambda |\mathbb {M} ).

The third level of inference in the evidence framework ranks different models by examining their posterior probabilities

p(\mathbb {M} |D)\propto p(D|\mathbb {M} )p(\mathbb {M} ).

We can see that Bayesian evidence framework is a unified theory for learning the model and model selection. Kwok used the Bayesian evidence framework to interpret the formulation of SVM and model selection. And he also applied Bayesian evidence framework to support vector regression.

Now, given the data points $\{x_{i},y_{i}\}_{i=1}^{N}$ and the hyperparameters $\mu$ and $\zeta$ of the model $\mathbb {M}$ , the model parameters $w$ and $b$ are estimated by maximizing the posterior $p(w,b|D,\log \mu ,\log \zeta ,\mathbb {M} )$ . Applying Bayes’ rule, we obtain

p(w,b|D,\log \mu ,\log \zeta ,\mathbb {M} )={\frac {p(D|w,b,\log \mu ,\log \zeta ,\mathbb {M} )p(w,b|\log \mu ,\log \zeta ,\mathbb {M} )}{p(D|\log \mu ,\log \zeta ,\mathbb {M} )}},

where $p(D|\log \mu ,\log \zeta ,\mathbb {M} )$ is a normalizing constant such the integral over all possible $w$ and $b$ is equal to 1. We assume $w$ and $b$ are independent of the hyperparameter $\zeta$ , and are conditional independent, i.e., we assume

p(w,b|\log \mu ,\log \zeta ,\mathbb {M} )=p(w|\log \mu ,\mathbb {M} )p(b|\log \sigma _{b},\mathbb {M} ).

When $\sigma _{b}\to \infty$ , the distribution of $b$ will approximate a uniform distribution. Furthermore, we assume $w$ and $b$ are Gaussian distribution, so we obtain the a priori distribution of $w$ and $b$ with $\sigma _{b}\to \infty$ to be

{\begin{array}{l}p(w,b|\log \mu ,)=\left({\frac {\mu }{2\pi }}\right)^{\frac {n_{f}}{2}}\exp \left({-{\frac {\mu }{2}}w^{T}w}\right){\frac {1}{\sqrt {2\pi \sigma _{b}}}}\exp \left({-{\frac {b^{2}}{2\sigma _{b}}}}\right)\\\quad \quad \quad \quad \quad \quad \quad \propto \left({\frac {\mu }{2\pi }}\right)^{\frac {n_{f}}{2}}\exp \left({-{\frac {\mu }{2}}w^{T}w}\right)\end{array}}.

Here $n_{f}$ is the dimensionality of the feature space, same as the dimensionality of $w$ .

The probability of $p(D|w,b,\log \mu ,\log \zeta ,\mathbb {M} )$ is assumed to depend only on $w,b,\zeta$ and $\mathbb {M}$ . We assume that the data points are independently identically distributed (i.i.d.), so that:

p(D|w,b,\log \zeta ,\mathbb {M} )=\prod \limits _{i=1}^{N}{p(x_{i},y_{i}|w,b,\log \zeta ,\mathbb {M} )}.

In order to obtain the least square cost function, it is assumed that the probability of a data point is proportional to:

p(x_{i},y_{i}|w,b,\log \zeta ,\mathbb {M} )\propto p(e_{i}|w,b,\log \zeta ,\mathbb {M} ).

A Gaussian distribution is taken for the errors $e_{i}=y_{i}-(w^{T}\phi (x_{i})+b)$ as:

p(e_{i}|w,b,\log \zeta ,\mathbb {M} )={\sqrt {\frac {\zeta }{2\pi }}}\exp \left({-{\frac {\zeta e_{i}^{2}}{2}}}\right).

It is assumed that the $w$ and $b$ are determined in such a way that the class centers ${\hat {m}}_{-}$ and ${\hat {m}}_{+}$ are mapped onto the target -1 and +1, respectively. The projections $w^{T}\phi (x)+b$ of the class elements $\phi (x)$ follow a multivariate Gaussian distribution, which have variance $1/\zeta$ .

Combining the preceding expressions, and neglecting all constants, Bayes’ rule becomes

p(w,b|D,\log \mu ,\log \zeta ,\mathbb {M} )\propto \exp(-{\frac {\mu }{2}}w^{T}w-{\frac {\zeta }{2}}\sum \limits _{i=1}^{N}{e_{i}^{2}})=\exp(-J_{2}(w,b)).

The maximum posterior density estimates $w_{MP}$ and $b_{MP}$ are then obtained by minimizing the negative logarithm of (26), so we arrive (10).

Related Research Articles

In mathematics, the classic Möbius inversion formula is a relation between pairs of arithmetic functions, each defined from the other by sums over divisors. It was introduced into number theory in 1832 by August Ferdinand Möbius.

The Liouville Lambda function, denoted by λ(n) and named after Joseph Liouville, is an important arithmetic function. Its value is +1 if n is the product of an even number of prime numbers, and −1 if it is the product of an odd number of primes.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space $under the operation of composition.$

In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.

In quantum mechanics, information theory, and Fourier analysis, the entropic uncertainty or Hirschman uncertainty is defined as the sum of the temporal and spectral Shannon entropies. It turns out that Heisenberg's uncertainty principle can be expressed as a lower bound on the sum of these entropies. This is stronger than the usual statement of the uncertainty principle in terms of the product of standard deviations.

In probability theory, a distribution is said to be stable if a linear combination of two independent random variables with this distribution has the same distribution, up to location and scale parameters. A random variable is said to be stable if its distribution is stable. The stable distribution family is also sometimes referred to as the Lévy alpha-stable distribution, after Paul Lévy, the first mathematician to have studied it.

In mathematics, the Lerch zeta function, sometimes called the Hurwitz–Lerch zeta function, is a special function that generalizes the Hurwitz zeta function and the polylogarithm. It is named after Czech mathematician Mathias Lerch, who published a paper about the function in 1887.

In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. Note that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.

In mathematics, the Selberg class is an axiomatic definition of a class of L-functions. The members of the class are Dirichlet series which obey four axioms that seem to capture the essential properties satisfied by most functions that are commonly called L-functions or zeta functions. Although the exact nature of the class is conjectural, the hope is that the definition of the class will lead to a classification of its contents and an elucidation of its properties, including insight into their relationship to automorphic forms and the Riemann hypothesis. The class was defined by Atle Selberg in, who preferred not to use the word "axiom" that later authors have employed.

In physics, precisely in the study of the theory of general relativity and many alternatives to it, the post-Newtonian formalism is a calculational tool that expresses Einstein's (nonlinear) equations of gravity in terms of the lowest-order deviations from Newton's law of universal gravitation. This allows approximations to Einstein's equations to be made in the case of weak fields. Higher-order terms can be added to increase accuracy, but for strong fields, it may be preferable to solve the complete equations numerically. Some of these post-Newtonian approximations are expansions in a small parameter, which is the ratio of the velocity of the matter forming the gravitational field to the speed of light, which in this case is better called the speed of gravity. In the limit, when the fundamental speed of gravity becomes infinite, the post-Newtonian expansion reduces to Newton's law of gravity.

In complex analysis, functional analysis and operator theory, a Bergman space is a function space of holomorphic functions in a domain D of the complex plane that are sufficiently well-behaved at the boundary that they are absolutely integrable. Specifically, for $0 < p < \infty$ , the Bergman space $A p (D)$ is the space of all holomorphic functions $in D for which the p-norm is finite:$

Oblate spheroidal coordinates are a three-dimensional orthogonal coordinate system that results from rotating the two-dimensional elliptic coordinate system about the non-focal axis of the ellipse, i.e., the symmetry axis that separates the foci. Thus, the two foci are transformed into a ring of radius $in the x - y plane. Oblate spheroidal coordinates can also be considered as a limiting case of ellipsoidal coordinates in which the two largest semi-axes are equal in length.$

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

Tail value at risk (TVaR), also known as tail conditional expectation (TCE) or conditional tail expectation (CTE), is a risk measure associated with the more general value at risk. It quantifies the expected value of the loss given that an event outside a given probability level has occurred.

In number theory, an average order of an arithmetic function is some simpler or better-understood function which takes the same values "on average".

In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and P is the standard logistic function, then X = P(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).

A geometric stable distribution or geo-stable distribution is a type of leptokurtic probability distribution. Geometric stable distributions were introduced in Klebanov, L. B., Maniya, G. M., and Melamed, I. A. (1985). A problem of Zolotarev and analogs of infinitely divisible and stable distributions in a scheme for summing a random number of random variables. These distributions are analogues for stable distributions for the case when the number of summands is random, independent of the distribution of summand, and having geometric distribution. The geometric stable distribution may be symmetric or asymmetric. A symmetric geometric stable distribution is also referred to as a Linnik distribution. The Laplace distribution and asymmetric Laplace distribution are special cases of the geometric stable distribution. The Laplace distribution is also a special case of a Linnik distribution. The Mittag-Leffler distribution is also a special case of a geometric stable distribution.

In complex analysis and geometric function theory, the Grunsky matrices, or Grunsky operators, are infinite matrices introduced in 1939 by Helmut Grunsky. The matrices correspond to either a single holomorphic function on the unit disk or a pair of holomorphic functions on the unit disk and its complement. The Grunsky inequalities express boundedness properties of these matrices, which in general are contraction operators or in important special cases unitary operators. As Grunsky showed, these inequalities hold if and only if the holomorphic function is univalent. The inequalities are equivalent to the inequalities of Goluzin, discovered in 1947. Roughly speaking, the Grunsky inequalities give information on the coefficients of the logarithm of a univalent function; later generalizations by Milin, starting from the Lebedev–Milin inequality, succeeded in exponentiating the inequalities to obtain inequalities for the coefficients of the univalent function itself. The Grunsky matrix and its associated inequalities were originally formulated in a more general setting of univalent functions between a region bounded by finitely many sufficiently smooth Jordan curves and its complement: the results of Grunsky, Goluzin and Milin generalize to that case.

In number theory, the prime omega functions $and count the number of prime factors of a natural number Thereby counts each distinct prime factor, whereas the related function counts the total number of prime factors of honoring their multiplicity. That is, if we have a prime factorization of of the form for distinct primes, then the respective prime omega functions are given by and . These prime factor counting functions have many important number theoretic relations.$

References

↑ Suykens, J. A. K.; Vandewalle, J. (1999) "Least squares support vector machine classifiers", Neural Processing Letters, 9 (3), 293–300.
↑ Vapnik, V. The nature of statistical learning theory. Springer-Verlag, New York, 1995.
↑ MacKay, D. J. C. Bayesian Interpolation. Neural Computation, 4(3): 415–447, May 1992.
↑ MacKay, D. J. C. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3): 448–472, May 1992.
↑ MacKay, D. J. C. The evidence framework applied to classification networks. Neural Computation, 4(5): 720–736, Sep. 1992.

Bibliography

J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific Pub. Co., Singapore, 2002. ISBN 981-238-151-1
Suykens J. A. K., Vandewalle J., Least squares support vector machine classifiers, Neural Processing Letters, vol. 9, no. 3, Jun. 1999, pp. 293–300.
Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995. ISBN 0-387-98780-0
MacKay, D. J. C., Probable networks and plausible predictions—A review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, vol. 6, 1995, pp. 469–505.

External links

www.esat.kuleuven.be/sista/lssvmlab/ "Least squares support vector machine Lab (LS-SVMlab) toolbox contains Matlab/C implementations for a number of LS-SVM algorithms".
www.kernel-machines.org "Support Vector Machines and Kernel based methods (Smola & Schölkopf)".
www.gaussianprocess.org "Gaussian Processes: Data modeling using Gaussian Process priors over functions for regression and classification (MacKay, Williams)".
www.support-vector.net "Support Vector Machines and kernel based methods (Cristianini)".
dlib: Contains a least-squares SVM implementation for large-scale datasets.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Suykens, J. A. K.; Vandewalle, J. (1999) "Least squares support vector machine classifiers", Neural Processing Letters, 9 (3), 293–300.

[2] Vapnik, V. The nature of statistical learning theory. Springer-Verlag, New York, 1995.

[3] MacKay, D. J. C. Bayesian Interpolation. Neural Computation, 4(3): 415–447, May 1992.

[4] MacKay, D. J. C. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3): 448–472, May 1992.

[5] MacKay, D. J. C. The evidence framework applied to classification networks. Neural Computation, 4(5): 720–736, Sep. 1992.

[1]

[2]

[3]

[4]

[5]