Bayesian interpretation of kernel regularization

Last updated

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning. [1]

Contents

A mathematical equivalence between the regularization and the Bayesian point of view is easily proved in cases where the reproducing kernel Hilbert space is finite-dimensional. The infinite-dimensional case raises subtle mathematical issues; we will consider here the finite-dimensional case. We start with a brief review of the main ideas underlying kernel methods for scalar learning, and briefly introduce the concepts of regularization and Gaussian processes. We then show how both points of view arrive at essentially equivalent estimators, and show the connection that ties them together.

The supervised learning problem

The classical supervised learning problem requires estimating the output for some new input point by learning a scalar-valued estimator on the basis of a training set consisting of input-output pairs, . [2] Given a symmetric and positive bivariate function called a kernel, one of the most popular estimators in machine learning is given by

 

 

 

 

(1)

where is the kernel matrix with entries , , and . We will see how this estimator can be derived both from a regularization and a Bayesian perspective.

A regularization perspective

The main assumption in the regularization perspective is that the set of functions is assumed to belong to a reproducing kernel Hilbert space . [2] [3] [4] [5]

Reproducing kernel Hilbert space

A reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions defined by a symmetric, positive-definite function called the reproducing kernel such that the function belongs to for all . [6] [7] [8] There are three main properties make an RKHS appealing:

1. The reproducing property, which gives name to the space,

where is the inner product in .

2. Functions in an RKHS are in the closure of the linear combination of the kernel at given points,

.

This allows the construction in a unified framework of both linear and generalized linear models.

3. The squared norm in an RKHS can be written as

and could be viewed as measuring the complexity of the function.

The regularized functional

The estimator is derived as the minimizer of the regularized functional

 

 

 

 

(2)

where and is the norm in . The first term in this functional, which measures the average of the squares of the errors between the and the , is called the empirical risk and represents the cost we pay by predicting for the true value . The second term in the functional is the squared norm in a RKHS multiplied by a weight and serves the purpose of stabilizing the problem [3] [5] as well as of adding a trade-off between fitting and complexity of the estimator. [2] The weight , called the regularizer, determines the degree to which instability and complexity of the estimator should be penalized (higher penalty for increasing value of ).

Derivation of the estimator

The explicit form of the estimator in equation ( 1 ) is derived in two steps. First, the representer theorem [9] [10] [11] states that the minimizer of the functional ( 2 ) can always be written as a linear combination of the kernels centered at the training-set points,

 

 

 

 

(3)

for some . The explicit form of the coefficients can be found by substituting for in the functional ( 2 ). For a function of the form in equation ( 3 ), we have that

We can rewrite the functional ( 2 ) as

This functional is convex in and therefore we can find its minimum by setting the gradient with respect to to zero,

Substituting this expression for the coefficients in equation ( 3 ), we obtain the estimator stated previously in equation ( 1 ),

A Bayesian perspective

The notion of a kernel plays a crucial role in Bayesian probability as the covariance function of a stochastic process called the Gaussian process .

A review of Bayesian probability

As part of the Bayesian framework, the Gaussian process specifies the prior distribution that describes the prior beliefs about the properties of the function being modeled. These beliefs are updated after taking into account observational data by means of a likelihood function that relates the prior beliefs to the observations. Taken together, the prior and likelihood lead to an updated distribution called the posterior distribution that is customarily used for predicting test cases.

The Gaussian process

A Gaussian process (GP) is a stochastic process in which any finite number of random variables that are sampled follow a joint Normal distribution. [12] The mean vector and covariance matrix of the Gaussian distribution completely specify the GP. GPs are usually used as a priori distribution for functions, and as such the mean vector and covariance matrix can be viewed as functions, where the covariance function is also called the kernel of the GP. Let a function follow a Gaussian process with mean function and kernel function ,

In terms of the underlying Gaussian distribution, we have that for any finite set if we let then

where is the mean vector and is the covariance matrix of the multivariate Gaussian distribution.

Derivation of the estimator

In a regression context, the likelihood function is usually assumed to be a Gaussian distribution and the observations to be independent and identically distributed (iid),

This assumption corresponds to the observations being corrupted with zero-mean Gaussian noise with variance . The iid assumption makes it possible to factorize the likelihood function over the data points given the set of inputs and the variance of the noise , and thus the posterior distribution can be computed analytically. For a test input vector , given the training data , the posterior distribution is given by

where denotes the set of parameters which include the variance of the noise and any parameters from the covariance function and where

The connection between regularization and Bayes

A connection between regularization theory and Bayesian theory can only be achieved in the case of finite dimensional RKHS. Under this assumption, regularization theory and Bayesian theory are connected through Gaussian process prediction. [3] [12]

In the finite dimensional case, every RKHS can be described in terms of a feature map such that [2]

Functions in the RKHS with kernel can be then be written as

and we also have that

We can now build a Gaussian process by assuming to be distributed according to a multivariate Gaussian distribution with zero mean and identity covariance matrix,

If we assume a Gaussian likelihood we have

where . The resulting posterior distribution is the given by

We can see that a maximum posterior (MAP) estimate is equivalent to the minimization problem defining Tikhonov regularization, where in the Bayesian case the regularization parameter is related to the noise variance.

From a philosophical perspective, the loss function in a regularization setting plays a different role than the likelihood function in the Bayesian setting. Whereas the loss function measures the error that is incurred when predicting in place of , the likelihood function measures how likely the observations are from the model that was assumed to be true in the generative process. From a mathematical perspective, however, the formulations of the regularization and Bayesian frameworks make the loss function and the likelihood function to have the same mathematical role of promoting the inference of functions that approximate the labels as much as possible.

See also

Related Research Articles

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

<span class="mw-page-title-main">Kriging</span> Method of interpolation

In statistics, originally in geostatistics, kriging or Kriging, also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging gives the best linear unbiased prediction (BLUP) at unsampled locations. Interpolating methods based on other criteria such as smoothness may not yield the BLUP. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Wiener–Kolmogorov prediction, after Norbert Wiener and Andrey Kolmogorov.

<span class="mw-page-title-main">Cross-correlation</span> Covariance and correlation

In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

<span class="mw-page-title-main">Kernel method</span> Class of algorithms for pattern analysis

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

In statistics, the inverse Wishart distribution, also called the inverted Wishart distribution, is a probability distribution defined on real-valued positive-definite matrices. In Bayesian statistics it is used as the conjugate prior for the covariance matrix of a multivariate normal distribution.

In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then, positive-definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.

Spectral regularization is any of a class of regularization techniques used in machine learning to control the impact of noise and prevent overfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

<span class="mw-page-title-main">Neural network Gaussian process</span> The distribution over functions corresponding to an infinitely wide Bayesian neural network.

THIS ARTICLE NEEDS ATTENTION. IT IS MISSING A FORMAL DEFINITION OF THE OBJECT IT INTENDS TO DESCRIBE. AS IN, THERE IS NO SPECIFICATION OF WHAT EXACTLY DEFINES THE OBJECT "NNGP" - THEREFORE THE DISCUSSION ON EQUIVALENCE BETWEEN NNGP AND ANY OTHER OBJECT CANNOT BE FOLLOWED DUE TO THE LACK OF A FORMAL DEFINITION. [date = August 2023]

In statistics and machine learning, Gaussian process approximation is a computational method that accelerates inference tasks in the context of a Gaussian process model, most commonly likelihood evaluation and prediction. Like approximations of other models, they can often be expressed as additional assumptions imposed on the model, which do not correspond to any actual feature, but which retain its key properties while simplifying calculations. Many of these approximation methods can be expressed in purely linear algebraic or functional analytic terms as matrix or function approximations. Others are purely algorithmic and cannot easily be rephrased as a modification of a statistical model.

Bayesian quadrature is a method for approximating intractable integration problems. It falls within the class of probabilistic numerical methods. Bayesian quadrature views numerical integration as a Bayesian inference task, where function evaluations are used to estimate the integral of that function. For this reason, it is sometimes also referred to as "Bayesian probabilistic numerical integration" or "Bayesian numerical integration". The name "Bayesian cubature" is also sometimes used when the integrand is multi-dimensional. A potential advantage of this approach is that it provides probabilistic uncertainty quantification for the value of the integral.

References

  1. Álvarez, Mauricio A.; Rosasco, Lorenzo; Lawrence, Neil D. (June 2011). "Kernels for Vector-Valued Functions: A Review". arXiv: 1106.6251 [stat.ML].
  2. 1 2 3 4 Vapnik, Vladimir (1998). Statistical learning theory. Wiley. ISBN   9780471030034.
  3. 1 2 3 Wahba, Grace (1990). Spline models for observational data. SIAM.
  4. Schölkopf, Bernhard; Smola, Alexander J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. ISBN   9780262194754.
  5. 1 2 Girosi, F.; Poggio, T. (1990). "Networks and the best approximation property" (PDF). Biological Cybernetics. Springer. 63 (3): 169–176. doi:10.1007/bf00195855. hdl: 1721.1/6017 . S2CID   18824241.
  6. Aronszajn, N (May 1950). "Theory of Reproducing Kernels". Transactions of the American Mathematical Society. 68 (3): 337–404. doi: 10.2307/1990404 . JSTOR   1990404.
  7. Schwartz, Laurent (1964). "Sous-espaces hilbertiens d'espaces vectoriels topologiques et noyaux associés (noyaux reproduisants)". Journal d'Analyse Mathématique . Springer. 13 (1): 115–256. doi:10.1007/bf02786620. S2CID   117202393.
  8. Cucker, Felipe; Smale, Steve (October 5, 2001). "On the mathematical foundations of learning". Bulletin of the American Mathematical Society. 39 (1): 1–49. doi: 10.1090/s0273-0979-01-00923-5 .
  9. Kimeldorf, George S.; Wahba, Grace (1970). "A correspondence between Bayesian estimation on stochastic processes and smoothing by splines". The Annals of Mathematical Statistics. 41 (2): 495–502. doi: 10.1214/aoms/1177697089 .
  10. Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". Computational Learning Theory. Lecture Notes in Computer Science. Vol. 2111/2001. pp. 416–426. doi:10.1007/3-540-44581-1_27. ISBN   978-3-540-42343-0.
  11. De Vito, Ernesto; Rosasco, Lorenzo; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (October 2004). "Some Properties of Regularized Kernel Methods". Journal of Machine Learning Research. 5: 1363–1390.
  12. 1 2 Rasmussen, Carl Edward; Williams, Christopher K. I. (2006). Gaussian Processes for Machine Learning. The MIT Press. ISBN   0-262-18253-X.