![]() | This article should be summarized in Least squares#Regularization and a link provided from there to here using the {{ Main }} template.(November 2020) |
Part of a series on |
Regression analysis |
---|
Models |
Estimation |
Background |
Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.
RLS is used for two main reasons. The first comes up when the number of variables in the linear system exceeds the number of observations. In such settings, the ordinary least-squares problem is ill-posed and is therefore impossible to fit because the associated optimization problem has infinitely many solutions. RLS allows the introduction of further constraints that uniquely determine the solution.
The second reason for using RLS arises when the learned model suffers from poor generalization. RLS can be used in such cases to improve the generalizability of the model by constraining it at training time. This constraint can either force the solution to be "sparse" in some way or to reflect other prior knowledge about the problem such as information about correlations between features. A Bayesian understanding of this can be reached by showing that RLS methods are often equivalent to priors on the solution to the least-squares problem.
Consider a learning setting given by a probabilistic space , . Let denote a training set of pairs i.i.d. with respect to the joint distribution . Let be a loss function. Define as the space of the functions such that expected risk: is well defined. The main goal is to minimize the expected risk: Since the problem cannot be solved exactly there is a need to specify how to measure the quality of a solution. A good learning algorithm should provide an estimator with a small risk.
As the joint distribution is typically unknown, the empirical risk is taken. For regularized least squares the square loss function is introduced:
However, if the functions are from a relatively unconstrained space, such as the set of square-integrable functions on , this approach may overfit the training data, and lead to poor generalization. Thus, it should somehow constrain or penalize the complexity of the function . In RLS, this is accomplished by choosing functions from a reproducing kernel Hilbert space (RKHS) , and adding a regularization term to the objective function, proportional to the norm of the function in :
A RKHS can be defined by a symmetric positive-definite kernel function with the reproducing property: where . The RKHS for a kernel consists of the completion of the space of functions spanned by : , where all are real numbers. Some commonly used kernels include the linear kernel, inducing the space of linear functions: the polynomial kernel, inducing the space of polynomial functions of order : and the Gaussian kernel:
Note that for an arbitrary loss function , this approach defines a general class of algorithms named Tikhonov regularization. For instance, using the hinge loss leads to the support vector machine algorithm, and using the epsilon-insensitive loss leads to support vector regression.
The representer theorem guarantees that the solution can be written as: for some .
The minimization problem can be expressed as: where, with some abuse of notation, the entry of kernel matrix (as opposed to kernel function ) is .
For such a function,
The following minimization problem can be obtained:
As the sum of convex functions is convex, the solution is unique and its minimum can be found by setting the gradient with respect to to : where
The complexity of training is basically the cost of computing the kernel matrix plus the cost of solving the linear system which is roughly . The computation of the kernel matrix for the linear or Gaussian kernel is . The complexity of testing is .
The prediction at a new test point is:
For convenience a vector notation is introduced. Let be an matrix, where the rows are input vectors, and a vector where the entries are corresponding outputs. In terms of vectors, the kernel matrix can be written as . The learning function can be written as:
Here we define . The objective function can be rewritten as:
The first term is the objective function from ordinary least squares (OLS) regression, corresponding to the residual sum of squares. The second term is a regularization term, not present in OLS, which penalizes large values. As a smooth finite dimensional problem is considered and it is possible to apply standard calculus tools. In order to minimize the objective function, the gradient is calculated with respect to and set it to zero:
This solution closely resembles that of standard linear regression, with an extra term . If the assumptions of OLS regression hold, the solution , with , is an unbiased estimator, and is the minimum-variance linear unbiased estimator, according to the Gauss–Markov theorem. The term therefore leads to a biased solution; however, it also tends to reduce variance. This is easy to see, as the covariance matrix of the -values is proportional to , and therefore large values of will lead to lower variance. Therefore, manipulating corresponds to trading-off bias and variance. For problems with high-variance estimates, such as cases with relatively small or with correlated regressors, the optimal prediction accuracy may be obtained by using a nonzero , and thus introducing some bias to reduce variance. Furthermore, it is not uncommon in machine learning to have cases where , in which case is rank-deficient, and a nonzero is necessary to compute .
The parameter controls the invertibility of the matrix . Several methods can be used to solve the above linear system, Cholesky decomposition being probably the method of choice, since the matrix is symmetric and positive definite. The complexity of this method is for training and for testing. The cost is essentially that of computing , whereas the inverse computation (or rather the solution of the linear system) is roughly .
In this section it will be shown how to extend RLS to any kind of reproducing kernel K. Instead of linear kernel a feature map is considered for some Hilbert space , called the feature space. In this case the kernel is defined as: The matrix is now replaced by the new data matrix , where , or the -th component of the . It means that for a given training set . Thus, the objective function can be written as
This approach is known as the kernel trick. This technique can significantly simplify the computational operations. If is high dimensional, computing may be rather intensive. If the explicit form of the kernel function is known, we just need to compute and store the kernel matrix .
In fact, the Hilbert space need not be isomorphic to , and can be infinite dimensional. This follows from Mercer's theorem, which states that a continuous, symmetric, positive definite kernel function can be expressed as where form an orthonormal basis for , and . If feature maps is defined with components , it follows that . This demonstrates that any kernel can be associated with a feature map, and that RLS generally consists of linear RLS performed in some possibly higher-dimensional feature space. While Mercer's theorem shows how one feature map that can be associated with a kernel, in fact multiple feature maps can be associated with a given reproducing kernel. For instance, the map satisfies the property for an arbitrary reproducing kernel.
Least squares can be viewed as a likelihood maximization under an assumption of normally distributed residuals. This is because the exponent of the Gaussian distribution is quadratic in the data, and so is the least-squares objective function. In this framework, the regularization terms of RLS can be understood to be encoding priors on . [1] For instance, Tikhonov regularization corresponds to a normally distributed prior on that is centered at 0. To see this, first note that the OLS objective is proportional to the log-likelihood function when each sampled is normally distributed around . Then observe that a normal prior on centered at 0 has a log-probability of the formwhere and are constants that depend on the variance of the prior and are independent of . Thus, minimizing the logarithm of the likelihood times the prior is equivalent to minimizing the sum of the OLS loss function and the ridge regression regularization term.
This gives a more intuitive interpretation for why Tikhonov regularization leads to a unique solution to the least-squares problem: there are infinitely many vectors satisfying the constraints obtained from the data, but since we come to the problem with a prior belief that is normally distributed around the origin, we will end up choosing a solution with this constraint in mind.
Other regularization methods correspond to different priors. See the list below for more details.
One particularly common choice for the penalty function is the squared norm, i.e., and the solution is found as The most common names for this are called Tikhonov regularization and ridge regression. It admits a closed-form solution for : The name ridge regression alludes to the fact that the term adds positive entries along the diagonal "ridge" of the sample covariance matrix .
When , i.e., in the case of ordinary least squares, the condition that causes the sample covariance matrix to not have full rank and so it cannot be inverted to yield a unique solution. This is why there can be an infinitude of solutions to the ordinary least squares problem when . However, when , i.e., when ridge regression is used, the addition of to the sample covariance matrix ensures that all of its eigenvalues will be strictly greater than 0. In other words, it becomes invertible, and the solution is then unique.
Compared to ordinary least squares, ridge regression is not unbiased. It accepts bias to reduce variance and the mean square error.
If we want to find for different values of the regularization coefficient (which we denote ) we may use the eigenvalue decomposition of the covariance matrix where columns of are the eigenvectors of and - its eigenvalues.
The solution in then given by where
Using the above results, the algorithm for finding a maximum likelihood estimate of may be defined as follows: [2]
This algorithm, for automatic (as opposed to heuristic) regularization, is obtained as a fixed point solution in the maximum likelihood estimation of the parameters. [2] Although the guarantees of convergence are not provided, the examples indicate that a satisfactory solution may be obtained after a couple of iterations.
The eigenvalue decomposition simplifies derivation of the algorithm and also simplifies the calculations:
An alternative fixed-point algorithm known as Gull-McKay algorithm [2] usually has a faster convergence, but may be used only if . Thus, while it can be used without problems for caution is recommended for .
The least absolute selection and shrinkage (LASSO) method is another popular choice. In lasso regression, the lasso penalty function is the norm, i.e.
Note that the lasso penalty function is convex but not strictly convex. Unlike Tikhonov regularization, this scheme does not have a convenient closed-form solution: instead, the solution is typically found using quadratic programming or more general convex optimization methods, as well as by specific algorithms such as the least-angle regression algorithm.
An important difference between lasso regression and Tikhonov regularization is that lasso regression forces more entries of to actually equal 0 than would otherwise. In contrast, while Tikhonov regularization forces entries of to be small, it does not force more of them to be 0 than would be otherwise. Thus, LASSO regularization is more appropriate than Tikhonov regularization in cases in which we expect the number of non-zero entries of to be small, and Tikhonov regularization is more appropriate when we expect that entries of will generally be small but not necessarily zero. Which of these regimes is more relevant depends on the specific data set at hand.
Besides feature selection described above, LASSO has some limitations. Ridge regression provides better accuracy in the case for highly correlated variables. [3] In another case, , LASSO selects at most variables. Moreover, LASSO tends to select some arbitrary variables from group of highly correlated samples, so there is no grouping effect.
The most extreme way to enforce sparsity is to say that the actual magnitude of the coefficients of does not matter; rather, the only thing that determines the complexity of is the number of non-zero entries. This corresponds to setting to be the norm of . This regularization function, while attractive for the sparsity that it guarantees, is very difficult to solve because doing so requires optimization of a function that is not even weakly convex. Lasso regression is the minimal possible relaxation of penalization that yields a weakly convex optimization problem.
For any non-negative and the objective has the following form:
Let , then the solution of the minimization problem is described as: for some .
Consider as an Elastic Net penalty function.
When , elastic net becomes ridge regression, whereas it becomes Lasso. Elastic Net penalty function doesn't have the first derivative at 0 and it is strictly convex taking the properties both lasso regression and ridge regression.
One of the main properties of the Elastic Net is that it can select groups of correlated variables. The difference between weight vectors of samples and is given by: where . [4]
If and are highly correlated (), the weight vectors are very close. In the case of negatively correlated samples () the samples can be taken. To summarize, for highly correlated variables the weight vectors tend to be equal up to a sign in the case of negative correlated variables.
The following is a list of possible choices of the regularization function , along with the name for each one, the corresponding prior if there is a simple one, and ways for computing the solution to the resulting optimization problem.
Name | Regularization function | Corresponding prior | Methods for solving |
---|---|---|---|
Tikhonov regularization | Normal | Closed form | |
Lasso regression | Laplace | Proximal gradient descent, least angle regression | |
penalization | – | Forward selection, Backward elimination, use of priors such as spike and slab | |
Elastic nets | Normal and Laplace mixture | Proximal gradient descent | |
Total variation regularization | – | Split–Bregman method, among others |
In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories, SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).
In mathematics and physical science, spherical harmonics are special functions defined on the surface of a sphere. They are often employed in solving partial differential equations in many scientific fields. The table of spherical harmonics contains a list of common spherical harmonics.
In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form and with parametric extension for arbitrary real constants a, b and non-zero c. It is named after the mathematician Carl Friedrich Gauss. The graph of a Gaussian is a characteristic symmetric "bell curve" shape. The parameter a is the height of the curve's peak, b is the position of the center of the peak, and c controls the width of the "bell".
In geodesy, conversion among different geographic coordinate systems is made necessary by the different geographic coordinate systems in use across the world and over time. Coordinate conversion is composed of a number of different types of conversion: format change of geographic coordinates, conversion of coordinate systems, or transformation to different geodetic datums. Geographic coordinate conversion has applications in cartography, surveying, navigation and geographic information systems.
In mathematics, the Schwarzian derivative is an operator similar to the derivative which is invariant under Möbius transformations. Thus, it occurs in the theory of the complex projective line, and in particular, in the theory of modular forms and hypergeometric functions. It plays an important role in the theory of univalent functions, conformal mapping and Teichmüller spaces. It is named after the German mathematician Hermann Schwarz.
Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.
In mathematics, statistics, finance, and computer science, particularly in machine learning and inverse problems, regularization is a process that converts the answer of a problem to a simpler one. It is often used in solving ill-posed problems or to prevent overfitting.
In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).
A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.
In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then, positive-definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.
A self-concordant function is a function satisfying a certain differential inequality, which makes it particularly easy for optimization using Newton's method A self-concordant barrier is a particular self-concordant function, that is also a barrier function for a particular convex set. Self-concordant barriers are important ingredients in interior point methods for optimization.
In complex analysis, given initial data consisting of points in the complex unit disc and target data consisting of points in , the Nevanlinna–Pick interpolation problem is to find a holomorphic function that interpolates the data, that is for all ,
In mathematics, the spectral theory of ordinary differential equations is the part of spectral theory concerned with the determination of the spectrum and eigenfunction expansion associated with a linear ordinary differential equation. In his dissertation, Hermann Weyl generalized the classical Sturm–Liouville theory on a finite closed interval to second order differential operators with singularities at the endpoints of the interval, possibly semi-infinite or infinite. Unlike the classical case, the spectrum may no longer consist of just a countable set of eigenvalues, but may also contain a continuous part. In this case the eigenfunction expansion involves an integral over the continuous part with respect to a spectral measure, given by the Titchmarsh–Kodaira formula. The theory was put in its final simplified form for singular differential equations of even degree by Kodaira and others, using von Neumann's spectral theorem. It has had important applications in quantum mechanics, operator theory and harmonic analysis on semisimple Lie groups.
The derivatives of scalars, vectors, and second-order tensors with respect to second-order tensors are of considerable use in continuum mechanics. These derivatives are used in the theories of nonlinear elasticity and plasticity, particularly in the design of algorithms for numerical simulations.
Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.
In complex analysis and geometric function theory, the Grunsky matrices, or Grunsky operators, are infinite matrices introduced in 1939 by Helmut Grunsky. The matrices correspond to either a single holomorphic function on the unit disk or a pair of holomorphic functions on the unit disk and its complement. The Grunsky inequalities express boundedness properties of these matrices, which in general are contraction operators or in important special cases unitary operators. As Grunsky showed, these inequalities hold if and only if the holomorphic function is univalent. The inequalities are equivalent to the inequalities of Goluzin, discovered in 1947. Roughly speaking, the Grunsky inequalities give information on the coefficients of the logarithm of a univalent function; later generalizations by Milin, starting from the Lebedev–Milin inequality, succeeded in exponentiating the inequalities to obtain inequalities for the coefficients of the univalent function itself. The Grunsky matrix and its associated inequalities were originally formulated in a more general setting of univalent functions between a region bounded by finitely many sufficiently smooth Jordan curves and its complement: the results of Grunsky, Goluzin and Milin generalize to that case.
Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.
For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.
Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is regularization of the form
In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.