Regularization by spectral filtering

Last updated

Spectral regularization is any of a class of regularization techniques used in machine learning to control the impact of noise and prevent overfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.

Regularization (mathematics) technique in mathematics, statistics, and computer science

In mathematics, statistics, and computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.

Machine learning Scientific study of algorithms and statistical models that computer systems use to perform tasks without explicit instructions

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop a conventional algorithm for effectively performing the task.

Overfitting production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". An overfitted model is a statistical model that contains more parameters than can be justified by the data. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

Contents

Spectral regularization algorithms rely on methods that were originally defined and studied in the theory of ill-posed inverse problems (for instance, see [1] ) focusing on the inversion of a linear operator (or a matrix) that possibly has a bad condition number or an unbounded inverse. In this context, regularization amounts to substituting the original operator by a bounded operator called the "regularization operator" that has a condition number controlled by a regularization parameter, [2] a classical example being Tikhonov regularization. To ensure stability, this regularization parameter is tuned based on the level of noise. [2] The main idea behind spectral regularization is that each regularization operator can be described using spectral calculus as an appropriate filter on the eigenvalues of the operator that defines the problem, and the role of the filter is to "suppress the oscillatory behavior corresponding to small eigenvalues". [2] Therefore, each algorithm in the class of spectral regularization algorithms is defined by a suitable filter function (which needs to be derived for that particular algorithm). Three of the most commonly used regularization algorithms for which spectral filtering is well-studied are Tikhonov regularization, Landweber iteration, and truncated singular value decomposition (TSVD). As for choosing the regularization parameter, examples of candidate methods to compute this parameter include the discrepancy principle, generalized cross validation, and the L-curve criterion. [3]

An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them: for example, calculating an image in X-ray computed tomography, source reconstruction in acoustics, or calculating the density of the Earth from measurements of its gravity field. It is called an inverse problem because it starts with the effects and then calculates the causes. It is the inverse of a forward problem, which starts with the causes and then calculates the effects.

In the field of numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input, and how much error in the output results from an error in the input. Very frequently, one is solving the inverse problem – given one is solving for x, and thus the condition number of the (local) inverse must be used. In linear regression the condition number of the moment matrix can be used as a diagnostic for multicollinearity.

Tikhonov regularization Regularization technique for ill-posed problems

Tikhonov regularization, named for Andrey Tikhonov, is a method of regularization of ill-posed problems. Also known as ridge regression, it is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

It is of note that the notion of spectral filtering studied in the context of machine learning is closely connected to the literature on function approximation (in signal processing).

In general, a function approximation problem asks us to select a function among a well-defined class that closely matches ("approximates") a target function in a task-specific way. The need for function approximations arises in many branches of applied mathematics, and computer science in particular.

Notation

The training set is defined as , where is the input matrix and is the output vector. Where applicable, the kernel function is denoted by , and the kernel matrix is denoted by which has entries and denotes the Reproducing Kernel Hilbert Space (RKHS) with kernel . The regularization parameter is denoted by .

(Note: For and , with and being Hilbert spaces, given a linear, continuous operator , assume that holds. In this setting, the direct problem would be to solve for given and the inverse problem would be to solve for given . If the solution exists, is unique and stable, the inverse problem (i.e. the problem of solving for ) is well-posed; otherwise, it is ill-posed.)

Relation to the theory of ill-posed inverse problems

The connection between the regularized least squares (RLS) estimation problem (Tikhonov regularization setting) and the theory of ill-posed inverse problems is an example of how spectral regularization algorithms are related to the theory of ill-posed inverse problems.

The RLS estimator solves

and the RKHS allows for expressing this RLS estimator as where with . [4] The penalization term is used for controlling smoothness and preventing overfitting. Since the solution of empirical risk minimization can be written as such that , adding the penalty function amounts to the following change in the system that needs to be solved: [5]

In this learning setting, the kernel matrix can be decomposed as , with

and are the corresponding eigenvectors. Therefore, in the initial learning setting, the following holds:

Thus, for small eigenvalues, even small perturbations in the data can lead to considerable changes in the solution. Hence, the problem is ill-conditioned, and solving this RLS problem amounts to stabilizing a possibly ill-conditioned matrix inversion problem, which is studied in the theory of ill-posed inverse problems; in both problems, a main concern is to deal with the issue of numerical stability.

Implementation of algorithms

Each algorithm in the class of spectral regularization algorithms is defined by a suitable filter function, denoted here by . If the Kernel matrix is denoted by , then should control the magnitude of the smaller eigenvalues of . In a filtering setup, the goal is to find estimators where . To do so, a scalar filter function is defined using the eigen-decomposition of the kernel matrix:

which yields

Typically, an appropriate filter function should have the following properties: [5]

1. As goes to zero, .

2. The magnitude of the (smaller) eigenvalues of is controlled by .

While the above items give a rough characterization of the general properties of filter functions for all spectral regularization algorithms, the derivation of the filter function (and hence its exact form) varies depending on the specific regularization method that spectral filtering is applied to.

Filter function for Tikhonov regularization

In the Tikhonov regularization setting, the filter function for RLS is described below. As shown in, [4] in this setting, . Thus,

The undesired components are filtered out using regularization:

The filter function for Tikhonov regularization is therefore defined as: [5]

Filter function for Landweber iteration

The idea behind the Landweber iteration is gradient descent: [5]

In this setting, if is larger than 's largest eigenvalue, the above iteration converges by choosing as the step-size:. [5] The above iteration is equivalent to minimizing (i.e. the empirical risk) via gradient descent; using induction, it can be proved that at the -th iteration, the solution is given by [5]

Thus, the appropriate filter function is defined by:

It can be shown that this filter function corresponds to a truncated power expansion of ; [5] to see this, note that the relation , would still hold if is replaced by a matrix; thus, if (the kernel matrix), or rather , is considered, the following holds:

In this setting, the number of iterations gives the regularization parameter; roughly speaking, . [5] If is large, overfitting may be a concern. If is small, oversmoothing may be a concern. Thus, choosing an appropriate time for early stopping of the iterations provides a regularization effect.

Filter function for TSVD

In the TSVD setting, given the eigen-decomposition and using a prescribed threshold , a regularized inverse can be formed for the kernel matrix by discarding all the eigenvalues that are smaller than this threshold. [5] Thus, the filter function for TSVD can be defined as

It can be shown that TSVD is equivalent to the (unsupervised) projection of the data using (kernel) Principal Component Analysis (PCA), and that it is also equivalent to minimizing the empirical risk on the projected data (without regularization). [5] Note that the number of components kept for the projection is the only free parameter here.

Related Research Articles

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the form:

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

The spectrum of a linear operator that operates on a Banach space consists of all scalars such that the operator does not have a bounded inverse on . The spectrum has a standard decomposition into three parts:

In mathematics, the Weierstrass functions are special functions of a complex variable that are auxiliary to the Weierstrass elliptic function. They are named for Karl Weierstrass. The relation between the sigma, zeta, and functions is analogous to that between the sine, cotangent, and squared cosecant functions: the logarithmic derivative of the sine is the cotangent, whose derivative is negative the squared cosecant.

Universal Transverse Mercator coordinate system coordinate system

The Universal Transverse Mercator (UTM) is a system for assigning coordinates to locations on the surface of the Earth. Like the traditional method of latitude and longitude, it is a horizontal position representation, which means it ignores altitude and treats the earth as a perfect ellipsoid. However, it differs from global latitude/longitude in that it divides earth into 60 zones and projects each to the plane as a basis for its coordinates. Specifying a location means specifying the zone and the x, y coordinate in that plane. The projection from spheroid to a UTM zone is some parameterization of the transverse Mercator projection. The parameters vary by nation or region or mapping system.

In functional analysis, the concept of a compact operator on Hilbert space is an extension of the concept of a matrix acting on a finite-dimensional vector space; in Hilbert space, compact operators are precisely the closure of finite-rank operators in the topology induced by the operator norm. As such, results from matrix theory can sometimes be extended to compact operators using similar arguments. By contrast, the study of general operators on infinite-dimensional spaces often requires a genuinely different approach.

The possibility that there might be more than one dimension of time has occasionally been discussed in physics and philosophy.

In mathematics, the spectral theory of ordinary differential equations is the part of spectral theory concerned with the determination of the spectrum and eigenfunction expansion associated with a linear ordinary differential equation. In his dissertation Hermann Weyl generalized the classical Sturm–Liouville theory on a finite closed interval to second order differential operators with singularities at the endpoints of the interval, possibly semi-infinite or infinite. Unlike the classical case, the spectrum may no longer consist of just a countable set of eigenvalues, but may also contain a continuous part. In this case the eigenfunction expansion involves an integral over the continuous part with respect to a spectral measure, given by the Titchmarsh–Kodaira formula. The theory was put in its final simplified form for singular differential equations of even degree by Kodaira and others, using von Neumann's spectral theorem. It has had important applications in quantum mechanics, operator theory and harmonic analysis on semisimple Lie groups.

Online machine learning in computer science and statistics

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

In statistical mechanics, the eight-vertex model is a generalisation of the ice-type (six-vertex) models; it was discussed by Sutherland, and Fan & Wu, and solved by Baxter in the zero-field case.

In mathematics, an ambit field is a d-dimensional random field describing the stochastic properties of a given system. The input is in general a d-dimensional vector assigning a real value to each of the points in the field. In its most general form, the ambit field, , is defined by a constant plus a stochastic integral, where the integration is done with respect to a Lévy basis, plus a smooth term given by an ordinary Lebesgue integral. The integrations are done over so-called ambit sets, which is used to model the sphere of influence which affect a given point.

In machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over

Regularized least squares

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

Low-rank matrix approximations are essential tools in the application of kernel methods to large-scale learning problems.

In image analysis, the generalized structure tensor (GST) is an extension of the Cartesian structure tensor to curvilinear coordinates.. It is mainly used to detect and to represent the "direction" parameters of curves, just as the Cartesian structure tensor detects and represents the direction in Cartesian coordinates. Curve families generated by pairs of locally orthogonal functions have been the best studied.

The GHK algorithm is an importance sampling method for simulating choice probabilities in the multivariate probit model. These simulated probabilities can be used to recover parameter estimates from the maximized likelihood equation using any one of the usual well known maximization methods. Train has well documented steps for implementing this algorithm for a multinomial probit model. What follows here will applies to the binary multivariate probit model.

References

  1. H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems. Kluwer, 1996.
  2. 1 2 3 L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral Algorithms for Supervised Learning, Neural Computation, 20(7), 2008.
  3. P. C. Hansen, J. G. Nagy, and D. P. O'Leary. Deblurring Images: Matrices, Spectra, and Filtering, Fundamentals of Algorithms 3, SIAM, Philadelphia, 2006.
  4. 1 2 L. Rosasco. Lecture 6 of the Lecture Notes for 9.520: Statistical Learning Theory and Applications. Massachusetts Institute of Technology, Fall 2013. Available at https://www.mit.edu/~9.520/fall13/slides/class06/class06_RLSSVM.pdf
  5. 1 2 3 4 5 6 7 8 9 10 L. Rosasco. Lecture 7 of the Lecture Notes for 9.520: Statistical Learning Theory and Applications. Massachusetts Institute of Technology, Fall 2013. Available at https://www.mit.edu/~9.520/fall13/slides/class07/class07_spectral.pdf