Low-rank matrix approximations

Last updated

Low-rank matrix approximations are essential tools in the application of kernel methods to large-scale learning problems. [1]

Contents

Kernel methods (for instance, support vector machines or Gaussian processes [2] ) project data points into a high-dimensional or infinite-dimensional feature space and find the optimal splitting hyperplane. In the kernel method the data is represented in a kernel matrix (or, Gram matrix). Many algorithms can solve machine learning problems using the kernel matrix. The main problem of kernel method is its high computational cost associated with kernel matrices. The cost is at least quadratic in the number of training data points, but most kernel methods include computation of matrix inversion or eigenvalue decomposition and the cost becomes cubic in the number of training data. Large training sets cause large storage and computational costs. While low rank decomposition methods (Cholesky decomposition) reduce this cost, they still require computing the kernel matrix. One of the approaches to deal with this problem is low-rank matrix approximations. The most popular examples of them are Nyström method and the random features. Both of them have been successfully applied to efficient kernel learning.

Nyström approximation

Kernel methods become unfeasible when the number of points is so large such that the kernel matrix cannot be stored in memory.

If is the number of training examples, the storage and computational cost required to find the solution of the problem using general kernel method is and respectively. The Nyström approximation can allow a significant speed-up of the computations. [2] [3] This speed-up is achieved by using, instead of the kernel matrix, its approximation of rank . An advantage of the method is that it is not necessary to compute or store the whole kernel matrix, but only a submatrix of size .

It reduces the storage and complexity requirements to and respectively.

The method is named "Nyström approximation" because it can be interpreted as a case of the Nyström method from integral equation theory. [3]

Theorem for kernel approximation

is a kernel matrix for some kernel method. Consider the first points in the training set. Then there exists a matrix of rank :

, where

,

is invertible matrix

and

Proof

Singular-value decomposition application

Applying singular-value decomposition (SVD) to matrix with dimensions produces a singular system consisting of singular values vectors and such that they form orthonormal bases of and respectively:

If and are matrices with 's and 's in the columns and is a diagonal matrix having singular values on the first -entries on the diagonal (all the other elements of the matrix are zeros):

then the matrix can be rewritten as: [4]

Further proof
  • is data matrix

Applying singular-value decomposition to these matrices:

  • is the -dimensional matrix consisting of the first rows of matrix

Applying singular-value decomposition to these matrices:

Since are orthogonal matrices,

Replacing by and , an approximation for can be obtained:

( is not necessarily an orthogonal matrix).

However, defining , it can be computed the next way:

By the characterization for orthogonal matrix : equality holds. Then, using the formula for the inverse of matrix multiplication for invertible matrices and , the expression in braces can be rewritten as:

Then the expression for :

Defining , the proof is finished.

General theorem for kernel approximation for a feature map

For a feature map with associated kernel : equality also follows by replacing by the operator such that , , , and by the operator such that , , . Once again, a simple inspection shows that the feature map is only needed in the proof while the end result only depends on computing the kernel function.

Application for regularized least squares

In a vector and kernel notation, the problem of Regularized least squares can be rewritten as:

Computing the gradient and setting in to 0, the minimum can be obtained:

The inverse matrix can be computed using Woodbury matrix identity:

It has the desired storage and complexity requirements.

Randomized feature maps approximation

Let – samples of data, – a randomized feature map (maps a single vector to a vector of higher dimensionality) so that the inner product between a pair of transformed points approximates their kernel evaluation:

where is the mapping embedded in the RBF kernel.

Since is low-dimensional, the input can be easily transformed with , after that different linear learning methods to approximate the answer of the corresponding nonlinear kernel can be applied. There are different randomized feature maps to compute the approximations to the RBF kernels. For instance, Random Fourier features and random binning features.

Random Fourier features

Random Fourier features map produces a Monte Carlo approximation to the feature map. The Monte Carlo method is considered to be randomized. These random features consists of sinusoids randomly drawn from Fourier transform of the kernel to be approximated, where and are random variables. The line is randomly chosen, then the data points are projected on it by the mappings. The resulting scalar is passed through a sinusoid. The product of the transformed points will approximate a shift-invariant kernel. Since the map is smooth, random Fourier features work well on interpolation tasks.

Random binning features

A random binning features map partitions the input space using randomly shifted grids at randomly chosen resolutions and assigns to an input point a binary bit string that corresponds to the bins in which it falls. The grids are constructed so that the probability that two points are assigned to the same bin is proportional to . The inner product between a pair of transformed points is proportional to the number of times the two points are binned together, and is therefore an unbiased estimate of . Since this mapping is not smooth and uses the proximity between input points, Random Binning Features works well for approximating kernels that depend only on the distance between datapoints.

Comparison of approximation methods

The approaches for large-scale kernel learning (Nyström method and random features) differs in the fact that the Nyström method uses data dependent basis functions while in random features approach the basis functions are sampled from a distribution independent from the training data. This difference leads to an improved analysis for kernel learning approaches based on the Nyström method. When there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nyström method can achieve better results than Random Features based approach. [5]

See also

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Pauli matrices</span> Matrices important in quantum mechanics and the study of spin

In mathematical physics and mathematics, the Pauli matrices are a set of three 2 × 2 complex matrices which are Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In mathematics, particularly in linear algebra, a skew-symmetricmatrix is a square matrix whose transpose equals its negative. That is, it satisfies the condition

The Rayleigh–Ritz method is a direct numerical method of approximating eigenvalues, originated in the context of solving physical boundary value problems and named after Lord Rayleigh and Walther Ritz.

In mathematics, Schubert calculus is a branch of algebraic geometry introduced in the nineteenth century by Hermann Schubert, in order to solve various counting problems of projective geometry. It was a precursor of several more modern theories, for example characteristic classes, and in particular its algorithmic aspects are still of current interest. The term Schubert calculus is sometimes used to mean the enumerative geometry of linear subspaces of a vector space, which is roughly equivalent to describing the cohomology ring of Grassmannians. Sometimes it is used to mean the more general enumerative geometry of algebraic varieties that are homogenous spaces of simple Lie groups. Even more generally, Schubert calculus is often understood to encompass the study of analogous questions in generalized cohomology theories.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

<span class="mw-page-title-main">Generalized chi-squared distribution</span>

In probability theory and statistics, the generalized chi-squared distribution is the distribution of a quadratic form of a multinormal variable, or a linear combination of different normal variables and squares of normal variables. Equivalently, it is also a linear sum of independent noncentral chi-square variables and a normal variable. There are several other such generalizations for which the same term is sometimes used; some of them are special cases of the family discussed here, for example the gamma distribution.

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

In machine learning and data mining, a string kernel is a kernel function that operates on strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intuitively understood as functions measuring the similarity of pairs of strings: the more similar two strings a and b are, the higher the value of a string kernel K(a, b) will be.

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.

Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L2 that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

<span class="mw-page-title-main">Loop representation in gauge theories and quantum gravity</span> Description of gauge theories using loop operators

Attempts have been made to describe gauge theories in terms of extended objects such as Wilson loops and holonomies. The loop representation is a quantum hamiltonian representation of gauge theories in terms of loops. The aim of the loop representation in the context of Yang–Mills theories is to avoid the redundancy introduced by Gauss gauge symmetries allowing to work directly in the space of physical states. The idea is well known in the context of lattice Yang–Mills theory. Attempts to explore the continuous loop representation was made by Gambini and Trias for canonical Yang–Mills theory, however there were difficulties as they represented singular objects. As we shall see the loop formalism goes far beyond a simple gauge invariant description, in fact it is the natural geometrical framework to treat gauge theories and quantum gravity in terms of their fundamental physical excitations.

<span class="mw-page-title-main">Batch normalization</span> Method used to make artificial neural networks faster and stable by re-centering and re-scaling

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

References

  1. Francis R. Bach and Michael I. Jordan (2005). "Predictive low-rank decomposition for kernel methods". ICML.
  2. 1 2 Williams, C.K.I.; Seeger, M. (2001). "Using the Nyström method to speed up kernel machines". Advances in Neural Information Processing Systems. 13.
  3. 1 2 Petros Drineas and Michael W. Mahoney (2005). "On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning". Journal of Machine Learning Research 6, pp. 2153–2175.
  4. C. Eckart, G. Young, The approximation of one matrix by another of lower rank. Psychometrika, Volume 1, 1936, Pages 211–8. doi : 10.1007/BF02288367
  5. Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin and Zhi-Hua Zhou (2012). "Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison". Advances in Neural Information Processing Systems 25 (NIPS).