Random feature

Last updated

Random features (RF) are a technique used in machine learning to approximate kernel methods, introduced by Ali Rahimi and Ben Recht in their 2007 paper "Random Features for Large-Scale Kernel Machines", [1] and extended by. [2] [3] RF uses a Monte Carlo approximation to kernel functions by randomly sampled feature maps. It is used for datasets that are too large for traditional kernel methods like support vector machine, kernel ridge regression, and gaussian process.

Contents

Mathematics

Kernel method

Given a feature map , where is a Hilbert space (more specifically, a reproducing kernel Hilbert space), the kernel trick replaces inner products in feature space by a kernel function Kernel methods replaces linear operations in high-dimensional space by operations on the kernel matrix: where is the number of data points.

Random kernel method

The problem with kernel methods is that the kernel matrix has size . This becomes computationally infeasible when reaches the order of a million. The random kernel method replaces the kernel function by an inner product in low-dimensional feature space : where is a randomly sampled feature map .

This converts kernel linear regression into linear regression in feature space, kernel SVM into SVM in feature space, etc. Since we have where , these methods no longer involve matrices of size , but only random feature matrices of size .

Random Fourier feature

Radial basis function kernel

The radial basis function (RBF) kernel on two samples is defined as [4]

where is the squared Euclidean distance and is a free parameter defining the shape of the kernel. It can be approximated by a random Fourier feature map :where are IID samples from the multidimensional normal distribution .

Theorem  -

  1. (Unbiased estimation)
  2. (Variance bound)
  3. (Convergence) As , the approximation converges in probability to the true kernel.
Proof

(Unbiased estimation) By independence of , it suffices to prove the case of . By the trigonometric identity ,Apply the spherical symmetry of normal distribution, then evaluate the integral:

(Variance bound) Since are IID, it suffices to prove that the variance of is finite, which is true since it is bounded within .

(Convergence) By Chebyshev's inequality.

Since are bounded, there is a stronger convergence guarantee by Hoeffding's inequality. [1] :Claim 1

Random Fourier features

By Bochner's theorem, the above construction can be generalized to arbitrary positive definite shift-invariant kernel .

Define its Fourier transformthen are sampled IID from the probability distribution with probability density . This applies for other kernels like the Laplace kernel and the Cauchy kernel.

Neural network interpretation

Given a random Fourier feature map , training the feature on a dataset by featurized linear regression is equivalent to fitting complex parameters such thatwhich is a neural network with a single hidden layer, with activation function , zero bias, and the parameters in the first layer frozen.

In the overparameterized case, when , the network linearly interpolates the dataset , and the network parameters is the least-norm solution:At the limit of , the L2 norm where is the interpolating function obtained by the kernel regression with the original kernel, and is the norm in the reproducing kernel Hilbert space for the kernel. [5]

Other examples

Random binning features

A random binning features map partitions the input space using randomly shifted grids at randomly chosen resolutions and assigns to an input point a binary bit string that corresponds to the bins in which it falls. The grids are constructed so that the probability that two points are assigned to the same bin is proportional to . The inner product between a pair of transformed points is proportional to the number of times the two points are binned together, and is therefore an unbiased estimate of . Since this mapping is not smooth and uses the proximity between input points, Random Binning Features works well for approximating kernels that depend only on the distance between datapoints.

Orthogonal random features

Orthogonal random features [6] uses a random orthogonal matrix instead of a random Fourier matrix.

Historical context

In NIPS 2006, deep learning had just become competitive with linear models like PCA and linear SVMs for large datasets, and people speculated about whether it could compete with kernel SVMs. However, there was no way to train kernel SVM on large datasets. The two authors developed the random feature method to train those.

It was then found that the variance bound did not match practice: The variance bound predicts that approximation to within requires , but in practice required only . Attempting to discover what caused this led to the subsequent two papers [2] [3] . [7]

See also

Related Research Articles

In optics, polarized light can be described using the Jones calculus, invented by R. C. Jones in 1941. Polarized light is represented by a Jones vector, and linear optical elements are represented by Jones matrices. When light crosses an optical element the resulting polarization of the emerging light is found by taking the product of the Jones matrix of the optical element and the Jones vector of the incident light. Note that Jones calculus is only applicable to light that is already fully polarized. Light which is randomly polarized, partially polarized, or incoherent must be treated using Mueller calculus.

<span class="mw-page-title-main">Uncertainty principle</span> Foundational principle in quantum physics

The uncertainty principle, also known as Heisenberg's indeterminacy principle, is a fundamental concept in quantum mechanics. It states that there is a limit to the precision with which certain pairs of physical properties, such as position and momentum, can be simultaneously known. In other words, the more accurately one property is measured, the less accurately the other property can be known.

<span class="mw-page-title-main">Quantum harmonic oscillator</span> Important, well-understood quantum mechanical model

The quantum harmonic oscillator is the quantum-mechanical analog of the classical harmonic oscillator. Because an arbitrary smooth potential can usually be approximated as a harmonic potential at the vicinity of a stable equilibrium point, it is one of the most important model systems in quantum mechanics. Furthermore, it is one of the few quantum-mechanical systems for which an exact, analytical solution is known.

In physics, the CHSH inequality can be used in the proof of Bell's theorem, which states that certain consequences of entanglement in quantum mechanics cannot be reproduced by local hidden-variable theories. Experimental verification of the inequality being violated is seen as confirmation that nature cannot be described by such theories. CHSH stands for John Clauser, Michael Horne, Abner Shimony, and Richard Holt, who described it in a much-cited paper published in 1969. They derived the CHSH inequality, which, as with John Stewart Bell's original inequality, is a constraint—on the statistical occurrence of "coincidences" in a Bell test—which is necessarily true if an underlying local hidden-variable theory exists. In practice, the inequality is routinely violated by modern experiments in quantum mechanics.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space under the operation of composition.

<span class="mw-page-title-main">Spherical harmonics</span> Special mathematical functions defined on the surface of a sphere

In mathematics and physical science, spherical harmonics are special functions defined on the surface of a sphere. They are often employed in solving partial differential equations in many scientific fields. The table of spherical harmonics contains a list of common spherical harmonics.

<span class="mw-page-title-main">Rabi cycle</span> Quantum mechanical phenomenon

In physics, the Rabi cycle is the cyclic behaviour of a two-level quantum system in the presence of an oscillatory driving field. A great variety of physical processes belonging to the areas of quantum computing, condensed matter, atomic and molecular physics, and nuclear and particle physics can be conveniently studied in terms of two-level quantum mechanical systems, and exhibit Rabi flopping when coupled to an optical driving field. The effect is important in quantum optics, magnetic resonance and quantum computing, and is named after Isidor Isaac Rabi.

<span class="mw-page-title-main">Reproducing kernel Hilbert space</span> In functional analysis, a Hilbert space

In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Specifically, a Hilbert space of functions from a set is an RKHS if, for each , there exists a function such that for all ,

<span class="mw-page-title-main">Bloch sphere</span> Geometrical representation of the pure state space of a two-level quantum mechanical system

In quantum mechanics and computing, the Bloch sphere is a geometrical representation of the pure state space of a two-level quantum mechanical system (qubit), named after the physicist Felix Bloch.

In mathematics, certain systems of partial differential equations are usefully formulated, from the point of view of their underlying geometric and algebraic structure, in terms of a system of differential forms. The idea is to take advantage of the way a differential form restricts to a submanifold, and the fact that this restriction is compatible with the exterior derivative. This is one possible approach to certain over-determined systems, for example, including Lax pairs of integrable systems. A Pfaffian system is specified by 1-forms alone, but the theory includes other types of example of differential system. To elaborate, a Pfaffian system is a set of 1-forms on a smooth manifold.

<span class="mw-page-title-main">Jaynes–Cummings model</span> Model in quantum optics

The Jaynes–Cummings model is a theoretical model in quantum optics. It describes the system of a two-level atom interacting with a quantized mode of an optical cavity, with or without the presence of light. It was originally developed to study the interaction of atoms with the quantized electromagnetic field in order to investigate the phenomena of spontaneous emission and absorption of photons in a cavity.

Sinusoidal plane-wave solutions are particular solutions to the wave equation.

Photon polarization is the quantum mechanical description of the classical polarized sinusoidal plane electromagnetic wave. An individual photon can be described as having right or left circular polarization, or a superposition of the two. Equivalently, a photon can be described as having horizontal or vertical linear polarization, or a superposition of the two.

A vacuum Rabi oscillation is a damped oscillation of an initially excited atom coupled to an electromagnetic resonator or cavity in which the atom alternately emits photon(s) into a single-mode electromagnetic cavity and reabsorbs them. The atom interacts with a single-mode field confined to a limited volume V in an optical cavity. Spontaneous emission is a consequence of coupling between the atom and the vacuum fluctuations of the cavity field.

<span class="mw-page-title-main">Mutually unbiased bases</span>

In quantum information theory, a set of bases in Hilbert space Cd are said to be mutually unbiased if when a system is prepared in an eigenstate of one of the bases, then all outcomes of the measurement with respect to the other basis are predicted to occur with an equal probability inexorably equal to 1/d.

<span class="mw-page-title-main">Kicked rotator</span>

The kicked rotator, also spelled as kicked rotor, is a paradigmatic model for both Hamiltonian chaos and quantum chaos. It describes a free rotating stick in an inhomogeneous "gravitation like" field that is periodically switched on in short pulses. The model is described by the Hamiltonian

In physics, Berry connection and Berry curvature are related concepts which can be viewed, respectively, as a local gauge potential and gauge field associated with the Berry phase or geometric phase. The concept was first introduced by S. Pancharatnam as geometric phase and later elaborately explained and popularized by Michael Berry in a paper published in 1984 emphasizing how geometric phases provide a powerful unifying concept in several branches of classical and quantum physics.

In physics, and especially scattering theory, the momentum-transfer cross section is an effective scattering cross section useful for describing the average momentum transferred from a particle when it collides with a target. Essentially, it contains all the information about a scattering process necessary for calculating average momentum transfers but ignores other details about the scattering angle.

In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.

In mathematics, the field of logarithmic-exponential transseries is a non-Archimedean ordered differential field which extends comparability of asymptotic growth rates of elementary nontrigonometric functions to a much broader class of objects. Each log-exp transseries represents a formal asymptotic behavior, and it can be manipulated formally, and when it converges, corresponds to actual behavior. Transseries can also be convenient for representing functions. Through their inclusion of exponentiation and logarithms, transseries are a strong generalization of the power series at infinity and other similar asymptotic expansions.

References

  1. 1 2 Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20.
  2. 1 2 Rahimi, Ali; Recht, Benjamin (September 2008). "Uniform approximation of functions with random bases". 2008 46th Annual Allerton Conference on Communication, Control, and Computing. IEEE. pp. 555–561. doi:10.1109/allerton.2008.4797607. ISBN   978-1-4244-2925-7.
  3. 1 2 Rahimi, Ali; Recht, Benjamin (2008). "Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning". Advances in Neural Information Processing Systems. 21. Curran Associates, Inc.
  4. Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf (2004). "A primer on kernel methods". Kernel Methods in Computational Biology.
  5. Belkin, Mikhail (May 2021). "Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation". Acta Numerica. 30: 203–248. arXiv: 2105.14368 . doi:10.1017/S0962492921000039. ISSN   0962-4929.
  6. Yu, Felix Xinnan X; Suresh, Ananda Theertha; Choromanski, Krzysztof M; Holtmann-Rice, Daniel N; Kumar, Sanjiv (2016). "Orthogonal Random Features". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc.
  7. Recht, Benjamin. "Reflections on Random Kitchen Sinks". arg min blog. Retrieved 2024-09-29.