Radial basis function kernel

Last updated September 29, 2024

In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.^[1]

Since the value of the RBF kernel decreases with distance and ranges between zero (in the infinite-distance limit) and one (when $x = x'$ ), it has a ready interpretation as a similarity measure.^[2] The feature space of the kernel has an infinite number of dimensions; for $\sigma =1$ , its expansion using the multinomial theorem is:^[3]

{\begin{alignedat}{2}\exp \left(-{\frac {1}{2}}\|\mathbf {x} -\mathbf {x'} \|^{2}\right)&=\exp({\frac {2}{2}}\mathbf {x} ^{\top }\mathbf {x'} -{\frac {1}{2}}\|\mathbf {x} \|^{2}-{\frac {1}{2}}\|\mathbf {x'} \|^{2})\\[5pt]&=\exp(\mathbf {x} ^{\top }\mathbf {x'} )\exp(-{\frac {1}{2}}\|\mathbf {x} \|^{2})\exp(-{\frac {1}{2}}\|\mathbf {x'} \|^{2})\\[5pt]&=\sum _{j=0}^{\infty }{\frac {(\mathbf {x} ^{\top }\mathbf {x'} )^{j}}{j!}}\exp \left(-{\frac {1}{2}}\|\mathbf {x} \|^{2}\right)\exp \left(-{\frac {1}{2}}\|\mathbf {x'} \|^{2}\right)\\[5pt]&=\sum _{j=0}^{\infty }\quad \sum _{n_{1}+n_{2}+\dots +n_{k}=j}\exp \left(-{\frac {1}{2}}\|\mathbf {x} \|^{2}\right){\frac {x_{1}^{n_{1}}\cdots x_{k}^{n_{k}}}{\sqrt {n_{1}!\cdots n_{k}!}}}\exp \left(-{\frac {1}{2}}\|\mathbf {x'} \|^{2}\right){\frac {{x'}_{1}^{n_{1}}\cdots {x'}_{k}^{n_{k}}}{\sqrt {n_{1}!\cdots n_{k}!}}}\\[5pt]&=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {x'} )\rangle \end{alignedat}}

\varphi (\mathbf {x} )=\exp \left(-{\frac {1}{2}}\|\mathbf {x} \|^{2}\right)\left(a_{\ell _{0}}^{(0)},a_{1}^{(1)},\dots ,a_{\ell _{1}}^{(1)},\dots ,a_{1}^{(j)},\dots ,a_{\ell _{j}}^{(j)},\dots \right)

where $\ell _{j}={\tbinom {k+j-1}{j}}$ ,

a_{\ell }^{(j)}={\frac {x_{1}^{n_{1}}\cdots x_{k}^{n_{k}}}{\sqrt {n_{1}!\cdots n_{k}!}}}\quad |\quad n_{1}+n_{2}+\dots +n_{k}=j\wedge 1\leq \ell \leq \ell _{j}

Approximations

Because support vector machines and other models employing the kernel trick do not scale well to large numbers of training samples or large numbers of features in the input space, several approximations to the RBF kernel (and similar kernels) have been introduced.^[4] Typically, these take the form of a function z that maps a single vector to a vector of higher dimensionality, approximating the kernel:

\langle z(\mathbf {x} ),z(\mathbf {x'} )\rangle \approx \langle \varphi (\mathbf {x} ),\varphi (\mathbf {x'} )\rangle =K(\mathbf {x} ,\mathbf {x'} )

where $\textstyle \varphi$ is the implicit mapping embedded in the RBF kernel.

Fourier random features

One way to construct such a z is to randomly sample from the Fourier transformation of the kernel^[5] $\varphi (x)={\frac {1}{\sqrt {D}}}[\cos \langle w_{1},x\rangle ,\sin \langle w_{1},x\rangle ,\ldots ,\cos \langle w_{D},x\rangle ,\sin \langle w_{D},x\rangle ]^{T}$ where $w_{1},...,w_{D}$ are independent samples from the normal distribution $N(0,\sigma ^{-2}I)$ .

Theorem: $\operatorname {E} [\langle \varphi (x),\varphi (y)\rangle ]=e^{\|x-y\|^{2}/(2\sigma ^{2})}.$

Proof: It suffices to prove the case of $D=1$ . Use the trigonometric identity $\cos(a-b)=\cos(a)\cos(b)+\sin(a)\sin(b)$ , the spherical symmetry of gaussian distribution, then evaluate the integral

\int _{-\infty }^{\infty }{\frac {\cos(kx)e^{-x^{2}/2}}{\sqrt {2\pi }}}dx=e^{-k^{2}/2}.

Theorem: $\operatorname {Var} [\langle \varphi (x),\varphi (y)\rangle ]=O(D^{-1})$ . (Appendix A.2^[6]).

Nyström method

Another approach uses the Nyström method to approximate the eigendecomposition of the Gram matrix K, using only a random sample of the training set.^[7]

Related Research Articles

In mathematical physics and mathematics, the Pauli matrices are a set of three $2 \times 2$ complex matrices that are traceless, Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

The uncertainty principle, also known as Heisenberg's indeterminacy principle, is a fundamental concept in quantum mechanics. It states that there is a limit to the precision with which certain pairs of physical properties, such as position and momentum, can be simultaneously known. In other words, the more accurately one property is measured, the less accurately the other property can be known.

In statistical mechanics, the virial theorem provides a general equation that relates the average over time of the total kinetic energy of a stable system of discrete particles, bound by a conservative force with that of the total potential energy of the system. Mathematically, the theorem states $where T is the total kinetic energy of the N particles, F k represents the force on the k th particle, which is located at position r k, and angle brackets represent the average over time of the enclosed quantity. The word virial for the right-hand side of the equation derives from vis, the Latin word for "force" or "energy", and was given its technical definition by Rudolf Clausius in 1870.$

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space $under the operation of composition.$

In mathematics, particularly in linear algebra, a skew-symmetricmatrix is a square matrix whose transpose equals its negative. That is, it satisfies the condition

In mathematics and physical science, spherical harmonics are special functions defined on the surface of a sphere. They are often employed in solving partial differential equations in many scientific fields. The table of spherical harmonics contains a list of common spherical harmonics.

In mathematics, a Green's function is the impulse response of an inhomogeneous linear differential operator defined on a domain with specified initial conditions or boundary conditions.

In physics, the Rabi cycle is the cyclic behaviour of a two-level quantum system in the presence of an oscillatory driving field. A great variety of physical processes belonging to the areas of quantum computing, condensed matter, atomic and molecular physics, and nuclear and particle physics can be conveniently studied in terms of two-level quantum mechanical systems, and exhibit Rabi flopping when coupled to an optical driving field. The effect is important in quantum optics, magnetic resonance and quantum computing, and is named after Isidor Isaac Rabi.

In physics, the S-matrix or scattering matrix relates the initial state and the final state of a physical system undergoing a scattering process. It is used in quantum mechanics, scattering theory and quantum field theory (QFT).

In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Specifically, a Hilbert space $of functions from a set is an RKHS if, for each, there exists a function such that for all,$

In quantum field theory, the Lehmann–Symanzik–Zimmermann (LSZ) reduction formula is a method to calculate S-matrix elements from the time-ordered correlation functions of a quantum field theory. It is a step of the path that starts from the Lagrangian of some quantum field theory and leads to prediction of measurable quantities. It is named after the three German physicists Harry Lehmann, Kurt Symanzik and Wolfhart Zimmermann.

In mathematics, the Fubini–Study metric is a Kähler metric on a complex projective space CPⁿ endowed with a Hermitian form. This metric was originally described in 1904 and 1905 by Guido Fubini and Eduard Study.

In solid-state physics, the tight-binding model is an approach to the calculation of electronic band structure using an approximate set of wave functions based upon superposition of wave functions for isolated atoms located at each atomic site. The method is closely related to the LCAO method used in chemistry. Tight-binding models are applied to a wide variety of solids. The model gives good qualitative results in many cases and can be combined with other models that give better results where the tight-binding model fails. Though the tight-binding model is a one-electron model, the model also provides a basis for more advanced calculations like the calculation of surface states and application to various kinds of many-body problem and quasiparticle calculations.

The Jaynes–Cummings model is a theoretical model in quantum optics. It describes the system of a two-level atom interacting with a quantized mode of an optical cavity, with or without the presence of light. It was originally developed to study the interaction of atoms with the quantized electromagnetic field in order to investigate the phenomena of spontaneous emission and absorption of photons in a cavity.

In physics and mathematics, the solid harmonics are solutions of the Laplace equation in spherical polar coordinates, assumed to be (smooth) functions $. There are two kinds: the regular solid harmonics, which are well-defined at the origin and the irregular solid harmonics, which are singular at the origin. Both sets of functions play an important role in potential theory, and are obtained by rescaling spherical harmonics appropriately:$

In functional analysis, the dual norm is a measure of size for a continuous linear function defined on a normed vector space.

In mathematics, vector spherical harmonics (VSH) are an extension of the scalar spherical harmonics for use with vector fields. The components of the VSH are complex-valued functions expressed in the spherical coordinate basis vectors.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

In pure and applied mathematics, quantum mechanics and computer graphics, a tensor operator generalizes the notion of operators which are scalars and vectors. A special class of these are spherical tensor operators which apply the notion of the spherical basis and spherical harmonics. The spherical basis closely relates to the description of angular momentum in quantum mechanics and spherical harmonic functions. The coordinate-free generalization of a tensor operator is known as a representation operator.

Random features (RF) are a technique used in machine learning to approximate kernel methods, introduced by Ali Rahimi and Ben Recht in their 2007 paper "Random Features for Large-Scale Kernel Machines", and extended by. RF uses a Monte Carlo approximation to kernel functions by randomly sampled feature maps. It is used for datasets that are too large for traditional kernel methods like support vector machine, kernel ridge regression, and gaussian process.

References

↑ Chang, Yin-Wen; Hsieh, Cho-Jui; Chang, Kai-Wei; Ringgaard, Michael; Lin, Chih-Jen (2010). "Training and testing low-degree polynomial data mappings via linear SVM". Journal of Machine Learning Research. 11: 1471–1490.
1 2 Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf (2004). "A primer on kernel methods". Kernel Methods in Computational Biology.
↑ Shashua, Amnon (2009). "Introduction to Machine Learning: Class Notes 67577". arXiv: 0904.3664v1 [cs.LG].
↑ Andreas Müller (2012). Kernel Approximations for Efficient SVMs (and other feature extraction methods).
↑ Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20. Curran Associates, Inc.
↑ Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (2021-03-19). "Random Feature Attention". arXiv: 2103.02143 [cs.CL].
↑ C.K.I. Williams; M. Seeger (2001). "Using the Nyström method to speed up kernel machines". Advances in Neural Information Processing Systems. 13.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Chang2010-1] Chang, Yin-Wen; Hsieh, Cho-Jui; Chang, Kai-Wei; Ringgaard, Michael; Lin, Chih-Jen (2010). "Training and testing low-degree polynomial data mappings via linear SVM". Journal of Machine Learning Research. 11: 1471–1490.

[primer-2] 1 2 Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf (2004). "A primer on kernel methods". Kernel Methods in Computational Biology.

[3] Shashua, Amnon (2009). "Introduction to Machine Learning: Class Notes 67577". arXiv: 0904.3664v1 [cs.LG].

[4] Andreas Müller (2012). Kernel Approximations for Efficient SVMs (and other feature extraction methods).

[5] Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20. Curran Associates, Inc.

[6] Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (2021-03-19). "Random Feature Attention". arXiv: 2103.02143 [cs.CL].

[7] C.K.I. Williams; M. Seeger (2001). "Using the Nyström method to speed up kernel machines". Advances in Neural Information Processing Systems. 13.

[1]

[2]

[3]

[4]

[5]

[6]

[7]