Polynomial kernel

Last updated September 08, 2024

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context of regression analysis, such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that of polynomial regression, but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to logical conjunctions of input features.^[1]

Definition

For degree- $d$ polynomials, the polynomial kernel is defined as^[2]

K(\mathbf {x} ,\mathbf {y} )=(\mathbf {x} ^{\mathsf {T}}\mathbf {y} +c)^{d}

where $x$ and $y$ are vectors of size $n$ in the input space, i.e. vectors of features computed from training or test samples and $c \geq 0$ is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. When $c = 0$ , the kernel is called homogeneous.^[3] (A further generalized polykernel divides $x T y$ by a user-specified scalar parameter $a$ .^[4])

As a kernel, $K$ corresponds to an inner product in a feature space based on some mapping $φ$ :

K(\mathbf {x} ,\mathbf {y} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {y} )\rangle

The nature of $φ$ can be seen from an example. Let $d = 2$ , so we get the special case of the quadratic kernel. After using the multinomial theorem (twice—the outermost application is the binomial theorem) and regrouping,

K(\mathbf {x} ,\mathbf {y} )=\left(\sum _{i=1}^{n}x_{i}y_{i}+c\right)^{2}=\sum _{i=1}^{n}\left(x_{i}^{2}\right)\left(y_{i}^{2}\right)+\sum _{i=2}^{n}\sum _{j=1}^{i-1}\left({\sqrt {2}}x_{i}x_{j}\right)\left({\sqrt {2}}y_{i}y_{j}\right)+\sum _{i=1}^{n}\left({\sqrt {2c}}x_{i}\right)\left({\sqrt {2c}}y_{i}\right)+c^{2}

From this it follows that the feature map is given by:

\varphi (x)=\left(x_{n}^{2},\ldots ,x_{1}^{2},{\sqrt {2}}x_{n}x_{n-1},\ldots ,{\sqrt {2}}x_{n}x_{1},{\sqrt {2}}x_{n-1}x_{n-2},\ldots ,{\sqrt {2}}x_{n-1}x_{1},\ldots ,{\sqrt {2}}x_{2}x_{1},{\sqrt {2c}}x_{n},\ldots ,{\sqrt {2c}}x_{1},c\right)

generalizing for $\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}$ , where $\mathbf {x} \in \mathbb {R} ^{n}$ , $\mathbf {y} \in \mathbb {R} ^{n}$ and applying the multinomial theorem:

${\begin{alignedat}{2}\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}&=\sum _{j_{1}+j_{2}+\dots +j_{n+1}=d}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}y_{1}^{j_{1}}\cdots y_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\\&=\varphi (\mathbf {x} )^{T}\varphi (\mathbf {y} )\end{alignedat}}$

The last summation has $l_{d}={\tbinom {n+d}{d}}$ elements, so that:

\varphi (\mathbf {x} )=\left(a_{1},\dots ,a_{l},\dots ,a_{l_{d}}\right)

where $l=(j_{1},j_{2},...,j_{n},j_{n+1})$ and

a_{l}={\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\quad |\quad j_{1}+j_{2}+\dots +j_{n}+j_{n+1}=d

Practical use

Although the RBF kernel is more popular in SVM classification than the polynomial kernel, the latter is quite popular in natural language processing (NLP).^[1]^[5] The most common degree is $d = 2$ (quadratic), since larger degrees tend to overfit on NLP problems.

Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including:

full expansion of the kernel prior to training/testing with a linear SVM,^[5] i.e. full computation of the mapping $φ$ as in polynomial regression;
basket mining (using a variant of the apriori algorithm) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion;^[6]
inverted indexing of support vectors.^[6]^[1]

One problem with the polynomial kernel is that it may suffer from numerical instability: when $x T y + c < 1$ , $K (x, y) = (x T y + c) d$ tends to zero with increasing $d$ , whereas when $x T y + c > 1$ , $K (x, y)$ tends to infinity.^[4]

Related Research Articles

In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the quantity of the vector field's source at each point. More technically, the divergence represents the volume density of the outward flux of a vector field from an infinitesimal volume around a given point.

The Cauchy–Schwarz inequality is an upper bound on the inner product between two vectors in an inner product space in terms of the product of the vector norms. It is considered one of the most important and widely used inequalities in mathematics.

In mathematics, an $n$ -sphere or hypersphere is an ⁠ $⁠$ -dimensional generalization of the ⁠ $⁠$ -dimensional circle and ⁠ $⁠$ -dimensional sphere to any non-negative integer ⁠ $⁠$ . The ⁠ $⁠$ -sphere is the setting for ⁠ $⁠$ -dimensional spherical geometry.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

In the mathematical field of differential geometry, a metric tensor is an additional structure on a manifold $M$ that allows defining distances and angles, just as the inner product on a Euclidean space allows defining distances and angles there. More precisely, a metric tensor at a point $p$ of $M$ is a bilinear form defined on the tangent space at $p$ , and a metric field on $M$ consists of a metric tensor at each point $p$ of $M$ that varies smoothly with $p$ .

In mathematics and physical science, spherical harmonics are special functions defined on the surface of a sphere. They are often employed in solving partial differential equations in many scientific fields. The table of spherical harmonics contains a list of common spherical harmonics.

In mathematics, a linear form is a linear map from a vector space to its field of scalars.

In vector calculus, Green's theorem relates a line integral around a simple closed curve $C$ to a double integral over the plane region $D$ bounded by $C$ . It is the two-dimensional special case of Stokes' theorem. In one dimension, it is equivalent to the fundamental theorem of calculus. In three dimensions, it is equivalent to the divergence theorem.

In mathematics, a Green's function is the impulse response of an inhomogeneous linear differential operator defined on a domain with specified initial conditions or boundary conditions.

In mathematics, the covariant derivative is a way of specifying a derivative along tangent vectors of a manifold. Alternatively, the covariant derivative is a way of introducing and working with a connection on a manifold by means of a differential operator, to be contrasted with the approach given by a principal connection on the frame bundle – see affine connection. In the special case of a manifold isometrically embedded into a higher-dimensional Euclidean space, the covariant derivative can be viewed as the orthogonal projection of the Euclidean directional derivative onto the manifold's tangent space. In this case the Euclidean derivative is broken into two parts, the extrinsic normal component and the intrinsic covariant derivative component.

In quantum field theory, the Lehmann–Symanzik–Zimmermann (LSZ) reduction formula is a method to calculate S-matrix elements from the time-ordered correlation functions of a quantum field theory. It is a step of the path that starts from the Lagrangian of some quantum field theory and leads to prediction of measurable quantities. It is named after the three German physicists Harry Lehmann, Kurt Symanzik and Wolfhart Zimmermann.

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

In applied mathematics, polyharmonic splines are used for function approximation and data interpolation. They are very useful for interpolating and fitting scattered data in many dimensions. Special cases include thin plate splines and natural cubic splines in one dimension.

In physics and mathematics, the solid harmonics are solutions of the Laplace equation in spherical polar coordinates, assumed to be (smooth) functions $. There are two kinds: the regular solid harmonics, which are well-defined at the origin and the irregular solid harmonics, which are singular at the origin. Both sets of functions play an important role in potential theory, and are obtained by rescaling spherical harmonics appropriately:$

In mathematics, vector spherical harmonics (VSH) are an extension of the scalar spherical harmonics for use with vector fields. The components of the VSH are complex-valued functions expressed in the spherical coordinate basis vectors.

In the finite element method for the numerical solution of elliptic partial differential equations, the stiffness matrix is a matrix that represents the system of linear equations that must be solved in order to ascertain an approximate solution to the differential equation.

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

References

1 2 3 Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.
↑ "Archived copy" (PDF). Archived from the original (PDF) on 2013-04-15. Retrieved 2012-11-12.{{cite web}}: CS1 maint: archived copy as title (link)
↑ Shashua, Amnon (2009). "Introduction to Machine Learning: Class Notes 67577". arXiv: 0904.3664v1 [cs.LG].
1 2 Lin, Chih-Jen (2012). Machine learning software: design and practical use (PDF). Machine Learning Summer School. Kyoto.
1 2 Chang, Yin-Wen; Hsieh, Cho-Jui; Chang, Kai-Wei; Ringgaard, Michael; Lin, Chih-Jen (2010). "Training and testing low-degree polynomial data mappings via linear SVM". Journal of Machine Learning Research . 11: 1471–1490.
1 2 Kudo, T.; Matsumoto, Y. (2003). Fast methods for kernel-based text analysis. Proc. ACL.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Goldberg2008-1] 1 2 3 Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.

[2] "Archived copy" (PDF). Archived from the original (PDF) on 2013-04-15. Retrieved 2012-11-12.{{cite web}}: CS1 maint: archived copy as title (link)

[3] Shashua, Amnon (2009). "Introduction to Machine Learning: Class Notes 67577". arXiv: 0904.3664v1 [cs.LG].

[lin2012-4] 1 2 Lin, Chih-Jen (2012). Machine learning software: design and practical use (PDF). Machine Learning Summer School. Kyoto.

[Chang2010-5] 1 2 Chang, Yin-Wen; Hsieh, Cho-Jui; Chang, Kai-Wei; Ringgaard, Michael; Lin, Chih-Jen (2010). "Training and testing low-degree polynomial data mappings via linear SVM". Journal of Machine Learning Research . 11: 1471–1490.

[Kudo2003-6] 1 2 Kudo, T.; Matsumoto, Y. (2003). Fast methods for kernel-based text analysis. Proc. ACL.

[1]

[2]

[3]

[4]

[5]

[6]

Polynomial kernel

Contents

Definition

Practical use

Related Research Articles

References