Positive-definite kernel

Last updated

In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then, positive-definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.

Contents

Definition

Let be a nonempty set, sometimes referred to as the index set. A symmetric function is called a positive-definite (p.d.) kernel on if

holds for any , given .

In probability theory, a distinction is sometimes made between positive-definite kernels, for which equality in (1.1) implies , and positive semi-definite (p.s.d.) kernels, which do not impose this condition. Note that this is equivalent to requiring that any finite matrix constructed by pairwise evaluation, , has either entirely positive (p.d.) or nonnegative (p.s.d.) eigenvalues.

In mathematical literature, kernels are usually complex valued functions, but in this article we assume real-valued functions, which is the common practice in applications of p.d. kernels.

Some general properties

Examples of p.d. kernels

History

Positive-definite kernels, as defined in (1.1), appeared first in 1909 in a paper on integral equations by James Mercer. [2] Several other authors made use of this concept in the following two decades, but none of them explicitly used kernels , i.e. p.d. functions (indeed M. Mathias and S. Bochner seem not to have been aware of the study of p.d. kernels). Mercer’s work arose from Hilbert’s paper of 1904 [3] on Fredholm integral equations of the second kind:

In particular, Hilbert had shown that

where is a continuous real symmetric kernel, is continuous, is a complete system of orthonormal eigenfunctions, and ’s are the corresponding eigenvalues of (1.2). Hilbert defined a “definite” kernel as one for which the double integral

satisfies except for . The original object of Mercer’s paper was to characterize the kernels which are definite in the sense of Hilbert, but Mercer soon found that the class of such functions was too restrictive to characterize in terms of determinants. He therefore defined a continuous real symmetric kernel to be of positive type (i.e. positive-definite) if for all real continuous functions on , and he proved that (1.1) is a necessary and sufficient condition for a kernel to be of positive type. Mercer then proved that for any continuous p.d. kernel the expansion

holds absolutely and uniformly.

At about the same time W. H. Young, [4] motivated by a different question in the theory of integral equations, showed that for continuous kernels condition (1.1) is equivalent to for all .

E.H. Moore [5] [6] initiated the study of a very general kind of p.d. kernel. If is an abstract set, he calls functions defined on “positive Hermitian matrices” if they satisfy (1.1) for all . Moore was interested in generalization of integral equations and showed that to each such there is a Hilbert space of functions such that, for each . This property is called the reproducing property of the kernel and turns out to have importance in the solution of boundary-value problems for elliptic partial differential equations.

Another line of development in which p.d. kernels played a large role was the theory of harmonics on homogeneous spaces as begun by E. Cartan in 1929, and continued by H. Weyl and S. Ito. The most comprehensive theory of p.d. kernels in homogeneous spaces is that of M. Krein [7] which includes as special cases the work on p.d. functions and irreducible unitary representations of locally compact groups.

In probability theory, p.d. kernels arise as covariance kernels of stochastic processes. [8]

Connection with reproducing kernel Hilbert spaces and feature maps

Positive-definite kernels provide a framework that encompasses some basic Hilbert space constructions. In the following we present a tight relationship between positive-definite kernels and two mathematical objects, namely reproducing Hilbert spaces and feature maps.

Let be a set, a Hilbert space of functions , and the corresponding inner product on . For any the evaluation functional is defined by . We first define a reproducing kernel Hilbert space (RKHS):

Definition: Space is called a reproducing kernel Hilbert space if the evaluation functionals are continuous.

Every RKHS has a special function associated to it, namely the reproducing kernel:

Definition: Reproducing kernel is a function such that

  1. , and
  2. , for all and .

The latter property is called the reproducing property.

The following result shows equivalence between RKHS and reproducing kernels:

Theorem   Every reproducing kernel induces a unique RKHS, and every RKHS has a unique reproducing kernel.

Now the connection between positive definite kernels and RKHS is given by the following theorem

Theorem   Every reproducing kernel is positive-definite, and every positive definite kernel defines a unique RKHS, of which it is the unique reproducing kernel.

Thus, given a positive-definite kernel , it is possible to build an associated RKHS with as a reproducing kernel.

As stated earlier, positive definite kernels can be constructed from inner products. This fact can be used to connect p.d. kernels with another interesting object that arises in machine learning applications, namely the feature map. Let be a Hilbert space, and the corresponding inner product. Any map is called a feature map. In this case we call the feature space. It is easy to see [9] that every feature map defines a unique p.d. kernel by

Indeed, positive definiteness of follows from the p.d. property of the inner product. On the other hand, every p.d. kernel, and its corresponding RKHS, have many associated feature maps. For example: Let , and for all . Then , by the reproducing property. This suggests a new look at p.d. kernels as inner products in appropriate Hilbert spaces, or in other words p.d. kernels can be viewed as similarity maps which quantify effectively how similar two points and are through the value . Moreover, through the equivalence of p.d. kernels and its corresponding RKHS, every feature map can be used to construct a RKHS.

Kernels and distances

Kernel methods are often compared to distance based methods such as nearest neighbors. In this section we discuss parallels between their two respective ingredients, namely kernels and distances .

Here by a distance function between each pair of elements of some set , we mean a metric defined on that set, i.e. any nonnegative-valued function on which satisfies

One link between distances and p.d. kernels is given by a particular kind of kernel, called a negative definite kernel, and defined as follows

Definition: A symmetric function is called a negative definite (n.d.) kernel on if

holds for any and such that .

The parallel between n.d. kernels and distances is in the following: whenever a n.d. kernel vanishes on the set , and is zero only on this set, then its square root is a distance for . [10] At the same time each distance does not correspond necessarily to a n.d. kernel. This is only true for Hilbertian distances, where distance is called Hilbertian if one can embed the metric space isometrically into some Hilbert space.

On the other hand, n.d. kernels can be identified with a subfamily of p.d. kernels known as infinitely divisible kernels. A nonnegative-valued kernel is said to be infinitely divisible if for every there exists a positive-definite kernel such that .

Another link is that a p.d. kernel induces a pseudometric, where the first constraint on the distance function is loosened to allow for . Given a positive-definite kernel , we can define a distance function as:

Some applications

Kernels in machine learning

Positive-definite kernels, through their equivalence with reproducing kernel Hilbert spaces, are particularly important in the field of statistical learning theory because of the celebrated representer theorem which states that every minimizer function in an RKHS can be written as a linear combination of the kernel function evaluated at the training points. This is a practically useful result as it effectively simplifies the empirical risk minimization problem from an infinite dimensional to a finite dimensional optimization problem.

Kernels in probabilistic models

There are several different ways in which kernels arise in probability theory.

Assume now that a noise variable , with zero mean and variance , is added to , such that the noise is independent for different and independent of there, then the problem of finding a good estimate for is identical to the above one, but with a modified kernel given by .

Numerical solution of partial differential equations

One of the greatest application areas of so-called meshfree methods is in the numerical solution of PDEs. Some of the popular meshfree methods are closely related to positive-definite kernels (such as meshless local Petrov Galerkin (MLPG), Reproducing kernel particle method (RKPM) and smoothed-particle hydrodynamics (SPH)). These methods use radial basis kernel for collocation. [11]

Stinespring dilation theorem

Other applications

In the literature on computer experiments [12] and other engineering experiments, one increasingly encounters models based on p.d. kernels, RBFs or kriging. One such topic is response surface methodology. Other types of applications that boil down to data fitting are rapid prototyping and computer graphics. Here one often uses implicit surface models to approximate or interpolate point cloud data.

Applications of p.d. kernels in various other branches of mathematics are in multivariate integration, multivariate optimization, and in numerical analysis and scientific computing, where one studies fast, accurate and adaptive algorithms ideally implemented in high-performance computing environments. [13]

See also

Related Research Articles

In mathematics, a matrix with real entries is positive-definite if the real number is positive for every nonzero real column vector where is the transpose of . More generally, a Hermitian matrix is positive-definite if the real number is positive for every nonzero complex column vector where denotes the conjugate transpose of

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Spherical harmonics</span> Special mathematical functions defined on the surface of a sphere

In mathematics and physical science, spherical harmonics are special functions defined on the surface of a sphere. They are often employed in solving partial differential equations in many scientific fields.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

<span class="mw-page-title-main">Radon transform</span> Integral transform

In mathematics, the Radon transform is the integral transform which takes a function f defined on the plane to a function Rf defined on the (two-dimensional) space of lines in the plane, whose value at a particular line is equal to the line integral of the function over that line. The transform was introduced in 1917 by Johann Radon, who also provided a formula for the inverse transform. Radon further included formulas for the transform in three dimensions, in which the integral is taken over planes. It was later generalized to higher-dimensional Euclidean spaces, and more broadly in the context of integral geometry. The complex analogue of the Radon transform is known as the Penrose transform. The Radon transform is widely applicable to tomography, the creation of an image from the projection data associated with cross-sectional scans of an object.

<span class="mw-page-title-main">Reproducing kernel Hilbert space</span>

In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions and in the RKHS are close in norm, i.e., is small, then and are also pointwise close, i.e., is small for all . The converse does not need to be true. Informally, this can be shown by looking at the supremum norm: the sequence of functions converges pointwise, but do not converge uniformly i.e. do not converge with respect to the supremum norm.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Early versions of MTL were called "hints".

In mathematics, the Riesz–Thorin theorem, often referred to as the Riesz–Thorin interpolation theorem or the Riesz–Thorin convexity theorem, is a result about interpolation of operators. It is named after Marcel Riesz and his student G. Olof Thorin.

Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets. Many classes of convex optimization problems admit polynomial-time algorithms, whereas mathematical optimization is in general NP-hard.

In mathematics, and specifically in potential theory, the Poisson kernel is an integral kernel, used for solving the two-dimensional Laplace equation, given Dirichlet boundary conditions on the unit disk. The kernel can be understood as the derivative of the Green's function for the Laplace equation. It is named for Siméon Poisson.

In mathematics, a local system on a topological space X is a tool from algebraic topology which interpolates between cohomology with coefficients in a fixed abelian group A, and general sheaf cohomology in which coefficients vary from point to point. Local coefficient systems were introduced by Norman Steenrod in 1943.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

In statistics, kernel-independent component analysis is an efficient algorithm for independent component analysis which estimates source components by optimizing a generalized variance contrast function, which is based on representations in a reproducing kernel Hilbert space. Those contrast functions use the notion of mutual information as a measure of statistical independence.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

In mathematics, the Poisson boundary is a measure space associated to a random walk. It is an object designed to encode the asymptotic behaviour of the random walk, i.e. how trajectories diverge when the number of steps goes to infinity. Despite being called a boundary it is in general a purely measure-theoretical object and not a boundary in the topological sense. However, in the case where the random walk is on a topological space the Poisson boundary can be related to the Martin boundary which is an analytic construction yielding a genuine topological boundary. Both boundaries are related to harmonic functions on the space via generalisations of the Poisson formula.

Poisson-type random measures are a family of three random counting measures which are closed under restriction to a subspace, i.e. closed under thinning. They are the only distributions in the canonical non-negative power series family of distributions to possess this property and include the Poisson distribution, negative binomial distribution, and binomial distribution. The PT family of distributions is also known as the Katz family of distributions, the Panjer or (a,b,0) class of distributions and may be retrieved through the Conway–Maxwell–Poisson distribution.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

References

  1. Hein, M. and Bousquet, O. (2005). "Hilbertian metrics and positive definite kernels on probability measures". In Ghahramani, Z. and Cowell, R., editors, Proceedings of AISTATS 2005.
  2. Mercer, J. (1909). “Functions of positive and negative type and their connection with the theory of integral equations”. Philosophical Transactions of the Royal Society of London, Series A 209, pp. 415-446.
  3. Hilbert, D. (1904). "Grundzuge einer allgemeinen Theorie der linearen Integralgleichungen I", Gott. Nachrichten, math.-phys. K1 (1904), pp. 49-91.
  4. Young, W. H. (1909). "A note on a class of symmetric functions and on a theorem required in the theory of integral equations", Philos. Trans. Roy.Soc. London, Ser. A, 209, pp. 415-446.
  5. Moore, E.H. (1916). "On properly positive Hermitian matrices", Bull. Amer. Math. Soc. 23, 59, pp. 66-67.
  6. Moore, E.H. (1935). "General Analysis, Part I", Memoirs Amer. Philos. Soc. 1, Philadelphia.
  7. Krein. M (1949/1950). "Hermitian-positive kernels on homogeneous spaces I and II" (in Russian), Ukrain. Mat. Z. 1(1949), pp. 64-98, and 2(1950), pp. 10-59. English translation: Amer. Math. Soc. Translations Ser. 2, 34 (1963), pp. 69-164.
  8. Loève, M. (1960). "Probability theory", 2nd ed., Van Nostrand, Princeton, N.J.
  9. Rosasco, L. and Poggio, T. (2015). "A Regularization Tour of Machine Learning - MIT 9.520 Lecture Notes" Manuscript.
  10. Berg, C., Christensen, J. P. R., and Ressel, P. (1984). "Harmonic Analysis on Semigroups". Number 100 in Graduate Texts in Mathematics, Springer Verlag.
  11. Schaback, R. and Wendland, H. (2006). "Kernel Techniques: From Machine Learning to Meshless Methods", Cambridge University Press, Acta Numerica (2006), pp. 1-97.
  12. Haaland, B. and Qian, P. Z. G. (2010). "Accurate emulators for large-scale computer experiments", Ann. Stat.
  13. Gumerov, N. A. and Duraiswami, R. (2007). "Fast radial basis function interpolation via preconditioned Krylov iteration". SIAM J. Scient. Computing 29/5, pp. 1876-1899.