Reproducing kernel Hilbert space

Last updated
Figure illustrates related but varying approaches to viewing RKHS Different Views on RKHS.png
Figure illustrates related but varying approaches to viewing RKHS

In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions and in the RKHS are close in norm, i.e., is small, then and are also pointwise close, i.e., is small for all . The converse does not need to be true. Informally, this can be shown by looking at the supremum norm: the sequence of functions converges pointwise, but does not converge uniformly i.e. does not converge with respect to the supremum norm. (This is not a counterexample because the supremum norm does not arise from any inner product due to not satisfying the parallelogram law.)

Contents

It is not entirely straightforward to construct a Hilbert space of functions which is not an RKHS. [1] Some examples, however, have been found. [2] [3]

L2 spaces are not Hilbert spaces of functions (and hence not RKHSs), but rather Hilbert spaces of equivalence classes of functions (for example, the functions and defined by and are equivalent in L2). However, there are RKHSs in which the norm is an L2-norm, such as the space of band-limited functions (see the example below).

An RKHS is associated with a kernel that reproduces every function in the space in the sense that for every in the set on which the functions are defined, "evaluation at " can be performed by taking an inner product with a function determined by the kernel. Such a reproducing kernel exists if and only if every evaluation functional is continuous.

The reproducing kernel was first introduced in the 1907 work of Stanisław Zaremba concerning boundary value problems for harmonic and biharmonic functions. James Mercer simultaneously examined functions which satisfy the reproducing property in the theory of integral equations. The idea of the reproducing kernel remained untouched for nearly twenty years until it appeared in the dissertations of Gábor Szegő, Stefan Bergman, and Salomon Bochner. The subject was eventually systematically developed in the early 1950s by Nachman Aronszajn and Stefan Bergman. [4]

These spaces have wide applications, including complex analysis, harmonic analysis, and quantum mechanics. Reproducing kernel Hilbert spaces are particularly important in the field of statistical learning theory because of the celebrated representer theorem which states that every function in an RKHS that minimises an empirical risk functional can be written as a linear combination of the kernel function evaluated at the training points. This is a practically useful result as it effectively simplifies the empirical risk minimization problem from an infinite dimensional to a finite dimensional optimization problem.

For ease of understanding, we provide the framework for real-valued Hilbert spaces. The theory can be easily extended to spaces of complex-valued functions and hence include the many important examples of reproducing kernel Hilbert spaces that are spaces of analytic functions. [5]

Definition

Let be an arbitrary set and a Hilbert space of real-valued functions on , equipped with pointwise addition and pointwise scalar multiplication. The evaluation functional over the Hilbert space of functions is a linear functional that evaluates each function at a point ,

We say that H is a reproducing kernel Hilbert space if, for all in , is continuous at every in or, equivalently, if is a bounded operator on , i.e. there exists some such that

 

 

 

 

(1)

Although is assumed for all , it might still be the case that .

While property ( 1 ) is the weakest condition that ensures both the existence of an inner product and the evaluation of every function in at every point in the domain, it does not lend itself to easy application in practice. A more intuitive definition of the RKHS can be obtained by observing that this property guarantees that the evaluation functional can be represented by taking the inner product of with a function in . This function is the so-called reproducing kernel[ citation needed ] for the Hilbert space from which the RKHS takes its name. More formally, the Riesz representation theorem implies that for all in there exists a unique element of with the reproducing property,

 

 

 

 

(2)

Since is itself a function defined on with values in the field (or in the case of complex Hilbert spaces) and as is in we have that

where is the element in associated to .

This allows us to define the reproducing kernel of as a function (or in the complex case) by

From this definition it is easy to see that (or in the complex case) is both symmetric (resp. conjugate symmetric) and positive definite, i.e.

for every [6] The Moore–Aronszajn theorem (see below) is a sort of converse to this: if a function satisfies these conditions then there is a Hilbert space of functions on for which it is a reproducing kernel.

Example

The space of bandlimited continuous functions is a RKHS, as we now show. Formally, fix some cutoff frequency and define the Hilbert space

where is the set of continuous square integrable functions, and is the Fourier transform of . As the inner product of this Hilbert space, we use

From the Fourier inversion theorem, we have

It then follows by the Cauchy–Schwarz inequality and Plancherel's theorem that, for all ,

This inequality shows that the evaluation functional is bounded, proving that is indeed a RKHS.

The kernel function in this case is given by

The Fourier transform of defined above is given by

which is a consequence of the time-shifting property of the Fourier transform. Consequently, using Plancherel's theorem, we have

Thus we obtain the reproducing property of the kernel.

in this case is the "bandlimited version" of the Dirac delta function, and that converges to in the weak sense as the cutoff frequency tends to infinity.

Moore–Aronszajn theorem

We have seen how a reproducing kernel Hilbert space defines a reproducing kernel function that is both symmetric and positive definite. The Moore–Aronszajn theorem goes in the other direction; it states that every symmetric, positive definite kernel defines a unique reproducing kernel Hilbert space. The theorem first appeared in Aronszajn's Theory of Reproducing Kernels, although he attributes it to E. H. Moore.

Theorem. Suppose K is a symmetric, positive definite kernel on a set X. Then there is a unique Hilbert space of functions on X for which K is a reproducing kernel.

Proof. For all x in X, define Kx = K(x, ⋅ ). Let H0 be the linear span of {Kx : xX}. Define an inner product on H0 by

which implies . The symmetry of this inner product follows from the symmetry of K and the non-degeneracy follows from the fact that K is positive definite.

Let H be the completion of H0 with respect to this inner product. Then H consists of functions of the form

Now we can check the reproducing property ( 2 ):

To prove uniqueness, let G be another Hilbert space of functions for which K is a reproducing kernel. For every x and y in X, ( 2 ) implies that

By linearity, on the span of . Then because G is complete and contains H0 and hence contains its completion.

Now we need to prove that every element of G is in H. Let be an element of G. Since H is a closed subspace of G, we can write where and . Now if then, since K is a reproducing kernel of G and H:

where we have used the fact that belongs to H so that its inner product with in G is zero. This shows that in G and concludes the proof.

Integral operators and Mercer's theorem

We may characterize a symmetric positive definite kernel via the integral operator using Mercer's theorem and obtain an additional view of the RKHS. Let be a compact space equipped with a strictly positive finite Borel measure and a continuous, symmetric, and positive definite function. Define the integral operator as

where is the space of square integrable functions with respect to .

Mercer's theorem states that the spectral decomposition of the integral operator of yields a series representation of in terms of the eigenvalues and eigenfunctions of . This then implies that is a reproducing kernel so that the corresponding RKHS can be defined in terms of these eigenvalues and eigenfunctions. We provide the details below.

Under these assumptions is a compact, continuous, self-adjoint, and positive operator. The spectral theorem for self-adjoint operators implies that there is an at most countable decreasing sequence such that and , where the form an orthonormal basis of . By the positivity of for all One can also show that maps continuously into the space of continuous functions and therefore we may choose continuous functions as the eigenvectors, that is, for all Then by Mercer's theorem may be written in terms of the eigenvalues and continuous eigenfunctions as

for all such that

This above series representation is referred to as a Mercer kernel or Mercer representation of .

Furthermore, it can be shown that the RKHS of is given by

where the inner product of given by

This representation of the RKHS has application in probability and statistics, for example to the Karhunen-Loève representation for stochastic processes and kernel PCA.

Feature maps

A feature map is a map , where is a Hilbert space which we will call the feature space. The first sections presented the connection between bounded/continuous evaluation functions, positive definite functions, and integral operators and in this section we provide another representation of the RKHS in terms of feature maps.

Every feature map defines a kernel via

 

 

 

 

(3)

Clearly is symmetric and positive definiteness follows from the properties of inner product in . Conversely, every positive definite function and corresponding reproducing kernel Hilbert space has infinitely many associated feature maps such that ( 3 ) holds.

For example, we can trivially take and for all . Then ( 3 ) is satisfied by the reproducing property. Another classical example of a feature map relates to the previous section regarding integral operators by taking and .

This connection between kernels and feature maps provides us with a new way to understand positive definite functions and hence reproducing kernels as inner products in . Moreover, every feature map can naturally define a RKHS by means of the definition of a positive definite function.

Lastly, feature maps allow us to construct function spaces that reveal another perspective on the RKHS. Consider the linear space

We can define a norm on by

It can be shown that is a RKHS with kernel defined by . This representation implies that the elements of the RKHS are inner products of elements in the feature space and can accordingly be seen as hyperplanes. This view of the RKHS is related to the kernel trick in machine learning. [7]

Properties

Useful properties of RKHSs:

Common examples

Bilinear kernels

The RKHS corresponding to this kernel is the dual space, consisting of functions satisfying .

Polynomial kernels

Radial basis function kernels

These are another common class of kernels which satisfy . Some examples include:

Bergman kernels

We also provide examples of Bergman kernels. Let X be finite and let H consist of all complex-valued functions on X. Then an element of H can be represented as an array of complex numbers. If the usual inner product is used, then Kx is the function whose value is 1 at x and 0 everywhere else, and can be thought of as an identity matrix since

In this case, H is isomorphic to .

The case of (where denotes the unit disc) is more sophisticated. Here the Bergman space is the space of square-integrable holomorphic functions on . It can be shown that the reproducing kernel for is

Lastly, the space of band limited functions in with bandwidth is a RKHS with reproducing kernel

Extension to vector-valued functions

In this section we extend the definition of the RKHS to spaces of vector-valued functions as this extension is particularly important in multi-task learning and manifold regularization. The main difference is that the reproducing kernel is a symmetric function that is now a positive semi-definite matrix for every in . More formally, we define a vector-valued RKHS (vvRKHS) as a Hilbert space of functions such that for all and

and

This second property parallels the reproducing property for the scalar-valued case. This definition can also be connected to integral operators, bounded evaluation functions, and feature maps as we saw for the scalar-valued RKHS. We can equivalently define the vvRKHS as a vector-valued Hilbert space with a bounded evaluation functional and show that this implies the existence of a unique reproducing kernel by the Riesz Representation theorem. Mercer's theorem can also be extended to address the vector-valued setting and we can therefore obtain a feature map view of the vvRKHS. Lastly, it can also be shown that the closure of the span of coincides with , another property similar to the scalar-valued case.

We can gain intuition for the vvRKHS by taking a component-wise perspective on these spaces. In particular, we find that every vvRKHS is isometrically isomorphic to a scalar-valued RKHS on a particular input space. Let . Consider the space and the corresponding reproducing kernel

 

 

 

 

(4)

As noted above, the RKHS associated to this reproducing kernel is given by the closure of the span of where for every set of pairs .

The connection to the scalar-valued RKHS can then be made by the fact that every matrix-valued kernel can be identified with a kernel of the form of ( 4 ) via

Moreover, every kernel with the form of ( 4 ) defines a matrix-valued kernel with the above expression. Now letting the map be defined as

where is the component of the canonical basis for , one can show that is bijective and an isometry between and .

While this view of the vvRKHS can be useful in multi-task learning, this isometry does not reduce the study of the vector-valued case to that of the scalar-valued case. In fact, this isometry procedure can make both the scalar-valued kernel and the input space too difficult to work with in practice as properties of the original kernels are often lost. [10] [11] [12]

An important class of matrix-valued reproducing kernels are separable kernels which can factorized as the product of a scalar valued kernel and a -dimensional symmetric positive semi-definite matrix. In light of our previous discussion these kernels are of the form

for all in and in . As the scalar-valued kernel encodes dependencies between the inputs, we can observe that the matrix-valued kernel encodes dependencies among both the inputs and the outputs.

We lastly remark that the above theory can be further extended to spaces of functions with values in function spaces but obtaining kernels for these spaces is a more difficult task. [13]

Connection between RKHSs and the ReLU function

The ReLU function is commonly defined as and is a mainstay in the architecture of neural networks where it is used as an activation function. One can construct a ReLU-like nonlinear function using the theory of reproducing kernel Hilbert spaces. Below, we derive this construction and show how it implies the representation power of neural networks with ReLU activations.

We will work with the Hilbert space of absolutely continuous functions with and square integrable (i.e. ) derivative. It has the inner product

To construct the reproducing kernel it suffices to consider a dense subspace, so let and . The Fundamental Theorem of Calculus then gives

where

and i.e.

This implies reproduces .

Moreover the minimum function on has the following representations with the ReLu function:

Using this formulation, we can apply the representer theorem to the RKHS, letting one prove the optimality of using ReLU activations in neural network settings.[ citation needed ]

See also

Notes

  1. Alpay, D., and T. M. Mills. "A family of Hilbert spaces which are not reproducing kernel Hilbert spaces." J. Anal. Appl. 1.2 (2003): 107–111.
  2. Z. Pasternak-Winiarski, "On weights which admit reproducing kernel of Bergman type", International Journal of Mathematics and Mathematical Sciences, vol. 15, Issue 1, 1992.
  3. T. Ł. Żynda, "On weights which admit reproducing kernel of Szegő type", Journal of Contemporary Mathematical Analysis (Armenian Academy of Sciences), 55, 2020.
  4. Okutmustur
  5. Paulson
  6. Durrett
  7. Rosasco
  8. Rosasco
  9. Berlinet, Alain and Thomas, Christine. Reproducing kernel Hilbert spaces in Probability and Statistics , Kluwer Academic Publishers, 2004
  10. De Vito
  11. Zhang
  12. Alvarez
  13. Rosasco

Related Research Articles

The Riesz representation theorem, sometimes called the Riesz–Fréchet representation theorem after Frigyes Riesz and Maurice René Fréchet, establishes an important connection between a Hilbert space and its continuous dual space. If the underlying field is the real numbers, the two are isometrically isomorphic; if the underlying field is the complex numbers, the two are isometrically anti-isomorphic. The (anti-) isomorphism is a particular natural isomorphism.

In mathematics, weak topology is an alternative term for certain initial topologies, often on topological vector spaces or spaces of linear operators, for instance on a Hilbert space. The term is most commonly used for the initial topology of a topological vector space with respect to its continuous dual. The remainder of this article will deal with this case, which is one of the concepts of functional analysis.

<span class="mw-page-title-main">Dirac delta function</span> Generalized function whose value is zero everywhere except at zero

In mathematical analysis, the Dirac delta function, also known as the unit impulse, is a generalized function on the real numbers, whose value is zero everywhere except at zero, and whose integral over the entire real line is equal to one. Since there is no function having this property, to model the delta "function" rigorously involves the use of limits or, as is common in mathematics, measure theory and the theory of distributions.

In mathematics, particularly linear algebra and functional analysis, a spectral theorem is a result about when a linear operator or matrix can be diagonalized. This is extremely useful because computations involving a diagonalizable matrix can often be reduced to much simpler computations involving the corresponding diagonal matrix. The concept of diagonalization is relatively straightforward for operators on finite-dimensional vector spaces but requires some modification for operators on infinite-dimensional spaces. In general, the spectral theorem identifies a class of linear operators that can be modeled by multiplication operators, which are as simple as one can hope to find. In more abstract language, the spectral theorem is a statement about commutative C*-algebras. See also spectral theory for a historical perspective.

In mathematics, specifically functional analysis, a trace-class operator is a linear operator for which a trace may be defined, such that the trace is a finite number independent of the choice of basis used to compute the trace. This trace of trace-class operators generalizes the trace of matrices studied in linear algebra. All trace-class operators are compact operators.

In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in, is one of the most notable results of the work of James Mercer (1883–1932). It is an important theoretical tool in the theory of integral equations; it is used in the Hilbert space theory of stochastic processes, for example the Karhunen–Loève theorem; and it is also used in the reproducing kernel Hilbert space theory where it characterizes a symmetric positive-definite kernel as a reproducing kernel.

In mathematics and signal processing, the Hilbert transform is a specific singular integral that takes a function, u(t) of a real variable and produces another function of a real variable H(u)(t). The Hilbert transform is given by the Cauchy principal value of the convolution with the function (see § Definition). The Hilbert transform has a particularly simple representation in the frequency domain: It imparts a phase shift of ±90° (π/2 radians) to every frequency component of a function, the sign of the shift depending on the sign of the frequency (see § Relationship with the Fourier transform). The Hilbert transform is important in signal processing, where it is a component of the analytic representation of a real-valued signal u(t). The Hilbert transform was first introduced by David Hilbert in this setting, to solve a special case of the Riemann–Hilbert problem for analytic functions.

In mathematics, specifically in operator theory, each linear operator on an inner product space defines a Hermitian adjoint operator on that space according to the rule

In the theory of stochastic processes, the Karhunen–Loève theorem, also known as the Kosambi–Karhunen–Loève theorem states that a stochastic process can be represented as an infinite linear combination of orthogonal functions, analogous to a Fourier series representation of a function on a bounded interval. The transformation is also known as Hotelling transform and eigenvector transform, and is closely related to principal component analysis (PCA) technique widely used in image processing and in data analysis in many fields.

In probability theory and related fields, Malliavin calculus is a set of mathematical techniques and ideas that extend the mathematical field of calculus of variations from deterministic functions to stochastic processes. In particular, it allows the computation of derivatives of random variables. Malliavin calculus is also called the stochastic calculus of variations. P. Malliavin first initiated the calculus on infinite dimensional space. Then, the significant contributors such as S. Kusuoka, D. Stroock, J-M. Bismut, Shinzo Watanabe, I. Shigekawa, and so on finally completed the foundations.

In mathematics, particularly in functional analysis, a projection-valued measure is a function defined on certain subsets of a fixed set and whose values are self-adjoint projections on a fixed Hilbert space. A projection-valued measure (PVM) is formally similar to a real-valued measure, except that its values are self-adjoint projections rather than real numbers. As in the case of ordinary measures, it is possible to integrate complex-valued functions with respect to a PVM; the result of such an integration is a linear operator on the given Hilbert space.

In probability theory and statistical mechanics, the Gaussian free field (GFF) is a Gaussian random field, a central model of random surfaces.

In functional analysis, the dual norm is a measure of size for a continuous linear function defined on a normed vector space.

<span class="mw-page-title-main">Hilbert space</span> Type of topological vector space

In mathematics, Hilbert spaces allow the methods of linear algebra and calculus to be generalized from (finite-dimensional) Euclidean vector spaces to spaces that may be infinite-dimensional. Hilbert spaces arise naturally and frequently in mathematics and physics, typically as function spaces. Formally, a Hilbert space is a vector space equipped with an inner product that induces a distance function for which the space is a complete metric space.

Coherent states have been introduced in a physical context, first as quasi-classical states in quantum mechanics, then as the backbone of quantum optics and they are described in that spirit in the article Coherent states. However, they have generated a huge variety of generalizations, which have led to a tremendous amount of literature in mathematical physics. In this article, we sketch the main directions of research on this line. For further details, we refer to several existing surveys.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

In probability theory, a branch of mathematics, white noise analysis, otherwise known as Hida calculus, is a framework for infinite-dimensional and stochastic calculus, based on the Gaussian white noise probability space, to be compared with Malliavin calculus based on the Wiener process. It was initiated by Takeyuki Hida in his 1975 Carleton Mathematical Lecture Notes.

References