Kernel methods for vector output

Last updated

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

Contents

In typical machine learning algorithms, these functions produce a scalar output. Recent development of kernel methods for functions with vector-valued output is due, at least in part, to interest in simultaneously solving related problems. Kernels which capture the relationship between the problems allow them to borrow strength from each other. Algorithms of this type include multi-task learning (also called multi-output learning or vector-valued learning), transfer learning, and co-kriging. Multi-label classification can be interpreted as mapping inputs to (binary) coding vectors with length equal to the number of classes.

In Gaussian processes, kernels are called covariance functions. Multiple-output functions correspond to considering multiple processes. See Bayesian interpretation of regularization for the connection between the two perspectives.

History

The history of learning vector-valued functions is closely linked to transfer learning- storing knowledge gained while solving one problem and applying it to a different but related problem. The fundamental motivation for transfer learning in the field of machine learning was discussed in a NIPS-95 workshop on “Learning to Learn,” which focused on the need for lifelong machine learning methods that retain and reuse previously learned knowledge. Research on transfer learning has attracted much attention since 1995 in different names: learning to learn, lifelong learning, knowledge transfer, inductive transfer, multitask learning, knowledge consolidation, context-sensitive learning, knowledge-based inductive bias, metalearning, and incremental/cumulative learning. [1] Interest in learning vector-valued functions was particularly sparked by multitask learning, a framework which tries to learn multiple, possibly different tasks simultaneously.

Much of the initial research in multitask learning in the machine learning community was algorithmic in nature, and applied to methods such as neural networks, decision trees and k-nearest neighbors in the 1990s. [2] The use of probabilistic models and Gaussian processes was pioneered and largely developed in the context of geostatistics, where prediction over vector-valued output data is known as cokriging. [3] [4] [5] Geostatistical approaches to multivariate modeling are mostly formulated around the linear model of coregionalization (LMC), a generative approach for developing valid covariance functions that has been used for multivariate regression and in statistics for computer emulation of expensive multivariate computer codes. The regularization and kernel theory literature for vector-valued functions followed in the 2000s. [6] [7] While the Bayesian and regularization perspectives were developed independently, they are in fact closely related. [8]

Notation

In this context, the supervised learning problem is to learn the function which best predicts vector-valued outputs given inputs (data) .

for
, an input space (e.g. )

In general, each component of (), could have different input data () with different cardinality () and even different input spaces (). [8] Geostatistics literature calls this case heterotopic, and uses isotopic to indicate that the each component of the output vector has the same set of inputs. [9]

Here, for simplicity in the notation, we assume the number and sample space of the data for each output are the same.

Regularization perspective [8] [10] [11]

From the regularization perspective, the problem is to learn belonging to a reproducing kernel Hilbert space of vector-valued functions (). This is similar to the scalar case of Tikhonov regularization, with some extra care in the notation.

Vector-valued caseScalar case
Reproducing kernel
Learning problem
Solution

(derived via the representer theorem )

with ,
where are the coefficients and output vectors concatenated to form vectors and matrix of blocks:

Solve for by taking the derivative of the learning problem, setting it equal to zero, and substituting in the above expression for :

where

It is possible, though non-trivial, to show that a representer theorem also holds for Tikhonov regularization in the vector-valued setting. [8]

Note, the matrix-valued kernel can also be defined by a scalar kernel on the space . An isometry exists between the Hilbert spaces associated with these two kernels:

Gaussian process perspective

The estimator of the vector-valued regularization framework can also be derived from a Bayesian viewpoint using Gaussian process methods in the case of a finite dimensional Reproducing kernel Hilbert space. The derivation is similar to the scalar-valued case Bayesian interpretation of regularization. The vector-valued function , consisting of outputs , is assumed to follow a Gaussian process:

where is now a vector of the mean functions for the outputs and is a positive definite matrix-valued function with entry corresponding to the covariance between the outputs and .

For a set of inputs , the prior distribution over the vector is given by , where is a vector that concatenates the mean vectors associated to the outputs and is a block-partitioned matrix. The distribution of the outputs is taken to be Gaussian:

where is a diagonal matrix with elements specifying the noise for each output. Using this form for the likelihood, the predictive distribution for a new vector is:

where is the training data, and is a set of hyperparameters for and .

Equations for and can then be obtained:

where has entries for and . Note that the predictor is identical to the predictor derived in the regularization framework. For non-Gaussian likelihoods different methods such as Laplace approximation and variational methods are needed to approximate the estimators.

Example kernels

Separable

A simple, but broadly applicable, class of multi-output kernels can be separated into the product of a kernel on the input-space and a kernel representing the correlations among the outputs: [8]

: scalar kernel on
: scalar kernel on

In matrix form:    where is a symmetric and positive semi-definite matrix. Note, setting to the identity matrix treats the outputs as unrelated and is equivalent to solving the scalar-output problems separately.

For a slightly more general form, adding several of these kernels yields sum of separable kernels (SoS kernels).

From regularization literature [8] [10] [12] [13] [14]

Derived from regularizer

One way of obtaining is to specify a regularizer which limits the complexity of in a desirable way, and then derive the corresponding kernel. For certain regularizers, this kernel will turn out to be separable.

Mixed-effect regularizer

where:

where matrix with all entries equal to 1.

This regularizer is a combination of limiting the complexity of each component of the estimator () and forcing each component of the estimator to be close to the mean of all the components. Setting treats all the components as independent and is the same as solving the scalar problems separately. Setting assumes all the components are explained by the same function.

Cluster-based regularizer

where:

  • is the index set of components that belong to cluster
  • is the cardinality of cluster
  • if and both belong to cluster  ( otherwise

where

This regularizer divides the components into clusters and forces the components in each cluster to be similar.

Graph regularizer

where matrix of weights encoding the similarities between the components

where ,  

Note, is the graph laplacian. See also: graph kernel.

Learned from data

Several approaches to learning from data have been proposed. [8] These include: performing a preliminary inference step to estimate from the training data, [9] a proposal to learn and together based on the cluster regularizer, [15] and sparsity-based approaches which assume only a few of the features are needed. [16] [17]

From Bayesian literature

Linear model of coregionalization (LMC)

In LMC, outputs are expressed as linear combinations of independent random functions such that the resulting covariance function (over all inputs and outputs) is a valid positive semidefinite function. Assuming outputs with , each is expressed as:

where are scalar coefficients and the independent functions have zero mean and covariance cov if and 0 otherwise. The cross covariance between any two functions and can then be written as:

where the functions , with and have zero mean and covariance cov if and . But is given by . Thus the kernel can now be expressed as

where each is known as a coregionalization matrix. Therefore, the kernel derived from LMC is a sum of the products of two covariance functions, one that models the dependence between the outputs, independently of the input vector (the coregionalization matrix ), and one that models the input dependence, independently of (the covariance function ).

Intrinsic coregionalization model (ICM)

The ICM is a simplified version of the LMC, with . ICM assumes that the elements of the coregionalization matrix can be written as , for some suitable coefficients . With this form for :

where

In this case, the coefficients

and the kernel matrix for multiple outputs becomes . ICM is much more restrictive than the LMC since it assumes that each basic covariance contributes equally to the construction of the autocovariances and cross covariances for the outputs. However, the computations required for the inference are greatly simplified.

Semiparametric latent factor model (SLFM)

Another simplified version of the LMC is the semiparametric latent factor model (SLFM), which corresponds to setting (instead of as in ICM). Thus each latent function has its own covariance.

Non-separable

While simple, the structure of separable kernels can be too limiting for some problems.

Notable examples of non-separable kernels in the regularization literature include:

In the Bayesian perspective, LMC produces a separable kernel because the output functions evaluated at a point only depend on the values of the latent functions at . A non-trivial way to mix the latent functions is by convolving a base process with a smoothing kernel. If the base process is a Gaussian process, the convolved process is Gaussian as well. We can therefore exploit convolutions to construct covariance functions. [20] This method of producing non-separable kernels is known as process convolution. Process convolutions were introduced for multiple outputs in the machine learning community as "dependent Gaussian processes". [21]

Implementation

When implementing an algorithm using any of the kernels above, practical considerations of tuning the parameters and ensuring reasonable computation time must be considered.

Regularization perspective

Approached from the regularization perspective, parameter tuning is similar to the scalar-valued case and can generally be accomplished with cross validation. Solving the required linear system is typically expensive in memory and time. If the kernel is separable, a coordinate transform can convert to a block-diagonal matrix, greatly reducing the computational burden by solving D independent subproblems (plus the eigendecomposition of ). In particular, for a least squares loss function (Tikhonov regularization), there exists a closed form solution for : [8] [14]

Bayesian perspective

There are many works related to parameter estimation for Gaussian processes. Some methods such as maximization of the marginal likelihood (also known as evidence approximation, type II maximum likelihood, empirical Bayes), and least squares give point estimates of the parameter vector . There are also works employing a full Bayesian inference by assigning priors to and computing the posterior distribution through a sampling procedure. For non-Gaussian likelihoods, there is no closed form solution for the posterior distribution or for the marginal likelihood. However, the marginal likelihood can be approximated under a Laplace, variational Bayes or expectation propagation (EP) approximation frameworks for multiple output classification and used to find estimates for the hyperparameters.

The main computational problem in the Bayesian viewpoint is the same as the one appearing in regularization theory of inverting the matrix

This step is necessary for computing the marginal likelihood and the predictive distribution. For most proposed approximation methods to reduce computation, the computational efficiency gained is independent of the particular method employed (e.g. LMC, process convolution) used to compute the multi-output covariance matrix. A summary of different methods for reducing computational complexity in multi-output Gaussian processes is presented in. [8]

Related Research Articles

In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.

Multivariate normal distribution Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

Support-vector machine Set of methods for supervised statistical learning

In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Controllability is an important property of a control system, and the controllability property plays a crucial role in many control problems, such as stabilization of unstable systems by feedback, or optimal control.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Early versions of MTL were called "hints".

Tikhonov regularization, named for Andrey Tikhonov, is a method of regularization of ill-posed problems. Also known as ridge regression, it is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

Screw theory Mathematical formulation of vector pairs used in physics (rigid body dynamics)

Screw theory is the algebraic calculation of pairs of vectors, such as forces and moments or angular and linear velocity, that arise in the kinematics and dynamics of rigid bodies. The mathematical framework was developed by Sir Robert Stawell Ball in 1876 for application in kinematics and statics of mechanisms.

Kernel method Class of algorithms for pattern analysis

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over pairs of data points in raw representation.

In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then, positive-definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.

In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.

Within mathematical analysis, Regularization perspectives on support-vector machines provide a way of interpreting support-vector machines (SVMs) in the context of other machine-learning algorithms. SVM algorithms categorize multidimensional data, with the goal of fitting the training set data well, but also avoiding overfitting, so that the solution generalizes to new data points. Regularization algorithms also aim to fit training set data and avoid overfitting. They do this by choosing a fitting function that has low error on the training set, but also is not too complicated, where complicated functions are functions with high norms in some function space. Specifically, Tikhonov regularization algorithms choose a function that minimizes the sum of training-set error plus the function's norm. The training-set error can be calculated with different loss functions. For example, regularized least squares is a special case of Tikhonov regularization using the squared error loss as the loss function.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

Spectral regularization is any of a class of regularization techniques used in machine learning to control the impact of noise and prevent overfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over

Manifold regularization

In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

Low-rank matrix approximations are essential tools in the application of kernel methods to large-scale learning problems.

References

  1. S.J. Pan and Q. Yang, "A survey on transfer learning," IEEE Transactions on Knowledge and Data Engineering, 22, 2010
  2. Rich Caruana, "Multitask Learning," Machine Learning, 41–76, 1997
  3. J. Ver Hoef and R. Barry, "Constructing and fitting models for cokriging and multivariable spatial prediction [ dead link ]," Journal of Statistical Planning and Inference, 69:275–294, 1998
  4. P. Goovaerts, "Geostatistics for Natural Resources Evaluation," Oxford University Press, USA, 1997
  5. N. Cressie "Statistics for Spatial Data," John Wiley & Sons Inc. (Revised Edition), USA, 1993
  6. C.A. Micchelli and M. Pontil, "On learning vector-valued functions," Neural Computation, 17:177–204, 2005
  7. C. Carmeli et al., "Vector valued reproducing kernel hilbert spaces of integrable functions and mercer theorem," Anal. Appl. (Singap.), 4
  8. 1 2 3 4 5 6 7 8 9 10 11 Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence, "Kernels for Vector-Valued Functions: A Review," Foundations and Trends in Machine Learning 4, no. 3 (2012): 195–266. doi: 10.1561/2200000036 arXiv:1106.6251
  9. 1 2 Hans Wackernagel. Multivariate Geostatistics. Springer-Verlag Heidelberg New york, 2003.
  10. 1 2 C.A. Micchelli and M. Pontil. On learning vector–valued functions. Neural Computation, 17:177–204, 2005.
  11. C.Carmeli, E.DeVito, and A.Toigo. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Anal. Appl. (Singap.), 4(4):377–408, 2006.
  12. C. A. Micchelli and M. Pontil. Kernels for multi-task learning. In Advances in Neural Information Processing Systems (NIPS). MIT Press, 2004.
  13. T.Evgeniou, C.A.Micchelli, and M.Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615–637, 2005.
  14. 1 2 L. Baldassarre, L. Rosasco, A. Barla, and A. Verri. Multi-output learning via spectral filtering. Technical report, Massachusetts Institute of Technology, 2011. MIT-CSAIL-TR-2011-004, CBCL-296.
  15. Laurent Jacob, Francis Bach, and Jean-Philippe Vert. Clustered multi-task learning: A convex formulation. In NIPS 21, pages 745–752, 2008.
  16. Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.
  17. Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In ECML/PKDD (1), pages 71–85, 2008.
  18. I. Maceˆdo and R. Castro. Learning divergence-free and curl-free vector fields with matrix-valued kernels. Technical report, Instituto Nacional de Matematica Pura e Aplicada, 2008.
  19. A. Caponnetto, C.A. Micchelli, M. Pontil, and Y. Ying. Universal kernels for multi-task learning. Journal of Machine Learning Research, 9:1615–1646, 2008.
  20. D. Higdon, "Space and space-time modeling using process convolutions, Quantitative methods for current environmental issues, 37–56, 2002
  21. P. Boyle and M. Frean, "Dependent gaussian processes, Advances in Neural Information Processing Systems, 17:217–224, MIT Press, 2005