Gaussian process approximations

Last updated November 27, 2024

In statistics and machine learning, Gaussian process approximation is a computational method that accelerates inference tasks in the context of a Gaussian process model, most commonly likelihood evaluation and prediction. Like approximations of other models, they can often be expressed as additional assumptions imposed on the model, which do not correspond to any actual feature, but which retain its key properties while simplifying calculations. Many of these approximation methods can be expressed in purely linear algebraic or functional analytic terms as matrix or function approximations. Others are purely algorithmic and cannot easily be rephrased as a modification of a statistical model.

Basic ideas

In statistical modeling, it is often convenient to assume that $y\in {\mathcal {Y}}$ , the phenomenon under investigation is a Gaussian process indexed by $X\in {\mathcal {X}}={\mathcal {X}}_{1}\times {\mathcal {X}}_{2}\dots {\mathcal {X}}_{d}$ which has mean function ${\displaystyle \mu$ and covariance function $K:{\mathcal {X}}\times {\mathcal {X}}\rightarrow \mathbb {R}$ . One can also assume that data $\mathbf {y} =(y_{1},\dots ,y_{n})$ are values of a particular realization of this process for indices $\mathbf {X} =X_{1},\dots ,X_{n}$ .

Consequently, the joint distribution of the data can be expressed as

\mathbf {y} \sim {\mathcal {N}}(\mathbf {\mu } ,\mathbf {\Sigma } )

,

where $\mathbf {\Sigma } =\left[K(X_{i},X_{j})\right]_{i,j=1}^{n}$ and $\mathbf {\mu } =\left(\mu (X_{1}),\mu (X_{2}),\dots ,\mu (X_{d})\right)^{\top }$ , i.e. respectively a matrix with the covariance function values and a vector with the mean function values at corresponding (pairs of) indices. The negative log-likelihood of the data then takes the form

-\log \ell (\mathbf {y} )={\frac {d}{2\pi }}+{\frac {1}{2}}\log \det(\mathbf {\Sigma } )+\left(\mathbf {y} -\mathbf {\mu } \right)^{\top }\mathbf {\Sigma } ^{-1}\left(\mathbf {y} -\mathbf {\mu } \right)

Similarly, the best predictor of $\mathbf {y} ^{*}$ , the values of $y$ for indices $\mathbf {X} ^{*}=\left(X_{1}^{*},X_{2}^{*},\dots ,X_{d}^{*}\right)$ , given data $\mathbf {y}$ has the form

\mathbf {\mu } _{\mathbf {y} }^{*}=\mathbb {E} \left[\mathbf {y} ^{*}|\mathbf {y} \right]=\mathbf {\mu } ^{*}-\mathbf {\Sigma } _{\mathbf {y} ^{*}\mathbf {y} }\mathbf {\Sigma } ^{-1}\left(\mathbf {y} -\mathbf {\mu } \right)

In the context of Gaussian models, especially in geostatistics, prediction using the best predictor, i.e. mean conditional on the data, is also known as kriging.

The most computationally expensive component of the best predictor formula is inverting the covariance matrix $\mathbf {\Sigma }$ , which has cubic complexity ${\mathcal {O}}(n^{3})$ . Similarly, evaluating likelihood involves both calculating $\mathbf {\Sigma } ^{-1}$ and the determinant $\det(\mathbf {\Sigma } )$ which has the same cubic complexity.

Gaussian process approximations can often be expressed in terms of assumptions on $y$ under which $\log \ell (\mathbf {y} )$ and $\mathbf {\mu } _{\mathbf {y} }^{*}$ can be calculated with much lower complexity. Since these assumptions are generally not believed to reflect reality, the likelihood and the best predictor obtained in this way are not exact, but they are meant to be close to their original values.

Model-based methods

This class of approximations is expressed through a set of assumptions which are imposed on the original process and which, typically, imply some special structure of the covariance matrix. Although most of these methods were developed independently, most of them can be expressed as special cases of the sparse general Vecchia approximation.

Sparse covariance methods

These methods approximate the true model in a way the covariance matrix is sparse. Typically, each method proposes its own algorithm that takes the full advantage of the sparsity pattern in the covariance matrix. Two prominent members of this class of approaches are covariance tapering and domain partitioning. The first method generally requires a metric $d$ over ${\mathcal {X}}$ and assumes that for $X,{\tilde {X}}\in {\mathcal {X}}$ we have $Cov(y(X),y({\tilde {X}}))\neq 0$ only if $d(X,{\tilde {X}})<r$ for some radius $r$ . The second method assumes that there exist ${\mathcal {X}}^{(1)},\dots ,{\mathcal {X}}^{(K)}$ such that $\bigcup _{k=1}^{K}{\mathcal {X}}^{(k)}$ . Then with appropriate distribution of indices among partition elements and ordering of elements of $X$ the covariance matrix is block diagonal.

Sparse precision methods

This family of methods assumes that the precision matrix $\mathbf {\Lambda } =\mathbf {\Sigma } ^{-1}$ is sparse and generally specifies which of its elements are non-zero. This leads to fast inversion because only those elements need to be calculated. Some of the prominent approximations in this category include the approach based on the equivalence between Gaussian processes with Matern covariance function and stochastic PDEs, periodic embedding, and Nearest Neighbour Gaussian processes. The first method applies to the case of $d=2$ and when ${\mathcal {X}}$ has a defined metric and takes advantage of the fact, that the Markov property holds which makes $\mathbf {\Lambda }$ very sparse. The second extends the domain and uses Discrete Fourier Transform to decorrelate the data, which results in a diagonal precision matrix. The third one requires a metric on ${\mathcal {X}}$ and takes advantage of the so-called screening effect assuming that $\mathbf {\Lambda } _{i,j}\neq 0$ only if $d(x_{i},x_{j})<r$ , for some $r>0$ .

Sparse Cholesky factor methods

In many practical applications, calculating $\mathbf {\Lambda }$ is replaced with computing first $\mathbf {L}$ , the Cholesky factor of $\mathbf {\Sigma }$ , and second its inverse $\mathbf {L} ^{-1}$ . This is known to be more stable than a plain inversion. For this reason, some authors focus on constructing a sparse approximation of the Cholesky factor of the precision or covariance matrices. One of the most established methods in this class is the Vecchia approximation and its generalization. These approaches determine the optimal ordering of indices and, consequently, the elements of $\mathbf {x}$ and then assume a dependency structure which minimizes in-fill in the Cholesky factor. Several other methods can be expressed in this framework, the Multi-resolution Approximation (MRA), Nearest Neighbour Gaussian Process, Modified Predictive Process and Full-scale approximation.

Low-rank methods

While this approach encompasses many methods, the common assumption underlying them all is the assumption, that $y$ , the Gaussian process of interest, is effectively low-rank. More precisely, it is assumed, that there exists a set of indices ${\bar {X}}=\{{\bar {x}}_{1},\dots ,{\bar {x}}_{p}\}$ such that every other set of indices $X=\{x_{1},\dots ,x_{n}\}$

$y(X)\sim {\mathcal {N}}\left(\mathbf {A} _{X}{\bar {\mathbf {\mu } }},\mathbf {A} _{X}^{\top }{\bar {\mathbf {\Sigma } }}\mathbf {A} _{X}+\mathbf {D} \right)$

where $\mathbf {A} _{X}$ is an $p\times k$ matrix, ${\bar {\mathbf {\mu } }}=\mu \left(y\left({\bar {X}}\right)\right)$ and ${\bar {\mathbf {\Sigma } }}=K\left({\bar {X}},{\bar {X}}\right)$ and $\mathbf {D}$ is a diagonal matrix. Depending on the method and application various ways of selecting ${\bar {X}}$ have been proposed. Typically, $p$ is selected to be much smaller than $n$ which means that the computational cost of inverting ${\bar {\mathbf {\Sigma } }}$ is manageable ( ${\mathcal {O}}(p^{3})$ instead of ${\mathcal {O}}(n^{3})$ ).

More generally, on top of selecting ${\bar {X}}$ , one may also find an $n\times p$ matrix $\mathbf {A}$ and assume that $X=\mathbf {A} \mathbf {\eta }$ , where $\mathbf {\eta }$ are $p$ values of a Gaussian process possibly independent of $x$ . Many machine learning methods fall into this category, such as subset-of-regressors (SoR), relevance vector machine, sparse spectrum Gaussian Process and others and they generally differ in the way they derive $\mathbf {A}$ and $\mathbf {\eta }$ .

Hierarchical methods

The general principle of hierarchical approximations consists of a repeated application of some other method, such that each consecutive application refines the quality of the approximation. Even though they can be expressed as a set of statistical assumptions, they are often described in terms of a hierarchical matrix approximation (HODLR) or basis function expansion (LatticeKrig, MRA, wavelets). The hierarchical matrix approach can often be represented as a repeated application of a low-rank approximation to successively smaller subsets of the index set $X$ . Basis function expansion relies on using functions with compact support. These features can then be exploited by an algorithm who steps through consecutive layers of the approximation. In the most favourable settings some of these methods can achieve quasi-linear ( ${\mathcal {O}}(n\log n)$ ) complexity.

Unified framework

Probabilistic graphical models provide a convenient framework for comparing model-based approximations. In this context, value of the process at index $x_{k}\in X$ can then be represented by a vertex in a directed graph and edges correspond to the terms in the factorization of the joint density of $y(X)$ . In general, when no independent relations are assumed, the joint probability distribution can be represented by an arbitrary directed acyclic graph. Using a particular approximation can then be expressed as a certain way of ordering the vertices and adding or removing specific edges.

Methods without a statistical model

This class of methods does not specify a statistical model or impose assumptions on an existing one. Three major members of this group are the meta-kriging algorithm, the gapfill algorithm and Local Approximate Gaussian Process approach. The first one partitions the set of indices into $K$ components ${\mathcal {X}}^{(1)},\dots ,{\mathcal {X}}^{(k)}$ , calculates the conditional distribution for each those components separately and then uses geometric median of the conditional PDFs to combine them. The second is based on quantile regression using values of the process which are close to the value one is trying to predict, where distance is measured in terms of a metric on the set of indices. Local Approximate Gaussian Process uses a similar logic but constructs a valid stochastic process based on these neighboring values.

Related Research Articles

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is $The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is (sigma). A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate .$

<span class="mw-page-title-main">Central limit theorem</span> Fundamental theorem in probability theory and statistics

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

Hotellings <i>T</i>-squared distribution Type of probability distribution

In statistics, particularly in hypothesis testing, the Hotelling's T-squared distribution (T²), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the F-distribution and is most notable for arising as the distribution of a set of sample statistics that are natural generalizations of the statistics underlying the Student's t-distribution. The Hotelling's t-squared statistic (t²) is a generalization of Student's t-statistic that is used in multivariate hypothesis testing.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in R^p×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

The sensitivity index or discriminability index or detectability index is a dimensionless statistic used in signal detection theory. A higher index indicates that the signal can be more readily detected.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value $. Like this, individuals with better and better -values are generated over the generation sequence.$

The ensemble Kalman filter (EnKF) is a recursive filter suitable for problems with a large number of variables, such as discretizations of partial differential equations in geophysical models. The EnKF originated as a version of the Kalman filter for large problems, and it is now an important data assimilation component of ensemble forecasting. EnKF is related to the particle filter but the EnKF makes the assumption that all probability distributions involved are Gaussian; when it is applicable, it is much more efficient than the particle filter.

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

<span class="mw-page-title-main">Logit-normal distribution</span> Probability distribution

In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and t is the standard logistic function, then X = t(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log (X/(1-X)) is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L² that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

Vecchia approximation is a Gaussian processes approximation technique originally developed by Aldo Vecchia, a statistician at United States Geological Survey. It is one of the earliest attempts to use Gaussian processes in high-dimensional settings. It has since been extensively generalized giving rise to many contemporary approximations.

References

Liu, Haitao; Ong, Yew-Soon; Shen, Xiaobo; Cai, Jianfei (2020). "When Gaussian Process Meets Big Data: A Review of Scalable GPS". IEEE Transactions on Neural Networks and Learning Systems. PP (11): 1–19. arXiv: 1807.01065 . doi:10.1109/TNNLS.2019.2957109. PMID 31944966.
Heaton, Matthew J.; Datta, Abhirup; Finley, Andrew O.; Furrer, Reinhard; Guinness, Joseph; Guhaniyogi, Rajarshi; Gerber, Florian; Gramacy, Robert B.; Hammerling, Dorit; Katzfuss, Matthias; Lindgren, Finn; Nychka, Douglas W.; Sun, Furong; Zammit-Mangion, Andrew (2018). "A Case Study Competition Among Methods for Analyzing Large Spatial Data". Journal of Agricultural, Biological and Environmental Statistics. 24 (3): 398–425. doi: 10.1007/s13253-018-00348-w . ISSN 1085-7117. PMC 6709111 . PMID 31496633.
Banerjee, Sudipto (2017). "High-Dimensional Bayesian Geostatistics". Bayesian Analysis. 12 (2): 583–614. doi: 10.1214/17-BA1056R . PMC 5790125 . PMID 29391920.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.