Vecchia approximation

Last updated

Vecchia approximation is a Gaussian processes approximation technique originally developed by Aldo Vecchia, a statistician at United States Geological Survey. [1] It is one of the earliest attempts to use Gaussian processes in high-dimensional settings. It has since been extensively generalized giving rise to many contemporary approximations.

Contents

Intuition

A joint probability distribution for events , and , denoted , can be expressed as

Vecchia's approximation takes the form, for example,

and is accurate when events and are close to conditionally independent given knowledge of . Of course one could have alternatively chosen the approximation

and so use of the approximation requires some knowledge of which events are close to conditionally independent given others. Moreover, we could have chosen a different ordering, for example

Fortunately, in many cases there are good heuristics making decisions about how to construct the approximation.

More technically, general versions of the approximation lead to a sparse Cholesky factor of the precision matrix. Using the standard Cholesky factorization produces entries which can be interpreted [2] as conditional correlations with zeros indicating no independence (since the model is Gaussian). These independence relations can be alternatively expressed using graphical models and there exist theorems linking graph structure and vertex ordering with zeros in the Cholesky factor. In particular, it is known [3] that independencies that are encoded in a moral graph lead to Cholesky factors of the precision matrix that have no fill-in.

Formal description

The problem

Let be a Gaussian process indexed by with mean function and covariance function . Assume that is a finite subset of and is a vector of values of evaluated at , i.e. for . Assume further, that one observes where with . In this context the two most common inference tasks include evaluating the likelihood

or making predictions of values of for and , i.e. calculating

Original formulation

The original Vecchia method starts with the observation that the joint density of observations can be written as a product of conditional distributions

Vecchia approximation assumes instead that for some

Vecchia also suggested that the above approximation be applied to observations that are reordered lexicographically using their spatial coordinates. While his simple method has many weaknesses, it reduced the computational complexity to . Many of its deficiencies were addressed by the subsequent generalizations.

General formulation

While conceptually simple, the assumption of the Vecchia approximation often proves to be fairly restrictive and inaccurate. [4] This inspired important generalizations and improvements introduced in the basic version over the years: the inclusion of latent variables, more sophisticated conditioning and better ordering. Different special cases of the general Vecchia approximation can be described in terms of how these three elements are selected. [5]

Latent variables

To describe extensions of the Vecchia method in its most general form, define and notice that for it holds that like in the previous section

because given all other variables are independent of .

Ordering

It has been widely noted that the original lexicographic ordering based on coordinates when is two-dimensional produces poor results. [6] More recently another orderings have been proposed, some of which ensure that points are ordered in a quasi-random fashion. Highly scalable, they have been shown to also drastically improve accuracy. [4]

Conditioning

Similar to the basic version described above, for a given ordering a general Vecchia approximation can be defined as

where . Since it follows that since suggesting that the terms be replaced with . It turns out, however, that sometimes conditioning on some of the observations increases sparsity of the Cholesky factor of the precision matrix of . Therefore, one might instead consider sets and such that and express as

Multiple methods of choosing and have been proposed, most notably the nearest-neighbour Gaussian process (NNGP), [7] meshed Gaussian process [8] and multi-resolution approximation (MRA) approaches using , standard Vecchia using and Sparse General Vecchia where both and are non-empty. [5]

Software

Several packages have been developed which implement some variants of the Vecchia approximation.

Notes

  1. Vecchia, A. V. (1988). "Estimation and Model Identification for Continuous Spatial Processes". Journal of the Royal Statistical Society, Series B (Methodological). 50 (2): 297–312. doi:10.1111/j.2517-6161.1988.tb01729.x.
  2. Pourahmadi, M. (2007). "Cholesky Decompositions and Estimation of A Covariance Matrix: Orthogonality of Variance Correlation Parameters". Biometrika. 94 (4): 1006–1013. doi:10.1093/biomet/asm073. ISSN   0006-3444.
  3. Khare, Kshitij; Rajaratnam, Bala (2011). "Wishart distributions for decomposable covariance graph models". The Annals of Statistics. 39 (1): 514–555. doi: 10.1214/10-AOS841 . ISSN   0090-5364.
  4. 1 2 Guinness, Joseph (2018). "Permutation and Grouping Methods for Sharpening Gaussian Process Approximations". Technometrics. 60 (4): 415–429. doi:10.1080/00401706.2018.1437476. ISSN   0040-1706. PMC   6707751 . PMID   31447491.
  5. 1 2 Katzfuss, Matthias; Guinness, Joseph (2021). "A General Framework for Vecchia Approximations of Gaussian Processes". Statistical Science. 36. arXiv: 1708.06302 . doi:10.1214/19-STS755. S2CID   88522976.
  6. Sudipto Banerjee; Bradley P. Carlin; Alan E. Gelfand (12 September 2014). Hierarchical Modeling and Analysis for Spatial Data, Second Edition. CRC Press. ISBN   978-1-4398-1917-3.
  7. Datta, Abhirup; Banerjee, Sudipto; Finley, Andrew; Gelfand, Alan (2016). "Hierarchical Nearest-Neighbor Gaussian Process Models for Large Spatial Data". Journal of the American Statistical Association. 111 (514): 800–812. doi:10.1080/01621459.2015.1044091. PMC   5927603 . PMID   29720777.
  8. Peruzzi, Michele; Banerjee, Sudipto; Finley, Andrew (2020). "Highly Scalable Bayesian Geostatistical Modeling Via Meshed Gaussian Processes on Partitioned Domains". Journal of the American Statistical Association. 117 (538): 969–982. arXiv: 2003.11208 . doi:10.1080/01621459.2020.1833889. PMC   9354857 . PMID   35935897.

Related Research Articles

<span class="mw-page-title-main">Discrete Fourier transform</span> Type of Fourier transform in discrete mathematics

In mathematics, the discrete Fourier transform (DFT) converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a complex-valued function of frequency. The interval at which the DTFT is sampled is the reciprocal of the duration of the input sequence. An inverse DFT (IDFT) is a Fourier series, using the DTFT samples as coefficients of complex sinusoids at the corresponding DTFT frequencies. It has the same sample-values as the original input sequence. The DFT is therefore said to be a frequency domain representation of the original input sequence. If the original sequence spans all the non-zero values of a function, its DTFT is continuous, and the DFT provides discrete samples of one cycle. If the original sequence is one cycle of a periodic function, the DFT provides all the non-zero values of one DTFT cycle.

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Independence (probability theory)</span> When the occurrence of one event does not affect the likelihood of another

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In linear algebra, the Cholesky decomposition or Cholesky factorization is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for efficient numerical solutions, e.g., Monte Carlo simulations. It was discovered by André-Louis Cholesky for real matrices, and posthumously published in 1924. When it is applicable, the Cholesky decomposition is roughly twice as efficient as the LU decomposition for solving systems of linear equations.

<span class="mw-page-title-main">Kalman filter</span> Algorithm that estimates unknowns from a series of measurements over time

For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe. The filter is named after Rudolf E. Kálmán, who was one of the primary developers of its theory.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In statistics, the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables.

<span class="mw-page-title-main">Path integral formulation</span> Formulation of quantum mechanics

The path integral formulation is a description in quantum mechanics that generalizes the action principle of classical mechanics. It replaces the classical notion of a single, unique classical trajectory for a system with a sum, or functional integral, over an infinity of quantum-mechanically possible trajectories to compute a quantum amplitude.

In the theory of stochastic processes, the Karhunen–Loève theorem, also known as the Kosambi–Karhunen–Loève theorem states that a stochastic process can be represented as an infinite linear combination of orthogonal functions, analogous to a Fourier series representation of a function on a bounded interval. The transformation is also known as Hotelling transform and eigenvector transform, and is closely related to principal component analysis (PCA) technique widely used in image processing and in data analysis in many fields.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

In mathematics, a line integral is an integral where the function to be integrated is evaluated along a curve. The terms path integral, curve integral, and curvilinear integral are also used; contour integral is used as well, although that is typically reserved for line integrals in the complex plane.

<span class="mw-page-title-main">Generalized chi-squared distribution</span>

In probability theory and statistics, the generalized chi-squared distribution is the distribution of a quadratic form of a multinormal variable, or a linear combination of different normal variables and squares of normal variables. Equivalently, it is also a linear sum of independent noncentral chi-square variables and a normal variable. There are several other such generalizations for which the same term is sometimes used; some of them are special cases of the family discussed here, for example the gamma distribution.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

Low-rank matrix approximations are essential tools in the application of kernel methods to large-scale learning problems.

In statistics, the complex Wishart distribution is a complex version of the Wishart distribution. It is the distribution of times the sample Hermitian covariance matrix of zero-mean independent Gaussian random variables. It has support for Hermitian positive definite matrices.

In statistics and machine learning, Gaussian process approximation is a computational method that accelerates inference tasks in the context of a Gaussian process model, most commonly likelihood evaluation and prediction. Like approximations of other models, they can often be expressed as additional assumptions imposed on the model, which do not correspond to any actual feature, but which retain its key properties while simplifying calculations. Many of these approximation methods can be expressed in purely linear algebraic or functional analytic terms as matrix or function approximations. Others are purely algorithmic and cannot easily be rephrased as a modification of a statistical model.

Tau functions are an important ingredient in the modern mathematical theory of integrable systems, and have numerous applications in a variety of other domains. They were originally introduced by Ryogo Hirota in his direct method approach to soliton equations, based on expressing them in an equivalent bilinear form.