Bayesian quadrature

Last updated

Bayesian quadrature [1] [2] [3] [4] [5] is a method for approximating intractable integration problems. It falls within the class of probabilistic numerical methods. Bayesian quadrature views numerical integration as a Bayesian inference task, where function evaluations are used to estimate the integral of that function. For this reason, it is sometimes also referred to as "Bayesian probabilistic numerical integration" or "Bayesian numerical integration". The name "Bayesian cubature" is also sometimes used when the integrand is multi-dimensional. A potential advantage of this approach is that it provides probabilistic uncertainty quantification for the value of the integral.

Contents

Bayesian quadrature

Numerical integration

Let be a function defined on a domain (where typically ). In numerical integration, function evaluations at distinct locations in are used to estimate the integral of against a measure : i.e. Given weights , a quadrature rule is an estimator of of the form

Bayesian quadrature consists of specifying a prior distribution over , conditioning this prior on to obtain a posterior distribution , then computing the implied posterior distribution on . The name "quadrature" comes from the fact that the posterior mean on sometimes takes the form of a quadrature rule whose weights are determined by the choice of prior.

Bayesian quadrature with Gaussian processes

The most common choice of prior distribution for is a Gaussian process as this permits conjugate inference to obtain a closed-form posterior distribution on . Suppose we have a Gaussian process with prior mean function and covariance function (or kernel function) . Then, the posterior distribution on is a Gaussian process with mean and kernel given by:

where , , and .

Furthermore, the posterior distribution on is a univariate Gaussian distribution with mean and variance given by

The function is the kernel mean embedding of and denotes the integral of with respect to both inputs. In particular, note that the posterior mean is a quadrature rule with weights and the posterior variance provides a quantification of the user's uncertainty over the value of .

In more challenging integration problems, where the prior distribution cannot be relied upon as a meaningful representation of epistemic uncertainty, it is necessary to use the data to set the kernel hyperparameters using, for example, maximum likelihood estimation. The estimation of kernel hyperparameters introduces adaptivity into Bayesian quadrature. [6] [7]

Example

Illustration of Bayesian quadrature for estimating
n
[
f
]
=
[?]
0
1
f
(
x
)
d
x
{\displaystyle \textstyle \nu [f]=\int _{0}^{1}f(x)\,\mathrm {d} x}
where
f
(
x
)
=
(
1
+
x
2
)
sin
[?]
(
5
p
x
)
+
8
/
5
{\displaystyle \textstyle f(x)=(1+x^{2})\sin(5\pi x)+8/5}
. The posterior distribution (blue) concentrates on the true integral when more data (the red points) is obtained of the integrand
f
{\displaystyle f}
. Bayesian quadrature animation.gif
Illustration of Bayesian quadrature for estimating where . The posterior distribution (blue) concentrates on the true integral when more data (the red points) is obtained of the integrand .

Consider estimation of the integral

using a Bayesian quadrature rule based on a zero-mean Gaussian process prior with the Matérn covariance function of smoothness and correlation length . This covariance function is It is straightforward (though tedious) to compute that

Convergence of the Bayesian quadrature point estimate and concentration of the posterior mass, as quantified by , around the true integral as is evaluated at more and more points is displayed in the accompanying animation.

Advantages and disadvantages

Since Bayesian quadrature is an example of probabilistic numerics, it inherits certain advantages compared with traditional numerical integration methods:

Despite these merits, Bayesian quadrature methods possess the following limitations:

Algorithmic design

Prior distributions

The most commonly used prior for is a Gaussian process prior. This is mainly due to the advantage provided by Gaussian conjugacy and the fact that Gaussian processes can encode a wide range of prior knowledge including smoothness, periodicity and sparsity through a careful choice of prior covariance. However, a number of other prior distributions have also been proposed. This includes multi-output Gaussian processes, [9] which are particularly useful when tackling multiple related numerical integration tasks simultaneously or sequentially, and tree-based priors such as Bayesian additive regression trees, [10] which are well suited for discontinuous . Additionally, Dirichlet processes priors have also been proposed for the integration measure . [11]

Point selection

The points are either considered to be given, or can be selected so as to ensure the posterior on concentrates at a faster rate. One approach consists of using point sets from other quadrature rules. For example, taking independent and identically distributed realisations from recovers a Bayesian approach to Monte Carlo, [3] whereas using certain deterministic point sets such as low-discrepancy sequences or lattices recovers a Bayesian alternative to quasi-Monte Carlo. [4] [12] It is of course also possible to use point sets specifically designed for Bayesian quadrature; see for example the work of [13] who exploited symmetries in point sets to obtain scalable Bayesian quadrature estimators. Alternatively, points can also be selected adaptively following principles from active learning and Bayesian experimental design so as to directly minimise posterior uncertainty, [14] [15] including for multi-output Gaussian processes. [16]

Kernel mean and initial error

One of the challenges when implementing Bayesian quadrature is the need to evaluate the function and the constant . The former is commonly called the kernel mean, and is a quantity which is key to the computation of kernel-based distances such as the maximum mean discrepancy. The latter is commonly called the initial error since it provides an upper bound on the integration error before any function values are observed. Unfortunately, the kernel mean and initial error can only be computed for a small number of pairs; see for example Table 1 in. [4]

Theory

There have been a number of theoretical guarantees derived for Bayesian quadrature. These usually require Sobolev smoothness properties of the integrand, [4] [17] [18] although recent work also extends to integrands in the reproducing kernel Hilbert space of the Gaussian kernel. [19] Most of the results apply to the case of Monte Carlo or deterministic grid point sets, but some results also extend to adaptive designs. [20]

Software

Related Research Articles

<span class="mw-page-title-main">Student's t-distribution</span> Probability distribution

In probability and statistics, Student's t distribution is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.

<span class="mw-page-title-main">Pearson correlation coefficient</span> Measure of linear correlation

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In mathematics, a Lie algebroid is a vector bundle together with a Lie bracket on its space of sections and a vector bundle morphism , satisfying a Leibniz rule. A Lie algebroid can thus be thought of as a "many-object generalisation" of a Lie algebra.

In probability theory and mathematical physics, a random matrix is a matrix-valued random variable—that is, a matrix in which some or all of its entries are sampled randomly from a probability distribution. Random matrix theory (RMT) is the study of properties of random matrices, often as they become large. RMT provides techniques like mean-field theory, diagrammatic methods, the cavity method, or the replica method to compute quantities like traces, spectral densities, or scalar products between eigenvectors. Many physical phenomena, such as the spectrum of nuclei of heavy atoms, the thermal conductivity of a lattice, or the emergence of quantum chaos, can be modeled mathematically as problems concerning large, random matrices.

Stochastic dominance is a partial order between random variables. It is a form of stochastic ordering. The concept arises in decision theory and decision analysis in situations where one gamble can be ranked as superior to another gamble for a broad class of decision-makers. It is based on shared preferences regarding sets of possible outcomes and their associated probabilities. Only limited knowledge of preferences is required for determining dominance. Risk aversion is a factor only in second order stochastic dominance.

In arithmetic, a complex-base system is a positional numeral system whose radix is an imaginary or complex number.

In probability theory, a Markov kernel is a map that in the general theory of Markov processes plays the role that the transition matrix does in the theory of Markov processes with a finite state space.

In statistics, the Matérn covariance, also called the Matérn kernel, is a covariance function used in spatial statistics, geostatistics, machine learning, image analysis, and other applications of multivariate statistical analysis on metric spaces. It is named after the Swedish forestry statistician Bertil Matérn. It specifies the covariance between two measurements as a function of the distance between the points at which they are taken. Since the covariance only depends on distances between points, it is stationary. If the distance is Euclidean distance, the Matérn covariance is also isotropic.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.

In mathematics, a determinantal point process is a stochastic point process, the probability distribution of which is characterized as a determinant of some function. Such processes arise as important tools in random matrix theory, combinatorics, physics, and wireless network modeling.

In statistical inference, the concept of a confidence distribution (CD) has often been loosely referred to as a distribution function on the parameter space that can represent confidence intervals of all levels for a parameter of interest. Historically, it has typically been constructed by inverting the upper limits of lower sided confidence intervals of all levels, and it was also commonly associated with a fiducial interpretation, although it is a purely frequentist concept. A confidence distribution is NOT a probability distribution function of the parameter of interest, but may still be a function useful for making inferences.

Coherent states have been introduced in a physical context, first as quasi-classical states in quantum mechanics, then as the backbone of quantum optics and they are described in that spirit in the article Coherent states. However, they have generated a huge variety of generalizations, which have led to a tremendous amount of literature in mathematical physics. In this article, we sketch the main directions of research on this line. For further details, we refer to several existing surveys.

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

In probability theory and statistics, the generalized multivariate log-gamma (G-MVLG) distribution is a multivariate distribution introduced by Demirhan and Hamurkaroglu in 2011. The G-MVLG is a flexible distribution. Skewness and kurtosis are well controlled by the parameters of the distribution. This enables one to control dispersion of the distribution. Because of this property, the distribution is effectively used as a joint prior distribution in Bayesian analysis, especially when the likelihood is not from the location-scale family of distributions such as normal distribution.

Lagrangian field theory is a formalism in classical field theory. It is the field-theoretic analogue of Lagrangian mechanics. Lagrangian mechanics is used to analyze the motion of a system of discrete particles each with a finite number of degrees of freedom. Lagrangian field theory applies to continua and fields, which have an infinite number of degrees of freedom.

In quantum probability, the Belavkin equation, also known as Belavkin-Schrödinger equation, quantum filtering equation, stochastic master equation, is a quantum stochastic differential equation describing the dynamics of a quantum system undergoing observation in continuous time. It was derived and henceforth studied by Viacheslav Belavkin in 1988.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

Probabilistic numerics is an active field of study at the intersection of applied mathematics, statistics, and machine learning centering on the concept of uncertainty in computation. In probabilistic numerics, tasks in numerical analysis such as finding numerical solutions for integration, linear algebra, optimization and simulation and differential equations are seen as problems of statistical, probabilistic, or Bayesian inference.

Distributional data analysis is a branch of nonparametric statistics that is related to functional data analysis. It is concerned with random objects that are probability distributions, i.e., the statistical analysis of samples of random distributions where each atom of a sample is a distribution. One of the main challenges in distributional data analysis is that the space of probability distributions is, while a convex space, is not a vector space.

References

  1. Diaconis, P. (1988). "Bayesian Numerical Analysis". Statistical Decision Theory and Related Topics IV: 163–175. doi:10.1007/978-1-4613-8768-8_20. ISBN   978-1-4613-8770-1.
  2. O’Hagan, A. (2002). "Bayes–Hermite quadrature". Journal of Statistical Planning and Inference (29): 245–260.
  3. 1 2 Rasmussen, C.; Ghahramani, Z. (2002). "Bayesian Monte Carlo". Neural Information Processing Systems: 489–496.
  4. 1 2 3 4 Briol, F.-X.; Oates, C. J.; Girolami, M.; Osborne, M. A.; Sejdinovic, D. (2019). "Probabilistic integration: A role in statistical computation? (with discussion and rejoinder)". Statistical Science. 34 (1): 1–22.
  5. Hennig, P.; Osborne, M. A.; Kersting, H. P. (2022). Probabilistic Numerics (PDF). Cambridge University Press. pp. 63–122. ISBN   978-1107163447.
  6. Jagadeeswaran, R.; Hickernell, Fred J. (2019-09-10). "Fast automatic Bayesian cubature using lattice sampling". Statistics and Computing. 29 (6): 1215–1229. arXiv: 1809.09803 . doi:10.1007/s11222-019-09895-9. ISSN   0960-3174. S2CID   119709309.
  7. 1 2 Fisher, Matthew; Oates, Chris; Powell, Catherine; Teckentrup, Aretha (2020-06-03). "A Locally Adaptive Bayesian Cubature Method". International Conference on Artificial Intelligence and Statistics. PMLR: 1265–1275. arXiv: 1910.02995 .
  8. Cockayne, Jon; Oates, Chris; Sullivan, Tim; Girolami, Mark (2017). "Probabilistic numerical methods for PDE-constrained Bayesian inverse problems". Bayesian Inference and Maximum Entropy Methods in Science and Engineering. AIP Conference Proceedings. 1853 (1). Author(s): 060001. arXiv: 1701.04006 . Bibcode:2017AIPC.1853f0001C. doi:10.1063/1.4985359. hdl:10044/1/66123. S2CID   17349210.
  9. 1 2 Xi, X.; Briol, F.-X.; Girolami, M. (2018). "Bayesian quadrature for multiple related integrals". International Conference on Machine Learning: 8533–8564. arXiv: 1801.04153 .
  10. 1 2 Zhu, H.; Liu, X.; Kang, R.; Shen, Z.; Flaxman, S.; Briol, F.-X. (2020). "Bayesian probabilistic numerical integration with tree-based models". Neural Information Processing Systems: 5837–5849. arXiv: 2006.05371 .
  11. Oates, C. J.; Niederer, S.; Lee, A.; Briol, F.-X.; Girolami, M. (2017). "Probabilistic models for integration error in the assessment of functional cardiac models". Neural Information Processing Systems: 110–118. arXiv: 1606.06841 .
  12. Jagadeeswaran, R.; Hickernell, F. J. (2019). "Fast automatic Bayesian cubature using lattice sampling". Statistics and Computing. 29 (6): 1215–1229. arXiv: 1809.09803 . doi:10.1007/s11222-019-09895-9. S2CID   119709309.
  13. Karvonen, T.; Särkkä, S. (2018). "Fully symmetric kernel quadrature". SIAM Journal on Scientific Computing. 40 (2): 697–720. arXiv: 1703.06359 . Bibcode:2018SJSC...40A.697K. doi:10.1137/17M1121779. S2CID   9707762.
  14. Gunter, T.; Garnett, R.; Osborne, M. A.; Hennig, P.; Roberts, S. (2014). "Sampling for inference in probabilistic models with fast Bayesian quadrature". Neural Information Processing Systems: 2789–2797. arXiv: 1411.0439 .
  15. Briol, F.-X.; Oates, C. J.; Girolami, M.; Osborne, M. A. (2015). "Frank-Wolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees". Neural Information Processing Systems: 1162–1170. arXiv: 1506.02681 .
  16. Gessner, A.; Gonzalez, J.; Mahsereci, M. (2019). "Active multi-information source Bayesian quadrature". Uncertainty in Artificial Intelligence. arXiv: 1903.11331 .
  17. Kanagawa, M.; Sriperumbudur, B. K.; Fukumizu, K. (2020). "Convergence analysis of deterministic kernel-based quadrature rules in misspecified settings". Foundations of Computational Mathematics. 20: 155–194. arXiv: 1709.00147 . doi: 10.1007/s10208-018-09407-7 . S2CID   11717907.
  18. Wynne, G.; Briol, F.-X.; Girolami, M. (2020). "Convergence guarantees for Gaussian process means with misspecified likelihoods and smoothness". Journal of Machine Learning Research. 22 (123): 1–40. arXiv: 2001.10818 .
  19. Karvonen, T.; Oates, C. J.; Girolami, M. (2021). "Integration in reproducing kernel Hilbert spaces of Gaussian kernels". Mathematics of Computation. 90 (331): 2209–2233. arXiv: 2004.12654 . doi: 10.1090/mcom/3659 . S2CID   216552869.
  20. Kanagawa, M.; Hennig, P. (2019). "Convergence guarantees for adaptive Bayesian quadrature methods". Neural Information Processing Systems: 6237–6248. arXiv: 1905.10271 .