Stein discrepancy

Last updated

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, [1] but has since been used in diverse settings in statistics, machine learning and computer science. [2]

Contents

Definition

Let be a measurable space and let be a set of measurable functions of the form . A natural notion of distance between two probability distributions , , defined on , is provided by an integral probability metric [3]

where for the purposes of exposition we assume that the expectations exist, and that the set is sufficiently rich that (1.1) is indeed a metric on the set of probability distributions on , i.e. if and only if . The choice of the set determines the topological properties of (1.1). However, for practical purposes the evaluation of (1.1) requires access to both and , often rendering direct computation of (1.1) impractical.

Stein's method is a theoretical tool that can be used to bound (1.1). Specifically, we suppose that we can identify an operator and a set of real-valued functions in the domain of , both of which may be -dependent, such that for each there exists a solution to the Stein equation

The operator is termed a Stein operator and the set is called a Stein set. It follows immediately, upon substituting (1.2) into (1.1), that we obtain an upper bound

.

This resulting bound

is called a Stein discrepancy. [1] In contrast to the original integral probability metric , it may be possible to analyse or compute using expectations only with respect to the distribution .

Examples

Several different Stein discrepancies have been studied, with some of the most widely used presented next.

Classical Stein discrepancy

For a probability distribution with positive and differentiable density function on a convex set , whose boundary is denoted , the combination of the Langevin–Stein operator and the classical Stein set

yields the classical Stein discrepancy. [1] Here denotes the Euclidean norm and the Euclidean inner product. Here is the associated operator norm for matrices , and denotes the outward unit normal to at location . If then we interpret .

In the univariate case , the classical Stein discrepancy can be computed exactly by solving a quadratically constrained quadratic program. [1]

Graph Stein discrepancy

The first known computable Stein discrepancies were the graph Stein discrepancies (GSDs). Given a discrete distribution , one can define the graph with vertex set and edge set . From this graph, one can define the graph Stein set as

The GSD is actually the solution of a finite-dimensional linear program, with the size of as low as linear in , meaning that the GSD can be efficiently computed. [1]

Kernel Stein discrepancy

The supremum arising in the definition of Stein discrepancy can be evaluated in closed form using a particular choice of Stein set. Indeed, let be the unit ball in a (possibly vector-valued) reproducing kernel Hilbert space with reproducing kernel , whose elements are in the domain of the Stein operator . Suppose that

where the Stein operator acts on the first argument of and acts on the second argument. Then it can be shown [4] that

,

where the random variables and in the expectation are independent. In particular, if is a discrete distribution on , then the Stein discrepancy takes the closed form

A Stein discrepancy constructed in this manner is called a kernel Stein discrepancy [5] [6] [7] [8] and the construction is closely connected to the theory of kernel embedding of probability distributions.

Let be a reproducing kernel. For a probability distribution with positive and differentiable density function on , the combination of the Langevin--Stein operator and the Stein set

associated to the matrix-valued reproducing kernel , yields a kernel Stein discrepancy with [5]

where (resp. ) indicated the gradient with respect to the argument indexed by (resp. ).

Concretely, if we take the inverse multi-quadric kernel with parameters and a symmetric positive definite matrix, and if we denote , then we have

.

Diffusion Stein discrepancy

Diffusion Stein discrepancies [9] generalize the Langevin Stein operator to a class of diffusion Stein operators, each representing an Itô diffusion that has as its stationary distribution. Here, is a matrix-valued function determined by the infinitesimal generator of the diffusion.

Other Stein discrepancies

Additional Stein discrepancies have been developed for constrained domains, [10] non-Euclidean domains [11] [12] [10] , discrete domains, [13] [14] and improved scalability. [15] [16]

Properties

The flexibility in the choice of Stein operator and Stein set in the construction of Stein discrepancy precludes general statements of a theoretical nature. However, much is known about the particular Stein discrepancies.

Computable without the normalisation constant

Stein discrepancy can sometimes be computed in challenging settings where the probability distribution admits a probability density function (with respect to an appropriate reference measure on ) of the form , where and its derivative can be numerically evaluated but whose normalisation constant is not easily computed or approximated. Considering (2.1), we observe that the dependence of on occurs only through the term

which does not depend on the normalisation constant .

Stein discrepancy as a statistical divergence

A basic requirement of Stein discrepancy is that it is a statistical divergence, meaning that and if and only if . This property can be shown to hold for classical Stein discrepancy [1] and kernel Stein discrepancy [6] [7] [8] a provided that appropriate regularity conditions hold.

Convergence control

A stronger property, compared to being a statistical divergence, is convergence control, meaning that implies converges to in a sense to be specified. For example, under appropriate regularity conditions, both the classical Stein discrepancy and graph Stein discrepancy enjoy Wasserstein convergence control, meaning that implies that the Wasserstein metric between and converges to zero. [1] [17] [9] For the kernel Stein discrepancy, weak convergence control has been established [8] [18] under regularity conditions on the distribution and the reproducing kernel , which are applicable in particular to (2.1). Other well-known choices of , such as based on the Gaussian kernel, provably do not enjoy weak convergence control. [8]

Convergence detection

The converse property to convergence control is convergence detection, meaning that whenever converges to in a sense to be specified. For example, under appropriate regularity conditions, classical Stein discrepancy enjoys a particular form of mean square convergence detection [1] [9] , meaning that whenever converges in mean-square to and converges in mean-square to . For kernel Stein discrepancy, Wasserstein convergence detection has been established, [8] under appropriate regularity conditions on the distribution and the reproducing kernel .

Applications of Stein discrepancy

Several applications of Stein discrepancy have been proposed, some of which are now described.

Optimal quantisation

Optimal quantisation using Stein discrepancy. The contours in this video represent level sets of a continuous probability distribution and we consider the task of summarising this distribution with a discrete set of states selected from its domain . In particular, we suppose that the density function is known only up to proportionality, a setting where Markov chain Monte Carlo (MCMC) methods are widely used. In the first half of this video a Markov chain produces samples that are approximately distributed from , with the sample path shown in black. In the second half of the video an algorithm, called Stein thinning, [19] is applied to select a subset of states from the sample path, with selected states shown in red. These states are selected based on greedy minimisation of a Stein discrepancy between the discrete distribution and . Together, the selected states provide an approximation of that, in this instance, is more accurate than that provided by the original MCMC output.

Given a probability distribution defined on a measurable space , the quantization task is to select a small number of states such that the associated discrete distribution is an accurate approximation of in a sense to be specified.

Stein points [18] are the result of performing optimal quantisation via minimisation of Stein discrepancy:

Under appropriate regularity conditions, it can be shown [18] that as . Thus, if the Stein discrepancy enjoys convergence control, it follows that converges to . Extensions of this result, to allow for imperfect numerical optimisation, have also been derived. [18] [20] [19]

Sophisticated optimisation algorithms have been designed to perform efficient quantisation based on Stein discrepancy, including gradient flow algorithms that aim to minimise kernel Stein discrepancy over an appropriate space of probability measures. [21]

Optimal weighted approximation

If one is allowed to consider weighted combinations of point masses, then more accurate approximation is possible compared to (3.1). For simplicity of exposition, suppose we are given a set of states . Then the optimal weighted combination of the point masses , i.e.

which minimise Stein discrepancy can be obtained in closed form when a kernel Stein discrepancy is used. [5] Some authors [22] [23] consider imposing, in addition, a non-negativity constraint on the weights, i.e. . However, in both cases the computation required to compute the optimal weights can involve solving linear systems of equations that are numerically ill-conditioned. Interestingly, it has been shown [19] that greedy approximation of using an un-weighted combination of states can reduce this computational requirement. In particular, the greedy Stein thinning algorithm

has been shown to satisfy an error bound

Non-myopic and mini-batch generalisations of the greedy algorithm have been demonstrated [24] to yield further improvement in approximation quality relative to computational cost.

Variational inference

Stein discrepancy has been exploited as a variational objective in variational Bayesian methods. [25] [26] Given a collection of probability distributions on , parametrised by , one can seek the distribution in this collection that best approximates a distribution of interest:

A possible advantage of Stein discrepancy in this context, [26] compared to the traditional Kullback–Leibler variational objective, is that need not be absolutely continuous with respect to in order for to be well-defined. This property can be used to circumvent the use of flow-based generative models, for example, which impose diffeomorphism constraints in order to enforce absolute continuity of and .

Statistical estimation

Stein discrepancy has been proposed as a tool to fit parametric statistical models to data. Given a dataset , consider the associated discrete distribution . For a given parametric collection of probability distributions on , one can estimate a value of the parameter which is compatible with the dataset using a minimum Stein discrepancy estimator [27]

The approach is closely related to the framework of minimum distance estimation, with the role of the "distance" being played by the Stein discrepancy. Alternatively, a generalised Bayesian approach to estimation of the parameter can be considered [4] where, given a prior probability distribution with density function , , (with respect to an appropriate reference measure on ), one constructs a generalised posterior with probability density function

for some to be specified or determined.

Hypothesis testing

The Stein discrepancy has also been used as a test statistic for performing goodness-of-fit testing [6] [7] and comparing latent variable models. [28] Since the aforementioned tests have a computational cost quadratic in the sample size, alternatives have been developed with (near-)linear runtimes. [29] [15]

See also

Related Research Articles

Laplaces equation Second order partial differential equation

In mathematics and physics, Laplace's equation is a second-order partial differential equation named after Pierre-Simon Laplace, who first studied its properties. This is often written as

The likelihood function describes the joint probability of the observed data as a function of the parameters of the chosen statistical model. For each specific parameter value in the parameter space, the likelihood function therefore assigns a probabilistic prediction to the observed data . Since it is essentially the product of sampling densities, the likelihood generally encapsulates both the data-generating process as well as the missing-data mechanism that produced the observed sample.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mathematical analysis, Hölder's inequality, named after Otto Hölder, is a fundamental inequality between integrals and an indispensable tool for the study of Lp spaces.

Hamiltonian mechanics Formulation of classical mechanics using momenta

Hamiltonian mechanics emerged in 1833 as a reformulation of Lagrangian mechanics. Introduced by Sir William Rowan Hamilton, Hamiltonian mechanics replaces (generalized) velocities used in Lagrangian mechanics with (generalized) momenta. Both theories provide interpretations of classical mechanics and describe the same physical phenomena.

Spherical harmonics Special mathematical functions defined on the surface of a sphere

In mathematics and physical science, spherical harmonics are special functions defined on the surface of a sphere. They are often employed in solving partial differential equations in many scientific fields.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

In information theory, the Rényi entropy generalizes the Hartley entropy, the Shannon entropy, the collision entropy and the min-entropy. Entropies quantify the diversity, uncertainty, or randomness of a system. The entropy is named after Alfréd Rényi. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .

In mathematics, and specifically in potential theory, the Poisson kernel is an integral kernel, used for solving the two-dimensional Laplace equation, given Dirichlet boundary conditions on the unit disk. The kernel can be understood as the derivative of the Green's function for the Laplace equation. It is named for Siméon Poisson.

In statistics, the delta method is a result concerning the approximate probability distribution for a function of an asymptotically normal statistical estimator from knowledge of the limiting variance of that estimator.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, over the generation sequence, individuals with better and better -values are generated.

In statistics, the observed information, or observed Fisher information, is the negative of the second derivative of the "log-likelihood". It is a sample-based version of the Fisher information.

In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then, positive-definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.

In mathematics — specifically, in stochastic analysis — the infinitesimal generator of a Feller process is a Fourier multiplier operator that encodes a great deal of information about the process. The generator is used in evolution equations such as the Kolmogorov backward equation ; its L2 Hermitian adjoint is used in evolution equations such as the Fokker–Planck equation.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product

In mathematics, the Poisson boundary is a measure space associated to a random walk. It is an object designed to encode the asymptotic behaviour of the random walk, i.e. how trajectories diverge when the number of steps goes to infinity. Despite being called a boundary it is in general a purely measure-theoretical object and not a boundary in the topological sense. However, in the case where the random walk is on a topological space the Poisson boundary can be related to the Martin boundary which is an analytic construction yielding a genuine topological boundary. Both boundaries are related to harmonic functions on the space via generalisations of the Poisson formula.

Poisson-type random measures are a family of three random counting measures which are closed under restriction to a subspace, i.e. closed under thinning. They are the only distributions in the canonical non-negative power series family of distributions to possess this property and include the Poisson distribution, negative binomial distribution, and binomial distribution. The PT family of distributions is also known as the Katz family of distributions, the Panjer or (a,b,0) class of distributions and may be retrieved through the Conway–Maxwell–Poisson distribution.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

References

  1. 1 2 3 4 5 6 7 8 J. Gorham and L. Mackey. Measuring Sample Quality with Stein's Method. Advances in Neural Information Processing Systems, 2015.
  2. Anastasiou, A., Barp, A., Briol, F-X., Ebner, B., Gaunt, R. E., Ghaderinezhad, F., Gorham, J., Gretton, A., Ley, C., Liu, Q., Mackey, L., Oates, C. J., Reinert, G. & Swan, Y. (2021). Stein’s method meets statistics: A review of some recent developments. arXiv:2105.03481.
  3. Müller, Alfred (1997). "Integral Probability Metrics and Their Generating Classes of Functions". Advances in Applied Probability. 29 (2): 429–443. doi:10.2307/1428011. ISSN   0001-8678.
  4. 1 2 Mastubara, T., Knoblauch, J., Briol, F-X., Oates, C. J. Robust Generalised Bayesian Inference for Intractable Likelihoods. arXiv:2104.07359.
  5. 1 2 3 Oates, C. J., Girolami, M., & Chopin, N. (2017). Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society B: Statistical Methodology, 79(3), 695–718.
  6. 1 2 3 Liu, Q., Lee, J. D., & Jordan, M. I. (2016). A kernelized Stein discrepancy for goodness-of-fit tests and model evaluation. International Conference on Machine Learning, 276–284.
  7. 1 2 3 Chwialkowski, K., Strathmann, H., & Gretton, A. (2016). A kernel test of goodness of fit. International Conference on Machine Learning, 2606–2615.
  8. 1 2 3 4 5 Gorham J, Mackey L. Measuring sample quality with kernels. InInternational Conference on Machine Learning 2017 Jul 17 (pp. 1292-1301). PMLR.
  9. 1 2 3 Gorham, J., Duncan, A. B., Vollmer, S. J., & Mackey, L. (2019). Measuring sample quality with diffusions. The Annals of Applied Probability, 29(5), 2884-2928.
  10. 1 2 Shi, J., Liu, C., & Mackey, L. (2021). Sampling with Mirrored Stein Operators. arXiv preprint arXiv:2106.12506
  11. Barp A, Oates CJ, Porcu E, Girolami M. A Riemann-Stein kernel method. arXiv preprint arXiv:1810.04946. 2018.
  12. Xu W, Matsuda T. Interpretable Stein Goodness-of-fit Tests on Riemannian Manifolds. In ICML 2021.
  13. Yang J, Liu Q, Rao V, Neville J. Goodness-of-fit testing for discrete distributions via Stein discrepancy. In ICML 2018 (pp. 5561-5570). PMLR.
  14. Shi J, Zhou Y, Hwang J, Titsias M, Mackey L. Gradient Estimation with Discrete Stein Operators. arXiv preprint arXiv:2202.09497. 2022.
  15. 1 2 Huggins JH, Mackey L. Random Feature Stein Discrepancies. In NeurIPS 2018.
  16. Gorham J, Raj A, Mackey L. Stochastic Stein Discrepancies. In NeurIPS 2020.
  17. Mackey, L., & Gorham, J. (2016). Multivariate Stein factors for a class of strongly log-concave distributions. Electronic Communications in Probability, 21, 1-14.
  18. 1 2 3 4 Chen WY, Mackey L, Gorham J, Briol FX, Oates CJ. Stein points. In International Conference on Machine Learning 2018 (pp. 844-853). PMLR.
  19. 1 2 3 Riabiz M, Chen W, Cockayne J, Swietach P, Niederer SA, Mackey L, Oates CJ. Optimal thinning of MCMC output. Journal of the Royal Statistical Society B: Statistical Methodology, to appear. 2021. arXiv : 2005.03952
  20. Chen WY, Barp A, Briol FX, Gorham J, Girolami M, Mackey L, Oates CJ. Stein Point Markov Chain Monte Carlo. International Conference on Machine Learning (ICML 2019). arXiv : 1905.03673
  21. Korba A, Aubin-Frankowski PC, Majewski S, Ablin P. "Kernel Stein Discrepancy Descent." arXiv preprint arXiv : 2105.09994. 2021.
  22. Liu Q, Lee J. Black-box importance sampling. In Artificial Intelligence and Statistics 2017 (pp. 952-961). PMLR.
  23. Hodgkinson L, Salomone R, Roosta F. The reproducing Stein kernel approach for post-hoc corrected sampling. arXiv preprint arXiv:2001.09266. 2020.
  24. Teymur O, Gorham J, Riabiz M, Oates CJ. Optimal quantisation of probability measures using maximum mean discrepancy. In International Conference on Artificial Intelligence and Statistics 2021 (pp. 1027-1035). PMLR.
  25. Ranganath R, Tran D, Altosaar J, Blei D. Operator variational inference. Advances in Neural Information Processing Systems. 2016;29:496-504.
  26. 1 2 Fisher M, Nolan T, Graham M, Prangle D, Oates CJ. Measure transport with kernel Stein discrepancy. International Conference on Artificial Intelligence and Statistics 2021 (pp. 1054-1062). PMLR.
  27. Barp, A., Briol, F.-X., Duncan, A. B., Girolami, M., & Mackey, L. (2019). Minimum Stein discrepancy estimators. Neural Information Processing Systems, 12964–12976.
  28. Kanagawa, H., Jitkrittum, W., Mackey, L., Fukumizu, K., & Gretton, A. (2019). A kernel Stein test for comparing latent variable models. arXiv preprint arXiv:1907.00586.
  29. Jitkrittum W, Xu W, Szabó Z, Fukumizu K, Gretton A. A Linear-Time Kernel Goodness-of-Fit Test.