Part of a series on |
Bayesian statistics |
---|
![]() |
Posterior = Likelihood × Prior ÷ Evidence |
Background |
Model building |
Posterior approximation |
Estimators |
Evidence approximation |
Model evaluation |
The nested sampling algorithm is a computational approach to the Bayesian statistics problems of comparing models and generating samples from posterior distributions. It was developed in 2004 by physicist John Skilling. [1]
Bayes' theorem can be applied to a pair of competing models and for data , one of which may be true (though which one is unknown) but which both cannot be true simultaneously. The posterior probability for may be calculated as:
The prior probabilities and are already known, as they are chosen by the researcher ahead of time. However, the remaining Bayes factor is not so easy to evaluate, since in general it requires marginalizing nuisance parameters. Generally, has a set of parameters that can be grouped together and called , and has its own vector of parameters that may be of different dimensionality, but is still termed . The marginalization for is
and likewise for . This integral is often analytically intractable, and in these cases it is necessary to employ a numerical algorithm to find an approximation. The nested sampling algorithm was developed by John Skilling specifically to approximate these marginalization integrals, and it has the added benefit of generating samples from the posterior distribution . [2] It is an alternative to methods from the Bayesian literature [3] such as bridge sampling and defensive importance sampling.
Here is a simple version of the nested sampling algorithm, followed by a description of how it computes the marginal probability density where is or :
Start with points sampled from prior. for to do % The number of iterations j is chosen by guesswork. current likelihood values of the points; Save the point with least likelihood as a sample point with weight . Update the point with least likelihood with some Markov chain Monte Carlo steps according to the prior, accepting only steps that keep the likelihood above . endreturn;
At each iteration, is an estimate of the amount of prior mass covered by the hypervolume in parameter space of all points with likelihood greater than . The weight factor is an estimate of the amount of prior mass that lies between two nested hypersurfaces and . The update step computes the sum over of to numerically approximate the integral
In the limit , this estimator has a positive bias of order [4] which can be removed by using instead of the in the above algorithm.
The idea is to subdivide the range of and estimate, for each interval , how likely it is a priori that a randomly chosen would map to this interval. This can be thought of as a Bayesian's way to numerically implement Lebesgue integration. [5]
The original procedure outlined by Skilling (given above in pseudocode) does not specify what specific Markov chain Monte Carlo algorithm should be used to choose new points with better likelihood.
Skilling's own code examples (such as one in Sivia and Skilling (2006), [6] available on Skilling's website) chooses a random existing point and selects a nearby point chosen by a random distance from the existing point; if the likelihood is better, then the point is accepted, else it is rejected and the process repeated. Mukherjee et al. (2006) [7] found higher acceptance rates by selecting points randomly within an ellipsoid drawn around the existing points; this idea was refined into the MultiNest algorithm [8] which handles multimodal posteriors better by grouping points into likelihood contours and drawing an ellipsoid for each contour.
Example implementations demonstrating the nested sampling algorithm are publicly available for download, written in several programming languages.
Since nested sampling was proposed in 2004, it has been used in many aspects of the field of astronomy. One paper suggested using nested sampling for cosmological model selection and object detection, as it "uniquely combines accuracy, general applicability and computational feasibility." [7] A refinement of the algorithm to handle multimodal posteriors has been suggested as a means to detect astronomical objects in extant datasets. [10] Other applications of nested sampling are in the field of finite element updating where the algorithm is used to choose an optimal finite element model, and this was applied to structural dynamics. [12] This sampling method has also been used in the field of materials modeling. It can be used to learn the partition function from statistical mechanics and derive thermodynamic properties. [13]
Dynamic nested sampling is a generalisation of the nested sampling algorithm in which the number of samples taken in different regions of the parameter space is dynamically adjusted to maximise calculation accuracy. [14] This can lead to large improvements in accuracy and computational efficiency when compared to the original nested sampling algorithm, in which the allocation of samples cannot be changed and often many samples are taken in regions which have little effect on calculation accuracy.
Publicly available dynamic nested sampling software packages include:
Dynamic nested sampling has been applied to a variety of scientific problems, including analysis of gravitational waves, [17] mapping distances in space [18] and exoplanet detection. [19]
A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian inference uses a prior distribution to estimate posterior probabilities. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".
A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.
In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.
In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.
Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood, represents a convenient approach for setting hyperparameters, but has been mostly supplanted by fully Bayesian hierarchical analyses since the 2000s with the increasing availability of well-performing computation techniques. It is still commonly used, however, for variational methods in Deep Learning, such as variational autoencoders, where latent variable spaces are high-dimensional.
The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in question can have a common set of parameters, such as a null hypothesis and an alternative, but this is not necessary; for instance, it could also be a non-linear model compared to its linear approximation. The Bayes factor can be thought of as a Bayesian analog to the likelihood-ratio test, although it uses the integrated likelihood rather than the maximized likelihood. As such, both quantities only coincide under simple hypotheses. Also, in contrast with null hypothesis significance testing, Bayes factors support evaluation of evidence in favor of a null hypothesis, rather than only allowing the null to be rejected or not rejected.
In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.
A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample for all possible values of the parameters; it can be understood as the probability of the model itself and is therefore often referred to as model evidence or simply evidence.
Bayesian experimental design provides a general probability-theoretical framework from which other theories on experimental design can be derived. It is based on Bayesian inference to interpret the observations/data acquired during the experiment. This allows accounting for both any prior knowledge on the parameters to be determined as well as uncertainties in observations.
An estimation procedure that is often claimed to be part of Bayesian statistics is the maximum a posteriori (MAP) estimate of an unknown quantity, that equals the mode of the posterior density with respect to some reference measure, typically the Lebesgue measure. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior density over the quantity one wants to estimate. MAP estimation is therefore a regularization of maximum likelihood estimation, so is not a well-defined statistic of the Bayesian posterior distribution.
In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).
Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.
Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that address the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.
Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.
In computational statistics, the pseudo-marginal Metropolis–Hastings algorithm is a Monte Carlo method to sample from a probability distribution. It is an instance of the popular Metropolis–Hastings algorithm that extends its use to cases where the target density is not available analytically. It relies on the fact that the Metropolis–Hastings algorithm can still sample from the correct target distribution if the target density in the acceptance ratio is replaced by an estimate. It is especially popular in Bayesian statistics, where it is applied if the likelihood function is not tractable.
Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.
In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.
Probabilistic numerics is an active field of study at the intersection of applied mathematics, statistics, and machine learning centering on the concept of uncertainty in computation. In probabilistic numerics, tasks in numerical analysis such as finding numerical solutions for integration, linear algebra, optimization and simulation and differential equations are seen as problems of statistical, probabilistic, or Bayesian inference.
{{cite book}}
: CS1 maint: multiple names: authors list (link)