In numerical analysis and computational statistics, rejection sampling is a basic technique used to generate observations from a distribution. It is also commonly called the acceptance-rejection method or "accept-reject algorithm" and is a type of exact simulation method. The method works for any distribution in with a density.
Rejection sampling is based on the observation that to sample a random variable in one dimension, one can perform a uniformly random sampling of the two-dimensional Cartesian graph, and keep the samples in the region under the graph of its density function. [1] [2] [3] Note that this property can be extended to N-dimension functions.
To visualize the motivation behind rejection sampling, imagine graphing the probability density function (PDF) of a random variable onto a large rectangular board and throwing darts at it. Assume that the darts are uniformly distributed around the board. Now remove all of the darts that are outside the area under the curve. The remaining darts will be distributed uniformly within the area under the curve, and the ‑positions of these darts will be distributed according to the random variable's density. This is because there is the most room for the darts to land where the curve is highest and thus the probability density is greatest.
The visualization just described is equivalent to a particular form of rejection sampling where the "proposal distribution" is uniform. Hence its graph is a rectangle. The general form of rejection sampling assumes that the board is not necessarily rectangular but is shaped according to the density of some proposal distribution (not necessarily normalized to ) that we know how to sample from (for example, using inversion sampling). Its shape must be at least as high at every point as the distribution we want to sample from, so that the former completely encloses the latter. Otherwise, there would be parts of the curved area we want to sample from that could never be reached.
Rejection sampling works as follows:
This algorithm can be used to sample from the area under any curve, regardless of whether the function integrates to 1. In fact, scaling a function by a constant has no effect on the sampled ‑positions. Thus, the algorithm can be used to sample from a distribution whose normalizing constant is unknown, which is common in computational statistics.
The rejection sampling method generates sampling values from a target distribution with an arbitrary probability density function by using a proposal distribution with probability density . The idea is that one can generate a sample value from by instead sampling from and accepting the sample from with probability , repeating the draws from until a value is accepted. here is a constant, finite bound on the likelihood ratio , satisfying over the support of ; in other words, M must satisfy for all values of . Note that this requires that the support of must include the support of —in other words, whenever .
The validation of this method is the envelope principle: when simulating the pair , one produces a uniform simulation over the subgraph of . Accepting only pairs such that then produces pairs uniformly distributed over the subgraph of and thus, marginally, a simulation from
This means that, with enough replicates, the algorithm generates a sample from the desired distribution . There are a number of extensions to this algorithm, such as the Metropolis algorithm.
This method relates to the general field of Monte Carlo techniques, including Markov chain Monte Carlo algorithms that also use a proxy distribution to achieve simulation from the target distribution . It forms the basis for algorithms such as the Metropolis algorithm.
The unconditional acceptance probability is the proportion of proposed samples which are accepted, which is where , and the value of each time is generated under the density function of the proposal distribution .
The number of samples required from to obtain an accepted value thus follows a geometric distribution with probability , which has mean . Intuitively, is the expected number of the iterations that are needed, as a measure of the computational complexity of the algorithm.
Rewrite the above equation, Note that , due to the above formula, where is a probability which can only take values in the interval . When is chosen closer to one, the unconditional acceptance probability is higher the less that ratio varies, since is the upper bound for the likelihood ratio . In practice, a value of closer to 1 is preferred as it implies fewer rejected samples, on average, and thus fewer iterations of the algorithm. In this sense, one prefers to have as small as possible (while still satisfying , which suggests that should generally resemble in some way. Note, however, that cannot be equal to 1: such would imply that , i.e. that the target and proposal distributions are actually the same distribution.
Rejection sampling is most often used in cases where the form of makes sampling difficult. A single iteration of the rejection algorithm requires sampling from the proposal distribution, drawing from a uniform distribution, and evaluating the expression. Rejection sampling is thus more efficient than some other method whenever M times the cost of these operations—which is the expected cost of obtaining a sample with rejection sampling—is lower than the cost of obtaining a sample using the other method.
The algorithm, which was used by John von Neumann [4] and dates back to Buffon and his needle, [5] obtains a sample from distribution with density using samples from distribution with density as follows:
The algorithm will take an average of iterations to obtain a sample. [6]
Rejection sampling can be far more efficient compared with the naive methods in some situations. For example, given a problem as sampling conditionally on given the set , i.e., , sometimes can be easily simulated, using the naive methods (e.g. by inverse transform sampling):
The problem is this sampling can be difficult and inefficient, if . The expected number of iterations would be , which could be close to infinity. Moreover, even when you apply the Rejection sampling method, it is always hard to optimize the bound for the likelihood ratio. More often than not, is large and the rejection rate is high, the algorithm can be very inefficient. The Natural Exponential Family (if it exists), also known as exponential tilting, provides a class of proposal distributions that can lower the computation complexity, the value of and speed up the computations (see examples: working with Natural Exponential Families).
Given a random variable , is the target distribution. Assume for simplicity, the density function can be explicitly written as . Choose the proposal as
where and . Clearly, , is from a natural exponential family. Moreover, the likelihood ratio is
Note that implies that it is indeed a cumulant-generation function, that is,
It is easy to derive the cumulant-generation function of the proposal and therefore the proposal's cumulants.
As a simple example, suppose under , , with . The goal is to sample , where . The analysis goes as follows:
holds, accept the value of ; if not, continue sampling new and new until acceptance.
For the above example, as the measurement of the efficiency, the expected number of the iterations the natural exponential family based rejection sampling method is of order , that is , while under the naive method, the expected number of the iterations is , which is far more inefficient.
In general, exponential tilting a parametric class of proposal distribution, solves the optimization problems conveniently, with its useful properties that directly characterize the distribution of the proposal. For this type of problem, to simulate conditionally on , among the class of simple distributions, the trick is to use natural exponential family, which helps to gain some control over the complexity and considerably speed up the computation. Indeed, there are deep mathematical reasons for using natural exponential family.
Rejection sampling requires knowing the target distribution (specifically, ability to evaluate target PDF at any point).
Rejection sampling can lead to a lot of unwanted samples being taken if the function being sampled is highly concentrated in a certain region, for example a function that has a spike at some location. For many distributions, this problem can be solved using an adaptive extension (see adaptive rejection sampling), or with an appropriate change of variables with the method of the ratio of uniforms. In addition, as the dimensions of the problem get larger, the ratio of the embedded volume to the "corners" of the embedding volume tends towards zero, thus a lot of rejections can take place before a useful sample is generated, thus making the algorithm inefficient and impractical. See curse of dimensionality. In high dimensions, it is necessary to use a different approach, typically a Markov chain Monte Carlo method such as Metropolis sampling or Gibbs sampling. (However, Gibbs sampling, which breaks down a multi-dimensional sampling problem into a series of low-dimensional samples, may use rejection sampling as one of its steps.)
For many distributions, finding a proposal distribution that includes the given distribution without a lot of wasted space is difficult. An extension of rejection sampling that can be used to overcome this difficulty and efficiently sample from a wide variety of distributions (provided that they have log-concave density functions, which is in fact the case for most of the common distributions—even those whose density functions are not concave themselves) is known as adaptive rejection sampling (ARS).
There are three basic ideas to this technique as ultimately introduced by Gilks in 1992: [7]
The method essentially involves successively determining an envelope of straight-line segments that approximates the logarithm better and better while still remaining above the curve, starting with a fixed number of segments (possibly just a single tangent line). Sampling from a truncated exponential random variable is straightforward. Just take the log of a uniform random variable (with appropriate interval and corresponding truncation).
Unfortunately, ARS can only be applied for sampling from log-concave target densities. For this reason, several extensions of ARS have been proposed in literature for tackling non-log-concave target distributions. [9] [10] [11] Furthermore, different combinations of ARS and the Metropolis-Hastings method have been designed in order to obtain a universal sampler that builds a self-tuning proposal densities (i.e., a proposal automatically constructed and adapted to the target). This class of methods are often called as Adaptive Rejection Metropolis Sampling (ARMS) algorithms. [12] [13] The resulting adaptive techniques can be always applied but the generated samples are correlated in this case (although the correlation vanishes quickly to zero as the number of iterations grows).
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is . A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.
In statistics, sufficiency is a property of a statistic computed on a sample dataset in relation to a parametric model of the dataset. A sufficient statistic contains all of the information that the dataset provides about the model parameters. It is closely related to the concepts of an ancillary statistic which contains no information about the model parameters, and of a complete statistic which only contains information about the parameters and no ancillary information.
In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space under the operation of composition.
In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:
In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form
In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.
In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.
In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space. It can be used to calculate the informational difference between measurements.
In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic parameter. The result is named in honor of Harald Cramér and C. R. Rao, but has also been derived independently by Maurice Fréchet, Georges Darmois, and by Alexander Aitken and Harold Silverstone. It is also known as Fréchet-Cramér–Rao or Fréchet-Darmois-Cramér-Rao lower bound. It states that the precision of any unbiased estimator is at most the Fisher information; or (equivalently) the reciprocal of the Fisher information is a lower bound on its variance.
In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).
In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.
In directional statistics, the von Mises–Fisher distribution, is a probability distribution on the -sphere in . If the distribution reduces to the von Mises distribution on the circle.
In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.
Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, individuals with better and better -values are generated over the generation sequence.
A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.
In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.
In probability theory and statistics, the normal-exponential-gamma distribution is a three-parameter family of continuous probability distributions. It has a location parameter , scale parameter and a shape parameter .
Exponential Tilting (ET), Exponential Twisting, or Exponential Change of Measure (ECM) is a distribution shifting technique used in many parts of mathematics. The different exponential tiltings of a random variable is known as the natural exponential family of .
A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.