Slice sampling

Last updated May 03, 2023

Slice sampling is a type of Markov chain Monte Carlo algorithm for pseudo-random number sampling, i.e. for drawing random samples from a statistical distribution. The method is based on the observation that to sample a random variable one can sample uniformly from the region under the graph of its density function.^[1]^[2]^[3]

Motivation

Suppose you want to sample some random variable X with distribution f(x). Suppose that the following is the graph of f(x). The height of f(x) corresponds to the likelihood at that point.

If you were to uniformly sample X, each value would have the same likelihood of being sampled, and your distribution would be of the form f(x) = y for some y value instead of some non-uniform function f(x). Instead of the original black line, your new distribution would look more like the blue line.

In order to sample X in a manner which will retain the distribution f(x), some sampling technique must be used which takes into account the varied likelihoods for each range of f(x).

Method

Slice sampling, in its simplest form, samples uniformly from underneath the curve f(x) without the need to reject any points, as follows:

Choose a starting value x₀ for which f(x₀) > 0.
Sample a y value uniformly between 0 and f(x₀).
Draw a horizontal line across the curve at this y position.
Sample a point (x, y) from the line segments within the curve.
Repeat from step 2 using the new x value.

The motivation here is that one way to sample a point uniformly from within an arbitrary curve is first to draw thin uniform-height horizontal slices across the whole curve. Then, we can sample a point within the curve by randomly selecting a slice that falls at or below the curve at the x-position from the previous iteration, then randomly picking an x-position somewhere along the slice. By using the x-position from the previous iteration of the algorithm, in the long run we select slices with probabilities proportional to the lengths of their segments within the curve. The most difficult part of this algorithm is finding the bounds of the horizontal slice, which involves inverting the function describing the distribution being sampled from. This is especially problematic for multi-modal distributions, where the slice may consist of multiple discontinuous parts. It is often possible to use a form of rejection sampling to overcome this, where we sample from a larger slice that is known to include the desired slice in question, and then discard points outside of the desired slice. This algorithm can be used to sample from the area under any curve, regardless of whether the function integrates to 1. In fact, scaling a function by a constant has no effect on the sampled x-positions. This means that the algorithm can be used to sample from a distribution whose probability density function is only known up to a constant (i.e. whose normalizing constant is unknown), which is common in computational statistics.

Implementation

Slice sampling gets its name from the first step: defining a slice by sampling from an auxiliary variable $Y$ . This variable is sampled from $[0,f(x)]$ , where $f(x)$ is either the probability density function (PDF) of X or is at least proportional to its PDF. This defines a slice of X where $f(x)\geq Y$ . In other words, we are now looking at a region of X where the probability density is at least $Y$ . Then the next value of X is sampled uniformly from this slice. A new value of $Y$ is sampled, then X, and so on. This can be visualized as alternatively sampling the y-position and then the x-position of points under PDF, thus the Xs are from the desired distribution. The $Y$ values have no particular consequences or interpretations outside of their usefulness for the procedure.

If both the PDF and its inverse are available, and the distribution is unimodal, then finding the slice and sampling from it are simple. If not, a stepping-out procedure can be used to find a region whose endpoints fall outside the slice. Then, a sample can be drawn from the slice using rejection sampling. Various procedures for this are described in detail by Radford M. Neal.^[2]

Note that, in contrast to many available methods for generating random numbers from non-uniform distributions, random variates generated directly by this approach will exhibit serial statistical dependence. This is because to draw the next sample, we define the slice based on the value of f(x) for the current sample.

Compared to other methods

Slice sampling is a Markov chain method and as such serves the same purpose as Gibbs sampling and Metropolis. Unlike Metropolis, there is no need to manually tune the candidate function or candidate standard deviation.

Recall that Metropolis is sensitive to step size. If the step size is too small random walk causes slow decorrelation. If the step size is too large there is great inefficiency due to a high rejection rate.

In contrast to Metropolis, slice sampling automatically adjusts the step size to match the local shape of the density function. Implementation is arguably easier and more efficient than Gibbs sampling or simple Metropolis updates.

Note that, in contrast to many available methods for generating random numbers from non-uniform distributions, random variates generated directly by this approach will exhibit serial statistical dependence. In other words, not all points have the same independent likelihood of selection. This is because to draw the next sample, we define the slice based on the value of f(x) for the current sample. However, the generated samples are markovian, and are therefore expected to converge to the correct distribution in long run.

Slice Sampling requires that the distribution to be sampled be evaluable. One way to relax this requirement is to substitute an evaluable distribution which is proportional to the true unevaluable distribution.

Univariate case

To sample a random variable X with density f(x) we introduce an auxiliary variable Y and iterate as follows:

Given a sample x we choose y uniformly at random from the interval [0, f(x)];
given y we choose x uniformly at random from the set $f^{-1}[y,+\infty )$ .
The sample of x is obtained by ignoring the y values.

Our auxiliary variable Y represents a horizontal "slice" of the distribution. The rest of each iteration is dedicated to sampling an x value from the slice which is representative of the density of the region being considered.

In practice, sampling from a horizontal slice of a multimodal distribution is difficult. There is a tension between obtaining a large sampling region and thereby making possible large moves in the distribution space, and obtaining a simpler sampling region to increase efficiency. One option for simplifying this process is regional expansion and contraction.

First, a width parameter w is used to define the area containing the given 'x value. Each endpoint of this area is tested to see if it lies outside the given slice. If not, the region is extended in the appropriate direction(s) by w until the end both endpoints lie outside the slice.
A candidate sample is selected uniformly from within this region. If the candidate sample lies inside of the slice, then it is accepted as the new sample. If it lies outside of the slice, the candidate point becomes the new boundary for the region. A new candidate sample is taken uniformly. The process repeats until the candidate sample is within the slice. (See diagram for a visual example).

→

Slice-within-Gibbs sampling

In a Gibbs sampler, one needs to draw efficiently from all the full-conditional distributions. When sampling from a full-conditional density is not easy, a single iteration of slice sampling or the Metropolis-Hastings algorithm can be used within-Gibbs to sample from the variable in question. If the full-conditional density is log-concave, a more efficient alternative is the application of adaptive rejection sampling (ARS) methods.^[4]^[5] When the ARS techniques cannot be applied (since the full-conditional is non-log-concave), the adaptive rejection Metropolis sampling algorithms are often employed.^[6]^[7]

Multivariate methods

Treating each variable independently

Single variable slice sampling can be used in the multivariate case by sampling each variable in turn repeatedly, as in Gibbs sampling. To do so requires that we can compute, for each component $x_{i}$ a function that is proportional to $p(x_{i}|x_{0}...x_{n})$ .

To prevent random walk behavior, overrelaxation methods can be used to update each variable in turn.^{[ citation needed ]} Overrelaxation chooses a new value on the opposite side of the mode from the current value, as opposed to choosing a new independent value from the distribution as done in Gibbs.

Hyperrectangle slice sampling

This method adapts the univariate algorithm to the multivariate case by substituting a hyperrectangle for the one-dimensional w region used in the original. The hyperrectangle H is initialized to a random position over the slice. H is then shrunk as points from it are rejected.

Reflective slice sampling

Reflective slice sampling is a technique to suppress random walk behavior in which the successive candidate samples of distribution f(x) are kept within the bounds of the slice by "reflecting" the direction of sampling inward toward the slice once the boundary has been hit.

In this graphical representation of reflective sampling, the shape indicates the bounds of a sampling slice. The dots indicate start and stopping points of a sampling walk. When the samples hit the bounds of the slice, the direction of sampling is "reflected" back into the slice.

Example

Consider a single variable example. Suppose our true distribution is a normal distribution with mean 0 and standard deviation 3, $g(x)\sim N(0,3^{2})$ . So: $f(x)={\frac {1}{\sqrt {2\pi \cdot 3^{2}}}}\ e^{-{\frac {(x-0)^{2}}{2\cdot 3^{2}}}}$ . The peak of the distribution is obviously at $x=0$ , at which point $f(x)\approx 0.1330$ .

We first draw a uniform random value y from the range of f(x) in order to define our slice(es). f(x) ranges from 0 to ~0.1330, so any value between these two extremes suffice. Suppose we take y = 0.1. The problem becomes how to sample points that have values y > 0.1.
Next, we set our width parameter w which we will use to expand our region of consideration. This value is arbitrary. Suppose w = 2.
Next, we need an initial value for x. We draw x from the uniform distribution within the domain of f(x) which satisfies f(x) > 0.1 (our y parameter). Suppose x = 2. This works because f(2) = ~0.1065 > 0.1.^[8]
Because x = 2 and w = 2, our current region of interest is bounded by (1, 3).
Now, each endpoint of this area is tested to see if it lies outside the given slice. Our right bound lies outside our slice (f(3) = ~0.0807 < 0.1), but the left value does not (f(1) = ~0.1258 > 0.1). We expand the left bound by adding w to it until it extends past the limit of the slice. After this process, the new bounds of our region of interest are (−3, 3).
Next, we take a uniform sample within (−3, 3). Suppose this sample yields x = −2.9. Though this sample is within our region of interest, it does not lie within our slice (f(2.9) = ~0.08334 < 0.1), so we modify the left bound of our region of interest to this point. Now we take a uniform sample from (−2.9, 3). Suppose this time our sample yields x = 1, which is within our slice, and thus is the accepted sample output by slice sampling. Had our new x not been within our slice, we would continue the shrinking/resampling process until a valid x within bounds is found.

If we're interested in the peak of the distribution, we can keep repeating this process since the new point corresponds to a higher f(x) than the original point.

Another example

To sample from the normal distribution $N(0,1)$ we first choose an initial x—say 0. After each sample of x we choose y uniformly at random from $(0,e^{-x^{2}/2}/{\sqrt {2\pi }}]$ , which is bounded the pdf of $N(0,1)$ . After each y sample we choose x uniformly at random from $[-\alpha ,\alpha ]$ where $\alpha ={\sqrt {-2\ln(y{\sqrt {2\pi }})}}$ . This is the slice where $f(x)>y$ .

An implementation in the Macsyma language is:

slice(x) :=block([y,alpha],y:random(exp(-x^2/2.0) /sqrt(2.0*dfloat(%pi))),alpha:sqrt(-2.0*ln(y*sqrt(2.0*dfloat(%pi)))),x:signum(random()) *random(alpha) );

Related Research Articles

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success or failure. A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as it is not actually random or a variable, but rather it is a mapping or a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for identically distributed independent samples, the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

In probability theory, a probability density function (PDF), or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

Inverse transform sampling is a basic method for pseudo-random number sampling, i.e., for generating sample numbers at random from any probability distribution given its cumulative distribution function.

In statistics and statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution from which direct sampling is difficult. This sequence can be used to approximate the distribution or to compute an integral. Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. For single-dimensional distributions, there are usually other methods that can directly return independent samples from the distribution, and these are free from the problem of autocorrelated samples that is inherent in MCMC methods.

The Box–Muller transform, by George Edward Pelham Box and Mervin Edgar Muller, is a random number sampling method for generating pairs of independent, standard, normally distributed random numbers, given a source of uniformly distributed random numbers. The method was in fact first mentioned explicitly by Raymond E. A. C. Paley and Norbert Wiener in 1934.

In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference.

In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain. The more steps that are included, the more closely the distribution of the sample matches the actual desired distribution. Various algorithms exist for constructing chains, including the Metropolis–Hastings algorithm.

In numerical analysis and computational statistics, rejection sampling is a basic technique used to generate observations from a distribution. It is also commonly called the acceptance-rejection method or "accept-reject algorithm" and is a type of exact simulation method. The method works for any distribution in $with a density.$

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

<span class="mw-page-title-main">Monte Carlo integration</span> Numerical technique

In mathematics, Monte Carlo integration is a technique for numerical integration using random numbers. It is a particular Monte Carlo method that numerically computes a definite integral. While other algorithms usually evaluate the integrand at a regular grid, Monte Carlo randomly chooses points at which the integrand is evaluated. This method is particularly useful for higher-dimensional integrals.

In probability theory and statistics, the continuous uniform distributions or rectangular distributions are a family of symmetric probability distributions. Such a distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds. The bounds are defined by the parameters, $and which are the minimum and maximum values. The interval can either be closed or open. Therefore, the distribution is often abbreviated where stands for uniform distribution. The difference between the bounds defines the interval length; all intervals of the same length on the distribution's support are equally probable. It is the maximum entropy probability distribution for a random variable under no constraint other than that it is contained in the distribution's support.$

In probability theory, an indecomposable distribution is a probability distribution that cannot be represented as the distribution of the sum of two or more non-constant independent random variables: Z ≠ X + Y. If it can be so expressed, it is decomposable:Z = X + Y. If, further, it can be expressed as the distribution of the sum of two or more independent identically distributed random variables, then it is divisible:Z = X₁ + X₂.

A stochastic simulation is a simulation of a system that has variables that can change stochastically (randomly) with individual probabilities.

In probability theory, the probability integral transform relates to the result that data values that are modeled as being random variables from any given continuous distribution can be converted to random variables having a standard uniform distribution. This holds exactly provided that the distribution being used is the true distribution of the random variables; if the distribution is one fitted to the data, the result will hold approximately in large samples.

<span class="mw-page-title-main">Truncated normal distribution</span> Type of probability distribution

In probability and statistics, the truncated normal distribution is the probability distribution derived from that of a normally distributed random variable by bounding the random variable from either below or above. The truncated normal distribution has wide applications in statistics and econometrics.

Multiple-try Metropolis (MTM) is a sampling method that is a modified form of the Metropolis–Hastings method, first presented by Liu, Liang, and Wong in 2000. It is designed to help the sampling trajectory converge faster, by increasing both the step size and the acceptance rate.

Non-uniform random variate generation or pseudo-random number sampling is the numerical practice of generating pseudo-random numbers (PRN) that follow a given probability distribution. Methods are typically based on the availability of a uniformly distributed PRN generator. Computational algorithms are then used to manipulate a single random variate, X, or often several such variates, into a new random variate Y such that these values have the required distribution. The first methods were developed for Monte-Carlo simulations in the Manhattan project, published by John von Neumann in the early 1950s.

References

↑ Damlen, P., Wakefield, J., & Walker, S. (1999). Gibbs sampling for Bayesian non‐conjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 61(2), 331-344.Chicago
1 2 Neal, Radford M. (2003). "Slice Sampling". Annals of Statistics . 31 (3): 705–767. doi: 10.1214/aos/1056562461 . MR 1994729. Zbl 1051.65007.
↑ Bishop, Christopher (2006). "11.4: Slice sampling". Pattern Recognition and Machine Learning. Springer. ISBN 978-0387310732.
↑ Gilks, W. R.; Wild, P. (1992-01-01). "Adaptive Rejection Sampling for Gibbs Sampling". Journal of the Royal Statistical Society. Series C (Applied Statistics). 41 (2): 337–348. doi:10.2307/2347565. JSTOR 2347565.
↑ Hörmann, Wolfgang (1995-06-01). "A Rejection Technique for Sampling from T-concave Distributions". ACM Trans. Math. Softw. 21 (2): 182–193. CiteSeerX 10.1.1.56.6055 . doi:10.1145/203082.203089. ISSN 0098-3500. S2CID 592740.
↑ Gilks, W. R.; Best, N. G.; Tan, K. K. C. (1995-01-01). "Adaptive Rejection Metropolis Sampling within Gibbs Sampling". Journal of the Royal Statistical Society. Series C (Applied Statistics). 44 (4): 455–472. doi:10.2307/2986138. JSTOR 2986138.
↑ Meyer, Renate; Cai, Bo; Perron, François (2008-03-15). "Adaptive rejection Metropolis sampling using Lagrange interpolation polynomials of degree 2". Computational Statistics & Data Analysis. 52 (7): 3408–3423. doi:10.1016/j.csda.2008.01.005.
↑ Note that if we didn't know how to select x such that f(x) > y, we can still pick any random value for x, evaluate f(x), and use that as our value of y. y only initializes the algorithm; as the algorithm progresses it will find higher and higher values of y.

External links

http://www.probability.ca/jeff/java/slice.html

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Damlen, P., Wakefield, J., & Walker, S. (1999). Gibbs sampling for Bayesian non‐conjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 61(2), 331-344.Chicago

[radford03-2] 1 2 Neal, Radford M. (2003). "Slice Sampling". Annals of Statistics . 31 (3): 705–767. doi: 10.1214/aos/1056562461 . MR 1994729. Zbl 1051.65007.

[bishop06-3] Bishop, Christopher (2006). "11.4: Slice sampling". Pattern Recognition and Machine Learning. Springer. ISBN 978-0387310732.

[4] Gilks, W. R.; Wild, P. (1992-01-01). "Adaptive Rejection Sampling for Gibbs Sampling". Journal of the Royal Statistical Society. Series C (Applied Statistics). 41 (2): 337–348. doi:10.2307/2347565. JSTOR 2347565.

[5] Hörmann, Wolfgang (1995-06-01). "A Rejection Technique for Sampling from T-concave Distributions". ACM Trans. Math. Softw. 21 (2): 182–193. CiteSeerX 10.1.1.56.6055 . doi:10.1145/203082.203089. ISSN 0098-3500. S2CID 592740.

[6] Gilks, W. R.; Best, N. G.; Tan, K. K. C. (1995-01-01). "Adaptive Rejection Metropolis Sampling within Gibbs Sampling". Journal of the Royal Statistical Society. Series C (Applied Statistics). 44 (4): 455–472. doi:10.2307/2986138. JSTOR 2986138.

[7] Meyer, Renate; Cai, Bo; Perron, François (2008-03-15). "Adaptive rejection Metropolis sampling using Lagrange interpolation polynomials of degree 2". Computational Statistics & Data Analysis. 52 (7): 3408–3423. doi:10.1016/j.csda.2008.01.005.

[8] Note that if we didn't know how to select x such that f(x) > y, we can still pick any random value for x, evaluate f(x), and use that as our value of y. y only initializes the algorithm; as the algorithm progresses it will find higher and higher values of y.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]