Markov chain central limit theorem

Last updated June 19, 2024

In the mathematical theory of random processes, the Markov chain central limit theorem has a conclusion somewhat similar in form to that of the classic central limit theorem (CLT) of probability theory, but the quantity in the role taken by the variance in the classic CLT has a more complicated definition. See also the general form of Bienaymé's identity.

Statement

Suppose that:

the sequence ${\textstyle X_{1},X_{2},X_{3},\ldots }$ of random elements of some set is a Markov chain that has a stationary probability distribution; and
the initial distribution of the process, i.e. the distribution of ${\textstyle X_{1}}$ , is the stationary distribution, so that ${\textstyle X_{1},X_{2},X_{3},\ldots }$ are identically distributed. In the classic central limit theorem these random variables would be assumed to be independent, but here we have only the weaker assumption that the process has the Markov property; and
${\textstyle g}$ is some (measurable) real-valued function for which ${\textstyle \operatorname {var} (g(X_{1}))<+\infty .}$

Now let^[1]^[2]^[3]

{\begin{aligned}\mu &=\operatorname {E} (g(X_{1})),\\{\widehat {\mu }}_{n}&={\frac {1}{n}}\sum _{k=1}^{n}g(X_{k})\\\sigma ^{2}&:=\lim _{n\to \infty }\operatorname {var} ({\sqrt {n}}{\widehat {\mu }}_{n})=\lim _{n\to \infty }n\operatorname {var} ({\widehat {\mu }}_{n})=\operatorname {var} (g(X_{1}))+2\sum _{k=1}^{\infty }\operatorname {cov} (g(X_{1}),g(X_{1+k})).\end{aligned}}

Then as ${\textstyle n\to \infty ,}$ we have^[4]

{\sqrt {n}}({\hat {\mu }}_{n}-\mu )\ {\xrightarrow {\mathcal {D}}}\ {\text{Normal}}(0,\sigma ^{2}),

where the decorated arrow indicates convergence in distribution.

Monte Carlo Setting

The Markov chain central limit theorem can be guaranteed for functionals of general state space Markov chains under certain conditions. In particular, this can be done with a focus on Monte Carlo settings. An example of the application in a MCMC (Markov Chain Monte Carlo) setting is the following:

Consider a simple hard spheres model on a grid. Suppose $X=\{1,\ldots ,n_{1}\}\times \{1,\ldots ,n_{2}\}\subseteq Z^{2}$ . A proper configuration on $X$ consists of coloring each point either black or white in such a way that no two adjacent points are white. Let $\chi$ denote the set of all proper configurations on $X$ , $N_{\chi }(n_{1},n_{2})$ be the total number of proper configurations and π be the uniform distribution on $\chi$ so that each proper configuration is equally likely. Suppose our goal is to calculate the typical number of white points in a proper configuration; that is, if $W(x)$ is the number of white points in $x\in \chi$ then we want the value of

$E_{\pi }W=\sum _{x\in \chi }{\frac {W(x)}{N_{\chi }{\bigl (}n_{1},n_{2}{\bigr )}}}$

If $n_{1}$ and $n_{2}$ are even moderately large then we will have to resort to an approximation to $E_{\pi }W$ . Consider the following Markov chain on $\chi$ . Fix $p\in (0,1)$ and set $X_{1}=x_{1}$ where $x_{1}\in \chi$ is an arbitrary proper configuration. Randomly choose a point $(x,y)\in X$ and independently draw $U\sim \mathrm {Uniform} (0,1)$ . If $u\leq p$ and all of the adjacent points are black then color $(x,y)$ white leaving all other points alone. Otherwise, color $(x,y)$ black and leave all other points alone. Call the resulting configuration $X_{1}$ . Continuing in this fashion yields a Harris ergodic Markov chain $\{X_{1},X_{2},X_{3},\ldots \}$ having $\pi$ as its invariant distribution. It is now a simple matter to estimate $E_{\pi }W$ with ${\overline {w_{n}}}=\sum _{i=1}^{n}W(X_{i})/n$ . Also, since $\chi$ is finite (albeit potentially large) it is well known that $X$ will converge exponentially fast to $\pi$ which implies that a CLT holds for ${\overline {w_{n}}}$ .

Implications

Not taking into account the additional terms in the variance which stem from correlations (e.g. serial correlations in markov chain monte carlo simulations) can result in the problem of pseudoreplication when computing e.g. the confidence intervals for the sample mean.

Related Research Articles

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by $,,,, or .$

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

In probability theory and statistics, the chi-squared distribution with $degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.$

In mathematics, the Wiener process is a real-valued continuous-time stochastic process named in honor of American mathematician Norbert Wiener for his investigations on the mathematical properties of the one-dimensional Brownian motion. It is often also called Brownian motion due to its historical connection with the physical process of the same name originally observed by Scottish botanist Robert Brown. It is one of the best known Lévy processes and occurs frequently in pure and applied mathematics, economics, quantitative finance, evolutionary biology, and physics.

In probability theory, the law of large numbers (LLN) is a mathematical theorem that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression.

In mathematical analysis, a function of bounded variation, also known as $BV$ function, is a real-valued function whose total variation is bounded (finite): the graph of a function having this property is well behaved in a precise sense. For a continuous function of a single variable, being of bounded variation means that the distance along the direction of the $y$ -axis, neglecting the contribution of motion along $x$ -axis, traveled by a point moving along the graph has a finite value. For a continuous function of several variables, the meaning of the definition is the same, except for the fact that the continuous path to be considered cannot be the whole graph of the given function, but can be every intersection of the graph itself with a hyperplane parallel to a fixed $x$ -axis and to the $y$ -axis.

In probability theory and statistics, the cumulants $κ n$ of a probability distribution are a set of quantities that provide an alternative to the moments of the distribution. Any two probability distributions whose moments are identical will have identical cumulants as well, and vice versa.

Stein's lemma, named in honor of Charles Stein, is a theorem of probability theory that is of interest primarily because of its applications to statistical inference — in particular, to James–Stein estimation and empirical Bayes methods — and its applications to portfolio choice theory. The theorem gives a formula for the covariance of one random variable with the value of a function of another, when the two random variables are jointly normally distributed.

In probability theory, a compound Poisson distribution is the probability distribution of the sum of a number of independent identically-distributed random variables, where the number of terms to be added is itself a Poisson-distributed variable. The result can be either a continuous or a discrete distribution.

In probability theory and statistics, the chi distribution is a continuous probability distribution over the non-negative real line. It is the distribution of the positive square root of a sum of squared independent Gaussian random variables. Equivalently, it is the distribution of the Euclidean distance between a multivariate Gaussian random variable and the origin. It is thus related to the chi-squared distribution by describing the distribution of the positive square roots of a variable obeying a chi-squared distribution.

Renewal theory is the branch of probability theory that generalizes the Poisson process for arbitrary holding times. Instead of exponentially distributed holding times, a renewal process may have any independent and identically distributed (IID) holding times that have finite mean. A renewal-reward process additionally has a random sequence of rewards incurred at each holding time, which are IID but need not be independent of the holding times.

Probability theory and statistics have some commonly used conventions, in addition to standard mathematical notation and mathematical symbols.

<span class="mw-page-title-main">Variance reduction</span>

In mathematics, more specifically in the theory of Monte Carlo methods, variance reduction is a procedure used to increase the precision of the estimates obtained for a given simulation or computational effort. Every output random variable from the simulation is associated with a variance which limits the precision of the simulation results. In order to make a simulation statistically efficient, i.e., to obtain a greater precision and smaller confidence intervals for the output random variable of interest, variance reduction techniques can be used. The main variance reduction methods are

In statistics, the Chapman–Robbins bound or Hammersley–Chapman–Robbins bound is a lower bound on the variance of estimators of a deterministic parameter. It is a generalization of the Cramér–Rao bound; compared to the Cramér–Rao bound, it is both tighter and applicable to a wider range of problems. However, it is usually more difficult to compute.

In mathematics, especially measure theory, a set function is a function whose domain is a family of subsets of some given set and that (usually) takes its values in the extended real number line $which consists of the real numbers and$

In statistics and in probability theory, distance correlation or distance covariance is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimension. The population distance correlation coefficient is zero if and only if the random vectors are independent. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to Pearson's correlation, which can only detect linear association between two random variables.

In representation theory of mathematics, the Waldspurger formula relates the special values of two L-functions of two related admissible irreducible representations. Let $k$ be the base field, $f$ be an automorphic form over $k$ , $π$ be the representation associated via the Jacquet–Langlands correspondence with f. Goro Shimura (1976) proved this formula, when $and f is a cusp form; Günter Harder made the same discovery at the same time in an unpublished paper. Marie-France Vignéras (1980) proved this formula, when and f is a newform. Jean-Loup Waldspurger, for whom the formula is named, reproved and generalized the result of Vignéras in 1985 via a totally different method which was widely used thereafter by mathematicians to prove similar formulas.$

In mathematics, nuclear operators are an important class of linear operators introduced by Alexander Grothendieck in his doctoral dissertation. Nuclear operators are intimately tied to the projective tensor product of two topological vector spaces (TVSs).

References

↑ On the Markov Chain Central Limit Theorem, Galin L. Jones, https://arxiv.org/pdf/math/0409112.pdf
↑ Markov Chain Monte Carlo Lecture Notes Charles J. Geyer https://www.stat.umn.edu/geyer/f05/8931/n1998.pdf page 9
↑ Note that the equation for $\sigma ^{2}$ starts from Bienaymé's identity and then assumes that $\lim _{n\to \infty }\sum _{k=1}^{n}{\frac {(n-k)}{n}}\operatorname {cov} (g(X_{1}),g(X_{1+k}))\approx \lim _{n\to \infty }\sum _{k=1}^{n}\operatorname {cov} (g(X_{1}),g(X_{1+k}))\to \sum _{k=1}^{\infty }\operatorname {cov} (g(X_{1}),g(X_{1+k}))$ which is the Cesàro summation, see Greyer, Markov Chain Monte Carlo Lecture Notes https://www.stat.umn.edu/geyer/f05/8931/n1998.pdf page 9
↑ Geyer, Charles J. (2011). Introduction to Markov Chain Monte Carlo. In Handbook of MarkovChain Monte Carlo. Edited by S. P. Brooks, A. E. Gelman, G. L. Jones, and X. L. Meng. Chapman & Hall/CRC, Boca Raton, FL, Section 1.8. http://www.mcmchandbook.net/HandbookChapter1.pdf

Sources

Gordin, M. I. and Lifšic, B. A. (1978). "Central limit theorem for stationary Markov processes." Soviet Mathematics, Doklady, 19, 392–394. (English translation of Russian original).
Geyer, Charles J. (2011). "Introduction to MCMC." In Handbook of Markov Chain Monte Carlo, edited by S. P. Brooks, A. E. Gelman, G. L. Jones, and X. L. Meng. Chapman & Hall/CRC, Boca Raton, pp. 3–48.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] On the Markov Chain Central Limit Theorem, Galin L. Jones, https://arxiv.org/pdf/math/0409112.pdf

[2] Markov Chain Monte Carlo Lecture Notes Charles J. Geyer https://www.stat.umn.edu/geyer/f05/8931/n1998.pdf page 9

[3] Note that the equation for $\sigma ^{2}$ starts from Bienaymé's identity and then assumes that $\lim _{n\to \infty }\sum _{k=1}^{n}{\frac {(n-k)}{n}}\operatorname {cov} (g(X_{1}),g(X_{1+k}))\approx \lim _{n\to \infty }\sum _{k=1}^{n}\operatorname {cov} (g(X_{1}),g(X_{1+k}))\to \sum _{k=1}^{\infty }\operatorname {cov} (g(X_{1}),g(X_{1+k}))$ which is the Cesàro summation, see Greyer, Markov Chain Monte Carlo Lecture Notes https://www.stat.umn.edu/geyer/f05/8931/n1998.pdf page 9

[4] Geyer, Charles J. (2011). Introduction to Markov Chain Monte Carlo. In Handbook of MarkovChain Monte Carlo. Edited by S. P. Brooks, A. E. Gelman, G. L. Jones, and X. L. Meng. Chapman & Hall/CRC, Boca Raton, FL, Section 1.8. http://www.mcmchandbook.net/HandbookChapter1.pdf

[1]

[2]

[3]

[4]