Sanov's theorem

Last updated July 11, 2025

In mathematics and information theory, Sanov's theorem gives a bound on the probability of observing an atypical sequence of samples from a given probability distribution. In the language of large deviations theory, Sanov's theorem identifies the rate function for large deviations of the empirical measure of a sequence of i.i.d. random variables.

Let A be a set of probability distributions over an alphabet X, and let q be an arbitrary distribution over X (where q may or may not be in A). Suppose we draw n i.i.d. samples from q, represented by the vector $x^{n}=(x_{1},x_{2},\ldots ,x_{n})$ . Then, we have the following bound on the probability that the empirical measure ${\hat {p}}_{x^{n}}$ of the samples falls within the set A:

q^{n}({\hat {p}}_{x^{n}}\in A)\leq (n+1)^{|X|}2^{-nD_{\mathrm {KL} }(p^{*}||q)}

,

where

$q^{n}$ is the joint probability distribution on $X^{n}$ , and
$p^{*}$ is the information projection of q onto A.
$D_{\mathrm {KL} }(P\|Q)$ , the KL divergence, is given by: $D_{\mathrm {KL} }(P\|Q)=\sum _{x\in {\mathcal {X}}}P(x)\log {\frac {P(x)}{Q(x)}}.$

In words, the probability of drawing an atypical distribution is bounded by a function of the KL divergence from the true distribution to the atypical one; in the case that we consider a set of possible atypical distributions, there is a dominant atypical distribution, given by the information projection.

Furthermore, if A is a closed set, then

\lim _{n\to \infty }{\frac {1}{n}}\log q^{n}({\hat {p}}_{x^{n}}\in A)=-D_{\mathrm {KL} }(p^{*}||q).

Technical statement

Define:

${\textstyle \Sigma }$ is a finite set with size ${\textstyle \geq 2}$ . Understood as “alphabet”.
${\textstyle \Delta (\Sigma )}$ is the simplex spanned by the alphabet. It is a subset of ${\textstyle \mathbb {R} ^{\Sigma }}$ .
${\textstyle L_{n}}$ is a random variable taking values in ${\textstyle \Delta (\Sigma )}$ . Take ${\textstyle n}$ samples from the distribution ${\textstyle \mu }$ , then ${\textstyle L_{n}}$ is the frequency probability vector for the sample.
${\textstyle {\mathcal {L}}_{n}}$ is the space of values that ${\textstyle L_{n}}$ can take. In other words, it is

$\{(a_{1}/n,\dots ,a_{|\Sigma |}/n):\sum _{i}a_{i}=n,a_{i}\in \mathbb {N} \}$ Then, Sanov's theorem states:^[1]

For every measurable subset ${\textstyle S\in \Delta (\Sigma )}$ , $-\inf _{\nu \in int(S)}D(\nu \|\mu )\leq \liminf _{n}{\frac {1}{n}}\ln P_{\mu }(L_{n}\in S)\leq \limsup _{n}{\frac {1}{n}}\ln P_{\mu }(L_{n}\in S)\leq -\inf _{\nu \in cl(S)}D(\nu \|\mu )$
For every open subset ${\textstyle U\in \Delta (\Sigma )}$ , $-\lim _{n}\lim _{\nu \in U\cap {\mathcal {L}}_{n}}D(\nu \|\mu )=\lim _{n}{\frac {1}{n}}\ln P_{\mu }(L_{n}\in S)=-\inf _{\nu \in U}D(\nu \|\mu )$

Here, $int(S)$ means the interior, and $cl(S)$ means the closure.

References

↑ Dembo, Amir; Zeitouni, Ofer (2010). "Large Deviations Techniques and Applications" . Stochastic Modelling and Applied Probability. 38: 16–17. doi:10.1007/978-3-642-03311-7. ISBN 978-3-642-03310-0. ISSN 0172-4568.

Sanov's theorem

Contents

Technical statement

References

Further reading