Asymptotic equipartition property

Last updated

In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression.

Contents

Roughly speaking, the theorem states that although there are many series of results that may be produced by a random process, the one actually produced is most probably from a loosely defined set of outcomes that all have approximately the same chance of being the one actually realized. (This is a consequence of the law of large numbers and ergodic theory.) Although there are individual outcomes which have a higher probability than any outcome in this set, the vast number of outcomes in the set almost guarantees that the outcome will come from the set. One way of intuitively understanding the property is through Cramér's large deviation theorem, which states that the probability of a large deviation from mean decays exponentially with the number of samples. Such results are studied in large deviations theory; intuitively, it is the large deviations that would violate equipartition, but these are unlikely.

In the field of pseudorandom number generation, a candidate generator of undetermined quality whose output sequence lies too far outside the typical set by some statistical criteria is rejected as insufficiently random. Thus, although the typical set is loosely defined, practical notions arise concerning sufficient typicality.

Definition

Given a discrete-time stationary ergodic stochastic process on the probability space , the asymptotic equipartition property is an assertion that, almost surely, where or simply denotes the entropy rate of , which must exist for all discrete-time stationary processes including the ergodic ones. The asymptotic equipartition property is proved for finite-valued (i.e. ) stationary ergodic stochastic processes in the Shannon–McMillan–Breiman theorem using the ergodic theory and for any i.i.d. sources directly using the law of large numbers in both the discrete-valued case (where is simply the entropy of a symbol) and the continuous-valued case (where is the differential entropy instead). The definition of the asymptotic equipartition property can also be extended for certain classes of continuous-time stochastic processes for which a typical set exists for long enough observation time. The convergence is proven almost sure in all cases.

Discrete-time i.i.d. sources

Given is an i.i.d. source which may take values in the alphabet , its time series is i.i.d. with entropy . The weak law of large numbers gives the asymptotic equipartition property with convergence in probability, since the entropy is equal to the expectation of [1]

The strong law of large numbers asserts the stronger almost sure convergence, Convergence in the sense of L1 asserts an even stronger

Discrete-time finite-valued stationary ergodic sources

Consider a finite-valued sample space , i.e. , for the discrete-time stationary ergodic process defined on the probability space . The Shannon–McMillan–Breiman theorem, due to Claude Shannon, Brockway McMillan, and Leo Breiman, states that we have convergence in the sense of L1. [2] Chung Kai-lai generalized this to the case where may take value in a set of countable infinity, provided that the entropy rate is still finite. [3]

Proof sketch [3]


Non-stationary discrete-time source producing independent symbols

The assumptions of stationarity/ergodicity/identical distribution of random variables is not essential for the asymptotic equipartition property to hold. Indeed, as is quite clear intuitively, the asymptotic equipartition property requires only some form of the law of large numbers to hold, which is fairly general. However, the expression needs to be suitably generalized, and the conditions need to be formulated precisely.

We assume that the source is producing independent symbols, with possibly different output statistics at each instant. We assume that the statistics of the process are known completely, that is, the marginal distribution of the process seen at each time instant is known. The joint distribution is just the product of marginals. Then, under the condition (which can be relaxed) that for all i, for some M > 0, the following holds (AEP): where

Proof

The proof follows from a simple application of Markov's inequality (applied to second moment of .

It is obvious that the proof holds if any moment is uniformly bounded for r > 1 (again by Markov's inequality applied to r-th moment). Q.E.D.

Even this condition is not necessary, but given a non-stationary random process, it should not be difficult to test whether the asymptotic equipartition property holds using the above method.

Applications

The asymptotic equipartition property for non-stationary discrete-time independent process leads us to (among other results) the source coding theorem for non-stationary source (with independent output symbols) and noisy-channel coding theorem for non-stationary memoryless channels.

Measure-theoretic form

is a measure-preserving map on the probability space .

If is a finite or countable partition of , then its entropy is with the convention that .

We only consider partitions with finite entropy: .

If is a finite or countable partition of , then we construct a sequence of partitions by iterating the map:where is the least upper bound partition, that is, the least refined partition that refines both and :Write to be the set in where falls in. So, for example, is the -letter initial segment of the -name of .

Write to be the information (in units of nats) about we can recover, if we know which element in the partition that falls in:Similarly, the conditional information of partition , conditional on partition , about , is is the Kolmogorov-Sinai entropy In other words, by definition, we have a convergence in expectation. The SMB theorem states that when is ergodic, we have convergence in L1. [4]

Theorem (ergodic case)  If is ergodic, then converges in L1 to the constant function .

In other words,

In particular, since L1 convergence implies almost sure convergence, with probability 1.

Corollary (entropy equipartition property)  , we can partition the partition into two parts, the “good” part and the “bad” part .

The bad part is small:

The good part is almost equipartitioned according to entropy:

If is not necessarily ergodic, then the underlying probability space would be split up into multiple subsets, each invariant under . In this case, we still have L1 convergence to some function, but that function is no longer a constant function. [5]

Theorem (general case)  Let be the sigma-algebra generated by all -invariant measurable subsets of , - converges in L1 to

When is ergodic, is trivial, and so the functionsimplifies into the constant function , which by definition, equals , which equals by a proposition.

Continuous-time stationary ergodic sources

Discrete-time functions can be interpolated to continuous-time functions. If such interpolation f is measurable, we may define the continuous-time stationary process accordingly as . If the asymptotic equipartition property holds for the discrete-time process, as in the i.i.d. or finite-valued stationary ergodic cases shown above, it automatically holds for the continuous-time stationary process derived from it by some measurable interpolation. i.e. where n corresponds to the degree of freedom in time τ. nH(X)/τ and H(X) are the entropy per unit time and per degree of freedom respectively, defined by Shannon.

An important class of such continuous-time stationary process is the bandlimited stationary ergodic process with the sample space being a subset of the continuous functions. The asymptotic equipartition property holds if the process is white, in which case the time samples are i.i.d., or there exists T > 1/2W, where W is the nominal bandwidth, such that the T-spaced time samples take values in a finite set, in which case we have the discrete-time finite-valued stationary ergodic process.

Any time-invariant operations also preserves the asymptotic equipartition property, stationarity and ergodicity and we may easily turn a stationary process to non-stationary without losing the asymptotic equipartition property by nulling out a finite number of time samples in the process.

Category theory

A category theoretic definition for the equipartition property is given by Gromov. [6] Given a sequence of Cartesian powers of a measure space P, this sequence admits an asymptotically equivalent sequence HN of homogeneous measure spaces (i.e. all sets have the same measure; all morphisms are invariant under the group of automorphisms, and thus factor as a morphism to the terminal object).

The above requires a definition of asymptotic equivalence. This is given in terms of a distance function, giving how much an injective correspondence differs from an isomorphism. An injective correspondence is a partially defined map that is a bijection; that is, it is a bijection between a subset and . Then define where |S| denotes the measure of a set S. In what follows, the measure of P and Q are taken to be 1, so that the measure spaces are probability spaces. This distance is commonly known as the earth mover's distance or Wasserstein metric.

Similarly, define with taken to be the counting measure on P. Thus, this definition requires that P be a finite measure space. Finally, let

A sequence of injective correspondences are then asymptotically equivalent when

Given a homogenous space sequence HN that is asymptotically equivalent to PN, the entropy H(P) of P may be taken as

See also

Notes

  1. Cover & Thomas (1991), p. 51.
  2. Hawkins, Jane (2021). Ergodic dynamics: from basic theory to applications. Graduate texts in mathematics. Cham, Switzerland: Springer. p. 204. ISBN   978-3-030-59241-7.
  3. 1 2 Algoet, Paul H.; Cover, Thomas M. (1988). "A Sandwich Proof of the Shannon-McMillan-Breiman Theorem" (PDF). The Annals of Probability. 16 (2): 899–909. doi:10.1214/aop/1176991794.
  4. Petersen, Karl E. (1983). "6.2. The Shannon—McMillan—Breiman Theorem". Ergodic Theory. Cambridge Studies in Advanced Mathematics. Cambridge: Cambridge University Press. ISBN   978-0-521-38997-6.
  5. Pollicott, Mark; Yuri, Michiko (1998). "12.4. The Shannon-McMillan-Brieman theorem". Dynamical Systems and Ergodic Theory. London Mathematical Society Student Texts. Cambridge: Cambridge University Press. ISBN   978-0-521-57294-1.
  6. Misha Gromov, (2012) "In a Search for a Structure, Part 1: On Entropy". (See page 5, where the equipartition property is called the 'Bernoulli approximation theorem'.)

Related Research Articles

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed to describe the state of the variable, considering the distribution of probabilities across all potential states. Given a discrete random variable , which takes values in the set and is distributed according to , the entropy is where denotes the sum over the variable's possible values. The choice of base for , the logarithm, varies for different applications. Base 2 gives the unit of bits, while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.

In mathematics, the prime number theorem (PNT) describes the asymptotic distribution of the prime numbers among the positive integers. It formalizes the intuitive idea that primes become less common as they become larger by precisely quantifying the rate at which this occurs. The theorem was proved independently by Jacques Hadamard and Charles Jean de la Vallée Poussin in 1896 using ideas introduced by Bernhard Riemann.

<span class="mw-page-title-main">Central limit theorem</span> Fundamental theorem in probability theory and statistics

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.

<span class="mw-page-title-main">Bernoulli process</span> Random process of binary (boolean) random variables

In probability and statistics, a Bernoulli process is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The component Bernoulli variablesXi are identically distributed and independent. Prosaically, a Bernoulli process is a repeated coin flipping, possibly with an unfair coin. Every variable Xi in the sequence is associated with a Bernoulli trial or experiment. They all have the same Bernoulli distribution. Much of what can be said about the Bernoulli process can also be generalized to more than two outcomes ; this generalization is known as the Bernoulli scheme.

<span class="mw-page-title-main">Law of large numbers</span> Averages of repeated trials converge to the expected value

In probability theory, the law of large numbers (LLN) is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asymptotic equipartition property (AEP) which is a kind of law of large numbers. The notion of typicality is only concerned with the probability of a sequence and not the actual sequence itself.

Ergodic theory is a branch of mathematics that studies statistical properties of deterministic dynamical systems; it is the study of ergodicity. In this context, "statistical properties" refers to properties which are expressed through the behavior of time averages of various functions along trajectories of dynamical systems. The notion of deterministic dynamical systems assumes that the equations determining the dynamics do not contain any random perturbations, noise, etc. Thus, the statistics with which we are concerned are properties of the dynamics.

In number theory, Aleksandr Yakovlevich Khinchin proved that for almost all real numbers x, coefficients ai of the continued fraction expansion of x have a finite geometric mean that is independent of the value of x and is known as Khinchin's constant.

<span class="mw-page-title-main">Law of the iterated logarithm</span> Mathematical theorem

In probability theory, the law of the iterated logarithm describes the magnitude of the fluctuations of a random walk. The original statement of the law of the iterated logarithm is due to A. Ya. Khinchin (1924). Another statement was given by A. N. Kolmogorov in 1929.

In numerical analysis, the order of convergence and the rate of convergence of a convergent sequence are quantities that represent how quickly the sequence approaches its limit. In the most common applications, a sequence that converges to is said to have order of convergence and rate of convergence if

In mathematics Lévy's constant occurs in an expression for the asymptotic behaviour of the denominators of the convergents of continued fractions. In 1935, the Soviet mathematician Aleksandr Khinchin showed that the denominators qn of the convergents of the continued fraction expansions of almost all real numbers satisfy

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one reference probability distribution P is different from a second probability distribution Q. Mathematically, it is defined as

In mathematics, Borel summation is a summation method for divergent series, introduced by Émile Borel. It is particularly useful for summing divergent asymptotic series, and in some sense gives the best possible sum for such series. There are several variations of this method that are also called Borel summation, and a generalization of it called Mittag-Leffler summation.

In mathematics, the Bernoulli scheme or Bernoulli shift is a generalization of the Bernoulli process to more than two possible outcomes. Bernoulli schemes appear naturally in symbolic dynamics, and are thus important in the study of dynamical systems. Many important dynamical systems exhibit a repellor that is the product of the Cantor set and a smooth manifold, and the dynamics on the Cantor set are isomorphic to that of the Bernoulli shift. This is essentially the Markov partition. The term shift is in reference to the shift operator, which may be used to study Bernoulli schemes. The Ornstein isomorphism theorem shows that Bernoulli shifts are isomorphic when their entropy is equal.

Euclid's theorem is a fundamental statement in number theory that asserts that there are infinitely many prime numbers. It was first proven by Euclid in his work Elements. There are several proofs of the theorem.

In probability theory, the theory of large deviations concerns the asymptotic behaviour of remote tails of sequences of probability distributions. While some basic ideas of the theory can be traced to Laplace, the formalization started with insurance mathematics, namely ruin theory with Cramér and Lundberg. A unified formalization of large deviation theory was developed in 1966, in a paper by Varadhan. Large deviations theory formalizes the heuristic ideas of concentration of measures and widely generalizes the notion of convergence of probability measures.

In mathematics, ergodicity expresses the idea that a point of a moving system, either a dynamical system or a stochastic process, will eventually visit all parts of the space that the system moves in, in a uniform and random sense. This implies that the average behavior of the system can be deduced from the trajectory of a "typical" point. Equivalently, a sufficiently large collection of random samples from a process can represent the average statistical properties of the entire process. Ergodicity is a property of the system; it is a statement that the system cannot be reduced or factored into smaller components. Ergodic theory is the study of systems possessing ergodicity.

In the mathematical theory of probability, the entropy rate or source information rate is a function assigning an entropy to a stochastic process.

Maximal entropy random walk (MERW) is a popular type of biased random walk on a graph, in which transition probabilities are chosen accordingly to the principle of maximum entropy, which says that the probability distribution which best represents the current state of knowledge is the one with largest entropy. While standard random walk chooses for every vertex uniform probability distribution among its outgoing edges, locally maximizing entropy rate, MERW maximizes it globally by assuming uniform probability distribution among all paths in a given graph.

References

Journal articles

Textbooks