Typical set

Last updated May 06, 2024

In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asymptotic equipartition property (AEP) which is a kind of law of large numbers. The notion of typicality is only concerned with the probability of a sequence and not the actual sequence itself.

(Weakly) typical sequences (weak typicality, entropy typicality)
Properties
Example
Strongly typical sequences (strong typicality, letter typicality)
Jointly typical sequences
Applications of typicality
Typical set encoding
Typical set decoding
Universal null-hypothesis testing
Universal channel code
See also
References

This has great use in compression theory as it provides a theoretical means for compressing data, allowing us to represent any sequence Xⁿ using nH(X) bits on average, and, hence, justifying the use of entropy as a measure of information from a source.

The AEP can also be proven for a large class of stationary ergodic processes, allowing typical set to be defined in more general cases.

(Weakly) typical sequences (weak typicality, entropy typicality)

If a sequence x₁, ..., x_n is drawn from an independent identically-distributed random variable (IID) X defined over a finite alphabet ${\mathcal {X}}$ , then the typical set, A_ε⁽ⁿ⁾ $\in {\mathcal {X}}$ ⁽ⁿ⁾ is defined as those sequences which satisfy:

2^{-n(H(X)+\varepsilon )}\leqslant p(x_{1},x_{2},\dots ,x_{n})\leqslant 2^{-n(H(X)-\varepsilon )}

where

H(X)=-\sum _{x\in {\mathcal {X}}}p(x)\log _{2}p(x)

is the information entropy of X. The probability above need only be within a factor of 2^nε. Taking the logarithm on all sides and dividing by -n, this definition can be equivalently stated as

H(X)-\varepsilon \leq -{\frac {1}{n}}\log _{2}p(x_{1},x_{2},\ldots ,x_{n})\leq H(X)+\varepsilon .

For i.i.d sequence, since

p(x_{1},x_{2},\ldots ,x_{n})=\prod _{i=1}^{n}p(x_{i}),

we further have

H(X)-\varepsilon \leq -{\frac {1}{n}}\sum _{i=1}^{n}\log _{2}p(x_{i})\leq H(X)+\varepsilon .

By the law of large numbers, for sufficiently large n

-{\frac {1}{n}}\sum _{i=1}^{n}\log _{2}p(x_{i})\rightarrow H(X).

Properties

An essential characteristic of the typical set is that, if one draws a large number n of independent random samples from the distribution X, the resulting sequence (x₁, x₂, ..., x_n) is very likely to be a member of the typical set, even though the typical set comprises only a small fraction of all the possible sequences. Formally, given any $\varepsilon >0$ , one can choose n such that:

The probability of a sequence from X⁽ⁿ⁾ being drawn from A_ε⁽ⁿ⁾ is greater than 1 − ε, i.e. $Pr[x^{(n)}\in A_{\epsilon }^{(n)}]\geq 1-\varepsilon$
$\left|{A_{\varepsilon }}^{(n)}\right|\leqslant 2^{n(H(X)+\varepsilon )}$
$\left|{A_{\varepsilon }}^{(n)}\right|\geqslant (1-\varepsilon )2^{n(H(X)-\varepsilon )}$
If the distribution over ${\mathcal {X}}$ is not uniform, then the fraction of sequences that are typical is

{\frac {|A_{\epsilon }^{(n)}|}{|{\mathcal {X}}^{(n)}|}}\equiv {\frac {2^{nH(X)}}{2^{n\log _{2}|{\mathcal {X}}|}}}=2^{-n(\log _{2}|{\mathcal {X}}|-H(X))}\rightarrow 0

as n becomes very large, since

H(X)<\log _{2}|{\mathcal {X}}|,

where

|{\mathcal {X}}|

is the cardinality of

{\mathcal {X}}

.

For a general stochastic process {X(t)} with AEP, the (weakly) typical set can be defined similarly with p(x₁, x₂, ..., x_n) replaced by p(x₀^τ) (i.e. the probability of the sample limited to the time interval [0, τ]), n being the degree of freedom of the process in the time interval and H(X) being the entropy rate. If the process is continuous valued, differential entropy is used instead.

Example

Counter-intuitively, the most likely sequence is often not a member of the typical set. For example, suppose that X is an i.i.d Bernoulli random variable with p(0)=0.1 and p(1)=0.9. In n independent trials, since p(1)>p(0), the most likely sequence of outcome is the sequence of all 1's, (1,1,...,1). Here the entropy of X is H(X)=0.469, while

-{\frac {1}{n}}\log _{2}p\left(x^{(n)}=(1,1,\ldots ,1)\right)=-{\frac {1}{n}}\log _{2}(0.9^{n})=0.152

So this sequence is not in the typical set because its average logarithmic probability cannot come arbitrarily close to the entropy of the random variable X no matter how large we take the value of n.

For Bernoulli random variables, the typical set consists of sequences with average numbers of 0s and 1s in n independent trials. This is easily demonstrated: If p(1) = p and p(0) = 1-p, then for n trials with m 1's, we have

-{\frac {1}{n}}\log _{2}p(x^{(n)})=-{\frac {1}{n}}\log _{2}p^{m}(1-p)^{n-m}=-{\frac {m}{n}}\log _{2}p-\left({\frac {n-m}{n}}\right)\log _{2}(1-p).

The average number of 1's in a sequence of Bernoulli trials is m = np. Thus, we have

-{\frac {1}{n}}\log _{2}p(x^{(n)})=-p\log _{2}p-(1-p)\log _{2}(1-p)=H(X).

For this example, if n=10, then the typical set consist of all sequences that have a single 0 in the entire sequence. In case p(0)=p(1)=0.5, then every possible binary sequences belong to the typical set.

Strongly typical sequences (strong typicality, letter typicality)

If a sequence x₁, ..., x_n is drawn from some specified joint distribution defined over a finite or an infinite alphabet ${\mathcal {X}}$ , then the strongly typical set, A_ε,strong⁽ⁿ⁾ $\in {\mathcal {X}}$ is defined as the set of sequences which satisfy

\left|{\frac {N(x_{i})}{n}}-p(x_{i})\right|<{\frac {\varepsilon }{\|{\mathcal {X}}\|}}.

where ${N(x_{i})}$ is the number of occurrences of a specific symbol in the sequence.

It can be shown that strongly typical sequences are also weakly typical (with a different constant ε), and hence the name. The two forms, however, are not equivalent. Strong typicality is often easier to work with in proving theorems for memoryless channels. However, as is apparent from the definition, this form of typicality is only defined for random variables having finite support.

Jointly typical sequences

Two sequences $x^{n}$ and $y^{n}$ are jointly ε-typical if the pair $(x^{n},y^{n})$ is ε-typical with respect to the joint distribution $p(x^{n},y^{n})=\prod _{i=1}^{n}p(x_{i},y_{i})$ and both $x^{n}$ and $y^{n}$ are ε-typical with respect to their marginal distributions $p(x^{n})$ and $p(y^{n})$ . The set of all such pairs of sequences $(x^{n},y^{n})$ is denoted by $A_{\varepsilon }^{n}(X,Y)$ . Jointly ε-typical n-tuple sequences are defined similarly.

Let ${\tilde {X}}^{n}$ and ${\tilde {Y}}^{n}$ be two independent sequences of random variables with the same marginal distributions $p(x^{n})$ and $p(y^{n})$ . Then for any ε>0, for sufficiently large n, jointly typical sequences satisfy the following properties:

$P\left[(X^{n},Y^{n})\in A_{\varepsilon }^{n}(X,Y)\right]\geqslant 1-\epsilon$
$\left|A_{\varepsilon }^{n}(X,Y)\right|\leqslant 2^{n(H(X,Y)+\epsilon )}$
$\left|A_{\varepsilon }^{n}(X,Y)\right|\geqslant (1-\epsilon )2^{n(H(X,Y)-\epsilon )}$
$P\left[({\tilde {X}}^{n},{\tilde {Y}}^{n})\in A_{\varepsilon }^{n}(X,Y)\right]\leqslant 2^{-n(I(X;Y)-3\epsilon )}$
$P\left[({\tilde {X}}^{n},{\tilde {Y}}^{n})\in A_{\varepsilon }^{n}(X,Y)\right]\geqslant (1-\epsilon )2^{-n(I(X;Y)+3\epsilon )}$

Applications of typicality

Typical set encoding

In information theory, typical set encoding encodes only the sequences in the typical set of a stochastic source with fixed length block codes. Since the size of the typical set is about 2^nH(X), only nH(X) bits are required for the coding, while at the same time ensuring that the chances of encoding error is limited to ε. Asymptotically, it is, by the AEP, lossless and achieves the minimum rate equal to the entropy rate of the source.

Typical set decoding

In information theory, typical set decoding is used in conjunction with random coding to estimate the transmitted message as the one with a codeword that is jointly ε-typical with the observation. i.e.

{\hat {w}}=w\iff (\exists w)((x_{1}^{n}(w),y_{1}^{n})\in A_{\varepsilon }^{n}(X,Y))

where ${\hat {w}},x_{1}^{n}(w),y_{1}^{n}$ are the message estimate, codeword of message $w$ and the observation respectively. $A_{\varepsilon }^{n}(X,Y)$ is defined with respect to the joint distribution $p(x_{1}^{n})p(y_{1}^{n}|x_{1}^{n})$ where $p(y_{1}^{n}|x_{1}^{n})$ is the transition probability that characterizes the channel statistics, and $p(x_{1}^{n})$ is some input distribution used to generate the codewords in the random codebook.

Universal null-hypothesis testing

Universal channel code

Related Research Articles

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable $, which takes values in the alphabet and is distributed according to, the entropy is$

In probability theory, the law of large numbers (LLN) is a mathematical theorem that states that the average of the results obtained from a large number of independent and identical random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

Additive white Gaussian noise (AWGN) is a basic noise model used in information theory to mimic the effect of many random processes that occur in nature. The modifiers denote specific characteristics:

In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression.

Vapnik–Chervonenkis theory was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.

In information theory, Shannon's source coding theorem establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.

In extremal graph theory, Szemerédi’s regularity lemma states that a graph can be partitioned into a bounded number of parts so that the edges between parts are regular. The lemma shows that certain properties of random graphs can be applied to dense graphs like counting the copies of a given subgraph within graphs. Endre Szemerédi proved the lemma over bipartite graphs for his theorem on arithmetic progressions in 1975 and for general graphs in 1978. Variants of the lemma use different notions of regularity and apply to other mathematical objects like hypergraphs.

In information theory, the noisy-channel coding theorem, establishes that for any given degree of noise contamination of a communication channel, it is possible to communicate discrete data nearly error-free up to a computable maximum rate through the channel. This result was presented by Claude Shannon in 1948 and was based in part on earlier work and ideas of Harry Nyquist and Ralph Hartley.

In information theory, Fano's inequality relates the average information lost in a noisy channel to the probability of the categorization error. It was derived by Robert Fano in the early 1950s while teaching a Ph.D. seminar in information theory at MIT, and later recorded in his 1961 textbook.

In mathematical optimization, the ellipsoid method is an iterative method for minimizing convex functions over convex sets. The ellipsoid method generates a sequence of ellipsoids whose volume uniformly decreases at every step, thus enclosing a minimizer of a convex function.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

In probability theory and theoretical computer science, McDiarmid's inequality is a concentration inequality which bounds the deviation between the sampled value and the expected value of certain functions when they are evaluated on independent random variables. McDiarmid's inequality applies to functions that satisfy a bounded differences property, meaning that replacing a single argument to the function while leaving all other arguments unchanged cannot cause too large of a change in the value of the function.

A randomness extractor, often simply called an "extractor", is a function, which being applied to output from a weak entropy source, together with a short, uniformly random seed, generates a highly random output that appears independent from the source and uniformly distributed. Examples of weakly random sources include radioactive decay or thermal noise; the only restriction on possible sources is that there is no way they can be fully controlled, calculated or predicted, and that a lower bound on their entropy rate can be established. For a given source, a randomness extractor can even be considered to be a true random number generator (TRNG); but there is no single extractor that has been proven to produce truly random output from any type of weakly random source.

The exponential mechanism is a technique for designing differentially private algorithms. It was developed by Frank McSherry and Kunal Talwar in 2007. Their work was recognized as a co-winner of the 2009 PET Award for Outstanding Research in Privacy Enhancing Technologies.

The Elias Bassalygo bound is a mathematical limit used in coding theory for error correction during data transmission or communications.

In quantum computing, the quantum phase estimation algorithm is a quantum algorithm to estimate the phase corresponding to an eigenvalue of a given unitary operator. Because the eigenvalues of a unitary operator always have unit modulus, they are characterized by their phase, and therefore the algorithm can be equivalently described as retrieving either the phase or the eigenvalue itself. The algorithm was initially introduced by Alexei Kitaev in 1995.

In the theory of quantum communication, the quantum capacity is the highest rate at which quantum information can be communicated over many independent uses of a noisy quantum channel from a sender to a receiver. It is also equal to the highest rate at which entanglement can be generated over the channel, and forward classical communication cannot improve it. The quantum capacity theorem is important for the theory of quantum error correction, and more broadly for the theory of quantum computation. The theorem giving a lower bound on the quantum capacity of any channel is colloquially known as the LSD theorem, after the authors Lloyd, Shor, and Devetak who proved it with increasing standards of rigor.

In coding theory, the Wozencraft ensemble is a set of linear codes in which most of codes satisfy the Gilbert-Varshamov bound. It is named after John Wozencraft, who proved its existence. The ensemble is described by Massey (1963), who attributes it to Wozencraft. Justesen (1972) used the Wozencraft ensemble as the inner codes in his construction of strongly explicit asymptotically good code.

The Gilbert–Varshamov bound for linear codes is related to the general Gilbert–Varshamov bound, which gives a lower bound on the maximal number of elements in an error-correcting code of a given block length and minimum Hamming weight over a field $. This may be translated into a statement about the maximum rate of a code with given length and minimum distance. The Gilbert-Varshamov bound for linear codes asserts the existence of q -ary linear codes for any relative minimum distance less than the given bound that simultaneously have high rate. The existence proof uses the probabilistic method, and thus is not constructive. The Gilbert-Varshamov bound is the best known in terms of relative distance for codes over alphabets of size less than 49. For larger alphabets, algebraic geometry codes sometimes achieve an asymptotically better rate vs. distance tradeoff than is given by the Gilbert-Varshamov bound.$

Fuzzy extractors are a method that allows biometric data to be used as inputs to standard cryptographic techniques, to enhance computer security. "Fuzzy", in this context, refers to the fact that the fixed values required for cryptography will be extracted from values close to but not identical to the original key, without compromising the security required. One application is to encrypt and authenticate users records, using the biometric inputs of the user as a key.

References

C. E. Shannon, "A Mathematical Theory of Communication", Bell System Technical Journal , vol. 27, pp. 379–423, 623-656, July, October, 1948
Cover, Thomas M. (2006). "Chapter 3: Asymptotic Equipartition Property, Chapter 5: Data Compression, Chapter 8: Channel Capacity". Elements of Information Theory. John Wiley & Sons. ISBN 0-471-24195-4.
David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0-521-64298-1

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.