Shannon's source coding theorem

Last updated

In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.

Contents

Named after Claude Shannon, the source coding theorem shows that, in the limit, as the length of a stream of independent and identically-distributed random variable (i.i.d.) data tends to infinity, it is impossible to compress such data such that the code rate (average number of bits per symbol) is less than the Shannon entropy of the source, without it being virtually certain that information will be lost. However it is possible to get the code rate arbitrarily close to the Shannon entropy, with negligible probability of loss.

The source coding theorem for symbol codes places an upper and a lower bound on the minimal possible expected length of codewords as a function of the entropy of the input word (which is viewed as a random variable) and of the size of the target alphabet.

Note that, for data that exhibits more dependencies (whose source is not an i.i.d. random variable), the Kolmogorov complexity, which quantifies the minimal description length of an object, is more suitable to describe the limits of data compression. Shannon entropy takes into account only frequency regularities while Kolmogorov complexity takes into account all algorithmic regularities, so in general the latter is smaller. On the other hand, if an object is generated by a random process in such a way that it has only frequency regularities, entropy is close to complexity with high probability (Shen et al. 2017). [1]

Statements

Source coding is a mapping from (a sequence of) symbols from an information source to a sequence of alphabet symbols (usually bits) such that the source symbols can be exactly recovered from the binary bits (lossless source coding) or recovered within some distortion (lossy source coding). This is one approach to data compression.

Source coding theorem

In information theory, the source coding theorem (Shannon 1948) [2] informally states that (MacKay 2003, pg. 81, [3] Cover 2006, Chapter 5 [4] ):

N i.i.d. random variables each with entropy H(X) can be compressed into more than NH(X) bits with negligible risk of information loss, as N → ∞; but conversely, if they are compressed into fewer than NH(X) bits it is virtually certain that information will be lost.

The coded sequence represents the compressed message in a biunivocal way, under the assumption that the decoder knows the source. From a practical point of view, this hypothesis is not always true. Consequently, when the entropy encoding is applied the transmitted message is . Usually, the information that characterizes the source is inserted at the beginning of the transmitted message.

Source coding theorem for symbol codes

Let Σ1, Σ2 denote two finite alphabets and let Σ
1
and Σ
2
denote the set of all finite words from those alphabets (respectively).

Suppose that X is a random variable taking values in Σ1 and let f be a uniquely decodable code from Σ
1
to Σ
2
where 2| = a. Let S denote the random variable given by the length of codeword f(X).

If f is optimal in the sense that it has the minimal expected word length for X, then (Shannon 1948):

Where denotes the expected value operator.

Proof: source coding theorem

Given X is an i.i.d. source, its time series X1, ..., Xn is i.i.d. with entropy H(X) in the discrete-valued case and differential entropy in the continuous-valued case. The Source coding theorem states that for any ε > 0, i.e. for any rate H(X) + ε larger than the entropy of the source, there is large enough n and an encoder that takes n i.i.d. repetition of the source, X1:n, and maps it to n(H(X) + ε) binary bits such that the source symbols X1:n are recoverable from the binary bits with probability of at least 1 − ε.

Proof of Achievability. Fix some ε > 0, and let

The typical set, Aε
n
, is defined as follows:

The asymptotic equipartition property (AEP) shows that for large enough n, the probability that a sequence generated by the source lies in the typical set, Aε
n
, as defined approaches one. In particular, for sufficiently large n, can be made arbitrarily close to 1, and specifically, greater than (See AEP for a proof).

The definition of typical sets implies that those sequences that lie in the typical set satisfy:


Since bits are enough to point to any string in this set.

The encoding algorithm: the encoder checks if the input sequence lies within the typical set; if yes, it outputs the index of the input sequence within the typical set; if not, the encoder outputs an arbitrary n(H(X) + ε) digit number. As long as the input sequence lies within the typical set (with probability at least 1 − ε), the encoder does not make any error. So, the probability of error of the encoder is bounded above by ε.

Proof of converse: the converse is proved by showing that any set of size smaller than Aε
n
(in the sense of exponent) would cover a set of probability bounded away from 1.

Proof: Source coding theorem for symbol codes

For 1 ≤ in let si denote the word length of each possible xi. Define , where C is chosen so that q1 + ... + qn = 1. Then

where the second line follows from Gibbs' inequality and the fifth line follows from Kraft's inequality:

so log C ≤ 0.

For the second inequality we may set

so that

and so

and

and so by Kraft's inequality there exists a prefix-free code having those word lengths. Thus the minimal S satisfies

Extension to non-stationary independent sources

Fixed rate lossless source coding for discrete time non-stationary independent sources

Define typical set Aε
n
as:

Then, for given δ > 0, for n large enough, Pr(Aε
n
) > 1 − δ
. Now we just encode the sequences in the typical set, and usual methods in source coding show that the cardinality of this set is smaller than . Thus, on an average, Hn(X) + ε bits suffice for encoding with probability greater than 1 − δ, where ε and δ can be made arbitrarily small, by making n larger.

See also

Related Research Articles

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

In mathematics, the branch of real analysis studies the behavior of real numbers, sequences and series of real numbers, and real functions. Some particular properties of real-valued sequences and functions that real analysis studies include convergence, limits, continuity, smoothness, differentiability and integrability.

In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is a name given to two different but related techniques for constructing a prefix code based on a set of symbols and their probabilities.

<span class="mw-page-title-main">Law of large numbers</span> Averages of repeated trials converge to the expected value

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and tends to become closer to the expected value as more trials are performed.

Additive white Gaussian noise (AWGN) is a basic noise model used in information theory to mimic the effect of many random processes that occur in nature. The modifiers denote specific characteristics:

In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression.

In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asymptotic equipartition property (AEP) which is a kind of law of large numbers. The notion of typicality is only concerned with the probability of a sequence and not the actual sequence itself.

<span class="mw-page-title-main">Vapnik–Chervonenkis theory</span> Branch of statistical computational learning theory

Vapnik–Chervonenkis theory was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.

In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.

In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassily Hoeffding in 1963.

In information theory, the noisy-channel coding theorem, establishes that for any given degree of noise contamination of a communication channel, it is possible to communicate discrete data nearly error-free up to a computable maximum rate through the channel. This result was presented by Claude Shannon in 1948 and was based in part on earlier work and ideas of Harry Nyquist and Ralph Hartley.

In probability theory, the theory of large deviations concerns the asymptotic behaviour of remote tails of sequences of probability distributions. While some basic ideas of the theory can be traced to Laplace, the formalization started with insurance mathematics, namely ruin theory with Cramér and Lundberg. A unified formalization of large deviation theory was developed in 1966, in a paper by Varadhan. Large deviations theory formalizes the heuristic ideas of concentration of measures and widely generalizes the notion of convergence of probability measures.

In probability theory and statistics, the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based on the Kullback–Leibler divergence, with some notable differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

In information theory, the error exponent of a channel code or source code over the block length of the code is the rate at which the error probability decays exponentially with the block length of the code. Formally, it is defined as the limiting ratio of the negative logarithm of the error probability to the block length of the code for large block lengths. For example, if the probability of error of a decoder drops as , where is the block length, the error exponent is . In this example, approaches for large . Many of the information-theoretic theorems are of asymptotic nature, for example, the channel coding theorem states that for any rate less than the channel capacity, the probability of the error of the channel code can be made to go to zero as the block length goes to infinity. In practical situations, there are limitations to the delay of the communication and the block length must be finite. Therefore, it is important to study how the probability of error drops as the block length go to infinity.

In probability theory, Lindeberg's condition is a sufficient condition for the central limit theorem (CLT) to hold for a sequence of independent random variables. Unlike the classical CLT, which requires that the random variables in question have finite variance and be both independent and identically distributed, Lindeberg's CLT only requires that they have finite variance, satisfy Lindeberg's condition, and be independent. It is named after the Finnish mathematician Jarl Waldemar Lindeberg.

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory.

In the theory of quantum communication, the quantum capacity is the highest rate at which quantum information can be communicated over many independent uses of a noisy quantum channel from a sender to a receiver. It is also equal to the highest rate at which entanglement can be generated over the channel, and forward classical communication cannot improve it. The quantum capacity theorem is important for the theory of quantum error correction, and more broadly for the theory of quantum computation. The theorem giving a lower bound on the quantum capacity of any channel is colloquially known as the LSD theorem, after the authors Lloyd, Shor, and Devetak who proved it with increasing standards of rigor.

In information theory, Shannon–Fano–Elias coding is a precursor to arithmetic coding, in which probabilities are used to determine codewords.

In the mathematical theory of probability, the Hsu–Robbins–Erdős theorem states that if is a sequence of i.i.d. random variables with zero mean and finite variance and

References

  1. Shen, A. and Uspensky, V.A. and Vereshchagin, N. (2017). "Chapter 7.3. : Complexity and entropy". Kolmogorov Complexity and Algorithmic Randomness. American Mathematical Society. p. 226. ISBN   9781470431822.{{cite book}}: CS1 maint: multiple names: authors list (link)
  2. C.E. Shannon, "A Mathematical Theory of Communication Archived 2009-02-16 at the Wayback Machine ", Bell System Technical Journal , vol. 27, pp. 379–423, 623-656, July, October, 1948
  3. David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN   0-521-64298-1
  4. Cover, Thomas M. (2006). "Chapter 5: Data Compression". Elements of Information Theory. John Wiley & Sons. pp. 103–142. ISBN   0-471-24195-4.