Noisy-channel coding theorem

Last updated January 19, 2024

In information theory, the noisy-channel coding theorem (sometimes Shannon's theorem or Shannon's limit), establishes that for any given degree of noise contamination of a communication channel, it is possible (in theory) to communicate discrete data (digital information) nearly error-free up to a computable maximum rate through the channel. This result was presented by Claude Shannon in 1948 and was based in part on earlier work and ideas of Harry Nyquist and Ralph Hartley.

The Shannon limit or Shannon capacity of a communication channel refers to the maximum rate of error-free data that can theoretically be transferred over the channel if the link is subject to random data transmission errors, for a particular noise level. It was first described by Shannon (1948), and shortly after published in a book by Shannon and Warren Weaver entitled The Mathematical Theory of Communication (1949). This founded the modern discipline of information theory.

Overview

Stated by Claude Shannon in 1948, the theorem describes the maximum possible efficiency of error-correcting methods versus levels of noise interference and data corruption. Shannon's theorem has wide-ranging applications in both communications and data storage. This theorem is of foundational importance to the modern field of information theory. Shannon only gave an outline of the proof. The first rigorous proof for the discrete case is given in ( Feinstein 1954 ).

The Shannon theorem states that given a noisy channel with channel capacity C and information transmitted at a rate R, then if $R<C$ there exist codes that allow the probability of error at the receiver to be made arbitrarily small. This means that, theoretically, it is possible to transmit information nearly without error at any rate below a limiting rate, C.

The converse is also important. If $R>C$ , an arbitrarily small probability of error is not achievable. All codes will have a probability of error greater than a certain positive minimal level, and this level increases as the rate increases. So, information cannot be guaranteed to be transmitted reliably across a channel at rates beyond the channel capacity. The theorem does not address the rare situation in which rate and capacity are equal.

The channel capacity $C$ can be calculated from the physical properties of a channel; for a band-limited channel with Gaussian noise, using the Shannon–Hartley theorem.

Simple schemes such as "send the message 3 times and use a best 2 out of 3 voting scheme if the copies differ" are inefficient error-correction methods, unable to asymptotically guarantee that a block of data can be communicated free of error. Advanced techniques such as Reed–Solomon codes and, more recently, low-density parity-check (LDPC) codes and turbo codes, come much closer to reaching the theoretical Shannon limit, but at a cost of high computational complexity. Using these highly efficient codes and with the computing power in today's digital signal processors, it is now possible to reach very close to the Shannon limit. In fact, it was shown that LDPC codes can reach within 0.0045 dB of the Shannon limit (for binary additive white Gaussian noise (AWGN) channels, with very long block lengths).^[1]

Mathematical statement

The basic mathematical model for a communication system is the following:

{\xrightarrow[{\text{Message}}]{W}}{\begin{array}{|c| }\hline {\text{Encoder}}\\f_{n}\\\hline \end{array}}{\xrightarrow[{\mathrm {Encoded \atop sequence} }]{X^{n}}}{\begin{array}{|c| }\hline {\text{Channel}}\\p(y|x)\\\hline \end{array}}{\xrightarrow[{\mathrm {Received \atop sequence} }]{Y^{n}}}{\begin{array}{|c| }\hline {\text{Decoder}}\\g_{n}\\\hline \end{array}}{\xrightarrow[{\mathrm {Estimated \atop message} }]{\hat {W}}}

A messageW is transmitted through a noisy channel by using encoding and decoding functions. An encoder maps W into a pre-defined sequence of channel symbols of length n. In its most basic model, the channel distorts each of these symbols independently of the others. The output of the channel –the received sequence– is fed into a decoder which maps the sequence into an estimate of the message. In this setting, the probability of error is defined as:

P_{e}={\text{Pr}}\left\{{\hat {W}}\neq W\right\}.

Theorem (Shannon, 1948):

1. For every discrete memoryless channel, the channel capacity, defined in terms of the mutual information

I(X;Y)

as

\ C=\sup _{p_{X}}I(X;Y)

^[2]

has the following property. For any

\epsilon >0

and

R<C

, for large enough

N

, there exists a code of length

N

and rate

\geq R

and a decoding algorithm, such that the maximal probability of block error is

\leq \epsilon

.

2. If a probability of bit error

p_{b}

is acceptable, rates up to

R(p_{b})

are achievable, where

R(p_{b})={\frac {C}{1-H_{2}(p_{b})}}.

and

H_{2}(p_{b})

is the binary entropy function

H_{2}(p_{b})=-\left[p_{b}\log _{2}{p_{b}}+(1-p_{b})\log _{2}({1-p_{b}})\right]

3. For any

p_{b}

, rates greater than

R(p_{b})

are not achievable.

(MacKay (2003), p. 162; cf Gallager (1968), ch.5; Cover and Thomas (1991), p. 198; Shannon (1948) thm. 11)

Outline of proof

As with the several other major results in information theory, the proof of the noisy channel coding theorem includes an achievability result and a matching converse result. These two components serve to bound, in this case, the set of possible rates at which one can communicate over a noisy channel, and matching serves to show that these bounds are tight bounds.

The following outlines are only one set of many different styles available for study in information theory texts.

Achievability for discrete memoryless channels

This particular proof of achievability follows the style of proofs that make use of the asymptotic equipartition property (AEP). Another style can be found in information theory texts using error exponents.

Both types of proofs make use of a random coding argument where the codebook used across a channel is randomly constructed - this serves to make the analysis simpler while still proving the existence of a code satisfying a desired low probability of error at any data rate below the channel capacity.

By an AEP-related argument, given a channel, length $n$ strings of source symbols $X_{1}^{n}$ , and length $n$ strings of channel outputs $Y_{1}^{n}$ , we can define a jointly typical set by the following:

A_{\varepsilon }^{(n)}=\{(x^{n},y^{n})\in {\mathcal {X}}^{n}\times {\mathcal {Y}}^{n}

2^{-n(H(X)+\varepsilon )}\leq p(X_{1}^{n})\leq 2^{-n(H(X)-\varepsilon )}

2^{-n(H(Y)+\varepsilon )}\leq p(Y_{1}^{n})\leq 2^{-n(H(Y)-\varepsilon )}

{2^{-n(H(X,Y)+\varepsilon )}}\leq p(X_{1}^{n},Y_{1}^{n})\leq 2^{-n(H(X,Y)-\varepsilon )}\}

We say that two sequences ${X_{1}^{n}}$ and $Y_{1}^{n}$ are jointly typical if they lie in the jointly typical set defined above.

Steps

In the style of the random coding argument, we randomly generate $2^{nR}$ codewords of length n from a probability distribution Q.
This code is revealed to the sender and receiver. It is also assumed that one knows the transition matrix $p(y|x)$ for the channel being used.
A message W is chosen according to the uniform distribution on the set of codewords. That is, $Pr(W=w)=2^{-nR},w=1,2,\dots ,2^{nR}$ .
The message W is sent across the channel.
The receiver receives a sequence according to $P(y^{n}|x^{n}(w))=\prod _{i=1}^{n}p(y_{i}|x_{i}(w))$
Sending these codewords across the channel, we receive $Y_{1}^{n}$ , and decode to some source sequence if there exists exactly 1 codeword that is jointly typical with Y. If there are no jointly typical codewords, or if there are more than one, an error is declared. An error also occurs if a decoded codeword does not match the original codeword. This is called typical set decoding.

The probability of error of this scheme is divided into two parts:

First, error can occur if no jointly typical X sequences are found for a received Y sequence
Second, error can occur if an incorrect X sequence is jointly typical with a received Y sequence.

By the randomness of the code construction, we can assume that the average probability of error averaged over all codes does not depend on the index sent. Thus, without loss of generality, we can assume W = 1.
From the joint AEP, we know that the probability that no jointly typical X exists goes to 0 as n grows large. We can bound this error probability by $\varepsilon$ .
Also from the joint AEP, we know the probability that a particular $X_{1}^{n}(i)$ and the $Y_{1}^{n}$ resulting from W = 1 are jointly typical is $\leq 2^{-n(I(X;Y)-3\varepsilon )}$ .

Define: $E_{i}=\{(X_{1}^{n}(i),Y_{1}^{n})\in A_{\varepsilon }^{(n)}\},i=1,2,\dots ,2^{nR}$

as the event that message i is jointly typical with the sequence received when message 1 is sent.

{\begin{aligned}P({\text{error}})&{}=P({\text{error}}|W=1)\leq P(E_{1}^{c})+\sum _{i=2}^{2^{nR}}P(E_{i})\\&{}\leq P(E_{1}^{c})+(2^{nR}-1)2^{-n(I(X;Y)-3\varepsilon )}\\&{}\leq \varepsilon +2^{-n(I(X;Y)-R-3\varepsilon )}.\end{aligned}}

We can observe that as $n$ goes to infinity, if $R<I(X;Y)$ for the channel, the probability of error will go to 0.

Finally, given that the average codebook is shown to be "good," we know that there exists a codebook whose performance is better than the average, and so satisfies our need for arbitrarily low error probability communicating across the noisy channel.

Weak converse for discrete memoryless channels

Suppose a code of $2^{nR}$ codewords. Let W be drawn uniformly over this set as an index. Let $X^{n}$ and $Y^{n}$ be the transmitted codewords and received codewords, respectively.

$nR=H(W)=H(W|Y^{n})+I(W;Y^{n})$ using identities involving entropy and mutual information
$\leq H(W|Y^{n})+I(X^{n}(W);Y^{n})$ since X is a function of W
$\leq 1+P_{e}^{(n)}nR+I(X^{n}(W);Y^{n})$ by the use of Fano's Inequality
$\leq 1+P_{e}^{(n)}nR+nC$ by the fact that capacity is maximized mutual information.

The result of these steps is that $P_{e}^{(n)}\geq 1-{\frac {1}{nR}}-{\frac {C}{R}}$ . As the block length $n$ goes to infinity, we obtain $P_{e}^{(n)}$ is bounded away from 0 if R is greater than C - we can get arbitrarily low rates of error only if R is less than C.

Strong converse for discrete memoryless channels

A strong converse theorem, proven by Wolfowitz in 1957,^[3] states that,

P_{e}\geq 1-{\frac {4A}{n(R-C)^{2}}}-e^{-{\frac {n(R-C)}{2}}}

for some finite positive constant $A$ . While the weak converse states that the error probability is bounded away from zero as $n$ goes to infinity, the strong converse states that the error goes to 1. Thus, $C$ is a sharp threshold between perfectly reliable and completely unreliable communication.

Channel coding theorem for non-stationary memoryless channels

We assume that the channel is memoryless, but its transition probabilities change with time, in a fashion known at the transmitter as well as the receiver.

Then the channel capacity is given by

C=\lim \inf \max _{p^{(X_{1})},p^{(X_{2})},...}{\frac {1}{n}}\sum _{i=1}^{n}I(X_{i};Y_{i}).

The maximum is attained at the capacity achieving distributions for each respective channel. That is, $C=\lim \inf {\frac {1}{n}}\sum _{i=1}^{n}C_{i}$ where $C_{i}$ is the capacity of the ith channel.

Outline of the proof

The proof runs through in almost the same way as that of channel coding theorem. Achievability follows from random coding with each symbol chosen randomly from the capacity achieving distribution for that particular channel. Typicality arguments use the definition of typical sets for non-stationary sources defined in the asymptotic equipartition property article.

The technicality of lim inf comes into play when ${\frac {1}{n}}\sum _{i=1}^{n}C_{i}$ does not converge.

Notes

↑ Sae-Young Chung; Forney, G. D.; Richardson, T.J.; Urbank, R. (February 2001). "On the Design of Low-Density Parity-Check Codes within 0.0045 dB of the Shannon Limit" (PDF). IEEE Communications Letters. 5 (2): 58–60. doi:10.1109/4234.905935. S2CID 7381972.
↑ For a description of the "sup" function, see Supremum
↑ Gallager, Robert (1968). Information Theory and Reliable Communication. Wiley. ISBN 0-471-29048-3.

Related Research Articles

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field, in applied mathematics, is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

A binary symmetric channel is a common communications channel model used in coding theory and information theory. In this model, a transmitter wishes to send a bit, and the receiver will receive a bit. The bit will be "flipped" with a "crossover probability" of p, and otherwise is received correctly. This model can be applied to varied communication channels such as telephone lines or disk drive storage.

Additive white Gaussian noise (AWGN) is a basic noise model used in information theory to mimic the effect of many random processes that occur in nature. The modifiers denote specific characteristics:

Channel capacity, in electrical engineering, computer science, and information theory, is the theoretical maximum rate at which information can be reliably transmitted over a communication channel.

In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression.

In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asymptotic equipartition property (AEP) which is a kind of law of large numbers. The notion of typicality is only concerned with the probability of a sequence and not the actual sequence itself.

In coding theory, block codes are a large and important family of error-correcting codes that encode data in blocks. There is a vast number of examples for block codes, many of which have a wide range of practical applications. The abstract definition of block codes is conceptually useful because it allows coding theorists, mathematicians, and computer scientists to study the limitations of all block codes in a unified way. Such limitations often take the form of bounds that relate different parameters of the block code to each other, such as its rate and its ability to detect and correct errors.

In information theory, Shannon's source coding theorem establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.

In information theory, the error exponent of a channel code or source code over the block length of the code is the rate at which the error probability decays exponentially with the block length of the code. Formally, it is defined as the limiting ratio of the negative logarithm of the error probability to the block length of the code for large block lengths. For example, if the probability of error $of a decoder drops as, where is the block length, the error exponent is . In this example, approaches for large . Many of the information-theoretic theorems are of asymptotic nature, for example, the channel coding theorem states that for any rate less than the channel capacity, the probability of the error of the channel code can be made to go to zero as the block length goes to infinity. In practical situations, there are limitations to the delay of the communication and the block length must be finite. Therefore, it is important to study how the probability of error drops as the block length go to infinity.$

The Hadamard code is an error-correcting code named after Jacques Hadamard that is used for error detection and correction when transmitting messages over very noisy or unreliable channels. In 1971, the code was used to transmit photos of Mars back to Earth from the NASA space probe Mariner 9. Because of its unique mathematical properties, the Hadamard code is not only used by engineers, but also intensely studied in coding theory, mathematics, and theoretical computer science. The Hadamard code is also known under the names Walsh code, Walsh family, and Walsh–Hadamard code in recognition of the American mathematician Joseph Leonard Walsh.

In mathematics, uniform integrability is an important concept in real analysis, functional analysis and measure theory, and plays a vital role in the theory of martingales.

In coding theory, list decoding is an alternative to unique decoding of error-correcting codes for large error rates. The notion was proposed by Elias in the 1950s. The main idea behind list decoding is that the decoding algorithm instead of outputting a single possible message outputs a list of possibilities one of which is correct. This allows for handling a greater number of errors than that allowed by unique decoding.

In coding theory, Justesen codes form a class of error-correcting codes that have a constant rate, constant relative distance, and a constant alphabet size.

A locally testable code is a type of error-correcting code for which it can be determined if a string is a word in that code by looking at a small number of bits of the string. In some situations, it is useful to know if the data is corrupted without decoding all of it so that appropriate action can be taken in response. For example, in communication, if the receiver encounters a corrupted code, it can request the data be re-sent, which could increase the accuracy of said data. Similarly, in data storage, these codes can allow for damaged data to be recovered and rewritten properly.

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory.

An arbitrarily varying channel (AVC) is a communication channel model used in coding theory, and was first introduced by Blackwell, Breiman, and Thomasian. This particular channel has unknown parameters that can change over time and these changes may not have a uniform pattern during the transmission of a codeword. $uses of this channel can be described using a stochastic matrix, where is the input alphabet, is the output alphabet, and is the probability over a given set of states, that the transmitted input leads to the received output . The state in set can vary arbitrarily at each time unit . This channel was developed as an alternative to Shannon's Binary Symmetric Channel (BSC), where the entire nature of the channel is known, to be more realistic to actual network channel situations.$

In coding theory, expander codes form a class of error-correcting codes that are constructed from bipartite expander graphs. Along with Justesen codes, expander codes are of particular interest since they have a constant positive rate, a constant positive relative distance, and a constant alphabet size. In fact, the alphabet contains only two elements, so expander codes belong to the class of binary codes. Furthermore, expander codes can be both encoded and decoded in time proportional to the block length of the code.

In coding theory, the Wozencraft ensemble is a set of linear codes in which most of codes satisfy the Gilbert-Varshamov bound. It is named after John Wozencraft, who proved its existence. The ensemble is described by Massey (1963), who attributes it to Wozencraft. Justesen (1972) used the Wozencraft ensemble as the inner codes in his construction of strongly explicit asymptotically good code.

Fuzzy extractors are a method that allows biometric data to be used as inputs to standard cryptographic techniques, to enhance computer security. "Fuzzy", in this context, refers to the fact that the fixed values required for cryptography will be extracted from values close to but not identical to the original key, without compromising the security required. One application is to encrypt and authenticate users records, using the biometric inputs of the user as a key.

In mathematics, singular integral operators of convolution type are the singular integral operators that arise on Rⁿ and Tⁿ through convolution by distributions; equivalently they are the singular integral operators that commute with translations. The classical examples in harmonic analysis are the harmonic conjugation operator on the circle, the Hilbert transform on the circle and the real line, the Beurling transform in the complex plane and the Riesz transforms in Euclidean space. The continuity of these operators on L² is evident because the Fourier transform converts them into multiplication operators. Continuity on L^p spaces was first established by Marcel Riesz. The classical techniques include the use of Poisson integrals, interpolation theory and the Hardy–Littlewood maximal function. For more general operators, fundamental new techniques, introduced by Alberto Calderón and Antoni Zygmund in 1952, were developed by a number of authors to give general criteria for continuity on L^p spaces. This article explains the theory for the classical operators and sketches the subsequent general theory.

References

Aazhang, B. (2004). "Shannon's Noisy Channel Coding Theorem" (PDF). Connections.
Cover, T.M.; Thomas, J.A. (1991). Elements of Information Theory. Wiley. ISBN 0-471-06259-6.
Fano, R.M. (1961). Transmission of information; a statistical theory of communications. MIT Press. ISBN 0-262-06001-9.
Feinstein, Amiel (September 1954). "A new basic theorem of information theory". Transactions of the IRE Professional Group on Information Theory. 4 (4): 2–22. Bibcode:1955PhDT........12F. doi:10.1109/TIT.1954.1057459. hdl: 1721.1/4798 .
Lundheim, Lars (2002). "On Shannon and Shannon's Formula" (PDF). Telektronik. 98 (1): 20–29.

MacKay, David J.C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press. ISBN 0-521-64298-1. [free online]
Shannon, C. E. (1948). "A Mathematical Theory of Communication". Bell System Technical Journal. 27 (3): 379–423. doi:10.1002/j.1538-7305.1948.tb01338.x.
Shannon, C.E. (1998) [1948]. A Mathematical Theory of Communication. University of Illinois Press.
Wolfowitz, J. (1957). "The coding of messages subject to chance errors". Illinois J. Math. 1 (4): 591–606. doi: 10.1215/ijm/1255380682 .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Sae-Young Chung; Forney, G. D.; Richardson, T.J.; Urbank, R. (February 2001). "On the Design of Low-Density Parity-Check Codes within 0.0045 dB of the Shannon Limit" (PDF). IEEE Communications Letters. 5 (2): 58–60. doi:10.1109/4234.905935. S2CID 7381972.

[2] For a description of the "sup" function, see Supremum

[3] Gallager, Robert (1968). Information Theory and Reliable Communication. Wiley. ISBN 0-471-29048-3.

[1]

[2]

[3]