Error exponent

Last updated September 26, 2020

In information theory, the error exponent of a channel code or source code over the block length of the code is the rate at which the error probability decays exponentially with the block length of the code. Formally, it is defined as the limiting ratio of the negative logarithm of the error probability to the block length of the code for large block lengths. For example, if the probability of error $P_{\mathrm {error} }$ of a decoder drops as $e^{-n\alpha }$ , where $n$ is the block length, the error exponent is $\alpha$ . In this example, ${\frac {-\ln P_{\mathrm {error} }}{n}}$ approaches $\alpha$ for large $n$ . Many of the information-theoretic theorems are of asymptotic nature, for example, the channel coding theorem states that for any rate less than the channel capacity, the probability of the error of the channel code can be made to go to zero as the block length goes to infinity. In practical situations, there are limitations to the delay of the communication and the block length must be finite. Therefore, it is important to study how the probability of error drops as the block length go to infinity.

Error exponent in channel coding

For time-invariant DMC's

The channel coding theorem states that for any ε > 0 and for any rate less than the channel capacity, there is an encoding and decoding scheme that can be used to ensure that the probability of block error is less than ε > 0 for sufficiently long message block X. Also, for any rate greater than the channel capacity, the probability of block error at the receiver goes to one as the block length goes to infinity.

Assuming a channel coding setup as follows: the channel can transmit any of $M=2^{nR}\;$ messages, by transmitting the corresponding codeword (which is of length n). Each component in the codebook is drawn i.i.d. according to some probability distribution with probability mass function Q. At the decoding end, maximum likelihood decoding is done.

Let $X_{i}^{n}$ be the $i$ th random codeword in the codebook, where $i$ goes from $1$ to $M$ . Suppose the first message is selected, so codeword $X_{1}^{n}$ is transmitted. Given that $y_{1}^{n}$ is received, the probability that the codeword is incorrectly detected as $X_{2}^{n}$ is:

P_{\mathrm {error} \ 1\to 2}=\sum _{x_{2}^{n}}Q(x_{2}^{n})1(p(y_{1}^{n}\mid x_{2}^{n})>p(y_{1}^{n}\mid x_{1}^{n})).

The function $1(p(y_{1}^{n}\mid x_{2}^{n})>p(y_{1}^{n}\mid x_{1}^{n}))$ has upper bound

\left({\frac {p(y_{1}^{n}\mid x_{2}^{n})}{p(y_{1}^{n}\mid x_{1}^{n})}}\right)^{s}

for $s>0\;$ Thus,

P_{\mathrm {error} \ 1\to 2}\leq \sum _{x_{2}^{n}}Q(x_{2}^{n})\left({\frac {p(y_{1}^{n}\mid x_{2}^{n})}{p(y_{1}^{n}\mid x_{1}^{n})}}\right)^{s}.

Since there are a total of M messages, and the entries in the codebook are i.i.d., the probability that $X_{1}^{n}$ is confused with any other message is $M$ times the above expression. Using the union bound, the probability of confusing $X_{1}^{n}$ with any message is bounded by:

P_{\mathrm {error} \ 1\to \mathrm {any} }\leq M^{\rho }\left(\sum _{x_{2}^{n}}Q(x_{2}^{n})\left({\frac {p(y_{1}^{n}\mid x_{2}^{n})}{p(y_{1}^{n}\mid x_{1}^{n})}}\right)^{s}\right)^{\rho }

for any $0<\rho <1$ . Averaging over all combinations of $x_{1}^{n},y_{1}^{n}$ :

P_{\mathrm {error} \ 1\to \mathrm {any} }\leq M^{\rho }\sum _{y_{1}^{n}}\left(\sum _{x_{1}^{n}}Q(x_{1}^{n})[p(y_{1}^{n}\mid x_{1}^{n})]^{1-s\rho }\right)\left(\sum _{x_{2}^{n}}Q(x_{2}^{n})[p(y_{1}^{n}\mid x_{2}^{n})]^{s}\right)^{\rho }.

Choosing $s=1-s\rho$ and combining the two sums over $x_{1}^{n}$ in the above formula:

P_{\mathrm {error} \ 1\to \mathrm {any} }\leq M^{\rho }\sum _{y_{1}^{n}}\left(\sum _{x_{1}^{n}}Q(x_{1}^{n})[p(y_{1}^{n}\mid x_{1}^{n})]^{\frac {1}{1+\rho }}\right)^{1+\rho }.

Using the independence nature of the elements of the codeword, and the discrete memoryless nature of the channel:

P_{\mathrm {error} \ 1\to \mathrm {any} }\leq M^{\rho }\prod _{i=1}^{n}\sum _{y_{i}}\left(\sum _{x_{i}}Q_{i}(x_{i})[p_{i}(y_{i}\mid x_{i})]^{\frac {1}{1+\rho }}\right)^{1+\rho }

Using the fact that each element of codeword is identically distributed and thus stationary:

P_{\mathrm {error} \ 1\to \mathrm {any} }\leq M^{\rho }\left(\sum _{y}\left(\sum _{x}Q(x)[p(y\mid x)]^{\frac {1}{1+\rho }}\right)^{1+\rho }\right)^{n}.

Replacing M by 2^nR and defining

E_{o}(\rho ,Q)=-\ln \left(\sum _{y}\left(\sum _{x}Q(x)[p(y\mid x)]^{1/(1+\rho )}\right)^{1+\rho }\right),

probability of error becomes

P_{\mathrm {error} }\leq \exp(-n(E_{o}(\rho ,Q)-\rho R)).

Q and $\rho$ should be chosen so that the bound is tighest. Thus, the error exponent can be defined as

E_{r}(R)=\max _{Q}\max _{\rho \in [0,1]}E_{o}(\rho ,Q)-\rho R.\;

Error exponent in source coding

For time invariant discrete memoryless sources

The source coding theorem states that for any $\varepsilon >0$ and any discrete-time i.i.d. source such as $X$ and for any rate less than the entropy of the source, there is large enough $n$ and an encoder that takes $n$ i.i.d. repetition of the source, $X^{1:n}$ , and maps it to $n.(H(X)+\varepsilon )$ binary bits such that the source symbols $X^{1:n}$ are recoverable from the binary bits with probability at least $1-\varepsilon$ .

Let $M=e^{nR}\,\!$ be the total number of possible messages. Next map each of the possible source output sequences to one of the messages randomly using a uniform distribution and independently from everything else. When a source is generated the corresponding message $M=m\,$ is then transmitted to the destination. The message gets decoded to one of the possible source strings. In order to minimize the probability of error the decoder will decode to the source sequence $X_{1}^{n}$ that maximizes $P(X_{1}^{n}\mid A_{m})$ , where $A_{m}\,$ denotes the event that message $m$ was transmitted. This rule is equivalent to finding the source sequence $X_{1}^{n}$ among the set of source sequences that map to message $m$ that maximizes $P(X_{1}^{n})$ . This reduction follows from the fact that the messages were assigned randomly and independently of everything else.

Thus, as an example of when an error occurs, supposed that the source sequence $X_{1}^{n}(1)$ was mapped to message $1$ as was the source sequence $X_{1}^{n}(2)$ . If $X_{1}^{n}(1)\,$ was generated at the source, but $P(X_{1}^{n}(2))>P(X_{1}^{n}(1))$ then an error occurs.

Let $S_{i}\,$ denote the event that the source sequence $X_{1}^{n}(i)$ was generated at the source, so that $P(S_{i})=P(X_{1}^{n}(i))\,.$ Then the probability of error can be broken down as $P(E)=\sum _{i}P(E\mid S_{i})P(S_{i})\,.$ Thus, attention can be focused on finding an upper bound to the $P(E\mid S_{i})\,$ .

Let $A_{i'}\,$ denote the event that the source sequence $X_{1}^{n}(i')$ was mapped to the same message as the source sequence $X_{1}^{n}(i)$ and that $P(X_{1}^{n}(i'))\geq P(X_{1}^{n}(i))$ . Thus, letting $X_{i,i'}\,$ denote the event that the two source sequences $i\,$ and $i'\,$ map to the same message, we have that

P(A_{i'})=P\left(X_{i,i'}\bigcap P(X_{1}^{n}(i')\right)\geq P(X_{1}^{n}(i)))\,

and using the fact that $P(X_{i,i'})={\frac {1}{M}}\,$ and is independent of everything else have that

P(A_{i'})={\frac {1}{M}}P(P(X_{1}^{n}(i'))\geq P(X_{1}^{n}(i)))\,.

A simple upper bound for the term on the left can be established as

\left[P(P(X_{1}^{n}(i'))\geq P(X_{1}^{n}(i)))\right]\leq \left({\frac {P(X_{1}^{n}(i'))}{P(X_{1}^{n}(i))}}\right)^{s}\,

for some arbitrary real number $s>0\,.$ This upper bound can be verified by noting that $P(P(X_{1}^{n}(i'))>P(X_{1}^{n}(i)))\,$ either equals $1\,$ or $0\,$ because the probabilities of a given input sequence are completely deterministic. Thus, if $P(X_{1}^{n}(i'))\geq P(X_{1}^{n}(i))\,,$ then ${\frac {P(X_{1}^{n}(i'))}{P(X_{1}^{n}(i))}}\geq 1\,$ so that the inequality holds in that case. The inequality holds in the other case as well because

\left({\frac {P(X_{1}^{n}(i'))}{P(X_{1}^{n}(i))}}\right)^{s}\geq 0\,

for all possible source strings. Thus, combining everything and introducing some $\rho \in [0,1]\,$ , have that

P(E\mid S_{i})\leq P(\bigcup _{i\neq i'}A_{i'})\leq \left(\sum _{i\neq i'}P(A_{i'})\right)^{\rho }\leq \left({\frac {1}{M}}\sum _{i\neq i'}\left({\frac {P(X_{1}^{n}(i'))}{P(X_{1}^{n}(i))}}\right)^{s}\right)^{\rho }\,.

Where the inequalities follow from a variation on the Union Bound. Finally applying this upper bound to the summation for $P(E)\,$ have that:

P(E)=\sum _{i}P(E\mid S_{i})P(S_{i})\leq \sum _{i}P(X_{1}^{n}(i))\left({\frac {1}{M}}\sum _{i'}\left({\frac {P(X_{1}^{n}(i'))}{P(X_{1}^{n}(i))}}\right)^{s}\right)^{\rho }\,.

Where the sum can now be taken over all $i'\,$ because that will only increase the bound. Ultimately yielding that

P(E)\leq {\frac {1}{M^{\rho }}}\sum _{i}P(X_{1}^{n}(i))^{1-s\rho }\left(\sum _{i'}P(X_{1}^{n}(i'))^{s}\right)^{\rho }\,.

Now for simplicity let $1-s\rho =s\,$ so that $s={\frac {1}{1+\rho }}\,.$ Substituting this new value of $s\,$ into the above bound on the probability of error and using the fact that $i'\,$ is just a dummy variable in the sum gives the following as an upper bound on the probability of error:

P(E)\leq {\frac {1}{M^{\rho }}}\left(\sum _{i}P(X_{1}^{n}(i))^{\frac {1}{1+\rho }}\right)^{1+\rho }\,.

M=e^{nR}\,\!

and each of the components of

X_{1}^{n}(i)\,

are independent. Thus, simplifying the above equation yields

P(E)\leq \exp \left(-n\left[\rho R-\ln \left(\sum _{x_{i}}P(x_{i})^{\frac {1}{1+\rho }}\right)(1+\rho )\right]\right).

The term in the exponent should be maximized over $\rho \,$ in order to achieve the highest upper bound on the probability of error.

Letting $E_{0}(\rho )=\ln \left(\sum _{x_{i}}P(x_{i})^{\frac {1}{1+\rho }}\right)(1+\rho )\,,$ see that the error exponent for the source coding case is:

E_{r}(R)=\max _{\rho \in [0,1]}\left[\rho R-E_{0}(\rho )\right].\,

Related Research Articles

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes. The concept of information entropy was introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication". As an example, consider a biased coin with probability $p$ of landing on heads and probability $1- p$ of landing on tails. The maximum surprise is for $p = 1/2$ , when there is no reason to expect one outcome over another, and in this case a coin flip has an entropy of one bit. The minimum surprise is when $p = 0$ or $p = 1$ , when the event is known and the entropy is zero bits. Other values of p give different entropies between zero and one bits.

Multivariate normal distribution Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In probability theory, Chebyshev's inequality guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from the mean. Specifically, no more than 1/k² of the distribution's values can be more than k standard deviations away from the mean. The rule is often called Chebyshev's theorem, about the range of standard deviations around the mean, in statistics. The inequality has great utility because it can be applied to any probability distribution in which the mean and variance are defined. For example, it can be used to prove the weak law of large numbers.

In physics, Liouville's theorem, named after the French mathematician Joseph Liouville, is a key theorem in classical statistical and Hamiltonian mechanics. It asserts that the phase-space distribution function is constant along the trajectories of the system—that is that the density of system points in the vicinity of a given system point traveling through phase-space is constant with time. This time-independent density is in statistical mechanics known as the classical a priori probability.

In mathematical statistics, the Kullback–Leibler divergence is a measure of how one probability distribution is different from a second, reference probability distribution. Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. In contrast to variation of information, it is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread - it also does not satisfy the triangle inequality. In the simple case, a Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and machine learning.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

In probability theory, the Chernoff bound, named after Herman Chernoff but due to Herman Rubin, gives exponentially decreasing bounds on tail distributions of sums of independent random variables. It is a sharper bound than the known first- or second-moment-based tail bounds such as Markov's inequality or Chebyshev's inequality, which only yield power-law bounds on tail decay. However, the Chernoff bound requires that the variates be independent – a condition that neither Markov's inequality nor Chebyshev's inequality require, although Chebyshev's inequality does require the variates to be pairwise independent.

In the theory of stochastic processes, the Karhunen–Loève theorem, also known as the Kosambi–Karhunen–Loève theorem is a representation of a stochastic process as an infinite linear combination of orthogonal functions, analogous to a Fourier series representation of a function on a bounded interval. The transformation is also known as Hotelling transform and eigenvector transform, and is closely related to principal component analysis (PCA) technique widely used in image processing and in data analysis in many fields.

In quantum mechanics, the probability current is a mathematical quantity describing the flow of probability in terms of probability per unit time per unit area. Specifically, if one describes the probability density as a heterogeneous fluid, then the probability current is the rate of flow of this fluid. This is analogous to mass currents in hydrodynamics and electric currents in electromagnetism. It is a real vector, like electric current density. The concept of a probability current is a useful formalism in quantum mechanics. The probability current is invariant under Gauge Transformation.

In probability theory and statistics, the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based on the Kullback–Leibler divergence, with some notable differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen-Shannon distance.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

In quantum mechanics, notably in quantum information theory, fidelity is a measure of the "closeness" of two quantum states. It expresses the probability that one state will pass a test to identify as the other. The fidelity is not a metric on the space of density matrices, but it can be used to define the Bures metric on this space.

In quantum information theory, quantum relative entropy is a measure of distinguishability between two quantum states. It is the quantum mechanical analog of relative entropy.

In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment.

In mathematics — specifically, in stochastic analysis — an Itô diffusion is a solution to a specific type of stochastic differential equation. That equation is similar to the Langevin equation used in physics to describe the Brownian motion of a particle subjected to a potential in a viscous fluid. Itô diffusions are named after the Japanese mathematician Kiyosi Itô.

A Moran process or Moran model is a simple stochastic process used in biology to describe finite populations. The process is named after Patrick Moran, who first proposed the model in 1958. It can be used to model variety-increasing processes such as mutation as well as variety-reducing effects such as genetic drift and natural selection. The process can describe the probabilistic dynamics in a finite population of constant size N in which two alleles A and B are competing for dominance. The two alleles are considered to be true replicators.

In quantum mechanics, and especially quantum information and the study of open quantum systems, the trace distanceT is a metric on the space of density matrices and gives a measure of the distinguishability between two states. It is the quantum generalization of the Kolmogorov distance for classical probability distributions.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product

In quantum information theory, the classical capacity of a quantum channel is the maximum rate at which classical data can be sent over it error-free in the limit of many uses of the channel. Holevo, Schumacher, and Westmoreland proved the following least upper bound on the classical capacity of any quantum channel $:$

In mathematics and theoretical computer science, analysis of Boolean functions is the study of real-valued functions on $or from a spectral perspective. The functions studied are often, but not always, Boolean-valued, making them Boolean functions. The area has found many applications in combinatorics, social choice theory, random graphs, and theoretical computer science, especially in hardness of approximation, property testing, and PAC learning.$

References

R. Gallager, Information Theory and Reliable Communication, Wiley 1968

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.