Inequalities in information theory

Last updated

Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.

Contents

Entropic inequalities

Consider a tuple of finitely (or at most countably) supported random variables on the same probability space. There are 2n subsets, for which (joint) entropies can be computed. For example, when n = 2, we may consider the entropies and . They satisfy the following inequalities (which together characterize the range of the marginal and joint entropies of two random variables):

In fact, these can all be expressed as special cases of a single inequality involving the conditional mutual information , namely

where , , and each denote the joint distribution of some arbitrary (possibly empty) subset of our collection of random variables. Inequalities that can be derived as linear combinations of this are known as Shannon-type inequalities.

For larger there are further restrictions on possible values of entropy. To make this precise, a vector in indexed by subsets of is said to be entropic if there is a joint, discrete distribution of n random variables such that is their joint entropy, for each subset . The set of entropic vectors is denoted , following the notation of Yeung. [1] It is not closed nor convex for , but its topological closure is known to be convex and hence it can be characterized by the (infinitely many) linear inequalities satisfied by all entropic vectors, called entropic inequalities .

The set of all vectors that satisfy Shannon-type inequalities (but not necessarily other entropic inequalities) contains . This containment is strict for and further inequalities are known as non-Shannon type inequalities. Zhang and Yeung reported the first non-Shannon-type inequality, [2] often referred to as the Zhang-Yeung inequality. Matus [3] proved that no finite set of inequalities can characterize (by linear combinations) all entropic inequalities. In other words, the region is not a polytope.

Lower bounds for the Kullback–Leibler divergence

A great many important inequalities in information theory are actually lower bounds for the Kullback–Leibler divergence. Even the Shannon-type inequalities can be considered part of this category, since the interaction information can be expressed as the Kullback–Leibler divergence of the joint distribution with respect to the product of the marginals, and thus these inequalities can be seen as a special case of Gibbs' inequality.

On the other hand, it seems to be much more difficult to derive useful upper bounds for the Kullback–Leibler divergence. This is because the Kullback–Leibler divergence DKL(P||Q) depends very sensitively on events that are very rare in the reference distribution Q. DKL(P||Q) increases without bound as an event of finite non-zero probability in the distribution P becomes exceedingly rare in the reference distribution Q, and in fact DKL(P||Q) is not even defined if an event of non-zero probability in P has zero probability in Q. (Hence the requirement that P be absolutely continuous with respect to Q.)

Gibbs' inequality

This fundamental inequality states that the Kullback–Leibler divergence is non-negative.

Kullback's inequality

Another inequality concerning the Kullback–Leibler divergence is known as Kullback's inequality. [4] If P and Q are probability distributions on the real line with P absolutely continuous with respect to Q, and whose first moments exist, then

where is the large deviations rate function , i.e. the convex conjugate of the cumulant-generating function, of Q, and is the first moment of P.

The Cramér–Rao bound is a corollary of this result.

Pinsker's inequality

Pinsker's inequality relates Kullback–Leibler divergence and total variation distance. It states that if P, Q are two probability distributions, then

where

is the KullbackLeibler divergence in nats and

is the total variation distance.

Other inequalities

Hirschman uncertainty

In 1957, [5] Hirschman showed that for a (reasonably well-behaved) function such that and its Fourier transform the sum of the differential entropies of and is non-negative, i.e.

Hirschman conjectured, and it was later proved, [6] that a sharper bound of which is attained in the case of a Gaussian distribution, could replace the right-hand side of this inequality. This is especially significant since it implies, and is stronger than, Weyl's formulation of Heisenberg's uncertainty principle.

Tao's inequality

Given discrete random variables , , and , such that takes values only in the interval [1, 1] and is determined by (such that ), we have [7] [8]

relating the conditional expectation to the conditional mutual information. This is a simple consequence of Pinsker's inequality. (Note: the correction factor log 2 inside the radical arises because we are measuring the conditional mutual information in bits rather than nats.)

Machine based proof checker of information-theoretic inequalities

Several machine based proof checker algorithms are now available. Proof checker algorithms typically verify the inequalities as either true or false. More advanced proof checker algorithms can produce proof or counterexamples. [9] ITIP is a Matlab based proof checker for all Shannon type Inequalities. Xitip is an open source, faster version of the same algorithm implemented in C with a graphical front end. Xitip also has a built in language parsing feature which support a broader range of random variable descriptions as input. AITIP and oXitip are cloud based implementations for validating the Shannon type inequalities. oXitip uses GLPK optimizer and has a C++ backend based on Xitip with a web based user interface. AITIP uses Gurobi solver for optimization and a mix of python and C++ in the backend implementation. It can also provide the canonical break down of the inequalities in terms of basic Information measures. [9]

See also

Related Research Articles

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field, in applied mathematics, is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .

<span class="mw-page-title-main">Gibbs' inequality</span>

In information theory, Gibbs' inequality is a statement about the information entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality. It was first presented by J. Willard Gibbs in the 19th century.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. When the points are interpreted as probability distributions – notably as either values of the parameter of a parametric model or as a data set of observed values – the resulting distance is a statistical distance. The most basic Bregman divergence is the squared Euclidean distance.

This article discusses how information theory is related to measure theory.

In quantum information theory, quantum relative entropy is a measure of distinguishability between two quantum states. It is the quantum mechanical analog of relative entropy.

<span class="mw-page-title-main">Quantities of information</span>

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

In probability theory and in particular in information theory, total correlation is one of several generalizations of the mutual information. It is also known as the multivariate constraint or multiinformation. It quantifies the redundancy or dependency among a set of n random variables.

In information theory, the entropy power inequality (EPI) is a result that relates to so-called "entropy power" of random variables. It shows that the entropy power of suitably well-behaved random variables is a superadditive function. The entropy power inequality was proved in 1948 by Claude Shannon in his seminal paper "A Mathematical Theory of Communication". Shannon also provided a sufficient condition for equality to hold; Stam (1959) showed that the condition is in fact necessary.

The entropic vector or entropic function is a concept arising in information theory. It represents the possible values of Shannon's information entropy that subsets of one set of random variables may take. Understanding which vectors are entropic is a way to represent all possible inequalities between entropies of various subsets. For example, for any two random variables , their joint entropy is at most the sum of the entropies of and of :

<span class="mw-page-title-main">Conditional mutual information</span> Information theory

In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.

In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.

In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function. If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P << Q, and whose first moments exist, then

In information geometry, a divergence is a kind of statistical distance: a binary function which establishes the separation from one probability distribution to another on a statistical manifold.

In information theory, the information projection or I-projection of a probability distribution q onto a set of distributions P is

In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions and by a concave and bounded function of the Kullback–Leibler divergence . The bound can be viewed as an alternative to the well-known Pinsker's inequality: when is large, Pinsker's inequality is vacuous, while Bretagnolle–Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing

References

  1. Yeung, R.W. (1997). "A framework for linear information inequalities". IEEE Transactions on Information Theory. 43 (6): 1924–1934. doi:10.1109/18.641556.)
  2. Zhang, Z.; Yeung, R. W. (1998). "On characterization of entropy function via information inequalities". IEEE Transactions on Information Theory. 44 (4): 1440–1452. doi:10.1109/18.681320.
  3. Matus, F. (2007). Infinitely many information inequalities. 2007 IEEE International Symposium on Information Theory.
  4. Fuchs, Aimé; Letta, Giorgio (1970). "L'Inégalité de KULLBACK. Application à la théorie de l'estimation". Séminaire de Probabilités IV Université de Strasbourg. Lecture Notes in Mathematics. Vol. 124. Strasbourg. pp. 108–131. doi:10.1007/bfb0059338. ISBN   978-3-540-04913-5. MR   0267669.{{cite book}}: CS1 maint: location missing publisher (link)
  5. Hirschman, I. I. (1957). "A Note on Entropy". American Journal of Mathematics . 79 (1): 152–156. doi:10.2307/2372390. JSTOR   2372390.
  6. Beckner, W. (1975). "Inequalities in Fourier Analysis". Annals of Mathematics . 102 (6): 159–182. doi:10.2307/1970980. JSTOR   1970980.
  7. Tao, T. (2006). "Szemerédi's regularity lemma revisited". Contrib. Discrete Math. 1: 8–28. arXiv: math/0504472 . Bibcode:2005math......4472T.
  8. Ahlswede, Rudolf (2007). "The final form of Tao's inequality relating conditional expectation and conditional mutual information". Advances in Mathematics of Communications. 1 (2): 239–242. doi: 10.3934/amc.2007.1.239 .
  9. 1 2 Ho, S.W.; Ling, L.; Tan, C.W.; Yeung, R.W. (2020). "Proving and Disproving Information Inequalities: Theory and Scalable Algorithms". IEEE Transactions on Information Theory. 66 (9): 5525–5536. doi:10.1109/TIT.2020.2982642. S2CID   216530139.