Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.
Consider a tuple of finitely (or at most countably) supported random variables on the same probability space. There are 2n subsets, for which (joint) entropies can be computed. For example, when n = 2, we may consider the entropies and . They satisfy the following inequalities (which together characterize the range of the marginal and joint entropies of two random variables):
In fact, these can all be expressed as special cases of a single inequality involving the conditional mutual information , namely
where , , and each denote the joint distribution of some arbitrary (possibly empty) subset of our collection of random variables. Inequalities that can be derived as linear combinations of this are known as Shannon-type inequalities.
For larger there are further restrictions on possible values of entropy. To make this precise, a vector in indexed by subsets of is said to be entropic if there is a joint, discrete distribution of n random variables such that is their joint entropy, for each subset . The set of entropic vectors is denoted , following the notation of Yeung. [1] It is not closed nor convex for , but its topological closure is known to be convex and hence it can be characterized by the (infinitely many) linear inequalities satisfied by all entropic vectors, called entropic inequalities .
The set of all vectors that satisfy Shannon-type inequalities (but not necessarily other entropic inequalities) contains . This containment is strict for and further inequalities are known as non-Shannon type inequalities. Zhang and Yeung reported the first non-Shannon-type inequality, [2] often referred to as the Zhang-Yeung inequality. Matus [3] proved that no finite set of inequalities can characterize (by linear combinations) all entropic inequalities. In other words, the region is not a polytope.
A great many important inequalities in information theory are actually lower bounds for the Kullback–Leibler divergence. Even the Shannon-type inequalities can be considered part of this category, since the interaction information can be expressed as the Kullback–Leibler divergence of the joint distribution with respect to the product of the marginals, and thus these inequalities can be seen as a special case of Gibbs' inequality.
On the other hand, it seems to be much more difficult to derive useful upper bounds for the Kullback–Leibler divergence. This is because the Kullback–Leibler divergence DKL(P||Q) depends very sensitively on events that are very rare in the reference distribution Q. DKL(P||Q) increases without bound as an event of finite non-zero probability in the distribution P becomes exceedingly rare in the reference distribution Q, and in fact DKL(P||Q) is not even defined if an event of non-zero probability in P has zero probability in Q. (Hence the requirement that P be absolutely continuous with respect to Q.)
This fundamental inequality states that the Kullback–Leibler divergence is non-negative.
Another inequality concerning the Kullback–Leibler divergence is known as Kullback's inequality. [4] If P and Q are probability distributions on the real line with P absolutely continuous with respect to Q, and whose first moments exist, then
where is the large deviations rate function , i.e. the convex conjugate of the cumulant-generating function, of Q, and is the first moment of P.
The Cramér–Rao bound is a corollary of this result.
Pinsker's inequality relates Kullback–Leibler divergence and total variation distance. It states that if P, Q are two probability distributions, then
where
is the Kullback–Leibler divergence in nats and
is the total variation distance.
In 1957, [5] Hirschman showed that for a (reasonably well-behaved) function such that and its Fourier transform the sum of the differential entropies of and is non-negative, i.e.
Hirschman conjectured, and it was later proved, [6] that a sharper bound of which is attained in the case of a Gaussian distribution, could replace the right-hand side of this inequality. This is especially significant since it implies, and is stronger than, Weyl's formulation of Heisenberg's uncertainty principle.
Given discrete random variables , , and , such that takes values only in the interval [−1, 1] and is determined by (such that ), we have [7] [8]
relating the conditional expectation to the conditional mutual information. This is a simple consequence of Pinsker's inequality. (Note: the correction factor log 2 inside the radical arises because we are measuring the conditional mutual information in bits rather than nats.)
Several machine based proof checker algorithms are now available. Proof checker algorithms typically verify the inequalities as either true or false. More advanced proof checker algorithms can produce proof or counterexamples. [9] ITIP is a Matlab based proof checker for all Shannon type Inequalities. Xitip is an open source, faster version of the same algorithm implemented in C with a graphical front end. Xitip also has a built in language parsing feature which support a broader range of random variable descriptions as input. AITIP and oXitip are cloud based implementations for validating the Shannon type inequalities. oXitip uses GLPK optimizer and has a C++ backend based on Xitip with a web based user interface. AITIP uses Gurobi solver for optimization and a mix of python and C++ in the backend implementation. It can also provide the canonical break down of the inequalities in terms of basic Information measures. [9] Quantum information-theoretic inequalities can be checked by the contraction map proof method [10] .
Information theory is the mathematical study of the quantification, storage, and communication of information. The field was established and put on a firm footing by Claude Shannon in the 1940s, though early contributions were made in the 1920s through the works of Harry Nyquist and Ralph Hartley. It is at the intersection of electronic engineering, mathematics, statistics, computer science, neurobiology, physics, and electrical engineering.
In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed to describe the state of the variable, considering the distribution of probabilities across all potential states. Given a discrete random variable , which takes values in the set and is distributed according to , the entropy is where denotes the sum over the variable's possible values. The choice of base for , the logarithm, varies for different applications. Base 2 gives the unit of bits, while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.
In quantum mechanics, information theory, and Fourier analysis, the entropic uncertainty or Hirschman uncertainty is defined as the sum of the temporal and spectral Shannon entropies. It turns out that Heisenberg's uncertainty principle can be expressed as a lower bound on the sum of these entropies. This is stronger than the usual statement of the uncertainty principle in terms of the product of standard deviations.
In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one reference probability distribution P is different from a second probability distribution Q. Mathematically, it is defined as
In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.
In information theory, the cross-entropy between two probability distributions and , over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .
In information theory, Gibbs' inequality is a statement about the information entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality. It was first presented by J. Willard Gibbs in the 19th century.
Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.
In information theory, Fano's inequality relates the average information lost in a noisy channel to the probability of the categorization error. It was derived by Robert Fano in the early 1950s while teaching a Ph.D. seminar in information theory at MIT, and later recorded in his 1961 textbook.
In probability theory and statistics, the Jensen–Shannon divergence, named after Johan Jensen and Claude Shannon, is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based on the Kullback–Leibler divergence, with some notable differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance. The similarity between the distributions is greater when the Jensen-Shannon distance is closer to zero.
This article discusses how information theory is related to measure theory.
In quantum information theory, quantum relative entropy is a measure of distinguishability between two quantum states. It is the quantum mechanical analog of relative entropy.
The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.
In information theory, the entropy power inequality (EPI) is a result that relates to so-called "entropy power" of random variables. It shows that the entropy power of suitably well-behaved random variables is a superadditive function. The entropy power inequality was proved in 1948 by Claude Shannon in his seminal paper "A Mathematical Theory of Communication". Shannon also provided a sufficient condition for equality to hold; Stam (1959) showed that the condition is in fact necessary.
The entropic vector or entropic function is a concept arising in information theory. It represents the possible values of Shannon's information entropy that subsets of one set of random variables may take. Understanding which vectors are entropic is a way to represent all possible inequalities between entropies of various subsets. For example, for any two random variables , their joint entropy is at most the sum of the entropies of and of :
In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.
In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.
In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function. If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P << Q, and whose first moments exist, then where is the rate function, i.e. the convex conjugate of the cumulant-generating function, of , and is the first moment of
The log sum inequality is used for proving theorems in information theory.
In information geometry, a divergence is a kind of statistical distance: a binary function which establishes the separation from one probability distribution to another on a statistical manifold.
{{cite book}}
: CS1 maint: location missing publisher (link)