Gibbs' inequality

Last updated
Josiah Willard Gibbs Josiah Willard Gibbs -from MMS-.jpg
Josiah Willard Gibbs

In information theory, Gibbs' inequality is a statement about the information entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality. It was first presented by J. Willard Gibbs in the 19th century.

Contents

Gibbs' inequality

Suppose that

is a discrete probability distribution. Then for any other probability distribution

the following inequality between positive quantities (since pi and qi are between zero and one) holds: [1] :68

with equality if and only if

for all i. Put in words, the information entropy of a distribution P is less than or equal to its cross entropy with any other distribution Q.

The difference between the two quantities is the Kullback–Leibler divergence or relative entropy, so the inequality can also be written: [2] :34

Note that the use of base-2 logarithms is optional, and allows one to refer to the quantity on each side of the inequality as an "average surprisal" measured in bits.

Proof

For simplicity, we prove the statement using the natural logarithm (ln). Because

the particular logarithm base b > 1 that we choose only scales the relationship by the factor 1 / ln b.

Let denote the set of all for which pi is non-zero. Then, since for all x > 0, with equality if and only if x=1, we have:

The last inequality is a consequence of the pi and qi being part of a probability distribution. Specifically, the sum of all non-zero values is 1. Some non-zero qi, however, may have been excluded since the choice of indices is conditioned upon the pi being non-zero. Therefore, the sum of the qi may be less than 1.

So far, over the index set , we have:

,

or equivalently

.

Both sums can be extended to all , i.e. including , by recalling that the expression tends to 0 as tends to 0, and tends to as tends to 0. We arrive at

For equality to hold, we require

  1. for all so that the equality holds,
  2. and which means if , that is, if .

This can happen if and only if for .

Alternative proofs

The result can alternatively be proved using Jensen's inequality, the log sum inequality, or the fact that the Kullback-Leibler divergence is a form of Bregman divergence. Below we give a proof based on Jensen's inequality:

Because log is a concave function, we have that:

Where the first inequality is due to Jensen's inequality, and the last equality is due to the same reason given in the above proof.

Furthermore, since is strictly concave, by the equality condition of Jensen's inequality we get equality when

and

Suppose that this ratio is , then we have that

Where we use the fact that are probability distributions. Therefore, the equality happens when .

Corollary

The entropy of is bounded by: [1] :68

The proof is trivial – simply set for all i.

See also

Related Research Articles

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data.

<span class="mw-page-title-main">Divergence of the sum of the reciprocals of the primes</span> Theorem

The sum of the reciprocals of all prime numbers diverges; that is:

<span class="mw-page-title-main">Prime-counting function</span> Function representing the number of primes less than or equal to a given number

In mathematics, the prime-counting function is the function counting the number of prime numbers less than or equal to some real number x. It is denoted by π(x) (unrelated to the number π).

<span class="mw-page-title-main">Inequality of arithmetic and geometric means</span> Arithmetic mean is greater than or equal to geometric mean

In mathematics, the inequality of arithmetic and geometric means, or more briefly the AM–GM inequality, states that the arithmetic mean of a list of non-negative real numbers is greater than or equal to the geometric mean of the same list; and further, that the two means are equal if and only if every number in the list is the same.

In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.

In quantum mechanics, information theory, and Fourier analysis, the entropic uncertainty or Hirschman uncertainty is defined as the sum of the temporal and spectral Shannon entropies. It turns out that Heisenberg's uncertainty principle can be expressed as a lower bound on the sum of these entropies. This is stronger than the usual statement of the uncertainty principle in terms of the product of standard deviations.

In mathematical statistics, the Kullback–Leibler divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy. A measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In probability theory and statistics, the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based on the Kullback–Leibler divergence, with some notable differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance.

<span class="mw-page-title-main">Binary entropy function</span>

In information theory, the binary entropy function, denoted or , is defined as the entropy of a Bernoulli process with probability of one of two values. It is a special case of , the entropy function. Mathematically, the Bernoulli trial is modelled as a random variable that can take on only two values: 0 and 1, which are mutually exclusive and exhaustive.

In quantum information theory, quantum relative entropy is a measure of distinguishability between two quantum states. It is the quantum mechanical analog of relative entropy.

<span class="mw-page-title-main">Quantities of information</span>

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.

In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.

In mathematics, Young's inequality for products is a mathematical inequality about the product of two numbers. The inequality is named after William Henry Young and should not be confused with Young's convolution inequality.

In variational Bayesian methods, the evidence lower bound is a useful lower bound on the log-likelihood of some observed data.

References

  1. 1 2 Pierre Bremaud (6 December 2012). An Introduction to Probabilistic Modeling. Springer Science & Business Media. ISBN   978-1-4612-1046-7.
  2. David J. C. MacKay (25 September 2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press. ISBN   978-0-521-64298-9.