Log sum inequality

Last updated

The log sum inequality is used for proving theorems in information theory.

Contents

Statement

Let and be nonnegative numbers. Denote the sum of all s by and the sum of all s by . The log sum inequality states that

with equality if and only if are equal for all , in other words for all . [1]

(Take to be if and if . These are the limiting values obtained as the relevant number tends to .) [1]

Proof

Notice that after setting we have

where the inequality follows from Jensen's inequality since , , and is convex. [1]

Generalizations

The inequality remains valid for provided that and .[ citation needed ] The proof above holds for any function such that is convex, such as all continuous non-decreasing functions. Generalizations to non-decreasing functions other than the logarithm is given in Csiszár, 2004.

Another generalization is due to Dannan, Neff and Thiel, who showed that if and are positive real numbers with and , and , then . [2]

Applications

The log sum inequality can be used to prove inequalities in information theory. Gibbs' inequality states that the Kullback-Leibler divergence is non-negative, and equal to zero precisely if its arguments are equal. [3] One proof uses the log sum inequality.

The inequality can also prove convexity of Kullback-Leibler divergence. [4]

Notes

  1. 1 2 3 4 Cover & Thomas (1991), p. 29.
  2. F. M. Dannan, P. Neff, C. Thiel (2016). "On the sum of squared logarithms inequality and related inequalities" (PDF). Journal of Mathematical Inequalities. 10 (1): 1–17. doi:10.7153/jmi-10-01. S2CID   23953925 . Retrieved 12 January 2023.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  3. MacKay (2003), p. 34.
  4. Cover & Thomas (1991), p. 30.

Related Research Articles

<span class="mw-page-title-main">Generalized mean</span> N-th root of the arithmetic mean of the given numbers raised to the power n

In mathematics, generalized means are a family of functions for aggregating sets of numbers. These include as special cases the Pythagorean means.

<span class="mw-page-title-main">Uncertainty principle</span> Foundational principle in quantum physics

In quantum mechanics, the uncertainty principle is any of a variety of mathematical inequalities asserting a fundamental limit to the accuracy with which the values for certain pairs of physical quantities of a particle, such as position, x, and momentum, p, can be predicted from initial conditions.

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In number theory, a Liouville number is a real number with the property that, for every positive integer , there exists a pair of integers with such that

<span class="mw-page-title-main">Divergence of the sum of the reciprocals of the primes</span> Theorem

The sum of the reciprocals of all prime numbers diverges; that is:

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In mathematics, a Dirichlet series is any series of the form

In mathematical statistics, the Kullback–Leibler divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

<span class="mw-page-title-main">Gibbs' inequality</span>

In information theory, Gibbs' inequality is a statement about the information entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality. It was first presented by J. Willard Gibbs in the 19th century.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In information theory, Fano's inequality relates the average information lost in a noisy channel to the probability of the categorization error. It was derived by Robert Fano in the early 1950s while teaching a Ph.D. seminar in information theory at MIT, and later recorded in his 1961 textbook.

In quantum information theory, quantum relative entropy is a measure of distinguishability between two quantum states. It is the quantum mechanical analog of relative entropy.

Carleman's inequality is an inequality in mathematics, named after Torsten Carleman, who proved it in 1923 and used it to prove the Denjoy–Carleman theorem on quasi-analytic classes.

Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.

In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.

In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function. If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P << Q, and whose first moments exist, then

In information geometry, a divergence is a kind of statistical distance: a binary function which establishes the separation from one probability distribution to another on a statistical manifold.

<span class="mw-page-title-main">Grunsky matrix</span>

In complex analysis and geometric function theory, the Grunsky matrices, or Grunsky operators, are infinite matrices introduced in 1939 by Helmut Grunsky. The matrices correspond to either a single holomorphic function on the unit disk or a pair of holomorphic functions on the unit disk and its complement. The Grunsky inequalities express boundedness properties of these matrices, which in general are contraction operators or in important special cases unitary operators. As Grunsky showed, these inequalities hold if and only if the holomorphic function is univalent. The inequalities are equivalent to the inequalities of Goluzin, discovered in 1947. Roughly speaking, the Grunsky inequalities give information on the coefficients of the logarithm of a univalent function; later generalizations by Milin, starting from the Lebedev–Milin inequality, succeeded in exponentiating the inequalities to obtain inequalities for the coefficients of the univalent function itself. The Grunsky matrix and its associated inequalities were originally formulated in a more general setting of univalent functions between a region bounded by finitely many sufficiently smooth Jordan curves and its complement: the results of Grunsky, Goluzin and Milin generalize to that case.

In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions and by a concave and bounded function of the Kullback–Leibler divergence . The bound can be viewed as an alternative to the well-known Pinsker's inequality: when is large, Pinsker's inequality is vacuous, while Bretagnolle–Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing

References