Log sum inequality

Last updated May 09, 2024

The log sum inequality is used for proving theorems in information theory.

Statement

Let $a_{1},\ldots ,a_{n}$ and $b_{1},\ldots ,b_{n}$ be nonnegative numbers. Denote the sum of all $a_{i}$ s by $a$ and the sum of all $b_{i}$ s by $b$ . The log sum inequality states that

\sum _{i=1}^{n}a_{i}\log {\frac {a_{i}}{b_{i}}}\geq a\log {\frac {a}{b}},

with equality if and only if ${\frac {a_{i}}{b_{i}}}$ are equal for all $i$ , in other words $a_{i}=cb_{i}$ for all $i$ .^[1]

(Take $a_{i}\log {\frac {a_{i}}{b_{i}}}$ to be $0$ if $a_{i}=0$ and $\infty$ if $a_{i}>0,b_{i}=0$ . These are the limiting values obtained as the relevant number tends to $0$ .)^[1]

Proof

Notice that after setting $f(x)=x\log x$ we have

{\begin{aligned}\sum _{i=1}^{n}a_{i}\log {\frac {a_{i}}{b_{i}}}&{}=\sum _{i=1}^{n}b_{i}f\left({\frac {a_{i}}{b_{i}}}\right)=b\sum _{i=1}^{n}{\frac {b_{i}}{b}}f\left({\frac {a_{i}}{b_{i}}}\right)\\&{}\geq bf\left(\sum _{i=1}^{n}{\frac {b_{i}}{b}}{\frac {a_{i}}{b_{i}}}\right)=bf\left({\frac {1}{b}}\sum _{i=1}^{n}a_{i}\right)=bf\left({\frac {a}{b}}\right)\\&{}=a\log {\frac {a}{b}},\end{aligned}}

where the inequality follows from Jensen's inequality since ${\frac {b_{i}}{b}}\geq 0$ , $\sum _{i=1}^{n}{\frac {b_{i}}{b}}=1$ , and $f$ is convex.^[1]

Generalizations

The inequality remains valid for $n=\infty$ provided that $a<\infty$ and $b<\infty$ .^{[ citation needed ]} The proof above holds for any function $g$ such that $f(x)=xg(x)$ is convex, such as all continuous non-decreasing functions. Generalizations to non-decreasing functions other than the logarithm is given in Csiszár, 2004.

Another generalization is due to Dannan, Neff and Thiel, who showed that if $a_{1},a_{2}\cdots a_{n}$ and $b_{1},b_{2}\cdots b_{n}$ are positive real numbers with $a_{1}+a_{2}\cdots +a_{n}=a$ and $b_{1}+b_{2}\cdots +b_{n}=b$ , and $k\geq 0$ , then $\sum _{i=1}^{n}a_{i}\log \left({\frac {a_{i}}{b_{i}}}+k\right)\geq a\log \left({\frac {a}{b}}+k\right)$ . ^[2]

Applications

The log sum inequality can be used to prove inequalities in information theory. Gibbs' inequality states that the Kullback-Leibler divergence is non-negative, and equal to zero precisely if its arguments are equal.^[3] One proof uses the log sum inequality.

Proof^[1]

Let

P=(p_{i})_{i\in \mathbb {N} }

and

Q=(q_{i})_{i\in \mathbb {N} }

be pmfs. In the log sum inequality, substitute

n=\infty

,

a_{i}=p_{i}

and

b_{i}=q_{i}

to get

\mathbb {D} _{\mathrm {KL} }(P\|Q)\equiv \sum _{i}p_{i}\log _{2}{\frac {p_{i}}{q_{i}}}\geq 1\log {\frac {1}{1}}=0

with equality if and only if $p_{i}=q_{i}$ for all i (as both $P$ and $Q$ sum to 1).

The inequality can also prove convexity of Kullback-Leibler divergence.^[4]

Notes

1 2 3 4 Cover & Thomas (1991), p. 29.
↑ F. M. Dannan, P. Neff, C. Thiel (2016). "On the sum of squared logarithms inequality and related inequalities" (PDF). Journal of Mathematical Inequalities. 10 (1): 1–17. doi:10.7153/jmi-10-01. S2CID 23953925 . Retrieved 12 January 2023.{{cite journal}}: CS1 maint: multiple names: authors list (link)
↑ MacKay (2003), p. 34.
↑ Cover & Thomas (1991), p. 30.

Related Research Articles

In mathematics, generalized means are a family of functions for aggregating sets of numbers. These include as special cases the Pythagorean means.

The uncertainty principle, also known as Heisenberg's indeterminacy principle, is a fundamental concept in quantum mechanics. It states that there is a limit to the precision with which certain pairs of physical properties, such as position and momentum, can be simultaneously known. In other words, the more accurately one property is measured, the less accurately the other property can be known.

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In number theory, a Liouville number is a real number $with the property that, for every positive integer, there exists a pair of integers with such that$

The sum of the reciprocals of all prime numbers diverges; that is:

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In mathematics, a Dirichlet series is any series of the form

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted $, is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q . A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model instead of P when the actual distribution is P . While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.$

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In information theory, Gibbs' inequality is a statement about the information entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality. It was first presented by J. Willard Gibbs in the 19th century.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In information theory, Fano's inequality relates the average information lost in a noisy channel to the probability of the categorization error. It was derived by Robert Fano in the early 1950s while teaching a Ph.D. seminar in information theory at MIT, and later recorded in his 1961 textbook.

This article discusses how information theory is related to measure theory.

In quantum information theory, quantum relative entropy is a measure of distinguishability between two quantum states. It is the quantum mechanical analog of relative entropy.

Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.

In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.

In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function. If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P << Q, and whose first moments exist, then

In information geometry, a divergence is a kind of statistical distance: a binary function which establishes the separation from one probability distribution to another on a statistical manifold.

In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions $and by a concave and bounded function of the Kullback-Leibler divergence . The bound can be viewed as an alternative to the well-known Pinsker's inequality: when is large, Pinsker's inequality is vacuous, while Bretagnolle-Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing (Bretagnolle-Huber-Carol Inequality is a variation of Concentration inequality for multinomially distributed random variables which bounds the total variation distance.)$

References

Cover, Thomas M.; Thomas, Joy A. (1991). Elements of Information Theory. Hoboken, New Jersey: Wiley. ISBN 978-0-471-24195-9.
Csiszár, I.; Shields, P. (2004). "Information Theory and Statistics: A Tutorial" (PDF). Foundations and Trends in Communications and Information Theory. 1 (4): 417–528. doi:10.1561/0100000004 . Retrieved 2009-06-14.
T.S. Han, K. Kobayashi, Mathematics of information and coding. American Mathematical Society, 2001. ISBN 0-8218-0534-7.
Information Theory course materials, Utah State University . Retrieved on 2009-06-14.
MacKay, David J.C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press. ISBN 0-521-64298-1.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[FOOTNOTECoverThomas199129-1] 1 2 3 4 Cover & Thomas (1991), p. 29.

[2] F. M. Dannan, P. Neff, C. Thiel (2016). "On the sum of squared logarithms inequality and related inequalities" (PDF). Journal of Mathematical Inequalities. 10 (1): 1–17. doi:10.7153/jmi-10-01. S2CID 23953925 . Retrieved 12 January 2023.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[FOOTNOTEMacKay200334-3] MacKay (2003), p. 34.

[FOOTNOTECoverThomas199130-4] Cover & Thomas (1991), p. 30.

[1]

[2]

[3]

[4]