Information theory and measure theory

Last updated November 09, 2024

This article discusses how information theory (a branch of mathematics studying the transmission, processing and storage of information) is related to measure theory (a branch of mathematics related to integration and probability).

Measures in information theory

Many of the concepts in information theory have separate definitions and formulas for continuous and discrete cases. For example, entropy $\mathrm {H} (X)$ is usually defined for discrete random variables, whereas for continuous random variables the related concept of differential entropy, written $h(X)$ , is used (see Cover and Thomas, 2006, chapter 8). Both these concepts are mathematical expectations, but the expectation is defined with an integral for the continuous case, and a sum for the discrete case.

These separate definitions can be more closely related in terms of measure theory. For discrete random variables, probability mass functions can be considered density functions with respect to the counting measure. Thinking of both the integral and the sum as integration on a measure space allows for a unified treatment.

Consider the formula for the differential entropy of a continuous random variable $X$ with range $\mathbb {R}$ and probability density function $f(x)$ :

h(X)=-\int _{\mathbb {R} }f(x)\log f(x)\,dx.

This can usually be interpreted as the following Riemann–Stieltjes integral:

h(X)=-\int _{\mathbb {R} }f(x)\log f(x)\,d\mu (x),

where $\mu$ is the Lebesgue measure.

If instead, $X$ is discrete, with range $\Omega$ a finite set, $f$ is a probability mass function on $\Omega$ , and $\nu$ is the counting measure on $\Omega$ , we can write:

\mathrm {H} (X)=-\sum _{x\in \Omega }f(x)\log f(x)=-\int _{\Omega }f(x)\log f(x)\,d\nu (x).

The integral expression, and the general concept, are identical in the continuous case; the only difference is the measure used. In both cases the probability density function $f$ is the Radon–Nikodym derivative of the probability measure with respect to the measure against which the integral is taken.

If $P$ is the probability measure induced by $X$ , then the integral can also be taken directly with respect to $P$ :

h(X)=-\int _{\Omega }\log {\frac {\mathrm {d} P}{\mathrm {d} \mu }}\,dP,

If instead of the underlying measure μ we take another probability measure $Q$ , we are led to the Kullback–Leibler divergence: let $P$ and $Q$ be probability measures over the same space. Then if $P$ is absolutely continuous with respect to $Q$ , written $P\ll Q,$ the Radon–Nikodym derivative ${\frac {\mathrm {d} P}{\mathrm {d} Q}}$ exists and the Kullback–Leibler divergence can be expressed in its full generality:

D_{\operatorname {KL} }(P\|Q)=\int _{\operatorname {supp} P}{\frac {\mathrm {d} P}{\mathrm {d} Q}}\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\,dQ=\int _{\operatorname {supp} P}\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\,dP,

where the integral runs over the support of $P.$ Note that we have dropped the negative sign: the Kullback–Leibler divergence is always non-negative due to Gibbs' inequality.

Entropy as a "measure"

Venn diagram for various information measures associated with correlated variables X and Y. The area contained by both circles is the joint entropy H(X,Y). The circle on the left (red and cyan) is the individual entropy H(X), with the red being the conditional entropy H(X|Y). The circle on the right (blue and cyan) is H(Y), with the blue being H(Y|X). The cyan is the mutual information I(X;Y). Entropy-mutual-information-relative-entropy-relation-diagram.svg — Venn diagram for various information measures associated with correlated variables X and Y. The area contained by both circles is the joint entropy H(X,Y). The circle on the left (red and cyan) is the individual entropy H(X), with the red being the conditional entropy H(X|Y). The circle on the right (blue and cyan) is H(Y), with the blue being H(Y|X). The cyan is the mutual information I(X;Y).

Venn diagram of information theoretic measures for three variables x, y, and z. Each circle represents an individual entropy: H(x) is the lower left circle, H(y) the lower right, and H(z) is the upper circle. The intersections of any two circles represents the mutual information for the two associated variables (e.g. I(x;z) is yellow and gray). The union of any two circles is the joint entropy for the two associated variables (e.g. H(x,y) is everything but green). The joint entropy H(x,y,z) of all three variables is the union of all three circles. It is partitioned into 7 pieces, red, blue, and green being the conditional entropies H(x|y,z), H(y|x,z), H(z|x,y) respectively, yellow, magenta and cyan being the conditional mutual informations I(x;z|y), I(y;z|x) and I(x;y|z) respectively, and gray being the multivariate mutual information I(x;y;z). The multivariate mutual information is the only one of all that may be negative. VennInfo3Var.svg — Venn diagram of information theoretic measures for three variables x, y, and z. Each circle represents an individual entropy: H(x) is the lower left circle, H(y) the lower right, and H(z) is the upper circle. The intersections of any two circles represents the mutual information for the two associated variables (e.g. I(x;z) is yellow and gray). The union of any two circles is the joint entropy for the two associated variables (e.g. H(x,y) is everything but green). The joint entropy H(x,y,z) of all three variables is the union of all three circles. It is partitioned into 7 pieces, red, blue, and green being the conditional entropies H(x|y,z), H(y|x,z), H(z|x,y) respectively, yellow, magenta and cyan being the conditional mutual informations I(x;z|y), I(y;z|x) and I(x;y|z) respectively, and gray being the multivariate mutual information I(x;y;z). The multivariate mutual information is the only one of all that may be negative.

There is an analogy between Shannon's basic "measures" of the information content of random variables and a measure over sets. Namely the joint entropy, conditional entropy, and mutual information can be considered as the measure of a set union, set difference, and set intersection, respectively (Reza pp. 106–108).

If we associate the existence of abstract sets ${\tilde {X}}$ and ${\tilde {Y}}$ to arbitrary discrete random variables X and Y, somehow representing the information borne by X and Y, respectively, such that:

$\mu ({\tilde {X}}\cap {\tilde {Y}})=0$ whenever X and Y are unconditionally independent, and
${\tilde {X}}={\tilde {Y}}$ whenever X and Y are such that either one is completely determined by the other (i.e. by a bijection);

where $\mu$ is a signed measure over these sets, and we set:

{\begin{aligned}\mathrm {H} (X)&=\mu ({\tilde {X}}),\\\mathrm {H} (Y)&=\mu ({\tilde {Y}}),\\\mathrm {H} (X,Y)&=\mu ({\tilde {X}}\cup {\tilde {Y}}),\\\mathrm {H} (X\mid Y)&=\mu ({\tilde {X}}\setminus {\tilde {Y}}),\\\operatorname {I} (X;Y)&=\mu ({\tilde {X}}\cap {\tilde {Y}});\end{aligned}}

we find that Shannon's "measure" of information content satisfies all the postulates and basic properties of a formal signed measure over sets, as commonly illustrated in an information diagram . This allows the sum of two measures to be written:

\mu (A)+\mu (B)=\mu (A\cup B)+\mu (A\cap B)

and the analog of Bayes' theorem ( $\mu (A)+\mu (B\setminus A)=\mu (B)+\mu (A\setminus B)$ ) allows the difference of two measures to be written:

\mu (A)-\mu (B)=\mu (A\setminus B)-\mu (B\setminus A)

This can be a handy mnemonic device in some situations, e.g.

{\begin{aligned}\mathrm {H} (X,Y)&=\mathrm {H} (X)+\mathrm {H} (Y\mid X)&\mu ({\tilde {X}}\cup {\tilde {Y}})&=\mu ({\tilde {X}})+\mu ({\tilde {Y}}\setminus {\tilde {X}})\\\operatorname {I} (X;Y)&=\mathrm {H} (X)-\mathrm {H} (X\mid Y)&\mu ({\tilde {X}}\cap {\tilde {Y}})&=\mu ({\tilde {X}})-\mu ({\tilde {X}}\setminus {\tilde {Y}})\end{aligned}}

Note that measures (expectation values of the logarithm) of true probabilities are called "entropy" and generally represented by the letter H, while other measures are often referred to as "information" or "correlation" and generally represented by the letter I. For notational simplicity, the letter I is sometimes used for all measures.

Multivariate mutual information

Certain extensions to the definitions of Shannon's basic measures of information are necessary to deal with the σ-algebra generated by the sets that would be associated to three or more arbitrary random variables. (See Reza pp. 106–108 for an informal but rather complete discussion.) Namely $\mathrm {H} (X,Y,Z,\cdots )$ needs to be defined in the obvious way as the entropy of a joint distribution, and a multivariate mutual information $\operatorname {I} (X;Y;Z;\cdots )$ defined in a suitable manner so that we can set:

{\begin{aligned}\mathrm {H} (X,Y,Z,\cdots )&=\mu ({\tilde {X}}\cup {\tilde {Y}}\cup {\tilde {Z}}\cup \cdots ),\\\operatorname {I} (X;Y;Z;\cdots )&=\mu ({\tilde {X}}\cap {\tilde {Y}}\cap {\tilde {Z}}\cap \cdots );\end{aligned}}

in order to define the (signed) measure over the whole σ-algebra. There is no single universally accepted definition for the multivariate mutual information, but the one that corresponds here to the measure of a set intersection is due to Fano (1966: p. 57-59). The definition is recursive. As a base case the mutual information of a single random variable is defined to be its entropy: $\operatorname {I} (X)=\mathrm {H} (X)$ . Then for $n\geq 2$ we set

\operatorname {I} (X_{1};\cdots ;X_{n})=\operatorname {I} (X_{1};\cdots ;X_{n-1})-\operatorname {I} (X_{1};\cdots ;X_{n-1}\mid X_{n}),

where the conditional mutual information is defined as

\operatorname {I} (X_{1};\cdots ;X_{n-1}\mid X_{n})=\mathbb {E} _{X_{n}}{\big (}\operatorname {I} (X_{1};\cdots ;X_{n-1})\mid X_{n}{\big )}.

The first step in the recursion yields Shannon's definition $\operatorname {I} (X_{1};X_{2})=\mathrm {H} (X_{1})-\mathrm {H} (X_{1}\mid X_{2}).$ The multivariate mutual information (same as interaction information but for a change in sign) of three or more random variables can be negative as well as positive: Let X and Y be two independent fair coin flips, and let Z be their exclusive or. Then $\operatorname {I} (X;Y;Z)=-1$ bit.

Many other variations are possible for three or more random variables: for example, $\operatorname {I} (X,Y;Z)$ is the mutual information of the joint distribution of X and Y relative to Z, and can be interpreted as $\mu (({\tilde {X}}\cup {\tilde {Y}})\cap {\tilde {Z}}).$ Many more complicated expressions can be built this way, and still have meaning, e.g. $\operatorname {I} (X,Y;Z\mid W),$ or $\mathrm {H} (X,Z\mid W,Y).$

Related Research Articles

In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed to describe the state of the variable, considering the distribution of probabilities across all potential states. Given a discrete random variable $, which takes values in the set and is distributed according to, the entropy is where denotes the sum over the variable's possible values. The choice of base for, the logarithm, varies for different applications. Base 2 gives the unit of bits, while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.$

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability $and the value 0 with probability . Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes-no question. Such questions lead to outcomes that are Boolean-valued: a single bit whose value is success/yes/true/one with probability p and failure/no/false/zero with probability q . It can be used to represent a coin toss where 1 and 0 would represent "heads" and "tails", respectively, and p would be the probability of the coin landing on heads. In particular, unfair coins would have$

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

In thermodynamics, the Helmholtz free energy is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal). The change in the Helmholtz energy during a process is equal to the maximum amount of work that the system can perform in a thermodynamic process in which temperature is held constant. At constant temperature, the Helmholtz free energy is minimized at equilibrium.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation.

In probability theory and statistics, the cumulants $κ n$ of a probability distribution are a set of quantities that provide an alternative to the moments of the distribution. Any two probability distributions whose moments are identical will have identical cumulants as well, and vice versa.

In mathematics, the moments of a function are certain quantitative measures related to the shape of the function's graph. If the function represents mass density, then the zeroth moment is the total mass, the first moment is the center of mass, and the second moment is the moment of inertia. If the function is a probability distribution, then the first moment is the expected value, the second central moment is the variance, the third standardized moment is the skewness, and the fourth standardized moment is the kurtosis.

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

<span class="mw-page-title-main">Logistic distribution</span> Continuous probability distribution

In probability theory and statistics, the logistic distribution is a continuous probability distribution. Its cumulative distribution function is the logistic function, which appears in logistic regression and feedforward neural networks. It resembles the normal distribution in shape but has heavier tails. The logistic distribution is a special case of the Tukey lambda distribution.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

Quantum statistical mechanics is statistical mechanics applied to quantum mechanical systems. In quantum mechanics a statistical ensemble is described by a density operator S, which is a non-negative, self-adjoint, trace-class operator of trace 1 on the Hilbert space H describing the quantum system. This can be shown under various mathematical formalisms for quantum mechanics.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted $, is a type of statistical distance: a measure of how one reference probability distribution P is different from a second probability distribution Q . Mathematically, it is defined as$

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In mathematics, a $π$ -system on a set $is a collection of certain subsets of such that$

In thermodynamics, the fundamental thermodynamic relation are four fundamental equations which demonstrate how four important thermodynamic quantities depend on variables that can be controlled and measured experimentally. Thus, they are essentially equations of state, and using the fundamental equations, experimental data can be used to determine sought-after quantities like G or H (enthalpy). The relation is generally expressed as a microscopic change in internal energy in terms of microscopic changes in entropy, and volume for a closed system in thermal equilibrium in the following way.

<span class="mw-page-title-main">Quantities of information</span>

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

In probability theory and statistics, a cross-covariance matrix is a matrix whose element in the i, j position is the covariance between the i-th element of a random vector and j-th element of another random vector. A random vector is a random variable with multiple dimensions. Each element of the vector is a scalar random variable. Each element has either a finite number of observed empirical values or a finite or infinite number of potential values. The potential values are specified by a theoretical joint probability distribution. Intuitively, the cross-covariance matrix generalizes the notion of covariance to multiple dimensions.

References

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory, second edition, 2006. New Jersey: Wiley and Sons. ISBN 978-0-471-24195-9.
Fazlollah M. Reza. An Introduction to Information Theory. New York: McGraw–Hill 1961. New York: Dover 1994. ISBN 0-486-68210-2
Fano, R. M. (1966), Transmission of Information: a statistical theory of communications, MIT Press, ISBN 978-0-262-56169-3, OCLC 804123877
R. W. Yeung, "On entropy, information inequalities, and Groups." PS Archived 2016-03-03 at the Wayback Machine