Conditional mutual information

Last updated
Venn diagram of information theoretic measures for three variables
x
{\displaystyle x}
,
y
{\displaystyle y}
, and
z
{\displaystyle z}
, represented by the lower left, lower right, and upper circles, respectively. The conditional mutual informations
I
(
x
;
z
|
y
)
{\displaystyle I(x;z|y)}
,
I
(
y
;
z
|
x
)
{\displaystyle I(y;z|x)}
and
I
(
x
;
y
|
z
)
{\displaystyle I(x;y|z)}
are represented by the yellow, cyan, and magenta regions, respectively. VennInfo3Var.svg
Venn diagram of information theoretic measures for three variables , , and , represented by the lower left, lower right, and upper circles, respectively. The conditional mutual informations , and are represented by the yellow, cyan, and magenta regions, respectively.

In probability theory, particularly information theory, the conditional mutual information [1] [2] is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.

Contents

Definition

For random variables , , and with support sets , and , we define the conditional mutual information as

This may be written in terms of the expectation operator: .

Thus is the expected (with respect to ) Kullback–Leibler divergence from the conditional joint distribution to the product of the conditional marginals and . Compare with the definition of mutual information.

In terms of PMFs for discrete distributions

For discrete random variables , , and with support sets , and , the conditional mutual information is as follows

where the marginal, joint, and/or conditional probability mass functions are denoted by with the appropriate subscript. This can be simplified as

In terms of PDFs for continuous distributions

For (absolutely) continuous random variables , , and with support sets , and , the conditional mutual information is as follows

where the marginal, joint, and/or conditional probability density functions are denoted by with the appropriate subscript. This can be simplified as

Some identities

Alternatively, we may write in terms of joint and conditional entropies as [3]

This can be rewritten to show its relationship to mutual information

usually rearranged as the chain rule for mutual information

or

Another equivalent form of the above is

Another equivalent form of the conditional mutual information is

Like mutual information, conditional mutual information can be expressed as a Kullback–Leibler divergence:

Or as an expected value of simpler Kullback–Leibler divergences:

,
.

More general definition

A more general definition of conditional mutual information, applicable to random variables with continuous or other arbitrary distributions, will depend on the concept of regular conditional probability . (See also. [4] [5] )

Let be a probability space, and let the random variables , , and each be defined as a Borel-measurable function from to some state space endowed with a topological structure.

Consider the Borel measure (on the σ-algebra generated by the open sets) in the state space of each random variable defined by assigning each Borel set the -measure of its preimage in . This is called the pushforward measure The support of a random variable is defined to be the topological support of this measure, i.e.

Now we can formally define the conditional probability measure given the value of one (or, via the product topology, more) of the random variables. Let be a measurable subset of (i.e. ) and let Then, using the disintegration theorem:

where the limit is taken over the open neighborhoods of , as they are allowed to become arbitrarily smaller with respect to set inclusion.

Finally we can define the conditional mutual information via Lebesgue integration:

where the integrand is the logarithm of a Radon–Nikodym derivative involving some of the conditional probability measures we have just defined.

Note on notation

In an expression such as and need not necessarily be restricted to representing individual random variables, but could also represent the joint distribution of any collection of random variables defined on the same probability space. As is common in probability theory, we may use the comma to denote such a joint distribution, e.g. Hence the use of the semicolon (or occasionally a colon or even a wedge ) to separate the principal arguments of the mutual information symbol. (No such distinction is necessary in the symbol for joint entropy, since the joint entropy of any number of random variables is the same as the entropy of their joint distribution.)

Properties

Nonnegativity

It is always true that

,

for discrete, jointly distributed random variables , and . This result has been used as a basic building block for proving other inequalities in information theory, in particular, those known as Shannon-type inequalities. Conditional mutual information is also non-negative for continuous random variables under certain regularity conditions. [6]

Interaction information

Conditioning on a third random variable may either increase or decrease the mutual information: that is, the difference , called the interaction information, may be positive, negative, or zero. This is the case even when random variables are pairwise independent. Such is the case when:

in which case , and are pairwise independent and in particular , but

Chain rule for mutual information

The chain rule (as derived above) provides two ways to decompose :

The data processing inequality is closely related to conditional mutual information and can be proven using the chain rule.

Interaction information

The conditional mutual information is used to inductively define the interaction information, a generalization of mutual information, as follows:

where

Because the conditional mutual information can be greater than or less than its unconditional counterpart, the interaction information can be positive, negative, or zero, which makes it hard to interpret.

Related Research Articles

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

<span class="mw-page-title-main">Random variable</span> Variable representing a random phenomenon

A random variable is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

<span class="mw-page-title-main">Independence (probability theory)</span> Fundamental concept in probability theory

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if is a random variable whose expected value is defined, and is any random variable on the same probability space, then

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value – the value it would take “on average” over an arbitrarily large number of occurrences – given that a certain set of "conditions" is known to occur. If the random variable can take on only a finite number of values, the “conditions” are that the variable can only take on a subset of those values. More formally, in the case when the random variable is defined over a discrete probability space, the "conditions" are a partition of this probability space.

In probability theory and statistics, given two jointly distributed random variables and , the conditional probability distribution of given is the probability distribution of when is known to be a particular value; in some cases the conditional probabilities may be expressed as functions containing the unspecified value of as a parameter. When both and are categorical variables, a conditional probability table is typically used to represent the conditional probability. The conditional distribution contrasts with the marginal distribution of a random variable, which is its distribution without reference to the value of the other variable.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

<span class="mw-page-title-main">Conditional entropy</span> Measure of relative information in probability theory

In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known. Here, information is measured in shannons, nats, or hartleys. The entropy of conditioned on is written as .

In mathematical statistics, the Kullback–Leibler divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In mathematics, the Gibbs measure, named after Josiah Willard Gibbs, is a probability measure frequently seen in many problems of probability theory and statistical mechanics. It is a generalization of the canonical ensemble to infinite systems. The canonical ensemble gives the probability of the system X being in state x as

This article discusses how information theory is related to measure theory.

<span class="mw-page-title-main">Quantities of information</span>

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

In mathematics, uniform integrability is an important concept in real analysis, functional analysis and measure theory, and plays a vital role in the theory of martingales.

In probability theory, a random measure is a measure-valued random element. Random measures are for example used in the theory of random processes, where they form many important point processes such as Poisson point processes and Cox processes.

In probability theory, regular conditional probability is a concept that formalizes the notion of conditioning on the outcome of a random variable. The resulting conditional probability distribution is a parametrized family of probability measures called a Markov kernel.

In probability theory, a Markov kernel is a map that in the general theory of Markov processes plays the role that the transition matrix does in the theory of Markov processes with a finite state space.

<span class="mw-page-title-main">Conditional probability</span> Probability of an event occurring, given that another event has already occurred

In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occurring with some sort of relationship with another event A. In this event, the event B can be analyzed by a conditional probability with respect to A. If the event of interest is A and the event B is known or assumed to have occurred, "the conditional probability of A given B", or "the probability of A under the condition B", is usually written as P(A|B) or occasionally PB(A). This can also be understood as the fraction of probability B that intersects with A: .

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

References

  1. Wyner, A. D. (1978). "A definition of conditional mutual information for arbitrary ensembles". Information and Control. 38 (1): 51–59. doi: 10.1016/s0019-9958(78)90026-8 .
  2. Dobrushin, R. L. (1959). "General formulation of Shannon's main theorem in information theory". Uspekhi Mat. Nauk. 14: 3–104.
  3. Cover, Thomas; Thomas, Joy A. (2006). Elements of Information Theory (2nd ed.). New York: Wiley-Interscience. ISBN   0-471-24195-4.
  4. Regular Conditional Probability on PlanetMath
  5. D. Leao, Jr. et al. Regular conditional probability, disintegration of probability and Radon spaces. Proyecciones. Vol. 23, No. 1, pp. 15–29, May 2004, Universidad Católica del Norte, Antofagasta, Chile PDF
  6. Polyanskiy, Yury; Wu, Yihong (2017). Lecture notes on information theory (PDF). p. 30.