Conditional entropy

Last updated
Venn diagram showing additive and subtractive relationships various information measures associated with correlated variables
X
{\displaystyle X}
and
Y
{\displaystyle Y}
. The area contained by both circles is the joint entropy
H
(
X
,
Y
)
{\displaystyle \mathrm {H} (X,Y)}
. The circle on the left (red and violet) is the individual entropy
H
(
X
)
{\displaystyle \mathrm {H} (X)}
, with the red being the conditional entropy
H
(
X
|
Y
)
{\displaystyle \mathrm {H} (X|Y)}
. The circle on the right (blue and violet) is
H
(
Y
)
{\displaystyle \mathrm {H} (Y)}
, with the blue being
H
(
Y
|
X
)
{\displaystyle \mathrm {H} (Y|X)}
. The violet is the mutual information
I
[?]
(
X
;
Y
)
{\displaystyle \operatorname {I} (X;Y)}
. Entropy-mutual-information-relative-entropy-relation-diagram.svg
Venn diagram showing additive and subtractive relationships various information measures associated with correlated variables and . The area contained by both circles is the joint entropy . The circle on the left (red and violet) is the individual entropy , with the red being the conditional entropy . The circle on the right (blue and violet) is , with the blue being . The violet is the mutual information .

In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known. Here, information is measured in shannons, nats, or hartleys. The entropy of conditioned on is written as .

Contents

Definition

The conditional entropy of given is defined as

 

 

 

 

(Eq.1)

where and denote the support sets of and .

Note: Here, the convention is that the expression should be treated as being equal to zero. This is because . [1]

Intuitively, notice that by definition of expected value and of conditional probability, can be written as , where is defined as . One can think of as associating each pair with a quantity measuring the information content of given . This quantity is directly related to the amount of information needed to describe the event given . Hence by computing the expected value of over all pairs of values , the conditional entropy measures how much information, on average, the variable encodes about .

Motivation

Let be the entropy of the discrete random variable conditioned on the discrete random variable taking a certain value . Denote the support sets of and by and . Let have probability mass function . The unconditional entropy of is calculated as , i.e.

where is the information content of the outcome of taking the value . The entropy of conditioned on taking the value is defined analogously by conditional expectation:

Note that is the result of averaging over all possible values that may take. Also, if the above sum is taken over a sample , the expected value is known in some domains as equivocation. [2]

Given discrete random variables with image and with image , the conditional entropy of given is defined as the weighted sum of for each possible value of , using as the weights: [3] :15

Properties

Conditional entropy equals zero

if and only if the value of is completely determined by the value of .

Conditional entropy of independent random variables

Conversely, if and only if and are independent random variables.

Chain rule

Assume that the combined system determined by two random variables and has joint entropy , that is, we need bits of information on average to describe its exact state. Now if we first learn the value of , we have gained bits of information. Once is known, we only need bits to describe the state of the whole system. This quantity is exactly , which gives the chain rule of conditional entropy:

[3] :17

The chain rule follows from the above definition of conditional entropy:

In general, a chain rule for multiple random variables holds:

[3] :22

It has a similar form to chain rule in probability theory, except that addition instead of multiplication is used.

Bayes' rule

Bayes' rule for conditional entropy states

Proof. and . Symmetry entails . Subtracting the two equations implies Bayes' rule.

If is conditionally independent of given we have:

Other properties

For any and :

where is the mutual information between and .

For independent and :

and

Although the specific-conditional entropy can be either less or greater than for a given random variate of , can never exceed .

Conditional differential entropy

Definition

The above definition is for discrete random variables. The continuous version of discrete conditional entropy is called conditional differential (or continuous) entropy. Let and be a continuous random variables with a joint probability density function . The differential conditional entropy is defined as [3] :249

 

 

 

 

(Eq.2)

Properties

In contrast to the conditional entropy for discrete random variables, the conditional differential entropy may be negative.

As in the discrete case there is a chain rule for differential entropy:

[3] :253

Notice however that this rule may not be true if the involved differential entropies do not exist or are infinite.

Joint differential entropy is also used in the definition of the mutual information between continuous random variables:

with equality if and only if and are independent. [3] :253

Relation to estimator error

The conditional differential entropy yields a lower bound on the expected squared error of an estimator. For any random variable , observation and estimator the following holds: [3] :255

This is related to the uncertainty principle from quantum mechanics.

Generalization to quantum theory

In quantum information theory, the conditional entropy is generalized to the conditional quantum entropy. The latter can take negative values, unlike its classical counterpart.

See also

Related Research Articles

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

<span class="mw-page-title-main">Random variable</span> Variable representing a random phenomenon

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' can be misleading as its mathematical definition is not actually random nor a variable, but rather it is a function from possible outcomes in a sample space to a measurable space, often to the real numbers.

<span class="mw-page-title-main">Independence (probability theory)</span> When the occurrence of one event does not affect the likelihood of another

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

The likelihood function is the joint probability of observed data viewed as a function of the parameters of a statistical model.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

<span class="mw-page-title-main">Beta distribution</span> Probability distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if is a random variable whose expected value is defined, and is any random variable on the same probability space, then

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value – the value it would take "on average" over an arbitrarily large number of occurrences – given that a certain set of "conditions" is known to occur. If the random variable can take on only a finite number of values, the "conditions" are that the variable can only take on a subset of those values. More formally, in the case when the random variable is defined over a discrete probability space, the "conditions" are a partition of this probability space.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

<span class="mw-page-title-main">Joint entropy</span> Measure of information in probability and information theory

In information theory, joint entropy is a measure of the uncertainty associated with a set of variables.

In mathematical statistics, the Kullback–Leibler divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

This article discusses how information theory is related to measure theory.

<span class="mw-page-title-main">Quantities of information</span>

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

The partition function or configuration integral, as used in probability theory, information theory and dynamical systems, is a generalization of the definition of a partition function in statistical mechanics. It is a special case of a normalizing constant in probability theory, for the Boltzmann distribution. The partition function occurs in many problems of probability theory because, in situations where there is a natural symmetry, its associated probability measure, the Gibbs measure, has the Markov property. This means that the partition function occurs not only in physical systems with translation symmetry, but also in such varied settings as neural networks, and applications such as genomics, corpus linguistics and artificial intelligence, which employ Markov networks, and Markov logic networks. The Gibbs measure is also the unique measure that has the property of maximizing the entropy for a fixed expectation value of the energy; this underlies the appearance of the partition function in maximum entropy methods and the algorithms derived therefrom.

<span class="mw-page-title-main">Conditional mutual information</span> Information theory

In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.

References

  1. "David MacKay: Information Theory, Pattern Recognition and Neural Networks: The Book". www.inference.org.uk. Retrieved 2019-10-25.
  2. Hellman, M.; Raviv, J. (1970). "Probability of error, equivocation, and the Chernoff bound". IEEE Transactions on Information Theory. 16 (4): 368–372. doi:10.1109/TIT.1970.1054466.
  3. 1 2 3 4 5 6 7 T. Cover; J. Thomas (1991). Elements of Information Theory . Wiley. ISBN   0-471-06259-6.