Quantities of information

Last updated
A misleading information diagram showing additive and subtractive relationships among Shannon's basic quantities of information for correlated variables
X
{\displaystyle X}
and
Y
{\displaystyle Y}
. The area contained by both circles is the joint entropy
H
(
X
,
Y
)
{\displaystyle \mathrm {H} (X,Y)}
. The circle on the left (red and violet) is the individual entropy
H
(
X
)
{\displaystyle \mathrm {H} (X)}
, with the red being the conditional entropy
H
(
X
|
Y
)
{\displaystyle \mathrm {H} (X|Y)}
. The circle on the right (blue and violet) is
H
(
Y
)
{\displaystyle \mathrm {H} (Y)}
, with the blue being
H
(
Y
|
X
)
{\displaystyle \mathrm {H} (Y|X)}
. The violet is the mutual information
I
[?]
(
X
;
Y
)
{\displaystyle \operatorname {I} (X;Y)}
. Entropy-mutual-information-relative-entropy-relation-diagram.svg
A misleading information diagram showing additive and subtractive relationships among Shannon's basic quantities of information for correlated variables and . The area contained by both circles is the joint entropy . The circle on the left (red and violet) is the individual entropy , with the red being the conditional entropy . The circle on the right (blue and violet) is , with the blue being . The violet is the mutual information .

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy (information content) Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

Contents

In what follows, an expression of the form is considered by convention to be equal to zero whenever is zero. This is justified because for any logarithmic base.

Self-information

Shannon derived a measure of information content called the self-information or "surprisal" of a message :

where is the probability that message is chosen from all possible choices in the message space . The base of the logarithm only affects a scaling factor and, consequently, the units in which the measured information content is expressed. If the logarithm is base 2, the measure of information is expressed in units of shannons or more often simply "bits" (a bit in other contexts is rather defined as a "binary digit", whose average information content is at most 1 shannon).

Information from a source is gained by a recipient only if the recipient did not already have that information to begin with. Messages that convey information over a certain (P=1) event (or one which is known with certainty, for instance, through a back-channel) provide no information, as the above equation indicates. Infrequently occurring messages contain more information than more frequently occurring messages.

It can also be shown that a compound message of two (or more) unrelated messages would have a quantity of information that is the sum of the measures of information of each message individually. That can be derived using this definition by considering a compound message providing information regarding the values of two random variables M and N using a message which is the concatenation of the elementary messages m and n, each of whose information content are given by and respectively. If the messages m and n each depend only on M and N, and the processes M and N are independent, then since (the definition of statistical independence) it is clear from the above definition that .

An example: The weather forecast broadcast is: "Tonight's forecast: Dark. Continued darkness until widely scattered light in the morning." This message contains almost no information. However, a forecast of a snowstorm would certainly contain information since such does not happen every evening. There would be an even greater amount of information in an accurate forecast of snow for a warm location, such as Miami. The amount of information in a forecast of snow for a location where it never snows (impossible event) is the highest (infinity).

Entropy

The entropy of a discrete message space is a measure of the amount of uncertainty one has about which message will be chosen. It is defined as the average self-information of a message from that message space:

where

denotes the expected value operation.

An important property of entropy is that it is maximized when all the messages in the message space are equiprobable (e.g. ). In this case .

Sometimes the function is expressed in terms of the probabilities of the distribution:

where each and

An important special case of this is the binary entropy function :

Joint entropy

The joint entropy of two discrete random variables and is defined as the entropy of the joint distribution of and :

If and are independent, then the joint entropy is simply the sum of their individual entropies.

(Note: The joint entropy should not be confused with the cross entropy, despite similar notations.)

Conditional entropy (equivocation)

Given a particular value of a random variable , the conditional entropy of given is defined as:

where is the conditional probability of given .

The conditional entropy of given , also called the equivocation of about is then given by:

This uses the conditional expectation from probability theory.

A basic property of the conditional entropy is that:

Kullback–Leibler divergence (information gain)

The Kullback–Leibler divergence (or information divergence, information gain, or relative entropy) is a way of comparing two distributions, a "true" probability distribution , and an arbitrary probability distribution . If we compress data in a manner that assumes is the distribution underlying some data, when, in reality, is the correct distribution, Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression, or, mathematically,

It is in some sense the "distance" from to , although it is not a true metric due to its not being symmetric.

Mutual information (transinformation)

It turns out that one of the most useful and important measures of information is the mutual information , or transinformation. This is a measure of how much information can be obtained about one random variable by observing another. The mutual information of relative to (which represents conceptually the average amount of information about that can be gained by observing ) is given by:

A basic property of the mutual information is that:

That is, knowing , we can save an average of bits in encoding compared to not knowing . Mutual information is symmetric:


Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) of the posterior probability distribution of given the value of to the prior distribution on :

In other words, this is a measure of how much, on the average, the probability distribution on will change if we are given the value of . This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:

Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.

Differential entropy

The basic measures of discrete entropy have been extended by analogy to continuous spaces by replacing sums with integrals and probability mass functions with probability density functions. Although, in both cases, mutual information expresses the number of bits of information common to the two sources in question, the analogy does not imply identical properties; for example, differential entropy may be negative.

The differential analogies of entropy, joint entropy, conditional entropy, and mutual information are defined as follows:

where is the joint density function, and are the marginal distributions, and is the conditional distribution.

See also

Related Research Articles

Cauchy distribution Probability distribution

The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution, Cauchy–Lorentz distribution, Lorentz(ian) function, or Breit–Wigner distribution. The Cauchy distribution is the distribution of the x-intercept of a ray issuing from with a uniformly distributed angle. It is also the distribution of the ratio of two independent normally distributed random variables with mean zero.

Information theory is the scientific study of the quantification, storage, and communication of digital information. The field was fundamentally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

Entropy (information theory) Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

Mutual information Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

Quantum statistical mechanics is statistical mechanics applied to quantum mechanical systems. In quantum mechanics a statistical ensemble is described by a density operator S, which is a non-negative, self-adjoint, trace-class operator of trace 1 on the Hilbert space H describing the quantum system. This can be shown under various mathematical formalisms for quantum mechanics. One such formalism is provided by quantum logic.

In quantum mechanics, information theory, and Fourier analysis, the entropic uncertainty or Hirschman uncertainty is defined as the sum of the temporal and spectral Shannon entropies. It turns out that Heisenberg's uncertainty principle can be expressed as a lower bound on the sum of these entropies. This is stronger than the usual statement of the uncertainty principle in terms of the product of standard deviations.

Conditional entropy Measure of relative information in probability theory

In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known. Here, information is measured in shannons, nats, or hartleys. The entropy of conditioned on is written as .

In mathematical statistics, the Kullback–Leibler divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In information theory, the Rényi entropy generalizes the Hartley entropy, the Shannon entropy, the collision entropy and the min-entropy. Entropies quantify the diversity, uncertainty, or randomness of a system. The entropy is named after Alfréd Rényi, who looked for the most general definition of information measures that preserve additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

Binary entropy function

In information theory, the binary entropy function, denoted or , is defined as the entropy of a Bernoulli process with probability of one of two values. It is a special case of , the entropy function. Mathematically, the Bernoulli trial is modelled as a random variable that can take on only two values: 0 and 1, which are mutually exclusive and exhaustive.

This article discusses how information theory is related to measure theory.

Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.

Conditional mutual information

In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.

References

  1. D.J.C. Mackay. Information theory, inferences, and learning algorithms.:141