Differential entropy

Last updated

Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. [1] :181–218 The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy (described here) is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

Contents

In terms of measure theory, the differential entropy of a probability measure is the negative relative entropy from that measure to the Lebesgue measure, where the latter is treated as if it were a probability measure, despite being unnormalized.

Definition

Let be a random variable with a probability density function whose support is a set . The differential entropy or is defined as [2] :243

For probability distributions which do not have an explicit density function expression, but have an explicit quantile function expression, , then can be defined in terms of the derivative of i.e. the quantile density function as [3] :54–59

.

As with its discrete analog, the units of differential entropy depend on the base of the logarithm, which is usually 2 (i.e., the units are bits). See logarithmic units for logarithms taken in different bases. Related concepts such as joint, conditional differential entropy, and relative entropy are defined in a similar fashion. Unlike the discrete analog, the differential entropy has an offset that depends on the units used to measure . [4] :183–184 For example, the differential entropy of a quantity measured in millimeters will be log(1000) more than the same quantity measured in meters; a dimensionless quantity will have differential entropy of log(1000) more than the same quantity divided by 1000.

One must take care in trying to apply properties of discrete entropy to differential entropy, since probability density functions can be greater than 1. For example, the uniform distribution has negative differential entropy; i.e., it is better ordered than as shown now

being less than that of which has zero differential entropy. Thus, differential entropy does not share all properties of discrete entropy.

The continuous mutual information has the distinction of retaining its fundamental significance as a measure of discrete information since it is actually the limit of the discrete mutual information of partitions of and as these partitions become finer and finer. Thus it is invariant under non-linear homeomorphisms (continuous and uniquely invertible maps), [5] including linear [6] transformations of and , and still represents the amount of discrete information that can be transmitted over a channel that admits a continuous space of values.

For the direct analogue of discrete entropy extended to the continuous space, see limiting density of discrete points.

Properties of differential entropy

.
In particular, for a constant
For a vector valued random variable and an invertible (square) matrix
[2] :253
where is the Jacobian of the transformation . [7] The above inequality becomes an equality if the transform is a bijection. Furthermore, when is a rigid rotation, translation, or combination thereof, the Jacobian determinant is always 1, and .

However, differential entropy does not have other desirable properties:

A modification of differential entropy that addresses these drawbacks is the relative information entropy, also known as the Kullback–Leibler divergence, which includes an invariant measure factor (see limiting density of discrete points).

Maximization in the normal distribution

Theorem

With a normal distribution, differential entropy is maximized for a given variance. A Gaussian random variable has the largest entropy amongst all random variables of equal variance, or, alternatively, the maximum entropy distribution under constraints of mean and variance is the Gaussian. [2] :255

Proof

Let be a Gaussian PDF with mean μ and variance and an arbitrary PDF with the same variance. Since differential entropy is translation invariant we can assume that has the same mean of as .

Consider the Kullback–Leibler divergence between the two distributions

Now note that

because the result does not depend on other than through the variance. Combining the two results yields

with equality when following from the properties of Kullback–Leibler divergence.

Alternative proof

This result may also be demonstrated using the calculus of variations. A Lagrangian function with two Lagrangian multipliers may be defined as:

where g(x) is some function with mean μ. When the entropy of g(x) is at a maximum and the constraint equations, which consist of the normalization condition and the requirement of fixed variance , are both satisfied, then a small variation δg(x) about g(x) will produce a variation δL about L which is equal to zero:

Since this must hold for any small δg(x), the term in brackets must be zero, and solving for g(x) yields:

Using the constraint equations to solve for λ0 and λ yields the normal distribution:

Example: Exponential distribution

Let be an exponentially distributed random variable with parameter , that is, with probability density function

Its differential entropy is then

Here, was used rather than to make it explicit that the logarithm was taken to base e, to simplify the calculation.

Relation to estimator error

The differential entropy yields a lower bound on the expected squared error of an estimator. For any random variable and estimator the following holds: [2]

with equality if and only if is a Gaussian random variable and is the mean of .

Differential entropies for various distributions

In the table below is the gamma function, is the digamma function, is the beta function, and γE is Euler's constant. [8] :219–230

Table of differential entropies
Distribution NameProbability density function (pdf)Differential entropy in nats Support
Uniform
Normal
Exponential
Rayleigh
Beta for
Cauchy
Chi
Chi-squared
Erlang
F
Gamma
Laplace
Logistic
Lognormal
Maxwell–Boltzmann
Generalized normal
Pareto
Student's t
Triangular
Weibull
Multivariate normal

Many of the differential entropies are from. [9] :120–122

Variants

As described above, differential entropy does not share all properties of discrete entropy. For example, the differential entropy can be negative; also it is not invariant under continuous coordinate transformations. Edwin Thompson Jaynes showed in fact that the expression above is not the correct limit of the expression for a finite set of probabilities. [10] :181–218

A modification of differential entropy adds an invariant measure factor to correct this, (see limiting density of discrete points). If is further constrained to be a probability density, the resulting notion is called relative entropy in information theory:

The definition of differential entropy above can be obtained by partitioning the range of into bins of length with associated sample points within the bins, for Riemann integrable. This gives a quantized version of , defined by if . Then the entropy of is [2]

The first term on the right approximates the differential entropy, while the second term is approximately . Note that this procedure suggests that the entropy in the discrete sense of a continuous random variable should be .

See also

Related Research Articles

<span class="mw-page-title-main">Cumulative distribution function</span> Probability that random variable X is less than or equal to x

In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable , or just distribution function of , evaluated at , is the probability that will take a value less than or equal to .

<span class="mw-page-title-main">Cauchy distribution</span> Probability distribution

The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution, Cauchy–Lorentz distribution, Lorentz(ian) function, or Breit–Wigner distribution. The Cauchy distribution is the distribution of the x-intercept of a ray issuing from with a uniformly distributed angle. It is also the distribution of the ratio of two independent normally distributed random variables with mean zero.

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to , the entropy is

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Probability density function</span> Function whose integral over a region describes the probability of an event occurring in that region

In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space. It can be used to calculate the informational difference between measurements.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In probability theory and mathematical physics, a random matrix is a matrix-valued random variable—that is, a matrix in which some or all elements are random variables. Many important properties of physical systems can be represented mathematically as matrix problems. For example, the thermal conductivity of a lattice can be computed from the dynamical matrix of the particle-particle interactions within the lattice.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

This article discusses how information theory is related to measure theory.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In mathematics – specifically, in stochastic analysis – an Itô diffusion is a solution to a specific type of stochastic differential equation. That equation is similar to the Langevin equation used in physics to describe the Brownian motion of a particle subjected to a potential in a viscous fluid. Itô diffusions are named after the Japanese mathematician Kiyosi Itô.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

<span class="mw-page-title-main">Logit-normal distribution</span>

In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and t is the standard logistic function, then X = t(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log (X/(1-X)) is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).

References

  1. Jaynes, E.T. (1963). "Information Theory And Statistical Mechanics" (PDF). Brandeis University Summer Institute Lectures in Theoretical Physics. 3 (sect. 4b).
  2. 1 2 3 4 5 6 7 8 Cover, Thomas M.; Thomas, Joy A. (1991). Elements of Information Theory . New York: Wiley. ISBN   0-471-06259-6.
  3. Vasicek, Oldrich (1976), "A Test for Normality Based on Sample Entropy", Journal of the Royal Statistical Society, Series B , 38 (1): 54–59, JSTOR   2984828.
  4. Gibbs, Josiah Willard (1902). Elementary Principles in Statistical Mechanics, developed with especial reference to the rational foundation of thermodynamics . New York: Charles Scribner's Sons.
  5. Kraskov, Alexander; Stögbauer, Grassberger (2004). "Estimating mutual information". Physical Review E . 60 (6): 066138. arXiv: cond-mat/0305641 . Bibcode:2004PhRvE..69f6138K. doi:10.1103/PhysRevE.69.066138. PMID   15244698. S2CID   1269438.
  6. Fazlollah M. Reza (1994) [1961]. An Introduction to Information Theory. Dover Publications, Inc., New York. ISBN   0-486-68210-2.
  7. "proof of upper bound on differential entropy of f(X)". Stack Exchange . April 16, 2016.
  8. Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (PDF). Journal of Econometrics. 150 (2). Elsevier: 219–230. doi:10.1016/j.jeconom.2008.12.014. Archived from the original (PDF) on 2016-03-07. Retrieved 2011-06-02.
  9. Lazo, A. and P. Rathie (1978). "On the entropy of continuous probability distributions". IEEE Transactions on Information Theory. 24 (1): 120–122. doi:10.1109/TIT.1978.1055832.
  10. Jaynes, E.T. (1963). "Information Theory And Statistical Mechanics" (PDF). Brandeis University Summer Institute Lectures in Theoretical Physics. 3 (sect. 4b).