Differential entropy

Last updated

Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. [1] :181–218 The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy (described here) is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

Contents

In terms of measure theory, the differential entropy of a probability measure is the negative relative entropy from that measure to the Lebesgue measure, where the latter is treated as if it were a probability measure, despite being unnormalized.

Definition

Let be a random variable with a probability density function whose support is a set . The differential entropy or is defined as [2] :243

For probability distributions which do not have an explicit density function expression, but have an explicit quantile function expression, , then can be defined in terms of the derivative of i.e. the quantile density function as [3] :54–59

.

As with its discrete analog, the units of differential entropy depend on the base of the logarithm, which is usually 2 (i.e., the units are bits). See logarithmic units for logarithms taken in different bases. Related concepts such as joint, conditional differential entropy, and relative entropy are defined in a similar fashion. Unlike the discrete analog, the differential entropy has an offset that depends on the units used to measure . [4] :183–184 For example, the differential entropy of a quantity measured in millimeters will be log(1000) more than the same quantity measured in meters; a dimensionless quantity will have differential entropy of log(1000) more than the same quantity divided by 1000.

One must take care in trying to apply properties of discrete entropy to differential entropy, since probability density functions can be greater than 1. For example, the uniform distribution has negative differential entropy; i.e., it is better ordered than as shown now

being less than that of which has zero differential entropy. Thus, differential entropy does not share all properties of discrete entropy.

The continuous mutual information has the distinction of retaining its fundamental significance as a measure of discrete information since it is actually the limit of the discrete mutual information of partitions of and as these partitions become finer and finer. Thus it is invariant under non-linear homeomorphisms (continuous and uniquely invertible maps), [5] including linear [6] transformations of and , and still represents the amount of discrete information that can be transmitted over a channel that admits a continuous space of values.

For the direct analogue of discrete entropy extended to the continuous space, see limiting density of discrete points.

Properties of differential entropy

.
In particular, for a constant
For a vector valued random variable and an invertible (square) matrix
[2] :253
where is the Jacobian of the transformation . [7] The above inequality becomes an equality if the transform is a bijection. Furthermore, when is a rigid rotation, translation, or combination thereof, the Jacobian determinant is always 1, and .

However, differential entropy does not have other desirable properties:

A modification of differential entropy that addresses these drawbacks is the relative information entropy, also known as the Kullback–Leibler divergence, which includes an invariant measure factor (see limiting density of discrete points).

Maximization in the normal distribution

Theorem

With a normal distribution, differential entropy is maximized for a given variance. A Gaussian random variable has the largest entropy amongst all random variables of equal variance, or, alternatively, the maximum entropy distribution under constraints of mean and variance is the Gaussian. [2] :255

Proof

Let be a Gaussian PDF with mean μ and variance and an arbitrary PDF with the same variance. Since differential entropy is translation invariant we can assume that has the same mean of as .

Consider the Kullback–Leibler divergence between the two distributions

Now note that

because the result does not depend on other than through the variance. Combining the two results yields

with equality when following from the properties of Kullback–Leibler divergence.

Alternative proof

This result may also be demonstrated using the calculus of variations. A Lagrangian function with two Lagrangian multipliers may be defined as:

where g(x) is some function with mean μ. When the entropy of g(x) is at a maximum and the constraint equations, which consist of the normalization condition and the requirement of fixed variance , are both satisfied, then a small variation δg(x) about g(x) will produce a variation δL about L which is equal to zero:

Since this must hold for any small δg(x), the term in brackets must be zero, and solving for g(x) yields:

Using the constraint equations to solve for λ0 and λ yields the normal distribution:

Example: Exponential distribution

Let be an exponentially distributed random variable with parameter , that is, with probability density function

Its differential entropy is then

Here, was used rather than to make it explicit that the logarithm was taken to base e, to simplify the calculation.

Relation to estimator error

The differential entropy yields a lower bound on the expected squared error of an estimator. For any random variable and estimator the following holds: [2]

with equality if and only if is a Gaussian random variable and is the mean of .

Differential entropies for various distributions

In the table below is the gamma function, is the digamma function, is the beta function, and γE is Euler's constant. [8] :219–230

Table of differential entropies
Distribution NameProbability density function (pdf)Differential entropy in nats Support
Uniform
Normal
Exponential
Rayleigh
Beta for
Cauchy
Chi
Chi-squared
Erlang
F
Gamma
Laplace
Logistic
Lognormal
Maxwell–Boltzmann
Generalized normal
Pareto
Student's t
Triangular
Weibull
Multivariate normal

Many of the differential entropies are from. [9] :120–122

Variants

As described above, differential entropy does not share all properties of discrete entropy. For example, the differential entropy can be negative; also it is not invariant under continuous coordinate transformations. Edwin Thompson Jaynes showed in fact that the expression above is not the correct limit of the expression for a finite set of probabilities. [10] :181–218

A modification of differential entropy adds an invariant measure factor to correct this, (see limiting density of discrete points). If is further constrained to be a probability density, the resulting notion is called relative entropy in information theory:

The definition of differential entropy above can be obtained by partitioning the range of into bins of length with associated sample points within the bins, for Riemann integrable. This gives a quantized version of , defined by if . Then the entropy of is [2]

The first term on the right approximates the differential entropy, while the second term is approximately . Note that this procedure suggests that the entropy in the discrete sense of a continuous random variable should be .

See also

Related Research Articles

In physics, the cross section is a measure of the probability that a specific process will take place when some kind of radiant excitation intersects a localized phenomenon. For example, the Rutherford cross-section is a measure of probability that an alpha particle will be deflected by a given angle during an interaction with an atomic nucleus. Cross section is typically denoted σ (sigma) and is expressed in units of area, more specifically in barns. In a way, it can be thought of as the size of the object that the excitation must hit in order for the process to occur, but more exactly, it is a parameter of a stochastic process.

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Probability density function</span> Function whose integral over a region describes the probability of an event occurring in that region

In probability theory, a probability density function (PDF), or density of an absolutely continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample. Probability density is the probability per unit length, in other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

<span class="mw-page-title-main">Fokker–Planck equation</span> Partial differential equation

In statistical mechanics and information theory, the Fokker–Planck equation is a partial differential equation that describes the time evolution of the probability density function of the velocity of a particle under the influence of drag forces and random forces, as in Brownian motion. The equation can be generalized to other observables as well. The Fokker-Planck equation has multiple applications in information theory, graph theory, data science, finance, economics etc.

In mathematics, Itô's lemma or Itô's formula is an identity used in Itô calculus to find the differential of a time-dependent function of a stochastic process. It serves as the stochastic calculus counterpart of the chain rule. It can be heuristically derived by forming the Taylor series expansion of the function up to its second derivatives and retaining terms up to first order in the time increment and second order in the Wiener process increment. The lemma is widely employed in mathematical finance, and its best known application is in the derivation of the Black–Scholes equation for option values.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution", and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

<span class="mw-page-title-main">Separation of variables</span> Technique for solving differential equations

In mathematics, separation of variables is any of several methods for solving ordinary and partial differential equations, in which algebra allows one to rewrite an equation so that each of two variables occurs on a different side of the equation.

In statistics, the Bhattacharyya distance measures the similarity of two probability distributions. It is closely related to the Bhattacharyya coefficient, which is a measure of the amount of overlap between two statistical samples or populations.

In mathematical statistics, the Kullback–Leibler divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In mathematics, probabilistic metric spaces are a generalization of metric spaces where the distance no longer takes values in the non-negative real numbers R0, but in distribution functions.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In mathematics – specifically, in stochastic analysis – an Itô diffusion is a solution to a specific type of stochastic differential equation. That equation is similar to the Langevin equation used in physics to describe the Brownian motion of a particle subjected to a potential in a viscous fluid. Itô diffusions are named after the Japanese mathematician Kiyosi Itô.

In mathematics, the spectral theory of ordinary differential equations is the part of spectral theory concerned with the determination of the spectrum and eigenfunction expansion associated with a linear ordinary differential equation. In his dissertation, Hermann Weyl generalized the classical Sturm–Liouville theory on a finite closed interval to second order differential operators with singularities at the endpoints of the interval, possibly semi-infinite or infinite. Unlike the classical case, the spectrum may no longer consist of just a countable set of eigenvalues, but may also contain a continuous part. In this case the eigenfunction expansion involves an integral over the continuous part with respect to a spectral measure, given by the Titchmarsh–Kodaira formula. The theory was put in its final simplified form for singular differential equations of even degree by Kodaira and others, using von Neumann's spectral theorem. It has had important applications in quantum mechanics, operator theory and harmonic analysis on semisimple Lie groups.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

Lagrangian field theory is a formalism in classical field theory. It is the field-theoretic analogue of Lagrangian mechanics. Lagrangian mechanics is used to analyze the motion of a system of discrete particles each with a finite number of degrees of freedom. Lagrangian field theory applies to continua and fields, which have an infinite number of degrees of freedom.

References

  1. Jaynes, E.T. (1963). "Information Theory And Statistical Mechanics" (PDF). Brandeis University Summer Institute Lectures in Theoretical Physics. 3 (sect. 4b).
  2. 1 2 3 4 5 6 7 8 Cover, Thomas M.; Thomas, Joy A. (1991). Elements of Information Theory . New York: Wiley. ISBN   0-471-06259-6.
  3. Vasicek, Oldrich (1976), "A Test for Normality Based on Sample Entropy", Journal of the Royal Statistical Society, Series B , 38 (1): 54–59, JSTOR   2984828.
  4. Gibbs, Josiah Willard (1902). Elementary Principles in Statistical Mechanics, developed with especial reference to the rational foundation of thermodynamics . New York: Charles Scribner's Sons.
  5. Kraskov, Alexander; Stögbauer, Grassberger (2004). "Estimating mutual information". Physical Review E . 60 (6): 066138. arXiv: cond-mat/0305641 . Bibcode:2004PhRvE..69f6138K. doi:10.1103/PhysRevE.69.066138. PMID   15244698. S2CID   1269438.
  6. Fazlollah M. Reza (1994) [1961]. An Introduction to Information Theory. Dover Publications, Inc., New York. ISBN   0-486-68210-2.
  7. "proof of upper bound on differential entropy of f(X)". Stack Exchange . April 16, 2016.
  8. Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (PDF). Journal of Econometrics. Elsevier. 150 (2): 219–230. doi:10.1016/j.jeconom.2008.12.014. Archived from the original (PDF) on 2016-03-07. Retrieved 2011-06-02.
  9. Lazo, A. and P. Rathie (1978). "On the entropy of continuous probability distributions". IEEE Transactions on Information Theory. 24 (1): 120–122. doi:10.1109/TIT.1978.1055832.
  10. Jaynes, E.T. (1963). "Information Theory And Statistical Mechanics" (PDF). Brandeis University Summer Institute Lectures in Theoretical Physics. 3 (sect. 4b).