Maximum entropy probability distribution

Last updated

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures), then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Contents

Definition of entropy and differential entropy

If is a continuous random variable with probability density , then the differential entropy of is defined as [1] [2] [3]

If is a discrete random variable with distribution given by

then the entropy of is defined as

The seemingly divergent term is replaced by zero, whenever

This is a special case of more general forms described in the articles Entropy (information theory), Principle of maximum entropy, and differential entropy. In connection with maximum entropy distributions, this is the only one needed, because maximizing will also maximize the more general forms.

The base of the logarithm is not important, as long as the same one is used consistently: Change of base merely results in a rescaling of the entropy. Information theorists may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists often prefer the natural logarithm, resulting in a unit of "nat"s for the entropy.

However, the chosen measure is crucial, even though the typical use of the Lebesgue measure is often defended as a "natural" choice: Which measure is chosen determines the entropy and the consequent maximum entropy distribution.

Distributions with measured constants

Many statistical distributions of applicable interest are those for which the moments or other measurable quantities are constrained to be constants. The following theorem by Ludwig Boltzmann gives the form of the probability density under these constraints.

Continuous case

Suppose is a continuous, closed subset of the real numbers and we choose to specify measurable functions and numbers We consider the class of all real-valued random variables which are supported on (i.e. whose density function is zero outside of ) and which satisfy the moment conditions:

If there is a member in whose density function is positive everywhere in and if there exists a maximal entropy distribution for then its probability density has the following form:

where we assume that The constant and the Lagrange multipliers solve the constrained optimization problem with (which ensures that integrates to unity): [4]

Using the Karush–Kuhn–Tucker conditions, it can be shown that the optimization problem has a unique solution because the objective function in the optimization is concave in

Note that when the moment constraints are equalities (instead of inequalities), that is,

then the constraint condition can be dropped, which makes optimization over the Lagrange multipliers unconstrained.

Discrete case

Suppose is a (finite or infinite) discrete subset of the reals, and that we choose to specify functions and numbers We consider the class of all discrete random variables which are supported on and which satisfy the moment conditions

If there exists a member of class which assigns positive probability to all members of and if there exists a maximum entropy distribution for then this distribution has the following shape:

where we assume that and the constants solve the constrained optimization problem with [5]

Again as above, if the moment conditions are equalities (instead of inequalities), then the constraint condition is not present in the optimization.

Proof in the case of equality constraints

In the case of equality constraints, this theorem is proved with the calculus of variations and Lagrange multipliers. The constraints can be written as

We consider the functional

where and are the Lagrange multipliers. The zeroth constraint ensures the second axiom of probability. The other constraints are that the measurements of the function are given constants up to order . The entropy attains an extremum when the functional derivative is equal to zero:

Therefore, the extremal entropy probability distribution in this case must be of the form (),

remembering that . It can be verified that this is the maximal solution by checking that the variation around this solution is always negative.

Uniqueness of the maximum

Suppose are distributions satisfying the expectation-constraints. Letting and considering the distribution it is clear that this distribution satisfies the expectation-constraints and furthermore has as support From basic facts about entropy, it holds that Taking limits and respectively, yields

It follows that a distribution satisfying the expectation-constraints and maximising entropy must necessarily have full support — i. e. the distribution is almost everywhere strictly positive. It follows that the maximising distribution must be an internal point in the space of distributions satisfying the expectation-constraints, that is, it must be a local extreme. Thus it suffices to show that the local extreme is unique, in order to show both that the entropy-maximising distribution is unique (and this also shows that the local extreme is the global maximum).

Suppose are local extremes. Reformulating the above computations these are characterised by parameters via and similarly for where We now note a series of identities: Via 1the satisfaction of the expectation-constraints and utilising gradients / directional derivatives, one has

and similarly for Letting one obtains:

where for some Computing further one has

where is similar to the distribution above, only parameterised by Assuming that no non-trivial linear combination of the observables is almost everywhere (a.e.) constant, (which e.g. holds if the observables are independent and not a.e. constant), it holds that has non-zero variance, unless By the above equation it is thus clear, that the latter must be the case. Hence so the parameters characterising the local extrema are identical, which means that the distributions themselves are identical. Thus, the local extreme is unique and by the above discussion, the maximum is unique – provided a local extreme actually exists.

Caveats

Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrarily large entropy (e.g. the class of all continuous distributions on R with mean 0 but arbitrary standard deviation), or that the entropies are bounded above but there is no distribution which attains the maximal entropy. [lower-alpha 1] It is also possible that the expected value restrictions for the class C force the probability distribution to be zero in certain subsets of S. In that case our theorem doesn't apply, but one can work around this by shrinking the set S.

Examples

Every probability distribution is trivially a maximum entropy probability distribution under the constraint that the distribution has its own entropy. To see this, rewrite the density as and compare to the expression of the theorem above. By choosing to be the measurable function and

to be the constant, is the maximum entropy probability distribution under the constraint

.

Nontrivial examples are distributions that are subject to multiple constraints that are different from the assignment of the entropy. These are often found by starting with the same procedure and finding that can be separated into parts.

A table of examples of maximum entropy distributions is given in Lisman (1972) [6] and Park & Bera (2009). [7]

Uniform and piecewise uniform distributions

The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b], and thus the probability density is 0 outside of the interval. This uniform density can be related to Laplace's principle of indifference, sometimes called the principle of insufficient reason. More generally, if we are given a subdivision a=a0 < a1 < ... < ak = b of the interval [a,b] and probabilities p1,...,pk that add up to one, then we can consider the class of all continuous distributions such that

The density of the maximum entropy distribution for this class is constant on each of the intervals [aj-1,aj). The uniform distribution on the finite set {x1,...,xn} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.

Positive and specified mean: the exponential distribution

The exponential distribution, for which the density function is

is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a specified mean of 1/λ.

In the case of distributions supported on [0,∞), the maximum entropy distribution depends on relationships between the first and second moments. In specific cases, it may be the exponential distribution, or may be another distribution, or may be undefinable. [8]

Specified mean and variance: the normal distribution

The normal distribution N(μ,σ2), for which the density function is

has maximum entropy among all real-valued distributions supported on (∞,∞) with a specified variance σ2 (a particular moment). The same is true when the mean μ and the variance σ2 is specified (the first two moments), since entropy is translation invariant on (∞,∞). Therefore, the assumption of normality imposes the minimal prior structural constraint beyond these moments. (See the differential entropy article for a derivation.)

Discrete distributions with specified mean

Among all the discrete distributions supported on the set {x1,...,xn} with a specified mean μ, the maximum entropy distribution has the following shape:

where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

For example, if a large number N of dice are thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x1,...,x6} = {1,...,6} and μ = S/N.

Finally, among all the discrete distributions supported on the infinite set with mean μ, the maximum entropy distribution has the shape:

where again the constants C and r were determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ. For example, in the case that xk = k, this gives

such that respective maximum entropy distribution is the geometric distribution.

Circular random variables

For a continuous random variable distributed about the unit circle, the Von Mises distribution maximizes the entropy when the real and imaginary parts of the first circular moment are specified [9] or, equivalently, the circular mean and circular variance are specified.

When the mean and variance of the angles modulo are specified, the wrapped normal distribution maximizes the entropy. [9]

Maximizer for specified mean, variance and skew

There exists an upper bound on the entropy of continuous random variables on with a specified mean, variance, and skew. However, there is no distribution which achieves this upper bound, because is unbounded when (see Cover & Thomas (2006: chapter 12)).

However, the maximum entropy is ε-achievable: a distribution's entropy can be arbitrarily close to the upper bound. Start with a normal distribution of the specified mean and variance. To introduce a positive skew, perturb the normal distribution upward by a small amount at a value many σ larger than the mean. The skewness, being proportional to the third moment, will be affected more than the lower order moments.

This is a special case of the general case in which the exponential of any odd-order polynomial in x will be unbounded on . For example, will likewise be unbounded on , but when the support is limited to a bounded or semi-bounded interval the upper entropy bound may be achieved (e.g. if x lies in the interval [0,] and λ< 0, the exponential distribution will result).

Maximizer for specified mean and deviation risk measure

Every distribution with log-concave density is a maximal entropy distribution with specified mean μ and deviation risk measure D . [10]

In particular, the maximal entropy distribution with specified mean and deviation is:

Other examples

In the table below, each listed distribution maximizes the entropy for a particular set of functional constraints listed in the third column, and the constraint that be included in the support of the probability density, which is listed in the fourth column. [6] [7]

Several listed examples (Bernoulli, geometric, exponential, Laplace, Pareto) are trivially true, because their associated constraints are equivalent to the assignment of their entropy. They are included anyway because their constraint is related to a common or easily measured quantity.

For reference, is the gamma function, is the digamma function, is the beta function, and is the Euler-Mascheroni constant.

Table of probability distributions and corresponding maximum entropy constraints
Distribution nameProbability density / mass functionMaximum Entropy constraintSupport
Uniform (discrete) None
Uniform (continuous) None
Bernoulli
Geometric
Exponential
Laplace
Asymmetric Laplace
where
Pareto
Normal
Truncated normal (see article)
von Mises
Rayleigh
Beta for
Cauchy
Chi
Chi-squared
Erlang
Gamma
Lognormal
Maxwell–Boltzmann
Weibull
Multivariate normal

Binomial
n-generalized binomial distribution [11]
Poisson
-generalized binomial distribution} [11]
Logistic

The maximum entropy principle can be used to upper bound the entropy of statistical mixtures. [12]

See also

Notes

  1. For example, the class of all continuous distributions X on R with E(X) = 0 and E(X2) = E(X3) = 1 (see Cover, Ch 12).

Citations

  1. Williams, D. (2001). Weighing the Odds. Cambridge University Press. pp. 197–199. ISBN   0-521-00618-X.
  2. Bernardo, J.M.; Smith, A.F.M. (2000). Bayesian Theory. Wiley. pp. 209, 366. ISBN   0-471-49464-X.
  3. O'Hagan, A. (1994), Bayesian Inference. Kendall's Advanced Theory of Statistics. Vol. 2B. Edward Arnold. section 5.40. ISBN   0-340-52922-9.
  4. Botev, Z.I.; Kroese, D.P. (2011). "The generalized cross entropy method, with applications to probability density estimation" (PDF). Methodology and Computing in Applied Probability. 13 (1): 1–27. doi:10.1007/s11009-009-9133-7. S2CID   18155189.
  5. Botev, Z.I.; Kroese, D.P. (2008). "Non-asymptotic bandwidth selection for density estimation of discrete data". Methodology and Computing in Applied Probability. 10 (3): 435. doi:10.1007/s11009-007-9057-zv. S2CID   122047337.
  6. 1 2 3 Lisman, J. H. C.; van Zuylen, M. C. A. (1972). "Note on the generation of most probable frequency distributions". Statistica Neerlandica. 26 (1): 19–23. doi:10.1111/j.1467-9574.1972.tb00152.x.
  7. 1 2 Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (PDF). Journal of Econometrics. 150 (2): 219–230. CiteSeerX   10.1.1.511.9750 . doi:10.1016/j.jeconom.2008.12.014. Archived from the original (PDF) on 2016-03-07. Retrieved 2011-06-02.
  8. Dowson, D.; Wragg, A. (September 1973). "Maximum-entropy distributions having prescribed first and second moments". IEEE Transactions on Information Theory (correspondance). 19 (5): 689–693. doi:10.1109/tit.1973.1055060. ISSN   0018-9448.
  9. 1 2 Jammalamadaka, S. Rao; SenGupta, A. (2001). Topics in circular statistics. New Jersey: World Scientific. ISBN   978-981-02-3778-3 . Retrieved 2011-05-15.
  10. 1 2 Grechuk, Bogdan; Molyboha, Anton; Zabarankin, Michael (2009). "Maximum entropy principle with general deviation measures". Mathematics of Operations Research . 34 (2): 445–467. doi:10.1287/moor.1090.0377 via researchgate.net.
  11. 1 2 Harremös, Peter (2001). "Binomial and Poisson distributions as maximum entropy distributions". IEEE Transactions on Information Theory . 47 (5): 2039–2041. doi:10.1109/18.930936. S2CID   16171405.
  12. Nielsen, Frank; Nock, Richard (2017). "MaxEnt upper bounds for the differential entropy of univariate continuous distributions". IEEE Signal Processing Letters. IEEE. 24 (4): 402–406. Bibcode:2017ISPL...24..402N. doi:10.1109/LSP.2017.2666792. S2CID   14092514.

Related Research Articles

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to , the entropy is

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equation constraints. It is named after the mathematician Joseph-Louis Lagrange.

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data.

Quantum statistical mechanics is statistical mechanics applied to quantum mechanical systems. In quantum mechanics a statistical ensemble is described by a density operator S, which is a non-negative, self-adjoint, trace-class operator of trace 1 on the Hilbert space H describing the quantum system. This can be shown under various mathematical formalisms for quantum mechanics.

<span class="mw-page-title-main">Laplace distribution</span> Probability distribution

In probability theory and statistics, the Laplace distribution is a continuous probability distribution named after Pierre-Simon Laplace. It is also sometimes called the double exponential distribution, because it can be thought of as two exponential distributions spliced together along the abscissa, although the term is also sometimes used to refer to the Gumbel distribution. The difference between two independent identically distributed exponential random variables is governed by a Laplace distribution, as is a Brownian motion evaluated at an exponentially distributed random time. Increments of Laplace motion or a variance gamma process evaluated over the time scale also have a Laplace distribution.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In statistics and probability theory, a point process or point field is a collection of mathematical points randomly located on a mathematical space such as the real line or Euclidean space. Point processes can be used for spatial data analysis, which is of interest in such diverse disciplines as forestry, plant ecology, epidemiology, geography, seismology, materials science, astronomy, telecommunications, computational neuroscience, economics and others.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

<span class="mw-page-title-main">Inverse Gaussian distribution</span> Family of continuous probability distributions

In probability theory, the inverse Gaussian distribution is a two-parameter family of continuous probability distributions with support on (0,∞).

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, over the generation sequence, individuals with better and better -values are generated.

<span class="mw-page-title-main">Conway–Maxwell–Poisson distribution</span> Probability distribution

In probability theory and statistics, the Conway–Maxwell–Poisson distribution is a discrete probability distribution named after Richard W. Conway, William L. Maxwell, and Siméon Denis Poisson that generalizes the Poisson distribution by adding a parameter to model overdispersion and underdispersion. It is a member of the exponential family, has the Poisson distribution and geometric distribution as special cases and the Bernoulli distribution as a limiting case.

<span class="mw-page-title-main">Poisson distribution</span> Discrete probability distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event. It can also be used for the number of events in other types of intervals than time, and in dimension greater than 1.

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

Pure inductive logic (PIL) is the area of mathematical logic concerned with the philosophical and mathematical foundations of probabilistic inductive reasoning. It combines classical predicate logic and probability theory. Probability values are assigned to sentences of a first-order relational language to represent degrees of belief that should be held by a rational agent. Conditional probability values represent degrees of belief based on the assumption of some received evidence.

The Gibbs rotational ensemble represents the possible states of a mechanical system in thermal and rotational equilibrium at temperature and angular velocity . The Jaynes procedure can be used to obtain this ensemble. An ensemble is the set of microstates corresponding to a given macrostate.

References