Principle of maximum entropy

Last updated

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition that expresses testable information).

Contents

Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. According to this principle, the distribution with maximal information entropy is the best choice.

History

The principle was first expounded by E. T. Jaynes in two papers in 1957 [1] [2] where he emphasized a natural correspondence between statistical mechanics and information theory. In particular, Jaynes offered a new and very general rationale why the Gibbsian method of statistical mechanics works. He argued that the entropy of statistical mechanics and the information entropy of information theory are basically the same thing. Consequently, statistical mechanics should be seen just as a particular application of a general tool of logical inference and information theory.

Overview

In most practical cases, the stated prior data or testable information is given by a set of conserved quantities (average values of some moment functions), associated with the probability distribution in question. This is the way the maximum entropy principle is most often used in statistical thermodynamics. Another possibility is to prescribe some symmetries of the probability distribution. The equivalence between conserved quantities and corresponding symmetry groups implies a similar equivalence for these two ways of specifying the testable information in the maximum entropy method.

The maximum entropy principle is also needed to guarantee the uniqueness and consistency of probability assignments obtained by different methods, statistical mechanics and logical inference in particular.

The maximum entropy principle makes explicit our freedom in using different forms of prior data. As a special case, a uniform prior probability density (Laplace's principle of indifference, sometimes called the principle of insufficient reason), may be adopted. Thus, the maximum entropy principle is not merely an alternative way to view the usual methods of inference of classical statistics, but represents a significant conceptual generalization of those methods.

However these statements do not imply that thermodynamical systems need not be shown to be ergodic to justify treatment as a statistical ensemble.

In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.

Testable information

The principle of maximum entropy is useful explicitly only when applied to testable information. Testable information is a statement about a probability distribution whose truth or falsity is well-defined. For example, the statements

the expectation of the variable is 2.87

and

(where and are probabilities of events) are statements of testable information.

Given testable information, the maximum entropy procedure consists of seeking the probability distribution which maximizes information entropy, subject to the constraints of the information. This constrained optimization problem is typically solved using the method of Lagrange multipliers. [3]

Entropy maximization with no testable information respects the universal "constraint" that the sum of the probabilities is one. Under this constraint, the maximum entropy discrete probability distribution is the uniform distribution,

Applications

The principle of maximum entropy is commonly applied in two ways to inferential problems:

Prior probabilities

The principle of maximum entropy is often used to obtain prior probability distributions for Bayesian inference. Jaynes was a strong advocate of this approach, claiming the maximum entropy distribution represented the least informative distribution. [4] A large amount of literature is now dedicated to the elicitation of maximum entropy priors and links with channel coding. [5] [6] [7] [8]

Posterior probabilities

Maximum entropy is a sufficient updating rule for radical probabilism. Richard Jeffrey's probability kinematics is a special case of maximum entropy inference. However, maximum entropy is not a generalisation of all such sufficient updating rules. [9]

Maximum entropy models

Alternatively, the principle is often invoked for model specification: in this case the observed data itself is assumed to be the testable information. Such models are widely used in natural language processing. An example of such a model is logistic regression, which corresponds to the maximum entropy classifier for independent observations.

Probability density estimation

One of the main applications of the maximum entropy principle is in discrete and continuous density estimation. [10] [11] Similar to support vector machine estimators, the maximum entropy principle may require the solution to a quadratic programming problem, and thus provide a sparse mixture model as the optimal density estimator. One important advantage of the method is its ability to incorporate prior information in the density estimation. [12]

General solution for the maximum entropy distribution with linear constraints

Discrete case

We have some testable information I about a quantity x taking values in {x1, x2,..., xn}. We assume this information has the form of m constraints on the expectations of the functions fk; that is, we require our probability distribution to satisfy the moment inequality/equality constraints:

where the are observables. We also require the probability density to sum to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint

The probability distribution with maximum information entropy subject to these inequality/equality constraints is of the form: [10]

for some . It is sometimes called the Gibbs distribution. The normalization constant is determined by:

and is conventionally called the partition function. (The Pitman–Koopman theorem states that the necessary and sufficient condition for a sampling distribution to admit sufficient statistics of bounded dimension is that it have the general form of a maximum entropy distribution.)

The λk parameters are Lagrange multipliers. In the case of equality constraints their values are determined from the solution of the nonlinear equations

In the case of inequality constraints, the Lagrange multipliers are determined from the solution of a convex optimization program with linear constraints. [10] In both cases, there is no closed form solution, and the computation of the Lagrange multipliers usually requires numerical methods.

Continuous case

For continuous distributions, the Shannon entropy cannot be used, as it is only defined for discrete probability spaces. Instead Edwin Jaynes (1963, 1968, 2003) gave the following formula, which is closely related to the relative entropy (see also differential entropy).

where q(x), which Jaynes called the "invariant measure", is proportional to the limiting density of discrete points. For now, we shall assume that q is known; we will discuss it further after the solution equations are given.

A closely related quantity, the relative entropy, is usually defined as the Kullback–Leibler divergence of p from q (although it is sometimes, confusingly, defined as the negative of this). The inference principle of minimizing this, due to Kullback, is known as the Principle of Minimum Discrimination Information.

We have some testable information I about a quantity x which takes values in some interval of the real numbers (all integrals below are over this interval). We assume this information has the form of m constraints on the expectations of the functions fk, i.e. we require our probability density function to satisfy the inequality (or purely equality) moment constraints:

where the are observables. We also require the probability density to integrate to one, which may be viewed as a primitive constraint on the identity function and an observable equal to 1 giving the constraint

The probability density function with maximum Hc subject to these constraints is: [11]

with the partition function determined by

As in the discrete case, in the case where all moment constraints are equalities, the values of the parameters are determined by the system of nonlinear equations:

In the case with inequality moment constraints the Lagrange multipliers are determined from the solution of a convex optimization program. [11]

The invariant measure function q(x) can be best understood by supposing that x is known to take values only in the bounded interval (a, b), and that no other information is given. Then the maximum entropy probability density function is

where A is a normalization constant. The invariant measure function is actually the prior density function encoding 'lack of relevant information'. It cannot be determined by the principle of maximum entropy, and must be determined by some other logical method, such as the principle of transformation groups or marginalization theory.

Examples

For several examples of maximum entropy distributions, see the article on maximum entropy probability distributions.

Justifications for the principle of maximum entropy

Proponents of the principle of maximum entropy justify its use in assigning probabilities in several ways, including the following two arguments. These arguments take the use of Bayesian probability as given, and are thus subject to the same postulates.

Information entropy as a measure of 'uninformativeness'

Consider a discrete probability distribution among mutually exclusive propositions. The most informative distribution would occur when one of the propositions was known to be true. In that case, the information entropy would be equal to zero. The least informative distribution would occur when there is no reason to favor any one of the propositions over the others. In that case, the only reasonable probability distribution would be uniform, and then the information entropy would be equal to its maximum possible value, . The information entropy can therefore be seen as a numerical measure which describes how uninformative a particular probability distribution is, ranging from zero (completely informative) to (completely uninformative).

By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess. Thus the maximum entropy distribution is the only reasonable distribution. The dependence of the solution on the dominating measure represented by is however a source of criticisms of the approach since this dominating measure is in fact arbitrary. [13]

The Wallis derivation

The following argument is the result of a suggestion made by Graham Wallis to E. T. Jaynes in 1962. [14] It is essentially the same mathematical argument used for the Maxwell–Boltzmann statistics in statistical mechanics, although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept. The information entropy function is not assumed a priori, but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way.

Suppose an individual wishes to make a probability assignment among mutually exclusive propositions. He has some testable information, but is not sure how to go about including this information in his probability assessment. He therefore conceives of the following random experiment. He will distribute quanta of probability (each worth ) at random among the possibilities. (One might imagine that he will throw balls into buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, he will check if the probability assignment thus obtained is consistent with his information. (For this step to be successful, the information must be a constraint given by an open set in the space of probability measures). If it is inconsistent, he will reject it and try again. If it is consistent, his assessment will be

where is the probability of the th proposition, while ni is the number of quanta that were assigned to the th proposition (i.e. the number of balls that ended up in bucket ).

Now, in order to reduce the 'graininess' of the probability assignment, it will be necessary to use quite a large number of quanta of probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, the protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the multinomial distribution,

where

is sometimes known as the multiplicity of the outcome.

The most probable result is the one which maximizes the multiplicity . Rather than maximizing directly, the protagonist could equivalently maximize any monotonic increasing function of . He decides to maximize

At this point, in order to simplify the expression, the protagonist takes the limit as , i.e. as the probability levels go from grainy discrete values to smooth continuous values. Using Stirling's approximation, he finds

All that remains for the protagonist to do is to maximize entropy under the constraints of his testable information. He has found that the maximum entropy distribution is the most probable of all "fair" random distributions, in the limit as the probability levels go from discrete to continuous.

Compatibility with Bayes' theorem

Giffin and Caticha (2007) state that Bayes' theorem and the principle of maximum entropy are completely compatible and can be seen as special cases of the "method of maximum relative entropy". They state that this method reproduces every aspect of orthodox Bayesian inference methods. In addition this new method opens the door to tackling problems that could not be addressed by either the maximal entropy principle or orthodox Bayesian methods individually. Moreover, recent contributions (Lazar 2003, and Schennach 2005) show that frequentist relative-entropy-based inference approaches (such as empirical likelihood and exponentially tilted empirical likelihood – see e.g. Owen 2001 and Kitamura 2006) can be combined with prior information to perform Bayesian posterior analysis.

Jaynes stated Bayes' theorem was a way to calculate a probability, while maximum entropy was a way to assign a prior probability distribution. [15]

It is however, possible in concept to solve for a posterior distribution directly from a stated prior distribution using the principle of minimum cross-entropy (or the Principle of Maximum Entropy being a special case of using a uniform distribution as the given prior), independently of any Bayesian considerations by treating the problem formally as a constrained optimisation problem, the Entropy functional being the objective function. For the case of given average values as testable information (averaged over the sought after probability distribution), the sought after distribution is formally the Gibbs (or Boltzmann) distribution the parameters of which must be solved for in order to achieve minimum cross entropy and satisfy the given testable information.

Relevance to physics

The principle of maximum entropy bears a relation to a key assumption of kinetic theory of gases known as molecular chaos or Stosszahlansatz. This asserts that the distribution function characterizing particles entering a collision can be factorized. Though this statement can be understood as a strictly physical hypothesis, it can also be interpreted as a heuristic hypothesis regarding the most probable configuration of particles before colliding. [16]

See also

Notes

  1. Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics" (PDF). Physical Review. Series II. 106 (4): 620–630. Bibcode:1957PhRv..106..620J. doi:10.1103/PhysRev.106.620. MR   0087305.
  2. Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics II" (PDF). Physical Review. Series II. 108 (2): 171–190. Bibcode:1957PhRv..108..171J. doi:10.1103/PhysRev.108.171. MR   0096414.
  3. Sivia, Devinderjit; Skilling, John (2006-06-02). Data Analysis: A Bayesian Tutorial. OUP Oxford. ISBN   978-0-19-154670-9.
  4. Jaynes, E. T. (1968). "Prior Probabilities" (PDF or PostScript). IEEE Transactions on Systems Science and Cybernetics. 4 (3): 227–241. doi:10.1109/TSSC.1968.300117.{{cite journal}}: External link in |format= (help)
  5. Clarke, B. (2006). "Information optimality and Bayesian modelling". Journal of Econometrics . 138 (2): 405–429. doi:10.1016/j.jeconom.2006.05.003.
  6. Soofi, E.S. (2000). "Principal Information Theoretic Approaches". Journal of the American Statistical Association . 95 (452): 1349–1353. doi:10.2307/2669786. JSTOR   2669786. MR   1825292.
  7. Bousquet, N. (2008). "Eliciting vague but proper maximal entropy priors in Bayesian experiments". Statistical Papers. 51 (3): 613–628. doi:10.1007/s00362-008-0149-9. S2CID   119657859.
  8. Palmieri, Francesco A. N.; Ciuonzo, Domenico (2013-04-01). "Objective priors from maximum entropy in data classification". Information Fusion. 14 (2): 186–198. CiteSeerX   10.1.1.387.4515 . doi:10.1016/j.inffus.2012.01.012.
  9. Skyrms, B (1987). "Updating, supposing and MAXENT". Theory and Decision. 22 (3): 225–46. doi:10.1007/BF00134086. S2CID   121847242.
  10. 1 2 3 Botev, Z. I.; Kroese, D. P. (2008). "Non-asymptotic Bandwidth Selection for Density Estimation of Discrete Data". Methodology and Computing in Applied Probability. 10 (3): 435. doi:10.1007/s11009-007-9057-z. S2CID   122047337.
  11. 1 2 3 Botev, Z. I.; Kroese, D. P. (2011). "The Generalized Cross Entropy Method, with Applications to Probability Density Estimation" (PDF). Methodology and Computing in Applied Probability. 13 (1): 1–27. doi:10.1007/s11009-009-9133-7. S2CID   18155189.
  12. Kesavan, H. K.; Kapur, J. N. (1990). "Maximum Entropy and Minimum Cross-Entropy Principles". In Fougère, P. F. (ed.). Maximum Entropy and Bayesian Methods . pp.  419–432. doi:10.1007/978-94-009-0683-9_29. ISBN   978-94-010-6792-8.
  13. Druilhet, Pierre; Marin, Jean-Michel (2007). "Invariant {HPD} credible sets and {MAP} estimators". Bayesian Anal. 2: 681–691. doi: 10.1214/07-BA227 .
  14. Jaynes, E. T. (2003) Probability Theory: The Logic of Science, Cambridge University Press, p. 351-355. ISBN   978-0521592710
  15. Jaynes, E. T. (1988) "The Relation of Bayesian and Maximum Entropy Methods", in Maximum-Entropy and Bayesian Methods in Science and Engineering (Vol. 1), Kluwer Academic Publishers, p. 25-29.
  16. Chliamovitch, G.; Malaspinas, O.; Chopard, B. (2017). "Kinetic theory beyond the Stosszahlansatz". Entropy. 19 (8): 381. Bibcode:2017Entrp..19..381C. doi: 10.3390/e19080381 .

Related Research Articles

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

The likelihood function is the joint probability of observed data viewed as a function of the parameters of a statistical model.

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equation constraints. It is named after the mathematician Joseph-Louis Lagrange.

<span class="mw-page-title-main">Weibull distribution</span> Continuous probability distribution

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It models a broad range of random variables, largely in the nature of a time to failure or time between events. Examples are maximum one-day rainfalls and the time a user spends on a web page.

In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. Other names include Wishart ensemble, or Wishart–Laguerre ensemble, or LOE, LUE, LSE.

A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In mathematical statistics, the Kullback–Leibler divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher information matrix:

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

<span class="mw-page-title-main">Conway–Maxwell–Poisson distribution</span> Probability distribution

In probability theory and statistics, the Conway–Maxwell–Poisson distribution is a discrete probability distribution named after Richard W. Conway, William L. Maxwell, and Siméon Denis Poisson that generalizes the Poisson distribution by adding a parameter to model overdispersion and underdispersion. It is a member of the exponential family, has the Poisson distribution and geometric distribution as special cases and the Bernoulli distribution as a limiting case.

<span class="mw-page-title-main">Poisson distribution</span> Discrete probability distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. It is named after French mathematician Siméon Denis Poisson. The Poisson distribution can also be used for the number of events in other specified interval types such as distance, area, or volume. It plays an important role for discrete-stable distributions.

<span class="mw-page-title-main">Asymmetric Laplace distribution</span> Continuous probability distribution

In probability theory and statistics, the asymmetric Laplace distribution (ALD) is a continuous probability distribution which is a generalization of the Laplace distribution. Just as the Laplace distribution consists of two exponential distributions of equal scale back-to-back about x = m, the asymmetric Laplace consists of two exponential distributions of unequal scale back to back about x = m, adjusted to assure continuity and normalization. The difference of two variates exponentially distributed with different means and rate parameters will be distributed according to the ALD. When the two rate parameters are equal, the difference will be distributed according to the Laplace distribution.

Maximal entropy random walk (MERW) is a popular type of biased random walk on a graph, in which transition probabilities are chosen accordingly to the principle of maximum entropy, which says that the probability distribution which best represents the current state of knowledge is the one with largest entropy. While standard random walk chooses for every vertex uniform probability distribution among its outgoing edges, locally maximizing entropy rate, MERW maximizes it globally by assuming uniform probability distribution among all paths in a given graph.

Info-metrics is an interdisciplinary approach to scientific modeling, inference and efficient information processing. It is the science of modeling, reasoning, and drawing inferences under conditions of noisy and limited information. From the point of view of the sciences, this framework is at the intersection of information theory, statistical methods of inference, applied mathematics, computer science, econometrics, complexity theory, decision analysis, modeling, and the philosophy of science.

References

Further reading