Info-metrics

Last updated

Info-metrics is an interdisciplinary approach to scientific modeling, inference and efficient information processing. It is the science of modeling, reasoning, and drawing inferences under conditions of noisy and limited information. From the point of view of the sciences, this framework is at the intersection of information theory, statistical methods of inference, applied mathematics, computer science, econometrics, complexity theory, decision analysis, modeling, and the philosophy of science.

Contents

Info-metrics provides a constrained optimization framework to tackle under-determined or ill-posed problems – problems where there is not sufficient information for finding a unique solution. Such problems are very common across all sciences: available information is incomplete, limited, noisy and uncertain. Info-metrics is useful for modelling, information processing, theory building, and inference problems across the scientific spectrum. The info-metrics framework can also be used to test hypotheses about competing theories or causal mechanisms.

History

Info-metrics evolved from the classical maximum entropy formalism, which is based on the work of Shannon. Early contributions were mostly in the natural and mathematical/statistical sciences. Since the mid 1980s and especially in the mid 1990s the maximum entropy approach was generalized and extended to handle a larger class of problems in the social and behavioral sciences, especially for complex problems and data. The word ‘info-metrics’ was coined in 2009 by Amos Golan, right before the interdisciplinary Info-Metrics Institute was inaugurated.

Preliminary definitions

Consider a random variable that can result in one of K distinct outcomes. The probability of each outcome is for . Thus, is a K-dimensional probability distribution defined for such that and . Define the informational content of a single outcome to be (e.g., Shannon). Observing an outcome at the tails of the distribution (a rare event) provides much more information than observing another, more probable, outcome. The entropy [1] is the expected information content of an outcome of the random variable X whose probability distribution is P:

Here if , and is the expectation operator.

The basic info-metrics problem

Consider the problem of modeling and inferring the unobserved probability distribution of some K-dimensional discrete random variable given just the mean (expected value) of that variable. We also know that the probabilities are nonnegative and normalized (i.e., sum up to exactly 1). For all K > 2 the problem is underdetermined. Within the info-metrics framework, the solution is to maximize the entropy of the random variable subject to the two constraints: mean and normalization. This yields the usual maximum entropy solution. The solutions to that problem can be extended and generalized in several ways. First, one can use another entropy instead of Shannon’s entropy. Second, the same approach can be used for continuous random variables, for all types of conditional models (e.g., regression, inequality and nonlinear models), and for many constraints. Third, priors can be incorporated within that framework. Fourth, the same framework can be extended to accommodate greater uncertainty: uncertainty about the observed values and/or uncertainty about the model itself. Last, the same basic framework can be used to develop new models/theories, validate these models using all available information, and test statistical hypotheses about the model.

Examples

Six-sided dice

Inference based on information resulting from repeated independent experiments.

The following example is attributed to Boltzmann and was further popularized by Jaynes. Consider a six-sided die, where tossing the die is the event and the distinct outcomes are the numbers 1 through 6 on the upper face of the die. The experiment is the independent repetitions of tossing the same die. Suppose you only observe the empirical mean value, y, of N tosses of a six-sided die. Given that information, you want to infer the probabilities that a specific value of the face will show up in the next toss of the die. You also know that the sum of the probabilities must be 1. Maximizing the entropy (and using log base 2) subject to these two constraints (mean and normalization) yields the most uninformed solution.

for and . The solution is

where is the inferred probability of event , is the inferred Lagrange multipliers associated with the mean constraint, and is the partition (normalization) function. If it’s a fair die with mean of 3.5 you would expect that all faces are equally likely and the probabilities are equal. This is what the maximum entropy solution gives. If the die is unfair (or loaded) with a mean of 4, the resulting maximum entropy solution will be . For comparison, minimizing the least squares criterion instead of maximizing the entropy yields .

Some cross-disciplinary examples

Rainfall prediction: Using the expected daily rainfall (arithmetic mean), the maximum entropy framework can be used to infer and forecast the daily rainfall distribution. [2]

Portfolio management: Suppose there is a portfolio manager who needs to allocate some assets or assign portfolio weights to different assets, while taking into account the investor’s constraints and preferences. Using these preferences and constraints, as well as the observed information, such as the market mean return, and covariances, of each asset over some time period, the entropy maximization framework can be used to find the optimal portfolio weights. In this case, the entropy of the portfolio represents its diversity. This framework can be modified to include other constraints such as minimal variance, maximal diversity etc. That model involves inequalities and can be further generalized to include short sales. More such examples and related code can be found on [3] [4]

An extensive list of work related to info-metrics can be found here: http://info-metrics.org/bibliography.html

See also

Related Research Articles

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field, in applied mathematics, is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

The likelihood function is the joint probability of observed data viewed as a function of the parameters of a statistical model.

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

<span class="mw-page-title-main">Log-normal distribution</span> Probability distribution

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y, X = exp(Y), has a log-normal distribution. A random variable which is log-normally distributed takes only positive real values. It is a convenient and useful model for measurements in exact and engineering sciences, as well as medicine, economics and other topics (e.g., energies, concentrations, lengths, prices of financial instruments, and other metrics).

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In physics, maximum entropy thermodynamics views equilibrium thermodynamics and statistical mechanics as inference processes. More specifically, MaxEnt applies inference techniques rooted in Shannon information theory, Bayesian probability, and the principle of maximum entropy. These techniques are relevant to any situation requiring prediction from incomplete or insufficient data. MaxEnt thermodynamics began with two papers by Edwin T. Jaynes published in the 1957 Physical Review.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

Maximum entropy spectral estimation is a method of spectral density estimation. The goal is to improve the spectral quality based on the principle of maximum entropy. The method is based on choosing the spectrum which corresponds to the most random or the most unpredictable time series whose autocorrelation function agrees with the known values. This assumption, which corresponds to the concept of maximum entropy as used in both statistical mechanics and information theory, is maximally non-committal with regard to the unknown values of the autocorrelation function of the time series. It is simply the application of maximum entropy modeling to any type of spectrum and is used in all fields where data is presented in spectral form. The usefulness of the technique varies based on the source of the spectral data since it is dependent on the amount of assumed knowledge about the spectrum that can be applied to the model.

In the mathematical theory of probability, the entropy rate or source information rate is a function assigning an entropy to a stochastic process.

The partition function or configuration integral, as used in probability theory, information theory and dynamical systems, is a generalization of the definition of a partition function in statistical mechanics. It is a special case of a normalizing constant in probability theory, for the Boltzmann distribution. The partition function occurs in many problems of probability theory because, in situations where there is a natural symmetry, its associated probability measure, the Gibbs measure, has the Markov property. This means that the partition function occurs not only in physical systems with translation symmetry, but also in such varied settings as neural networks, and applications such as genomics, corpus linguistics and artificial intelligence, which employ Markov networks, and Markov logic networks. The Gibbs measure is also the unique measure that has the property of maximizing the entropy for a fixed expectation value of the energy; this underlies the appearance of the partition function in maximum entropy methods and the algorithms derived therefrom.

In statistics and econometrics, extremum estimators are a wide class of estimators for parametric models that are calculated through maximization of a certain objective function, which depends on the data. The general theory of extremum estimators was developed by Amemiya (1985).

References

  1. Shannon, Claude (1948). "A mathematical theory of communication". Bell System Technical Journal. 27: 379–423.
  2. Golan, Amos (2018). Foundations of Info-metrics: Modeling, Inference, and Imperfect Information. Oxford University Press.
  3. Bera, Anil K.; Park, Sung Y. (2008). "Optimal portfolio diversification using the maximum entropy principle". Econometric Reviews. 27 (4–6): 484–512.
  4. "Portfolio Allocation – Foundations of Info-Metrics". info-metrics.org.

Further reading

Classics

Basic books and research monographs

Other representative applications

Marco Frittelli. "The minimal entropy martingale measure and the valuation problem in incomplete markets". Mathematical finance, 10(1):39–52, 2000.

Amos Golan and Volker Dose. "A generalized information theoretical approach to tomographic reconstruction". Journal of Physics A: Mathematical and General, 34(7):1271, 2001.