Minimum description length

Last updated August 17, 2023

Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam's razor. The MDL principle can be extended to other forms of inductive inference and learning, for example to estimation and sequential prediction, without explicitly identifying a single model of the data.

Within Jorma Rissanen's theory of learning, a central concept of information theory, models are statistical hypotheses and descriptions are defined as universal codes.
Rissanen's 1978^[1] pragmatic first attempt to automatically derive short descriptions, relates to the Bayesian Information Criterion (BIC).
Within Algorithmic Information Theory, where the description length of a data sequence is the length of the smallest program that outputs that data set. In this context, it is also known as 'idealized' MDL principle and it is closely related to Solomonoff's theory of inductive inference, which is that the best model of a data set is represented by its shortest self-extracting archive.

Overview

Selecting the minimum length description of the available data as the best model observes the principle identified as Occam's razor. Prior to the advent of computer programming, generating such descriptions was the intellectual labor of scientific theorists. It was far less formal than it has become in the computer age. If two scientists had a theoretic disagreement, they rarely could formally apply Occam's razor to choose between their theories. They would have different data sets and possibly different descriptive languages. Nevertheless, science advanced as Occam's razor was an informal guide in deciding which model was best.

With the advent of formal languages and computer programming Occam's razor was mathematically defined. Models of a given set of observations, encoded as bits of data, could be created in the form of computer programs that output that data. Occam's razor could then formally select the shortest program, measured in bits of this algorithmic information, as the best model.

To avoid confusion, note that there is nothing in the MDL principle that implies a machine produced the program embodying the model. It can be entirely the product of humans. The MDL principle applies regardless of whether the description to be run on a computer is the product of humans, machines or any combination thereof. The MDL principle requires only that the shortest description, when executed, produce the original data set without error.

Two-Part codes

The distinction in computer programs between programs and literal data applies to all formal descriptions and is sometimes referred to as "two parts" of a description. In statistical MDL learning, such a description is frequently called a two-part code.

MDL in machine learning

MDL applies in machine learning when algorithms (machines) generate descriptions. Learning occurs when an algorithm generates a shorter description of the same data set.

The theoretic minimum description length of a data set, called its Kolmogorov complexity, cannot, however, be computed. That is to say, even if by random chance an algorithm generates the shortest program of all that outputs the data set, an automated theorem prover cannot prove there is no shorter such program. Nevertheless, given two programs that output the dataset, the MDL principle selects the shorter of the two as embodying the best model.

Recent work on algorithmic MDL learning

Recent machine MDL learning of algorithmic, as opposed to statistical, data models have received increasing attention with increasing availability of data, computation resources and theoretic advances.^[2]^[3] Approaches are informed by the burgeoning field of artificial general intelligence. Shortly before his death, Marvin Minsky came out strongly in favor of this line of research, saying:^[4]

It seems to me that the most important discovery since Gödel was the discovery by Chaitin, Solomonoff and Kolmogorov of the concept called Algorithmic Probability which is a fundamental new theory of how to make predictions given a collection of experiences and this is a beautiful theory, everybody should learn it, but it’s got one problem, that is, that you cannot actually calculate what this theory predicts because it is too hard, it requires an infinite amount of work. However, it should be possible to make practical approximations to the Chaitin, Kolmogorov, Solomonoff theory that would make better predictions than anything we have today. Everybody should learn all about that and spend the rest of their lives working on it.
— Panel discussion on The Limits of Understanding, World Science Festival, NYC, Dec 14, 2014

Statistical MDL learning

Any set of data can be represented by a string of symbols from a finite (say, binary) alphabet.

[The MDL Principle] is based on the following insight: any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally. (Grünwald, 2004)^[5]

Based on this, in 1978, Jorma Rissanen published an MDL learning algorithm using the statistical notion of information rather than algorithmic information. Over the past 40 years this has developed into a rich theory of statistical and machine learning procedures with connections to Bayesian model selection and averaging, penalization methods such as Lasso and Ridge, and so on - Grünwald and Roos (2020)^[6] give an introduction including all modern developments. Rissanen started out with this idea: all statistical learning is about finding regularities in data, and the best hypothesis to describe the regularities in data is also the one that is able to statistically compress the data most. Like other statistical methods, it can be used for learning the parameters of a model using some data. Usually though, standard statistical methods assume that the general form of a model is fixed. MDL's main strength is that it can also be used for selecting the general form of a model and its parameters. The quantity of interest (sometimes just a model, sometimes just parameters, sometimes both at the same time) is called a hypothesis. The basic idea is then to consider the (lossless) two-stage code that encodes data $D$ with length ${L(D)}$ by first encoding a hypothesis $H$ in the set of considered hypotheses ${\cal {H}}$ and then coding $D$ "with the help of" $H$ ; in the simplest context this just means "encoding the deviations of the data from the predictions made by $H$ :

{L(D)}=\min _{H\in {\cal {H}}}\ (\ L(H)+L(D|H)\ )\

The $H$ achieving this minimum is then viewed as the best explanation of data $D$ . As a simple example, take a regression problem: the data $D$ could consist of a sequence of points $D=(x_{1},y_{1}),\ldots ,(x_{n},y_{n})$ , the set ${\cal {H}}$ could be the set of all polynomials from $X$ to $Y$ . To describe a polynomial $H$ of degree (say) $k$ , one would first have to discretize the parameters to some precision; one would then have to describe this precision (a natural number); next, one would have to describe the degree $k$ (another natural number), and in the final step, one would have to describe $k+1$ parameters; the total length would be $L(H)$ . One would then describe the points in $D$ using some fixed code for the x-values and then using a code for the $n$ deviations $y_{i}-H(x_{i})$ .

In practice, one often (but not always) uses a probabilistic model. For example, one associates each polynomial $H$ with the corresponding conditional distribution expressing that given $X$ , $Y$ is normally distributed with mean $H(X)$ and some variance $\sigma ^{2}$ which could either be fixed or added as a free parameter. Then the set of hypotheses ${\cal {H}}$ reduces to the assumption of a linear^{[ clarification needed ]} model, $Y=H(X)+\epsilon$ , with $H$ a polynomial.

Furthermore, one is often not directly interested in specific parameters values, but just, for example, the degree of the polynomial. In that case, one sets ${\cal {H}}$ to be ${\cal {H}}=\{{\cal {H}}_{0},{\cal {H}}_{1},\ldots \}$ where each ${\cal {H}}_{j}$ represents the hypothesis that the data is best described as a j-th degree polynomial. One then codes data $D$ given hypothesis ${\cal {H}}_{j}$ using a one-part code designed such that, whenever some hypothesis $H\in {\cal {H}}_{j}$ fits the data well, the codelength $L(D|H)$ is short. The design of such codes is called universal coding. There are various types of universal codes one could use, often giving similar lengths for long data sequences but differing for short ones. The 'best' (in the sense that it has a minimax optimality property) are the normalized maximum likelihood (NML) or Shtarkov codes. A quite useful class of codes are the Bayesian marginal likelihood codes. For exponential families of distributions, when Jeffreys prior is used and the parameter space is suitably restricted, these asymptotically coincide with the NML codes; this brings MDL theory in close contact with objective Bayes model selection, in which one also sometimes adopts Jeffreys' prior, albeit for different reasons. The MDL approach to model selection "gives a selection criterion formally identical to the BIC approach"^[7] for large number of samples.

Example of Statistical MDL Learning

A coin is flipped 1000 times, and the numbers of heads and tails are recorded. Consider two model classes:

The first is a code that represents outcomes with a 0 for heads or a 1 for tails. This code represents the hypothesis that the coin is fair. The code length according to this code is always exactly 1000 bits.
The second consists of all codes that are efficient for a coin with some specific bias, representing the hypothesis that the coin is not fair. Say that we observe 510 heads and 490 tails. Then the code length according to the best code in the second model class is shorter than 1000 bits.

For this reason, a naive statistical method might choose the second model as a better explanation for the data. However, an MDL approach would construct a single code based on the hypothesis, instead of just using the best one. This code could be the normalized maximum likelihood code or a Bayesian code. If such a code is used, then the total codelength based on the second model class would be larger than 1000 bits. Therefore, the conclusion when following an MDL approach is inevitably that there is not enough evidence to support the hypothesis of the biased coin, even though the best element of the second model class provides better fit to the data.

Statistical MDL Notation

Central to MDL theory is the one-to-one correspondence between code length functions and probability distributions (this follows from the Kraft–McMillan inequality). For any probability distribution $P$ , it is possible to construct a code $C$ such that the length (in bits) of $C(x)$ is equal to $-\log _{2}P(x)$ ; this code minimizes the expected code length. Conversely, given a code $C$ , one can construct a probability distribution $P$ such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code is equivalent to searching for a good probability distribution.

Limitations of Statistical MDL Learning

The description language of statistical MDL is not computationally universal. Therefore it cannot, even in principle, learn models of recursive natural processes.

Related concepts

Statistical MDL learning is very strongly connected to probability theory and statistics through the correspondence between codes and probability distributions mentioned above. This has led some researchers to view MDL as equivalent to Bayesian inference: code length of model and data together in MDL correspond respectively to prior probability and marginal likelihood in the Bayesian framework.^[8]

While Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework also accommodates other codes that are not Bayesian. An example is the Shtarkov normalized maximum likelihood code, which plays a central role in current MDL theory, but has no equivalent in Bayesian inference. Furthermore, Rissanen stresses that we should make no assumptions about the true data-generating process: in practice, a model class is typically a simplification of reality and thus does not contain any code or probability distribution that is true in any objective sense.^[9]^{[ self-published source? ]}^[10] In the last mentioned reference Rissanen bases the mathematical underpinning of MDL on the Kolmogorov structure function.

According to the MDL philosophy, Bayesian methods should be dismissed if they are based on unsafe priors that would lead to poor results. The priors that are acceptable from an MDL point of view also tend to be favored in so-called objective Bayesian analysis; there, however, the motivation is usually different.^[11]

Other systems

Rissanen's was not the first information-theoretic approach to learning; as early as 1968 Wallace and Boulton pioneered a related concept called minimum message length (MML). The difference between MDL and MML is a source of ongoing confusion. Superficially, the methods appear mostly equivalent, but there are some significant differences, especially in interpretation:

MML is a fully subjective Bayesian approach: it starts from the idea that one represents one's beliefs about the data-generating process in the form of a prior distribution. MDL avoids assumptions about the data-generating process.
Both methods make use of two-part codes: the first part always represents the information that one is trying to learn, such as the index of a model class (model selection) or parameter values (parameter estimation); the second part is an encoding of the data given the information in the first part. The difference between the methods is that, in the MDL literature, it is advocated that unwanted parameters should be moved to the second part of the code, where they can be represented with the data by using a so-called one-part code, which is often more efficient than a two-part code. In the original description of MML, all parameters are encoded in the first part, so all parameters are learned.
Within the MML framework, each parameter is stated to exactly the precision which results in the optimal overall message length: the preceding example might arise if some parameter was originally considered "possibly useful" to a model but was subsequently found to be unable to help to explain the data (such a parameter will be assigned a code length corresponding to the (Bayesian) prior probability that the parameter would be found to be unhelpful). In the MDL framework, the focus is more on comparing model classes than models, and it is more natural to approach the same question by comparing the class of models that explicitly include such a parameter against some other class that doesn't. The difference lies in the machinery applied to reach the same conclusion.

Related Research Articles

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

In philosophy, Occam's razor is the problem-solving principle that recommends searching for explanations constructed with the smallest possible set of elements. It is also known as the principle of parsimony or the law of parsimony. Attributed to William of Ockham, a 14th-century English philosopher and theologian, it is frequently cited as Entia non sunt multiplicanda praeter necessitatem, which translates as "Entities must not be multiplied beyond necessity", although Occam never used these exact words. Popularly, the principle is sometimes inaccurately paraphrased as "The simplest explanation is usually the best one."

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

Pattern recognition is the automated recognition of patterns and regularities in data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). It is one of several forms of causal notation. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

Minimum message length (MML) is a Bayesian information-theoretic method for statistical model comparison and selection. It provides a formal information theory restatement of Occam's Razor: even when models are equal in their measure of fit-accuracy to the observed data, the one generating the most concise explanation of data is more likely to be correct. MML was invented by Chris Wallace, first appearing in the seminal paper "An information measure for classification". MML is intended not just as a theoretical construct, but as a technique that may be deployed in practice. It differs from the related concept of Kolmogorov complexity in that it does not require use of a Turing-complete language to model data.

In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s. It is used in inductive inference theory and analyses of algorithms. In his general theory of inductive inference, Solomonoff uses the method together with Bayes' rule to obtain probabilities of prediction for an algorithm's future outputs.

Solomonoff's theory of inductive inference is a mathematical theory of induction introduced by Ray Solomonoff, based on probability theory and theoretical computer science. In essence, Solomonoff's induction derives the posterior probability of any computable theory, given a sequence of observed data. This posterior probability is derived from Bayes' rule and some universal prior, that is, a prior that assigns a positive probability to any computable theory.

The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a null hypothesis and an alternative, but this is not necessary; for instance, it could also be a non-linear model compared to its linear approximation. The Bayes factor can be thought of as a Bayesian analog to the likelihood-ratio test, although it uses the (integrated) marginal likelihood rather than the maximized likelihood. As such, both quantities only coincide under simple hypotheses. Also, in contrast with null hypothesis significance testing, Bayes factors support evaluation of evidence in favor of a null hypothesis, rather than only allowing the null to be rejected or not rejected.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of learning, this may be the selection of a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice.

In the mathematical theory of stochastic processes, variable-order Markov (VOM) models are an important class of models that extend the well known Markov chain models. In contrast to the Markov chain models, where each random variable in a sequence with a Markov property depends on a fixed number of random variables, in VOM models this number of conditioning random variables may vary based on the specific observed realization.

Jorma Johannes Rissanen was an information theorist, known for originating the minimum description length (MDL) principle and practical approaches to arithmetic coding for lossless data compression. His work inspired the development of the theory of stochastic chains with memory of variable length.

In 1973, Andrey Kolmogorov proposed a non-probabilistic approach to statistics and model selection. Let each datum be a finite binary string and a model be a finite set of binary strings. Consider model classes consisting of models of given maximal Kolmogorov complexity. The Kolmogorov structure function of an individual data string expresses the relation between the complexity level constraint on a model class and the least log-cardinality of a model in the class containing the data. The structure function determines all stochastic properties of the individual data string: for every constrained model class it determines the individual best-fitting model in the class irrespective of whether the true model is in the model class considered or not. In the classical case we talk about a set of data with a probability distribution, and the properties are those of the expectations. In contrast, here we deal with individual data strings and the properties of the individual string focused on. In this setting, a property holds with certainty rather than with high probability as in the classical case. The Kolmogorov structure function precisely quantifies the goodness-of-fit of an individual model with respect to individual data.

The free energy principle is a theoretical framework that suggests that the brain reduces surprise or uncertainty by making predictions based on internal models and updating them using sensory input. It highlights the brain's objective of aligning its internal model with the external world to enhance prediction accuracy. This principle integrates Bayesian inference with active inference, where actions are guided by predictions and sensory feedback refines them. It has wide-ranging implications for comprehending brain function, perception, and action.

Inductive probability attempts to give the probability of future events based on past events. It is the basis for inductive reasoning, and gives the mathematical basis for learning and the perception of patterns. It is a source of knowledge about the world.

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

References

↑ Rissanen, J. (September 1978). "Modeling by shortest data description". Automatica. 14 (5): 465–471. doi:10.1016/0005-1098(78)90005-5.
↑ Zenil, Hector; Kiani, Narsis A.; Zea, Allan A.; Tegnér, Jesper (January 2019). "Causal deconvolution by algorithmic generative models". Nature Machine Intelligence. 1 (1): 58–66. doi:10.1038/s42256-018-0005-0. hdl: 10754/630919 . S2CID 86562557.
↑ "Remodelling machine learning: An AI that thinks like a scientist". Nature Machine Intelligence: 1. 28 January 2019. doi:10.1038/s42256-019-0026-3. S2CID 189929110.
↑ Archived at Ghostarchive and the Wayback Machine : "The Limits of Understanding". YouTube .
↑ Grunwald, Peter (June 2004). "A tutorial introduction to the minimum description length principle". arXiv: math/0406077 . Bibcode:2004math......6077G.{{cite journal}}: Cite journal requires |journal= (help)
↑ Grünwald, Peter; Roos, Teemu (2020). "Minimum Description Length Revisited". International Journal of Mathematics for Industry. 11 (1). doi: 10.1142/S2661335219300018 . hdl: 10138/317252 . S2CID 201314867.
↑ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "Model Assessment and Selection". The Elements of Statistical Learning. Springer Series in Statistics. pp. 219–259. doi:10.1007/978-0-387-84858-7_7. ISBN 978-0-387-84857-0.
↑ MacKay, David J. C.; Kay, David J. C. Mac (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press. ISBN 978-0-521-64298-9.^{[ page needed ]}
↑ Rissanen, Jorma. "Homepage of Jorma Rissanen". Archived from the original on 2015-12-10. Retrieved 2010-07-03.
↑ Rissanen, J. (2007). Information and Complexity in Statistical Modeling. Springer. Retrieved 2010-07-03.^{[ page needed ]}
↑ Nannen, Volker (May 2010). "A Short Introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length (MDL)". arXiv: 1005.2364 . Bibcode:2010arXiv1005.2364N.{{cite journal}}: Cite journal requires |journal= (help)