Kolmogorov structure function

Last updated

In 1973, Andrey Kolmogorov proposed a non-probabilistic approach to statistics and model selection. Let each datum be a finite binary string and a model be a finite set of binary strings. Consider model classes consisting of models of given maximal Kolmogorov complexity. The Kolmogorov structure function of an individual data string expresses the relation between the complexity level constraint on a model class and the least log-cardinality of a model in the class containing the data. The structure function determines all stochastic properties of the individual data string: for every constrained model class it determines the individual best-fitting model in the class irrespective of whether the true model is in the model class considered or not. In the classical case we talk about a set of data with a probability distribution, and the properties are those of the expectations. In contrast, here we deal with individual data strings and the properties of the individual string focused on. In this setting, a property holds with certainty rather than with high probability as in the classical case. The Kolmogorov structure function precisely quantifies the goodness-of-fit of an individual model with respect to individual data.

Contents

The Kolmogorov structure function is used in the algorithmic information theory, also known as the theory of Kolmogorov complexity, for describing the structure of a string by use of models of increasing complexity.

Kolmogorov's definition

Kolmogorov (left) talks on the structure function (see drawing on the blackboard) in (Tallinn, 1973). Kolm complexity lect.jpg
Kolmogorov (left) talks on the structure function (see drawing on the blackboard) in (Tallinn, 1973).

The structure function was originally proposed by Kolmogorov in 1973 at a Soviet Information Theory symposium in Tallinn, but these results were not published [1] p. 182. But the results were announced in [2] in 1974, the only written record by Kolmogorov himself. One of his last scientific statements is (translated from the original Russian by L.A. Levin):

To each constructive object corresponds a function of a natural number k—the log of minimal cardinality of x-containing sets that allow definitions of complexity at most k. If the element x itself allows a simple definition, then the function drops to 0 even for small k. Lacking such definition, the element is "random" in a negative sense. But it is positively "probabilistically random" only when function having taken the value at a relatively small , then changes approximately as .

Kolmogorov, announcement cited above

Contemporary definition

It is discussed in Cover and Thomas. [1] It is extensively studied in Vereshchagin and Vitányi [3] where also the main properties are resolved. The Kolmogorov structure function can be written as

where is a binary string of length with where is a contemplated model (set of n-length strings) for , is the Kolmogorov complexity of and is a nonnegative integer value bounding the complexity of the contemplated 's. Clearly, this function is nonincreasing and reaches for where is the required number of bits to change into and is the Kolmogorov complexity of .

The algorithmic sufficient statistic

We define a set containing such that

.

The function never decreases more than a fixed independent constant below the diagonal called sufficiency line L defined by

.

It is approached to within a constant distance by the graph of for certain arguments (for instance, for ). For these 's we have and the associated model (witness for ) is called an optimal set for , and its description of bits is therefore an algorithmic sufficient statistic. We write `algorithmic' for `Kolmogorov complexity' by convention. The main properties of an algorithmic sufficient statistic are the following: If is an algorithmic sufficient statistic for , then

.

That is, the two-part description of using the model and as data-to-model code the index of in the enumeration of in bits, is as concise as the shortest one-part code of in bits. This can be easily seen as follows:

,
Structure functions
h
x
(
a
)
,
b
x
(
a
)
,
l
x
(
a
)
{\displaystyle h_{x}(\alpha ),\beta _{x}(\alpha ),\lambda _{x}(\alpha )}
and minimal sufficient statistic. Estimator.jpg
Structure functions and minimal sufficient statistic.

using straightforward inequalities and the sufficiency property, we find that . (For example, given , we can describe self-delimitingly (you can determine its end) in bits.) Therefore, the randomness deficiency of in is a constant, which means that is a typical (random) element of S. However, there can be models containing that are not sufficient statistics. An algorithmic sufficient statistic for has the additional property, apart from being a model of best fit, that and therefore by the Kolmogorov complexity symmetry of information (the information about in is about the same as the information about in x) we have : the algorithmic sufficient statistic is a model of best fit that is almost completely determined by . ( is a shortest program for .) The algorithmic sufficient statistic associated with the least such is called the algorithmic minimal sufficient statistic.

With respect to the picture: The MDL structure function is explained below. The Goodness-of-fit structure function is the least randomness deficiency (see above) of any model for such that . This structure function gives the goodness-of-fit of a model (containing x) for the string x. When it is low the model fits well, and when it is high the model doesn't fit well. If for some then there is a typical model for such that and is typical (random) for S. That is, is the best-fitting model for x. For more details see [1] and especially [3] and. [4]

Selection of properties

Within the constraints that the graph goes down at an angle of at least 45 degrees, that it starts at n and ends approximately at , every graph (up to a additive term in argument and value) is realized by the structure function of some data x and vice versa. Where the graph hits the diagonal first the argument (complexity) is that of the minimum sufficient statistic. It is incomputable to determine this place. See. [3]

Main property

It is proved that at each level of complexity the structure function allows us to select the best model for the individual string x within a strip of with certainty, not with great probability. [3]

The MDL variant

The Minimum description length (MDL) function: The length of the minimal two-part code for x consisting of the model cost K(S) and the length of the index of x in S, in the model class of sets of given maximal Kolmogorov complexity , the complexity of S upper bounded by , is given by the MDL function or constrained MDL estimator:

where is the total length of two-part code of x with help of model S.

Main property

It is proved that at each level of complexity the structure function allows us to select the best model S for the individual string x within a strip of with certainty, not with great probability. [3]

Application in statistics

The mathematics developed above were taken as the foundation of MDL by its inventor Jorma Rissanen. [5]

Probability models

For every computable probability distribution it can be proved [6] that

.

For example, if is some computable distribution on the set of strings of length , then each has probability . Kolmogorov's structure function becomes

where x is a binary string of length n with where is a contemplated model (computable probability of -length strings) for , is the Kolmogorov complexity of and is an integer value bounding the complexity of the contemplated 's. Clearly, this function is non-increasing and reaches for where c is the required number of bits to change into and is the Kolmogorov complexity of . Then . For every complexity level the function is the Kolmogorov complexity version of the maximum likelihood (ML).

Main property

It is proved that at each level of complexity the structure function allows us to select the best model for the individual string within a strip of with certainty, not with great probability. [3]

The MDL variant and probability models

The MDL function: The length of the minimal two-part code for x consisting of the model cost K(P) and the length of , in the model class of computable probability mass functions of given maximal Kolmogorov complexity , the complexity of P upper bounded by , is given by the MDL function or constrained MDL estimator:

where is the total length of two-part code of x with help of model P.

Main property

It is proved that at each level of complexity the MDL function allows us to select the best model P for the individual string x within a strip of with certainty, not with great probability. [3]

Extension to rate distortion and denoising

It turns out that the approach can be extended to a theory of rate distortion of individual finite sequences and denoising of individual finite sequences [7] using Kolmogorov complexity. Experiments using real compressor programs have been carried out with success. [8] Here the assumption is that for natural data the Kolmogorov complexity is not far from the length of a compressed version using a good compressor.

Related Research Articles

<span class="mw-page-title-main">Kolmogorov complexity</span> Measure of algorithmic complexity

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.

<span class="mw-page-title-main">Kolmogorov–Smirnov test</span> Non-parametric statistical test between two distributions

In statistics, the Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution, or to compare two samples. In essence, the test answers the question "How likely is it that we would see a collection of samples like this if they were drawn from that probability distribution?" or, in the second case, "How likely is it that we would see two sets of samples like this if they were drawn from the same probability distribution?". It is named after Andrey Kolmogorov and Nikolai Smirnov.

The bin packing problem is an optimization problem, in which items of different sizes must be packed into a finite number of bins or containers, each of a fixed given capacity, in a way that minimizes the number of bins used. The problem has many applications, such as filling up containers, loading trucks with weight capacity constraints, creating file backups in media, and technology mapping in FPGA semiconductor chip design.

Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam's razor. The MDL principle can be extended to other forms of inductive inference and learning, for example to estimation and sequential prediction, without explicitly identifying a single model of the data.

<span class="mw-page-title-main">Algorithmic probability</span>

In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s. It is used in inductive inference theory and analyses of algorithms. In his general theory of inductive inference, Solomonoff uses the method together with Bayes' rule to obtain probabilities of prediction for an algorithm's future outputs.

In information theory, Shannon's source coding theorem establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.

In computational complexity theory, PostBQP is a complexity class consisting of all of the computational problems solvable in polynomial time on a quantum Turing machine with postselection and bounded error.

The chain rule for Kolmogorov complexity is an analogue of the chain rule for information entropy, which states:

In mathematics, effective dimension is a modification of Hausdorff dimension and other fractal dimensions that places it in a computability theory setting. There are several variations of which the most common is effective Hausdorff dimension. Dimension, in mathematics, is a particular way of describing the size of an object. Hausdorff dimension generalizes the well-known integer dimensions assigned to points, lines, planes, etc. by allowing one to distinguish between objects of intermediate size between these integer-dimensional objects. For example, fractal subsets of the plane may have intermediate dimension between 1 and 2, as they are "larger" than lines or curves, and yet "smaller" than filled circles or rectangles. Effective dimension modifies Hausdorff dimension by requiring that objects with small effective dimension be not only small but also locatable in a computable sense. As such, objects with large Hausdorff dimension also have large effective dimension, and objects with small effective dimension have small Hausdorff dimension, but an object can have small Hausdorff but large effective dimension. An example is an algorithmically random point on a line, which has Hausdorff dimension 0 but effective dimension 1.

In mathematics, the FEE method, or fast E-function evaluation method, is the method of fast summation of series of a special form. It was constructed in 1990 by Ekaterina Karatsuba and is so-named because it makes fast computations of the Siegel E-functions possible, in particular of .

In mathematics the Function Field Sieve is one of the most efficient algorithms to solve the Discrete Logarithm Problem (DLP) in a finite field. It has heuristic subexponential complexity. Leonard Adleman developed it in 1994 and then elaborated it together with M. D. Huang in 1999. Previous work includes the work of D. Coppersmith about the DLP in fields of characteristic two.

The exponential mechanism is a technique for designing differentially private algorithms. It was developed by Frank McSherry and Kunal Talwar in 2007. Their work was recognized as a co-winner of the 2009 PET Award for Outstanding Research in Privacy Enhancing Technologies.

<span class="mw-page-title-main">Poisson distribution</span> Discrete probability distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. It is named after French mathematician Siméon Denis Poisson. The Poisson distribution can also be used for the number of events in other specified interval types such as distance, area, or volume. It plays an important role for discrete-stable distributions.

In data structures, a range query consists of pre-processing some input data into a data structure to efficiently answer any number of queries on any subset of the input. Particularly, there is a group of problems that have been extensively studied where the input is an array of unsorted numbers and a query consists of computing some function, such as the minimum, on a specific range of the array.

Information distance is the distance between two finite objects expressed as the number of bits in the shortest program which transforms one object into the other one or vice versa on a universal computer. This is an extension of Kolmogorov complexity. The Kolmogorov complexity of a single finite object is the information in that object; the information distance between a pair of finite objects is the minimum information required to go from one object to the other or vice versa. Information distance was first defined and investigated in based on thermodynamic principles, see also. Subsequently, it achieved final form in. It is applied in the normalized compression distance and the normalized Google distance.

<span class="mw-page-title-main">Hyperbolic geometric graph</span>

A hyperbolic geometric graph (HGG) or hyperbolic geometric network (HGN) is a special type of spatial network where (1) latent coordinates of nodes are sprinkled according to a probability density function into a hyperbolic space of constant negative curvature and (2) an edge between two nodes is present if they are close according to a function of the metric (typically either a Heaviside step function resulting in deterministic connections between vertices closer than a certain threshold distance, or a decaying function of hyperbolic distance yielding the connection probability). A HGG generalizes a random geometric graph (RGG) whose embedding space is Euclidean.

<span class="mw-page-title-main">Occam learning</span> Model of algorithmic learning

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant.

<span class="mw-page-title-main">Error tolerance (PAC learning)</span>

In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated.

In mathematics, the incompressibility method is a proof method like the probabilistic method, the counting method or the pigeonhole principle. To prove that an object in a certain class satisfies a certain property, select an object of that class that is incompressible. If it does not satisfy the property, it can be compressed by computable coding. Since it can be generally proven that almost all objects in a given class are incompressible, the argument demonstrates that almost all objects in the class have the property involved. To select an incompressible object is ineffective, and cannot be done by a computer program. However, a simple counting argument usually shows that almost all objects of a given class can be compressed by only a few bits.

References

  1. 1 2 3 Cover, Thomas M.; Thomas, Joy A. (1991). Elements of information theory . New York: Wiley. pp.  175–178. ISBN   978-0471062592.
  2. Abstract of a talk for the Moscow Mathematical Society in Uspekhi Mat. Nauk Volume 29, Issue 4(178) in the Communications of the Moscow Mathematical Society page 155 (in the Russian edition, not translated into English)
  3. 1 2 3 4 5 6 7 Vereshchagin, N.K.; Vitanyi, P.M.B. (1 December 2004). "Kolmogorov's Structure Functions and Model Selection". IEEE Transactions on Information Theory. 50 (12): 3265–3290. arXiv: cs/0204037 . doi:10.1109/TIT.2004.838346.
  4. Gacs, P.; Tromp, J.T.; Vitanyi, P.M.B. (2001). "Algorithmic statistics". IEEE Transactions on Information Theory. 47 (6): 2443–2463. arXiv: math/0006233 . doi:10.1109/18.945257.
  5. Rissanen, Jorma (2007). Information and complexity in statistical modeling (Online-Ausg. ed.). New York: Springer. ISBN   978-0-387-36610-4.
  6. A.Kh. Shen, The concept of (α, β)-stochasticity in the Kolmogorov sense, and its properties, Soviet Math. Dokl., 28:1(1983), 295--299
  7. Vereshchagin, Nikolai K.; Vitanyi, Paul M.B. (1 July 2010). "Rate Distortion and Denoising of Individual Data Using Kolmogorov Complexity". IEEE Transactions on Information Theory. 56 (7): 3438–3454. arXiv: cs/0411014 . doi:10.1109/TIT.2010.2048491.
  8. de Rooij, Steven; Vitanyi, Paul (1 March 2012). "Approximating Rate-Distortion Graphs of Individual Data: Experiments in Lossy Compression and Denoising". IEEE Transactions on Computers. 61 (3): 395–407. arXiv: cs/0609121 . doi:10.1109/TC.2011.25.

Literature