Information distance

Last updated

Information distance is the distance between two finite objects (represented as computer files) expressed as the number of bits in the shortest program which transforms one object into the other one or vice versa on a universal computer. This is an extension of Kolmogorov complexity. [1] The Kolmogorov complexity of a single finite object is the information in that object; the information distance between a pair of finite objects is the minimum information required to go from one object to the other or vice versa. Information distance was first defined and investigated in [2] based on thermodynamic principles, see also. [3] Subsequently, it achieved final form in. [4] It is applied in the normalized compression distance and the normalized Google distance.

Contents

Properties

Formally the information distance between and is defined by

with a finite binary program for the fixed universal computer with as inputs finite binary strings . In [4] it is proven that with

where is the Kolmogorov complexity defined by [1] of the prefix type. [5] This is the important quantity.

Universality

Let be the class of upper semicomputable distances that satisfy the density condition

This excludes irrelevant distances such as for ; it takes care that if the distance growth then the number of objects within that distance of a given object grows. If then up to a constant additive term. [4] The probabilistic expressions of the distance is the first cohomological class in information symmetric cohomology, [6] which may be conceived as a universality property.

Metricity

The distance is a metric up to an additive term in the metric (in)equalities. [4] The probabilistic version of the metric is indeed unique has shown by Han in 1981. [7]

Maximum overlap

If , then there is a program of length that converts to , and a program of length such that the program converts to . (The programs are of the self-delimiting format which means that one can decide where one program ends and the other begins in concatenation of the programs.) That is, the shortest programs to convert between two objects can be made maximally overlapping: For it can be divided into a program that converts object to object , and another program which concatenated with the first converts to while the concatenation of these two programs is a shortest program to convert between these objects. [4]

Minimum overlap

The programs to convert between objects and can also be made minimal overlapping. There exists a program of length up to an additive term of that maps to and has small complexity when is known (). Interchanging the two objects we have the other program [8] Having in mind the parallelism between Shannon information theory and Kolmogorov complexity theory, one can say that this result is parallel to the Slepian-Wolf and Körner–Imre Csiszár–Marton theorems.

Applications

Theoretical

The result of An.A. Muchnik on minimum overlap above is an important theoretical application showing that certain codes exist: to go to finite target object from any object there is a program which almost only depends on the target object! This result is fairly precise and the error term cannot be significantly improved. [9] Information distance was material in the textbook, [10] it occurs in the Encyclopedia on Distances. [11]

Practical

To determine the similarity of objects such as genomes, languages, music, internet attacks and worms, software programs, and so on, information distance is normalized and the Kolmogorov complexity terms approximated by real-world compressors (the Kolmogorov complexity is a lower bound to the length in bits of a compressed version of the object). The result is the normalized compression distance (NCD) between the objects. This pertains to objects given as computer files like the genome of a mouse or text of a book. If the objects are just given by name such as `Einstein' or `table' or the name of a book or the name `mouse', compression does not make sense. We need outside information about what the name means. Using a data base (such as the internet) and a means to search the database (such as a search engine like Google) provides this information. Every search engine on a data base that provides aggregate page counts can be used in the normalized Google distance (NGD). A python package for computing all information distances and volumes, multivariate mutual information, conditional mutual information, joint entropies, total correlations, in a dataset of n variables is available . [12]

Related Research Articles

Kolmogorov complexity Measure of algorithmic complexity

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963.

Entropy (information theory) Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , with possible outcomes , which occur with probability the entropy of is formally defined as:

In mathematics, an ultrametric space is a metric space in which the triangle inequality is strengthened to . Sometimes the associated metric is also called a non-Archimedean metric or super-metric. Although some of the theorems for ultrametric spaces may seem strange at a first glance, they appear naturally in many applications.

In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s. It is used in inductive inference theory and analyses of algorithms. In his general theory of inductive inference, Solomonoff uses the prior obtained by this formula, in Bayes' rule for prediction.

Solomonoff's theory of inductive inference is a mathematical proof that if a universe is generated by an algorithm, then observations of that universe, encoded as a dataset, are best predicted by the smallest executable archive of that dataset. This formalization of Occam's razor for induction was introduced by Ray Solomonoff, based on probability theory and theoretical computer science. In essence, Solomonoff's induction derives the posterior probability of any computable theory, given a sequence of observed data. This posterior probability is derived from Bayes rule and some universal prior, that is, a prior that assigns a positive probability to any computable theory.

Mutual information Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

Algorithmic information theory (AIT) is a branch of theoretical computer science that concerns itself with the relationship between computation and information of computably generated objects, such as strings or any other data structure. In other words, it is shown within algorithmic information theory that computational incompressibility "mimics" the relations or inequalities found in information theory. According to Gregory Chaitin, it is "the result of putting Shannon's information theory and Turing's computability theory into a cocktail shaker and shaking vigorously."

A height function is a function that quantifies the complexity of mathematical objects. In Diophantine geometry, height functions quantify the size of solutions to Diophantine equations and are typically functions from a set of points on algebraic varieties to the real numbers.

Intuitively, an algorithmically random sequence is a sequence of binary digits that appears random to any algorithm running on a universal Turing machine. The notion can be applied analogously to sequences on any finite alphabet. Random sequences are key objects of study in algorithmic information theory.

The chain rule for Kolmogorov complexity is an analogue of the chain rule for information entropy, which states:

In mathematics, effective dimension is a modification of Hausdorff dimension and other fractal dimensions which places it in a computability theory setting. There are several variations of which the most common is effective Hausdorff dimension. Dimension, in mathematics, is a particular way of describing the size of an object. Hausdorff dimension generalizes the well-known integer dimensions assigned to points, lines, planes, etc. by allowing one to distinguish between objects of intermediate size between these integer-dimensional objects. For example, fractal subsets of the plane may have intermediate dimension between 1 and 2, as they are "larger" than lines or curves, and yet "smaller" than filled circles or rectangles. Effective dimension modifies Hausdorff dimension by requiring that objects with small effective dimension be not only small but also locatable in a computable sense. As such, objects with large Hausdorff dimension also have large effective dimension, and objects with small effective dimension have small Hausdorff dimension, but an object can have small Hausdorff but large effective dimension. An example is an algorithmically random point on a line, which has Hausdorff dimension 0 but effective dimension 1.

The entropic vector or entropic function is a concept arising in information theory. It represents the possible values of Shannon's information entropy that subsets of one set of random variables may take. Understanding which vectors are entropic is a way to represent all possible inequalities between entropies of various subsets. For example, for any two random variables , their joint entropy is at most the sum of the entropies of and of :

In 1973 Kolmogorov proposed a non-probabilistic approach to statistics and model selection. Let each datum be a finite binary string and a model be a finite set of binary strings. Consider model classes consisting of models of given maximal Kolmogorov complexity. The Kolmogorov structure function of an individual data string expresses the relation between the complexity level constraint on a model class and the least log-cardinality of a model in the class containing the data. The structure function determines all stochastic properties of the individual data string: for every constrained model class it determines the individual best-fitting model in the class irrespective of whether the true model is in the model class considered or not. In the classical case we talk about a set of data with a probability distribution, and the properties are those of the expectations. In contrast, here we deal with individual data strings and the properties of the individual string focused on. In this setting, a property holds with certainty rather than with high probability as in the classical case. The Kolmogorov structure function precisely quantifies the goodness-of-fit of an individual model with respect to individual data.

Paul Vitányi Dutch theoretical computer scientist

Paul Michael Béla Vitányi is a Dutch computer scientist, Professor of Computer Science at the University of Amsterdam and researcher at the Dutch Centrum Wiskunde & Informatica.

The Normalized Google Distance (NGD) is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Normalized Google Distance, while words with dissimilar meanings tend to be farther apart.

A zone diagram is a certain geometric object which a variation on the notion of Voronoi diagram. It was introduced by Tetsuo Asano, Jiri Matousek, and Takeshi Tokuyama in 2007.

Normalized compression distance (NCD) is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few. Such a measurement should not be application dependent or arbitrary. A reasonable definition for the similarity between two objects is how difficult it is to transform them into each other.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant.

In mathematics, the incompressibility method is a proof method like the probabilistic method, the counting method or the pigeonhole principle. To prove that an object in a certain class satisfies a certain property, select an object of that class which is incompressible. If it does not satisfy the property, it can be compressed by computable coding. Since it can be generally proven that almost all objects in a given class are incompressible, the argument demonstrates that almost all objects in the class have the property involved. To select an incompressible object is ineffective, and cannot be done by a computer program. However, a simple counting argument usually shows that almost all objects of a given class can be compressed by only a few bits.

References

  1. 1 2 A.N. Kolmogorov, Three approaches to the quantitative definition of information, Problems Inform. Transmission, 1:1(1965), 1–7
  2. M. Li, P.M.B. Vitanyi, Theory of Thermodynamics of Computation, Proc. IEEE Physics of Computation Workshop, Dallas, Texas, USA, 1992, 42–46
  3. M. Li, P.M.B. Vitanyi, Reversibility and Adiabatic Computation: Trading Time and Space for Energy, Proc. R. Soc. Lond. A 9 April 1996 vol. 452 no. 1947 769–789
  4. 1 2 3 4 5 C.H. Bennett, P. Gacs, M. Li, P.M.B. Vitanyi, W. Zurek, Information distance, IEEE Transactions on Information Theory, 44:4(1998), 1407–1423
  5. L.A. Levin, Laws of Information Conservation (Nongrowth) and Aspects of the Foundation of Probability Theory, Problems Inform. Transmission, 10:3(1974), 30–35
  6. P. Baudot, The Poincaré-Shannon Machine: Statistical Physics and Machine Learning Aspects of Information Cohomology , Entropy, 21:9 - 881 (2019)
  7. Te Sun Han, A uniqueness of Shannon information distance and related nonnegativity problems, Journal of combinatorics. 6:4 p.320-331 (1981), 30–35
  8. Muchnik, Andrej A. (2002). "Conditional complexity and codes". Theoretical Computer Science. 271 (1–2): 97–109. doi: 10.1016/S0304-3975(01)00033-0 .
  9. N.K Vereshchagin, M.V. Vyugin, Independent minimum length programs to translate between given strings, Proc. 15th Ann. Conf. Computational Complexity, 2000, 138–144
  10. M.Hutter, Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability, Springer, 1998
  11. M.M. Deza, E Deza, Encyclopedia of Distances, Springer, 2009, doi : 10.1007/978-3-642-00234-2
  12. "InfoTopo: Topological Information Data Analysis. Deep statistical unsupervised and supervised learning - File Exchange - Github". github.com/pierrebaudot/infotopopy/. Retrieved 26 September 2020.