In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s. [2] It is used in inductive inference theory and analyses of algorithms. In his general theory of inductive inference, Solomonoff uses the method together with Bayes' rule to obtain probabilities of prediction for an algorithm's future outputs. [3]
In the mathematical formalism used, the observations have the form of finite binary strings viewed as outputs of Turing machines, and the universal prior is a probability distribution over the set of finite binary strings calculated from a probability distribution over programs (that is, inputs to a universal Turing machine). The prior is universal in the Turing-computability sense, i.e. no string has zero probability. It is not computable, but it can be approximated. [4]
Formally, the probability is not a probability and it is not computable. It is only "lower semi-computable" and a "semi-measure". By "semi-measure", it means that . That is, the "probability" does not actually sum up to one, unlike actual probabilities. This is because some inputs to the Turing machine causes it to never halt, which means the probability mass allocated to those inputs is lost. By "lower semi-computable", it means there is a Turing machine that, given an input string , can print out a sequence that converges to from below, but there is no such Turing machine that does the same from above.
Algorithmic probability is the main ingredient of Solomonoff's theory of inductive inference, the theory of prediction based on observations; it was invented with the goal of using it for machine learning; given a sequence of symbols, which one will come next? Solomonoff's theory provides an answer that is optimal in a certain sense, although it is incomputable. Unlike, for example, Karl Popper's informal inductive inference theory,[ clarification needed ] Solomonoff's is mathematically rigorous.
Four principal inspirations for Solomonoff's algorithmic probability were: Occam's razor, Epicurus' principle of multiple explanations, modern computing theory (e.g. use of a universal Turing machine) and Bayes’ rule for prediction. [5]
Occam's razor and Epicurus' principle are essentially two different non-mathematical approximations of the universal prior.
At the heart of the universal prior is an abstract model of a computer, such as a universal Turing machine. [8] Any abstract computer will do, as long as it is Turing-complete, i.e. every computable function has at least one program that will compute its application on the abstract computer.
The abstract computer is used to give precise meaning to the phrase "simple explanation". In the formalism used, explanations, or theories of phenomena, are computer programs that generate observation strings when run on the abstract computer. Each computer program is assigned a weight corresponding to its length. The universal probability distribution is the probability distribution on all possible output strings with random input, assigning for each finite output prefix q the sum of the probabilities of the programs that compute something starting with q. [9] Thus, a simple explanation is a short computer program. A complex explanation is a long computer program. Simple explanations are more likely, so a high-probability observation string is one generated by a short computer program, or perhaps by any of a large number of slightly longer computer programs. A low-probability observation string is one that can only be generated by a long computer program.
Algorithmic probability is closely related to the concept of Kolmogorov complexity. Kolmogorov's introduction of complexity was motivated by information theory and problems in randomness, while Solomonoff introduced algorithmic complexity for a different reason: inductive reasoning. A single universal prior probability that can be substituted for each actual prior probability in Bayes's rule was invented by Solomonoff with Kolmogorov complexity as a side product. [10] It predicts the most likely continuation of that observation, and provides a measure of how likely this continuation will be.[ citation needed ]
Solomonoff's enumerable measure is universal in a certain powerful sense, but the computation time can be infinite. One way of dealing with this issue is a variant of Leonid Levin's Search Algorithm, [11] which limits the time spent computing the success of possible programs, with shorter programs given more time. When run for longer and longer periods of time, it will generate a sequence of approximations which converge to the universal probability distribution. Other methods of dealing with the issue include limiting the search space by including training sequences.
Solomonoff proved this distribution to be machine-invariant within a constant factor (called the invariance theorem). [12]
Kolmogorov's Invariance theorem clarifies that the Kolmogorov Complexity, or Minimal Description Length, of a dataset is invariant to the choice of Turing-Complete language used to simulate a Universal Turing Machine:
where .
The minimal description such that serves as a natural representation of the string relative to the Turing-Complete language . Moreover, as can't be compressed further is an incompressible and hence uncomputable string. This corresponds to a scientists' notion of randomness and clarifies the reason why Kolmogorov Complexity is not computable.
It follows that any piece of data has a necessary and sufficient representation in terms of a random string.
The following is taken from [13]
From the theory of compilers, it is known that for any two Turing-Complete languages and , there exists a compiler expressed in that translates programs expressed in into functionally-equivalent programs expressed in .
It follows that if we let be the shortest program that prints a given string then:
where , and by symmetry we obtain the opposite inequality.
Given that any uniquely-decodable code satisfies the Kraft-McMillan inequality, prefix-free Kolmogorov Complexity allows us to derive the Universal Distribution:
where the fact that may simulate a prefix-free UTM implies that for two distinct descriptions and , isn't a substring of and isn't a substring of .
In a Computable Universe, given a phenomenon with encoding generated by a physical process the probability of that phenomenon is well-defined and equal to the sum over the probabilities of distinct and independent causes. The prefix-free criterion is precisely what guarantees causal independence.
This is an immediate consequence of the Kraft-McMillan inequality.
Kraft's inequality states that given a sequence of strings there exists a prefix code with codewords where if and only if:
where is the size of the alphabet .
Without loss of generality, let's suppose we may order the such that:
Now, there exists a prefix code if and only if at each step there is at least one codeword to choose that does not contain any of the previous codewords as a prefix. Due to the existence of a codeword at a previous step codewords are forbidden as they contain as a prefix. It follows that in general a prefix code exists if and only if:
Dividing both sides by , we find:
QED.
Solomonoff invented the concept of algorithmic probability with its associated invariance theorem around 1960, [14] publishing a report on it: "A Preliminary Report on a General Theory of Inductive Inference." [15] He clarified these ideas more fully in 1964 with "A Formal Theory of Inductive Inference," Part I [16] and Part II. [17]
In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.
In the computer science subfield of algorithmic information theory, a Chaitin constant or halting probability is a real number that, informally speaking, represents the probability that a randomly constructed program will halt. These numbers are formed from a construction due to Gregory Chaitin.
Minimum message length (MML) is a Bayesian information-theoretic method for statistical model comparison and selection. It provides a formal information theory restatement of Occam's Razor: even when models are equal in their measure of fit-accuracy to the observed data, the one generating the most concise explanation of data is more likely to be correct. MML was invented by Chris Wallace, first appearing in the seminal paper "An information measure for classification". MML is intended not just as a theoretical construct, but as a technique that may be deployed in practice. It differs from the related concept of Kolmogorov complexity in that it does not require use of a Turing-complete language to model data.
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam's razor. The MDL principle can be extended to other forms of inductive inference and learning, for example to estimation and sequential prediction, without explicitly identifying a single model of the data.
In computer science, computational learning theory is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms.
Ray Solomonoff was an American mathematician who invented algorithmic probability, his General Theory of Inductive Inference, and was a founder of algorithmic information theory. He was an originator of the branch of artificial intelligence based on machine learning, prediction and probability. He circulated the first report on non-semantic machine learning in 1956.
Solomonoff's theory of inductive inference is a mathematical theory of induction introduced by Ray Solomonoff, based on probability theory and theoretical computer science. In essence, Solomonoff's induction derives the posterior probability of any computable theory, given a sequence of observed data. This posterior probability is derived from Bayes' rule and some universal prior, that is, a prior that assigns a positive probability to any computable theory.
In information theory, Shannon's source coding theorem establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.
In computational complexity and optimization the no free lunch theorem is a result that states that for certain types of mathematical problems, the computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method. The name alludes to the saying "no such thing as a free lunch", that is, no method offers a "short cut". This is under the assumption that the search space is a probability density function. It does not apply to the case where the search space has underlying structure that can be exploited more efficiently than random search or even has closed-form solutions that can be determined without search at all. For such probabilistic assumptions, the outputs of all procedures solving a particular type of problem are statistically identical. A colourful way of describing such a circumstance, introduced by David Wolpert and William G. Macready in connection with the problems of search and optimization, is to say that there is no free lunch. Wolpert had previously derived no free lunch theorems for machine learning. Before Wolpert's article was published, Cullen Schaffer independently proved a restricted version of one of Wolpert's theorems and used it to critique the current state of machine learning research on the problem of induction.
Algorithmic information theory (AIT) is a branch of theoretical computer science that concerns itself with the relationship between computation and information of computably generated objects, such as strings or any other data structure. In other words, it is shown within algorithmic information theory that computational incompressibility "mimics" the relations or inequalities found in information theory. According to Gregory Chaitin, it is "the result of putting Shannon's information theory and Turing's computability theory into a cocktail shaker and shaking vigorously."
Intuitively, an algorithmically random sequence is a sequence of binary digits that appears random to any algorithm running on a universal Turing machine. The notion can be applied analogously to sequences on any finite alphabet. Random sequences are key objects of study in algorithmic information theory.
The chain rule for Kolmogorov complexity is an analogue of the chain rule for information entropy, which states:
In 1973, Andrey Kolmogorov proposed a non-probabilistic approach to statistics and model selection. Let each datum be a finite binary string and a model be a finite set of binary strings. Consider model classes consisting of models of given maximal Kolmogorov complexity. The Kolmogorov structure function of an individual data string expresses the relation between the complexity level constraint on a model class and the least log-cardinality of a model in the class containing the data. The structure function determines all stochastic properties of the individual data string: for every constrained model class it determines the individual best-fitting model in the class irrespective of whether the true model is in the model class considered or not. In the classical case we talk about a set of data with a probability distribution, and the properties are those of the expectations. In contrast, here we deal with individual data strings and the properties of the individual string focused on. In this setting, a property holds with certainty rather than with high probability as in the classical case. The Kolmogorov structure function precisely quantifies the goodness-of-fit of an individual model with respect to individual data.
Logical depth is a measure of complexity for individual strings devised by Charles H. Bennett based on the computational complexity of an algorithm that can recreate a given piece of information. It differs from Kolmogorov complexity in that it considers the computation time of the algorithm with nearly minimal length, rather than the length of the minimal algorithm.
Simplicity theory is a cognitive theory that seeks to explain the attractiveness of situations or events to human minds. It is based on work done by scientists like behavioural scientist Nick Chater, computer scientist Paul Vitanyi, psychologist Jacob Feldman, and artificial intelligence researchers Jean-Louis Dessalles and Jürgen Schmidhuber. It claims that interesting situations appear simpler than expected to the observer.
Information distance is the distance between two finite objects expressed as the number of bits in the shortest program which transforms one object into the other one or vice versa on a universal computer. This is an extension of Kolmogorov complexity. The Kolmogorov complexity of a single finite object is the information in that object; the information distance between a pair of finite objects is the minimum information required to go from one object to the other or vice versa. Information distance was first defined and investigated in based on thermodynamic principles, see also. Subsequently, it achieved final form in. It is applied in the normalized compression distance and the normalized Google distance.
In mathematics, a set of natural numbers is called a K-trivial set if its initial segments viewed as binary strings are easy to describe: the prefix-free Kolmogorov complexity is as low as possible, close to that of a computable set. Solovay proved in 1975 that a set can be K-trivial without being computable.
Inductive probability attempts to give the probability of future events based on past events. It is the basis for inductive reasoning, and gives the mathematical basis for learning and the perception of patterns. It is a source of knowledge about the world.
Universality probability is an abstruse probability measure in computational complexity theory that concerns universal Turing machines.
In mathematics, the incompressibility method is a proof method like the probabilistic method, the counting method or the pigeonhole principle. To prove that an object in a certain class satisfies a certain property, select an object of that class that is incompressible. If it does not satisfy the property, it can be compressed by computable coding. Since it can be generally proven that almost all objects in a given class are incompressible, the argument demonstrates that almost all objects in the class have the property involved. To select an incompressible object is ineffective, and cannot be done by a computer program. However, a simple counting argument usually shows that almost all objects of a given class can be compressed by only a few bits.