Solomonoff's theory of inductive inference is a mathematical proof that if a universe is generated by an algorithm, then observations of that universe, encoded as a dataset, are best predicted by the smallest executable archive of that dataset. This formalization of

Occam's razor Occam's razor, Ockham's razor, or Ocham's razor ( la, novacula Occami), also known as the principle of parsimony or the law of parsimony ( la, lex parsimoniae), is the problem-solving principle that "entities should not be multiplied beyond neces ...

JJ McCall. Induction: From Kolmogorov and Solomonoff to De Finetti and Back to Kolmogorov – Metroeconomica, 2004 – Wiley Online Library.D Stork. Foundations of Occam's razor and parsimony in learning from ricoh.com – NIPS 2001 Workshop, 2001A.N. Soklakov. Occam's razor as a formal basis for a physical theor
from arxiv.org
– Foundations of Physics Letters, 2002 – SpringerM Hutter. On the existence and convergence of computable universal prior
arxiv.org
– Algorithmic Learning Theory, 2003 – Springer for

induction Induction, Inducible or Inductive may refer to: Biology and medicine * Labor induction (birth/pregnancy) * Induction chemotherapy, in medicine * Induced stem cells, stem cells derived from somatic, reproductive, pluripotent or other cell t ...

was introduced by

Ray Solomonoff Ray Solomonoff (July 25, 1926 – December 7, 2009) was the inventor of algorithmic probability, his General Theory of Inductive Inference (also known as Universal Inductive Inference),Samuel Rathmanner and Marcus Hutter. A philosophical treatise o ...

, based on

probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...

and

theoretical computer science Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumsc ...

. In essence, Solomonoff's induction derives the

posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...

of any

computable Computability is the ability to solve a problem in an effective manner. It is a key topic of the field of computability theory within mathematical logic and the theory of computation within computer science. The computability of a problem is close ...

theory, given a sequence of observed data. This posterior probability is derived from

Bayes rule In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...

and some ''universal'' prior, that is, a prior that assigns a positive probability to any computable theory.

Origin

Philosophical

The theory is based in philosophical foundations, and was founded by

around 1960. It is a mathematically formalized combination of

and the

Principle of Multiple Explanations Epicurus (; grc-gre, Ἐπίκουρος ; 341–270 BC) was an ancient Greek philosopher and sage who founded Epicureanism, a highly influential school of philosophy. He was born on the Greek island of Samos to Athenian parents. Influenced ...

.Ming Li and Paul Vitanyi, ''An Introduction to Kolmogorov Complexity and Its Applications.'' Springer-Verlag, N.Y., 2008p 339 ff. All

theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories. Marcus Hutter's

universal artificial intelligence AIXI is a theoretical mathematical formalism for artificial general intelligence. It combines Solomonoff induction with sequential decision theory. AIXI was first proposed by Marcus Hutter in 2000 and several results regarding AIXI are proved i ...

builds upon this to calculate the

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...

of an action.

Principle

Solomonoff's induction has been argued to be the computational formalization of pure

Bayesianism Bayesian probability is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification o ...

. To understand, recall that Bayesianism derives the posterior probability

\mathbb P D /math> of a theory T given data D by applying Bayes rule, which yields \mathbb P D = \mathbb P T \mathbb P / (\mathbb P T \mathbb P + \sum_ \mathbb P A \mathbb P, where theories A are alternatives to theory T . For this equation to make sense, the quantities \mathbb P T /math> and \mathbb P A /math> must be well-defined for all theories T and A . In other words, any theory must define a probability distribution over observable data D . Solomonoff's induction essentially boils down to demanding in addition that all such probability distributions be

. Interestingly, the set of computable probability distributions is a subset of the set of all programs, which is

countable In mathematics, a set is countable if either it is finite or it can be made in one to one correspondence with the set of natural numbers. Equivalently, a set is ''countable'' if there exists an injective function from it into the natural numbers; ...

. Similarly, the sets of observable data considered by Solomonoff were finite. Without loss of generality, we can thus consider that any observable data is a finite

bit string A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level pa ...

. As a result, Solomonoff's induction can be defined by only invoking discrete probability distributions. Solomonoff's induction then allows to make probabilistic predictions of future data

F

, by simply obeying the laws of probability. Namely, we have

\mathbb P D = \mathbb E_T T,D .html"_;"title=".html"_;"title="mathbb_P[F">T,D">.html"_;"title="mathbb_P T,D =_\sum_T_\mathbb_P[F.html"_;"title="">T,D.html"_;"title=".html"_;"title="mathbb_P[F">T,D">.html"_;"title="mathbb_P[F">T,D=_\sum_T_\mathbb_P[F">T,D\mathbb_P_D/math>._This_quantity_can_be_interpreted_as_the_average_predictions_\mathbb_P[F.html" ;"title="">T,D=_\sum_T_\mathbb_P[F.html" ;"title="">T,D.html" ;"title=".html" ;"title="mathbb P[F">T,D">.html" ;"title="mathbb P[F">T,D= \sum_T \mathbb P[F">T,D\mathbb P D /math>. This quantity can be interpreted as the average predictions \mathbb P[F">T,D/math> of all theories T given past data D, weighted by their posterior credences \mathbb P D /math>.

Mathematical

The proof of the "razor" is based on the known mathematical properties of a probability distribution over a countable set. These properties are relevant because the infinite set of all programs is a denumerable set. The sum S of the probabilities of all programs must be exactly equal to one (as per the definition of [ robability) thus the probabilities must roughly decrease as we enumerate the infinite set of all programs, otherwise S will be strictly greater than one. To be more precise, for every

\epsilon

> 0, there is some length ''l'' such that the probability of all programs longer than ''l'' is at most

\epsilon

. This does not, however, preclude very long programs from having very high probability. Fundamental ingredients of the theory are the concepts of

algorithmic probability In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s. It is used in induct ...

and

Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...

. The universal

prior probability In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...

of any prefix ''p'' of a computable sequence ''x'' is the sum of the probabilities of all programs (for a

universal computer A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...

) that compute something starting with ''p''. Given some ''p'' and any computable but unknown probability distribution from which ''x'' is sampled, the universal prior and

Bayes' theorem In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...

can be used to predict the yet unseen parts of ''x'' in optimal fashion.

Mathematical guarantees

Solomonoff's completeness

The remarkable property of Solomonoff's induction is its completeness. In essence, the completeness theorem guarantees that the expected cumulative errors made by the predictions based on Solomonoff's induction are upper-bounded by the

of the (stochastic) data generating process. The errors can be measured using the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...

or the square of the difference between the induction's prediction and the probability assigned by the (stochastic) data generating process.

Solomonoff's uncomputability

Unfortunately, Solomonoff also proved that Solomonoff's induction is uncomputable. In fact, he showed that

computability Computability is the ability to solve a problem in an effective manner. It is a key topic of the field of computability theory within mathematical logic and the theory of computation within computer science. The computability of a problem is close ...

and completeness are mutually exclusive: any complete theory must be uncomputable. The proof of this is derived from a game between the induction and the environment. Essentially, any computable induction can be tricked by a computable environment, by choosing the computable environment that negates the computable induction's prediction. This fact can be regarded as an instance of the ''

no free lunch theorem In mathematical folklore, the "no free lunch" (NFL) theorem (sometimes pluralized) of David Wolpert and William Macready appears in the 1997 "No Free Lunch Theorems for Optimization".Wolpert, D.H., Macready, W.G. (199