Solomonoff's theory of inductive inference
   HOME

TheInfoList



OR:

Solomonoff's theory of inductive inference is a mathematical proof that if a universe is generated by an algorithm, then observations of that universe, encoded as a dataset, are best predicted by the smallest executable archive of that dataset. This formalization of Occam's razorJJ McCall. Induction: From Kolmogorov and Solomonoff to De Finetti and Back to Kolmogorov – Metroeconomica, 2004 – Wiley Online Library.D Stork. Foundations of Occam's razor and parsimony in learning from ricoh.com – NIPS 2001 Workshop, 2001A.N. Soklakov. Occam's razor as a formal basis for a physical theor
from arxiv.org
– Foundations of Physics Letters, 2002 – Springer
M Hutter. On the existence and convergence of computable universal prior
arxiv.org
– Algorithmic Learning Theory, 2003 – Springer
for induction was introduced by
Ray Solomonoff Ray Solomonoff (July 25, 1926 – December 7, 2009) was the inventor of algorithmic probability, his General Theory of Inductive Inference (also known as Universal Inductive Inference),Samuel Rathmanner and Marcus Hutter. A philosophical treatise ...
, based on
probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...
and
theoretical computer science computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumscribe the ...
. In essence, Solomonoff's induction derives the
posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...
of any computable theory, given a sequence of observed data. This posterior probability is derived from Bayes rule and some ''universal'' prior, that is, a prior that assigns a positive probability to any computable theory.


Origin


Philosophical

The theory is based in philosophical foundations, and was founded by
Ray Solomonoff Ray Solomonoff (July 25, 1926 – December 7, 2009) was the inventor of algorithmic probability, his General Theory of Inductive Inference (also known as Universal Inductive Inference),Samuel Rathmanner and Marcus Hutter. A philosophical treatise ...
around 1960. It is a mathematically formalized combination of Occam's razor and the Principle of Multiple Explanations.Ming Li and Paul Vitanyi, ''An Introduction to Kolmogorov Complexity and Its Applications.'' Springer-Verlag, N.Y., 2008p 339 ff. All computable theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories. Marcus Hutter's
universal artificial intelligence AIXI is a theoretical mathematical formalism for artificial general intelligence. It combines Solomonoff induction with sequential decision theory. AIXI was first proposed by Marcus Hutter in 2000 and several results regarding AIXI are proved i ...
builds upon this to calculate the expected value of an action.


Principle

Solomonoff's induction has been argued to be the computational formalization of pure Bayesianism. To understand, recall that Bayesianism derives the posterior probability \mathbb P D/math> of a theory T given data D by applying Bayes rule, which yields \mathbb P D= \mathbb P T\mathbb P / (\mathbb P T\mathbb P + \sum_ \mathbb P A\mathbb P , where theories A are alternatives to theory T. For this equation to make sense, the quantities \mathbb P T/math> and \mathbb P A/math> must be well-defined for all theories T and A. In other words, any theory must define a probability distribution over observable data D. Solomonoff's induction essentially boils down to demanding in addition that all such probability distributions be computable. Interestingly, the set of computable probability distributions is a subset of the set of all programs, which is
countable In mathematics, a set is countable if either it is finite or it can be made in one to one correspondence with the set of natural numbers. Equivalently, a set is ''countable'' if there exists an injective function from it into the natural numbers ...
. Similarly, the sets of observable data considered by Solomonoff were finite. Without loss of generality, we can thus consider that any observable data is a finite
bit string A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level ...
. As a result, Solomonoff's induction can be defined by only invoking discrete probability distributions. Solomonoff's induction then allows to make probabilistic predictions of future data F, by simply obeying the laws of probability. Namely, we have \mathbb P D= \mathbb E_T T,D.html"_;"title=".html"_;"title="mathbb_P[F">T,D">.html"_;"title="mathbb_P T,D=_\sum_T_\mathbb_P[F.html"_;"title="">T,D.html"_;"title=".html"_;"title="mathbb_P[F">T,D">.html"_;"title="mathbb_P[F">T,D=_\sum_T_\mathbb_P[F">T,D\mathbb_P_D/math>._This_quantity_can_be_interpreted_as_the_average_predictions_\mathbb_P[F.html" ;"title="">T,D=_\sum_T_\mathbb_P[F.html" ;"title="">T,D.html" ;"title=".html" ;"title="mathbb P[F">T,D">.html" ;"title="mathbb P[F">T,D= \sum_T \mathbb P[F">T,D\mathbb P D/math>. This quantity can be interpreted as the average predictions \mathbb P[F">T,D/math> of all theories T given past data D, weighted by their posterior credences \mathbb P D/math>.


Mathematical

The proof of the "razor" is based on the known mathematical properties of a probability distribution over a countable set. These properties are relevant because the infinite set of all programs is a denumerable set. The sum S of the probabilities of all programs must be exactly equal to one (as per the definition of [ robability) thus the probabilities must roughly decrease as we enumerate the infinite set of all programs, otherwise S will be strictly greater than one. To be more precise, for every \epsilon > 0, there is some length ''l'' such that the probability of all programs longer than ''l'' is at most \epsilon. This does not, however, preclude very long programs from having very high probability. Fundamental ingredients of the theory are the concepts of
algorithmic probability In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s. It is used in induc ...
and
Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
. The universal
prior probability In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
of any prefix ''p'' of a computable sequence ''x'' is the sum of the probabilities of all programs (for a
universal computer A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...
) that compute something starting with ''p''. Given some ''p'' and any computable but unknown probability distribution from which ''x'' is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of ''x'' in optimal fashion.


Mathematical guarantees


Solomonoff's completeness

The remarkable property of Solomonoff's induction is its completeness. In essence, the completeness theorem guarantees that the expected cumulative errors made by the predictions based on Solomonoff's induction are upper-bounded by the
Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
of the (stochastic) data generating process. The errors can be measured using the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...
or the square of the difference between the induction's prediction and the probability assigned by the (stochastic) data generating process.


Solomonoff's uncomputability

Unfortunately, Solomonoff also proved that Solomonoff's induction is uncomputable. In fact, he showed that
computability Computability is the ability to solve a problem in an effective manner. It is a key topic of the field of computability theory within mathematical logic and the theory of computation within computer science. The computability of a problem is clo ...
and completeness are mutually exclusive: any complete theory must be uncomputable. The proof of this is derived from a game between the induction and the environment. Essentially, any computable induction can be tricked by a computable environment, by choosing the computable environment that negates the computable induction's prediction. This fact can be regarded as an instance of the ''
no free lunch theorem In mathematical folklore, the "no free lunch" (NFL) theorem (sometimes pluralized) of David Wolpert and William Macready appears in the 1997 "No Free Lunch Theorems for Optimization".Wolpert, D.H., Macready, W.G. (1997),No Free Lunch Theorems f ...
''.


Modern applications


Artificial intelligence

Though Solomonoff's inductive inference is not computable, several AIXI-derived algorithms approximate it in order to make it run on a modern computer. The more computing power they are given, the closer their predictions are to the predictions of inductive inference (their mathematical limit is Solomonoff's inductive inference). Another direction of inductive inference is based on E. Mark Gold's model of learning in the limit from 1967 and has developed since then more and more models of learning. The general scenario is the following: Given a class ''S'' of computable functions, is there a learner (that is, recursive functional) which for any input of the form (''f''(0),''f''(1),...,''f''(''n'')) outputs a hypothesis (an index ''e'' with respect to a previously agreed on acceptable numbering of all computable functions; the indexed function may be required consistent with the given values of ''f''). A learner ''M'' learns a function ''f'' if almost all its hypotheses are the same index ''e'', which generates the function ''f''; ''M'' learns ''S'' if ''M'' learns every ''f'' in ''S''. Basic results are that all recursively enumerable classes of functions are learnable while the class REC of all computable functions is not learnable. Many related models have been considered and also the learning of classes of recursively enumerable sets from positive data is a topic studied from Gold's pioneering paper in 1967 onwards. A far reaching extension of the Gold’s approach is developed by Schmidhuber's theory of generalized Kolmogorov complexities, which are kinds of super-recursive algorithms.


Turing machines

The third mathematically based direction of inductive inference makes use of the theory of automata and computation. In this context, the process of inductive inference is performed by an abstract automaton called an inductive
Turing machine A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...
(Burgin, 2005). ''Inductive Turing machines'' represent the next step in the development of computer science providing better models for contemporary computers and computer networks (Burgin, 2001) and forming an important class of super-recursive algorithms as they satisfy all conditions in the definition of
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
. Namely, each inductive Turing machine is a type of effective method in which a definite list of well-defined instructions for completing a task, when given an initial state, will proceed through a well-defined series of successive states, eventually terminating in an end-state. The difference between an inductive Turing machine and a
Turing machine A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...
is that to produce the result a Turing machine has to stop, while in some cases an inductive Turing machine can do this without stopping.
Stephen Kleene Stephen Cole Kleene ( ; January 5, 1909 – January 25, 1994) was an American mathematician. One of the students of Alonzo Church, Kleene, along with Rózsa Péter, Alan Turing, Emil Post, and others, is best known as a founder of the branch of ...
called procedures that could run forever without stopping by the name ''calculation procedure or algorithm'' (Kleene 1952:137). Kleene also demanded that such an algorithm must eventually exhibit "some object" (Kleene 1952:137). This condition is satisfied by inductive Turing machines, as their results are exhibited after a finite number of steps, but inductive Turing machines do not always tell at which step the result has been obtained. Simple inductive Turing machines are equivalent to other models of computation. More advanced inductive Turing machines are much more powerful. It is proved (Burgin, 2005) that limiting partial recursive functions, trial and error predicates, general Turing machines, and simple inductive Turing machines are equivalent models of computation. However, simple inductive Turing machines and general Turing machines give direct constructions of computing automata, which are thoroughly grounded in physical machines. In contrast, trial and error predicates, limiting recursive functions and limiting partial recursive functions present syntactic systems of symbols with formal rules for their manipulation. Simple inductive Turing machines and general Turing machines are related to limiting partial recursive functions and trial and error predicates as Turing machines are related to partial recursive functions and lambda-calculus. Note that only simple inductive Turing machines have the same structure (but different functioning semantics of the output mode) as Turing machines. Other types of inductive Turing machines have an essentially more advanced structure due to the structured memory and more powerful instructions. Their utilization for inference and learning allows achieving higher efficiency and better reflects learning of people (Burgin and Klinger, 2004). Some researchers confuse computations of inductive Turing machines with non-stopping computations or with infinite time computations. First, some of computations of inductive Turing machines halt. As in the case of conventional Turing machines, some halting computations give the result, while others do not give. Second, some non-stopping computations of inductive Turing machines give results, while others do not give. Rules of inductive Turing machines determine when a computation (stopping or non-stopping) gives a result. Namely, an inductive Turing machine produces output from time to time and once this output stops changing, it is considered the result of the computation. It is necessary to know that descriptions of this rule in some papers are incorrect. For instance, Davis (2006: 128) formulates the rule when result is obtained without stopping as "once the correct output has been produced any subsequent output will simply repeat this correct result." Third, in contrast to the widespread misconception, inductive Turing machines give results (when it happens) always after a finite number of steps (in finite time) in contrast to infinite and infinite-time computations. There are two main distinctions between conventional Turing machines and simple inductive Turing machines. The first distinction is that even simple inductive Turing machines can do much more than conventional Turing machines. The second distinction is that a conventional Turing machine always informs (by halting or by coming to a final state) when the result is obtained, while a simple inductive Turing machine in some cases does inform about reaching the result, while in other cases (where the conventional Turing machine is helpless), it does not inform. People have an illusion that a computer always itself informs (by halting or by other means) when the result is obtained. In contrast to this, users themselves have to decide in many cases whether the computed result is what they need or it is necessary to continue computations. Indeed, everyday desktop computer applications like word processors and spreadsheets spend most of their time waiting in
event loop In computer science, the event loop is a programming construct or design pattern that waits for and dispatches events or messages in a program. The event loop works by making a request to some internal or external "event provider" (that generally ...
s, and do not terminate until directed to do so by users.


Evolutionary inductive Turing machines

Evolutionary approach to inductive inference is accomplished by another class of automata called evolutionary inductive Turing machines (Burgin and Eberbach, 2009; 2012). An evolutionary inductive Turing machine is a (possibly infinite) sequence ''E'' = of inductive Turing machines ''A'' 't''each working on generations X which are coded as words in the alphabet of the machines ''A'' 't'' The goal is to build a “population” ''Z'' satisfying the inference condition. The automaton ''A'' 't''called a component, or a level automaton, of E represents (encodes) a one-level evolutionary algorithm that works with input generations ''X'' 'i''of the population by applying the variation operators v and selection operator s. The first generation ''X'' is given as input to ''E'' and is processed by the automaton ''A'' which generates/produces the first generation ''X'' as its transfer output, which goes to the automaton ''A'' For all ''t'' = 1, 2, 3, ..., the automaton ''A'' 't''receives the generation ''X'' 't'' − 1as its input from ''A'' 't'' − 1and then applies the variation operator v and selection operator ''s'', producing the generation ''X'' 'i'' + 1and sending it to ''A'' 't'' + 1to continue evolution.


See also

*
Algorithmic information theory Algorithmic information theory (AIT) is a branch of theoretical computer science that concerns itself with the relationship between computation and information of computably generated objects (as opposed to stochastically generated), such as str ...
* Bayesian inference *
Language identification in the limit Language identification in the limit is a formal model for inductive inference of formal languages, mainly by computers (see machine learning and induction of regular languages). It was introduced by E. Mark Gold in a technical report and a journal ...
*
Inductive inference Inductive reasoning is a method of reasoning in which a general principle is derived from a body of observations. It consists of making broad generalizations based on specific observations. Inductive reasoning is distinct from ''deductive'' rea ...
* Inductive probability *
Mill's methods Mill's Methods are five methods of induction described by philosopher John Stuart Mill in his 1843 book ''A System of Logic''. They are intended to illuminate issues of causation. The methods Direct method of agreement For a property to be a ...
*
Minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occa ...
*
Minimum message length Minimum message length (MML) is a Bayesian information-theoretic method for statistical model comparison and selection. It provides a formal information theory restatement of Occam's Razor: even when models are equal in their measure of fit-accurac ...
* For a philosophical viewpoint, see: Problem of induction and
New riddle of induction The new riddle of induction was presented by Nelson Goodman in '' Fact, Fiction, and Forecast'' as a successor to Hume's original problem. It presents the logical predicates grue and bleen which are unusual due to their time-dependence. Many ha ...


References


Sources

* * Burgin, M. (2005), ''Super-recursive Algorithms'', Monographs in computer science, Springer. * Burgin, M., "How We Know What Technology Can Do", ''Communications of the ACM'', v. 44, No. 11, 2001, pp. 82–88. * Burgin, M.; Eberbach, E., "Universality for Turing Machines, Inductive Turing Machines and Evolutionary Algorithms", ''Fundamenta Informaticae'', v. 91, No. 1, 2009, 53–77. * Burgin, M.; Eberbach, E., "On Foundations of Evolutionary Computation: An Evolutionary Automata Approach", in ''Handbook of Research on Artificial Immune Systems and Natural Computing: Applying Complex Adaptive Technologies'' (Hongwei Mo, Ed.), IGI Global, Hershey, Pennsylvania, 2009, 342–360. * Burgin, M.; Eberbach, E., "Evolutionary Automata: Expressiveness and Convergence of Evolutionary Computation", ''Computer Journal'', v. 55, No. 9, 2012, pp. 1023–1029. * Burgin, M.; Klinger, A. Experience, Generations, and Limits in Machine Learning, ''Theoretical Computer Science'', v. 317, No. 1/3, 2004, pp. 71–91 * Davis, Martin (2006) "The Church–Turing Thesis: Consensus and opposition]". Proceedings, Computability in Europe 2006. Lecture Notes in Computer Science, 3988 pp. 125–132. * William Gasarch, Gasarch, W.; Smith, C. H. (1997) "A survey of inductive inference with an emphasis on queries". ''Complexity, logic, and recursion theory'', Lecture Notes in Pure and Appl. Math., 187, Dekker, New York, pp. 225–260. * Hay, Nick.
Universal Semimeasures: An Introduction
" CDMTCS Research Report Series, University of Auckland, Feb. 2007. * Jain, Sanjay ; Osherson, Daniel ; Royer, James ; Sharma, Arun, ''Systems that Learn: An Introduction to Learning Theory'' (second edition),
MIT Press The MIT Press is a university press affiliated with the Massachusetts Institute of Technology (MIT) in Cambridge, Massachusetts (United States). It was established in 1962. History The MIT Press traces its origins back to 1926 when MIT publish ...
, 1999. * . * Li Ming; Vitanyi, Paul, ''An Introduction to Kolmogorov Complexity and Its Applications'', 2nd Edition, Springer Verlag, 1997. * Osherson, Daniel ; Stob, Michael ; Weinstein, Scott, ''Systems That Learn, An Introduction to Learning Theory for Cognitive and Computer Scientists'',
MIT Press The MIT Press is a university press affiliated with the Massachusetts Institute of Technology (MIT) in Cambridge, Massachusetts (United States). It was established in 1962. History The MIT Press traces its origins back to 1926 when MIT publish ...
, 1986. * * * {{cite journal , doi=10.1016/S0019-9958(64)90131-7 , last=Solomonoff , first= Ray , title=A Formal Theory of Inductive Inference Part II , journal = Information and Control , url=http://world.std.com/~rjs/1964pt2.pdf , volume=7 , issue= 2 , pages= 224–254 , date=June 1964, doi-access=free


External links


Algorithmic probability – Scholarpedia
Statistical inference Inductive reasoning Machine learning Bayesian statistics Algorithmic information theory