Inductive probability attempts to give the probability of future
events based on past events. It is the basis for inductive reasoning,
and gives the mathematical basis for learning and the perception of
patterns. It is a source of knowledge about the world.
There are three sources of knowledge: inference, communication, and
deduction. Communication relays information found using other methods.
Deduction establishes new facts based on existing facts. Only
inference establishes new facts from data.
The basis of inference is Bayes' theorem. But this theorem is
sometimes hard to apply and understand. The simpler method to
understand inference is in terms of quantities of information.
Information describing the world is written in a language. For
example, a simple mathematical language of propositions may be chosen.
Sentences may be written down in this language as strings of
characters. But in the computer it is possible to encode these
sentences as strings of bits (1s and 0s). Then the language may be
encoded so that the most commonly used sentences are the shortest.
This internal language implicitly represents probabilities of
statements.
Occam's razor
Contents 1 History 1.1 Minimum description/message length 1.1.1 Overfitting 1.2 Inference based on program complexity 1.2.1 Detecting patterns in the data 1.2.2 Considering all theories 1.2.3 Universal priors 1.3 Universal artificial intelligence 2 Probability 2.1 Comparison to deductive probability
2.2
Probability
3
Probability
3.1 Combining information 3.2 The internal language of information 3.2.1 Encoding expressions 3.2.2 Distribution of numbers 4
Probability
4.1 Conditional probability 4.2 The frequentist approach applied to possible worlds 4.3 The law of total of probability 4.4 Alternate possibilities 4.5 Negation 4.6 Implication and condition probability 5 Bayesian hypothesis testing 5.1 Set of hypothesis 6 Boolean inductive inference 6.1
Generalization
7 Derivations 7.1 Derivation of inductive probability 7.2 A model for inductive inference 7.2.1 Applying Bayes' theorem 7.2.2 Removing theories without predictive power 8 Key people 9 See also 10 References 11 External links History[edit]
Probability
P ( A ∧ B ) = P ( B ) ⋅ P ( A
B ) = P ( A ) ⋅ P ( B
A ) displaystyle P(Aland B)=P(B)cdot P(AB)=P(A)cdot P(BA) Becomes in terms of message length L, L ( A ∧ B ) = L ( B ) + L ( A
B ) = L ( A ) + L ( B
A ) displaystyle L(Aland B)=L(B)+L(AB)=L(A)+L(BA) What this means is that in describing an event, if all the information
is given describing the event then the length of the information may
be used to give the raw probability of the event. So if the
information describing the occurrence of A is given, along with the
information describing B given A, then all the information describing
A and B has been given.[3] [4]
Overfitting[edit]
Overfitting
L ( K ) displaystyle L(K) bits then its prior probability is, P ( K ) = 2 − L ( K ) displaystyle P(K)=2^ -L(K) The length of the shortest program that represents the string of bits
is called the Kolmogorov complexity.
Kolmogorov complexity
2 − n displaystyle 2^ -n . Each program that produces the sequence of bits, but is shorter than the n is a theory/pattern about the bits with a probability of 2 − k displaystyle 2^ -k where k is the length of the program.
The probability of receiving a sequence of bits y after receiving a
series of bits x is then the conditional probability of receiving y
given x, which is the probability of x with y appended, divided by the
probability of x.[6][7][8]
Universal priors[edit]
The programming language affects the predictions of the next bit in
the string. The language acts as a prior probability. This is
particularly a problem where the programming language codes for
numbers and other data types. Intuitively we think that 0 and 1 are
simple numbers, and that prime numbers are somehow more complex than
numbers that may be composite.
Using the
Kolmogorov complexity
Exploring the environment; performing actions to get responses that broaden the agents knowledge. Competing or co-operating with another agent; games. Balancing short and long term rewards. In general no agent will always provide the best actions in all
situations. A particular choice made by an agent may be wrong, and the
environment may provide no way for the agent to recover from an
initial bad choice. However the agent is
Pareto optimal
Shared knowledge. Assumed facts, that should be inferred from the data. For example, in a trial the participants are aware the outcome of all
the previous history of trials. They also assume that each outcome is
equally probable. Together this allows a single unconditional value of
probability to be defined.
But in reality each individual does not have the same information. And
in general the probability of each outcome is not equal. The dice may
be loaded, and this loading needs to be inferred from the data.
Probability
The statement Prior knowledge Prior probabilities The estimation procedure used to give the probability. Combining probability approaches[edit] Inductive probability combines two different approaches to probability.
Probability
Each approach gives a slightly different viewpoint. Information theory
is used in relating probabilities to quantities of information. This
approach is often used in giving estimates of prior probabilities.
Frequentist probability
P ( x ) = 2 − L ( x ) displaystyle P(x)=2^ -L(x) Where P ( x ) displaystyle P(x) is the probability of the string of bits x displaystyle x and L ( x ) displaystyle L(x) is its length. The prior probability of any statement is calculated from the number of bits needed to state it. See also information theory. Combining information[edit] Two statements A displaystyle A and B displaystyle B may be represented by two separate encodings. Then the length of the encoding is, L ( A ∧ B ) = L ( A ) + L ( B ) displaystyle L(Aland B)=L(A)+L(B) or in terms of probability, P ( A ∧ B ) = P ( A ) P ( B ) displaystyle P(Aland B)=P(A)P(B) But this law is not always true because there may be a shorter method of encoding B displaystyle B if we assume A displaystyle A . So the above probability law applies only if A displaystyle A and B displaystyle B are "independent".
The internal language of information[edit]
The primary use of the information approach to probability is to
provide estimates of the complexity of statements. Recall that Occam's
razor states that "All things being equal, the simplest theory is the
most likely to be correct". In order to apply this rule, first there
needs to be a definition of what "simplest" means. Information theory
defines simplest to mean having the shortest encoding.
Knowledge
Constants (including function identifier). Application of functions. quantifiers. A Huffman code must distinguish the 3 cases. The length of each code
is based on the frequency of each type of sub expressions.
Initially constants are all assigned the same length/probability.
Later constants may be assigned a probability using the Huffman code
based on the number of uses of the function id in all expressions
recorded so far. In using a Huffman code the goal is to estimate
probabilities, not to compress the data.
The length of a function application is the length of the function
identifier constant plus the sum of the sizes of the expressions for
each parameter.
The length of a quantifier is the length of the expression being
quantified over.
Distribution of numbers[edit]
No explicit representation of natural numbers is given. However
natural numbers may be constructed by applying the successor function
to 0, and then applying other arithmetic functions. A distribution of
natural numbers is implied by this, based on the complexity of
constructing each number.
Rational numbers are constructed by the division of natural numbers.
The simplest representation has no common factors between the
numerator and the denominator. This allows the probability
distribution of natural numbers may be extended to rational numbers.
Probability
b = x : B ( x ) displaystyle b= x:B(x) Conditional probability[edit]
Each probability is always associated with the state of knowledge at a
particular point in the argument. Probabilities before an inference
are known as prior probabilities, and probabilities after are known as
posterior probabilities.
Probability
B displaystyle B is true given that A displaystyle A is true is written as: P ( B
A ) . displaystyle P(BA). All probabilities are in some sense conditional. The prior probability of B displaystyle B is, P ( B ) = P ( B
⊤ ) displaystyle P(B)=P(Btop ) The frequentist approach applied to possible worlds[edit] In the frequentist approach, probabilities are defined as the ratio of the number of outcomes within an event to the total number of outcomes. In the possible world model each possible world is an outcome, and statements about possible worlds define events. The probability of a statement being true is the number of possible worlds divided by the total number of worlds. The probability of a statement A displaystyle A being true about possible worlds is then, P ( A ) =
x : A ( x )
x : ⊤
displaystyle P(A)= frac x:A(x) x:top For a conditional probability. P ( B
A ) =
x : A ( x ) ∧ B ( x )
x : A ( x )
displaystyle P(BA)= frac x:A(x)land B(x) x:A(x) then P ( A ∧ B ) =
x : A ( x ) ∧ B ( x )
x : ⊤
=
x : A ( x ) ∧ B ( x )
x : A ( x )
x : A ( x )
x : ⊤
= P ( A ) P ( B
A ) displaystyle begin aligned P(Aland B)&= frac x:A(x)land B(x) x:top \[8pt]&= frac x:A(x)land B(x) x:A(x) frac x:A(x) x:top \[8pt]&=P(A)P(BA)end aligned Using symmetry this equation may be written out as Bayes' law. P ( A ∧ B ) = P ( A ) P ( B
A ) = P ( B ) P ( A
B ) displaystyle P(Aland B)=P(A)P(BA)=P(B)P(AB) This law describes the relationship between prior and posterior
probabilities when new facts are learnt.
Written as quantities of information
Bayes' Theorem
L ( A ∧ B ) = L ( A ) + L ( B
A ) = L ( B ) + L ( A
B ) displaystyle L(Aland B)=L(A)+L(BA)=L(B)+L(AB) Two statements A and B are said to be independent if knowing the truth of A does not change the probability of B. Mathematically this is, P ( B ) = P ( B
A ) displaystyle P(B)=P(BA) then
Bayes' Theorem
P ( A ∧ B ) = P ( A ) P ( B ) displaystyle P(Aland B)=P(A)P(B) The law of total of probability[edit] For a set of mutually exclusive possibilities A i displaystyle A_ i , the sum of the posterior probabilities must be 1. ∑ i P ( A i
B ) = 1 displaystyle sum _ i P(A_ i B) =1 Substituting using
Bayes' theorem
∑ i P ( B
A i ) P ( A i ) = ∑ i P ( A i
B ) P ( B ) displaystyle sum _ i P(BA_ i )P(A_ i ) =sum _ i P(A_ i B)P(B) P ( B ) = ∑ i P ( B
A i ) P ( A i ) displaystyle P(B)=sum _ i P(BA_ i )P(A_ i ) This result is used to give the extended form of Bayes' theorem, P ( A i
B ) = P ( B
A i ) P ( A i ) ∑ j P ( B
A j ) P ( A j ) displaystyle P(A_ i B)= frac P(BA_ i )P(A_ i ) sum _ j P(BA_ j )P(A_ j ) This is the usual form of
Bayes' theorem
A i displaystyle A_ i is 1. Alternate possibilities[edit] For mutually exclusive possibilities, the probabilities add. P ( A ∨ B ) = P ( A ) + P ( B ) , if P ( A ∧ B ) = 0 displaystyle P(Alor B)=P(A)+P(B),qquad text if P(Aland B)=0 Using A ∨ B = ( A ∧ ¬ ( A ∧ B ) ) ∨ ( B ∧ ¬ ( A ∧ B ) ) ∨ ( A ∧ B ) displaystyle Alor B=(Aland neg (Aland B))lor (Bland neg (Aland B))lor (Aland B) Then the alternatives A ∧ ¬ ( A ∧ B ) , B ∧ ¬ ( A ∧ B ) , A ∧ B displaystyle Aland neg (Aland B),quad Bland neg (Aland B),quad Aland B are all mutually exclusive. Also, ( A ∧ ¬ ( A ∧ B ) ) ∨ ( A ∧ B ) = A displaystyle (Aland neg (Aland B))lor (Aland B)=A P ( A ∧ ¬ ( A ∧ B ) ) + P ( A ∧ B ) = P ( A ) displaystyle P(Aland neg (Aland B))+P(Aland B)=P(A) P ( A ∧ ¬ ( A ∧ B ) ) = P ( A ) − P ( A ∧ B ) displaystyle P(Aland neg (Aland B))=P(A)-P(Aland B) so, putting it all together, P ( A ∨ B ) = P ( ( A ∧ ¬ ( A ∧ B ) ) ∨ ( B ∧ ¬ ( A ∧ B ) ) ∨ ( A ∧ B ) ) = P ( A ∧ ¬ ( A ∧ B ) + P ( B ∧ ¬ ( A ∧ B ) ) + P ( A ∧ B ) = P ( A ) − P ( A ∧ B ) + P ( B ) − P ( A ∧ B ) + P ( A ∧ B ) = P ( A ) + P ( B ) − P ( A ∧ B ) displaystyle begin aligned P(Alor B)&=P((Aland neg (Aland B))lor (Bland neg (Aland B))lor (Aland B))\&=P(Aland neg (Aland B)+P(Bland neg (Aland B))+P(Aland B)\&=P(A)-P(Aland B)+P(B)-P(Aland B)+P(Aland B)\&=P(A)+P(B)-P(Aland B)end aligned Negation[edit] As, A ∨ ¬ A = ⊤ displaystyle Alor neg A=top then P ( A ) + P ( ¬ A ) = 1 displaystyle P(A)+P(neg A)=1 Implication and condition probability[edit] Implication is related to conditional probability by the following equation, A → B ⟺ P ( B
A ) = 1 displaystyle Ato Biff P(BA)=1 Derivation, A → B ⟺ P ( A → B ) = 1 ⟺ P ( A ∧ B ∨ ¬ A ) = 1 ⟺ P ( A ∧ B ) + P ( ¬ A ) = 1 ⟺ P ( A ∧ B ) = P ( A ) ⟺ P ( A ) ⋅ P ( B
A ) = P ( A ) ⟺ P ( B
A ) = 1 displaystyle begin aligned Ato B&iff P(Ato B)=1\&iff P(Aland Blor neg A)=1\&iff P(Aland B)+P(neg A)=1\&iff P(Aland B)=P(A)\&iff P(A)cdot P(BA)=P(A)\&iff P(BA)=1end aligned Bayesian hypothesis testing[edit]
Bayes' theorem
P ( H
F ) = P ( H ) P ( F
H ) P ( F ) displaystyle P(HF)= frac P(H)P(FH) P(F) or in terms of information, P ( H
F ) = 2 − ( L ( H ) + L ( F
H ) − L ( F ) ) displaystyle P(HF)=2^ -(L(H)+L(FH)-L(F)) By assuming the hypothesis is true, a simpler representation of the statement F may be given. The length of the encoding of this simpler representation is L ( F
H ) . displaystyle L(FH). L ( H ) + L ( F
H ) displaystyle L(H)+L(FH) represents the amount of information needed to represent the facts F, if H is true. L ( F ) displaystyle L(F) is the amount of information needed to represent F without the hypothesis H. The difference is how much the representation of the facts has been compressed by assuming that H is true. This is the evidence that the hypothesis H is true. If L ( F ) displaystyle L(F) is estimated from encoding length then the probability obtained will not be between 0 and 1. The value obtained is proportional to the probability, without being a good probability estimate. The number obtained is sometimes referred to as a relative probability, being how much more probable the theory is than not holding the theory. If a full set of mutually exclusive hypothesis that provide evidence is known, a proper estimate may be given for the prior probability P ( F ) displaystyle P(F) . Set of hypothesis[edit] Probabilities may be calculated from the extended form of Bayes' theorem. Given all mutually exclusive hypothesis H i displaystyle H_ i which give evidence, such that, L ( H i ) + L ( F
H i ) < L ( F ) displaystyle L(H_ i )+L(FH_ i )<L(F) and also the hypothesis R, that none of the hypothesis is true, then, P ( H i
F ) = P ( H i ) P ( F
H i ) P ( F
R ) + ∑ j P ( H j ) P ( F
H j ) P ( R
F ) = P ( F
R ) P ( F
R ) + ∑ j P ( H j ) P ( F
H j ) displaystyle begin aligned P(H_ i F)&= frac P(H_ i )P(FH_ i ) P(FR)+sum _ j P(H_ j )P(FH_ j ) \[8pt]P(RF)&= frac P(FR) P(FR)+sum _ j P(H_ j )P(FH_ j ) end aligned In terms of information, P ( H i
F ) = 2 − ( L ( H i ) + L ( F
H i ) ) 2 − L ( F
R ) + ∑ j 2 − ( L ( H j ) + L ( F
H j ) ) P ( R
F ) = 2 − L ( F
R ) 2 − L ( F
R ) + ∑ j 2 − ( L ( H j ) + L ( F
H j ) ) displaystyle begin aligned P(H_ i F)&= frac 2^ -(L(H_ i )+L(FH_ i )) 2^ -L(FR) +sum _ j 2^ -(L(H_ j )+L(FH_ j )) \[8pt]P(RF)&= frac 2^ -L(FR) 2^ -L(FR) +sum _ j 2^ -(L(H_ j )+L(FH_ j )) end aligned In most situations it is a good approximation to assume that F displaystyle F is independent of R displaystyle R , which means P ( F
R ) = P ( F ) displaystyle P(FR)=P(F) giving, P ( H i
F ) ≈ 2 − ( L ( H i ) + L ( F
H i ) ) 2 − L ( F ) + ∑ j 2 − ( L ( H j ) + L ( F
H j ) ) P ( R
F ) ≈ 2 − L ( F ) 2 − L ( F ) + ∑ j 2 − ( L ( H j ) + L ( F
H j ) ) displaystyle begin aligned P(H_ i F)&approx frac 2^ -(L(H_ i )+L(FH_ i )) 2^ -L(F) +sum _ j 2^ -(L(H_ j )+L(FH_ j )) \[8pt]P(RF)&approx frac 2^ -L(F) 2^ -L(F) +sum _ j 2^ -(L(H_ j )+L(FH_ j )) end aligned Boolean inductive inference[edit] Abductive inference [11][12][13][14] starts with a set of facts F which is a statement (Boolean expression). Abductive reasoning is of the form, A theory T implies the statement F. As the theory T is simpler than F, abduction says that there is a probability that the theory T is implied by F. The theory T, also called an explanation of the condition F, is an answer to the ubiquitous factual "why" question. For example, for the condition F is "Why do apples fall?". The answer is a theory T that implies that apples fall; F = G m 1 m 2 r 2 displaystyle F=G frac m_ 1 m_ 2 r^ 2
Inductive inference
All observed objects in a class C have a property P. Therefore there is a probability that all objects in a class C have a property P. In terms of abductive inference, all objects in a class C or set have
a property P is a theory that implies the observed condition, All
observed objects in a class C have a property P.
So inductive inference is a special case of abductive inference. In
common usage the term inductive inference is often used to refer to
both abductive and inductive inference.
Generalization
The center of an apple falls towards the center of the earth. Generalizing by replacing apple for object, and earth for object gives, in a two body system, The center of an object falls towards the center of another object. The theory explains all objects falling, so there is strong evidence for it. The second observation, The planets appear to follow an elliptical path. After some complicated mathematical calculus, it can be seen that if the acceleration follows the inverse square law then objects will follow an ellipse. So induction gives evidence for the inverse square law. Using Galileo's observation that all objects drop with the same speed, F 1 = m 1 a 1 = m 1 k 1 r 2 i 1 displaystyle F_ 1 =m_ 1 a_ 1 = frac m_ 1 k_ 1 r^ 2 i_ 1 F 2 = m 2 a 2 = m 2 k 2 r 2 i 2 displaystyle F_ 2 =m_ 2 a_ 2 = frac m_ 2 k_ 2 r^ 2 i_ 2 where i 1 displaystyle i_ 1 and i 2 displaystyle i_ 2 vectors towards the center of the other object. Then using Newton's third law F 1 = − F 2 displaystyle F_ 1 =-F_ 2 F = G m 1 m 2 r 2 displaystyle F=G frac m_ 1 m_ 2 r^ 2 Probabilities for inductive inference[edit] Implication determines condition probability as, T → F ⟺ P ( F
T ) = 1 displaystyle Tto Fiff P(FT)=1 So, P ( F
T ) = 1 displaystyle P(FT)=1 L ( F
T ) = 0 displaystyle L(FT)=0 This result may be used in the probabilities given for Bayesian hypothesis testing. For a single theory, H = T and, P ( T
F ) = P ( T ) P ( F ) displaystyle P(TF)= frac P(T) P(F) or in terms of information, the relative probability is, P ( T
F ) = 2 − ( L ( T ) − L ( F ) ) displaystyle P(TF)=2^ -(L(T)-L(F)) Note that this estimate for P(TF) is not a true probability. If L ( T i ) < L ( F ) displaystyle L(T_ i )<L(F) then the theory has evidence to support it. Then for a set of theories T i = H i displaystyle T_ i =H_ i , such that L ( T i ) < L ( F ) displaystyle L(T_ i )<L(F) , P ( T i
F ) = P ( T i ) P ( F
R ) + ∑ j P ( T j ) displaystyle P(T_ i F)= frac P(T_ i ) P(FR)+sum _ j P(T_ j ) P ( R
F ) = P ( F
R ) P ( F
R ) + ∑ j P ( T j ) displaystyle P(RF)= frac P(FR) P(FR)+sum _ j P(T_ j ) giving, P ( T i
F ) ≈ 2 − L ( T i ) 2 − L ( F ) + ∑ j 2 − L ( T j ) displaystyle P(T_ i F)approx frac 2^ -L(T_ i ) 2^ -L(F) +sum _ j 2^ -L(T_ j ) P ( R
F ) ≈ 2 − L ( F ) 2 − L ( F ) + ∑ j 2 − L ( T j ) displaystyle P(RF)approx frac 2^ -L(F) 2^ -L(F) +sum _ j 2^ -L(T_ j ) Derivations[edit] Derivation of inductive probability[edit] Make a list of all the shortest programs K i displaystyle K_ i that each produce a distinct infinite string of bits, and satisfy the relation, T n ( R ( K i ) ) = x displaystyle T_ n (R(K_ i ))=x where R ( K i ) displaystyle R(K_ i ) is the result of running the program K i displaystyle K_ i and T n displaystyle T_ n truncates the string after n bits. The problem is to calculate the probability that the source is produced by program K i , displaystyle K_ i , given that the truncated source after n bits is x. This is represented by the conditional probability, P ( s = R ( K i )
T n ( s ) = x ) displaystyle P(s=R(K_ i )T_ n (s)=x) Using the extended form of Bayes' theorem P ( s = R ( K i )
T n ( s ) = x ) = P ( T n ( s ) = x
s = R ( K i ) ) P ( s = R ( K i ) ) ∑ j P ( T n ( s ) = x
s = R ( K j ) ) P ( s = R ( K j ) ) . displaystyle P(s=R(K_ i )T_ n (s)=x)= frac P(T_ n (s)=xs=R(K_ i ))P(s=R(K_ i )) sum _ j P(T_ n (s)=xs=R(K_ j ))P(s=R(K_ j )) . The extended form relies on the law of total probability. This means that the s = R ( K i ) displaystyle s=R(K_ i ) must be distinct possibilities, which is given by the condition that each K i displaystyle K_ i produce a different infinite string. Also one of the conditions s = R ( K i ) displaystyle s=R(K_ i ) must be true. This must be true, as in the limit as n → ∞ , displaystyle nto infty , there is always at least one program that produces T n ( s ) displaystyle T_ n (s) . As K i displaystyle K_ i are chosen so that T n ( R ( K i ) ) = x , displaystyle T_ n (R(K_ i ))=x, then, P ( T n ( s ) = x
s = R ( K i ) ) = 1 displaystyle P(T_ n (s)=xs=R(K_ i ))=1 The apriori probability of the string being produced from the program, given no information about the string, is based on the size of the program, P ( s = R ( K i ) ) = 2 − I ( K i ) displaystyle P(s=R(K_ i ))=2^ -I(K_ i ) giving, P ( s = R ( K i )
T n ( s ) = x ) = 2 − I ( K i ) ∑ j 2 − I ( K j ) . displaystyle P(s=R(K_ i )T_ n (s)=x)= frac 2^ -I(K_ i ) sum _ j 2^ -I(K_ j ) . Programs that are the same or longer than the length of x provide no predictive power. Separate them out giving, P ( s = R ( K i )
T n ( s ) = x ) = 2 − I ( K i ) ∑ j : I ( K j ) < n 2 − I ( K j ) + ∑ j : I ( K j ) ⩾ n 2 − I ( K j ) . displaystyle P(s=R(K_ i )T_ n (s)=x)= frac 2^ -I(K_ i ) sum _ j:I(K_ j )<n 2^ -I(K_ j ) +sum _ j:I(K_ j )geqslant n 2^ -I(K_ j ) . Then identify the two probabilities as, P ( x has pattern ) = ∑ j : I ( K j ) < n 2 − I ( K j ) displaystyle P(x text has pattern )=sum _ j:I(K_ j )<n 2^ -I(K_ j ) P ( x is random ) = ∑ j : I ( K j ) ⩾ n 2 − I ( K j ) displaystyle P(x text is random )=sum _ j:I(K_ j )geqslant n 2^ -I(K_ j ) But the prior probability that x is a random set of bits is 2 − n displaystyle 2^ -n . So, P ( s = R ( K i )
T n ( s ) = x ) = 2 − I ( K i ) 2 − n + ∑ j : I ( K j ) < n 2 − I ( K j ) . displaystyle P(s=R(K_ i )T_ n (s)=x)= frac 2^ -I(K_ i ) 2^ -n +sum _ j:I(K_ j )<n 2^ -I(K_ j ) . The probability that the source is random, or unpredictable is, P ( random ( s )
T n ( s ) = x ) = 2 − n 2 − n + ∑ j : I ( K j ) < n 2 − I ( K j ) . displaystyle P(operatorname random (s)T_ n (s)=x)= frac 2^ -n 2^ -n +sum _ j:I(K_ j )<n 2^ -I(K_ j ) . A model for inductive inference[edit] A model of how worlds are constructed is used in determining the probabilities of theories, A random bit string is selected. A condition is constructed from the bit string. A world is constructed that is consistent with the condition. If w is the bit string then the world is created such that R ( w ) displaystyle R(w) is true. An intelligent agent has some facts about the word, represented by the bit string c, which gives the condition, C = R ( c ) displaystyle C=R(c) The set of bit strings identical with any condition x is E ( x ) displaystyle E(x) . ∀ x , E ( x ) = w : R ( w ) ≡ x displaystyle forall x,E(x)= w:R(w)equiv x A theory is a simpler condition that explains (or implies) C. The set of all such theories is called T, T ( C ) = t : t → C displaystyle T(C)= t:tto C Applying Bayes' theorem[edit]
extended form of
Bayes' theorem
P ( A i
B ) = P ( B
A i ) P ( A i ) ∑ j P ( B
A j ) P ( A j ) , displaystyle P(A_ i B)= frac P(BA_ i ),P(A_ i ) sum _ j P(BA_ j ),P(A_ j ) , where, B = E ( C ) displaystyle B=E(C) A i = E ( t ) displaystyle A_ i =E(t) To apply
Bayes' theorem
A i displaystyle A_ i is a partition of the event space. For T ( C ) displaystyle T(C) to be a partition, no bit string n may belong to two theories. To prove this assume they can and derive a contradiction, ( N ∈ T ) ∧ ( N ∈ M ) ∧ ( N ≠ M ) ∧ ( n ∈ E ( N ) ∧ n ∈ E ( M ) ) displaystyle (Nin T)land (Nin M)land (Nneq M)land (nin E(N)land nin E(M)) ⟹ ( N ≠ M ) ∧ R ( n ) ≡ N ∧ R ( n ) ≡ M displaystyle implies (Nneq M)land R(n)equiv Nland R(n)equiv M ⟹ ⊥ displaystyle implies bot Secondly prove that T includes all outcomes consistent with the condition. As all theories consistent with C are included then R ( w ) displaystyle R(w) must be in this set. So Bayes theorem may be applied as specified giving, ∀ t ∈ T ( C ) , P ( E ( t )
E ( C ) ) = P ( E ( t ) ) ⋅ P ( E ( C )
E ( t ) ) ∑ j ∈ T ( C ) P ( E ( j ) ) ⋅ P ( E ( C )
E ( j ) ) displaystyle forall tin T(C),P(E(t)E(C))= frac P(E(t))cdot P(E(C)E(t)) sum _ jin T(C) P(E(j))cdot P(E(C)E(j)) Using the implication and condition probability law, the definition of T ( C ) displaystyle T(C) implies, ∀ t ∈ T ( C ) , P ( E ( C )
E ( t ) ) = 1 displaystyle forall tin T(C),P(E(C)E(t))=1 The probability of each theory in T is given by, ∀ t ∈ T ( C ) , P ( E ( t ) ) = ∑ n : R ( n ) ≡ t 2 − L ( n ) displaystyle forall tin T(C),P(E(t))=sum _ n:R(n)equiv t 2^ -L(n) so, ∀ t ∈ T ( C ) , P ( E ( t )
E ( C ) ) = ∑ n : R ( n ) ≡ t 2 − L ( n ) ∑ j ∈ T ( C ) ∑ m : R ( m ) ≡ j 2 − L ( m ) displaystyle forall tin T(C),P(E(t)E(C))= frac sum _ n:R(n)equiv t 2^ -L(n) sum _ jin T(C) sum _ m:R(m)equiv j 2^ -L(m) Finally the probabilities of the events may be identified with the probabilities of the condition which the outcomes in the event satisfy, ∀ t ∈ T ( C ) , P ( E ( t )
E ( C ) ) = P ( t
C ) displaystyle forall tin T(C),P(E(t)E(C))=P(tC) giving ∀ t ∈ T ( C ) , P ( t
C ) = ∑ n : R ( n ) ≡ t 2 − L ( n ) ∑ j ∈ T ( C ) ∑ m : R ( m ) ≡ j 2 − L ( m ) displaystyle forall tin T(C),P(tC)= frac sum _ n:R(n)equiv t 2^ -L(n) sum _ jin T(C) sum _ m:R(m)equiv j 2^ -L(m) This is the probability of the theory t after observing that the condition C holds. Removing theories without predictive power[edit] Theories that are less probable than the condition C have no predictive power. Separate them out giving, ∀ t ∈ T ( C ) , P ( t
C ) = P ( E ( t ) ) ( ∑ j : j ∈ T ( C ) ∧ P ( E ( j ) ) > P ( E ( C ) ) P ( E ( j ) ) ) + ( ∑ j : j ∈ T ( C ) ∧ P ( E ( j ) ) ≤ P ( E ( C ) ) P ( j ) ) displaystyle forall tin T(C),P(tC)= frac P(E(t)) (sum _ j:jin T(C)land P(E(j))>P(E(C)) P(E(j)))+(sum _ j:jin T(C)land P(E(j))leq P(E(C)) P(j)) The probability of the theories without predictive power on C is the same as the probability of C. So, P ( E ( C ) ) = ∑ j : j ∈ T ( C ) ∧ P ( E ( j ) ) ≤ P ( E ( C ) ) P ( j ) displaystyle P(E(C))=sum _ j:jin T(C)land P(E(j))leq P(E(C)) P(j) So the probability ∀ t ∈ T ( C ) , P ( t
C ) = P ( E ( t ) ) P ( E ( C ) ) + ∑ j : j ∈ T ( C ) ∧ P ( E ( j ) ) > P ( E ( C ) ) P ( E ( j ) ) displaystyle forall tin T(C),P(tC)= frac P(E(t)) P(E(C))+sum _ j:jin T(C)land P(E(j))>P(E(C)) P(E(j)) and the probability of no prediction for C, written as random ( C ) displaystyle operatorname random (C) , P ( random ( C )
C ) = P ( E ( C ) ) P ( E ( C ) ) + ∑ j : j ∈ T ( C ) ∧ P ( E ( j ) ) > P ( E ( C ) ) P ( E ( j ) ) displaystyle P( text random (C)C)= frac P(E(C)) P(E(C))+sum _ j:jin T(C)land P(E(j))>P(E(C)) P(E(j)) The probability of a condition was given as, ∀ t , P ( E ( t ) ) = ∑ n : R ( n ) ≡ t 2 − L ( n ) displaystyle forall t,P(E(t))=sum _ n:R(n)equiv t 2^ -L(n) Bit strings for theories that are more complex than the bit string given to the agent as input have no predictive power. There probabilities are better included in the random case. To implement this a new definition is given as F in, ∀ t , P ( F ( t , c ) ) = ∑ n : R ( n ) ≡ t ∧ L ( n ) < L ( c ) 2 − L ( n ) displaystyle forall t,P(F(t,c))=sum _ n:R(n)equiv tland L(n)<L(c) 2^ -L(n) Using F, an improved version of the abductive probabilities is, ∀ t ∈ T ( C ) , P ( t
C ) = P ( F ( t , c ) ) P ( F ( C , c ) ) + ∑ j : j ∈ T ( C ) ∧ P ( F ( j , c ) ) > P ( F ( C , c ) ) P ( E ( j , c ) ) displaystyle forall tin T(C),P(tC)= frac P(F(t,c)) P(F(C,c))+sum _ j:jin T(C)land P(F(j,c))>P(F(C,c)) P(E(j,c)) P ( random ( C )
C ) = P ( F ( C , c ) ) P ( F ( C , c ) ) + ∑ j : j ∈ T ( C ) ∧ P ( F ( j , c ) ) > P ( F ( C , c ) ) P ( F ( j , c ) ) displaystyle P(operatorname random (C)C)= frac P(F(C,c)) P(F(C,c))+sum _ j:jin T(C)land P(F(j,c))>P(F(C,c)) P(F(j,c)) Key people[edit] William of Ockham Thomas Bayes Ray Solomonoff Andrey Kolmogorov Chris Wallace D. M. Boulton Jorma Rissanen Marcus Hutter See also[edit] Abductive reasoning Algorithmic probability Algorithmic information theory Bayesian inference Information theory Inductive inference Inductive logic programming Inductive reasoning Learning Minimum message length Minimum description length Occam's razor Solomonoff's theory of inductive inference Universal artificial intelligence References[edit] ^ Wallace, Chris; Boulton (1968). "An information measure for classification". Computer Journal. 11 (2): 185–194. ^ Rissanen, J. (1978). "Modeling by shortest data description". Automatica. 14 (5): 465–658. doi:10.1016/0005-1098(78)90005-5. ^ Allison, Lloyd. "Minimum Message Length (MML) – LA's MML introduction". ^ Oliver, J. J.; Baxter, Rohan A. "MML and Bayesianism: Similarities and Differences (Introduction to Minimum Encoding Inference – Part II)". ^ Li, M. and Vitanyi, P., An Introduction to Kolmogorov Complexity and Its Applications, 3rd Edition, Springer Science and Business Media, N.Y., 2008, p 347 ^ Solomonoff, R., "A Preliminary Report on a General Theory of Inductive Inference", Report V-131, Zator Co., Cambridge, Ma. Feb 4, 1960, revision, Nov., 1960. ^ Solomonoff, R., "A Formal Theory of Inductive Inference, Part I" Information and Control, Vol 7, No. 1 pp 1–22, March 1964. ^ Solomonoff, R., "A Formal Theory of Inductive Inference, Part II" Information and Control, Vol 7, No. 2 pp 224–254, June 1964. ^ Hutter, Marcus (1998). Sequential Decisions Based on Algorithmic Probability. Springer. ISBN 3-540-22139-5. ^ Carnap, Rudolf. "STATISTICAL AND INDUCTIVE PROBABILITY" (PDF). ^ "Abduction". ^ Pfeifer, Niki; Kleiter, Gernot D. (2006). "INFERENCE IN CONDITIONAL PROBABILITY LOGIC". Kybernetika. 42 (4): 391–404. ^ "Conditional Probability". Artificial Intelligence - Foundations of computational agents. ^ "Introduction to the theory of Inductive Logic Programming (ILP)". ^ Plotkin, Gordon D. (1970). Meltzer, B.; Michie, D., eds. "A Note on Inductive Generalization". Machine Intelligence. Edinburgh University Press. 5: 153–163. ^ Plotkin, Gordon D. (1971). Meltzer, B.; Michie, D., eds. "A Further Note on Inductive Generalization". Machine Intelligence. Edinburgh University Press. 6: 101–124. ^ Isaac Newton: "In [experimental] philosophy particular propositions are inferred from the phenomena and afterwards rendered general by induction": "Principia", Book 3, General Scholium, at p.392 in Volume 2 of Andrew Motte's English translation published 1729. External links[edit] Rathmanner, S and Hutter, M., "A Philosophical Treatise of Universal Induction" in Entropy 2011, 13, 1076–1136: A very clear philosophical and mathematical analysis of Solomonoff's Theory of Inductive Inference. C.S. Wallace, Statistical and Inductive Inference by Minimum Message Length, Springer-Verlag (Information Science and Statistics), ISBN 0-387-23795-X, May 2005 – chapter headings, table of contents |