Inductive probability attempts to give the
probability
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
of future events based on past events. It is the basis for
inductive reasoning
Inductive reasoning is a method of reasoning in which a general principle is derived from a body of observations. It consists of making broad generalizations based on specific observations. Inductive reasoning is distinct from ''deductive'' re ...
, and gives the mathematical basis for
learning
Learning is the process of acquiring new understanding, knowledge, behaviors, skills, value (personal and cultural), values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machine learning, machines ...
and the perception of patterns. It is a source of
knowledge
Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is distinc ...
about the world.
There are three sources of knowledge:
inference
Inferences are steps in reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinction that in ...
, communication, and deduction. Communication relays information found using other methods. Deduction establishes new facts based on existing facts. Inference establishes new facts from data. Its basis is
Bayes' theorem.
Information describing the world is written in a language. For example, a simple mathematical language of propositions may be chosen. Sentences may be written down in this language as strings of characters. But in the computer it is possible to encode these sentences as strings of bits (1s and 0s). Then the language may be encoded so that the most commonly used sentences are the shortest. This internal language implicitly represents probabilities of statements.
Occam's razor
Occam's razor, Ockham's razor, or Ocham's razor ( la, novacula Occami), also known as the principle of parsimony or the law of parsimony ( la, lex parsimoniae), is the problem-solving principle that "entities should not be multiplied beyond neces ...
says the "simplest theory, consistent with the data is most likely to be correct". The "simplest theory" is interpreted as the representation of the theory written in this internal language. The theory with the shortest encoding in this internal language is most likely to be correct.
History
Probability and statistics was focused on
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
s and tests of significance. Probability was formal, well defined, but limited in scope. In particular its application was limited to situations that could be defined as an experiment or trial, with a well defined population.
Bayes's theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For exampl ...
is named after Rev.
Thomas Bayes
Thomas Bayes ( ; 1701 7 April 1761) was an English statistician, philosopher and Presbyterian minister who is known for formulating a specific case of the theorem that bears his name: Bayes' theorem. Bayes never published what would become his ...
1701–1761.
Bayesian inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
broadened the application of probability to many situations where a population was not well defined. But Bayes' theorem always depended on prior probabilities, to generate new probabilities. It was unclear where these prior probabilities should come from.
Ray Solomonoff
Ray Solomonoff (July 25, 1926 – December 7, 2009) was the inventor of algorithmic probability, his General Theory of Inductive Inference (also known as Universal Inductive Inference),Samuel Rathmanner and Marcus Hutter. A philosophical treatise ...
developed
algorithmic probability In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s.
It is used in induc ...
which gave an explanation for what randomness is and how patterns in the data may be represented by computer programs, that give shorter representations of the data circa 1964.
Chris Wallace
Christopher Wallace (born October 12, 1947) is an American broadcast journalist. He is known for his tough and wide-ranging interviews, for which he is often compared to his father, ''60 Minutes'' journalist Mike Wallace. Over his 50-year care ...
and D. M. Boulton developed
minimum message length
Minimum message length (MML) is a Bayesian information-theoretic method for statistical model comparison and selection. It provides a formal information theory restatement of Occam's Razor: even when models are equal in their measure of fit-accurac ...
circa 1968. Later
Jorma Rissanen
Jorma Johannes Rissanen (October 20, 1932 – May 9, 2020) was an information theorist, known for originating the minimum description length (MDL) principle and practical approaches to arithmetic coding for lossless data compression. His work in ...
developed the
minimum description length
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
circa 1978. These methods allow
information theory
Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
to be related to probability, in a way that can be compared to the application of Bayes' theorem, but which give a source and explanation for the role of prior probabilities.
Marcus Hutter
Marcus Hutter (born April 14, 1967 in Munich) is DeepMind Senior Scientist researching the mathematical foundations of artificial general intelligence. He is on leave from his professorship at the ANU College of Engineering and Computer Scie ...
combined
decision theory
Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...
with the work of Ray Solomonoff and
Andrey Kolmogorov
Andrey Nikolaevich Kolmogorov ( rus, Андре́й Никола́евич Колмого́ров, p=ɐnˈdrʲej nʲɪkɐˈlajɪvʲɪtɕ kəlmɐˈɡorəf, a=Ru-Andrey Nikolaevich Kolmogorov.ogg, 25 April 1903 – 20 October 1987) was a Sovi ...
to give a theory for the
Pareto optimal
Pareto efficiency or Pareto optimality is a situation where no action or allocation is available that makes one individual better off without making another worse off. The concept is named after Vilfredo Pareto (1848–1923), Italian civil engin ...
behavior for an
Intelligent agent
In artificial intelligence, an intelligent agent (IA) is anything which perceives its environment, takes actions autonomously in order to achieve goals, and may improve its performance with learning or may use knowledge. They may be simple or ...
, circa 1998.
Minimum description/message length
The program with the shortest length that matches the data is the most likely to predict future data. This is the thesis behind the
minimum message length
Minimum message length (MML) is a Bayesian information-theoretic method for statistical model comparison and selection. It provides a formal information theory restatement of Occam's Razor: even when models are equal in their measure of fit-accurac ...
and
minimum description length
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
methods.
At first sight
Bayes' theorem appears different from the minimimum message/description length principle. At closer inspection it turns out to be the same. Bayes' theorem is about conditional probabilities, and states the probability that event ''B'' happens if firstly event ''A'' happens:
:
becomes in terms of message length ''L'',
:
This means that if all the information is given describing an event then the length of the information may be used to give the raw probability of the event. So if the information describing the occurrence of ''A'' is given, along with the information describing ''B'' given ''A'', then all the information describing ''A'' and ''B'' has been given.
Overfitting
Overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
occurs when the model matches the random noise and not the pattern in the data. For example, take the situation where a curve is fitted to a set of points. If a polynomial with many terms is fitted then it can more closely represent the data. Then the fit will be better, and the information needed to describe the deviations from the fitted curve will be smaller. Smaller information length means higher probability.
However, the information needed to describe the curve must also be considered. The total information for a curve with many terms may be greater than for a curve with fewer terms, that has not as good a fit, but needs less information to describe the polynomial.
Inference based on program complexity
Solomonoff's theory of inductive inference
Solomonoff's theory of inductive inference is a mathematical proof that if a universe is generated by an algorithm, then observations of that universe, encoded as a dataset, are best predicted by the smallest executable archive of that dataset. T ...
is also inductive inference. A bit string ''x'' is observed. Then consider all programs that generate strings starting with ''x''. Cast in the form of inductive inference, the programs are theories that imply the observation of the bit string ''x''.
The method used here to give probabilities for inductive inference is based on
Solomonoff's theory of inductive inference
Solomonoff's theory of inductive inference is a mathematical proof that if a universe is generated by an algorithm, then observations of that universe, encoded as a dataset, are best predicted by the smallest executable archive of that dataset. T ...
.
Detecting patterns in the data
If all the bits are 1, then people infer that there is a bias in the coin and that it is more likely also that the next bit is 1 also. This is described as learning from, or detecting a pattern in the data.
Such a pattern may be represented by a
computer program
A computer program is a sequence or set of instructions in a programming language for a computer to execute. Computer programs are one component of software, which also includes documentation and other intangible components.
A computer program ...
. A short computer program may be written that produces a series of bits which are all 1. If the length of the program ''K'' is
bits then its prior probability is,
:
The length of the shortest program that represents the string of bits is called the
Kolmogorov complexity
In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
.
Kolmogorov complexity is not computable. This is related to the
halting problem
In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running, or continue to run forever. Alan Turing proved in 1936 that a g ...
. When searching for the shortest program some programs may go into an infinite loop.
Considering all theories
The Greek philosopher
Epicurus
Epicurus (; grc-gre, Ἐπίκουρος ; 341–270 BC) was an ancient Greek philosopher and sage who founded Epicureanism, a highly influential school of philosophy. He was born on the Greek island of Samos to Athenian parents. Influenced ...
is quoted as saying "If more than one theory is consistent with the observations, keep all theories".
As in a crime novel all theories must be considered in determining the likely murderer, so with inductive probability all programs must be considered in determining the likely future bits arising from the stream of bits.
Programs that are already longer than ''n'' have no predictive power. The raw (or prior) probability that the pattern of bits is random (has no pattern) is
.
Each program that produces the sequence of bits, but is shorter than the ''n'' is a theory/pattern about the bits with a probability of
where ''k'' is the length of the program.
The probability of receiving a sequence of bits ''y'' after receiving a series of bits ''x'' is then the
conditional probability
In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occur ...
of receiving ''y'' given ''x'', which is the probability of ''x'' with ''y'' appended, divided by the probability of ''x''.
Universal priors
The programming language affects the predictions of the next bit in the string. The language acts as a
prior probability
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
. This is particularly a problem where the programming language codes for numbers and other data types. Intuitively we think that 0 and 1 are simple numbers, and that prime numbers are somehow more complex than numbers that may be composite.
Using the
Kolmogorov complexity
In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
gives an unbiased estimate (a universal prior) of the prior probability of a number. As a thought experiment an
intelligent agent
In artificial intelligence, an intelligent agent (IA) is anything which perceives its environment, takes actions autonomously in order to achieve goals, and may improve its performance with learning or may use knowledge. They may be simple or ...
may be fitted with a data input device giving a series of numbers, after applying some transformation function to the raw numbers. Another agent might have the same input device with a different transformation function. The agents do not see or know about these transformation functions. Then there appears no rational basis for preferring one function over another. A universal prior insures that although two agents may have different initial probability distributions for the data input, the difference will be bounded by a constant.
So universal priors do not eliminate an initial bias, but they reduce and limit it. Whenever we describe an event in a language, either using a natural language or other, the language has encoded in it our prior expectations. So some reliance on prior probabilities are inevitable.
A problem arises where an intelligent agent's prior expectations interact with the environment to form a self reinforcing feed back loop. This is the problem of bias or prejudice. Universal priors reduce but do not eliminate this problem.
Universal artificial intelligence
The theory of
universal artificial intelligence
AIXI is a theoretical mathematical formalism for artificial general intelligence.
It combines Solomonoff induction with sequential decision theory.
AIXI was first proposed by Marcus Hutter in 2000 and several results regarding AIXI are proved i ...
applies
decision theory
Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...
to inductive probabilities. The theory shows how the best actions to optimize a reward function may be chosen. The result is a theoretical model of intelligence.
It is a fundamental theory of intelligence, which optimizes the agents behavior in,
* Exploring the environment; performing actions to get responses that broaden the agents knowledge.
* Competing or co-operating with another agent; games.
* Balancing short and long term rewards.
In general no agent will always provide the best actions in all situations. A particular choice made by an agent may be wrong, and the environment may provide no way for the agent to recover from an initial bad choice. However the agent is
Pareto optimal
Pareto efficiency or Pareto optimality is a situation where no action or allocation is available that makes one individual better off without making another worse off. The concept is named after Vilfredo Pareto (1848–1923), Italian civil engin ...
in the sense that no other agent will do better than this agent in this environment, without doing worse in another environment. No other agent may, in this sense, be said to be better.
At present the theory is limited by incomputability (the
halting problem
In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running, or continue to run forever. Alan Turing proved in 1936 that a g ...
). Approximations may be used to avoid this. Processing speed and
combinatorial explosion
In mathematics, a combinatorial explosion is the rapid growth of the complexity of a problem due to how the combinatorics of the problem is affected by the input, constraints, and bounds of the problem. Combinatorial explosion is sometimes used ...
remain the primary limiting factors for
artificial intelligence
Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
.
Probability
Probability is the representation of uncertain or partial knowledge about the truth of statements. Probabilities are subjective and personal estimates of likely outcomes based on past experience and inferences made from the data.
This description of probability may seem strange at first. In natural language we refer to "the probability" that the sun will rise tomorrow. We do not refer to "your probability" that the sun will rise. But in order for inference to be correctly modeled probability must be personal, and the act of inference generates new posterior probabilities from prior probabilities.
Probabilities are personal because they are conditional on the knowledge of the individual. Probabilities are subjective because they always depend, to some extent, on prior probabilities assigned by the individual. Subjective should not be taken here to mean vague or undefined.
The term
intelligent agent
In artificial intelligence, an intelligent agent (IA) is anything which perceives its environment, takes actions autonomously in order to achieve goals, and may improve its performance with learning or may use knowledge. They may be simple or ...
is used to refer to the holder of the probabilities. The intelligent agent may be a human or a machine. If the intelligent agent does not interact with the environment then the probability will converge over time to the frequency of the event.
If however the agent uses the probability to interact with the environment there may be a feedback, so that two agents in the identical environment starting with only slightly different priors, end up with completely different probabilities. In this case optimal
decision theory
Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...
as in
Marcus Hutter's Universal Artificial Intelligence will give
Pareto optimal
Pareto efficiency or Pareto optimality is a situation where no action or allocation is available that makes one individual better off without making another worse off. The concept is named after Vilfredo Pareto (1848–1923), Italian civil engin ...
performance for the agent. This means that no other intelligent agent could do better in one environment without doing worse in another environment.
Comparison to deductive probability
In deductive probability theories, probabilities are absolutes, independent of the individual making the assessment. But deductive probabilities are based on,
* Shared knowledge.
* Assumed facts, that should be inferred from the data.
For example, in a trial the participants are aware the outcome of all the previous history of trials. They also assume that each outcome is equally probable. Together this allows a single unconditional value of probability to be defined.
But in reality each individual does not have the same information. And in general the probability of each outcome is not equal. The dice may be loaded, and this loading needs to be inferred from the data.
Probability as estimation
The
principle of indifference
The principle of indifference (also called principle of insufficient reason) is a rule for assigning epistemic probabilities. The principle of indifference states that in the absence of any relevant evidence, agents should distribute their cre ...
has played a key role in probability theory. It says that if N statements are symmetric so that one condition cannot be preferred over another then all statements are equally probable.
Taken seriously, in evaluating probability this principle leads to contradictions. Suppose there are 3 bags of gold in the distance and one is asked to select one. Then because of the distance one cannot see the bag sizes. You estimate using the principle of indifference that each bag has equal amounts of gold, and each bag has one third of the gold.
Now, while one of us is not looking, the other takes one of the bags and divide it into 3 bags. Now there are 5 bags of gold. The principle of indifference now says each bag has one fifth of the gold. A bag that was estimated to have one third of the gold is now estimated to have one fifth of the gold.
Taken as a value associated with the bag the values are different therefore contradictory. But taken as an estimate given under a particular scenario, both values are separate estimates given under different circumstances and there is no reason to believe they are equal.
Estimates of prior probabilities are particularly suspect. Estimates will be constructed that do not follow any consistent frequency distribution. For this reason prior probabilities are considered as estimates of probabilities rather than probabilities.
A full theoretical treatment would associate with each probability,
* The statement
* Prior knowledge
* Prior probabilities
* The estimation procedure used to give the probability.
Combining probability approaches
Inductive probability combines two different approaches to probability.
* Probability and information
* Probability and frequency
Each approach gives a slightly different viewpoint. Information theory is used in relating probabilities to quantities of information. This approach is often used in giving estimates of prior probabilities.
Frequentist probability
Frequentist probability or frequentism is an interpretation of probability; it defines an event's probability as the limit of its relative frequency in many trials (the long-run probability). Probabilities can be found (in principle) by a repea ...
defines probabilities as objective statements about how often an event occurs. This approach may be stretched by defining the
trials
In law, a trial is a coming together of parties to a dispute, to present information (in the form of evidence) in a tribunal, a formal setting with the authority to adjudicate claims or disputes. One form of tribunal is a court. The tribunal, w ...
to be over
possible world
A possible world is a complete and consistent way the world is or could have been. Possible worlds are widely used as a formal device in logic, philosophy, and linguistics in order to provide a semantics for intensional logic, intensional and mod ...
s. Statements about possible worlds define
events
Event may refer to:
Gatherings of people
* Ceremony, an event of ritual significance, performed on a special occasion
* Convention (meeting), a gathering of individuals engaged in some common interest
* Event management, the organization of ev ...
.
Probability and information
Whereas logic represents only two values; true and false as the values of statement, probability associates a number in
,1to each statement. If the probability of a statement is 0, the statement is false. If the probability of a statement is 1 the statement is true.
In considering some data as a string of bits the prior probabilities for a sequence of 1s and 0s, the probability of 1 and 0 is equal. Therefore, each extra bit halves the probability of a sequence of bits.
This leads to the conclusion that,
:
Where
is the probability of the string of bits
and
is its length.
The prior probability of any statement is calculated from the number of bits needed to state it. See also
information theory
Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
.
Combining information
Two statements
and
may be represented by two separate encodings. Then the length of the encoding is,
:
or in terms of probability,
:
But this law is not always true because there may be a shorter method of encoding
if we assume
. So the above probability law applies only if
and
are "independent".
The internal language of information
The primary use of the information approach to probability is to provide estimates of the complexity of statements. Recall that Occam's razor states that "All things being equal, the simplest theory is the most likely to be correct". In order to apply this rule, first there needs to be a definition of what "simplest" means. Information theory defines simplest to mean having the shortest encoding.
Knowledge is represented as
statements
Statement or statements may refer to: Common uses
*Statement (computer science), the smallest standalone element of an imperative programming language
*Statement (logic), declarative sentence that is either true or false
*Statement, a declarative ...
. Each statement is a
Boolean expression
Expression may refer to:
Linguistics
* Expression (linguistics), a word, phrase, or sentence
* Fixed expression, a form of words with a specific meaning
* Idiom, a type of fixed expression
* Metaphorical expression, a particular word, phrase, o ...
. Expressions are encoded by a function that takes a description (as against the value) of the expression and encodes it as a bit string.
The length of the encoding of a statement gives an estimate of the probability of a statement. This probability estimate will often be used as the prior probability of a statement.
Technically this estimate is not a probability because it is not constructed from a frequency distribution. The probability estimates given by it do not always obey
the law of total of probability. Applying the law of total probability to various scenarios will usually give a more accurate probability estimate of the prior probability than the estimate from the length of the statement.
Encoding expressions
An expression is constructed from sub expressions,
* Constants (including function identifier).
* Application of functions.
*
quantifiers.
A
Huffman code
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algori ...
must distinguish the 3 cases. The length of each code is based on the frequency of each type of sub expressions.
Initially constants are all assigned the same length/probability. Later constants may be assigned a probability using the Huffman code based on the number of uses of the function id in all expressions recorded so far. In using a Huffman code the goal is to estimate probabilities, not to compress the data.
The length of a function application is the length of the function identifier constant plus the sum of the sizes of the expressions for each parameter.
The length of a quantifier is the length of the expression being quantified over.
Distribution of numbers
No explicit representation of natural numbers is given. However natural numbers may be constructed by applying the successor function to 0, and then applying other arithmetic functions. A distribution of natural numbers is implied by this, based on the complexity of constructing each number.
Rational numbers are constructed by the division of natural numbers. The simplest representation has no common factors between the numerator and the denominator. This allows the probability distribution of natural numbers may be extended to rational numbers.
Probability and frequency
The probability of an
event
Event may refer to:
Gatherings of people
* Ceremony, an event of ritual significance, performed on a special occasion
* Convention (meeting), a gathering of individuals engaged in some common interest
* Event management, the organization of e ...
may be interpreted as the frequencies of
outcomes where the statement is true divided by the total number of outcomes. If the outcomes form a continuum the frequency may need to be replaced with a
measure
Measure may refer to:
* Measurement, the assignment of a number to a characteristic of an object or event
Law
* Ballot measure, proposed legislation in the United States
* Church of England Measure, legislation of the Church of England
* Mea ...
.
Events are sets of outcomes. Statements may be related to events. A Boolean statement B about outcomes defines a set of outcomes b,
:
Conditional probability
Each probability is always associated with the state of knowledge at a particular point in the argument. Probabilities before an inference are known as prior probabilities, and probabilities after are known as posterior probabilities.
Probability depends on the facts known. The truth of a fact limits the domain of outcomes to the outcomes consistent with the fact. Prior probabilities are the probabilities before a fact is known. Posterior probabilities are after a fact is known. The posterior probabilities are said to be conditional on the fact. the probability that
is true given that
is true is written as:
All probabilities are in some sense conditional. The prior probability of
is,
:
The frequentist approach applied to possible worlds
In the
frequentist approach, probabilities are defined as the ratio of the number of
outcomes within an event to the total number of outcomes. In the
possible world
A possible world is a complete and consistent way the world is or could have been. Possible worlds are widely used as a formal device in logic, philosophy, and linguistics in order to provide a semantics for intensional logic, intensional and mod ...
model each possible world is an outcome, and statements about possible worlds define events. The probability of a statement being true is the number of possible worlds where the statement is true divided by the total number of possible worlds. The probability of a statement
being true about possible worlds is then,
:
For a conditional probability.
:
then
:
Using symmetry this equation may be written out as Bayes' law.
:
This law describes the relationship between prior and posterior probabilities when new facts are learnt.
Written as quantities of information
Bayes' Theorem becomes,
:
Two statements A and B are said to be independent if knowing the truth of A does not change the probability of B. Mathematically this is,
:
then
Bayes' Theorem reduces to,
:
The law of total of probability
For a set of mutually exclusive possibilities
, the sum of the posterior probabilities must be 1.
:
Substituting using Bayes' theorem gives the
law of total probability
In probability theory, the law (or formula) of total probability is a fundamental rule relating marginal probabilities to conditional probabilities. It expresses the total probability of an outcome which can be realized via several distinct eve ...
:
:
This result is used to give the
extended form of Bayes' theorem,
:
This is the usual form of Bayes' theorem used in practice, because it guarantees the sum of all the posterior probabilities for
is 1.
Alternate possibilities
For mutually exclusive possibilities, the probabilities add.
:
Using
:
Then the alternatives
:
are all mutually exclusive. Also,
:
:
:
so, putting it all together,
:
Negation
As,
:
then
:
Implication and condition probability
Implication is related to conditional probability by the following equation,
:
Derivation,
:
Bayesian hypothesis testing
Bayes' theorem may be used to estimate the probability of a hypothesis or theory H, given some facts F. The posterior probability of H is then
:
or in terms of information,
:
By assuming the hypothesis is true, a simpler representation of the statement F may be given. The length of the encoding of this simpler representation is
represents the amount of information needed to represent the facts F, if H is true.
is the amount of information needed to represent F without the hypothesis H. The difference is how much the representation of the facts has been compressed by assuming that H is true. This is the evidence that the hypothesis H is true.
If
is estimated from
encoding length then the probability obtained will not be between 0 and 1. The value obtained is proportional to the probability, without being a good probability estimate. The number obtained is sometimes referred to as a relative probability, being how much more probable the theory is than not holding the theory.
If a full set of mutually exclusive hypothesis that provide evidence is known, a proper estimate may be given for the prior probability
.
Set of hypothesis
Probabilities may be calculated from the extended form of Bayes' theorem. Given all mutually exclusive hypothesis
which give evidence, such that,
:
and also the hypothesis R, that none of the hypothesis is true, then,
:
In terms of information,
:
In most situations it is a good approximation to assume that
is independent of
, which means
giving,
:
Boolean inductive inference
Abductive inference starts with a set of facts ''F'' which is a statement (Boolean expression).
Abductive reasoning
Abductive reasoning (also called abduction,For example: abductive inference, or retroduction) is a form of logical inference formulated and advanced by American philosopher Charles Sanders Peirce beginning in the last third of the 19th century ...
is of the form,
:''A theory T implies the statement F. As the theory T is simpler than F, abduction says that there is a probability that the theory T is implied by F''.
The theory ''T'', also called an explanation of the condition ''F'', is an answer to the ubiquitous factual "why" question. For example, for the condition ''F'' is "Why do apples fall?". The answer is a theory ''T'' that implies that apples fall;
:
Inductive inference is of the form,
:''All observed objects in a class C have a property P. Therefore there is a probability that all objects in a class C have a property P''.
In terms of abductive inference, ''all objects in a class C or set have a property P'' is a theory that implies the observed condition, ''All observed objects in a class C have a property P''.
So
inductive inference
Inductive reasoning is a method of reasoning in which a general principle is derived from a body of observations. It consists of making broad generalizations based on specific observations. Inductive reasoning is distinct from ''deductive'' re ...
is a special case of abductive inference. In common usage the term inductive inference is often used to refer to both abductive and inductive inference.
Generalization and specialization
Inductive inference is related to
generalization
A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...
. Generalizations may be formed from statements by replacing a specific value with membership of a category, or by replacing membership of a category with membership of a broader category. In deductive logic, generalization is a powerful method of generating new theories that may be true. In inductive inference generalization generates theories that have a probability of being true.
The opposite of generalization is specialization. Specialization is used in applying a general rule to a specific case. Specializations are created from generalizations by replacing membership of a category by a specific value, or by replacing a category with a sub category.
The
Linnaen classification of living things and objects forms the basis for generalization and specification. The ability to identify, recognize and classify is the basis for generalization. Perceiving the world as a collection of objects appears to be a key aspect of human intelligence. It is the object oriented model, in the non
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
sense.
The object oriented model is constructed from our
perception
Perception () is the organization, identification, and interpretation of sensory information in order to represent and understand the presented information or environment. All perception involves signals that go through the nervous system ...
. In particularly
vision
Vision, Visions, or The Vision may refer to:
Perception Optical perception
* Visual perception, the sense of sight
* Visual system, the physical mechanism of eyesight
* Computer vision, a field dealing with how computers can be made to gain un ...
is based on the ability to compare two images and calculate how much information is needed to morph or map one image into another.
Computer vision
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
uses this mapping to construct 3D images from
stereo image pairs.
Inductive logic programming is a means of constructing theory that implies a condition. Plotkin's "''relative least general generalization (rlgg)''" approach constructs the simplest generalization consistent with the condition.
Newton's use of induction
Isaac Newton
Sir Isaac Newton (25 December 1642 – 20 March 1726/27) was an English mathematician, physicist, astronomer, alchemist, theologian, and author (described in his time as a "natural philosopher"), widely recognised as one of the grea ...
used inductive arguments in constructing his
law of universal gravitation.
[Isaac Newton: "In xperimentalphilosophy particular propositions are inferred from the phenomena and afterwards rendered general by induction": " Principia", Book 3, General Scholium, at p.392 in Volume 2 of Andrew Motte's English translation published 1729.] Starting with the statement,
* The center of an apple falls towards the center of the earth.
Generalizing by replacing apple for object, and earth for object gives, in a two body system,
* The center of an object falls towards the center of another object.
The theory explains all objects falling, so there is strong evidence for it. The second observation,
* The planets appear to follow an elliptical path.
After some complicated mathematical
calculus
Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematical study of continuous change, in the same way that geometry is the study of shape, and algebra is the study of generalizations of arithm ...
, it can be seen that if the acceleration follows the inverse square law then objects will follow an ellipse. So induction gives evidence for the inverse square law.
Using
Galileo's observation that all objects drop with the same speed,
:
:
where
and
vectors towards the center of the other object. Then using
Newton's third law
Newton's laws of motion are three basic laws of classical mechanics that describe the relationship between the motion of an object and the forces acting on it. These laws can be paraphrased as follows:
# A body remains at rest, or in moti ...
:
Probabilities for inductive inference
Implication determines condition probability as,
:
So,
:
:
This result may be used in the probabilities given for Bayesian hypothesis testing. For a single theory, H = T and,
:
or in terms of information, the relative probability is,
:
Note that this estimate for P(T, F) is not a true probability. If
then the theory has evidence to support it. Then for a set of theories
, such that
,
:
:
giving,
:
:
Derivations
Derivation of inductive probability
Make a list of all the shortest programs
that each produce a distinct infinite string of bits, and satisfy the relation,
:
where
is the result of running the program
and
truncates the string after ''n'' bits.
The problem is to calculate the probability that the source is produced by program
given that the truncated source after n bits is ''x''. This is represented by the conditional probability,
:
Using the
extended form of Bayes' theorem
:
The extended form relies on the
law of total probability
In probability theory, the law (or formula) of total probability is a fundamental rule relating marginal probabilities to conditional probabilities. It expresses the total probability of an outcome which can be realized via several distinct eve ...
. This means that the
must be distinct possibilities, which is given by the condition that each
produce a different infinite string. Also one of the conditions
must be true. This must be true, as in the limit as
there is always at least one program that produces
.
As
are chosen so that
then,
:
The apriori probability of the string being produced from the program, given no information about the string, is based on the size of the program,
:
giving,
:
Programs that are the same or longer than the length of ''x'' provide no predictive power. Separate them out giving,
:
Then identify the two probabilities as,
:
:
But the prior probability that ''x'' is a random set of bits is
. So,
:
The probability that the source is random, or unpredictable is,
:
A model for inductive inference
A model of how worlds are constructed is used in determining the probabilities of theories,
* A random bit string is selected.
* A condition is constructed from the bit string.
* A world is constructed that is consistent with the condition.
If ''w'' is the bit string then the world is created such that
is true. An
intelligent agent
In artificial intelligence, an intelligent agent (IA) is anything which perceives its environment, takes actions autonomously in order to achieve goals, and may improve its performance with learning or may use knowledge. They may be simple or ...
has some facts about the word, represented by the bit string ''c'', which gives the condition,
:
The set of bit strings identical with any condition ''x'' is
.
:
A theory is a simpler condition that explains (or implies) ''C''. The set of all such theories is called ''T'',
:
Applying Bayes' theorem
extended form of Bayes' theorem may be applied
:
where,
:
:
To apply Bayes' theorem the following must hold:
is a
partition
Partition may refer to:
Computing Hardware
* Disk partitioning, the division of a hard disk drive
* Memory partition, a subdivision of a computer's memory, usually for use by a single job
Software
* Partition (database), the division of a ...
of the event space.
For
to be a partition, no bit string ''n'' may belong to two theories. To prove this assume they can and derive a contradiction,
:
:
:
Secondly prove that ''T'' includes all outcomes consistent with the condition. As all theories consistent with ''C'' are included then
must be in this set.
So Bayes theorem may be applied as specified giving,
:
Using the
implication and condition probability law, the definition of
implies,
:
The probability of each theory in ''T'' is given by,
:
so,
:
Finally the probabilities of the events may be identified with the probabilities of the condition which the outcomes in the event satisfy,
:
giving
:
This is the probability of the theory ''t'' after observing that the condition ''C'' holds.
Removing theories without predictive power
Theories that are less probable than the condition ''C'' have no predictive power. Separate them out giving,
:
The probability of the theories without predictive power on ''C'' is the same as the probability of ''C''. So,
:
So the probability
:
and the probability of no prediction for C, written as
,
:
The probability of a condition was given as,
:
Bit strings for theories that are more complex than the bit string given to the agent as input have no predictive power. There probabilities are better included in the ''random'' case. To implement this a new definition is given as ''F'' in,
:
Using ''F'', an improved version of the abductive probabilities is,
:
:
Key people
*
William of Ockham
William of Ockham, OFM (; also Occam, from la, Gulielmus Occamus; 1287 – 10 April 1347) was an English Franciscan friar, scholastic philosopher, apologist, and Catholic theologian, who is believed to have been born in Ockham, a small vill ...
*
Thomas Bayes
Thomas Bayes ( ; 1701 7 April 1761) was an English statistician, philosopher and Presbyterian minister who is known for formulating a specific case of the theorem that bears his name: Bayes' theorem. Bayes never published what would become his ...
*
Ray Solomonoff
Ray Solomonoff (July 25, 1926 – December 7, 2009) was the inventor of algorithmic probability, his General Theory of Inductive Inference (also known as Universal Inductive Inference),Samuel Rathmanner and Marcus Hutter. A philosophical treatise ...
*
Andrey Kolmogorov
Andrey Nikolaevich Kolmogorov ( rus, Андре́й Никола́евич Колмого́ров, p=ɐnˈdrʲej nʲɪkɐˈlajɪvʲɪtɕ kəlmɐˈɡorəf, a=Ru-Andrey Nikolaevich Kolmogorov.ogg, 25 April 1903 – 20 October 1987) was a Sovi ...
*
Chris Wallace
Christopher Wallace (born October 12, 1947) is an American broadcast journalist. He is known for his tough and wide-ranging interviews, for which he is often compared to his father, ''60 Minutes'' journalist Mike Wallace. Over his 50-year care ...
* D. M. Boulton
*
Jorma Rissanen
Jorma Johannes Rissanen (October 20, 1932 – May 9, 2020) was an information theorist, known for originating the minimum description length (MDL) principle and practical approaches to arithmetic coding for lossless data compression. His work in ...
*
Marcus Hutter
Marcus Hutter (born April 14, 1967 in Munich) is DeepMind Senior Scientist researching the mathematical foundations of artificial general intelligence. He is on leave from his professorship at the ANU College of Engineering and Computer Scie ...
See also
*
Abductive reasoning
Abductive reasoning (also called abduction,For example: abductive inference, or retroduction) is a form of logical inference formulated and advanced by American philosopher Charles Sanders Peirce beginning in the last third of the 19th century ...
*
Algorithmic probability In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s.
It is used in induc ...
*
Algorithmic information theory
Algorithmic information theory (AIT) is a branch of theoretical computer science that concerns itself with the relationship between computation and information of computably generated objects (as opposed to stochastically generated), such as str ...
*
Bayesian inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
*
Information theory
Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
*
Inductive inference
Inductive reasoning is a method of reasoning in which a general principle is derived from a body of observations. It consists of making broad generalizations based on specific observations. Inductive reasoning is distinct from ''deductive'' re ...
*
Inductive logic programming
*
Inductive reasoning
Inductive reasoning is a method of reasoning in which a general principle is derived from a body of observations. It consists of making broad generalizations based on specific observations. Inductive reasoning is distinct from ''deductive'' re ...
*
Learning
Learning is the process of acquiring new understanding, knowledge, behaviors, skills, value (personal and cultural), values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machine learning, machines ...
*
Minimum message length
Minimum message length (MML) is a Bayesian information-theoretic method for statistical model comparison and selection. It provides a formal information theory restatement of Occam's Razor: even when models are equal in their measure of fit-accurac ...
*
Minimum description length
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
*
Occam's razor
Occam's razor, Ockham's razor, or Ocham's razor ( la, novacula Occami), also known as the principle of parsimony or the law of parsimony ( la, lex parsimoniae), is the problem-solving principle that "entities should not be multiplied beyond neces ...
*
Solomonoff's theory of inductive inference
Solomonoff's theory of inductive inference is a mathematical proof that if a universe is generated by an algorithm, then observations of that universe, encoded as a dataset, are best predicted by the smallest executable archive of that dataset. T ...
*
Universal artificial intelligence
AIXI is a theoretical mathematical formalism for artificial general intelligence.
It combines Solomonoff induction with sequential decision theory.
AIXI was first proposed by Marcus Hutter in 2000 and several results regarding AIXI are proved i ...
References
External links
* Rathmanner, S and Hutter, M., "A Philosophical Treatise of Universal Induction" in Entropy 2011, 13, 1076–1136: A very clear philosophical and mathematical analysis of Solomonoff's Theory of Inductive Inference.
*
C.S. WallaceStatistical and Inductive Inference by Minimum Message Length Springer-Verlag (Information Science and Statistics), {{ISBN, 0-387-23795-X, May 2005
chapter headingstable of contentsan
sample pages
Philosophy of statistics
Inductive reasoning
Inference
Machine learning
Probability theory