Bayesian programming is a formalism and a methodology for having a technique to specify
probabilistic models
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
and solve problems when less than the necessary information is available.
Edwin T. Jaynes proposed that probability could be considered as an alternative and an extension of logic for rational reasoning with incomplete and uncertain information. In his founding book ''Probability Theory: The Logic of Science''
he developed this theory and proposed what he called “the robot,” which was not
a physical device, but an
inference engine
In the field of artificial intelligence, an inference engine is a component of the system that applies logical rules to the knowledge base to deduce new information. The first inference engines were components of expert systems. The typical expert ...
to automate probabilistic reasoning—a kind of
Prolog
Prolog is a logic programming language associated with artificial intelligence and computational linguistics.
Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is intended primarily ...
for probability instead of logic. Bayesian programming
is a formal and concrete implementation of this "robot".
Bayesian programming may also be seen as an algebraic formalism to specify
graphical model
A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...
s such as, for instance,
Bayesian network
A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
s,
dynamic Bayesian network
A Dynamic Bayesian Network (DBN) is a Bayesian network (BN) which relates variables to each other over adjacent time steps. This is often called a ''Two-Timeslice'' BN (2TBN) because it says that at any point in time T, the value of a variable c ...
s,
Kalman filter
For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimat ...
s or
hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
s. Indeed, Bayesian Programming is more general than
Bayesian network
A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
s and has a power of expression equivalent to probabilistic
factor graph
A factor graph is a bipartite graph representing the factorization of a function. In probability theory and its applications, factor graphs are used to represent factorization of a probability distribution function, enabling efficient computatio ...
s.
Formalism
A Bayesian program is a means of specifying a family of probability distributions.
The constituent elements of a Bayesian program are presented below:
:
# A program is constructed from a description and a question.
# A description is constructed using some specification (
) as given by the programmer and an identification or learning process for the parameters not completely specified by the specification, using a data set (
).
# A specification is constructed from a set of pertinent variables, a decomposition and a set of forms.
# Forms are either parametric forms or questions to other Bayesian programs.
# A question specifies which probability distribution has to be computed.
Description
The purpose of a description is to specify an effective method of computing a
joint probability distribution
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
on a set of
variables given a set of experimental data
and some
specification
. This
joint distribution
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
is denoted as:
.
To specify preliminary knowledge
, the programmer must undertake the following:
# Define the set of relevant
variables on which the joint distribution is defined.
# Decompose the joint distribution (break it into relevant
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s
* Independ ...
or
conditional probabilities
In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occur ...
).
# Define the forms of each of the distributions (e.g., for each variable, one of the
list of probability distributions
Many probability distributions that are important in theory or applications have been given specific names.
Discrete distributions
With finite support
* The Bernoulli distribution, which takes value 1 with probability ''p'' and value 0 with ...
).
Decomposition
Given a partition of
containing
subsets,
variables are defined
, each corresponding to one of these subsets.
Each variable
is obtained as the conjunction of the variables
belonging to the
subset. Recursive application of
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
leads to:
:
Conditional independence
In probability theory, conditional independence describes situations wherein an observation is irrelevant or redundant when evaluating the certainty of a hypothesis. Conditional independence is usually formulated in terms of conditional probabil ...
hypotheses then allow further simplifications. A conditional
independence hypothesis for variable
is defined by choosing some variable
among the variables appearing in the conjunction
, labelling
as the
conjunction of these chosen variables and setting:
:
We then obtain:
:
Such a simplification of the joint distribution as a product of simpler distributions is
called a decomposition, derived using the
chain rule
In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x)=f(g(x)) for every , ...
.
This ensures that each variable appears at the most once on the left of a conditioning
bar, which is the necessary and sufficient condition to write mathematically valid
decompositions.
Forms
Each distribution
appearing in the product is then associated
with either a parametric form (i.e., a function
) or a question to another Bayesian program
.
When it is a form
, in general,
is a vector of parameters that may depend on
or
or both. Learning
takes place when some of these parameters are computed using the data set
.
An important feature of Bayesian Programming is this capacity to use questions to other Bayesian programs as components of the definition of a new Bayesian program.
is obtained by some inferences done by another Bayesian program defined by the specifications
and the data
. This is similar to calling a subroutine in classical programming and provides an easy way to build
hierarchical models.
Question
Given a description (i.e.,
), a question is obtained by partitioning
into three sets: the searched variables, the known variables and
the free variables.
The 3 variables
,
and
are defined as the
conjunction of the variables belonging to
these sets.
A question is defined as the set
of distributions:
:
made of many "instantiated questions" as the cardinal of
,
each instantiated question being the distribution:
:
Inference
Given the joint distribution
, it is always possible to compute any possible question using the following general inference:
:
where the first equality results from the marginalization rule, the second
results from
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
and the third corresponds to a second application of marginalization. The denominator appears to be a normalization term and can be replaced by a constant
.
Theoretically, this allows to solve any Bayesian inference problem. In practice,
however, the cost of computing exhaustively and exactly
is too great in almost all cases.
Replacing the joint distribution by its decomposition we get:
:
which is usually a much simpler expression to compute, as the dimensionality of the problem is considerably reduced by the decomposition into a product of lower dimension distributions.
Example
Bayesian spam detection
The purpose of Bayesian spam filtering is to eliminate junk e-mails.
The problem is very easy to formulate. E-mails should be classified
into one of two categories: non-spam or spam. The only available information to classify the e-mails is their content: a set of words. Using these words without taking the order into account is commonly called a
bag of words model
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...
.
The classifier should furthermore be able to adapt to its user and to learn
from experience. Starting from an initial standard setting, the classifier should
modify its internal parameters when the user disagrees with its own decision.
It will hence adapt to the user's criteria to differentiate between non-spam and
spam. It will improve its results as it encounters increasingly classified e-mails.
Variables
The variables necessary to write this program are as follows:
#
: a binary variable, false if the e-mail is not spam and true otherwise.
#
:
binary variables.
is true if the
word of the dictionary is present in the text.
These
binary variables sum up all the information
about an e-mail.
Decomposition
Starting from the joint distribution and applying recursively
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
we obtain:
:
This is an exact mathematical expression.
It can be drastically simplified by assuming that the probability of appearance of a word knowing the nature of the text (spam or not) is independent of the appearance of the other words. This is the
naive Bayes
In statistics, naive Bayes classifiers are a family of simple "Probabilistic classification, probabilistic classifiers" based on applying Bayes' theorem with strong (naive) statistical independence, independence assumptions between the features (s ...
assumption and this makes this spam filter a
naive Bayes
In statistics, naive Bayes classifiers are a family of simple "Probabilistic classification, probabilistic classifiers" based on applying Bayes' theorem with strong (naive) statistical independence, independence assumptions between the features (s ...
model.
For instance, the programmer can assume that:
:
to finally obtain:
:
This kind of assumption is known as the
naive Bayes' assumption. It is "naive" in the sense that the independence between words is clearly not completely true. For instance, it completely neglects that the appearance of pairs of words may be more significant than isolated appearances. However, the programmer may assume this hypothesis and may develop the model and the associated inferences to test how reliable and efficient it is.
Parametric forms
To be able to compute the joint distribution, the programmer must now specify the
distributions appearing in the decomposition:
#
is a prior defined, for instance, by
# Each of the
forms
may be specified using
Laplace rule of succession
In probability theory, the rule of succession is a formula introduced in the 18th century by Pierre-Simon Laplace in the course of treating the sunrise problem. The formula is still used, particularly to estimate underlying probabilities when t ...
(this is a pseudocounts-based
smoothing technique to counter the
zero-frequency problem of words never-seen-before):
##
##
where
stands for the number of appearances of the
word in non-spam e-mails and
stands for the total number of non-spam e-mails. Similarly,
stands for the number of appearances of the
word in spam e-mails and
stands for the total number of spam e-mails.
Identification
The
forms
are not yet completely specified because the
parameters
,
,
and
have no values yet.
The identification of these parameters could be done either by batch processing a series of classified e-mails or by an incremental updating of the parameters using the user's classifications of the e-mails as they arrive.
Both methods could be combined: the system could start with initial standard values of these parameters issued from a generic database, then some incremental learning customizes the classifier to each individual user.
Question
The question asked to the program is: "what is the probability for a given text to be spam knowing which words appear and don't appear in this text?"
It can be formalized by:
:
which can be computed as follows:
:
The denominator appears to be a
normalization constant
The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics. The normalizing constant is used to reduce any probability function to a probability density function with total probability of one.
...
. It is not necessary to compute it to decide if we are dealing with spam. For instance, an easy trick is to compute the ratio:
:
This computation is faster and easier because it requires only
products.
Bayesian program
The Bayesian spam filter program is completely defined by:
:
Bayesian filter, Kalman filter and hidden Markov model
Bayesian filters (often called
Recursive Bayesian estimation
In probability theory, statistics, and machine learning, recursive Bayesian estimation, also known as a Bayes filter, is a general probabilistic approach for estimating an unknown probability density function ( PDF) recursively over time using inc ...
) are generic probabilistic models for time evolving processes. Numerous models are particular instances of this generic approach, for instance: the
Kalman filter
For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimat ...
or the
Hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
(HMM).
Variables
* Variables
are a time series of state variables considered to be on a time horizon ranging from
to
.
* Variables
are a time series of observation variables on the same horizon.
Decomposition
The decomposition is based:
* on
, called the system model, transition model or dynamic model, which formalizes the transition from the state at time
to the state at time
;
* on
, called the observation model, which expresses what can be observed at time
when the system is in state
;
* on an initial state at time
:
.
Parametrical forms
The parametrical forms are not constrained and different choices lead to different well-known models: see Kalman filters and Hidden Markov models just below.
Question
The typical question for such models is
: what is the probability distribution for the state at time
knowing the observations from instant
to
?
The most common case is Bayesian filtering where
, which searches for the present state, knowing past observations.
However, it is also possible
, to extrapolate a future state from past observations, or to do smoothing
, to recover a past state from observations made either before or after that instant.
More complicated questions may also be asked as shown below in the HMM section.
Bayesian filters
have a very interesting recursive property, which contributes greatly to their attractiveness.
may be computed simply from
with the following formula:
:
Another interesting point of view for this equation is to consider that there are two phases: a
prediction phase and an estimation phase:
* During the prediction phase, the state is predicted using the dynamic model and the estimation of the state at the previous moment:
::
* During the estimation phase, the prediction is either confirmed or invalidated using the last observation:
::
Bayesian program
:
Kalman filter
The very well-known
Kalman filter
For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimat ...
s are a special case of Bayesian
filters.
They are defined by the following Bayesian program:
:
* Variables are continuous.
* The transition model
and the observation model
are both specified using Gaussian laws with means that are linear functions of the conditioning variables.
With these hypotheses and by using the recursive formula, it is possible to solve
the inference problem analytically to answer the usual
question.
This leads to an extremely efficient algorithm, which explains the popularity of Kalman filters and the number of their everyday applications.
When there are no obvious linear transition and observation models, it is still often
possible, using a first-order Taylor's expansion, to treat these models as locally linear.
This generalization is commonly called the
extended Kalman filter
In estimation theory, the extended Kalman filter (EKF) is the nonlinear version of the Kalman filter which linearizes about an estimate of the current mean and covariance. In the case of well defined transition models, the EKF has been considered t ...
.
Hidden Markov model
Hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
s (HMMs) are another very popular specialization of Bayesian filters.
They are defined by the following Bayesian program:
:
* Variables are treated as being discrete.
* The transition model
and the observation model
are
both specified using probability matrices.
* The question most frequently asked of HMMs is:
::
What is the most probable series of states that leads to the present state, knowing the past observations?
This particular question may be answered with a specific and very efficient algorithm
called the
Viterbi algorithm
The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especiall ...
.
The
Baum–Welch algorithm In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the f ...
has been developed
for HMMs.
Applications
Academic applications
Since 2000, Bayesian programming has been used to develop both
robotics
Robotics is an interdisciplinary branch of computer science and engineering. Robotics involves design, construction, operation, and use of robots. The goal of robotics is to design machines that can help and assist humans. Robotics integrat ...
applications and life sciences models.
Robotics
In robotics, bayesian programming was applied to
autonomous robotics
An autonomous robot is a robot that acts without recourse to human control. The first autonomous robots environment were known as Elmer and Elsie, which were constructed in the late 1940s by W. Grey Walter. They were the first robots in history th ...
,
robotic
CAD
Computer-aided design (CAD) is the use of computers (or ) to aid in the creation, modification, analysis, or optimization of a design. This software is used to increase the productivity of the designer, improve the quality of design, improve co ...
systems,
advanced driver-assistance systems
An advanced driver-assistance system (ADAS) is any of a groups of electronic technologies that assist drivers in driving and parking functions. Through a safe human-machine interface, ADAS increase car and road safety. ADAS uses automated technol ...
,
robotic arm
A robotic arm is a type of mechanical arm, usually programmable, with similar functions to a human arm; the arm may be the sum total of the mechanism or may be part of a more complex robot. The links of such a manipulator are connected by joints ...
control,
mobile robot
A mobile robot is an automatic machine that is capable of locomotion.Hu, J.; Bhowmick, P.; Lanzon, A.,Group Coordinated Control of Networked Mobile Robots with Applications to Object Transportation IEEE Transactions on Vehicular Technology, 2021 ...
ics, human-robot interaction, human-vehicle interaction (Bayesian autonomous driver models)
video game
Video games, also known as computer games, are electronic games that involves interaction with a user interface or input device such as a joystick, controller, keyboard, or motion sensing device to generate visual feedback. This fee ...
avatar programming and training and real-time strategy games (AI).
Life sciences
In life sciences, bayesian programming was used in vision to reconstruct shape from motion, to model visuo-vestibular interaction and to study
saccadic eye movements; in speech perception and control to study early
speech acquisition Speech acquisition focuses on the development of vocal, acoustic and oral language by a child. This includes motor planning and execution, pronunciation, phonological and articulation patterns (as opposed to content and grammar which is language).
...
and the emergence of articulatory-acoustic systems; and to model handwriting perception and control.
Pattern recognition
Bayesian program learning has potential applications
voice recognition and synthesis,
image recognition
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
and natural language processing. It employs the principles of ''compositionality'' (building abstract representations from parts), ''causality'' (building complexity from parts) and ''learning to learn'' (using previously recognized concepts to ease the creation of new concepts).
Possibility theories
The comparison between probabilistic approaches (not only bayesian programming) and possibility theories continues to be debated.
Possibility theories like, for instance,
fuzzy set
In mathematics, fuzzy sets (a.k.a. uncertain sets) are sets whose elements have degrees of membership. Fuzzy sets were introduced independently by Lotfi A. Zadeh in 1965 as an extension of the classical notion of set.
At the same time, defined a ...
s,
fuzzy logic
Fuzzy logic is a form of many-valued logic in which the truth value of variables may be any real number between 0 and 1. It is employed to handle the concept of partial truth, where the truth value may range between completely true and completely ...
and
possibility theory
Possibility theory is a mathematical theory for dealing with certain types of uncertainty and is an alternative to probability theory. It uses measures of possibility and necessity between 0 and 1, ranging from impossible to possible and unnecessa ...
are alternatives to probability to model uncertainty. They argue that probability is insufficient or inconvenient to model certain aspects of incomplete/uncertain knowledge.
The defense of probability is mainly based on
Cox's theorem
Cox's theorem, named after the physicist Richard Threlkeld Cox, is a derivation of the laws of probability theory from a certain set of postulates. This derivation justifies the so-called "logical" interpretation of probability, as the laws of pro ...
, which starts from four postulates concerning rational reasoning in the presence of uncertainty. It demonstrates that the only mathematical framework that satisfies these postulates is probability theory. The argument is that any approach other than probability necessarily infringes one of these postulates and the value of that infringement.
Probabilistic programming
The purpose of
probabilistic programming
Probabilistic programming (PP) is a programming paradigm in which probabilistic models are specified and inference for these models is performed automatically.
It represents an attempt to unify probabilistic modeling and traditional general pur ...
is to unify the scope of classical programming languages with probabilistic modeling (especially
bayesian network
A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
s) to deal with uncertainty while profiting from the programming languages' expressiveness to encode complexity.
Extended classical programming languages include logical languages as proposed in
Probabilistic Horn Abduction, Independent Choice Logic, PRISM, and ProbLog which proposes an extension of Prolog.
It can also be extensions of
functional programming languages
In computer science, functional programming is a programming paradigm where programs are constructed by applying and composing functions. It is a declarative programming paradigm in which function definitions are trees of expressions that m ...
(essentially
Lisp
A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech.
Types
* A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lisping ...
and
Scheme A scheme is a systematic plan for the implementation of a certain idea.
Scheme or schemer may refer to:
Arts and entertainment
* ''The Scheme'' (TV series), a BBC Scotland documentary series
* The Scheme (band), an English pop band
* ''The Schem ...
) such as IBAL or CHURCH. The underlying programming languages can be object-oriented as in BLOG and FACTORIE or more standard ones as in CES and FIGARO.
The purpose of Bayesian programming is different. Jaynes' precept of "probability as logic" argues that probability is an extension of and an alternative to logic above which a complete theory of rationality, computation and programming can be rebuilt.
Bayesian programming attempts to replace classical languages with a programming approach based on probability that considers
incompleteness and
uncertainty
Uncertainty refers to epistemic situations involving imperfect or unknown information. It applies to predictions of future events, to physical measurements that are already made, or to the unknown. Uncertainty arises in partially observable or ...
.
The precise comparison between the
semantics
Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy
Philosophy (f ...
and power of expression of Bayesian and probabilistic programming is an open question.
See also
References
Further reading
*
External links
A companion site to the ''Bayesian programming'' book where to download ProBT an inference engine dedicated to Bayesian programming.* Th
Bayesian-programming.org site{{Webarchive, url=https://archive.today/20131123162815/http://bayesian-programming.org/ , date=2013-11-23 for the promotion of Bayesian programming with detailed information and numerous publications.
Bayesian statistics
Artificial intelligence