Variational Bayesian methods are a family of techniques for approximating intractable
integral
In mathematics, an integral is the continuous analog of a Summation, sum, which is used to calculate area, areas, volume, volumes, and their generalizations. Integration, the process of computing an integral, is one of the two fundamental oper ...
s arising in
Bayesian inference
Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...
and
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
. They are typically used in complex
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
s consisting of observed variables (usually termed "data") as well as unknown
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s and
latent variable
In statistics, latent variables (from Latin: present participle of ) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such '' latent va ...
s, with various sorts of relationships among the three types of
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
s, as might be described by a
graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
# To provide an analytical approximation to the
posterior probability
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posteri ...
of the unobserved variables, in order to do
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
over these variables.
# To derive a
lower bound for the
marginal likelihood
A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample for all possible values of the parameters; it can be under ...
(sometimes called the ''evidence'') of the observed data (i.e. the
marginal probability
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variable ...
of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing
model selection
Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one.
In the context of machine learning and more generally statistical analysis, this may be the selection of ...
, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the
Bayes factor
The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in question can have a common set of parameters, such as a null hypothesis ...
article.)
In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to
Monte Carlo sampling methods—particularly,
Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) is a class of algorithms used to draw samples from a probability distribution. Given a probability distribution, one can construct a Markov chain whose elements' distribution approximates it – that ...
methods such as
Gibbs sampling
In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate distribution, multivariate probability distribution when direct sampling from the joint distribution is dif ...
—for taking a fully Bayesian approach to
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
over complex
distributions that are difficult to evaluate directly or
sample. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.
Variational Bayes can be seen as an extension of the
expectation–maximization (EM) algorithm from
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
(ML) or
maximum a posteriori
An estimation procedure that is often claimed to be part of Bayesian statistics is the maximum a posteriori (MAP) estimate of an unknown quantity, that equals the mode of the posterior density with respect to some reference measure, typically ...
(MAP) estimation of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire
posterior distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...
of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.
For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to update the parameters iteratively often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.
Mathematical derivation
Problem
In
variational inference, the posterior distribution over a set of unobserved variables
given some data
is approximated by a so-called variational distribution,
:
The distribution
is restricted to belong to a family of distributions of simpler form than
(e.g. a family of Gaussian distributions), selected with the intention of making
similar to the true posterior,
.
The similarity (or dissimilarity) is measured in terms of a dissimilarity function
and hence inference is performed by selecting the distribution
that minimizes
.
KL divergence
The most common type of variational Bayes uses the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
(KL-divergence) of ''Q'' from ''P'' as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as
:
Note that ''Q'' and ''P'' are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to the
expectation–maximization algorithm
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent varia ...
. (Using the KL-divergence in the other way produces the
expectation propagation algorithm.)
Intractability
Variational techniques are typically used to form an approximation for:
:
The marginalization over
to calculate
in the denominator is typically intractable, because, for example, the search space of
is combinatorially large. Therefore, we seek an approximation, using
.
Evidence lower bound
Given that
, the KL-divergence above can also be written as
:
Because
is a constant with respect to
and
because
is a distribution, we have
:
which, according to the definition of
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
(for a discrete
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
), can be written as follows
:
which can be rearranged to become
:
As the ''log-
evidence
Evidence for a proposition is what supports the proposition. It is usually understood as an indication that the proposition is truth, true. The exact definition and role of evidence vary across different fields. In epistemology, evidence is what J ...
''
is fixed with respect to
, maximizing the final term
minimizes the KL divergence of
from
. By appropriate choice of
,
becomes tractable to compute and to maximize. Hence we have both an analytical approximation
for the posterior
, and a lower bound
for the log-evidence
(since the KL-divergence is non-negative).
The lower bound
is known as the (negative) variational free energy in analogy with
thermodynamic free energy
In thermodynamics, the thermodynamic free energy is one of the state functions of a thermodynamic system. The change in the free energy is the maximum amount of work that the system can perform in a process at constant temperature, and its ...
because it can also be expressed as a negative energy