In
variational Bayesian methods
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...
, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound
or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.
Terminology and notation
Let
and
be
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s,
jointly-distributed with distribution
. For example,
is the
marginal distribution
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables ...
of
, and
is the
conditional distribution
In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the co ...
of
given
. Then, for any sample
, and any distribution
, we have
The left-hand side is called the ''evidence'' for
, and the right-hand side is called the ''evidence lower bound for
'', or ''ELBO''. We refer to the above inequality as the ''ELBO inequality''.
In the terminology of variational Bayesian methods, the distribution
is called the ''evidence''. Some authors use the term ''evidence'' to mean
, and others authors call
the ''log-evidence'', and some use the terms ''evidence'' and ''log-evidence'' interchangeably.
There is no generally fixed notation for the ELBO. In this article we use
Motivation
Variational Bayesian inference
Suppose we have an observable random variable
, and we want to find its true distribution
. This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find
exactly, forcing us to search for a good approximation''.''
That is, we define a sufficiently large parametric family
of distributions, then solve for
for some loss function
. One possible way to solve this is by considering small variation from
to
, and solve for
. This is a problem in the
calculus of variations
The calculus of variations (or Variational Calculus) is a field of mathematical analysis that uses variations, which are small changes in functions
and functionals, to find maxima and minima of functionals: mappings from a set of functions t ...
, thus it is called the variational method.
Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider ''implicitly parametrized'' probability distributions:
* First, define a simple distribution
over a latent random variable
. Usually a normal distribution or a uniform distribution suffices.
* Next, define a family of complicated functions
(such as a
deep neural network
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
De ...
) parametrized by
.
* Finally, define a way to convert any
into a simple distribution over the observable random variable
. For example, let
have two outputs, then we can define the corresponding distribution over
to be the normal distribution
.
This defines a family of joint distributions
over
. It is very easy to sample
: simply sample
, then compute
, and finally sample
using
.
In other words, we have a generative model for both the observable and the latent.
Now, we consider a distribution
good, if it is a close approximation of
:
since the distribution on the right side is over
only, the distribution on the left side must marginalize the latent variable
away.
In general, it's impossible to perform the integral
, forcing us to perform another approximation.
Since
, it suffices to find a good approximation of
. So define another distribution family
and use it to approximate
. This is a discriminative model for the latent.
The entire situation is summarized in the following table:
In Bayesian language,
is the observed evidence, and
is the latent/unobserved. The distribution
over
is the ''prior distribution'' over
,
is the likelihood function, and
is the ''posterior'' ''distribution'' over
.
Given an observation
, we can ''infer'' what
likely gave rise to
by computing
. The usual Bayesian method is to estimate the integral
, then compute by
Bayes rule
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
. This is expensive to perform in general, but if we can simply find a good approximation
for most
, then we can infer
from
cheaply. Thus, the search for a good
is also called amortized inference.
All in all, we have found a problem of variational Bayesian inference.
Deriving the ELBO
A basic result in variational inference is that minimizing the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
(KL-divergence) is equivalent to maximizing the log-likelihood:
where