HOME

TheInfoList



OR:

In
variational Bayesian methods Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...
, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.


Terminology and notation

Let X and Z be
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s, jointly-distributed with distribution p_\theta. For example, p_\theta( X) is the
marginal distribution In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables ...
of X, and p_\theta( Z \mid X) is the
conditional distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the co ...
of Z given X. Then, for any sample x\sim p_\theta, and any distribution q_\phi , we have\ln p_\theta(x) \ge \mathbb \mathbb E_\left \ln\frac \rightThe left-hand side is called the ''evidence'' for x, and the right-hand side is called the ''evidence lower bound for x'', or ''ELBO''. We refer to the above inequality as the ''ELBO inequality''. In the terminology of variational Bayesian methods, the distribution p_\theta( X) is called the ''evidence''. Some authors use the term ''evidence'' to mean \ln p_\theta( X), and others authors call \ln p_\theta( X) the ''log-evidence'', and some use the terms ''evidence'' and ''log-evidence'' interchangeably. There is no generally fixed notation for the ELBO. In this article we useL(\phi, \theta; x) := \mathbb E_\left \ln\frac \right


Motivation


Variational Bayesian inference

Suppose we have an observable random variable X, and we want to find its true distribution p^*. This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find p^* exactly, forcing us to search for a good approximation''.'' That is, we define a sufficiently large parametric family \_ of distributions, then solve for \min_\theta L(p_\theta, p^*) for some loss function L. One possible way to solve this is by considering small variation from p_\theta to p_, and solve for L(p_\theta, p^*) - L(p_, p^*) =0. This is a problem in the
calculus of variations The calculus of variations (or Variational Calculus) is a field of mathematical analysis that uses variations, which are small changes in functions and functionals, to find maxima and minima of functionals: mappings from a set of functions t ...
, thus it is called the variational method. Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider ''implicitly parametrized'' probability distributions: * First, define a simple distribution p(z) over a latent random variable Z. Usually a normal distribution or a uniform distribution suffices. * Next, define a family of complicated functions f_\theta (such as a
deep neural network Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
) parametrized by \theta. * Finally, define a way to convert any f_\theta(z) into a simple distribution over the observable random variable X. For example, let f_\theta(z) = (f_1(z), f_2(z)) have two outputs, then we can define the corresponding distribution over X to be the normal distribution \mathcal N(f_1(z), e^). This defines a family of joint distributions p_\theta over (X, Z). It is very easy to sample (x, z) \sim p_\theta: simply sample z\sim p, then compute f_\theta(z), and finally sample x \sim p_\theta(\cdot , z) using f_\theta(z). In other words, we have a generative model for both the observable and the latent. Now, we consider a distribution p_\theta good, if it is a close approximation of p^*:p_\theta(X) \approx p^*(X)since the distribution on the right side is over X only, the distribution on the left side must marginalize the latent variable Z away.
In general, it's impossible to perform the integral p_\theta(x) = \int p_\theta(x, z)p(z)dz, forcing us to perform another approximation. Since p_\theta(x) = \frac, it suffices to find a good approximation of p_\theta(z, x). So define another distribution family q_\phi(z, x) and use it to approximate p_\theta(z, x). This is a discriminative model for the latent. The entire situation is summarized in the following table: In Bayesian language, X is the observed evidence, and Z is the latent/unobserved. The distribution p over Z is the ''prior distribution'' over Z, p_\theta(x, z) is the likelihood function, and p_\theta(z, x) is the ''posterior'' ''distribution'' over Z. Given an observation x, we can ''infer'' what z likely gave rise to x by computing p_\theta(z, x). The usual Bayesian method is to estimate the integral p_\theta(x) = \int p_\theta(x, z)p(z)dz, then compute by
Bayes rule In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
p_\theta(z, x) = \frac. This is expensive to perform in general, but if we can simply find a good approximation q_\phi(z, x) \approx p_\theta(z, x) for most x, z, then we can infer z from x cheaply. Thus, the search for a good q_\phi is also called amortized inference. All in all, we have found a problem of variational Bayesian inference.


Deriving the ELBO

A basic result in variational inference is that minimizing the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
(KL-divergence) is equivalent to maximizing the log-likelihood:\mathbb_ ln p_\theta (x)= -H(p^*) - D_(p^*(x) \, p_\theta(x))where H(p^*) = -\mathbb \mathbb E_ ln p^*(x)/math> is the
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
of the true distribution. So if we can maximize \mathbb_ ln p_\theta (x)/math>, we can minimize D_(p^*(x) \, p_\theta(x)), and consequently find an accurate approximation p_\theta \approx p^*. To maximize \mathbb_ ln p_\theta (x)/math>, we simply sample many x_i\sim p^*(x), i.e. use
Importance sampling Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally att ...
N\max_\theta \mathbb_ ln p_\theta (x)approx \max_\theta \sum_i \ln p_\theta (x_i) In order to maximize \sum_i \ln p_\theta (x_i), it's necessary to find \ln p_\theta(x):\ln p_\theta(x) = \ln \int p_\theta(x, z) p(z)dzThis usually has no closed form and must be estimated. The usual way to estimate integrals is
Monte Carlo integration In mathematics, Monte Carlo integration is a technique for numerical integration using random numbers. It is a particular Monte Carlo method that numerically computes a definite integral. While other algorithms usually evaluate the integrand a ...
with
importance sampling Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally att ...
:\int p_\theta(x, z) p(z)dz = \mathbb E_\left frac\right/math>where q_\phi(z, x) is a sampling distribution over z that we use to perform the Monte Carlo integration. So we see that if we sample z\sim q_\phi(\cdot, x), then \frac is an unbiased estimator of p_\theta(x). Unfortunately, this does not give us an unbiased estimator of \ln p_\theta(x), because \ln is nonlinear. Indeed, we have by
Jensen's inequality In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier pr ...
, \ln p_\theta(x)= \ln \mathbb E_\left frac\right\geq \mathbb E_\left ln\frac\right/math>In fact, all the obvious estimators of \ln p_\theta(x) are biased downwards, because no matter how many samples of z_i\sim q_\phi(\cdot , x) we take, we have by Jensen's inequality:\mathbb E_\left \ln \left(\frac 1N \sum_i \frac\right) \right\leq \ln \mathbb E_\left \frac 1N \sum_i \frac \right= \ln p_\theta(x) Subtracting the right side, we see that the problem comes down to a biased estimator of zero:\mathbb E_\left \ln \left(\frac 1N \sum_i \frac\right) \right\leq 0By the
delta method In statistics, the delta method is a result concerning the approximate probability distribution for a function of an asymptotically normal statistical estimator from knowledge of the limiting variance of that estimator. History The delta method ...
, we have\mathbb E_\left \ln \left(\frac 1N \sum_i \frac\right) \right\approx -\frac \mathbb V_\left frac\right= O(N^)If we continue with this, we would obtain the importance-weighted autoencoder. But we return to the simplest case with N=1:\ln p_\theta(x)= \ln \mathbb E_\left frac\right\geq \mathbb E_\left ln\frac\right/math>The tightness of the inequality has a closed form:\ln p_\theta(x)- \mathbb E_\left ln\frac\right= D_(q_\phi(\cdot , x)\, p_\theta(\cdot , x))\geq 0We have thus obtained the ELBO function:L(\phi, \theta; x) := \ln p_\theta(x) - D_(q_\phi(\cdot , x)\, p_\theta(\cdot , x))


Maximizing the ELBO

For fixed x, the optimization \max_ L(\phi, \theta; x) simultaneously attempts to maximize \ln p_\theta(x) and minimize D_(q_\phi(\cdot , x)\, p_\theta(\cdot , x)). If the parametrization for p_\theta and q_\phi are flexible enough, we would obtain some \hat\phi, \hat \theta, such that we have simultaneously \ln p_(x) \approx \max_\theta \ln p_\theta(x); \quad q_(\cdot , x)\approx p_(\cdot , x)Since\mathbb_ ln p_\theta (x)= -H(p^*) - D_(p^*(x) \, p_\theta(x))we have\ln p_(x) \approx \max_\theta -H(p^*) - D_(p^*(x) \, p_\theta(x))and so\hat\theta \approx \arg\min D_(p^*(x) \, p_\theta(x))In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model p_ \approx p^* and an accurate discriminative model q_(\cdot , x)\approx p_(\cdot , x).


Main forms

The ELBO has many possible expressions, each with some different emphasis.\mathbb_\left ln\frac\right= \int q_\phi(z, x)\ln\fracdzThis form shows that if we sample z\sim q_\phi(\cdot , x), then \ln\frac is an
unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
of the ELBO.\ln p_\theta(x) - D_(q_\phi(\cdot , x) \;\, \; p_\theta(\cdot , x))This form shows that the ELBO is a lower bound on the evidence \ln p_\theta(x), and that maximizing the ELBO with respect to \phi is equivalent to minimizing the KL-divergence from p_\theta(\cdot , x) to q_\phi(\cdot , x).\mathbb_ z)- D_(q_\phi(\cdot , x) \;\, \; p)This form shows that maximizing the ELBO simultaneously attempts to keep q_\phi(\cdot , x) close to p and concentrate q_\phi(\cdot , x) on those z that maximizes \ln p_\theta (x, z). That is, the approximate posterior q_\phi(\cdot , x) balances between staying close to the prior p and moving towards the maximum likelihood \arg\max_z \ln p_\theta (x, z).H(q_\phi(\cdot , x)) + \mathbb_ x)+ \ln p_\theta(x)This form shows that maximizing the ELBO simultaneously attempts to keep the entropy of q_\phi(\cdot , x) high, and concentrate q_\phi(\cdot , x) on those z that maximizes \ln p_\theta (z, x). That is, the approximate posterior q_\phi(\cdot , x) balances between being a uniform distribution and moving towards the maximum a posteriori \arg\max_z \ln p_\theta (z, x).


Data-processing inequality

Suppose we take N independent samples from p^*, and collect them in the dataset D = \, then we have empirical distribution q_D(x) = \frac 1N \sum_i \delta_. Fitting p_\theta(x) to q_D(x) can be done, as usual, by maximizing the loglikelihood \ln p_\theta(D):D_(q_D(x) \, p_\theta(x)) = -\frac 1N \sum_i \ln p_\theta(x_i) - H(q_D)= -\frac 1N \ln p_\theta(D) + H(q_D) Now, by the ELBO inequality, we can bound \ln p_\theta(D), and thusD_(q_D(x) \, p_\theta(x)) \leq -\frac 1N L(\phi, \theta; D) - H(q_D)The right-hand-side simplifies to a KL-divergence, and so we get:D_(q_D(x) \, p_\theta(x)) \leq -\frac 1N \sum_i L(\phi, \theta; x_i) - H(q_D)= D_(q_(x, z); p_\theta(x, z))This result can be interpreted as a special case of the
data processing inequality The data processing inequality is an information theoretic concept which states that the information content of a signal cannot be increased via a local physical operation. This can be expressed concisely as 'post-processing cannot increase inform ...
. In this interpretation, maximizing L(\phi, \theta; D)= \sum_i L(\phi, \theta; x_i) is minimizing D_(q_(x, z); p_\theta(x, z)), which upper-bounds the real quantity of interest D_(q_(x); p_\theta(x)) via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.


References


Notes

{{reflist, group=note Theory of probability distributions