Variational Bayesian Method
   HOME

TheInfoList



OR:

Variational Bayesian methods are a family of techniques for approximating intractable
integral In mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented i ...
s arising in
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. They are typically used in complex
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
s consisting of observed variables (usually termed "data") as well as unknown
parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s and
latent variable In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
s, with various sorts of relationships among the three types of
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s, as might be described by a
graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...
. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes: #To provide an analytical approximation to the
posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
of the unobserved variables, in order to do
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
over these variables. #To derive a
lower bound In mathematics, particularly in order theory, an upper bound or majorant of a subset of some preordered set is an element of that is greater than or equal to every element of . Dually, a lower bound or minorant of is defined to be an element ...
for the
marginal likelihood A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evi ...
(sometimes called the ''evidence'') of the observed data (i.e. the
marginal probability In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables ...
of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing
model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the
Bayes factor The Bayes factor is a ratio of two competing statistical models represented by their marginal likelihood, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a nu ...
article.) In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to
Monte Carlo sampling Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be determi ...
methods—particularly,
Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
methods such as
Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is dif ...
—for taking a fully Bayesian approach to
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
over complex distributions that are difficult to evaluate directly or
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior. Variational Bayes can be seen as an extension of the expectation-maximization (EM) algorithm from
maximum a posteriori estimation In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the b ...
(MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire
posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically. For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to update the parameters iteratively often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.


Mathematical derivation


Problem

In variational inference, the posterior distribution over a set of unobserved variables \mathbf = \ given some data \mathbf is approximated by a so-called variational distribution, Q(\mathbf): : P(\mathbf\mid \mathbf) \approx Q(\mathbf). The distribution Q(\mathbf) is restricted to belong to a family of distributions of simpler form than P(\mathbf\mid \mathbf) (e.g. a family of Gaussian distributions), selected with the intention of making Q(\mathbf) similar to the true posterior, P(\mathbf\mid \mathbf). The similarity (or dissimilarity) is measured in terms of a dissimilarity function d(Q; P) and hence inference is performed by selecting the distribution Q(\mathbf) that minimizes d(Q; P).


KL divergence

The most common type of variational Bayes uses the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
(KL-divergence) of ''Q'' from ''P'' as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as :D_(Q \parallel P) \triangleq \sum_\mathbf Q(\mathbf) \log \frac. Note that ''Q'' and ''P'' are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to the expectation-maximization algorithm. (Using the KL-divergence in the other way produces the
expectation propagation Expectation propagation (EP) is a technique in Bayesian machine learning. EP finds approximations to a probability distribution. It uses an iterative approach that uses the factorization structure of the target distribution. It differs from oth ...
algorithm.)


Intractability

Variational techniques are typically used to form an approximation for: :P(\mathbf Z \mid \mathbf X) = \frac = \frac The marginalization over \mathbf Z to calculate P(\mathbf X) in the denominator is typically intractable, because, for example, the search space of \mathbf Z is combinatorially large. Therefore, we seek an approximation, using Q(\mathbf Z) \approx P(\mathbf Z \mid \mathbf X).


Evidence lower bound

Given that P(\mathbf Z \mid \mathbf X) = \frac, the KL-divergence above can also be written as : D_(Q \parallel P) = \sum_\mathbf Q(\mathbf) \left \log \frac + \log P(\mathbf) \right= \sum_\mathbf Q(\mathbf) \left \log Q(\mathbf) - \log P(\mathbf,\mathbf) \right+ \sum_\mathbf Q(\mathbf) \left \log P(\mathbf) \right Because P(\mathbf) is a constant with respect to \mathbf Z and \sum_\mathbf Q(\mathbf) = 1 because Q(\mathbf) is a distribution, we have : D_(Q \parallel P) = \sum_\mathbf Q(\mathbf) \left \log Q(\mathbf) - \log P(\mathbf,\mathbf) \right+ \log P(\mathbf) which, according to the definition of
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
(for a discrete
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
), can be written as follows : D_(Q \parallel P) = \mathbb_ \left \log Q(\mathbf) - \log P(\mathbf,\mathbf) \right+ \log P(\mathbf) which can be rearranged to become : \log P(\mathbf) = D_(Q \parallel P) - \mathbb_ \left \log Q(\mathbf) - \log P(\mathbf,\mathbf) \right= D_(Q\parallel P) + \mathcal(Q) As the ''log-
evidence Evidence for a proposition is what supports this proposition. It is usually understood as an indication that the supported proposition is true. What role evidence plays and how it is conceived varies from field to field. In epistemology, evidenc ...
'' \log P(\mathbf) is fixed with respect to Q, maximizing the final term \mathcal(Q) minimizes the KL divergence of Q from P. By appropriate choice of Q, \mathcal(Q) becomes tractable to compute and to maximize. Hence we have both an analytical approximation Q for the posterior P(\mathbf\mid \mathbf), and a lower bound \mathcal(Q) for the log-evidence \log P(\mathbf) (since the KL-divergence is non-negative). The lower bound \mathcal(Q) is known as the (negative) variational free energy in analogy with
thermodynamic free energy The thermodynamic free energy is a concept useful in the thermodynamics of chemical or thermal processes in engineering and science. The change in the free energy is the maximum amount of work that a thermodynamic system can perform in a process a ...
because it can also be expressed as a negative energy \operatorname_ log P(\mathbf,\mathbf)/math> plus the
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
of Q. The term \mathcal(Q) is also known as Evidence Lower Bound, abbreviated as ELBO, to emphasize that it is a lower bound on the log-evidence of the data.


Proofs

By the generalized Pythagorean theorem of
Bregman divergence In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. W ...
, of which KL-divergence is a special case, it can be shown that: : D_(Q\parallel P) \geq D_(Q\parallel Q^) + D_(Q^\parallel P), \forall Q^ \in\mathcal where \mathcal is a convex set and the equality holds if: : Q = Q^ \triangleq \arg\min_D_(Q\parallel P). In this case, the global minimizer Q^(\mathbf) = q^(\mathbf_1\mid\mathbf_2)q^(\mathbf_2) = q^(\mathbf_2\mid\mathbf_1)q^(\mathbf_1), with \mathbf=\, can be found as follows: : q^(\mathbf_2) = \frac\frac = \frac\exp\mathbb_\left(\log\frac\right), in which the normalizing constant is: :\zeta(\mathbf) =P(\mathbf)\int_\frac = \int_\exp\mathbb_\left(\log\frac\right). The term \zeta(\mathbf) is often called the
evidence Evidence for a proposition is what supports this proposition. It is usually understood as an indication that the supported proposition is true. What role evidence plays and how it is conceived varies from field to field. In epistemology, evidenc ...
lower bound (ELBO) in practice, since P(\mathbf)\geq\zeta(\mathbf)=\exp(\mathcal(Q^)), as shown above. By interchanging the roles of \mathbf_1 and \mathbf_2, we can iteratively compute the approximated q^(\mathbf_1) and q^(\mathbf_2) of the true model's marginals P(\mathbf_1\mid\mathbf) and P(\mathbf_2\mid\mathbf), respectively. Although this iterative scheme is guaranteed to converge monotonically, the converged Q^ is only a local minimizer of D_(Q\parallel P). If the constrained space \mathcal is confined within independent space, i.e. q^(\mathbf_1\mid\mathbf_2) = q^(\mathbf),the above iterative scheme will become the so-called mean field approximation Q^(\mathbf) = q^(\mathbf_1)q^(\mathbf_2),as shown below.


Mean field approximation

The variational distribution Q(\mathbf) is usually assumed to factorize over some
partition Partition may refer to: Computing Hardware * Disk partitioning, the division of a hard disk drive * Memory partition, a subdivision of a computer's memory, usually for use by a single job Software * Partition (database), the division of a ...
of the latent variables, i.e. for some partition of the latent variables \mathbf into \mathbf_1 \dots \mathbf_M, :Q(\mathbf) = \prod_^M q_i(\mathbf_i\mid \mathbf) It can be shown using the
calculus of variations The calculus of variations (or Variational Calculus) is a field of mathematical analysis that uses variations, which are small changes in functions and functionals, to find maxima and minima of functionals: mappings from a set of functions t ...
(hence the name "variational Bayes") that the "best" distribution q_j^ for each of the factors q_j (in terms of the distribution minimizing the KL divergence, as described above) can be expressed as: :q_j^(\mathbf_j\mid \mathbf) = \frac where \operatorname_ ln p(\mathbf, \mathbf)/math> is the expectation of the logarithm of the
joint probability Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considere ...
of the data and latent variables, taken over all variables not in the partition: refer to for a derivation of the distribution q_j^(\mathbf_j\mid \mathbf). In practice, we usually work in terms of logarithms, i.e.: :\ln q_j^(\mathbf_j\mid \mathbf) = \operatorname_ ln p(\mathbf, \mathbf)+ \text The constant in the above expression is related to the
normalizing constant The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics. The normalizing constant is used to reduce any probability function to a probability density function with total probability of one. ...
(the denominator in the expression above for q_j^) and is usually reinstated by inspection, as the rest of the expression can usually be recognized as being a known type of distribution (e.g.
Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...
,
gamma Gamma (uppercase , lowercase ; ''gámma'') is the third letter of the Greek alphabet. In the system of Greek numerals it has a value of 3. In Ancient Greek, the letter gamma represented a voiced velar stop . In Modern Greek, this letter re ...
, etc.). Using the properties of expectations, the expression \operatorname_ ln p(\mathbf, \mathbf)/math> can usually be simplified into a function of the fixed
hyperparameter In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...
s of the
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
s over the latent variables and of expectations (and sometimes higher
moment Moment or Moments may refer to: * Present time Music * The Moments, American R&B vocal group Albums * ''Moment'' (Dark Tranquillity album), 2020 * ''Moment'' (Speed album), 1998 * ''Moments'' (Darude album) * ''Moments'' (Christine Guldbrand ...
s such as the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
) of latent variables not in the current partition (i.e. latent variables not included in \mathbf_j). This creates
circular dependencies In software engineering, a circular dependency is a relation between two or more modules which either directly or indirectly depend on each other to function properly. Such modules are also known as mutually recursive. Overview Circular depen ...
between the parameters of the distributions over variables in one partition and the expectations of variables in the other partitions. This naturally suggests an
iterative Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...
algorithm, much like EM (the expectation-maximization algorithm), in which the expectations (and possibly higher moments) of the latent variables are initialized in some fashion (perhaps randomly), and then the parameters of each distribution are computed in turn using the current values of the expectations, after which the expectation of the newly computed distribution is set appropriately according to the computed parameters. An algorithm of this sort is guaranteed to
converge Converge may refer to: * Converge (band), American hardcore punk band * Converge (Baptist denomination), American national evangelical Baptist body * Limit (mathematics) * Converge ICT, internet service provider in the Philippines *CONVERGE CFD s ...
. In other words, for each of the partitions of variables, by simplifying the expression for the distribution over the partition's variables and examining the distribution's functional dependency on the variables in question, the family of the distribution can usually be determined (which in turn determines the value of the constant). The formula for the distribution's parameters will be expressed in terms of the prior distributions' hyperparameters (which are known constants), but also in terms of expectations of functions of variables in other partitions. Usually these expectations can be simplified into functions of expectations of the variables themselves (i.e. the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
s); sometimes expectations of squared variables (which can be related to the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
of the variables), or expectations of higher powers (i.e. higher
moment Moment or Moments may refer to: * Present time Music * The Moments, American R&B vocal group Albums * ''Moment'' (Dark Tranquillity album), 2020 * ''Moment'' (Speed album), 1998 * ''Moments'' (Darude album) * ''Moments'' (Christine Guldbrand ...
s) also appear. In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up. However, those formulas depend on those distributions' parameters, which depend in turn on the expectations about other variables. The result is that the formulas for the parameters of each variable's distributions can be expressed as a series of equations with mutual,
nonlinear In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many other ...
dependencies among the variables. Usually, it is not possible to solve this system of equations directly. However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge. An example will make this process clearer.


A duality formula for variational inference

The following theorem is referred to as a duality formula for variational inference. It explains some important properties of the variational distributions used in variational Bayes methods. Consider two
probability spaces In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models t ...
(\Theta,\mathcal,P) and (\Theta,\mathcal,Q) with Q \ll P. Assume that there is a common dominating
probability measure In mathematics, a probability measure is a real-valued function defined on a set of events in a probability space that satisfies measure properties such as ''countable additivity''. The difference between a probability measure and the more gener ...
\lambda such that P \ll \lambda and Q \ll \lambda. Let h denote any real-valued
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
on (\Theta,\mathcal,P) that satisfies h \in L_1(P). Then the following equality holds : \log E_P exp h= \text_ \. Further, the supremum on the right-hand side is attained if and only if it holds : \frac = \frac, almost surely with respect to probability measure Q, where p(\theta) = dP/d\lambda and q(\theta) = dQ/d\lambda denote the Radon-Nikodym derivatives of the probability measures P and Q with respect to \lambda, respectively.


A basic example

Consider a simple non-hierarchical Bayesian model consisting of a set of
i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
observations from a Gaussian distribution, with unknown
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
and
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
. In the following, we work through this model in great detail to illustrate the workings of the variational Bayes method. For mathematical convenience, in the following example we work in terms of the
precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...
— i.e. the reciprocal of the variance (or in a multivariate Gaussian, the inverse of the
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
) — rather than the variance itself. (From a theoretical standpoint, precision and variance are equivalent since there is a
one-to-one correspondence In mathematics, a bijection, also known as a bijective function, one-to-one correspondence, or invertible function, is a function between the elements of two sets, where each element of one set is paired with exactly one element of the other s ...
between the two.)


The mathematical model

We place
conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...
distributions on the unknown mean \mu and precision \tau, i.e. the mean also follows a Gaussian distribution while the precision follows a
gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distri ...
. In other words: : \begin \tau & \sim \operatorname(a_0, b_0) \\ \mu, \tau & \sim \mathcal(\mu_0, (\lambda_0 \tau)^) \\ \ & \sim \mathcal(\mu, \tau^) \\ N &= \text \end The
hyperparameter In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...
s \mu_0, \lambda_0, a_0 and b_0 in the prior distributions are fixed, given values. They can be set to small positive numbers to give broad prior distributions indicating ignorance about the prior distributions of \mu and \tau. We are given N data points \mathbf = \ and our goal is to infer the
posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
q(\mu, \tau)=p(\mu,\tau\mid x_1, \ldots, x_N) of the parameters \mu and \tau.


The joint probability

The
joint probability Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considere ...
of all variables can be rewritten as :p(\mathbf,\mu,\tau) = p(\mathbf\mid \mu,\tau) p(\mu\mid \tau) p(\tau) where the individual factors are : \begin p(\mathbf\mid \mu,\tau) & = \prod_^N \mathcal(x_n\mid \mu,\tau^) \\ p(\mu\mid \tau) & = \mathcal \left (\mu\mid \mu_0, (\lambda_0 \tau)^ \right ) \\ p(\tau) & = \operatorname(\tau\mid a_0, b_0) \end where : \begin \mathcal(x\mid \mu,\sigma^2) & = \frac e^ \\ \operatorname(\tau\mid a,b) & = \frac b^a \tau^ e^ \end


Factorized approximation

Assume that q(\mu,\tau) = q(\mu)q(\tau), i.e. that the posterior distribution factorizes into independent factors for \mu and \tau. This type of assumption underlies the variational Bayesian method. The true posterior distribution does not in fact factor this way (in fact, in this simple case, it is known to be a Gaussian-gamma distribution), and hence the result we obtain will be an approximation.


Derivation of

Then : \begin \ln q_\mu^*(\mu) &= \operatorname_\tau\left ln p(\mathbf\mid \mu,\tau) + \ln p(\mu\mid \tau) + \ln p(\tau)\right+ C \\ &= \operatorname_\tau\left ln p(\mathbf\mid \mu,\tau)\right+ \operatorname_\tau\left ln p(\mu\mid \tau)\right+ \operatorname_\left ln p(\tau)\right+ C \\ &= \operatorname_\tau\left ln \prod_^N \mathcal \left (x_n\mid \mu,\tau^ \right )\right+ \operatorname_\tau\left ln \mathcal \left (\mu\mid \mu_0, (\lambda_0 \tau)^ \right )\right+ C_2 \\ &= \operatorname_\tau\left ln \prod_^N \sqrt e^\right+ \operatorname_\left ln \sqrt e^\right+ C_2 \\ &= \operatorname_\left sum_^N \left(\frac(\ln\tau - \ln 2\pi) - \frac\right)\right+ \operatorname_\left frac(\ln \lambda_0 + \ln \tau - \ln 2\pi) - \frac\right+ C_2 \\ &= \operatorname_\left sum_^N -\frac\right+ \operatorname_\left \frac\right+ \operatorname_\left sum_^N \frac(\ln\tau - \ln 2\pi)\right+ \operatorname_\left frac(\ln \lambda_0 + \ln \tau - \ln 2\pi)\right+ C_2 \\ &= \operatorname_\left sum_^N -\frac\right+ \operatorname_\left \frac\right+ C_3 \\ &= - \frac \left\ + C_3 \end In the above derivation, C, C_2 and C_3 refer to values that are constant with respect to \mu. Note that the term \operatorname_ ln p(\tau)/math> is not a function of \mu and will have the same value regardless of the value of \mu. Hence in line 3 we can absorb it into the constant term at the end. We do the same thing in line 7. The last line is simply a quadratic polynomial in \mu. Since this is the logarithm of q_\mu^*(\mu), we can see that q_\mu^*(\mu) itself is a Gaussian distribution. With a certain amount of tedious math (expanding the squares inside of the braces, separating out and grouping the terms involving \mu and \mu^2 and
completing the square : In elementary algebra, completing the square is a technique for converting a quadratic polynomial of the form :ax^2 + bx + c to the form :a(x-h)^2 + k for some values of ''h'' and ''k''. In other words, completing the square places a perfe ...
over \mu), we can derive the parameters of the Gaussian distribution: :\begin \ln q_\mu^*(\mu) &= -\frac \left\ + C_3 \\ &= -\frac \left\ + C_3 \\ &= -\frac \left\ + C_3 \\ &= -\frac \left\ + C_3 \\ &= -\frac \left\ + C_4 \\ &= -\frac \left\ + C_4 \\ &= -\frac \left\ + C_4 \\ &= -\frac \left\ + C_4 \\ &= -\frac \left\ + C_5 \\ &= -\frac \left\ + C_5 \\ &= -\frac (\lambda_0+N)\operatorname_
tau Tau (uppercase Τ, lowercase τ, or \boldsymbol\tau; el, ταυ ) is the 19th letter of the Greek alphabet, representing the voiceless dental or alveolar plosive . In the system of Greek numerals, it has a value of 300. The name in English ...
\left(\mu-\frac\right)^2 + C_5 \end Note that all of the above steps can be shortened by using the formula for the sum of two quadratics. In other words: : \begin q_\mu^*(\mu) &\sim \mathcal(\mu\mid \mu_N,\lambda_N^) \\ \mu_N &= \frac \\ \lambda_N &= (\lambda_0 + N) \operatorname_
tau Tau (uppercase Τ, lowercase τ, or \boldsymbol\tau; el, ταυ ) is the 19th letter of the Greek alphabet, representing the voiceless dental or alveolar plosive . In the system of Greek numerals, it has a value of 300. The name in English ...
\\ \bar &= \frac\sum_^N x_n \end


Derivation of

The derivation of q_\tau^*(\tau) is similar to above, although we omit some of the details for the sake of brevity. : \begin \ln q_\tau^*(\tau) &= \operatorname_ ln p(\mathbf\mid \mu,\tau) + \ln p(\mu\mid \tau)+ \ln p(\tau) + \text \\ &= (a_0 - 1) \ln \tau - b_0 \tau + \frac \ln \tau + \frac \ln \tau - \frac \operatorname_\mu \left \sum_^N (x_n-\mu)^2 + \lambda_0(\mu - \mu_0)^2 \right + \text \end Exponentiating both sides, we can see that q_\tau^*(\tau) is a
gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distri ...
. Specifically: : \begin q_\tau^*(\tau) &\sim \operatorname(\tau\mid a_N, b_N) \\ a_N &= a_0 + \frac \\ b_N &= b_0 + \frac \operatorname_\mu \left sum_^N (x_n-\mu)^2 + \lambda_0(\mu - \mu_0)^2\right\end


Algorithm for computing the parameters

Let us recap the conclusions from the previous sections: : \begin q_\mu^*(\mu) &\sim \mathcal(\mu\mid\mu_N,\lambda_N^) \\ \mu_N &= \frac \\ \lambda_N &= (\lambda_0 + N) \operatorname_
tau Tau (uppercase Τ, lowercase τ, or \boldsymbol\tau; el, ταυ ) is the 19th letter of the Greek alphabet, representing the voiceless dental or alveolar plosive . In the system of Greek numerals, it has a value of 300. The name in English ...
\\ \bar &= \frac\sum_^N x_n \end and : \begin q_\tau^*(\tau) &\sim \operatorname(\tau\mid a_N, b_N) \\ a_N &= a_0 + \frac \\ b_N &= b_0 + \frac \operatorname_\mu \left sum_^N (x_n-\mu)^2 + \lambda_0(\mu - \mu_0)^2\right\end In each case, the parameters for the distribution over one of the variables depend on expectations taken with respect to the other variable. We can expand the expectations, using the standard formulas for the expectations of moments of the Gaussian and gamma distributions: : \begin \operatorname tau\mid a_N, b_N&= \frac \\ \operatorname \left mu\mid\mu_N,\lambda_N^ \right &= \mu_N \\ \operatorname\left ^2 \right&= \operatorname(X) + (\operatorname ^2 \\ \operatorname \left mu^2\mid\mu_N,\lambda_N^ \right &= \lambda_N^ + \mu_N^2 \end Applying these formulas to the above equations is trivial in most cases, but the equation for b_N takes more work: : \begin b_N &= b_0 + \frac \operatorname_\mu \left sum_^N (x_n-\mu)^2 + \lambda_0(\mu - \mu_0)^2\right\\ &= b_0 + \frac \operatorname_\mu \left (\lambda_0+N)\mu^2 -2 \left (\lambda_0\mu_0 + \sum_^N x_n \right )\mu + \left(\sum_^N x_n^2 \right ) + \lambda_0\mu_0^2 \right\\ &= b_0 + \frac \left (\lambda_0+N)\operatorname_\mu[\mu^2-2_\left_(\lambda_0\mu_0_+_\sum_^N_x_n_\right)\operatorname_\mu_[\mu.html" ;"title="mu^2.html" ;"title="(\lambda_0+N)\operatorname_\mu[\mu^2">(\lambda_0+N)\operatorname_\mu[\mu^2-2 \left (\lambda_0\mu_0 + \sum_^N x_n \right)\operatorname_\mu [\mu">mu^2.html" ;"title="(\lambda_0+N)\operatorname_\mu[\mu^2">(\lambda_0+N)\operatorname_\mu[\mu^2-2 \left (\lambda_0\mu_0 + \sum_^N x_n \right)\operatorname_\mu [\mu+ \left (\sum_^N x_n^2 \right ) + \lambda_0\mu_0^2 \right] \\ &= b_0 + \frac \left[ (\lambda_0+N) \left (\lambda_N^ + \mu_N^2 \right ) -2 \left (\lambda_0\mu_0 + \sum_^N x_n \right)\mu_N + \left(\sum_^N x_n^2 \right) + \lambda_0\mu_0^2 \right] \\ \end We can then write the parameter equations as follows, without any expectations: :\begin \mu_N &= \frac \\ \lambda_N &= (\lambda_0 + N) \frac \\ \bar &= \frac\sum_^N x_n \\ a_N &= a_0 + \frac \\ b_N &= b_0 + \frac \left (\lambda_0+N) \left (\lambda_N^ + \mu_N^2 \right ) -2 \left (\lambda_0\mu_0 + \sum_^N x_n \right )\mu_N + \left (\sum_^N x_n^2 \right ) + \lambda_0\mu_0^2 \right\end Note that there are circular dependencies among the formulas for \lambda_Nand b_N. This naturally suggests an EM-like algorithm: #Compute \sum_^N x_n and \sum_^N x_n^2. Use these values to compute \mu_N and a_N. #Initialize \lambda_N to some arbitrary value. #Use the current value of \lambda_N, along with the known values of the other parameters, to compute b_N. #Use the current value of b_N, along with the known values of the other parameters, to compute \lambda_N. #Repeat the last two steps until convergence (i.e. until neither value has changed more than some small amount). We then have values for the hyperparameters of the approximating distributions of the posterior parameters, which we can use to compute any properties we want of the posterior — e.g. its mean and variance, a 95% highest-density region (the smallest interval that includes 95% of the total probability), etc. It can be shown that this algorithm is guaranteed to converge to a local maximum. Note also that the posterior distributions have the same form as the corresponding prior distributions. We did ''not'' assume this; the only assumption we made was that the distributions factorize, and the form of the distributions followed naturally. It turns out (see below) that the fact that the posterior distributions have the same form as the prior distributions is not a coincidence, but a general result whenever the prior distributions are members of the
exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
, which is the case for most of the standard distributions.


Further discussion


Step-by-step recipe

The above example shows the method by which the variational-Bayesian approximation to a
posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
density in a given
Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
is derived: #Describe the network with a
graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...
, identifying the observed variables (data) \mathbf and unobserved variables (
parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s \boldsymbol\Theta and
latent variable In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
s \mathbf) and their
conditional probability distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the co ...
s. Variational Bayes will then construct an approximation to the posterior probability p(\mathbf,\boldsymbol\Theta\mid\mathbf). The approximation has the basic property that it is a factorized distribution, i.e. a product of two or more
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...
distributions over disjoint subsets of the unobserved variables. #Partition the unobserved variables into two or more subsets, over which the independent factors will be derived. There is no universal procedure for doing this; creating too many subsets yields a poor approximation, while creating too few makes the entire variational Bayes procedure intractable. Typically, the first split is to separate the parameters and latent variables; often, this is enough by itself to produce a tractable result. Assume that the partitions are called \mathbf_1,\ldots,\mathbf_M. #For a given partition \mathbf_j, write down the formula for the best approximating distribution q_j^(\mathbf_j\mid \mathbf) using the basic equation \ln q_j^(\mathbf_j\mid \mathbf) = \operatorname_ ln p(\mathbf, \mathbf)+ \text . #Fill in the formula for the
joint probability distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
using the graphical model. Any component conditional distributions that don't involve any of the variables in \mathbf_j can be ignored; they will be folded into the constant term. #Simplify the formula and apply the expectation operator, following the above example. Ideally, this should simplify into expectations of basic functions of variables not in \mathbf_j (e.g. first or second raw
moment Moment or Moments may refer to: * Present time Music * The Moments, American R&B vocal group Albums * ''Moment'' (Dark Tranquillity album), 2020 * ''Moment'' (Speed album), 1998 * ''Moments'' (Darude album) * ''Moments'' (Christine Guldbrand ...
s, expectation of a logarithm, etc.). In order for the variational Bayes procedure to work well, these expectations should generally be expressible analytically as functions of the parameters and/or
hyperparameter In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...
s of the distributions of these variables. In all cases, these expectation terms are constants with respect to the variables in the current partition. #The functional form of the formula with respect to the variables in the current partition indicates the type of distribution. In particular, exponentiating the formula generates the
probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
(PDF) of the distribution (or at least, something proportional to it, with unknown
normalization constant The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics. The normalizing constant is used to reduce any probability function to a probability density function with total probability of one. ...
). In order for the overall method to be tractable, it should be possible to recognize the functional form as belonging to a known distribution. Significant mathematical manipulation may be required to convert the formula into a form that matches the PDF of a known distribution. When this can be done, the normalization constant can be reinstated by definition, and equations for the parameters of the known distribution can be derived by extracting the appropriate parts of the formula. #When all expectations can be replaced analytically with functions of variables not in the current partition, and the PDF put into a form that allows identification with a known distribution, the result is a set of equations expressing the values of the optimum parameters as functions of the parameters of variables in other partitions. #When this procedure can be applied to all partitions, the result is a set of mutually linked equations specifying the optimum values of all parameters. #An
expectation maximization Expectation or Expectations may refer to: Science * Expectation (epistemic) * Expected value, in mathematical probability theory * Expectation value (quantum mechanics) * Expectation–maximization algorithm, in statistics Music * ''Expectation' ...
(EM) type procedure is then applied, picking an initial value for each parameter and the iterating through a series of steps, where at each step we cycle through the equations, updating each parameter in turn. This is guaranteed to converge.


Most important points

Due to all of the mathematical manipulations involved, it is easy to lose track of the big picture. The important things are: #The idea of variational Bayes is to construct an analytical approximation to the
posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
of the set of unobserved variables (parameters and latent variables), given the data. This means that the form of the solution is similar to other
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
methods, such as
Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is dif ...
— i.e. a distribution that seeks to describe everything that is known about the variables. As in other Bayesian methods — but unlike e.g. in
expectation maximization Expectation or Expectations may refer to: Science * Expectation (epistemic) * Expected value, in mathematical probability theory * Expectation value (quantum mechanics) * Expectation–maximization algorithm, in statistics Music * ''Expectation' ...
(EM) or other
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
methods — both types of unobserved variables (i.e. parameters and latent variables) are treated the same, i.e. as
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s. Estimates for the variables can then be derived in the standard Bayesian ways, e.g. calculating the mean of the distribution to get a single point estimate or deriving a
credible interval In Bayesian statistics, a credible interval is an interval within which an unobserved parameter value falls with a particular probability. It is an interval in the domain of a posterior probability distribution or a predictive distribution. The ...
, highest density region, etc. #"Analytical approximation" means that a formula can be written down for the posterior distribution. The formula generally consists of a product of well-known probability distributions, each of which ''factorizes'' over a set of unobserved variables (i.e. it is
conditionally independent In probability theory, conditional independence describes situations wherein an observation is irrelevant or redundant when evaluating the certainty of a hypothesis. Conditional independence is usually formulated in terms of conditional probabil ...
of the other variables, given the observed data). This formula is not the true posterior distribution, but an approximation to it; in particular, it will generally agree fairly closely in the lowest
moment Moment or Moments may refer to: * Present time Music * The Moments, American R&B vocal group Albums * ''Moment'' (Dark Tranquillity album), 2020 * ''Moment'' (Speed album), 1998 * ''Moments'' (Darude album) * ''Moments'' (Christine Guldbrand ...
s of the unobserved variables, e.g. the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
and
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
. #The result of all of the mathematical manipulations is (1) the identity of the probability distributions making up the factors, and (2) mutually dependent formulas for the parameters of these distributions. The actual values of these parameters are computed numerically, through an alternating iterative procedure much like EM.


Compared with expectation maximization (EM)

Variational Bayes (VB) is often compared with
expectation maximization Expectation or Expectations may refer to: Science * Expectation (epistemic) * Expected value, in mathematical probability theory * Expectation value (quantum mechanics) * Expectation–maximization algorithm, in statistics Music * ''Expectation' ...
(EM). The actual numerical procedure is quite similar, in that both are alternating iterative procedures that successively converge on optimum parameter values. The initial steps to derive the respective procedures are also vaguely similar, both starting out with formulas for probability densities and both involving significant amounts of mathematical manipulations. However, there are a number of differences. Most important is ''what'' is being computed. *EM computes point estimates of posterior distribution of those random variables that can be categorized as "parameters", but only estimates of the actual posterior distributions of the latent variables (at least in "soft EM", and often only when the latent variables are discrete). The point estimates computed are the
mode Mode ( la, modus meaning "manner, tune, measure, due measure, rhythm, melody") may refer to: Arts and entertainment * '' MO''D''E (magazine)'', a defunct U.S. women's fashion magazine * ''Mode'' magazine, a fictional fashion magazine which is ...
s of these parameters; no other information is available. *VB, on the other hand, computes estimates of the actual posterior distribution of all variables, both parameters and latent variables. When point estimates need to be derived, generally the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
is used rather than the mode, as is normal in Bayesian inference. Concomitant with this, the parameters computed in VB do ''not'' have the same significance as those in EM. EM computes optimum values of the parameters of the Bayes network itself. VB computes optimum values of the parameters of the distributions used to approximate the parameters and latent variables of the Bayes network. For example, a typical Gaussian
mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation ...
will have parameters for the mean and variance of each of the mixture components. EM would directly estimate optimum values for these parameters. VB, however, would first fit a distribution to these parameters — typically in the form of a
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
, e.g. a
normal-scaled inverse gamma distribution In probability theory and statistics, the normal-inverse-gamma distribution (or Gaussian-inverse-gamma distribution) is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution ...
— and would then compute values for the parameters of this prior distribution, i.e. essentially
hyperparameters In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...
. In this case, VB would compute optimum estimates of the four parameters of the normal-scaled inverse gamma distribution that describes the joint distribution of the mean and variance of the component.


A more complex example

Imagine a Bayesian
Gaussian mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observatio ...
described as follows: : \begin \mathbf & \sim \operatorname(K, \alpha_0) \\ \mathbf_ & \sim \mathcal(\mathbf_0, \nu_0) \\ \mathbf_ & \sim \mathcal(\mathbf_0, (\beta_0 \mathbf_i)^) \\ \mathbf = 1 \dots N& \sim \operatorname(1, \mathbf) \\ \mathbf_ & \sim \mathcal(\mathbf_, ^) \\ K &= \text \\ N &= \text \end Note: *SymDir() is the symmetric
Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector \boldsymb ...
of dimension K, with the hyperparameter for each component set to \alpha_0. The Dirichlet distribution is the
conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...
of the
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...
or
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of w ...
. *\mathcal() is the
Wishart distribution In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. It is a family of probability distributions define ...
, which is the conjugate prior of the
precision matrix In statistics, the precision matrix or concentration matrix is the matrix inverse of the covariance matrix or dispersion matrix, P = \Sigma^. For univariate distributions, the precision matrix degenerates into a scalar precision, defined as the ...
(inverse
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
) for a
multivariate Gaussian distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...
. *Mult() is a
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of w ...
over a single observation (equivalent to a
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...
). The state space is a "one-of-K" representation, i.e. a K-dimensional vector in which one of the elements is 1 (specifying the identity of the observation) and all other elements are 0. *\mathcal() is the Gaussian distribution, in this case specifically the
multivariate Gaussian distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...
. The interpretation of the above variables is as follows: *\mathbf = \ is the set of N data points, each of which is a D-dimensional vector distributed according to a
multivariate Gaussian distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...
. *\mathbf = \ is a set of latent variables, one per data point, specifying which mixture component the corresponding data point belongs to, using a "one-of-K" vector representation with components z_ for k = 1 \dots K, as described above. *\mathbf is the mixing proportions for the K mixture components. *\mathbf_ and \mathbf_ specify the parameters (
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
and
precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...
) associated with each mixture component. The joint probability of all variables can be rewritten as :p(\mathbf,\mathbf,\mathbf,\mathbf,\mathbf) = p(\mathbf\mid \mathbf,\mathbf,\mathbf) p(\mathbf\mid \mathbf) p(\mathbf) p(\mathbf\mid \mathbf) p(\mathbf) where the individual factors are : \begin p(\mathbf\mid \mathbf,\mathbf,\mathbf) & = \prod_^N \prod_^K \mathcal(\mathbf_n\mid \mathbf_k,\mathbf_k^)^ \\ p(\mathbf\mid \mathbf) & = \prod_^N \prod_^K \pi_k^ \\ p(\mathbf) & = \frac \prod_^K \pi_k^ \\ p(\mathbf\mid \mathbf) & = \prod_^K \mathcal(\mathbf_k\mid \mathbf_0,(\beta_0 \mathbf_k)^) \\ p(\mathbf) & = \prod_^K \mathcal(\mathbf_k\mid \mathbf_0, \nu_0) \end where : \begin \mathcal(\mathbf\mid \mathbf,\mathbf) & = \frac \frac \exp \left\ \\ \mathcal(\mathbf\mid \mathbf,\nu) & = B(\mathbf,\nu) , \mathbf, ^ \exp \left(-\frac \operatorname(\mathbf^\mathbf) \right) \\ B(\mathbf,\nu) & = , \mathbf, ^ \left\^ \\ D & = \text \end Assume that q(\mathbf,\mathbf,\mathbf,\mathbf) = q(\mathbf)q(\mathbf,\mathbf,\mathbf). Then : \begin \ln q^*(\mathbf) &= \operatorname_ ln p(\mathbf,\mathbf,\mathbf,\mathbf,\mathbf)+ \text \\ &= \operatorname_ ln p(\mathbf\mid \mathbf)+ \operatorname_ ln p(\mathbf\mid \mathbf,\mathbf,\mathbf)+ \text \\ &= \sum_^N \sum_^K z_ \ln \rho_ + \text \end where we have defined :\ln \rho_ = \operatorname ln \pi_k+ \frac \operatorname \mathbf_k, - \frac \ln(2\pi) - \frac \operatorname_ \mathbf_n - \mathbf_k)^ \mathbf_k (\mathbf_n - \mathbf_k)/math> Exponentiating both sides of the formula for \ln q^*(\mathbf) yields :q^*(\mathbf) \propto \prod_^N \prod_^K \rho_^ Requiring that this be normalized ends up requiring that the \rho_ sum to 1 over all values of k, yielding :q^*(\mathbf) = \prod_^N \prod_^K r_^ where :r_ = \frac In other words, q^*(\mathbf) is a product of single-observation
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of w ...
s, and factors over each individual \mathbf_n, which is distributed as a single-observation multinomial distribution with parameters r_ for k = 1 \dots K. Furthermore, we note that :\operatorname _= r_ \, which is a standard result for categorical distributions. Now, considering the factor q(\mathbf,\mathbf,\mathbf), note that it automatically factors into q(\mathbf) \prod_^K q(\mathbf_k,\mathbf_k) due to the structure of the graphical model defining our Gaussian mixture model, which is specified above. Then, : \begin \ln q^*(\mathbf) &= \ln p(\mathbf) + \operatorname_ ln p(\mathbf\mid \mathbf)+ \text \\ &= (\alpha_0 - 1) \sum_^K \ln \pi_k + \sum_^N \sum_^K r_ \ln \pi_k + \text \end Taking the exponential of both sides, we recognize q^*(\mathbf) as a
Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector \boldsymb ...
:q^*(\mathbf) \sim \operatorname(\mathbf) \, where :\alpha_k = \alpha_0 + N_k \, where :N_k = \sum_^N r_ \, Finally :\ln q^*(\mathbf_k,\mathbf_k) = \ln p(\mathbf_k,\mathbf_k) + \sum_^N \operatorname _\ln \mathcal(\mathbf_n\mid \mathbf_k,\mathbf_k^) + \text Grouping and reading off terms involving \mathbf_k and \mathbf_k, the result is a
Gaussian-Wishart distribution In probability theory and statistics, the normal-Wishart distribution (or Gaussian-Wishart distribution) is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distributio ...
given by :q^*(\mathbf_k,\mathbf_k) = \mathcal(\mathbf_k\mid \mathbf_k,(\beta_k \mathbf_k)^) \mathcal(\mathbf_k\mid \mathbf_k,\nu_k) given the definitions : \begin \beta_k &= \beta_0 + N_k \\ \mathbf_k &= \frac (\beta_0 \mathbf_0 + N_k _k) \\ \mathbf_k^ &= \mathbf_0^ + N_k \mathbf_k + \frac (_k - \mathbf_0)(_k - \mathbf_0)^ \\ \nu_k &= \nu_0 + N_k \\ N_k &= \sum_^N r_ \\ _k &= \frac \sum_^N r_ \mathbf_n \\ \mathbf_k &= \frac \sum_^N r_ (\mathbf_n - _k) (\mathbf_n - _k)^ \end Finally, notice that these functions require the values of r_, which make use of \rho_, which is defined in turn based on \operatorname ln \pi_k/math>, \operatorname \mathbf_k, /math>, and \operatorname_ \mathbf_n - \mathbf_k)^ \mathbf_k (\mathbf_n - \mathbf_k)/math>. Now that we have determined the distributions over which these expectations are taken, we can derive formulas for them: : \begin \operatorname_ \mathbf_n - \mathbf_k)^ \mathbf_k (\mathbf_n - \mathbf_k)& = D\beta_k^ + \nu_k (\mathbf_n - \mathbf_k)^ \mathbf_k (\mathbf_n - \mathbf_k) \\ \ln _k &\equiv \operatorname \mathbf_k, = \sum_^D \psi \left(\frac\right) + D \ln 2 + \ln , \mathbf_k, \\ \ln _k &\equiv \operatorname\left \pi_k, \right= \psi(\alpha_k) - \psi\left(\sum_^K \alpha_i\right) \end These results lead to :r_ \propto _k _k^ \exp \left\ These can be converted from proportional to absolute values by normalizing over k so that the corresponding values sum to 1. Note that: #The update equations for the parameters \beta_k, \mathbf_k, \mathbf_k and \nu_k of the variables \mathbf_k and \mathbf_k depend on the statistics N_k, _k, and \mathbf_k, and these statistics in turn depend on r_. #The update equations for the parameters \alpha_ of the variable \mathbf depend on the statistic N_k, which depends in turn on r_. #The update equation for r_ has a direct circular dependence on \beta_k, \mathbf_k, \mathbf_k and \nu_k as well as an indirect circular dependence on \mathbf_k, \nu_k and \alpha_ through _k and _k. This suggests an iterative procedure that alternates between two steps: #An E-step that computes the value of r_ using the current values of all the other parameters. #An M-step that uses the new value of r_ to compute new values of all the other parameters. Note that these steps correspond closely with the standard EM algorithm to derive a
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
or
maximum a posteriori In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the b ...
(MAP) solution for the parameters of a
Gaussian mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observatio ...
. The responsibilities r_ in the E step correspond closely to the
posterior probabilities The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
of the latent variables given the data, i.e. p(\mathbf\mid \mathbf); the computation of the statistics N_k, _k, and \mathbf_k corresponds closely to the computation of corresponding "soft-count" statistics over the data; and the use of those statistics to compute new values of the parameters corresponds closely to the use of soft counts to compute new parameter values in normal EM over a Gaussian mixture model.


Exponential-family distributions

Note that in the previous example, once the distribution over unobserved variables was assumed to factorize into distributions over the "parameters" and distributions over the "latent data", the derived "best" distribution for each variable was in the same family as the corresponding prior distribution over the variable. This is a general result that holds true for all prior distributions derived from the
exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
.


See also

* Variational message passing: a modular algorithm for variational Bayesian inference. *
Variational autoencoder In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods. ...
: an artificial neural network belonging to the families of probabilistic graphical models and Variational Bayesian methods. * Expectation-maximization algorithm: a related approach which corresponds to a special case of variational Bayesian inference. *
Generalized filtering Generalized filtering is a generic Bayesian filtering scheme for nonlinear state-space models. It is based on a variational principle of least action, formulated in generalized coordinates of motion. Note that "generalized coordinates of motion" a ...
: a variational filtering scheme for nonlinear state space models. *
Calculus of variations The calculus of variations (or Variational Calculus) is a field of mathematical analysis that uses variations, which are small changes in functions and functionals, to find maxima and minima of functionals: mappings from a set of functions t ...
: the field of mathematical analysis that deals with maximizing or minimizing functionals. * Maximum entropy discrimination: This is a variational inference framework that allows for introducing and accounting for additional large-margin constraintsSotirios P. Chatzis,
Infinite Markov-Switching Maximum Entropy Discrimination Machines
” Proc. 30th International Conference on Machine Learning (ICML). Journal of Machine Learning Research: Workshop and Conference Proceedings, vol. 28, no. 3, pp. 729–737, June 2013.


References


External links


The on-line textbook: Information Theory, Inference, and Learning Algorithms
by
David J.C. MacKay Professor Sir David John Cameron MacKay (22 April 1967 – 14 April 2016) was a British physicist, mathematician, and academic. He was the Regius Professor of Engineering in the Department of Engineering at the University of Cambridge and fro ...
provides an introduction to variational methods (p. 422).
A Tutorial on Variational Bayes
Fox, C. and Roberts, S. 2012. Artificial Intelligence Review, .
Variational-Bayes Repository
A repository of research papers, software, and links related to the use of variational methods for approximate Bayesian learning up to 2003.

by M. J. Beal includes comparisons of EM to Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs.

by Jason Eisner may be worth reading before a more mathematically detailed treatment.
Copula Variational Bayes inference via information geometry (pdf)
by Tran, V.H. 2018. This paper is primarily written for students. Via
Bregman divergence In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. W ...
, the paper shows that Variational Bayes is simply a generalized Pythagorean projection of true model onto an arbitrarily correlated (copula) distributional space, of which the independent space is merely a special case. {{DEFAULTSORT:Variational Bayesian Methods Bayesian statistics