In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, a variational autoencoder (VAE), is an
artificial neural network
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.
An ANN is based on a collection of connected unit ...
architecture introduced by Diederik P. Kingma and
Max Welling
Max Welling (born 1968) is a Dutch computer scientist in machine learning at the University of Amsterdam. In August 2017, the university spin-off ''Scyfer BV'', co-founded by Welling, was acquired by Qualcomm. He has since then served as a Vic ...
, belonging to the families of
probabilistic graphical models and
variational Bayesian methods
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...
.
Variational autoencoders are often associated with the
autoencoder model because of its architectural affinity, but with significant differences in the goal and mathematical formulation. Variational autoencoders are probabilistic generative models that require neural networks as only a part of their overall structure, as e.g. in VQ-VAE. The neural network components are typically referred to as the encoder and decoder for the first and second component respectively. The first neural network maps the input variable to a latent space that corresponds to the parameters of a variational distribution. In this way, the encoder can produce multiple different samples that all come from the same distribution. The decoder has the opposite function, which is to map from the latent space to the input space, in order to produce or generate data points. Both networks are typically trained together with the usage of the reparameterization trick, although the variance of the noise model can be learned separately.
Although this type of model was initially designed for
unsupervised learning
Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
, its effectiveness has been proven for
semi-supervised learning Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of ...
and
supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
.
Overview of architecture and operation
A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the Expectation-Maximization meta-algorithm (e.g. probabilistic PCA, (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually intractable, and in doing so requires the discovery of q-distributions, or variational posteriors. These q distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. This neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder.
The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent.
To optimize this model, one needs to know two terms: the "reconstruction error", and the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...
. Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data. The KL-D from the free energy expression maximizes the probability mass of the q distribution that overlaps with the p distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value.
Formulation
From the point of view of probabilistic modelling, one wants to maximize the likelihood of the data
by their chosen parameterized probability distribution
. This distribution is usually chosen to be a Gaussian
which is parameterized by
and
respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents
results in intractable integrals. Let us find
via
marginalizing over
.
:
where
represents the
joint distribution
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
under
of the observable data
and its latent representation or encoding
. According to the
chain rule
In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x)=f(g(x)) for every , ...
, the equation can be rewritten as
:
In the vanilla variational autoencoder,
is usually taken to be a finite-dimensional vector of real numbers, and
to be a
Gaussian distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu ...
. Then
is a mixture of Gaussian distributions.
It is now possible to define the set of the relationships between the input data and its latent representation as
* Prior
* Likelihood
* Posterior
Unfortunately, the computation of
is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as
:
with
defined as the set of real values that parametrize
. This is sometimes called ''amortized inference'', since by "investing" in finding a good
, one can later infer
from
quickly without doing any integrals.
In this way, the problem is of finding a good probabilistic autoencoder, in which the conditional likelihood distribution
is computed by the ''probabilistic decoder'', and the approximated posterior distribution
is computed by the ''probabilistic encoder''.
Evidence lower bound (ELBO)
As in every
deep learning problem, it is necessary to define a differentiable loss function in order to update the network weights through
backpropagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
.
For variational autoencoders, the idea is to jointly optimize the generative model parameters
to reduce the reconstruction error between the input and the output, and
to make
as close as possible to
. As reconstruction loss,
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
and
cross entropy
In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is ...
are often used.
As distance loss between the two distributions the reverse Kullback–Leibler divergence
is a good choice to squeeze
under
.
The distance loss just defined is expanded as
:
Now define the
evidence lower bound (ELBO):
Maximizing the ELBO
is equivalent to simultaneously maximizing
and minimizing
. That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior
from the exact posterior
.
For a more detailed derivation and more interpretations of ELBO and its maximization, see
its main page.
Reparameterization
To efficient search for
the typical method is
gradient descent
In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
.
It is straightforward to find
However,
does not allow one to put the
inside the expectation, since
appears in the probability distribution itself. The reparameterization trick (also known as stochastic backpropagation) bypasses this difficulty.
The most important example is when
is normally distributed, as
.
:
This can be reparametrized by letting
be a "standard
random number generator
Random number generation is a process by which, often by means of a random number generator (RNG), a sequence of numbers or symbols that cannot be reasonably predicted better than by random chance is generated. This means that the particular outc ...
", and construct
as
. Here,
is obtained by the
Cholesky decomposition
In linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced ) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for effici ...
:
Then we have
and so we obtained an unbiased estimator of the gradient, allowing
stochastic gradient descent.
Since we reparametrized
, we need to find
. Let
by the probability density function for
, then
where
is the Jacobian matrix of
with respect to
. Since
, this is
Variations
Many variational autoencoders applications and extensions have been used to adapt the architecture to other domains and improve its performance.
-VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for
values greater than one. This architecture can discover disentangled latent factors without supervision.
The conditional VAE (CVAE), inserts label information in the latent space to force a deterministic constrained representation of the learned data.
Some structures directly deal with the quality of the generated samples or implement more than one latent space to further improve the representation learning.
Some architectures mix VAE and
generative adversarial network
A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is a ...
s to obtain hybrid models.
See also
*
Autoencoder
An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data ( unsupervised learning). The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder lea ...
*
Artificial neural network
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.
An ANN is based on a collection of connected unit ...
*
Deep learning
*
Generative adversarial network
A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is a ...
*
Representation learning
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature e ...
*
Sparse dictionary learning
Sparse coding is a representation learning method which aims at finding a sparse representation of the input data (also known as sparse coding) in the form of a linear combination of basic elements as well as those basic elements themselves. Thes ...
*
Data augmentation
*
Backpropagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
References
{{Differentiable computing
Neural network architectures
Unsupervised learning
Supervised learning
Graphical models
Bayesian statistics
Dimension reduction