A flow-based generative model is a
generative model
In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsis ...
used in
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
that explicitly models a
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
by leveraging normalizing flow,
which is a statistical method using the
change-of-variable law of probabilities to transform a simple distribution into a complex one.
The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as the
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation.
In contrast, many alternative generative modeling methods such as
variational autoencoder (VAE) and
generative adversarial network
A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is a ...
do not explicitly represent the likelihood function.
Method
Let
be a (possibly multivariate)
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
with distribution
.
For
, let
be a sequence of random variables transformed from
. The functions
should be invertible, i.e. the
inverse function
In mathematics, the inverse function of a function (also called the inverse of ) is a function that undoes the operation of . The inverse of exists if and only if is bijective, and if it exists, is denoted by f^ .
For a function f\colon X\t ...
exists. The final output
models the target distribution.
The log likelihood of
is (see
derivation
Derivation may refer to:
Language
* Morphological derivation, a word-formation process
* Parse tree or concrete syntax tree, representing a string's syntax in formal grammars
Law
* Derivative work, in copyright law
* Derivation proceeding, a proc ...
):
:
To efficiently compute the log likelihood, the functions
should be 1. easy to invert, and 2. easy to compute the determinant of its Jacobian. In practice, the functions
are modeled using
deep neural networks
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
...
, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE,
RealNVP,
and Glow.
Derivation of log likelihood
Consider
and
. Note that
.
By the
change of variable
Change or Changing may refer to:
Alteration
* Impermanence, a difference in a state of affairs at different points in time
* Menopause, also referred to as "the change", the permanent cessation of the menstrual period
* Metamorphosis, or change, ...
formula, the distribution of
is:
:
Where
is the
determinant
In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if and ...
of the
Jacobian matrix
In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variables as ...
of
.
By the
inverse function theorem
In mathematics, specifically differential calculus, the inverse function theorem gives a sufficient condition for a function to be invertible in a neighborhood of a point in its domain: namely, that its ''derivative is continuous and non-zero at th ...
:
:
By the identity
(where
is an
invertible matrix
In linear algebra, an -by- square matrix is called invertible (also nonsingular or nondegenerate), if there exists an -by- square matrix such that
:\mathbf = \mathbf = \mathbf_n \
where denotes the -by- identity matrix and the multiplicati ...
), we have:
:
The log likelihood is thus:
:
In general, the above applies to any
and
. Since
is equal to
subtracted by a non-recursive term, we can infer by
induction
Induction, Inducible or Inductive may refer to:
Biology and medicine
* Labor induction (birth/pregnancy)
* Induction chemotherapy, in medicine
* Induced stem cells, stem cells derived from somatic, reproductive, pluripotent or other cell t ...
that:
:
Training method
As is generally done when training a deep learning model, the goal with normalizing flows is to minimize the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
between the model's likelihood and the target distribution to be estimated. Denoting
the model's likelihood and
the target distribution to learn, the (forward) KL-divergence is:
: