HOME

TheInfoList



OR:

A flow-based generative model is a
generative model In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsis ...
used in
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
that explicitly models a
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
by leveraging normalizing flow, which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one. The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as the
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation. In contrast, many alternative generative modeling methods such as variational autoencoder (VAE) and
generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is a ...
do not explicitly represent the likelihood function.


Method

Let z_0 be a (possibly multivariate)
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
with distribution p_0(z_0). For i = 1, ..., K, let z_i = f_i(z_) be a sequence of random variables transformed from z_0. The functions f_1, ..., f_K should be invertible, i.e. the
inverse function In mathematics, the inverse function of a function (also called the inverse of ) is a function that undoes the operation of . The inverse of exists if and only if is bijective, and if it exists, is denoted by f^ . For a function f\colon X\t ...
f^_i exists. The final output z_K models the target distribution. The log likelihood of z_K is (see
derivation Derivation may refer to: Language * Morphological derivation, a word-formation process * Parse tree or concrete syntax tree, representing a string's syntax in formal grammars Law * Derivative work, in copyright law * Derivation proceeding, a proc ...
): : \log p_K(z_K) = \log p_0(z_0) - \sum_^ \log \left, \det \frac\ To efficiently compute the log likelihood, the functions f_1, ..., f_K should be 1. easy to invert, and 2. easy to compute the determinant of its Jacobian. In practice, the functions f_1, ..., f_K are modeled using
deep neural networks Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...
, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE, RealNVP, and Glow.


Derivation of log likelihood

Consider z_1 and z_0. Note that z_0 = f^_1(z_1). By the
change of variable Change or Changing may refer to: Alteration * Impermanence, a difference in a state of affairs at different points in time * Menopause, also referred to as "the change", the permanent cessation of the menstrual period * Metamorphosis, or change, ...
formula, the distribution of z_1 is: : p_1(z_1) = p_0(z_0)\left, \det \frac\ Where \det \frac is the
determinant In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if and ...
of the
Jacobian matrix In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variables as ...
of f^_1. By the
inverse function theorem In mathematics, specifically differential calculus, the inverse function theorem gives a sufficient condition for a function to be invertible in a neighborhood of a point in its domain: namely, that its ''derivative is continuous and non-zero at th ...
: : p_1(z_1) = p_0(z_0)\left, \det \left(\frac\right)^\ By the identity \det(A^) = \det(A)^ (where A is an
invertible matrix In linear algebra, an -by- square matrix is called invertible (also nonsingular or nondegenerate), if there exists an -by- square matrix such that :\mathbf = \mathbf = \mathbf_n \ where denotes the -by- identity matrix and the multiplicati ...
), we have: : p_1(z_1) = p_0(z_0)\left, \det \frac\^ The log likelihood is thus: : \log p_1(z_1) = \log p_0(z_0) - \log \left, \det \frac\ In general, the above applies to any z_i and z_. Since \log p_i(z_i) is equal to \log p_(z_) subtracted by a non-recursive term, we can infer by
induction Induction, Inducible or Inductive may refer to: Biology and medicine * Labor induction (birth/pregnancy) * Induction chemotherapy, in medicine * Induced stem cells, stem cells derived from somatic, reproductive, pluripotent or other cell t ...
that: : \log p_K(z_K) = \log p_0(z_0) - \sum_^ \log \left, \det \frac\


Training method

As is generally done when training a deep learning model, the goal with normalizing flows is to minimize the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
between the model's likelihood and the target distribution to be estimated. Denoting p_\theta the model's likelihood and p^* the target distribution to learn, the (forward) KL-divergence is: : D_ , p_(x)= -\mathbb_
og(p_(x)) Og ( he, עוֹג, ʿŌg ; ar, عوج, ʿŪj ; grc, Ωγ, Ōg) according to the Hebrew Bible and other sources, was an Amorite king of Bashan who was slain along with his army by Moses and his men at the battle of Edrei. In Arabic literature h ...
+ \mathbb_ og(p^(x))/math> The second term on the right-hand side of the equation corresponds to the entropy of the target distribution and is independent of the parameter \theta we want the model to learn, which only leaves the expectation of the negative log-likelihood to minimize under the target distribution. This intractable term can be approximated with a Monte-Carlo method. Indeed, if we have a dataset \_ of samples each independantly drawn from the target distribution p^*(x), then this term can be computed as: : -\mathbb_
og(p_(x)) Og ( he, עוֹג, ʿŌg ; ar, عوج, ʿŪj ; grc, Ωγ, Ōg) according to the Hebrew Bible and other sources, was an Amorite king of Bashan who was slain along with his army by Moses and his men at the battle of Edrei. In Arabic literature h ...
\approx -\frac \sum_^ log(p_(x_)) Therefore, the learning objective : \underset\ D_ , p_(x)/math> is replaced by : \underset\ \sum_^ log(p_(x_)) In other words, minimizing the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
between the model's likelihood and the target distribution is equivalent to maximizing the model likelihood under observed samples of the target distribution. A pseudocode for training normalizing flows is as follows: * INPUT. dataset x_, normalizing flow model f_\theta(\cdot), p_0 . * SOLVE. \max_\theta \sum_j \ln p_\theta(x_j) by gradient descent * RETURN. \hat\theta


Variants


Planar Flow

The earliest example. Fix some activation function h, and let \theta = (u, w, b) with th appropriate dimensions, thenx = f_\theta(z) = z + u h(\langle w, z \rangle + b)The inverse f_\theta^ has no closed-form solution in general. The Jacobian is , \det (I + h'(\langle w, z \rangle + b) uw^T), = , 1 + h'(\langle w, z \rangle + b) \langle u, w\rangle, . For it to be invertible everywhere, it must be nonzero everywhere. For example, h = \tanh and \langle u, w \rangle > -1 satisfies the requirement.


Nonlinear Independent Components Estimation (NICE)

Let x, z\in \R^ be even-dimensional, and split them in the middle. Then the normalizing flow functions arex = \begin x_1 \\ x_2 \end= f_\theta(z) = \begin z_1 \\z_2 \end + \begin 0 \\ m_\theta(z_1) \endwhere m_\theta is any neural network with weights \theta. f_\theta^ is just z_1 = x_1, z_2 = x_2 - m_\theta(x_1), and the Jacobian is just 1, that is, the flow is volume-preserving. When n=1, this is seen as a curvy shearing along the x_2 direction.


Real Non-Volume Preserving (Real NVP)

The Real Non-Volume Preserving model generalizes NICE model by:x = \begin x_1 \\ x_2 \end= f_\theta(z) = \begin z_1 \\ e^ \odot z_2 \end + \begin 0 \\ m_\theta(z_1) \end Its inverse is z_1 = x_1, z_2 = e^\odot (x_2 - m_\theta (x_1)), and its Jacobian is \prod^n_ e^. The NICE model is recovered by setting s_\theta = 0. Since the Real NVP map keeps the first and second halves of the vector x separate, it's usually required to add a permutation (x_1, x_2) \mapsto (x_2, x_1) after every Real NVP layer.


Generative Flow (Glow)

In generative flow model, each layer has 3 parts: * channel-wise affine transformy_ = s_c(x_ + b_c)with Jacobian \prod_c s_c^. * invertible 1x1 convolutionz_ = \sum_ K_ y_with Jacobian \det(K)^. Here K is any invertible matrix. * Real NVP, with Jacobian as described in Real NVP. The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.


Masked autoregressive flow (MAF)

An autoregressive model of a distribution on \R^n is defined as the following stochastic process: \begin x_1 \sim& N(\mu_1, \sigma_1^2)\\ x_2 \sim& N(\mu_2(x_1), \sigma_2(x_1)^2)\\ &\cdots \\ x_n \sim& N(\mu_n(x_), \sigma_n(x_)^2)\\ \endwhere \mu_i: \R^ \to \R and \sigma_i: \R^ \to (0, \infty) are fixed functions that define the autoregressive model. By the reparametrization trick, the autoregressive model is generalized to a normalizing flow:\begin x_1 =& \mu_1 + \sigma_1 z_1\\ x_2 =& \mu_2(x_1) + \sigma_2(x_1) z_2\\ &\cdots \\ x_n =& \mu_n(x_) + \sigma_n(x_) z_n\\ \endThe autoregressive model is recovered by setting z \sim N(0, I_). The forward mapping is slow (because it's sequential), but the backward mapping is fast (because it's parallel). The Jacobian matrix is lower-diagonal, so the Jacobian is \sigma_1 \sigma_2(x_1)\cdots \sigma_n(x_). Reversing the two maps f_\theta and f_\theta^ of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping.


Continuous Normalizing Flow (CNF)

Instead of constructing flow by function composition, another approach is to formulate the flow as a continuous-time dynamic. Let z_0 be the latent variable with distribution p(z_0). Map this latent variable to data space with the following flow function: : x = F(z_0) = z_T = z_0 + \int_0^T f(z_t, t) dt where f is an arbitrary function and can be modeled with e.g. neural networks. The inverse function is then naturally: : z_0 = F^(x) = z_T + \int_T^0 -f(z_t, t) dt And the log-likelihood of x can be found as: : \log(p(x)) = \log(p(z_0)) - \int_0^T \text\left frac \rightdt Since the trace depends only on the diagonal of the Jacobian \partial_ f, this allows "free-form" Jacobian. Here, "free-form" means that there is no restriction on the Jacobian's form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently. The trace can be estimated by "Hutchinson's trick":
Given any matrix W\in \R^, and any random u\in \R^n with E u^T= I, we have E
^T W u In computing, a Control key is a modifier key which, when pressed in conjunction with another key, performs a special operation (for example, ); similar to the Shift key, the Control key rarely performs any function when pressed by itself. ...
= tr(W). (Proof: expand the expectation directly.)
Usually, the random vector is sampled from N(0, I) (normal distribution) or \^n ( Radamacher distribution). When f is implemented as a neural network,
neural ODE In biology, the nervous system is the highly complex part of an animal that coordinates its actions and sensory information by transmitting signals to and from different parts of its body. The nervous system detects environmental changes ...
methods would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE. There are two main deficiencies of CNF, one is that a continuous flow must be a
homeomorphism In the mathematical field of topology, a homeomorphism, topological isomorphism, or bicontinuous function is a bijective and continuous function between topological spaces that has a continuous inverse function. Homeomorphisms are the isomorphi ...
, thus preserve orientation and
ambient isotopy In the mathematical subject of topology, an ambient isotopy, also called an ''h-isotopy'', is a kind of continuous distortion of an ambient space, for example a manifold, taking a submanifold to another submanifold. For example in knot theory, one ...
(for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible to turn a sphere inside out, or undo a knot), and the other is that the learned flow f might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible f that all solve the same problem). By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE". Any homeomorphism of \R^n can be approximated by a neural ODE operating on \R^, proved by combining
Whitney embedding theorem In mathematics, particularly in differential topology, there are two Whitney embedding theorems, named after Hassler Whitney: *The strong Whitney embedding theorem states that any differentiable manifold, smooth real numbers, real -dimension (math ...
for manifolds and the
universal approximation theorem In the mathematical theory of artificial neural networks, universal approximation theorems are results that establish the density of an algorithmically generated class of functions within a given function space of interest. Typically, these result ...
for neural networks. To regularize the flow f, one can impose regularization losses. The paper proposed the following regularization loss based on optimal transport theory:\lambda_ \int_^\left\, f(z_t, t)\right\, ^ dt +\lambda_ \int_^\left\, \nabla_z f(z_t, t)\right\, _F^ dt where \lambda_K, \lambda_J >0 are hyperparameters. The first term punishes the model for varying the flow field over time, and the second term punishes it for varying the flow field over space.


Downsides

Despite normalizing flows success in estimating high-dimensional densities, some downsides still exist in their designs. First of all, their latent space where input data is projected onto is not a lower-dimensional space and therefore, flow-based models do not allow for compression of data by default and require a lot of computation. However, it is still possible to perform image compression with them. Flow-based models are also notorious for failing in estimating the likelihood of out-of-distribution samples (i.e.: samples that were not drawn from the same distribution as the training set). Some hypotheses were formulated to explain this phenomenon, among which the
typical set In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asympt ...
hypothesis, estimation issues when training models, or fundamental issues due to the entropy of the data distributions. One of the most interesting properties of normalizing flows is the invertibility of their learned
bijective In mathematics, a bijection, also known as a bijective function, one-to-one correspondence, or invertible function, is a function between the elements of two sets, where each element of one set is paired with exactly one element of the other s ...
map. This property is given by constraints in the design of the models (cf.: RealNVP, Glow) which guarantee theoretical invertibility. The integrity of the inverse is important in order to ensure the applicability of the change-of-variable theorem, the computation of the Jacobian of the map as well as sampling with the model. However, in practice this invertibility is violated and the inverse map explodes because of numerical imprecision.


Applications

Flow-based generative models have been applied on a variety of modeling tasks, including: * Audio generation * Image generation * Molecular graph generation * Point-cloud modeling * Video generation * Lossy image compression


References

{{reflist


External links


Flow-based Deep Generative Models

Normalizing flow models
Machine learning Statistical models Probabilistic models