machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, diffusion models, also known as diffusion-based generative models or score-based generative models, are a class of

latent variable In statistics, latent variables (from Latin: present participle of ) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such '' latent va ...

generative models. A diffusion model consists of two major components: the forward diffusion process, and the reverse sampling process. The goal of diffusion models is to learn a

diffusion process In probability theory and statistics, diffusion processes are a class of continuous-time Markov process with almost surely continuous sample paths. Diffusion process is stochastic in nature and hence is used to model many real-life stochastic sy ...

for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality. There are various equivalent formalisms, including

Markov chain In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally ...

s, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. They are typically trained using

variational inference Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...

. The model responsible for denoising is typically called its "

backbone The spinal column, also known as the vertebral column, spine or backbone, is the core part of the axial skeleton in vertebrates. The vertebral column is the defining and eponymous characteristic of the vertebrate. The spinal column is a segmente ...

". The backbone may be of any kind, but they are typically U-nets or

transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...

. , diffusion models are mainly used for

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

tasks, including image denoising,

inpainting Inpainting is a conservation process where damaged, deteriorated, or missing parts of an artwork are filled in to present a complete image. This process is commonly used in image restoration. It can be applied to both physical and digital art m ...

super-resolution Super-resolution imaging (SR) is a class of techniques that improve the resolution of an imaging system. In optical SR the diffraction limit of systems is transcended, while in geometrical SR the resolution of digital imaging sensors is enhanced ...

, image generation, and video generation. These typically involve training a neural network to sequentially denoise images blurred with

Gaussian noise Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...

. The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image. Diffusion-based image generators have seen widespread commercial interest, such as

Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...

and

DALL-E DALL-E, DALL-E 2, and DALL-E 3 (stylised DALL·E) are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as Prompt engineering, ''prompts''. The first ...

. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation. Other than computer vision, diffusion models have also found applications in

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

such as

text generation Natural language generation (NLG) is a software process that produces natural language output. A widely cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the ...

and summarization, sound generation, and reinforcement learning.

Denoising diffusion model

Non-equilibrium thermodynamics

Diffusion models were introduced in 2015 as a method to train a model that can sample from a highly complex probability distribution. They used techniques from

non-equilibrium thermodynamics Non-equilibrium thermodynamics is a branch of thermodynamics that deals with physical systems that are not in thermodynamic equilibrium but can be described in terms of macroscopic quantities (non-equilibrium state variables) that represent an ex ...

, especially

diffusion Diffusion is the net movement of anything (for example, atoms, ions, molecules, energy) generally from a region of higher concentration to a region of lower concentration. Diffusion is driven by a gradient in Gibbs free energy or chemical p ...

. Consider, for example, how one might model the distribution of all naturally-occurring photos. Each image is a point in the space of all images, and the distribution of naturally-occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a

Gaussian distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...

\mathcal(0, I)

. A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution. The equilibrium distribution is the Gaussian distribution

\mathcal(0, I)

, with pdf

\rho(x) \propto e^

. This is just the

Maxwell–Boltzmann distribution In physics (in particular in statistical mechanics), the Maxwell–Boltzmann distribution, or Maxwell(ian) distribution, is a particular probability distribution named after James Clerk Maxwell and Ludwig Boltzmann. It was first defined and use ...

of particles in a potential well

V(x) = \frac 12 \, x\, ^2

at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a Brownian walker) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.

Denoising Diffusion Probabilistic Model (DDPM)

The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by

Forward diffusion

To present the model, we need some notation. *

\beta_1, ..., \beta_T \in (0, 1)

are fixed constants. *

\alpha_t := 1-\beta_t

\bar \alpha_t := \alpha_1 \cdots \alpha_t

\sigma_t := \sqrt

\tilde \sigma_t := \frac\sqrt

\tilde\mu_t(x_t, x_0) :=\frac

\mathcal(\mu, \Sigma)

is the normal distribution with mean

\mu

and variance

\Sigma

, and

\mathcal(x ,  \mu, \Sigma)

is the probability density at

x

. * A vertical bar denotes

conditioning Conditioning may refer to: Science, computing, and technology * Air conditioning, the removal of heat from indoor air for thermal comfort ** Automobile air conditioning, air conditioning in a vehicle ** Ice storage air conditioning, air conditio ...

. A forward diffusion process starts at some starting point

x_0 \sim q

, where

q

is the probability distribution to be learned, then repeatedly adds noise to it by

x_t = \sqrt x_ + \sqrt z_t

where

z_1, ..., z_T

are IID samples from

\mathcal(0, I)

. This is designed so that for any starting distribution of

x_0

, we have

\lim_t x_t, x_0

converging to

\mathcal(0, I)

. The entire diffusion process then satisfies

q(x_) = q(x_0)q(x_1, x_0) \cdots q(x_T, x_) = q(x_0) \mathcal(x_1 ,  \sqrt x_0, \beta_1 I) \cdots \mathcal(x_T ,  \sqrt x_, \beta_T I)

\ln q(x_) = \ln q(x_0) - \sum_^T \frac \,  x_t - \sqrtx_\, ^2 + C

where

C

is a normalization constant and often omitted. In particular, we note that

x_, x_0

is a

gaussian process In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution. The di ...

, which affords us considerable freedom in

reparameterization In mathematics, and more specifically in geometry, parametrization (or parameterization; also parameterisation, parametrisation) is the process of finding parametric equations of a curve, a surface, or, more generally, a manifold or a variety, ...

. For example, by standard manipulation with gaussian process,

x_, x_0 \sim N\left(\sqrt x_, \sigma_^2 I \right)

x_ ,  x_t, x_0 \sim \mathcal(\tilde\mu_t(x_t, x_0), \tilde\sigma_t^2 I)

In particular, notice that for large

t

, the variable

x_, x_0 \sim N\left(\sqrt x_, \sigma_^2 I \right)

converges to

\mathcal(0, I)

. That is, after a long enough diffusion process, we end up with some

x_T

that is very close to

\mathcal(0, I)

, with all traces of the original

x_0 \sim q

gone. For example, since

x_, x_0 \sim N\left(\sqrt x_, \sigma_^2 I \right)

we can sample

x_, x_0

directly "in one step", instead of going through all the intermediate steps

x_1, x_2, ..., x_

Backward diffusion

The key idea of DDPM is to use a neural network parametrized by

\theta

. The network takes in two arguments

x_t, t

, and outputs a vector

\mu_\theta(x_t, t)

and a matrix

\Sigma_\theta(x_t, t)

, such that each step in the forward diffusion process can be approximately undone by

x_ \sim \mathcal(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

. This then gives us a backward diffusion process

p_\theta

defined by

p_\theta(x_T) = \mathcal(x_T ,  0, I)

p_\theta(x_ ,  x_t) = \mathcal(x_ ,  \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The goal now is to learn the parameters such that

p_\theta(x_0)

is as close to

q(x_0)

as possible. To do that, we use

maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...

with variational inference.

Variational inference

The ELBO inequality states that

\ln p_\theta(x_0) \geq E_x_0)

, and taking one more expectation, we get

E_ln p_\theta(x_0) \geq E_x_0)

We see that maximizing the quantity on the right would give us a lower bound on the likelihood of observed data. This allows us to perform variational inference. Define the loss function

L(\theta) := -E_x_0) /math>and now the goal is to minimize the loss by stochastic gradient descent. The expression may be simplified to L(\theta) = \sum_^T E_x_t) + E_x_0) \,  p_\theta(x_T)) + C where C does not depend on the parameter, and thus can be ignored. Since p_\theta(x_T) = \mathcal(x_T ,  0, I) also does not depend on the parameter, the term E_x_0) \,  p_\theta(x_T)) /math> can also be ignored. This leaves just L(\theta ) = \sum_^T L_t with L_t =  E_x_t) /math> to be minimized.

Noise prediction network

Since

x_ ,  x_t, x_0 \sim \mathcal(\tilde\mu_t(x_t, x_0), \tilde\sigma_t^2 I)

, this suggests that we should use

\mu_\theta(x_t, t) = \tilde \mu_t(x_t, x_0)

; however, the network does not have access to

x_0

, and so it has to estimate it instead. Now, since

x_, x_0 \sim N\left(\sqrt x_, \sigma_^2 I \right)

, we may write

x_t = \sqrt x_ + \sigma_t z

, where

z

is some unknown gaussian noise. Now we see that estimating

x_0

is equivalent to estimating

z

. Therefore, let the network output a noise vector

\epsilon_\theta(x_t, t)

, and let it predict

\mu_\theta(x_t, t) =\tilde\mu_t\left(x_t, \frac\right) = \frac

It remains to design

\Sigma_\theta(x_t, t)

. The DDPM paper suggested not learning it (since it resulted in "unstable training and poorer sample quality"), but fixing it at some value

\Sigma_\theta(x_t, t) = \zeta_t^2 I

, where either

\zeta_t^2 = \beta_t \text \tilde\sigma_t^2

yielded similar performance. With this, the loss simplifies to

L_t = \frac E_\left \epsilon_\theta(x_t, t) - z \right\, ^2\right + C

which may be minimized by stochastic gradient descent. The paper noted empirically that an even simpler loss function

L_ = E_\left \epsilon_\theta(x_t, t) - z \right\, ^2\right /math>resulted in better models.

Backward diffusion process

After a noise prediction network is trained, it can be used for generating data points in the original distribution in a loop as follows: # Compute the noise estimate

\epsilon \leftarrow \epsilon_\theta(x_t, t)

# Compute the original data estimate

\tilde x_0 \leftarrow (x_t - \sigma_t \epsilon) / \sqrt

# Sample the previous data

x_ \sim \mathcal(\tilde\mu_t(x_t, \tilde x_0), \tilde\sigma_t^2 I)

# Change time

t \leftarrow t-1

Score-based generative model

Score-based generative model is another formulation of diffusion modelling. They are also called noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).

Score matching

The idea of score functions

Consider the problem of image generation. Let

x

represent an image, and let

q(x)

be the probability distribution over all possible images. If we have

q(x)

itself, then we can say for certain how likely a certain image is. However, this is intractable in general. Most often, we are uninterested in knowing the absolute probability of a certain image. Instead, we are usually only interested in knowing how likely a certain image is compared to its immediate neighbors — e.g. how much more likely is an image of cat compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some Gaussian noise added? Consequently, we are actually quite uninterested in

q(x)

itself, but rather,

\nabla_x \ln q(x)

. This has two major effects: * One, we no longer need to normalize

q(x)

, but can use any

\tilde q(x) = Cq(x)

, where

C = \int \tilde q(x) dx > 0

is any unknown constant that is of no concern to us. * Two, we are comparing

q(x)

neighbors

q(x + dx)

, by

\frac =e^

Let the score function be

s(x) := \nabla_x \ln q(x)

; then consider what we can do with

s(x)

. As it turns out,

s(x)

allows us to sample from

q(x)

using thermodynamics. Specifically, if we have a potential energy function

U(x) = -\ln q(x)

, and a lot of particles in the potential well, then the distribution at thermodynamic equilibrium is the

Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability tha ...

q_U(x) \propto e^ = q(x)^

. At temperature

k_BT=1

, the Boltzmann distribution is exactly

q(x)

. Therefore, to model

q(x)

, we may start with a particle sampled at any convenient distribution (such as the standard gaussian distribution), then simulate the motion of the particle forwards according to the

Langevin equation In physics, a Langevin equation (named after Paul Langevin) is a stochastic differential equation describing how a system evolves when subjected to a combination of deterministic and fluctuating ("random") forces. The dependent variables in a Lange ...

dx_= -\nabla_U(x_t) d t+d W_t

and the Boltzmann distribution is, by Fokker-Planck equation, the unique thermodynamic equilibrium. So no matter what distribution

x_0

has, the distribution of

x_t

converges in distribution to

q

t\to \infty

Learning the score function

Given a density

q

, we wish to learn a score function approximation

f_\theta \approx \nabla \ln q

. This is score matching''.'' Typically, score matching is formalized as minimizing Fisher divergence function

E_q f_\theta(x) - \nabla \ln q(x)\, ^2 /math>. By expanding the integral, and performing an integration by parts, E_q f_\theta(x) - \nabla \ln q(x)\, ^2 = E_q f_\theta\, ^2 + 2\nabla\cdot f_\theta + C giving us a loss function, also known as the Hyvärinen scoring rule, that can be minimized by stochastic gradient descent.

Annealing the score function

Suppose we need to model the distribution of images, and we want

x_0 \sim \mathcal(0, I)

, a white-noise image. Now, most white-noise images do not look like real images, so

q(x_0) \approx 0

for large swaths of

x_0 \sim \mathcal(0, I)

. This presents a problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score function

\nabla_\ln q(x_t)

at that point, then we cannot impose the time-evolution equation on a particle:

dx_= \nabla_\ln q(x_t) d t+d W_t

To deal with this problem, we perform annealing. If

q

is too different from a white-noise distribution, then progressively add noise until it is indistinguishable from one. That is, we perform a forward diffusion, then learn the score function, then use the score function to perform a backward diffusion.

Continuous diffusion processes

Forward diffusion process

Consider again the forward diffusion process, but this time in continuous time:

x_t = \sqrt x_ + \sqrt z_t

By taking the

\beta_t \to \beta(t)dt, \sqrtz_t \to dW_t

limit, we obtain a continuous diffusion process, in the form of a

stochastic differential equation A stochastic differential equation (SDE) is a differential equation in which one or more of the terms is a stochastic process, resulting in a solution which is also a stochastic process. SDEs have many applications throughout pure mathematics an ...

dx_t = -\frac 12 \beta(t) x_t dt + \sqrt dW_t

where

W_t

is a

Wiener process In mathematics, the Wiener process (or Brownian motion, due to its historical connection with Brownian motion, the physical process of the same name) is a real-valued continuous-time stochastic process discovered by Norbert Wiener. It is one o ...

(multidimensional Brownian motion). Now, the equation is exactly a special case of the overdamped Langevin equation

dx_t = -\frac (\nabla_x U)dt + \sqrtdW_t

where

D

is diffusion tensor,

T

is temperature, and

U

is potential energy field. If we substitute in

D= \frac 12 \beta(t)I, k_BT = 1, U = \frac 12 \, x\, ^2

, we recover the above equation. This explains why the phrase "Langevin dynamics" is sometimes used in diffusion models. Now the above equation is for the stochastic motion of a single particle. Suppose we have a cloud of particles distributed according to

q

at time

t=0

, then after a long time, the cloud of particles would settle into the stable distribution of

\mathcal(0, I)

. Let

\rho_t

be the density of the cloud of particles at time

t

, then we have

\rho_0 = q; \quad \rho_T \approx \mathcal(0, I)

and the goal is to somehow reverse the process, so that we can start at the end and diffuse back to the beginning. By Fokker-Planck equation, the density of the cloud evolves according to

\partial_t \ln \rho_t = \frac 12 \beta(t) \left(
n + (x+ \nabla\ln\rho_t) \cdot \nabla \ln\rho_t + \Delta\ln\rho_t
\right)

where

n

is the dimension of space, and

\Delta

is the

Laplace operator In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a Scalar field, scalar function on Euclidean space. It is usually denoted by the symbols \nabla\cdot\nabla, \nabla^2 (where \ ...

. Equivalently,

\partial_t \rho_t = \frac 12 \beta(t) ( \nabla\cdot(x\rho_t) + \Delta \rho_t)

Backward diffusion process

If we have solved

\rho_t

for time

t\in

, T The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

/math>, then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with density

\nu_0 = \rho_T

, and let the particles in the cloud evolve according to

dy_t =  \frac \beta(T-t) y_ d t + \beta(T-t) \underbrace_ d t+\sqrt d W_t

then by plugging into the Fokker-Planck equation, we find that

\partial_t \rho_ = \partial_t \nu_t

. Thus this cloud of points is the original cloud, evolving backwards.

Noise conditional score network (NCSN)

At the continuous limit,

\bar \alpha_t = (1-\beta_1) \cdots (1-\beta_t) = e^ \to e^

and so

x_, x_0 \sim N\left(e^ x_, \left(1- e^\right) I \right)

In particular, we see that we can directly sample from any point in the continuous diffusion process without going through the intermediate steps, by first sampling

x_0 \sim q, z \sim \mathcal(0, I)

, then get

x_t = e^ x_ + \left(1- e^\right) z

. That is, we can quickly sample

x_t \sim \rho_t

for any

t \geq 0

. Now, define a certain probability distribution

\gamma

over

L(\theta) = E_[\">f_\theta(x_t, t)\, ^2 + 2\nabla\cdot f_\theta(x_t, t) /math>
After training, f_\theta(x_t, t) \approx \nabla \ln\rho_t, so we can perform the backwards diffusion process by first sampling x_T \sim \mathcal(0, I), then integrating the SDE from t=T to t=0 : x_=x_t + \frac \beta(t) x_ d t + \beta(t) f_\theta(x_t, t) d t+\sqrt d W_t This may be done by any SDE integration method, such as

Euler–Maruyama method In Itô calculus, the Euler–Maruyama method (also simply called the Euler method) is a method for the approximate numerical analysis, numerical solution of a stochastic differential equation (SDE). It is an extension of the Euler method for ord ...

. The name "noise conditional score network" is explained thus: * "network", because

f_\theta

is implemented as a neural network. * "score", because the output of the network is interpreted as approximating the score function

\nabla\ln\rho_t

. * "noise conditional", because

\rho_t

is equal to

\rho_0

blurred by an added gaussian noise that increases with time, and so the score function depends on the amount of noise added.

Their equivalence

DDPM and score-based generative models are equivalent. This means that a network trained using DDPM can be used as a NCSN, and vice versa. We know that

x_, x_0 \sim N\left(\sqrt x_, \sigma_^2 I\right)

, so by Tweedie's formula, we have

\nabla_\ln q(x_t) = \frac(-x_t + \sqrt E_q x_t

As described previously, the DDPM loss function is

\sum_t L_

with

L_ = E_\left \epsilon_\theta(x_t, t) - z \right\, ^2\right /math>
where x_t =\sqrt x_ + \sigma_tz . By a change of variables, L_ = E_\left \epsilon_\theta(x_t, t) - 
\frac \right\, ^2\right = E_\left \epsilon_\theta(x_t, t) - 
\frac \right\, ^2\right /math>
and the term inside becomes a least squares regression, so if the network actually reaches the global minimum of loss, then we have \epsilon_\theta(x_t, t) = \frac = -\sigma_t\nabla_\ln q(x_t) Thus, a score-based network predicts noise, and can be used for denoising.

Conversely, the continuous limit x_ = x_, \beta_t = \beta(t) dt, z_t\sqrt = dW_t of the backward equation x_ = \frac- \frac \epsilon_\theta(x_t, t) + \sqrt z_t; \quad z_t \sim \mathcal(0, I) gives us precisely the same equation as score-based diffusion: x_ = x_t(1+\beta(t)dt / 2) + \beta(t) \nabla_\ln q(x_t) dt + \sqrtdW_t Thus, at infinitesimal steps of DDPM, a denoising network performs score-based diffusion.

Main variants

Noise schedule

In DDPM, the sequence of numbers

0 = \sigma_0 < \sigma_1 < \cdots < \sigma_T < 1

is called a (discrete time) noise schedule. In general, consider a strictly increasing monotonic function

\sigma

of type

\R \to (0, 1)

, such as the

sigmoid function A sigmoid function is any mathematical function whose graph of a function, graph has a characteristic S-shaped or sigmoid curve. A common example of a sigmoid function is the logistic function, which is defined by the formula :\sigma(x ...

. In that case, a noise schedule is a sequence of real numbers

\lambda_1 < \lambda_2 < \cdots < \lambda_T

. It then defines a sequence of noises

\sigma_t := \sigma(\lambda_t)

, which then derives the other quantities

\beta_t = 1 - \frac

. In order to use arbitrary noise schedules, instead of training a noise prediction model

\epsilon_\theta(x_t, t)

, one trains

\epsilon_\theta(x_t, \sigma_t)

. Similarly, for the noise conditional score network, instead of training

f_\theta(x_t, t)

, one trains

f_\theta(x_t, \sigma_t)

Denoising Diffusion Implicit Model (DDIM)

The original DDPM method for generating images is slow, since the forward diffusion process usually takes

T \sim 1000

to make the distribution of

x_T

to appear close to gaussian. However this means the backward diffusion process also take 1000 steps. Unlike the forward diffusion process, which can skip steps as

x_t ,  x_0

is gaussian for all

t \geq 1

, the backward diffusion process does not allow skipping steps. For example, to sample

x_, x_ \sim \mathcal(\mu_\theta(x_, t-1), \Sigma_\theta(x_, t-1))

requires the model to first sample

x_

. Attempting to directly sample

x_, x_t

would require us to marginalize out

x_

, which is generally intractable. DDIM is a method to take any model trained on DDPM loss, and use it to sample with some steps skipped, sacrificing an adjustable amount of quality. If we generate the Markovian chain case in DDPM to non-Markovian case, DDIM corresponds to the case that the reverse process has variance equals to 0. In other words, the reverse process (and also the forward process) is deterministic. When using fewer sampling steps, DDIM outperforms DDPM. In detail, the DDIM sampling method is as follows. Start with the forward diffusion process

x_t = \sqrt x_0 + \sigma_t \epsilon

. Then, during the backward denoising process, given

x_t, \epsilon_\theta(x_t, t)

, the original data is estimated as

x_0' = \frac

then the backward diffusion process can jump to any step

0 \leq s < t

, and the next denoised sample is

x_ = \sqrt x_0'
+ \sqrt \epsilon_\theta(x_t, t) 
+ \sigma_s' \epsilon

where

\sigma_s'

is an arbitrary real number within the range

, \sigma_s /math>, and \epsilon \sim \mathcal(0, I) is a newly sampled gaussian noise. If all \sigma_s' = 0, then the backward process becomes deterministic, and this special case of DDIM is also called "DDIM". The original paper noted that when the process is deterministic, samples generated with only 20 steps are already very similar to ones generated with 1000 steps on the high-level.

The original paper recommended defining a single "eta value" \eta \in

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

/math>, such that

\sigma_s' = \eta \tilde\sigma_s

. When

\eta = 1

, this is the original DDPM. When

\eta = 0

, this is the fully deterministic DDIM. For intermediate values, the process interpolates between them. By the equivalence, the DDIM algorithm also applies for score-based diffusion models.

Latent diffusion model (LDM)

Since the diffusion model is a general method for modelling probability distributions, if one wants to model a distribution over images, one can first encode the images into a lower-dimensional space by an encoder, then use a diffusion model to model the distribution over encoded images. Then to generate an image, one can sample from the diffusion model, then use a decoder to decode it into an image. The encoder-decoder pair is most often a

variational autoencoder In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian metho ...

(VAE).

Architectural improvements

proposed various architectural improvements. For example, they proposed log-space interpolation during backward sampling. Instead of sampling from

x_ \sim \mathcal(\tilde\mu_t(x_t, \tilde x_0), \tilde\sigma_t^2 I)

, they recommended sampling from

\mathcal(\tilde\mu_t(x_t, \tilde x_0), (\sigma_t^v \tilde\sigma_t^)^2 I)

for a learned parameter

v

. In the ''v-prediction'' formalism, the noising formula

x_t = \sqrt x_0 + \sqrt \epsilon_t

is reparameterised by an angle

\phi_t

such that

\cos \phi_t = \sqrt

and a "velocity" defined by

\cos\phi_t \epsilon_t - \sin\phi_t x_0

. The network is trained to predict the velocity

\hat v_\theta

, and denoising is by

x_ = \cos(\delta)\;  x_ - \sin(\delta) \hat_\; (x_)

. This parameterization was found to improve performance, as the model can be trained to reach total noise (i.e.

\phi_t = 90^\circ

) and then reverse it, whereas the standard parameterization never reaches total noise since

\sqrt > 0

is always true.

Classifier guidance

Classifier guidance was proposed in 2021 to improve class-conditional generation by using a classifier. The original publication used CLIP text encoders to improve text-conditional image generation. Suppose we wish to sample not from the entire distribution of images, but conditional on the image description. We don't want to sample a generic image, but an image that fits the description "black cat with red eyes". Generally, we want to sample from the distribution

p(x, y)

, where

x

ranges over images, and

y

ranges over classes of images (a description "black cat with red eyes" is just a very detailed class, and a class "cat" is just a very vague description). Taking the perspective of the noisy channel model, we can understand the process as follows: To generate an image

x

conditional on description

y

, we imagine that the requester really had in mind an image

x

, but the image is passed through a noisy channel and came out garbled, as

y

. Image generation is then nothing but inferring which

x

the requester had in mind. In other words, conditional image generation is simply "translating from a textual language into a pictorial language". Then, as in noisy-channel model, we use Bayes theorem to get

p(x, y) \propto p(y, x)p(x)

in other words, if we have a good model of the space of all images, and a good image-to-class translator, we get a class-to-image translator "for free". In the equation for backward diffusion, the score

\nabla \ln p(x)

can be replaced by

\nabla_x \ln p(x, y) = \underbrace_ + \underbrace_

where

\nabla_x \ln p(x)

is the score function, trained as previously described, and

\nabla_x \ln p(y, x)

is found by using a differentiable image classifier. During the diffusion process, we need to condition on the time, giving

\nabla_  \ln p(x_t, y, t) = \nabla_ \ln p(y, x_t, t) + \nabla_ \ln p(x_t, t)

Although, usually the classifier model does not depend on time, in which case

p(y, x_t, t) = p(y, x_t)

. Classifier guidance is defined for the gradient of score function, thus for score-based diffusion network, but as previously noted, score-based diffusion models are equivalent to denoising models by

\epsilon_\theta(x_t, t) = 
-\sigma_t\nabla_\ln p(x_t, t)

, and similarly,

\epsilon_\theta(x_t, y, t) = 
-\sigma_t\nabla_\ln p(x_t, y, t)

. Therefore, classifier guidance works for denoising diffusion as well, using the modified noise prediction:

\epsilon_\theta(x_t, y, t) = \epsilon_\theta(x_t, t) - \underbrace_

With temperature

The classifier-guided diffusion model samples from

p(x, y)

, which is concentrated around the maximum a posteriori estimate

\arg\max_x p(x, y)

. If we want to force the model to move towards the

maximum likelihood estimate In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

\arg\max_x p(y, x)

, we can use

p_\gamma(x, y) \propto p(y, x)^\gamma p(x)

where

\gamma > 0

is interpretable as ''

inverse temperature In statistical thermodynamics, thermodynamic beta, also known as coldness, is the reciprocal of the thermodynamic temperature of a system:\beta = \frac (where is the temperature and is Boltzmann constant). Thermodynamic beta has units recipr ...

''. In the context of diffusion models, it is usually called the guidance scale. A high

\gamma

would force the model to sample from a distribution concentrated around

\arg\max_x p(y, x)

. This sometimes improves quality of generated images. This gives a modification to the previous equation:

\nabla_x \ln p_\beta(x, y) = \nabla_x \ln p(x) + \gamma \nabla_x \ln p(y, x)

For denoising models, it corresponds to

\epsilon_\theta(x_t, y, t) = \epsilon_\theta(x_t, t) - \gamma \sigma_t \nabla_ \ln p(y, x_t, t)

Classifier-free guidance (CFG)

If we do not have a classifier

p(y, x)

, we could still extract one out of the image model itself:

\nabla_x \ln p_\gamma(x, y) = (1-\gamma) \nabla_x \ln p(x) + \gamma \nabla_x \ln p(x, y)

Such a model is usually trained by presenting it with both

(x, y)

and

(x, )

, allowing it to model both

\nabla_x\ln p(x, y)

and

\nabla_x\ln p(x)

. Note that for CFG, the diffusion model cannot be merely a generative model of the entire data distribution

\nabla_x \ln p(x)

. It must be a conditional generative model

\nabla_x \ln p(x ,  y)

. For example, in stable diffusion, the diffusion backbone takes as input both a noisy model

x_t

, a time

t

, and a conditioning vector

y

(such as a vector encoding a text prompt), and produces a noise prediction

\epsilon_\theta(x_t, y, t)

. For denoising models, it corresponds to

\epsilon_\theta(x_t, y, t, \gamma) = \epsilon_\theta(x_t, t) + \gamma (\epsilon_\theta(x_t, y, t) - \epsilon_\theta(x_t, t))

As sampled by DDIM, the algorithm can be written as

\begin
\epsilon_ &\leftarrow \epsilon_\theta(x_t, t) \\
\epsilon_ &\leftarrow \epsilon_\theta(x_t, t, c) \\
\epsilon_ &\leftarrow \epsilon_ + \gamma(\epsilon_ - \epsilon_)\\
x_0 &\leftarrow (x_t - \sigma_t \epsilon_) / \sqrt\\
x_s &\leftarrow \sqrt x_0 + \sqrt \epsilon_ + \sigma_s' \epsilon\\

\end

A similar technique applies to language model sampling. Also, if the unconditional generation

\epsilon_ \leftarrow \epsilon_\theta(x_t, t)

is replaced by

\epsilon_ \leftarrow \epsilon_\theta(x_t, t, c')

, then it results in negative prompting, which pushes the generation away from

c'

condition.

Samplers

Given a diffusion model, one may regard it either as a continuous process, and sample from it by integrating a SDE, or one can regard it as a discrete process, and sample from it by iterating the discrete steps. The choice of the "noise schedule"

\beta_t

can also affect the quality of samples. A noise schedule is a function that sends a natural number to a noise level:

t \mapsto \beta_t, \quad t \in \, \beta \in (0, 1)

A noise schedule is more often specified by a map

t \mapsto \sigma_t

. The two definitions are equivalent, since

\beta_t = 1 - \frac

. In the DDPM perspective, one can use the DDPM itself (with noise), or DDIM (with adjustable amount of noise). The case where one adds noise is sometimes called ancestral sampling. One can interpolate between noise and no noise. The amount of noise is denoted

\eta

("eta value") in the DDIM paper, with

\eta = 0

denoting no noise (as in ''deterministic'' DDIM), and

\eta = 1

denoting full noise (as in DDPM). In the perspective of SDE, one can use any of the numerical integration methods, such as

Heun's method In mathematics and computational science, Heun's method may refer to the improved or modified Euler's method (that is, the explicit trapezoidal rule), or a similar two-stage Runge–Kutta method. It is named after Karl Heun and is a numerical pr ...

linear multistep method Linear multistep methods are used for the numerical solution of ordinary differential equations. Conceptually, a numerical method starts from an initial point and then takes a short step forward in time to find the next solution point. The proce ...

s, etc. Just as in the discrete case, one can add an adjustable amount of noise during the integration. A survey and comparison of samplers in the context of image generation is in.

Other examples

Notable variants include Poisson flow generative model, consistency model, critically-damped Langevin diffusion, GenPhys, cold diffusion, discrete diffusion, etc.

Flow-based diffusion model

Abstractly speaking, the idea of diffusion model is to take an unknown probability distribution (the distribution of natural-looking images), then progressively convert it to a known probability distribution (standard gaussian distribution), by building an absolutely continuous probability path connecting them. The probability path is in fact defined implicitly by the score function

\nabla \ln p_t

. In denoising diffusion models, the forward process adds noise, and the backward process removes noise. Both the forward and backward processes are SDEs, though the forward process is integrable in closed-form, so it can be done at no computational cost. The backward process is not integrable in closed-form, so it must be integrated step-by-step by standard SDE solvers, which can be very expensive. The probability path in diffusions model is defined through an

Itô process Ito, Itō or Itoh may refer to: Places * Ito Island, an island of Milne Bay Province, Papua New Guinea * Ito Airport, an airport in the Democratic Republic of the Congo * Ito District, Wakayama, a district located in Wakayama Prefecture, Japan ...

and one can retrieve the deterministic process by using the Probability ODE flow formulation. In flow-based diffusion models, the forward process is a deterministic flow along a time-dependent vector field, and the backward process is also a deterministic flow along the same vector field, but going backwards. Both processes are solutions to

ODEs Odes may refer to: *The plural of ode, a type of poem * ''Odes'' (Horace), a collection of poems by the Roman author Horace, circa 23 BCE *Odes of Solomon, a pseudepigraphic book of the Bible *Book of Odes (Bible), a Deuterocanonical book of the ...

. If the vector field is well-behaved, the ODE will also be well-behaved. Given two distributions

\pi_0

and

\pi_1

, a flow-based model is a time-dependent velocity field

v_t(x)

\times \mathbb R^d

, such that if we start by sampling a point

x  \sim \pi_0

, and let it move according to the velocity field:

\quad \text\phi_0(x) = x

we end up with a point

x_1 \sim \pi_1

. The solution

\phi_t

of the above ODE define a probability path

\pi_0

by the

pushforward measure In measure theory, a pushforward measure (also known as push forward, push-forward or image measure) is obtained by transferring ("pushing forward") a measure from one measurable space to another using a measurable function. Definition Given mea ...

operator. In particular,

\pi_0 = \pi_1

. The probability path and the velocity field also satisfy the

continuity equation A continuity equation or transport equation is an equation that describes the transport of some quantity. It is particularly simple and powerful when applied to a conserved quantity, but it can be generalized to apply to any extensive quantity ...

, in the sense of probability distribution:

\partial_t p_t + \nabla \cdot (v_t  p_t) = 0

To construct a probability path, we start by construct a conditional probability path

p_t(x \vert z)

and the corresponding conditional velocity field

v_t(x \vert z)

on some conditional distribution

q(z)

. A natural choice is the Gaussian conditional probability path:

p_t(x \vert z) = \mathcal \left( m_t(z), \zeta_t^2 I \right)

The conditional velocity field which corresponds to the geodesic path between conditional Gaussian path is

v_t(x \vert z) = \frac (x - m_t(z)) + m_t'(z)

The probability path and velocity field are then computed by marginalizing

p_t(x) = \int p_t(x \vert z) q(z) dz \qquad \text \qquad v_t(x) = \mathbb_ \left frac \right /math>

Optimal transport flow

The idea of optimal transport flow is to construct a probability path minimizing the

Wasserstein metric In mathematics, the Wasserstein distance or Kantorovich– Rubinstein metric is a distance function defined between probability distributions on a given metric space M. It is named after Leonid Vaseršteĭn. Intuitively, if each distribution ...

. The distribution on which we condition is an approximation of the optimal transport plan between

\pi_0

and

\pi_1

z = (x_0, x_1)

and

q(z) = \Gamma(\pi_0, \pi_1)

, where

\Gamma

is the optimal transport plan, which can be approximated by mini-batch optimal transport. If the batch size is not large, then the transport it computes can be very far from the true optimal transport.

Rectified flow

The idea of rectified flow is to learn a flow model such that the velocity is nearly constant along each flow path. This is beneficial, because we can integrate along such a vector field with very few steps. For example, if an ODE

\dot(x) = v_t(\phi_t(x))

follows perfectly straight paths, it simplifies to

\phi_t(x) = x_0 + t \cdot v_0(x_0)

, allowing for exact solutions in one step. In practice, we cannot reach such perfection, but when the flow field is nearly so, we can take a few large steps instead of many little steps. The general idea is to start with two distributions

\pi_0

and

\pi_1

, then construct a flow field

\phi^0 = \

from it, then repeatedly apply a "reflow" operation to obtain successive flow fields

\phi^1, \phi^2, \dots

, each straighter than the previous one. When the flow field is straight enough for the application, we stop. Generally, for any time-differentiable process

\phi_t

v_t

can be estimated by solving:

\,\mathrmt.

In rectified flow, by injecting strong priors that intermediate trajectories are straight, it can achieve both theoretical relevance for optimal transport and computational efficiency, as ODEs with straight paths can be simulated precisely without time discretization.

Specifically, rectified flow seeks to match an ODE with the marginal distributions of the linear interpolation between points from distributions

\pi_0

and

\pi_1

. Given observations

x_0 \sim \pi_0

and

x_1 \sim \pi_1

, the canonical linear interpolation

x_t= t x_1 + (1-t)x_0, t\in,1 /math> yields a trivial case \dot_t = x_1 - x_0, which cannot be causally simulated without x_1 . To address this, x_t is "projected" into a space of causally simulatable ODEs, by minimizing the least squares loss with respect to the direction x_1 - x_0 : \min_ \int_0^1  \mathbb_\left lVert\rVert^2\right \,\mathrmt. The data pair (x_0, x_1) can be any coupling of \pi_0 and \pi_1, typically independent (i.e., (x_0,x_1) \sim \pi_0 \times \pi_1) obtained by randomly combining observations from \pi_0 and \pi_1 . This process ensures that the trajectories closely mirror the density map of x_t trajectories but ''reroute'' at intersections to ensure causality.

A distinctive aspect of rectified flow is its capability for "reflow", which straightens the trajectory of ODE paths. Denote the rectified flow

\phi^0 = \

induced from

(x_0,x_1)

\phi^0 = \mathsf((x_0,x_1))

. Recursively applying this

\mathsf(\cdot)

operator generates a series of rectified flows

\phi^ = \mathsf((\phi_0^k(x_0), \phi_1^k(x_1)))

. This "reflow" process not only reduces transport costs but also straightens the paths of rectified flows, making

\phi^k

paths straighter with increasing

k

. Rectified flow includes a nonlinear extension where linear interpolation

x_t

is replaced with any time-differentiable curve that connects

x_0

and

x_1

, given by

x_t = \alpha_t x_1 + \beta_t x_0

. This framework encompasses DDIM and probability flow ODEs as special cases, with particular choices of

\alpha_t

and

\beta_t

. However, in the case where the path of

x_t

is not straight, the reflow process no longer ensures a reduction in convex transport costs, and also no longer straighten the paths of

\phi_t

Choice of architecture

X-Y plot of algorithmically-generated AI art of European-style castle in Japan demonstrating DDIM diffusion steps

Diffusion model

For generating images by DDPM, we need a neural network that takes a time

t

and a noisy image

x_t

, and predicts a noise

\epsilon_\theta(x_t, t)

from it. Since predicting the noise is the same as predicting the denoised image, then subtracting it from

x_t

, denoising architectures tend to work well. For example, the

U-Net U-Net is a convolutional neural network that was developed for image segmentation. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more preci ...

, which was found to be good for denoising images, is often used for denoising diffusion models that generate images. For DDPM, the underlying architecture ("backbone") does not have to be a U-Net. It just has to predict the noise somehow. For example, the diffusion transformer (DiT) uses a

Transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

to predict the mean and diagonal covariance of the noise, given the textual conditioning and the partially denoised image. It is the same as standard U-Net-based denoising diffusion model, with a Transformer replacing the U-Net.

Mixture of experts Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. MoE represents a form of ensemble learning. They were also called committee machines. ...

-Transformer can also be applied. DDPM can be used to model general data distributions, not just natural-looking images. For example, Human Motion Diffusion models human motion trajectory by DDPM. Each human motion trajectory is a sequence of poses, represented by either joint rotations or positions. It uses a

network to generate a less noisy trajectory out of a noisy one.

Conditioning

The base diffusion model can only generate unconditionally from the whole distribution. For example, a diffusion model learned on

ImageNet The ImageNet project is a large visual database designed for use in Outline of object recognition, visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictur ...

would generate images that look like a random image from ImageNet. To generate images from just one category, one would need to impose the condition, and then sample from the conditional distribution. Whatever condition one wants to impose, one needs to first convert the conditioning into a vector of floating point numbers, then feed it into the underlying diffusion model neural network. However, one has freedom in choosing how to convert the conditioning into a vector. Stable Diffusion, for example, imposes conditioning in the form of cross-attention mechanism, where the query is an intermediate representation of the image in the U-Net, and both key and value are the conditioning vectors. The conditioning can be selectively applied to only parts of an image, and new kinds of conditionings can be finetuned upon the base model, as used in ControlNet. As a particularly simple example, consider image inpainting. The conditions are

\tilde x

, the reference image, and

m

, the inpainting

mask A mask is an object normally worn on the face, typically for protection, disguise, performance, or entertainment, and often employed for rituals and rites. Masks have been used since antiquity for both ceremonial and practical purposes, ...

. The conditioning is imposed at each step of the backward diffusion process, by first sampling

\tilde x_t \sim N\left(\sqrt \tilde x, \sigma_^2 I \right)

, a noisy version of

\tilde x

, then replacing

x_t

with

(1-m) \odot x_t + m \odot \tilde x_t

, where

\odot

means elementwise multiplication. Another application of cross-attention mechanism is prompt-to-prompt image editing. Conditioning is not limited to just generating images from a specific category, or according to a specific caption (as in text-to-image). For example, demonstrated generating human motion, conditioned on an audio clip of human walking (allowing syncing motion to a soundtrack), or video of human running, or a text description of human motion, etc. For how conditional diffusion models are mathematically formulated, see a methodological summary in.

Upscaling

As generating an image takes a long time, one can try to generate a small image by a base diffusion model, then upscale it by other models. Upscaling can be done by

GAN The word Gan or the initials GAN may refer to: Places * Gan, a component of Hebrew placenames literally meaning "garden" China * Gan River (Jiangxi) * Gan River (Inner Mongolia), * Gan County, in Jiangxi province * Gansu, abbreviated '' ...

, or signal processing methods like

Lanczos resampling Lanczos filtering and Lanczos resampling are two applications of a certain mathematical formula. It can be used as a low-pass filter or used to smoothly interpolate the value of a digital signal between its samples. In the latter case, it maps ...

. Diffusion models themselves can be used to perform upscaling. Cascading diffusion model stacks multiple diffusion models one after another, in the style of Progressive GAN. The lowest level is a standard diffusion model that generate 32x32 image, then the image would be upscaled by a diffusion model specifically trained for upscaling, and the process repeats. In more detail, the diffusion upscaler is trained as follows: * Sample

(x_0, z_0, c)

, where

x_0

is the high-resolution image,

z_0

is the same image but scaled down to a low-resolution, and

c

is the conditioning, which can be the caption of the image, the class of the image, etc. * Sample two white noises

\epsilon_x, \epsilon_z

, two time-steps

t_x, t_z

. Compute the noisy versions of the high-resolution and low-resolution images:

\begin
x_ &= \sqrt x_0 + \sigma_ \epsilon_x\\
z_ &= \sqrt z_0 + \sigma_ \epsilon_z
\end

. * Train the denoising network to predict

\epsilon_x

given

x_, z_, t_x, t_z, c

. That is, apply gradient descent on

\theta

on the L2 loss

\,  \epsilon_\theta(x_, z_, t_x, t_z, c) - \epsilon_x \, _2^2

Examples

This section collects some notable diffusion models, and briefly describes their architecture.

OpenAI

The DALL-E series by OpenAI are text-conditional diffusion models of images. The first version of DALL-E (2021) is not actually a diffusion model. Instead, it uses a Transformer architecture that autoregressively generates a sequence of tokens, which is then converted to an image by the decoder of a discrete VAE. Released with DALL-E was the CLIP classifier, which was used by DALL-E to rank generated images according to how close the image fits the text. GLIDE (2022-03) is a 3.5-billion diffusion model, and a small version was released publicly. Soon after, DALL-E 2 was released (2022-04). DALL-E 2 is a 3.5-billion cascaded diffusion model that generates images from text by "inverting the CLIP image encoder", the technique which they termed "unCLIP". The unCLIP method contains 4 models: a CLIP image encoder, a CLIP text encoder, an image decoder, and a "prior" model (which can be a diffusion model, or an autoregressive model). During training, the prior model is trained to convert CLIP image encodings to CLIP text encodings. The image decoder is trained to convert CLIP image encodings back to images. During inference, a text is converted by the CLIP text encoder to a vector, then it is converted by the prior model to an image encoding, then it is converted by the image decoder to an image. Sora (2024-02) is a diffusion Transformer model (DiT).

Stability AI

(2022-08), released by Stability AI, consists of a denoising latent diffusion model (860 million parameters), a VAE, and a text encoder. The denoising network is a U-Net, with cross-attention blocks to allow for conditional image generation. Stable Diffusion 3 (2024-03) changed the latent diffusion model from the UNet to a Transformer model, and so it is a DiT. It uses rectified flow. Stable Video 4D (2024-07) is a latent diffusion model for videos of 3D objects.

Google

Imagen (2022) uses a T5-XXL language model to encode the input text into an embedding vector. It is a cascaded diffusion model with three sub-models. The first step denoises a white noise to a 64×64 image, conditional on the embedding vector of the text. This model has 2B parameters. The second step upscales the image by 64×64→256×256, conditional on embedding. This model has 650M parameters. The third step is similar, upscaling by 256×256→1024×1024. This model has 400M parameters. The three denoising networks are all U-Nets. Muse (2023-01) is not a diffusion model, but an encoder-only Transformer that is trained to predict masked image tokens from unmasked image tokens. Imagen 2 (2023-12) is also diffusion-based. It can generate images based on a prompt that mixes images and text. No further information available. Imagen 3 (2024-05) is too. No further information available. Veo (2024) generates videos by latent diffusion. The diffusion is conditioned on a vector that encodes both a text prompt and an image prompt.

References

{{reflist Markov models Machine learning algorithms

Denoising diffusion model

Non-equilibrium thermodynamics

Denoising Diffusion Probabilistic Model (DDPM)

Forward diffusion

Backward diffusion

Variational inference

Noise prediction network

Backward diffusion process

Score-based generative model

Score matching

The idea of score functions

Learning the score function

Annealing the score function

Continuous diffusion processes

Forward diffusion process

Backward diffusion process

Noise conditional score network (NCSN)

Their equivalence

Main variants

Noise schedule

Denoising Diffusion Implicit Model (DDIM)

Latent diffusion model (LDM)

Architectural improvements

Classifier guidance

With temperature

Classifier-free guidance (CFG)

Samplers

Other examples

Flow-based diffusion model

Optimal transport flow

Rectified flow

Choice of architecture

Diffusion model

Conditioning

Upscaling

Examples

OpenAI

Stability AI

Google

Meta

See also

Further reading

References