Deep Learning

Lecture 12: Diffusion models

???

R: it takes 2h30 to cover the first part (up to score-based models, excluded). => Stop there, show a code example instead, but keep the rest of the slides for reference. R: the applications are pretty cool but their presentation is too superficial. Go in more details and explain where/how the diffusion models are used in each case. Drop a few examples if needed.

Today

VAEs
Variational diffusion models
Score-based generative models

Applications

A few motivating examples.

Content generation

.center[Diffusion models have emerged as powerful generative models, beating previous state-of-the-art models (such as GANs) on a variety of tasks.]

Image super-resolution

]

Text-to-image generation

.italic[A group of teddy bears in suite in a corporate office celebrating
the birthday of their friend. There is a pizza cake on the desk.]

]

.center.width-50[]

Artistic tools and image editing

.center.width-100[]

Inverse problems in medical imaging

.center.width-100[]

Data assimilation in ocean models

.center.width-65[]

VAEs

A short recap.

Variational autoencoders

???

Recap on the black board.

Training

$$\begin{aligned} \theta^{*}, \phi^{*} &= \arg \max_{\theta,\phi} \mathbb{E}_{p(\mathbf{x})} \text{ELBO}(\mathbf{x};\theta,\phi) \\\ &= \arg \max_{\theta,\phi} \mathbb{E}_{p(\mathbf{x})} \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \log \frac{p_\theta(\mathbf{x},\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})} \right] \\\ &= \arg \max_{\theta,\phi} \mathbb{E}_{p(\mathbf{x})} \left[ \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \log p_\theta(\mathbf{x}|\mathbf{z})\right] - \text{KL}(q_\phi(\mathbf{z}|\mathbf{x}) || p(\mathbf{z})) \right]. \end{aligned}$$

Solution: Make $p(\mathbf{z})$ a learnable distribution.

???

Explain the maths on the black board, taking the expectation wrt $p(\mathbf{x})$ of the ELBO and consider the expected KL terms.

(Markovian) Hierarchical VAEs

The prior $p(\mathbf{z})$ is itself a VAE, and recursively so for its own hyper-prior.

Similarly to VAEs, training is done by maximizing the ELBO, using a variational distribution $q_\phi(\mathbf{z}_{1:T} | \mathbf{x})$ over all levels of latent variables: $$\begin{aligned} \log p_\theta(\mathbf{x}) &\geq \mathbb{E}_{q_\phi(\mathbf{z}_{1:T} | \mathbf{x})}\left[ \log \frac{p(\mathbf{x},\mathbf{z}_{1:T})}{q_\phi(\mathbf{z}_{1:T}|\mathbf{x})} \right] \end{aligned}$$

???

Rederive the ELBO.

Variational diffusion models

.center.width-100[]

Variational diffusion models are Markovian HVAEs with the following constraints:

The latent dimension is the same as the data dimension.
The encoder is fixed to linear Gaussian transitions $q(\mathbf{x}_t | \mathbf{x}_{t-1})$.
The hyper-parameters are set such that $q(\mathbf{x}_T | \mathbf{x}_0)$ is a standard Gaussian.

.center.width-100[]

Forward diffusion process

.center.width-100[]

With $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, we have $$\begin{aligned} \mathbf{x}_t &= \sqrt{ {\alpha}_t} \mathbf{x}_{t-1} + \sqrt{1-{\alpha}_t} \epsilon \\ q(\mathbf{x}_t | \mathbf{x}_{t-1}) &= \mathcal{N}(\mathbf{x}_t ; \sqrt{\alpha_t} \mathbf{x}_{t-1}, (1-\alpha_t)\mathbf{I}) \\ q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) &= \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}) \end{aligned}$$

???

Start drawing the full probabilistic graphical model as the forward and reverse processes are presented.

.center.width-100[]

Diffusion kernel

.center.width-100[]

With $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ and $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, we have

$$\begin{aligned} \mathbf{x}_t &= \sqrt{\bar{\alpha}_t} \mathbf{x}_{0} + \sqrt{1-\bar{\alpha}_t} \epsilon \\\ q(\mathbf{x}_t | \mathbf{x}_{0}) &= \mathcal{N}(\mathbf{x}_t ; \sqrt{\bar{\alpha}_t} \mathbf{x}_{0}, (1-\bar{\alpha}_t)\mathbf{I}) \end{aligned}$$

.center.width-100[]

Diffusion kernel $q(\mathbf{x}_t | \mathbf{x}_{0})$ for different noise levels $t$.

]

.center.width-100[]

Marginal distribution $q(\mathbf{x}_t)$.

]

Reverse denoising process

.center.width-100[]

$$\begin{aligned} p(\mathbf{x}_{0:T}) &= p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)\\ p(\mathbf{x}_T) &= \mathcal{N}(\mathbf{x}_T; \mathbf{0}, I) \\ p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) &= \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \sigma^2_\theta(\mathbf{x}_t, t)\mathbf{I}) \\ \mathbf{x}_{t-1} &= \mu_\theta(\mathbf{x}_t, t) + \sigma_\theta(\mathbf{x}_t, t) \mathbf{z} \end{aligned}$$ with $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.

Training

For learning the parameters $\theta$ of the reverse process, we can form a variational lower bound on the log-likelihood of the data as

$$\mathbb{E}_{q(\mathbf{x}_0)}\left[ \log p_\theta(\mathbf{x}_0) \right] \geq \mathbb{E}_{q(\mathbf{x}_0)q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] := L$$

???

Derive on the board.

This objective can be rewritten as $$\begin{aligned} L &= \mathbb{E}_{q(\mathbf{x}_0)q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_0)} \left[L_0 - \sum_{t>1} L_{t-1} - L_T\right] \end{aligned}$$ where

$L_0 = \mathbb{E}_{q(\mathbf{x}_1 | \mathbf{x}_0)}[\log p_\theta(\mathbf{x}_0 | \mathbf{x}_1)]$ can be interpreted as a reconstruction term. It can be approximated and optimized using a Monte Carlo estimate.
$L_{t-1} = \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)}\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) )$ is a denoising matching term. The transition $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ provides a learning signal for the reverse process, since it defines how to denoise the noisified input $\mathbf{x}_t$ with access to the original input $\mathbf{x}_0$.
$L_T = \text{KL}(q(\mathbf{x}_T | \mathbf{x}_0) || p_\theta(\mathbf{x}_T))$ represents how close the distribution of the final noisified input is to the standard Gaussian. It has no trainable parameters.

The distribution $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is the tractable posterior distribution $$\begin{aligned} q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) &= \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} \\ &= \mathcal{N}(\mathbf{x}_{t-1}; \mu_q(\mathbf{x}_t, \mathbf{x}_0, t), \sigma^2_t I) \end{aligned}$$ where $$\begin{aligned} \mu_q(\mathbf{x}_t, \mathbf{x}_0, t) &= \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)}{1-\bar{\alpha}_t}\mathbf{x}_0 \\ \sigma^2_t &= \frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} \end{aligned}$$

???

Take the time to do the derivation on the board.

Interpretation 1: Denoising

To minimize the expected KL divergence $L_{t-1}$, we need to match the reverse process $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ to the tractable posterior. Since both are Gaussian, we can match their means and variances.

By construction, the variance of the reverse process can be set to the known variance $\sigma^2_t$ of the tractable posterior.

For the mean, we reuse the analytical form of $\mu_q(\mathbf{x}_t, \mathbf{x}_0, t)$ and parameterize the mean of the reverse process using a .bold[denoising network] as $$\mu_\theta(\mathbf{x}_t, t) = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)}{1-\bar{\alpha}_t}\hat{\mathbf{x}}_\theta(\mathbf{x}_t, t).$$

???

Derive on the board.

Under this parameterization, the minimization of expected KL divergence $L_{t-1}$ can be rewritten as $$\begin{aligned} &\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)}\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) )\\ =&\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \frac{1}{2\sigma^2_t} || \mu_\theta(\mathbf{x}_t, t) - \mu_q(\mathbf{x}_t, \mathbf{x}_0, t) ||_2^2 \\ =&\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \frac{1}{2\sigma^2_t} \frac{\bar{\alpha}_{t-1}(1-\alpha_t)^2}{(1-\bar{\alpha}_t)^2} || \hat{\mathbf{x}}_\theta(\mathbf{x}_t, t) - \mathbf{x}_0 ||_2^2 \end{aligned}$$

.success[Optimizing a VDM amounts to learning a neural network that predicts the original ground truth $\mathbf{x}_0$ from a noisy input $\mathbf{x}_t$.]

Finally, minimizing the summation of the $L_{t-1}$ terms across all noise levels $t$ can be approximated by minimizing the expectation over all timesteps as $$\arg \min_\theta \mathbb{E}_{t \sim U\{2,T\}} \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)}\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) ).$$

Interpretation 2: Noise prediction

A second interpretation of VDMs can be obtained using the reparameterization trick. Using $$\mathbf{x}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t} \epsilon}{\sqrt{\bar{\alpha}_t}},$$ we can rewrite the mean of the tractable posterior as $$\begin{aligned} \mu_q(\mathbf{x}_t, \mathbf{x}_0, t) &= \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)}{1-\bar{\alpha}_t}\mathbf{x}_0 \\ &= \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)}{1-\bar{\alpha}_t}\frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t} \epsilon}{\sqrt{\bar{\alpha}_t}} \\ &= ... \\ &= \frac{1}{\sqrt{\alpha}_t} \mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{(1-\bar{\alpha}_t)\alpha_t}}\epsilon \end{aligned}$$

???

Derive on the board.

Accordingly, the mean of the reverse process can be parameterized with a .bold[noise-prediction network] as

$$\mu_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha}_t} \mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{(1-\bar{\alpha}_t)\alpha_t}}{\epsilon}_\theta(\mathbf{x}_t, t).$$

Under this parameterization, the minimization of the expected KL divergence $L_{t-1}$ can be rewritten as $$\begin{aligned} &\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)}\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) )\\ =&\arg \min_\theta \mathbb{E}_{\mathcal{N}(\epsilon;\mathbf{0}, I)} \frac{1}{2\sigma^2_t} \frac{(1-\alpha_t)^2}{(1-\bar{\alpha}_t) \alpha_t} || {\epsilon}_\theta(\underbrace{\sqrt{\bar{\alpha}_t} \mathbf{x}_{0} + \sqrt{1-\bar{\alpha}_t} \epsilon}_{\mathbf{x}_t}, t) - \epsilon ||_2^2 \end{aligned}$$

.success[Optimizing a VDM amounts to learning a neural network that predicts the noise $\epsilon$ that was added to the original ground truth $\mathbf{x}_0$ to obtain the noisy $\mathbf{x}_t$.]

In summary, training and sampling thus eventually boils down to:

.center.width-100[]

???

Note that in practice, the coefficient before the norm in the loss function is often omitted. Setting it to 1 is found to increase the sample quality.

Network architectures

Diffusion models often use U-Net architectures (at least for image data) with ResNet blocks and self-attention layers to represent $\hat{\mathbf{x}}_\theta(\mathbf{x}_t, t)$ or $\epsilon_\theta(\mathbf{x}_t, t)$.

.center.width-100[]

Score-based generative models

Score-based models

Maximum likelihood estimation for energy-based probabilistic models $$p_{\theta}(\mathbf{x}) = \frac{1}{Z_{\theta}} \exp(-f_{\theta}(\mathbf{x}))$$ can be intractable when the partition function $Z_{\theta}$ is unknown. We can sidestep this issue with a score-based model $$s_\theta(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x})$$ that approximates the (Stein) .bold[score function] of the data distribution. If we parameterize the score-based model with an energy-based model, then we have $$s_\theta(\mathbf{x}) = \nabla_{\mathbf{x}} \log p_{\theta}(\mathbf{x}) = -\nabla_{\mathbf{x}} f_{\theta}(\mathbf{x}) - \nabla_{\mathbf{x}} \log Z_{\theta} = -\nabla_{\mathbf{x}} f_{\theta}(\mathbf{x}),$$ which discards the intractable partition function and expands the family of models that can be used.

The score function points in the direction of the highest density of the data distribution. It can be used to find modes of the data distribution or to generate samples by .bold[Langevin dynamics] by iterating the following sampling rule $$\mathbf{x}_{i+1} = \mathbf{x}_i + \epsilon \nabla_{\mathbf{x}_i} \log p(\mathbf{x}_i) + \sqrt{2\epsilon} \mathbf{z}_i,$$ where $\epsilon$ is the step size and $\mathbf{z}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. When $\epsilon$ is small, Langevin dynamics will converge to the data distribution $p(\mathbf{x})$.

.center.width-30[]

Similarly to likelihood-based models, score-based models can be trained by minimizing the .bold[Fisher divergence] between the data distribution $p(\mathbf{x})$ and the model distribution $p_\theta(\mathbf{x})$ as $$\mathbb{E}_{p(\mathbf{x})} \left[ || \nabla_{\mathbf{x}} \log p(\mathbf{x}) - s_\theta(\mathbf{x}) ||_2^2 \right].$$

Unfortunately, the explicit score matching objective leads to inaccurate estimates in low-density regions, where few data points are available to constrain the score.

Since initial sample points are likely to be in low-density regions in high-dimensional spaces, the inaccurate score-based model will derail the Langevin dynamics and lead to poor sample quality.

.center.width-100[]

To address this issue, .bold[denoising score matching] can be used to train the score-based model to predict the score of increasingly noisified data points.

For each noise level $t$, the score-based model $s_\theta(\mathbf{x}_t, t)$ is trained to predict the score of the noisified data point $\mathbf{x}_t$ as $$s_\theta(\mathbf{x}_t, t) \approx \nabla_{\mathbf{x}_t} \log p_{t} (\mathbf{x}_t)$$ where $p_{t} (\mathbf{x}_t)$ is the noise-perturbed data distribution $$p_{t} (\mathbf{x}_t) = \int p(\mathbf{x}_0) \mathcal{N}(\mathbf{x}_t ; \mathbf{x}_0, \sigma^2_t \mathbf{I}) d\mathbf{x}_0$$ and $\sigma^2_t$ is an increasing sequence of noise levels.

The training objective for $s_\theta(\mathbf{x}_t, t)$ is then a weighted sum of Fisher divergences for all noise levels $t$, $$\sum_{t=1}^T \lambda(t) \mathbb{E}_{p_{t}(\mathbf{x}_t)} \left[ || \nabla_{\mathbf{x}_t} \log p_{t}(\mathbf{x}_t) - s_\theta(\mathbf{x}_t, t) ||_2^2 \right]$$ where $\lambda(t)$ is a weighting function.

Finally, annealed Langevin dynamics can be used to sample from the score-based model by running Langevin dynamics with decreasing noise levels $t=T, ..., 1$.

.center.width-100[]

Interpretation 3: Denoising score matching

A third interpretation of VDMs can be obtained by reparameterizing $\mathbf{x}_0$ using Tweedie's formula, as $$\mathbf{x}_0 = \frac{\mathbf{x}_t + (1-\bar{\alpha}_t) \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) }{\sqrt{\bar{\alpha}_t}},$$ which we can plug into the the mean of the tractable posterior to obtain $$\begin{aligned} \mu_q(\mathbf{x}_t, \mathbf{x}_0, t) &= \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)}{1-\bar{\alpha}_t}\mathbf{x}_0 \\ &= ... \\ &= \frac{1}{\sqrt{\alpha}_t} \mathbf{x}_t + \frac{1-\alpha_t}{\sqrt{\alpha_t}} \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t). \end{aligned}$$

???

Derive on the board.

The mean of the reverse process can be parameterized with a .bold[score network] as $$\mu_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha}_t} \mathbf{x}_t + \frac{1-\alpha_t}{\sqrt{\alpha_t}} s_\theta(\mathbf{x}_t, t).$$

Under this parameterization, the minimization of the expected KL divergence $L_{t-1}$ can be rewritten as $$\begin{aligned} &\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)}\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) )\\ =&\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \frac{1}{2\sigma^2_t} \frac{(1-\alpha_t)^2}{\alpha_t} || s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) ||_2^2 \end{aligned}$$

.success[Optimizing a score-based model amounts to learning a neural network that predicts the score $\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$.]

Unfortunately, $\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$ is not tractable in general. However, since $s_\theta(\mathbf{x}_t, t)$ is learned in expectation over the data distribution $q(\mathbf{x}_0)$, minimizing instead $$\mathbb{E}_{q(\mathbf{x}_0)} \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \frac{1}{2\sigma^2_t} \frac{(1-\alpha_t)^2}{\alpha_t} || s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0) ||_2^2$$ ensures that $s_\theta(\mathbf{x}_t, t) \approx \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$.

Ancestral sampling

Sampling from the score-based diffusion model is done by starting from $\mathbf{x}_T \sim p(\mathbf{x}_T)=\mathcal{N}(\mathbf{0}, \mathbf{I})$ and then following the estimated reverse Markov chain, as $$\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha}_t} \mathbf{x}_t + \frac{1-\alpha_t}{\sqrt{\alpha_t}} s_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}_t,$$ where $\mathbf{z}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, for $t=T, ..., 1$.

Conditional sampling

To turn a diffusion model $p_\theta(\mathbf{x}_{0:T})$ into a conditional model, we can add conditioning information $y$ at each step of the reverse process, as $$p_\theta(\mathbf{x}_{0:T} | y) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t, y).$$

With a score-based model however, we can use the Bayes rule and notice that $$\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t | y) = \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) + \nabla_{\mathbf{x}_t} \log p(y | \mathbf{x}_t),$$ where we leverage the fact that the gradient of $\log p(y)$ with respect to $\mathbf{x}_t$ is zero.

In other words, controllable generation can be achieved by adding a conditioning signal during sampling, without having to retrain the model. E.g., train an extra classifier $p(y | \mathbf{x}_t)$ and use it to control the sampling process by adding its gradient to the score.

Continuous-time diffusion models

.center.width-100[]

With $\beta_t = 1 - \alpha_t$, we can rewrite the forward process as $$\begin{aligned} \mathbf{x}_t &= \sqrt{ {\alpha}_t} \mathbf{x}_{t-1} + \sqrt{1-{\alpha}_t} \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &= \sqrt{1 - {\beta}_t} \mathbf{x}_{t-1} + \sqrt{ {\beta}_t} \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &= \sqrt{1 - {\beta}(t)\Delta_t} \mathbf{x}_{t-1} + \sqrt{ {\beta}(t)\Delta_t} \mathcal{N}(\mathbf{0}, \mathbf{I}) \end{aligned}$$

When $\Delta_t \rightarrow 0$, we can further rewrite the forward process as $$\begin{aligned} \mathbf{x}_t &= \sqrt{1 - {\beta}(t)\Delta_t} \mathbf{x}_{t-1} + \sqrt{ {\beta}(t)\Delta_t} \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &\approx \mathbf{x}_{t-1} - \frac{\beta(t)\Delta_t}{2} \mathbf{x}_{t-1} + \sqrt{ {\beta}(t)\Delta_t} \mathcal{N}(\mathbf{0}, \mathbf{I}) \end{aligned}.$$

This last update rule corresponds to the Euler-Maruyama discretization of the stochastic differential equation (SDE) $$\text{d}\mathbf{x}_t = -\frac{1}{2}\beta(t)\mathbf{x}_t \text{d}t + \sqrt{\beta(t)} \text{d}\mathbf{w}_t$$ describing the diffusion in the infinitesimal limit.

.center.width-80[]

The reverse process satisfies a reverse-time SDE that can be derived analytically from the forward-time SDE and the score of the marginal distribution $q(\mathbf{x}_t)$, as $$\text{d}\mathbf{x}_t = \left[ -\frac{1}{2}\beta(t)\mathbf{x}_t - \beta(t)\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) \right] \text{d}t + \sqrt{\beta(t)} \text{d}\mathbf{w}_t.$$

.center.width-80[]

The score $\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$ of the marginal diffused density $q(\mathbf{x}_t)$ is not tractable, but can be estimated using denoising score matching (DSM) by solving $$\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_0)} \mathbb{E}_{t\sim U[0,T]} \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} || s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0) ||_2^2,$$ which will result in $s_\theta(\mathbf{x}_t, t) \approx \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$ because of the outer expectation over $q(\mathbf{x}_0)$.

Probability flow ODE

For any diffusion process, there exists a corresponding deterministic process $$\text{d}\mathbf{x}_t = \left[ \mathbf{f}(t, \mathbf{x}_t) - \frac{1}{2} g^2(t) \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) \right] \text{d}t$$ whose trajectories share the same marginal densities $p(\mathbf{x}_t)$.

Therefore, when $\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)$ is replaced by its approximation $s_\theta(\mathbf{x}_t, t)$, the probability flow ODE becomes a special case of a neural ODE. In particular, it is an example of continuous-time normalizing flows!

.center.width-80[]

Latent-space diffusion models

Directly modeling the data distribution can be make the denoising process difficult to learn. A more effective approach is to combine VAEs with a diffusion prior.

The distribution of latent embeddings is simpler to model.
Diffusion on non-image data is possible with tailored autoencoders.

.center.width-100[]

The end.

]

.footnote[Credits: Blattmann et al, 2023. Prompt: "A teddy bear is playing the electric guitar, high definition, 4k."]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lecture12.md

lecture12.md

Deep Learning

Today

Applications

Content generation

Image super-resolution

Text-to-image generation

Artistic tools and image editing

Inverse problems in medical imaging

Data assimilation in ocean models

VAEs

Variational autoencoders

Training

(Markovian) Hierarchical VAEs

Variational diffusion models

Forward diffusion process

Diffusion kernel

Reverse denoising process

Training

Interpretation 1: Denoising

Interpretation 2: Noise prediction

Network architectures

Score-based generative models

Score-based models

Interpretation 3: Denoising score matching

Ancestral sampling

Conditional sampling

Continuous-time diffusion models

Probability flow ODE

Latent-space diffusion models

Files

lecture12.md

Latest commit

History

lecture12.md

File metadata and controls

Deep Learning

Today

Applications

Content generation

Image super-resolution

Text-to-image generation

Artistic tools and image editing

Inverse problems in medical imaging

Data assimilation in ocean models

VAEs

Variational autoencoders

Training

(Markovian) Hierarchical VAEs

Variational diffusion models

Forward diffusion process

Diffusion kernel

Reverse denoising process

Training

Interpretation 1: Denoising

Interpretation 2: Noise prediction

Network architectures

Score-based generative models

Score-based models

Interpretation 3: Denoising score matching

Ancestral sampling

Conditional sampling

Continuous-time diffusion models

Probability flow ODE

Latent-space diffusion models