You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
R: it takes 2h30 to cover the first part (up to score-based models, excluded).
=> Stop there, show a code example instead, but keep the rest of the slides for reference.
R: the applications are pretty cool but their presentation is too superficial. Go in more details and explain where/how the diffusion models are used in each case. Drop a few examples if needed.
Today
VAEs
Variational diffusion models
Score-based generative models
.alert[Caution: See also the side notes derived in class.]
class: middle
Applications
A few motivating examples.
class: middle
Content generation
.center[.width-45[] .width-45[]]
.center[Diffusion models have emerged as powerful generative models, beating previous state-of-the-art models (such as GANs) on a variety of tasks.]
.alert[Issue: The prior matching term limits the expressivity of the model.]
class: middle, black-slide, center
count: false
Solution: Make $p(\mathbf{z})$ a learnable distribution.
.width-80[]
???
Explain the maths on the black board, taking the expectation wrt $p(\mathbf{x})$ of the ELBO and consider the expected KL terms.
class: middle
(Markovian) Hierarchical VAEs
The prior $p(\mathbf{z})$ is itself a VAE, and recursively so for its own hyper-prior.
.center[]
class: middle
Similarly to VAEs, training is done by maximizing the ELBO, using a variational distribution $q_\phi(\mathbf{z}_{1:T} | \mathbf{x})$ over all levels of latent variables:
$$\begin{aligned}
\log p_\theta(\mathbf{x}) &\geq \mathbb{E}_{q_\phi(\mathbf{z}_{1:T} | \mathbf{x})}\left[ \log \frac{p(\mathbf{x},\mathbf{z}_{1:T})}{q_\phi(\mathbf{z}_{1:T}|\mathbf{x})} \right]
\end{aligned}$$
???
Rederive the ELBO.
class: middle
Variational diffusion models
class: middle
.center.width-100[]
class: middle
Variational diffusion models are Markovian HVAEs with the following constraints:
The latent dimension is the same as the data dimension.
The encoder is fixed to linear Gaussian transitions $q(\mathbf{x}_t | \mathbf{x}_{t-1})$.
The hyper-parameters are set such that $q(\mathbf{x}_T | \mathbf{x}_0)$ is a standard Gaussian.
This objective can be rewritten as
$$\begin{aligned}
L &= \mathbb{E}_{q(\mathbf{x}_0)q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\
&= \mathbb{E}_{q(\mathbf{x}_0)} \left[L_0 - \sum_{t>1} L_{t-1} - L_T\right]
\end{aligned}$$
where
$L_0 = \mathbb{E}_{q(\mathbf{x}_1 | \mathbf{x}_0)}[\log p_\theta(\mathbf{x}_0 | \mathbf{x}_1)]$ can be interpreted as a reconstruction term. It can be approximated and optimized using a Monte Carlo estimate.
$L_{t-1} = \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)}\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) )$ is a denoising matching term. The transition $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ provides a learning signal for the reverse process, since it defines how to denoise the noisified input $\mathbf{x}_t$ with access to the original input $\mathbf{x}_0$.
$L_T = \text{KL}(q(\mathbf{x}_T | \mathbf{x}_0) || p_\theta(\mathbf{x}_T))$ represents how close the distribution of the final noisified input is to the standard Gaussian. It has no trainable parameters.
class: middle
.center[]
The distribution $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is the tractable posterior distribution
$$\begin{aligned}
q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) &= \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} \\
&= \mathcal{N}(\mathbf{x}_{t-1}; \mu_q(\mathbf{x}_t, \mathbf{x}_0, t), \sigma^2_t I)
\end{aligned}$$
where
$$\begin{aligned}
\mu_q(\mathbf{x}_t, \mathbf{x}_0, t) &= \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)}{1-\bar{\alpha}_t}\mathbf{x}_0 \\
\sigma^2_t &= \frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}
\end{aligned}$$
???
Take the time to do the derivation on the board.
class: middle
Interpretation 1: Denoising
To minimize the expected KL divergence $L_{t-1}$, we need to match the reverse process $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ to the tractable posterior. Since both are Gaussian, we can match their means and variances.
By construction, the variance of the reverse process can be set to the known variance $\sigma^2_t$ of the tractable posterior.
For the mean, we reuse the analytical form of $\mu_q(\mathbf{x}_t, \mathbf{x}_0, t)$ and parameterize the mean of the reverse process using a .bold[denoising network] as
$$\mu_\theta(\mathbf{x}_t, t) = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)}{1-\bar{\alpha}_t}\hat{\mathbf{x}}_\theta(\mathbf{x}_t, t).$$
???
Derive on the board.
class: middle
Under this parameterization, the minimization of expected KL divergence $L_{t-1}$ can be rewritten as
$$\begin{aligned}
&\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)}\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) )\\
=&\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \frac{1}{2\sigma^2_t} || \mu_\theta(\mathbf{x}_t, t) - \mu_q(\mathbf{x}_t, \mathbf{x}_0, t) ||_2^2 \\
=&\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \frac{1}{2\sigma^2_t} \frac{\bar{\alpha}_{t-1}(1-\alpha_t)^2}{(1-\bar{\alpha}_t)^2} || \hat{\mathbf{x}}_\theta(\mathbf{x}_t, t) - \mathbf{x}_0 ||_2^2
\end{aligned}$$
.success[Optimizing a VDM amounts to learning a neural network that predicts the original ground truth $\mathbf{x}_0$ from a noisy input $\mathbf{x}_t$.]
class: middle
Finally, minimizing the summation of the $L_{t-1}$ terms across all noise levels $t$ can be approximated by minimizing the expectation over all timesteps as
$$\arg \min_\theta \mathbb{E}_{t \sim U\{2,T\}} \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)}\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) ).$$
class: middle
Interpretation 2: Noise prediction
A second interpretation of VDMs can be obtained using the reparameterization trick.
Using $$\mathbf{x}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t} \epsilon}{\sqrt{\bar{\alpha}_t}},$$
we can rewrite the mean of the tractable posterior as
$$\begin{aligned}
\mu_q(\mathbf{x}_t, \mathbf{x}_0, t) &= \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)}{1-\bar{\alpha}_t}\mathbf{x}_0 \\
&= \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)}{1-\bar{\alpha}_t}\frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t} \epsilon}{\sqrt{\bar{\alpha}_t}} \\
&= ... \\
&= \frac{1}{\sqrt{\alpha}_t} \mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{(1-\bar{\alpha}_t)\alpha_t}}\epsilon
\end{aligned}$$
???
Derive on the board.
class: middle
Accordingly, the mean of the reverse process can be parameterized with a .bold[noise-prediction network] as
Under this parameterization, the minimization of the expected KL divergence $L_{t-1}$ can be rewritten as
$$\begin{aligned}
&\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)}\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) )\\
=&\arg \min_\theta \mathbb{E}_{\mathcal{N}(\epsilon;\mathbf{0}, I)} \frac{1}{2\sigma^2_t} \frac{(1-\alpha_t)^2}{(1-\bar{\alpha}_t) \alpha_t} || {\epsilon}_\theta(\underbrace{\sqrt{\bar{\alpha}_t} \mathbf{x}_{0} + \sqrt{1-\bar{\alpha}_t} \epsilon}_{\mathbf{x}_t}, t) - \epsilon ||_2^2
\end{aligned}$$
.success[Optimizing a VDM amounts to learning a neural network that predicts the noise $\epsilon$ that was added to the original ground truth $\mathbf{x}_0$ to obtain the noisy $\mathbf{x}_t$.]
class: middle
In summary, training and sampling thus eventually boils down to:
.center.width-100[]
???
Note that in practice, the coefficient before the norm in the loss function is often omitted. Setting it to 1 is found to increase the sample quality.
class: middle
Network architectures
Diffusion models often use U-Net architectures (at least for image data) with ResNet blocks and self-attention layers to represent $\hat{\mathbf{x}}_\theta(\mathbf{x}_t, t)$ or $\epsilon_\theta(\mathbf{x}_t, t)$.
Maximum likelihood estimation for energy-based probabilistic models $$p_{\theta}(\mathbf{x}) = \frac{1}{Z_{\theta}} \exp(-f_{\theta}(\mathbf{x}))$$ can be intractable when the partition function $Z_{\theta}$ is unknown.
We can sidestep this issue with a score-based model $$s_\theta(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x})$$ that approximates the (Stein) .bold[score function] of the data distribution. If we parameterize the score-based model with an energy-based model, then we have $$s_\theta(\mathbf{x}) = \nabla_{\mathbf{x}} \log p_{\theta}(\mathbf{x}) = -\nabla_{\mathbf{x}} f_{\theta}(\mathbf{x}) - \nabla_{\mathbf{x}} \log Z_{\theta} = -\nabla_{\mathbf{x}} f_{\theta}(\mathbf{x}),$$
which discards the intractable partition function and expands the family of models that can be used.
class: middle
The score function points in the direction of the highest density of the data distribution.
It can be used to find modes of the data distribution or to generate samples by .bold[Langevin dynamics] by iterating the following sampling rule
$$\mathbf{x}_{i+1} = \mathbf{x}_i + \epsilon \nabla_{\mathbf{x}_i} \log p(\mathbf{x}_i) + \sqrt{2\epsilon} \mathbf{z}_i,$$
where $\epsilon$ is the step size and $\mathbf{z}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. When $\epsilon$ is small, Langevin dynamics will converge to the data distribution $p(\mathbf{x})$.
Similarly to likelihood-based models, score-based models can be trained by minimizing the .bold[Fisher divergence] between the data distribution $p(\mathbf{x})$ and the model distribution $p_\theta(\mathbf{x})$ as
$$\mathbb{E}_{p(\mathbf{x})} \left[ || \nabla_{\mathbf{x}} \log p(\mathbf{x}) - s_\theta(\mathbf{x}) ||_2^2 \right].$$
class: middle
Unfortunately, the explicit score matching objective leads to inaccurate estimates in low-density regions, where few data points are available to constrain the score.
Since initial sample points are likely to be in low-density regions in high-dimensional spaces, the inaccurate score-based model will derail the Langevin dynamics and lead to poor sample quality.
To address this issue, .bold[denoising score matching] can be used to train the score-based model to predict the score of increasingly noisified data points.
For each noise level $t$, the score-based model $s_\theta(\mathbf{x}_t, t)$ is trained to predict the score of the noisified data point $\mathbf{x}_t$ as
$$s_\theta(\mathbf{x}_t, t) \approx \nabla_{\mathbf{x}_t} \log p_{t} (\mathbf{x}_t)$$
where $p_{t} (\mathbf{x}_t)$ is the noise-perturbed data distribution
$$p_{t} (\mathbf{x}_t) = \int p(\mathbf{x}_0) \mathcal{N}(\mathbf{x}_t ; \mathbf{x}_0, \sigma^2_t \mathbf{I}) d\mathbf{x}_0$$
and $\sigma^2_t$ is an increasing sequence of noise levels.
class: middle
The training objective for $s_\theta(\mathbf{x}_t, t)$ is then a weighted sum of Fisher divergences for all noise levels $t$,
$$\sum_{t=1}^T \lambda(t) \mathbb{E}_{p_{t}(\mathbf{x}_t)} \left[ || \nabla_{\mathbf{x}_t} \log p_{t}(\mathbf{x}_t) - s_\theta(\mathbf{x}_t, t) ||_2^2 \right]$$
where $\lambda(t)$ is a weighting function.
class: middle
Finally, annealed Langevin dynamics can be used to sample from the score-based model by running Langevin dynamics with decreasing noise levels $t=T, ..., 1$.
A third interpretation of VDMs can be obtained by reparameterizing $\mathbf{x}_0$ using Tweedie's formula, as
$$\mathbf{x}_0 = \frac{\mathbf{x}_t + (1-\bar{\alpha}_t) \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) }{\sqrt{\bar{\alpha}_t}},$$
which we can plug into the the mean of the tractable posterior to obtain
$$\begin{aligned}
\mu_q(\mathbf{x}_t, \mathbf{x}_0, t) &= \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)}{1-\bar{\alpha}_t}\mathbf{x}_0 \\
&= ... \\
&= \frac{1}{\sqrt{\alpha}_t} \mathbf{x}_t + \frac{1-\alpha_t}{\sqrt{\alpha_t}} \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t).
\end{aligned}$$
???
Derive on the board.
class: middle
The mean of the reverse process can be parameterized with a .bold[score network] as
$$\mu_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha}_t} \mathbf{x}_t + \frac{1-\alpha_t}{\sqrt{\alpha_t}} s_\theta(\mathbf{x}_t, t).$$
Under this parameterization, the minimization of the expected KL divergence $L_{t-1}$ can be rewritten as
$$\begin{aligned}
&\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)}\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) )\\
=&\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \frac{1}{2\sigma^2_t} \frac{(1-\alpha_t)^2}{\alpha_t} || s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) ||_2^2
\end{aligned}$$
.success[Optimizing a score-based model amounts to learning a neural network that predicts the score $\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$.]
class: middle
Unfortunately, $\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$ is not tractable in general.
However, since $s_\theta(\mathbf{x}_t, t)$ is learned in expectation over the data distribution $q(\mathbf{x}_0)$, minimizing instead
$$\mathbb{E}_{q(\mathbf{x}_0)} \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \frac{1}{2\sigma^2_t} \frac{(1-\alpha_t)^2}{\alpha_t} || s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0) ||_2^2$$
ensures that $s_\theta(\mathbf{x}_t, t) \approx \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$.
exclude: true
class: middle
Ancestral sampling
Sampling from the score-based diffusion model is done by starting from $\mathbf{x}_T \sim p(\mathbf{x}_T)=\mathcal{N}(\mathbf{0}, \mathbf{I})$ and then following the estimated reverse Markov chain, as
$$\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha}_t} \mathbf{x}_t + \frac{1-\alpha_t}{\sqrt{\alpha_t}} s_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}_t,$$
where $\mathbf{z}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, for $t=T, ..., 1$.
class: middle
Conditional sampling
To turn a diffusion model $p_\theta(\mathbf{x}_{0:T})$ into a conditional model, we can add conditioning information $y$ at each step of the reverse process, as
$$p_\theta(\mathbf{x}_{0:T} | y) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t, y).$$
class: middle
With a score-based model however, we can use the Bayes rule and notice that
$$\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t | y) = \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) + \nabla_{\mathbf{x}_t} \log p(y | \mathbf{x}_t),$$
where we leverage the fact that the gradient of $\log p(y)$ with respect to $\mathbf{x}_t$ is zero.
In other words, controllable generation can be achieved by adding a conditioning signal during sampling, without having to retrain the model. E.g., train an extra classifier $p(y | \mathbf{x}_t)$ and use it to control the sampling process by adding its gradient to the score.
class: middle
Continuous-time diffusion models
.center.width-100[]
With $\beta_t = 1 - \alpha_t$, we can rewrite the forward process as
$$\begin{aligned}
\mathbf{x}_t &= \sqrt{ {\alpha}_t} \mathbf{x}_{t-1} + \sqrt{1-{\alpha}_t} \mathcal{N}(\mathbf{0}, \mathbf{I}) \\
&= \sqrt{1 - {\beta}_t} \mathbf{x}_{t-1} + \sqrt{ {\beta}_t} \mathcal{N}(\mathbf{0}, \mathbf{I}) \\
&= \sqrt{1 - {\beta}(t)\Delta_t} \mathbf{x}_{t-1} + \sqrt{ {\beta}(t)\Delta_t} \mathcal{N}(\mathbf{0}, \mathbf{I})
\end{aligned}$$
When $\Delta_t \rightarrow 0$, we can further rewrite the forward process as
$$\begin{aligned}
\mathbf{x}_t &= \sqrt{1 - {\beta}(t)\Delta_t} \mathbf{x}_{t-1} + \sqrt{ {\beta}(t)\Delta_t} \mathcal{N}(\mathbf{0}, \mathbf{I}) \\
&\approx \mathbf{x}_{t-1} - \frac{\beta(t)\Delta_t}{2} \mathbf{x}_{t-1} + \sqrt{ {\beta}(t)\Delta_t} \mathcal{N}(\mathbf{0}, \mathbf{I})
\end{aligned}.$$
This last update rule corresponds to the Euler-Maruyama discretization of the stochastic differential equation (SDE)
$$\text{d}\mathbf{x}_t = -\frac{1}{2}\beta(t)\mathbf{x}_t \text{d}t + \sqrt{\beta(t)} \text{d}\mathbf{w}_t$$
describing the diffusion in the infinitesimal limit.
.center.width-80[]
class: middle
The reverse process satisfies a reverse-time SDE that can be derived analytically from the forward-time SDE and the score of the marginal distribution $q(\mathbf{x}_t)$, as
$$\text{d}\mathbf{x}_t = \left[ -\frac{1}{2}\beta(t)\mathbf{x}_t - \beta(t)\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) \right] \text{d}t + \sqrt{\beta(t)} \text{d}\mathbf{w}_t.$$
The score $\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$ of the marginal diffused density $q(\mathbf{x}_t)$ is not tractable, but can be estimated using denoising score matching (DSM) by solving
$$\arg \min_\theta \mathbb{E}_{q(\mathbf{x}_0)} \mathbb{E}_{t\sim U[0,T]} \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} || s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0) ||_2^2,$$
which will result in $s_\theta(\mathbf{x}_t, t) \approx \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$ because of the outer expectation over $q(\mathbf{x}_0)$.
.success[This is just the .bold[same objective] as for VDMs! (See Interpretation 3)]
class: middle
Probability flow ODE
For any diffusion process, there exists a corresponding deterministic process
$$\text{d}\mathbf{x}_t = \left[ \mathbf{f}(t, \mathbf{x}_t) - \frac{1}{2} g^2(t) \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) \right] \text{d}t$$
whose trajectories share the same marginal densities $p(\mathbf{x}_t)$.
Therefore, when $\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)$ is replaced by its approximation $s_\theta(\mathbf{x}_t, t)$, the probability flow ODE becomes a special case of a neural ODE. In particular, it is an example of continuous-time normalizing flows!
Directly modeling the data distribution can be make the denoising process difficult to learn. A more effective approach is to combine VAEs with a diffusion prior.
The distribution of latent embeddings is simpler to model.
Diffusion on non-image data is possible with tailored autoencoders.