Diffusion Models

Diffusion Models (DMs) include two processes: forward and backward.

Forward process

General idea

Degrading input data using noise iteratively, forward in time (i.e., $t$ increases).

Given image $x_0 \sim q(x_0)$, which called data distribution, forward process gradually adds Gauss noise thru $T$ time steps and produces latent $x_T$.
At each time step $t$, we sample Gauss noise that following the distribution $\mathcal{N}(\sqrt{1 - \beta_t} x_{t-1}, \beta_t)$, where the hyper-parameters $0 < \beta_{1:T} < 1$ represent the variance of noise incorporated at each time step. Intuitively, all we want is destroy the whole structure of the original image thru a diffusion process by iteratively decay the previous image $x_{t-1}$ and add Gauss noise into it to produce $x_t$.
If $T$ is sufficient enough, the distribution of the latent variable $x_T$ is nearly an isotropic Gaussian. Why? Because at each time step $t$ we add a Gauss noise with a variance $\beta_t \in (0,1)$, hence the distribution of $x_T$ is a linear transformation of $\mathcal{N}(\sqrt{1 - \beta_t} x_{t-1}, \beta_t)$. Therefore, $x_T$ should be an isotropic Gaussian noise, or ideally, it should be nearly an identity Gauss distribution, which is $q(x_T) \sim \mathcal{N}(x_T,; 0, \mathcal{I})$.

Mathematical explanation

In the forward process, we take $x_0$ as original image and produce $x_T$ which is a latent variable following an isotropic Gauss noise. At a given time step $t$, we produce $x_t$ as:

$$ \begin{align} \label{eqf} q(x_t|x_{t-1}) &:= \mathcal{N}(x_t; \sqrt{1 - \beta_t}x_{t-1}, \beta_t\mathcal{I}) \br &:= \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}\epsilon \tag{1} \end{align} $$

where $q(x_t|x_{t-1})$ refers to the conditional probability distribution of the image at time step $t$. To sample from this distribution, we use the property of Gauss distribution, that is, if $x_t \sim \mathcal{N}(\mu, \beta)$, it can be expressed as: $$ x_t = \mu + \sqrt{\beta}\epsilon, $$ where $\epsilon \sim \mathcal{N}(0,\mathcal{I})$ is a standard normal random variable.

You may wonder why it has to be $\sqrt{1-\beta_t}$, so I am going to explain here. So for simplicity, consider $x_0 \sim \mathcal{N}(\theta, 1)$, and we would like to keep the variance of the image equal to 1 at every step $t$: $$ x_1 = \alpha x_0 + \sqrt{\beta_1}\epsilon_1 $$ Note that the above equation is given by applying reparameterization trick. Friendly remind, reparameterization trick states that: $$ \mathcal{N}(\mu, \sigma^2) = \mu + \sigma \cdot \epsilon, \ \epsilon \sim \mathcal{N}(0, \mathcal{I}) $$

The variance of $x_1$ is: $$ Var(x_1) = \alpha^2 + \beta_1 $$ As aforementioned, we want to keep $Var(x_1) = 1$. Forcing this constraint will lead to: $$ 1 = \alpha^2 + \beta_1 \Rightarrow \alpha = \sqrt{1 - \beta_1} $$ Of course this works for every step $t$, so abstractly we can write it as: $$ x_t = \sqrt{1-\beta_1}x_{t-1} + \sqrt{\beta_t} \epsilon, \ \epsilon \sim \mathcal{N}(0, \mathcal{I{}}) $$ which corresponds to the probability distribution you’ve started from: $$ q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t \mathcal{I}) $$ Instead of moving step by step, from time step $t-1$ to $t$, we can achieve a single-step calculation for the above estimation, directly from $x_0$ to $x_{T}$, which is $q(x_T | x_0)$:

$$ q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\gamma}x_0, (1-\gamma_t)\mathcal{I}), $$ where $\gamma_t = \prod_{i=1}^{t}(1-\beta_i)$. Consequently, $x_t$ can be directly calculated from $x_0$ by: $$ x_t = \sqrt{\gamma_t}\cdot x_0 + \sqrt{1-\gamma_t}\cdot\epsilon,\ \epsilon\sim\mathcal{N}(0, \mathcal{I}) \tag{2} $$ So I think some of you may wonder why we are able to end up with this simplified estimation. Let’s go step by step to understand this concept, by extending the equation (1).

$$ \begin{align} \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_{t}}\epsilon &= \sqrt{\alpha_{t}}(\sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1-\alpha_{t-1}}\epsilon) + \sqrt{1-\alpha_{t}}\epsilon \br &= \sqrt{\alpha_t \alpha_{t-1}}x_{t-2} + \sqrt{\alpha_t (1-\alpha_{t-1})}\epsilon + \sqrt{1-\alpha_t}\epsilon \br &= \sqrt{\alpha_t \alpha_{t-1}}x_{t-2} + \sqrt{\alpha_t (1-\alpha_{t-1}) + 1-\alpha_t}\epsilon \br &= \sqrt{\alpha_t \alpha_{t-1}}x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}}\epsilon \end{align} \tag{3} $$

The reason why we can go from the second line to the third line is the additive property of variances in Gaussian distributions: when two Gaussian noises are added, their variances added. Here, we have $\sqrt{\alpha_t (1-\alpha_{t-1})}\epsilon$ is a Gaussian noise, and so is $\sqrt{1-\alpha_{t}}\epsilon$. Following reparameterization trick, we have $Var(\sqrt{\alpha_t (1-\alpha_{t-1})}\epsilon) = \alpha_t (1-\alpha_{t-1})$, and $Var(\sqrt{1-\alpha_{t}}\epsilon) = 1-\alpha_{t}$. Hence, adding two variances, we get $\alpha_t(1-\alpha{t-1}) + 1 - \alpha_t$, and using reparam trick we can write it as the third line.

Next, continuing to extend one more level from (3), we have:

$$ \begin{align} &\sqrt{\alpha_t \alpha_{t-1}}x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}}\epsilon \br = &\sqrt{\alpha_t \alpha_{t-1}}(\sqrt{\alpha_{t-2}}x_{t-3} + \sqrt{1-\alpha_{t-2}}\epsilon)+ \sqrt{1 - \alpha_t \alpha_{t-1}}\epsilon \br = &\sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}}x_{t-3}+ \sqrt{1-\alpha_t \alpha_{t-1} \alpha_{t-2}}\epsilon \end{align} $$

We now are able to recognize the pattern, that is: $$ \begin{align} x_t &= \sqrt{\prod_{i=0}^{t}\alpha_i}x_0 + \sqrt{1-\prod_{i=1}^{t}}\epsilon, \br q(x_t|x_0) &\sim \mathcal{N}(x_0; \sqrt{\prod_{i=1}^t}\alpha_i, 1 -\prod_{i=1}^t\alpha_i) \end{align} $$ Replace $\beta = 1 - \alpha$, we finally get our simplified estimation as discussed above (2).

Backward process

General idea

The backward process is in contrast with the forward process: Given a latent variable $\mathcal{z} = x_T$, where $T$ is the total time steps of our diffusion model, following a distribution (which is usually the Normal distribution $\mathcal{N}(0, 1)$, i.e., $q(z_T) \sim \mathcal{N}(0, \mathcal{I})$), we want to produce a latent variable $z_0 \sim q(x_0)$, where $q(x_0)$ is a real image distribution.
From $t$ to $t-1$, we use the conditional probability to estimate the latent $z_{t-1}\sim p(z_{t-1}|z_{t})$. This equals to remove a certain noise from $z_t$ to get $z_{t-1}$, which is contradictory to forward process where we add noise to the previous image at each time step.

Mathematical explanation

The key of this backward process is to understand how we sample $z_{t-1}$ from $z_t$, which equals to estimate $p(z_{t-1}|z_{t})$: $$ p(z_{t-1}|z_t) = \mathcal{N}(z_{t-1}|\mu_{\theta}(z_t, t), \sigma_{\theta}(z_t, t)) $$ where $\mu_\theta$ and $\sigma_\theta$ is learnable thru 2 neural networks. Similar to forward process, let $\gamma = \prod_{i=1}^t1-\beta_i$, we have: $$ p(z_{t-1}|z_t) = \mathcal{N}(z_{t-1}|\mu_{\theta}(z_t, \gamma_t), \sigma_{\theta}(z_t, \gamma_t)) $$

Optimization

To guide the backward process in learning forward process, we minimize the Kullback-Leibler (KL) divergence of the joint distribution of the forward and reverse sequences:

$$ \begin{align} p_\theta(z_0, ,…, ,z_T) &= p(z_T)\prod_{t=1}^Tp_\theta(z_{t-1}|z_t), \br q(x_0, , …, x_T) &= q(x_0)\prod_{t=1}^Tq(x_t|x_{t-1}), \end{align} $$

which leads to minimize:

$$ \begin{align} &\text{KL}(q (x_0, …, x_T) | p_\theta (z_0, …, z_T)) \br &= - \mathbb{E}_{q(x_0, …, x_T)} [\log p_\theta (z_0, …, z_T)] + c \br &\overset{(i)}{=} \mathbb{E}_{q(x_0, …, x_T)} \left[ - \log p (z_T) - \sum_{t=1}^T \log \frac{p_\theta (z_{t-1} | z_t)}{q (x_t | x_{t-1})} \right] + c \br &\overset{(ii)}{\geq} \mathbb{E} \left[ - \log p_\theta (z_0) \right] + c \tag{4} \end{align} $$

Forward process#

General idea#

Mathematical explanation#

Backward process#

General idea#

Mathematical explanation#

Optimization#

Forward process

General idea

Mathematical explanation

Backward process

General idea

Mathematical explanation

Optimization