Existing generative models can mostly be categorized into Explicit and Implicit models. An important aspect to discuss the differences between these categories is based on theirs objectives.

In generative modeling, tractability and flexibility are two conflicting objectives. In particular, tractable models are usually analytically computable, therefore they are easy to evaluate and fit. However, they are not usually flexible enough to learn the true data structure. On the other hands, flexible models is supposed to fit arbitrary data structure, but they are expensive in evaluating, training, and sampling.

Explicit generative models

This is so-called Likelihood-based generative models. As the name states, these models usually make strong assumptions to ensure tractability of likelihood. These assumptions are usually through strong restrictions on the model architecture (flow-based models), or reliance on surrogate objectives to approximate maximum likelihood training (VAEs), or causal convolution (autoregressive models).

Pros:

  • By explicitly modeling $P(x)$, they can capture the entire data distribution, leading to no mode collapse.
  • Generally more stable to train, and better coverage.

Cons

  • Less efficient due to likelihood computation.
  • Sample quality are generally worse compared to implicit models.

Implicit generative models

Implicit generative models generates samples from a target distribution without the need of approximating the probability distribution of the data, demonstrating theirs flexibility. In contrast, the probability distribution is implicitly represented by a model of its sampling process. They often require adversarial training, which is notoriously unstable and can lead to mode collapse. The most prominent candidate is GAN.

Pros:

  • Efficient in evaluating, training and sampling.
  • Often produce sharper, more realistic samples.

Cons:

  • Prone to mode collapse.
  • Adversarial training can be unstable.
  • Cannot compute likelihood of a sample, they just sample.

Score-based generative models

There exists diffusion and score-based generative models, which are both analytically tractable and flexible. These models circumvent several of the above limitations. In these models, we do not need a tractable normalizing constant. Instead, we can rely on score matching to directly learn the models.

Score-based models have achieved state-of-the-art performance on many downstream tasks and applications. These tasks include generation, audio synthesis, shape generation, … For diffusion models, kindly refer to my previous post

Stein score function

The key idea is to model the gradient of the log probability density function, a quantity often known as the (Stein) score function. The score function is the gradient of the log-probability of a distribution w.r.t. to the input. This represents the direction of steepest ascent in the log density of $p(x)$, helping the model understand how $x$ relates to the data distribution. $$ \nabla_x logp(x) $$ A model $s_\theta(x)$, which models the score function explicitly, is a score-based model: $$ s_\theta(x) \approx \nabla_x logp(x) $$ Normally, the probability density $p(x)$ is intractable due to the involvement of a normalizing constant $Z_\theta$.​

$$ p_\theta(x)= \frac{exp(−f_\theta(x))​}{Z_\theta​}, \text{where} Z_\theta​=\int exp(−f_\theta​(x))dx $$

Here the function $f_\theta(x)$ is often called an unnormalized probabilistic model, or energy-based model. We can train $p_\theta(x)$ by maximizing the log-likelihood of the data: $$ max_\theta \sum_{i=1}^N log \ p\theta(x_i) $$ but this equation requires $p_\theta(x)$ to be a normalized pdf. This is infeasible because computing $p_\theta(x)$ relies on evaluating $Z_\theta$, which is a typically intractable quantity for any general $f_\theta(x)$. This is the reason why explicit models have to have constraints on their architectures, to make $Z_\theta$ tractable.

However, in score-based models, we have: $$ \nabla_x​logp(x)=\nabla_x​\log\exp(−f_\theta​(x))−\nabla_x​logZ_\theta​. $$ where $\nabla_x​logZ_\theta = 0$ due to $Z_\theta$ is independent from $x$. Thus, we do not need a tractable normalizing constant, i.e., we do not need any special architectures to make the normalizing constant tractable. This also gets rid of expensive computation, focusing on learning $\nabla_x​\log\exp(−f_\theta​(x))$, the gradient of the unnormalized log density.

The goal is to train the model to approximate the true score function $\nabla_x \log p(x)$. This can be done by minimizing the Fisher divergence, similar to likelihood-based models, which measures the difference between the true score and the model’s predicted score:

$$ \mathbb{E}_{p(x)} \left[ \lVert \nabla_x \log p(x) - s_\theta(x) \rVert^2_2 \right] $$

where $\nabla x​\log\ p(x)$ is the “ground truth” score of the data distribution (unknown in practice), and $s_\theta(x)$ is the model’s predicted score. But since we do not know the “ground truth ”, how do we train and backdrop? What do we optimize? To address this, we use a proxy that allows the model to learn the score indirectly.

Score matching

Directly computing $\nabla_x \log p(x)$ is infeasible since $p(x)$ is often unknown. However, it can be shown (up to some regularity conditions) that the above loss is equivalent to minimizing:

$$ \mathbb{E}_{p(x)} \left[ \text{tr}(\nabla_x s_\theta(x)) + \frac{1}{2} \lVert s_\theta(x) \rVert^2_2 \right], $$

where $\text{tr}(\nabla_x s_\theta(x))$ is the trace of the Jacobian of $s_\theta(x)$, which represents the divergence of the score function. This removes the need of computing $\nabla_x \log p(x)$, but the trace of the Jacobian is still too expensive for large networks and approximations are needed.

Denoising score matching

This is a widely used approach to train score-based models when the true score function $\nabla_x \log p(x)$ is unknown. Denoising score matching (DSM) works well for small level of noise:

$$ \frac{1}{2} \mathbb{E}_{q_\sigma(\tilde{x} \mid x)p_{data}(x)} \left[ \lVert s_\theta(\tilde{x}) - \nabla_{\tilde{x}} \log q_\sigma(\tilde{x} \mid x) \rVert_2^2 \right], $$

where:

  • $q_\sigma(\tilde{x} \mid x)$: Noise distribution, describing how $x$ is corrupted to $\tilde{x}$.
  • $\nabla_{\tilde{x}} \log q_\sigma(\tilde{x} \mid x)$: The “ground truth” score of the noisy data, which can be computed analytically.
  • $s_\theta(\tilde{x})$: The model’s predicted score for the noisy data. DSM reformulates this objective by corrupting the data $x$ into a noisy version $\tilde{x}$, making the problem tractable.

Steps of DSM:

  1. Corrupt the Data:
    • Sample a clean data point $x \sim p_{data}(x)$.
    • Add noise from a pre-specified distribution $q_\sigma(\tilde{x} \mid x)$ (e.g., Gaussian noise): $\tilde{x} = x + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)$
  2. Train the Model to Predict the Noisy Score:
    • The true score for the noisy data $\nabla_{\tilde{x}} \log q_\sigma(\tilde{x} \mid x)$ is given by: $\nabla_{\tilde{x}} \log q_\sigma(\tilde{x} \mid x) = -\frac{\tilde{x} - x}{\sigma^2}$
    • This provides a tractable target for training the model.
  3. Loss Function:
    • Minimize the squared error between the model’s predicted score $s_\theta(\tilde{x})$ and the true noisy score $\frac{1}{2} \mathbb{E}_{q_\sigma(\tilde{x} \mid x)p_{data}(x)} \left[ \lVert s_\theta(\tilde{x}) + \frac{\tilde{x} - x}{\sigma^2} \rVert_2^2 \right]$

DSM allows the model to approximate the score of the clean data $\nabla_x \log p(x)$ indirectly by learning the score of the noisy data $\nabla_{\tilde{x}} \log q_\sigma(\tilde{x} \mid x)$ The key insight is that the learned noisy score can be related to the clean score:

$$ \nabla_x \log p(x) = \lim_{\sigma \to 0} \mathbb{E}_{q_\sigma(\tilde{x} \mid x)} \left[ \nabla_{\tilde{x}} \log q_\sigma(\tilde{x} \mid x) \right] $$

Advantages of score matching

  • We can train with score matching directly with SGD like maximizing log-likelihood, without requiring adversarial optimization.
  • We have no constraints on the form of as we do not require to be the score function of a normalized distribution.

Additionally, using the score matching objective gives us a considerable amount of modeling flexibility. The Fisher divergence itself does not require $s_\theta(x)$ to be an actual score function of any normalized distribution—it simply compares the $l_2$ distance between the ground-truth data score and the score-based model, with no additional assumptions on the form of $s_\theta(x)$. In fact, the only requirement on the score-based model is that it should be a vector-valued function with the same input and output dimensionality, which is easy to satisfy in practice.

As a brief summary, we can represent a distribution by modeling its score function, which can be estimated by training a score-based model of free-form architectures with score matching.


Langevin dynamics

After training the score-based model, we can sample with Langevin dynamics. Langevin dynamics is a Markov Chain Monte Carlo (MCMC) method that allows sampling from a target distribution $p(x)$, given the gradient of its log-probability (the score function, $\nabla_x \log p(x)$. It is particularly suited for score-based models because these models learn to approximate $\nabla_x \log p(x)$.

The sampling process involves iteratively updating the sample xtx_txt​ using the following update rule:

$$ x_{t+1} = x_t + \epsilon \cdot \nabla_x \log p(x_t) + \sqrt{2\epsilon} \cdot z_t, \quad t=0, 1, …, K $$

where:

  • $\nabla_x \log p(x_t)$: The score function, approximated by the trained model $s_\theta(x)$.
  • $\epsilon$: Step size controlling the magnitude of the updates. The gradient $\epsilon \cdot \nabla_x \log p(x_t)$ ensures the sample moves toward high-probability regions.
  • $z_t \sim \mathcal{N}(0, I)$: Gaussian noise added to each step to ensure proper exploration of the space and convergence to $p(x)$. The noise $\sqrt{2\epsilon} \cdot z_t$ prevents the chain from getting stuck in local modes, ensuring convergence to the correct distribution.
  • $x_0$​: Initial sample, drawn from an arbitrary prior distribution $\pi(x)$ (commonly a Gaussian).

As $\epsilon \to 0$ (small step size) and $K \to \infty$ (sufficient iterations), the sequence ${x_t}$converges to a sample from the target distribution $p(x)$ under certain conditions. In practice, the error is negligible when $\epsilon$ is sufficiently small and $K$ is sufficiently large.

Advantages:

  • Efficiency: Langevin dynamics allows sampling without explicitly modeling$p(x)$, making it highly efficient for score-based generative models.
  • Flexibility: This method works well for any distribution where the score function $\nabla_x \log p(x)$can be approximated.
  • No Normalization Required: Langevin dynamics works directly with the unnormalized log-density $\log p(x)$, bypassing the need for a normalizing constant.
  • Scalable: It can be applied to high-dimensional data by leveraging efficient computation of gradients.
  • Guaranteed Convergence: Theoretically, Langevin dynamics converges to the target distribution under appropriate conditions.