Recap

In the last post, we derived the ELBO and watched it split into two parts:

\begin{equation} \mathcal{L}(\theta, \phi; x) = \underbrace{\mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)]}_{\text{Reconstruction}} - \underbrace{D_{KL}(q_\phi(z|x) ,||, p(z))}_{\text{Regularization}} \end{equation}

This looks clean on paper: reconstruct $x$ well, while keeping the encoded distribution close to the prior $p(z)$.

But to actually train a neural network with this, we need to answer three questions:

  1. How do we compute the KL term in practice?
  2. How do we estimate the reconstruction expectation?
  3. How do we differentiate through a random sample $z \sim q_\phi(z|x)$?

Question 3 is the hardest one, and this is where the VAE paper really shines. But let’s start with the easy parts first.


The KL Term: A Closed Form

For the VAE to be practical, we need a variational family $\mathcal{Q}$ that is flexible enough to approximate the posterior but simple enough to compute with. The standard choice is the diagonal Gaussian:

\begin{equation} q_\phi(z|x) = \mathcal{N}\big(z; \mu_\phi(x), \text{diag}(\sigma^2_\phi(x))\big) \end{equation}

and the prior is the standard isotropic Gaussian:

\begin{equation} p(z) = \mathcal{N}(z; 0, I) \end{equation}

Two Gaussians. Nothing too wild.

The encoder $\phi$ takes $x$ and outputs two vectors: a mean $\mu_\phi(x) \in \mathbb{R}^d$ and a log-variance $\log \sigma^2_\phi(x) \in \mathbb{R}^d$. We output log-variance rather than variance so the network can produce any real number without needing a positivity constraint.

Because both distributions are Gaussian, the KL divergence has a closed-form expression:

\begin{equation} D_{KL}\big(\mathcal{N}(\mu, \sigma^2 I) ,||, \mathcal{N}(0, I)\big) = \frac{1}{2} \sum_{j=1}^{d} \left( \mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1 \right) \end{equation}

This is pretty nice actually. No sampling is required for the KL term. We get a deterministic, differentiable expression straight from the encoder’s output. Gradients flow cleanly back into $\phi$ since everything is just elementary functions of $\mu_j$ and $\sigma_j^2$.

So the KL term is basically “free” in terms of differentiability. All the hard work lives in the reconstruction term.


The Reconstruction Term: Monte Carlo

Now the harder half:

\begin{equation} \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] \end{equation}

This expectation has no closed form. The decoder $p_\theta(x|z)$ is a deep neural network, so there’s no way to integrate over all possible $z$ analytically.

So we estimate the expectation with a Monte Carlo sample:

\begin{equation} \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] \approx \frac{1}{L} \sum_{\ell=1}^{L} \log p_\theta(x | z^{(\ell)}), \quad z^{(\ell)} \sim q_\phi(z|x) \end{equation}

In practice, we often use just one sample per data point ($L=1$). This sounds a bit reckless, but it works surprisingly well because the minibatch average already provides a lot of averaging.

What does $\log p_\theta(x|z)$ actually look like?

That depends on the data type:

  • For binary data (e.g. binarized MNIST), $p_\theta(x|z)$ is a product of Bernoullis, and $\log p_\theta(x|z)$ reduces to the negative binary cross-entropy between $x$ and the decoder output.
  • For continuous data with a fixed-variance Gaussian decoder, $\log p_\theta(x|z) = -\frac{1}{2\sigma^2} |x - \hat{x}|^2 + \text{const}$. So it’s just a scaled MSE.

So the “probabilistic” decoder, in practice, collapses into one of the two loss functions you already know. The loss function itself is nothing special — the interesting part is what happens before it.


The Problem: You Cannot Backprop Thru a Sample

And here is where we have a problem.

Our current recipe looks like this:

  1. The encoder outputs $\mu_\phi(x)$ and $\sigma_\phi(x)$.
  2. We sample $z \sim \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))$.
  3. The decoder computes $\log p_\theta(x|z)$.
  4. We want $\nabla_\phi$ of the whole thing.

Step 2 is the killer. The sample $z$ depends on $\phi$ (because $\mu$ and $\sigma$ are outputs of $\phi$), but it is also a random draw. So what is the derivative of a random draw with respect to the parameters that shape its distribution?

Formally, we want:

\begin{equation} \nabla_\phi , \mathbb{E}_{z \sim q_\phi(z|x)} [f(z)] \end{equation}

where $f(z) = \log p_\theta(x|z)$. The gradient wants to go inside the expectation, but the expectation itself is over a distribution that depends on $\phi$. You can’t just swap them.

A first attempt: the score function estimator

There is a general-purpose gradient estimator for this, called REINFORCE or the score function estimator:

\begin{equation} \nabla_\phi , \mathbb{E}_{q_\phi} [f(z)] = \mathbb{E}_{q_\phi} \big[ f(z) , \nabla_\phi \log q_\phi(z|x) \big] \end{equation}

It’s unbiased and works for any distribution. But in practice, the variance of this estimator is so high that training a deep network with it is painful at best. We need a better idea.


The Reparameterization Trick

Here is the key insight of the VAE. Instead of sampling $z$ directly from a distribution that depends on $\phi$, we sample a fixed random variable and then transform it with a deterministic function that depends on $\phi$.

For a Gaussian encoder, this is very simple. Any sample $z \sim \mathcal{N}(\mu, \sigma^2)$ can be written as:

\begin{equation} z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \end{equation}

where $\odot$ is elementwise multiplication.

The randomness now lives in $\epsilon$, which has nothing to do with $\phi$. The parameters $\mu_\phi(x)$ and $\sigma_\phi(x)$ only appear in the deterministic transformation.

So we don’t eliminate the randomness — we relocate it. The noise is pushed out to a leaf of the computation graph, where autograd can treat it as a constant.

Why this makes everything work

With reparameterization, the gradient becomes:

\begin{equation} \nabla_\phi , \mathbb{E}_{z \sim q_\phi} [f(z)] = \nabla_\phi , \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} \big[ f(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon) \big] = \mathbb{E}_{\epsilon} \big[ \nabla_\phi f(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon) \big] \end{equation}

The gradient and the expectation commute now, because the distribution we’re averaging over ($\mathcal{N}(0, I)$) does not depend on $\phi$.

In autograd terms:

  • $\epsilon$ is a constant tensor sampled once per forward pass.
  • $z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon$ is a deterministic node with two parents ($\mu$ and $\sigma$) that we want gradients for.
  • Backprop flows through $z$ into $\mu$ and $\sigma$, and then into $\phi$. No custom math required.

This is why people call it a “trick.” It looks like a small algebraic rewrite. It is a small algebraic rewrite. But it’s the rewrite that makes the entire VAE differentiable.

A more general picture

Reparameterization isn’t limited to Gaussians. Whenever you can write a sample from your target distribution as $z = g_\phi(x, \epsilon)$ where $\epsilon$ is drawn from some fixed base distribution, you can apply the same idea. This is exactly the principle behind normalizing flows — an entire family of models built on pushing simple noise through invertible, learnable transformations.

For now, though, a Gaussian is all we need.


The Full VAE

We now have every ingredient. Let’s put the VAE together as one training loop.

Forward pass for a single example $x$:

  1. Encoder: $(\mu, \log \sigma^2) = \text{EncoderNet}_\phi(x)$.
  2. Sample: $\epsilon \sim \mathcal{N}(0, I)$.
  3. Reparameterize: $z = \mu + \sigma \odot \epsilon$, where $\sigma = \exp(\tfrac{1}{2} \log \sigma^2)$.
  4. Decoder: $\hat{x} = \text{DecoderNet}_\theta(z)$ (or the parameters of $p_\theta(x|z)$).
  5. Compute the two loss terms:
    • $\mathcal{L}_{\text{recon}} = -\log p_\theta(x|z)$ (BCE or MSE).
    • $\mathcal{L}_{\text{KL}} = \frac{1}{2} \sum_j (\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1)$.
  6. Total loss: $\mathcal{L} = \mathcal{L}_{\text{recon}} + \mathcal{L}_{\text{KL}}$.

Backward pass: standard autograd through every step. The stochastic node $\epsilon$ is a constant tensor, so backprop never has to reason about randomness.

Pseudocode

mu, log_var = encoder(x)                  # (B, d), (B, d)
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)               # ~ N(0, I)
z = mu + std * eps                        # reparameterization

x_hat = decoder(z)

recon_loss = F.binary_cross_entropy(x_hat, x, reduction="sum") / B
kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp()) / B

loss = recon_loss + kl_loss
loss.backward()

That is the entire VAE training loop. The rest is plumbing.


What We Actually Built

So looking back, we started from the idea that data is produced by hidden causes. We wrote down a probabilistic model, found it intractable, used variational inference, derived the ELBO, decomposed it into reconstruction plus regularization, and finally made it differentiable with the reparameterization trick.

The result is a model that can do two things:

  1. Encode: Given an input $x$, map it to a distribution over latent codes $q_\phi(z|x)$. This gives us a probabilistic representation — not a point, but a region of meaning.
  2. Generate: Sample $z \sim \mathcal{N}(0, I)$, pass it through the decoder, and get a new $x$ that looks like it came from the training distribution.

The second capability is what makes the VAE generative in a way that a plain autoencoder is not. Because the KL regularizer forces the encoder’s outputs to fill the prior smoothly, sampling from the prior lands us in regions the decoder can reconstruct.


Two Things Worth Noticing

The KL term is a budget. It limits how much information the encoder can stuff into $z$ about any particular $x$. If the encoder tries to memorize, KL goes up. If the encoder stays too close to the prior, reconstruction suffers. The ELBO is a negotiation between the two.

The reparameterization trick is why this is a neural network and not an EM algorithm. Classical variational inference required careful, distribution-specific updates. Reparameterization replaces all of that with a single backward pass — turn the inference problem into a differentiable computation and hand it to the optimizer.


Summary

  1. The KL term has a closed form for diagonal Gaussians — no sampling needed.
  2. The reconstruction term is estimated with a single Monte Carlo sample, reducing to BCE or MSE depending on the decoder.
  3. Backpropagating through a raw sample is intractable and very noisy via REINFORCE.
  4. The reparameterization trick rewrites $z = \mu + \sigma \odot \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$, moving randomness to a constant leaf so gradients flow deterministically through $\mu$ and $\sigma$.
  5. The full VAE is an encoder, a sampler, a decoder, and two loss terms — all tied together by autograd.