The Intractable Wall
In the previous post, we defined the Latent Variable Model (LVM) — the idea that complex, high-dimensional data $x$ is generated by a simpler, lower-dimensional latent variable $z$.
We established that to train such a model, we need to maximize the marginal likelihood (the “evidence”):
\begin{equation} p_\theta(x) = \int p_\theta(x, z) , dz = \int p_\theta(x|z)p(z) , dz \end{equation}
We also established that for any interesting, non-linear model (where $p_\theta(x|z)$ is a neural network), this integral is impossible to solve. We cannot integrate over all possible latent states $z$. Consequently, we cannot calculate the posterior $p_\theta(z|x)$ either, because the evidence $p_\theta(x)$ appears in the denominator of Bayes’ rule:
\begin{equation} p_\theta(z|x) = \frac{p_\theta(x|z)p(z)}{p_\theta(x)} = \frac{p_\theta(x|z)p(z)}{\int p_\theta(x|z)p(z) , dz} \end{equation}
We are stuck. We have a generative story that makes perfect sense, but we have no way to compute the probability of our data, which means we cannot train our model using standard Maximum Likelihood Estimation (MLE).
So how do we get around this? That’s where Variational Inference (VI) comes in.
The main idea of VI is actually pretty simple. If we cannot calculate the true posterior $p_\theta(z|x)$ analytically, perhaps we can approximate it with a simpler distribution $q_\phi(z|x)$ and then optimize the parameters $\phi$ to make it as close as possible to the truth.
Turning Integration into Optimization
In classical calculus, integration is finding the area under a curve. In high dimensions, this is exponentially expensive. Optimization, however — finding the peak of a mountain — is something deep learning is exceptionally good at (thanks to Gradient Descent).
Variational Inference converts the problem of inference (calculating an integral) into a problem of optimization (minimizing a distance).
We introduce a variational family of distributions $\mathcal{Q}$. We choose a specific distribution $q_\phi(z|x) \in \mathcal{Q}$, parameterized by $\phi$, to serve as a surrogate for the true, intractable posterior $p_\theta(z|x)$.
Our goal is simple: Find the parameters $\phi$ that make $q_\phi(z|x)$ look as much like $p_\theta(z|x)$ as possible.
If we can do this, we can use $q_\phi(z|x)$ as a proxy for the unknown posterior. This $q_\phi(z|x)$ is often called the inference model (or in VAE terminology, the encoder), while $p_\theta(x|z)$ is the generative model (the decoder).
But how do we measure “closeness” between two probability distributions? And how do we minimize that distance if we don’t know the target $p_\theta(z|x)$ in the first place?
The Kullback-Leibler Divergence
To measure the similarity between our approximation $q_\phi(z|x)$ and the truth $p_\theta(z|x)$, we use the Kullback-Leibler (KL) Divergence.
Formally, for two continuous distributions $q(z)$ and $p(z)$, the KL divergence is defined as:
\begin{equation} D_{KL}(q||p) = \int q(z) \log \frac{q(z)}{p(z)} , dz = \mathbb{E}_{z \sim q} [\log q(z) - \log p(z)] \end{equation}
Key Properties of KL Divergence:
- Non-negative: $D_{KL}(q||p) \ge 0$.
- Zero at equality: $D_{KL}(q||p) = 0$ if and only if $q(z) = p(z)$ almost everywhere.
- Asymmetric: $D_{KL}(q||p) \neq D_{KL}(p||q)$.
In Variational Inference, we typically minimize the “reverse” KL divergence: $D_{KL}(q_\phi(z|x) || p_\theta(z|x))$. (Note: for brevity, we will often write $D_{KL}(q_\phi||p_\theta)$).
\begin{equation} \phi^* = \arg \min_\phi D_{KL}(q_\phi(z|x) || p_\theta(z|x)) \end{equation}
So, our objective function seems clear. Let’s expand this definition:
\begin{equation} D_{KL}(q_\phi(z|x) || p_\theta(z|x)) = \mathbb{E}_{q_\phi} [\log q_\phi(z|x) - \log p_\theta(z|x)] \end{equation}
Using Bayes’ rule $\log p_\theta(z|x) = \log p_\theta(x, z) - \log p_\theta(x)$, we get:
\begin{equation} D_{KL}(q_\phi || p_\theta) = \mathbb{E}_{q_\phi} [\log q_\phi(z|x) - \log p_\theta(x, z) + \log p_\theta(x)] \end{equation}
We hit the wall again. The term $\log p_\theta(x)$ (the marginal likelihood) is in the equation. We can’t minimize this KL divergence directly because it requires knowing the very thing we are trying to find (the evidence).
Interlude: Why “Reverse” KL?
You may wonder why we chose $D_{KL}(q||p)$ (Reverse KL) instead of $D_{KL}(p||q)$ (Forward KL). The choice dictates the behavior of our approximation:
Forward KL ($p || q$): This is “mean-seeking” or “inclusive”. To minimize it, wherever $p(z) > 0$, we must ensure $q(z) > 0$ (otherwise the ratio $p/q$ explodes). $q$ tries to cover all the probability mass of $p$. If $p$ is multimodal, $q$ (if simpler, e.g., Gaussian) effectively averages them out, potentially putting high probability in low-probability regions between modes. Crucially, computing this requires taking an expectation over $p(z|x)$, which is the intractable posterior we can’t sample from! This makes Forward KL computationally infeasible for this setup.
Reverse KL ($q || p$): This is “mode-seeking” or “exclusive”. The expectation is over $q$. We minimize $\sum q(z) \log(q(z)/p(z))$. If $p(z) \approx 0$, force $q(z)$ to be 0 to avoid a penalty. $q$ tends to latch onto one mode of $p$ and ignore others.
- Pro: We can sample from $q$ (since we designed it!).
- Con: It effectively underestimates variance.
We use Reverse KL primarily because we can actually compute expectations over $q$.
The Great Decomposition: Deriving the ELBO
We need a different approach. We need to derive a relationship between the marginal likelihood, the KL divergence, and something we can compute.
Let’s start from what we want to maximize: the log-likelihood of the data, $\log p_\theta(x)$.
Since $\log p_\theta(x)$ does not depend on $z$, we can multiply it by $\int q_\phi(z|x) , dz$ (which equals 1) and bring it inside the expectation:
\begin{equation} \log p_\theta(x) = \int q_\phi(z|x) \log p_\theta(x) , dz = \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x)] \end{equation}
Now, we use Bayes’ rule again: $p_\theta(x) = \frac{p_\theta(x, z)}{p_\theta(z|x)}$. Substituting this in:
\begin{equation} \log p_\theta(x) = \mathbb{E}_{q_\phi} \left[ \log \frac{p_\theta(x, z)}{p_\theta(z|x)} \right] \end{equation}
Here is the “magic” step. We multiply and divide the fraction by our approximate posterior $q_\phi(z|x)$. This introduces our variational parameter $\phi$ into the equation without changing the value:
\begin{equation} \log p_\theta(x) = \mathbb{E}_{q_\phi} \left[ \log \left( \frac{p_\theta(x, z)}{p_\theta(z|x)} \cdot \frac{q_\phi(z|x)}{q_\phi(z|x)} \right) \right] \end{equation}
Using the property of logarithms $\log(ab) = \log a + \log b$, we split this into two terms:
\begin{equation} \log p_\theta(x) = \mathbb{E}_{q_\phi} \left[ \log \frac{p_\theta(x, z)}{q_\phi(z|x)} \right] + \mathbb{E}_{q_\phi} \left[ \log \frac{q_\phi(z|x)}{p_\theta(z|x)} \right] \end{equation}
This leads us to the Evidence Identity:
\begin{equation} \log p_\theta(x) = \mathcal{L}(\theta, \phi; x) + D_{KL}(q_\phi(z|x) || p_\theta(z|x)) \end{equation}
Where $\mathcal{L}(\theta, \phi; x)$ is called the Evidence Lower Bound (ELBO).
Why is this useful?
Likelihood Bound: We know that $D_{KL} \ge 0$. Therefore:
\begin{equation} \log p_\theta(x) \ge \mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)} \left[ \log \frac{p_\theta(x, z)}{q_\phi(z|x)} \right] \end{equation}
The term $\mathcal{L}$ is strictly a lower bound on the evidence.
Tractability: Unlike the marginal likelihood or the true posterior, the ELBO contains only terms we can compute! The expectation is over $q_\phi$, which we choose (e.g., a Gaussian), so we can easily sample from it via Monte Carlo estimation.
Alternative Derivation: Jensen’s Inequality
For those who prefer a quicker path, we can derive the ELBO directly using Jensen’s Inequality, which states that for a concave function $f$ (like log), $f(\mathbb{E}[y]) \ge \mathbb{E}[f(y)]$.
\begin{equation} \begin{aligned} \log p_\theta(x) &= \log \int p_\theta(x, z) , dz \\ &= \log \int p_\theta(x, z) \frac{q_\phi(z|x)}{q_\phi(z|x)} , dz \\ &= \log \mathbb{E}_{q_\phi} \left[ \frac{p_\theta(x, z)}{q_\phi(z|x)} \right] \\ &\ge \mathbb{E}_{q_\phi} \left[ \log \frac{p_\theta(x, z)}{q_\phi(z|x)} \right] = \text{ELBO} \end{aligned} \end{equation}
This elegantly proves that the ELBO is indeed a lower bound.
The Optimization Strategy
Looking at the Evidence Identity:
\begin{equation} \log p_\theta(x) = \underbrace{\mathcal{L}(\theta, \phi; x)}_{\text{maximize this}} + \underbrace{D_{KL}(q_\phi || p_\theta)}_{\text{minimize this}} \end{equation}
The left-hand side, $\log p_\theta(x)$, is fixed for a given data point $x$ and model parameter $\theta$. If we maximize the ELBO with respect to $\phi$ (the variational parameters), we are essentially pushing the lower bound up. Since the total sum is fixed, maximizing the ELBO must minimize the KL divergence.
Thus, we have successfully replaced the impossible problem (minimize KL with unknown target) with a possible one (maximize ELBO).
Analyzing the Lower Bound: The Trade-off
To understand what we are actually teaching our model to do, let’s rewrite the ELBO in a more intuitive form.
We start with the definition derived above:
\begin{equation} \mathcal{L} = \mathbb{E}_{q_\phi} [\log p_\theta(x, z) - \log q_\phi(z|x)] \end{equation}
Recall that $p_\theta(x, z) = p_\theta(x|z)p(z)$. Expanding the log term:
\begin{equation} \mathcal{L} = \mathbb{E}_{q_\phi} [\log p_\theta(x|z) + \log p(z) - \log q_\phi(z|x)] \end{equation}
(Note: we use $p(z)$ instead of $p_\theta(z)$ for the prior as it typically has no learnable parameters)
Grouping the terms involving $z$:
\begin{equation} \mathcal{L} = \mathbb{E}_{q_\phi} [\log p_\theta(x|z)] - \mathbb{E}_{q_\phi} [\log q_\phi(z|x) - \log p(z)] \end{equation}
The second part is exactly the definition of KL divergence between the approximate posterior $q_\phi(z|x)$ and the prior $p(z)$. This gives us the standard VAE objective form:
\begin{equation} \mathcal{L}(\theta, \phi; x) = \underbrace{\mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)]}_{\text{Reconstruction}} - \underbrace{D_{KL}(q_\phi(z|x) || p(z))}_{\text{Regularization}} \end{equation}
So the ELBO forces the model to balance two competing objectives.
The Reconstruction Term
$\mathbb{E}_{q}[\log p_\theta(x|z)]$
This term measures how well the decoder can reconstruct the input $x$ given a sample $z$ from the encoder.
- It encourages the encoder $q_\phi$ to pick values of $z$ that are “informative” about $x$.
- It encourages the decoder $p_\theta$ to assign high probability to the true data.
- If we only optimized this, $q_\phi(z|x)$ would collapse to a point mass (a delta function) exactly at the $z$ that best reconstructs $x$. This is essentially a standard autoencoder, which is prone to overfitting and learns a disjoint latent space.
The Regularization Term
$-D_{KL}(q_\phi(z|x) || p(z))$
This term measures the distance between our approximate posterior and the prior $p(z)$ (usually a unit Gaussian $\mathcal{N}(0, I)$).
- It acts as a regularizer, forcing the distribution of latent codes to look like the prior.
- It prevents the encoder from “cheating” by mapping inputs to disjoint, far-away points in the latent space.
- It ensures the latent space remains smooth and continuous, which is vital for generation.
In other words: describe the data $x$ as well as possible (Reconstruction), but don’t deviate too far from the prior belief about the world (Regularization).
Geometric Intuition
Think of the log-likelihood as a ceiling and the ELBO as a floor. The gap between them is the KL divergence. When we maximize the ELBO, we’re pushing the floor up toward the ceiling.
More concretely: when we optimize $\phi$, the floor (ELBO) rises. Since the ceiling is fixed for a given $\theta$, raising the floor reduces the gap (minimizes KL). Once the approximation is tight, we can then adjust $\theta$ to raise the ceiling even higher.
This alternating maximization over $q_\phi$ and $p_\theta$ corresponds to the E-step and M-step in the classical Expectation-Maximization (EM) algorithm — only here, our expectations are replaced by stochastic gradients.
Summary
We have crossed the bridge from intractable integrals to tractable optimization.
- The Problem: We cannot calculate the evidence $p_\theta(x)$ or the posterior $p_\theta(z|x)$ because of the integral over $z$.
- The Solution: We introduce a surrogate posterior $q_\phi(z|x)$ and minimize its KL divergence from the true posterior.
- The Method: We minimized Reverse KL because it is computationally tractable (expectation over $q$).
- The Tool: We derived the ELBO (via Bayes’ Decomposition or Jensen’s Inequality), a computable lower bound on the evidence.
- The Interpretation: Maximizing the ELBO balances reconstruction accuracy (fitting the data) against regularization (staying close to the prior).
This final expression — reconstruction minus KL regularization — is exactly the training loss used in Variational Autoencoders.
We now have a valid objective function. However, there is one final hurdle. The ELBO involves an expectation $\mathbb{E}_{q_\phi(z|x)}[\dots]$. To train this with gradient descent, we need to differentiate through the sampling process of $z$. You cannot simply ask PyTorch or TensorFlow to compute the gradient of a random coin flip.