Where the VAE Idea Went

The Variational Autoencoder was proposed by Kingma & Welling in 2014, which is now more than a decade ago. In machine learning time, that’s ancient. And yet the scaffolding we built in this series — encoder, decoder, ELBO, reparameterization — has become one of the most influential ideas in modern generative modeling.

Very few people train vanilla VAEs for image generation today. That job has been taken over by diffusion models and, increasingly, flow matching. But look under the hood of any state-of-the-art system, and you will find VAE ideas threaded through: a tokenizer that encodes images into discrete codes; a continuous latent space that diffusion models actually operate in; a probabilistic framing of “encoder outputs a distribution, not a point.”

This post is basically a survey of VAE variants — no heavy derivations, just the main ideas — and a look at what the probabilistic perspective gave us that is still useful today.


VQ-VAE: Discrete Latents

Proposed by van den Oord et al. (2017), VQ-VAE replaces the continuous Gaussian posterior with a discrete codebook. Instead of $q_\phi(z|x)$ being a Gaussian, the encoder outputs a vector $z_e(x)$ that gets quantized to its nearest entry in a learned codebook ${e_k}_{k=1}^K$:

\begin{equation} z_q(x) = e_{k^*}, \quad k^* = \arg\min_k |z_e(x) - e_k| \end{equation}

The decoder then reconstructs $x$ from the discrete code $z_q(x)$.

This looks strange at first — how do we backpropagate through the arg-min? The answer is the straight-through estimator: in the forward pass, quantize; in the backward pass, pretend quantization is the identity. The decoder’s gradients flow directly back to the encoder.

The loss (to minimize) also looks different from a standard VAE:

\begin{equation} \mathcal{L} = \underbrace{-\log p_\theta(x | z_q(x))}_{\text{reconstruction}} + \underbrace{|\text{sg}[z_e(x)] - e_{k^*}|^2}_{\text{codebook}} + \beta \underbrace{|z_e(x) - \text{sg}[e_{k^*}]|^2}_{\text{commitment}} \end{equation}

where $\text{sg}[\cdot]$ is the stop-gradient operator. The KL term as we knew it is gone, replaced by a uniform prior over discrete codes and these commitment losses.

Why it mattered. Discrete latents turn images (and audio, and video) into sequences of tokens, which can then be modeled by powerful autoregressive priors — exactly the transformer-based machinery we use for language. VQ-VAE and its successor VQ-VAE-2 were the bridge that let images enter the LLM era. Every modern image-tokenizer you’ve heard of — from DALL·E’s discrete VAE to the tokenizers inside modern multi-modal models — is a descendant.


Hierarchical VAEs: Depth as Capacity

A flat VAE with a single latent $z$ has limited capacity to model complex images. Hierarchical VAEs stack multiple layers of latents $z_1, z_2, \dots, z_L$ in a chain:

\begin{equation} p(x, z_{1:L}) = p(x | z_1), p(z_1 | z_2), \cdots, p(z_{L-1} | z_L), p(z_L) \end{equation}

Each level captures structure at a different scale — top levels encode coarse composition, bottom levels handle fine detail. The ELBO generalizes naturally to a sum of KL terms, one per level.

The obvious problem is that deeper hierarchies are harder to train. Upper latents tend to collapse (the decoder ignores them), leaving all the work to the bottom. The important follow-ups:

  • Ladder VAE (Sønderby et al., 2016). A specific parameterization of the approximate posterior that shares information between bottom-up and top-down paths.
  • IAF-VAE (Kingma et al., 2016). Replaces the Gaussian posterior with an Inverse Autoregressive Flow, giving $q_\phi(z|x)$ the flexibility to capture correlations a diagonal Gaussian cannot.
  • NVAE (Vahdat & Kautz, 2020). A deep, carefully engineered hierarchical VAE with residual cells, swish activations, spectral regularization, and batch normalization tuning — the first hierarchical VAE to produce genuinely competitive image samples. It showed that hierarchical VAEs can scale, if you’re willing to do the engineering.

The lesson from hierarchical VAEs generalizes beyond VAEs: generative models benefit from multi-scale structure, and the posterior approximation has to be rich enough to express the dependencies you care about.


Adversarial Hybrids

VAEs produce blurry samples because the Gaussian decoder likelihood, combined with a mean-field posterior, averages over modes. GANs produce sharp samples but with no likelihood, no encoder, and no stability.

Several papers asked the obvious question: what if we combine them?

  • VAE-GAN (Larsen et al., 2016). Use a GAN discriminator as the reconstruction loss instead of pixel MSE/BCE. The decoder now has to produce samples that a discriminator finds indistinguishable from real data.
  • IntroVAE (Huang et al., 2018). Turns the encoder itself into the discriminator, creating an adversarial game without a separate GAN head.
  • BiGAN / ALI (Dumoulin et al., 2017). GANs with an encoder baked in, trained adversarially on joint samples $(x, z)$.

Sometimes these hybrids get the best of both, sometimes the worst of both. They are important because they formalized a question that stayed relevant: the ELBO and the adversarial objective are measuring different things — when should we prefer one over the other?


Flow-Based Posteriors and Decoders

One strand of VAE research pushed in a direction that eventually became its own field: normalizing flows.

A normalizing flow is a bijective, differentiable transformation $f$ that maps a simple distribution (a Gaussian) to a complex one. If $z_0 \sim \mathcal{N}(0, I)$ and $z_K = f_K \circ \cdots \circ f_1(z_0)$, then the density of $z_K$ can be computed via the change-of-variables formula — which means we get a tractable, expressive, exact likelihood.

Applied inside a VAE, a flow can play two roles:

  1. As a posterior: replace the mean-field Gaussian $q_\phi(z|x)$ with $q_\phi(z|x) = f_K \circ \cdots \circ f_1 (\epsilon; x)$, making the approximate posterior arbitrarily expressive (IAF-VAE, above).
  2. As a prior or decoder: replace $p(z)$ or $p(x|z)$ with flow-based densities. The ELBO still holds, but each piece is much more flexible.

Flows later grew into a standalone family of models (RealNVP, Glow, FFJORD). The VAE was one of the first settings where their expressive power was useful.


The Information Bottleneck Lens

You may wonder why all these variants look so similar to each other. So I think there is a common theme running through all of them, and that theme has a name: the Information Bottleneck (IB) principle (Tishby et al., 1999).

The IB principle says: a good representation $z$ of input $x$ for predicting some target $y$ should maximize:

\begin{equation} \mathcal{L}_{\text{IB}} = I(z; y) - \beta \cdot I(z; x) \end{equation}

That is, $z$ should contain as much information about $y$ as possible, and as little about $x$ as possible, balanced by a trade-off parameter β.

Replace $y$ with “the data reconstruction task” and you get β-VAE. Replace $I(z; x)$ with the KL to a prior and you get a tractable upper bound. Every VAE-style model we have discussed — vanilla, conditional, β, VQ, hierarchical — is a different point in an IB design space, parameterized by how $I(z; x)$ is approximated and what “task” the latent is serving.

According to my own reading of this, the VAE was the first clean implementation of an information-theoretic idea that had been floating around for decades.


Where VAE Ideas Live Today

A short, incomplete catalog of where you still find the VAE’s fingerprints in 2026:

  • Latent Diffusion Models. Stable Diffusion and its descendants train a diffusion model not on pixels, but in the latent space of a pretrained VAE. The VAE compresses images into a tractable latent where diffusion is dramatically cheaper. The generative heavy lifting has moved to diffusion; the representation is still the VAE’s.
  • VQ tokenizers for multi-modal LLMs. Images, audio, and video are tokenized by VQ-VAE-style encoders before being fed to transformers. The codebook lives on.
  • World models. Systems that predict future states in an environment use continuous latent spaces whose structure is directly inherited from the VAE literature. The encoder-is-a-distribution framing is essential for capturing uncertainty in prediction.
  • Probabilistic programming. Amortized variational inference — the central computational idea of the VAE — is now a standard tool in libraries like Pyro and NumPyro, used for Bayesian models far removed from images.
  • Scientific machine learning. In chemistry, genomics, and neuroscience, VAE-style models are used precisely because the latent space gives you a distribution over explanations, not a single point estimate. Uncertainty is the product, not a byproduct.

What the VAE Really Taught Us

If I had to distill the lasting contribution of the VAE, it would be three interlocking ideas.

First, encoders should return distributions, not points. A deterministic embedding is a commitment to a single interpretation of the input. A distributional embedding keeps the model honest about uncertainty, and gives downstream samplers, generators, and decision-makers something meaningful to sample from.

Second, intractable integrals can be traded for tractable optimization. The ELBO is the archetype of this move: take an integral you can’t compute, write it as the sum of something you can maximize and something non-negative, then maximize what you can. This pattern now appears across variational inference, PAC-Bayes bounds, amortized Bayes, and diffusion model training.

Third, the reparameterization trick changed the default in machine learning. Before the VAE, stochastic models used custom inference algorithms per model family. After the VAE, the default became: write your model, rewrite sampling as a deterministic function of noise, call backward(). Virtually every modern probabilistic deep learning system inherits this pattern.

The VAE stopped being the state-of-the-art generator years ago. But it never stopped being the template for how we write probabilistic models in deep learning.


Closing the Series

We started this series by asking a simple question: how do we model data that seems to be generated by hidden causes? That led us through Latent Variable Models, Variational Inference, the ELBO, the reparameterization trick, conditional generation, β-regularization, and finally this survey of descendants.

Along the way we did a lot of math. The math was never the point. The point was a particular way of looking at problems: data is a shadow of something structured; inference is optimization in disguise; uncertainty is a first-class citizen; and the right objective function is more valuable than any specific architecture.

I only introduced the basic concepts here — each of these variants could be its own series. But hopefully this gives you enough to follow the literature when you want to go deeper.