The Unconditional Problem

In the previous post, we assembled a working VAE — encoder, decoder, reparameterization, ELBO. If you train it on MNIST, you get a model that can generate handwritten digits by sampling $z \sim \mathcal{N}(0, I)$ and decoding.

That’s cool and all, but also kind of annoying.

We have no way to say “draw me a 7.” The VAE generates something that looks like a digit, but we have no lever to control which digit. The generative process is:

\begin{equation} z \sim p(z), \quad x \sim p_\theta(x|z) \end{equation}

There is simply no place for our intent to enter the equation.

In most real-world applications, we don’t want a model that generates data blindly. We want to generate data conditioned on something — a class label, a text prompt, a source image, a musical key. So the question this post answers is: how do we add a condition $y$ to the VAE without breaking anything we already built?

Turns out the solution is actually very simple.


From $p(x)$ to $p(x|y)$

Alongside our data $x$, assume we also have a condition $y$. This $y$ could be many things:

  • A discrete class label (y = 7).
  • A continuous attribute (y = brightness value).
  • Another modality (y = a text embedding).
  • A structured object (y = a segmentation mask).

We are no longer modeling the unconditional distribution $p(x)$. We are modeling the conditional distribution $p(x|y)$.

Following the same latent variable recipe from before, we introduce a latent $z$ and write:

\begin{equation} p_\theta(x|y) = \int p_\theta(x|z, y),p(z|y),dz \end{equation}

Three things to notice:

  1. The likelihood $p_\theta(x|z, y)$ now depends on both $z$ and $y$. The decoder sees the condition.
  2. The prior $p(z|y)$ may also depend on $y$. (Often we just set $p(z|y) = p(z) = \mathcal{N}(0, I)$.)
  3. Everything else stays the same.

And just like before, this integral is intractable for nonlinear decoders. So we do what we already know: introduce an approximate posterior and derive an ELBO.


Deriving the Conditional ELBO

We introduce a variational posterior $q_\phi(z|x, y)$ — the encoder, which now also takes $y$ as input — and repeat the familiar derivation.

Starting from the log conditional likelihood:

\begin{equation} \log p_\theta(x|y) = \log \int p_\theta(x|z, y),p(z|y),dz \end{equation}

Multiply and divide inside the integral by $q_\phi(z|x, y)$, then apply Jensen’s inequality (or repeat the Bayes-decomposition argument from the VI post). The result is the Conditional ELBO:

\begin{equation} \log p_\theta(x|y) \ge \mathbb{E}_{q_\phi(z|x,y)} [\log p_\theta(x|z, y)] - D_{KL}(q_\phi(z|x, y) ,||, p(z|y)) \end{equation}

Compare this to the standard VAE ELBO. The shape is identical:

\begin{equation} \mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) ,||, p(z)) \end{equation}

\begin{equation} \mathcal{L}_{\text{CVAE}} = \mathbb{E}_{q_\phi(z|x,y)}[\log p_\theta(x|z, y)] - D_{KL}(q_\phi(z|x, y) ,||, p(z|y)) \end{equation}

Every distribution that originally depended on $x$ now depends on both $x$ and $y$. That’s the whole change. The reparameterization trick, the Monte Carlo estimation, the closed-form KL for Gaussians — all of it carries over unchanged.

CVAE is basically just the same VAE but with $y$ plugged in everywhere.


Where Does $y$ Actually Go?

On paper, “the encoder and decoder both see $y$” is easy to say. In code, you have to decide how $y$ enters the networks. A few common patterns:

Concatenation. The simplest option: turn $y$ into a feature vector (one-hot for classes, embedding for text, etc.) and concatenate it with the input.

# Encoder
h = encoder_net(torch.cat([x, y], dim=-1))
mu, log_var = h.chunk(2, dim=-1)

# Decoder
x_hat = decoder_net(torch.cat([z, y], dim=-1))

Embedding + addition. For class labels, learn an embedding $e(y) \in \mathbb{R}^d$ and add it to $z$ before decoding. This biases the latent space along a learnable direction per class.

Conditional normalization. For image data, Conditional BatchNorm or FiLM layers modulate activations using $y$. Each class (or attribute vector) gets its own affine transform applied to feature maps.

Cross-attention. For rich conditions (text, images), let $y$ be a sequence of tokens and attend to it at every decoder block. This is the pattern inherited by modern latent-diffusion models.

None of these are mandated by the math. The ELBO only says $y$ must appear in the relevant conditional distributions. How the network reads $y$ is an architectural choice.


Is the Conditional Prior Worth It?

The math allows $p(z|y)$ to depend on $y$. Do we want it to?

The simple path: $p(z|y) = \mathcal{N}(0, I)$. We share a single unit-Gaussian prior across all conditions. The encoder is responsible for routing different $y$ values to appropriate regions of latent space via the KL term. This is what most CVAE implementations actually do, and it works fine for most cases.

The learnable path: $p_\psi(z|y) = \mathcal{N}(\mu_\psi(y), \sigma^2_\psi(y))$. A small network produces the prior mean and variance as functions of $y$. This gives each condition its own “region” of latent space, which can help when conditions are very different from one another.

The trade-off: a learnable conditional prior adds flexibility at the cost of extra parameters and some training instability. For MNIST-scale class conditioning, the fixed $\mathcal{N}(0, I)$ is more than enough. For complex conditions, a learnable prior starts to pay off.

According to my own experience, always start with the fixed prior. If it works, you don’t need the learnable one. If the KL term is perpetually large and reconstruction is struggling, then consider letting the prior move.


A Worked Example: Class-Conditional MNIST

To make this concrete, consider the simplest case: conditioning on a digit class $y \in {0, 1, \dots, 9}$.

Setup.

  • Represent $y$ as a one-hot vector $\mathbf{y} \in {0, 1}^{10}$.
  • Encoder: $q_\phi(z|x, y)$ takes $[x, \mathbf{y}]$ concatenated, returns $(\mu, \log \sigma^2)$.
  • Prior: $p(z|y) = \mathcal{N}(0, I)$ (shared).
  • Decoder: $p_\theta(x|z, y)$ takes $[z, \mathbf{y}]$ concatenated, returns pixel Bernoullis.

Training pseudocode.

y_onehot = F.one_hot(y, num_classes=10).float()

mu, log_var = encoder(torch.cat([x, y_onehot], dim=-1))
std = torch.exp(0.5 * log_var)
z = mu + std * torch.randn_like(std)

x_hat = decoder(torch.cat([z, y_onehot], dim=-1))

recon = F.binary_cross_entropy(x_hat, x, reduction="sum") / B
kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp()) / B

loss = recon + kl

Generation. At inference, pick the class you want, sample $z$ from the prior, and decode:

y_wanted = F.one_hot(torch.tensor([7]), num_classes=10).float()
z = torch.randn(1, latent_dim)
generated = decoder(torch.cat([z, y_wanted], dim=-1))

Now the model generates a 7 because we told it to. A handful of lines of change from the unconditional VAE, and we have control.


What the Condition Can Be

CVAE is more general than just class-conditional generation. Any variable $y$ that explains some structure in $x$ can be a condition. Some examples you’ll see in the wild:

  • Attribute-conditional. Condition on attribute vectors (age, hair color, expression) to generate faces with controlled properties.
  • Text-to-image. Condition on a text embedding. The decoder’s cross-attention reads the text and aligns pixels to it.
  • Image-to-image translation. Condition on a source image (e.g. edges, segmentation, low-res). This is the VAE formulation of pix2pix-style tasks.
  • Time-series forecasting. Condition on past observations, decode future ones. The encoder-decoder shape generalizes naturally.
  • Missing-data imputation. Treat the observed part of $x$ as $y$ and the unobserved part as $x$. The CVAE fills in what’s missing.

So you may wonder why these all look similar — it’s because they’re all answering the same question: given $y$, what distribution over $x$ does the data imply?


Connection to Modern Conditional Generators

Every successful generative model today is a conditional one. Stable Diffusion, Imagen, Sora, audio LDMs — all of them learn $p(x|y)$ where $y$ is text, audio, video, or multi-modal embeddings.

The architectural tricks have evolved (cross-attention, classifier-free guidance, adapter modules), but the probabilistic skeleton is the one we just wrote down. The CVAE was the first clean expression of this in the deep learning era: let the condition enter every piece of the model, and let the ELBO handle the rest.

If you understand why the CVAE ELBO looks the way it does, you basically already understand the setup of conditional score matching, conditional flow matching, and guided diffusion. They all inherit this structure.


A Practical Note: Posterior Collapse with Strong Conditions

One failure mode worth a warning. If $y$ contains almost as much information as $x$ (e.g. $y$ is a detailed caption for a simple image), the decoder can learn to ignore $z$ entirely. The latent becomes useless — the model just memorizes how to map $y$ directly to $x$.

Symptoms:

  • $D_{KL}(q_\phi(z|x, y) ,||, p(z|y)) \to 0$.
  • $\mu_\phi(x, y) \approx 0$, $\sigma_\phi(x, y) \approx 1$ for every input.
  • Sampling different $z$ gives nearly identical outputs.

Fixes:

  • KL annealing (start with a small KL weight and ramp up).
  • Free bits (allow a minimum KL budget per latent dimension).
  • Weaker conditioning paths (inject $y$ at fewer layers).

We’ll meet this tension again — dialed up intentionally — in the next post on $\beta$-VAE.


Summary

  1. The CVAE models $p(x|y)$ by letting the encoder, decoder, and (optionally) prior all condition on $y$.
  2. The conditional ELBO has the same shape as the VAE ELBO, with every relevant distribution conditioned on $y$.
  3. All the machinery we built for the VAE — reparameterization, closed-form Gaussian KL, Monte Carlo reconstruction — carries over without change.
  4. Where $y$ enters the network is an engineering choice: concat, embed, FiLM, cross-attend.
  5. The CVAE is the probabilistic skeleton underneath every modern conditional generator.