The Hidden World Behind Our Data

Think about it: when you draw something, you start with basic shapes, not every tiny detail. The shape, the pose, the color — these hidden factors determine what you end up with. Latent variables in ML work the same way.

We observe high-dimensional data like images, but there’s a simpler structure underneath. The pixels are just the final result of some hidden generative process.

This is basically the idea behind LVMs — what we observe is only a shadow of something unobserved, something simpler and more fundamental.


Motivation

Most modern generative models rely on latent variables — unobserved factors that influence the data. Instead of directly modeling $p(x)$, we assume there exists some latent representation $z$ that captures the underlying structure of the data.

In this note, we start from a simple linear model — Probabilistic PCA (PPCA) — and generalize to the family of latent variable models (LVMs). This sets the mathematical foundation for the Variational Autoencoder (VAE).


Formal Definition and the Marginal Likelihood

So the essential idea behind LVMs is that what we observe — the data $x$ — is only a shadow or projection of something unobserved, called a latent variable $z$.

Instead of modeling $p(x)$ directly, we model the joint distribution $p(x, z)$, then derive everything else (including $p(x)$) from it.

In plain words, we first sample $z$ (e.g., we imagine the size, the shape, and the color of a horse) and then create an image with all necessary details, i.e., we sample $x$ from the conditional distribution $p(x|z)$. Then, the core idea of LVM is we introduce the latent variables $z$, and the joint distribution is factorized as:

\begin{equation} p_\theta(x, z) = p_\theta(x\mid z),p(z) \end{equation}

where:

  • $p(z)$: the prior over latent variables (usually simple, e.g. $\mathcal{N}(0, I)$),
  • $p_\theta(x\mid z)$: the likelihood (a conditional distribution parameterized by $\theta$, often a neural decoder).

This naturally expressed the generative process described above:

  1. Draw a latent sample $z \sim p(z)$.
  2. Generate an observable $x \sim p_\theta(x\mid z)$.

However, in training, we usually have access only to $x$, not to the hidden $z$. Therefore, according to probabilistic inference, we should sum out (or marginalize out) the unknown $z$. As a result, the (marginal) likelihood function is the following:

\begin{equation} p_\theta(x) = \int p_\theta(x, z) , dz = \int p_\theta(x\mid z),p(z) , dz \end{equation}

This integral is also known as the evidence of the data. It captures the model’s ability to explain the data by integrating over all possible latent causes $z$.

  • A probabilistic interpretation:

The term $p_\theta(x\mid z)p(z)$ can be seen as the joint probability density of seeing both $x$ and $z$ simultaneously. By integrating over $z$, we effectively sum over every possible hidden cause that might have produced $x$:

\begin{equation} p_\theta(x) = \mathbb{E}_{p(z)}[p_\theta(x\mid z)] \end{equation}

  • Bayesian viewpoint:

In Bayesian inference, this step — summing or integrating out the unobserved variable — is called marginalization. It’s the process that turns the joint model $p(x, z)$ into a model that depends only on observable data $x$:

\begin{equation} p(x) = \int p(x, z),dz \end{equation}

This marginalization embodies the idea that we are uncertain about $z$, and instead of committing to a specific value, we integrate across all possibilities.

At first glance, the integral above looks straightforward. But in reality, it’s almost always intractable:

  • Nonlinearity of the decoder. In most deep generative models, $p_\theta(x\mid z)$ is represented by a neural network, e.g. $\mathcal{N}(x; f_\theta(z), \sigma^2 I)$. No closed form for $p_\theta(x)$.
  • High-dimensional latent space. $z$ is often 32–256D; naive numerical integration is infeasible.
  • Exponential complexity. The integral is a sum over exponentially many latent configurations.

Exception (tractable case): linear-Gaussian models

Given that:

\begin{equation} p(z) = \mathcal{N}(0, I), \quad p(x\mid z) = \mathcal{N}(x; Wz + \mu, \sigma^2 I), \end{equation}

we can integrate analytically:

\begin{equation} p(x) = \mathcal{N}(x; \mu, WW^\top + \sigma^2 I). \end{equation}

But as soon as $p(x\mid z)$ becomes nonlinear (e.g., $Wz$ replaced by a neural net), the integral becomes intractable.

Once we have $p_\theta(x)$, the posterior is

\begin{equation} p_\theta(z\mid x) = \frac{p_\theta(x\mid z),p(z)}{p_\theta(x)}. \end{equation}

However, evaluating $p_\theta(z\mid x)$ requires $p_\theta(x)$, which is intractable in the general nonlinear case. Thus both the marginal likelihood and the posterior are intractable — the central bottleneck in learning LVMs.

We have two roads ahead:

  • Exact inference (analytic) works for special cases like Probabilistic Principal Component Analysis (PPCA).
  • Approximate inference is required in general. We introduce a surrogate posterior $q_\phi(z\mid x)$ and will optimize a lower bound on $\log p_\theta(x)$ — the ELBO (in the next note).

From PCA to Probabilistic PCA

Given data $x \in \mathbb{R}^D$, PCA finds a linear projection:

\begin{equation} z = W^\top (x - \mu),\quad W \in \mathbb{R}^{D \times d},; d < D \end{equation}

that maximizes variance or minimizes reconstruction error:

\begin{equation} \min_W | x - \mu - WW^\top (x - \mu) |^2 \end{equation}

PCA, however, is deterministic and non-probabilistic. We can’t use it to generate data, nor to reason about uncertainty.

Let’s define a latent variable model where $z \in \mathbb{R}^d$ is hidden and Gaussian:

\begin{equation} z \sim \mathcal{N}(0, I) \end{equation}

and data $x$ is generated linearly from $z$ with Gaussian noise:

\begin{equation} x = Wz + \mu + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I) \end{equation}

Thus, the conditional distribution of $x$ given $z$ is:

\begin{equation} p(x|z) = \mathcal{N}(x; Wz + \mu, \sigma^2 I) \end{equation}

Since $z$ is latent, the likelihood of a single observation $x$ is:

\begin{equation} p(x) = \int p(x|z), p(z) , dz \end{equation}

Substituting the Gaussian definitions:

\begin{equation} p(x) = \int \mathcal{N}(x; Wz + \mu, \sigma^2 I), \mathcal{N}(z; 0, I) , dz \end{equation}

This integral can be solved analytically (because the product of two Gaussians is Gaussian):

\begin{equation} p(x) = \mathcal{N}(x; \mu, WW^\top + \sigma^2 I) \end{equation}

We’ve just shown that PCA is a special case of a latent variable model, where $p(x)$ is a Gaussian with covariance structured as $WW^\top + \sigma^2 I$.

Interpretation:

  • W: defines the subspace directions (principal components).
  • $\sigma^2$: controls the noise (orthogonal variance).
  • $z$: explains the causal latent factors of variation.

So, intuitively, we can understand that PCA reconstructs and PPCA explains. It says: “There exists a latent world $z$ where data is generated from a simple distribution — and what we see is just its noisy projection.”

This simple probabilistic leap — from reconstruction to explanation — is the seed idea of all modern VAEs.


Generalizing to Latent Variable Models

PPCA is linear and Gaussian. But what if the data lies on a nonlinear manifold?

We generalize by introducing a nonlinear generative process:

\begin{equation} z \sim p(z), \quad x \sim p_\theta(x|z) \end{equation}

where $p_\theta(x|z)$ is parameterized by a neural network (the decoder).

Then the marginal likelihood becomes:

\begin{equation} p_\theta(x) = \int p_\theta(x|z), p(z) , dz \end{equation}

This defines the Latent Variable Model (LVM).

The posterior $p_\theta(z|x)$ tells us how likely each latent cause $z$ is given an observation $x$:

\begin{equation} p_\theta(z|x) = \frac{p_\theta(x|z), p(z)}{p_\theta(x)} \end{equation}

However, computing $p_\theta(x)$ requires integrating over all $z$:

\begin{equation} p_\theta(x) = \int p_\theta(x|z)p(z),dz \end{equation}

which is generally intractable for nonlinear models.

This intractability motivates variational inference — the key idea behind VAEs.

We want to learn parameters $\theta$ that maximize the marginal likelihood:

\begin{equation} \max_\theta \log p_\theta(x) \end{equation}

But since $p_\theta(x)$ is intractable, we’ll derive a lower bound (ELBO) that we can compute (it’s gonna be in the next post if I will ever write).


Geometric Intuition

Imagine each $z$ corresponds to a coordinate in a smooth, continuous manifold. The function $p_\theta(x|z)$ “decodes” that point into a sample in data space.

  • In PCA, this mapping is linear and the manifold is a flat subspace.
  • In nonlinear LVMs, the manifold can be curved — learned by a neural decoder.

Summary

ConceptDeterministic AE / PCALatent Variable Model
Latent variableFixed vectorRandom variable $z \sim p(z)$
EncoderLinear projectionInference of posterior $p(z\mid x)$
DecoderLinear reconstructionConditional distribution $p(x\mid z)$
ObjectiveReconstruction errorLikelihood (or its lower bound)
LearningDeterministic mappingProbabilistic inference
  • Latent Variable Models introduce hidden variables to capture unobserved structure.
  • Probabilistic PCA is the simplest example — tractable, linear, Gaussian.
  • The marginal likelihood $p(x) = \int p(x\mid z)p(z)dz$ is key, but often intractable.
  • This motivates variational inference — which we will derive step-by-step in the next post.

References

  • Tipping, M. E., & Bishop, C. M. (1999). Probabilistic Principal Component Analysis.
  • Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes.
  • Doersch, C. (2016). Tutorial on Variational Autoencoders.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 12.