The Standard Story
The standard way to explain RLHF is as a three-step pipeline:
- Supervised Fine-Tuning (SFT): Take a pre-trained LLM, fine-tune it on high-quality demonstration data.
- Reward Model Training: Collect human preferences (response A is better than response B), train a reward model to predict these preferences.
- RL Optimization: Use PPO to optimize the SFT model against the reward model, with a KL penalty to prevent it from straying too far.
This is the InstructGPT recipe (Ouyang et al., 2022). It works. But it doesn’t really explain why it works, or why DPO — which skips the reward model entirely — also works, or what all these methods have in common.
A NeurIPS 2025 paper (“LLM Safety Alignment is Divergence Estimation in Disguise”) offers a much cleaner perspective. Let me walk thru it.
The Divergence View
Here’s the core idea. Imagine we have two distributions over model outputs:
- $p_{safe}$: the distribution of “good” outputs (helpful, harmless, honest)
- $p_{unsafe}$: the distribution of “bad” outputs (harmful, unhelpful, dishonest)
Alignment is fundamentally about making the model’s output distribution $\pi_\theta$ closer to $p_{safe}$ and farther from $p_{unsafe}$. We can measure this with divergence functions like KL divergence.
The surprising claim: RLHF, DPO, and related methods are all doing divergence estimation. They differ in how they estimate and optimize this divergence, but the underlying objective is the same.
RLHF as Divergence Estimation
In RLHF, the reward model $r_\phi(x, y)$ learns to assign higher scores to preferred outputs. The RL objective is:
$$\max_\theta ; \mathbb{E}_{x \sim D, y \sim \pi_\theta(\cdot|x)}\left[r_\phi(x, y)\right] - \beta \cdot D_{KL}(\pi_\theta | \pi_{ref})$$
where $\pi_{ref}$ is the SFT model and $\beta$ controls how far the policy can deviate.
The closed-form solution (assuming unlimited optimization capacity) is:
$$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{r_\phi(x, y)}{\beta}\right)$$
where $Z(x) = \sum_y \pi_{ref}(y|x) \exp(r_\phi(x, y) / \beta)$ is the partition function.
Now, what is the reward model actually learning? It’s trained on preference pairs $(y_w, y_l)$ where $y_w$ is preferred over $y_l$, using the Bradley-Terry model:
$$P(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$$
where $\sigma$ is the sigmoid function.
The key insight: $r_\phi(x, y_w) - r_\phi(x, y_l)$ is estimating the log-likelihood ratio between the safe and unsafe distributions:
$$r_\phi(x, y_w) - r_\phi(x, y_l) \approx \log \frac{p_{safe}(y_w|x)}{p_{unsafe}(y_w|x)} - \log \frac{p_{safe}(y_l|x)}{p_{unsafe}(y_l|x)}$$
So the reward model is implicitly learning the divergence between safe and unsafe distributions, and PPO is optimizing the policy to maximize this divergence in favor of the safe distribution.
DPO: Skipping the Middleman
Direct Preference Optimization (Rafailov et al., 2023) starts from the same KL-constrained objective as RLHF but rearranges the math to eliminate the reward model.
From the closed-form optimal policy above, we can solve for the reward:
$$r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x)$$
Substituting this into the Bradley-Terry preference model:
$$P(y_w \succ y_l | x) = \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)$$
The partition function $Z(x)$ cancels because it appears in both terms. The DPO loss maximizes this:
$$L_{DPO}(\theta) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$$
No reward model. No PPO. No sampling from the current policy during training. Just a loss function over preference pairs that you can optimize with standard gradient descent.
In the divergence framework: DPO directly estimates the divergence between preferred and dispreferred output distributions using the policy’s own log-probabilities as the measurement tool. It’s the same divergence estimation, just with a different estimator.
Constitutional AI / RLAIF
Constitutional AI (Bai et al., 2022, Anthropic) adds another twist: instead of human preferences, use AI-generated preferences. A “critic” model evaluates outputs against a set of principles (the “constitution”) and provides preference labels.
In the divergence framework: Constitutional AI replaces the human oracle $p_{safe}$ with an AI-approximated version $\hat{p}_{safe}$. The divergence estimation is the same — we’re still trying to push the model toward safe outputs — but the reference distribution is defined by the constitution rather than collected human judgments.
The advantage: it scales better (no human labeling bottleneck). The risk: the constitution is an imperfect specification of safety, and errors in $\hat{p}_{safe}$ propagate to the trained model.
KLDO: Making the Divergence Explicit
The NeurIPS paper proposes KLDO (KL Divergence Optimization), which makes the divergence estimation explicit rather than implicit:
$$L_{KLDO}(\theta) = D_{KL}(\pi_\theta | p_{safe}) - \alpha \cdot D_{KL}(\pi_\theta | p_{unsafe})$$
Minimize divergence from the safe distribution, maximize divergence from the unsafe distribution. Straightforward.
In practice, $p_{safe}$ and $p_{unsafe}$ aren’t known directly — they’re estimated from the preference data. But the explicit framing makes the optimization objective clearer and allows you to weight the two terms differently (maybe you care more about avoiding unsafe outputs than about matching the ideal safe distribution).
The Unifying Table
| Method | How It Estimates Divergence | Reward Model? | Online RL? |
|---|---|---|---|
| RLHF | Reward model learns implicit log-likelihood ratio; PPO optimizes against it | Yes | Yes (PPO) |
| DPO | Policy log-probability ratios directly estimate divergence | No (implicit) | No (offline) |
| Constitutional AI | AI critic approximates safe/unsafe distributions; then RLHF or DPO | Optional | Optional |
| KTO | Uses single-feedback (good/bad, no pairs) to estimate utility | No | No |
| KLDO | Explicitly optimizes KL divergence between policy and safe/unsafe | No | No |
They’re all doing the same thing: moving the model’s output distribution toward safety and away from harm. The mathematical formulations differ, but the underlying optimization problem is shared.
Why This Matters
This unifying view gives us several things:
Diagnostic power. If your aligned model is misbehaving, you can ask: is the divergence estimate wrong (bad reward model / bad preference data), or is the optimization failing (PPO instability / DPO misspecification)?
Method selection. Need stability and simplicity? DPO. Need to scale without human labels? Constitutional AI. Need fine-grained control over the safety-helpfulness trade-off? KLDO’s separate terms let you tune that.
Understanding failure modes. Reward hacking in RLHF is the policy finding outputs that have high estimated divergence from $p_{unsafe}$ but don’t actually correspond to safe behavior — the estimator is wrong. Sycophancy is the model overfit to $\hat{p}_{safe}$ which rewards agreeable responses. The divergence framework makes these failure modes interpretable.
It also explains why no single method dominates. RLHF has the richest divergence estimation (learned reward model) but the most complex optimization (PPO). DPO has simpler optimization but a more restrictive estimator (offline, no reward model). Constitutional AI trades off estimation accuracy for scalability. Each makes different trade-offs in the same fundamental optimization problem.
Open Questions
Does the divergence view suggest better alignment methods we haven’t tried yet? If all methods are doing divergence estimation, can we find better divergence measures than KL? Is there a “best” estimator we should converge on? And how does this framework extend to multi-turn conversations, tool use, and agent behaviors where the output distribution is much more complex?
These are questions I’m still thinking about. If the divergence view is right, then the path to better alignment is better divergence estimation — not fundamentally new paradigms.