SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety

DPO: Clean but Flawed

If you’ve read the RLHF post in this series, you know that DPO (Direct Preference Optimization) is elegant — it eliminates the reward model and PPO entirely, replacing them with a single loss function over preference pairs. It’s simpler to implement, more stable to train, and produces competitive results.

So why did NeurIPS 2025 and ICLR 2026 have a combined 15+ papers trying to fix it?

Because DPO has real problems. And in 2025-2026, the community found them, characterized them, and started fixing them. This post surveys the most important DPO variants, focusing on the safety angle.

What’s Wrong with DPO

Problem 1: DPO Is a Misspecified Estimator

An ICLR 2026 oral paper (“Why DPO is a Misspecified Estimator and How to Fix It”) showed that DPO’s loss function makes an implicit assumption that doesn’t hold in practice.

DPO assumes the optimal policy takes the form $\pi^*(y|x) \propto \pi_{ref}(y|x) \exp(r(x,y)/\beta)$. This is the closed-form solution to the KL-constrained reward maximization problem. But this only holds at convergence — during training, the policy is not yet optimal, and the implicit reward extracted from the policy’s log-probabilities is inaccurate.

The concrete consequences:

Preference reversals: DPO can learn to prefer response B over A even when the training data says A > B
Reward degradation: The implicit reward model quality decreases during training, especially on out-of-distribution inputs

Problem 2: Safety-Helpfulness Trade-off

Standard DPO optimizes a single preference: “which response is better?” But “better” conflates helpfulness and safety. A response can be helpful but unsafe, or safe but unhelpful. Optimizing a single preference signal pushes the model toward whichever dimension dominates the training data.

In practice, most preference datasets are skewed toward helpfulness (because that’s what annotators naturally reward). Safety gets underrepresented unless you explicitly add safety-focused preference pairs.

Problem 3: Average vs. Worst-Case

DPO optimizes expected performance across the training distribution. But safety is a worst-case property — you need the model to be safe on every prompt, not just on average. A model that’s safe on 99% of prompts and catastrophically unsafe on 1% has failed at safety.

SafeDPO: The Clean Fix

SafeDPO (ICLR 2026, Kim et al.) is my favorite solution because it adds minimal machinery.

The idea: add a single safety constraint to the DPO objective. Instead of optimizing helpfulness preference alone, SafeDPO preserves the optimal solution of a safety-constrained optimization problem.

In standard DPO, the loss is:

$$L_{DPO} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$$

SafeDPO modifies this to incorporate a safety margin $\alpha$:

$$L_{SafeDPO} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} - \alpha \cdot c(x, y_w, y_l)\right)\right]$$

where $c(x, y_w, y_l)$ is a safety cost term derived from dual preference annotations (helpfulness + safety).

One extra hyperparameter ($\alpha$). No auxiliary networks. No cost models. No major architectural changes. Just a principled adjustment to the loss that balances helpfulness and safety.

The results show that SafeDPO matches or exceeds standard DPO on helpfulness benchmarks while significantly improving safety metrics. The safety-helpfulness trade-off becomes tunable via $\alpha$.

RePO: Per-Prompt Safety Guarantees

Rectified Policy Optimization (NeurIPS 2025) attacks Problem 3 — the average vs. worst-case issue.

Standard safety-constrained RLHF uses an expected safety constraint: $\mathbb{E}[c(x, y)] \leq \epsilon$. This allows the model to be very unsafe on some prompts as long as it’s safe on average.

RePO replaces this with a per-prompt constraint: for every prompt $x$, the expected safety cost must be below the threshold. This is much stricter but also much more aligned with what we actually want from a safe model.

The challenge: per-prompt constraints are hard to enforce across the entire input distribution. RePO uses a rectification approach that identifies and upweights the prompts where safety violations are most likely, effectively focusing optimization effort where it matters most.

Token-Importance Guided DPO

Standard DPO treats all tokens in a response equally when computing log-probabilities. But not all tokens matter equally for preference. The token that makes a response harmful might be one specific word in an otherwise fine response.

Token-Importance DPO (ICLR 2026) assigns learned weights to each token position, so the preference signal is concentrated on the tokens that actually differ between preferred and dispreferred responses. This reduces noise in the gradient and improves both helpfulness and safety.

Nash DPO: Multi-Agent Preferences

Multiplayer Nash Preference Optimization (ICLR 2026) extends DPO to a multi-agent setting. Instead of binary preferences (A > B), it considers preferences from multiple annotators who may disagree.

The solution: find the Nash equilibrium of a preference game, where the policy balances competing preferences. This connects to the pluralistic alignment theme — different users have different values, and a single DPO loss can’t represent all of them.

Other Notable Variants

Variant	Key Idea	Source
Semi-Supervised DPO	Uses limited labeled + abundant unlabeled preference data	ICLR 2026
Learning from Reference Answers	Replaces binary preferences with reference completions	ICLR 2026
Learning from Noisy Preferences	Robust to mislabeled preference pairs	ICLR 2026
Alignment-Weighted DPO	Weights DPO loss by alignment quality per example	ICLR 2026
Less is More (Data Selection)	Strategic selection of preference data beats using all of it	NeurIPS 2025

The Comparison

Method	Fixes	Complexity vs. DPO	Safety-Aware?
DPO (baseline)	—	Baseline	No
SafeDPO	Safety-helpfulness trade-off	+1 hyperparameter	Yes (dual annotations)
RePO	Average vs. worst-case	Moderate	Yes (per-prompt)
Token-Importance DPO	Token-level noise	Moderate	Indirectly
Nash DPO	Multi-annotator disagreement	Higher	Indirectly (pluralistic)
Data Selection	Data quality	Minimal	Depends on data

What I Take Away

The DPO story in 2025-2026 follows a familiar pattern in ML:

A clean, elegant method appears (DPO, 2023)
People adopt it widely because it’s simpler than the alternative (RLHF)
Edge cases and failure modes emerge at scale
The community produces a dozen fixes, each addressing a different failure mode
A few fixes become standard (SafeDPO is my bet for safety-aware applications)

The broader lesson: alignment techniques that optimize a single objective (helpfulness preference) will always struggle with safety. Safety needs to be in the objective explicitly, not as a side effect of “being helpful.” SafeDPO’s approach — adding a safety term with minimal machinery — seems like the right level of intervention.

The even broader question: are these DPO fixes enough, or do we need fundamentally different approaches for safety? The divergence estimation view from the RLHF post suggests that better estimators could help. But better estimators of what? That’s where the pluralistic alignment work comes in — and we’ll cover that soon.

DPO: Clean but Flawed#

What’s Wrong with DPO#

Problem 1: DPO Is a Misspecified Estimator#

Problem 2: Safety-Helpfulness Trade-off#

Problem 3: Average vs. Worst-Case#

SafeDPO: The Clean Fix#

RePO: Per-Prompt Safety Guarantees#

Token-Importance Guided DPO#

Nash DPO: Multi-Agent Preferences#

Other Notable Variants#

The Comparison#

What I Take Away#