SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety
DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here’s how SafeDPO, RePO, and other recent variants are fixing them.
DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here’s how SafeDPO, RePO, and other recent variants are fixing them.
A unifying view of RLHF, DPO, and Constitutional AI — they’re all estimating the divergence between safe and unsafe output distributions. Plus a clean derivation of why DPO works.
How trust regions stabilize policy optimization, why PPO became the default for RLHF, and how GRPO eliminates the critic entirely.
MDPs, value functions, the REINFORCE algorithm, actor-critic methods, and generalized advantage estimation — the RL foundations you need before understanding RLHF.
A survey of where the VAE idea went after 2014 — VQ-VAE, hierarchical VAEs, adversarial hybrids, flow-based posteriors — and what the VAE really gave us beyond a specific architecture.
Sparse attention patterns, head specialization, rotary embeddings, gated attention, and the modern efficiency tricks that make large transformers actually trainable.
A single Greek letter in front of the KL term changes what the VAE learns. We look at β-VAE as a rate-distortion trade-off, an information bottleneck, and a simple probe into disentangled representations.
A first-principles walkthrough of the Transformer — self-attention, positional encoding, multi-head attention — with the math that makes it work.
We extend the VAE into a controllable generative model by adding a condition y into every term of the ELBO.
Mechanistic interpretability meets alignment — how researchers found that a tiny fraction of neurons are responsible for almost all safety behavior in LLMs, and what that means.