SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety

DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here’s how SafeDPO, RePO, and other recent variants are fixing them.

March 30,2026 | Estimated reading time: 6 min | 1067 words | Author: khanhnn

RLHF Is Just Divergence Estimation in Disguise

A unifying view of RLHF, DPO, and Constitutional AI — they’re all estimating the divergence between safe and unsafe output distributions. Plus a clean derivation of why DPO works.

March 22,2026 | Estimated reading time: 6 min | 1137 words | Author: khanhnn

From Policy Gradient to PPO — Part 2: Trust Regions, PPO, and GRPO

How trust regions stabilize policy optimization, why PPO became the default for RLHF, and how GRPO eliminates the critic entirely.

March 18,2026 | Estimated reading time: 7 min | 1280 words | Author: khanhnn

From Policy Gradient to PPO — Part 1: Foundations

MDPs, value functions, the REINFORCE algorithm, actor-critic methods, and generalized advantage estimation — the RL foundations you need before understanding RLHF.

March 5,2026 | Estimated reading time: 6 min | 1153 words | Author: khanhnn

VAE Variants and Modern Interpretations

A survey of where the VAE idea went after 2014 — VQ-VAE, hierarchical VAEs, adversarial hybrids, flow-based posteriors — and what the VAE really gave us beyond a specific architecture.

February 25,2026 | Estimated reading time: 8 min | 1660 words | Author: khanhnn

Transformers from First Principles — Part 2: What Scale Reveals

Sparse attention patterns, head specialization, rotary embeddings, gated attention, and the modern efficiency tricks that make large transformers actually trainable.

February 20,2026 | Estimated reading time: 7 min | 1487 words | Author: khanhnn

β-VAE and the Emergence of Disentanglement

A single Greek letter in front of the KL term changes what the VAE learns. We look at β-VAE as a rate-distortion trade-off, an information bottleneck, and a simple probe into disentangled representations.

February 10,2026 | Estimated reading time: 8 min | 1623 words | Author: khanhnn

Transformers from First Principles — Part 1: Attention Is All You Need (Really)

A first-principles walkthrough of the Transformer — self-attention, positional encoding, multi-head attention — with the math that makes it work.

February 8,2026 | Estimated reading time: 8 min | 1533 words | Author: khanhnn

Conditional VAE (CVAE): Learning to Generate with Conditions

We extend the VAE into a controllable generative model by adding a condition y into every term of the ELBO.

January 25,2026 | Estimated reading time: 8 min | 1514 words | Author: khanhnn

Safety Neurons: 5% of Your Model Controls 90% of Safety

Mechanistic interpretability meets alignment — how researchers found that a tiny fraction of neurons are responsible for almost all safety behavior in LLMs, and what that means.

January 18,2026 | Estimated reading time: 5 min | 1033 words | Author: khanhnn