Paper Roundup: LLM Safety & RLHF at NeurIPS 2025 and ICLR 2026

A curated list of papers on alignment, preference optimization, mechanistic interpretability, and reasoning from the two biggest ML conferences this cycle — with personal takes on the ones that matter most.

April 29,2026 | Estimated reading time: 9 min | 1784 words | Author: khanhnn

SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety

DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here’s how SafeDPO, RePO, and other recent variants are fixing them.

March 30,2026 | Estimated reading time: 6 min | 1067 words | Author: khanhnn

Does RL Actually Make LLMs Reason Better?

The evidence is more complicated than the hype suggests. RL improves sampling efficiency but may not expand reasoning capacity — and longer chains of thought don’t always help.

March 28,2026 | Estimated reading time: 6 min | 1098 words | Author: khanhnn

RLHF Is Just Divergence Estimation in Disguise

A unifying view of RLHF, DPO, and Constitutional AI — they’re all estimating the divergence between safe and unsafe output distributions. Plus a clean derivation of why DPO works.

March 22,2026 | Estimated reading time: 6 min | 1137 words | Author: khanhnn

From Policy Gradient to PPO — Part 2: Trust Regions, PPO, and GRPO

How trust regions stabilize policy optimization, why PPO became the default for RLHF, and how GRPO eliminates the critic entirely.

March 18,2026 | Estimated reading time: 7 min | 1280 words | Author: khanhnn

From Policy Gradient to PPO — Part 1: Foundations

MDPs, value functions, the REINFORCE algorithm, actor-critic methods, and generalized advantage estimation — the RL foundations you need before understanding RLHF.

March 5,2026 | Estimated reading time: 6 min | 1153 words | Author: khanhnn

A Curated Guide to LLMs, Reinforcement Learning, and AI Safety

Books, papers, conferences, and researchers — a personal resource list for anyone going deep into LLMs, RL, and AI safety.

December 28,2025 | Estimated reading time: 8 min | 1495 words | Author: khanhnn