Paper Roundup: LLM Safety & RLHF at NeurIPS 2025 and ICLR 2026

A curated list of papers on alignment, preference optimization, mechanistic interpretability, and reasoning from the two biggest ML conferences this cycle — with personal takes on the ones that matter most.

April 29,2026 | Estimated reading time: 9 min | 1784 words | Author: khanhnn

Pluralistic Alignment: One Model, Many Values

RLHF optimizes for an average human preference — but humans disagree. The Artificial Hivemind problem, counterfactual alignment, and why one-size-fits-all safety is a design choice we should question.

April 15,2026 | Estimated reading time: 6 min | 1114 words | Author: khanhnn

SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety

DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here’s how SafeDPO, RePO, and other recent variants are fixing them.

March 30,2026 | Estimated reading time: 6 min | 1067 words | Author: khanhnn

RLHF Is Just Divergence Estimation in Disguise

A unifying view of RLHF, DPO, and Constitutional AI — they’re all estimating the divergence between safe and unsafe output distributions. Plus a clean derivation of why DPO works.

March 22,2026 | Estimated reading time: 6 min | 1137 words | Author: khanhnn

Safety Neurons: 5% of Your Model Controls 90% of Safety

Mechanistic interpretability meets alignment — how researchers found that a tiny fraction of neurons are responsible for almost all safety behavior in LLMs, and what that means.

January 18,2026 | Estimated reading time: 5 min | 1033 words | Author: khanhnn

A Curated Guide to LLMs, Reinforcement Learning, and AI Safety

Books, papers, conferences, and researchers — a personal resource list for anyone going deep into LLMs, RL, and AI safety.

December 28,2025 | Estimated reading time: 8 min | 1495 words | Author: khanhnn