Paper Roundup: LLM Safety & RLHF at NeurIPS 2025 and ICLR 2026

A curated list of papers on alignment, preference optimization, mechanistic interpretability, and reasoning from the two biggest ML conferences this cycle — with personal takes on the ones that matter most.

April 29,2026 | Estimated reading time: 9 min | 1784 words | Author: khanhnn

Pluralistic Alignment: One Model, Many Values

RLHF optimizes for an average human preference — but humans disagree. The Artificial Hivemind problem, counterfactual alignment, and why one-size-fits-all safety is a design choice we should question.

April 15,2026 | Estimated reading time: 6 min | 1114 words | Author: khanhnn

Sparse Autoencoders: The Swiss Army Knife of Interpretability

SAEs went from niche interpretability tool to dominant research theme in one year. Where they’re being applied, what they reveal, and the fundamental limitations nobody has solved yet.

April 8,2026 | Estimated reading time: 6 min | 1233 words | Author: khanhnn

SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety

DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here’s how SafeDPO, RePO, and other recent variants are fixing them.

March 30,2026 | Estimated reading time: 6 min | 1067 words | Author: khanhnn

Does RL Actually Make LLMs Reason Better?

The evidence is more complicated than the hype suggests. RL improves sampling efficiency but may not expand reasoning capacity — and longer chains of thought don’t always help.

March 28,2026 | Estimated reading time: 6 min | 1098 words | Author: khanhnn

RLHF Is Just Divergence Estimation in Disguise

A unifying view of RLHF, DPO, and Constitutional AI — they’re all estimating the divergence between safe and unsafe output distributions. Plus a clean derivation of why DPO works.

March 22,2026 | Estimated reading time: 6 min | 1137 words | Author: khanhnn

Transformers from First Principles — Part 2: What Scale Reveals

Sparse attention patterns, head specialization, rotary embeddings, gated attention, and the modern efficiency tricks that make large transformers actually trainable.

February 20,2026 | Estimated reading time: 7 min | 1487 words | Author: khanhnn

Transformers from First Principles — Part 1: Attention Is All You Need (Really)

A first-principles walkthrough of the Transformer — self-attention, positional encoding, multi-head attention — with the math that makes it work.

February 8,2026 | Estimated reading time: 8 min | 1533 words | Author: khanhnn

A Curated Guide to LLMs, Reinforcement Learning, and AI Safety

Books, papers, conferences, and researchers — a personal resource list for anyone going deep into LLMs, RL, and AI safety.

December 28,2025 | Estimated reading time: 8 min | 1495 words | Author: khanhnn