Paper Roundup: LLM Safety & RLHF at NeurIPS 2025 and ICLR 2026
A curated list of papers on alignment, preference optimization, mechanistic interpretability, and reasoning from the two biggest ML conferences this cycle — with personal takes on the ones that matter most.
Ara: What If Research Papers Were Executable?
A deep look at Agent-Native Research Artifacts (Ara) — a proposed replacement for academic PDFs that packages research as machine-executable knowledge bundles. What it gets right, what it gets wrong, and why it matters for AI-assisted research.
Pluralistic Alignment: One Model, Many Values
RLHF optimizes for an average human preference — but humans disagree. The Artificial Hivemind problem, counterfactual alignment, and why one-size-fits-all safety is a design choice we should question.
Sparse Autoencoders: The Swiss Army Knife of Interpretability
SAEs went from niche interpretability tool to dominant research theme in one year. Where they’re being applied, what they reveal, and the fundamental limitations nobody has solved yet.
SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety
DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here’s how SafeDPO, RePO, and other recent variants are fixing them.
Does RL Actually Make LLMs Reason Better?
The evidence is more complicated than the hype suggests. RL improves sampling efficiency but may not expand reasoning capacity — and longer chains of thought don’t always help.
RLHF Is Just Divergence Estimation in Disguise
A unifying view of RLHF, DPO, and Constitutional AI — they’re all estimating the divergence between safe and unsafe output distributions. Plus a clean derivation of why DPO works.
From Policy Gradient to PPO — Part 2: Trust Regions, PPO, and GRPO
How trust regions stabilize policy optimization, why PPO became the default for RLHF, and how GRPO eliminates the critic entirely.
From Policy Gradient to PPO — Part 1: Foundations
MDPs, value functions, the REINFORCE algorithm, actor-critic methods, and generalized advantage estimation — the RL foundations you need before understanding RLHF.
VAE Variants and Modern Interpretations
A survey of where the VAE idea went after 2014 — VQ-VAE, hierarchical VAEs, adversarial hybrids, flow-based posteriors — and what the VAE really gave us beyond a specific architecture.