What This Is
I’ve been going thru the proceedings of NeurIPS 2025 and ICLR 2026 to map out what the research community is working on in LLM safety, alignment, RLHF, and related areas. This post is basically my reading list — organized by conference and topic, with personal notes on the papers I think are most important.
ICML 2026 (July, Seoul) hasn’t published its accepted papers yet, so I’ll update this when that drops.
If you just want the top picks, skip to the curated reading list at the bottom.
NeurIPS 2025
Best Papers & Award Winners
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (Best Paper)
This one is interesting because it shows that RLHF reduces output diversity. They built a benchmark of 26K queries with 31K dense human annotations and found that reward models are miscalibrated against diverse human preferences. So when we align models to be “helpful”, we’re actually making them converge to a narrow band of responses. You may wonder if that’s necessarily bad — and the answer is, it depends on what you’re optimizing for. But it’s something we should at least be aware of.
Gated Attention for Large Language Models (Best Paper)
Adds head-specific sigmoid gating after scaled dot product attention. Tested across 30+ variants. More of an architecture contribution than safety, but it touches on how attention mechanisms can be made more controllable.
Does Reinforcement Learning Really Incentivize Reasoning Capabilities?
This is the contrarian paper I keep coming back to. They show that RLVR (RL from verifiable rewards) enhances sampling efficiency without actually expanding reasoning capacity. So RL makes models better at finding good answers from what they already know, but doesn’t teach them to reason about new things. If this holds up, it means current RL methods haven’t even scratched the surface of what’s possible.
Safety & Alignment
LLM Safety Alignment is Divergence Estimation in Disguise
Probably my favorite theory paper from this cycle. Shows that RLHF, DPO, and related methods are all doing the same thing — estimating divergence between safe and unsafe output distributions. They propose a KLDO variant based on KL divergence. The unifying perspective is really clean.
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
They identified that roughly 5% of neurons in a model are responsible for safety behavior. Patching these “safety neurons” restores >90% of safety performance. This is huge for interpretability — it means safety isn’t spread diffusely across the whole network, it’s concentrated. And that concentration is both a strength (we can study it) and a vulnerability (we can attack it).
Lifelong Safety Alignment for Language Models
Frames alignment as a competitive game: a Meta-Attacker that discovers jailbreaks vs. a Defender that resists them. Continuous adaptation to evolving strategies. Interesting framing but I’m not sure how practical it is — the arms race dynamic seems expensive.
Safe RLHF-V (PKU-Alignment)
First multimodal safety alignment framework. Introduces BeaverTails-V dataset with dual preference annotations (helpfulness + safety) and a Beaver-Guard-V multi-level guardrail system. This matters because as models become multimodal, single-modality safety approaches won’t be enough.
Rectified Policy Optimization (RePO)
Replaces expected safety constraints with critical per-prompt constraints. The key insight: average-case safety isn’t good enough — you need worst-case guarantees per prompt. Makes sense intuitively.
Preference Optimization & RLHF
Less is More: Improving LLM Alignment via Preference Data Selection
Strategic data selection for DPO training improves performance while reducing computation. Quality over quantity for preference data — not surprising but good to have the empirical validation.
Greedy Sampling Is Provably Efficient For RLHF
Theory paper. If you’re into the math of RLHF sampling strategies, this is your paper.
Provably Efficient Online RLHF with One-Pass Reward Modeling
Addresses the computational cost of continuously integrating new data + re-optimizing reward models. One-pass approach. Practical if you’re deploying RLHF at scale.
Mechanistic Interpretability & SAEs
SAEs (Sparse Autoencoders) dominated the interpretability track this year:
- Revising and Falsifying SAE Feature Explanations — Improving the quality of what SAE features actually mean
- Measuring SAE Feature Sensitivity — How reliably do SAE features activate on similar inputs?
- SAE Neural Operators — Generalizes SAEs to infinite-dimensional function spaces. Ambitious.
- One-Step SAEs for Diffusion Models — Extends SAE interpretability beyond language to image generation (SDXL Turbo)
- SAEs for Pathology Foundation Models — SAE features map to interpretable biological concepts. Strong correlations with cell type counts.
- Feature Absorption in SAEs — Hierarchical features cause absorption problems during optimization. Varying SAE size/sparsity doesn’t fix it.
- Matching Pursuit SAE — New architecture using Matching Pursuit for hierarchical features. Reveals geometric assumptions in existing SAE designs.
My take: SAEs are becoming the swiss army knife of interpretability. But the absorption and sensitivity issues suggest we’re still in the “figuring out the tool” phase, not the “using the tool reliably” phase.
Reasoning, CoT & Agents
- Decision Pivots for CoT Verification — Identifies minimal checkpoints that any correct reasoning path must visit. Interesting for verification.
- Latent Reasoning Models / CODI — Reasoning in continuous hidden space via self-distillation. Efficient but less interpretable.
- Self-Correction in Long CoT — Models do redundant reasoning; first step dominates. So longer chains don’t always help.
- RL with Model-rewarded Thinking (RLMT) — Online RL with preference-based reward models. Outperforms standard RLHF across DPO, PPO, GRPO.
- Multi-Agent Reasoning (Game-Theoretic) — Non-zero-sum game between base agents + critical agent. Uncertainty-aware collaboration.
- R1-Zero for GUI Grounding — Online RL + CoT reasoning for computer-use agents. Finds that longer chains lead to worse performance. Confirms the self-correction finding above.
Attention & Feature Interpretation
- Sparse Attention Emergence — Timing follows power laws based on task/architecture/optimizer. Oral paper.
- Attention Head Specialization — Individual heads specialize in semantic/visual attributes. Gives interpretable, controllable structure.
ICLR 2026
Outstanding Papers
SafeDPO: A Simple Approach to DPO with Enhanced Safety (Outstanding)
Balances helpfulness + safety without auxiliary networks or cost models. Single additional hyperparameter. Minimal modifications to standard DPO. This is the kind of paper I love — takes a real problem and solves it with minimal machinery. If you read one DPO paper this year, read this.
Q-RAG: Multi-Step Retrieval via RL-Trained Embedders (Outstanding)
Value-based RL for training retrieval embedders on long contexts. The evolution from single-hop to multi-step RAG. Important for practical systems.
AgentFlow: Agentic Framework with Flow-GRPO (Outstanding)
This is wild. A trainable agentic system (planner, executor, verifier, generator) using Flow-based Group Refined Policy Optimization for sparse-reward credit assignment. A 7B parameter backbone beats GPT-4o on search, math, and science reasoning. If this replicates, it’s a strong signal that small models + good RL + agent architecture can compete with giant models.
Common Corpus: Ethical LLM Pre-training Data (Outstanding)
Addresses data bias and ethics in pre-training. Important groundwork even if less flashy.
Transformers are Inherently Succinct (Outstanding)
Theoretical explanation of why transformers work so well. If you like theory papers.
LLMs Get Lost In Multi-Turn Conversation (Outstanding)
Demonstrates performance decline in multi-turn with underspecified instructions. Very relevant for real-world deployment where conversations are messy and long.
DPO & Preference Optimization
This was a major theme — lots of people finding problems with DPO and proposing fixes:
| Paper | Key Finding |
|---|---|
| Why DPO is a Misspecified Estimator (Oral) | Exposes statistical flaw: preference reversals + reward degradation |
| Token-Importance Guided DPO | Refines DPO with token-level weighting |
| Multiplayer Nash Preference Optimization | Extends preference optimization to multi-agent setting |
| Semi-Supervised Preference with Limited Feedback | Data efficiency for preference-based training |
| Learning from Reference Answers | Alternative to binary preferences |
| Learning Correlated Reward Models | Multi-objective reward modeling with correlations |
The trend is clear: original DPO had real problems, and the community is fixing them from multiple angles — statistical, multi-agent, data-efficient, multi-objective.
Safety & Alignment
| Paper | Focus |
|---|---|
| Rethinking Deep Safety Alignment | Balances harmlessness + helpfulness |
| Alignment-Weighted DPO | Weighted approach for safety |
| Invisible Safety Threat: Malicious Finetuning via Steganography | Security vulnerabilities in fine-tuning. Scary. |
| Causal Intervention for Vulnerability Analysis | Shows shallow alignment enables jailbreaks. Fine-tuning on CoT datasets encourages principled refusals. |
The steganography paper is worth highlighting — it shows that fine-tuning can be weaponized in ways that are hard to detect. Not great news for open-weight model safety.
Reasoning & Chain-of-Thought
| Paper | Key Finding |
|---|---|
| Your Base Model is Smarter Than You Think | Sampling-based reasoning strategies unlock latent capability |
| Verifying CoT via Computational Graphs | Graph-based reasoning validation |
| Detecting Implicit Reward Hacking | Measures reasoning effort to identify deceptive reasoning |
| LoongRL | RL reasoning over long contexts |
| The Art of Scaling RL Compute for LLMs | Optimal compute allocation for RL training |
| TROLL | Trust region methods for stable RL in language models |
Mechanistic Interpretability
- SAEs for Code Correctness — Identifies directions corresponding to code correctness in LLMs
- Mech Interp of In-Context Learning — Finds “common structures” in transformer QK circuits
- Tracking Equivalent Mech Interp Across Networks — Framework for discovering succinct algorithms
- Small Transformers Don’t Need LayerNorm at Inference — LN-free analogs enable more precise mechanistic analysis
- Is Mechanistic Interpretability Identifiable? — Fundamental question about whether we can uniquely identify mechanisms. Important.
Research Trends Across Both Conferences
Looking at these together, some clear patterns:
DPO is getting fixed from all directions. Multiple papers identify statistical, practical, and safety issues. SafeDPO is the cleanest solution so far.
SAEs are everywhere. 10+ papers across both conferences. The tool is gaining adoption but still has fundamental issues (absorption, sensitivity, identifiability).
RL for reasoning is complicated. Evidence that RL improves sampling efficiency but may not expand actual reasoning capacity. Longer chains don’t always help.
Multimodal safety is just starting. Safe RLHF-V is the first real multimodal safety framework. Expect this to explode next year.
Pluralistic alignment is emerging. Moving beyond single-objective “helpful and harmless” toward diverse values and personalization.
Agents + RL intersection. AgentFlow showing 7B beats GPT-4o with the right architecture. Small + smart > big + dumb.
Curated Reading List
Read These First
| Paper | Venue | Why |
|---|---|---|
| SafeDPO | ICLR 2026 | Fixes DPO for safety with one extra hyperparameter. Clean and practical. |
| Why DPO is Misspecified | ICLR 2026 | Understand the problem before the solution. |
| Safety Alignment is Divergence Estimation | NeurIPS 2025 | Unifying theory for RLHF/DPO/etc. |
| Safety Neurons | NeurIPS 2025 | 5% of neurons → 90% of safety. Huge for interpretability. |
| Does RL Incentivize Reasoning? | NeurIPS 2025 | Contrarian finding. Changes how you think about RL for LLMs. |
| AgentFlow | ICLR 2026 | 7B beats GPT-4o. RL + agents > scale. |
Second Priority
| Paper | Venue | Why |
|---|---|---|
| TROLL | ICLR 2026 | Stable RL training for LLMs. Practical. |
| LoongRL | ICLR 2026 | RL + long context reasoning. |
| Artificial Hivemind | NeurIPS 2025 | RLHF reduces diversity. Best Paper for a reason. |
| SAE Feature Absorption | NeurIPS 2025 | Fundamental limitation of current SAEs. |
| Is Mech Interp Identifiable? | ICLR 2026 | Existential question for the field. |
| Safe RLHF-V | NeurIPS 2025 | First multimodal safety framework. |