Paper Roundup: LLM Safety & RLHF at NeurIPS 2025 and ICLR 2026

What This Is

I’ve been going thru the proceedings of NeurIPS 2025 and ICLR 2026 to map out what the research community is working on in LLM safety, alignment, RLHF, and related areas. This post is basically my reading list — organized by conference and topic, with personal notes on the papers I think are most important.

ICML 2026 (July, Seoul) hasn’t published its accepted papers yet, so I’ll update this when that drops.

If you just want the top picks, skip to the curated reading list at the bottom.

NeurIPS 2025

Best Papers & Award Winners

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (Best Paper)

This one is interesting because it shows that RLHF reduces output diversity. They built a benchmark of 26K queries with 31K dense human annotations and found that reward models are miscalibrated against diverse human preferences. So when we align models to be “helpful”, we’re actually making them converge to a narrow band of responses. You may wonder if that’s necessarily bad — and the answer is, it depends on what you’re optimizing for. But it’s something we should at least be aware of.

Gated Attention for Large Language Models (Best Paper)

Adds head-specific sigmoid gating after scaled dot product attention. Tested across 30+ variants. More of an architecture contribution than safety, but it touches on how attention mechanisms can be made more controllable.

Does Reinforcement Learning Really Incentivize Reasoning Capabilities?

This is the contrarian paper I keep coming back to. They show that RLVR (RL from verifiable rewards) enhances sampling efficiency without actually expanding reasoning capacity. So RL makes models better at finding good answers from what they already know, but doesn’t teach them to reason about new things. If this holds up, it means current RL methods haven’t even scratched the surface of what’s possible.

Safety & Alignment

LLM Safety Alignment is Divergence Estimation in Disguise

Probably my favorite theory paper from this cycle. Shows that RLHF, DPO, and related methods are all doing the same thing — estimating divergence between safe and unsafe output distributions. They propose a KLDO variant based on KL divergence. The unifying perspective is really clean.

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

They identified that roughly 5% of neurons in a model are responsible for safety behavior. Patching these “safety neurons” restores >90% of safety performance. This is huge for interpretability — it means safety isn’t spread diffusely across the whole network, it’s concentrated. And that concentration is both a strength (we can study it) and a vulnerability (we can attack it).

Lifelong Safety Alignment for Language Models

Frames alignment as a competitive game: a Meta-Attacker that discovers jailbreaks vs. a Defender that resists them. Continuous adaptation to evolving strategies. Interesting framing but I’m not sure how practical it is — the arms race dynamic seems expensive.

Safe RLHF-V (PKU-Alignment)

First multimodal safety alignment framework. Introduces BeaverTails-V dataset with dual preference annotations (helpfulness + safety) and a Beaver-Guard-V multi-level guardrail system. This matters because as models become multimodal, single-modality safety approaches won’t be enough.

Rectified Policy Optimization (RePO)

Replaces expected safety constraints with critical per-prompt constraints. The key insight: average-case safety isn’t good enough — you need worst-case guarantees per prompt. Makes sense intuitively.

Preference Optimization & RLHF

Less is More: Improving LLM Alignment via Preference Data Selection

Strategic data selection for DPO training improves performance while reducing computation. Quality over quantity for preference data — not surprising but good to have the empirical validation.

Greedy Sampling Is Provably Efficient For RLHF

Theory paper. If you’re into the math of RLHF sampling strategies, this is your paper.

Provably Efficient Online RLHF with One-Pass Reward Modeling

Addresses the computational cost of continuously integrating new data + re-optimizing reward models. One-pass approach. Practical if you’re deploying RLHF at scale.

Mechanistic Interpretability & SAEs

SAEs (Sparse Autoencoders) dominated the interpretability track this year:

Revising and Falsifying SAE Feature Explanations — Improving the quality of what SAE features actually mean
Measuring SAE Feature Sensitivity — How reliably do SAE features activate on similar inputs?
SAE Neural Operators — Generalizes SAEs to infinite-dimensional function spaces. Ambitious.
One-Step SAEs for Diffusion Models — Extends SAE interpretability beyond language to image generation (SDXL Turbo)
SAEs for Pathology Foundation Models — SAE features map to interpretable biological concepts. Strong correlations with cell type counts.
Feature Absorption in SAEs — Hierarchical features cause absorption problems during optimization. Varying SAE size/sparsity doesn’t fix it.
Matching Pursuit SAE — New architecture using Matching Pursuit for hierarchical features. Reveals geometric assumptions in existing SAE designs.

My take: SAEs are becoming the swiss army knife of interpretability. But the absorption and sensitivity issues suggest we’re still in the “figuring out the tool” phase, not the “using the tool reliably” phase.

Reasoning, CoT & Agents

Decision Pivots for CoT Verification — Identifies minimal checkpoints that any correct reasoning path must visit. Interesting for verification.
Latent Reasoning Models / CODI — Reasoning in continuous hidden space via self-distillation. Efficient but less interpretable.
Self-Correction in Long CoT — Models do redundant reasoning; first step dominates. So longer chains don’t always help.
RL with Model-rewarded Thinking (RLMT) — Online RL with preference-based reward models. Outperforms standard RLHF across DPO, PPO, GRPO.
Multi-Agent Reasoning (Game-Theoretic) — Non-zero-sum game between base agents + critical agent. Uncertainty-aware collaboration.
R1-Zero for GUI Grounding — Online RL + CoT reasoning for computer-use agents. Finds that longer chains lead to worse performance. Confirms the self-correction finding above.

Attention & Feature Interpretation

Sparse Attention Emergence — Timing follows power laws based on task/architecture/optimizer. Oral paper.
Attention Head Specialization — Individual heads specialize in semantic/visual attributes. Gives interpretable, controllable structure.

ICLR 2026

Outstanding Papers

SafeDPO: A Simple Approach to DPO with Enhanced Safety (Outstanding)

Balances helpfulness + safety without auxiliary networks or cost models. Single additional hyperparameter. Minimal modifications to standard DPO. This is the kind of paper I love — takes a real problem and solves it with minimal machinery. If you read one DPO paper this year, read this.

Q-RAG: Multi-Step Retrieval via RL-Trained Embedders (Outstanding)

Value-based RL for training retrieval embedders on long contexts. The evolution from single-hop to multi-step RAG. Important for practical systems.

AgentFlow: Agentic Framework with Flow-GRPO (Outstanding)

This is wild. A trainable agentic system (planner, executor, verifier, generator) using Flow-based Group Refined Policy Optimization for sparse-reward credit assignment. A 7B parameter backbone beats GPT-4o on search, math, and science reasoning. If this replicates, it’s a strong signal that small models + good RL + agent architecture can compete with giant models.

Common Corpus: Ethical LLM Pre-training Data (Outstanding)

Addresses data bias and ethics in pre-training. Important groundwork even if less flashy.

Transformers are Inherently Succinct (Outstanding)

Theoretical explanation of why transformers work so well. If you like theory papers.

LLMs Get Lost In Multi-Turn Conversation (Outstanding)

Demonstrates performance decline in multi-turn with underspecified instructions. Very relevant for real-world deployment where conversations are messy and long.

DPO & Preference Optimization

This was a major theme — lots of people finding problems with DPO and proposing fixes:

Paper	Key Finding
Why DPO is a Misspecified Estimator (Oral)	Exposes statistical flaw: preference reversals + reward degradation
Token-Importance Guided DPO	Refines DPO with token-level weighting
Multiplayer Nash Preference Optimization	Extends preference optimization to multi-agent setting
Semi-Supervised Preference with Limited Feedback	Data efficiency for preference-based training
Learning from Reference Answers	Alternative to binary preferences
Learning Correlated Reward Models	Multi-objective reward modeling with correlations

The trend is clear: original DPO had real problems, and the community is fixing them from multiple angles — statistical, multi-agent, data-efficient, multi-objective.

Safety & Alignment

Paper	Focus
Rethinking Deep Safety Alignment	Balances harmlessness + helpfulness
Alignment-Weighted DPO	Weighted approach for safety
Invisible Safety Threat: Malicious Finetuning via Steganography	Security vulnerabilities in fine-tuning. Scary.
Causal Intervention for Vulnerability Analysis	Shows shallow alignment enables jailbreaks. Fine-tuning on CoT datasets encourages principled refusals.

The steganography paper is worth highlighting — it shows that fine-tuning can be weaponized in ways that are hard to detect. Not great news for open-weight model safety.

Reasoning & Chain-of-Thought

Paper	Key Finding
Your Base Model is Smarter Than You Think	Sampling-based reasoning strategies unlock latent capability
Verifying CoT via Computational Graphs	Graph-based reasoning validation
Detecting Implicit Reward Hacking	Measures reasoning effort to identify deceptive reasoning
LoongRL	RL reasoning over long contexts
The Art of Scaling RL Compute for LLMs	Optimal compute allocation for RL training
TROLL	Trust region methods for stable RL in language models

Mechanistic Interpretability

SAEs for Code Correctness — Identifies directions corresponding to code correctness in LLMs
Mech Interp of In-Context Learning — Finds “common structures” in transformer QK circuits
Tracking Equivalent Mech Interp Across Networks — Framework for discovering succinct algorithms
Small Transformers Don’t Need LayerNorm at Inference — LN-free analogs enable more precise mechanistic analysis
Is Mechanistic Interpretability Identifiable? — Fundamental question about whether we can uniquely identify mechanisms. Important.

Research Trends Across Both Conferences

Looking at these together, some clear patterns:

DPO is getting fixed from all directions. Multiple papers identify statistical, practical, and safety issues. SafeDPO is the cleanest solution so far.
SAEs are everywhere. 10+ papers across both conferences. The tool is gaining adoption but still has fundamental issues (absorption, sensitivity, identifiability).
RL for reasoning is complicated. Evidence that RL improves sampling efficiency but may not expand actual reasoning capacity. Longer chains don’t always help.
Multimodal safety is just starting. Safe RLHF-V is the first real multimodal safety framework. Expect this to explode next year.
Pluralistic alignment is emerging. Moving beyond single-objective “helpful and harmless” toward diverse values and personalization.
Agents + RL intersection. AgentFlow showing 7B beats GPT-4o with the right architecture. Small + smart > big + dumb.

Curated Reading List

Read These First

Paper	Venue	Why
SafeDPO	ICLR 2026	Fixes DPO for safety with one extra hyperparameter. Clean and practical.
Why DPO is Misspecified	ICLR 2026	Understand the problem before the solution.
Safety Alignment is Divergence Estimation	NeurIPS 2025	Unifying theory for RLHF/DPO/etc.
Safety Neurons	NeurIPS 2025	5% of neurons → 90% of safety. Huge for interpretability.
Does RL Incentivize Reasoning?	NeurIPS 2025	Contrarian finding. Changes how you think about RL for LLMs.
AgentFlow	ICLR 2026	7B beats GPT-4o. RL + agents > scale.

Second Priority

Paper	Venue	Why
TROLL	ICLR 2026	Stable RL training for LLMs. Practical.
LoongRL	ICLR 2026	RL + long context reasoning.
Artificial Hivemind	NeurIPS 2025	RLHF reduces diversity. Best Paper for a reason.
SAE Feature Absorption	NeurIPS 2025	Fundamental limitation of current SAEs.
Is Mech Interp Identifiable?	ICLR 2026	Existential question for the field.
Safe RLHF-V	NeurIPS 2025	First multimodal safety framework.

What This Is#

NeurIPS 2025#

Best Papers & Award Winners#

Safety & Alignment#

Preference Optimization & RLHF#

Mechanistic Interpretability & SAEs#

Reasoning, CoT & Agents#

Attention & Feature Interpretation#

ICLR 2026#

Outstanding Papers#

DPO & Preference Optimization#

Safety & Alignment#

Reasoning & Chain-of-Thought#

Mechanistic Interpretability#

Research Trends Across Both Conferences#

Curated Reading List#

Read These First#

Second Priority#

Sources#

What This Is

NeurIPS 2025

Best Papers & Award Winners

Safety & Alignment

Preference Optimization & RLHF

Mechanistic Interpretability & SAEs

Reasoning, CoT & Agents

Attention & Feature Interpretation

ICLR 2026

Outstanding Papers

DPO & Preference Optimization

Safety & Alignment

Reasoning & Chain-of-Thought

Mechanistic Interpretability

Research Trends Across Both Conferences

Curated Reading List

Read These First

Second Priority

Sources