What This Is

I’ve been going thru the proceedings of NeurIPS 2025 and ICLR 2026 to map out what the research community is working on in LLM safety, alignment, RLHF, and related areas. This post is basically my reading list — organized by conference and topic, with personal notes on the papers I think are most important.

ICML 2026 (July, Seoul) hasn’t published its accepted papers yet, so I’ll update this when that drops.

If you just want the top picks, skip to the curated reading list at the bottom.


NeurIPS 2025

Best Papers & Award Winners

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (Best Paper)

This one is interesting because it shows that RLHF reduces output diversity. They built a benchmark of 26K queries with 31K dense human annotations and found that reward models are miscalibrated against diverse human preferences. So when we align models to be “helpful”, we’re actually making them converge to a narrow band of responses. You may wonder if that’s necessarily bad — and the answer is, it depends on what you’re optimizing for. But it’s something we should at least be aware of.

Gated Attention for Large Language Models (Best Paper)

Adds head-specific sigmoid gating after scaled dot product attention. Tested across 30+ variants. More of an architecture contribution than safety, but it touches on how attention mechanisms can be made more controllable.

Does Reinforcement Learning Really Incentivize Reasoning Capabilities?

This is the contrarian paper I keep coming back to. They show that RLVR (RL from verifiable rewards) enhances sampling efficiency without actually expanding reasoning capacity. So RL makes models better at finding good answers from what they already know, but doesn’t teach them to reason about new things. If this holds up, it means current RL methods haven’t even scratched the surface of what’s possible.

Safety & Alignment

LLM Safety Alignment is Divergence Estimation in Disguise

Probably my favorite theory paper from this cycle. Shows that RLHF, DPO, and related methods are all doing the same thing — estimating divergence between safe and unsafe output distributions. They propose a KLDO variant based on KL divergence. The unifying perspective is really clean.

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

They identified that roughly 5% of neurons in a model are responsible for safety behavior. Patching these “safety neurons” restores >90% of safety performance. This is huge for interpretability — it means safety isn’t spread diffusely across the whole network, it’s concentrated. And that concentration is both a strength (we can study it) and a vulnerability (we can attack it).

Lifelong Safety Alignment for Language Models

Frames alignment as a competitive game: a Meta-Attacker that discovers jailbreaks vs. a Defender that resists them. Continuous adaptation to evolving strategies. Interesting framing but I’m not sure how practical it is — the arms race dynamic seems expensive.

Safe RLHF-V (PKU-Alignment)

First multimodal safety alignment framework. Introduces BeaverTails-V dataset with dual preference annotations (helpfulness + safety) and a Beaver-Guard-V multi-level guardrail system. This matters because as models become multimodal, single-modality safety approaches won’t be enough.

Rectified Policy Optimization (RePO)

Replaces expected safety constraints with critical per-prompt constraints. The key insight: average-case safety isn’t good enough — you need worst-case guarantees per prompt. Makes sense intuitively.

Preference Optimization & RLHF

Less is More: Improving LLM Alignment via Preference Data Selection

Strategic data selection for DPO training improves performance while reducing computation. Quality over quantity for preference data — not surprising but good to have the empirical validation.

Greedy Sampling Is Provably Efficient For RLHF

Theory paper. If you’re into the math of RLHF sampling strategies, this is your paper.

Provably Efficient Online RLHF with One-Pass Reward Modeling

Addresses the computational cost of continuously integrating new data + re-optimizing reward models. One-pass approach. Practical if you’re deploying RLHF at scale.

Mechanistic Interpretability & SAEs

SAEs (Sparse Autoencoders) dominated the interpretability track this year:

  • Revising and Falsifying SAE Feature Explanations — Improving the quality of what SAE features actually mean
  • Measuring SAE Feature Sensitivity — How reliably do SAE features activate on similar inputs?
  • SAE Neural Operators — Generalizes SAEs to infinite-dimensional function spaces. Ambitious.
  • One-Step SAEs for Diffusion Models — Extends SAE interpretability beyond language to image generation (SDXL Turbo)
  • SAEs for Pathology Foundation Models — SAE features map to interpretable biological concepts. Strong correlations with cell type counts.
  • Feature Absorption in SAEs — Hierarchical features cause absorption problems during optimization. Varying SAE size/sparsity doesn’t fix it.
  • Matching Pursuit SAE — New architecture using Matching Pursuit for hierarchical features. Reveals geometric assumptions in existing SAE designs.

My take: SAEs are becoming the swiss army knife of interpretability. But the absorption and sensitivity issues suggest we’re still in the “figuring out the tool” phase, not the “using the tool reliably” phase.

Reasoning, CoT & Agents

  • Decision Pivots for CoT Verification — Identifies minimal checkpoints that any correct reasoning path must visit. Interesting for verification.
  • Latent Reasoning Models / CODI — Reasoning in continuous hidden space via self-distillation. Efficient but less interpretable.
  • Self-Correction in Long CoT — Models do redundant reasoning; first step dominates. So longer chains don’t always help.
  • RL with Model-rewarded Thinking (RLMT) — Online RL with preference-based reward models. Outperforms standard RLHF across DPO, PPO, GRPO.
  • Multi-Agent Reasoning (Game-Theoretic) — Non-zero-sum game between base agents + critical agent. Uncertainty-aware collaboration.
  • R1-Zero for GUI Grounding — Online RL + CoT reasoning for computer-use agents. Finds that longer chains lead to worse performance. Confirms the self-correction finding above.

Attention & Feature Interpretation

  • Sparse Attention Emergence — Timing follows power laws based on task/architecture/optimizer. Oral paper.
  • Attention Head Specialization — Individual heads specialize in semantic/visual attributes. Gives interpretable, controllable structure.

ICLR 2026

Outstanding Papers

SafeDPO: A Simple Approach to DPO with Enhanced Safety (Outstanding)

Balances helpfulness + safety without auxiliary networks or cost models. Single additional hyperparameter. Minimal modifications to standard DPO. This is the kind of paper I love — takes a real problem and solves it with minimal machinery. If you read one DPO paper this year, read this.

Q-RAG: Multi-Step Retrieval via RL-Trained Embedders (Outstanding)

Value-based RL for training retrieval embedders on long contexts. The evolution from single-hop to multi-step RAG. Important for practical systems.

AgentFlow: Agentic Framework with Flow-GRPO (Outstanding)

This is wild. A trainable agentic system (planner, executor, verifier, generator) using Flow-based Group Refined Policy Optimization for sparse-reward credit assignment. A 7B parameter backbone beats GPT-4o on search, math, and science reasoning. If this replicates, it’s a strong signal that small models + good RL + agent architecture can compete with giant models.

Common Corpus: Ethical LLM Pre-training Data (Outstanding)

Addresses data bias and ethics in pre-training. Important groundwork even if less flashy.

Transformers are Inherently Succinct (Outstanding)

Theoretical explanation of why transformers work so well. If you like theory papers.

LLMs Get Lost In Multi-Turn Conversation (Outstanding)

Demonstrates performance decline in multi-turn with underspecified instructions. Very relevant for real-world deployment where conversations are messy and long.

DPO & Preference Optimization

This was a major theme — lots of people finding problems with DPO and proposing fixes:

PaperKey Finding
Why DPO is a Misspecified Estimator (Oral)Exposes statistical flaw: preference reversals + reward degradation
Token-Importance Guided DPORefines DPO with token-level weighting
Multiplayer Nash Preference OptimizationExtends preference optimization to multi-agent setting
Semi-Supervised Preference with Limited FeedbackData efficiency for preference-based training
Learning from Reference AnswersAlternative to binary preferences
Learning Correlated Reward ModelsMulti-objective reward modeling with correlations

The trend is clear: original DPO had real problems, and the community is fixing them from multiple angles — statistical, multi-agent, data-efficient, multi-objective.

Safety & Alignment

PaperFocus
Rethinking Deep Safety AlignmentBalances harmlessness + helpfulness
Alignment-Weighted DPOWeighted approach for safety
Invisible Safety Threat: Malicious Finetuning via SteganographySecurity vulnerabilities in fine-tuning. Scary.
Causal Intervention for Vulnerability AnalysisShows shallow alignment enables jailbreaks. Fine-tuning on CoT datasets encourages principled refusals.

The steganography paper is worth highlighting — it shows that fine-tuning can be weaponized in ways that are hard to detect. Not great news for open-weight model safety.

Reasoning & Chain-of-Thought

PaperKey Finding
Your Base Model is Smarter Than You ThinkSampling-based reasoning strategies unlock latent capability
Verifying CoT via Computational GraphsGraph-based reasoning validation
Detecting Implicit Reward HackingMeasures reasoning effort to identify deceptive reasoning
LoongRLRL reasoning over long contexts
The Art of Scaling RL Compute for LLMsOptimal compute allocation for RL training
TROLLTrust region methods for stable RL in language models

Mechanistic Interpretability

  • SAEs for Code Correctness — Identifies directions corresponding to code correctness in LLMs
  • Mech Interp of In-Context Learning — Finds “common structures” in transformer QK circuits
  • Tracking Equivalent Mech Interp Across Networks — Framework for discovering succinct algorithms
  • Small Transformers Don’t Need LayerNorm at Inference — LN-free analogs enable more precise mechanistic analysis
  • Is Mechanistic Interpretability Identifiable? — Fundamental question about whether we can uniquely identify mechanisms. Important.

Looking at these together, some clear patterns:

  1. DPO is getting fixed from all directions. Multiple papers identify statistical, practical, and safety issues. SafeDPO is the cleanest solution so far.

  2. SAEs are everywhere. 10+ papers across both conferences. The tool is gaining adoption but still has fundamental issues (absorption, sensitivity, identifiability).

  3. RL for reasoning is complicated. Evidence that RL improves sampling efficiency but may not expand actual reasoning capacity. Longer chains don’t always help.

  4. Multimodal safety is just starting. Safe RLHF-V is the first real multimodal safety framework. Expect this to explode next year.

  5. Pluralistic alignment is emerging. Moving beyond single-objective “helpful and harmless” toward diverse values and personalization.

  6. Agents + RL intersection. AgentFlow showing 7B beats GPT-4o with the right architecture. Small + smart > big + dumb.


Curated Reading List

Read These First

PaperVenueWhy
SafeDPOICLR 2026Fixes DPO for safety with one extra hyperparameter. Clean and practical.
Why DPO is MisspecifiedICLR 2026Understand the problem before the solution.
Safety Alignment is Divergence EstimationNeurIPS 2025Unifying theory for RLHF/DPO/etc.
Safety NeuronsNeurIPS 20255% of neurons → 90% of safety. Huge for interpretability.
Does RL Incentivize Reasoning?NeurIPS 2025Contrarian finding. Changes how you think about RL for LLMs.
AgentFlowICLR 20267B beats GPT-4o. RL + agents > scale.

Second Priority

PaperVenueWhy
TROLLICLR 2026Stable RL training for LLMs. Practical.
LoongRLICLR 2026RL + long context reasoning.
Artificial HivemindNeurIPS 2025RLHF reduces diversity. Best Paper for a reason.
SAE Feature AbsorptionNeurIPS 2025Fundamental limitation of current SAEs.
Is Mech Interp Identifiable?ICLR 2026Existential question for the field.
Safe RLHF-VNeurIPS 2025First multimodal safety framework.

Sources