The Hype

Since OpenAI’s o1 and DeepSeek’s R1, the narrative has been clear: train LLMs with RL on reasoning tasks, and they learn to reason better. Give them chain-of-thought, reward correct answers, and the model develops genuine reasoning capabilities.

This narrative is compelling. It’s also probably wrong — or at least significantly oversimplified.

Several papers from NeurIPS 2025 and ICLR 2026 challenge the assumption that RL for reasoning does what we think it does. The evidence points to something more nuanced: RL makes models better at finding answers they could already produce, without necessarily expanding what they can reason about.


The Key Finding: Sampling Efficiency ≠ Reasoning Capacity

A NeurIPS 2025 paper (“Does Reinforcement Learning Really Incentivize Reasoning Capabilities?”) ran a clean experiment. They took a base model, applied RL from verifiable rewards (RLVR — where the reward signal comes from checking if the answer is correct, not from a learned reward model), and measured two things:

  1. Pass@1: Does the model get the right answer on the first try more often?
  2. Pass@k: If you sample k responses and take the best one, does RL expand the set of problems the model can solve?

The result: RL significantly improves Pass@1 but barely moves Pass@k (for large k). In other words:

  • The model doesn’t learn to solve new problems it couldn’t solve before
  • It gets better at producing correct solutions on the first attempt
  • The “reasoning” was already latent in the pre-trained model — RL just makes it more accessible

This is a fundamental distinction. Sampling efficiency means you need fewer attempts to find a correct answer. Reasoning capacity means you can solve problems you previously couldn’t solve at all. RLVR improves the former, not the latter.


Longer Chains Don’t Always Help

If RL for reasoning worked the way we hoped, longer chains of thought should lead to better answers — more “thinking” should mean better reasoning. Several papers found the opposite.

Self-Correction in Long CoT (NeurIPS 2025 FoRLM Workshop) found that in extended chain-of-thought sequences, models engage in redundant reasoning. The first reasoning step dominates the outcome, and subsequent steps mostly repeat or marginally refine it. Self-correction does happen, but it’s rarer than you’d expect.

R1-Zero for GUI Grounding (NeurIPS 2025) trained agents with online RL + CoT reasoning for computer-use tasks. They found that longer reasoning chains actually led to worse performance. The model would overthink, second-guess itself, and end up with worse actions than a shorter reasoning process.

This suggests there’s an optimal “reasoning length” for each task, and that length is shorter than you might think. More tokens ≠ more thought. Sometimes more tokens = more confusion.


What RL Actually Does to LLMs

Combining these findings, here’s my current model of what RL for reasoning does:

RL reshapes the output distribution, not the knowledge. The pre-trained model has a broad distribution over possible outputs for any given input. Some of those outputs contain correct reasoning, some don’t. RL concentrates probability mass on the outputs that lead to correct answers. It’s sharpening, not expanding.

RL teaches the model when to think and when to just answer. On easy problems, RL-trained models often produce shorter, more direct responses. On hard problems, they produce longer reasoning chains. The model learns to allocate compute where it’s needed — which is a real and valuable capability, just not the same as “learning to reason.”

RL can improve formatting and structure. RL-trained models produce more organized, step-by-step reasoning chains. This improves readability and makes it easier for verifiers to check the work. But structured presentation of reasoning is different from the reasoning itself.


The Counter-Evidence

Not all evidence points this way. There are cases where RL does seem to unlock new capabilities:

RLMT (RL with Model-Rewarded Thinking) (NeurIPS 2025 FoRLM Workshop) uses preference-based reward models and online RL, and outperforms standard RLHF across DPO, PPO, and GRPO optimizers. The gains come from teaching the model to think in a more structured way before answering, which sometimes enables solutions the base model couldn’t find.

AgentFlow + Flow-GRPO (ICLR 2026) showed that a 7B model with good RL training and agent architecture beat GPT-4o on complex multi-step tasks. This isn’t just sampling efficiency — the 7B model genuinely couldn’t solve these tasks before RL training. But the capability might come from the agent architecture (planner + executor + verifier) rather than from RL alone.

LoongRL (ICLR 2026) focuses on RL for long-context reasoning and shows genuine improvement on tasks requiring multi-hop reasoning over long documents. The key ingredient seems to be training on tasks that specifically require long-range information integration.

So the picture isn’t all negative. RL can expand capabilities in some settings, especially when combined with architectural innovations or specifically designed training tasks. But the blanket claim that “RL teaches reasoning” is too strong.


What This Means for Alignment

If RL mostly improves sampling efficiency rather than reasoning capacity, the implications for alignment are interesting:

Good news: Safety training (RLHF/DPO) might be more robust than we thought. If RL doesn’t fundamentally change what the model can do, only how likely it is to do it, then safety alignment is working on the same model capabilities that pre-training established.

Bad news: We can’t rely on RL alone to teach models to reason safely about novel situations. If the model doesn’t have latent safety reasoning from pre-training, RLHF won’t create it — it’ll just make whatever safety behavior exists more or less likely to appear.

Practical implication: Pre-training data and SFT quality might matter more than we thought for safety outcomes. RLHF fine-tunes the edges, but the foundation has to be solid.


My Take

I think the field over-indexed on RL for reasoning in 2024-2025, partly because the results were impressive (o1 genuinely performs better on math and coding benchmarks) and partly because the narrative was clean (train with RL → learn to reason).

The reality is messier. RL is a powerful tool for optimization, but optimization and learning are different things. RL optimizes the model to produce outputs that score well on a reward signal. If “scoring well” correlates with “genuine reasoning,” great. If it correlates with “pattern-matching that looks like reasoning,” we’re fooling ourselves.

The research direction I’m most excited about is understanding when RL expands capabilities vs. when it merely sharpens existing ones. The answer probably depends on the task, the training data, and how far the pre-trained model is from being able to solve it. Getting this boundary right will determine whether scaling RL for reasoning is a path to genuine AI progress or a path to increasingly convincing pattern matching.