Where We Left Off
In Part 1, we derived the policy gradient theorem, built REINFORCE, reduced its variance with baselines and actor-critic methods, and arrived at GAE for practical advantage estimation. The missing piece: stability. Vanilla policy gradient methods take steps that can be too large, destroying a good policy in one update.
This post covers how trust regions solve that, leading to TRPO and then PPO — and why PPO specifically became the algorithm behind RLHF. We’ll also look at GRPO, a recent alternative from DeepSeek that drops the critic entirely.
The Step Size Problem
Policy gradient gives us a direction to update $\theta$. But how big should the step be?
Too small: slow convergence, wasted compute. Too big: the policy changes dramatically, the advantage estimates (computed under the old policy) become inaccurate, and performance can collapse catastrophically.
This is worse than in supervised learning. In supervised learning, a bad step gives you a higher loss but you can recover. In RL, a bad policy step means the agent starts collecting worse trajectories, which gives worse gradient estimates, which leads to worse updates — a death spiral.
We need some way to constrain how much the policy changes per update.
Trust Region Policy Optimization (TRPO)
TRPO (Schulman et al., 2015) formalizes this constraint. Instead of the standard policy gradient update, TRPO solves a constrained optimization problem:
$$\max_\theta ; \hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} A_t\right]$$
subject to:
$$\hat{\mathbb{E}}_t\left[D_{KL}(\pi_{\theta_{old}}(\cdot | s_t) | \pi_\theta(\cdot | s_t))\right] \leq \delta$$
The ratio $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$ is the importance sampling ratio. It lets us evaluate the new policy using data collected under the old policy.
The KL constraint says: the new policy can’t be “too different” from the old one, measured by KL divergence. The trust region is the set of policies within $\delta$ KL divergence of the current one.
TRPO works well in practice, but it’s complicated to implement. The constrained optimization requires computing the Fisher information matrix and solving a second-order optimization problem (conjugate gradient + line search). This is expensive and fiddly.
PPO: Making Trust Regions Practical
PPO (Schulman et al., 2017) approximates TRPO’s constraint with a much simpler mechanism: clipping.
PPO-Clip
Instead of constraining the KL divergence explicitly, PPO clips the importance sampling ratio:
$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min\left(r_t(\theta) A_t, ; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]$$
where $\epsilon$ is typically 0.1 or 0.2.
Let’s unpack what the clipping does:
When advantage is positive ($A_t > 0$, the action was good):
- $r_t > 1 + \epsilon$: the new policy makes this action much more likely. Clip it — don’t over-exploit.
- $r_t < 1$: the new policy makes this action less likely. The min takes the unclipped value — allow the gradient to push the policy back toward this good action.
When advantage is negative ($A_t < 0$, the action was bad):
- $r_t < 1 - \epsilon$: the new policy already reduced this action’s probability a lot. Clip it — don’t over-correct.
- $r_t > 1$: the new policy makes this bad action more likely. The min takes the unclipped value — allow the gradient to fix this.
The effect: PPO allows the policy to improve but prevents it from changing too much in either direction. No Fisher matrix, no conjugate gradient, no line search. Just a clipped objective you can optimize with standard gradient descent.
PPO-Penalty (Alternative)
There’s also a KL-penalty variant:
$$L^{KL}(\theta) = \hat{\mathbb{E}}_t\left[r_t(\theta) A_t - \beta \cdot D_{KL}(\pi_{\theta_{old}} | \pi_\theta)\right]$$
where $\beta$ is an adaptive coefficient that increases when KL divergence exceeds a target and decreases when it’s below. This is closer to the TRPO constraint in spirit but uses a penalty instead of a hard constraint.
In practice, PPO-Clip is simpler and more widely used.
Why PPO for RLHF
When OpenAI developed InstructGPT (the precursor to ChatGPT), they chose PPO for the RL step. Why?
Stability. Language model outputs are high-dimensional (vocabulary size ~50K-100K), and the reward signal is sparse (one reward per complete response). This makes the optimization landscape treacherous. PPO’s clipping prevents the catastrophic policy collapse that vanilla policy gradient methods suffer from.
Compatibility with KL constraints. In RLHF, you typically add a KL penalty between the current policy and the initial SFT model: $R_{total} = R_{reward} - \beta \cdot D_{KL}(\pi_\theta | \pi_{SFT})$. This prevents the model from drifting too far from its pre-trained behavior (which would cause it to generate gibberish that the reward model scores highly). PPO’s own stability mechanism plus this external KL constraint gives you two layers of protection.
Sample efficiency. PPO can do multiple gradient steps on the same batch of collected data (within the trust region), while REINFORCE uses each batch once. This matters when generating responses from a large language model is expensive.
Implementation simplicity. Compared to TRPO, PPO is straightforward to implement and doesn’t require second-order optimization.
That said, PPO for LLMs is still notoriously hard to tune. The reward model, the KL coefficient, the clipping parameter, the learning rate, the batch size — all interact in non-obvious ways. This difficulty is one of the motivations behind DPO, which we’ll cover in a later post.
GRPO: Dropping the Critic
Group Relative Policy Optimization (GRPO), introduced by DeepSeek for training their reasoning models, takes a different approach: eliminate the critic entirely.
The key insight: instead of training a separate value network to estimate advantages, just compare outputs within a group.
For each prompt, GRPO:
- Samples a group of $G$ responses from the current policy
- Scores each response with the reward model
- Computes advantages as normalized rewards within the group:
$$A_i = \frac{r_i - \text{mean}(r_1, \ldots, r_G)}{\text{std}(r_1, \ldots, r_G)}$$
- Uses the PPO-clip objective with these group-relative advantages
No critic network. No GAE. No TD learning. Just “which responses in this group were better than average?”
This works because in the LLM setting, the reward model already provides a scalar reward per response. You don’t need a separate value function to estimate returns — you have the actual rewards. The group normalization provides a baseline automatically (the group mean serves the same variance-reduction role as the critic).
GRPO was used to train DeepSeek-R1, which showed strong reasoning capabilities. The simplicity is appealing — one fewer network to train, fewer hyperparameters, less engineering complexity.
Flow-GRPO: Agents + RL
AgentFlow (ICLR 2026) extended GRPO to agentic systems with Flow-GRPO. The problem with standard GRPO for agents: rewards are sparse (only at the end of a multi-step task), so credit assignment is hard. Which step in a 10-step agent trajectory was responsible for the final success or failure?
Flow-GRPO addresses this with flow-based credit assignment — decomposing the agent’s trajectory into stages (planning, execution, verification, generation) and assigning credit at each stage boundary.
The result: a 7B parameter model beat GPT-4o on search, math, and science reasoning tasks. Small model + good RL + good architecture > big model. This is a strong signal that RL for agents is a frontier worth watching.
The Landscape
| Method | Critic? | Constraint | Complexity | Used In |
|---|---|---|---|---|
| REINFORCE | No | None | Very simple | Teaching |
| Actor-Critic | Yes | None | Moderate | Classic RL |
| TRPO | Yes | KL constraint (hard) | Complex | Research |
| PPO-Clip | Yes | Clip ratio | Simple | RLHF (InstructGPT, ChatGPT) |
| GRPO | No | Clip ratio + group norm | Simpler | DeepSeek-R1 reasoning |
| Flow-GRPO | No | Flow credit assignment | Moderate | AgentFlow (agents) |
The trend is toward simpler methods that leverage the structure of the LLM setting (reward models, group sampling) rather than general-purpose RL machinery.
What’s Next
We now have both the transformer architecture (Part 1-2 of that series) and the RL foundations. The next step is to combine them: how do you actually train a language model with human feedback? That’s the RLHF pipeline — and it turns out the whole thing can be viewed as divergence estimation in disguise.