From Policy Gradient to PPO — Part 2: Trust Regions, PPO, and GRPO

How trust regions stabilize policy optimization, why PPO became the default for RLHF, and how GRPO eliminates the critic entirely.

March 18,2026 | Estimated reading time: 7 min | 1280 words | Author: khanhnn

From Policy Gradient to PPO — Part 1: Foundations

MDPs, value functions, the REINFORCE algorithm, actor-critic methods, and generalized advantage estimation — the RL foundations you need before understanding RLHF.

March 5,2026 | Estimated reading time: 6 min | 1153 words | Author: khanhnn