Policy-Gradient-to-Ppo on My Learning Notes

Policy-Gradient-to-Ppo on My Learning Notes https://learning-notes-dz2.pages.dev/series/policy-gradient-to-ppo/ Recent content in Policy-Gradient-to-Ppo on My Learning Notes Hugo -- 0.124.0 en Tue, 16 Jun 2026 07:19:01 +0000 From Policy Gradient to PPO — Part 2: Trust Regions, PPO, and GRPO https://learning-notes-dz2.pages.dev/posts/2026-03-18/ Wed, 18 Mar 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-03-18/ How trust regions stabilize policy optimization, why PPO became the default for RLHF, and how GRPO eliminates the critic entirely. From Policy Gradient to PPO — Part 1: Foundations https://learning-notes-dz2.pages.dev/posts/2026-03-05/ Thu, 05 Mar 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-03-05/ MDPs, value functions, the REINFORCE algorithm, actor-critic methods, and generalized advantage estimation — the RL foundations you need before understanding RLHF.