From Policy Gradient to PPO — Part 1: Foundations

Why RL Now?

If you’re reading this blog for the LLM content, you might be wondering why we’re taking a detour into reinforcement learning. The reason is simple: RL is how we train LLMs to be helpful, harmless, and honest. The “HF” in RLHF stands for Human Feedback, but the “RL” is what actually does the optimization. And if you don’t understand policy gradients and PPO, the RLHF pipeline is just a black box.

So this two-part series builds RL from scratch. Part 1: the foundations (MDPs, value functions, REINFORCE, actor-critic). Part 2: from TRPO to PPO to GRPO, and why PPO specifically was chosen for language model alignment.

Markov Decision Processes

An MDP is defined by a tuple $(S, A, P, R, \gamma)$:

$S$: set of states
$A$: set of actions
$P(s’|s, a)$: transition probability — if you’re in state $s$ and take action $a$, what’s the probability of ending up in state $s’$?
$R(s, a)$: reward function — how good is taking action $a$ in state $s$?
$\gamma \in [0, 1)$: discount factor — how much we value future rewards vs. immediate ones

A policy $\pi(a|s)$ is a probability distribution over actions given a state. The agent’s goal: find the policy that maximizes cumulative discounted reward.

For LLMs, the mapping is:

State: the tokens generated so far (the context)
Action: the next token to generate
Reward: comes from a reward model (trained on human preferences) after the full response is generated
Policy: the language model itself — $\pi_\theta(a_t | s_t)$ is the probability of generating token $a_t$ given the context $s_t$

Value Functions

The state value function tells us the expected cumulative reward from state $s$ under policy $\pi$:

$$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]$$

The action-value function (Q-function) conditions on both state and action:

$$Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]$$

The relationship: $V^\pi(s) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$.

The advantage function measures how much better action $a$ is compared to the average:

$$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$

If $A^\pi(s, a) > 0$, action $a$ is better than what the policy would typically do. If $A^\pi(s, a) < 0$, it’s worse. This will be important for policy gradient methods.

Bellman Equations

Value functions satisfy recursive relationships (the Bellman equations):

$$V^\pi(s) = \mathbb{E}_{a \sim \pi}\left[R(s, a) + \gamma \mathbb{E}_{s’ \sim P}[V^\pi(s’)]\right]$$

This says: the value of a state equals the immediate reward plus the discounted value of the next state, in expectation. It’s the foundation for all dynamic programming and TD learning methods.

The Policy Gradient Theorem

We want to find the policy parameters $\theta$ that maximize the expected return:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right]$$

where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ is a trajectory sampled from the policy.

The policy gradient theorem gives us the gradient of this objective:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t\right]$$

where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return from time step $t$.

This is a beautiful result. The intuition: $\nabla_\theta \log \pi_\theta(a_t | s_t)$ points in the direction that makes action $a_t$ more likely. We weight this by $G_t$ — if the return was high, push the policy toward those actions; if low, push away.

The proof relies on the “log-derivative trick”: $\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta$. This lets us express the gradient as an expectation under the policy itself, which we can estimate with samples.

REINFORCE

REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. It directly implements the policy gradient theorem:

Sample a trajectory $\tau$ by running the policy
Compute returns $G_t$ for each time step
Update: $\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t$

That’s it. Sample, compute returns, update. Simple and unbiased.

The problem: variance. $G_t$ is a noisy estimate because it depends on the entire future trajectory. One lucky trajectory can send the gradient in a wildly wrong direction. In practice, REINFORCE is slow to converge and unstable.

Variance Reduction with Baselines

We can subtract a baseline $b(s_t)$ from the return without introducing bias:

$$\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot (G_t - b(s_t))\right]$$

Why is this still unbiased? Because $\mathbb{E}_{a \sim \pi}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0$.

The optimal baseline turns out to be close to $V^\pi(s_t)$ — the expected return from state $s_t$. With this baseline, we’re effectively using the advantage: $G_t - V^\pi(s_t) \approx A^\pi(s_t, a_t)$. Actions better than average get positive gradient, worse than average get negative gradient.

This reduces variance dramatically but we need to estimate $V^\pi$ somehow. Which brings us to actor-critic methods.

Actor-Critic

The actor-critic framework uses two networks:

Actor ($\pi_\theta$): the policy network — decides which actions to take
Critic ($V_\phi$): the value network — estimates $V^\pi(s)$ to reduce variance

The critic provides the baseline. Instead of waiting for the full trajectory return $G_t$, we can use the TD (temporal difference) target:

$$A_t \approx r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$$

This is a one-step advantage estimate. It has lower variance than $G_t - V_\phi(s_t)$ (because we bootstrap from the critic instead of using the full return) but introduces bias (because the critic is an approximation).

The actor update: $$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t$$

The critic update (minimize squared TD error): $$\phi \leftarrow \phi - \beta \nabla_\phi (V_\phi(s_t) - (r_t + \gamma V_\phi(s_{t+1})))^2$$

Actor-critic is the foundation for essentially all modern policy gradient methods, including PPO.

Generalized Advantage Estimation (GAE)

The one-step TD advantage $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is low variance but biased. The full Monte Carlo advantage $G_t - V(s_t)$ is unbiased but high variance. GAE (Schulman et al., 2015) interpolates between them:

$$A_t^{GAE} = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l}$$

where $\lambda \in [0, 1]$ controls the bias-variance trade-off:

$\lambda = 0$: pure one-step TD (low variance, high bias)
$\lambda = 1$: equivalent to Monte Carlo advantage (no bias, high variance)
In practice: $\lambda \approx 0.95$ works well

This is an exponentially-weighted average of multi-step TD estimates. The decaying weights mean that nearby rewards are trusted more than distant ones (lower variance), while still incorporating long-horizon information (lower bias than pure one-step).

GAE is used in virtually every modern policy gradient implementation, including the PPO implementations used for RLHF.

Summary

What we covered:

MDPs: the formal framework for sequential decision-making
Value functions: $V^\pi$, $Q^\pi$, advantage $A^\pi$, and Bellman equations
Policy gradient theorem: how to compute gradients of expected return
REINFORCE: simplest implementation, high variance
Baselines and actor-critic: variance reduction thru learned value functions
GAE: the practical bias-variance trade-off for advantage estimation

These are the building blocks. In Part 2, we’ll see how trust regions and clipped objectives lead to PPO — and why PPO became the algorithm of choice for training language models with human feedback.

Why RL Now?#

Markov Decision Processes#

Value Functions#

Bellman Equations#

The Policy Gradient Theorem#

REINFORCE#

Variance Reduction with Baselines#

Actor-Critic#

Generalized Advantage Estimation (GAE)#

Summary#