Why RL Now?
If you’re reading this blog for the LLM content, you might be wondering why we’re taking a detour into reinforcement learning. The reason is simple: RL is how we train LLMs to be helpful, harmless, and honest. The “HF” in RLHF stands for Human Feedback, but the “RL” is what actually does the optimization. And if you don’t understand policy gradients and PPO, the RLHF pipeline is just a black box.
So this two-part series builds RL from scratch. Part 1: the foundations (MDPs, value functions, REINFORCE, actor-critic). Part 2: from TRPO to PPO to GRPO, and why PPO specifically was chosen for language model alignment.
Markov Decision Processes
An MDP is defined by a tuple $(S, A, P, R, \gamma)$:
- $S$: set of states
- $A$: set of actions
- $P(s’|s, a)$: transition probability — if you’re in state $s$ and take action $a$, what’s the probability of ending up in state $s’$?
- $R(s, a)$: reward function — how good is taking action $a$ in state $s$?
- $\gamma \in [0, 1)$: discount factor — how much we value future rewards vs. immediate ones
A policy $\pi(a|s)$ is a probability distribution over actions given a state. The agent’s goal: find the policy that maximizes cumulative discounted reward.
For LLMs, the mapping is:
- State: the tokens generated so far (the context)
- Action: the next token to generate
- Reward: comes from a reward model (trained on human preferences) after the full response is generated
- Policy: the language model itself — $\pi_\theta(a_t | s_t)$ is the probability of generating token $a_t$ given the context $s_t$
Value Functions
The state value function tells us the expected cumulative reward from state $s$ under policy $\pi$:
$$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]$$
The action-value function (Q-function) conditions on both state and action:
$$Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]$$
The relationship: $V^\pi(s) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$.
The advantage function measures how much better action $a$ is compared to the average:
$$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$
If $A^\pi(s, a) > 0$, action $a$ is better than what the policy would typically do. If $A^\pi(s, a) < 0$, it’s worse. This will be important for policy gradient methods.
Bellman Equations
Value functions satisfy recursive relationships (the Bellman equations):
$$V^\pi(s) = \mathbb{E}_{a \sim \pi}\left[R(s, a) + \gamma \mathbb{E}_{s’ \sim P}[V^\pi(s’)]\right]$$
This says: the value of a state equals the immediate reward plus the discounted value of the next state, in expectation. It’s the foundation for all dynamic programming and TD learning methods.
The Policy Gradient Theorem
We want to find the policy parameters $\theta$ that maximize the expected return:
$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right]$$
where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ is a trajectory sampled from the policy.
The policy gradient theorem gives us the gradient of this objective:
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t\right]$$
where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return from time step $t$.
This is a beautiful result. The intuition: $\nabla_\theta \log \pi_\theta(a_t | s_t)$ points in the direction that makes action $a_t$ more likely. We weight this by $G_t$ — if the return was high, push the policy toward those actions; if low, push away.
The proof relies on the “log-derivative trick”: $\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta$. This lets us express the gradient as an expectation under the policy itself, which we can estimate with samples.
REINFORCE
REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. It directly implements the policy gradient theorem:
- Sample a trajectory $\tau$ by running the policy
- Compute returns $G_t$ for each time step
- Update: $\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t$
That’s it. Sample, compute returns, update. Simple and unbiased.
The problem: variance. $G_t$ is a noisy estimate because it depends on the entire future trajectory. One lucky trajectory can send the gradient in a wildly wrong direction. In practice, REINFORCE is slow to converge and unstable.
Variance Reduction with Baselines
We can subtract a baseline $b(s_t)$ from the return without introducing bias:
$$\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot (G_t - b(s_t))\right]$$
Why is this still unbiased? Because $\mathbb{E}_{a \sim \pi}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0$.
The optimal baseline turns out to be close to $V^\pi(s_t)$ — the expected return from state $s_t$. With this baseline, we’re effectively using the advantage: $G_t - V^\pi(s_t) \approx A^\pi(s_t, a_t)$. Actions better than average get positive gradient, worse than average get negative gradient.
This reduces variance dramatically but we need to estimate $V^\pi$ somehow. Which brings us to actor-critic methods.
Actor-Critic
The actor-critic framework uses two networks:
- Actor ($\pi_\theta$): the policy network — decides which actions to take
- Critic ($V_\phi$): the value network — estimates $V^\pi(s)$ to reduce variance
The critic provides the baseline. Instead of waiting for the full trajectory return $G_t$, we can use the TD (temporal difference) target:
$$A_t \approx r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$$
This is a one-step advantage estimate. It has lower variance than $G_t - V_\phi(s_t)$ (because we bootstrap from the critic instead of using the full return) but introduces bias (because the critic is an approximation).
The actor update: $$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t$$
The critic update (minimize squared TD error): $$\phi \leftarrow \phi - \beta \nabla_\phi (V_\phi(s_t) - (r_t + \gamma V_\phi(s_{t+1})))^2$$
Actor-critic is the foundation for essentially all modern policy gradient methods, including PPO.
Generalized Advantage Estimation (GAE)
The one-step TD advantage $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is low variance but biased. The full Monte Carlo advantage $G_t - V(s_t)$ is unbiased but high variance. GAE (Schulman et al., 2015) interpolates between them:
$$A_t^{GAE} = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l}$$
where $\lambda \in [0, 1]$ controls the bias-variance trade-off:
- $\lambda = 0$: pure one-step TD (low variance, high bias)
- $\lambda = 1$: equivalent to Monte Carlo advantage (no bias, high variance)
- In practice: $\lambda \approx 0.95$ works well
This is an exponentially-weighted average of multi-step TD estimates. The decaying weights mean that nearby rewards are trusted more than distant ones (lower variance), while still incorporating long-horizon information (lower bias than pure one-step).
GAE is used in virtually every modern policy gradient implementation, including the PPO implementations used for RLHF.
Summary
What we covered:
- MDPs: the formal framework for sequential decision-making
- Value functions: $V^\pi$, $Q^\pi$, advantage $A^\pi$, and Bellman equations
- Policy gradient theorem: how to compute gradients of expected return
- REINFORCE: simplest implementation, high variance
- Baselines and actor-critic: variance reduction thru learned value functions
- GAE: the practical bias-variance trade-off for advantage estimation
These are the building blocks. In Part 2, we’ll see how trust regions and clipped objectives lead to PPO — and why PPO became the algorithm of choice for training language models with human feedback.