SafeDPO and Friends: Preference Optimization That Doesn’t Sacrifice Safety
March 30,2026 | Estimated reading time: 6 min | 1067 words | Author: khanhnn
Does RL Actually Make LLMs Reason Better?
March 28,2026 | Estimated reading time: 6 min | 1098 words | Author: khanhnn
RLHF Is Just Divergence Estimation in Disguise
March 22,2026 | Estimated reading time: 6 min | 1137 words | Author: khanhnn
From Policy Gradient to PPO — Part 2: Trust Regions, PPO, and GRPO
March 18,2026 | Estimated reading time: 7 min | 1280 words | Author: khanhnn
From Policy Gradient to PPO — Part 1: Foundations
March 5,2026 | Estimated reading time: 6 min | 1153 words | Author: khanhnn