👋 Welcome to my learning notes!

AI Product. Agentic Safety / RLHF / Interpretability. Writing to understand — from first principles to frontier research. Hope you gain something here. For feedbacks, no hesitation to contact me via my email.

Paper Roundup: LLM Safety & RLHF at NeurIPS 2025 and ICLR 2026

A curated list of papers on alignment, preference optimization, mechanistic interpretability, and reasoning from the two biggest ML conferences this cycle — with personal takes on the ones that matter most.

April 29,2026 | Estimated reading time: 9 min | 1784 words | Author: khanhnn

Ara: What If Research Papers Were Executable?

A deep look at Agent-Native Research Artifacts (Ara) — a proposed replacement for academic PDFs that packages research as machine-executable knowledge bundles. What it gets right, what it gets wrong, and why it matters for AI-assisted research.

April 28,2026 | Estimated reading time: 7 min | 1282 words | Author: khanhnn

Pluralistic Alignment: One Model, Many Values

RLHF optimizes for an average human preference — but humans disagree. The Artificial Hivemind problem, counterfactual alignment, and why one-size-fits-all safety is a design choice we should question.

April 15,2026 | Estimated reading time: 6 min | 1114 words | Author: khanhnn

Sparse Autoencoders: The Swiss Army Knife of Interpretability

SAEs went from niche interpretability tool to dominant research theme in one year. Where they’re being applied, what they reveal, and the fundamental limitations nobody has solved yet.

April 8,2026 | Estimated reading time: 6 min | 1233 words | Author: khanhnn

SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety

DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here’s how SafeDPO, RePO, and other recent variants are fixing them.

March 30,2026 | Estimated reading time: 6 min | 1067 words | Author: khanhnn

Does RL Actually Make LLMs Reason Better?

The evidence is more complicated than the hype suggests. RL improves sampling efficiency but may not expand reasoning capacity — and longer chains of thought don’t always help.

March 28,2026 | Estimated reading time: 6 min | 1098 words | Author: khanhnn

RLHF Is Just Divergence Estimation in Disguise

A unifying view of RLHF, DPO, and Constitutional AI — they’re all estimating the divergence between safe and unsafe output distributions. Plus a clean derivation of why DPO works.

March 22,2026 | Estimated reading time: 6 min | 1137 words | Author: khanhnn

From Policy Gradient to PPO — Part 2: Trust Regions, PPO, and GRPO

How trust regions stabilize policy optimization, why PPO became the default for RLHF, and how GRPO eliminates the critic entirely.

March 18,2026 | Estimated reading time: 7 min | 1280 words | Author: khanhnn

From Policy Gradient to PPO — Part 1: Foundations

MDPs, value functions, the REINFORCE algorithm, actor-critic methods, and generalized advantage estimation — the RL foundations you need before understanding RLHF.

March 5,2026 | Estimated reading time: 6 min | 1153 words | Author: khanhnn

VAE Variants and Modern Interpretations

A survey of where the VAE idea went after 2014 — VQ-VAE, hierarchical VAEs, adversarial hybrids, flow-based posteriors — and what the VAE really gave us beyond a specific architecture.

February 25,2026 | Estimated reading time: 8 min | 1660 words | Author: khanhnn