👋 Welcome to my learning notes!

AI Product. Agentic Safety / RLHF / Interpretability. Writing to understand — from first principles to frontier research. Hope you gain something here. For feedbacks, no hesitation to contact me via my email.

Paper Roundup: LLM Safety & RLHF at NeurIPS 2025 and ICLR 2026

A curated list of papers on alignment, preference optimization, mechanistic interpretability, and reasoning from the two biggest ML conferences this cycle — with personal takes on the ones that matter most.

Ara: What If Research Papers Were Executable?

A deep look at Agent-Native Research Artifacts (Ara) — a proposed replacement for academic PDFs that packages research as machine-executable knowledge bundles. What it gets right, what it gets wrong, and why it matters for AI-assisted research.

Pluralistic Alignment: One Model, Many Values

RLHF optimizes for an average human preference — but humans disagree. The Artificial Hivemind problem, counterfactual alignment, and why one-size-fits-all safety is a design choice we should question.

Sparse Autoencoders: The Swiss Army Knife of Interpretability

SAEs went from niche interpretability tool to dominant research theme in one year. Where they’re being applied, what they reveal, and the fundamental limitations nobody has solved yet.

SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety

DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here’s how SafeDPO, RePO, and other recent variants are fixing them.

Does RL Actually Make LLMs Reason Better?

The evidence is more complicated than the hype suggests. RL improves sampling efficiency but may not expand reasoning capacity — and longer chains of thought don’t always help.

RLHF Is Just Divergence Estimation in Disguise

A unifying view of RLHF, DPO, and Constitutional AI — they’re all estimating the divergence between safe and unsafe output distributions. Plus a clean derivation of why DPO works.

From Policy Gradient to PPO — Part 2: Trust Regions, PPO, and GRPO

How trust regions stabilize policy optimization, why PPO became the default for RLHF, and how GRPO eliminates the critic entirely.

From Policy Gradient to PPO — Part 1: Foundations

MDPs, value functions, the REINFORCE algorithm, actor-critic methods, and generalized advantage estimation — the RL foundations you need before understanding RLHF.

VAE Variants and Modern Interpretations

A survey of where the VAE idea went after 2014 — VQ-VAE, hierarchical VAEs, adversarial hybrids, flow-based posteriors — and what the VAE really gave us beyond a specific architecture.