Paper Roundup: LLM Safety & RLHF at NeurIPS 2025 and ICLR 2026

A curated list of papers on alignment, preference optimization, mechanistic interpretability, and reasoning from the two biggest ML conferences this cycle — with personal takes on the ones that matter most.

April 29,2026 | Estimated reading time: 9 min | 1784 words | Author: khanhnn

Ara: What If Research Papers Were Executable?

A deep look at Agent-Native Research Artifacts (Ara) — a proposed replacement for academic PDFs that packages research as machine-executable knowledge bundles. What it gets right, what it gets wrong, and why it matters for AI-assisted research.

April 28,2026 | Estimated reading time: 7 min | 1282 words | Author: khanhnn

Sparse Autoencoders: The Swiss Army Knife of Interpretability

SAEs went from niche interpretability tool to dominant research theme in one year. Where they’re being applied, what they reveal, and the fundamental limitations nobody has solved yet.

April 8,2026 | Estimated reading time: 6 min | 1233 words | Author: khanhnn

A Curated Guide to LLMs, Reinforcement Learning, and AI Safety

Books, papers, conferences, and researchers — a personal resource list for anyone going deep into LLMs, RL, and AI safety.

December 28,2025 | Estimated reading time: 8 min | 1495 words | Author: khanhnn