Paper Roundup: LLM Safety & RLHF at NeurIPS 2025 and ICLR 2026

A curated list of papers on alignment, preference optimization, mechanistic interpretability, and reasoning from the two biggest ML conferences this cycle — with personal takes on the ones that matter most.

April 29,2026 | Estimated reading time: 9 min | 1784 words | Author: khanhnn