Llm on My Learning Notes

Llm on My Learning Notes https://learning-notes-dz2.pages.dev/tags/llm/ Recent content in Llm on My Learning Notes Hugo -- 0.124.0 en Tue, 16 Jun 2026 07:19:01 +0000 Paper Roundup: LLM Safety & RLHF at NeurIPS 2025 and ICLR 2026 https://learning-notes-dz2.pages.dev/posts/2026-04-29/ Wed, 29 Apr 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-04-29/ A curated list of papers on alignment, preference optimization, mechanistic interpretability, and reasoning from the two biggest ML conferences this cycle — with personal takes on the ones that matter most. Pluralistic Alignment: One Model, Many Values https://learning-notes-dz2.pages.dev/posts/2026-04-15/ Wed, 15 Apr 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-04-15/ RLHF optimizes for an average human preference — but humans disagree. The Artificial Hivemind problem, counterfactual alignment, and why one-size-fits-all safety is a design choice we should question. Sparse Autoencoders: The Swiss Army Knife of Interpretability https://learning-notes-dz2.pages.dev/posts/2026-04-08/ Wed, 08 Apr 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-04-08/ SAEs went from niche interpretability tool to dominant research theme in one year. Where they’re being applied, what they reveal, and the fundamental limitations nobody has solved yet. SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety https://learning-notes-dz2.pages.dev/posts/2026-03-30/ Mon, 30 Mar 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-03-30/ DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here’s how SafeDPO, RePO, and other recent variants are fixing them. Does RL Actually Make LLMs Reason Better? https://learning-notes-dz2.pages.dev/posts/2026-03-28/ Sat, 28 Mar 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-03-28/ The evidence is more complicated than the hype suggests. RL improves sampling efficiency but may not expand reasoning capacity — and longer chains of thought don’t always help. RLHF Is Just Divergence Estimation in Disguise https://learning-notes-dz2.pages.dev/posts/2026-03-22/ Sun, 22 Mar 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-03-22/ A unifying view of RLHF, DPO, and Constitutional AI — they’re all estimating the divergence between safe and unsafe output distributions. Plus a clean derivation of why DPO works. Transformers from First Principles — Part 2: What Scale Reveals https://learning-notes-dz2.pages.dev/posts/2026-02-20/ Fri, 20 Feb 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-02-20/ Sparse attention patterns, head specialization, rotary embeddings, gated attention, and the modern efficiency tricks that make large transformers actually trainable. Transformers from First Principles — Part 1: Attention Is All You Need (Really) https://learning-notes-dz2.pages.dev/posts/2026-02-08/ Sun, 08 Feb 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-02-08/ A first-principles walkthrough of the Transformer — self-attention, positional encoding, multi-head attention — with the math that makes it work. A Curated Guide to LLMs, Reinforcement Learning, and AI Safety https://learning-notes-dz2.pages.dev/posts/2025-12-28/ Sun, 28 Dec 2025 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2025-12-28/ Books, papers, conferences, and researchers — a personal resource list for anyone going deep into LLMs, RL, and AI safety.