Interpretability on My Learning Notes

Interpretability on My Learning Notes https://learning-notes-dz2.pages.dev/tags/interpretability/ Recent content in Interpretability on My Learning Notes Hugo -- 0.124.0 en Tue, 16 Jun 2026 07:19:01 +0000 Sparse Autoencoders: The Swiss Army Knife of Interpretability https://learning-notes-dz2.pages.dev/posts/2026-04-08/ Wed, 08 Apr 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-04-08/ SAEs went from niche interpretability tool to dominant research theme in one year. Where they’re being applied, what they reveal, and the fundamental limitations nobody has solved yet. Safety Neurons: 5% of Your Model Controls 90% of Safety https://learning-notes-dz2.pages.dev/posts/2026-01-18/ Sun, 18 Jan 2026 00:00:00 +0700 https://learning-notes-dz2.pages.dev/posts/2026-01-18/ Mechanistic interpretability meets alignment — how researchers found that a tiny fraction of neurons are responsible for almost all safety behavior in LLMs, and what that means.