Sparse Autoencoders: The Swiss Army Knife of Interpretability
SAEs went from niche interpretability tool to dominant research theme in one year. Where they’re being applied, what they reveal, and the fundamental limitations nobody has solved yet.
SAEs went from niche interpretability tool to dominant research theme in one year. Where they’re being applied, what they reveal, and the fundamental limitations nobody has solved yet.
Mechanistic interpretability meets alignment — how researchers found that a tiny fraction of neurons are responsible for almost all safety behavior in LLMs, and what that means.