Sparse Autoencoders: The Swiss Army Knife of Interpretability

Why SAEs Suddenly Everywhere

If you looked at the NeurIPS 2025 and ICLR 2026 proceedings, you’d notice something: Sparse Autoencoders (SAEs) are in everything. At least 10 papers at NeurIPS alone, another 5+ at ICLR. Language models, diffusion models, pathology models, code correctness, in-context learning — SAEs are being applied to all of them.

Two years ago, SAEs were a niche tool in the mechanistic interpretability community. Now they’re arguably the dominant method for understanding what neural networks learn. So what happened?

What SAEs Do

The core idea is simple. Neural network activations (the hidden states between layers) live in high-dimensional space. Each dimension doesn’t correspond to a single interpretable concept — instead, networks use superposition, where more concepts are represented than there are dimensions, with each concept encoded as a direction in activation space.

An SAE is trained to decompose these activations into sparse, interpretable features:

$$h = \text{Dec}(\text{Enc}(x)) = \sum_{i} f_i(x) \cdot d_i$$

where $f_i(x)$ are the (sparse) feature activations and $d_i$ are learned feature directions (dictionary vectors).

The sparsity constraint means that for any given input, only a small number of features are active. This makes each feature interpretable — you can look at what inputs activate feature $i$ and get a coherent description of what it represents.

For example, an SAE trained on GPT-2’s activations might produce features like:

Feature 47: activates on Python function definitions
Feature 203: activates on mentions of geographic locations in Europe
Feature 891: activates on negation words in the context of safety refusals

This gives us a vocabulary for describing what the model is doing at each layer.

The Application Explosion

Here’s where SAEs are being applied, based on the 2025-2026 conference papers:

Language Models (Expected)

This is where SAEs started and where most work happens. The standard application: train an SAE on a specific layer’s activations, identify features, and use them to understand or steer model behavior.

NeurIPS 2025 had several papers pushing this forward:

Feature explanations — improving the quality of automated descriptions of what each feature represents
Feature sensitivity — measuring how reliably features activate on similar inputs (answer: not as reliably as you’d want)
SAE Neural Operators — generalizing SAEs to infinite-dimensional function spaces, extending the linear representation hypothesis to a “functional representation hypothesis”

Diffusion Models (New)

“One-Step is Enough” (NeurIPS 2025) extended SAE interpretability to SDXL Turbo, a text-to-image diffusion model. They found interpretable features corresponding to visual concepts — textures, object types, compositional patterns. This is significant because it shows the SAE approach generalizes beyond language.

Pathology Models (Unexpected)

“Evaluating the Utility of SAEs for Interpreting a Pathology Foundation Model” (NeurIPS 2025) applied SAEs to a medical image model and found that features correlated strongly with cell type counts. So SAEs extract biologically meaningful concepts from pathology models, not just vaguely interpretable directions.

Code Correctness (ICLR 2026)

“Mechanistic Interpretability of Code Correctness in LLMs via SAEs” identified feature directions that correspond to whether the model believes its generated code is correct. This could enable runtime monitoring of code generation confidence.

In-Context Learning (ICLR 2026)

“Mechanistic Interpretability of In-Context Learning Generalization” used SAEs to find “common structures” in transformer QK circuits that enable in-context learning. This connects SAEs to understanding one of the most mysterious capabilities of LLMs.

The Limitations Nobody Has Solved

Despite the hype, SAEs have fundamental issues that the community is actively wrestling with.

Feature Absorption

This is the most important open problem. “Feature Absorption in Sparse Autoencoders” (NeurIPS 2025) showed that when a model represents hierarchical features (e.g., “animal” → “dog” → “golden retriever”), SAE training can cause lower-level features to be absorbed into higher-level ones.

What this means: the SAE might have a feature for “dog” that absorbs the “golden retriever” feature. You’d see “dog” activate for golden retrievers, but you’d miss the specific breed information. The hierarchy collapses.

The paper showed that varying SAE size and sparsity level is insufficient to fix this. It’s a structural problem with how SAEs decompose hierarchical representations. “Matching Pursuit SAE” (NeurIPS 2025) proposes a different architecture using Matching Pursuit that handles hierarchical features better, but it’s not a complete solution.

Sensitivity and Reliability

How reliably does a feature activate on similar inputs? “Measuring SAE Feature Sensitivity” (NeurIPS 2025) found that reliability varies significantly across features. Some features activate consistently on related inputs; others are noisy and context-dependent.

If you’re building safety monitoring on top of SAE features (e.g., “alert if the safety-refusal feature drops below threshold”), you need features to be reliable. Current SAEs may not be reliable enough for safety-critical applications.

Identifiability

“Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?” (ICLR 2026) asks the fundamental question: can we uniquely identify the “true” mechanistic description of a network, or are there multiple equally valid decompositions?

If there are multiple valid decompositions, then different SAEs trained on the same model might give different but equally correct feature sets. This doesn’t invalidate SAEs, but it means we need to be careful about claiming we’ve found “the” representation rather than “a” representation.

LayerNorm Interference

“Small Transformers Don’t Need LayerNorm at Inference Time” (ICLR 2026) found that LayerNorm hinders mechanistic interpretability. They created LN-free analogs of GPT-2 XL that enable more precise analysis. This suggests that some of the noise in SAE feature quality might come from the interaction between SAEs and LayerNorm, not from the SAE method itself.

SAEs and Safety

How does this connect to the safety neurons work and the broader alignment story?

SAEs give us a feature-level vocabulary for talking about model behavior. Safety neurons (from the earlier post) are specific circuits identified at the neuron level. SAEs operate at a higher level of abstraction — they decompose activations into features that may span multiple neurons.

The connection: SAE features for safety-related behavior should activate in the same contexts where safety neurons fire. If they don’t, something is being missed by one or both methods.

The promise: if SAEs give us reliable safety features, we can monitor model alignment in real time during deployment. “Is the model’s safety feature active?” is a much more informative signal than “did the model refuse?” (which only tells you after the output is generated).

The reality: feature absorption might hide safety features inside broader categories, and sensitivity issues mean feature activations might not be reliable enough for safety-critical monitoring. We’re not there yet.

Where This Goes

My read on the SAE landscape in mid-2026:

Adoption is ahead of understanding. People are applying SAEs to everything because they work — you get interpretable features out. But the fundamental limitations (absorption, sensitivity, identifiability) mean we don’t fully understand what we’re getting or missing.

The tool is improving faster than alternatives. Despite its problems, SAEs are more practical and scalable than other interpretability methods (circuit analysis, linear probes, attention head analysis). The volume of papers suggests the community has converged on SAEs as the primary approach, and incremental improvements are compounding.

Safety applications are the highest-stakes use case. If SAEs become reliable enough for real-time safety monitoring, that changes the alignment game significantly. But “reliable enough” is a high bar, and we’re honestly not sure how close we are.

The Swiss Army Knife metaphor is apt: SAEs are versatile and useful for many things, but you wouldn’t use a Swiss Army Knife for surgery. For safety-critical interpretability, we might need more specialized tools — and SAEs as currently designed might not be enough.

Why SAEs Suddenly Everywhere#

What SAEs Do#

The Application Explosion#

Language Models (Expected)#

Diffusion Models (New)#

Pathology Models (Unexpected)#

Code Correctness (ICLR 2026)#

In-Context Learning (ICLR 2026)#

The Limitations Nobody Has Solved#

Feature Absorption#

Sensitivity and Reliability#

Identifiability#

LayerNorm Interference#

SAEs and Safety#

Where This Goes#