The Question

Here’s something that sounds like it should be hard to answer: when an LLM refuses a harmful request, which specific neurons are responsible for that refusal?

You might expect safety behavior to be distributed across the whole network — millions of parameters working together to produce “I can’t help with that.” After all, safety training (RLHF, DPO, etc.) updates all parameters during training.

But a NeurIPS 2025 paper (“Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons”) found something surprising: safety is concentrated. Roughly 5% of neurons handle roughly 90% of safety behavior. They call these “safety neurons.”


How They Found Them

The method is conceptually simple, even if the execution is involved:

Step 1: Generate paired outputs. For a set of harmful prompts, record the model’s activations when it refuses (aligned behavior) and when a jailbroken version complies (unaligned behavior). The difference in activations between these two cases highlights which neurons “activate” for safety.

Step 2: Identify safety-critical neurons. Using the activation differences, rank neurons by how much their activation changes between safe and unsafe behavior. The top neurons by this metric are the “safety neurons.”

Step 3: Validate with patching. Take a jailbroken (unsafe) model and patch in the activations of only the safety neurons from the aligned model. If safety is really concentrated in these neurons, patching just 5% of them should restore safety behavior.

It does. Patching ~5% of neurons restores >90% of safety performance. The remaining 95% of neurons can be in their jailbroken state, and the model still refuses harmful requests.


What Safety Neurons Actually Do

The paper goes further than just identifying which neurons matter. They also analyzed what these neurons compute:

Early layers: Safety neurons in early layers appear to detect harmful intent in the input — recognizing patterns associated with requests for dangerous information, manipulation, etc.

Middle layers: These neurons seem to activate “refusal circuitry” — they transform the hidden state in ways that suppress harmful completion pathways.

Late layers: Safety neurons near the output layers steer the token probabilities toward refusal responses (“I can’t”, “I’m not able to”, etc.) and away from harmful completions.

So there’s a pipeline: detect harm → activate refusal → steer output. And each stage has its own concentrated set of neurons.


Why Concentration Happens

You may wonder why safety would be concentrated rather than distributed. I think there are a few reasons:

Safety training is narrow. RLHF and DPO optimize on a relatively small set of safety-related examples compared to the massive pre-training corpus. The gradient updates from safety training affect a limited subset of neurons strongly, rather than all neurons weakly.

Safety is a “mode switch.” Safe behavior often requires a categorical decision — comply or refuse — rather than a gradual adjustment. Categorical decisions tend to be implemented by a small number of high-impact neurons (think of a ReLU activation as a gate: on or off).

Superposition. Neural networks represent many features in superposition (more features than neurons, with each feature as a direction in activation space). Safety might be one such feature — a specific direction that a small set of neurons are aligned with.


Implications for Alignment

This finding has both encouraging and worrying implications.

The Good News

Targeted safety interventions. If we know which neurons are responsible for safety, we can monitor them specifically during training and deployment. If safety neuron activations drop, that’s an early warning.

Efficient safety patching. You don’t need to retrain the whole model to fix a safety issue. Patching a small number of neurons could restore safety. This is much cheaper and faster than full RLHF cycles.

Interpretability toolkit. Safety neurons give us a concrete, mechanistic handle on alignment. Instead of treating the model as a black box that sometimes refuses and sometimes doesn’t, we can trace the decision thru specific components.

The Bad News

Concentrated = vulnerable. If 5% of neurons control 90% of safety, then attacking those specific neurons could disable safety much more efficiently than attacking the model as a whole. An adversary who knows which neurons to target could craft inputs that suppress safety neuron activations.

Fragility. Concentration means there’s less redundancy. In a distributed safety system, corrupting a few neurons has limited impact because many others compensate. In a concentrated system, corrupting the right few neurons can catastrophically disable safety.

Jailbreaking explained. Some jailbreak techniques might work precisely because they suppress safety neuron activations. Understanding this mechanism could lead to better defenses — or better attacks. This is the dual-use problem of interpretability research.


Connection to Broader Interpretability

This work fits into the mechanistic interpretability research program that Chris Olah and others have been building. The big picture:

Sparse Autoencoders (SAEs) decompose neural network activations into interpretable features. Recent work (5+ SAE papers at NeurIPS 2025 alone) is making this practical at LLM scale.

Circuit analysis traces how information flows thru specific pathways in the network. Safety neurons are one such circuit.

Feature absorption (NeurIPS 2025) shows that hierarchical features can be absorbed into other features during SAE optimization, making some features invisible. Could safety features be similarly absorbed or hidden?

The connection: safety neurons are a specific circuit within the broader feature landscape that SAEs try to map. Understanding this circuit helps, but we need the full map to really trust our interpretability tools.


What This Changes

Before this paper, safety alignment was mostly a training-time concern: get the right training data, use the right loss function, hope it generalizes. Safety neurons shift the conversation to a mechanistic level:

  • We can audit specific components for safety properties
  • We can detect when safety is being undermined (activation monitoring)
  • We can intervene surgically (patching, pruning, amplification)

But the concentration also means we should worry more about the robustness of alignment. A model that implements safety thru 5% of its neurons is a model where safety can be disabled by targeting 5% of its neurons.

The question going forward: is this concentration an artifact of current training methods that could be improved, or is it a fundamental property of how neural networks implement mode-switching behavior? If it’s fundamental, we need to build defenses that assume concentration rather than trying to eliminate it.