Noobs guide to mechanistic interpretability
Here’s a fun fact: nobody fully understands why large language models work. We know the math, we know the architecture, we can train them. But ask “why did it output this specific token?” and the honest answer is usually a shrug. Mechanistic interpretability is trying to change that.
In this post, I’ll cover the core concepts and techniques you need to know to get started: what features and superposition are, how to figure out which parts of the model are responsible for specific behaviors (attribution), how to peek inside and understand what’s happening (discovery), and how to actually modify the model based on what you find (intervention).
What even is Mechanistic Interpretability?
Earlier, we used to think of LLMs as a kind of black box. We knew about the architecture (attention, MLPs, residual connections) but we still didn’t understand why the model did specific things. Which weights, which layers, which neurons were responsible for a particular behavior? No idea.
In mechanistic interpretability (mech interp), we try to actually reverse engineer the model. We look inside, find the components responsible for specific behaviors, and once we identify them, we can steer, amplify, or suppress those behaviors precisely.
This is a relatively new field, but it’s already proving to be critical. Companies like Anthropic have some of the strongest mech interp teams and are considered pioneers in this area. Their investment in interpretability research is a big part of how they build safer, more capable models.
There are three broad categories of techniques in mech interp:
Attribution - figuring out which inputs or components are responsible for the output:
- Input Attribution (Gradient-based, Perturbation-based)
- Direct Logit Attribution
Discovery - understanding what’s happening inside the model:
- Logit Lens
- Probing
- Attention Pattern Analysis
- Ablation Studies
- Circuits
- Sparse Autoencoders (SAEs)
Intervention - making changes based on what we found:
- Activation Steering
- Activation Patching
- Matrix Orthogonalization
- Rank-One Model Editing (ROME)
Before jumping into techniques, let’s cover a few concepts that everything else builds on.
The Residual Stream
Think of a transformer as a highway. The residual stream is that highway, a single vector that flows from the input embedding all the way to the final output.
Each layer (attention head, MLP) reads from this stream, does some computation, and adds its result back. That’s the residual connection: x + layer(x). So the final output is basically the input embedding plus the sum of everything every layer contributed.
Why does this matter? Because every layer’s contribution is additive. You can ask “what did layer 7’s MLP add to the residual stream?” and get a meaningful answer. You can also remove a layer’s contribution (ablation) and see what breaks. This additive structure is what makes most of the techniques below possible.
This decomposition was formalized in Anthropic’s A Mathematical Framework for Transformer Circuits.
Superposition
In an ideal world, each neuron would represent one clean concept: “this token is a verb”, “this sentence is toxic”, “this is French”. But models have far more concepts to represent than they have neurons. So they pack multiple features into the same neurons.
This is superposition. A single neuron might partially activate for “toxicity”, partially for “French language”, and partially for “code syntax”, all at once. The model can get away with this because these features rarely co-occur, so the interference is tolerable.
Think of it this way: imagine you have 3 dimensions but need to store 100 different direction vectors. You can’t make them all orthogonal, but you can make them almost orthogonal. Close enough that you can recover them most of the time. That’s what the model does.
This is why interpreting individual neurons is often meaningless. The real unit of analysis is features (directions in activation space), not neurons. Anthropic’s Toy Models of Superposition paper gives a detailed mathematical treatment of how and why this happens.
Linear Representation Hypothesis
This is the claim that concepts are encoded as directions (vectors) in activation space, not as individual neurons. The idea was formalized by Park et al. in The Linear Representation Hypothesis and the Geometry of Large Language Models.
Here’s the evidence: if you take the activations for many “happy” sentences and many “sad” sentences, you’ll often find a single direction in activation space that cleanly separates them. Move along that direction and you shift the model’s internal representation from happy to sad.
This is exactly why probing works, a linear probe is literally finding that separating direction. And it’s why activation steering works, you add a vector in that direction to push the model toward a concept.
If features were encoded in some nonlinear, tangled way, none of these techniques would work. The fact that they do is what makes mech interp tractable.
Input Attribution
Before looking inside the model, there’s a more basic question: which input tokens were most important for the output? This is input attribution, and it’s one of the oldest families of interpretability techniques.
A comprehensive survey of these methods (along with everything else in this post) can be found in Ferrando et al.’s A Primer on the Inner Workings of Transformer-based Language Models.
Gradient-based Methods
The core idea: compute the gradient of the output with respect to each input token’s embedding. If the gradient is large, that token matters a lot.
Vanilla gradients - just \(\frac{\partial y}{\partial x_i}\). Simple, but noisy and can be unreliable.
Gradient x Input - multiply the gradient by the input embedding itself. This accounts for the actual magnitude of the input, not just sensitivity. Gives more meaningful scores than raw gradients.
Integrated Gradients - vanilla gradients only tell you sensitivity at one point. Integrated Gradients (Sundararajan et al., 2017) fixes this by averaging gradients along a straight path from a “blank” baseline input to your actual input. It satisfies nice theoretical axioms (sensitivity, implementation invariance) and is probably the most widely used gradient method.
Perturbation-based Methods
Instead of using gradients, just remove or mask tokens and see what happens:
Ablation/Occlusion - remove a token, see how much the output changes. Most intuitive, but expensive since you need one forward pass per token.
LIME - LIME (Ribeiro et al., 2016) perturbs the input many times randomly, then trains a simple linear model that approximates the model’s behavior locally. The linear model’s weights become your feature importances.
SHAP - SHAP (Lundberg and Lee, 2017) is based on Shapley values from game theory. Each token’s importance is its average marginal contribution across all possible subsets of tokens. Theoretically principled but very expensive to compute exactly, so various approximations exist.
Attention Rollout
Raw attention weights from a single layer don’t tell you much because information compounds across layers. Attention Rollout (Abnar and Zuidema, 2020) multiplies attention matrices across layers to get a token-to-token attribution that accounts for the full path through the network. Much better than looking at individual attention heatmaps in isolation.
Direct Logit Attribution (DLA)
This is different from Logit Lens (which we’ll cover next). DLA asks “how much did each component contribute to the final logit of a specific token?”
Because of the residual stream’s additive structure, the final logit for any token can be decomposed as a sum of contributions from every attention head and every MLP layer:
\[\text{logit}(t) = \sum_{\text{heads}} W_U \cdot h_{\text{head}} + \sum_{\text{layers}} W_U \cdot h_{\text{mlp}} + \text{bias terms}\]You just project each component’s output through the unembedding matrix \(W_U\). This gives you a clean bar chart showing exactly which heads and MLPs pushed the model toward or away from a specific token. No training required, no interventions needed, just linear algebra on a single forward pass.
DLA was formalized in Anthropic’s A Mathematical Framework for Transformer Circuits and is one of the most commonly used tools in the mech interp workflow.
Logit Lens
This is probably the simplest interpretability technique out there, originally introduced by nostalgebraist (2020). At any intermediate layer, you take the residual stream and project it through the model’s final unembedding matrix (the one that normally converts the last hidden state into vocabulary logits).
This gives you a probability distribution over tokens at that layer, essentially asking “if the model stopped thinking right here, what would it predict?”
You can watch how predictions evolve layer by layer:
| Layer | Top prediction | Confidence |
|---|---|---|
| 2 | “the” | low |
| 8 | “Paris” | medium |
| 15 | “Paris” | high |
This shows you where in the model knowledge gets resolved. A variant called Tuned Lens (Belrose et al., 2023) trains a small affine transformation per layer to get cleaner results, since raw projections can be noisy in early layers.
Note the difference from DLA: Logit Lens asks “what would the model predict at this layer?” while DLA asks “how much did this specific component contribute to the final prediction?” Both are useful, but they answer different questions.
Probing
In probing, we train a simple linear model (like logistic regression) on the activations of specific layers using contrastive prompts, for example toxic vs. safe, Hindi vs. English, tagged vs. untagged.
If the probe trains successfully, it means the feature is linearly separable at that layer. In simpler words, there exists a direction in the activation space that cleanly separates the two classes. The hyperplane learned during training defines a probe vector, which is essentially the direction where that feature lives inside the model.
We can later use this vector to:
- Steer the model (move activations in that direction)
- Orthogonalize against it (remove that feature entirely)
- Amplify or suppress the behavior depending on our objective
If you’ve read my previous post on removing the refusal direction, that’s exactly what we did. Found the refusal direction via contrastive activations and projected it out.
Attention Pattern Analysis
Each attention head decides which tokens to “look at” when processing a given token. You can visualize this as a heatmap showing which tokens attend to which.
Some heads have clean, interpretable patterns:
- Previous token heads - always attend to the immediately preceding token. Simple but important for bigram-style predictions.
- Induction heads - look for patterns like “A B … A” and predict “B” will follow. This is one of the core mechanisms behind in-context learning, as shown in Anthropic’s In-context Learning and Induction Heads paper (Olsson et al., 2022).
- Name mover heads - in a sentence like “John gave the book to Mary”, these heads move “Mary” information to the final position for prediction.
By identifying what specific heads do, you start building a picture of the model’s internal circuits, small subnetworks that implement specific behaviors.
Circuits
A circuit is a small subnetwork within the model that implements a specific behavior. Not a single head or single MLP, but a pipeline of components working together.
The most famous example is the Indirect Object Identification (IOI) circuit from Wang et al. (2022). Given a prompt like:
“John and Mary went to the store. John gave a drink to ___”
The model predicts “Mary”. The researchers identified ~26 attention heads across multiple layers that form a pipeline:
- Duplicate token heads - detect that “John” appears twice
- S-inhibition heads - suppress the repeated name
- Name mover heads - copy the remaining name (“Mary”) to the output
This is mech interp at its most powerful. Not just “this neuron lights up” but a full mechanistic explanation of how the model computes the answer, step by step.
Ablation Studies
The simplest causal technique. You remove or disable a component and observe what changes.
There are a few types:
- Zero ablation - set the component’s output to zero. Simple, but can be misleading since zero might itself be an unusual input to downstream layers.
- Mean ablation - replace with the average activation across many inputs. More realistic, it’s like asking “what if this component contributed nothing special?”
- Resample ablation - replace with the activation from a different input. Useful for testing if a component carries specific information.
If you ablate attention head 9.1 and the model stops doing in-context learning, you’ve found evidence that head 9.1 is part of the in-context learning circuit.
Ablation is how you go from correlation (“this head activates during X”) to causation (“this head is necessary for X”).
Sparse Autoencoders (SAEs)
Remember the superposition problem? One neuron = multiple features. SAEs are designed to untangle that.
In transformer models, the MLP layers are the hardest components to interpret precisely because of superposition. Sparse Autoencoders help by introducing a sparse intermediate layer. The idea is that by forcing most neurons to be zero, each active neuron is more likely to correspond to a single interpretable feature.
The architecture is simple:
- Take the MLP activations as input
- Expand into a higher-dimensional hidden layer
- Apply a sparsity constraint (L1 penalty)
- Reconstruct the original activations
During reconstruction, the model learns a sparse representation where individual neurons map more cleanly to distinct features. The training process is somewhat similar in spirit to representation learning methods like Word2Vec, you’re learning a useful internal representation through reconstruction.
The downside? SAEs need to be trained, and they typically require more data compared to probing. But the payoff is big: you get a dictionary of interpretable features rather than just a single direction.
Anthropic’s work on SAEs (Scaling Monosemanticity) showed that you can find features for surprisingly specific concepts, things like “Golden Gate Bridge”, “code bugs”, “deceptive behavior”, all as individual SAE neurons.
Intervention Techniques
Once you’ve identified the relevant components and directions, you can actually modify the model’s behavior. Here’s a quick overview:
Activation Steering
Add or subtract a direction vector to the residual stream during inference. Want the model to be more positive? Add the “positive sentiment” direction. Want to remove toxicity? Subtract the “toxic” direction. No fine-tuning needed, it’s a runtime intervention.
This was formalized as Representation Engineering by Zou et al. (2023), which showed you can control a wide range of model behaviors (honesty, fairness, harmlessness) by steering along specific directions.
Activation Patching
Replace activations at a specific layer from one run with activations from another. For example, patch layer 10’s output from a “clean” run into a “corrupted” run. If performance recovers, layer 10 carries the critical information. This is the main tool for causal analysis in circuits research.
Matrix Orthogonalization
Permanently modify the model’s weight matrices to remove a specific direction. You project out the unwanted direction from the output weights:
\[P = I - \vec{d}\vec{d}^T\]Then apply \(W' = P \cdot W\) to the relevant layers. This is what I did to remove the refusal direction from Param-1, a permanent edit, no hooks needed at inference time. The technique was demonstrated in Refusal in Language Models Is Mediated by a Single Direction (Arditi et al., 2024).
Rank-One Model Editing (ROME)
A technique for editing factual associations in the model, introduced by Meng et al. (2022). Want to change “The Eiffel Tower is in Paris” to “The Eiffel Tower is in London”? ROME identifies the specific MLP layer that stores this fact and applies a rank-one update to the weight matrix. It’s precise, targeted, and only affects the specific fact you’re editing.
Where to go from here?
If you want to get into mech interp, here are some good starting points:
- Anthropic’s Transformer Circuits Thread - the gold standard for mech interp research
- A Primer on the Inner Workings of Transformer-based Language Models - comprehensive survey of all the techniques covered in this post and more
- Neel Nanda’s TransformerLens - the go-to library for mech interp experiments
- ARENA 3.0 Curriculum - structured exercises that walk you through mech interp concepts hands-on
- 200 Concrete Open Problems in Mechanistic Interpretability by Neel Nanda - if you’re looking for research directions
This field is still very young. Most of the work has been done on text models, and there’s barely any work on audio or multimodal models. If that sounds interesting, stay tuned, I’ll be writing about mech interp for audio language models next.