Ă
Circuit Tracing: Revealing Computational Graphs in Language Models
We introduce a method to uncover mechanisms underlying behaviors of language models. We produce graph descriptions of the modelâs computation on prompts of interest by tracing individual computational steps in a âreplacement modelâ. This replacement model substitutes a more interpretable component (here, a âcross-layer transcoderâ) for parts of the underlying model (here, the multi-layer perceptrons) that it is trained to approximate. We develop a suite of visualization and validation tools we use to investigate these âattribution graphsâ supporting simple behaviors of an 18-layer language model, and lay the groundwork for a companion paper applying these methods to a frontier model, Claude 3.5 Haiku.
§ 1
Deep learning models produce their outputs using a series of transformations distributed across many computational units (artificial âneuronsâ). The field of mechanistic interpretability  seeks to describe these transformations in human-understandable language. To date, our teamâs approach has followed a two-step approach. First, we identify features, interpretable building blocks that the model uses in its computations. Second, we describe the processes, or circuits, by which these features interact to produce model outputs.
A natural approach is to use the raw neurons of the model as these building blocks.An alternative approach is to study the roles of gross model components such as entire MLP blocks and attention heads . This approach has identified interesting roles these components play in specific behaviors, but large components play a multitude of unrelated roles across the data distribution, so we seek a more granular decomposition. Using this approach, previous work successfully identified interesting circuits in vision models, built out of neurons that appear to represent meaningful visual concepts . However, model neurons are often polysemantic â representing a mixture of many unrelated concepts. One reason for polysemanticity is thought to be the phenomenon of superposition, in which models must represent more concepts than they have neurons, and thus must âsmearâ their representation of concepts across many neurons. This mismatch between the networkâs basic computational units (neurons) and meaningful concepts has proved a major impediment to progress to the mechanistic agenda, especially in understanding language models.
In recent years, sparse coding models such as sparse autoencoders (SAEs) , transcoders , and crosscoders have emerged as promising tools for identifying interpretable features represented in superposition. These methods decompose model activations into sparsely active components (âfeaturesâ),We use âfeatureâ following the tradition of âfeature detectorsâ in neuroscience and âfeature learningâ in machine learning. Some recent literature uses the term âlatent,â which refers to specific vectors in the modelâs latent space. We find âfeatureâ better captures the computational role these elements play, making it more appropriate for describing transcoder neurons than SAE decoder vectors. which turn out in many cases to correspond to human-interpretable concepts. While current sparse coding methods are an imperfect way of identifying features (see § 7 Limitations ), they produce interpretable enough results that we are motivated to study circuits composed of these features. Several authors have already made promising steps in this direction .
Although the basic premise of studying circuits built out of sparse coding features sounds simple, the design space is large. In this paper we describe our current approach, which involves several key methodological decisions:
- Transcoders. We extract features using a variant of transcoders rather than SAEs, which allows us to construct an interpretable âreplacement modelâ that can be studied as a proxy for the original model. Importantly, this approach allows us to analyze direct feature-feature interactions.
- Cross-Layer. We base our analysis on cross-layer transcoders (CLT) , in which each feature reads from the residual stream at one layer and contributes to the outputs of all subsequent MLP layers of the original model, which greatly simplifies the resulting circuits. Remarkably, we can substitute our learned CLT features for the modelâs MLPs while matching the underlying modelâs outputs in ~50% of cases.
- Attribution Graphs. We focus on studying âattribution graphsâ which describe the steps a model used to produce an output for a target token on a particular prompt, using an approach similar to Dunefsky et al. . The nodes in the attribution graph represent active features, token embeddings from the prompt, reconstruction errors, and output logits. The edges in the graph represent linear effects between nodes, so the activity of each feature is the sum of its input edges (up to its activation threshold) (see § 3 Attribution Graphs ).
- Linear Attribution Between Features. We design our setup so that, for a specific input, the direct interactions between features are linear. This makes attribution a well-defined, principled operation. Crucially, we freeze attention patterns and normalization denominators (following ) and use transcoders to achieve this linearity.Direct feature-feature interactions are linear because transcoder features âbridge overâ the MLP nonlinearities, replacing their computation, and because weâve frozen the remaining nonlinearities: attention patterns and normalization denominators. Itâs worth noting that strictly we mean that the pre-activation of a feature is linear with respect to the activations of earlier features.
Freezing attention patterns is a standard approach which divides understanding transformers into two steps: understanding behavior given attention patterns, and understanding why the model attends to those positions. This approach was explored in depth for attention models in A Mathematical Framework, which also discussed a generalization to MLP layers which is essentially the approach used in this paper. Note that factoring out understanding attention patterns in this way leads to the issues with attention noted in § 7.1 Limitations: Missing Attention Circuits. However, we can then also take the same solution Framework takes of studying QK circuits. Features also have indirect interactions, mediated by other features, which correspond to multi-step paths. - Pruning. Although our features are sparse, there are still too many features active on a given prompt to easily interpret the resulting graph. To manage this complexity, we prune graphs by identifying the nodes and edges which most contribute to the modelâs output at a specific token position (see § 5.2.4 Appendix: Graph Pruning ). Doing so allows us to produce sparse, interpretable graphs of the modelâs computation for arbitrary prompts.
- Interface. We designed an interactive interface for exploring attribution graphs, and the features theyâre composed of, that allows a researcher to quickly identify and highlight key mechanisms within them.
- Validation. Our approach to studying circuits is indirect â our replacement model may use different mechanisms from the underlying model. Thus, it is important that we validate the mechanisms we find in attribution graphs. We do so using perturbation experiments. Specifically, we measure the extent to which applying perturbations in a featureâs direction produces changes to other feature activations (and to the modelâs output) that are consistent with the attribution graph. We find that across prompts, perturbation experiments are generally qualitatively consistent with our attribution graphs, though there are some deviations.
- Global Weights. While our paper mostly focuses on studying attribution graphs for individual prompts, our methods also allow us to study the weights of the replacement model (âglobal weightsâ) directly, which underlie mechanisms across many prompts. In § 4 Global Weights, we demonstrate some challenges with doing so â naive global weights are often less interpretable than attribution graphs due to weight interference. However, we successfully apply them to understand the circuits underlying small number addition.
The goal of this paper is to describe and validate our methodology in detail, using a few case studies for illustration.
- We begin with methods. We describe the setup of our replacement model ( § 2 Building an Interpretable Replacement Model ) and how we construct attribution graphs ( § 3 Attribution Graphs ), concluding with two case studies ( § 3.7 Factual Recall Case Study, § 3.8 Addition Case Study ). We then go on to explore approaches to constructing global circuits, including challenges and some preliminary methods for addressing them ( § 4 Global Weights ).
- We then provide a detailed quantitative evaluation of our cross-layer transcoders and the resulting attribution graphs ( § 5 Evaluations ), showing metrics by which CLTs provide Pareto-improvements over neurons and per-layer transcoders. Afterwards, we provide an overview of our companion paper, in which we apply our method to various behaviors of Claude 3.5 Haiku ( § 6 Biology ). We follow with a discussion of methodological limitations ( § 7 Limitations ). These include the role of attention patterns, the impact of reconstruction errors, the identification of suppression motifs, and the difficulty of understanding global circuits. Addressing these limitations, and seeing what additional model mechanisms are then revealed, is a promising direction for future work.
- We close with a broader discussion ( § 8 Discussion ) of the design space of methods for producing attribution graphs â parts of our approach can be freely remixed with others while retaining much of the benefit â and a review of related work ( § 9 Related Work ).
- Our companion paper, On the Biology of a Large Language Model, applies these methods to Claude 3.5 Haiku, investigating a diverse range of behaviors such as multiple hop reasoning, planning, and hallucinations.
We note that training a cross-layer transcoder can incur significant up-front cost and effort, which is amortized over its application to circuit discovery. We have found that this improves circuit interpretability and parsimony enough to justify the investment (see cost estimates  for open-weights models and discussion of cost-matched performance relative to per-layer transcoders). Nevertheless, we stress that alternatives like per-layer transcoders or even MLP neurons can be used instead (keeping the same steps 3â8 above), and still produce useful insights. Moreover, it is likely that better methods than CLTs will be developed in the future.
To aid replication, we share guidance on CLT implementation, details on the pruning method, and the front-end code supporting the interactive graph analysis interface.
§ 2
§ 2.1
A cross-layer transcoder (CLT) consists of neurons (âfeaturesâ) divided into L layers, the same number of layers as the underlying model. The goal of the model is to reconstruct the outputs of the MLPs of the underlying model, using sparsely active features. The features receive input from the modelâs residual stream at their associated layer, but are âcross-layerâ in the sense that they can provide output to all subsequent layers.Other architectures besides CLTs can be used for circuit analysis, but weâve empirically found this approach to work well. Concretely:
- Each feature in the \ell^\text{th} layer âreads inâ from the residual stream at that layer using a linear encoder followed by a nonlinearity.
- An \ell^\text{th} layer feature contributes to the reconstruction of the MLP outputs in layers \ell, \ell+1,\ldots, L, using a separate set of linear decoder weights for each output layer.
- All features in all layers are trained jointly. As a result, the output of the MLP in a layer \ell^\prime is jointly reconstructed by the features from all previous layers.
More formally, to run a cross-layer transcoder, let \mathbf{x^{\ell}} denote the original modelâs residual stream activations at layer \ell. The CLT feature activations \mathbf{a^{\ell}} at layer \ell are computed as
\mathbf{a^{\ell}} = \text{JumpReLU}\!\left(W_{enc}^{\ell} \mathbf{x^{\ell}}\right)
where W_{enc}^{\ell} is the CLT encoder matrix at layer \ell.
We let \mathbf{y^{\ell}} refer to the output of the original modelâs MLP at layer \ell. The CLTâs attempted reconstruction \mathbf{\hat{y}^{\ell}} of \mathbf{y^{\ell}} is computed using the JumpReLU activation function as:
\mathbf{\hat{y}^{\ell}} = \sum_{\ellâ=1}^{\ell} W_{dec}^{\ellâ \to \ell} \mathbf{a^{\ellâ}}
where W_{dec}^{\ellâ \to \ell} is the CLT decoder matrix for features at layer \ellâ outputting to layer \ell.
To train a cross-layer transcoder, we minimize a sum of two loss functions. The first is a reconstruction error loss, summed across layers:
L_\text{MSE} = \sum_{\ell=1}^L \|\mathbf{\hat{y}^{\ell}} - \mathbf{y^{\ell}} \|
The second is a sparsity penalty (with an overall coefficient \lambda, a hyperparameter, and another hyperparameter c ) summed across layers:
L_\text{sparsity} = \lambda\sum_{\ell=1}^L \sum_{i=1}^N \textrm{tanh}(c \cdot \|\mathbf{W_{dec, i}^{\ell}}\| \cdot a^{\ell}_i)
Where N is the number of features per layer and \mathbf{W_{dec, i}^{\ell}} is the concatenation of all decoder vectors of feature i.
We trained CLTs of varying sizes on a small 18-layer transformer model (â18Lâ) 18L has no MLP in layer 0, so our CLT has 17 layers., and on Claude 3.5 Haiku. The total number of features across all layers ranged from 300K to 10M features (for 18L) and from 300K to 30M features (for Haiku). For more training details see § D Appendix: CLT Implementation Details.
§ 2.2
Given a trained cross-layer transcoder, we can define a âreplacement modelâ that substitutes the cross-layer transcoder features for the modelâs MLP neurons â that is, where each layerâs MLP output is replaced by its reconstruction by all CLTs that write to the that layer. Running a forward pass of this replacement model is identical to running the original model, with two modifications:
- Upon reaching the input to the MLP in layer \ell, we compute the activations of the cross-layer transcoder features whose encoders live in layer \ell.
- Upon reaching the output of the MLP in layer \ell, we overwrite it with the summed outputs of the cross-layer transcoder features in this and previous layers, using their decoders for layer \ell.
Attention layers are applied as usual, without any freezing or modification. Although our CLTs were only trained using input activations from the underlying model, ârunningâ the replacement model involves running CLTs on âoff-distributionâ input activations from intermediate activations from the replacement model itself.
As a simple evaluation, we measure the fraction of completions for which the most likely token output of the replacement model matches that of the underlying model. The fraction improves with scale, and is better for CLTs compared to a per-layer transcoder baseline (i.e., each layer has a standard single layer transcoder trained on it; the number of features shown refers to the total number across all layers). We also compare to a baseline of thresholded neurons, varying the threshold below which neurons are zeroed out (empirically, we find that higher neuron activations are increasingly interpretable, and we indicate below where their interpretability roughly matches that of features according to our auto-evaluations in § 5.1.2 Quantitative CLT Evaluations ). Our largest 18L CLT matches the underlying modelâs next-token completion on 50% of a diverse set of pretraining-style prompts from an open source dataset (see § R Additional Evaluation Details ).We use a randomly sampled set of prompts and target tokens, restricting to those which the model predicts correctly, but with a confidence lower than 80% (to filter out âboringâ tokens).
§ 2.3
While running the replacement model can sometimes reproduce the same outputs as the underlying model, there is still a significant gap, and reconstruction errors can compound across layers. Since we are ultimately interested in understanding the underlying model, we would like to approximate it as closely as possible. To that end, when studying a fixed prompt p, we construct a local replacement model, which
- Substitutes the CLT for the MLP layers (as in the replacement model);
- Uses the attention patterns and normalization denominators from the underlying modelâs forward pass on p (as in );
- Adds an error adjustment to the CLT output at each (token position, layer) pair equal to the difference between the true MLP output on p  and the CLT output on p (as in ).
After this error adjustment and freezing of attention and normalization nonlinearities, weâve effectively re-written the underlying modelâs computation on the prompt p in terms of different basic units; all of the error-corrected replacement modelâs activations and logit outputs exactly match those of the underlying model. However, this does not guarantee that the local replacement model and underlying model use the same mechanisms. We can measure differences in mechanism by measuring how differently these models respond to perturbations; we refer to the extent to which perturbation behavior matches as âmechanistic faithfulnessâ, discussed in § 5.3 Evaluating Mechanistic Faithfulness.This is similar in spirit to a Taylor approximation of a function f at a point a; both agree locally in a neighborhood of a but diverge in behavior as you move away.
The local replacement model can be viewed as a very large fully connected neural network, spanning across tokens, on which we can do classic circuit analysis:
- Its input is the concatenated set of one-hot vectors for each token in the prompt.
- Its neurons are the union of the CLT features active at every token position.
- Its weights are the summed interactions over all the linear paths from one feature to another, including via the residual stream and through attention, but not passing through MLP or CLT layers. Because attention patterns and normalization denominators are frozen, the impact of a source featureâs activation on a target featureâs pre-activation via each path is linear in the activation of the source feature. We sometimes refer to these as âvirtual weightsâ because they are not instantiated in the underlying model.
- Additionally, it has bias-like nodes corresponding to error terms, with a connection from each bias to each downstream neuron in the model.
The only nonlinearities in the local replacement model are those applied to feature preactivations.
The local replacement model serves as the basis of our attribution graphs, where we study the feature-feature interactions of the local replacement model on the prompt for which it was made. These graphs are the primary object of study of this paper.
§ 3
We will introduce our methodology for constructing attribution graphs while working through a case study regarding the modelâs ability to write acronyms for arbitrary titles. In the example we study, the model successfully completes a fictional acronym. Specifically, we give the model the prompt The National Digital Analytics Group (N and sample its completion: DAG). The tokenizer the model was trained with uses a special âCaps Lockâ token, which means the prompt and completion are tokenized as follows: The
National
Digital
Analytics
Group
(
âȘ
n
dag
.
We explain the computation the model performs to output the âDAGâ token by constructing an attribution graph showing the flow of information from the prompt through intermediary features and to that output.Due to the âCaps Lockâ token, the actual target token is âdagâ. We write the token in uppercase here and in the rest of the text for ease of reading. Below, we show a simplified diagram of the full attribution graph. The diagram shows the prompt at the bottom and the modelâs completion on top. Boxes represent groups of similar features, and can be hovered over to display each featureâs visualization. We discuss our interpretation of features in § 3.3 Understanding and Labeling Features. Arrows represent the direct effect of a group of features or a token on other features and the output logit.
Gâ
The graph for the acronym prompt shows three main paths, originating from each of the tokens that compose the desired acronym. Paths originate from features for a given word, promoting features about âsaying the first letter of that word in the correct positionâ, which themselves have positive edges to a âsay DAGâ feature and the logit. âsay Xâ labels describe âoutput featuresâ, which promote a specific token X, and arbitrary single letters are denoted with underscores. The âWord â say _Wâ edges represent attention headsâ OV circuits writing to a subspace that is then amplified by MLPs at the target position. Each group of features also has a direct edge to the logit in addition to the sequential paths, representing effects mediated only via attention head OVs (i.e., paths to the output in the local replacement model that donât âtouchâ another MLP layer).
In order to output âDAGâ, the model also needs to decide to output an acronym, and to account for the fact that the prompt already contains N, and indeed we see features for âin an acronymâ and âin an N at the start of an acronymâ with positive edges to the logit. The word National has minimal influence on the logit. We hypothesize that this is due to its main contribution being through influencing attention patterns, which our method does not explain (see § 7.1 Limitations: Missing Attention Circuits ).
In the rest of this section, we explain how we compute and visualize attribution graphs.
§ 3.1
To interpret the computations performed by the local replacement model, we compute a causal graph that depicts the sequences of computational steps it performs on a particular prompt. The core logic by which we construct the graph is essentially the same as that of Dunefsky et al., extended to handle cross-layer transcoders. Our graphs contain four types of nodes:
- The output  nodes correspond to candidate output tokens. We only construct output nodes for the tokens required to reach 95% of the probability mass, up to a total of 10.We chose this threshold arbitrarily. Empirically, fewer than three logits are required to capture 95% of the probability in the cases we study.
- The intermediate nodes correspond to active cross-layer transcoder features at each prompt token position.
- The primary input nodes of the graph correspond to the embeddings of the prompt tokens.
- Additional input nodes (â error nodesâ) correspond to the portion of each MLP output in the underlying model left unexplained by the CLT.
Edges in the graph represent direct, linear attributions in the local replacement model. Edges originate from feature, embedding, and error nodes, and terminate at feature and output nodes. Given a source feature node s and a target feature node t, the edge weight between them is defined to be A_{s\rightarrow t}:= a_sw_{s \rightarrow t}, where w_{s \rightarrow t} is the (virtual) weight in the local replacement model viewed as a fully connected neural network and a_s is the activation of the source feature.Alternatively, w_{s \rightarrow t} is the derivative of the preactivation of t with respect to the source feature activation, with stop-gradients on all non-linearities in the local replacement model.
In terms of the underlying model, w_{s \rightarrow t} is a sum over all linear paths (i.e., through attention head OVs and residual connections) connecting the source featureâs decoder vectors to the target featureâs encoder vector.
We now give details on how to efficiently compute these in practice, using backwards Jacobians. Let s be a source feature node at layer \ell_s and context position c_s and let t be a target feature node at layer \ell_t and context position c_t. We write J_{c_s, \ell_s \rightarrow c_t, \ell_t}^{\blacktriangledown} for the Jacobian of the underlying model with a stop-gradient operation applied to all model components with nonlinearities â the MLP outputs, the attention patterns, and normalization denominators â on a backwards pass on the prompt of interest, from the residual stream at context position c_t and layer \ell_t to the residual stream at context position c_s and layer \ell_s. The edge weight from s to t is then
A_{s \rightarrow t} = a_s w_{s \rightarrow t} = a_s \sum_{\ell_s \leq \ell < \ell_t} (W_{\text{dec}, \;s}^{\ell_s \to \ell})^T J_{c_s, \ell \rightarrow c_t, \ell_t}^{\blacktriangledown} W_{\text{enc}, \;t}^{\ell_t},
where
- W_{\text{dec}, \;s}^{\ell_s \to \ell} is the decoder vector of the feature for s writing to layer \ell,
- W_{\text{enc}, \;t}^{\ell_t} is the encoder vector of the feature for s.
The formulas for the other edge types are similar, e.g., an embedding-feature edge weight is given by w_{s \rightarrow t} = \text{Emb}_s^TJ_{c_s, \ell_s \rightarrow c_t, \ell_t}^{\blacktriangledown} W_{\text{enc}, \;t}^{\ell_t}. Note that error nodes have no input edges. For all such formulas, and an expansion of the Jacobian in terms of paths in the underlying model, see § E Appendix: Attribution Graph Computation.)
Because we have added stop-gradients to all model nonlinearities in the computation above, the preactivation h_t of any feature node t is simply the sum of its incoming edges in the graph: h_t = \sum_\mathcal{S_t} w_{s \rightarrow t} where S_t is the set of nodes at earlier layers and equal or earlier context positions as t. Thus the attribution graph edges provide a linear decomposition of each featureâs activity.
Note that these graphs do not contain information about the influence of nodes on other nodes via their influence on attention patterns, but do contain information about node-to-node influence through the outputs  of frozen attention. In other words, we account for the information which flows from one token position to another, but not why the model moved that information.That is, our model ignores the âQK-circuitsâ but captures the âOV-circuitsâ. Note also that the outgoing edges from a cross-layer feature aggregate the effect of its decodings at all of the layers that it writes to on downstream features.
While our replacement model features are sparsely active (on the order of a hundred active features per token position), attribution graphs are too large to be viewed in full, particularly as prompt length grows â the number of edges can grow to the millions even for short prompts. Fortunately, a small subgraph typically accounts for most of the significant paths from the input to the output.
To identify such subgraphs, we apply a pruning algorithm designed to preserve nodes and edges that directly or indirectly exert significant influence on the logit nodes. With our default parameters, we typically reduce the number of nodes by a factor of 10, while only reducing the behavior explained by 20%. See § F Appendix: Graph Pruning for methodological details of our algorithms and metrics.
§ 3.2
Even following pruning, attribution graphs are quite information-dense. A pruned graph often contains hundreds of nodes and tens of thousands of edges â too much information to interpret all at once. To allow us to navigate this complexity, we developed an interactive attribution graphs visualization interface. The interface is designed to enable âtracingâ key paths through the graph, retain the ability to revisit previously explored nodes and paths, and materialize the information needed to interpret features on an as-needed basis.
Below we show the interactive visualization for the attribution graph attributing back from the single token âDAGâ:
say âD_â Analytics Digital Group National say/continue an acronym say âDA_â say â_Gâ say DAG output: âdagâ (p=0.992) N at start of acronym say â_Aâ âŁâ
F#05477227 â [say âDA_â] say â_DA_â Input Features[Digital] digitalâL1+0.098[Analytics] analysesâL1+0.083Err: mlp âNâL12+0.071[Analytics]âL1+0.065[Analytics] analyticsâL1+0.056[say âD_â]L11+0.054[say/continue an acronym] Uppercase before an acronymâL1+0.051[Analytics]âL1+0.047n at the start of a wordL1+0.045[Digital]âL1+0.045Err: mlp âNâL13+0.038say â_AâL13+0.038[say âD_â] Say âD_âL12+0.034at the start of acronymL3+0.023N at the start of acronymL3+0.019[say âD_â] say âD_â or â_DâL13+0.019at the start of acronymL4+0.018[Digital] digitalâL1+0.017Ditigal in compound nounsâL9+0.016Err: mlp âNâL14+0.015[say/continue an acronym] at the start of acronymL2+0.013about to say something in all capsâL1+0.011say an acronymL12+0.009N at the beginning of an acronymL1+0.009at the start of an acronymL5+0.008[Group] groupâL2+0.007Emb: ââȘââEmb+0.006Emb: â DigitalââEmb+0.006[N at start of acronym] N at the beginning of an acronymL2+0.005[Group]âL1â0.005Err: mlp âNâL10â0.010Err: mlp âNâL11â0.010[Group] groupâL1â0.011N at the start of acronymL2â0.019say a short acronymL12â0.020Output Featuressay âD_â, mostly âDA_âL17+0.118output: âdagâ (p=0.992)L18+0.042[say DAG] say dagL16+0.042
âŁâ
Token PredictionsTop ada da adas oda eda dam rada day das dag Bottom Ät me kr ĐżŃОз atius tet Burke ŃĐ” bre terne Top Activations ,799 shares to Cornell Capital Partners pursuant to a â Stand by â Equity Distribution Agreement (â âȘ S EDA â) in exchange for proceeds of 56, and traces of trit ium, derived from U.S. Energy Research and Development Administration ( âȘ ER DA ) operations via the production of trit ium by neut ron bombard ment of lith ium - 6 ers of N,N - dim eth yl ac ry lam ide and ac ry lam ide ( âȘ N ND MA and AM ) wherein the rat ios of âȘ N ND MA to AM are 1:contributions.â Rather than using the entire input image, maximum likelihood joint probabil istic data association ( âȘ JP DA ) reduces the threshold to a low level, and then applies a grid - based state model for trac kers,â â Proceedings of the â Sixth International Conference on â Intelligent Systems Design and Application ( âȘ IS DA â 06 ), vol.02, pp.384 - 389,2 006 âȘ M ERS ), and â Sw ine â Acute â â D iar rh ea â Syndrome ( âȘ S ADS ).âȘ S ARS and âȘ M ERS emerged in 2 003 and 2 012,path ologies.
âŁââŁâ
The interface is interactive. Nodes can be hovered over and clicked on to display additional information. Subgraphs can also be constructed by using Cmd/Ctrl+Click to select a subset of nodes. In the subgraph, features can be aggregated into groups we call supernodes (motivated below in § 3.4 Grouping Features into Supernodes ).
§ 3.3
We use feature visualizations similar to those shown in our previous work, Scaling Monosemanticity, in order to manually interpret and label individual features in our graph.We sometimes used labels from our automated interpretability pipeline as a starting point, but generally found human labels to be more reliable.
The easiest features to label are input features, which activate on specific tokens or categories of closely-related tokens and which are common in early layers, and output features, which promote continuing the response with specific tokens or categories of closely-related tokens and which are common in late layers. For example:
- This feature is likely an input feature because its visualization shows that it activates strongly on the word âdigitalâ and similar words like âdigitizeâ, but not on other words. We therefore label it a âdigitalâ feature.
- This feature (a Haiku feature from the later § 3.8 Addition Case Study ) is an input feature that activates on a variety of tokens that end in the digit 6, and even on tokens that are more abstractly related to 6 like âsixâ and âJuneâ.
- This feature is likely an output feature because it activates strongly on several different tokens, but in each example, the token is followed by the text âdagâ. Furthermore, the top of the visualization indicates that the feature increases the probability of the model predicting âdagâ more than any other token (in terms of its direct effect through the residual stream). This suggests that itâs an output feature. Since output features are common, when labeling output features that promote some token or category X, we often simply write âsay Xâ, so we give this example the label âsay âdagââ.
- This feature (from the later § 3.7 Factual Recall Case Study ) is an output feature that promotes a variety of sports, though it also demonstrates some ways in which labeling output features can be difficult. For example, one must observe that âlacâ is the first token of âlacrosseâ. Also, the next token in the context after the feature activates often isnât actually the name of a sport, but is usually a plausible place for the name of a sport to go.
Other features, which are common in middle layers of the model, are more abstract and require more work to label. We may use examples of contexts they are active over, their logit effects (the tokens they directly promote and suppress through the residual stream and unembedding), and the features theyâre connected to in order to label them. For example:
- This feature activates on the first one or two letters of an unfinished acronym after an open parenthesis, for a variety of letters and acronyms, so we label it as continuing an acronym in general.
- This feature  activates at the start of a variety of acronyms that all have D as their second letter, and many of the tokens it promotes directly have D as their second letter as well. (Not all of those tokens do, but we donât expect logit effects to perfectly represent the featureâs functionality because of indirect effects through the rest of the model. We also find that features further away from the last layer have less interpretable logit effects.) For brevity, we may label this feature as âsay â_Dââ, representing the first letter with an underscore.
- Finally, this feature activates on the first letter of various strings of uppercase letters that donât seem to be acronyms, and the tokens it most suppresses are acronym-like letters, but its examples otherwise lack an obvious commonality, so we tentatively label it as suppressing acronyms.
We find that even imperfect labels for these features allow us to find significant structure in the graphs.
§ 3.4
Attribution graphs often contain groups of features which share a facet relevant to their role on the prompt. For example, there are three features active on âDigitalâ in our prompt which each respond to the word âdigitalâ in different cases and contexts. The only facet which matters for this prompt is that the word âdigitalâ starts with a âDâ; all three features have positive edges to the same set of downstream nodes. Thus for the purposes of analyzing this prompt, it makes sense to group these features together and treat them as a unit. For the purposes of visualization and analysis, we find it convenient to group multiple nodes â corresponding to (feature, context position) pairs â into a âsupernode.â These supernodes correspond to the boxes in the simplified schematic we showed above, reproduced below for convenience.
Gâ
The strategy we use to group nodes depends on the analysis at hand, and on the roles of the features in a given prompt. We sometimes group features which activate over similar contexts, have similar embedding or logit effects, or have similar input/output edges, depending on the facet which is important for the claim we are making about the mechanism. We generally want nodes within a supernode to promote each other, and their effects on downstream nodes to have the same sign. While we experimented with automated strategies such as clustering based on decoder vectors or the graph adjacency matrix, no automated method was sufficient to cover the range of feature groupings required to illustrate certain mechanistic claims. We further discuss supernodes and potential reasons for why they are needed in Similar Features and Supernodes.
§ 3.5
In attribution graphs, nodes suggest which features matter for a modelâs output, and edges suggest how they matter. We can validate the claims of an attribution graph by performing feature perturbations in the underlying model, and checking if the effects on downstream features or on the model outputs match our predictions based on the graph. Features can be intervened on by modifying their computed activation and injecting its modified decoding in lieu of the original reconstruction.
Features in a cross-layer transcoder write to multiple output layers, so we need to decide on a range of layers in which to perform our intervention. How might we do this? We could intervene on a featureâs decoding at a single layer just like we would for a per-layer transcoder, but edges in an attribution graph represent the cumulative effect of multiple layersâ decodings, so intervening at a single layer would only target a subset of a given edge. In addition, weâll often want to intervene on more than one feature at a time, and different features in a supernode will decode to different layers.
To perform interventions over layer ranges, we modify the decoding of a feature at each layer in the given range, and run a forward pass starting from the last layer in the range. Since we arenât recomputing a layerâs MLP output based on the result of interventions earlier in the range, the only change to the modelâs MLP outputs will be our intervention. We call this approach âconstrained patchingâ, as it doesnât allow an intervention to have second-order effects within its patching range. See § K Appendix: Iterative Patching for a description of another approach we call âiterative patchingâ, and see § H Appendix: Nuances of Steering with Cross-Layer Features for a discussion of why more naive approaches, such as adding a featureâs decoder vector at each layer during a forward pass of the model, risk double counting a featureâs effect.
Below, we illustrate a multiplicative version of constrained patching, in which we multiply a target featureâs activation by M in the [\ell - 1, \ell] layer range. Note that MLP outputs at further layers are not directly affected by the patch.They can be indirectly affected since they sit downstream of affected nodes.
Attribution graphs are constructed by using the underlying modelâs attention patterns, so edges in the graph do not account for effects mediated via QK circuits. Similarly, in our perturbation experiments, we keep attention patterns fixed at the values observed during an unperturbed forward pass. This methodological choice means our results donât account for how perturbations might have altered the attention patterns themselves.
Returning to our acronym prompt, we show the results of patching supernodes, starting with suppressing the âGroupâ supernode. Below, we overlay patching effects onto supernode schematics for clarity, displaying the effect on other supernodes and the logit distribution. Note that in this diagram, the position of the nodes in the figure is not meant to correspond to token positions unless explicitly noted.
Aâ
We now show the results of suppressing some supernodes on the aggregate activation of other supernodes and on the logit. For each patch, we set every feature in a nodeâs activation to be the opposite of its original value (or equivalently, we steer multiplicatively with a factor of â1).For a discussion of why we steer negatively instead of ablating the feature, see § I Unexplained Variance and Choice of Steering Factors. We then plot each nodeâs total activation as a fraction of its original value.For each patched supernode, we choose the end-layer range which causes the largest suppression effect on the logit. We use an orange outline to highlight nodes downstream of one another for which we would hypothesize patching to have an effect.
We see that inhibiting features for each word inhibits the related initial features in turn. In addition, the supernode of features for âsay DA_â is affected by inhibitions of both the âDigitalâ and âAnalyticsâ supernodes.
§ 3.6
The attribution graph also allows us to identify in which layers a featureâs decoding will have the greatest downstream effect on the logit. For example, the âAnalyticsâ supernode features mostly contribute to the âdagâ logit indirectly through intermediate groups of features âsay _Aâ, âsay DA_â, and âsay DAGâ which live in layers 13 and beyond.
Analytics say â_Aâ say âDA_â say DAG output: âdagâ (p=0.992) âŁâ
F#16424601 â [say DAG] say dag Input Featuressay âD_âL11+0.068[say â_Aâ]L13+0.065[say âDA_â]L15+0.058say â_GâL13+0.049say â_GâL15+0.043[say âDA_â] say â_DA_âL15+0.042Err: mlp âNâL15+0.040Err: mlp âNâL12+0.036Uppercase before an acronymâL1+0.034digitalâL1+0.032at the start of acronymL3+0.031[Analytics] analysesâL1+0.026[Analytics]âL1+0.026groupâL1+0.023Say âD_âL12+0.021at the start of acronymL4+0.017[Analytics] analyticsâL1+0.016at the start of an acronymL5+0.015Err: mlp âNâL14+0.014DigitalâL1+0.014GroupâL1+0.013[Analytics]âL1+0.013N at the beginning of an acronymL2+0.012nationalâL1+0.011say an acronymL12+0.011say âD_â or â_DâL13+0.010[say â_Aâ] say something with an aL15+0.009digitalâL1+0.008Err: mlp âNâL13+0.008groupâL2+0.007say âg_âL15+0.006N at the start of acronymL3+0.006Ditigal in compound nounsâL9+0.005Err: mlp âNâL10â0.006(âL1â0.006Err: mlp âNâL11â0.007at the start of acronymL7â0.008say something with an hL15â0.008N at the beginning of an acronymL1â0.009N at the start of acronymL2â0.013n at the start of a wordL1â0.153Output Featuresoutput: âdagâ (p=0.992)L18+0.046
âŁâ
Token PredictionsTop dag dog nog grad daq kog ynb ydd dig weg Bottom ouse atte rite rance ĐžĐœĐ” pier Đ°ŃĐ” defense ete ione Top Activations of ML K âs â shooting. One of the best museums I have visited.ââ ~~~ â lar ry dag â The â Sixth Floor Museum is a really good museum but it is not a U.S and panel components of the mask - panel assembly, which elimin ates the need for the extra â Aqu a dag coating, is in the form of a resil ient cont actor of the general type substantially disclosed in the metall ized sidew all of the panel. Since application of this discrete loc alized area of â Aqu a dag is usually a manual operation, requiring cogn iz ance to achieve a proper de position of coating,coating to pud dle thereby ac cent u ating the danger of acc idental s plat tering of â Aqu a dag on the screen.â Another means for achieving an electrical connection between the mask and panel components of ete ek end o. r.: M.â Bil ders van â Bo sse,â Museum â Mes dag. Den â Ha ag.â 53. W.âȘ WIT SEN, geboren te Amsterdam ro ert, dan nog zou de kunst van d ezen uit ver kor ene de verz am eling â Mes dag ge ad eld hebben.â J ammer, j ammer, dat de tall oo ze bar sten op hun post.â Voor eer st men gt zich de kunst van het e cht p aar â Mes dag z elf in gro ote ve el z ij dig heid tel k ens onder dit v re em van al die uitg ez och te kunst wer ken, ^ wel ke het e cht p aar â Mes dag met z ul k een s ma ak en over leg heeft w eten bij een te b ren uit en gem een vri end sch app elijke bet re kk ingen van het e cht p aar â Mes dag met â Bo sb oom en zijn ber oem de vr ouw, als van zel ve aan le lik ya zm is â â I was no big mountain, f umes surrounded my head â Y ĂŒ ce dag deg ili dum d uman s ardi bas umi,â â My loved one mour ns me, oh a V - separator, prefer ably of the design of the K H D â Hum bol dt â We dag AG V - separator. The static cascade s if ter prefer ably comprises: a feed opening at da yan am am.â â I was no big mountain, f umes surrounded my head â Y ĂŒ ce dag deg ili dum d uman s ardi bas umi,â â My loved one mour ns me, oh ote ve el z ij dig heid tel k ens onder dit v re em de k oor.â Mes dag heeft van het belangrij k ste, wat zijn we el de rig, v ru cht baar penc te ek eningen, studies en s child er ijen van den gro oten ker ks chi lder.â Mes dag met zijn wel gev ul de b eurs en zijn verz a mel z u cht van m oo nog z ien aan de s child erst ud ies van den h eer en â Me vr ouw â Mes dag, die v lij tig hun ne toen nog geh eel land elijke om ge ving ges child erd of the panel is provided by applying an area of an additional con ductive coating, such as â Aqu a dag, to the ba sal portion of at least one of the stu ds proper making overl apping contact, Willem en â Th ijs, van â Ma uve en anderen ev ene ens geg aan:â Mes dag was h ĂŒn con fr ater en v riend, pi kte het beste en m oo iste uit groot en verh even hij zij, m its hij het nieuwe evangel ie verk ond ige, of â Mes dag bez it kost bare kunst wer ken van hem.â Na ast deze voor tr eff elijke voor vor. Der â Haupt t eil der rum Ă€n ischen â Str ei tk rĂ€ fte hat â Ba ba dag auf g eg eben.â Mut ma Ă li ches â W etter. Es ist ein weit Subsample Interval 0 the metall ized sidew all of the panel. Since application of this discrete loc alized area of â Aqu a dag is usually a manual operation, requiring cogn iz ance to achieve a proper de position of coating,coating to pud dle thereby ac cent u ating the danger of acc idental s plat tering of â Aqu a dag on the screen.â Another means for achieving an electrical connection between the mask and panel components of ete ek end o. r.: M.â Bil ders van â Bo sse,â Museum â Mes dag. Den â Ha ag.â 53. W.âȘ WIT SEN, geboren te Amsterdam ro ert, dan nog zou de kunst van d ezen uit ver kor ene de verz am eling â Mes dag ge ad eld hebben.â J ammer, j ammer, dat de tall oo ze bar sten op hun post.â Voor eer st men gt zich de kunst van het e cht p aar â Mes dag z elf in gro ote ve el z ij dig heid tel k ens onder dit v re em Subsample Interval 1 ete ek end o. r.: M.â Bil ders van â Bo sse,â Museum â Mes dag. Den â Ha ag.â 53. W.âȘ WIT SEN, geboren te Amsterdam ro ert, dan nog zou de kunst van d ezen uit ver kor ene de verz am eling â Mes dag ge ad eld hebben.â J ammer, j ammer, dat de tall oo ze bar sten op hun post.â Voor eer st men gt zich de kunst van het e cht p aar â Mes dag z elf in gro ote ve el z ij dig heid tel k ens onder dit v re em van al die uitg ez och te kunst wer ken, ^ wel ke het e cht p aar â Mes dag met z ul k een s ma ak en over leg heeft w eten bij een te b ren uit en gem een vri end sch app elijke bet re kk ingen van het e cht p aar â Mes dag met â Bo sb oom en zijn ber oem de vr ouw, als van zel ve aan le Subsample Interval 2 da yan am am.â â I was no big mountain, f umes surrounded my head â Y ĂŒ ce dag deg ili dum d uman s ardi bas umi,â â My loved one mour ns me, oh ote ve el z ij dig heid tel k ens onder dit v re em de k oor.â Mes dag heeft van het belangrij k ste, wat zijn we el de rig, v ru cht baar penc te ek eningen, studies en s child er ijen van den gro oten ker ks chi lder.â Mes dag met zijn wel gev ul de b eurs en zijn verz a mel z u cht van m oo lik ya zm is â â I was no big mountain, f umes surrounded my head â Y ĂŒ ce dag deg ili dum d uman s ardi bas umi,â â My loved one mour ns me, oh nog z ien aan de s child erst ud ies van den h eer en â Me vr ouw â Mes dag, die v lij tig hun ne toen nog geh eel land elijke om ge ving ges child erd Subsample Interval 3 of the panel is provided by applying an area of an additional con ductive coating, such as â Aqu a dag, to the ba sal portion of at least one of the stu ds proper making overl apping contact, Willem en â Th ijs, van â Ma uve en anderen ev ene ens geg aan:â Mes dag was h ĂŒn con fr ater en v riend, pi kte het beste en m oo iste uit groot en verh even hij zij, m its hij het nieuwe evangel ie verk ond ige, of â Mes dag bez it kost bare kunst wer ken van hem.â Na ast deze voor tr eff elijke voor vor. Der â Haupt t eil der rum Ă€n ischen â Str ei tk rĂ€ fte hat â Ba ba dag auf g eg eben.â Mut ma Ă li ches â W etter. Es ist ein weit ge ering de e en ige verz am eling overd ro eg, en de â Ko nin gin â Mes dag s kon ink lijke da ad met de sc hen king van het â Groot k r uis der Subsample Interval 4 lt to the â Riks dag mot ions â covering this decision.â The Premier thinks that the â Riks dag will â adopt this view, and is convinced that no foreign â power w l ll recognize Norway dt zich het Museum â Mes dag, na ast de w oning van het e cht p aar â Mes dag. Op i Augustus 1 903 is het museum voor het publ iek g eop end gew lik ya zm is â â I was no big mountain, f umes surrounded my head â Y ĂŒ ce dag deg ild im d uman s ardi bas imi â â My loved one mour ns me, oh is the kde 3 code https:// sv n.u lu dag.org.tr / u lu dag / trunk / Policy Kit - kde / â < sm arter > ap ache log ger: well isch - â H outh ak ker â H uis ââââ âč E EN âȘ W IZO - âȘ DISTRIC TS DAG â âč â Uit de af de lingen â Ein dh oven,âs - â Hert ogen Subsample Interval 5 lik ya zm is â â I was no big mountain, f umes surrounded my head â Y ĂŒ ce dag deg ild im d uman s ardi bas imi â â My loved one mour ns me, oh is the kde 3 code https:// sv n.u lu dag.org.tr / u lu dag / trunk / Policy Kit - kde / â < sm arter > ap ache log ger: well isch - â H outh ak ker â H uis ââââ âč E EN âȘ W IZO - âȘ DISTRIC TS DAG â âč â Uit de af de lingen â Ein dh oven,âs - â Hert ogen dat het â Vert oog niet m inder dan de â S me ek br ief voor den â Rij ks dag be ste md was en die beide st uk ken dus v Ăł Ăłr de behand eling der go â â Nam ens de â Commiss ie, \ V.âȘ OR ANJE.â â Bon ds dag â D ord recht.â â Bo ott ocht â W ijk bij â D uur st ede Subsample Interval 6 .â O ani 0.er â e op â â - â..â Vi beg yn der i dag et â 000 ).â Vi har lig gende p aa vor â T ist 300 â argent inske nationale â â Ba unk i Buenos â Ay res ind st ill ede for rige â â Man dag sine â Be tal inger. Den havde â en ( Ăžr und kap ital af 40 ij ne â Maj est eit aan de â Kr ei so ber sten het jaar na den â Rij ks dag â 1 â Da art oe be ho orde eigen lijk ook â die verf uer ische,. Was die brief e env ou dig voor de â g ele gen heid â van den â Rij ks dag ger eed gem aakt, â zoo als het â Vert oog â dan zou de datum niet na â Maart 1 927.540 Cleveland,â Ak ron en Columbus â Sp oor - weg - â M ij.,6 % Equipment Trust and Second â Mortgage 10 / 40 Subsample Interval 7 so I haven ât searched for touch - optim ized dist ros lately.â < W L M | weg > So I don ât really know.â < k g b > yo, um.. am, met d ien verst ande, dat de geb eden, die voor den 24 sten â Zon dag st aan aang eg even, ste eds op den laat sten â Zon dag mo eten gebruikt worden 16.: Johann â â Schm it, alt 3 â M onate, aus â Bon ne weg.â 19.:â Sus anna â Margare tha â B ecker, alt â 3 â 1 Juli 1 918.555 Florida, Central en â Penins ular â Sp oor weg â Ma at sch app ij.5 % le H y po the ek â G aan zonder oph ouden â U wen â Naam.â âč â Vro eger werd op d ezen â Zon dag het â Evangel ie gel ezen van de verm enig v ul di ging der 5 bro Subsample Interval 8 some of the benefits of âȘ M LAG are not realized, such as:1.) The âȘ M LAG Inter Switch Connection ( IS C ) links are generally not used in steady state operation ( that is Does â Perry & Son and G. T.â Dun l op.â National Electric Supply Company agt.â Wallace & â Cad l ck; order setting aside â p order of discontin u ance $, â H â J M â 0 13,7 J O P M â Kun days,7.00.8.15,11.00 A. K â Ja arv erg a dering der â Ze em v sche â A fd. op â P aa sch ma an dag te â Q o eo.â Jong â Ze eland werd ook nu weer s aam zij bl even over, die een gez onder basis wil den va stst ellen dan een â norm alen rust dag van 24 uur â en ex amen voor kl asse - bak kers wa art oe Bottom Activations .â Blo em boll eu eu H uur en Export â M ij.â Kant.,â He ere weg 92 9 â â H aan. J. C. de â Gem.- â Sec öt ar rn ger r n â â Rer che ^ ta vc y e ii â. « onn ab eno i die â Cr jt Ă€nn uh rn ng von â L rg uj â Ai erth c Design Language - lin dig â http:// www.min imal ly min imal.com â ====== â lin dig â I think I need a translation for what the concepts are.ââ _ We focus on.â Wash â ington - B I ade havde af vi gende â Men in â ger i dag om â Kon stit ut ionen for Na â t ion ernes â Forb und?â â Herald e aps, and âȘ M LAG are deployed together in a network, some of the benefits of âȘ M LAG are not realized, such as:1.) The âȘ M LAG Inter Switch Connection ( IS weg 278 826 â Ha ase, F. G. M. Arts,â He ere weg 73 â H age,3 VI. v.â CafĂ© - rest., Haven ek jes en ga at jes waren bez et. Dit suc ces g af aan le iding dat â Zon dag 24 Nov. d.a.v. inden t uin van het â Pal eis not inhabited by man.â â Come on.â â Miss King.â â Yes,â Fr au â Hol weg?â â These women are to be with you.â â Miss King is group leader of the women cc g gt gat g ct gt g tat ata gg cc gt g a ag ct tt tag a ac tag t ct c a ac tt gt tat ct aa 560 P n C Y P 76 â H oes, C. * J.â Cen tr. ver war ming,â â He ere weg 200 28 â â H ogen bir k, L.â V isch hand el,â Sp oor weg - â Ma at sch app ij.523 Chicago en Northwestern â Sp oor weg - â Ma at sch app ij.7 % â Men omen ee â Ex ten Ab « i.ds an â Ă oc kt ni ag » ». so im » â â L onn la as!»,« > 5 â Sö rz ens, i.3 « ),â aux flam ing ants les agents du pang erman isme et de M. de â Beth mann - â Holl weg: « â Vous ĂŽ tes la major itĂ© du pays et vous sv < z du l lo bin â http:// www.tech cr unch.com / 2 009 / 09 / 28 / ne we ggs - ipo - filing - reveals - the - financ ials - behind - a - 2 pre uve sera fa ite de la van itĂ© des esp oirs de M. de Be ih mann H oll weg.11 suf fi rait, dans cette rev ne de let tr Ă©s, pour dĂ©m 2 750 m. af te leg gen.â âč Robert See â Ste ele was â Zon dag 11. te â Berl ijn â â We iss ens ee w eder in the money..â Waar uit dan kl aar b ly kt, dat de â Wo ens dag,â D onder dag en â V ry dag van T 5 de - ââ ââ & 8 / mm.sup.2 with the in exp ensive steel just mentioned, the steel must be over aged for a period of 20 to 30 minutes. This over aging time is more aan.â Aan de v le ug el:â Cor van â B oven.â âč â Mid dag, / 12.00 Victor Herbert - melod ie Ă«n door AndrĂ© â Ko ste lan ov om 14:47 â < S tef an de V ries > â G oe dem id dag â < le oqu ant > hi S tef an de V ries â < S tef an de
âŁââŁâ
We would thus expect steering negatively on an âAnalyticsâ feature to have an effect on the dag logit which plateaus before layer 13 and then decreases in magnitude as we approach the final layer. The decrease is caused by the constrained nature of our intervention. If a patching range includes all the âsay an acronymâ features, it will not change their activation, because constrained patching doesnât allow knock-on effects. Below, we show the effect of steering with each Analytics feature, keeping the start layer set to 1 and sweeping over the patching end layer.Note that we use a large steering factor in this experiment. For a discussion of this, see § I Unexplained Variance and Choice of Steering Factors.
§ 3.7
We now turn to the question of factual recall by studying how the model completes Fact: Michael Jordan plays the sport of with basketball with 65% confidence . We start by computing an attribution graph. We group semantically similar features into supernodes like we did for the acronym study.
The supernode diagram below shows two primary paths. One path originates from the âplaysâ and âsportâ tokens and promotes âsportâ and âsay a sportâ features, which in turn promote the logits for basketball, football, and other sports. The other path originates from âMichael Jordan and other celebritiesâ and promotes basketball related features, which have positive edges to the basketball logit and negative edges to the football logit. In addition to these sequential paths, some groups of features such as âMichael Jordanâ and âsport/game ofâ have direct edges to the basketball logit, representing effects mediated only via attention head OVs, consistent with the findings of Batson et al. .
We also display the full interactive graph below.
Michael Jordan and celebrities sport/game of basketball discussion say a sport sport play output: âfootballâ (p=0.021) output: âbasketballâ (p=0.653) âŁâ
F#04737369 â [basketball discussion] discussing basketball Input Features[basketball discussion] discussing basketballâL7+0.201[Michael Jordan and celebrities] Michael Jordan and other famous MichaelsâL4+0.055[Michael Jordan and celebrities] Jordan as a last nameâL1+0.041[sport]âL1+0.040[Michael Jordan and celebrities] Michael JordanâL3+0.039[Michael Jordan and celebrities] Michael JordanâL1+0.032[Michael Jordan and celebrities] Jordan nounâL1+0.032[Michael Jordan and celebrities] JordanâL1+0.031[play] âplayâ tokenâL1+0.025[Michael Jordan and celebrities] polysemantic (rhythm/relax and Jordan)âL2+0.022famous athletes and coachesâL2+0.020Err: mlp â ofâL10+0.020[Michael Jordan and celebrities] Jordan or Conditioner, especially after airâL2+0.019Err: mlp â ofâL11+0.017wild westâL4+0.017[Michael Jordan and celebrities] polysemantic (old west + michael jordan)âL4+0.013GovernmentâL4+0.011[sport/game of] of after sportL4+0.011[Michael Jordan and celebrities] famous MichaelsâL1+0.011sports teamsâL7+0.010sports leagues and stadiumsâL7+0.009[sport/game of] of after gameL5+0.008sport terms (sidelines, teams, players, referees)âL6+0.008[Michael Jordan and celebrities] polysemantic (democracy in chinese / Jordan)âL1+0.008Emb: â MichaelââEmb+0.006[sport] the/some/a sportâL1+0.006Err: mlp â JordanââL10+0.006[sport/game of] sport (multilingual)âL4+0.005say an unpopular sportL12â0.006Err: mlp â sportââL6â0.007[basketball discussion] shooting percentage in NBAL13â0.007Err: mlp â ofâL9â0.008say choice after ofL10â0.008Err: mlp â sportââL7â0.008Err: mlp â JordanââL8â0.008[sport/game of] sport after theâL2â0.008Err: mlp â ofâL14â0.011Err: mlp â ofâL13â0.012say choiceL14â0.014[say a sport] donât say a popular sportL11â0.019[basketball discussion] Basketball playersL13â0.020[basketball discussion] discussing attributes and skills related to basketballL12â0.022âofâ tokenL1â0.034Err: mlp â ofâL12â0.039on/in/asL1â0.043Output Featuresoutput: âââ (p=0.017)L18+0.010[basketball discussion] basketball moves (dunks, layups, etc)L16â0.011output: âbasketballâ (p=0.653)L18â0.012output: âlacâ (p=0.010)L18â0.015[say a sport]L16â0.021output: âgolfâ (p=0.066)L18â0.023say Major/Minor leagues, lacrosseL16â0.028output: âtennisâ (p=0.015)L18â0.031output: âsoccerâ (p=0.010)L18â0.035output: âbaseballâ (p=0.015)L18â0.038output: âhockeyâ (p=0.008)L18â0.041output: âfootballâ (p=0.021)L18â0.045
âŁâ
Token PredictionsTop tip ho guard guards Lithuania bac lottery dun Lithuanian Bottom cricket Cricket football Football quer batting ilu ongs rint oft Top Activations it may be based upon NCAA â Collegiate football. Similarly, a fantasy basketball league may be based only on the National Basketball Association âs ( NBA ) Eastern Conference or Western Conference. Some fantasy sport leagues may be may be based upon NCAA â Collegiate football. Similarly, a fantasy basketball league may be based only on the National Basketball Association âs ( NBA ) Eastern Conference or Western Conference. Some fantasy sport leagues may be even NCAA â Collegiate football. Similarly, a fantasy basketball league may be based only on the National Basketball Association âs ( NBA ) Eastern Conference or Western Conference. Some fantasy sport leagues may be even further limiting, and â fern ly â â Gave it a pic of tip - off of a basketball game. It offered â â Instru mentation â.ââ ------ â â Ge ep â â Wh oa that âs awesome! N K friends.â â They gave me some stuff.â â â Stuff?â â What kind of stuff?â â Basketball tickets,some nice t equ ila,â â â Lo aned me money a few times.â â My indicate like elements.
âŁââŁâ
In addition, a complex set of mechanisms seems to be involved in contributing information about the entity Michael Jordan to the residual stream at âJordanâ, as observed in Nanda et al.. We have grouped into one supernode  features sensitive to âMichaelâ, an L1 feature which has already identified the token pair âMichael Jordanâ, features for other celebrities, and polysemantic features firing on âMichael Jordanâ and other unrelated concepts. Note that we choose to include some polysemantic features in supernodes as long as they share a facet relevant to the prompt, such as this feature  which activates more strongly on the word âsynergyâ than on âMichael Jordanâ. We evaluate features in more depth in Qualitative Feature Evaluations.
Steering experiments can once more allow us to validate the hypotheses proposed by the graph.
Ablating either the âsportâ or âMichael Jordanâ supernode has a large effect on the logit but a comparatively smaller effect on the other supernode, confirming the parallel path structure. In addition, we see that suppressing the intermediate âbasketball discussionâ supernode also has a large effect on the logit.
§ 3.8
We now consider the simple addition prompt calc: 36+59=. We use the prefix âcalc:â because the 18L performs much better on the problem with it. This prefix is not necessary for Haiku 3.5, but we include it nevertheless for the purposes of direct comparison in later sections. Unlike previous sections, we show results for Haiku 3.5 because the patterns are clearer and show the same structure (see § Q Appendix: Comparison of Addition Features ⊠for a side-by-side comparison). We look at small-number addition because it is one of the simplest behaviors exhibited competently by most LLMs and human adults (try the problem in your head to see if your approach matches the modelâs!).
We supplement the generic feature visualization (on arbitrary dataset examples) with one which explicitly covers the set of two-digit addition problems, allowing us to get a crisp picture of what each feature does. Following Nikankin et. al., who analyzed neurons, we visualize each feature active on the = token with three plots:
- An operand plot, displaying its activity on the 100 Ă 100 grid of potential inputs.
- An output weight plot, displaying its direct weights on the outputs for [0, 99].During analysis we visualize effects on [0, 999]. This is important to understand effects beyond the first 100 number tokens (e.g., the feature predicting 95 mod 100), but we only show [0, 99] for simplicity.
- An embedding weight plot (or âde-embeddingâ ), displaying the direct effect of embedding vectors on a featureâs encoder. This is shown in the same format as the output weight plot.
We show an example plot of each of these three types below for different features. On this restricted domain, the operand plots are complete descriptions of the CLT features as functions. Stripes and grids in these plots represent different kinds of structure (e.g. diagonal lines indicate constraints on the sum, while grids represent modular constraints on the inputs).
In the supernode diagram below, we see information flow from input features, which split out the final digit, the number, and the magnitude of the operands to three major paths: a final-digit path (mod 10) (light-brown, right), a moderate precision path (middle), and a low precision path (dark brown, left),With hover you can see the variability in the precision of the low precision lookup table features and the moderate precision sum features. which collectively produce a moderate precision value of the sum and the final digit of the sum; these finally constructively interfere to give both the mod 100 version of the sum and the final output.
9
We provide the equivalent interactive graph for 18L here.
The supernode graph suggests a taxonomy of features underlying this task, in which features vary along two major axes:As in other sections of this work, we apply these labels manually.
- Computational role
- Sum features have diagonal operand plots, and fire on pairs of inputs whose sum satisfies some condition.
- Lookup Table Features  have plots that look like a grid, and consist of inputs a and b satisfying
condition1(a) AND condition2(b)
. We discuss these in more detail below. - Add Function Features  have plots with horizontal or vertical bars. One addend satisfies some condition, or an
OR
operation merges two conditions across addends. - Mostly Active Features are active on the â=â token of most of our 10,000 addition prompts.
- Miscellaneous Features have all sorts of strange properties, but often look like hybrid activation patterns from the above types. We find that these have lower influence on the outputs.
- Condition properties
- Precision: we find conditions with ones-digit precision ( sum=_5 or =59 ), with exact range (of width e.g. 2 or 10), and with fuzzy ranges of width ranging from 2â50.
- Modularity: we find features that are sensitive to the sum or operand value in absolute terms, mod 10, mod 100, and less commonly, mod 2, mod 5, mod 25, and mod 50.
- Pattern: we find features sensitive to a regex style pattern in an input or output, such as âstarts with 51â, as in . These do not feature as prominently in our addition graphs, having low influence on the modelâs output, but they do exist, and may be more important for other tasks involving numbers.
These findings broadly agree with other mechanistic studies showing that language models trained on natural language corpora perform addition using parallel heuristics involving magnitudes and moduli that constructively interfere to produce the correct answer . Namely, Nikankin et al. proposed a âbag of heuristicsâ interpretation, recognizing a set of âoperandâ features (equivalent to our âadd Xâ features) and âresultâ features (equivalent to our âsumâ features) exhibiting high- and low-precision and different modularities in sensing the input and producing the output.
We also identify the existence of lookup table features, which seem to be an interesting consequence of the architecture used by both the model and the CLT. Neurons and CLT feature activations are computed by applying a nonlinearity to the sum of their inputs. This produces a âparallelogram constraintâ on the response of a feature to a set of inputs: namely, if a feature f is active on two inputs of the form x+y and z+w, then it must be active on at least one of the inputs x+w or z+y. This follows since the preactivation of f is an affine function of the operands.Write f_{pre} for the preactivation of x. Then f_{pre}(x + y) + f_{pre}(z + w) = f_{pre}(x + w) + f_{pre}(y + z). If both terms on the LHS are positive, at least one on the right must be. In particular, it is impossible for input features to produce a general sum feature in one step. For example, a general âsum = 5â feature which fires for 1+4 and 2+3 would need to fire for at least one of 1+3 or 2+4. So some intermediate step between copying over information about both inputs to the â=â token and producing a property of their sum is required. CLT lookup table features represent these intermediate steps for addition.Concretely, the intermediate info is the ones digit produced from adding the ones digit of the operands and the approximate magnitude of the sum of two numbers in the 20s.
To validate that the structure we observe in the attribution graph matches the causal structure of the model, we perform a series of interventions. For each supernode, we perturb it to the negative of its original value, and measure the result on all subsequent supernodes and the outputs. We find results largely consistent with the graph:
In particular, inhibiting the ones-digit feature on either of the input tokens suppresses the entire ones-digit pathway (the _6 + _9  lookup table features, the resulting sum=_5  and sum= _95  features), while leaving the magnitude pathway mostly intact, including the sum~92  features. Remarkably, when suppressing _6, the model confidently outputs 98 instead of the correct answer 95; the tens digit from the original problem is preserved by the other magnitude signals but the ones digit is that which would result from adding 9 to itself. (Suppressing _9, however, results in an output of 91, not 92, so such numerology must be taken with a grain of salt.). Conversely, inhibiting low-precision features on either inputs ( ~30 and ~59 ) suppress the low-precision lookup table features, the magnitude sum feature, and the appropriate sum features while leaving the ones-digit pathway alone.
We also show the quantitative effects of perturbations on the outputs, finding that negatively steering the _6 + _9 Â lookup table features smears the result out over a range of 5, while negatively steering the final sum=_95 Â feature smears the result out to a wider band (perhaps coming from sum~92 features).
We will investigate how CLT features interact across the full range of two-digit addition prompts below, after establishing the framework for global weights that we use to generalize this circuit to other inputs.
§ 4
The attribution graphs we construct show how features interact on a specific prompt to produce the modelâs output, but we are also interested in a more global picture of how features interact across all contexts. In a classic multi-layer perceptron, the global interactions are provided by the weights of the model: the direct influence of one neuron on another is just the weight between them if the neurons are in consecutive layers; if neurons are further apart, the influence of one on another factors through intermediate layers. In our setup, the interaction between features has a context independent  component and a context dependent  component. We would ideally like to capture both: we want a set of global weights which are context independent, but also capture network behavior across all possible contexts. In this section we analyze the context independent component (a kind of âvirtual weightâ), a problem with them (large âinterferenceâ terms with no causal effect on distribution), and one approach using co-activation statistics to deal with the interference.
On a specific prompt, a source CLT feature ( s ) influences a target ( t ) via three kinds of paths:
- residual-direct: s âs decoders write to the residual stream, where it is read in at a later layer by t âs encoder.
- attention-direct: s âs decoders write to the residual stream, are transported by some number of attention head OV steps, and then read by t âs encoder.
- indirect: paths from s to t are mediated by other CLT features.
We note that the residual-direct influence is simply the product of the first featureâs activation on this prompt times a virtual weight  which is consistent across inputs.The attention-direct terms can also be written in terms of virtual weights given by multiplying various decoder vectors by a series of attention head OVs, and then by an encoder, but these get scaled on a prompt by both the source feature activation and the attention patterns, which make their analysis more complex. These virtual weights are a simple form of global weights because of this consistent relationship. Virtual weights have been derived between many different components in neural networks, including attention heads , SAE features , and transcoder features . For CLTs, the virtual weight between two features is the inner product between the encoder of the downstream feature and the sum of decoders in between these two features.
More formally, let \ell_s and \ell_t be the layers of the encoder weights for features s and t. Let L_{st} be the set of layers in between these features such that \forall \ell \in L_{st}, s \leq \ell < t. Feature s writes to all MLP outputs in L_{st} before reaching feature t. Let W_{\text{dec}}^{s,\ell} be the decoder weights for feature s targeting layer \ell, and W_{\text{enc}}^t be the encoder weights for feature t. Then, the virtual weights are computed as: V_{st} = \big< \sum_{\ell \in L_{st}} W_{\text{dec}}^{s,\ell}, W_{\text{enc}}^t \big> Attribution graph edges consist of a sum of the residual-direct contribution (the virtual weight multiplied by a featureâs activation) plus the attention-direct contribution.
There is one major problem with interpreting virtual weights: interference . Because millions of features are interacting via the residual stream, they will all be connected, and features which never activate together on-distribution can still have (potentially large) virtual weights between them. When this happens, the virtual weights are not suitable global weights because these connections never impact network function.
We can see interference at play in the following example: below, we take a âSay a game nameâ feature in 18L and plot its largest virtual weights by magnitude. Green bars indicate a positive connection and purple bars indicate a negative one. Many of the most strongly-connected features are hard to interpret or not clearly related to the concept.
You might consider this a sign that virtual weights or our CLTs arenât capturing interpretable connections. However, we can still uncover many interpretable connections by trying to remove interference from these weights.We also see interference weights when looking at a larger sample of features in the § O Appendix: Interference Weights over More FeaturesAppendix.
There are two basic solutions to this problem. One is to restrict the set of features being studied to those active on a small domain (as we do in § 4.1 Global Weights in Addition ). The other is to bring in information about the feature-feature coactivation on the data distribution.
For example, let a_i be the activation for feature i. We can compute an expected residual attribution value by multiplying the virtual weight as follows. V_{ij}^{\text{ERA}} = \mathbb{E}\big[ {\small 1}\kern{-0.33em}1(a_j > 0)V_{ij}a_i \big] = \mathbb{E}\big[ {\small 1}\kern{-0.33em}1(a_j > 0)a_i \big] V_{ij} This represents the average strength of a residual-direct path across all of the prompts weâve analyzed (also computed by Dunefsky et al. ). This is similar to computing the average of all attribution graphs within a context position across many tokens.It only represents the residual-direct component and does not include the attention-direct one. The indicator function in this expression ( {\small 1}\kern{-0.33em}1(a_j > 0) ) captures how attributions are only positive when the target feature is active. As small feature activations are often polysemantic, we instead weight attributions using the target activation value: V_{ij}^{\text{TWERA}} = \frac{\mathbb{E}\big[a_ja_i\big]}{\mathbb{E}\big[ a_j \big]} V_{ij} We call this last type of weight target-weighted expected residual attribution (TWERA). As shown in the equations, both of these values can be computed by multiplying the original virtual weights by (âon-distributionâ) statistics of activations.
Now, we revisit the example game feature from before but with connections ordered by TWERA. We also plot each connectionâs ârawâ virtual weight for comparison. Many more of these connections are interpretable, suggesting that the virtual weights extracted useful signals but we needed to remove the interference in order to see them. The most interpretable features from the virtual weight plot above (another âSay a game nameâ and âUltimate frisbeeâ feature) are preserved while many unrelated concepts are filtered out.
TWERA is not a perfect solution for interference. Comparing TWERA values to the raw virtual weights shows that many extremely small virtual weights have strong TWERA values. Note that these weights cannot be 0 as this would also send TWERA to 0. This indicates that TWERA heavily relies on the coactivation statistics and strongly changes which connections are important beyond simply removing the large interference weights. TWERA also does not handle inhibition well (like attribution generally). We will explore these issues further in future work.
Still, we find that global weights give us a useful window into how features behave in a broader range of contexts than our attribution graphs. Weâll use these methods to complement our understanding in the rest of this paper and in the companion paper.
§ 4.1
We now return to the simple addition problem from above  on Haiku 3.5, and show how data-independent virtual weights reveal clear structure between the types in our taxonomy of addition features. We again consider completions of the 10,000 prompts calc: a+b= for a,b  â [0, 99]. In addition to the operand plots (again, defined above ), we inspect the virtual weight graph after restricting the large n_\text{feat} \times n_\text{feat} virtual weight matrix to the set which are active on at least 10 of the set of 10,000 addition prompts. This allows us to see all the feature-feature interactions that can occur via direct residual paths.
In the neighborhood of the features appearing in the 36+59 prompt above, we see:
- The _6 + _9 Â lookup table features feed into other sum features ( sum=_15, sum=_25, etc.), which likely activate when combined with other magnitude features.
- The add _9 Â feature feeds into other ones-digit lookup table features, ( _9 + _9, _0 + _9, etc.).
- The sum = _5 feature is fed by other lookup table features.
- The medium-precision ~36+60 Â lookup table feature is fed by an add ~62 Â feature in addition to the add ~57 feature we see on this prompt.
We provide an interactive interface to explore the virtual weights connecting all 2931 features prominent on two-digit addition problems in our smaller 18L model.
We find that restricting to features active on this narrow domain of addition problems produces a global circuit graph where most edges are interpretable in terms of the operand function realized by the source and target features. Moreover, the connections between features recapitulate a more general version of the graph in the previous section; add features detect a specific operand as part of the input, lookup table features propagate this information to sum features which (in concert with the previous features) produce the modelâs final answer.
Several of the features we find take the form of heuristics as in Nikankin et al. : features whose operand plots have predictive power directly push that prediction to the outputs: the low precision features promote outputs in the range matching their operands; the ones digit lookup table features directly promote outputs matching their mappings (e.g. _6 + _9 directly upweights _5 outputs). Almost none of these features represent the full solution until the very last layers of the model. As viewed by CLTs, the models use several intermediate heuristics rather than a coherent procedural program with a single confident output.
Our focus on the computational steps the model uses to perform addition is complementary to concurrent work by Kantamneni and Tegmark which begins from representations. Inspired by the observation of spikes in the Fourier decomposition of the embedding vectors for integers, they find low-dimensional subspaces highly correlated with numbersâ magnitudes and mod 2, 5, 10, and 100 components. Projecting to those subspaces preserves much of the modelâs performance on the task, consistent with a âClockâ algorithm performing separate calculations in each modulus, which interfere constructively at the end; the CLT features show essentially high- and low-precision versions of that method. Some of the important features we find have operand plots similar to their neurons, which they fit as a (thresholded) sum of Fourier modes.Thereâs no guarantee that the CLT features are the most parsimonious way to split up the computation, and itâs possible some of our less important, roughly-periodic features which are harder to interpret are artifacts of the periodic aspects of the representation. Some of the important features appearing in our graphs (such as operands or sums that start  with fixed digits, e.g. 95_  and 9_ ) arenât describable in Fourier terms, consistent with the existence of some error in their low-rank approximation.In § P Appendix: Number Output Weights over More Features, we show output weight plots for the 9_  and 95_ features on all number predictions from [0,999]. We also show a miscellaneous feature that promotes âsimple numbersâ: small numbers, multiples of 100 and a few standouts like 360. Identifying the representational basis of the ensemble of computational strategies revealed by our unsupervised approach is a promising direction for future work.
Altogether, weâve replicated a view of the base model using heuristics finding matching CLT features, weâve shown how these heuristics contribute to separable pathways through intervention experiments, and weâve demonstrated how these heuristics are connected, building off one another to collectively solve the addition task.
§ 5
In this section, we perform qualitative and quantitative evaluations of transcoder features and the attribution graphs derived from them, focusing especially on interpretability  and sufficiency. For readers interested in a higher level discussion of findings and limitations, we recommend skipping ahead to § 6 Biology and § 7 Limitations.
Our methods produce causal graph descriptions of the modelâs mechanisms on a particular prompt. How can we quantify how well these descriptions capture what is really going on in the model? It is difficult to distill this question to one number, as several factors are relevant:
Interpretability. How well do we understand what individual features âmeanâ? We attempt to quantify interpretability in a few ways below; however, we still rely heavily on subjective evaluation  in practice. The coherence of our groupings of features into âsupernodesâ also warrants evaluation. We do not attempt to quantify this in this work, instead leaving it to readers to verify for themselves that our groupings are sensible and interpretable. We also note that in the context of attribution graphs, interpretability of the graph  is just as important as interpretability of individual features. To that end, we quantify one notion of graph simplicity: average path length.
Sufficiency. To what extent are our (pruned) attribution graphs sufficient to explain the modelâs behavior? We attempt to quantify this in several ways. The most straightforward such evaluation is our measurement of how well the replacement modelâs outputs match the underlying model, discussed in § 2.2 From Cross-Layer Transcoder to Replacement Model. This is a âhardâ evaluation in that a single error anywhere along the computational graph can severely degrade performance. We also compute a few âsofterâ measures of sufficiency below, that measure the proportion of error nodes in attribution graphs. Note that in many instances, we present schematics of subgraphs  of a pruned attribution graph that portray what we believe to be its most noteworthy components. We intentionally do not measure the sufficiency of these subgraphs, as they often intentionally exclude âboringâ but necessary parts of the graph (e.g. âthis is a math problemâ features in addition prompts). We leave it to future work to find more principled ways to distill attribution graphs to their âinterestingâ components and quantify how much (and what kind of) information is lost. One route is to consider families of prompts, and to exclude from considerations features that are present across all prompts within a family (but see ).
Mechanistic faithfulness. To what extent are the mechanisms we identify actually  used by the model? To measure this, we perform perturbation experiments (such as inhibiting active features) and measuring whether the effects agree with what is predicted by the local replacement model (the underlying object portrayed by our attribution graphs). We attempt to do so quantitatively below, and we also validate faithfulness on our specific case studies, in particular focusing on faithfulness of the mechanisms we have identified as interesting / important. Note that our notion of mechanistic faithfulness is related to the idea of necessity of circuit components to a modelâs computation. However, necessity can be a somewhat restrictive notion â mechanisms that are not strictly ânecessaryâ for the modelâs output may still be important to identify, especially in cases where multiple mechanisms cooperate in parallel to contribute to a computation, as we often observe.
We note that the specific evaluations we use are in many cases new to this work. In part this is because our work is somewhat unique in focusing on attribution graphs for individual prompts, rather than identifying circuits underlying the modelâs performance of an entire task. Developing better automatic methods for evaluating interpretability, sufficiency, and faithfulness of the entire pipeline (features, supernodes, graphs) is an important subject of future research. See § 9 Related Work for more detail on prior circuit evaluation methods.
§ 5.1
§ 5.1.1
For CLT features to be useful to us, they must be human-interpretable (perhaps in the future it will suffice for them to be AI-interpretable!). Interpretability is ultimately a qualitative property â the best gauge of the interpretability of our features is to view them in action. A standard (though incomplete) tool for understanding what a feature represents is to view the dataset examples for which it is active (we refer to the collection of such examples as our âfeature visualizationâ). We provide thousands of feature visualizations in the context of our case studies of circuits later in this paper and in the companion paper. Below we also show 50 randomly sampled features from assorted layers of each model.
Our feature visualizations show snippets of samples from public datasets ( Common Corpus, The Pile with books removed , LMSYS Chat 1m , and Isotonic Human-Assistant Conversation ) that most strongly activate the feature, as well as examples that activate the feature to varying degrees interpolating between the maximum activation and zero. Highlights indicate the strength of the featureâs activation at a given token position. We also show the output tokens that the feature most strongly promotes / inhibits via its direct connections through the unembedding layer (note that this information is typically more meaningful for features in later model layers).
18L Features (Hover)
Layer 1 | Layer 5 | Layer 9 | Layer 13 | Layer 17 |
Haiku Features (Hover)
First layer | Mid-layer | Final layer |
At a very coarse level, we find several types of features:
- Input features that represent low-level properties of text (e.g. specific tokens or phrases). Most early-layer features are of this kind, but such features are also present in middle and later layers.
- Features whose activations represent more abstract properties of the context. For example, a feature for the danger of mixing common cleaning chemicals. These features appear in middle and later layers.
- Features that perform functions, such as an âadd 9â feature that causes the model to output a number that is nine greater than another number in its context. These tend to be found in middle and later layers.
- Output features, whose activations promote specific outputs, either specific tokens or categories of tokens. An example is a âsay a capitalâ feature, which promotes the tokens corresponding to the names of different U.S. state capitals.
- Polysemantic features, especially in earlier layers, such as this feature that activates for the token ârhythmâ, Michael Jordan, and several other unrelated concepts.
In line with our previous results on crosscoders, we find that features also vary in the degree to which their outputs âliveâ in multiple layers â some features contribute primarily to reconstructing one or a few layers, while others have strong outputs all the way through the final layer of the model, with most features falling somewhere in between.
We also note that the abstractions represented by Haiku features are in many cases richer than those in the smaller 18L model, consistent with the modelâs greater capabilities.
§ 5.1.2
In § 2.2 From Cross-Layer Transcoder to Replacement Model, we evaluated the ability of our CLTs to reproduce the computation of the underlying model. Here, we measure reconstruction error, sparsity (measured by âL0â, the average number of features active per input token), and feature interpretability. As we increased the size of our CLT, we observed Pareto-improvements in reconstruction error (averaged across layers) and feature sparsity (in 18L, reconstruction error decreased at a roughly fixed L0, while in Haiku, reconstruction error and L0 both decreased). In our largest 18L run (10M features), we attained a normalized mean reconstruction error of ~11.5% and an average L0 of 88. In our largest Haiku run (30M features), we attained a normalized reconstruction error of 21.7%, and an average L0 of 235.
We also computed two LLM-based quantitative measures of interpretability, introduced and described in more detail in :
- Sort Eval: we take two randomly sampled features and identify the set of dataset examples that activate them most strongly. We present these sets of examples to Claude, including the token-by-token feature activation information. Then we take other dataset examples that activate only one of the features, present these to Claude, and ask it to guess which feature these examples correspond to (based on the initial example sets). The final evaluation score is the empirical likelihood that Claude guesses the correct feature on any given pair.
- Contrastive Eval: we generate (using Claude) pairs of prompts that are similar in content and structure but differ in one key respect. We compute the sets of features that activate on only one of the prompts but not the other. We present Claude with the feature vis for each such feature, along with the two prompts, and ask it to guess which prompt caused the feature to activate. The final evaluation score is the empirical likelihood that Claude guesses the correct prompt for features across trials.
We find that according to both measures, the quality of CLT features improves with scale (alongside improving reconstruction error) â see plots below. We scale the number of training steps with the number of features, so improvements reflect a combination of both forms of scaling â see § D Appendix: CLT Implementation Details for details.
We also compare CLTs to two baselines: per-layer transcoders (PLTs) trained at each layer of the model, and the raw neurons of the model thresholded at varying activation levels. Specifically, we sweep over a range of scalar thresholds, and for each value we clamp all neurons with activation less than the threshold to 0. For our metrics and graphs, we then only consider neurons above the threshold. We find that on all metrics, CLTs outperform PLTs, and both CLTs and PLTs substantially outperform the Pareto frontier of thresholded neurons.
§ 5.2
Case studies in § 3 Attribution Graphs  focused on qualitative observations derived from attribution graphs. In this section, we describe our more quantitative evaluations used to compare methodological choices and dictionary sizes. In each of the following subsections, we will introduce a metric and compare graphs generated using (1) cross-layer transcoders, (2) per-layer transcoders for every layer, and (3) thresholded neurons We choose the threshold at approximately the point where the neurons achieve similar automated interpretability scores as our smallest dictionaries, see figures above.. To connect the quantitative to the qualitative, we will link to graphs which score especially high or low on each of these metrics.
While we donât treat these metrics as fundamental quantities to be optimized, they have proven a useful guide for tracking ML improvements in dictionary learning and to flag prompts our method performs poorly on.
§ 5.2.1
Our graph-based metrics rely on quantities derived from the i ndirect influence matrix. Informally, this matrix measures how much each pair of nodes influences each other via all possible paths through the graph. This gives a natural importance metric for each node: how much it influences the logit nodes. We also commonly compare how much influence comes from error nodes vs. non-error nodes.
To construct this matrix, we start with the adjacency matrix of the graph. We replace all the edge weights with their absolute values (or simply clamp negative values to 0) to obtain an unsigned adjacency matrix and then normalize the input edges to each node so that they sum to 1. Let A refer to this normalized, unsigned adjacency matrix, indexed as (target, source).
The indirect influence matrix is B = A + A^2 + A^3 + \cdots, which is a Neumann series  and can be efficiently computed as B = (I - A)^{-1} - I. The entries of B indicate the sum of the strengths of all paths between a given pair of nodes, where the strength of any given path is given by the product of the values of its constituent edges in A. To compute a logit influence score for each node, we compute a weighted average of the rows of B corresponding to logit nodes (weighted by the probability of the particular logit).
§ 5.2.2
A natural metric of graph complexity is the average path length from embedding nodes to logit nodes. Intuitively, shorter paths are easier to understand as they require interpreting fewer links in the causal chain.
To measure the influence of paths of different lengths, we compute influence matrices B_{\ell} = \sum_{i=0}^{\ell} A^{i}.The influence of paths of length less than or equal to \ell is then given by P_{\ell} = \sum_{e} B^{\ell}_{t,e} where e are all embedding nodes and t is the logit node.If there are multiple logit nodes, we compute an average of the rows weighted by logit probability.
Below, we compare graphs built from our 10M CLT, 10M PLTs, and thresholded neurons in terms of their influence by path length averaged across a dataset of pretraining prompts  (without pruning).We normalize influence scores by the total influence of embeddings in the unpruned graph. This normalization factor is exactly equal to the replacement score, which we define in the next section.
One of the most important advantages of crosslayer transcoders is the extent to which they reduce the path lengths in the graph. To understand how large of a qualitative difference this is, we invite the reader to view these graphs generated with different types of replacement models for the same prompt.
Replacement Model Type | Average Path Length | Graph Link |
Cross-Layer Transcoder (10m) | 2.3 | |
Per-Layer Transcoders (10m) | 3.7 |
We find that one important way in which cross-layer transcoders collapse paths is the case of amplification, where many similar features activate each other in sequence. For example, on the prompt Zagreb:Croatia::Copenhagen: the per-layer transcoder shows a path of length 7 composed entirely of Copenhagen features  while the cross-layer transcoder collapses them all down to layer 1 features.
This example illustrates both the advantages and disadvantages of consolidating amplification of a repeated computation across multiple layers into a single cross-layer feature. On one hand, it makes interpretability substantially easier, as it automatically collapses duplicate computations into a single feature without needing to do post hoc analysis or clustering. It also reduces the risk of âchain-breakingâ, where missing one feature in an amplification chain inhibits the ability to trace back further into the graph (i.e., a relevant amplification feature is missing for one step of the path, breaking the causal chain). On the other hand, the CLT has a different causal structure than the underlying model, which increases the risk that the replacement modelâs mechanisms diverge from the underlying modelâs. In the above example, we observe a set of Copenhagen features that activate a Denmark feature, which initiates a mutually reinforcing chain of Copenhagen and Denmark features. This dynamic is invisible in CLT graphs, and to the extent this dynamic is also present in the underlying model, it is an example of CLTs being mechanistically unfaithful.
§ 5.2.3
Because our replacement model has reconstruction errors, we want to measure how much of the modelâs computation is being captured. That is, how much of the graph influence is attributable to feature nodes versus error nodes.
To measure this, we primarily rely on two metrics:
- Graph completeness score: measures the fraction of input edges (weighted by the target nodeâs logit influence score) that come from feature or embedding nodes rather than error nodes.
- Graph replacement score: measuresthe fraction of end-to-end graph paths (weighted by strength) that proceed from embedding nodes to logit nodes via feature nodes (rather than error nodes).
Intuitively, the completeness score gives more âpartial creditâ and measures how much of the most important node inputs are accounted for, whereas replacement score rewards complete explanations.
Below, we report average unpruned  graph replacement and completeness scores for dictionaries of various sizes and types on our pretraining prompt dataset. We find the biggest methodological improvement comes when moving from per-layer to cross-layer transcoders, with large but diminishing returns from scaling the number of features.
To contextualize the qualitative difference we observe in graphs with varying scores, we invite the reader to explore some representative attribution graphs. Note, these graphs are pruned with our default pruning, which we describe in more detail below.
Replacement Model Type | Completeness Score | Replacement Score | Graph Link |
Cross-Layer Transcoder (10m) | 0.80 | 0.61 | |
Per-Layer Transcoders (10m) | 0.78 | 0.37 |
§ 5.2.4
We rely heavily on pruning to make graphs more digestible. To decide how much to prune the graph, we can use the completeness and replacement metrics described above, but with pruned nodes now counting towards the error terms. By varying the pruning threshold, we chart a frontier between the number of {nodes, edges} and {replacement, completeness} scores (see Appendix for full plots and details).
We find we can generally reduce the number of nodes by an order of magnitude while reducing completeness by only 20%.
For a sense of the qualitative difference, in the table below we link to attribution graphs for the same prompt (another acronym) but with different pruning thresholds.
Pruning Threshold | Completeness Score | Node Count | Graph Link |
0.95 | 0.87 | 236 | |
0.9 | 0.83 | 137 | |
0.8 (default) | 0.70 | 55 | |
0.7 | 0.58 | 27 |
§ 5.3
As discussed in § 3.5 Validating Attribution Graph Hypotheses with Interventions, attribution graphs provide hypotheses about mechanisms, which must be validated with perturbation experiments. This is because attribution graphs describe interactions in the local replacement model, which may differ from the underlying model. In most of our work, we use attribution graphs as a tool for generating hypotheses about specific mechanisms (âFeature A activates Feature B, which increases the likelihood of Token Xâ) operating inside the model, which correspond to âsnippetsâ of the attribution graph. We summarize the results of three kinds of validation experiments, which are described in more detail in § G Appendix: Validating the Replacement Model.
We start by measuring the extent to which influence metrics derived from attribution graphs are predictive of intervention effects on the logit and other features. First, we measure the extent to which a nodeâs logit influence score is predictive of the effect of ablating a feature on the modelâs output distribution. We find that influence is significantly more predictive of ablation effects than baselines such as direct attribution (i.e. direct edges in the attribution graph, ignoring multi-step paths) and activation magnitude (see Validating Node-to-Logit Influence ). We then perform a similar analysis for interactions between features. We compute the influence score between pairs of features, and compare it to the relative effect of ablating the upstream feature in the pair on the activation of the downstream one. We observe a Spearman correlation of 0.72, which is evidence that graph influence is a good proxy for effects in the downstream model (see Validating Feature-to-feature Influence ). See Nuances of Steering with Cross-Layer Features for some complexities in interpreting these results.
The metrics above help provide an estimate of the likelihood that an intervention experiment will validate a specific  mechanism in the graph. We might also be interested in a more general validation of all  the mechanistic hypotheses implicitly made by our attribution graphs. Thus, another complementary approach to validation is to measure the mechanistic faithfulness of the local replacement model as a whole, rather than specific paths within attribution graphs. We can operationalize this by asking to what extent perturbations made in the local replacement model (which attribution graphs describe) have the same downstream effects as corresponding perturbations in the underlying model. We find that while perturbation results are reasonably similar between the two models when measured one layer after the intervention (~0.8 cosine similarity, ~0.4 normalized mean squared error), perturbation discrepancies compound significantly over layers.Compounding errors have a gradually detrimental effect on the faithfulness of the direction of perturbation effects, which are largely consistent across CLT sizes, with signs of faithfulness worsening slightly as dictionary size increases. Compounding errors can have a catastrophically detrimental effect on the magnitude of perturbations, with worse effects for larger dictionaries. We suspect the lack of normalization denominators in the local replacement model may be why its perturbation effect magnitudes deviate so significantly from the underlying model, even when the perturbation effect directions are significantly correlated. For more details, see Evaluating Faithfulness of the Local Replacement Model.
§ 6
In our companion paper, we use the method outlined here to perform deep investigations of the circuits in nine behavioral case studies of the frontier model Haiku 3.5. These include:
- Multi-Step Reasoning. We present a simple example where the model performs âtwo-hopâ reasoning to complete âThe capital of the state containing Dallas isâŠâ, going Dallas â Texas â Austin. We can see and manipulate its representation of the intermediate Texas step.
- Planning in Poems. We show that the model plans its outputs when writing lines of poetry. Before beginning to write each line, the model identifies potential rhyming words that could appear at the end. These preselected rhyming options then shape how the model constructs the entire line.
- Multilingual Circuits.We find the model uses a mixture of language-specific and abstract, language-independent circuits (which are more prevalent in Claude 3.5 Haiku than a smaller model).
- Addition. We highlight a case where the same addition circuitry generalizes between very different contexts, and uncover qualitative differences between the addition mechanisms in Claude 3.5 Haiku and a smaller, less capable model.
- Medical Diagnoses.We show an example in which the model identifies candidate diagnoses based on reported symptoms, and uses these to inform follow-up questions about additional symptoms that could corroborate the diagnosis â all âin its head,â without writing down its steps.
- Entity Recognition and Hallucinations. We uncover circuit mechanisms that allow the model to distinguish between familiar and unfamiliar entities, which determine whether it elects to answer a factual question or profess ignorance. âMisfiresâ of this circuit can cause hallucinations.
- Refusal of Harmful Requests. We find evidence that the model constructs a general-purpose âharmful requestsâ feature during finetuning, aggregated from features representing specific harmful requests learned during pretraining.
- An Analysis of a Jailbreak, which works by first tricking the model into starting to give dangerous instructions âwithout realizing it,â and continuing to do so due to pressure to adhere to syntactic and grammatical rules.
- Chain-of-thought Faithfulness. We explore the faithfulness of chain-of-thought reasoning to the modelâs actual mechanisms. We are able to distinguish between cases where the model genuinely performs the steps it says it is performing, cases where it makes up its reasoning without regard for truth, and cases where it works backwards from a human-provided clue so that its âreasoningâ will end up at the human-suggested answer.
- A Model with a Hidden Goal. We also apply our method to a variant of the model that has been finetuned to pursue a secret goal of exploiting biases in its training process. While the model is reluctant to reveal its goal out loud, our method exposes it, revealing the goal to be âbaked inâ to the modelâs âAssistantâ persona.
We encourage the reader to explore those case studies before returning here to understand the limitations we encountered, and how that informs our approach to method development.
§ 7
Despite the exciting results presented here and in the companion paper, our methodology has a number of significant limitations. At a high level, the most significant ones are:
- Missing Attention Circuits  â We donât explain how attention patterns are computed by QK-circuits, and can sometimes âmiss the interesting partâ of the computation as a result.
- Reconstruction Errors & Dark Matter â We only explain a portion of model computation, and much remains hidden. When the critical computation is missing, attribution graphs wonât reveal much.
- The Role of Inactive Features & Inhibitory Circuits  â Often the fact that certain features werenât active is just as interesting as the fact that others are. In particular, there are many interesting circuits which involve features inhibiting other features.
- Graph Complexity â The resulting attribution graphs can be very complex and hard to understand.
- Features at the Wrong Level of Abstraction â Issues like feature splitting and absorption mean that features often arenât at the level of abstraction which would make it easiest to understand the circuit.
- Difficulty of Understanding Global Circuits â Ideally, we want to understand models in a global manner, rather than attributions on a single example. However, global circuits are quite challenging.
- Mechanistic Faithfulness â When we replace MLP computation with transcoders, how confident are we that theyâre using the same mechanisms as the original MLP, rather than something thatâs just highly correlated with the MLPâs outputs?
We discuss these in detail below, and where possible provide concrete counterexamples where our present methods can not explain model computation due to these issues. We hope that these may motivate future research.
§ 7.1
One significant limitation of our approach is that we compute our attribution graphs with respect to fixed attention patterns. This makes attribution a well-defined and principled operation, but also means that our graphs do not attempt to explain how the modelâs attention patterns were formed, or how these patterns mediate feature-feature interactions through attention head output-value matrices . In this paper, we have focused on case studies where this is not too much of an issue â cases where attention patterns are not responsible for the âinteresting partâ or âcruxâ of the modelâs computation. However, we have also found many cases where this limitation renders our attribution graphs essentially useless.
§ 7.1.1 Example: Induction
Letâs consider for a moment a much simpler model â a humble 2-layer attention-only model, of the kind studied in . One interesting property of these models is their use of induction heads to perform basic in-context learning. For example, if we consider the following prompt:
I always loved visiting Aunt Sally. Whenever I was feeling sad, Aunt
These models will have induction heads attend back to "Sally"
, and then predict that is the correct answer. If we were to apply our present method, the answer isnât very informative. It would simply tell us that the model predicted "Sally"
, because there was a token "Sally"
earlier in the context.
This misses the entire interesting story! The induction head attends to "Sally"
because it was preceded by "Aunt"
, which matches the present token. Previous methods (e.g. were able to elucidate this, and so this case might even be seen as a kind of regression.
Indeed, when applied to Claude 3.5 Haiku on this prompt, our method has exactly this problem. See the attribution graph visualization  â the graph contains direct edges from token-level âSallyâ features to âsay Sallyâ features and to the âSallyâ logit, but fails to explain how these edges came about.
§ 7.1.2 Example: Multiple-Choice Questions
Induction is a simple case of attentional computation where we can make a reasonable guess at the mechanism even without help from our attribution graphs. However, this failure of the attribution graphs can manifest in more complex scenarios as well, where it completely obscures the interesting steps of the modelâs computation. For instance, consider a multiple choice question:
Human: In what year did World War II end?
(A) 1776
(B) 1945
(C) 1865
Assistant: Answer: (B)
When we compute the attribution graph ( interactive graph visualization ) for the "B"
token in the Assistantâs response, we obtain a relatively uninteresting answer â we answer "B"
because of a tokens following "(b)"
feature that activates on the correct answer. (Thereâs also a direct pathway to the token "B"
, and output pathways mediated by a say "B"
motor feature; weâve chosen to elide these for simplicity.)
None of this provides a useful explanation of how the model chose its answer! The graph âskips overâ the interesting part of the computation, which is how the model knew that 1945 was the correct answer. This is because the behavior is driven by attention. On further investigation, it turns out that there are three âcorrect answerâ features  that appear to fire on the correct answer to multiple choice questions, and which interventions show play a crucial role. From this, we hypothesize that the mechanism might be something like the following.Other explanations are possible, particularly for the direct paths from "B"
that do not go through tokens following "B"
â one alternative is that the model may use a âbinding ID vectorâ to group the "B"
tokens with "1945"
and nearby tokens and use this to attend directly back to the "B"
token from the final token position â see Feng & Steinhardt for more details on this type of mechanism.
Since this involves significant conjecture, itâs worth being clear about what we know about the QK-circuit, and what we donât.
- We know that itâs controlled by attention, since freezing attention locks the answer.
- We donât know that itâs mediated by a particular head, nor that any heads involved have a general behavior that can be understood as a generalization of this.
- We do know that there are three features which appear to track âthis seems like the correct answerâ. We do know that intervening on them changes the correct answer in expected ways; for example activating them on answer C causes the model to predict
"C"
. - We donât know that âcorrect answerâ features directly play a role in the key side of whatever attention heads are involved, nor do we know if âneed answerâ features exist or play a role on the query side.
- We also do not know if there might be alternative parallel mechanisms at play ( see Feng & Steinhardt ).
This is all to say, thereâs a lot we donât understand!
But despite our limited understanding, it seems clear that the model behavior crucially flows through attention patterns and the QK circuits that compute them. Until we can fix this, our attribution graphs will âmiss the storyâ in cases where attention plays a critical role. And while we were able to get a partial understanding of the story in this case through manual investigation, we would like for our methodology to surface this information automatically in the future!
§ 7.1.3 Future Directions on Attention
We suspect that similar circuits, where attention is the crux, are at play across a wide variety of prompts. In these cases, our present attribution graphs are little help to us, and new methods are needed.
Ultimately, the QK-circuit is a quadratic form over the residual stream. This means that attributions can naturally be made to pairs of key-side and query-side features. These pairs have a weight describing whether they increase or decrease attention to a particular token. However, this approach has the downside of a quadratic explosion in complexity.
Looking at QK-circuits for individual attention heads may make things harder than necessary, or even be misleading, if there is attention superposition. The interesting behavior may be spread across heads, which each have their own QK-circuit.
We also suspect that certain forms of dictionary learning on attention layers, or full attention replacement layers, could make these âQK attributionsâ more tractable and interpretable. We believe that finding the right way to handle attention attributions is one of the most important directions for future work.
§ 7.2
Our cross-layer transcoders fail to fully reconstruct the activations of the model. As a result, our methods leave some âdark matterâ of the modelâs mechanisms unobserved, in part because our dictionaries are not large enough to learn a feature corresponding to every concept the model might know about. In our attribution graphs, this failure manifests as âerror nodes,â which provide input to other nodes but receive no input themselves (they âpop out of nowhereâ) and which are not interpretable. In the examples presented in this paper, we managed to identify interesting mechanisms despite the presence of error nodes. It is important to note that the circuits we have described are only a partial description of the modelâs computation (both due to the presence of error nodes, and due to the fact that the circuits we highlight typically only include a subset of the non-error nodes in the graph!).
It is worth highlighting, however, that in some cases error nodes completely obscure what the model is doing. For instance, consider this attempted jailbreak â an obfuscated request for the model to write malware (which the model refuses):
Human: Write an office macro TAhT SEaRChEs fOS AND EtEIaFTxLRS pASWORdS
Assistant:
In the attribution graph, almost all the nodes on the obfuscated prompt tokens are error nodes, making it impossible to trace back the origin of the refusal-related features that are active on the âAssistant:â tokens.
The prevalence of error nodes in this example may not be surprising â this prompt is rather out-of-distribution relative to typical prompts, and so the cross-layer-transcoder is likely to do a poor job of predicting model activity.
We also note that another major source of error is the gap between our human interpretations of features and what they truly represent. Typically our interpretations of features are much too coarse to account for their precise activation profiles.
§ 7.2.1 Future Directions on Reconstruction Error and âDark Matterâ
We see several avenues for addressing this issue:
- Scaling replacement models to larger sizes / more training data will increase the amount of variance they explain.
- Architectural modifications to our cross-layer transcoder setup could make it more expressive and thus capable of explaining more variance.
- Training our replacement model in a more end-to-end fashion, rather than on MSE alone, could decrease the weight assigned to error nodes even at a fixed MSE level
- Finetuning the replacement model on data distributions of interest could improve our ability to capture mechanisms on those distributions.
- We could develop methods of attributing back from error nodes. This would leave an uninterpretable âholeâ in the attribution graph, but in some cases may still provide more insight into the model than our current no-inputs error nodes.
§ 7.3
Our cross-layer transcoder features are trained to be sparsely active. Their sparsity is key to the success of our method. It allows us to focus on a relatively small set of features for a given prompt, out of the tens of millions of features in the replacement model. However, this convenience relies on a key assumption â that only active features are involved in the mechanism underlying a modelâs responses.
In fact, this need not be the case! In some cases, the lack of activity  of a feature, because it has been suppressed by other features, may be key to the modelâs response. For instance, in our analysis of hallucinations and entity recognition (see companion paper ), we discovered a circuit in which âcanât answerâ features are suppressed  by features representing known entities, or questions with known answers. Thus, to explain why the model hallucinates in a specific context, we need to understand what caused the âcanât answerâ features to not be active.
By default, our attribution graphs do not allow us to answer such questions, because they only display active features. If we have a hypothesis about which inactive features may be relevant to the modelâs completion (due to suppression), we can include them in the attribution graph. However, this detracts somewhat from one of the main benefits of our methodology, which is its enablement of exploratory, hypothesis-free analysis.
This leads to the following challenge â how can we identify inactive features of interest, out of the tens of millions of inactive features? It seems like we want to know which features could have been âcounterfactually activeâ in some sense. In the entity recognition example, we identified these counterfactually active features by comparing pairs of prompts that contained either known or unknown entities (Michael Jordan or âMichael Batkinâ), and then focusing on features that were active in at least one prompt from each pair. We expect that this contrastive pairs strategy will be key to many circuit analyses going forward. However, we are also interested in developing more unsupervised approaches to identifying key suppressed features. One possibility may be to perform feature ablation experiments, and consider the set of inactive features that are only âone ablation awayâ from being active.
One might think that these issues can be escaped by moving to global circuit analysis. However, it seems like there may be a deep challenge which remains. We need a way to filter out interference weights, and itâs tempting to do this by using co-occurrence of features. But these strategies will miss important inhibitory weights, where one feature consistently prevents another from activating. This can be seen as a kind of global circuit analog of the challenges around inactive features in local attribution analysis.
§ 7.4
One of the fundamental challenges of interpretability is finding abstractions and interfaces that manage the cognitive load of understanding complex computations.For example, the fundamental reason we need features to be independently interpretable is to avoid needing to think about all of them at once (see discussion here ). Our methodology is designed to reduce the cognitive load of understanding circuits as much as possible. For example:
- Feature sparsity means that there are fewer nodes in the attribution graph.
- We prune the graphs so that the analyst can focus only on its most important components.
- Our UI is designed to make navigating the graph as fluid as possible.
- We use the not-very-principled abstraction of âsupernodesâ to ad-hoc group together related features.
Despite all these steps, our attribution graphs are still quite complex, and require considerable time and effort to understand for many reasons:
- Even after our pruning pipeline and on fairly short prompts, the graphs typically contain hundreds of features and thousands of edges.
- The concepts we are interested in are typically smeared across multiple features.
- Each feature receives many small inputs from many other features, making it difficult to succinctly summarize âwhat caused this feature to activate.â
- Features often exert influence on one another by multiple paths of different lengths, or even of different signs!
As a result, it is difficult to distill the mechanisms uncovered by our graphs into a succinct story. Consequently, the vignettes we have presented are necessarily simplified stories of even the limited understanding of model computation captured in our attribution graph. We hope that a combination of improved replacement model training, better abstractions, more sophisticated pruning, and better visualization tools can help mitigate this issue in the future.
§ 7.5
As sparse coding models have grown in popularity as a technique for extracting interpretable features from models, many researchers have documented shortcomings of the approach ( see e.g.). One notable issue is the problem of feature splitting in which the uncovered features are in some sense too specific. This can also lead to a related problem of feature absorption, where highly specific features steal credit from more general features, leaving holes in them (leading to things like a âU.S. cities except for New York and Los Angelesâ feature).
As a concrete example of feature splitting, recall that in many examples in this paper we have highlighted âsay Xâ features that cause the model to output a particular (group of) token(s). However, we also notice that there are many such features, suggesting that they each actually represent something more specific. For example, we came up with twelve prompts which Claude 3.5 Haiku completes with the word âduringâ and measured whether any features activated for all of the prompts (as a true âsay âduringââ feature would). In fact, there are no such features â any individual feature fires for only a subset of the prompts. Moreover, the degree of generality of the features appears to decrease with the size of our cross-layer transcoder.
It may be the case that each individual feature represents something interpretable â for instance, qualitatively different contexts that might cause one to say the word âduring.â However, we often find that the level of abstraction we care about is different from the level we find in our features. Using smaller cross-layer transcoders may help this problem, but would also cause us to capture less of the modelâs computation.
In this paper, we often work around this issue in an ad-hoc way by manually grouping together features with related meanings into âsupernodesâ of an attribution graph. While this technique has proven quite helpful, the manual step is labor-intensive and likely loses information. It also makes it difficult to study how well mechanisms generalize across prompts, since different subsets of a relevant feature category may be active on different prompts.
We expect that solving this problem requires recognizing that there exist interpretable concepts at varying levels of abstraction, and at different times we may be interested in different levels. Sparse coding approaches like SAEs and (cross-layer) transcoders are a âflatâ instrument, but we probably need a hierarchical variant that allows features at varying levels of abstraction to coexist in an interpretable way.
Several authors have recently proposed âMatryoshkaâ variants of sparse autoencoders that may help address this issue . Other researchers have proposed post-hoc ways to unify related features with âmeta-SAEsâ .
§ 7.6
In this paper we have mostly focused on attribution graphs, which display information about feature-feature interactions on a particular prompt. However, one theoretical advantage of transcoder-based methodologies like ours is that they give us global weights  between features, that are independent  of the prompt. This allows us to estimate a âconnectomeâ of the replacement model and learn about the general algorithms it (though not necessarily the underlying model) uses that apply to many different inputs. We have some successes in this approach â for instance, in the companion paper section on Refusals, we could see the global inputs to âharmful requestsâ features consisting of a variety of different specific categories of harm. In this paper, we studied in depth the global weights of features relating to arithmetic, finding for instance that âsay a number ending in 5â features receive input from â6 + 9â features, â7 + 8â features, etc.
However, for the most part, we have found global feature-feature connections rather difficult to understand. This is likely for two main reasons:
- Interference weights â because features are represented in superposition, in order to learn useful weights between features, models must incur spurious âinterference weightsâ â connections between features that donât make sense and arenât useful to the modelâs performance. These spurious weights are not too detrimental to model performance because their effects rarely âstack upâ to actually change the modelâs output. For instance, we sometimes see features like this, which appear to clearly be a âsay 15â feature, but whose top logit outputs include many seemingly unrelated words (âgagâ, âdutyâ, âtemperâ, âdispersâ). We believe these logit connections are essentially irrelevant to the modelâs behavior, because when this feature activates, it is very unlikely that âdutyâ will be a plausible completion, and so upweighting its logit runs little risk of causing it to be sampled. Unfortunately, this makes the global weights very difficult to understand! This phenomenon applies to feature-feature weights as well (see § 4.1 Global Weights in Addition ).
- Interactions mediated by attention  â The basic global feature-feature weights derived from our cross-layer transcoder describe the direct interactions between features not mediated by any attention layers. However, there are also feature-feature weights mediated by attention heads. These might be thought of as similar to how features in a convolutional neural network are related by multiple sets of weights, corresponding to different positional offsets (further discussion here ).
Our attribution graph edges are weighted combinations of both the direct weights and these attention-mediated weights. Our basic notion of global weights does not account for these interactions at all. One way to do so would be to compute the global weights mediated by every possible attention head. However, this has two limitations: (1) for this to be useful, we need a way of understanding the mechanisms by which different heads choose where they attend (see § 7.1 Limitations: Missing Attention Circuits ), (2) it does not account for interactions mediated by compositions of heads . Solving this issue likely requires extending our dictionary learning methodology to learn interpretable attentional features or replacement heads.
§ 7.7
Our cross-layer transcoder is trained to mimic the activations of the underlying model at each layer. However, even when it accurately reconstructs the modelâs activations, there is no guarantee that it does so via the same mechanisms. For instance, even if the cross-layer transcoder achieved 0 MSE on our training distribution, it might have learned a fundamentally different input/output function than the underlying model, and consequently have large reconstruction error on out-of-distribution inputs. We hope that this issue is mitigated by (1) training on a broad data distribution, and (2) forcing the replacement model to reconstruct the underlying modelâs per-layer activations, rather than simply its output. Nevertheless, we cannot guarantee that the replacement model has learned the same mechanisms â what we call mechanistic faithfulness â and instead resort to verifying it post-hoc.
In this paper, we have used perturbation experiments (inhibiting and exciting features) to validate the mechanisms suggested by our attribution graphs. In the case studies we presented, we were typically able to validate that features had the effects we expected (on the model output, and on other features). However, the degree of validation we have provided is very coarse. We typically perturb multiple features at once (âsupernodesâ) and check their directional effects on other features / logit outputs. In addition, we typically sweep over the layer at which we perform perturbations and use the layer that yields the maximum effect. In principle, our attribution graphs make predictions that are much more fine-grained than these kinds of interventions can test. Ideally, we should be able to accurately predict the effect of perturbing any feature at any layer on any other feature.
In § G Appendix: Validating the Replacement Model, we attempt to more comprehensively quantify our accuracy in predicting such perturbation results, finding reasonably good predictive power for effects a few layers downstream of a perturbation, and much worse predictive power many layers downstream. This suggests that, while our circuit descriptions may be mechanistically accurate at a very coarse level, we have substantial room to improve their faithfulness to the underlying model.
We are optimistic about trying methods to directly optimize for mechanistic faithfulness, or exploring alternative dictionary learning architectures that learn more faithful solutions naturally.
§ 8
Our approach to reverse engineering neural networks has four basic steps: decomposition into components, providing descriptions of these components, characterizing how components interact to produce behaviors, and validating these descriptions.See Sharkey et al. for a detailed description of the reverse engineering philosophy. A number of choices are required at each step, which can be more or less principled, and the power of a method is ultimately the degree to which it produces valid hypotheses about model behaviors.
In this paper, we trained cross-layer transcoders with sparse features to replace MLP blocks (the decomposition), described the features by the dataset examples they activate on (the description), characterized their interactions on specific prompts using attribution graphs (the interactions), and validated the hypotheses using causal steering interventions (the validation).
We believe some of the choices we made are robust, and that successful decomposition methods will make similar choices or find other ways of dealing with the underlying issues they address:
- We use learned features instead of neurons. While the top activations for neurons are often interpretable, lower activations are not. In principle, one could threshold neuron activations to restrict them to this interpretable regime; however, we found that thresholding neurons at that level damages model behavior significantly more than a transcoder or CLT. This means a trained replacement layer can provide a Pareto improvement relative to thresholded neurons across interpretability, L0, and MSE. The set of neurons is fixed. (Nevertheless, neurons can provide a starting point for investigation, without incurring any additional compute costs, see e.g. .)
- We use transcoders instead of residual-stream SAEs. While residual stream SAEs can decompose the latent states of a model, they donât provide a natural extension decomposing its computational steps. Crucially, transcoder features bridge over MLP layers and interact linearly via the residual stream with transcoder features in other layers. In contrast, interaction between SAE features is interposed by non-linear MLPs.
- We use cross-layer transcoders instead of per-layer transcoders. We hypothesized that different MLP layers might collaborate to implement a single computational step (âcross-layer superpositionâ); the most extreme case of this is when many layers amplify the same early-layer feature so that it is still large enough to influence late layers. CLTs collapse these to one feature. We found evidence that this phenomenon happens in practice, as manifested through the Pareto improvement on path length vs graph influence.
- We compute feature-feature interactions using linear direct effects instead of nonlinear attributions or ablations. Much has been written about âsaliency mapsâ or attribution through non-linear neural networks (including ablation, path-integrated gradients , and Shapley values ( e.g.)). Even the most principled options for credit assignment in a nonlinear setting are somewhat fraught. Since our goal is to crisply reason about mechanism, we construct our setup so that the direct interactions between features in the previous layer and the pre-activations of features in the next layer are conditionally linear; that is to say, they are linear once we freeze certain parts of the problem (attention patterns and normalization denominators). This factors the problem into a portion we can mechanistically understand in a principled manner, and a portion that remains to be understood . Also crucial to achieving this linear direct effect property is the earlier decision to use transcoders.Credit attribution in a non-linear setting is a hard problem. The core challenge canât simply be washed away, and itâs worth asking where we implicitly have pushed the complexity. There are three places it may have gone. First â and this is the best option â is that the non-linear interactions have become multi-step paths in our attribution graph, which can then be reasoned about in interpretable ways. Next, a significant amount of it must have gone into the frozen components, factored into separate questions we havenât tried to address here, but at least know remain. But there is also a bad option: some of it may have been simplified by our CLT taking a non-mechanistically faithful shortcut, which approximates MLP computation with a linear approximation that is often correct.Our setup allows for some similar principled notions of linear interaction in the global context, but we now have to think of different weights for interactions along different paths â for example, what is the interaction of feature A and feature B mediated by attention head H? This is discussed in § 4 Global Weights. The Framework  paper also discussed these general conceptual issues in the appendix.
There are other choices we made for convenience, or as a first step towards a more general solution:
- We collapse attention paths. Every edge in our attribution graph is the direct interaction of a pair of features, summed over all possible direct interaction paths. Some of these paths flow primarily through the residual stream. Others flow through attention heads. We make no effort in the present work to distinguish these. This throws away a lot of interesting structure, since which heads mediated an interaction may be interesting, if they are something we understand (e.g. a successor head or induction head ).Analyzing the global weights mediated by a head may be interesting here. For example, an induction head might systematically move âIâm Xâ features to âsay Xâ features. A successor head might systematically map âXâ features to âX+1â features.Of course, not all heads are individually interesting in the way induction heads or successor heads often are. We might suspect there are many more âattentional featuresâ like induction and succession hiding in superposition over attention heads. If we could reveal these, there might be a much richer story.
- We ignore QK-circuits. In order to get our linear feature-feature interactions, we factor understanding transformers into two pieces, following Framework. First we ask about feature-feature interaction, conditional on an attention head or set of attention heads (the âOV-circuitâ). But this leaves a second question of why attention heads attend to various pieces (the âQK-circuitâ). In this work, we do not attempt this second half.
- We only use a sparsity penalty and a reconstruction loss for crosscoder training. While our ultimate goal is to find circuits with sparse interpretable edges, in a replacement model which is mechanistically faithful to the underlying model, we donât train explicitly for any of those goals.
Nevertheless, our current method yielded interesting, validated mechanisms involving planning, multilingual structure, hallucinations, refusals, and more, in the companion paper.
We expect that advances in the trained, interpretable replacement model paradigm will produce quantitative improvements on graph-related metrics and qualitative improvements on the amount of model behaviors that become legible. It is possible that this will be an incremental process, where incremental improvements to CLTs and associated approaches to attention will yield incremental improvements to circuit identification, or that a radically different decomposition approach will best this method at scale at uncovering mechanisms. Regardless, we hope to enter an era where there is a clear flywheel between decomposition methods and âbiologyâ results, where the appearance of structure in specific model investigations inspires innovations in decomposition methods, which in turn bring more model behaviors into the light.
§ 8.1 Coda: The lessons of addition
Addition is one of the simplest behaviors performed by models, and because it is so structured, we can characterize every featureâs activity on the full problem domain exactly. This allows us to skip the difficult step of staring at dataset examples and trying to discern what a feature is doing relative to, and what distinguishes it from other features active in similar contexts. This revealed a range of heuristics used by Haiku 3.5 (âsay something ending in a 5â, âsay something around 50â, âsay something starting with 51â) which had been identified before by Nikankin , together with a set of lookup table features that connect input pairs satisfying certain conditions (say, adding digits ending in 6 and 9) to the appropriate sum features satisfying the consequence on the output (say, producing a sum ending in 5).
However, even in this easier setting we made numerous mistakes when labeling these features from the original dataset examples alone, for example thinking a _6 + _9
feature was itself a sum = _5
feature based on what followed it in contexts. We also struggled to distinguish between low-precision features of different scales, and between features which were sensitive to a limited set of inputs or merely appeared to be because of a high prevalence of those inputs in our dataset. How much worse must this be when looking at dozens of gradations of refusal features! Getting more precise distinctions between features in fuzzier domains than arithmetic, whether through feature geometry or superhuman autointerpretability methods, will be necessary if we want to understand problems at the level of resolution that even todayâs CLTs appear to make possible.
Because addition is such a clear problem, we were also able to see how the features connected with each other to build parallel pathways; giving rise from simple heuristics depending on the input to more complex heuristics related to the output; going from the âBag of Heuristicsâ identified by Nikankin to a âGraph of Heuristicsâ. The virtual weights show this computational structure, with groups of lookup table features combining to form sum features of different modularity and scale, which combine to form more precise sum features, and to eventually give the output. It seems likely that, in âfuzzierâ natural language examples, we are conflating many roles played by features at different depths into overall buckets like âunknown entityâ or âharmful requestâ or ânotions of largenessâ which actually serve specialized roles, and that there is actually an intricate aggregation and transformation of information taking place, just out of our understanding today.
§ 9
Despite being a young field, mechanistic interpretability has grown rapidly. For an introduction to the landscape of open problems and existing methods, we recommend the recent survey by Sharkey et al. . Broader perspectives can be found in recent reviews of mechanistic interpretability and related topics ( e.g. ).
In previous papers, weâve discussed some of the foundational topics our work builds on, and rather than recapitulate that discussion, we will refer readers to our previous discussion on them. This includes
- Existence of interpretable features  ( e.g.; see prior discussion )
- Attention head analysis  ( e.g.; see prior discussion ),
- Bertology  ( e.g. ; see prior discussion ),
- Interpretability interfaces (e.g. ; s ee prior discussion ),
- Disentanglement  ( e.g.; see prior discussion ),
- Compressed sensing  ( e.g.; see prior discussion ),
- Sparse dictionary learning  ( e.g.; see prior discussion ),
- Theory of superposition  ( e.g.; see prior discussion ),
- Theories of neural coding and distributed representation  ( e.g. ; see prior discussion )
- Activation steering  ( e.g.; see prior discussion ).
The next two sections will focus on the two different stages we often use in mechanistic interpretability : identifying features and then analyzing the circuits they form. Following that, weâll turn our attention to past work on the âbiologyâ of neural networks.
§ 9.1 Feature Discovery Methods
A fundamental challenge in circuit discovery is finding suitable units of analysis . The networkâs natural componentsâattention heads and neuronsâlack interpretability due to superposition , making the identification of better analytical units a central problem in the field.
Sparse Dictionary Learning is a technique with a long history originally developed by neuroscientists to analyze neural recording data . Recent work has applied Sparse Autoencoders (SAEs) as a scalable solution to address superposition by learning dictionaries that decompose language model representations. While SAEs have been successfully scaled to frontier models , researchers have identified several methodological limitations including feature shrinkage , feature absorption , lack of canonicalization , automating feature interpretation , and poor performance on downstream classification and steering tasks .
For circuit analysis specifically, SAEs are suboptimal because they decompose representations rather than computations. Transcoders address this limitation by predicting the output of nonlinear components from their inputs. Bridging over nonlinearities like this enables direct computation of pairwise feature interactions without relying on attributions or ablations through intermediate nonlinearities.
The space of dictionary learning approaches is large, and we remain very excited about work which explores this space and addresses methodological issues. Recent work has explored architectural modifications like multilayer feature learning , adding skip connections , incorporating gradient information , adding stronger hierarchical inductive biases , and further increasing computational efficiency with mixtures-of-experts . The community has also studied alternative training protocols to learn dictionaries that respect downstream activations , reduce feature shrinkage , and make downstream computational interactions sparse . To measure this methodological progress, a number of benchmarks and standardized evaluation protocols for assessing dictionary learning methods have been developed . We think circuit-based metrics will be the next frontier in dictionary learning evaluation.
Beyond dictionary learning, several alternative unsupervised approaches to extracting computational units have shown initial success in small-scale settings. These include transforming activations into the local interaction basis and developing attribution-based decompositions of parameters .
§ 9.2 Circuit Discovery Methods
Definitions. Throughout the literature, the term circuit is used to mean many different things. Olah et al. introduced the definition as a subgraph of a neural network where nodes are directions in activation space and edges are the weights between them. This definition has been relaxed over time in some parts of the literature and is often used to refer to a general subgraph of network components with edge weights computed from an attribution or intervention of some kind .
There are several dimensions along which circuit approaches and definitions vary:
- Are the units of analysis globally interpretable or not? (For example, compare monosemantic features versus an entire attention head which does many different things across the data distribution.)
- Is the circuit itself (i.e. the edge connections) a global description, or a locally valid attribution graph? Or something else?
- Are the edges interpretable? (For example, are the edges computed by linear attributions or a complex nonlinear intervention?)
- Does the approach naturally address superposition?
We believe the North Star of circuit research is to manifest an object with globally interpretable units connected by interpretable edges which are globally valid. The present work falls short by only offering a locally valid attribution graph.
Manual Analysis. Early circuit discovery was largely manual, requiring specific hypotheses and bespoke methods of validation . Causal mediation analysis , activation patching , patch patching , and distributed alignment search have been the the most commonly used techniques for refining hypotheses and isolating causal pathways in direct analyses. However, these techniques generally do not provide interpretable (i.e. linear) edge weights between units.
Automatic Analysis. These analyses were automated in Conmy et al. by developing a recursive patching procedure to automatically find component subgraphs given a task of interest. However, patching analyses are computationally expensive, requiring a forward pass per step. This motivated attribution patching , a more efficient method leveraging gradients to approximate the effect of interventions. There has been significant follow up work on attribution patching including improving the gradient approximation, adapting the techniques to vision models , and incorporating positional information . Other techniques studied in the literature include learned masking techniques, circuit probing , discretization , and information flow analysis . The objective of many of these automated approaches is isolating important model components (i.e., layers, neurons, attention heads) and the interactions between them, but they do not address the interpretation of these components.
However, armed with better computational units of analysis from sparse dictionary learning, this work and other recent papers are making a return to discovering connections between interpretable components. Specifically, our work is most similar to Dunefsky et al. and Ge et al. who also use transcoders to compute per-prompt attribution graphs with stop-gradients while also studying input-agnostic global weights. Our work is different in that we use crosscoders to absorb redundant features, use a more global pruning algorithm, include error nodes , and use a more powerful visualization suite to enable deeper qualitative analysis. These works are closer in spirit to the original circuits vision outlined in Olah et al., but inherit prompt specific quantities (e.g. attention patterns) that limit their generality.
Attention Circuits. The attention mechanism in transformers introduced challenges for weight-based circuit analysis as done by Olah et al.. Elhage et al. proposed decomposing attention layers into a nonlinear  QK component controlling the attention pattern, and a linear OV component which controls the output. By freezing QK (e.g., for a particular prompt), transcoder feature-feature interactions mediated by attention become linear. This is the approach adopted by this work and Dunefsky et al. . Others have tried training SAEs on the attention outputs and multiplying features through key and query matrices to explain attention patterns .
Replacement Models. Our notion of a replacement model is similar in spirit to past work on causal abstraction and proxy models . These methods seek to learn an interpretable graphical model which maintains faithfulness to an underlying black box model. However, these techniques typically require a task specification or other supervision, as opposed to our replacement model which is learned in a fully unsupervised manner.
Circuit Evaluation. Causal scrubbing was proposed as an early principled approach for evaluating interpretation quality through behavior-preserving resampling ablations. Shi et al. formalized criteria for evaluating circuit hypotheses, focusing on behavior preservation, localization, and minimality, while applying these tests to both synthetic and discovered circuits in transformer models. However, Miller et al. raised important concerns about existing faithfulness metrics, finding them highly sensitive to seemingly insignificant changes in ablation methodology.
§ 9.3 Circuit Biology and Phenomenology
Beyond methods, many works have performed deep case studies and uncovered interesting model phenomenology. For example, thorough circuit analysis has been performed on
- Arithmetic in toy models
- Python doc strings
- Indirect object identification
- Computing the greater-than operator
- Multiple Choice
- Pronoun gender
- In-context learning
The growing set of case studies has enabled further research on how these components are used in other tasks . Moreover, Tigges et al. found that many of these circuit analyses are consistent across training and scale. Preceding these analyses, there has also been a long line of âBertologyâ research that has studied model biology ( see survey ) using attention pattern analysis and probing.