May 20, 2025
How do language models (LMs) represent characters’ beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze Llama-3-70B-Instruct’s ability to reason about characters’ beliefs using causal mediation and abstraction. We construct a dataset that consists of simple stories where two characters each separately change the state of two objects, potentially unaware of each other’s actions. Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating reference information about them, represented as their Ordering IDs (OIs) in low rank subspaces of the state token’s residual stream. When asked about a character’s beliefs regarding the state of an object, the binding lookback retrieves the corresponding state OIand then an answer lookback retrieves the state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character’s beliefs. Our work provides insights into the LM’s belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
The ability to infer the mental states of others—known as Theory of Mind (ToM)—is an essential aspect of social and collective intelligence [1], [2]. Recent studies have established that language models (LMs) can solve some tasks requiring ToM reasoning [3]–[5], while others have highlighted shortcomings [6]–[8]. Nonetheless, most existing work relies on behavioral evaluations, which do not shed light on the internal mechanisms by which LMs encode and manipulate representations of mental states to solve (or fail to solve) such tasks [9], [10].
In this work, we investigate how LMs represent and update characters’ beliefs, which is a fundamental element of ToM [11], [12]. For instance, the Sally-Anne test [13], a canonical test of ToM in humans, evaluates this ability by asking individuals to track Sally’s belief, which diverges from reality due to missing information, and Anne’s belief, which updates based on new observations.
We construct CausalToM, a dataset of simple stories involving two characters, each interacting with an object to change its state, with the possibility of observing one another. We then analyze the internal mechanisms that enable Llama-3-70B-Instruct[14] to reason about and answer questions regarding the characters’ beliefs about the state of each object (for a sample story, see Section 3 and for the full prompt refer to Appendix 9).
We discover a pervasive computation that performs multiple subtasks, which we refer to as the lookback mechanism. This mechanism enables the model to recall important information only when it becomes necessary. In a lookback, two copies of a single piece of information are transferred to two distinct tokens. This allows attention heads at the latter token to look back at the earlier one when needed and retrieve vital information stored there, rather than transferring that information directly (see Fig. 1).
We identify three key lookback mechanisms that collectively perform belief tracking: 1) Binding lookback (Fig. 2 (i)): First the LM assigns ordering IDs (OIs) [15] that encode whether a character, object, or state token appears first or second. Then, the character and object OIsare copied to low-rank subspaces of the corresponding state token and the final token residual stream. Later, when the LM needs to answer a question about a character’s beliefs, it uses this information to retrieve the answer state OI. 2) Answer lookback (Fig. 2 (ii)): Uses the answer state OIfrom the binding lookback to retrieve the answer state token value. 3) Visibility lookback (Fig. 6): When an explicit visibility condition between characters is mentioned, the model employs additional reference information called the visibility ID to retrieve information about the observed character, augmenting the observing character’s awareness.
Overall, this work not only advances our understanding of the internal computations in LMs that enable ToM capability but also uncovers a pervasive mechanism that serves as the foundation for executing complex logical reasoning with conditionals. All code and data supporting this study are available at https://belief.baulab.info.
Our investigation of belief tracking uncovers a recurring pattern of computation that we call the lookback mechanism.2 Here we give a brief overview of this mechanism; subsequent sections provide detailed experiments and analyses. In lookback, source information is copied (via attention) into an address copy in the residual stream of a recalled token and a pointer copy in the residual stream of a lookback token that occurs later in the text. The LM places the address alongside a payload of the recalled token’s residual stream that can be brought forward to the lookback token if necessary. Fig. 1 schematically describes a generic lookback.
That is, the LM can use attention to dereference the pointer and retrieve the payload present in the residual stream of the recalled token (that might contain aggregated information from previous tokens), bringing it to the residual stream of the lookback token. Specifically, the pointer at the lookback token forms an attention query vector, while the address at the recalled token forms a key vector. Because the pointer and the address are copies of the same source information, they would have a high dot-product, hence a QK-circuit [16] is established forming a bridge from the lookback token to the recalled token. The LM uses this bridge to move the payload that contains information needed to complete the subtask through the OV-circuit.
To develop an intuition for why an LM would learn to implement lookback mechanisms to solve reasoning tasks such as our belief tracking task, consider that during training LMs process text in sequence with no foreknowledge of what might come next. Then, it would be useful to mark addresses alongside payloads that might be useful for downstream tasks. In our setting, the LM constructs a representation of the story without any knowledge of what questions it may be asked about, so the LM concentrates pieces of information in the residual stream of certain tokens which later become payloads and addresses. When the question text is reached, pointers are constructed that reference this crucial story information and dereference it as the answer to the question.
Dataset Existing datasets for evaluating ToM capabilities of LMs are designed for behavioral testing and lack counterfactual pairs needed for causal analysis [18]. To address this, we constructed CausalToM, a structured dataset of simple stories, where each story involves two characters, each interacting with a distinct object causing
the object to take a unique state. For example: “Character1andCharacter2are working in a busy restaurant. To complete an order,Character1grabs an opaqueObject1and fills it withState1. ThenCharacter2grabs another opaqueObject2and fills it withState2.” We then ask the LM to reason about one of the characters’ beliefs regarding the state of an object: “What doesCharacter1believeObject2contains?” We analyze the LM’s ability to track characters’ beliefs in two distinct
settings. (1) No Visibility, where both characters are unaware of each other’s actions, and (2) Explicit Visibility where explicit information about whether a character can/cannot observe the other’s actions is provided, e.g.,
“Bobcan observeCarla’s actions.Carlacannot observeBob’s actions.” We also provide general task instructions (e.g., answer
unknown when a character is unaware); refer to Appendix 9 & 10 for the full prompt and additional dataset details. Our experiments analyze the
Llama-3-70B-Instruct model in half-precision, using NNsight [19]. The model demonstrates a high behavioral
performance on both the no-visibility and explicit-visibility settings, achieving accuracy of 95.7% and 99% respectively. For all subsequent experiments, we filter out samples that the model fails to answer correctly.
Causal Mediation Analysis
Our goal is to develop a mechanistic understanding of how Llama-3-70B-Instructreasons about characters’ beliefs and answers related questions [20]. A key method for conducting causal analysis is interchange interventions [21]–[23], in which the LM is run on paired examples: an original input \(\mathbf{o}\) and a counterfactual input \(\mathbf{c}\) and certain internal activations in the LM run on the original are replaced with those computed from the counterfactual.
Drawing inspiration from existing literature [21], [24], [25], we begin our analysis by performing interchange interventions with counterfactuals that are identical to the original except for key input tokens. We trace the causal path from these key tokens to the final output. This is a type of Causal Mediation Analysis [26]. Specifically, we construct a counterfactual dataset where \(\mathbf{o}\) contains a question about the belief of a character not mentioned in the story, while \(\mathbf{c}\) is identical except that the story includes the queried character. The expected outcome of this intervention is a change in the final output of \(\mathbf{o}\) from unknown to a state token, such as beer. We conduct similar interchange interventions for object and state tokens (refer to Appendix 11 for more details).
Figure 3 presents the aggregated results of this experiment for the key input tokens \(\boldsymbol{\textcolor{nicegreen}{Character1}}\), \(\boldsymbol{\textcolor{niceyellow}{Object1}}\), and \(\boldsymbol{\textcolor{nicepurple}{State1}}\). The cells are color-coded to indicate the interchange intervention accuracy [27]. Even at this coarse level of analysis, several significant insights emerge: 1) Information from the correct state token (beer) flows directly from its residual stream to that of the final token in later layers, consistent with prior findings [28], [29]; 2) Information associated with the query character and the query object is retrieved from their earlier occurrences and passed to the final token before being replaced by the correct state token.
Desiderata Based Patching via Causal Abstraction
The causal mediation experiments provide a coarse-grained analysis of how information flows from an input token to the output but do not identify what that information is. A fact about transformers is that the input to the first layer contains input tokens and the output from the final layer contains the output token, but what is the information content of representations along the active causal path in between the input and output?
To answer this question, we turn to Causal Abstraction [30], [31]. We align the variables of a high-level causal model with the LM’s internal activations and verify the alignment by conducting targeted interchange interventions for each variable. Specifically, we perform aligned interchange interventions at both levels: interventions that target high-level causal variables and interventions that modify low-level features of the LM’s hidden activations. If the LM produces the same output as the high-level causal model under these aligned interventions, it provides evidence supporting the hypothesized causal model. The effect of these interventions is quantified using IIA, which measures the proportion of instances where intervened high-level causal model and low-level LM have the same output (refer to Appendix 12 for more details about the causal abstraction framework and Appendix 13 for the belief tracking causal model).
In addition to performing interchange interventions on entire residual stream vectors in LMs, we also intervene on specific subspaces to further localize causal variables. To identify the subspace encoding a particular variable, we employ the Desiderata-based Component Masking [29], [32], [33] technique, which learns a sparse binary mask over the internal activation space by maximizing the logit of the causal model output token. Specifically, we train a mask to select the singular vectors of the activation space that encode a high-level variable (see Appendix 14 for details).
The LM solves the no visibility setting of the belief tracking task using three key mechanisms: Ordering ID assignment, binding lookback, and answer lookback. Figure 2 illustrates the
hypothesized high-level causal model implemented by the LM, which we evaluate in the following subsections. The LM first assigns ordering IDs (OIs; , , ) to each character, object, and state in the story that encode their order of appearance (e.g., the second character Carla is assigned ). These OIsare used in two lookback mechanisms. (i) Binding lookback: Address copies of each character OI() and object OI() are placed alongside their corresponding state OIpayload ()
in the residual stream of each state token, binding together each character-object-state triple. When the model is asked about the belief of a specific character about a specific object, it moves pointer copies of the corresponding OIs(, ) to the final token’s residual stream. These pointers are dereferenced, bringing the correct state OIinto the final
token residual stream. (ii) Answer lookback: An address copy of the state OI() is alongside the state token payload () in the residual stream of the correct state token, while a pointer copy () is moved to the final token
residual stream via the binding lookback. The pointer is dereferenced, bringing the answer state token payload into the final token residual stream, which is predicted as the final output.
Refer to Appendix 13 for pseudocode defining the causal model for the belief tracking task. In Appendices 20 and 19, we show that parts of our analysis generalizes to the Llama-3.1-405B-Instruct model and the BigToM dataset [34].
The LM assigns an Ordering ID (OI; [15]) to the character, object, and state tokens. These OIs, encoded in a low-rank subspace of the internal activation, serve as a reference that indicates whether an entity is the first or second of its type independent of its token value. For example, in Fig. 2, Bob is assigned the first character OI, while Carla receives the second. In the subsequent subsections and Appendices 15 & 16, we validate the presence of OIsthrough multiple experiments, where intervening on tokens with identical token values but different OIsalters the model’s internal computation, leading to systematic changes in the final output predicted by our high-level causal model. The LM then uses these OIsas building blocks, feeding them into lookback mechanisms to track and retrieve beliefs.
The Binding Lookback is the first operation applied to these OIs. The character and object OIs, serving as the source information, are each copied twice. One copy, referred to as the address, is placed in the residual stream of the state token (recalled token), alongside the state OIas the payload to transfer. The other copy, referred to as the pointer, is moved in the residual stream of the final token (lookback token). These pointer and address copies are then used to form the QK-circuit at the lookback token, which dereferences the state OIpayload, transferring it from the state token to the final token. See Fig.2 (i) for a schematic of this lookback and see Fig.1 for the general mechanism.
Localizing the Address and Payload In our first experiment, we localize the address copies of the character and object OIsand the state OIpayload to the residual stream of the state token (recalled token, Fig. 2). We sampled a counterfactual dataset where each example consists of an original input \(\mathbf{o}\) with an answer that isn’t unknown and a counterfactual input \(\mathbf{c}\) where the character, object, and state tokens are identical, except the ordering of the two sentences is swapped while the question remains unchanged, as illustrated in Fig. 4. The expected outcome predicted by our high-level causal model under intervention is the other state token from the original example, e.g., beer, because reversing the address and payload values without changing the pointer flips the output. In LM, the QK-circuit, formed using the pointer at the lookback token, attends to the other state token and retrieves its state OIas the payload.
We perform an interchange intervention experiment layer-by-layer, where we replace the residual stream vectors at the first state token in the original run with that of the second state token in the counterfactual run and vice versa for the other state token. It is important to note that if the intervention targets state token values instead of their OIs, it should not produce the expected output. (This happens in the earlier layers.)
As shown in Fig. 4, the strongest alignment occurs between layers \(33\) and \(38\), supporting our hypothesis that the state token’s residual stream contains both the address information (character and object OIs) and the payload information (state OI). These components are subsequently used to form the appropriate QK and OV circuits of binding lookback.
Localizing the Source Information As shown in Fig. 2, the source information is copied as both the address and the pointer at different token positions. To localize the source information, we conduct intervention experiments with a dataset where the counterfactual example, c, swaps the order of the characters and objects as well as replaces the state tokens with entirely new ones while keeping the question the same as in o.
With this dataset, an interchange intervention on the high-level causal model that targets the source information will have downstream effects on both the address and the pointer, so no change in output occurs. However, if we additionally freeze the payloads and addresses, the causal model outputs the other state token, e.g., beer in Fig. 5, due to the mismatch between address and pointer.
In the LM, we interchange the residual streams of the character and object tokens while keeping the residual stream of the state token fixed. When the output of the intervened LM aligns with that of the intervened causal model, it indicates that the QK-circuit at the final token is attending to the alternate state token. As shown in Fig. 5, the second experiment reveals alignment between layers 20 and 34. This suggests that source information—specifically, the character and object OIs—is represented in their respective token residual streams within this layer range.
We provide more experimental results in Appendix 15 where we show in Fig. 13 that freezing the residual stream of the state token is necessary. In sum, these results not only provide evidence for the presence of source information but also establish its transfer to the recalled and lookback tokens as addresses and pointers, respectively.
Localizing the Pointer Information The pointer copies of the character and object OIare first formed at the character and object tokens in the question before being moved again to the final token for dereferencing (see Appendix 16 for experiments and more details).
The LM answers the question using the Answer Lookback. The state OIof the correct answer serves as the source information, which is copied into two instances. One instance, the address copy of the state OI, is in the residual stream of the state token (recalled token) with the state token itself as the payload. The other instance, the pointer copy of the state OI, is transferred to the residual stream of the final token (lookback token) as the payload of the binding lookback. This pointer is then dereferenced, bringing the state token as the payload into the residual stream of the final token, which is predicted as the final output. See Fig. 2 (ii) for the answer lookback and Fig. 1 for the general mechanism.
Localizing the Pointer Information We first localize the pointer of the answer lookback, which is the payload of the binding lookback. To do this, we conduct an interchange intervention experiment where the residual vectors at the final token position in the original run are replaced with those from the counterfactual run, one layer at a time. The counterfactual inputs have swapped objects and characters and randomly sampled states. If the answer pointer is targeted for intervention in the high-level causal model, the output is the other state in the original input, e.g., beer. As shown in Fig. [fig:binding95pointer], alignment begins at layer \(34\), indicating that this layer contains pointer information, in low-rank subspace, which remains causally relevant until layer \(52\).
Localizing the Payload To determine where the model uses the state OIpointer to retrieve the state token, we use the same interchange intervention experiment. However, if the answer payload is targeted for intervention in the high-level causal model, the output is the correct state token from the counterfactual example, e.g., tea, rather than the state token from the original example, as illustrated in Fig. [fig:binding95pointer]. The alignment occurs after layer \(56\), indicating that the model retrieves the correct state token (payload) into the final token’s residual stream by \(56\), where it is subsequently used to generate the final output.
In the previous section, we demonstrated how the LM uses ordering IDs and two lookback mechanisms to track the beliefs of characters that cannot observe each other. Now, we explore how the LM updates the beliefs of characters when provided with additional information that one of the characters (observing) can observe the actions of others (observed). We hypothesize that the LM employs another lookback mechanism, which we refer to as the Visibility Lookback, to incorporate information about the observed character.
As illustrated in Fig. 6, we hypothesize that the LM first generates a Visibility ID at the residual stream of the visibility sentence, serving as the source information. The address copy of the visibility ID remains in the residual stream of the visibility sentence, while its pointer copy gets transferred to the residual streams of the subsequent tokens, which are the lookback tokens. Then LM forms a QK-circuit at the lookback tokens and dereferences the visibility ID pointer to bring forward the payload.
Although we were unable to determine the exact semantics of the payload in this lookback, we speculate that it encodes the observed character’s OI. We propose the existence of another lookback, where the story sentence associated with the observed character serves as the source, and its payload encodes information about the observed character. That payload information, encoding observed character’s OI, is then retrieved by the lookback tokens of the Visibility lookback, which contributes to the queried character’s enhanced awareness (see Appendix 17 for more details).
Localizing the Source Information To localize the source information, we conduct an interchange intervention experiment where the counterfactual is a different story with altered visibility information. In the original example, the first character cannot observe the second character’s actions, whereas in the counterfactual example, the first character can observe them (Fig. 7). The causal model outcome of this intervention is a change in the final output of the original run from “unknown” to the state token associated with the queried object. The interchange intervention is executed on visibility sentence tokens. As shown in Fig. 7 (– line), alignment occurs between layers \(10\) and \(23\), indicating that the visibility ID remains encoded in the visibility sentence until layer \(23\), after which it is duplicated into address and pointer copies on visibility sentence and subsequent tokens respectively.
Localizing the Payload To localize the payload information, we use the same counterfactual dataset. However, instead of intervening on the source or recalled tokens, we intervene on the lookback tokens, specifically the question and answer tokens. As in the previous experiment, we replace the residual vectors of these tokens in the original run with those from the counterfactual run. As shown in Fig. 7 (– line), alignment occurs only after layer \(31\), indicating that the information enhancing the queried character’s awareness is present in the lookback tokens only after this layer.
Localizing the Address and Pointer The previous two experiments suggest the presence of a lookback mechanism, as there is no signal indicating that the source or payload has been formed between layers 24 and 31. We hypothesize that this lack of signal is due to a mismatch between the address and pointer information at the recalled and lookback tokens. Specifically, when intervening only on the recalled token after layer \(25\), the pointer is not updated, whereas intervening only on the lookback tokens leaves the address unaltered, leading to the mismatch. To test this hypothesis, we conduct another intervention using the same counterfactual dataset, but this time, we intervene on the residual vectors of both the recalled and lookback tokens, i.e., the visibility sentence, as well as the question and answer tokens. As shown in Fig. 7 (– line), alignment occurs after layer \(10\) and remains stable, supporting our hypothesis. This intervention replaces both the address and pointer copies of the visibility IDs, enabling the LM to form a QK-circuit and retrieve the payload.
Theory of mind in LMs A large body of work has focused on benchmarking different aspects of ToM through various tasks that attempt to assess LMs’ performance such as [35]–[42] and many more. In addition, there are various methods tailored to improve ToM ability in LMs through prompting [43]–[47].
Entity tracking in LMs Entity tracking and variable binding are crucial abilities for LMs to exhibit not only coherent ToM capabilities, but also neurosymbolic reasoning. Many existing works have attempted to decipher this ability in LMs [15], [29], [33], [48]–[51]. Our work builds on their empirical insights and extends the current understanding of how LMs bind various entities defined in context.
Mechanistic interpretability of theory of mind Only a few empirical studies explored the underlying mechanisms of ToM of LM [52], [53] [54]. Those studies focus on probing techniques [55], [56] to identify internal representations of beliefs and used steering techniques [57], [58] to improve LM performance by manipulating their activations. However, the mechanism by which LMs solve those tasks remains a black box, limiting our ability to understand, predict, and control LMs’ behaviors.
Through a series of desiderata-based patching experiments, we have mapped the mechanisms underlying the processing of partial knowledge and false beliefs in a set of simple stories. We are surprised by the pervasive appearance of a single recurring computational pattern: the lookback, which resembles a pointer dereference inside a transformer. The LMs use a combination of several lookbacks to reason about nontrivial visibility and belief states. Our improved understanding of these fundamental computations gives us optimism that it may be possible to fully reveal the algorithms underlying not only Theory of Mind, but also other forms of reasoning in LMs.
This research was supported in part by Open Philanthropy (N.P., N.S., A.S.S., D.B., A.G., Y.B.), the NSF National Deep Inference Fabric award #2408455 (D.B.), the Israel Council for Higher Education (N.S.), the Zuckerman STEM Leadership Program (T.R.S.), the Israel Science Foundation (grant No. 448/20; Y.B.), an Azrieli Foundation Early Career Faculty Fellowship (Y.B.), a Google Academic Gift (Y.B.), and a Google Gemma Academic Award (D.B.). This research was partly funded by the European Union (ERC, Control-LM, 101165402). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
Instruction: 1. Track the belief of each character as described in the story. 2. A character’s belief is formed only when they perform an action themselves or can observe the action taking place. 3. A character does not have any beliefs about the container and its contents which they cannot observe. 4. To answer the question, predict only what is inside the queried container, strictly based on the belief of the character, mentioned in the question. 5. If the queried character has no belief about the container in question, then predict ‘unknown’. 6. Do not predict container or character as the final output.
Story:BobandCarlaare working in a busy restaurant. To complete an order,Bobgrabs an opaquebottleand fills it withbeer. ThenCarlagrabs another opaquecupand fills it withcoffee.
Question: What doesBobbelieve thebottlecontains?
Answer:
Instruction: 1. Track the belief of each character as described in the story. 2. A character’s belief is formed only when they perform an action themselves or can observe the action taking place. 3. A character does not have any beliefs about the container and its contents which they cannot observe. 4. To answer the question, predict only what is inside the queried container, strictly based on the belief of the character, mentioned in the question. 5. If the queried character has no belief about the container in question, then predict ‘unknown’. 6. Do not predict container or character as the final output.
Story:BobandCarlaare working in a busy restaurant. To complete an order,Bobgrabs an opaquebottleand fills it withbeer. ThenCarlagrabs another opaquecupand fills it withcoffee.Bobcan observeCarla’s actions.Carlacannot observeBob’s actions.
Question: What doesBobbelieve thecupcontains?
Answer:
In total, there are 4 templates (one without and 3 with explicit visibility statements). Each template allows 4 different types of questions (CharacterX asked about ObjectY). We used lists of 103 characters, 21 objects, and 23 states. In our interchange intervention experiments, we randomly sample 80 pairs of original and counterfactual stories.
In addition to the experiment shown in Fig.8, we conduct similar experiments for the object and state tokens by replacing them in the story with random tokens, which alters the original example’s final output. However, patching the residual stream vectors of these tokens from the counterfactual run restores the relevant information, enabling the model to predict the causal model output. The results of these experiments are collectively presented in Fig.3, with separate heatmaps shown in Fig. 9, 10, 11.
Causal Models and Interventions A deterministic causal model \(\mathcal{M}\) has variables that take on values. Each variable has a mechanism that determines the value of the variable based on the values of parent variables. Variables without parents, denoted \(\mathbf{X}\), can be thought of as inputs that determine the setting of all other variables, denoted \(\mathcal{M}(\mathbf{x})\). A hard intervention \(A \leftarrow a\) overrides the mechanisms of variable \(A\), fixing it to a constant value \(a\).
Interchange Interventions We perform interchange interventions [21], [22] where a variable (or set of features) \(A\) is fixed to be the value it would take on if the LM were processing counterfactual input \(\mathbf{c}\). We write \(A \leftarrow \mathsf{Get}(\mathcal{M}(\mathbf{c}), A)\) where \(\mathsf{Get}(\mathcal{M}(\mathbf{c}), A)\) is the value of variable \(A\) when \(\mathcal{M}\) processes input \(\mathbf{c}\). In experiments, we will feed a original input \(\mathbf{o}\) to a model under an interchange intervention \(\mathcal{M}_{A \leftarrow \mathsf{Get}(\mathcal{M}(\mathbf{c}), A))}(\mathbf{o})\).
Featurizing Hidden Vectors The dimensions of hidden vectors are not an ideal unit of analysis [59], and so it is typical to featurize a hidden vector using some invertible function, e.g., an orthogonal matrix, to project a hidden vector into a new variable space with more interpretable dimensions called “features”[26]. A feature intervention \(\mathbf{F}_{\mathbf{h}} \leftarrow \mathbf{f}\) edits the mechanism of a hidden vector \(\mathbf{h}\) to fix the value of features \(\mathbf{F}_{\mathbf{h}}\) to \(\mathbf{f}\).
Alignment The LM is a low-level causal model \(\mathcal{L}\) where variables are dimensions of hidden vectors and the hypothesis about LM structure is a high-level causal model \(\mathcal{H}\). An alignment \(\Pi\) assigns each high-level variable \(A\) to features of a hidden vector \(\mathbf{F}^{A}_{\mathbf{h}}\), e.g., orthogonal directions in the activation space of \(\mathbf{h}\). To evaluate an alignment, we perform intervention experiments to evaluate whether high-level interventions on the variables in \(\mathcal{H}\) have the same effect as interventions on the aligned features in \(\mathcal{L}\).
Causal Abstraction We use interchange interventions to reveal whether the hypothesized causal model \(\mathcal{H}\) is an abstraction of an LM \(\mathcal{L}\). To simplify, assume both models share an input and output space. The high-level model \(\mathcal{H}\) is an abstraction of the low-level model \(\mathcal{L}\) under a given alignment when each high-level interchange intervention and the aligned low-level intervention result in the same output. For a high-level intervention on \(A\) aligned with low-level features \(\mathbf{F}^{A}_{\mathbf{h}}\) with a counterfactual input \(\mathbf{c}\) and original input \(\mathbf{b}\), we write \[\label{eq:abstraction} \mathsf{GetOutput}(\mathcal{L}_{\mathbf{F}_{\mathbf{h}}^{A} \leftarrow \mathsf{Get}(\mathcal{L}(\mathbf{c}), \mathbf{F}^A_\mathbf{h}))}(\mathbf{o})) = \mathsf{GetOutput}(\mathcal{H}_{A \leftarrow \mathsf{Get}(\mathcal{H}(\mathbf{c}), A))}(\mathbf{o}))\tag{1}\] If the low-level interchange intervention on the LM produces the same output as the aligned high-level intervention on the algorithm, this is a piece of evidence in favor of the hypothesis. This extends naturally to multi-variable interventions [31].
Graded Faithfulness Metric We construct counterfactual datasets for each causal variable where an example consists of a base prompt and a counterfactual prompt . The counterfactual label is the expected output of the algorithm after the high-level interchange intervention, i.e., the right-side of Equation 1 . The interchange intervention accuracy is the proportion of examples for which Equation 1 holds, i.e., the degree to which \(\mathcal{H}\) faithfully abstracts \(\mathcal{L}\).
While interchange interventions on residual vectors reveal where a causal variable might be encoded in the LM’s internal activations, they do not localize the variable to specific subspaces. To address this, we apply the Desiderata-based Component Masking technique [29], [32], [33], which learns a sparse binary mask over the singular vectors of the LM’s internal activations. First, we cache the internal activations from \(500\) samples at the token positions where residual-level interchange interventions align with the expected output. Next, we apply Singular Value Decomposition to compute the singular vectors, which are then used to construct a projection matrix. Rather than replacing the entire residual vector with that from the counterfactual run, we perform subspace-level interchange interventions using the following equations:
\[\mathsf{W_{proj} = V*V^{T}}\] \[\mathsf{h_{org} \leftarrow W_{proj} h_{counterfactual} + (I-W_{proj}) h_{original}}\]
Here, \(V\) is a matrix containing stacked singular vectors, while \(h_{counterfactual}\) and \(h_{original}\) represent the residual stream vectors from the counterfactual and original runs, respectively. The core idea is to first remove the existing information from the subspace defined by the projection matrix and then insert the counterfactual information into that same subspace using the same projection matrix. However, in DCM, instead of utilizing the entire internal activation space, we learn a binary mask over the matrix containing stacked singular vectors to identify the desired subspace. Specifically, before computing the projection matrix, we use the following equations to select the relevant singular vectors:
\[\mathsf{V \leftarrow V * mask}\]
We train the mask on \(80\) examples of the same counterfactual dataset and use another \(80\) as the validation set. We use the following objective function, which maximizes the logit of the causal model output token:
\[\mathcal{L} = -\mathsf{logit_{expected\_output}} + \lambda \sum \mathsf{W}\]
Where \(\lambda\) is a hyperparameter used to control the rank of the subspace and \(\mathsf{W}\) is the parameter of the learnable mask. We trained it for one epoch with ADAM optimizer, a batch size of \(4\) and a learning rate of \(0.01\).
As mentioned in section [sec:binding95lookback95source], the source information, consisting of character and object OI, is duplicated to form the address and pointer of the binding lookback. Here, we describe another experiment to verify that the source information is copied to both the address and the pointer. More specifically, we conduct the same interchange intervention experiment as described in Fig. 5, but without freezing the residual vectors at the state tokens. Based our hypothesis, this intervention will not be able to change the state of the original run, since the intervention at the source information will affect both address and pointer, hence making the model form the original QK-circuit.
In section [sec:binding95lookback95source], we identified the source of the information but did not fully determine the locations of each character and object OI. To address this, we now localize the character and object OIsseparately to gain a clearer understanding of the layers at which they appear in the residual streams of their respective tokens, as shown in Fig.14 and Fig.15.
In section 4.2, we localized the pointer information of binding lookback. However, we found that this information is transferred to the lookback token (last token) through two intermediate tokens: the queried character and the queried object. In this section, we separately localize the OIsof the queried character and queried object, as shown in Fig. 16 and Fig. 17.
As mentioned in section 5, the payload of the Visibility lookback remains undetermined. In this section, we attempt to disambiguate its semantics using the Attention Knockout technique introduced in [60], which helps reveal the flow of crucial information. We apply this technique to understand which previous tokens are vital for the formation of the payload information. Specifically, we "knock out" all attention heads at all layers of the second visibility sentence, preventing them from attending to one or more of the previous sentences. Then, we allow the attention heads to attend to the knocked-out sentence one layer at a time.
If the LM is fetching vital information from the knocked-out sentence, the interchange intervention accuracy (IIA) post-knockout will decrease. Therefore, a decrease in IIA will indicate which attention heads, at which layers, are bringing in the vital information from the knocked-out sentence. If, however, the model is not fetching any critical information from the knocked-out sentence, then knocking it out should not affect the IIA.
To determine if any vital information is influencing the formation of the Visibility lookback payload, we perform three knockout experiments: 1) Knockout attention heads from the second visibility sentence to both the first visibility sentence and the second story sentence (which contains information about the observed character), 2) Knockout attention heads from the second visibility sentence to only the first visibility sentence, and 3) Knockout attention heads from the second visibility sentence to the second story sentence. In each experiment, we measure the effect of the knockout using IIA.
Fig.18 shows the experimental results. Knocking out any of the previous sentences affects the model’s ability to produce the correct output. The decrease in IIA in the early layers can be explained by the restriction on the movement of character OIs. Specifically, the second visibility sentence mentions the first and second characters, whose character OIsmust be fetched before the model can perform any further operations. Therefore, we believe the decrease in IIA until layer 15, when the character OIsare formed (based on the results from Section 15), can be attributed to the model being restricted from fetching the character OIs. However, the persistently low IIA even after this layer—especially when both the second and first visibility sentences are involved—indicates that some vital information is being fetched by the second visibility sentence, which is essential for forming the coherent Visibility lookback payload. Thus, we speculate that the Visibility payload encodes information about the observed character, specifically their character OI, which is later used to fetch the correct state OI.
This section identifies the attention heads that align with the causal subspaces discovered in the previous sections. Specifically, first we focus on attention heads whose query projections are aligned with the subspaces—characterized by the relevant singular vectors—that contain the correct answer state OI. To quantify this alignment between attention heads and causal subspaces, we use the following computation.
Let \(Q \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}\) denote the query projection weight matrix for a given layer:
We normalize \(Q\) column-wise:
\[\tilde{Q}_{:, j} = \frac{Q_{:, j}}{\|Q_{:, j}\|} \quad \text{for each column } j\]
Let \(S \in \mathbb{R}^{d_{\text{model}} \times k}\) represent the matrix of \(k\) singular vectors (i.e., the causal subspace basis). We project the normalized query weights onto this subspace:
\[Q_{\text{sv}} = \tilde{Q} \cdot S\]
We then reshape the resulting projection into per-head components. Assuming \(Q_{\text{sv}} \in \mathbb{R}^{d_{\text{model}} \times k}\), and each attention head has dimensionality \(d_h\), we write:
\[Q_{\text{head}}^{(i)} = Q_{\text{sv}}^{(i)} \in \mathbb{R}^{d_h \times k} \quad \text{for } i = 1, \dots, n_{\text{heads}}\]
Finally, we compute the norm of each attention head’s projection:
\[\text{head\_norm}_i = \left\| Q_{\text{head}}^{(i)} \right\|_{F} \quad \text{for } i = 1, \dots, n_{\text{heads}}\]
We compute the \(head\_norm\) for each attention head in every layer, which quantifies how strongly a given head reads from the causal subspace present in the residual stream. The results are presented in Fig. 19, and they align with our previous findings: attention heads in the later layers form the QK-circuit by using pointer and address information to retrieve the payload during the Answer lookback.
We perform a similar analysis to check which attention heads’ value projection matrix align with the causal subspace that encodes the payload of the Answer lookback. Results are shown in Fig. 20, indicating that attention heads at later layers primarily align with causal subspace containing the answer token.
This section presents preliminary evidence that the mechanisms outlined in Sections 4 and 5 generalize to other benchmark datasets. Specifically, we demonstrate that Llama-3-70B-Instructanswers the belief questions (true belief and false belief) in the BigToM dataset [34] in a manner similar to that observed for CausalToM: by first converting token values to their corresponding OIsand then performing logical operations on them using lookbacks. However, as noted in Section 3, BigToM—like other benchmarks—lacks the coherent structure necessary for causal analysis. As a result, we were unable to replicate all experiments conducted on CausalToM. Thus, the results reported here provide only preliminary evidence of a similar underlying mechanism.
To justify the presence of OIs, we conduct an interchange intervention experiment, similar to the one described in Section 16, aiming to localize the character OIat the character token in the question sentence. We construct an original sample by replacing its question sentence with that of a counterfactual sample, selected directly from the unaltered BigToM dataset. Consequently, when processing the original sample, the model has no information about the queried character and, as a result, produces unknown as the final output. However, if we replace the residual vector at the queried character token in the original sample with the corresponding vector from the counterfactual sample (which contains the character OI), the model’s output changes from unknown to the state token(s) associated with the queried object. This is because inserting the character OIat the queried token provides the correct pointer information, aligning with the address information at the correct state token(s), thereby enabling the model to form the appropriate QK-circuit and retrieve the state’s OI. As shown in Fig. 21, we observe a high IIA between layers \(9 - 28\)—similar to the pattern seen in CausalToM—suggesting that the queried character token encodes the character OIin its residual vector within these layers.
Next, we investigate the Answer lookback mechanism in BigToM, focusing specifically on localizing the pointer and payload information at the final token position. To localize the pointer information, which encodes the correct state OI, we construct original and counterfactual samples by selecting two completely different examples from the BigToM dataset, each with different ordered states as the correct answer. For example, as illustrated in Fig.22, the counterfactual sample designates the first state as the answer, thrilling plot, whereas the original sample designates the second state, almond milk. We perform an intervention by swapping the residual vector at the last token position from the counterfactual sample into the original run. The causal model outcome of this intervention is that the model will output the alternative state token from the original sample, oat milk. As shown in Fig.22, this alignment occurs between layers 33 and 51, similar to the layer range observed for the pointer information in the Answer lookback of CausalToM.
Further, to localize the payload of the Answer lookback in BigToM, we perform an interchange intervention experiment using the same original and counterfactual samples as mentioned in the previous experiment, but with a different expected output—namely, the correct state from the counterfactual sample instead of the other state from the original sample. As shown in Fig. 23, alignment emerges after layer 59, consistent with the layer range observed for the Answer lookback payload in CausalToM.
Finally, we investigate the impact of the visibility condition on the underlying mechanism and find that, similar to CausalToM, the model uses the Visibility lookback to enhance the observing character’s awareness based on the observed character’s actions. To localize the effect of the visibility condition, we perform an interchange intervention in which the original and counterfactual samples differ in belief type—that is, if the original sample involves a false belief, the counterfactual involves a true belief, and vice versa. The expected output of this experiment is the other (incorrect) state of the original sample. Following the methodology in Section 5, we conduct three types of interventions: (1) only at the visibility condition sentence, (2) only at the subsequent question sentence, and (3) at both the visibility condition and the question sentence. As shown in Fig. 24, intervening only at the visibility sentence results in alignment at early layers, up to layer \(17\), while intervening only at the subsequent question sentence leads to alignment after layer \(26\). Intervening on both the visibility and question sentences results in alignment across all layers. These results align with those found in the CausalToM setting shown in the Fig. 7.
Previous experiments suggest that the underlying mechanisms responsible for answering belief questions in BigToM are similar to those in CausalToM. However, we observed that the subspaces encoding various types of information are not shared between the two settings. For example, although the pointer information in the Answer lookback encodes the correct state’s OIin both cases, the specific subspaces that represent this information at the final token position differ significantly. We leave a deeper investigation of this phenomenon—shared semantics across distinct subspaces in different distributions—for future work.
This section presents all the interchange intervention experiments described in the main text, conducted using the same set of counterfactual examples on Llama-3.1-405B-Instruct, using NDIF [19]. Each experiment was performed on 80 samples. Due to computational constraints, subspace interchange intervention experiments were not conducted. The results indicate that Llama-3.1-405B-Instruct employs the same underlying mechanism as Llama-3-70B-Instructto reason about belief and answer related questions. This suggests that the identified belief-tracking mechanism generalizes to other models capable of reliably performing the task.
Correspondence to prakash.nik@northeastern.edu.↩︎
Although this mechanism may resemble induction heads [16], [17], they differ fundamentally. In induction heads, information from a previous token occurrence is passed only to the subsequent token through, without being duplicated to its next occurrence. In contrast, the lookback mechanism copies the same information not only to the location where the vital information resides but also to the target location that needs to retrieve that information.↩︎