Abstract

How do language models (LMs) represent characters’ beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze Llama-3-70B-Instruct’s ability to reason about characters’ beliefs using causal mediation and abstraction. We construct a dataset that consists of simple stories where two characters each separately change the state of two objects, potentially unaware of each other’s actions. Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating reference information about them, represented as their Ordering IDs (OIs) in low rank subspaces of the state token’s residual stream. When asked about a character’s beliefs regarding the state of an object, the binding lookback retrieves the corresponding state OIand then an answer lookback retrieves the state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character’s beliefs. Our work provides insights into the LM’s belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.

1 Introduction↩︎

The ability to infer the mental states of others—known as Theory of Mind (ToM)—is an essential aspect of social and collective intelligence [1], [2]. Recent studies have established that language models (LMs) can solve some tasks requiring ToM reasoning [3]–[5], while others have highlighted shortcomings [6]–[8]. Nonetheless, most existing work relies on behavioral evaluations, which do not shed light on the internal mechanisms by which LMs encode and manipulate representations of mental states to solve (or fail to solve) such tasks [9], [10].

In this work, we investigate how LMs represent and update characters’ beliefs, which is a fundamental element of ToM [11], [12]. For instance, the Sally-Anne test [13], a canonical test of ToM in humans, evaluates this ability by asking individuals to track Sally’s belief, which diverges from reality due to missing information, and Anne’s belief, which updates based on new observations.

We construct CausalToM, a dataset of simple stories involving two characters, each interacting with an object to change its state, with the possibility of observing one another. We then analyze the internal mechanisms that enable Llama-3-70B-Instruct[14] to reason about and answer questions regarding the characters’ beliefs about the state of each object (for a sample story, see Section 3 and for the full prompt refer to Appendix 9).

Figure 1: The lookback mechanism is used to perform conditional reasoning. The source token contains information that is copied into two instances via attention to create a pointer and an address. Alongside the address in the residual stream is a payload of information. When necessary, the model retrieves the payload by dereferencing the pointer. Solid lines represent movement of information, while the dotted line indicates the attention “looking back” from pointer to address. — Figure 1: **The lookback mechanism** is used to perform conditional reasoning. The *source token* contains information that is copied into two instances via attention to create a *pointer* and an *address*. Alongside the address in the residual stream is a *payload* of information. When necessary, the model retrieves the payload by dereferencing the pointer. Solid lines represent movement of information, while the dotted line indicates the attention “looking back” from pointer to address.

We discover a pervasive computation that performs multiple subtasks, which we refer to as the lookback mechanism. This mechanism enables the model to recall important information only when it becomes necessary. In a lookback, two copies of a single piece of information are transferred to two distinct tokens. This allows attention heads at the latter token to look back at the earlier one when needed and retrieve vital information stored there, rather than transferring that information directly (see Fig. 1).

We identify three key lookback mechanisms that collectively perform belief tracking: 1) Binding lookback (Fig. 2 (i)): First the LM assigns ordering IDs (OIs) [15] that encode whether a character, object, or state token appears first or second. Then, the character and object OIsare copied to low-rank subspaces of the corresponding state token and the final token residual stream. Later, when the LM needs to answer a question about a character’s beliefs, it uses this information to retrieve the answer state OI. 2) Answer lookback (Fig. 2 (ii)): Uses the answer state OIfrom the binding lookback to retrieve the answer state token value. 3) Visibility lookback (Fig. 6): When an explicit visibility condition between characters is mentioned, the model employs additional reference information called the visibility ID to retrieve information about the observed character, augmenting the observing character’s awareness.

Overall, this work not only advances our understanding of the internal computations in LMs that enable ToM capability but also uncovers a pervasive mechanism that serves as the foundation for executing complex logical reasoning with conditionals. All code and data supporting this study are available at https://belief.baulab.info.

2 The Lookback Mechanism↩︎

Figure 2: Belief Tracking in Language Models: We task the LM with tracking the beliefs of two characters that manipulate the states of two objects. We hypothesize that the LM solves this task by implementing a causal model with two lookback mechanisms (2). To support our hypothesis, we conduct a causal analysis where we measure whether interventions on a high-level causal model produce the same output as equivalent interventions on the LM. For instance in [fig:binding95pointer], we show results for an experiment distinguishing the pointer and payload in the answer lookback. (a) Belief Tracking with No Visibility between Characters:Our hypothesized causal model for this kind of story has two lookbacks that operate on ordering IDs (OIs) that encode whether a token appears first or second. In the binding lookback (i), the LM first represents the two events in the story by binding together each character-object-state triple in the residual stream of the state token. When questioned about a particular character and object, the LM looks back to the corresponding triple and retrieves an OIto that state token. Notice that in this lookback, that payload is later used as a pointer, i.e., what a C programmer would call a double pointer. In the answer lookback (ii), the LM dereferences the pointer to the answer token to generate the correct answer. Color indicates the information content, while shape indicates the role of that information in lookback (see Fig. 1), e.g., the state OIis a payload (image) in the binding lookback and a pointer/address (image) in the answer lookback. (b) Answer Lookback Pointer and Payload: To test our hypothesized causal model, we run the LM on pairs of slightly different stories, and then intervene by patching a specific representation state from the counterfactual run to the original run, observing any change in the output. The causal model predicts that if we alter the “answer payload image” of the original to instead take the value of the counterfactual answer payload, the output should change from coffee to tea; the gray curve in the line plot shows this does occur with p\approx1.0 when patching states at the “:” token beyond layer 56, providing evidence that the answer payload resides in those states. On the other hand the causal model predicts that taking the counterfactual “answer pointer image” would change the original run output from coffee to beer—a new output that matches neither the original nor the counterfactual!—and we do see this surprising effect, again with p\approx1.0, when patching layers between 34 and 52, providing strong evidence that the answer pointer is encoded at those layers. Collected over N=80 samples, these measurements suggest the Answer Lookback occurs between layers 52 and 56. Furthermore the representations of the causal variables are small: the interventions can be localized even further to subspaces of dimension 33 (payload) or 20 (pointer), tiny portions of the 8192-dimensional state space. — Figure 2: **Belief Tracking in Language Models:** We task the LM with tracking the beliefs of two characters that manipulate the states of two objects. We hypothesize that the LM solves this task by implementing a causal model with two lookback mechanisms (2). To support our hypothesis, we conduct a causal analysis where we measure whether interventions on a high-level causal model produce the same output as equivalent interventions on the LM. For instance in [fig:binding95pointer], we show results for an experiment distinguishing the pointer and payload in the answer lookback.
(a) **Belief Tracking with *No Visibility* between Characters:**Our hypothesized causal model for this kind of story has two lookbacks that operate on ordering IDs (OIs) that encode whether a token appears first or second. In the **binding lookback (i)**, the LM first represents the two events in the story by binding together each character-object-state triple in the residual stream of the state token. When questioned about a particular character and object, the LM looks back to the corresponding triple and retrieves an OIto that state token. Notice that in this lookback, that payload is later used as a pointer, i.e., what a C programmer would call a double pointer. In the **answer lookback (ii)**, the LM dereferences the pointer to the answer token to generate the correct answer. Color indicates the information content, while shape indicates the role of that information in lookback (see Fig. 1), e.g., the state OIis a payload () in the binding lookback and a pointer/address () in the answer lookback.
(b) **Answer Lookback Pointer and Payload**: To test our hypothesized causal model, we run the LM on pairs of slightly different stories, and then intervene by patching a specific representation state from the counterfactual run to the original run, observing any change in the output. The causal model predicts that if we alter the “answer payload ” of the original to instead take the value of the counterfactual answer payload, the output should change from **coffee** to **tea**; the gray curve in the line plot shows this does occur with \(p\approx1.0\) when patching states at the “:” token beyond layer 56, providing evidence that the answer payload resides in those states. On the other hand the causal model predicts that taking the counterfactual “answer pointer ” would change the original run output from **coffee** to **beer**—a new output that matches *neither* the original nor the counterfactual!—and we do see this surprising effect, again with \(p\approx1.0\), when patching layers between 34 and 52, providing strong evidence that the answer pointer is encoded at those layers. Collected over \(N=80\) samples, these measurements suggest the Answer Lookback occurs between layers 52 and 56. Furthermore the representations of the causal variables are small: the interventions can be localized even further to subspaces of dimension 33 (payload) or 20 (pointer), tiny portions of the 8192-dimensional state space.

Our investigation of belief tracking uncovers a recurring pattern of computation that we call the lookback mechanism.² Here we give a brief overview of this mechanism; subsequent sections provide detailed experiments and analyses. In lookback, source information is copied (via attention) into an address copy in the residual stream of a recalled token and a pointer copy in the residual stream of a lookback token that occurs later in the text. The LM places the address alongside a payload of the recalled token’s residual stream that can be brought forward to the lookback token if necessary. Fig. 1 schematically describes a generic lookback.

That is, the LM can use attention to dereference the pointer and retrieve the payload present in the residual stream of the recalled token (that might contain aggregated information from previous tokens), bringing it to the residual stream of the lookback token. Specifically, the pointer at the lookback token forms an attention query vector, while the address at the recalled token forms a key vector. Because the pointer and the address are copies of the same source information, they would have a high dot-product, hence a QK-circuit [16] is established forming a bridge from the lookback token to the recalled token. The LM uses this bridge to move the payload that contains information needed to complete the subtask through the OV-circuit.

To develop an intuition for why an LM would learn to implement lookback mechanisms to solve reasoning tasks such as our belief tracking task, consider that during training LMs process text in sequence with no foreknowledge of what might come next. Then, it would be useful to mark addresses alongside payloads that might be useful for downstream tasks. In our setting, the LM constructs a representation of the story without any knowledge of what questions it may be asked about, so the LM concentrates pieces of information in the residual stream of certain tokens which later become payloads and addresses. When the question text is reached, pointers are constructed that reference this crucial story information and dereference it as the answer to the question.

3 Preliminaries↩︎

Dataset Existing datasets for evaluating ToM capabilities of LMs are designed for behavioral testing and lack counterfactual pairs needed for causal analysis [18]. To address this, we constructed CausalToM, a structured dataset of simple stories, where each story involves two characters, each interacting with a distinct object causing the object to take a unique state. For example: “Character1andCharacter2are working in a busy restaurant. To complete an order,Character1grabs an opaqueObject1and fills it withState1. ThenCharacter2grabs another opaqueObject2and fills it withState2.” We then ask the LM to reason about one of the characters’ beliefs regarding the state of an object: “What doesCharacter1believeObject2contains?” We analyze the LM’s ability to track characters’ beliefs in two distinct settings. (1) No Visibility, where both characters are unaware of each other’s actions, and (2) Explicit Visibility where explicit information about whether a character can/cannot observe the other’s actions is provided, e.g., “Bobcan observeCarla’s actions.Carlacannot observeBob’s actions.” We also provide general task instructions (e.g., answer unknown when a character is unaware); refer to Appendix 9 & 10 for the full prompt and additional dataset details. Our experiments analyze the Llama-3-70B-Instruct model in half-precision, using NNsight [19]. The model demonstrates a high behavioral performance on both the no-visibility and explicit-visibility settings, achieving accuracy of 95.7% and 99% respectively. For all subsequent experiments, we filter out samples that the model fails to answer correctly.

Causal Mediation Analysis

Our goal is to develop a mechanistic understanding of how Llama-3-70B-Instructreasons about characters’ beliefs and answers related questions [20]. A key method for conducting causal analysis is interchange interventions [21]–[23], in which the LM is run on paired examples: an original input \(\mathbf{o}\) and a counterfactual input \(\mathbf{c}\) and certain internal activations in the LM run on the original are replaced with those computed from the counterfactual.

Drawing inspiration from existing literature [21], [24], [25], we begin our analysis by performing interchange interventions with counterfactuals that are identical to the original except for key input tokens. We trace the causal path from these key tokens to the final output. This is a type of Causal Mediation Analysis [26]. Specifically, we construct a counterfactual dataset where \(\mathbf{o}\) contains a question about the belief of a character not mentioned in the story, while \(\mathbf{c}\) is identical except that the story includes the queried character. The expected outcome of this intervention is a change in the final output of \(\mathbf{o}\) from unknown to a state token, such as beer. We conduct similar interchange interventions for object and state tokens (refer to Appendix 11 for more details).

Figure 3 presents the aggregated results of this experiment for the key input tokens \(\boldsymbol{\textcolor{nicegreen}{Character1}}\), \(\boldsymbol{\textcolor{niceyellow}{Object1}}\), and \(\boldsymbol{\textcolor{nicepurple}{State1}}\). The cells are color-coded to indicate the interchange intervention accuracy [27]. Even at this coarse level of analysis, several significant insights emerge: 1) Information from the correct state token (beer) flows directly from its residual stream to that of the final token in later layers, consistent with prior findings [28], [29]; 2) Information associated with the query character and the query object is retrieved from their earlier occurrences and passed to the final token before being replaced by the correct state token.

Desiderata Based Patching via Causal Abstraction

The causal mediation experiments provide a coarse-grained analysis of how information flows from an input token to the output but do not identify what that information is. A fact about transformers is that the input to the first layer contains input tokens and the output from the final layer contains the output token, but what is the information content of representations along the active causal path in between the input and output?

To answer this question, we turn to Causal Abstraction [30], [31]. We align the variables of a high-level causal model with the LM’s internal activations and verify the alignment by conducting targeted interchange interventions for each variable. Specifically, we perform aligned interchange interventions at both levels: interventions that target high-level causal variables and interventions that modify low-level features of the LM’s hidden activations. If the LM produces the same output as the high-level causal model under these aligned interventions, it provides evidence supporting the hypothesized causal model. The effect of these interventions is quantified using IIA, which measures the proportion of instances where intervened high-level causal model and low-level LM have the same output (refer to Appendix 12 for more details about the causal abstraction framework and Appendix 13 for the belief tracking causal model).

In addition to performing interchange interventions on entire residual stream vectors in LMs, we also intervene on specific subspaces to further localize causal variables. To identify the subspace encoding a particular variable, we employ the Desiderata-based Component Masking [29], [32], [33] technique, which learns a sparse binary mask over the internal activation space by maximizing the logit of the causal model output token. Specifically, we train a mask to select the singular vectors of the activation space that encode a high-level variable (see Appendix 14 for details).

4 Belief Tracking via Ordering IDs and Lookback Mechanisms↩︎

The LM solves the no visibility setting of the belief tracking task using three key mechanisms: Ordering ID assignment, binding lookback, and answer lookback. Figure 2 illustrates the hypothesized high-level causal model implemented by the LM, which we evaluate in the following subsections. The LM first assigns ordering IDs (OIs; , , ) to each character, object, and state in the story that encode their order of appearance (e.g., the second character Carla is assigned ). These OIsare used in two lookback mechanisms. (i) Binding lookback: Address copies of each character OI() and object OI() are placed alongside their corresponding state OIpayload () in the residual stream of each state token, binding together each character-object-state triple. When the model is asked about the belief of a specific character about a specific object, it moves pointer copies of the corresponding OIs(, ) to the final token’s residual stream. These pointers are dereferenced, bringing the correct state OIinto the final token residual stream. (ii) Answer lookback: An address copy of the state OI() is alongside the state token payload () in the residual stream of the correct state token, while a pointer copy () is moved to the final token residual stream via the binding lookback. The pointer is dereferenced, bringing the answer state token payload into the final token residual stream, which is predicted as the final output.

Refer to Appendix 13 for pseudocode defining the causal model for the belief tracking task. In Appendices 20 and 19, we show that parts of our analysis generalizes to the Llama-3.1-405B-Instruct model and the BigToM dataset [34].

4.1 Ordering ID Assignment↩︎

The LM assigns an Ordering ID (OI; [15]) to the character, object, and state tokens. These OIs, encoded in a low-rank subspace of the internal activation, serve as a reference that indicates whether an entity is the first or second of its type independent of its token value. For example, in Fig. 2, Bob is assigned the first character OI, while Carla receives the second. In the subsequent subsections and Appendices 15 & 16, we validate the presence of OIsthrough multiple experiments, where intervening on tokens with identical token values but different OIsalters the model’s internal computation, leading to systematic changes in the final output predicted by our high-level causal model. The LM then uses these OIsas building blocks, feeding them into lookback mechanisms to track and retrieve beliefs.

4.2 Uncovering the Binding Lookback Mechanism↩︎

Figure 4: Binding lookback payload and address:We intervened on both the high-level causal model and the LM running on the original story, modifying their variables and internal activations respectively, to match those from a counterfactual scenario. In the causal model intervention, we update the addresses (character and object OIs; imageand image) and the payloads (state OIs; image). This causes the binding lookback mechanism to attend to and retrieve the state OIcorresponding to the alternate state token, which is then dereferenced by answer lookback to yield the alternate state token (e.g., beer instead of coffee). In the LM interchange intervention, modifying the residual stream at the state token results in identical outputs between layers 33 and 38. This confirms our hypothesis that both the address and payload information are represented in the residual stream of state tokens. — Figure 4: **Binding lookback payload and address:**We intervened on both the high-level causal model and the LM running on the original story, modifying their variables and internal activations respectively, to match those from a counterfactual scenario. In the causal model intervention, we update the addresses (character and object OIs; and ) and the payloads (state OIs; ). This causes the binding lookback mechanism to attend to and retrieve the state OIcorresponding to the alternate state token, which is then dereferenced by answer lookback to yield the alternate state token (e.g., **beer** instead of **coffee**). In the LM interchange intervention, modifying the residual stream at the state token results in identical outputs between layers 33 and 38. This confirms our hypothesis that both the address and payload information are represented in the residual stream of state tokens.

The Binding Lookback is the first operation applied to these OIs. The character and object OIs, serving as the source information, are each copied twice. One copy, referred to as the address, is placed in the residual stream of the state token (recalled token), alongside the state OIas the payload to transfer. The other copy, referred to as the pointer, is moved in the residual stream of the final token (lookback token). These pointer and address copies are then used to form the QK-circuit at the lookback token, which dereferences the state OIpayload, transferring it from the state token to the final token. See Fig.2 (i) for a schematic of this lookback and see Fig.1 for the general mechanism.

Localizing the Address and Payload In our first experiment, we localize the address copies of the character and object OIsand the state OIpayload to the residual stream of the state token (recalled token, Fig. 2). We sampled a counterfactual dataset where each example consists of an original input \(\mathbf{o}\) with an answer that isn’t unknown and a counterfactual input \(\mathbf{c}\) where the character, object, and state tokens are identical, except the ordering of the two sentences is swapped while the question remains unchanged, as illustrated in Fig. 4. The expected outcome predicted by our high-level causal model under intervention is the other state token from the original example, e.g., beer, because reversing the address and payload values without changing the pointer flips the output. In LM, the QK-circuit, formed using the pointer at the lookback token, attends to the other state token and retrieves its state OIas the payload.

We perform an interchange intervention experiment layer-by-layer, where we replace the residual stream vectors at the first state token in the original run with that of the second state token in the counterfactual run and vice versa for the other state token. It is important to note that if the intervention targets state token values instead of their OIs, it should not produce the expected output. (This happens in the earlier layers.)

As shown in Fig. 4, the strongest alignment occurs between layers \(33\) and \(38\), supporting our hypothesis that the state token’s residual stream contains both the address information (character and object OIs) and the payload information (state OI). These components are subsequently used to form the appropriate QK and OV circuits of binding lookback.

Localizing the Source Information As shown in Fig. 2, the source information is copied as both the address and the pointer at different token positions. To localize the source information, we conduct intervention experiments with a dataset where the counterfactual example, c, swaps the order of the characters and objects as well as replaces the state tokens with entirely new ones while keeping the question the same as in o.

With this dataset, an interchange intervention on the high-level causal model that targets the source information will have downstream effects on both the address and the pointer, so no change in output occurs. However, if we additionally freeze the payloads and addresses, the causal model outputs the other state token, e.g., beer in Fig. 5, due to the mismatch between address and pointer.

In the LM, we interchange the residual streams of the character and object tokens while keeping the residual stream of the state token fixed. When the output of the intervened LM aligns with that of the intervened causal model, it indicates that the QK-circuit at the final token is attending to the alternate state token. As shown in Fig. 5, the second experiment reveals alignment between layers 20 and 34. This suggests that source information—specifically, the character and object OIs—is represented in their respective token residual streams within this layer range.

Figure 5: Source Information of Binding lookback: We run the causal model and the LM on an original story, then update their variables and activations, respectively, to the values they would take on for a counterfactual story with swapped characters and objects and new states. The interchange intervention on the high-level causal model swaps the sources of the binding lookback (character and object OIs; image, image) while freezing the addresses and payloads of the binding lookback (character, object, and state OIs; image, image, image) and the answer lookback (state OIand token; image, image). By altering the sources, but freezing the addresses and payloads, only the pointer is changed so the binding lookback retrieves the other state OIwhich is dereferenced by the answer lookback to the other state token (e.g., beer instead of coffee). We perform the same interchange intervention on the LM and measure the agreement with the intervened causal model. Our results localize the source to the character and object token residual streams between layers 20 and 34. — Figure 5: **Source Information of Binding lookback**: We run the causal model and the LM on an original story, then update their variables and activations, respectively, to the values they would take on for a counterfactual story with swapped characters and objects and new states. The interchange intervention on the high-level causal model swaps the sources of the binding lookback (character and object OIs; , ) while freezing the addresses and payloads of the binding lookback (character, object, and state OIs; , , ) and the answer lookback (state OIand token; , ). By altering the sources, but freezing the addresses and payloads, only the pointer is changed so the binding lookback retrieves the other state OIwhich is dereferenced by the answer lookback to the other state token (e.g., **beer** instead of **coffee**). We perform the same interchange intervention on the LM and measure the agreement with the intervened causal model. Our results localize the source to the character and object token residual streams between layers 20 and 34.

We provide more experimental results in Appendix 15 where we show in Fig. 13 that freezing the residual stream of the state token is necessary. In sum, these results not only provide evidence for the presence of source information but also establish its transfer to the recalled and lookback tokens as addresses and pointers, respectively.

Localizing the Pointer Information The pointer copies of the character and object OIare first formed at the character and object tokens in the question before being moved again to the final token for dereferencing (see Appendix 16 for experiments and more details).

4.3 Uncovering the Answer Lookback Mechanism↩︎

The LM answers the question using the Answer Lookback. The state OIof the correct answer serves as the source information, which is copied into two instances. One instance, the address copy of the state OI, is in the residual stream of the state token (recalled token) with the state token itself as the payload. The other instance, the pointer copy of the state OI, is transferred to the residual stream of the final token (lookback token) as the payload of the binding lookback. This pointer is then dereferenced, bringing the state token as the payload into the residual stream of the final token, which is predicted as the final output. See Fig. 2 (ii) for the answer lookback and Fig. 1 for the general mechanism.

Localizing the Pointer Information We first localize the pointer of the answer lookback, which is the payload of the binding lookback. To do this, we conduct an interchange intervention experiment where the residual vectors at the final token position in the original run are replaced with those from the counterfactual run, one layer at a time. The counterfactual inputs have swapped objects and characters and randomly sampled states. If the answer pointer is targeted for intervention in the high-level causal model, the output is the other state in the original input, e.g., beer. As shown in Fig. [fig:binding95pointer], alignment begins at layer \(34\), indicating that this layer contains pointer information, in low-rank subspace, which remains causally relevant until layer \(52\).

Localizing the Payload To determine where the model uses the state OIpointer to retrieve the state token, we use the same interchange intervention experiment. However, if the answer payload is targeted for intervention in the high-level causal model, the output is the correct state token from the counterfactual example, e.g., tea, rather than the state token from the original example, as illustrated in Fig. [fig:binding95pointer]. The alignment occurs after layer \(56\), indicating that the model retrieves the correct state token (payload) into the final token’s residual stream by \(56\), where it is subsequently used to generate the final output.

5 Impact of Visibility Conditions on Belief Tracking Mechanism↩︎

In the previous section, we demonstrated how the LM uses ordering IDs and two lookback mechanisms to track the beliefs of characters that cannot observe each other. Now, we explore how the LM updates the beliefs of characters when provided with additional information that one of the characters (observing) can observe the actions of others (observed). We hypothesize that the LM employs another lookback mechanism, which we refer to as the Visibility Lookback, to incorporate information about the observed character.

As illustrated in Fig. 6, we hypothesize that the LM first generates a Visibility ID at the residual stream of the visibility sentence, serving as the source information. The address copy of the visibility ID remains in the residual stream of the visibility sentence, while its pointer copy gets transferred to the residual streams of the subsequent tokens, which are the lookback tokens. Then LM forms a QK-circuit at the lookback tokens and dereferences the visibility ID pointer to bring forward the payload.

Figure 6: Visbility Lookback When one character (the observing character) can see another (the observed character), the LM assigns a visibility ID (image) to the visibility sentence (where this relation is defined). An address copy of this visibility ID remains in the visibility sentence’s residual stream. A pointer copy of the visibility ID is transferred to the subsequent tokens’ residual stream (lookback tokens). During processing, the model dereferences this pointer through a QK-circuit, bringing forward the payload ( image ). Based on initial evidence, this payload contains the observed character’s OI(image). Refer to Appendix 17 for more details. This mechanism allows the model to incorporate the observed character’s knowledge into the observing character’s belief state, enabling more complex belief reasoning. — Figure 6: **Visbility Lookback** When one character (the observing character) can see another (the observed character), the LM assigns a visibility ID () to the visibility sentence (where this relation is defined). An address copy of this visibility ID remains in the visibility sentence’s residual stream. A pointer copy of the visibility ID is transferred to the subsequent tokens’ residual stream (lookback tokens). During processing, the model dereferences this pointer through a QK-circuit, bringing forward the payload ( ). Based on initial evidence, this payload contains the observed character’s OI(). Refer to Appendix 17 for more details. This mechanism allows the model to incorporate the observed character’s knowledge into the observing character’s belief state, enabling more complex belief reasoning.

Although we were unable to determine the exact semantics of the payload in this lookback, we speculate that it encodes the observed character’s OI. We propose the existence of another lookback, where the story sentence associated with the observed character serves as the source, and its payload encodes information about the observed character. That payload information, encoding observed character’s OI, is then retrieved by the lookback tokens of the Visibility lookback, which contributes to the queried character’s enhanced awareness (see Appendix 17 for more details).

5.1 Uncovering the Visibility Lookback Mechanism↩︎

Localizing the Source Information To localize the source information, we conduct an interchange intervention experiment where the counterfactual is a different story with altered visibility information. In the original example, the first character cannot observe the second character’s actions, whereas in the counterfactual example, the first character can observe them (Fig. 7). The causal model outcome of this intervention is a change in the final output of the original run from “unknown” to the state token associated with the queried object. The interchange intervention is executed on visibility sentence tokens. As shown in Fig. 7 (– line), alignment occurs between layers \(10\) and \(23\), indicating that the visibility ID remains encoded in the visibility sentence until layer \(23\), after which it is duplicated into address and pointer copies on visibility sentence and subsequent tokens respectively.

Localizing the Payload To localize the payload information, we use the same counterfactual dataset. However, instead of intervening on the source or recalled tokens, we intervene on the lookback tokens, specifically the question and answer tokens. As in the previous experiment, we replace the residual vectors of these tokens in the original run with those from the counterfactual run. As shown in Fig. 7 (– line), alignment occurs only after layer \(31\), indicating that the information enhancing the queried character’s awareness is present in the lookback tokens only after this layer.

Figure 7: Visibility Lookback: We conduct three interchange intervention experiments to support the Visibility Lookback hypothesis: (1) Source Alignment: We align the source information (image) by intervening on the visibility sentence—replacing it with its representation from a counterfactual run where the visibility sentence causes the queried character to become aware of the queried object’s contents. We observe that source information aligns between layers 10 and 23, after which it splits into separate address and pointer components. (2) Payload Alignment: To align the payload ( image ), we intervene on all subsequent tokens and observe alignment only after layer 31. (3) Address and Pointer Alignment: When intervening on both the address and pointer information (image), we observe alignment across a broader range of layers, particularly between layers 24 and 31, because of the enhanced alignment between the address and pointer copies at the recalled and lookback tokens. — Figure 7: **Visibility Lookback**: We conduct three interchange intervention experiments to support the Visibility Lookback hypothesis: (1) *Source Alignment*: We align the source information () by intervening on the visibility sentence—replacing it with its representation from a counterfactual run where the visibility sentence causes the queried character to become aware of the queried object’s contents. We observe that source information aligns between layers \(10\) and \(23\), after which it splits into separate address and pointer components. (2) *Payload Alignment*: To align the payload ( ), we intervene on all subsequent tokens and observe alignment only after layer \(31\). (3) *Address and Pointer Alignment*: When intervening on both the address and pointer information (), we observe alignment across a broader range of layers, particularly between layers \(24\) and \(31\), because of the enhanced alignment between the address and pointer copies at the recalled and lookback tokens.

Localizing the Address and Pointer The previous two experiments suggest the presence of a lookback mechanism, as there is no signal indicating that the source or payload has been formed between layers 24 and 31. We hypothesize that this lack of signal is due to a mismatch between the address and pointer information at the recalled and lookback tokens. Specifically, when intervening only on the recalled token after layer \(25\), the pointer is not updated, whereas intervening only on the lookback tokens leaves the address unaltered, leading to the mismatch. To test this hypothesis, we conduct another intervention using the same counterfactual dataset, but this time, we intervene on the residual vectors of both the recalled and lookback tokens, i.e., the visibility sentence, as well as the question and answer tokens. As shown in Fig. 7 (– line), alignment occurs after layer \(10\) and remains stable, supporting our hypothesis. This intervention replaces both the address and pointer copies of the visibility IDs, enabling the LM to form a QK-circuit and retrieve the payload.

6 Related Work↩︎

Theory of mind in LMs A large body of work has focused on benchmarking different aspects of ToM through various tasks that attempt to assess LMs’ performance such as [35]–[42] and many more. In addition, there are various methods tailored to improve ToM ability in LMs through prompting [43]–[47].

Entity tracking in LMs Entity tracking and variable binding are crucial abilities for LMs to exhibit not only coherent ToM capabilities, but also neurosymbolic reasoning. Many existing works have attempted to decipher this ability in LMs [15], [29], [33], [48]–[51]. Our work builds on their empirical insights and extends the current understanding of how LMs bind various entities defined in context.

Mechanistic interpretability of theory of mind Only a few empirical studies explored the underlying mechanisms of ToM of LM [52], [53] [54]. Those studies focus on probing techniques [55], [56] to identify internal representations of beliefs and used steering techniques [57], [58] to improve LM performance by manipulating their activations. However, the mechanism by which LMs solve those tasks remains a black box, limiting our ability to understand, predict, and control LMs’ behaviors.

7 Conclusion↩︎

Through a series of desiderata-based patching experiments, we have mapped the mechanisms underlying the processing of partial knowledge and false beliefs in a set of simple stories. We are surprised by the pervasive appearance of a single recurring computational pattern: the lookback, which resembles a pointer dereference inside a transformer. The LMs use a combination of several lookbacks to reason about nontrivial visibility and belief states. Our improved understanding of these fundamental computations gives us optimism that it may be possible to fully reveal the algorithms underlying not only Theory of Mind, but also other forms of reasoning in LMs.

8 Acknowledgement↩︎

This research was supported in part by Open Philanthropy (N.P., N.S., A.S.S., D.B., A.G., Y.B.), the NSF National Deep Inference Fabric award #2408455 (D.B.), the Israel Council for Higher Education (N.S.), the Zuckerman STEM Leadership Program (T.R.S.), the Israel Science Foundation (grant No. 448/20; Y.B.), an Azrieli Foundation Early Career Faculty Fellowship (Y.B.), a Google Academic Gift (Y.B.), and a Google Gemma Academic Award (D.B.). This research was partly funded by the European Union (ERC, Control-LM, 101165402). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

9 Full prompt↩︎

Instruction: 1. Track the belief of each character as described in the story. 2. A character’s belief is formed only when they perform an action themselves or can observe the action taking place. 3. A character does not have any beliefs about the container and its contents which they cannot observe. 4. To answer the question, predict only what is inside the queried container, strictly based on the belief of the character, mentioned in the question. 5. If the queried character has no belief about the container in question, then predict ‘unknown’. 6. Do not predict container or character as the final output.
Story:BobandCarlaare working in a busy restaurant. To complete an order,Bobgrabs an opaquebottleand fills it withbeer. ThenCarlagrabs another opaquecupand fills it withcoffee.
Question: What doesBobbelieve thebottlecontains?
Answer:

Instruction: 1. Track the belief of each character as described in the story. 2. A character’s belief is formed only when they perform an action themselves or can observe the action taking place. 3. A character does not have any beliefs about the container and its contents which they cannot observe. 4. To answer the question, predict only what is inside the queried container, strictly based on the belief of the character, mentioned in the question. 5. If the queried character has no belief about the container in question, then predict ‘unknown’. 6. Do not predict container or character as the final output.
Story:BobandCarlaare working in a busy restaurant. To complete an order,Bobgrabs an opaquebottleand fills it withbeer. ThenCarlagrabs another opaquecupand fills it withcoffee.Bobcan observeCarla’s actions.Carlacannot observeBob’s actions.
Question: What doesBobbelieve thecupcontains?
Answer:

10 The CausalToMDataset↩︎

In total, there are 4 templates (one without and 3 with explicit visibility statements). Each template allows 4 different types of questions (CharacterX asked about ObjectY). We used lists of 103 characters, 21 objects, and 23 states. In our interchange intervention experiments, we randomly sample 80 pairs of original and counterfactual stories.

11 Causal Mediation Analysis↩︎

Figure 8: Causal Mediation Analysis: The original example produces the output unknown because Bob is not mentioned in the story, leaving the model without any information about his beliefs. However, when the residual stream vectors corresponding to Bob from the counterfactual run are patched into the original run, the model acquires the necessary information about that character and consequently updates its output to beer. — Figure 8: **Causal Mediation Analysis**: The original example produces the output *unknown* because *Bob* is not mentioned in the story, leaving the model without any information about his beliefs. However, when the residual stream vectors corresponding to *Bob* from the counterfactual run are patched into the original run, the model acquires the necessary information about that character and consequently updates its output to *beer*.

In addition to the experiment shown in Fig.8, we conduct similar experiments for the object and state tokens by replacing them in the story with random tokens, which alters the original example’s final output. However, patching the residual stream vectors of these tokens from the counterfactual run restores the relevant information, enabling the model to predict the causal model output. The results of these experiments are collectively presented in Fig.3, with separate heatmaps shown in Fig. 9, 10, 11.

Figure 9: Information flow of character input tokens using causal mediation analysis.

Figure 10: Information flow of object input tokens using causal mediation analysis.

Figure 11: Information flow of state input tokens using causal mediation analysis.

12 Desiderate Based Patching Via Causal Abstraction↩︎

Causal Models and Interventions A deterministic causal model \(\mathcal{M}\) has variables that take on values. Each variable has a mechanism that determines the value of the variable based on the values of parent variables. Variables without parents, denoted \(\mathbf{X}\), can be thought of as inputs that determine the setting of all other variables, denoted \(\mathcal{M}(\mathbf{x})\). A hard intervention \(A \leftarrow a\) overrides the mechanisms of variable \(A\), fixing it to a constant value \(a\).

Interchange Interventions We perform interchange interventions [21], [22] where a variable (or set of features) \(A\) is fixed to be the value it would take on if the LM were processing counterfactual input \(\mathbf{c}\). We write \(A \leftarrow \mathsf{Get}(\mathcal{M}(\mathbf{c}), A)\) where \(\mathsf{Get}(\mathcal{M}(\mathbf{c}), A)\) is the value of variable \(A\) when \(\mathcal{M}\) processes input \(\mathbf{c}\). In experiments, we will feed a original input \(\mathbf{o}\) to a model under an interchange intervention \(\mathcal{M}_{A \leftarrow \mathsf{Get}(\mathcal{M}(\mathbf{c}), A))}(\mathbf{o})\).

Featurizing Hidden Vectors The dimensions of hidden vectors are not an ideal unit of analysis [59], and so it is typical to featurize a hidden vector using some invertible function, e.g., an orthogonal matrix, to project a hidden vector into a new variable space with more interpretable dimensions called “features”[26]. A feature intervention \(\mathbf{F}_{\mathbf{h}} \leftarrow \mathbf{f}\) edits the mechanism of a hidden vector \(\mathbf{h}\) to fix the value of features \(\mathbf{F}_{\mathbf{h}}\) to \(\mathbf{f}\).

Alignment The LM is a low-level causal model \(\mathcal{L}\) where variables are dimensions of hidden vectors and the hypothesis about LM structure is a high-level causal model \(\mathcal{H}\). An alignment \(\Pi\) assigns each high-level variable \(A\) to features of a hidden vector \(\mathbf{F}^{A}_{\mathbf{h}}\), e.g., orthogonal directions in the activation space of \(\mathbf{h}\). To evaluate an alignment, we perform intervention experiments to evaluate whether high-level interventions on the variables in \(\mathcal{H}\) have the same effect as interventions on the aligned features in \(\mathcal{L}\).

Causal Abstraction We use interchange interventions to reveal whether the hypothesized causal model \(\mathcal{H}\) is an abstraction of an LM \(\mathcal{L}\). To simplify, assume both models share an input and output space. The high-level model \(\mathcal{H}\) is an abstraction of the low-level model \(\mathcal{L}\) under a given alignment when each high-level interchange intervention and the aligned low-level intervention result in the same output. For a high-level intervention on \(A\) aligned with low-level features \(\mathbf{F}^{A}_{\mathbf{h}}\) with a counterfactual input \(\mathbf{c}\) and original input \(\mathbf{b}\), we write \[\label{eq:abstraction} \mathsf{GetOutput}(\mathcal{L}_{\mathbf{F}_{\mathbf{h}}^{A} \leftarrow \mathsf{Get}(\mathcal{L}(\mathbf{c}), \mathbf{F}^A_\mathbf{h}))}(\mathbf{o})) = \mathsf{GetOutput}(\mathcal{H}_{A \leftarrow \mathsf{Get}(\mathcal{H}(\mathbf{c}), A))}(\mathbf{o}))\tag{1}\] If the low-level interchange intervention on the LM produces the same output as the aligned high-level intervention on the algorithm, this is a piece of evidence in favor of the hypothesis. This extends naturally to multi-variable interventions [31].

Graded Faithfulness Metric We construct counterfactual datasets for each causal variable where an example consists of a base prompt and a counterfactual prompt . The counterfactual label is the expected output of the algorithm after the high-level interchange intervention, i.e., the right-side of Equation 1 . The interchange intervention accuracy is the proportion of examples for which Equation 1 holds, i.e., the degree to which \(\mathcal{H}\) faithfully abstracts \(\mathcal{L}\).

13 Pseudocode for the Belief Tracking High-Level Causal Model↩︎

14 Desiderata-based Component Masking↩︎

While interchange interventions on residual vectors reveal where a causal variable might be encoded in the LM’s internal activations, they do not localize the variable to specific subspaces. To address this, we apply the Desiderata-based Component Masking technique [29], [32], [33], which learns a sparse binary mask over the singular vectors of the LM’s internal activations. First, we cache the internal activations from \(500\) samples at the token positions where residual-level interchange interventions align with the expected output. Next, we apply Singular Value Decomposition to compute the singular vectors, which are then used to construct a projection matrix. Rather than replacing the entire residual vector with that from the counterfactual run, we perform subspace-level interchange interventions using the following equations:

\[\mathsf{W_{proj} = V*V^{T}}\] \[\mathsf{h_{org} \leftarrow W_{proj} h_{counterfactual} + (I-W_{proj}) h_{original}}\]

Here, \(V\) is a matrix containing stacked singular vectors, while \(h_{counterfactual}\) and \(h_{original}\) represent the residual stream vectors from the counterfactual and original runs, respectively. The core idea is to first remove the existing information from the subspace defined by the projection matrix and then insert the counterfactual information into that same subspace using the same projection matrix. However, in DCM, instead of utilizing the entire internal activation space, we learn a binary mask over the matrix containing stacked singular vectors to identify the desired subspace. Specifically, before computing the projection matrix, we use the following equations to select the relevant singular vectors:

\[\mathsf{V \leftarrow V * mask}\]

We train the mask on \(80\) examples of the same counterfactual dataset and use another \(80\) as the validation set. We use the following objective function, which maximizes the logit of the causal model output token:

\[\mathcal{L} = -\mathsf{logit_{expected\_output}} + \lambda \sum \mathsf{W}\]

Where \(\lambda\) is a hyperparameter used to control the rank of the subspace and \(\mathsf{W}\) is the parameter of the learnable mask. We trained it for one epoch with ADAM optimizer, a batch size of \(4\) and a learning rate of \(0.01\).

15 Aligning Character and Object OIs↩︎

As mentioned in section [sec:binding95lookback95source], the source information, consisting of character and object OI, is duplicated to form the address and pointer of the binding lookback. Here, we describe another experiment to verify that the source information is copied to both the address and the pointer. More specifically, we conduct the same interchange intervention experiment as described in Fig. 5, but without freezing the residual vectors at the state tokens. Based our hypothesis, this intervention will not be able to change the state of the original run, since the intervention at the source information will affect both address and pointer, hence making the model form the original QK-circuit.

Figure 13: Source Information of Binding lookback: In this interchange intervention experiment, the source information—i.e., the character and object OIDs (image, image)—is modified, while the address and payload (image, image, image) are recomputed based on the modified source. Since both the address and pointer information are derived from the altered source, the binding lookback ultimately retrieves the same original state token as the payload. As a result, we do not observe high intervention accuracy. — Figure 13: **Source Information** of Binding lookback: In this interchange intervention experiment, the source information—i.e., the character and object OIDs (, )—is modified, while the address and payload (, , ) are recomputed based on the modified source. Since both the address and pointer information are derived from the altered source, the binding lookback ultimately retrieves the same original state token as the payload. As a result, we do not observe high intervention accuracy.

In section [sec:binding95lookback95source], we identified the source of the information but did not fully determine the locations of each character and object OI. To address this, we now localize the character and object OIsseparately to gain a clearer understanding of the layers at which they appear in the residual streams of their respective tokens, as shown in Fig.14 and Fig.15.

Figure 14: Character OI: This interchange intervention experiment swaps the character OI(image), while freezing the object OIas well as binding lookback address and payload (image, image, image). Swapping the character OIsin the story tokens changes the queried character OIto the other one. Hence, the final output changes from unknown to water. — Figure 14: **Character OI**: This interchange intervention experiment swaps the character OI(), while freezing the object OIas well as binding lookback address and payload (, , ). Swapping the character OIsin the story tokens changes the queried character OIto the other one. Hence, the final output changes from *unknown* to **water**.

Figure 15: Object OI: This interchange intervention experiment swaps both the character and object OIs(image, image), while freezing the address and payload of binding lookback (image, image, image) as well as queried character OI(image). Swapping both character and object OIsin the story tokens ensures that the queried object gets the other OI. Hence, the final output changes from unknown to tea. — Figure 15: **Object OI**: This interchange intervention experiment swaps both the character and object OIs(, ), while freezing the address and payload of binding lookback (, , ) as well as queried character OI(). Swapping both character and object OIsin the story tokens ensures that the queried object gets the other OI. Hence, the final output changes from *unknown* to **tea**.

16 Aligning Query Character and Object OIs↩︎

In section 4.2, we localized the pointer information of binding lookback. However, we found that this information is transferred to the lookback token (last token) through two intermediate tokens: the queried character and the queried object. In this section, we separately localize the OIsof the queried character and queried object, as shown in Fig. 16 and Fig. 17.

Figure 16: Query Character OI: This interchange intervention experiment alters the OIof the queried character (image) to the other one. Hence, the final output changes from unknown to water. — Figure 16: **Query Character OI**: This interchange intervention experiment alters the OIof the queried character () to the other one. Hence, the final output changes from *unknown* to **water**.

Figure 17: Query Object OI: This interchange intervention experiment alters the OIof the queried object (image) to the other one. Hence, the final output changes from unknown to water. — Figure 17: **Query Object OI**: This interchange intervention experiment alters the OIof the queried object () to the other one. Hence, the final output changes from *unknown* to **water**.

17 Speculated Payload in Visibility Lookback↩︎

As mentioned in section 5, the payload of the Visibility lookback remains undetermined. In this section, we attempt to disambiguate its semantics using the Attention Knockout technique introduced in [60], which helps reveal the flow of crucial information. We apply this technique to understand which previous tokens are vital for the formation of the payload information. Specifically, we "knock out" all attention heads at all layers of the second visibility sentence, preventing them from attending to one or more of the previous sentences. Then, we allow the attention heads to attend to the knocked-out sentence one layer at a time.

If the LM is fetching vital information from the knocked-out sentence, the interchange intervention accuracy (IIA) post-knockout will decrease. Therefore, a decrease in IIA will indicate which attention heads, at which layers, are bringing in the vital information from the knocked-out sentence. If, however, the model is not fetching any critical information from the knocked-out sentence, then knocking it out should not affect the IIA.

Figure 18: At the second visibility sentence, attention heads are restricted to retrieve information from one of three prior contexts: (1) both the second story sentence and the first visibility sentence (– line), (2) only the first visibility sentence (– line), or (3) only the second story sentence (– line).

To determine if any vital information is influencing the formation of the Visibility lookback payload, we perform three knockout experiments: 1) Knockout attention heads from the second visibility sentence to both the first visibility sentence and the second story sentence (which contains information about the observed character), 2) Knockout attention heads from the second visibility sentence to only the first visibility sentence, and 3) Knockout attention heads from the second visibility sentence to the second story sentence. In each experiment, we measure the effect of the knockout using IIA.

Fig.18 shows the experimental results. Knocking out any of the previous sentences affects the model’s ability to produce the correct output. The decrease in IIA in the early layers can be explained by the restriction on the movement of character OIs. Specifically, the second visibility sentence mentions the first and second characters, whose character OIsmust be fetched before the model can perform any further operations. Therefore, we believe the decrease in IIA until layer 15, when the character OIsare formed (based on the results from Section 15), can be attributed to the model being restricted from fetching the character OIs. However, the persistently low IIA even after this layer—especially when both the second and first visibility sentences are involved—indicates that some vital information is being fetched by the second visibility sentence, which is essential for forming the coherent Visibility lookback payload. Thus, we speculate that the Visibility payload encodes information about the observed character, specifically their character OI, which is later used to fetch the correct state OI.

18 Correlation Analysis of Causal Subspaces and Attention Heads↩︎

This section identifies the attention heads that align with the causal subspaces discovered in the previous sections. Specifically, first we focus on attention heads whose query projections are aligned with the subspaces—characterized by the relevant singular vectors—that contain the correct answer state OI. To quantify this alignment between attention heads and causal subspaces, we use the following computation.

Let \(Q \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}\) denote the query projection weight matrix for a given layer:

We normalize \(Q\) column-wise:

\[\tilde{Q}_{:, j} = \frac{Q_{:, j}}{\|Q_{:, j}\|} \quad \text{for each column } j\]

Let \(S \in \mathbb{R}^{d_{\text{model}} \times k}\) represent the matrix of \(k\) singular vectors (i.e., the causal subspace basis). We project the normalized query weights onto this subspace:

\[Q_{\text{sv}} = \tilde{Q} \cdot S\]

We then reshape the resulting projection into per-head components. Assuming \(Q_{\text{sv}} \in \mathbb{R}^{d_{\text{model}} \times k}\), and each attention head has dimensionality \(d_h\), we write:

\[Q_{\text{head}}^{(i)} = Q_{\text{sv}}^{(i)} \in \mathbb{R}^{d_h \times k} \quad \text{for } i = 1, \dots, n_{\text{heads}}\]

Finally, we compute the norm of each attention head’s projection:

\[\text{head\_norm}_i = \left\| Q_{\text{head}}^{(i)} \right\|_{F} \quad \text{for } i = 1, \dots, n_{\text{heads}}\]

We compute the \(head\_norm\) for each attention head in every layer, which quantifies how strongly a given head reads from the causal subspace present in the residual stream. The results are presented in Fig. 19, and they align with our previous findings: attention heads in the later layers form the QK-circuit by using pointer and address information to retrieve the payload during the Answer lookback.

We perform a similar analysis to check which attention heads’ value projection matrix align with the causal subspace that encodes the payload of the Answer lookback. Results are shown in Fig. 20, indicating that attention heads at later layers primarily align with causal subspace containing the answer token.

Figure 19: Alignment between the Answer lookback pointer causal subspace and the query projection matrix in Llama-3-70B-Instruct.

Figure 20: Alignment between the Answer lookback payload causal subspace and the value projection matrix in Llama-3-70B-Instruct.

19 Belief Tracking Mechanism in BigToM Benchmark↩︎

This section presents preliminary evidence that the mechanisms outlined in Sections 4 and 5 generalize to other benchmark datasets. Specifically, we demonstrate that Llama-3-70B-Instructanswers the belief questions (true belief and false belief) in the BigToM dataset [34] in a manner similar to that observed for CausalToM: by first converting token values to their corresponding OIsand then performing logical operations on them using lookbacks. However, as noted in Section 3, BigToM—like other benchmarks—lacks the coherent structure necessary for causal analysis. As a result, we were unable to replicate all experiments conducted on CausalToM. Thus, the results reported here provide only preliminary evidence of a similar underlying mechanism.

Figure 21: Query Character OIin BigToM: This interchange intervention experiment inserts the first character’s OIinto the residual stream at the queried character token (image), resulting in the movement of pointer information to the last token that aligns with the address information of binding lookback mechanism. Consequently, the model is able to form the appropriate QK-circuit from the last token to predict the correct state answer token(s) as the final output, instead of unknown. — Figure 21: **Query Character OIin BigToM**: This interchange intervention experiment inserts the first character’s OIinto the residual stream at the queried character token (), resulting in the movement of pointer information to the last token that aligns with the address information of binding lookback mechanism. Consequently, the model is able to form the appropriate QK-circuit from the last token to predict the correct state answer token(s) as the final output, instead of unknown.

To justify the presence of OIs, we conduct an interchange intervention experiment, similar to the one described in Section 16, aiming to localize the character OIat the character token in the question sentence. We construct an original sample by replacing its question sentence with that of a counterfactual sample, selected directly from the unaltered BigToM dataset. Consequently, when processing the original sample, the model has no information about the queried character and, as a result, produces unknown as the final output. However, if we replace the residual vector at the queried character token in the original sample with the corresponding vector from the counterfactual sample (which contains the character OI), the model’s output changes from unknown to the state token(s) associated with the queried object. This is because inserting the character OIat the queried token provides the correct pointer information, aligning with the address information at the correct state token(s), thereby enabling the model to form the appropriate QK-circuit and retrieve the state’s OI. As shown in Fig. 21, we observe a high IIA between layers \(9 - 28\)—similar to the pattern seen in CausalToM—suggesting that the queried character token encodes the character OIin its residual vector within these layers.

Next, we investigate the Answer lookback mechanism in BigToM, focusing specifically on localizing the pointer and payload information at the final token position. To localize the pointer information, which encodes the correct state OI, we construct original and counterfactual samples by selecting two completely different examples from the BigToM dataset, each with different ordered states as the correct answer. For example, as illustrated in Fig.22, the counterfactual sample designates the first state as the answer, thrilling plot, whereas the original sample designates the second state, almond milk. We perform an intervention by swapping the residual vector at the last token position from the counterfactual sample into the original run. The causal model outcome of this intervention is that the model will output the alternative state token from the original sample, oat milk. As shown in Fig.22, this alignment occurs between layers 33 and 51, similar to the layer range observed for the pointer information in the Answer lookback of CausalToM.

Figure 22: Answer Lookback Pointer in BigToM: This interchange intervention experiment modifies the pointer information (image) of the Answer lookback, thereby altering the subsequent QK-circuit to attend to the other state (e.g., oat milk) instead of the original one (e.g., almond milk). As a result, the model retrieves the token value corresponding to the other state to answer the question. — Figure 22: **Answer Lookback Pointer in BigToM**: This interchange intervention experiment modifies the pointer information () of the Answer lookback, thereby altering the subsequent QK-circuit to attend to the other state (e.g., **oat milk**) instead of the original one (e.g., **almond milk**). As a result, the model retrieves the token value corresponding to the other state to answer the question.

Figure 23: Answer Lookback Payload in BigToM: This interchange intervention experiment directly modifies the payload information (image) of the Answer lookback, which is fetched from the corresponding state tokens and predicted as the next token(s). Thus, replacing its value in the original run, e.g. almond milk, with that from the counterfactual run, e.g. thrilling plot, causes the model’s next predicted tokens to correspond to the correct answer of the counterfactual sample. — Figure 23: **Answer Lookback Payload in BigToM**: This interchange intervention experiment directly modifies the payload information () of the Answer lookback, which is fetched from the corresponding state tokens and predicted as the next token(s). Thus, replacing its value in the original run, e.g. **almond milk**, with that from the counterfactual run, e.g. **thrilling plot**, causes the model’s next predicted tokens to correspond to the correct answer of the counterfactual sample.

Further, to localize the payload of the Answer lookback in BigToM, we perform an interchange intervention experiment using the same original and counterfactual samples as mentioned in the previous experiment, but with a different expected output—namely, the correct state from the counterfactual sample instead of the other state from the original sample. As shown in Fig. 23, alignment emerges after layer 59, consistent with the layer range observed for the Answer lookback payload in CausalToM.

Figure 24: Visibility Lookback in BigToM: We perform three interchange interventions to establish the presence of the Visibility ID, which serves as both address and pointer information. When intervening at the source (image)—i.e., the visibility sentence—both the address and pointer are updated, resulting in alignment across layers. Intervening only at the subsequent question tokens leads to alignment only at later layers, after the model has already fetched the payload ( image ). However, intervening at both the visibility and question sentences results in alignment across all layers, as the address and pointer remain consistent throughout. — Figure 24: **Visibility Lookback in BigToM**: We perform three interchange interventions to establish the presence of the Visibility ID, which serves as both address and pointer information. When intervening at the source ()—i.e., the visibility sentence—both the address and pointer are updated, resulting in alignment across layers. Intervening only at the subsequent question tokens leads to alignment only at later layers, after the model has already fetched the payload ( ). However, intervening at both the visibility and question sentences results in alignment across all layers, as the address and pointer remain consistent throughout.

Finally, we investigate the impact of the visibility condition on the underlying mechanism and find that, similar to CausalToM, the model uses the Visibility lookback to enhance the observing character’s awareness based on the observed character’s actions. To localize the effect of the visibility condition, we perform an interchange intervention in which the original and counterfactual samples differ in belief type—that is, if the original sample involves a false belief, the counterfactual involves a true belief, and vice versa. The expected output of this experiment is the other (incorrect) state of the original sample. Following the methodology in Section 5, we conduct three types of interventions: (1) only at the visibility condition sentence, (2) only at the subsequent question sentence, and (3) at both the visibility condition and the question sentence. As shown in Fig. 24, intervening only at the visibility sentence results in alignment at early layers, up to layer \(17\), while intervening only at the subsequent question sentence leads to alignment after layer \(26\). Intervening on both the visibility and question sentences results in alignment across all layers. These results align with those found in the CausalToM setting shown in the Fig. 7.

Previous experiments suggest that the underlying mechanisms responsible for answering belief questions in BigToM are similar to those in CausalToM. However, we observed that the subspaces encoding various types of information are not shared between the two settings. For example, although the pointer information in the Answer lookback encodes the correct state’s OIin both cases, the specific subspaces that represent this information at the final token position differ significantly. We leave a deeper investigation of this phenomenon—shared semantics across distinct subspaces in different distributions—for future work.

20 Generalization of Belief Tracking Mechanism on CausalToM to Llama-3.1-405B-Instruct↩︎

This section presents all the interchange intervention experiments described in the main text, conducted using the same set of counterfactual examples on Llama-3.1-405B-Instruct, using NDIF [19]. Each experiment was performed on 80 samples. Due to computational constraints, subspace interchange intervention experiments were not conducted. The results indicate that Llama-3.1-405B-Instruct employs the same underlying mechanism as Llama-3-70B-Instructto reason about belief and answer related questions. This suggests that the identified belief-tracking mechanism generalizes to other models capable of reliably performing the task.

Figure 25: Payload and address of Binding lookback — Figure 25: **Payload and address of Binding lookback**

Figure 26: Source Information of Binding lookback — Figure 26: **Source Information of Binding lookback**

Figure 27: Source Information of Binding lookback without freezing address and payload — Figure 27: **Source Information of Binding lookback without freezing address and payload**

Figure 28: Character OI — Figure 28: **Character OI**

Figure 29: Object OI — Figure 29: **Object OI**

Figure 30: Query Object OI — Figure 30: **Query Object OI**

Figure 31: Query Character OI — Figure 31: **Query Character OI**

Figure 32: Answer Lookback Pointer and Payload — Figure 32: **Answer Lookback Pointer and Payload**

Figure 33: Visibility Lookback — Figure 33: **Visibility Lookback**

References↩︎

[1]

David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1 (4): 515–526, 1978. .

[2]

Christoph Riedl, Young Ji Kim, Pranav Gupta, Thomas W Malone, and Anita Williams Woolley. Quantifying collective intelligence in human groups. Proceedings of the National Academy of Sciences, 118 (21): e2005737118, 2021.

[3]

Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, and Robin I. M. Dunbar. Llms achieve adult human performance on higher-order theory of mind tasks, 2024. URL https://arxiv.org/abs/2405.18870.

[4]

James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano, and Cristina Becchio. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8 (7): 1285–1295, Jul 2024. ISSN 2397-3374. . URL https://doi.org/10.1038/s41562-024-01882-z.

[5]

Michal Kosinski. Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences, 121 (45), October 2024. ISSN 1091-6490. . URL http://dx.doi.org/10.1073/pnas.2405460121.

[6]

Melanie Sclar, Jane Yu, Maryam Fazel-Zarandi, Yulia Tsvetkov, Yonatan Bisk, Yejin Choi, and Asli Celikyilmaz. Explore theory of mind: Program-guided adversarial data generation for theory of mind reasoning. ICLR, 2025.

[7]

Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and Vered Shwartz. Clever hans or neural theory of mind? stress testing social reasoning in large language models. In Yvette Graham and Matthew Purver (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2257–2273, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.138.

[8]

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. oM: A benchmark for stress-testing machine theory of mind in interactions. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 14397–14413, Singapore, December 2023. Association for Computational Linguistics. . URL https://aclanthology.org/2023.emnlp-main.890/.

[9]

Jennifer Hu, Felix Sosa, and Tomer Ullman. Re-evaluating theory of mind evaluation in large language models. arXiv preprint arXiv:2502.21098, 2025.

[10]

Hyowon Gweon, Judith Fan, and Been Kim. Socially intelligent machines that learn from humans and help humans learn. Philosophical Transactions of the Royal Society A, 381 (2251): 20220048, 2023.

[11]

Daniel Clement Dennett. The Intentional Stance. MIT Press, 1981.

[12]

Heinz Wimmer and Josef Perner. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13 (1): 103–128, 1983.

[13]

Simon Baron-Cohen, Alan M Leslie, and Uta Frith. Does the autistic child have a “theory of mind”? Cognition, 21 (1): 37–46, 1985.

[14]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Francisco Guzmán, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Govind Thattai, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vítor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.

[15]

Qin Dai, Benjamin Heinzerling, and Kentaro Inui. Representational analysis of binding in language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 17468–17493, Miami, Florida, USA, November 2024. Association for Computational Linguistics. . URL https://aclanthology.org/2024.emnlp-main.967/.

[16]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.

[17]

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.

[18]

Youjeong Kim and S Shyam Sundar. Anthropomorphism of computers: Is it mindful or mindless? Computers in Human Behavior, 28 (1): 241–250, 2012.

[19]

Jaden Fried Fiotto-Kaufman, Alexander Russell Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla E. Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, and David Bau. sight and NDIF: Democratizing access to open-weight foundation model internals. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=MxbEiFRf39.

[20]

Naomi Saphra and Sarah Wiegreffe. Mechanistic? In Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, and Hanjie Chen (eds.), Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 480–498, Miami, Florida, US, November 2024. Association for Computational Linguistics. . URL https://aclanthology.org/2024.blackboxnlp-1.30/.

[21]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 12388–12401. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf.

[22]

Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. In Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupała, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad (eds.), Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 163–173, Online, November 2020. Association for Computational Linguistics. . URL https://aclanthology.org/2020.blackboxnlp-1.16.

[23]

Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart M. Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 1828–1843. Association for Computational Linguistics, 2021. . URL https://doi.org/10.18653/v1/2021.acl-long.144.

[24]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022.

[25]

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul.

[26]

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, and Yonatan Belinkov. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability, 2024. URL https://arxiv.org/abs/2408.01416.

[27]

Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts. Inducing causal structure for interpretable neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 7324–7338. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/geiger22a.html.

[28]

Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla, 2023. URL https://arxiv.org/abs/2307.09458.

[29]

Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, and David Bau. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In Proceedings of the 2024 International Conference on Learning Representations, 2024. arXiv:2402.14811.

[30]

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 9574–9586, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/4f5c422f4d49a5a807eda27434231040-Abstract.html.

[31]

Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, and Thomas Icard. Causal abstraction: A theoretical foundation for mechanistic interpretability, 2024. URL https://arxiv.org/abs/2301.04709.

[32]

Nicola De Cao, Michael Sejr Schlichtkrull, Wilker Aziz, and Ivan Titov. How do decisions emerge across layers in neural models? interpretation with differentiable masking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3243–3255, Online, November 2020. Association for Computational Linguistics. . URL https://aclanthology.org/2020.emnlp-main.262.

[33]

Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, and David Bau. Discovering variable binding circuitry with desiderata, 2023. URL https://arxiv.org/abs/2307.03637.

[34]

Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah Goodman. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 36, 2024.

[35]

Matthew Le, Y-Lan Boureau, and Maximilian Nickel. Revisiting the evaluation of theory of mind through question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5872–5877, 2019.

[36]

Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. penToM: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8593–8623, Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.466.

[37]

Natalie Shapira, Guy Zwirn, and Yoav Goldberg. How well do large language models perform on faux pas tests? In Findings of the Association for Computational Linguistics: ACL 2023, pp. 10438–10451, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.findings-acl.663.

[38]

Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua Tenenbaum, and Tianmin Shu. oM-QA: Multimodal theory of mind question answering. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16077–16102, Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.851.

[39]

Yufan Wu, Yinghui He, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 10691–10706, Singapore, December 2023. Association for Computational Linguistics. . URL https://aclanthology.org/2023.findings-emnlp.717.

[40]

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Le Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. arXiv preprint arXiv:2310.15421, 2023.

[41]

Chunkit Chan, Cheng Jiayang, Yauwai Yim, Zheye Deng, Wei Fan, Haoran Li, Xin Liu, Hongming Zhang, Weiqi Wang, and Yangqiu Song. Negotiationtom: A benchmark for stress-testing machine theory of mind on negotiation surrounding. arXiv preprint arXiv:2404.13627, 2024.

[42]

James WA Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8 (7): 1285–1295, 2024.

[43]

Melanie Sclar, Sachin Kumar, Peter West, Alane Suhr, Yejin Choi, and Yulia Tsvetkov. Minding language models’(lack of) theory of mind: A plug-and-play multi-character belief tracker. arXiv preprint arXiv:2306.00924, 2023.

[44]

Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, et al. How far are large language models from agents with theory-of-mind? arXiv preprint arXiv:2310.03051, 2023.

[45]

Alex Wilf, Sihyun Lee, Paul Pu Liang, and Louis-Philippe Morency. Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8292–8308, Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.451.

[46]

Shima Rahimi Moghaddam and Christopher J Honey. Boosting theory-of-mind performance in large language models via prompting. arXiv preprint arXiv:2304.11490, 2023.

[47]

Guiyang Hou, Wenqi Zhang, Yongliang Shen, Linjuan Wu, and Weiming Lu. imeToM: Temporal space is the key to unlocking the door of large language models’ theory-of-mind. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics ACL 2024, pp. 11532–11547, Bangkok, Thailand and virtual meeting, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-acl.685.

[48]

Belinda Z. Li, Maxwell Nye, and Jacob Andreas. Implicit representations of meaning in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1813–1827, Online, August 2021. Association for Computational Linguistics. . URL https://aclanthology.org/2021.acl-long.143.

[49]

Najoung Kim and Sebastian Schuster. Entity tracking in language models. arXiv preprint arXiv:2305.02363, 2023.

[50]

Jiahai Feng and Jacob Steinhardt. How do language models bind entities in context? arXiv preprint arXiv:2310.17191, 2023.

[51]

Jiahai Feng, Stuart Russell, and Jacob Steinhardt. Monitoring latent world states in language models with propositional probes. CoRR, abs/2406.19501, 2024. . URL https://doi.org/10.48550/arXiv.2406.19501.

[52]

Wentao Zhu, Zhining Zhang, and Yizhou Wang. Language models represent beliefs of self and others. arXiv preprint arXiv:2402.18496, 2024.

[53]

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, and Andreas Bulling. Benchmarking mental state representations in language models. arXiv preprint arXiv:2406.17513, 2024.

[54]

Daniel A Herrmann and Benjamin A Levinstein. Standards for belief representations in llms. arXiv preprint arXiv:2405.21030, 2024.

[55]

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 (1): 207–219, 2022.

[56]

Guillaume Alain. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.

[57]

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.

[58]

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.

[59]

Paul Smolensky. Neural and conceptual interpretation of PDP models. In James L. McClelland, David E. Rumelhart, and the PDP Research Group (eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Psychological and Biological Models, volume 2, pp. 390–431. MIT Press, 1986.

[60]

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models, 2023. URL https://arxiv.org/abs/2304.14767.

Correspondence to prakash.nik@northeastern.edu.↩︎
Although this mechanism may resemble induction heads [16], [17], they differ fundamentally. In induction heads, information from a previous token occurrence is passed only to the subsequent token through, without being duplicated to its next occurrence. In contrast, the lookback mechanism copies the same information not only to the location where the vital information resides but also to the target location that needs to retrieve that information.↩︎

Language Models use Lookbacks to Track Beliefs