Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks

Xingcheng Xu\(^{1}\), Zibo Zhao\(^{2}\), Haipeng Zhang\(^{2,*}\), Yanqing Yang\(^{1,3,}\)1
\(^1\) Shanghai Artificial Intelligence Laboratory
\(^2\) Shanghaitech University
\(^3\) Fudan University
xingcheng.xu18@gmail.com, andrewzhao054@gmail.com
zhanghp@shanghaitech.edu.cn, yanqingyang@fudan.edu.cn


Abstract

Large language models (LLMs) have demonstrated impressive versatility across numerous tasks, yet their generalization capabilities remain poorly understood. To investigate these behaviors, arithmetic tasks serve as important venues. In previous studies, seemingly unrelated mysteries still exist – (1) models with appropriate positional embeddings can correctly perform longer unseen arithmetic operations such as addition, but their effectiveness varies in more complex tasks like multiplication; (2) models perform well for longer unseen cases in modular addition under specific moduli (e.g., modulo 100) but struggle under very close moduli (e.g., modulo 101), regardless of the positional encoding used. We believe previous studies have been treating the symptoms rather than addressing the root cause – they have paid excessive attention to improving model components, while overlooking the differences in task properties that may be the real drivers. This is confirmed by our unified theoretical framework for different arithmetic scenarios. For example, unlike multiplication, the digital addition task has the property of translation invariance which naturally aligns with the relative positional encoding, and this combination leads to successful generalization of addition to unseen longer domains. The discrepancy in operations modulo 100 and 101 arises from the base. Modulo 100, unlike 101, is compatible with the decimal system (base 10), such that unseen information in digits beyond the units digit and the tens digit is actually not needed for the task. Extensive experiments with GPT-like models validate our theoretical predictions. These findings deepen our understanding of the generalization mechanisms, and facilitate more data-efficient model training and objective-oriented AI alignment.

1 Introduction↩︎

Since the introduction of Transformer [1], Transformer-based models including large language models (LLMs) and large multimodal models (LMMs) have experienced a rapid rise. For example, models like GPT-4 [2], Claude [3], Gemini [4], and Llama [5], [6], along with their vision counterparts, have showcased impressive versatility. They excel in a wide range of tasks, including natural language processing, coding, mathematical reasoning, vision understanding, and more [7], [8]. However, the generalization capabilities of these foundation models are not yet fully understood in areas such as natural language understanding [9] and mathematical reasoning [10], [11].

In the field of natural language processing, the issue of out-of-distribution (OOD) generalization is much complex and challenging. LLMs perform exceptionally well on some generalization tasks while produce factual errors or misinformation on others e.g., [8], [9]. Studies therefore try to figure out why these differences exist between generalization tasks e.g., [12], what LLMs are actually learning on failed ones e.g., [13], and how they manage to generalize on successful tasks e.g., [11], [14].

Given the complexity of natural language tasks and the black-box nature of these models, researchers use basic mathematical tasks like \(n\)-digit addition, multiplication, and modular operations as essential channels for investigating model generalization behaviors. However, mysterious discrepancies in results still exist – (1) certain tasks (e.g., addition) succeed in unseen generalization with certain positional encodings (e.g., relative) but not other tasks (e.g., multiplication), and (2) there is a significant generalization difference between very close moduli in modular operations (e.g., modulo 100 and 101). Specifically, previous studies have observed that when training models with absolute positional embeddings (APE) on \(n\)-digit operations (e.g., addition), where both input operands are no longer than \(n\)-digit in length such as \(1234+5678\) for \(n=4\), the models successfully generalize on unseen \(n\)-digit inputs such as \(4321+8765\) (termed in-distribution (ID) generalization). However, they fail on longer unseen cases such as \(91234+15678\) (termed OOD generalization) as shown by [10], [11], [15], and [13]. Besides, models equipped with relative positional embeddings (RPE) can generalize to longer unseen inputs for addition tasks but struggle with multiplication tasks, according to [11] and [14]. Additionally, models trained on modular operations with specific moduli such as 100 can perfectly generalize to any longer unseen inputs with either absolute or relative positional embeddings. However, they fail to generalize to longer unseen inputs for other very close moduli such as 101, as noted by [11]. These seemingly unrelated OOD generalization mysteries are summarized in Table 1.

Table 1: Length Generalization of Transformers with APE and RPE on Arithmetic Tasks
Addition Multiplication Modular Addition Modular Multiplication
4-5 (lr)6-7 \(p=100\) \(p=101\) \(p=100\) \(p=101\)
APE
RPE

As we can summarize, these previous efforts address generalization issues in specific tasks, modifying components of individual models, such as altering positional encodings [11], [14] or attention mechanisms [16]. Their failure in figuring out the underneath mechanism calls for a reflective examination – we believe the field has been overly focused on improving model components, while overlooking the differences in task properties (e.g., addition v.s. multiplication, modulo \(10^2\) v.s. modulo \(10^2+1)\) that may be the real drivers. The perspective of mechanistic interpretability [17], [18] offers an angle in this direction. This data-driven and experimentally-based analytical approach has helped identifying and interpreting phenomena such as "grokking" [18] and analyzing the impact of repeated data on the performance of LLMs [17]. In this paper, we develop a unified theoretical framework that builds on the principles of language modeling, the model’s ability for universal approximation, and detailed task property analysis in various arithmetic scenarios. It confirms that generalization behaviors depend on task properties and training data coverage. For example, digital addition possesses the property of translation invariance, meaning the operation yields consistent results regardless of the positional shift of digits (e.g., \(8 + 9 = 17\) and \(800 + 900 = 1700\)). This property aligns with RPE, which maintains the positional relationship between digits, whereas multiplication does not. This leads to successful generalization of addition to unseen longer domains under RPE but not for multiplication. The discrepancy in operations modulo 100 and 101 arises from the base. Modulo 100 is consistent with the decimal system (base 10), such that \(11234+15678 \equiv 1234+5678 \equiv 34+78 \pmod{100}\), so information from higher digits beyond the units digit and tens digit is actually not needed. In contrast, when modulo 101 is applied, higher digits do matter.

Furthermore, we have conducted more extensive generalization analyses than those in the literature. We assume that Transformer models are trained on \(n\)-digit operations with at least one operand having a length of \(n\) such as \(1234+567\) for \(n=4\). This differs from the literature where the length of both operands is no longer than \(n\). We categorize generalization into two types: inward OOD generalization and outward OOD generalization. Inward OOD generalization involves generalizing to shorter-length domains, such as \(120+235\) or \(11+32\), while outward OOD generalization involves generalizing to longer-length domains, such as \(12035+235\) or \(123456+323456\). The core conclusions of our theoretical analysis are as follows: (1) For addition, under APE, Transformer models can generalize to the shorter-length (inward) OOD domain, but not to the longer-length (outward) OOD domain. However, under RPE, the models can generalize to both shorter-length and longer-length OOD domains, benefiting from the translation invariance of digit addition. (2) For multiplication, even RPE has limited effectiveness in the longer-length OOD domain due to the lack of translation invariance property. (3) For modular operations, if the modulus \(p\) divides \(10^n\), models can generalize to both shorter- and longer-length OOD domains regardless of the positional encoding, due to the compatibility with base 10 such that the information at higher-digit positions of the operands do not affect the result. When the modulus \(p\) does not divide \(10^n\), models can only generalize to the shorter-length OOD domain. For longer-length OOD domains, we have derived a theoretical accuracy formula based on the information loss and identification of the model’s final learned function.

As a note, the shorter-length (inward) OOD domain generalization is not trivial. If a model is trained on a smaller domain with a significant gap from the desired training dataset, such as training on \(n\)-digit addition with both operands having a length of \(n\) and the highest digits of both operands being greater than, for example, 5, the model fails to generalize to the shorter-length OOD domain. This challenge in generalization has significant implications for AI alignment [19], a crucial research area focused on ensuring that AI systems behave in ways that are aligned with human intentions and values. Our analysis provides valuable insights into this area by highlighting the importance of training data completeness. If the data excluded from the training dataset does not affect the desired ground truth support set, such as when the inward OOD domain is excluded during training, the model can still learn to generalize to the excluded inward OOD domain. However, if a significant amount of data is omitted, or a large number of training samples are mapped to the same answer, as shown in our counterexample above, the inward OOD domain generalization fails. Therefore, when our goal is to align the model to generalize certain OOD domains as expected, precise analysis of the task nature and careful control of the training data are necessary. This ensures that the models are trained on datasets that accurately reflect the desired outcomes, ultimately improving their alignment with intended behaviors.

To validate our theoretical framework, we also conduct extensive experimental analyses by training various generative language models, including NanoGPT, MicroGPT, and MiniGPT [20], on tasks involving \(n\)-digit addition, multiplication, and modular operations. We conduct robustness studies across different model scales, dataset sizes, and training data schemes.

Our main contributions are as follows:

1. Establishing a unified theoretical framework for understanding OOD generalization of Transformers: Our framework states why and how these models generalize across different tasks, addressing both successes and failures in OOD generalization. Comprehensive experimental evidence supports our theoretical predictions. To facilitate relevant research, we opensource our code2.

2. Clarifying the inward and outward OOD generalization to boost model training and alignment: We have introduced the concepts of inward and outward generalization, which more clearly delineates the differences between generalization to shorter or longer lengths. This understanding helps more data-efficient model training and objective-oriented AI alignment.

3. Practical implications for AI research: This study redirects the (unnecessary, sometimes erroneous) efforts in modifying individual technical aspects (e.g., positional embeddings) for isolated problems towards a more comprehensive mechanistic understanding of model generalization.

2 Related Work↩︎

2.0.0.1 Generalization of Transformers and LLMs on Arithmetic.

Numerous studies have examined the performance of Transformer-based language models in tasks involving arithmetic operations and mathematical reasoning. [21], [7] and [8] investigated various LLMs, such as GPT-3, GPT-4, and Gemini, in performing basic arithmetic and mathematical reasoning. [22] explored the limitations of Transformers in learning arithmetic, highlighting the significant influence of surface representation on model accuracy and the need for improved tokenization and positional encoding strategies. Subsequent research such as [23], [10], [11], [15], and [13] further explored the generalization capabilities of Transformer models and LLMs in arithmetic tasks, revealing poor OOD generalization. [14] demonstrated the impact of embedding strategies on arithmetic capabilities, showing that positional encoding significantly improves performance. [24] introduced Attention Bias Calibration (ABC) to achieve near-perfect length generalization in arithmetic tasks (e.g. addition and parity), highlighting targeted attention biasing. [25] examined generalization on unseen logical functions, introducing the Degree-Curriculum approach to improve generalization by incrementally increasing task complexity.

While previous studies have mainly focused on evaluating or improving generalization capabilities, our work develops a comprehensive theoretical framework to analyze OOD generalization behaviors in Transformer models trained on arithmetic operations, bridging the gap between empirical observations and theoretical understanding.

2.0.0.2 Mechanistic Interpretability and General Understanding.

Many studies have focused on understanding and interpreting the working dynamics of neural networks and Transformer models [17], [26][29]. From the perspective of universal approximation, [30] demonstrated that Transformer models equipped with trainable positional encodings can act as universal approximators for continuous functions in a compact domain under the \(L^p\) norm. Extending this concept, [31] showed that the Sumformer architecture also maintains this universal approximation property under the supremum norm, addressing the worst-case scenario for approximation.

From a mechanistic viewpoint, [17] investigated the impact of repeated data on the performance of large language models, highlighting significant performance degradation when a small fraction of data is repeated multiple times. Their findings emphasize the necessity of data diversity in training to avoid memorization and maintain generalization capabilities. [18] addressed the phenomenon of delayed generalization or "grokking" using addition and modular addition tasks, providing intuitive explanations through effective theories and phase diagrams. [32] utilized modular addition to mechanistically explain algorithm discovery in neural networks.

Our work contributes to this growing field of mechanistic interpretability by providing a macroscopic explanation specifically for Transformer models. By developing a comprehensive theoretical framework to analyze OOD generalization behaviors in these models, we systematically identify systematic biases and understand model behaviors in arithmetic reasoning.

3 Theoretical Analysis on Generalization for Arithmetic Reasoning↩︎

In this section, we review the Transformer model and the universal approximation theorem, and then conduct theoretical analyses of the inward and outward OOD generalization capabilities of the Transformer in solving tasks related to addition, modular addition, multiplication, and modular multiplication.

3.1 Preliminaries on Transformer and Universal Approximation↩︎

A Transformer model [1] predicts the next token based on the preceding tokens within the input sequence. Its output is subsequently used as input for the next prediction. For a target token \(x_{i}\) at position \(i\) in the sequence, the model generates a probability distribution over the vocabulary of potential next tokens. To be precise, let \(x=x_{1}x_{2}\ldots x_{T}\in\mathcal{V}^{T}\) denote the input sequence of tokens. The probability of observing this sequence with respect to a Transformer model is given as follows: \[\mathbf{P}_{\theta}(x)=\prod_{i=1}^{T}\mathbf{P}_{\theta}(x_{i}|x_{1},x_{2},...,x_{i-1})=\prod_{i=1}^{T}\mathbf{P}_{\theta}(x_{i}|x_{<i}).\] The conditional probability \(\mathbf{P}_{\theta}(x_{i}|x_{<i})\) is computed using the softmax function applied to the last hidden state.

3.1.0.1 Universal approximation theorem for Transformer models:

Transformer models have the capacity to universally approximate any arbitrary continuous sequence-to-sequence function within a compact domain. [30] have shown that, when equipped with trainable positional encodings, Transformers can serve as universal approximators for continuous functions in a compact domain under the \(L^p\) norm. [31] have extended this universal approximation property to the Sumformer architecture and demonstrated that the universal approximation also holds under the supremum norm, which represents the worst-case scenario for approximation. These characterizations highlight the representation power of fixed-width Transformer networks, despite the intrinsic parameter sharing and permutation equivariance.

3.2 Theoretical Analysis on Addition↩︎

Consider two natural numbers \(a=\sum_{i=1}^{n}a_{i}\times 10^{i-1}=(a_1,a_2,\cdots,a_{n})\) and \(b=\sum_{i=1}^{n}b_i\times 10^{i-1}=(b_1,b_2,\cdots,b_{n})\). The addition of these \(n\)-digit numbers, denoted as \(f(a,b)=a+b\), is expressed by \(c=\sum_{i=1}^{n+1}c_i\times 10^{i-1}=(c_1,c_2,\cdots,c_{n},c_{n+1})\).

Let the dataset \(\mathcal{D}_{n}:=\{(a,b)\in\mathbb{N}^2:a_n \vee b_n\geq 1, a_i=b_i\equiv 0,\forall i>n\}\). For notation simplicity, assume \((0,0)\in\mathcal{D}_{1}\). Here, \(a_n \vee b_n=\max\{a_n,b_n\}\). Note that \(\mathcal{D}_{n}\cap \mathcal{D}_{m}=\emptyset\) for \(n\neq m\) and \(\mathbb{N}^2=\bigcup_{n=1}^{\infty}\mathcal{D}_{n}\). Denote the shorter-length (inward) domain \(\mathcal{D}_{<n}:=\bigcup_{m=1}^{n-1}\mathcal{D}_{m}\) and the longer-length (outward) domain \(\mathcal{D}_{>n}:=\bigcup_{m=n+1}^{\infty}\mathcal{D}_{m}\).

Theorem 1. Assume a Transformer model with absolute positional embedding (APE) is trained on a multi-digit addition dataset for the operands \((a,b)\in \mathcal{D}_{n}\) (\(n\geq 2\)) with enough training computation, then the learned model can perfectly generalize for the shorter-length OOD domain \(\mathcal{D}_{<n}\), but fail completely for the longer-length OOD domain \(\mathcal{D}_{>n}\).

3.2.0.1 Proof.

Define the functions \[\chi(x):=\lfloor x/10 \rfloor \text{ and } \zeta(x):=x \;\operatorname{mod}\;10, \text{ for } x\in \mathbb{N}.\] Then \(c_i=\zeta(a_i+b_i+\chi(a_{i-1}+b_{i-1})),\forall i\). For simplicity, assume \(a_0=b_0=0\).

We define three forms of approximation:

  • Strong form: If \(\mathbf{P}_{\theta}(\tilde{c} = c_i \mid a + b = c_{<i}) = 1\) for any \(i \geq 1\). This means the model \(\mathbf{P}_{\theta}(\cdot \mid a + b = c_{<i})\) can perfectly learn the function \(c_i = \zeta(a_i + b_i+ \chi(a_{i-1} + b_{i-1})), \forall i\).

  • Standard form: If \(c_i = \arg\max_{\tilde{c}} \mathbf{P}_{\theta}(\tilde{c} \mid a + b = c_{<i})\) for any \(i \geq 1\). This means the model \(\mathbf{P}_{\theta}(\cdot \mid a + b = c_{<i})\) can approximate the function \(c_i = \zeta(a_i + b_i+ \chi(a_{i-1} + b_{i-1})), \forall i\) with the highest probability.

  • Weak form: If \(\mathbf{P}_{\theta}(\tilde{c} = c_i \mid a + b = c_{<i}) > 0\) for any \(i \geq 1\). This means the model \(\mathbf{P}_{\theta}(\cdot \mid a + b = c_{<i})\) can approximate the function \(c_i = \zeta(a_i + b_i+ \chi(a_{i-1} + b_{i-1})), \forall i\) with a non-zero probability.

In the following, we will use the standard form to demonstrate out-of-distribution (OOD) generalization. When training a Transformer model on \(\mathcal{D}_n\)-addition using absolute positional embedding (APE), the learned model approximates the function at each position of \(c\): \[\mathbf{P}_{\theta}(c_i \mid a + b = c_{<i})=\mathbf{P}_{\theta}(c_i \mid a_{i-1},a_{i},b_{i-1},b_{i})\to c_i = \zeta(a_i + b_i+ \chi(a_{i-1} + b_{i-1})).\]

Case I: Let us consider the shorter-length OOD domain \(\mathcal{D}_{<n}\) case. If \(i<n\), the model trained on a sample dataset in \(\mathcal{D}_n\) can at least approximate the function \(c_i\) in the standard form. If \(i=n\), \[\mathbf{P}_{\theta}(c_n \mid a_{n-1},a_{n},b_{n-1},b_{n})\to c_n = \zeta(a_n + b_n + \chi(a_{n-1} + b_{n-1}))\] for every \(a_n\vee b_n\geq 1\) except the case \(a_n=b_n=0\) simultaneously. If \(i=n+1\), \[\mathbf{P}_{\theta}(c_{n+1} \mid a_{n},a_{n+1},b_{n},b_{n+1})\to c_{n+1} = \zeta(a_{n+1} + b_{n+1} + \chi(a_{n} + b_{n}))=\chi(a_{n} + b_{n})\in \{0,1\}\] for every pair \((a_n,b_n)\) with \(a_n\vee b_n\geq 1\) and \(a_{n+1}=b_{n+1}=0\). In the case where \(a_n=b_n=0\), the conditions for both \(i=n\) and \(i=n+1\) necessitate OOD generalization. Since the model has been trained to approximate \(c_n\) accurately for \(a_n \vee b_n \geq 1\), it has learned the function for the carry-over mechanism properly. When \(a_n = b_n = 0\), the digit \(c_n\) purely depends on the carry from the previous position. For \(i = n+1\), the carry \(\chi(a_n + b_n)\) is correctly learned such that it maps \(\{0, 1\}\) depending on whether there was a carry from the \(n\)-th digit. With \(a_n = b_n = 0\), the model correctly sets \(c_{n+1} = 0\). The training on \(\mathcal{D}_n\) includes all possible carry scenarios and digit summations for \(a_n, b_n \in \{0, \ldots, 9\}\). The zero cases are naturally included in the learned patterns3. For \(i\geq n+2\), \[\mathbf{P}_{\theta}(c_i \mid a_{i-1},a_{i},b_{i-1},b_{i})\to c_i = \zeta(a_i + b_i+ \chi(a_{i-1} + b_{i-1}))\equiv 0,\] since \(a_i=b_i\equiv 0\) for any \((a,b)\in \mathcal{D}_n\) with \(i\geq n+1\). Thus, the model \(\mathbf{P}_{\theta}\) can approximate the function of \(c\) at every position for the shorter-length OOD domain \(\mathcal{D}_{<n}\).

Case II: Consider the longer-length OOD domain \(\mathcal{D}_{>n}\) case. If \(i\leq n\), the analysis remains the same as above. The learned model \(\mathbf{P}_{\theta}\) can predict the correct numbers at these positions. However, when \(i=n+1\), \[\mathbf{P}_{\theta}(c_{n+1} \mid a_{n},a_{n+1},b_{n},b_{n+1})\to c_{n+1} = \zeta(a_{n+1} + b_{n+1} + \chi(a_{n} + b_{n}))=\chi(a_{n} + b_{n})\in \{0,1\}\] for every pair \((a_n,b_n)\) with \(a_n\vee b_n\geq 1\) and \(a_{n+1}=b_{n+1}=0\). Note that for inference in the OOD domain \(\mathcal{D}_{>n}\), the model needs to predict each sample with \((a_{n+1},b_{n+1})\) at least for every \(a_{n+1}\vee b_{n+1}\geq 1\). However, the support of probability measure learned by the model \(\mathbf{P}_{\theta}\) is \(\operatorname{supp}\mathbf{P}_{\theta}=\{0,1\}\). For the model to predict \(c_{n+1}\) correctly even in the weak form, the support should be \(\operatorname{supp}\mathbf{P}_{\theta}=\{0,1,\cdots,9\}\). This indicates that the model \(\mathbf{P}_{\theta}\) cannot predict the number at position \(n+1\). Additionally, the learned probability \(\mathbf{P}_{\theta}(c_{n+1} \mid a_{n},a_{n+1},b_{n},b_{n+1})\) is actually independent of \((a_{n+1},b_{n+1})\). For \(i\geq n+2\), \[\mathbf{P}_{\theta}(c_i \mid a_{i-1},a_{i},b_{i-1},b_{i})\to c_i \equiv 0,\] since \(a_i=b_i\equiv 0\) for any \((a,b)\in \mathcal{D}_n\) with \(i\geq n+1\). This means that the learned model maps all inputs to zeros for positions \(i\geq n+2\). If the model could predict the numbers at positions \(i\geq n+2\), the requirement even in the weak form is that at least \(\{0,1\}\subset \operatorname{supp}\mathbf{P}_{\theta}(c_i\mid \cdots)\). This contradicts \(\operatorname{supp}\mathbf{P}_{\theta}(c_i\mid \cdots)=\{0\}\). Combining the above analysis, we conclude that the learned model \(\mathbf{P}_{\theta}\) cannot solve the problems in the OOD domain \(\mathcal{D}_{>n}\) but instead outputs the result \((a\;\operatorname{mod}\;10^n)+(b\;\operatorname{mod}\;10^n)\) for every sample in \(\mathcal{D}_{>n}\). 0◻

Remarks on Theorem 1: The challenging aspect of model prediction in the shorter-length OOD domain \(\mathcal{D}_{<n}\) arises from the need to generalize the \(n\)-th and \((n+1)\)-th positions in the result \(c\) when trained on \(\mathcal{D}_{n}\). Specifically, these positions must be generalized to the scenario where \(a_{n}=b_{n}=0\). Through our experimental analysis, we confirmed that the positions \(n\) and \(n+1\) are the last to be learned during the training process. An additional observation is that if the model is trained on the domain \(\mathcal{D}_{n-1,n}:=\mathcal{D}_{n-1}\cup \mathcal{D}_{n}\), the previously mentioned challenge is mitigated. This is because the case with \(a_{n}=b_{n}=0\) is already incorporated into the training dataset. Consequently, the positions \(n\) and \(n+1\) do not require OOD generalization; instead, they are learned directly from the training data. We have also conducted experiments based on this training scheme and found that learning on the domain that includes \(\mathcal{D}_{n-1,n}\) is significantly easier than learning on \(\mathcal{D}_{n}\) alone.

Based on the analysis above, we can immediately draw the following conclusion, which provides an explanation for the findings by [13].

Corollary 1. The learned Transformer model with APE approximates the function \(\hat{f}(a,b)=(a\;\operatorname{mod}\;10^n)+(b\;\operatorname{mod}\;10^n)\). The OOD generalization error is zero for the shorter-length OOD domain \(\mathcal{D}_{<n}\), but not less than \(10^n\) for every point in the longer-length OOD domain \(\mathcal{D}_{>n}\).

We are curious about the conditions under which a Transformer model can learn to perform addition operations. With absolute positional embedding, the model successfully generalizes inward, but fails to generalize outward. What would be the conclusion under relative positional embedding? Through theoretical and experimental analysis, we have arrived at the following conclusions.

Theorem 2. Assume a Transformer model with relative/abacus positional embedding (RPE) is trained on a multi-digit addition dataset for the operands \((a,b)\in \mathcal{D}_{n}\) (\(n\geq 2\)) with enough training computation, then the learned model can perfectly generalize both for the shorter-length OOD domain \(\mathcal{D}_{<n}\) and the longer-length OOD domain \(\mathcal{D}_{>n}\).

3.2.0.2 Sketch of Proof.

Under the condition of relative positional embedding, the core that the above conclusion holds is the following property of translation invariance: \[\mathbf{P}_{\theta}(c_i \mid a_{i-1},a_{i},b_{i-1},b_{i})=\mathbf{P}_{\theta}(c_{i+j} \mid a_{i+j-1},a_{i+j},b_{i+j-1},b_{i+j}), \text{ for any } i,j\in \mathbb{N}.\] For a Transformer model trained on a sample dataset in \(\mathcal{D}_{n}\) (\(n\geq 2\)) with RPE, the prediction at positions \(i\geq n+1\) for the OOD domain \(\mathcal{D}_{>n}\) can now be solved by \[\mathbf{P}_{\theta}(c_i \mid a_{i-1},a_{i},b_{i-1},b_{i})=\mathbf{P}_{\theta}(c_{2} \mid a_{1},a_{2},b_{1},b_{2}).\] The key point to be elucidated is that the Transformer model which satisfies the property of translation invariance lives in the space defined by the span of Transformer models. This is obvious. References in this direction include e.g. [33], [34], and [35]. 0◻

Remarks on APE and RPE: APE encodes positional information based on the absolute positions of tokens in a sequence. This approach can limit a model’s ability to generalize to sequences of different lengths or to handle out-of-distribution scenarios effectively. In contrast, RPE captures translation-invariant positional dependencies by encoding the relative distances between tokens. This method allows the model to focus on the relationships between tokens regardless of their absolute positions, enhancing its ability to generalize across varying sequence lengths and to better understand contextual relationships. Consequently, RPE is more robust and adaptable in the addition context compared to APE. Our theoretical framework can explain the addition-based experimental findings reported in the following references: [11], [13], [24], and [14].

3.3 Theoretical Analysis on Modular Addition↩︎

Consider the function for modular addition with a modulus \(p\), expressed as \(f(a, b) = (a + b) \mod p\), which will be the focus of our analysis in the following section. Subsequently, we will also represent modular addition using the notation \(\overline{c}^p = \overline{a + b}^p\). For simplicity, we will omit the superscript \(p\) when it is clear from the context.

3.3.1 Scenarios on Divisibility of 10’s Power by Modulus↩︎

Theorem 3. Assume a Transformer model with either absolute or relative/abacus positional embedding is trained on a multi-digit modular addition dataset with a modulus \(p\) that divides \(10^m\) for the operands \((a,b)\in \mathcal{D}_{n}\) (\(n\geq 2\) and \(m\leq n\)) with enough training computation, then the learned model can perfectly generalize both for the shorter-length OOD domain \(\mathcal{D}_{<n}\) and the longer-length OOD domain \(\mathcal{D}_{>n}\).

3.3.1.1 Sketch of Proof.

We will initially focus on the scenario where \(p = 10^m\), and subsequently explore the general case where \(p\) is a divisor of \(10^m\).

Case I: Let us revisit the equation for modular addition, which states that \(\overline{c}^p = \overline{a + b}^p = \overline{\overline{a}^p + \overline{b}^p}^p\). The above equation shows that for the case \(p = 10^m\), the digits in positions higher than \(m\) in numbers \(a\) and \(b\) do not affect the result \(\overline{c}^p\); only the digits in positions \(m\) and lower have an impact. Furthermore, we have \(\overline{c}^p=(\overline{c}^p_{1},\overline{c}^p_{2},\cdots,\overline{c}^p_{m})=(c_1,c_2,\cdots,c_m)\), where \(c=a+b\). A model trained on \(\mathcal{D}_{n}\) is capable of approximating the digits at positions ranging from \(1\) to \(m\). This can be expressed as: \[\mathbf{P}_{\theta}(\overline{c}^p_{i} \mid a_{i-1},a_{i},b_{i-1},b_{i})\to \overline{c}^p_{i} = \zeta(a_i + b_i+ \chi(a_{i-1} + b_{i-1})),\;\forall i=1,\cdots,m.\] All these functions are learned directly from the training data without the need for out-of-distribution (OOD) generalization if \(m<n\), while \(m=n\), only the \(n\)-th term \(\overline{c}^p_{n}\) need OOD generalization. For \(i>m\), the probability \(\mathbf{P}_{\theta}(\overline{c}^p_{i} \mid \cdot)\equiv 0\). The aforementioned conclusions apply to both domains \(\mathcal{D}_{<n}\) and \(\mathcal{D}_{>n}\).

Case II: Consider the case where \(p\) is a divisor of \(10^m\). Since we have \(\overline{c}^p = \overline{a + b}^p = \overline{\overline{a+b}^{10^m}}^p\), the result \(\overline{c}^p\) is indeed not influenced by the digits in positions higher than \(m\) in numbers \(a\) and \(b\). If let \(m\) be the minimum number which the \(m\)-th power of 10 can be divided by the modulus \(p\), i.e. \(m=\arg\min\{\tilde{m}: p\mid 10^{\tilde{m}}\}\), the model approximates the function at each position \(i\): \[\mathbf{P}_{\theta}(\overline{c}^p_{i} \mid a_{1},\cdots,a_{m},b_{1},\cdots,b_{m})\to \overline{c}^p_{i} = f_i^p(a_{1},\cdots,a_{m},b_{1},\cdots,b_{m}),\;\forall i=1,\cdots,m\] where \(f_i^p\) is the function for \(\overline{c}^p_{i}\) at the position \(i\). As an aside, it is worth noting that in the case described above, the function is more intricate than standard addition or modular addition with a modulus that divides a power of 10. These functions generally rely on the digits at all positions of the numbers \(a\) and \(b\), from position \(1\) through \(m\). All these functions can be learned directly from the training data without the need for OOD generalization when training on \(\mathcal{D}_{n}\) (\(n\geq m\)) except the term \(\overline{c}^p_{n}\). 0◻

3.3.2 Scenarios on Non-Divisibility of 10’s Power by Modulus↩︎

Theorem 4. (1) Assuming a Transformer model equipped with absolute positional embeddings is trained on a multi-digit modular addition dataset \(\mathcal{D}_{n}\) (\(n \geq 2\)) where the modulus \(p\) neither divides \(10^n\) nor exceeds \(10^n\), and provided that sufficient training computation is allocated, then the resulting trained model is capable of perfect generalization to the shorter-length OOD domain \(\mathcal{D}_{<n}\), while encountering difficulties in generalizing to the longer-length OOD domain \(\mathcal{D}_{>n}\).

(2) The function that the model has learned is \(\hat{f}^p(a,b)=\overline{\overline{a}^{10^n}+\overline{b}^{10^n}}^p\).

(3) Furthermore, the test accuracy on \(\widetilde{\mathcal{D}}_{n_{\text{test}}}\) (\(n_{\text{test}}>n\)) is given by \(\operatorname{Acc}(p,n,n_{\text{test}}) \approx \frac{\gcd(p, 10^n)}{p}\) if \(n_{\text{test}}\geq n+\log_{10}(p'/2+1)\), otherwise \(\operatorname{Acc}(p,n,n_{\text{test}}) =0\), where \(\gcd(p, 10^n)\) represents the greatest common divisor of \(p\) and \(10^n\), and \(p'=p/\gcd(p, 10^n)\).

3.3.2.1 Sketch of Proof.

In this case, the model approximates the function for each position \(i\) as follows when training on \(\mathcal{D}_n\): \[\mathbf{P}_{\theta}(\overline{c}^p_{i} \mid a_{1},\cdots,a_{n},b_{1},\cdots,b_{n})\to \overline{c}^p_{i} = f_i^p(a_{1},\cdots,a_{n},b_{1},\cdots,b_{n}),\;\forall i=1,\cdots,n\] where \(f_i^p\) represents the function for \(\overline{c}^p_{i}\) at position \(i\). Generally, the function \(f^p(a,b)=(a+b)-\lfloor (a+b)/p \rfloor p\). Each digit \(f^p_i\) depends on all positions of \(a\) and \(b\). If the model is trained on \(\mathcal{D}_{n}\), the aforementioned probabilities have been trained exclusively on scenarios where \(a_n \vee b_n \geq 1\). The case where \(a_n = b_n = 0\) requires OOD generalization for samples on the shorter-length domain \(\mathcal{D}_{<n}\). This can be addressed by aligning with the model trained on the domain containing \(\mathcal{D}_{n-1,n}\). If the model is trained on the dataset \(\mathcal{D}_{n-1,n}\), which includes the case where \(a_n = b_n = 0\), it learns the relevant patterns directly from the training data without the need for OOD generalization on the domain \(\mathcal{D}_{<n}\). However, the model typically struggles to generalize to the longer-length domain \(\mathcal{D}_{>n}\). This is because the model is expected to approximate the functions \(f^p(a,b)=\overline{a+b}^p\), which consider all digits of \(a\) and \(b\). Since the model is trained on \(\mathcal{D}_{n}\), it learns the function \(\hat{f}^p(a,b)=\overline{\overline{a}^{10^n}+\overline{b}^{10^n}}^p\), which is independent of the positions \(i>n\) of the numbers \(a\) and \(b\). 0◻

3.3.2.2 OOD Test Accuracy Analysis for Longer Length.

For the model’s output to be correct, it must satisfy the condition \(\overline{a+b}^p=\overline{\overline{a}^{10^n}+\overline{b}^{10^n}}^p\). This requirement also provides us with a method to estimate the OOD test accuracy on the longer-length domain \(\mathcal{D}_{>n}\).

Let \(H_n=\overline{a}^{10^n}+\overline{b}^{10^n}\), and \(R_n=(a+b)-H_n\). The OOD generalization error is then \[f^p(a,b)-\hat{f}^p(a,b) = R_n - \left(\left\lfloor (a+b)/p\right\rfloor - \left\lfloor H_n/p \right\rfloor \right)p.\] Denote \(\varepsilon_n^{R}:=\frac{R_n}{p}-\lfloor \frac{R_n}{p} \rfloor\in [0,1)\) and \(\varepsilon_n^{H}:=\frac{H_n}{p}-\lfloor \frac{H_n}{p} \rfloor\in [0,1)\). Then \[\begin{align} f^p(a,b)-\hat{f}^p(a,b) = (R_n/p - \left\lfloor (R_n+H_n)/p\right\rfloor + \left\lfloor H_n/p \right\rfloor)p = (\varepsilon_n^{R} - \lfloor\varepsilon_n^{R}+\varepsilon_n^{H} \rfloor)p. \end{align}\] That is, \[f^p(a,b)-\hat{f}^p(a,b)= \begin{cases} \varepsilon_n^{R} p\geq 0, &\text{if } \varepsilon_n^{R}+\varepsilon_n^{H}\in [0,1)\\ (\varepsilon_n^{R}-1) p<0, &\text{if } \varepsilon_n^{R}+\varepsilon_n^{H}\in [1,2) \end{cases}.\] For the special case where \(\varepsilon_n^{R}=0\) (i.e. \(R_n\) is divisible by \(p\)), we have \(\hat{f}^p(a,b)=f^p(a,b)\). This implies that the OOD test accuracy for a finite OOD test dataset may be greater than 0.

The OOD test accuracy on the domain (denote as \(\widetilde{\mathcal{D}}_{n_{test}}\) and \(n_{test}>n\)) in which the length of \(a,b\) are both \(n_{test}\) is \(\operatorname{Acc}(p,n,n_{\text{test}})=\frac{\#\{(a,b)\in \widetilde{\mathcal{D}}_{n_{test}}:\varepsilon_n^R=0\}}{\# \widetilde{\mathcal{D}}_{n_{test}}}\). This can be calculated by counting the number of \(R_n\) divisible by \(p\) in this domain. The theoretical test accuracy on \(\widetilde{\mathcal{D}}_{n_{\text{test}}}\) is given by \(\operatorname{Acc}(p,n,n_{\text{test}}) \approx \frac{1}{p'}\) if \(n_{\text{test}}\geq n+\log_{10}(p'/2+1)\), otherwise 0. The proof can be found in Appendix 8.1.

Let’s consider some examples. For \(p=151\) and \(n=4\), since \(\gcd(151,10^n)\equiv 1\), the test accuracy is \(\operatorname{Acc}(151,4,n_{\text{test}}) = \frac{1}{151}\approx 0.66\%\) if \(n_{\text{test}}\geq 6\), but \(0\) when \(n_{\text{test}}=5\). For \(p=201\) and \(n=4\), the test accuracy is \(\operatorname{Acc}(201,4,n_{\text{test}}) = \frac{1}{201}\approx 0.5\%\) if \(n_{\text{test}}\geq 7\), but \(0\) when \(n_{\text{test}}=5,6\). Another example is \(p=150\) and \(n=4\), where the greatest common divisor is \(\gcd(150, 10^4)=50\) and \(p'=3\), resulting in a test accuracy of \(\operatorname{Acc}(150,4,n_{\text{test}}) = \frac{50}{150}\approx 33.3\%\) for all \(n_{\text{test}}\geq 5\). In the extreme case where \(p\) is a divisor of \(10^n\), the test accuracy \(\operatorname{Acc}(p,n,n_{\text{test}})\equiv 100\%\). This aligns with the results for the scenarios on the divisibility of a power of 10 by the modulus. All these findings are confirmed by our experimental analysis (see Table 3 and Table 4).

Remark on Transformer models based on relative/abacus positional embedding: The standard addition benefits from the property of translation invariance, whereas modular addition with a modulus \(p\) that does not divide \(10^n\) lacks this property. Consequently, there is no apparent advantage to be gained from leveraging this characteristic.

3.4 Theoretical Analysis on Multiplication↩︎

Theorem 5. (1) Assuming a Transformer model equipped with absolute positional embeddings is trained on a multi-digit multiplication dataset \(\mathcal{D}_{n}\) (\(n \geq 2\)), and provided that sufficient training computation is allocated, then the resulting trained model is capable of perfect generalization to the shorter-length OOD domain \(\mathcal{D}_{<n}\), while it cannot generalize to the longer-length OOD domain \(\mathcal{D}_{>n}\). (2) The function that the model has learned is \(\hat{f}(a,b)=\overline{a}^{10^n}\times \overline{b}^{10^n}\).

Given two natural numbers \(a\) and \(b\), each represented by \(n\)-digit sequences \((a_1, a_2, \ldots, a_n)\) and \((b_1, b_2, \ldots, b_n)\), respectively, the product \(ab\) is expressed as a \(2n\)-digit number \(c = (c_1, c_2, \ldots, c_{2n})\).

To express each digit \(c_i\) of the product \(c\) in terms of the digits of \(a\) and \(b\), we need to understand the multiplication task and how the digits interact. The product \(ab\) can be represented as: \[ab = \left(\sum_{i=1}^{n} a_i \cdot 10^{i-1}\right) \left(\sum_{j=1}^{n} b_j \cdot 10^{j-1}\right) = \sum_{i=1}^{n} \sum_{j=1}^{n} a_i b_j \cdot 10^{(i-1)+(j-1)}.\] This gives us a double sum where each term \(a_i b_j\) contributes to a specific power of 10. To express the digit \(c_k\) (where \(1 \leq k \leq 2n\)) of the product, we need to collect all terms from the expansion that contribute to the \(10^{k-1}\) place.

For \(c_k\), we consider all pairs \((i, j)\) such that \(i + j - 2 = k - 1\), which simplifies to \(i + j = k + 1\). Define that the raw sum \(c_k^{R}\) at the \(k\)-th position as follows: \[c_k^{R} = \sum_{\substack{1\leq i, j\leq n \\ i + j = k + 1}} a_i b_j.\] However, since this is a digital product and carries might affect higher places, the correct formulation needs to account for carries from previous steps. The process of digit-wise calculation and adjustment with carries are as follows:

1. Initialize carry \(c_0^{\chi} = 0\).

2. Calculate the sum for each digit place: \[S_i = c_i^{R} + c_{i-1}^{\chi} =\sum_{\substack{1 \leq i', j' \leq n \\ i' + j' = i + 1}} a_{i'} b_{j'} + c_{i-1}^{\chi},\] where \(a_{i'}\) and \(b_{j'}\) are zeros if their indices are out of bounds.

3. Determine the digit and carry: \[c_i = \zeta(S_i),\quad c_i^{\chi} = \chi(S_i).\]

Here, \(\zeta(x):=x \;\operatorname{mod}\;10\) and \(\chi(x):=\lfloor x/10 \rfloor\), for \(x\in \mathbb{N}\). This recursive formula provides the digits of the product considering the carries correctly. Denote that \(c_i=f_i(a_1,\cdots,a_{i\wedge n},b_1,\cdots,b_{i\wedge n})\) for \(i=1,2,\cdots,2n\). A Transformer model \(\mathbf{P}_{\theta}(c_i\mid a\times b=c_1\cdots c_{i-1})=\mathbf{P}_{\theta}(c_i\mid a_1,\cdots,a_{i\wedge n},b_1,\cdots,b_{i\wedge n})\) will learn to approximate these functions \(f_i\) when given enough data and computation power.

Consider the longer length OOD domain \((a,b) \in \mathcal{D}_{>n}\). Let \(\overline{a} = \overline{a}^{10^n}\) and \(\overline{b} = \overline{b}^{10^n}\). The function learned by a Transformer model with absolute positional embeddings (APE) when trained with \((a,b) \in \mathcal{D}_{n-1,n}\) is then \[\hat{f}(a,b) = \overline{a}^{10^n} \cdot \overline{b}^{10^n} = \overline{c} = (\overline{c}_1,\overline{c}_2,\cdots,\overline{c}_{2n},0,\cdots,0)\] with \(\overline{c}_i = f_i(a_1,\cdots,a_{i \wedge n},b_1,\cdots,b_{i \wedge n}),\;1 \leq i \leq 2n\), as all terms related to \(a_i, b_i\) for \(i > n\) are discarded during the training process. If the true value of \(ab\) is \(c\), then \(\overline{c}_i = c_i\) for \(1 \leq i \leq n\), but generally differs from \(c_i\) when \(i > n\) since \(\overline{c}_i\) neglects the contribution of higher terms (greater than \(n\)) of \(a\) and \(b\).

Note that when a Transformer model is trained on domain \(\mathcal{D}_n\), if \(i<n\), the model learns the function \(f_i(a_1,\cdots,a_{i \wedge n},b_1,\cdots,b_{i \wedge n})\) directly from the training data. However, when \(i\geq n\), the model learns the function \(f_i(a_1,\cdots,a_{n},b_1,\cdots,b_{n})\) only for the case where \(a_n\vee b_n\geq 1\). In the scenario where \(a_n=b_n=0\), the model requires OOD generalization. The training on \(\mathcal{D}_n\) includes all possible carry scenarios and digit summations (here, we only need consider the units and tens digits of \(c^R_i\) and \(c^{\chi}_{i-1}\)) for \(a_n, b_n \in \{0, \ldots, 9\}\). The zero cases where \(a_n=b_n=0\) are naturally included in the learned patterns.

3.4.0.1 Transition Invariance Property in Multiplication

The transition invariance property for multiplication refers to the idea that the position of digits in the multiplication process can be shifted or "transitioned" in a systematic way that still respects the overall structure of multiplication. In the context of digit-wise multiplication, each digit \(c_i\) should be adjusted by the previous carry. This process is transition invariant because each digit’s place calculation transitions in a smooth and systematic way from one digit place to the next, maintaining the structure of the multiplication.

Transformers can utilize properties like transition invariance to learn multiplication using proper positional embeddings such as relative or abacus PE. In fact, the structured nature of multiplication, especially when broken down into steps that involve digit-by-digit operations and carry propagation, aligns well with the capabilities of Transformer models to capture sequential dependencies and patterns. However, the most challenging aspect is computing the raw sums \(c_i^R\) at each position. Each \(c_i^{R}\) results from a sum of specific pairs of digits from the input sequences \(a\) and \(b\). For a given \(c_i^{R}\), the valid pairs \((i', j')\) must satisfy \(i' + j' = i + 1\). Identifying these pairs involves that (1) ensuring \(1 \leq i', j' \leq n\), i.e., the indices must be within the bounds of the sequences. (2) For each \(i\), determining which pairs contribute to \(c_i^{R}\) involves iterating through potential values of \(i'\) and \(j'\) and checking if their sum equals \(i + 1\). Digit multiplication depends on the positional significance of digits. Misalignment in positions can lead to incorrect contributions to the product. Therefore, positional encoding and accurate handling of positional values are necessary to ensure correct multiplication results. There are also efficiency considerations. Multiplication of large numbers involves many such sums. For large \(n\), directly computing \(c_i^{R}\) for each \(i\) involves nested loops or checks, leading to a time complexity of \(O(n^2)\) in the worst case. This poses a great difficulty for computing the raw sum \(c_i^R\).

This challenge can be understood through the following analysis. Suppose the model is provided with Chain-of-Thought (CoT) style intermediate steps of multiplication as part of the training data. The CoT-like training data format is: \[a \times b \to (c^{R}, c^{\chi}) \to c.\] In digit-wise format, this is: \[(a_1, \cdots, a_n) \times (b_1, \cdots, b_n) \to (c_1^{R}, c_1^{\chi}, \cdots, c_{2n-1}^{R}, c_{2n-1}^{\chi}) \to (c_1, \cdots, c_{2n}).\] The conditional probability equation is then given by: \[\begin{align} \mathbf{P}_{\theta}(c_i \mid a_1, \cdots, a_{i \wedge n}, b_1, \cdots, b_{i \wedge n}) & = \mathbf{P}_{\theta}^{\chi}(c_{i-1}^{\chi} \mid a_1, \cdots, a_{(i-1) \wedge n}, b_1, \cdots, b_{(i-1) \wedge n}) \\ & \;\times \mathbf{P}_{\theta}^{R}(c_i^{R} \mid a_1, \cdots, a_{i \wedge n}, b_1, \cdots, b_{i \wedge n}) \\ & \;\times \mathbf{P}_{\theta}(c_i \mid c_i^{R}, c_{i-1}^{\chi}), \end{align}\] and \[\begin{align} \mathbf{P}_{\theta}^{\chi}(c_i^{\chi} \mid a_1, \cdots, a_{i \wedge n}, b_1, \cdots, b_{i \wedge n}) & = \mathbf{P}_{\theta}^{\chi}(c_{i-1}^{\chi} \mid a_1, \cdots, a_{(i-1) \wedge n}, b_1, \cdots, b_{(i-1) \wedge n}) \\ & \;\times \mathbf{P}_{\theta}^{R}(c_i^{R} \mid a_1, \cdots, a_{i \wedge n}, b_1, \cdots, b_{i \wedge n}) \\ & \;\times \mathbf{P}_{\theta}^{\chi}(c_i^{\chi} \mid c_i^{R}, c_{i-1}^{\chi}). \end{align}\] For the carry at the \(i\)-th position, we then have that \[\mathbf{P}_{\theta}^{\chi}(c_i^{\chi} \mid a_1, \cdots, a_{i \wedge n}, b_1, \cdots, b_{i \wedge n}) = \prod_{j=1}^{i} \mathbf{P}_{\theta}^{R}(c_j^{R} \mid a_1, \cdots, a_{j \wedge n}, b_1, \cdots, b_{j \wedge n}) \mathbf{P}_{\theta}^{\chi}(c_j^{\chi} \mid c_j^{R}, c_{j-1}^{\chi}).\] Note that \(\mathbf{P}_{\theta}(c_i \mid c_i^{R}, c_{i-1}^{\chi})\) and \(\mathbf{P}_{\theta}^{\chi}(c_i^{\chi} \mid c_i^{R}, c_{i-1}^{\chi})\) exhibit transition invariance. This could be handled by relative or abacus positional embedding. The difficulty lies in the computation of the raw sums \(\mathbf{P}_{\theta}^{R}(c_i^{R} \mid a_1, \cdots, a_{i \wedge n}, b_1, \cdots, b_{i \wedge n})\) even when using relative or abacus positional embedding.

Experiments on Transformer models using relative or abacus positional embeddings to learn multiplication have been presented in the literature. [11] and [14] show that addition can successfully generalize to OOD regions with higher numerical digits, but multiplication has largely not succeeded. Our analysis provides insights into the difficulties behind generalizing to higher numerical digits, which helps us understand the reasons for the failure in learning multiplication.

3.5 Theoretical Analysis on Modular Multiplication↩︎

Theorem 6. (1) Assume a Transformer model with either absolute or relative/abacus positional embedding is trained on a multi-digit modular multiplication dataset with a modulus \(p\) that divides \(10^m\) for the operands \((a,b)\in \mathcal{D}_{n}\) (\(n\geq 2\) and \(m\leq n\)) with enough training computation, then the learned model can perfectly generalize both for the shorter-length OOD domain \(\mathcal{D}_{<n}\) and the longer-length OOD domain \(\mathcal{D}_{>n}\).

(2) If the modulus \(p\) neither divides \(10^n\) nor exceeds \(10^n\), and provided that sufficient training computation is allocated, then the resulting trained model is capable of perfect generalization to the shorter-length OOD domain \(\mathcal{D}_{<n}\), while encountering difficulties in generalizing to the longer-length OOD domain \(\mathcal{D}_{>n}\). The function that the model with APE has learned is \(\hat{f}^p(a,b)=\overline{\overline{a}^{10^n} \times \overline{b}^{10^n}}^p\).

The proof resembles the process for modular addition. Suppose \(\overline{c}^p = \overline{ab}^p\). When \(p\) is a divisor of \(10^m\), we have \(\overline{c}^p = \overline{\overline{ab}^{10^m}}^p\). The value of \(\overline{c}^p\) remains unaffected by the digits in positions beyond \(m\) in the numbers \(a\) and \(b\). Now, let \(m\) be the smallest number such that the \(m\)-th power of 10 is divisible by the modulus \(p\), i.e., \(m=\arg\min\{\tilde{m}: p\mid 10^{\tilde{m}}\}\). The model approximates the function for each position \(i\) as follows: \[\mathbf{P}_{\theta}(\overline{c}^p_{i} \mid a_{1},\cdots,a_{m},b_{1},\cdots,b_{m})\to \overline{c}^p_{i} = f_i^p(a_{1},\cdots,a_{m},b_{1},\cdots,b_{m}),\;\forall i=1,\cdots,m\] where \(f_i^p\) represents the function for the \(i\)-th digit of \(\overline{c}^p\). All these functions can be learned directly from the training data without the need for OOD generalization when training on \(\mathcal{D}_{n}\) (\(n\geq m\)) except the term \(\overline{c}^p_{n}\).

When \(p\) is not a divisor of \(10^n\) and \(p<10^n\), the model approximates the function \(\hat{f}^p(a,b)=\overline{\overline{a}^{10^n}\times \overline{b}^{10^n}}^p\) at each position \(i\).This is because the model has been trained on \(\mathcal{D}_{n}\), which is agnostic to the digits in positions \(i>n\) of the numbers \(a\) and \(b\).

Remark: As the modular addition with a modulus \(p\) that does not divide \(10^n\), the modular multiplication with such a modulus also does not exhibit the property of translation invariance. Therefore, there is no evident advantage to be gained from exploiting this characteristic and relative/abacus positional embedding.

4 Experiments↩︎

The aim of this section is to further validate our theoretical analysis through experiments. We first describe the experimental design, data format, and testing methods, and then conduct extensive experimental analyses for addition, modular addition, multiplication, and modular multiplication. To conserve space in the main text and enhance readability, we present the main results in this section, with other extensive experimental results included in the appendix.

4.1 Experimental Design↩︎

4.1.0.1 Model Description:

We utilize a GPT model framework, which is a Transformer with a decoder-only architecture consisting of multiple layers and multi-head attentions. Our models are trained from scratch using NanoGPT, MicroGPT, and MiniGPT [20] with increasing model size, employing character-level tokenization and the cross-entropy loss function for the standard next-token prediction. The training focuses on basic arithmetic operations, including addition, multiplication, modular addition, and modular multiplication of integers. All experiments are conducted on a single GeForce RTX 4090 GPU. Detailed hyperparameters of the models and training are provided in Table 2.

Table 2: Hyperparameter Information for Arithmetic Operations Training
Hyperparameter NanoGPT MicroGPT MiniGPT
num layer
num head
dim embd
vocab size
context window
dropout prob
optimizer AdamW AdamW AdamW
learning rate
betas (0.9, 0.99) (0.9, 0.99) (0.9, 0.99)
weight decay True True True
grad norm clip

4.1.0.2 Data Description:

In this section, we delve into the four primary arithmetic operations described above, which include the following:

  • Addition: \(c = a + b\).

  • Modular addition: \(c \equiv a + b \pmod{p}\).

  • Multiplication: \(c = a \times b\).

  • Modular multiplication: \(c \equiv a \times b \pmod{p}\).

We generate multiple datasets randomly for each arithmetic task. Each dataset is organized as a sequence of operand pairs in natural order, with the results of the operations in reversed order. This format has been shown to be more effective for learning in next-token prediction models [13], [15]. For example, consider an \(n\)-digit addition \(a + b = c\), represented in standard format as "\(a_{n} \cdots a_{2}a_{1} + b_{n} \cdots b_{2}b_{1} = c_{n+1} \cdots c_{2}c_{1}\)". By reversing the order of the output "\(c\)", we obtain the reversed data format "\(a_{n} \cdots a_{2}a_{1} + b_{n} \cdots b_{2}b_{1} = c_{1} \cdots c_{n}c_{n+1}\)".

Subsequently, the data undergoes character-level tokenization, and we add ";", "<bos>", and "<eos>", a "line break" token, resulting in a vocabulary size of 16. When the context window exceeds the required size for \(n\)-digit arithmetic operations, we pad zeros before the numbers "\(a\)", "\(b\)", and "\(c\)".

We control the length of arithmetic operations \(n\) and randomly generate multiple datasets from \(\mathcal{D}_n\) for different lengths \(n\). These datasets for each arithmetic task are categorized into three distinct subsets: a training set, an in-distribution (ID) test set, and several additional out-of-distribution (OOD) test sets, sampled from \(m\)-digit operations with \(m \neq n\). The case where \(m < n\) is referred to as the shorter-length (inward) OOD domain, and the case where \(m > n\) is termed the longer-length (outward) OOD domain. We also construct numerous combination sets of samples from different domains \(\mathcal{D}_n\), such as \(\mathcal{D}_{n-1,n}\), to be used as training and ID test datasets in our work. In this case, the OOD test sets are from \(\mathcal{D}_m\) with \(m \neq n-1\) and \(n\). In the experiments presented in this paper, test accuracy is measured using maximum probability sampling.

4.2 Experiments on Addition↩︎

In this subsection, we trained multiple models on different datasets (e.g. \(\mathcal{D}_4\), \(\mathcal{D}_5\), \(\mathcal{D}_{4,5}\)) and tracked the changes in their accuracy over the course of training on various test sets. Additionally, we demonstrated how the models learn each digit during the training process.

4.2.1 Generalization for Different Digit Tasks↩︎

In Figure 1, we present the results of three different experiments using distinct training datasets (i.e., \(\mathcal{D}_4\), \(\mathcal{D}_5\), \(\mathcal{D}_{4,5}\)). For all experiments, we employ the MiniGPT model equipped with a learned APE. Each subfigure illustrates the test accuracy on different test domains \(\mathcal{D}_{i}\) for these models throughout the training process. Figure 1 verifies our Theorem 1. It demonstrates that models incorporating APE are unable to generalize to longer digits than those they are trained on but can succeed with lower digits. Additionally, the model trained on \(\mathcal{D}_5\) has a much more challenging training process compared to the model trained on \(\mathcal{D}_4\), while the model trained on \(\mathcal{D}_{4,5}\) experiences the easiest and smoothest training process among the three models. The reason, as explained in Theorem 1, is that for \(\mathcal{D}_{4,5}\), the model learns addition tasks on lower digits directly from the training data. In contrast, \(\mathcal{D}_{4}\) and \(\mathcal{D}_{5}\) require OOD generalization for the edge positions.

More results can be found in Table 5 and Table 6. We test the final trained model on datasets with varying digit lengths. While the models do not learn the addition of higher digits, they successfully learn the operation \(\hat{f}(a,b)=\overline{a}^{10^n}+\overline{b}^{10^n}\), supporting our Corollary 1.

We also conduct extensive experiments using various training datasets, model scales, and data scales. The results of these experiments are robust, and presented in Appendix 9.1.

a

b

c

d

e

f

Figure 1: Test Accuracy of Transformer Models with APE for Different Multi-digit Addition Tasks.

4.2.2 Learning Dynamics for Each Digit Position↩︎

The models and training datasets are identical to those described in Figure 1. We have assembled a comprehensive test dataset that contains a random sample from \(\mathcal{D}_1\) to \(\mathcal{D}_9\). Our objective is to demonstrate how these Transformer models equipped with APE learn each digit at every position throughout the training phase. The digit-wise test accuracy is defined as the accuracy of the prediction for each position in the result \(c\).

The plots in Figure 5 (see Appendix 9.1) visually represent whether these models are capable of accurately predicting the digits \(c_i\) at all positions. These graphs effectively illustrate the learning dynamics for each token in the context of addition tasks. The models exhibit high accuracy for the first four or five digits, with accuracy approaching 1.0 as training progresses, for datasets \(\mathcal{D}_4\), or \(\mathcal{D}_5\), and \(\mathcal{D}_{4,5}\), respectively. However, accuracy sharply declines for the 5th or 6th digits and remains near zero for the 7th, 8th, and 9th digits. These findings illustrate that while the models can effectively learn and predict lower-position digits, they struggle significantly with higher-position digits. This aligns with the theorem that Transformer models with APE can generalize well for shorter-length OOD domains but fail for longer-length OOD domains.

How Digits are Learned During Training?↩︎

The experiment results depicted in Figure 2 illustrate the learning dynamics of each function \(c_i\) during the training of Transformer models, using DecisionTreeRegressor to approximate these functions. The \(R^2\) values, which measure how well the model’s predictions fit the actual data, indicate that the models effectively learn lower-order digits with high accuracy, achieving \(R^2\) values close to 1. However, higher-order digits present more challenges, resulting in lower and less stable \(R^2\) values. Furthermore, at the early stages of training, the models first learn the higher-order digits (with higher \(R^2\) values) and then proceed to learn the lower-order digits.

From Figure 2, it is evident that the Transformer model trained on \(\mathcal{D}_4\) initially focuses on learning the digits at positions 4 and 5 before addressing positions lower than 4. Here, position 6 is trivial since it always equals zero. The Transformer model trained on \(\mathcal{D}_5\) first attempts to learn the digits at positions 5 and 6, then proceeds to positions lower than 5. The Transformer model trained on \(\mathcal{D}_{4,5}\) starts by learning the digits at positions 4, 5, and 6, and then moves to positions lower than 4. In our theoretical analysis, the most challenging parts are \(c_n\) and \(c_{n+1}\) when training the model with data in \(\mathcal{D}_n\), since these positions never encounter \(a_n = b_n = 0\) and require OOD generalization. The models prioritize learning the hardest positions first, followed by the easier positions in these experiments.

a

b

c

Figure 2: Learning Dynamics of Each Function \(c_i = \zeta(a_i + b_i+ \chi(a_{i-1} + b_{i-1}))\) for Addition.

Another notable result from the experiments is that the correlation of \(R^2\) values between different digit pairs is around zero (see Figure 6 in Appendix 9.1). This indicates that changes in the approximation for one position have little impact on other positions. This finding suggests that the Transformer model is flexible enough to handle different tokens independently, even though they share parameters.

4.2.3 Generalization Under Relative/Abacus Positional Embeddings↩︎

[14] conducted experiments using a 16-layer Transformer (decoder only) model with abacus positional embedding, trained on a random sample from \(\mathcal{D}_{\leq 20}\). It can generalize on 100-digit addition problems (see Figure 7 in Appendix 9.1).4 Additionally, [11] demonstrated that relative positional embeddings enable length generalization in addition tasks. In their work, models such as Transformer and Universal Transformer (encoder only) trained to add 5-digit numbers could generalize to 20-digit operands.

These results provide empirical evidence validating our Theorem 2 for longer-length OOD generalization. The findings are clear, and we will not replicate the procedures here. Instead, we reference these studies in the present context.

4.3 Experiments on Modular Addition↩︎

Table 3: Modular Addition: Test Accuracy w.r.t. the Ground Truth \(f^p(a,b)=\overline{a+b}^p\) on \(\widetilde{\mathcal{D}}_i\)
Test Accuracy (%) w.r.t. the Ground Truth on the Domain \(\widetilde{\mathcal{D}}_i\) Theory
Modulus 1 2 3 4 5 6 7 8 9 \(1/p'\)
\(p=50\) 100 100 100 100 99.3 92.0 93.1 95.2 91.4 100
\(p=51\) 100 98.5 99.9 99.3 0.3 1.8 1.9 1.9 1.6 1.96
\(p=100\) 100 100 100 100 100 100 100 100 100 100
\(p=101\) 100 100 100 100 0.0 1.2 0.9 1.1 1.0 0.99
\(p=150\) 100 100 100 100 33.2 33.6 32.3 33.0 33.7 33.3
\(p=151\) 100 99.9 99.9 100 0.0 0.6 0.7 0.7 0.6 0.66
\(p=200\) 100 100 100 100 99.8 98.9 93.7 94.1 93.5 100
\(p=201\) 100 100 99.9 99.9 0.0 0.0 0.5 0.4 0.5 0.50
Table 4: Modular Addition: Test Accuracy w.r.t. the Modular Truth \(\hat{f}^p(a,b)=\overline{\overline{a}^{10^n}+\overline{b}^{10^n}}^p\) on the Domain \(\widetilde{\mathcal{D}}_i\) for \(i=1,2\cdots,9\). The models and test methods are as indicated in the above table.
Test Accuracy (%) w.r.t. the Modular Truth on the Domain \(\widetilde{\mathcal{D}}_i\)
Modulus 1 2 3 4 5 6 7 8 9
\(p=50\) 100 100 100 100 99.3 92.0 93.1 95.2 91.4
\(p=51\) 100 98.5 99.9 99.3 95.1 94.4 92.6 91.3 92.4
\(p=100\) 100 100 100 100 100 100 100 100 100
\(p=101\) 100 100 100 100 100 100 100 100 100
\(p=150\) 100 100 100 100 100 100 100 99.8 99.7
\(p=151\) 100 99.9 99.9 100 99.9 99.7 99.6 99.1 99.2
\(p=200\) 100 100 100 100 99.8 98.9 93.7 94.1 93.5
\(p=201\) 100 100 99.9 99.9 96.4 96.6 95.7 90.4 91.2

The results in Table 3 validate Theorem 3, which states that Transformer models with absolute positional embeddings trained on multi-digit modular addition datasets exhibit distinct generalization capabilities based on the modulus \(p\). For moduli such as \(p = 50, 100, 200\) that divide \(10^n\), the models achieve perfect test accuracy across all digit domains, demonstrating their ability to generalize flawlessly to both shorter-length and longer-length OOD domains. In contrast, for moduli such as \(p = 51, 101, 150, 151, 201\) that do not divide \(10^n\), the models maintain high accuracy for lower digit domains but show significant performance degradation for higher digit positions.5

The OOD test accuracy in Table 3 for high-order digits can be completely expected using Theorem 4, which states that the test accuracy on \(\widetilde{\mathcal{D}}_{n_{\text{test}}}\) (\(n_{\text{test}}>n\)) is given by \(\operatorname{Acc}(p,n,n_{\text{test}}) \approx 1/p'\) if \(n_{\text{test}}\geq n+\log_{10}(p'/2+1)\), otherwise \(\operatorname{Acc}(p,n,n_{\text{test}}) =0\). These observations align well with the theoretical expectations outlined in Theorem 3 and Theorem 4, also explaining the experimental results found in the literature (see, e.g., [11]) in handling modular addition tasks with different moduli.

Furthermore, the results in Table 4 support Theorem 4, indicating that Transformer models with absolute positional embeddings trained on multi-digit modular addition datasets learns the function \(\hat{f}^p(a,b)=\overline{\overline{a}^{10^n}+\overline{b}^{10^n}}^p\) for any modulus \(p\). These findings fully align with the theoretical predictions.

4.4 Experiments on Multiplication and Modular Multiplication↩︎

We also conducted extensive experimental analyses for multiplication and modular multiplication tasks, examining the performance and generalization capabilities of Transformer models. These experiments are designed to test various configurations, including different positional encodings, model size and training data schemes. Detailed results and additional analyses are available in Appendix 9.3 and 9.4. The experimental outcomes consistently support our theoretical framework, demonstrating the robustness of our approach and providing further insights into the behavior of Transformer models in arithmetic reasoning tasks.

5 Discussion↩︎

Our study sheds light on the mechanistic interpretability and AI alignment of Transformer models. Understanding the mechanisms of Transformer models is crucial for ensuring their alignment with desired outcomes. Our theoretical framework provides a pathway for interpreting how these models generalize from training data to unseen tasks. This understanding is essential for aligning models with human-defined objectives, and reducing the risk of unintended behaviors.

We also observed phenomena akin to the satori phenomenon and emergence, where models suddenly exhibit a leap in understanding or capability once a critical threshold in training or data complexity is reached. This emergent behavior underscores the non-linear nature of model learning and highlights the need for further research into the conditions that trigger such phenomena.

Additionally, our work identifies challenges associated with different training data schemes, such as concatenation training without padding (e.g. "\(123+45=168;267+1=268;\)" as an input sequence) and line-by-line padding training (e.g. "\(123+45=168;\)<pad><pad><pad>" as an input sequence). These approaches can significantly impact model performance and generalization. Understanding these problems is essential for refining training strategies to improve model robustness and generalization.

6 Conclusion↩︎

In this paper, we developed a unified theoretical framework to explain various OOD generalization phenomena in Transformer models trained on arithmetic operations. This framework provides a principled understanding of why and how these models generalize across different scenarios, including \(n\)-digit addition, multiplication, and modular operations. We categorized generalization into inward OOD (generalizing to shorter-length domains) and outward OOD (generalizing to longer-length domains) to clearly delineate these behaviors.

Our theoretical analysis concludes that Transformer models with absolute positional encoding can generalize to the shorter-length OOD domain for addition, but not the longer-length domain. Relative positional encoding allows generalization to both shorter- and longer-length domains, benefiting from the translation invariance of digit addition. For multiplication, even relative positional encoding is less effective in the longer-length domain due to the lack of translation invariance. For modular operations, models generalize well to both shorter- and longer-length domains if the modulus \(p\) divides \(10^n\), regardless of positional encoding, due to the compatibility with base 10 where higher-digit positions do not affect the result. When \(p\) does not divide \(10^n\), models only generalize to the shorter-length domain, with theoretical accuracy derived for longer-length domains based on information loss and the identification of the model’s final learned function.

Through extensive experimental validation using NanoGPT, MicroGPT, and MiniGPT, we have supported our theoretical predictions. These experiments confirmed the robustness of our framework across different model scales, dataset sizes, and training data schemes. This work clarifies the mechanisms underlying generalization, addresses misconceptions in previous studies, and has significant implications for data-efficient model training and objective-oriented AI alignment, including for large language models (LLMs).

Future research should focus on extending this theoretical framework to more complex tasks and exploring additional factors that might influence OOD generalization. By continuing to build on this foundation, we can move closer to developing AI systems that are both powerful and reliably generalizable.

7 Appendix on Transformer↩︎

A Transformer model [1] predicts the next token based on the preceding tokens within the input sequence. Its output is subsequently used as input for the next prediction. For a target token \(x_{i}\) at position \(i\) in the sequence, the model generates a probability distribution over the vocabulary of potential next tokens. To be precise, let \(x=x_{1}x_{2}\ldots x_{T}\in\mathcal{V}^{T}\) denote the input sequence of tokens. The probability of observing this sequence with respect to a Transformer model is given as follows: \[\mathbf{P}_{\theta}(x)=\prod_{i=1}^{T}\mathbf{P}_{\theta}(x_{i}|x_{1},x_{2},...,x_{i-1})=\prod_{i=1}^{T}\mathbf{P}_{\theta}(x_{i}|x_{<i}).\] The conditional probability \(\mathbf{P}_{\theta}(x_{i}|x_{<i})\) is computed using the softmax function applied to the last hidden state. One way to design this model (see e.g. [20], [21]) is as follows: \[\begin{align}a^{\ell-1} & =h^{\ell-1}+\mathrm{MHA}_{\ell}(\mathrm{LN}_{\ell}^{A}(h^{\ell-1}))\\ h^{\ell} & =a^{\ell-1}+\mathrm{MLP}_{\ell}(\mathrm{LN}_{\ell}^{F}(a^{\ell-1})) \end{align}\] for \(\ell=1,2,\ldots,L\), with the initial embedding \(h^{0}=e_{tok}+e_{pos}\), where \(e_{tok}\) represents the initial token embedding and \(e_{pos}\) represents the positional embedding. In the context of GPT-series LLMs, \(\mathrm{MHA}_{\ell}\) refers to the masked multi-head attention of the \(\ell\)-th layer, \(\mathrm{MLP}_{\ell}\) is a multi-layer perception with one hidden layer, and \(\mathrm{LN}\) represents layer normalization. Define \(f_{\ell}\) such that \(h^{\ell}=f_{\ell}(h^{\ell-1})\). Consequently, the final hidden state of this LLM is \[h^{L}=f_{L}\circ\ldots\circ f_{2}\circ f_{1}(h^{0})\in\mathbb{R}^{d_{m}\times T},\] where \(d_{m}\) is the embedding dimension.

Let \(X=\mathrm{LN}(h^{L})=[X_{1},X_{2},\ldots,X_{T}]\). The final output conditional probability matrix \[\mathbf{P}_{\theta}=\mathrm{softmax}(WX)=\left(\frac{\exp(WX_{i})}{\sum_{j=1}^{N}\exp(WX_{i})_{j}}\right)_{i=1,2,\cdots,T}\in[0,1]^{N_V\times T},\] where \(W\in\mathbb{R}^{N_V\times d_{m}}\) is a weight matrix. The \(i\)-th column of the matrix \(\mathbf{P}_{\theta}\) represents the conditional probability \(\mathbf{P}_{\theta}(\tilde{x}_{i}|x_{<i})\) for any \(\tilde{x}_{i}\in\mathcal{V}\). By training on a large corpus of language texts, the LLMs provide the estimated probabilities.

8 Theoretical OOD Test Accuracy for Modular Arithmetic↩︎

8.1 Theoretical OOD Test Accuracy for Modular Addition Learning↩︎

To derive an accurate analytic formula (in Theorem 4) for the OOD test accuracy on \(\widetilde{\mathcal{D}}_{m}\) with \(m>n\) when a Transformer model is trained on the domain \(\mathcal{D}_n\), we must carefully count the valid pairs \((a, b)\in \widetilde{\mathcal{D}}_{m}\) that satisfy \(\overline{a+b}^p=\overline{\overline{a}^{10^n}+\overline{b}^{10^n}}^p\).

Let \(a = A \cdot 10^n + a_0\) and \(b = B \cdot 10^n + b_0\), where \(A, B\) range from 1 to \(10^{m-n} - 1\) and \(a_0, b_0\) range from 0 to \(10^n - 1\). We require \(a + b \equiv (a \mod 10^n + b \mod 10^n) \pmod{p}\), which simplifies to that \[(A + B) \cdot 10^n \equiv 0 \pmod{p}.\] Let \(p' = \frac{p}{\gcd(p, 10^n)}\). We are then left with the condition \((A + B) \equiv 0 \pmod{p'}\).

The number of such pairs is determined by the frequency of multiples of \(p'\) in the valid range. The total number of pairs \((A, B)\) is \((10^{m-n} - 1)^2\). There are \((10^{m-n} - 1)\) valid values for \(A\). For each \(A\), the number of valid \(B\) values is determined by the number of multiples of \(p'\) in the range. That is, for each \(A\), the number of valid \(B\) values is about \((10^{m-n} - 1) / p'\). The test accuracy is the ratio of valid pairs, i.e. the number of valid pairs divided by the total number of pairs.

Note that for \(m \geq n + \log_{10}(p'/2 + 1)\), the range \(1 \leq A, B < 10^{m-n}\) must include at least one complete cycle of \(p'\) to ensure some pairs \((A, B)\) satisfy \(A + B \equiv 0 \pmod{p'}\). This condition ensures that the number of digits in \(A\) and \(B\) is large enough to cover a full period of \(p'\). Otherwise, there exists no pair \((A,B)\) for which \(A + B \equiv 0 \pmod{p'}\).

The ultimate formula is as follows: \[\operatorname{Acc}(p,n,m) = \frac{\text{Number of Valid Pairs}}{\text{Total Number of Pairs}} \approx \frac{(10^{m-n} - 1) \cdot \left( \frac{10^{m-n} - 1}{p'} \right)}{(10^{m-n} - 1)^2} = \frac{1}{p'}\] for \(m \geq n + \log_{10}(p'/2 + 1)\), otherwise 0.

Given that \(p' = \frac{p}{\gcd(p, 10^n)}\), we have that \[\operatorname{Acc}(p,n,m) \approx \begin{cases} \frac{\gcd(p, 10^n)}{p}, & \text{if } m \geq n + \log_{10}(p'/2 + 1) \\ 0, & \text{otherwise} \end{cases}.\]

8.2 Theoretical OOD Test Accuracy for Modular Multiplication Learning↩︎

To count the valid pairs \((a, b) \in \widetilde{\mathcal{D}}_{m}\) that satisfy \(a \times b \equiv ((a \mod 10^n) \times (b \mod 10^n)) \pmod{p}\), denote \(a\) and \(b\) can be written as \(a = A \cdot 10^n + a_0\) and \(b = B \cdot 10^n + b_0\), where \(A, B\) are the upper \((m-n)\)-digit parts and \(a_0, b_0\) are the lower \(n\)-digit parts. \(A, B\) range from 1 to \(10^{m-n} - 1\) (since they are non-zero leading digits). \(a_0, b_0\) range from 0 to \(10^n - 1\). We need \[(A \cdot 10^n + a_0) \times (B \cdot 10^n + b_0) \equiv (a_0 \times b_0) \pmod{p}.\] This simplifies to that \[A \cdot B \cdot 10^{2n} + (A \cdot b_0 + B \cdot a_0) \cdot 10^n \equiv 0 \pmod{p}.\] This further simplifies to that \[A \cdot B \cdot 10^n + A \cdot b_0 + B \cdot a_0 \equiv 0 \pmod{p'},\quad p'=\frac{p}{\gcd(p, 10^n)}.\]

The theoretical closed expression for this problem is challenging to derive, but the numerical solution can be computed through an algorithmic program for small-scale cases.

9 Further Results↩︎

9.1 Further Results on Addition↩︎

a

b

Figure 3: Training Loss & Out of Sample In-Distribution Test Accuracy on Addition.

a

b

Figure 4: Training Loss & Out of Sample In-Distribution Test Accuracy on Addition.

a

b

c

d

e

f

g

h

i

Figure 5: Digit-Wise Test Accuracy of Transformer Models with APE for Addition Tasks.

a

b

c

d

Figure 6: Correlation Between Digit Pairs of Learning \(c_i\) and \(c_j\) for Addition.

Table 5: Standard Addition: Test Accuracy w.r.t. the Ground Truth \(f(a,b)=a+b\) on the Domain \(\mathcal{D}_i\) for \(i=1,2\cdots,9\). All models are instances of MiniGPT. The accuracy is tested on 10,000 random test samples (when \(n > 2\)), otherwise on the entire dataset. The outputs of models are generated using maximum probability sampling.
Test Accuracy (%) w.r.t. the Ground Truth on the Domain \(\mathcal{D}_i\)
Training Data 1 2 3 4 5 6 7 8 9
\(\mathcal{D}_{4}\) 100 100 100 100 0 0 0 0 0
\(\widetilde{\mathcal{D}}_{4}\) 100 100 72.6 100 0 0 0 0 0
\(\mathcal{D}_{5}\) 100 100 100 100 100 0 0 0 0
\(\mathcal{D}_{6}\) 100 100 100 100 100 100 0 0 0
\(\mathcal{D}_{4,5}\) 100 100 100 100 100 0 0 0 0
\(\mathcal{D}_{5,6}\) 100 100 100 100 100 100 0 0 0
\(\mathcal{D}_{6,7}\) 100 100 100 100 100 100 100 0 0
Table 6: Standard Addition: Test Accuracy w.r.t. the Modular Truth \(\hat{f}(a,b)=\overline{a}^{10^n}+\overline{b}^{10^n}\) on the Domain \(\mathcal{D}_i\) for \(i=1,2\cdots,9\). All models are instances of MiniGPT, and test methods are indicated as above.
Test Accuracy (%) w.r.t. the Modular Truth on the Domain \(\mathcal{D}_i\)
Training Data 1 2 3 4 5 6 7 8 9
\(\mathcal{D}_{4}\) 100 100 100 100 100 100 100 100 100
\(\widetilde{\mathcal{D}}_{4}\) 100 99.9 72.3 100 99.7 99.7 99.6 99.7 99.5
\(\mathcal{D}_{5}\) 100 100 100 100 100 100 100 100 100
\(\mathcal{D}_{6}\) 100 100 100 100 100 100 100 100 100
\(\mathcal{D}_{4,5}\) 100 100 100 100 100 100 100 100 100
\(\mathcal{D}_{5,6}\) 100 100 100 100 100 100 100 100 100
\(\mathcal{D}_{6,7}\) 100 100 100 100 100 100 100 100 100

Figure 7: Test Accuracy on Addition When Training Short and Testing Long using a 16-Layer Transformer (Decoder only) Model with Abacus Positional Embedding.

9.2 Further Results on Modular Addition↩︎

a

b

Figure 8: Digit-wise In-Distribution Test Accuracy & Total Accuracy for Modular addition.

9.3 Further Results on Multiplication↩︎

a

b

Figure 9: Digit-wise In-Distribution Test Accuracy & Total Accuracy for Multiplication.

Table 7: Standard Multiplication: Test Accuracy w.r.t. the Ground Truth \(f(a,b)=a\cdot b\) on the Domain \(\mathcal{D}_i\) for \(i=1,2\cdots,9\). The models trained on \(\mathcal{D}_{1,2}\) and \(\mathcal{D}_{2}\) are instances of MicroGPT, while others are of MiniGPT. The accuracy is tested on 10,000 random test samples (when \(n > 2\)), otherwise on the entire dataset. The outputs of models are generated using maximum probability sampling.
Test Accuracy (%) w.r.t. the Ground Truth on \(\mathcal{D}_i\)
Training Data 1 2 3 4 5 6 7 8 9
\(\mathcal{D}_{1,2}\) 100 100 0.1 0 0 0 0 0 0
\(\mathcal{D}_{2}\) 80.0 99.4 0.1 0 0 0 0 0 0
\(\mathcal{D}_{3}\) 100 96.4 99.0 0 0 0 0 0 0
\(\mathcal{D}_{2,3,4}\) 100 100 98.9 80.5 0 0 0 0 0
Table 8: Standard Multiplication: Test Accuracy w.r.t. the Modular Truth \(\hat{f}(a,b)=\overline{a}^{10^n}\cdot\overline{b}^{10^n}\) on the Domain \(\mathcal{D}_i\) for \(i=1,2\cdots,9\). The models and test methods are indicated as above.
Test Accuracy (%) w.r.t. the Modular Truth on \(\mathcal{D}_i\)
Training Data 1 2 3 4 5 6 7 8 9
\(\mathcal{D}_{1,2}\) 100 99.9 93.0 90.1 86.0 82.6 80.6 78.2 77.7
\(\mathcal{D}_{2}\) 85.0 99.4 98.1 96.7 89.0 88.9 88.4 89.8 88.7
\(\mathcal{D}_{3}\) 100 96.2 98.8 98.9 99.0 97.9 97.9 97.2 97.1
\(\mathcal{D}_{2,3,4}\) 100 100 98.9 81.0 75.6 76.2 73.8 67.5 66.9

Figure 10: Test Accuracy on Multiplication When Training Short and Testing Long using a Looped Transformer Models with Abacus Positional Embedding.

9.4 Further Results on Modular Multiplication↩︎

a

b

Figure 11: Digit-wise In-Distribution Test Accuracy & Total Accuracy for Modular Multiplication.

Table 9: Modular Multiplication: Test Accuracy w.r.t. the Ground Truth \(f^p(a,b)=\overline{a\cdot b}^p\) on \(\widetilde{\mathcal{D}}_i\)
Test Accuracy (%) w.r.t. the Ground Truth on the Domain \(\widetilde{\mathcal{D}}_i\) Theor.
Modulus 1 2 3 4 5 6 7 8 9 Acc.
\(p=50\) 100 100 100 100 100 100 100 100 100 100
\(p=51\) 100 100 99.7 2.6 2.5 2.8 2.4 2.5 3.2 2.4
\(p=100\) 100 100 100 100 100 100 100 100 100 100
\(p=101\) 100 100 100 1.1 1.0 1.2 0.9 1.1 1.0 1.0
\(p=150\) 30.0 56.4 55.5 46.9 46.5 46.3 47.4 46.9 47.0 40.8
\(p=200\) 100 63.3 61.8 62.1 62.6 62.9 62.4 61.7 62.6 100
\(p=201\) 80.0 78.3 92.2 0.7 0.6 0.5 0.6 0.6 0.6 0.6
Table 10: Modular Multiplication: Test Accuracy w.r.t. the Modular Truth \(\hat{f}^p(a,b)=\overline{\overline{a}^{10^n}\cdot \overline{b}^{10^n}}^p\) on the Domain \(\widetilde{\mathcal{D}}_i\) for \(i=1,2\cdots,9\). The models and test methods are as indicated in the above table.
Test Accuracy (%) w.r.t. the Modular Truth on \(\widetilde{\mathcal{D}}_i\)
Training Data 1 2 3 4 5 6 7 8 9
\(p=50\) 100 100 100 100 100 100 100 100 100
\(p=51\) 100 100 99.7 99.8 98.4 84.4 81.9 68.6 57.2
\(p=100\) 100 100 100 100 100 100 100 100 100
\(p=101\) 100 100 100 86.6 73.6 71.7 68.1 65.7 54.5
\(p=150\) 42.0 55.7 56.0 51.0 51.2 50.0 50.0 50.3 50.1
\(p=200\) 100 62.6 62.2 62.7 62.3 62.4 62.7 62.3 61.9
\(p=201\) 71.0 79.5 92.1 90.9 90.7 90.5 88.7 87.9 85.0

References↩︎

[1]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[2]
OpenAI (2023). Gpt-4 technical report. ArXiv, abs/2303.08774.
[3]
Anthropic (2023). Model card and evaluations for claude models. Technical Report 2023.
[4]
Gemini, T., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
[5]
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023a). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
[6]
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023b). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
[7]
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
[8]
Lu, C., Qian, C., Zheng, G., Fan, H., Gao, H., Zhang, J., Shao, J., Deng, J., Fu, J., Huang, K., et al. (2024). From gpt-4 to gemini and beyond: Assessing the landscape of mllms on generalizability, trustworthiness and causality through four modalities. arXiv preprint arXiv:2401.15071.
[9]
Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
[10]
Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G., Dyer, E., and Neyshabur, B. (2022). Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556.
[11]
Jelassi, S., d’Ascoli, S., Domingo-Enrich, C., Wu, Y., Li, Y., and Charton, F. (2023). Length generalization in arithmetic transformers. arXiv preprint arXiv:2306.15400.
[12]
Briakou, E., Cherry, C., and Foster, G. (2023). Searching for needles in a haystack: On the role of incidental bilingualism in palm’s translation capability. arXiv preprint arXiv:2305.10266.
[13]
Xu, X., Pan, Z., Zhang, H., and Yang, Y. (2024). It ain’t that bad: Understanding the mysterious performance drop in ood generalization for generative transformer models. arXiv preprint arXiv:2308.08268, The 33rd International Joint Conference on Artificial Intelligence (IJCAI-24).
[14]
McLeish, S., Bansal, A., Stein, A., Jain, N., Kirchenbauer, J., Bartoldson, B. R., Kailkhura, B., Bhatele, A., Geiping, J., Schwarzschild, A., et al. (2024). Transformers can do arithmetic with the right embeddings. arXiv preprint arXiv:2405.17399.
[15]
Lee, N., Sreenivasan, K., Lee, J. D., Lee, K., and Papailiopoulos, D. (2023). Teaching arithmetic to small transformers. arXiv preprint arXiv:2307.03381.
[16]
Dubois, Y., Dagan, G., Hupkes, D., and Bruni, E. (2019). Location attention for extrapolation to longer sequences. arXiv preprint arXiv:1911.03872.
[17]
Hernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Henighan, T., Hume, T., et al. (2022). Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487.
[18]
Liu, Z., Kitouni, O., Nolte, N. S., Michaud, E., Tegmark, M., and Williams, M. (2022). Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651–34663.
[19]
Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., et al. (2023). Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852.
[20]
Karpathy, A. (2023). The simplest, fastest repository for training/finetuning medium-sized gpts: nanogpt. GitHub https://github.com/karpathy/nanoGPT.
[21]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
[22]
Nogueira, R., Jiang, Z., and Lin, J. (2021). Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019.
[23]
Qian, J., Wang, H., Li, Z., Li, S., and Yan, X. (2022). Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051.
[24]
Duan, S., Shi, Y., and Xu, W. (2024). From interpolation to extrapolation: Complete length generalization for arithmetic transformers. arXiv preprint arXiv:2310.11984.
[25]
Abbe, E., Bengio, S., Lotfi, A., and Rizk, K. (2023). Generalization on the unseen, logic reasoning and degree curriculum. International Conference on Machine Learning.
[26]
Zhang, Y., Tiňo, P., Leonardis, A., and Tang, K. (2021). A survey on neural network interpretability. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(5):726–742.
[27]
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. (2022). Toy models of superposition. arXiv preprint arXiv:2209.10652.
[28]
Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. (2023). Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023), 2.
[29]
Templeton, A. (2024). Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Anthropic.
[30]
Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S. J., and Kumar, S. (2019). Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077.
[31]
Alberti, S., Dern, N., Thesing, L., and Kutyniok, G. (2023). Sumformer: Universal approximation for efficient transformers. In Topological, Algebraic and Geometric Learning Workshops 2023, pages 72–86. PMLR.
[32]
Zhong, Z., Liu, Z., Tegmark, M., and Andreas, J. (2023). The clock and the pizza: Two stories in mechanistic explanation of neural networks. arXiv preprint arXiv:2306.17844.
[33]
Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). Self-attention with relative position representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2:464–468.
[34]
Su, J., Lu, Y., Pan, S., Wen, B., and Liu, Y. (2021). Roformer: enhanced transformer with rotary position embedding. corr abs/2104.09864 (2021). arXiv preprint arXiv:2104.09864.
[35]
Wennberg, U. and Henter, G. E. (2021). The case for translation-invariant self-attention in transformer-based language models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2:130–140.

  1.  Corresponding authors.↩︎

  2. The code is available at https://github.com/xingchengxu/ArithmeticLLM↩︎

  3. If the training dataset has significant gaps, such as when a model is trained on \(n\)-digit addition but only with \(a_n, b_n \geq n_0\) (e.g., \(a_n, b_n \geq 6\)), it means the model never encounters pairs where both \(a_n < 6\) and \(b_n < 6\). While the digit-wise addition and carry mechanisms for positions 1 through \(n-1\) are learned correctly, since these positions involve a full range of digit pairs during training, the model fails to learn proper behavior for the \(n\)-th and \((n+1)\)-th positions. Specifically, for these positions, the model will not encounter any pairs where both digits are simultaneously less than 6. In this scenario, \(\zeta(a_n + b_n) \in \{2, 3, \ldots, 8\}\) (missing the digits 0, 1, 9), and \(\chi(a_n + b_n) \equiv 1\) (missing the digit 0). Consequently, the training dataset lacks complete coverage of all possible carry scenarios and digit summations. This substantial gap negatively affects the model’s ability to handle these edge situations. Thus, the final learned model cannot generalize to the OOD domain \(\mathcal{D}_{<n}\). Specifically, you will observe that the \((n+1)\)-th position value \(c_{n+1} \equiv 1\) for all samples in \(\mathcal{D}_{<n}\).↩︎

  4. Code to reproduce the results can be found on GitHub: https://github.com/mcleish7/arithmetic↩︎

  5. The task of performing addition modulo 150 requires an extended training duration in our experiment. To facilitate this, we prime the training process with samples that have shorter-length additions.↩︎