April 02, 2024

Off-the-shelf pre-trained language models have become the de facto standard in NLP pipelines for a multitude of downstream tasks. However, the inability of these models to properly encode numerals limits their performance on tasks requiring numeric
comprehension. We introduce strategies to semantically prime numerals in any corpus by generating anchors governed by the distribution of numerals in said corpus, thereby enabling mathematically grounded representations of these numeral tokens. We
establish the superiority of our proposed techniques through evaluation on a range of numeracy tasks for both in-domain (seen) and out-domain (unseen) numerals. Further, we expand our empirical evaluations to numerals ranging from 1 to 10 billion, a
significantly broader range compared to previous studies of the same nature, and we demonstrate significant improvements in the mathematical grounding of our learned embeddings.^{1}

*Numeracy*, at its core, is the comprehension of numbers, akin to the comprehension of words in literacy. The magnitude of a number is especially tied to its meaning [1]; as such, in developmental psychology, children able to distinguish numbers based on their magnitudes are said to possess the concept of numbers [2]. In the context of NLP, because numbers often grant objectivity to language [3], language models that can comprehend
numeric magnitude and scales allow for better inference [4], information extraction [5], and data-to-text generation [6], [7].

Numeric comprehension can indeed be induced in language models through explicit supervision [8]; however, the inherent numeric capabilities of
off-the-shelf language models induced from unsupervised training have been shown to be inadequate [4] and often fail to extrapolate to numerals *not
seen* in the training set [9], [10] - referred to as *out-of-domain* (OOD)
numerals. Approaches for numeracy induction to-date either involve strategies that learn representations for numerals separately from regular tokens [11], [12] or do so by training models on numeracy-specific tasks [13], [14]. In contrast, we *prime* (see ) the numerals in the training corpus by laying anchors such that numeracy is induced via the
unsupervised pre-training of the model itself without separately training numerical embeddings. As illustrated in Figure 1, our model shows substantial improvements in numeral representations for both numerals present in the
training corpus (in-domain) as well as numerals absent from the training corpus (out-domain) over the state-of-the-art baselines.

Further, the evaluation of numeracy in language models through their ability to predict numbers in a manner similar to textual tokens [11], [15] omits the influence of rote-memorization [16]. In order to decouple the rote-memorization of numerals with respect to the linguistic context in which they appear, our study follows the evaluation protocols of [9] wherein the quality of learned representations are assessed through a set of numeric comprehension tasks. Our contributions can be summarized as:

We develop new techniques for mathematical grounding of numerals in a corpus and quantitatively demonstrate significant improvements in model numeracy.

We evaluate our models on a range of numerical tasks for numerals 1 to 10 billion (\(10^{10}\)), the largest analysis scope to the best of our knowledge, and evaluate its extrapolation capabilities to unseen (out-domain) numerals.

Through rigorous evaluation, we demonstrate that the anchoring mechanisms lead to improved magnitude estimation (from

*compressive representations*) and relative ordering (from*directional priming*) of numerals.

**How does one prime numerals?** The *priming* effect is a temporary change in the perception of a target stimulus that frequently occurs in conjunction with a priming stimulus [17]. Similarly, semantic priming establishes the strength of relations among items belonging to the same or different categories [18].

Now, what does this mean in the context of numerals in a training corpus? Consider numerals 0 and 10 that are both equidistant to a supposed anchor numeral 5. If a language model has never seen the numerals 0 and 10 in its training corpus, the anchor
numeral 5—that the model has seen during its training—can now be used to ground the magnitudes of these unseen numerals such that the model can now reason its magnitude. *Essentially, we intend to ground the magnitudes of numerals that the model rarely
sees or has never seen based on the magnitudes of the numerals that it has frequently seen, known as the anchors*.

**How are the anchors determined?** First, we extract all numerals \(X\) from a training corpus \(C\) through which we intend to induce our anchors. The intuition that
anchors should be numerals widely represented (frequent) in the corpus leads to the choice of Gaussian mixture models (GMMs) in contrast to clustering methods such as k-means that lack probabilistic cluster assignment. The set of anchors is induced from
the means \(\mu_{k}\) of each Gaussian \(k \in K\) such that each numeral \(n \in X\) can be tied to its closest anchor (1 ). Here,
\(\mathcal{N}\) represents the probability density function and \(\pi_{k}\),\(\sigma_{k}\) represent the mixing coefficient and standard deviation for the
\(k\)-th Gaussian component. The initialization and the choice of \(K\) is described in A.1. \[\label{gmm} p(n) = \sum_{k=1}^{K} \pi_{k} \mathcal{N} (n; \mu_{k}, \sigma_{k}^2) \tag{1}\]

**Devising the four categories of anchors:** Theories for mental representation of cardinality further divides our implementation of these anchors into two halves: a continuous linear representation [19] and a compressive representation where the difference between numerals \(n\) and \(n+1\) decreases as
\(n\) increases [20]. As such, for linear representation of the number line, we associate numerals with their closest
anchor without alteration - giving us our first model *Anchors*. Similarly, for compressive representation, a given numeral \(n\) is anchored to \(m\) from a set of log-normalized
anchors such that \(\ln{(n)} \approx m\) - our second model *ln Anchors*. In both these methods, the priming is implemented through a specialized token <ANC> added to the
tokenizer.

Further, this priming effect is known to be symmetric with respect to the priming direction and additive to the effect of repetition priming [21].
This notion leads to our second category of models, viz. *directional anchors* represented with bi-directional arrows \(\rightleftarrows\). Thus, in addition to attaching anchors to numerals in the corpus, we signify
*where* the anchor lies in the number line with respect to the target numeral using specialized tokens <LA> (stating the anchor lies to the left of the target numeral in the number line) and <RA> (stating the anchor lies to the right of the target numeral in the number line). Training samples augmented with both <ANC> and <LA>/<RA> are depicted in Figure 2.

As delineated in the previous section, we evaluate four configurations of our model pre-trained on the *anchor-augmented* WikiText-103 corpus [22]: Anchors, *ln* Anchors, Anchors (\[), and *ln* Anchors (\]). The details of the datasets, pre-training and fine-tuning configurations, and embedding retrieval are described in
A.2.

Models | Decoding (Log-RMSE) |
Addition (Log-RMSE) |
||||||||
---|---|---|---|---|---|---|---|---|---|---|

Range | [1,100] | [100, 1k] | [1k, 10k] | [10k, \(10^{10}\)] | \(\forall\) Z \(\in\) C | [1,100] | [100, 1k] | [1k, 10k] | [10k, \(10^{10}\)] | \(\forall\) Z \(\in\) C |

GenBERT | 0.0926 | 0.0301 | 0.0215 | 0.0639 | 0.0700 | 0.0250 | 0.0204 | 0.0237 | 0.0905 | 0.0752 |

MWP-BERT | 0.0633 | 0.0213 | 0.0150 | 0.0540 | 0.0575 | 0.0077 |
0.0128 | 0.0200 | 0.0871 | 0.0533 |

Anchors | 0.1279 | 0.0196 | 0.0074 | 0.0344 | 0.0424 | 0.0449 | 0.0172 | 0.0102 | 0.0442 | 0.0401 |

Anchors (\(\rightleftarrows\)) | 0.1269 | 0.0123 | 0.0057 | 0.0290 |
0.0422 | 0.0180 | 0.0122 | 0.0089 | 0.0426 |
0.0378 |

ln Anchors |
0.0279 |
0.0087 |
0.0049 |
0.0375 | 0.0304 |
0.0119 | 0.0067 |
0.0084 |
0.0572 | 0.0329 |

ln Anchors (\(\rightleftarrows\)) |
0.1729 | 0.0109 | 0.0054 | 0.0375 | 0.0525 | 0.0157 | 0.0079 | 0.0106 | 0.0585 | 0.0443 |

List Maximum (Accuracy) |
List Minimum (Accuracy) |
|||||||||

[1,100] | [100, 1k] | [1k, 10k] | [10k, \(10^{10}\)] | \(\forall\) Z \(\in\) C | [1,100] | [100, 1k] | [1k, 10k] | [10k, \(10^{10}\)] | \(\forall\) Z \(\in\) C | |

GenBERT | 92.49% | 91.49% | 82.50% | 82.50% | 83.50% | 94.99% | 81.50% | 83.50% | 70.49% | 86.00% |

MWP-BERT | 93.00% |
91.50% | 85.00% | 79.00% | 87.25% | 96.00% |
88.50% | 88.50% | 75.00% | 87.00% |

Anchors | 92.50% | 91.00% | 63.00% | 87.00% | 87.75% | 90.49% | 88.99% | 92.00% | 86.00% | 88.87% |

Anchors (\(\rightleftarrows\)) | 93.00% |
83.00% | 82.50% | 83.00% | 88.37% | 92.50% | 90.00% | 86.50% | 85.50% | 91.00% |

ln Anchors |
92.00% | 88.00% | 88.50% | 81.50% | 89.37% | 93.50% | 92.00% | 81.00% | 85.00% | 90.50% |

ln Anchors (\(\rightleftarrows\)) |
89.00% | 93.50% |
90.50% |
88.00% |
89.87% |
94.00% | 93.50% |
92.50% |
91.50% |
92.50% |

**GenBERT** [13]: This model is based on the pre-trained BERT model and is additionally trained for quantitative reasoning (arithmetic,
list minimum/maximum operations) with a corpus of 1 million synthetically generated quantitative reasoning prompts.

**MWP-BERT** [14]: Also based on the pre-trained BERT model, MWP-BERT is trained for solving math word problmes (MWP) through the injection
of numerical properties via multiple numeracy grounded pre-training objectives that encourages contextual representations to capture numerical information.

In line with the premise set by [9], we evaluate the performance of the model embeddings on the tasks described below for different numerical ranges. The configurations for regressors and classifiers for the tasks mentioned below, are described in A.3.

**Decoding**: Given embeddings for a set of numerals, the task is to regress them to their numerical values, thus assessing the fidelity of the numerical magnitudes captured by the embeddings.

**Addition**: Given sets of concatenated embeddings of two numerals, the task is to regress them to the numerical sum of the two numerals. In addition to assessing the magnitude fidelity, this task additionally requires number
manipulation.

**List Maximum-Minimum**: While the first two tasks assess the magnitude captured by the embeddings, the task of predicting the maximum or minimum numeral in a set of randomly sampled numerals assesses whether the embeddings capture
relative ordering.

The results of above four tasks are illustrated in Table 1 for in-domain numerals, and similarly in Table 2 for out-of-domain numerals^{2} (see A.3). Our findings paint a consistent picture:

For the lower numeral ranges \([1,100]\) and \([100,1k]\), all models do seemingly well. However, the performance of the baselines decreases sharply as the magnitude of numerals increase (for ranges \([1k,10k]\) and \([10k, 10^{10}\)]). However,

*Anchors*and its variants have consistent performance across all the numeral ranges for both in-domain numerals and out-of-domain numerals.**Estimation of Numeral Magnitudes (I)**: Within our models, the first notable phenomena we observe is that for the decoding and addition tasks designed to assess the fidelity of numerical magnitudes captured by the numeral embeddings, the logarithmic compression (*ln*Anchors) has a greater contribution to the model performance than directional anchors (Anchors ($$)).**Estimation of Numeral Magnitudes (II)**: As the GMM-based anchors favor numerals frequent in the corpus, the anchors become sparse at higher numeral ranges - \([10k, 10^{10}]\). Thus, for this range specifically, we see that the model that strictly relies on directional anchors outperforms the log-compressive anchors on magnitude estimation tasks. Essentially, when the anchors are further from each other, knowing which direction they reside in with respect to the target numeral aids the model in reasoning about that numeral.**Estimation of Relative Ordering**: The second phenomena we observe is that for the task of retrieving the maximum/minimum numeral from a list of numerals, designed to assess the relative ordering capabilities of the numeral embeddings, the model that leverages both compressive representations and directional priming [21] (*ln*Anchors ($$)), has the best performance. Establishing that the incorporation of directional priming through the use of directional anchors further increases the relative ordering capabilities of the numeral embeddings.

For easier comparisons among models, the measure employed for the decoding and addition tasks is *log-RMSE*; as the error is log-compressed, seemingly small changes to the log-RMSE score translates to visible changes in numerical estimation
through their embeddings, as depicted in Figure 1.

In this paper, we have presented a simple plug-and-play BERT variant with enhanced numerical capabilites. Through our rigorous interpolation (in-domain) and extrapolation (out-of-domain) analyses, we showcase the superiority of our model in numeric comprehension while outlining the impact of logarithmic compression on magnitude estimation and the impact of directionality on relative ordering capabilities. Further, as a consequence of introducing anchors, we find the learning of niche pockets of similar embeddings for numerals closer in their magnitudes (A.4).

Although the majority of recent scholarly work in this domain revolves around training models to solve math problems [14], [23], [24] or strict arithmetic [25], [26], several notable articles have looked exclusively into numeracy. [11] and [12] devise strategies with Gaussian mixture models to generate embeddings for out-of-vocabulary numeral tokens. Similarly, [10] study the impact of numeral frequency in the pre-training corpus for few-shot arithmetic reasoning. [4], [9], and [27] perform exploratory analysis of numeric comprehension through probing strategies.

The restrictions from our in-house GPU resources do not allow scaling this study to more recent models that exceed 1 billion parameters. Nevertheless, recently published baselines that we evaluate against use the same underlying architecture that we employ, viz. the base BERT model. Given that larger models also depend on the base transformer architecture [28] and use similar learning mechanisms, we believe that these observations will carry over to larger models as well.

The datasets we use in this study are established benchmark datasets from publicly accessible websites and do not contain any personally identifiable information. Our analyses does not constitute human subjects and thus do not fall within the purview of the IRB.

Models | Decoding (Log-RMSE) |
Addition (Log-RMSE) |
||
---|---|---|---|---|

Range | OOD [1k, 10k] | OOD [10k, \(10^{10}\)] | OOD [1k, 10k] | OOD [10k, \(10^{10}\)] |

GenBERT | 0.0132 | 0.0602 | 0.0130 | 0.0922 |

MWP-BERT | 0.0097 | 0.0537 | 0.1205 | 0.0788 |

Anchors | 0.0059 | 0.0328 | 0.0082 | 0.0419 |

Anchors (\(\rightleftarrows\)) | 0.0043 | 0.0278 |
0.0067 | 0.0409 |

ln Anchors |
0.0033 | 0.0338 | 0.0043 | 0.0557 |

ln Anchors (\(\rightleftarrows\)) |
0.0029 |
0.0347 | 0.0033 |
0.0625 |

List Maximum (Accuracy) |
List Minimum (Accuracy) |
|||

OOD [1k, 10k] | OOD [10k, \(10^{10}\)] | OOD [1k, 10k] | OOD [10k, \(10^{10}\)] | |

GenBERT | 86.50% | 78.49% | 90.00% | 76.00% |

MWP-BERT | 87.00% | 82.50% | 88.50% | 77.00% |

Anchors | 84.50% | 83.50% | 89.49% | 83.50% |

Anchors (\(\rightleftarrows\)) | 86.00% | 88.50% |
90.00% | 81.50% |

ln Anchors |
87.50% | 86.99% | 90.00% | 83.50% |

ln Anchors (\(\rightleftarrows\)) |
88.00% |
87.00% | 91.50% |
84.00% |

As Gaussian mixture models are sensitive to initialization methods [29], we initialize our models with random sampling from the dataset. The heterogeneous nature of the numeral distribution in the dataset lends this as the optimal initialization strategy. The models are trained to a convergence tolerance of 0.001 with each component given its own general covariance matrix. The choice of \(K\) = 1000 Gaussian components was established stabilizing AIC and BIC values through a parameter sweep with \(K\) ranging from 10 to 5000.

The WikiText-103 corpus [22] consists of 611,725 training instances (that includes over 100 million tokens) extracted from the set of verified
*good* and *featured* articles on Wikipedia. Numeral tokens account for 2.4% of the corpus tokens with quadruple-digit numbers accounting for the greatest concentration of numerals - 41.8% .

For both our baselines GenBERT [13] and MWP-BERT [14], the pre-trained models that the authors have provided are used as-is, thus ensuring no performance degradation as a consequence of in-house training/replication. For our Anchor models, the scheme for training follows BERT’s standard training protocol of using masked-language modeling. However, instead of randomly masking 15% of the tokens as done in BERT, we mask the anchor numeral as we intend to ground the learning of the target numerals based on their anchors. With the standard sequence size of 512 for BERT, the models were trained for 6 epochs each in a cluster of 4 Tesla P100 GPUs. The pre-trained BERT models are loaded from the Huggingface library [30].

As recommended in the original BERT configuration, we tested hidden representations from the last hidden layer as well as from the sum of the last 4 hidden layers. We observed the best performance using a sum of the last 4 hidden layer representations, which we adopt for our experimentation.

For consistency in our experimental results, we opted for Extreme Gradient Boosting (XGBoost) [31] for regression over standard neural networks for their robustness to parameterization. The regressors were initialized with 1000 components with each tree having a maximum depth of 5 and trained with a learning rate of 0.01. Similarly, a standard LSTM setup with 4 stacked LSTMs coupled with a sigmoid activation for the final linear layer was used as the classifier. Each classifier was trained for 150 epochs with a learning rate of 1e-4.

As depicted in Table 1 for in-domain numerals, we perform the same set of evaluations for out-of-domain (unseen) numerals in Table 2, corroborating the same performance gains that we observed for in-domain numerals. Please note that all numerals in range [1,100] and [100, 1k] appear in the training corpus, thus only the ranges [1k, 10k] and [10k, \(10^{10}\)] qualify for OOD evaluation.

As an alternative visualization tool, we contrast heatmaps generated through the cosine similarities of numeral embeddings for the base BERT model and our model. As illustrated in Figure 3, the heatmap for the base BERT model has uniformly low cosine similarity throughout, leading to little distinction between numeral embeddings. In contrast, the heatmap for our model demonstrates sophisticated patterns of similarity for proximal numerals along its diagonal. Also seen are sections of low similarity scores in the top right and bottom left - indicating the ability to discern numerical magnitudes of lower and higher number ranges.

[1]

Stanislas Dehaene, Ghislaine Dehaene-Lambertz, and Laurent Cohen. 1998. Abstract representations of numbers in the animal and human brain. *Trends in neurosciences*,
21(8):355–361.

[2]

Jean Piaget. 1952. *The Child’s Conception of Number*. London: Routledge and Kegan Paul.

[3]

Theodore M Porter. 1996. Trust in numbers. Princeton University Press.

[4]

Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural language inference. In *Proceedings of the 27th
International Conference on Computational Linguistics*, pages 2340–2353.

[5]

Aman Madaan, Ashish Mittal, Ganesh Ramakrishnan, Sunita Sarawagi, et al. 2016. Numerical relation extraction with minimal supervision. In *Proceedings of the AAAI Conference on
Artificial Intelligence*, volume 30.

[6]

Mandar Sharma, John S Brownstein, and Naren Ramakrishnan. 2021. T 3: Domain-agnostic neural time-series narration. In *2021 IEEE International Conference on Data Mining (ICDM)*,
pages 1324–1329. IEEE.

[7]

Mandar Sharma, Ajay Gogineni, and Naren Ramakrishnan. 2022. Innovations in neural data-to-text generation. *arXiv preprint arXiv:2207.12571*.

[8]

Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2016. Order matters: Sequence to sequence for sets. In *4th International Conference on Learning Representations, ICLR
2016*.

[9]

Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. 2019. Do nlp models know numbers? probing numeracy in embeddings. In *Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing*, pages 5307–5315.

[10]

Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. 2022. Impact of pretraining term frequencies on few-shot reasoning. *arXiv preprint arXiv:2202.07206*.

[11]

Georgios Spithourakis and Sebastian Riedel. 2018. Numeracy for language models: Evaluating and improving their ability to predict numbers. In *Proceedings of the 56th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers)*, pages 2104–2115.

[12]

Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo Chu, Yinggong Zhao, Libin Shen, and Kewei Tu. 2020. Learning numeral embedding. In *Findings of the Association for Computational
Linguistics: EMNLP 2020*, pages 2586–2599.

[13]

Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. Injecting numerical reasoning skills into language models. In *Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics*, pages 946–958.

[14]

Zhenwen Liang, Jipeng Zhang, Lei Wang, Wei Qin, Jie Shao, and Xiangliang Zhang. 2022. Mwp-bert: A numeracy-augmented pre-trained encoder for math word problems. *36th Conference on
Neural Information Processing Systems (NeurIPS 2022) Workshop on Math-AI*.

[15]

Chung-Chi Chen, Hen-Hsen Huang, Hiroya Takamura, and Hsin-Hsi Chen. 2019. Numeracy-600k: Learning numeracy for detecting exaggerated information in market comments. In *Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics*, pages 6307–6313.

[16]

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, and Yoram Singer. 2020. Identity crisis: Memorization and generalization under extreme overparameterization. In *8th
International Conference on Learning Representations, ICLR 2020*.

[17]

John A. Bargh and Tanya L. Chartrand. 2000. The mind in the middle: A practical guide to priming and automaticity research.

[18]

Marco Zorzi, Ivilin Peev Stoianov, and Carlo Umilta. 2004. Computational modeling of numerical cognition.

[19]

Stanislas Dehaene. 2003. The neural basis of the weber–fechner law: a logarithmic mental number line. *Trends in cognitive sciences*, 7(4):145–147.

[20]

Stanislas Dehaene, Emmanuel Dupoux, and Jacques Mehler. 1990. Is numerical comparison digital? analogical and symbolic effects in two-digit number comparison. *Journal of experimental
Psychology: Human Perception and performance*, 16(3):626.

[21]

Bert Reynvoet, Marc Brysbaert, and Wim Fias. 2002. Semantic priming in number naming. *The Quarterly Journal of Experimental Psychology: Section A*, 55(4):1127–1139.

[22]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models. *International Conference on Learning Representations (ICLR)*.

[23]

Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In *Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing*, pages 845–854.

[24]

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2021. Investigating the limitations of transformers with simple arithmetic tasks. *arXiv preprint arXiv:2102.13019*.

[25]

Mandar Sharma, Nikhil Muralidhar, and Naren Ramakrishnan. 2022. Overcoming barriers to skill injection in language modeling: Case study in arithmetic. *36th Conference on
Neural Information Processing Systems (NeurIPS 2022) Workshop on Math-AI*.

[26]

Mandar Sharma, Nikhil Muralidhar, and Naren Ramakrishnan. 2023. https://doi.org/10.18653/v1/2023.acl-long.340. In *Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers)*, pages 6178–6191, Toronto, Canada. Association for Computational Linguistics.

[27]

Kuntal Kumar Pal and Chitta Baral. 2021. Investigating numeracy learning ability of a text-to-text transfer model. In *Findings of the Association for Computational Linguistics: EMNLP
2021*, pages 3095–3101.

[28]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. *Advances in
neural information processing systems*, 30.

[29]

Johannes Blömer and Kathrin Bujna. 2013. Simple methods for initializing the em algorithm for gaussian mixture models. *CoRR*.

[30]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s
transformers: State-of-the-art natural language processing.

[31]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In *Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data
mining*, pages 785–794.

Our codebase with the data and pre-trained models are anonymously hosted at https://github.com/Mandar-Sharma/Laying-Anchors↩︎

Please note that as all numerals in range [1,100] and [100, 1k] appear in the training corpus, only numeral ranges [1k, 10k] and [10k, \(10^{10}\)] qualify for OOD evaluation.↩︎