April 02, 2024
Off-the-shelf pre-trained language models have become the de facto standard in NLP pipelines for a multitude of downstream tasks. However, the inability of these models to properly encode numerals limits their performance on tasks requiring numeric comprehension. We introduce strategies to semantically prime numerals in any corpus by generating anchors governed by the distribution of numerals in said corpus, thereby enabling mathematically grounded representations of these numeral tokens. We establish the superiority of our proposed techniques through evaluation on a range of numeracy tasks for both in-domain (seen) and out-domain (unseen) numerals. Further, we expand our empirical evaluations to numerals ranging from 1 to 10 billion, a significantly broader range compared to previous studies of the same nature, and we demonstrate significant improvements in the mathematical grounding of our learned embeddings.1
Numeracy, at its core, is the comprehension of numbers, akin to the comprehension of words in literacy. The magnitude of a number is especially tied to its meaning [1]; as such, in developmental psychology, children able to distinguish numbers based on their magnitudes are said to possess the concept of numbers [2]. In the context of NLP, because numbers often grant objectivity to language [3], language models that can comprehend numeric magnitude and scales allow for better inference [4], information extraction [5], and data-to-text generation [6], [7].
Numeric comprehension can indeed be induced in language models through explicit supervision [8]; however, the inherent numeric capabilities of off-the-shelf language models induced from unsupervised training have been shown to be inadequate [4] and often fail to extrapolate to numerals not seen in the training set [9], [10] - referred to as out-of-domain (OOD) numerals. Approaches for numeracy induction to-date either involve strategies that learn representations for numerals separately from regular tokens [11], [12] or do so by training models on numeracy-specific tasks [13], [14]. In contrast, we prime (see ) the numerals in the training corpus by laying anchors such that numeracy is induced via the unsupervised pre-training of the model itself without separately training numerical embeddings. As illustrated in Figure 1, our model shows substantial improvements in numeral representations for both numerals present in the training corpus (in-domain) as well as numerals absent from the training corpus (out-domain) over the state-of-the-art baselines.
Further, the evaluation of numeracy in language models through their ability to predict numbers in a manner similar to textual tokens [11], [15] omits the influence of rote-memorization [16]. In order to decouple the rote-memorization of numerals with respect to the linguistic context in which they appear, our study follows the evaluation protocols of [9] wherein the quality of learned representations are assessed through a set of numeric comprehension tasks. Our contributions can be summarized as:
We develop new techniques for mathematical grounding of numerals in a corpus and quantitatively demonstrate significant improvements in model numeracy.
We evaluate our models on a range of numerical tasks for numerals 1 to 10 billion (\(10^{10}\)), the largest analysis scope to the best of our knowledge, and evaluate its extrapolation capabilities to unseen (out-domain) numerals.
Through rigorous evaluation, we demonstrate that the anchoring mechanisms lead to improved magnitude estimation (from compressive representations) and relative ordering (from directional priming) of numerals.
How does one prime numerals? The priming effect is a temporary change in the perception of a target stimulus that frequently occurs in conjunction with a priming stimulus [17]. Similarly, semantic priming establishes the strength of relations among items belonging to the same or different categories [18].
Now, what does this mean in the context of numerals in a training corpus? Consider numerals 0 and 10 that are both equidistant to a supposed anchor numeral 5. If a language model has never seen the numerals 0 and 10 in its training corpus, the anchor numeral 5—that the model has seen during its training—can now be used to ground the magnitudes of these unseen numerals such that the model can now reason its magnitude. Essentially, we intend to ground the magnitudes of numerals that the model rarely sees or has never seen based on the magnitudes of the numerals that it has frequently seen, known as the anchors.
How are the anchors determined? First, we extract all numerals \(X\) from a training corpus \(C\) through which we intend to induce our anchors. The intuition that anchors should be numerals widely represented (frequent) in the corpus leads to the choice of Gaussian mixture models (GMMs) in contrast to clustering methods such as k-means that lack probabilistic cluster assignment. The set of anchors is induced from the means \(\mu_{k}\) of each Gaussian \(k \in K\) such that each numeral \(n \in X\) can be tied to its closest anchor (1 ). Here, \(\mathcal{N}\) represents the probability density function and \(\pi_{k}\),\(\sigma_{k}\) represent the mixing coefficient and standard deviation for the \(k\)-th Gaussian component. The initialization and the choice of \(K\) is described in A.1. \[\label{gmm} p(n) = \sum_{k=1}^{K} \pi_{k} \mathcal{N} (n; \mu_{k}, \sigma_{k}^2) \tag{1}\]
Devising the four categories of anchors: Theories for mental representation of cardinality further divides our implementation of these anchors into two halves: a continuous linear representation [19] and a compressive representation where the difference between numerals \(n\) and \(n+1\) decreases as \(n\) increases [20]. As such, for linear representation of the number line, we associate numerals with their closest anchor without alteration - giving us our first model Anchors. Similarly, for compressive representation, a given numeral \(n\) is anchored to \(m\) from a set of log-normalized anchors such that \(\ln{(n)} \approx m\) - our second model ln Anchors. In both these methods, the priming is implemented through a specialized token <ANC> added to the tokenizer.
Further, this priming effect is known to be symmetric with respect to the priming direction and additive to the effect of repetition priming [21]. This notion leads to our second category of models, viz. directional anchors represented with bi-directional arrows \(\rightleftarrows\). Thus, in addition to attaching anchors to numerals in the corpus, we signify where the anchor lies in the number line with respect to the target numeral using specialized tokens <LA> (stating the anchor lies to the left of the target numeral in the number line) and <RA> (stating the anchor lies to the right of the target numeral in the number line). Training samples augmented with both <ANC> and <LA>/<RA> are depicted in Figure 2.
As delineated in the previous section, we evaluate four configurations of our model pre-trained on the anchor-augmented WikiText-103 corpus [22]: Anchors, ln Anchors, Anchors (\[), and *ln* Anchors (\]). The details of the datasets, pre-training and fine-tuning configurations, and embedding retrieval are described in A.2.
Models | Decoding (Log-RMSE) | Addition (Log-RMSE) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Range | [1,100] | [100, 1k] | [1k, 10k] | [10k, \(10^{10}\)] | \(\forall\) Z \(\in\) C | [1,100] | [100, 1k] | [1k, 10k] | [10k, \(10^{10}\)] | \(\forall\) Z \(\in\) C |
GenBERT | 0.0926 | 0.0301 | 0.0215 | 0.0639 | 0.0700 | 0.0250 | 0.0204 | 0.0237 | 0.0905 | 0.0752 |
MWP-BERT | 0.0633 | 0.0213 | 0.0150 | 0.0540 | 0.0575 | 0.0077 | 0.0128 | 0.0200 | 0.0871 | 0.0533 |
Anchors | 0.1279 | 0.0196 | 0.0074 | 0.0344 | 0.0424 | 0.0449 | 0.0172 | 0.0102 | 0.0442 | 0.0401 |
Anchors (\(\rightleftarrows\)) | 0.1269 | 0.0123 | 0.0057 | 0.0290 | 0.0422 | 0.0180 | 0.0122 | 0.0089 | 0.0426 | 0.0378 |
ln Anchors | 0.0279 | 0.0087 | 0.0049 | 0.0375 | 0.0304 | 0.0119 | 0.0067 | 0.0084 | 0.0572 | 0.0329 |
ln Anchors (\(\rightleftarrows\)) | 0.1729 | 0.0109 | 0.0054 | 0.0375 | 0.0525 | 0.0157 | 0.0079 | 0.0106 | 0.0585 | 0.0443 |
List Maximum (Accuracy) | List Minimum (Accuracy) | |||||||||
[1,100] | [100, 1k] | [1k, 10k] | [10k, \(10^{10}\)] | \(\forall\) Z \(\in\) C | [1,100] | [100, 1k] | [1k, 10k] | [10k, \(10^{10}\)] | \(\forall\) Z \(\in\) C | |
GenBERT | 92.49% | 91.49% | 82.50% | 82.50% | 83.50% | 94.99% | 81.50% | 83.50% | 70.49% | 86.00% |
MWP-BERT | 93.00% | 91.50% | 85.00% | 79.00% | 87.25% | 96.00% | 88.50% | 88.50% | 75.00% | 87.00% |
Anchors | 92.50% | 91.00% | 63.00% | 87.00% | 87.75% | 90.49% | 88.99% | 92.00% | 86.00% | 88.87% |
Anchors (\(\rightleftarrows\)) | 93.00% | 83.00% | 82.50% | 83.00% | 88.37% | 92.50% | 90.00% | 86.50% | 85.50% | 91.00% |
ln Anchors | 92.00% | 88.00% | 88.50% | 81.50% | 89.37% | 93.50% | 92.00% | 81.00% | 85.00% | 90.50% |
ln Anchors (\(\rightleftarrows\)) | 89.00% | 93.50% | 90.50% | 88.00% | 89.87% | 94.00% | 93.50% | 92.50% | 91.50% | 92.50% |
GenBERT [13]: This model is based on the pre-trained BERT model and is additionally trained for quantitative reasoning (arithmetic, list minimum/maximum operations) with a corpus of 1 million synthetically generated quantitative reasoning prompts.
MWP-BERT [14]: Also based on the pre-trained BERT model, MWP-BERT is trained for solving math word problmes (MWP) through the injection of numerical properties via multiple numeracy grounded pre-training objectives that encourages contextual representations to capture numerical information.
In line with the premise set by [9], we evaluate the performance of the model embeddings on the tasks described below for different numerical ranges. The configurations for regressors and classifiers for the tasks mentioned below, are described in A.3.
Decoding: Given embeddings for a set of numerals, the task is to regress them to their numerical values, thus assessing the fidelity of the numerical magnitudes captured by the embeddings.
Addition: Given sets of concatenated embeddings of two numerals, the task is to regress them to the numerical sum of the two numerals. In addition to assessing the magnitude fidelity, this task additionally requires number manipulation.
List Maximum-Minimum: While the first two tasks assess the magnitude captured by the embeddings, the task of predicting the maximum or minimum numeral in a set of randomly sampled numerals assesses whether the embeddings capture relative ordering.
The results of above four tasks are illustrated in Table 1 for in-domain numerals, and similarly in Table 2 for out-of-domain numerals2 (see A.3). Our findings paint a consistent picture:
For the lower numeral ranges \([1,100]\) and \([100,1k]\), all models do seemingly well. However, the performance of the baselines decreases sharply as the magnitude of numerals increase (for ranges \([1k,10k]\) and \([10k, 10^{10}\)]). However, Anchors and its variants have consistent performance across all the numeral ranges for both in-domain numerals and out-of-domain numerals.
Estimation of Numeral Magnitudes (I): Within our models, the first notable phenomena we observe is that for the decoding and addition tasks designed to assess the fidelity of numerical magnitudes captured by the numeral embeddings, the logarithmic compression (ln Anchors) has a greater contribution to the model performance than directional anchors (Anchors ($$)).
Estimation of Numeral Magnitudes (II): As the GMM-based anchors favor numerals frequent in the corpus, the anchors become sparse at higher numeral ranges - \([10k, 10^{10}]\). Thus, for this range specifically, we see that the model that strictly relies on directional anchors outperforms the log-compressive anchors on magnitude estimation tasks. Essentially, when the anchors are further from each other, knowing which direction they reside in with respect to the target numeral aids the model in reasoning about that numeral.
Estimation of Relative Ordering: The second phenomena we observe is that for the task of retrieving the maximum/minimum numeral from a list of numerals, designed to assess the relative ordering capabilities of the numeral embeddings, the model that leverages both compressive representations and directional priming [21] (ln Anchors ($$)), has the best performance. Establishing that the incorporation of directional priming through the use of directional anchors further increases the relative ordering capabilities of the numeral embeddings.
For easier comparisons among models, the measure employed for the decoding and addition tasks is log-RMSE; as the error is log-compressed, seemingly small changes to the log-RMSE score translates to visible changes in numerical estimation through their embeddings, as depicted in Figure 1.
In this paper, we have presented a simple plug-and-play BERT variant with enhanced numerical capabilites. Through our rigorous interpolation (in-domain) and extrapolation (out-of-domain) analyses, we showcase the superiority of our model in numeric comprehension while outlining the impact of logarithmic compression on magnitude estimation and the impact of directionality on relative ordering capabilities. Further, as a consequence of introducing anchors, we find the learning of niche pockets of similar embeddings for numerals closer in their magnitudes (A.4).
Although the majority of recent scholarly work in this domain revolves around training models to solve math problems [14], [23], [24] or strict arithmetic [25], [26], several notable articles have looked exclusively into numeracy. [11] and [12] devise strategies with Gaussian mixture models to generate embeddings for out-of-vocabulary numeral tokens. Similarly, [10] study the impact of numeral frequency in the pre-training corpus for few-shot arithmetic reasoning. [4], [9], and [27] perform exploratory analysis of numeric comprehension through probing strategies.
The restrictions from our in-house GPU resources do not allow scaling this study to more recent models that exceed 1 billion parameters. Nevertheless, recently published baselines that we evaluate against use the same underlying architecture that we employ, viz. the base BERT model. Given that larger models also depend on the base transformer architecture [28] and use similar learning mechanisms, we believe that these observations will carry over to larger models as well.
The datasets we use in this study are established benchmark datasets from publicly accessible websites and do not contain any personally identifiable information. Our analyses does not constitute human subjects and thus do not fall within the purview of the IRB.
Models | Decoding (Log-RMSE) | Addition (Log-RMSE) | ||
---|---|---|---|---|
Range | OOD [1k, 10k] | OOD [10k, \(10^{10}\)] | OOD [1k, 10k] | OOD [10k, \(10^{10}\)] |
GenBERT | 0.0132 | 0.0602 | 0.0130 | 0.0922 |
MWP-BERT | 0.0097 | 0.0537 | 0.1205 | 0.0788 |
Anchors | 0.0059 | 0.0328 | 0.0082 | 0.0419 |
Anchors (\(\rightleftarrows\)) | 0.0043 | 0.0278 | 0.0067 | 0.0409 |
ln Anchors | 0.0033 | 0.0338 | 0.0043 | 0.0557 |
ln Anchors (\(\rightleftarrows\)) | 0.0029 | 0.0347 | 0.0033 | 0.0625 |
List Maximum (Accuracy) | List Minimum (Accuracy) | |||
OOD [1k, 10k] | OOD [10k, \(10^{10}\)] | OOD [1k, 10k] | OOD [10k, \(10^{10}\)] | |
GenBERT | 86.50% | 78.49% | 90.00% | 76.00% |
MWP-BERT | 87.00% | 82.50% | 88.50% | 77.00% |
Anchors | 84.50% | 83.50% | 89.49% | 83.50% |
Anchors (\(\rightleftarrows\)) | 86.00% | 88.50% | 90.00% | 81.50% |
ln Anchors | 87.50% | 86.99% | 90.00% | 83.50% |
ln Anchors (\(\rightleftarrows\)) | 88.00% | 87.00% | 91.50% | 84.00% |
As Gaussian mixture models are sensitive to initialization methods [29], we initialize our models with random sampling from the dataset. The heterogeneous nature of the numeral distribution in the dataset lends this as the optimal initialization strategy. The models are trained to a convergence tolerance of 0.001 with each component given its own general covariance matrix. The choice of \(K\) = 1000 Gaussian components was established stabilizing AIC and BIC values through a parameter sweep with \(K\) ranging from 10 to 5000.
The WikiText-103 corpus [22] consists of 611,725 training instances (that includes over 100 million tokens) extracted from the set of verified good and featured articles on Wikipedia. Numeral tokens account for 2.4% of the corpus tokens with quadruple-digit numbers accounting for the greatest concentration of numerals - 41.8% .
For both our baselines GenBERT [13] and MWP-BERT [14], the pre-trained models that the authors have provided are used as-is, thus ensuring no performance degradation as a consequence of in-house training/replication. For our Anchor models, the scheme for training follows BERT’s standard training protocol of using masked-language modeling. However, instead of randomly masking 15% of the tokens as done in BERT, we mask the anchor numeral as we intend to ground the learning of the target numerals based on their anchors. With the standard sequence size of 512 for BERT, the models were trained for 6 epochs each in a cluster of 4 Tesla P100 GPUs. The pre-trained BERT models are loaded from the Huggingface library [30].
As recommended in the original BERT configuration, we tested hidden representations from the last hidden layer as well as from the sum of the last 4 hidden layers. We observed the best performance using a sum of the last 4 hidden layer representations, which we adopt for our experimentation.
For consistency in our experimental results, we opted for Extreme Gradient Boosting (XGBoost) [31] for regression over standard neural networks for their robustness to parameterization. The regressors were initialized with 1000 components with each tree having a maximum depth of 5 and trained with a learning rate of 0.01. Similarly, a standard LSTM setup with 4 stacked LSTMs coupled with a sigmoid activation for the final linear layer was used as the classifier. Each classifier was trained for 150 epochs with a learning rate of 1e-4.
As depicted in Table 1 for in-domain numerals, we perform the same set of evaluations for out-of-domain (unseen) numerals in Table 2, corroborating the same performance gains that we observed for in-domain numerals. Please note that all numerals in range [1,100] and [100, 1k] appear in the training corpus, thus only the ranges [1k, 10k] and [10k, \(10^{10}\)] qualify for OOD evaluation.
As an alternative visualization tool, we contrast heatmaps generated through the cosine similarities of numeral embeddings for the base BERT model and our model. As illustrated in Figure 3, the heatmap for the base BERT model has uniformly low cosine similarity throughout, leading to little distinction between numeral embeddings. In contrast, the heatmap for our model demonstrates sophisticated patterns of similarity for proximal numerals along its diagonal. Also seen are sections of low similarity scores in the top right and bottom left - indicating the ability to discern numerical magnitudes of lower and higher number ranges.
Our codebase with the data and pre-trained models are anonymously hosted at https://github.com/Mandar-Sharma/Laying-Anchors↩︎
Please note that as all numerals in range [1,100] and [100, 1k] appear in the training corpus, only numeral ranges [1k, 10k] and [10k, \(10^{10}\)] qualify for OOD evaluation.↩︎