April 02, 2024
Open-domain Question Answering (OpenQA) aims at answering factual questions with an external large-scale knowledge corpus. However, real-world knowledge is not static; it updates and evolves continually. Such a dynamic characteristic of knowledge poses a vital challenge for these models, as the trained models need to constantly adapt to the latest information to make sure that the answers remain accurate. In addition, it is still unclear how well an OpenQA model can transfer to completely new knowledge domains. In this paper, we investigate the generalization performance of a retrieval-augmented QA model in two specific scenarios: 1) adapting to updated versions of the same knowledge corpus; 2) switching to completely different knowledge domains. We observe that the generalization challenges of OpenQA models stem from the reader’s over-reliance on memorizing the knowledge from the external corpus, which hinders the model from generalizing to a new knowledge corpus. We introduce Corpus-Invariant Tuning (CIT), a simple but effective training strategy, to mitigate the knowledge over-memorization by controlling the likelihood of retrieved contexts during training. Extensive experimental results on multiple OpenQA benchmarks show that CIT achieves significantly better generalizability without compromising the model’s performance in its original corpus and domain.
Open-domain Question Answering (OpenQA) [1] aims at answering factual questions using a large-scale external knowledge corpus. This is in contrast to closed-book question answering [2] wherein the model is expected to directly answer questions with no access to external knowledge. In general, closed-book QA optimizes for memorization of knowledge in model parameters, while OpenQA focuses on retrieving relevant knowledge from an external corpus. OpenQA typically employs a retrieval-augmented approach [3]–[5], involving a two-stage process: a retriever to select relevant documents, followed by a reader to derive answers from these documents. It is more practical for real-world applications as it enables the use of extensive and varied knowledge sources for answering questions.
Retrieval-augmented OpenQA models rely on an external corpus to physically store the knowledge. However, real-world knowledge is not static; it updates and evolves continually. Therefore, it is essential to build models that are able to use fresh and real-time knowledge [6], [7], but the dynamic characteristic of knowledge poses a vital challenge as the trained models need to constantly adapt to the latest information to make sure that the answers remain relevant and accurate. In addition, closed-book QA systems have been proved limited in adapting to new information or domains due to their reliance on pre-existing knowledge, and updating their parametric knowledge requires extensive large-scale pre-training. Nevertheless, it is still unclear how well OpenQA systems can transfer to leveraging unseen corpora and domains during training.
In this paper, we first investigate how well state-of-the-art retrieval-augmented models, such as Atlas [8], can adapt to new and diverse knowledge corpora. Specifically, we explore the model’s performance in two scenarios: 1) adapting to updated versions of the same corpus (in §2.2.1); 2) switching to completely different domains (in §2.2.2). Our investigation involves three settings: directly applying a pre-trained model, fine-tuning the model with the new corpora, and training the model afresh on the new corpora. Initial experiments reveal that the model faces challenges in both scenarios. When directly transitioning to an updated corpus, there is a noticeable performance decline. Even additional tuning on the newer version doesn’t achieve the same effectiveness as training from scratch with the new data (56.9\(\rightarrow\)59.5 vs 62.2 in Table 1). Similar outcomes are observed when shifting from a general domain, like Wikipedia, to a specialized one, such as biomedical (41.2\(\rightarrow\)68.8 vs 69.7 in Table 2).
We hypothesize and validate that such generalization challenges stem from the reader’s over-reliance on memorizing the knowledge retrieved from the external corpus. This reliance primarily arises as the reader, with its primary training objective optimized for QA accuracy, often opts to hard-code a substantial amount of retrieved knowledge into its parametric memory. Such kind of over-memorization reduces the reader’s dependency on the retriever to choose more relevant contexts. This phenomenon hampers the model’s generalizability, particularly to updates in the knowledge corpus or changes in the knowledge domain. For instance, given a question Who is the prime minister of the UK?, if a model has already hard-coded an outdated answer Boris Johnson into its parameters (while being trained on an old corpus), it is harder to change its response even if the new information Rishi Sunak from an updated corpus is available.
To address this issue, we introduce Corpus-Invariant Tuning (CIT), a simple but effective training strategy to improve the corpus generalizability of retrieval-augmented text generation models. CIT aims to mitigate the reader’s tendency to memorize the documents retrieved from the corpus during training. This pushes the reader to rely more on retrieved documents to answer the input questions, rather than relying on memorizing the knowledge facts into its parameters. To achieve this, we propose a novel loss term to prevent memorization during training by controlling the likelihood of the retrieved documents. Through extensive experiments across various OpenQA benchmarks [9]–[11], carried out in both zero-shot and continual fine-tuning scenarios, we demonstrate that a retrieval-augmented model trained using our proposed CIT loss exhibits considerably enhanced generalizability across different corpora. This is evident by the considerable improvements in exact match (EM) scores, reaching up to a 2.1% absolute gain.
Our contributions can be summarized as follows:
We propose to mitigate knowledge over-memorization of the reader to improve the generalization ability of retrieval-augmented text generation models.
We introduce Corpus-Invariant Tuning (CIT), a straightforward but effective training strategy that regularizes the reader’s likelihood of the retrieved documents to mitigate it from over-memorizing the corpus during training.
Through extensive experiments on multiple benchmarks, we demonstrate that training models with CIT greatly improves the generalization of OpenQA models across both newer versions of the corpora and unseen domains.
Open-domain QA aims to answer questions only using a large-scale unified corpus, where the background documents for each question is not specified in advance. Given a natural language question \(x\), our objective is to build a model \(f(\cdot)\) to predict an answer \(\hat{y}\) using a unified list of background documents \(Z\), where \(\hat{y}=f(x, Z)\). Such a setting is more practical for real-world applications because it mirrors the vast and unstructured nature of real-world knowledge.
Since the external corpus collectively stores all essential information for answering the questions, the typical strategy to tackle the OpenQA problem is to implement a retrieval-augmented approach with a two-stage framework: 1) a retriever to select a small subset of documents that are most relevant to the current question, and 2) a reader to seek for useful information from the retrieved documents and generate the answer. Specifically, the probability of a predicted answer \(\hat{y}\) is decomposed by \[p\left(\hat{y}\mid x, Z\right)=\sum_{\mathcal{C}\subset Z}p\left(\mathcal{C}\mid x;\theta\right)\cdot p\left(\hat{y}\mid \mathcal{C}, x;\phi\right),\] where \(\mathcal{C}\) denotes the set of retrieved documents, and \(\theta\) and \(\phi\) are the parameters of the retriever and the reader respectively. During training, the retriever (\(\theta\)) and the reader (\(\phi\)) are often jointly optimized to ensure their effective collaboration, where the optimization is typically conducted with iterative training [4] or Expectation-Maximization (EM) based approach to train the model by treating the retrieved documents as hidden variables [5].
We aim to tackle the generalization challenge for retrieval-augmented models as discussed in Section 1. Specifically, we focus on the following two main research questions (RQs):
RQ1: How to improve the model’s generalization ability across different versions (temporal snapshots) of the same corpus?
RQ2: How to improve the model’s generalization ability across the corpora in different domains?
We conduct proof-of-concept experiments to test whether current retrieval-augmented OpenQA models can remain effective when the external corpus is updated to a newer version. Specifically, we adopt the most recent retrieval-augmented model Atlas-XL [8] and test it on the Natural Questions (NQ) benchmark1 with two different versions of Wikipedia (Wiki-2017 and Wiki-2018)2 as the external corpus. We first fine-tune the Atlas-XL model on each version of Wikipedia, and then evaluate the model’s generalization ability by both zero-shot testing (train the model with Wiki-2017 and directly test it with Wiki-2018) and continue fine-tuning (train the model with Wiki-2017 and further fine-tune it with Wiki-2018). As shown in Table 1, we can first observe that the model performs better when initially fine-tuned with Wiki-2018, which shows that the updated KB can improve the performance.3 However, we can also observe a significant performance degradation when using the model trained with Wiki-2017 to directly test it on Wiki-2018. Despite subsequent fine-tuning efforts, the performance still falls short of the original results obtained from initially training and testing with Wiki-2018. These results indicate that the current retrieval-augmented models still struggle to effectively generalize when the background corpus undergoes evolution or updates.
Training | Testing | EM |
Corpus | Corpus | Score |
Wiki-2017 | Wiki-2018 | 56.9 |
Wiki-2017 \(\rightarrow\) Wiki-2018 | Wiki-2018 | 59.5 |
Wiki-2018 | Wiki-2018 | 62.2 |
We conduct similar experiments with Atlas-XL to evaluate its generalization ability across different domains. We train the model on NQ with Wiki-2018 in the general domain, and test it on the Biomedical split in RobustQA4 with PubMed in the biomedical domain. The results presented in Table 2 reveal similar performance declines in both zero-shot and continual-fine-tuning settings, which indicates that the current OpenQA models also have inherent difficulty in generalizing across different domains.
Training | Testing | EM |
Corpus | Corpus | Score |
Wiki2018(NQ) | PubMed(Bio) | 41.2 |
Wiki2018(NQ)\(\rightarrow\)PubMed(Bio) | PubMed(Bio) | 68.8 |
PubMed(Bio) | PubMed(Bio) | 69.7 |
Motivated by the observed limitations in the generalization capabilities of retrieval-augmented models, we introduce Corpus-Invariant Tuning (CIT) to mitigate memorizing the lexical content of retrieved documents. Specifically, we posit that the generalization difficulties encountered by retrieval-augmented text generation models arises via excessive memorization the documents retrieved from the external corpus by the reader. In order to achieve higher question-answering accuracy during training, the reader tends to “hard-code” a large volume of retrieved documents rather than relying on an improved retriever for a better selection of relevant contexts, as is empirically validated in the document retrieval evaluations of Section 4.5. This limits the model’s ability to generalize because excessive memorization of documents by the reader dictates that when the external corpus is updated or transitions to a different domain, the model faces increased difficulty in adapting and correcting its knowledge compared to learning from scratch.5
Here we provide an empirical validation on our hypothesis that the degradation of model generalization ability is caused by over-memorization of retrieved knowledge. We replace the retrieved contexts with ground-truth retrieval results on Wiki-2018, and conduct a stand-alone evaluation with the reader. We report both the EM score and the overlap rate, i.e., the percentage of incorrectly predicted answers that have overlaps with the ground-truth retrieval results in Wiki-2017. The results are shown in Table 3. We can observe that while the models transferred from Wiki-2017 perform slightly worse in terms of EM score, it has a lot more error cases that overlap with the retrieved documents on Wiki-2017. Such results directly show that the over-memorization of contexts is the primary cause of the degradation of model generalizability.
Dataset | Training | EM | Overlap |
Corpus | Score | Rate | |
NQ | Wiki-2017\(\rightarrow\)2018 | 63.6 | 76.3 |
Wiki-2018 | 65.2 | 30.2 | |
TriviaQA | Wiki-2017\(\rightarrow\)2018 | 78.1 | 80.8 |
Wiki-2018 | 78.7 | 41.0 |
To solve this problem, we propose Corpus-Invariant Tuning (CIT), a straightforward but effective method to temper the reader’s tendency to over-memorize the contents of externally retrieved documents, thereby improving the model’s generalization abilities for downstream tasks like OpenQA. As depicted in Figure 1, the core idea of CIT is to control the reader’s memorization (likelihood) of the corpus to be “invariant” by introducing an additional loss term that ensures the reader’s likelihood of the retrieved documents does not increase during training. Specifically, for each training QA pair \((x,y)\) and its retrieved document set \(\mathcal{C}\), the loss term can be written as \[\label{eqn:KIT}\mathcal{L}_{\textit{CIT}}=\sum_{c\in\mathcal{C}}\|\log p_{\phi}\left(c\right)-\log p_{\phi_{0}}(c)\|^2,\tag{1}\] where \(\phi\) and \(\phi_0\) denote the current parameters and the original parameters6 of the reader respectively. We use \(p_{\phi}(c)\) to represent the reader’s likelihood of a retrieved document \(c\). In our experiments, we adopt the Masked Span Prediction (MSP) probability from the T5 model [13] to maintain consistency with the Atlas architecture. Essentially, we randomly mask out a fixed number of spans of the input sentence and use the model’s probability of generating these spans in the correct order as the likelihood. The overall training objective is a combination of the original loss for question answering \(\mathcal{L}_{\textit{QA}}\) and the CIT loss \(\mathcal{L}_{\textit{CIT}}\): \[\label{eqn:final95loss} \mathcal{L} = \mathcal{L}_{\textit{QA}} + \alpha\cdot\mathcal{L}_{\textit{CIT}},\tag{2}\] where \(\alpha\) is a configurable hyper-parameter.
Figure 1: Our proposed Corpus-Invariant Tuning (CIT) Framework. In addition to the existing loss for question answering, we introduce an auxiliary CIT loss to make sure that the reader does not over memorize the retrieved contexts. Specifically, given each batch of QA pairs and the relevant documents retrieved from the corpus, the CIT loss makes sure that the reader’s likelihood of these documents does not increase.
Retrieval-augmented QA models typically maximize answer accuracy as the end-to-end training objective. However, given the distinct roles of retrievers and readers, there exist two distinct approaches through which this goal can be achieved: the model can either choose to enhance the retriever to fetch more relevant documents or simply allow the reader to memorize pertinent knowledge. While both methods can contribute to performance improvements, the former approach increases generalization and CIT biases the model away from rigid memorization by the reader.
Our experiments are conducted on two general-domain OpenQA datasets, NaturalQuestion (NQ) [9] and TriviaQA [10], and a large cross-domain benchmark RobustQA [11]. Detailed statistics of these datasets are depicted in Table 4.
NQ and TriviaQA are the two most widely-used open-domain QA benchmarks, which contain factual question and answer pairs created and annotated based on Wikipedia. There are 79,168 and 78,785 QA pairs for training in NQ and TriviaQA respectively. In our experiments, we conduct both fully supervised training and few-shot training settings for model evaluation.
RobustQA7 is a large-scale OpenQA evaluation benchmarks specifically designed for evaluating the cross-domain generalization capabilities of OpenQA models. RobustQA includes 8 distinct domains, each equipped with its own test set and a corresponding list of background documents. The QA pairs and the documents are adopted and annotated from FiQA,8 SearchQA [14], BioASQ [15], and LOTTE [16].
Benchmark | Domain | # Test | Corpus |
Questions | Size | ||
NQ | Wikipedia | 3,610 | - |
TriviaQA | Wikipedia | 11,313 | - |
RobustQA | Web Search | 31,760 | 13,791,373 |
Biomedical | 1,956 | 15,559,026 | |
Finance | 3,669 | 57,638 | |
Lifestyle | 2,214 | 119,461 | |
Recreation | 2,096 | 166,975 | |
Technology | 2,115 | 638,509 | |
Science | 1,426 | 1,694,164 | |
Writing | 2,696 | 199,994 |
We adopt the state-of-the-art retrieval-augmented language model Atlas-XL [8] as our main baseline, which uses a Contriever [17] as the retriever, and a Fusion-in-Decoder (FiD) model [18] as the reader. The primary objective of our experiments is to evaluate whether the baseline model demonstrates improved performance when trained using our proposed CIT loss. Besides, we also introduce other most recent models Flan-T5 [19], RGF [20], ReAtt [21], FiE+PAQ [22], and FID-KD [23] for comparison. Our model is labeled as Atlas-XL+CIT which applies an additional CIT loss to control the knowledge over-memorization of the reader.
Model | Setting | Training | Testing | NQ | TriviaQA |
Corpus | Corpus | ||||
Atlas-XL [8] | Closed Book | - | - | 30.2 | 41.6 |
FiD-KD [23] | Original | Wiki-2018 | Wiki-2018 | 54.7 | 67.6 |
ReAtt [21] | Original | Wiki-2018 | Wiki-2018 | 54.7 | - |
FiE+PAQ [22] | Original | Wiki-2018 | Wiki-2018 | 58.4 | 72.6 |
Atlas-XL | Original | Wiki-2017 | Wiki-2017 | 58.8 | 75.5 |
Atlas-XL + CIT | Wiki-2017 | Wiki-2017 | 58.9 | 75.5 | |
Atlas-XL | Zero-shot | Wiki-2017 | Wiki-2018 | 56.9 | 75.1 |
Atlas-XL + CIT | Transfer | Wiki-2017 | Wiki-2018 | 58.6 | 75.5 |
Atlas-XL | Full-training | Wiki-2017\(\rightarrow\)Wiki-2018 | Wiki-2018 | 59.5 | 76.8 |
Atlas-XL + CIT | Transfer | Wiki-2017\(\rightarrow\)Wiki-2018 | Wiki-2018 | 61.6 | 77.4 |
Corresponding to the two research questions proposed in previous sections, we first focus on evaluating our model’s ability to generalize across different versions of the external corpus. We adopt the Wikipedia-domain benchmarks NQ and TriviaQA in our experiments, and test their cross-corpus generalization abilities on different Wikipedia versions. Similar to the setting in our preliminary experiments presented in Section 2.2, we use Wiki-2017 and Wiki-2018 as our background corpora, and we consider both zero-shot and fully-supervised settings to test the generalization ability of a trained model. Specifically, we use the following terms to denote different experiment settings:
Closed Book: The model is trained and tested without a retriever. The reader is responsible to understand questions and provide answers.
Original: Also known as the Open Book setting. The most typical experiment setting for retrieval-augmented models, where a retriever retrieves a set of documents from the external corpus, and the reader uses these documents to generate answer. We use the label Original to emphasize the absence of cross-corpus generalization in this setting, providing a baseline for comparison with the following settings to evaluate the model’s generalization ability.
Zero-shot Transfer: The model is initially trained with the older version of a knowledge corpus, and directly tested with the updated corpus version in a zero-shot manner without any additional fine-tuning.
Full-training Transfer: As opposite to the zero-shot transfer setting, after being initially trained with the older version of a knowledge corpus, the model is further fine-tuned on the same training QA pairs with an updated version of the corpus, before being tested with the new corpus.
We conduct generalization experiments in both zero-shot and full-training settings on the NQ and the TriviaQA datasets, and the results are shown in Table 5. In the Original setting to train and test the model on the same corpus (Wiki-2017), we can observe that compared with the baseline model Atlas-XL, incorporating a CIT loss to reduce knowledge over-memorization will not diminish the task performance; in fact, it can even slightly enhance it in certain cases (NQ with Wiki-2017). This is probably because when the reader is discouraged from rigid knowledge memorization, the retriever still has enough room of improvement to retrieve better documents and enhance the performance. Besides, in both of the Zero-shot Transfer and Full-training Transfer settings, we can observe that the CIT loss can significantly improve the generalization performance of the model across different versions of a knowledge corpus. This is likely because when the reader is discouraged from hard-coding knowledge into its parameters, it becomes more receptive to assimilating and utilizing new information from the retriever. In summary, our proposed CIT loss significantly improves the model’s ability to generalize across different versions of external corpus, without compromising the absolute task performance for OpenQA.
To address our second research question, in this section, we conduct experiments to evaluate whether our proposed CIT loss can help model better generalize across different knowledge domains. We conduct evaluations using the RobustQA benchmark, which encompasses eight diverse domains specifically tailored for OpenQA. We first assess the model’s ability to generalize from a general domain (Wikipedia) to these eight diverse domains. Subsequently, we evaluate the model’s effectiveness in generalizing interchangeably across these eight domains.
Method | Average | Biomedical | Search | Finance | Life | Recreation | Technology | Science | Writing |
RGF | 23.5 | 33.8 | 49.0 | 13.2 | 20.2 | 19.1 | 17.1 | 15.5 | 20.3 |
Flan-T5-XL | 32.1 | 43.1 | 70.9 | 14.6 | 25.5 | 25.4 | 21.3 | 23.9 | 32.1 |
Atlas-base | 28.3 | 40.0 | 59.2 | 15.6 | 23.8 | 22.8 | 19.8 | 18.3 | 27.3 |
Atlas-base + CIT | 30.5 | 40.3 | 60.9 | 15.5 | 26.7 | 24.9 | 19.9 | 23.6 | 32.3 |
Atlas-XL | 33.2 | 41.2 | 61.0 | 19.9 | 32.0 | 27.9 | 22.2 | 24.8 | 36.7 |
Atlas-XL + CIT | 35.4 | 43.7 | 71.5 | 20.1 | 33.8 | 28.1 | 22.2 | 26.9 | 37.0 |
We first evaluate how well a model trained with Wikipedia can generalize on the eight specific domains in RobustQA. Specifically, all the models are first fine-tuned on the NQ dataset with Wiki-2018, and then tested with the eight domain-specific benchmarks. The results are presented in Table 6. In general, adding the CIT loss to an Atlas-XL model significantly improves the average F1 score across eight domains in RobustQA, creating a new state-of-the-art among all 3B(XL)-sized models. Our proposed CIT loss can also help on smaller sizes of models, like Atlas-base, with 2.2% absolute improvement of the F1 score. Within these domains, we can see that domains like Life show significant improvement. This is probably due to their larger overlaps with Wikipedia, which makes it more crucial to avoid over-memorization, so that the old Wikipedia knowledge will not affect generalization on the new domain. In contrast, domains like Biomedical exhibits less improvements. This is possibly because a Biomedical domain KB has a smaller overlap with Wikipedia, thereby reducing the negative impact of knowledge over-memorization on the model’s generalizability.
Figure 2: The result heatmaps for cross-domain generalization experiments. Each value in the heatmap represents the absolute improvement (compared with Atlas-XL) of cross-domain relative performance (CRP) defined in Equation 3 . Darker green indicates larger improvements in cross-domain generalization.
In addition to testing the model trained from Wikipedia, we also assess the model’s capability to generalize between each pair of domains in RobustQA. We define the cross-domain relative performance (CRP) as the evaluation metric to intuitively characterize how well the model can generalize from a source domain \(s\) to a target domain \(t\). Specifically, the \(\textit{CRP}(s,t)\) is defined as the ratio of cross-domain performance and intra-domain performance: \[\label{eqn:crp} \textit{CRP}(s,t)=\frac{\textit{score}(s,t)}{\textit{score}(t,t)}\tag{3}\] where \(\textit{score}(s,t)\) is the performance (F1-score) of training the model in the source domain \(s\) and testing the model in the target domain \(t\). In Figure 2, we set different CIT strength \(\alpha\), and visualize the model’s generalizability into a heatmap. Each heat value \(h(s,t)\) stands for the absolute improvements of CRP over the baseline model Atlas-XL: \[h(s,t)=\textit{CRP}_{\alpha}(s,t) - \textit{CRP}_{\textit{Atlas-XL}}(s,t),\] where darker green indicates larger improvements and darker red indicates larger declines. We can already observe improvements for most domain pairs with \(\alpha=0.1\), and while \(\alpha\) reaches to \(0.3\), the improvements become much more significant and all domain pairs benefit from CIT in terms of cross-domain generalization.
The proposed CIT training loss reduces the reader model’s memorization tendency, leading to greater reliance on the documents retrieved. This enhanced dependency during training on the retrieved documents appears to enhance the retriever’s performance, as seen from the improvements in retrieval performance observed in Table 7 upon integrating CIT loss. Additionally, we measured the coverage of the reader’s predicted answer in the retrieved documents for the NQ benchmark, noting an increase in coverage within the top 40 documents from 66.9% to 69.1%. This suggests that our proposed CIT training loss leads to an increased reliance by the reader on retrieved documents rather than corpus memorization.
Model | R@10 | R@20 | R@40 |
---|---|---|---|
Atlas-XL | 79.7 | 84.3 | 88.4 |
Atlas-XL + CIT | 85.2 | 88.9 | 91.5 |
Figure 3: Parameter sensitivity on choices of \(\alpha\).
We then conduct a more in-depth study on the model sensitivity of the most important hyper-parameter \(\alpha\), which controls the strength of the corpus-invariant tuning (as shown in Equation 2 ). By choosing different values of \(\alpha\), we compute the average cross-domain relative performance between both different corpus versions (RQ1) and different domains (RQ2), and the results are shown in Figure 3. We can observe that as \(\alpha\) becomes larger, the generalization performance initially improves, and then starts to decline after reaching its peak. This trend is likely because if memorization is controlled too excessively, the reader may neglect memorizing some certain shared global knowledge that is actually beneficial for knowledge generalization. From Figure 3 we can observe that \(\alpha=0.2\) is best for generalization across different corpus versions (RQ1), and \(\alpha=0.3\) is the best for cross-domain setting (RQ2).
Enhancing text generation through the use of retrieved contexts has proven effective in a variety of knowledge-intensive downstream tasks. The most typical design for retrieval augmentation is to employ a retriever, which is jointly optimized with the reader in an end-to-end manner [3], [18], [21]–[24]. The training methods include iterative training [4], and also EM-based algorithms to treat retrieved documents as hidden variables [5]. Retrieval augmentation can also act as plug-in modules [25]. In this setting, the retrievers are not jointly trained with the reader, and the retrieved documents are used in an in-context manner [26], [27]. Recently, Atlas [8] trains and releases a new set of retrieval-augmented models based on the T5 architectures [13], which achieves state-of-the-art performances on OpenQA tasks in few-shot settings.
The task of OpenQA [1], [9], [10] aims at building models to answer questions without background documents. Given the high demand for external knowledge in this task, the standard approach involves integrating external corpora with a knowledge retriever to supply supporting evidence for answering questions [18], [28]. Recently, researchers also focus on new problem settings such as conversational QA [29], multi-hop QA [30], and new knowledge-enhanced solutions like using knowledge graphs [31], and multi-hop reasoning [32].
Because retrieval-augmented OpenQA models require an external KB to provide supporting documents, it is important to make sure that the model is robust and generalizable across different versions and domains of knowledge. There are a few previous studies that focus on the generalizability of OpenQA models. For example, [33] focus on the question generalizability of OpenQA models, and conducts a detailed analysis on the generalization performances of current OpenQA models. [34]–[36] propose training with synthetic data to improve the robustness of retrieval models in OpenQA settings. [37] focuses on the domain adaption problem of OpenQA, and proposes a reconstruction-based auxiliary loss to improve the model’s generalizability. Regarding dataset development, RobustQA [11] creates a new benchmark that involves multiple real-world domains. However, there are no previous studies that aim to tackle the model’s generalization ability on both different corpus and different knowledge domains. Also, we are the first to tackle this problem in a knowledge memorization perspective, enhancing the model’s generalization ability by reducing rigid memorization of the reader modules.
In this paper, we present Corpus-Invariant Tuning (CIT), a simple but effective training strategy to improve the generalization ability of a retrieval-augmented text generation model across different corpus versions and different knowledge domains. The main idea of CIT is to mitigate rigid knowledge memorization during training, so that the reader module can easily accept new knowledge in the retrieved documents and adapt to novel unseen domains. Specifically, we control the reader’s likelihood of the retrieved documents during training, to make sure that over-memorization of corpus knowledge is prevented. Extensive experiments are conducted on multiple OpenQA datasets in both zero-shot and fully-supervised training settings, and the results demonstrate that training the model with the proposed CIT loss significantly improves the model’s generalizability across different corpus versions and knowledge domains, without sacrificing the model’s inherent performance in its original domain.
Although retrieval-augmented text generation models are effective for many knowledge-intensive tasks, they have an inherent limitation of the large requirement of computational memory. To ensure time efficiency in the retrieval process, an index of the external corpus, often vast in size, must be pre-constructed. Another notable limitation of CIT is that the extent of memorization mitigation depends on a hyper-parameter, which is experimented and chosen by humans. Ideally, the model should be able to automatically determine the best level of memorization mitigation to reach an optimal balance between parametric knowledge and retrieval augmentation. This is an exciting new research topic, and we will explore this as the future work.
We thank the anonymous reviewers for their constructive suggestions. This research is based upon work supported by U.S. DARPA KAIROS Program No. FA8750-19-2-1004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
We train our model on 4 NVIDIA A100 GPUs with 80 GB memory, and the total training time is about 3.5 hours for the model to converge on an OpenQA dataset. During training, the index of the external corpus is pre-computed and equally sharded across all 4 GPUs. We adopt distributed data parallelism to make a copy of the model on each GPU and the data batch is splitted to the 4 devices.
Detailed hyper-parameter search range and choices are shown in Table 8 and Table 9 respectively. The choices are made by grid search.
Hyper-parameters | Searching Range |
---|---|
Maximum Length of FiD | [384, 512, 768] |
# Retrieved Contexts | [20, 30, 40] |
Generation Length | [10, 15, 20, 25, 30] |
Masked Percentage for CIT | [0.1, 0.15, 0.2] |
Strength of CIT \(\alpha\) | [0.1, 0.2, 0.3, 0.4, 0.5, 0.6] |
Learning Rate | [1e-5, 2e-5, 3e-5, 4e-5, 5e-5] |
Batch Size | [4, 8, 12, 16] |
Maximum Training Steps | [500, 1,000, 1,500] |
Warm-up Steps | [50, 100, 150] |
Weight Decay | [1e-2, 1e-3, 1e-4] |
Retriever Dropout | [0.1, 0.2, 0.3] |
Reader Dropout | [0.1, 0.2, 0.3] |
Hyper-parameters | Values |
---|---|
Maximum Length of FiD | 512 |
# Retrieved Contexts | 40 |
Generation Length | 20 |
Masked Percentage of CIT | 0.15 |
Strength of CIT \(\alpha\) | 0.2 (RQ1); 0.3 (RQ2) |
Learning Rate | 4e-5 |
Batch Size | 8 |
Maximum Training Steps | 500 |
Warm-up Steps | 50 |
Weight Decay | 1e-2 |
Retriever Dropout | 0.1 |
Reader Dropout | 0.1 |
The main idea of CIT is to control knowledge over-memorization to improve model’s generalization ability among different corpora and different domains. Such a training strategy is not only effective for retrieval-augmented encoder-decoder models like Atlas, but can theoretically also apply to larger-scale auto-regressive foundation models. Therefore, we also conduct preliminary experiments with LLaMA-2-7b [38] and Contriever as a case study to evaluate the robustness and ubiquity of our model among different language model architectures. Specifically, we freeze the retriever, and use the retrieved documents as the input prompt to generate the answer for each question. While maximizing the probabilities of the correct answers, we apply CIT in a similar way of maintaining the direct log-likelihood of these retrieved contexts \(\mathcal{C}\): \[\mathcal{L}_{\textit{CIT}}=\sum_{c\in\mathcal{C}}\|\log p'_{\phi}\left(c\right)-\log p'_{\phi_{0}}(c)\|^2.\] Different from the masked span prediction probability \(p_{\phi}\) in Equation 1 , the \(p'_{\phi}\) here represents the language modeling probability of the entire passage \(c\). Denoting the \(i\)-th word in \(c\) as \(c_i\), then \(p'_{\phi}\) can be formulated by \[p'_{\phi}(c) = \prod_{i=1}^{N-1}p\left(c_{i+1} \mid c_{1:i}\right).\] As shown in Table 10, we can observe that with auto-regressive language models like LLaMA, our proposed CIT can still achieves considerable improvements of the model’s generalization ability.
Dataset | Model | EM Score |
---|---|---|
NQ | LLaMA2-7b | 54.2 |
LLaMA2-7b + CIT | 57.7 | |
TriviaQA | LLaMA2-7b | 69.9 |
LLaMA2-7b + CIT | 71.4 |
The Wikipedia dumped in 2017 and 2018 respectively.↩︎
The NQ benchmark is annotated in 2018, so Wiki-2018 is a more up-to-date background KB for the task.↩︎
Such a phenomenon can be caused by the exposure bias problem as discussed in [12].↩︎
\(\phi_0\) denotes the initial reader’s parameters before training.↩︎