Predicting the Performance of Foundation Models via Agreement-on-the-Line

Aman Mehra\(^*\) \(^1\)Rahul Saxena\(^*\) \(^1\)Taeyoun Kim\(^*\) \(^1\)Christina Baek\(^1\)
Zico Kolter\(^{1,2}\)Aditi Raghunathan\(^1\)

Carnegie Mellon University\(^1\), Bosch Center for AI\(^2\)
{amanmehr, rsaxena2, taeyoun3, kbaek, zkolter, raditi}@cs.cmu.edu


Abstract

Estimating the out-of-distribution performance in regimes where labels are scarce is critical to safely deploy foundation models. Recently, it was shown that ensembles of neural networks observe the phenomena “agreement-on-the-line”, which can be leveraged to reliably predict OOD performance without labels. However, in contrast to classical neural networks that are trained on in-distribution data from scratch for numerous epochs, foundation models undergo minimal finetuning from heavily pretrained weights, which may reduce the ensemble diversity needed to observe agreement-on-the-line. In our work, we demonstrate that when lightly finetuning multiple runs from a single foundation model, the choice of randomness during training (linear head initialization, data ordering, and data subsetting) can lead to drastically different levels of agreement-on-the-line in the resulting ensemble. Surprisingly, only random head initialization is able to reliably induce agreement-on-the-line in finetuned foundation models across vision and language benchmarks. Second, we demonstrate that ensembles of multiple foundation models pretrained on different datasets but finetuned on the same task can also show agreement-on-the-line. In total, by careful construction of a diverse ensemble, we can utilize agreement-on-the-line-based methods to predict the OOD performance of foundation models with high precision.

a

b

c

d

e

f

g

Random Head

h

Data Ordering

i

Data Subsetting

Figure 1: The ID versus OOD accuracy and agreement trends for CIFAR10 to CIFAR10C “Pixelate” in linear probed CLIP, MNLI to SNLI in full finetuned OPT, and SQuAD to SQuAD-Shifts “Amazon” in full finetuned GPT2 models. Models observe different agreement linear fits depending on the diversity source (columns) used to generate the model ensemble..

1 Introduction↩︎

Foundation models (FM), or large models first pretrained on open world data then finetuned or prompted for a specific downstream task, have proven to be powerful solutions for many common machine learning problems. A notable trait about FMs is that they are far more robust to distribution shift than other deep learning approaches — across image and language benchmarks, they suffer a smaller performance degradation on out-of-distribution (OOD) data, that may vary substantially from the in-distribution (ID) finetuning data [1][6]. From clinical decision-making in different hospitals to navigating robots through unseen terrains, FMs are increasingly utilized for tasks prone to distribution shift. However, evaluating these models in OOD settings remains difficult: in many cases, acquiring labels for OOD data is costly and inefficient, while unlabled OOD data is much easier to collect. Although the field has explored other means for estimating OOD accuracy without labeled data, they are not ideal for FMs. A reliable FM performance estimator has the following desirable properties. First, the method must be computationally efficient to account for FMs’ large model size. Second, FMs are leveraged for many different tasks (e.g., classification, question-answering, regression), so the method should also be versatile across tasks. Third, as we will see, methods for finetuned FMs may require different model assumptions from neural networks trained from scratch.

Recently, [7] proposed a promising method for estimating the OOD accuracy of deep networks using the agreement between pairs of these classifiers (i.e., how often two classifiers make the same prediction). For distribution shifts where models observe a strong linear correlation in ID versus OOD accuracy – a common phenomenon in vision and language benchmarks [8], [9] – a strong linear correlation also holds for ID versus OOD agreement with extremely similar slopes and intercepts. These effects are referred to as accuracy-on-the-line (ACL) and agreement-on-the-line (AGL) respectively, and together they provide a simple method for estimating OOD accuracy via unlabeled data alone. Namely, without any OOD labels, we can instead measure the linear fit of ID versus OOD agreement as a proxy for the linear fit of ID versus OOD accuracy. We can then verify whether ACL holds using the correlation strength of agreement’s linear trend and estimate each model’s OOD accuracy by linearly transforming ID accuracy using this approximate slope and bias. This simple approach has shown reliably predict the OOD accuracy of models within a few percentage points across classification and question-answering tasks.

Unfortunately, while the method has several practical advantages, it is unclear whether finetuned FMs also observe the necessary AGL phenomena. Intuitively, a prerequisite to observing AGL is a diverse ensemble of classifiers. Since OOD accuracy falls below ID accuracy, if the linear trend in ID versus OOD agreement is to match that of accuracy, models must also agree much less OOD than ID. For this to happen, errors between any two models must be sufficiently decorrelated. [7] observes AGL in ensembles of neural networks trained for hundreds of epochs from scratch where it is conceivable that the stochasticity between training runs leads to large divergences in the weight space, and corresponding models have diverse OOD predictions. However, in the case of finetuned FMs, models are much closer in the weight space. FMs are often either linear probed over the same pretrained weights or full finetuned for a few epochs with a small learning rate and intuitively, such light finetuning may lead to models that “revert back” to their pretrained behavior to make highly correlated predictions OOD.

This raises the question: can we enforce AGL in this paradigm of lightly finetuning heavily pretrained models? In this work, we conduct an extensive study across several modalities, e.g., CLIP-based image classification and LLM-based question-answering, and training regiments, e.g., full finetuning and linear probing, to understand when AGL holds for finetuned FMs. We first investigate whether AGL appears in an ensemble of finetuned models from a single base FM. To collect a deep ensemble, the following sources of diversity can be injected into the finetuning process: 1) random initialization of the linear head; 2) random data ordering; and 3) random data subsetting. We find that not every source of diversity during fine-tuning on ID data manifests in sufficient diversity OOD, breaking the matching linear fits in ID versus OOD accuracy and agreement. Interestingly, finetuning models from different random initializations of the linear head consistently induces AGL in the resulting ensemble across benchmarks. In contrast, neural networks trained from scratch observe AGL irrespective to these diversity sources.

Second, we show that finetuned models from multiple different base FMs can be leveraged for AGL-based performance estimation. As base FMs can be pretrained with entirely different datasets, architectures, and training regiments, the linear trends in ID versus OOD accuracy and agreement may break altogether in such ensembles. Indeed previous works indicate that on vision tasks, FMs pretrained on different image corpora can have different levels of OOD robustness for the same ID performance [1], [10], [11]. On the contrary, we find that on language tasks, FMs pretrained on different text corpora observe both AGL and ACL across question-answering and text classification tasks.

In total, we develop simple techniques for applying AGL-based performance estimation methods to predict the OOD performance of foundation models. We demonstrate that the AGL phenomenon is not limited to ensembles of neural networks trained from scratch. By simply finetuning FMs from random initializations of the linear head, we can observe the phenomena in FMs across a wide variety of tasks (classification, question-answering) and modalities (vision, language) and training procedures (linear probing, full finetuning). We find that AGL is the only method to accurately estimate the performance of finetuned FMs across all tasks, surpassing other performance estimation baselines by a significant margin as large as \(20\%\) mean absolute percentage error.

2 Background and related work↩︎

2.1 Setup↩︎

We are interested in evaluating models that map an input \(x \in {\mathbb{X}}\) to a discrete output \(y \in {\mathbb{Y}}\). In particular, we finetune foundation models. For a base model \(\mathsf{B}\), let \(f(\mathsf{B})\) denote a finetuned version of \(\mathsf{B}\). In this work, we consider a variety of foundation models: GPT2 [2], OPT [12], Llama2 [13], BERT [6], and CLIP [1].

2.1.0.1 Finetuning strategies.

We have access to labeled data from some distribution \(\mathcal{D}_\text{ID}\) that we use for obtaining \(f(\mathsf{B})\) from \(\mathsf{B}\). In this work, we consider the following standard finetuning procedures.

  1. Linear probing (LP): Given features from the base model \(\mathsf{B}_\theta\), we train a linear head \(v\) such that the final classifier maps the score \(v^\top \mathsf{B}_\theta(x)\) to a predicted class. We randomly initialize \(v\) and update \(v\) via gradient steps on a suitable loss function. The base model parameters remain frozen. We refer to \(v\) as either a linear probe (classification), or span prediction head (question-answering) depending on the task.

  2. Full finetuning (FFT): We update all parameters of the backbone \(\mathsf{B}_\theta\) and the linear head \(v\) using a small learning rate. When infeasible to update all parameters, we perform low-rank adaptation (LoRA) [14] to reduce the number of trainable parameters while still effectively updating the feature extractor \(\mathsf{B}_\theta\). In this work, we do not distinguish between LoRA and FFT as they conceptually achieve the same effect, and seem to show similar empirical trends in our studies.

2.1.0.2 OOD performance estimation.

Given access to a labeled validation set from \(\mathcal{D}_\text{ID}\) and unlabeled samples from a related but different distribution \(\mathcal{D}_\text{OOD}\), our goal is to estimate performance on \(\mathcal{D}_\text{OOD}\). We consider the standard performance metrics for various tasks: Accuracy \(\ell_\text{0-1}:{\mathbb{Y}}\mapsto [0,1]\) for classification, and Exact Match \(\ell_\text{EM}:{\mathbb{Y}}\mapsto [0,1]\) and Macro-averaged F1 score \(\ell_\text{F1}: {\mathbb{Y}}\mapsto [0,1]\) for question-answering. We use \(\ell\) to denote the appropriate metric in the context.

2.2 Background on OOD accuracy estimation↩︎

There is rich literature on OOD performance estimation for deep networks, with a variety of proposed approaches. Initial works focused on upper bounding the degree of distribution shift through data and/or model dependent metrics, e.g., uniform convergence bounds using \(\mathcal{H}\)-divergence [15][18]. However, these bounds tend to be loose for deep networks  [8]. The following works try to estimate the performance exactly.

For classification, a popular approach is to leverage the model’s confidence to predict the OOD performance [19][23]. Since deep models are typically overconfident, it is common practice to first calibrate these models in-distribution by temperature scaling. On a related note are uncertainty quantification works that directly try to calibrate models under distribution shift [24][26]. Confidence based methods are commonly utilized in practice, and favorable for foundation models as they are not computationally intensive and model-agnostic. However, they do have a couple downsides. Namely, they often fail in the presence of large shifts [21] and are often only well-defined for accuracy and ill-defined for other common metrics like F1 score. These can be limiting factors for foundation models which are applied to a broad array of tasks. Still, as they are the most common estimation methods, we utilize them as the baselines in our work.

Another approach involves measuring model behavior on known auxiliary tasks to understand how the model will behave under the distribution shift at hand  [27][29]. These approaches however tend to be overfit to specific datasets or modalities. Similar to AGL, there are prediction methods that utilize information from ensembles. Oftentimes a separate “reference” ensemble is trained on some objective to predict the performance of a “target” model [30][32]. These methods generally have a much higher computational cost than AGL. Although AGL also requires at least 3 models to compute agreement between, these models only undergo generic finetuning. Thus, it is better suited approach for evaluating foundation models, especially if off-the-shelf finetuned models are readily available, e.g., from Huggingface (see Section 4).

Overall, there is growing attention towards understanding the safety and reliability of foundation models. To understand the effective robustness of FMs under distribution shift, recent works focus on studying the “accuracy-on-the-line” phenomena [8] which we cover in more detail in the next subsection and designing benchmarks that try to expose different failure modes of large models [33], [34]. On the other hand, unsupervised OOD performance estimation has been underexplored in this modern setting, in terms of new methods and the transferability of old methods to large pretrained models.

2.3 Accuracy and agreement on the line↩︎

We are interested in adapting the method “agreement-on-the-line” (AGL) [7] for OOD estimation as it obtains state-of-the-art performance estimation across a wide variety of distribution shifts. AGL is based on an earlier observation called “accuracy-on-the-line” (ACL) — across common distribution shift benchmarks, there is a strong linear correlation between the ID and OOD performance of models [8], [11], [35][39]. ACL has been observed in foundation models for image classification, e.g., CIFAR10C [20], ImageNetV2 [36], FMoW-WILDS [40], and question-answering, e.g., SQuAD-Shifts [39]. However, ACL does not always hold, e.g., Camelyon-WILDS [8] and SearchQA [9].

While ACL is a striking phenomenon, it does not immediately provide a practical method to estimate OOD performance—computing the linear fit of ID versus OOD accuracy requires labeled samples from \(\mathcal{D}_\text{OOD}\). Alternatively,  [7] suggests that we can estimate this linear trend exactly using only the agreement between neural networks. Formally, given a pair of models \(f_1\) and \(f_2\) that map inputs to labels, accuracy and agreement is defined as \[\begin{align} \mathsf{Acc}(f_i) = \mathbb{E}_{x,y \sim \mathcal{D}}[\ell(f_i(x), y)], ~~ \mathsf{Agr}(f_1, f_2) = \mathbb{E}_{x,y \sim \mathcal{D}}[\ell(f_1(x), f_2(x))], \end{align}\] where \(\ell\) is the appropriate performance metric of interest. While accuracy requires access to the ground truth labels \(y\), note that agreement only requires access to unlabeled data and a pair of models. [7] observes that when ID versus OOD accuracy is strongly linearly correlated between neural networks, i.e., ACL, then the ID versus OOD agreement of pairs of these models also observe a strong linear correlation with the same linear slope and bias. Furthermore, when accuracies do not show a linear correlation, agreements also do not. This coupled phenomenon is dubbed “agreement-on-the-line” (AGL).

[7] uses AGL for OOD performance estimation: one may obtain the slope and bias of the agreement line with unlabeled data, and then estimate the OOD performance by linearly transforming the ID validation performance. As in [7] and [8] we apply probit scaling to the accuracies and agreements, which induces a stronger linear fit. We refer the reader to [7] for formal AGL-based performance estimation algorithms (ALine-S and ALine-D), which we also provide in Appendix 7.1.1.

3 Predicting OOD performance: single base foundation model↩︎

We first evaluate whether AGL appears in an ensemble of multiple finetuned runs of a single base foundation model. This would enable precise OOD performance estimates for each ensemble member. A practitioner may naively gather a finetuned ensemble by training a couple runs with different seeds or hyperparameters. However, an overriding concern is that even with some randomness in the finetuning process, linear probing or light full finetuning over the same base model may lead to solutions with very correlated predictions. We extensively evaluate the following methods of introducing diversity into the finetuning process to see what approach (if any) can lead to AGL.

  1. Random linear heads We initialize the last layer of the network (i.e., the linear head) randomly, instead of via some zero-shot or pre-specified manner.

  2. Data ordering We present the same training data to each model but shuffle the order of the data, i.e., model observes different minibatches.

  3. Data subsetting We i.i.d. sample \(p\%\) subset of the data to train over. In the main body, we report models trained on independently sampled \(10\%\) of the training data, other proportions of \(30\%\) and \(50\%\) are reported in Appendix 7.4.

We perturb one source of diversity at a time and study whether AGL occurs in each resulting model ensemble. For each setting, we also vary the number of training epochs to collect models with different ID performances, which is necessary to obtain a meaningful linear correlation in accuracy. The additional randomness induced by varied training epochs does affect any conclusions we make in this section.

3.1 VLM-based Image Classification↩︎

We first investigate the effect of diversity source on AGL behavior for vision benchmarks. For image classification, a common pipeline is to finetune over a CLIP [1] pretrained foundation model.

3.1.0.1 CLIP Linear Probing

We finetune over OpenCLIP ViT-B/32 model trained on LAION-2B [41]. Given its well-established zero-shot capabilities, a popular method of finetuning CLIP is to simply employ linear probing on top of the CLIP representation. We take particular interest in evaluating the OOD performance of an ensemble of linear models trained on top of frozen base model representations.

3.1.0.2 Datasets

We evaluate ensembles on synthetic corruptions (CIFAR10C, CIFAR100C, ImageNetC), dataset replication shifts (CIFAR10.1, ImageNetV2), style shifts (OfficeHome), geographical and temporal shifts (FMoW-WILDS, iWildCam-WILDS), and interlaboratory shifts in medicine (Camelyon17-WILDS). iWildCam-WILDS exhibits weak ACL and Camelyon17-WILDS doesn’t exhibit any ACL [8]. We test on iWildCam-WILDS and Camelyon17-WILDS to verify AGL’s negative condition, i.e., when the linear correlation does not exist in ID versus OOD accuracy, it also does not exist in agreement.

Table 1: We evaluate models on the following distribution shift benchmarks.
ID OOD
CIFAR10 [42] CIFAR10C [20], CIFAR10.1 [35]
CIFAR100 [42] CIFAR100C [20]
ImageNet [43] ImageNetC [20], CIFAR10.1 [35]
FMoW ID [40] FMoW OOD [40]
iWildCam ID [40] iWildCam OOD [40]
Camelyon17 ID [40] Camelyon17 OOD [40]
OfficeHome [44] All (ID, OOD) pairings of domains Art, ClipArt, Product, Real World
MNLI [45] MNLI-Mismatched [45], SNLI [46]
SQuAD [47] SQuAD-Shifts [39]

a

b

c

d

Figure 2: In ensembles with diverse random initializations, ACL and AGL holds across benchmarks in linear probed CLIP models. Similar to [7], neither ACL nor AGL holds for the Camelyon17-WILDS.

3.1.0.3 Results

Across vision benchmarks, linear probed CLIP models observe ACL meaning there is a strong linear correlation in the ID versus OOD performance. Similarly, on the same datasets, we can observe a corresponding strong linear correlation in agreement across ensembles injected with diversity in linear head initialization, data ordering, and data subsetting (Figures 1). However, we find that only ensembles with diverse initialization leads to AGL where the linear trend of agreement and accuracy matching slope and bias (Figure 2). See Appendix 7.3.2 for results on other datasets. In model ensembles obtained by data ordering and data subsetting, we observe a consistent trend where the agreement trend observe a much higher slope close to the diagonal \(y = x\) line. These results are not specific to linear probing alone. In full finetuned CLIP models, we also observe that random linear heads induce the most reliable AGL behavior (See Appendix 7.3). Note that this setting is still notably different from [7] where models are heavily trained for tens to hundreds of epochs often with a large learning rate, which causes AGL behavior to be more robust to the source of diversity used to induce the ensemble (See Appendix 7.3.4).

3.2 LLM-based Question-Answering and Text Classification↩︎

We conduct a similar systematic investigation of AGL in finetuned runs of a single base language model. Similar to CLIP linear probing, we find that AGL cannot be observed without random head initialization in language models evaluated on text classification and question-answering tasks.

3.2.0.1 Full Finetuned Language Models

We evaluate over a collection of \(450\) full finetuned runs of several base FMs: GPT2-Medium [2] and OPT-125M [12]. Models are full finetuned for up to \(20\) epochs with a small learning rate (\(\leq 1e^{-6}\)). Hyperparameters specifics can be found in Appendix 7.2. We do not conduct a linear probing study for question-answering as it leads to poorly performing models. For text classification, we also conduct a linear probing study in Appendix 7.5.

3.2.0.2 Datasets

We test models on a text classification shift from MNLI [45] in the GLUE benchmark [48] to MNLI-Mismatched [45] and SNLI [46]. We also evaluate extractive question-answering models on the shift from SQuAD v1.1 [47] to SQuAD-Shifts [39].

3.2.0.3 Results

We evaluate models on accuracy for text classification and F1 score for question-answering. Similar to our findings in CLIP, in both text classification and question-answering benchmarks, ensembles of full finetuned LLMs observe AGL when models are trained from different randomly initialized linear or span heads while data ordering and data subsetting observe an agreement trend closer to the diagonal \(y = x\) line (Figure 1 and Appendix 7.5). We note that with full finetuning, the differences in AGL behavior between diversity sources are not as stark as with linearly probed models. In some sense, how model diversity is achieved is becomes increasingly less important for observing AGL as the base model parameters also diverge, with ensembles of models heavily trained from scratch at the extreme [7].

max width=

3.3 Summary and Implications↩︎

Across image and language modalities, we demonstrate that ensembles of finetuned FMs can also observe agreement-on-the-line like to heavily trained CNN’s [7]. A natural hypothesis about finetuning a single base FM may be that heavy pretraining and light finetuning leads to downstream models with highly correlated behavior under distribution shift. However, simply randomizing the initialization of the linear head alone induces sufficiently decorrelated models for observing AGL. The diversity in the ensemble becomes important when predicting the OOD performance of models using downstream AGL-based methods. In Table [tab:diversity95table], we show that AGL-based methods can only accurately predict the OOD performance of models in ensembles with diverse initialization, while methods fail for ensembles with diverse data subsetting or ordering.

Furthermore, our findings contrast to previous works that suggest AGL is a neural-network specific phenomenon [7], [49], unlike ACL which is model agnostic [8]. Specifically, [7] reports that linear models trained on top of the flattened CIFAR10 images do not observe AGL. However, we find that, on top of CLIP features, linear models can exhibit AGL with random initialization. Previous works on the Generalization Disagreement Equality [50] also contend that data subsetting leads to the most diversity in model predictions. Specifically, in-distribution, the agreement rate between pairs of models was shown to equal their expected accuracy in ensembles with diverse data subsetting, while diverse random initialization leads to slightly higher agreement [50], [51]. On the other hand, out-of-distribution, we find that randomizing the linear head is important for diverse predictions and AGL behavior.

4 Predicting OOD performance: multiple foundation models↩︎

Alternatively, we consider a scenario where the ensemble consists of multiple base foundation models. First, it is unclear whether ACL itself holds amongst such models. The base models are heavily pretrained on different data corpuses which may cause respective downstream models to have different ID versus OOD accuracy trends or “effective robustness” [10]. On vision tasks, for example, linear probing over CLIP, EfficientNet [52], ViT [53], and BYOL [54] observe varying robustness trends [1]. Second, even when ACL does indeed hold, it is unclear whether these model ensembles will also observe AGL. Here the problem is seemingly different from the single base model setting: any pair of foundation models finetuned from different base models may agree too little, or OOD agreement rate may vary across model pairs depending on the similarity of the pretraining corpus, breaking the linear correlation of agreement entirely. Yet, we observe that for language models and tasks, ensembles of finetuned FMs from a wide range of base models observe both ACL and AGL.

4.0.0.1 Models

We finetune models from OPT-125M, OPT-350M, OPT-1.3B [12], GPT2, GPT2-Medium, GPT2-Large, GPT2-XL [2], GPT-Neo-135M [55], Llama2-7B [13], Alpaca-7B [56], and Vicuna-7B [57]. We full finetune OPT and GPT models and LoRA finetune Llama, Alpaca, and Vicuna. These models are pretrained on different mixtures of BookCorpus [58], Stories [59], PILE [60], CCNews v2 corpus, and PushShift.io Reddit [61]. Alpaca and Vicuna are additionally instruction-finetuned over Llama2.

Figure 3: AGL can be observed between models finetuned from different base models (Llama, GPT, OPT) for the F1 score for question-answering shift (SQuAD to SQuAD-Shifts) and accuracy for text classification (MNLI-Matched to MNLI-Mismatched and SNLI).

4.1 Results↩︎

We investigate the AGL behavior of an ensemble of foundation models finetuned from diverse base models in Figure 3 for question-answering. First note that base LLM models pretrained on different text corpora lead to finetuned models that lie on the same linear trend in accuracy. Unlike the different accuracy trends observed by different vision foundation models [2], we suspect that the pretraining datasets for the language models in our study observe much more homogeneity. Second, the ID versus OOD agreement between pairs of models in this ensemble, including those between different base foundation models, is also strongly correlated and the slope and intercept closely matches that of accuracy. In other words, ensembles of different base models also observe AGL without any special regularization for ensemble diversity.

5 Estimating OOD Accuracy using AGL in Diverse Ensembles↩︎

By constructing a diverse ensemble of foundation models, we can leverage AGL to extract precise estimates of model performances under distribution shift. We construct diverse ensembles by collecting models trained from randomly-initialized heads (Section 3) and different base models (Section 4). For image classification, our model collection consists of just the former, i.e., linear models over CLIP representations. For text classification and question-answering, we include GPT, OPT, Llama models individually full finetuned from differently initialized heads. Table [tab:ALine95comparison], we compare the Mean Absolute Percentage Error (MAPE) of AGL-based prediction algorithms, ALine-S and ALine-D  [7], to other baselines.

We compare against confidence based methods ATC [21], AC [62] and DOC-Feat [23] and Naive Agreement which directly uses agreement between model pairs as a proxy for their average accuracy [50], [63]. For confidence based methods, we first temperature scale the models using ID validation data, and pick the lower error rate from the estimations obtained with and without temperature scaling. Since confidence baselines are designed to estimate the accuracy metric on classification tasks, there are several limitations when attempting to naively apply the baselines to estimate performance on question-answering. First, there is no easy analogous formulation of confidence baselines for the F1 score, so we estimate the exact-match score instead for fair comparison. On the other hand, note that AGL can predict the performance of models across metrics accuracy, F1, and exact-match. Second, extractive question-answering is a joint classification task where models predict both the start and end token index of the answer span in the context. More details for how we calibrate baselines for this setting is provided in Appendix 7.1.2.

Note that ALine-S and ALine-D only provide estimation guarantees in circumstances where the linear correlation in agreement is strong. Consistent with [7], we filter out datasets with low correlation strength \(R^2 \leq 0.95\). These shifts include iWildCam-WILDS, Camelyon-WILDS, and a few corruptions in CIFAR10C, CIFAR100C, and ImageNetC. We evaluate ALine-S/D for these specific failure cases in Appendix 7.1.3. Across datasets that observe a strong linear correlation in agreement, ALine-S and ALine-D provide precise estimates of OOD performance in finetuned foundation models, surpassing other baselines by a large margin. In particular, they perform notably better on question-answering task SQuAD, with the next best confidence method achieving as large as \(20\%\) higher error.

max width=

6 Conclusion↩︎

We develop methods for extending AGL to foundation models to enable OOD performance prediction in this emerging paradigm. We find that utilizing AGL for performance estimation requires a careful tuning of ensemble diversity. Unlike the original paradigm of AGL, where models observed tens or hundreds of epochs of training on the ID dataset, we find that randomness in specific optimization choices, especially linear head initialization, is crucial for foundation models. In fact, in contrast to [7], we find that linear models can also observe AGL, specifically in the CLIP representation space, suggesting that AGL may not be a neural network specific phenomena.

Our conclusion on AGL also sheds light on the robustness of foundation models. First, our experiments show that light finetuning alone can corrupt models to have diverse behaviors. Next, in contrast to vision models, where previous works show different forms of pretraining lead to different slopes in the linear correlations [1], a term that is often called “effective robustness”, we find that all the language models we evaluate, e.g., OPT, GPT2, GPT2-Neo, Alpaca, Llama, lie on the same accuracy line. This is particularly intriguing because it goes against the common wisdom that the pretraining data influences the models’ effective robustness. We leave these questions for future analysis.

7 Appendix↩︎

7.1 Details Regarding OOD Performance Estimation Baselines↩︎

7.1.1 AGL-Based Estimation Methods: ALine-S/D↩︎

ALine algorithms are the AGL-based performance estimation methods proposed in [7]. When the AGL phenomenon occurs, i.e., models observe a strong linear correlation in both ID versus OOD agreement and accuracy with matching slopes and biases, algorithms ALine-S and ALine-D effectively apply the linear transformation calculated using agreements to map the ID performances to OOD performance estimates. We describe the algorithms in more detail below.

7.1.1.1 AGL

Provided a collection of models \(\mathcal{F} = \{f_1, f_2, ..., f_n\}\), AGL suggests that ID versus OOD accuracy observe a strong linear correlation if and only if ID versus OOD agreement observes a strong linear correlation and when they do, the slopes and biases match: \(\forall f_i, f_j \in \mathcal{F}\) where \(i \neq j\)

\[\begin{gather} \Phi^{-1}(\text{Acc}_{\text{OOD}}(f_i)) = a \cdot \Phi^{-1}(\text{Acc}_{\text{ID}}(f_i)) + b \\ \Updownarrow \quad \\ \Phi^{-1}(\text{Agr}_{\text{OOD}}(f_i, f_j)) = a \cdot \Phi^{-1}(\text{Agr}_{\text{ID}}(f_i, f_j)) + b \end{gather} \label{eq:AGL}\tag{1}\]

\(\Phi^{-1}\) is the probit transform used to induce a better linear fit as used in [7] and [8]. Provided access to \(\text{Acc}_{\text{ID}}(f_i), \text{Agr}_{\text{ID}}(f_i, f_j), \text{Agr}_{\text{OOD}}(f_i, f_j) \;\forall i, j\), we’d like to estimate \(\text{Acc}_{\text{OOD}}(f_i)\) for all \(f_i \in \mathcal{F}\).

7.1.1.2 ALine-S

The algorithm ALine-S simply estimates the the slope \(a\) and bias \(b\) of accuracy by computing the linear fit of agreement.

\[\label{eq:your95label} \hat{a}, \hat{b} = \arg \min_{a, b \in \mathbb{R}} \sum_{i \neq j} \left( \Phi^{-1}(\hat{\text{Agr}}_{\text{OOD}}(f_i, f_j)) - a \cdot \Phi^{-1}(\hat{\text{Agr}}_{\text{ID}}(f_i, f_j)) - b \right)^2\tag{2}\]

With \(\hat{a}\) and \(\hat{b}\), we estimate \({\text{Acc}_{\text{OOD}}}(f_i) \approx \hat{a} \cdot {\text{Acc}_{\text{ID}}}(f_i) + \hat{b}\). This method is called Aline-S.

7.1.1.3 ALine-D

This method instead constructs the following system of linear equations. Provided the relation in Equation 1 , one can derive that for any \(f_i, f_j \in \mathcal{F}\),

\[\begin{align} &\frac{1}{2}\left(\Phi^{-1}(\text{Acc}_{\text{OOD}}(f_i)) + \Phi^{-1}(\text{Acc}_{\text{OOD}}(f_j))\right) \\ &\approx \Phi^{-1}(\text{Agr}_{\text{OOD}}(f_i, f_j)) + \hat{a} \cdot \left(\frac{\Phi^{-1}(\text{Acc}_{\text{ID}}(f_i)) + \Phi^{-1}(\text{Acc}_{\text{ID}}(f_j))}{2} - \Phi^{-1}(\text{Agr}_{\text{ID}}(f_i, f_j))\right) \end{align}\]

Treating \(\text{Acc}_{\text{OOD}}(f_i) \;\forall i\) as unknown variables, note that the right hand side is known and we can construct a linear system of equations using all \(n \choose 2\) pairs of models. The algorithm employs least squares to solve this approximate system of linear equations.

7.1.2 Temperature Scaling for Confidence-Based Estimation Methods↩︎

We compare ALine against confidence-based methods ATC [21], AC [62], and Doc-Feat [23]. These methods notably perform better after calibrating the models ID by temperature scaling.

7.1.2.1 Classification

For classification tasks, we optimize a temperature \(T\) for each model \(f\) on the cross-entropy loss over the in-distribution validation data. \[\begin{align} \min_T \sum_{x, y} \mathsf{CE}(\sigma(f(x) \exp(T)), y) \end{align}\] where \(\sigma(\cdot)\) is the softmax.

7.1.2.2 Question-Answering

For extractive question-answering tasks, the model has to predict two labels – the start and end token index \(y = [y_s, y_e]\) of the context span that answers the question. For each model, we attach a span prediction head \(v = [v_s, v_e] \in \mathbb{R}^{d \times 2}\) on top of the base \(\mathsf{B}_\theta(x) \in \mathbb{R}^{d \times N}\) where \(N\) is the token length of \(x\). \(s(x) = v_s^\top \mathsf{B}_\theta(x)\) and \(e(x) = v_e^\top \mathsf{B}_\theta(x)\) predict the start and end token index, respectively.

We’re interested in evaluating question-answering models on the exact match (EM) objective, \[\begin{align} \mathsf{EM}(\hat{y}, y) = \mathbb{1}\left[{\hat{y}_s = y_s}\right] \cdot \mathbb{1}\left[{\hat{y}_e = y_e}\right] \end{align}\]

EM treats question-answering as a classification problem over \(N \times N\) choices of start and end index pairs. This allows us to utilize confidence-based methods that are designed for classification tasks. We can calculate the model confidence for index pair \([i, j]\) as \(\sigma(s(x))_i \cdot \sigma(e(x))_j\).

We jointly optimize a separate temperature for the start and end logits \(T_s\) and \(T_e\) and we minimize the cross-entropy loss over the in-distribution validation data. \[\begin{align} \min_T \sum_{x, y} \mathsf{CE}(\sigma(s(x) \exp(T_s)) \sigma(e(x) \exp(T_e))^\top , y) \end{align}\]

7.1.3 Comparison to ProjNorm↩︎

In this section, we present a comparison with ProjNorm [31], a method that yields a score which is shown to be correlated with the OOD performance of the model. We study the same setting as presented in Section 4 where we estimate OOD performance of foundation models pretrained on different text corpora. Unlike AGL, ProjNorm doesn’t provide an estimate of the OOD performance hence we compare the linear correlation between the predicted value and OOD performance. From Table [tab:projnorm] it can be seen that estimates from ALine-D are more strongly correlated with OOD performance than ProjNorm.

max width=

7.1.4 Failure Datasets with Low Linear Correlation↩︎

In our comparison with baselines in Section 5, we filter out datasets with a low correlation coefficient \(\leq 0.95\) in ID vs OOD agreement. When the linear correlation of agreement is weak, AGL tells us that the correlation is also low for accuracy, and AGL-based methods are not guaranteed to be reliable in such circumstances. We provide ID vs OOD accuracy and agreement scatter plots for all datasets in Appendix 7.3.2.

Below, we separately provide the comparison with baselines for the excluded datasets. We generally find that the baseline ATC [21] is significantly better in circumstances where AGL-based methods are unreliable.

max width=

7.2 Finetuning Hyperparameters↩︎

We state here the hyperparameters used to finetune the models for diversity experiments reported in Section 3.

7.2.1 Linear Probing over CLIP for Vision Tasks↩︎

We train all linear probes using SGD. Models are trained for different timesteps to achieve an even distribution of ID accuracies.

max width=

7.2.2 Full Finetuning GPT, OPT, BERT on Language Tasks↩︎

We use AdamW [64] to full finetune language models. We keep the learning rate small. Models are trained for different timesteps to achieve an even distribution of ID accuracies.

max width=

max width=

max width=

max width=

max width=

7.3 More Experiments on the Effect of Diversity Source with CLIP↩︎

7.3.1 OfficeHome Linear Probing Diversity Experiments↩︎

In Section 3.1, we examine how diversity source impacts whether AGL is observed in linear probed CLIP models from CIFAR10 to CIFAR10C. Here, we perform the same experiment on Office-Home [44], which consists of 4 domains or image styles (“Art”, “Clip Art”, “Product”, and “Real World”) for 65 common objects. We train models on one domain and treat the remaining three domains as OOD. Similarly, only Random Initialization yields AGL or matching slopes in accuracy and agreement, and as a result, the corresponding MAPE of estimating the OOD performance of this diverse ensemble is the smallest.

a

b

c

d

e

f

g

Random Head

h

Data Ordering

i

Data Subsetting

Figure 4: ID vs OOD accuracy and agreement of linear probed CLIP models on OfficeHome Art (top row), Product (middle row), and Real World (bottom row). The figure title is the OOD domain..

The ALine-S MAPE(%) for ensembles trained on each domain of OfficeHome.

7.3.2 AGL appears in Random Head CLIP Ensembles across Datasets↩︎

We report the strength of AGL in linear probed CLIP models with randomly initialized linear heads across all datasets discussed in Section 3.2.

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

q

r

s

Figure 5: AGL and ACL for all C10C shifts with random head initialization finetuning..

a

b

Figure 6: AGL and ACL for the C10.1 shifts with random head initialization finetuning..

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

q

r

s

Figure 7: AGL and ACL for the C100C shifts with random head initialization finetuning..

a

b

c

d

e

f

Figure 8: AGL and ACL for the ImageNetC shifts with random head initialization finetuning..

a

b

c

Figure 9: AGL and ACL for the ImageNet V2 shifts with random head initialization finetuning..

a

b

c

Figure 10: AGL and ACL for 3 benchmarks from the WILDS dataset with random head initialization finetuning..

a

b

c

Figure 11: AGL and ACL for the OfficeHome ClipArt, Product, Real shifts with random head initialization finetuning over OfficeHome Art..

a

b

c

Figure 12: AGL and ACL for the OfficeHome Art, Product, Real shifts with random head initialization finetuning over OfficeHome ClipArt..

a

b

c

Figure 13: AGL and ACL for the OfficeHome ClipArt, Art, Real shifts with random head initialization finetuning over OfficeHome Product..

a

b

c

Figure 14: AGL and ACL for the OfficeHome Art, ClipArt, Product shifts with random head initialization finetuning over OfficeHome Real..

7.3.3 Diversity Matters in Full Finetuned CLIP Ensembles↩︎

We show that light full finetuning over CLIP also observes similar effects of diversity source on the strength of AGL. We verify this on shifts from CIFAR10 to CIFAR10C.

a

b

c

d

Random Head

e

Data Ordering

f

Data Subsetting

Figure 15: ID vs OOD accuracy and agreement of full finetuned CLIP models on shift from CIFAR10 to CIFAR10C “JPEG Compression” (top row) and “Pixelate” (bottom row) shifts. Only Random Initialization (Column a) yields AGL or matching slopes in accuracy and agreement..

max width=

7.3.4 Any diverse ensemble display AGL in models heavily trained from scratch↩︎

On the other hand, we demonstrate that in models heavily trained from scratch, AGL can be observed irrespective of the diversity source. Consistent with the models in [7], we train ResNet18 on CIFAR10 from scratch, varying the different sources of randomness. These models are trained heavily with SGD with learning rate of \(1\times10^{-2}\), batch size 128, and weight decay of \(1 \times 10^{-5}\) for up to 200 epochs. We do not use any data augmentation.

Figure 16: Effect of Diversity Source on ResNet18 from CIFAR10 to CIFAR10C-Snow

7.3.5 Effect of Training Dataset Size↩︎

To rid of any confounding factors from data subsetting observing a smaller amount of data, we evaluate all randomness sources on different training dataset sizes. We track the effect of diversity in random initialization, data ordering, and data subsetting for different portions of the training data (100\(\%\), 50\(\%\), 30\(\%\), \(10\%\)). For each percentage \(x\%\), Random Initialization and Data Ordering ensembles are trained on the same randomly sampled \(x\%\) proportion of the data while each model in the Data Subsetting ensemble observe different randomly sampled \(x\%\) portions

a

b

c

Random Head

d

Data ordering

Figure 17: ID vs OOD accuracy and agreement of linear probe CLIP models finetuned on 100% of CIFAR10 training data evaluated on the CIFAR10C “JPEG Compression” (top row) and “Pixelate” (bottom row) shifts.

a

b

c

d

Random Head

e

Data ordering

f

Data Subsetting

Figure 18: ID vs OOD accuracy and agreement of linear probe CLIP models finetuned on 50% of CIFAR10 training data evaluated on the CIFAR10C “JPEG Compression” (top) and “Pixelate” (bottom) shifts.

a

b

c

d

Random Head

e

Data ordering

f

Data Subsetting

Figure 19: ID vs OOD accuracy and agreement of linear probe CLIP models finetuned on 30% of the CIFAR10 training data evaluated on the CIFAR10C “JPEG Compression” (top) and “Pixelate” (bottom) shifts.

a

b

c

d

Random Head

e

Data Ordering

f

Data Subsetting

Figure 20: ID vs OOD accuracy and agreement of linear probe CLIP models finetuned on 10% of the CIFAR10 training data and evaluated on the CIFAR10C “JPEG Compression” (top) and “Pixelate” (bottom) shifts.

7.4 More Experiments on the Effect of Diversity Source for Extractive Question Answering↩︎

7.4.1 Single-Base Diversity Experiments Using OPT and BERT↩︎

In Section 3.2, we report the effect of diversity source in full finetuned GPT2-Medium models for the extractive QA task SQuAD to SQuAD-Shifts. In this section, we also provide the same experiments on OPT-125M and BERT. As observed in GPT, using Random Heads yields the strongest AGL behavior and achieves the smallest ALine-D MAPE.

In Table [tab:mape95opt9550] and [tab:mape95bert9550], we report the MAPE of OOD performance estimation of models trained with Random Initialization and Data Ordering using \(100\%\) of training data and Data Subsetting with \(10\%\) of training data.

max width=

max width=

7.4.2 Diversity in Full Finetuned GPT2 for Different Data Portions↩︎

We track the effect of diversity in random initialization, data ordering, and data subsetting for different portions of the training data (100\(\%\), 50\(\%\), 30\(\%\), \(10\%\)). For each percentage \(x\%\), Random Initialization and Data Ordering ensembles are trained on the same randomly sampled \(x\%\) proportion of the data while each model in the Data Subsetting ensemble observe different randomly sampled \(x\%\) portions.

a

b

c

d

e

f

g

Random Head

h

Data Ordering

Figure 21: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained GPT2 model with 100% of the training data.

a

b

c

d

e

f

g

h

i

j

Random Head

k

Data Ordering

l

Data Subsetting

Figure 22: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained GPT2 model with 50% of the training data.

a

b

c

d

e

f

g

h

i

j

Random Head

k

Data Ordering

l

Data Subsetting

Figure 23: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained GPT2 model with 10% of the training data.

7.4.3 Diversity in Full Finetuned OPT for Different Data Portions↩︎

We track the effect of diversity in random initialization, data ordering, and data subsetting for different portions of the training data (100\(\%\), 50\(\%\), 30\(\%\), \(10\%\)). For each percentage \(x\%\), Random Initialization and Data Ordering ensembles are trained on the same randomly sampled \(x\%\) proportion of the data while each model in the Data Subsetting ensemble observe different randomly sampled \(x\%\) portions.

a

b

c

d

e

f

g

Random Head

h

Data Ordering

Figure 24: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained OPT-125m with 100% of the training data.

a

b

c

d

e

f

g

h

i

j

Random Head

k

Data Ordering

l

Data Subsetting

Figure 25: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained OPT-125m with 50% of the training data.

a

b

c

d

e

f

g

h

i

j

Random Head

k

Data Ordering

l

Data Subsetting

Figure 26: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained OPT-125m with 30% of the training data.

a

b

c

d

e

f

g

h

i

j

Random Head

k

Data Ordering

l

Data Subsetting

Figure 27: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained OPT-125m with 10% of the training data.

7.4.4 Diversity in Full Finetuned BERT for Different Data Portions↩︎

We track the effect of diversity in random initialization, data ordering, and data subsetting for different portions of the training data (100\(\%\), 50\(\%\), 30\(\%\), \(10\%\)). For each percentage \(x\%\), Random Initialization and Data Ordering ensembles are trained on the same randomly sampled \(x\%\) proportion of the data while each model in the Data Subsetting ensemble observe different randomly sampled \(x\%\) portions.

a

b

c

d

e

f

g

Random Head

h

Data Ordering

Figure 28: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained BERT with 100% of the training data.

a

b

c

d

e

f

g

h

i

j

Random Head

k

Data Ordering

l

Data Subsetting

Figure 29: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained BERT with 50% of the training data.

a

b

c

d

e

f

g

h

i

j

Random Head

k

Data Ordering

l

Random Head

Figure 30: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained BERT with 30% of the training data.

a

b

c

d

e

f

g

h

i

j

Random Head

k

Data Ordering

l

Random Head

Figure 31: ID vs OOD accuracy and agreement of models finetuned on SQuAD from a single pretrained BERT with 10% of the training data.

7.5 More Experiments on the Effect of Diversity Source for Text Classification↩︎

7.5.1 Diversity Source for Linear Probing↩︎

In Section 3.2, we reported the effect of diversity source for full finetuned OPT. Here, we demonstrate similar results on text classification shift from MNLI-matched to SNLI with linear probed GPT2-Medium (Figure 32) and OPT-125M (Figure 33). Similarly, random initialization observes the strongest AGL behavior. In Table [tab:mape95tc], we report the average MAPE of OOD performance estimation using ALine across models.

a

Random Head

b

Data Ordering

c

Data Subsetting

Figure 32: ID vs OOD accuracy and agreement of models finetuned on MNLI from GPT2-Medium. Random Head and Data Ordering ensembles are trained on \(100\%\) of the training data while Data Subsetting ensemble is on \(10\%\)..

a

Random Head

b

Data Ordering

c

Data Subsetting

Figure 33: ID vs OOD accuracy and agreement of models finetuned on MNLI from OPT-125M. Random Head and Data Ordering ensembles are trained on \(100\%\) of the training data while Data Subsetting ensemble is on \(10\%\)..

max width=

References↩︎

[1]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
[2]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 (8): 9, 2019.
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901, 2020.
[4]
Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971, 2022.
[5]
Dequan Wang, Xiaosong Wang, Lilong Wang, Mengzhang Li, Qian Da, Xiaoqiang Liu, Xiangyu Gao, Jun Shen, Junjun He, Tian Shen, et al. Medfmc: A real-world dataset and benchmark for foundation model adaptation in medical image classification. arXiv preprint arXiv:2306.09579, 2023.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[7]
Christina Baek, Yiding Jiang, Aditi Raghunathan, and J Zico Kolter. Agreement-on-the-line: Predicting the performance of neural networks under distribution shift. Advances in Neural Information Processing Systems, 35: 19274–19289, 2022.
[8]
John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning, pp. 7721–7735. PMLR, 2021.
[9]
Anas Awadalla, Mitchell Wortsman, Gabriel Ilharco, Sewon Min, Ian Magnusson, Hannaneh Hajishirzi, and Ludwig Schmidt. Exploring the landscape of distributional robustness for question answering models. arXiv preprint arXiv:2210.12517, 2022.
[10]
Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt. Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pp. 6216–6234. PMLR, 2022.
[11]
Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33: 18583–18599, 2020.
[12]
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022.
[13]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
[14]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
[15]
Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19, 2006.
[16]
Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
[17]
Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. Advances in neural information processing systems, 23, 2010.
[18]
Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In International Conference on Machine Learning, pp. 942–950. PMLR, 2013.
[19]
Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
[20]
Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In 7th International Conference on Learning Representations, ICLR, 2019.
[21]
Saurabh Garg, Sivaraman Balakrishnan, Zachary C Lipton, Behnam Neyshabur, and Hanie Sedghi. Leveraging unlabeled data to predict out-of-distribution performance. International Conference on Learning Representations, 2022.
[22]
Hady Elsahar and Matthias Gallé. To annotate or not? predicting performance drop under domain shift. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2163–2173, 2019.
[23]
Devin Guillory, Vaishaal Shankar, Sayna Ebrahimi, Trevor Darrell, and Ludwig Schmidt. Predicting with confidence on unseen distributions. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1134–1144, 2021.
[24]
Yaodong Yu, Stephen Bates, Yi Ma, and Michael Jordan. Robust calibration with multi-domain temperature scaling. Advances in Neural Information Processing Systems, 35: 27510–27523, 2022.
[25]
Yuli Zou, Weijian Deng, and Liang Zheng. Adaptive calibrator ensemble for model calibration under distribution shift. arXiv preprint arXiv:2303.05331, 2023.
[26]
Aleksandr Podkopaev and Aaditya Ramdas. Distribution-free uncertainty quantification for classification under label shift. In Uncertainty in artificial intelligence, pp. 844–853. PMLR, 2021.
[27]
Sebastian Schelter, Tammo Rukat, and Felix Biessmann. Learning to validate the predictions of black box classifiers on unseen data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1289–1299, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450367356.
[28]
Weijian Deng and Liang Zheng. Are labels always necessary for classifier accuracy evaluation? In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15064–15073. IEEE Computer Society, 2021. .
[29]
Weijian Deng, Stephen Gould, and Liang Zheng. What does rotation prediction tell us about classifier accuracy under varying testing environments? arXiv preprint arXiv:2106.05961, 2021.
[30]
Ching-Yao Chuang, Antonio Torralba, and Stefanie Jegelka. Estimating generalization under distribution shifts via domain-invariant representations. arXiv preprint arXiv:2007.03511, 2020.
[31]
Yaodong Yu, Zitong Yang, Alexander Wei, Yi Ma, and Jacob Steinhardt. Predicting out-of-distribution error with the projection norm, 2022.
[32]
Jiefeng Chen, Frederick Liu, Besim Avci, Xi Wu, Yingyu Liang, and Somesh Jha. Detecting errors and estimating accuracy on unlabeled data with self-training ensembles. arXiv preprint arXiv:2106.15728, 2021.
[33]
Andrey Malinin, Neil Band, Ganshin, Alexander, German Chesnokov, Yarin Gal, Mark J. F. Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Roginskiy, Denis, Mariya Shmatova, Panos Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks, 2022.
[34]
Dustin Tran, Jeremiah Liu, Michael W. Dusenberry, Du Phan, Mark Collier, Jie Ren, Kehang Han, Zi Wang, Zelda Mariet, Huiyi Hu, Neil Band, Tim G. J. Rudner, Karan Singhal, Zachary Nado, Joost van Amersfoort, Andreas Kirsch, Rodolphe Jenatton, Nithum Thain, Honglin Yuan, Kelly Buchanan, Kevin Murphy, D. Sculley, Yarin Gal, Zoubin Ghahramani, Jasper Snoek, and Balaji Lakshminarayanan. Plex: Towards reliability using pretrained large model extensions, 2022.
[35]
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018.
[36]
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp. 5389–5400. PMLR, 2019.
[37]
Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt. A meta-analysis of overfitting in machine learning. Advances in Neural Information Processing Systems, 32, 2019.
[38]
Chhavi Yadav and Léon Bottou. Cold case: The lost mnist digits. Advances in neural information processing systems, 32, 2019.
[39]
John Miller, Karl Krauth, Benjamin Recht, and Ludwig Schmidt. The effect of natural distribution shift on question answering models. In International conference on machine learning, pp. 6905–6916. PMLR, 2020.
[40]
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pp. 5637–5664. PMLR, 2021.
[41]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
[42]
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[43]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014. URL http://arxiv.org/abs/1409.0575.
[44]
Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5018–5027, 2017.
[45]
Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference, 2018.
[46]
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
[47]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
[48]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. : A multi-task benchmark and analysis platform for natural language understanding. 2018. In the Proceedings of ICLR.
[49]
Donghwan Lee, Behrad Moniri, Xinmeng Huang, Edgar Dobriban, and Hamed Hassani. Demystifying disagreement-on-the-line in high dimensions, 2023.
[50]
Yiding Jiang, Vaishnavh Nagarajan, Christina Baek, and J Zico Kolter. Assessing generalization of sgd via disagreement. International Conference on Learning Representations, 2022.
[51]
Preetum Nakkiran and Yamini Bansal. Distributional generalization: A new kind of generalization. arXiv preprint arXiv:2009.08092, 2020.
[52]
Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. PMLR, 2019.
[53]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[54]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33: 21271–21284, 2020.
[55]
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. , March 2021. URL https://doi.org/10.5281/zenodo.5297715.
[56]
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
[57]
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
[58]
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27, 2015.
[59]
Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
[60]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
[61]
Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, volume 14, pp. 830–839, 2020.
[62]
Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. International Conference on Learning Representations, 2017.
[63]
Omid Madani, David Pennock, and Gary Flake. Co-validation: Using model disagreement on unlabeled data to validate classification algorithms. Advances in neural information processing systems, 17, 2004.
[64]
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.