April 02, 2024

Estimating the out-of-distribution performance in regimes where labels are scarce is critical to safely deploy foundation models. Recently, it was shown that ensembles of neural networks observe the phenomena “agreement-on-the-line”, which can be
leveraged to reliably predict OOD performance without labels. However, in contrast to classical neural networks that are trained on in-distribution data from scratch for numerous epochs, foundation models undergo minimal finetuning from heavily pretrained
weights, which may reduce the ensemble diversity needed to observe agreement-on-the-line. In our work, we demonstrate that when lightly finetuning multiple runs from a *single* foundation model, the choice of randomness during training (linear head
initialization, data ordering, and data subsetting) can lead to drastically different levels of agreement-on-the-line in the resulting ensemble. Surprisingly, only random head initialization is able to reliably induce agreement-on-the-line in finetuned
foundation models across vision and language benchmarks. Second, we demonstrate that ensembles of *multiple* foundation models pretrained on different datasets but finetuned on the same task can also show agreement-on-the-line. In total, by careful
construction of a diverse ensemble, we can utilize agreement-on-the-line-based methods to predict the OOD performance of foundation models with high precision.

Foundation models (FM), or large models first pretrained on open world data then finetuned or prompted for a specific downstream task, have proven to be powerful solutions for many common machine learning problems. A notable trait about FMs is that they
are far more robust to distribution shift than other deep learning approaches — across image and language benchmarks, they suffer a smaller performance degradation on out-of-distribution (OOD) data, that may vary substantially from the in-distribution (ID)
finetuning data [1]–[6]. From clinical decision-making in different hospitals to navigating robots through unseen terrains, FMs are increasingly utilized for tasks prone to distribution shift. However, evaluating these models in OOD settings
remains difficult: in many cases, acquiring labels for OOD data is costly and inefficient, while unlabled OOD data is much easier to collect. Although the field has explored other means for estimating OOD accuracy without labeled data, they are not ideal
for FMs. A reliable FM performance estimator has the following desirable properties. First, the method must be computationally efficient to account for FMs’ large model size. Second, FMs are leveraged for many different tasks (e.g., classification,
question-answering, regression), so the method should also be versatile across tasks. Third, as we will see, methods for finetuned FMs may require *different model assumptions from neural networks trained from scratch*.

Recently, [7] proposed a promising method for estimating the OOD accuracy of deep networks using the *agreement* between pairs
of these classifiers (i.e., how often two classifiers make the same prediction). For distribution shifts where models observe a strong linear correlation in ID versus OOD accuracy – a common phenomenon in vision and language benchmarks [8], [9] – a strong linear correlation also holds for ID
versus OOD agreement with extremely similar slopes and intercepts. These effects are referred to as accuracy-on-the-line (ACL) and agreement-on-the-line (AGL) respectively, and together they provide a simple method for estimating OOD accuracy via unlabeled
data alone. Namely, without any OOD labels, we can instead measure the linear fit of ID versus OOD agreement as a proxy for the linear fit of ID versus OOD accuracy. We can then verify whether ACL holds using the correlation strength of agreement’s linear
trend and estimate each model’s OOD accuracy by linearly transforming ID accuracy using this approximate slope and bias. This simple approach has shown reliably predict the OOD accuracy of models within a few percentage points across classification and
question-answering tasks.

Unfortunately, while the method has several practical advantages, it is unclear whether finetuned FMs also observe the necessary AGL phenomena. Intuitively, a prerequisite to observing AGL is a *diverse ensemble* of classifiers. Since OOD
accuracy falls below ID accuracy, if the linear trend in ID versus OOD agreement is to match that of accuracy, models must also *agree much less OOD than ID*. For this to happen, errors between any two models must be sufficiently decorrelated. [7] observes AGL in ensembles of neural networks trained for hundreds of epochs from scratch where it is conceivable that the stochasticity
between training runs leads to large divergences in the weight space, and corresponding models have diverse OOD predictions. However, in the case of finetuned FMs, models are much closer in the weight space. FMs are often either linear probed over the same
pretrained weights or full finetuned for a few epochs with a small learning rate and intuitively, such *light* finetuning may lead to models that “revert back” to their pretrained behavior to make highly correlated predictions OOD.

This raises the question: can we enforce AGL in this paradigm of lightly finetuning heavily pretrained models? In this work, we conduct an extensive study across several modalities, e.g., CLIP-based image classification and LLM-based question-answering,
and training regiments, e.g., full finetuning and linear probing, to understand when AGL holds for finetuned FMs. We first investigate whether AGL appears in an ensemble of finetuned models from a *single* base FM. To collect a deep ensemble, the
following sources of diversity can be injected into the finetuning process: 1) random initialization of the linear head; 2) random data ordering; and 3) random data subsetting. We find that not every source of diversity during fine-tuning on ID data
manifests in sufficient diversity OOD, breaking the matching linear fits in ID versus OOD accuracy and agreement. Interestingly, finetuning models from *different random initializations of the linear head consistently induces AGL* in the resulting
ensemble across benchmarks. In contrast, neural networks trained from scratch observe AGL irrespective to these diversity sources.

Second, we show that finetuned models from *multiple* different base FMs can be leveraged for AGL-based performance estimation. As base FMs can be pretrained with entirely different datasets, architectures, and training regiments, the linear
trends in ID versus OOD accuracy and agreement may break altogether in such ensembles. Indeed previous works indicate that on vision tasks, FMs pretrained on different image corpora can have different levels of OOD robustness for the same ID performance
[1], [10], [11]. On the contrary, we find that on language tasks, FMs pretrained on different text corpora observe both AGL and ACL across question-answering and text classification tasks.

In total, we develop simple techniques for applying AGL-based performance estimation methods to predict the OOD performance of foundation models. We demonstrate that the AGL phenomenon is not limited to ensembles of neural networks trained from scratch. By simply finetuning FMs from random initializations of the linear head, we can observe the phenomena in FMs across a wide variety of tasks (classification, question-answering) and modalities (vision, language) and training procedures (linear probing, full finetuning). We find that AGL is the only method to accurately estimate the performance of finetuned FMs across all tasks, surpassing other performance estimation baselines by a significant margin as large as \(20\%\) mean absolute percentage error.

We are interested in evaluating models that map an input \(x \in {\mathbb{X}}\) to a discrete output \(y \in {\mathbb{Y}}\). In particular, we finetune foundation models. For a base model \(\mathsf{B}\), let \(f(\mathsf{B})\) denote a finetuned version of \(\mathsf{B}\). In this work, we consider a variety of foundation models: GPT2 [2], OPT [12], Llama2 [13], BERT [6], and CLIP [1].

We have access to labeled data from some distribution \(\mathcal{D}_\text{ID}\) that we use for obtaining \(f(\mathsf{B})\) from \(\mathsf{B}\). In this work, we consider the following standard finetuning procedures.

**Linear probing (LP):**Given features from the base model \(\mathsf{B}_\theta\), we train a linear head \(v\) such that the final classifier maps the score \(v^\top \mathsf{B}_\theta(x)\) to a predicted class. We randomly initialize \(v\) and update \(v\) via gradient steps on a suitable loss function. The base model parameters remain frozen. We refer to \(v\) as either a linear probe (classification), or span prediction head (question-answering) depending on the task.**Full finetuning (FFT):**We update all parameters of the backbone \(\mathsf{B}_\theta\) and the linear head \(v\)*using a small learning rate*. When infeasible to update all parameters, we perform*low-rank adaptation*(LoRA) [14] to reduce the number of trainable parameters while still effectively updating the feature extractor \(\mathsf{B}_\theta\). In this work, we do not distinguish between LoRA and FFT as they conceptually achieve the same effect, and seem to show similar empirical trends in our studies.

Given access to a labeled validation set from \(\mathcal{D}_\text{ID}\) and *unlabeled* samples from a related but different distribution \(\mathcal{D}_\text{OOD}\), our goal is to
estimate performance on \(\mathcal{D}_\text{OOD}\). We consider the standard performance metrics for various tasks: Accuracy \(\ell_\text{0-1}:{\mathbb{Y}}\mapsto [0,1]\) for classification,
and Exact Match \(\ell_\text{EM}:{\mathbb{Y}}\mapsto [0,1]\) and Macro-averaged F1 score \(\ell_\text{F1}: {\mathbb{Y}}\mapsto [0,1]\) for question-answering. We use \(\ell\) to denote the appropriate metric in the context.

There is rich literature on OOD performance estimation for deep networks, with a variety of proposed approaches. Initial works focused on upper bounding the degree of distribution shift through data and/or model dependent metrics, e.g., uniform convergence bounds using \(\mathcal{H}\)-divergence [15]–[18]. However, these bounds tend to be loose for deep networks [8]. The following works try to estimate the performance exactly.

For classification, a popular approach is to leverage the model’s confidence to predict the OOD performance [19]–[23]. Since deep models are typically overconfident, it is common practice to first calibrate these models in-distribution by temperature scaling. On a related note are uncertainty quantification works that directly try to calibrate models under distribution shift [24]–[26]. Confidence based methods are commonly utilized in practice, and favorable for foundation models as they are not computationally intensive and model-agnostic. However, they do have a couple downsides. Namely, they often fail in the presence of large shifts [21] and are often only well-defined for accuracy and ill-defined for other common metrics like F1 score. These can be limiting factors for foundation models which are applied to a broad array of tasks. Still, as they are the most common estimation methods, we utilize them as the baselines in our work.

Another approach involves measuring model behavior on known auxiliary tasks to understand how the model will behave under the distribution shift at hand [27]–[29]. These approaches however tend to be overfit to specific datasets or modalities. Similar to AGL, there are prediction methods that utilize information from ensembles. Oftentimes a separate “reference” ensemble is trained on some objective to predict the performance of a “target” model [30]–[32]. These methods generally have a much higher computational cost than AGL. Although AGL also requires at least 3 models to compute agreement between, these models only undergo generic finetuning. Thus, it is better suited approach for evaluating foundation models, especially if off-the-shelf finetuned models are readily available, e.g., from Huggingface (see Section 4).

Overall, there is growing attention towards understanding the safety and reliability of foundation models. To understand the effective robustness of FMs under distribution shift, recent works focus on studying the “accuracy-on-the-line” phenomena [8] which we cover in more detail in the next subsection and designing benchmarks that try to expose different failure modes of large models [33], [34]. On the other hand, unsupervised OOD performance estimation has been underexplored in this modern setting, in terms of new methods and the transferability of old methods to large pretrained models.

We are interested in adapting the method “agreement-on-the-line” (AGL) [7] for OOD estimation as it obtains state-of-the-art performance estimation across a wide variety of distribution shifts. AGL is based on an earlier observation called “accuracy-on-the-line” (ACL) — across common distribution shift benchmarks, there is a strong linear correlation between the ID and OOD performance of models [8], [11], [35]–[39]. ACL has been observed in foundation models for image classification, e.g., CIFAR10C [20], ImageNetV2 [36], FMoW-WILDS [40], and question-answering, e.g., SQuAD-Shifts [39]. However, ACL does not always hold, e.g., Camelyon-WILDS [8] and SearchQA [9].

While ACL is a striking phenomenon, it does not immediately provide a practical method to estimate OOD performance—computing the linear fit of ID versus OOD accuracy requires labeled samples from \(\mathcal{D}_\text{OOD}\). Alternatively, [7] suggests that we can estimate this linear trend exactly using
only the agreement between neural networks. Formally, given a pair of models \(f_1\) and \(f_2\) that map inputs to labels, accuracy and agreement is defined as \[\begin{align} \mathsf{Acc}(f_i) = \mathbb{E}_{x,y \sim \mathcal{D}}[\ell(f_i(x), y)], ~~ \mathsf{Agr}(f_1, f_2) = \mathbb{E}_{x,y \sim \mathcal{D}}[\ell(f_1(x), f_2(x))],
\end{align}\] where \(\ell\) is the appropriate performance metric of interest. While accuracy requires access to the ground truth labels \(y\), note that agreement only requires
access to unlabeled data and a pair of models. [7] observes that when ID versus OOD accuracy is strongly linearly correlated between
neural networks, i.e., ACL, then the ID versus OOD agreement of pairs of these models also observe a strong linear correlation with the *same* linear slope and bias. Furthermore, when accuracies do not show a linear correlation, agreements also do
not. This coupled phenomenon is dubbed “agreement-on-the-line” (AGL).

[7] uses AGL for OOD performance estimation: one may obtain the slope and bias of the agreement line with unlabeled data, and then estimate the OOD performance by linearly transforming the ID validation performance. As in [7] and [8] we apply probit scaling to the accuracies and agreements, which induces a stronger linear fit. We refer the reader to [7] for formal AGL-based performance estimation algorithms (ALine-S and ALine-D), which we also provide in Appendix 7.1.1.

We first evaluate whether AGL appears in an ensemble of multiple finetuned runs of a *single* base foundation model. This would enable precise OOD performance estimates for each ensemble member. A practitioner may naively gather a finetuned
ensemble by training a couple runs with different seeds or hyperparameters. However, an overriding concern is that even with some randomness in the finetuning process, linear probing or light full finetuning over the same base model may lead to solutions
with very correlated predictions. We extensively evaluate the following methods of introducing diversity into the finetuning process to see what approach (if any) can lead to AGL.

**Random linear heads**We initialize the last layer of the network (i.e., the linear head) randomly, instead of via some zero-shot or pre-specified manner.**Data ordering**We present the same training data to each model but shuffle the order of the data, i.e., model observes different minibatches.**Data subsetting**We i.i.d. sample \(p\%\) subset of the data to train over. In the main body, we report models trained on independently sampled \(10\%\) of the training data, other proportions of \(30\%\) and \(50\%\) are reported in Appendix 7.4.

We perturb one source of diversity at a time and study whether AGL occurs in each resulting model ensemble. For each setting, we also vary the number of training epochs to collect models with different ID performances, which is necessary to obtain a meaningful linear correlation in accuracy. The additional randomness induced by varied training epochs does affect any conclusions we make in this section.

We first investigate the effect of diversity source on AGL behavior for vision benchmarks. For image classification, a common pipeline is to finetune over a CLIP [1] pretrained foundation model.

We finetune over OpenCLIP ViT-B/32 model trained on LAION-2B [41]. Given its well-established zero-shot capabilities, a popular method of finetuning CLIP is to simply employ linear probing on top of the CLIP representation. We take particular interest in evaluating the OOD performance of an ensemble of linear models trained on top of frozen base model representations.

We evaluate ensembles on synthetic corruptions (CIFAR10C, CIFAR100C, ImageNetC), dataset replication shifts (CIFAR10.1, ImageNetV2), style shifts (OfficeHome), geographical and temporal shifts (FMoW-WILDS, iWildCam-WILDS), and interlaboratory shifts in medicine (Camelyon17-WILDS). iWildCam-WILDS exhibits weak ACL and Camelyon17-WILDS doesn’t exhibit any ACL [8]. We test on iWildCam-WILDS and Camelyon17-WILDS to verify AGL’s negative condition, i.e., when the linear correlation does not exist in ID versus OOD accuracy, it also does not exist in agreement.

ID | OOD |
---|---|

CIFAR10 [42] | CIFAR10C [20], CIFAR10.1 [35] |

CIFAR100 [42] | CIFAR100C [20] |

ImageNet [43] | ImageNetC [20], CIFAR10.1 [35] |

FMoW ID [40] | FMoW OOD [40] |

iWildCam ID [40] | iWildCam OOD [40] |

Camelyon17 ID [40] | Camelyon17 OOD [40] |

OfficeHome [44] | All (ID, OOD) pairings of domains Art, ClipArt, Product, Real World |

MNLI [45] | MNLI-Mismatched [45], SNLI [46] |

SQuAD [47] | SQuAD-Shifts [39] |

Across vision benchmarks, linear probed CLIP models observe ACL meaning there is a strong linear correlation in the ID versus OOD performance. Similarly, on the same datasets, we can observe a corresponding strong linear correlation in agreement across
ensembles injected with diversity in linear head initialization, data ordering, and data subsetting (Figures 1). However, we find that only ensembles with diverse initialization leads to AGL where the linear trend of
agreement and accuracy matching slope and bias (Figure 2). See Appendix 7.3.2 for results on other datasets. In model ensembles obtained by data ordering and data subsetting, we observe a
consistent trend where the agreement trend observe a much higher slope *close to the diagonal \(y = x\) line*. These results are not specific to linear probing alone. In full finetuned CLIP models, we also observe
that random linear heads induce the most reliable AGL behavior (See Appendix 7.3). Note that this setting is still notably different from [7] where models are heavily trained for *tens to hundreds of epochs* often with a large learning rate, which causes AGL behavior to be more robust to the source of diversity used to
induce the ensemble (See Appendix 7.3.4).

We conduct a similar systematic investigation of AGL in finetuned runs of a single base *language* model. Similar to CLIP linear probing, we find that AGL cannot be observed without random head initialization in language models evaluated on text
classification and question-answering tasks.

We evaluate over a collection of \(450\) full finetuned runs of several base FMs: GPT2-Medium [2] and OPT-125M [12]. Models are full finetuned for up to \(20\) epochs with a small learning rate (\(\leq 1e^{-6}\)). Hyperparameters specifics can be found in Appendix 7.2. We do not conduct a linear probing study for question-answering as it leads to poorly performing models. For text classification, we also conduct a linear probing study in Appendix 7.5.

We test models on a text classification shift from MNLI [45] in the GLUE benchmark [48] to MNLI-Mismatched [45] and SNLI [46]. We also evaluate extractive question-answering models on the shift from SQuAD v1.1 [47] to SQuAD-Shifts [39].

We evaluate models on accuracy for text classification and F1 score for question-answering. Similar to our findings in CLIP, in both text classification and question-answering benchmarks, ensembles of full finetuned LLMs observe AGL when models are trained from different randomly initialized linear or span heads while data ordering and data subsetting observe an agreement trend closer to the diagonal \(y = x\) line (Figure 1 and Appendix 7.5). We note that with full finetuning, the differences in AGL behavior between diversity sources are not as stark as with linearly probed models. In some sense, how model diversity is achieved is becomes increasingly less important for observing AGL as the base model parameters also diverge, with ensembles of models heavily trained from scratch at the extreme [7].

max width=

Across image and language modalities, we demonstrate that ensembles of finetuned FMs can also observe agreement-on-the-line like to heavily trained CNN’s [7]. A natural hypothesis about finetuning a single base FM may be that heavy pretraining and light finetuning leads to downstream models with highly correlated behavior under distribution shift. However, simply randomizing the initialization of the linear head alone induces sufficiently decorrelated models for observing AGL. The diversity in the ensemble becomes important when predicting the OOD performance of models using downstream AGL-based methods. In Table [tab:diversity95table], we show that AGL-based methods can only accurately predict the OOD performance of models in ensembles with diverse initialization, while methods fail for ensembles with diverse data subsetting or ordering.

Furthermore, our findings contrast to previous works that suggest AGL is a neural-network specific phenomenon [7], [49], unlike ACL which is model agnostic [8]. Specifically, [7] reports that linear models trained on top of the flattened CIFAR10 images do not observe AGL. However, we find that, on top of CLIP
features, *linear models can exhibit AGL* with random initialization. Previous works on the Generalization Disagreement Equality [50] also contend that data subsetting leads to the most diversity in model predictions. Specifically, in-distribution, the agreement rate between pairs of models was shown to equal their expected accuracy in ensembles with
diverse data subsetting, while diverse random initialization leads to slightly higher agreement [50], [51]. On the other hand, *out-of-distribution*, we find that randomizing the linear head is important for diverse predictions and AGL behavior.

Alternatively, we consider a scenario where the ensemble consists of *multiple* base foundation models. First, it is unclear whether ACL itself holds amongst such models. The base models are heavily pretrained on different data corpuses which may
cause respective downstream models to have different ID versus OOD accuracy trends or “effective robustness” [10]. On vision tasks, for example,
linear probing over CLIP, EfficientNet [52], ViT [53], and BYOL [54] observe varying robustness trends [1]. Second, even when ACL does indeed hold, it is unclear whether these model ensembles will also observe AGL. Here the problem is
seemingly different from the single base model setting: any pair of foundation models finetuned from different base models may agree *too little*, or OOD agreement rate may vary across model pairs depending on the similarity of the pretraining
corpus, breaking the linear correlation of agreement entirely. Yet, we observe that for language models and tasks, ensembles of finetuned FMs from a wide range of base models *observe both ACL and AGL*.

We finetune models from OPT-125M, OPT-350M, OPT-1.3B [12], GPT2, GPT2-Medium, GPT2-Large, GPT2-XL [2], GPT-Neo-135M [55], Llama2-7B [13], Alpaca-7B [56], and Vicuna-7B [57]. We full finetune OPT and GPT models and LoRA finetune Llama, Alpaca, and Vicuna. These models are pretrained on different mixtures of BookCorpus [58], Stories [59], PILE [60], CCNews v2 corpus, and PushShift.io Reddit [61]. Alpaca and Vicuna are additionally instruction-finetuned over Llama2.

We investigate the AGL behavior of an ensemble of foundation models finetuned from diverse base models in Figure 3 for question-answering. First note that base LLM models pretrained on different text corpora lead to
finetuned models that lie on the *same linear trend in accuracy*. Unlike the different accuracy trends observed by different vision foundation models [2], we suspect that the pretraining datasets for the language models in our study observe much more homogeneity. Second, the ID versus OOD agreement between pairs of models in this ensemble, including those between
different base foundation models, is also strongly correlated and the slope and intercept closely matches that of accuracy. In other words, ensembles of different base models also observe AGL without any special regularization for ensemble diversity.

By constructing a diverse ensemble of foundation models, we can leverage AGL to extract precise estimates of model performances under distribution shift. We construct diverse ensembles by collecting models trained from randomly-initialized heads (Section 3) and different base models (Section 4). For image classification, our model collection consists of just the former, i.e., linear models over CLIP representations. For text classification and question-answering, we include GPT, OPT, Llama models individually full finetuned from differently initialized heads. Table [tab:ALine95comparison], we compare the Mean Absolute Percentage Error (MAPE) of AGL-based prediction algorithms, ALine-S and ALine-D [7], to other baselines.

We compare against confidence based methods ATC [21], AC [62] and DOC-Feat [23] and Naive Agreement which directly uses agreement between model pairs as a proxy for their average accuracy [50], [63]. For confidence based methods, we first temperature scale the models using ID validation data, and pick the lower error rate from the estimations obtained with and without temperature scaling. Since confidence baselines are designed to estimate the accuracy metric on classification tasks, there are several limitations when attempting to naively apply the baselines to estimate performance on question-answering. First, there is no easy analogous formulation of confidence baselines for the F1 score, so we estimate the exact-match score instead for fair comparison. On the other hand, note that AGL can predict the performance of models across metrics accuracy, F1, and exact-match. Second, extractive question-answering is a joint classification task where models predict both the start and end token index of the answer span in the context. More details for how we calibrate baselines for this setting is provided in Appendix 7.1.2.

Note that ALine-S and ALine-D only provide estimation guarantees in circumstances where the linear correlation in agreement is strong. Consistent with [7], we filter out datasets with low correlation strength \(R^2 \leq 0.95\). These shifts include iWildCam-WILDS, Camelyon-WILDS, and a few corruptions in CIFAR10C, CIFAR100C, and ImageNetC. We evaluate ALine-S/D for these specific failure cases in Appendix 7.1.3. Across datasets that observe a strong linear correlation in agreement, ALine-S and ALine-D provide precise estimates of OOD performance in finetuned foundation models, surpassing other baselines by a large margin. In particular, they perform notably better on question-answering task SQuAD, with the next best confidence method achieving as large as \(20\%\) higher error.

max width=

We develop methods for extending AGL to foundation models to enable OOD performance prediction in this emerging paradigm. We find that utilizing AGL for performance estimation requires a careful tuning of ensemble diversity. Unlike the original paradigm of AGL, where models observed tens or hundreds of epochs of training on the ID dataset, we find that randomness in specific optimization choices, especially linear head initialization, is crucial for foundation models. In fact, in contrast to [7], we find that linear models can also observe AGL, specifically in the CLIP representation space, suggesting that AGL may not be a neural network specific phenomena.

Our conclusion on AGL also sheds light on the robustness of foundation models. First, our experiments show that light finetuning alone can corrupt models to have diverse behaviors. Next, in contrast to vision models, where previous works show different
forms of pretraining lead to different slopes in the linear correlations [1], a term that is often called “effective robustness”, we
find that all the language models we evaluate, e.g., OPT, GPT2, GPT2-Neo, Alpaca, Llama, lie on the *same* accuracy line. This is particularly intriguing because it goes against the common wisdom that the pretraining data influences the models’
effective robustness. We leave these questions for future analysis.

ALine algorithms are the AGL-based performance estimation methods proposed in [7]. When the AGL phenomenon occurs, i.e., models observe a strong linear correlation in both ID versus OOD agreement and accuracy with matching slopes and biases, algorithms ALine-S and ALine-D effectively apply the linear transformation calculated using agreements to map the ID performances to OOD performance estimates. We describe the algorithms in more detail below.

Provided a collection of models \(\mathcal{F} = \{f_1, f_2, ..., f_n\}\), AGL suggests that ID versus OOD accuracy observe a strong linear correlation if and only if ID versus OOD agreement observes a strong linear correlation and when they do, the slopes and biases match: \(\forall f_i, f_j \in \mathcal{F}\) where \(i \neq j\)

\[\begin{gather} \Phi^{-1}(\text{Acc}_{\text{OOD}}(f_i)) = a \cdot \Phi^{-1}(\text{Acc}_{\text{ID}}(f_i)) + b \\ \Updownarrow \quad \\ \Phi^{-1}(\text{Agr}_{\text{OOD}}(f_i, f_j)) = a \cdot \Phi^{-1}(\text{Agr}_{\text{ID}}(f_i, f_j)) + b \end{gather} \label{eq:AGL}\tag{1}\]

\(\Phi^{-1}\) is the probit transform used to induce a better linear fit as used in [7] and [8]. Provided access to \(\text{Acc}_{\text{ID}}(f_i), \text{Agr}_{\text{ID}}(f_i, f_j), \text{Agr}_{\text{OOD}}(f_i, f_j) \;\forall i, j\), we’d like to estimate \(\text{Acc}_{\text{OOD}}(f_i)\) for all \(f_i \in \mathcal{F}\).

The algorithm ALine-S simply estimates the the slope \(a\) and bias \(b\) of accuracy by computing the linear fit of agreement.

\[\label{eq:your95label} \hat{a}, \hat{b} = \arg \min_{a, b \in \mathbb{R}} \sum_{i \neq j} \left( \Phi^{-1}(\hat{\text{Agr}}_{\text{OOD}}(f_i, f_j)) - a \cdot \Phi^{-1}(\hat{\text{Agr}}_{\text{ID}}(f_i, f_j)) - b \right)^2\tag{2}\]

With \(\hat{a}\) and \(\hat{b}\), we estimate \({\text{Acc}_{\text{OOD}}}(f_i) \approx \hat{a} \cdot {\text{Acc}_{\text{ID}}}(f_i) + \hat{b}\). This method is called Aline-S.

This method instead constructs the following system of linear equations. Provided the relation in Equation 1 , one can derive that for any \(f_i, f_j \in \mathcal{F}\),

\[\begin{align} &\frac{1}{2}\left(\Phi^{-1}(\text{Acc}_{\text{OOD}}(f_i)) + \Phi^{-1}(\text{Acc}_{\text{OOD}}(f_j))\right) \\ &\approx \Phi^{-1}(\text{Agr}_{\text{OOD}}(f_i, f_j)) + \hat{a} \cdot \left(\frac{\Phi^{-1}(\text{Acc}_{\text{ID}}(f_i)) + \Phi^{-1}(\text{Acc}_{\text{ID}}(f_j))}{2} - \Phi^{-1}(\text{Agr}_{\text{ID}}(f_i, f_j))\right) \end{align}\]

Treating \(\text{Acc}_{\text{OOD}}(f_i) \;\forall i\) as unknown variables, note that the right hand side is known and we can construct a linear system of equations using all \(n \choose 2\) pairs of models. The algorithm employs least squares to solve this approximate system of linear equations.

We compare ALine against confidence-based methods ATC [21], AC [62], and Doc-Feat [23]. These methods notably perform better after calibrating the models ID by temperature scaling.

For classification tasks, we optimize a temperature \(T\) for each model \(f\) on the cross-entropy loss over the in-distribution validation data. \[\begin{align} \min_T \sum_{x, y} \mathsf{CE}(\sigma(f(x) \exp(T)), y) \end{align}\] where \(\sigma(\cdot)\) is the softmax.

For extractive question-answering tasks, the model has to predict two labels – the start and end token index \(y = [y_s, y_e]\) of the context span that answers the question. For each model, we attach a span prediction head \(v = [v_s, v_e] \in \mathbb{R}^{d \times 2}\) on top of the base \(\mathsf{B}_\theta(x) \in \mathbb{R}^{d \times N}\) where \(N\) is the token length of \(x\). \(s(x) = v_s^\top \mathsf{B}_\theta(x)\) and \(e(x) = v_e^\top \mathsf{B}_\theta(x)\) predict the start and end token index, respectively.

We’re interested in evaluating question-answering models on the exact match (EM) objective, \[\begin{align} \mathsf{EM}(\hat{y}, y) = \mathbb{1}\left[{\hat{y}_s = y_s}\right] \cdot \mathbb{1}\left[{\hat{y}_e = y_e}\right] \end{align}\]

EM treats question-answering as a classification problem over \(N \times N\) choices of start and end index pairs. This allows us to utilize confidence-based methods that are designed for classification tasks. We can calculate the model confidence for index pair \([i, j]\) as \(\sigma(s(x))_i \cdot \sigma(e(x))_j\).

We jointly optimize a separate temperature for the start and end logits \(T_s\) and \(T_e\) and we minimize the cross-entropy loss over the in-distribution validation data. \[\begin{align} \min_T \sum_{x, y} \mathsf{CE}(\sigma(s(x) \exp(T_s)) \sigma(e(x) \exp(T_e))^\top , y) \end{align}\]

In this section, we present a comparison with ProjNorm [31], a method that yields a score which is shown to be correlated with the OOD performance of the model. We study the same setting as presented in Section 4 where we estimate OOD performance of foundation models pretrained on different text corpora. Unlike AGL, ProjNorm doesn’t provide an estimate of the OOD performance hence we compare the linear correlation between the predicted value and OOD performance. From Table [tab:projnorm] it can be seen that estimates from ALine-D are more strongly correlated with OOD performance than ProjNorm.

max width=

In our comparison with baselines in Section 5, we filter out datasets with a low correlation coefficient \(\leq 0.95\) in ID vs OOD agreement. When the linear correlation of agreement is weak, AGL tells us that the correlation is also low for accuracy, and AGL-based methods are not guaranteed to be reliable in such circumstances. We provide ID vs OOD accuracy and agreement scatter plots for all datasets in Appendix 7.3.2.

Below, we separately provide the comparison with baselines for the excluded datasets. We generally find that the baseline ATC [21] is significantly better in circumstances where AGL-based methods are unreliable.

max width=

We state here the hyperparameters used to finetune the models for diversity experiments reported in Section 3.

We train all linear probes using SGD. Models are trained for different timesteps to achieve an even distribution of ID accuracies.

max width=

We use AdamW [64] to full finetune language models. We keep the learning rate small. Models are trained for different timesteps to achieve an even distribution of ID accuracies.

max width=

max width=

max width=

max width=

max width=

In Section 3.1, we examine how diversity source impacts whether AGL is observed in linear probed CLIP models from CIFAR10 to CIFAR10C. Here, we perform the same experiment on Office-Home [44], which consists of 4 domains or image styles (“Art”, “Clip Art”, “Product”, and “Real World”) for 65 common objects. We train models on one domain and treat the remaining three domains as OOD. Similarly, only Random Initialization yields AGL or matching slopes in accuracy and agreement, and as a result, the corresponding MAPE of estimating the OOD performance of this diverse ensemble is the smallest.

The ALine-S MAPE(%) for ensembles trained on each domain of OfficeHome.

We report the strength of AGL in linear probed CLIP models with randomly initialized linear heads across all datasets discussed in Section 3.2.

We show that light full finetuning over CLIP also observes similar effects of diversity source on the strength of AGL. We verify this on shifts from CIFAR10 to CIFAR10C.

max width=

On the other hand, we demonstrate that in models heavily trained from scratch, AGL can be observed irrespective of the diversity source. Consistent with the models in [7], we train ResNet18 on CIFAR10 from scratch, varying the different sources of randomness. These models are trained heavily with SGD with learning rate of \(1\times10^{-2}\), batch size 128, and weight decay of \(1 \times 10^{-5}\) for up to 200 epochs. We do not use any data augmentation.

To rid of any confounding factors from data subsetting observing a smaller amount of data, we evaluate all randomness sources on different training dataset sizes. We track the effect of diversity in random initialization, data ordering, and data subsetting for different portions of the training data (100\(\%\), 50\(\%\), 30\(\%\), \(10\%\)). For each percentage \(x\%\), Random Initialization and Data Ordering ensembles are trained on the same randomly sampled \(x\%\) proportion of the data while each model in the Data Subsetting ensemble observe different randomly sampled \(x\%\) portions

In Section 3.2, we report the effect of diversity source in full finetuned GPT2-Medium models for the extractive QA task SQuAD to SQuAD-Shifts. In this section, we also provide the same experiments on OPT-125M and BERT. As observed in GPT, using Random Heads yields the strongest AGL behavior and achieves the smallest ALine-D MAPE.

In Table [tab:mape95opt9550] and [tab:mape95bert9550], we report the MAPE of OOD performance estimation of models trained with Random Initialization and Data Ordering using \(100\%\) of training data and Data Subsetting with \(10\%\) of training data.

max width=

max width=

We track the effect of diversity in random initialization, data ordering, and data subsetting for different portions of the training data (100\(\%\), 50\(\%\), 30\(\%\), \(10\%\)). For each percentage \(x\%\), Random Initialization and Data Ordering ensembles are trained on the same randomly sampled \(x\%\) proportion of the data while each model in the Data Subsetting ensemble observe different randomly sampled \(x\%\) portions.

We track the effect of diversity in random initialization, data ordering, and data subsetting for different portions of the training data (100\(\%\), 50\(\%\), 30\(\%\), \(10\%\)). For each percentage \(x\%\), Random Initialization and Data Ordering ensembles are trained on the same randomly sampled \(x\%\) proportion of the data while each model in the Data Subsetting ensemble observe different randomly sampled \(x\%\) portions.

We track the effect of diversity in random initialization, data ordering, and data subsetting for different portions of the training data (100\(\%\), 50\(\%\), 30\(\%\), \(10\%\)). For each percentage \(x\%\), Random Initialization and Data Ordering ensembles are trained on the same randomly sampled \(x\%\) proportion of the data while each model in the Data Subsetting ensemble observe different randomly sampled \(x\%\) portions.

In Section 3.2, we reported the effect of diversity source for full finetuned OPT. Here, we demonstrate similar results on text classification shift from MNLI-matched to SNLI with linear probed GPT2-Medium (Figure 32) and OPT-125M (Figure 33). Similarly, random initialization observes the strongest AGL behavior. In Table [tab:mape95tc], we report the average MAPE of OOD performance estimation using ALine across models.

max width=

[1]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In *International conference on machine learning*, pp. 8748–8763. PMLR, 2021.

[2]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. *OpenAI blog*, 1 (8): 9, 2019.

[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. *Advances in neural information processing systems*, 33: 1877–1901, 2020.

[4]

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust
fine-tuning of zero-shot models. In *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 7959–7971, 2022.

[5]

Dequan Wang, Xiaosong Wang, Lilong Wang, Mengzhang Li, Qian Da, Xiaoqiang Liu, Xiangyu Gao, Jun Shen, Junjun He, Tian Shen, et al. Medfmc: A real-world dataset and benchmark for
foundation model adaptation in medical image classification. *arXiv preprint arXiv:2306.09579*, 2023.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. *arXiv preprint
arXiv:1810.04805*, 2018.

[7]

Christina Baek, Yiding Jiang, Aditi Raghunathan, and J Zico Kolter. Agreement-on-the-line: Predicting the performance of neural networks under distribution shift. *Advances in Neural
Information Processing Systems*, 35: 19274–19289, 2022.

[8]

John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation
between out-of-distribution and in-distribution generalization. In *International Conference on Machine Learning*, pp. 7721–7735. PMLR, 2021.

[9]

Anas Awadalla, Mitchell Wortsman, Gabriel Ilharco, Sewon Min, Ian Magnusson, Hannaneh Hajishirzi, and Ludwig Schmidt. Exploring the landscape of distributional robustness for question
answering models. *arXiv preprint arXiv:2210.12517*, 2022.

[10]

Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt. Data determines distributional robustness in contrastive language image
pre-training (clip). In *International Conference on Machine Learning*, pp. 6216–6234. PMLR, 2022.

[11]

Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. *Advances
in Neural Information Processing Systems*, 33: 18583–18599, 2020.

[12]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt
Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022.

[13]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher,
Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian
Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein,
Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur,
Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.

[14]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. *arXiv preprint
arXiv:2106.09685*, 2021.

[15]

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. *Advances in neural information processing systems*, 19,
2006.

[16]

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. *arXiv preprint arXiv:0902.3430*, 2009.

[17]

Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. *Advances in neural information processing systems*, 23, 2010.

[18]

Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In *International Conference on Machine Learning*, pp. 942–950. PMLR, 2013.

[19]

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In *5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings*, 2017.

[20]

Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In *7th International Conference on Learning Representations,
ICLR*, 2019.

[21]

Saurabh Garg, Sivaraman Balakrishnan, Zachary C Lipton, Behnam Neyshabur, and Hanie Sedghi. Leveraging unlabeled data to predict out-of-distribution performance. *International
Conference on Learning Representations*, 2022.

[22]

Hady Elsahar and Matthias Gallé. To annotate or not? predicting performance drop under domain shift. In *Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*, pp. 2163–2173, 2019.

[23]

Devin Guillory, Vaishaal Shankar, Sayna Ebrahimi, Trevor Darrell, and Ludwig Schmidt. Predicting with confidence on unseen distributions. In *Proceedings of the IEEE/CVF international
conference on computer vision*, pp. 1134–1144, 2021.

[24]

Yaodong Yu, Stephen Bates, Yi Ma, and Michael Jordan. Robust calibration with multi-domain temperature scaling. *Advances in Neural Information Processing Systems*, 35:
27510–27523, 2022.

[25]

Yuli Zou, Weijian Deng, and Liang Zheng. Adaptive calibrator ensemble for model calibration under distribution shift. *arXiv preprint arXiv:2303.05331*, 2023.

[26]

Aleksandr Podkopaev and Aaditya Ramdas. Distribution-free uncertainty quantification for classification under label shift. In *Uncertainty in artificial intelligence*, pp.
844–853. PMLR, 2021.

[27]

Sebastian Schelter, Tammo Rukat, and Felix Biessmann. Learning to validate the predictions of black box classifiers on unseen data. In *Proceedings of the 2020 ACM SIGMOD
International Conference on Management of Data*, pp. 1289–1299, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450367356.

[28]

Weijian Deng and Liang Zheng. Are labels always necessary for classifier accuracy evaluation? In *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, pp.
15064–15073. IEEE Computer Society, 2021. .

[29]

Weijian Deng, Stephen Gould, and Liang Zheng. What does rotation prediction tell us about classifier accuracy under varying testing environments? *arXiv preprint
arXiv:2106.05961*, 2021.

[30]

Ching-Yao Chuang, Antonio Torralba, and Stefanie Jegelka. Estimating generalization under distribution shifts via domain-invariant representations. *arXiv preprint
arXiv:2007.03511*, 2020.

[31]

Yaodong Yu, Zitong Yang, Alexander Wei, Yi Ma, and Jacob Steinhardt. Predicting out-of-distribution error with the projection norm, 2022.

[32]

Jiefeng Chen, Frederick Liu, Besim Avci, Xi Wu, Yingyu Liang, and Somesh Jha. Detecting errors and estimating accuracy on unlabeled data with self-training ensembles. *arXiv preprint
arXiv:2106.15728*, 2021.

[33]

Andrey Malinin, Neil Band, Ganshin, Alexander, German Chesnokov, Yarin Gal, Mark J. F. Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina,
Vyas Raina, Roginskiy, Denis, Mariya Shmatova, Panos Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks, 2022.

[34]

Dustin Tran, Jeremiah Liu, Michael W. Dusenberry, Du Phan, Mark Collier, Jie Ren, Kehang Han, Zi Wang, Zelda Mariet, Huiyi Hu, Neil Band, Tim G. J. Rudner, Karan Singhal, Zachary Nado,
Joost van Amersfoort, Andreas Kirsch, Rodolphe Jenatton, Nithum Thain, Honglin Yuan, Kelly Buchanan, Kevin Murphy, D. Sculley, Yarin Gal, Zoubin Ghahramani, Jasper Snoek, and Balaji Lakshminarayanan. Plex: Towards reliability using pretrained large model
extensions, 2022.

[35]

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do cifar-10 classifiers generalize to cifar-10? *arXiv preprint arXiv:1806.00451*, 2018.

[36]

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In *International conference on machine learning*, pp.
5389–5400. PMLR, 2019.

[37]

Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt. A meta-analysis of overfitting in machine learning. *Advances in
Neural Information Processing Systems*, 32, 2019.

[38]

Chhavi Yadav and Léon Bottou. Cold case: The lost mnist digits. *Advances in neural information processing systems*, 32, 2019.

[39]

John Miller, Karl Krauth, Benjamin Recht, and Ludwig Schmidt. The effect of natural distribution shift on question answering models. In *International conference on machine
learning*, pp. 6905–6916. PMLR, 2020.

[40]

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A
benchmark of in-the-wild distribution shifts. In *International Conference on Machine Learning*, pp. 5637–5664. PMLR, 2021.

[41]

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali
Farhadi, and Ludwig Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.

[42]

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[43]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and
Li Fei-Fei. Imagenet large scale visual recognition challenge. *CoRR*, abs/1409.0575, 2014. URL http://arxiv.org/abs/1409.0575.

[44]

Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In *Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition*, pp. 5018–5027, 2017.

[45]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference, 2018.

[46]

Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. *arXiv preprint arXiv:1508.05326*,
2015.

[47]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. *arXiv preprint arXiv:1606.05250*, 2016.

[48]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. : A multi-task benchmark and analysis platform for natural language understanding. 2018. In the
Proceedings of ICLR.

[49]

Donghwan Lee, Behrad Moniri, Xinmeng Huang, Edgar Dobriban, and Hamed Hassani. Demystifying disagreement-on-the-line in high dimensions, 2023.

[50]

Yiding Jiang, Vaishnavh Nagarajan, Christina Baek, and J Zico Kolter. Assessing generalization of sgd via disagreement. *International Conference on Learning Representations*,
2022.

[51]

Preetum Nakkiran and Yamini Bansal. Distributional generalization: A new kind of generalization. *arXiv preprint arXiv:2009.08092*, 2020.

[52]

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In *International conference on machine learning*, pp. 6105–6114. PMLR,
2019.

[53]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
image is worth 16x16 words: Transformers for image recognition at scale. *arXiv preprint arXiv:2010.11929*, 2020.

[54]

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad
Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. *Advances in neural information processing systems*, 33: 21271–21284, 2020.

[55]

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. , March 2021. URL https://doi.org/10.5281/zenodo.5297715.

[56]

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.
https://github.com/tatsu-lab/stanford_alpaca, 2023.

[57]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An
open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.

[58]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching
movies and reading books. In *Proceedings of the IEEE international conference on computer vision*, pp. 19–27, 2015.

[59]

Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. *arXiv preprint arXiv:1806.02847*, 2018.

[60]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text
for language modeling. *arXiv preprint arXiv:2101.00027*, 2020.

[61]

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. The pushshift reddit dataset. In *Proceedings of the international AAAI conference on web and
social media*, volume 14, pp. 830–839, 2020.

[62]

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. *International Conference on Learning Representations*,
2017.

[63]

Omid Madani, David Pennock, and Gary Flake. Co-validation: Using model disagreement on unlabeled data to validate classification algorithms. *Advances in neural information processing
systems*, 17, 2004.

[64]

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. *arXiv preprint arXiv:1711.05101*, 2017.