AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation

Taeckyung LeeSorn ChottananurakTaesik GongSung-Ju Lee
KAIST Nokia Bell Labs
{taeckyung,sorn111930,profsj}@kaist.ac.kr, ``taesik.gong@nokia-bell-labs.com


Abstract

Test-time adaptation (TTA) has emerged as a viable solution to adapt pre-trained models to domain shifts using unlabeled test data. However, TTA faces challenges of adaptation failures due to its reliance on blind adaptation to unknown test samples in dynamic scenarios. Traditional methods for out-of-distribution performance estimation are limited by unrealistic assumptions in the TTA context, such as requiring labeled data or re-training models. To address this issue, we propose AETTA, a label-free accuracy estimation algorithm for TTA. We propose the prediction disagreement as the accuracy estimate, calculated by comparing the target model prediction with dropout inferences. We then improve the prediction disagreement to extend the applicability of AETTA under adaptation failures. Our extensive evaluation with four baselines and six TTA methods demonstrates that AETTA shows an average of 19.8%p more accurate estimation compared with the baselines. We further demonstrate the effectiveness of accuracy estimation with a model recovery case study, showcasing the practicality of our model recovery based on accuracy estimation. The source code is available at https://github.com/taeckyung/AETTA.

1 Introduction↩︎

The rise of deep learning has impacted various fields with remarkable achievements [1][5]. In real-world deep learning applications, the divergence between training and test data, known as domain shifts, often leads to poor accuracy. For instance, object detection models encountering previously unseen data (, variations of objects) or distributional shifts (, weather changes) might suffer from performance degradation. To overcome this challenge, Test-Time Adaptation (TTA) [6][13] has been regarded as a promising solution recently and actively studied. TTA aims to adapt pre-trained models to domain shifts on the fly with only unlabeled test data.

Despite recent advances in TTA, significant challenges hinder its practical applications. The core issue is that TTA’s reliance on unlabeled test-domain samples makes TTA susceptible to adaptation failures, especially in dynamic environments where the domain continuously changes [7], [14]. Although recent TTA studies deal with dynamic test streams in TTA [6], [7], [9], [10], [12], the inherent risk of TTA–blind adaptation to unseen test samples without ground-truth labels–remains a critical vulnerability. Notably, the absence of ground-truth labels makes it difficult to monitor the correctness of the adaptation. While various out-of-distribution performance estimation approaches have been proposed [15][18], such methods necessitate labeled train data for accuracy estimation, which is impractical for TTA scenarios.

Figure 1: AETTA estimates the model’s accuracy after adaptation using unlabeled test data without needing source data or ground-truth labels. AETTA can be integrated into existing TTA methods to estimate their accuracy under various scenarios.

In light of these challenges, we propose AETTA (Accuracy Estimation for Test-Time Adaptation), a novel accuracy estimation method designed for TTA without reliance on labeled data or source data access (Figure 1). AETTA leverages prediction disagreement with dropout inferences, where the prediction disagreement between the adapted model and dropout inferences serves as a basis for performance estimation. To enhance AETTA’s robustness to adaptation failure scenarios, we propose robust disagreement equality that dynamically adjust the accuracy estimates based on model failures. The key idea is to extend the well-calibration assumption (, predicted probabilities of expected model predictions are neither over-/under-confident [19]) to cover over-confident models (, adaptation failures) via adaptive scaling of the predicted probability. In addition, we provide theoretical analysis on how AETTA can estimate accuracy with unlabeled test data.

We evaluate AETTA on three TTA benchmarks (CIFAR10-C, CIFAR100-C, and ImageNet-C [20]) with two scenarios of fully TTA (, adapting to each corruption) [11] and continual TTA (, continuously adapting to 15 corruptions) [9]. We evaluate the accuracy estimation of AETTA integrated with six state-of-the-art TTA algorithms [7], [9][13]. We compare AETTA with four baselines that could be applied in the TTA setting. The result illustrates that AETTA shows an average of 19.8%p more accurate estimation compared with the baselines in various TTA methods and evaluation scenarios.

Furthermore, we explore the impact of performance estimation in TTA through a case study where we avoided undesirable accuracy drops in TTA based on AETTA. We propose a simple model recovery algorithm, which resets the model when consecutive estimated accuracy degradation or sudden accuracy drop are observed. Our case study shows that our model recovery algorithm with accuracy estimation achieved 11.7%p performance improvement, outperforming the best baseline that knows when distribution changes by 3.0%p. The result shows an example where accuracy estimation could benefit TTA in practice.

2 Preliminaries↩︎

2.1 Test-Time Adaptation (TTA)↩︎

Consider the source data distribution \(\mathcal{D^{\mathcal{S}}}\), and the target data distribution \(\mathcal{D^{\mathcal{T}}}\) and its random variable \({({X}, {Y})}\), where \(Y\) is typically unknown to the learning algorithm, and \(K\) is total number of classes. The covariate shift assumption [21] asserts a disparity between the source and target data distributions, defined by \(\mathcal{D}^{\mathcal{S}}(\mathbf{x}) \neq \mathcal{D}^{\mathcal{T}}(\mathbf{x})\) while maintaining consistency in the conditional label distribution: \(\mathcal{D}^{\mathcal{S}}(y|\mathbf{x}) = \mathcal{D}^{\mathcal{T}}(y|\mathbf{x})\).

Let \(h \sim {\mathcal{H}}_{\mathcal{A}}\) denote a hypothesis that predicts a single class for a single input and \(f\) denote a corresponding softmax value before class prediction. We define the hypothesis space \({\mathcal{H}}_{\mathcal{A}}\) as a hypothesis space \({\mathcal{H}}\) induced by a stochastic training algorithm \(\mathcal{A}\) [19]. The stochasticity could arise from a different random initialization or data ordering.

Assuming an off-the-shelf model \(h_0 \sim {\mathcal{H}}_{\mathcal{A}}\) pre-trained on \(\mathcal{D^{\mathcal{S}}}\), the goal of (fully) test-time adaptation (TTA) [11] is to adapt \(h_0\) for the target distribution \(D^{\mathcal{T}}\) to produce \(h\), using a batch of the unlabeled test set in an online manner.

2.2 Accuracy Estimation in TTA↩︎

We adopt a common TTA setup where source data is unavailable and target test data lacks labels [7], [9][13]. The objective of TTA accuracy estimation is to predict the test accuracy (or error) with unlabeled test streams.

Given an adapted model \(h(\cdot; \Theta)\) at time \(t\), we denote the test error of model \(h(\cdot; \Theta)\) by: \[{\tt Err}_{\mathcal{D}^{\mathcal{T}}} (h) \triangleq \mathbb{E}_{\mathcal{D}^{\mathcal{T}}} [ \mathbb{1} ( h({X}) \neq Y ) ].\] Note that we use the terms test accuracy and test error depending on the context, and the sum of them is 1. Given the temporal nature of TTA, we consider estimating the accuracy of the model \(h(\cdot; \Theta)\)–which has been updated before time \(t\)–with the test batch \(\mathbf{X}_t\). Following the estimation, the test batch \(\mathbf{X}_t\) is used for adaptation.

3 Methodology↩︎

3.1 Disagreement Equality↩︎

We introduce an approach for estimating the test error of a model that is adapted at test time. The key idea is to compare the model’s output against outputs generated through dropout inference. Remarkably, this estimation process does not rely on access to the original training or labeled test data, which contrasts with existing accuracy estimation methods [15][19]. For example, generalization disagreement equality (GDE) [19] proposes a theoretical ground for estimating model error by measuring the disagreement rate between two networks. However, GDE requires multiple pre-trained models from different training procedures to calculate the disagreement rate.

Instead of multiple pre-trained models, our strategy utilizes dropout inference sampling, a technique where random parts of a model’s intermediate layer outputs are omitted during the inference process [22]. From a single adapted model, we simulate the behavior of independent and identically distributed (i.i.d.) models by dropout inference sampling.

Definition 1. The hypothesis space \({\mathcal{H}}_{\mathcal{A}}\) satisfies the dropout independence if for any \(h \sim {\mathcal{H}}_{\mathcal{A}}\), \(h\) and its dropout inference samples are i.i.d. over \({\mathcal{H}}_{\mathcal{A}}\).

To estimate the accuracy of the model, we propose prediction disagreement with dropout inferences (PDD) that calculates a disagreement between the adapted model \(h(\cdot; \Theta)\) and the dropout inferences \(h(\cdot; \Theta^{{\tt dropout}})\) with respect to test samples as: \[\begin{align} {\tt PDD}_{\mathcal{D}^{\mathcal{T}}} \! (h) \triangleq \mathbb{E}_{\mathcal{D}^{\mathcal{T}}} \!\!\! \left[ \frac{1}{N} \! \sum_{i=1}^N \mathbb{1} \! \left[ h({X} ; \Theta) \neq {h}({X} ; \Theta^{{\tt dropout}_i}) \right] \! \right] \!\!, \end{align}\] where \(N\) is the number of dropout inferences.

We now provide the theoretical background to estimate test error with PDD. We first define the expectation function \(\tilde{h}\) [19] over hypothesis space \({\mathcal{H}}_{\mathcal{A}}\), which produces probability vector of size \(K\). For \(k\)-th element \(\tilde{h}_k(\mathbf{x})\), we define: \[\tilde{h}_k(\mathbf{x}) \triangleq \mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [ \mathbb{1} [ h(\mathbf{x}) = k]],\] which indicates the probability of a sample \(\mathbf{x}\) sampled from \({D^{\mathcal{T}}}\) being classed as the class \(k\). Note that the expectation function does not represent the model’s accuracy; it indicates the probability of the input being classified as a particular class, regardless of the ground truth labels.

Then, we define a confidence-prediction calibration assumption, indicating that the value of \(\tilde{h}\) for a particular class equals the probability of the sample having the same ground-truth label [19].

Definition 2. The hypothesis space \({\mathcal{H}}_{\mathcal{A}}\) and corresponding expectation function \(\tilde{h}\) satisfies confidence-prediction calibration1 on \(\mathcal{D}^{\mathcal{T}}\) if for any confidence value \(q \in [0, 1]\) and class \(k \in [1, \cdots ,K]\): \[p({Y} = k | \tilde{h}_k ({X}) = q) = q.\]

With PDD and the assumption of dropout independence and confidence-prediction calibration, we are able to estimate the model’s prediction error \(h\) (Theorem 1). Detailed proof is provided in the Appendix 8.1.

Theorem 1 (Disagreement Equality). If the hypothesis space \({\mathcal{H}}_{\mathcal{A}}\) and corresponding expectation function \(\tilde{h}\) satisfies dropout independence and confidence-prediction calibration, prediction disagreement with dropouts (PDD) approximates the test error over \({\mathcal{H}}_{\mathcal{A}}\): \[\mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt{Err}}_{\mathcal{D}^{\mathcal{T}}}(h)] = \mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt{PDD}}_{\mathcal{D}^{\mathcal{T}}}(h)].\]

3.2 Robust Disagreement Equality↩︎

a

Test batch accuracy and confidence.

b

Predicted class distribution.

Figure 2: Batch-wise accuracy, confidence, and prediction distribution when a model failed to adapt. TENT [11] is used on CIFAR100-C with continually changing domains. The model becomes over-confident, and predictions are skewed..

a

b

c

d

e

f

Figure 3: Correlations between the confidence value of estimated expectation function \(\tilde{h}\) and (1) ground-truth accuracy (GroundTruth), (2) conditional probability \(p(Y = k' | \tilde{h}_{k'} (X) = q)\) of confidence-prediction calibration (CPC), and (3) robust confidence-prediction calibration (RCPC). We used six TTA methods in CIFAR100-C with continual domain changes. We observed accuracy degradation in TENT and EATA and improvement in SAR, CoTTA, RoTTA, and SoTTA. When models failed to adapt, the original CPC misaligned with the ground truth. In contrast, our WCPC dynamically scaled the probability \(p\), thus showing better alignment..

Adaptation failures in TTA are often coupled with over-confident incorrect predictions. Figure 2 shows an illustrative example of this case; as the expectation function’s accuracy drops, the confidence increases, and predictions are skewed towards a few classes2. This violates the confidence-prediction calibration, leading to a high misalignment between test error and PDD (red lines in Figure 3).

To tackle the issue, we propose a robust confidence-prediction calibration to provide the theoretical ground of accuracy estimation for both well-calibrated or over-confident expectation function \(\tilde{h}\).

Definition 3. The hypothesis space \({\mathcal{H}}_{\mathcal{A}}\) and corresponding expectation function \(\tilde{h}\) satisfies robust confidence-prediction calibration on \(\mathcal{D}^{\mathcal{T}}\) if for any confidence value \(q \in [0, 1]\), any class \(k \in [1, \cdots ,K]\), and the over-confident class \(k'\), there exists a weighting constant \(b \geq 1\) and corresponding \(0 \leq a \leq 1\) that satisfies: \[p({Y} = {k'} | \tilde{h}_{k'} ({X}) = q) = a q,\] and \[p({Y} = {k} | \tilde{h}_{k} ({X}) = q) = b q \;\; \text{for k \neq k'}.\]

Robust confidence-prediction calibration adjusts the over-confident expectation function \(\tilde{h}\) to have a lower probability on the misclassified class \(k'\) via multiplying \(a \leq 1\). Note that we can easily expand Definition 3 for multiple over-confident classes. Then, we estimate the test error with Theorem 2 (detailed proof in the Appendix 8.2).

Theorem 2 (Robust Disagreement Equality). If the hypothesis space \({\mathcal{H}}_{\mathcal{A}}\) and corresponding expectation function \(\tilde{h}\) satisfies dropout independence and robust confidence-prediction calibration with a weighting constant \(b\), prediction disagreement with dropouts (PDD) approximates the test error over \({\mathcal{H}}_{\mathcal{A}}\): \[\mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt{Err}}_{\mathcal{D}^{\mathcal{T}}}(h)] = b \; \mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt{PDD}}_{\mathcal{D}^{\mathcal{T}}}(h)] - C,\] where \[C = \int_{q \in [0, 1]} {(b-a)} \; q (1 - q) \; p (\tilde{h}_{k'} ({X}) = q) dq.\]

3.3 Accuracy Estimation for TTA↩︎

With Theorem 2, we propose an empirical approach to estimate the single model test error. Our experiments show that a single model’s disagreement (and the test error) lies close to the robust disagreement equality. This aligns with the previous finding that a single pair of differently-trained models’ disagreement rate (and the test error) lies close to the disagreement equality [19]. Therefore, we approximate a single model test error as: \[\label{eq:general} {\tt{Err}}_{\mathcal{D}^{\mathcal{T}}}(h) \approx b \; {\tt{PDD}}_{\mathcal{D}^{\mathcal{T}}}(h),\tag{1}\] where we omit \(C\) due to the insufficient information regarding the true value of \(p (\tilde{h}_{k'} ({X}) = q)\). Note that \(C \approx 0\) for models with calibration.

Now, we discuss selecting a proper weighting constant \(b\). Note that a desirable \(b\) should dynamically suppress the over-confident expectation function depending on the context so that the confidence-prediction calibration assumption holds. To this end, we use the skewness of the predicted outputs as an indicator of model over-confidence. Out intuition is based on the observation that the predicted class distribution is highly skewed when the adaptation fails (Figure 2 (b)), which aligns with the findings from prior studies [23], [24]. Specifically, we estimate the skewness of predictions by calculating the entropy (\(\tt Ent\)) of the batch-aggregated softmax values from the dropout inferences over a test batch \(\mathbf{X}_t\): \[\begin{align} E^{{\tt avg}} &= {\tt {Ent}} \left( \frac{1}{N} \sum_{i=1}^{N} \frac{1}{|\mathbf{X}_t|} \sum_{\mathbf{x} \in \mathbf{X}_t} {f}(\mathbf{x}; \Theta^{{\tt dropout}_i}) \right), \end{align}\] where \(E^{{\tt avg}}\) would maximize as \(E^{\tt{max}} = {\tt Ent} (\vec{1}_K / K)\) with uniform predictions among the batch (, no failures); while the minimum value would be \(0\) when entire batch predicts a single class (, adaptation failures).

We then model \(b\) with \(E^{{\tt avg}}\) as: \[\label{eq:approx-b} \begin{align} & b = \left( \frac{E^{{\tt avg}}}{E^{{\tt max}}} \right)^{-\alpha}, \end{align}\tag{2}\] where \(\alpha \in [0, \infty)\) is a hyperparameter. If the adaptation does not fail, predictions are uniformly distributed as \(E^{{\tt avg}} = E^{{\tt max}}\) and \(b = 1\). Note that \(a=b=1\) drives Theorem 2 to be equivalent to Theorem 1. We found that modeling \(b\) with the average batch-wise entropy effectively corrects the correlation between confidence and prediction probability, as illustrated in Figure 3 (blue dots).

Finally, with Equation 1 and Equation 2 , we propose Accuracy Estimation for TTA (AETTA): \[\label{eq:wpdd} {\tt{Err}}_{\mathcal{D}^{\mathcal{T}}}(h) \approx \left( \frac{E^{{\tt avg}}}{E^{{\tt max}}} \right)^{-\alpha} {\tt{PDD}}_{\mathcal{D}^{\mathcal{T}}}(h).\tag{3}\]

Observe that \(\alpha=0\) and \(\infty\) result in \({\tt{Err}}_{\mathcal{D}^{\mathcal{T}}}(h)={\tt{PDD}}_{\mathcal{D}^{\mathcal{T}}}(h)\) and \({\tt{Err}}_{\mathcal{D}^{\mathcal{T}}}(h)=1\), respectively. Setting a small \(\alpha\) would result in a lesser penalty with adaptation failures. On the other hand, choosing a high \(\alpha\) would undesirably penalize model improvement cases. Our experiment found that accuracy estimation is not too sensitive to \(\alpha\) (Figure 6 (b)), and we chose \(\alpha=3\) for the other experiments.

Figure 4: AETTA: batchwise TTA accuracy estimation

We summarize the accuracy estimation procedure in Algorithm 4. We first infer with the adapted model for the current test batch \(\mathbf{X}_t\). Then, we repeatedly perform dropout inference sampling. With \(N\) samples from dropout inferences, we estimate the entropy of the batch-aggregated softmax output \(E^{{\tt avg}}\). Finally, we calculated the expected error of the model by AETTA. We apply the exponential moving average to the final accuracy estimation for stable error estimation.

4 Experiments↩︎

ll*6c>cc & & <multicolumn,6>TTA Method & & & & & &
Dataset & Method & TENT [11] & EATA [13] & SAR [7] & CoTTA [9] & RoTTA [10] & SoTTA [12] & Avg. (\(\downarrow\))
& SrcValid & 18.37 ± 0.29 & 14.37 ± 0.33 & 21.28 ± 0.27 & 18.43 ± 0.16 & 20.35 ± 1.31 & 13.13 ± 0.85 & 17.66 ± 0.24
& SoftmaxScore [25] & 6.26 ± 0.49 & 4.78 ± 0.12 & 5.21 ± 0.22 & 10.96 ± 0.28 & 6.01 ± 0.23 & 4.97 ± 0.50 & 6.37 ± 0.10
& GDE [19] & 18.69 ± 0.28 & 16.95 ± 0.22 & 21.25 ± 0.27 & 14.50 ± 0.03 & 23.27 ± 0.43 & 16.45 ± 0.21 & 18.52 ± 0.13
& AdvPerturb [26] & 23.06 ± 1.17 & 24.97 ± 1.00 & 21.89 ± 0.95 & 18.00 ± 0.82 & 19.35 ± 0.99 & 23.68 ± 0.85 & 21.83 ± 0.92
Fully CIFAR10-C & AETTA & 4.00 ± 0.03 & 3.87 ± 0.14 & 3.89 ± 0.07 & 6.83 ± 0.47 & 6.44 ± 1.35 & 5.28 ± 0.87 & 5.05 ± 0.46
& SrcValid & 38.96 ± 0.22 & 10.71 ± 0.31 & 42.68 ± 0.21 & 44.58 ± 0.30 & 23.50 ± 0.51 & 19.34 ± 0.63 & 29.96 ± 0.09
& SoftmaxScore [25] & 17.34 ± 0.10 & 27.86 ± 1.11 & 24.56 ± 0.25 & 34.50 ± 0.35 & 24.18 ± 0.19 & 23.98 ± 0.21 & 25.40 ± 0.23
& GDE [19] & 40.11 ± 0.05 & 71.53 ± 2.12 & 42.51 ± 0.23 & 33.21 ± 0.24 & 48.02 ± 0.56 & 34.24 ± 0.12 & 44.94 ± 0.23
& AdvPerturb [26] & 24.17 ± 0.41 & 8.22 ± 0.56 & 22.91 ± 0.60 & 20.53 ± 0.14 & 17.84 ± 0.65 & 25.77 ± 0.47 & 19.91 ± 0.26
Fully CIFAR100-C & AETTA & 6.89 ± 0.15 & 20.15 ± 1.70 & 6.54 ± 0.15 & 6.05 ± 0.12 & 6.88 ± 0.10 & 5.29 ± 0.18 & 8.63 ± 0.24
& SrcValid & 39.13 ± 0.89 & 35.89 ± 0.79 & 29.77 ± 0.94 & 41.09 ± 0.53 & 10.28 ± 0.28 & 16.00 ± 0.33 & 28.69 ± 0.54
& SoftmaxScore [25] & 20.67 ± 0.01 & 21.06 ± 0.03 & 24.42 ± 0.08 & 19.62 ± 0.02 & 21.03 ± 0.04 & 23.60 ± 0.07 & 21.73 ± 0.03
& GDE [19] & 70.58 ± 0.01 & 66.17 ± 0.07 & 63.48 ± 0.03 & 72.76 ± 0.02 & 66.39 ± 0.04 & 52.74 ± 0.02 & 65.35 ± 0.02
& AdvPerturb [26] & 12.56 ± 0.03 & 14.52 ± 0.01 & 18.76 ± 0.06 & 11.05 ± 0.02 & 12.93 ± 0.04 & 22.90 ± 0.02 & 15.45 ± 0.02
Fully ImageNet-C & AETTA & 6.14 ± 0.03 & 6.48 ± 0.02 & 6.43 ± 0.09 & 6.02 ± 0.03 & 14.82 ± 0.01 & 17.40 ± 0.26 & 9.55 ± 0.07

ll*6c>cc & & <multicolumn,6>TTA Method & & & & & &
Dataset & Method & TENT [11] & EATA [13] & SAR [7] & CoTTA [9] & RoTTA [10] & SoTTA [12] & Avg. (\(\downarrow\))
& SrcValid & 10.84 ± 1.83 & 11.06 ± 0.11 & 21.29 ± 0.26 & 18.30 ± 0.25 & 13.37 ± 0.89 & 9.40 ± 0.85 & 14.04 ± 0.58
& SoftmaxScore [25] & 41.10 ± 11.66 & 15.40 ± 4.73 & 5.21 ± 0.22 & 12.96 ± 0.37 & 12.57 ± 0.43 & 4.37 ± 0.09 & 15.27 ± 2.51
& GDE [19] & 46.29 ± 10.93 & 26.44 ± 5.16 & 21.25 ± 0.27 & 14.69 ± 0.15 & 17.50 ± 0.30 & 17.03 ± 0.70 & 23.87 ± 2.43
& AdvPerturb [26] & 15.56 ± 1.53 & 20.93 ± 2.83 & 21.88 ± 0.93 & 17.79 ± 0.74 & 22.95 ± 0.82 & 23.63 ± 0.78 & 20.45 ± 1.17
Continual CIFAR10-C & AETTA & 9.05 ± 1.02 & 7.13 ± 3.33 & 3.89 ± 0.06 & 5.82 ± 0.30 & 5.36 ± 1.22 & 4.73 ± 0.34 & 6.00 ± 0.35
& SrcValid & 11.00 ± 0.58 & 1.68 ± 0.18 & 38.20 ± 0.22 & 46.09 ± 0.38 & 19.43 ± 1.17 & 17.16 ± 1.57 & 22.32 ± 0.52
& SoftmaxScore [25] & 58.29 ± 1.82 & 76.58 ± 0.71 & 24.05 ± 0.29 & 36.27 ± 0.68 & 27.19 ± 0.12 & 21.89 ± 0.35 & 40.71 ± 0.43
& GDE [19] & 80.87 ± 1.29 & 94.01 ± 0.43 & 39.21 ± 0.22 & 35.43 ± 0.30 & 41.68 ± 0.45 & 35.29 ± 0.27 & 54.41 ± 0.18
& AdvPerturb [26] & 10.12 ± 0.24 & 1.97 ± 0.33 & 24.93 ± 0.57 & 19.62 ± 0.15 & 21.18 ± 0.71 & 25.12 ± 0.39 & 17.16 ± 0.32
Continual CIFAR100-C & AETTA & 5.85 ± 0.36 & 4.18 ± 0.82 & 6.67 ± 0.12 & 6.55 ± 0.17 & 5.86 ± 0.10 & 5.32 ± 0.18 & 5.74 ± 0.13
& SrcValid & 33.30 ± 0.93 & 36.42 ± 0.76 & 22.30 ± 0.55 & 41.06 ± 0.54 & 9.56 ± 0.26 & 14.28 ± 0.28 & 26.15 ± 0.53
& SoftmaxScore [25] & 19.34 ± 0.02 & 20.16 ± 0.05 & 21.91 ± 0.16 & 19.63 ± 0.01 & 17.56 ± 0.08 & 19.67 ± 0.50 & 19.71 ± 0.53
& GDE [19] & 68.30 ± 0.01 & 66.58 ± 0.03 & 64.36 ± 0.15 & 72.81 ± 0.07 & 73.76 ± 0.22 & 55.76 ± 0.45 & 66.93 ± 0.14
& AdvPerturb [26] & 14.82 ± 0.02 & 14.15 ± 0.06 & 19.17 ± 0.14 & 11.06 ± 0.02 & 11.05 ± 0.05 & 20.83 ± 0.39 & 15.18 ± 0.09
Continual ImageNet-C & AETTA & 5.66 ± 0.05 & 6.73 ± 0.03 & 6.68 ± 0.04 & 5.98 ± 0.04 & 11.19 ± 0.12 & 19.22 ± 0.79 & 9.24 ± 0.14

We describe our experimental setup and present the results. Please refer to the Appendix 11 for further details.

Scenario. We consider both fully (non-continual) and continual test-time adaptation scenarios. In the fully TTA setting, target domains are each corruption type [11], while in the continual setting, the target domain continually changes to 15 different corruptions [9]. During adaptation, we calculate the accuracy estimation for every batch and report the mean absolute error between the ground-truth batch-wise accuracy. We ran experiments with three random seeds (0, 1, 2) and reported the average values. We use the test batch size 64 for all TTA baselines, with a memory size 64 for RoTTA [10] and SoTTA [12]. We specify further details of the hyperparameters in the Appendix 11.2.

a

b

c

d

CIFAR10-C.

e

f

g

h

CIFAR100-C.

i

j

k

l

ImageNet-C.

Figure 5: Qualitative results on continual CIFAR10-C, CIFAR100-C, and ImageNet-C..

Datasets. We use three standard benchmarks for test-time adaptation: CIFAR10-C, CIFAR100-C, and ImageNet-C [20]. Each dataset contains 15 different corruptions with five levels of corruption, where we use corruption level 5. CIFAR10-C/CIFAR100-C/ImageNet-C contains 10/100/1,000 classes with 10,000/10,000/50,000 test data, respectively. We use pre-trained ResNet18 [1] as an adaptation target, following a recent study [12].

TTA Methods. We consider six state-of-the-art TTA methods. TENT [11] updates BN parameters with entropy minimization. EATA [13] utilizes entropy thresholding-based sample filtering and anti-forgetting regularization. SAR [7] also adapts sample filtering with sharpness minimization [27]. CoTTA [9] addresses the continual setting by augmentations and stochastic restoration of model weights to avoid catastrophic forgetting. RoTTA [10] adapts with robust batch normalization and category-balanced sampling with timeliness and uncertainty. SoTTA [12] utilizes high-confidence uniform-sampling and entropy-sharpness minimization for robust adaptation in noisy data streams [27].

Accuracy Estimation Baselines. We evaluate four distinct accuracy estimation baselines that could be applied to TTA settings: SrcValid, SoftmaxScore, GDE, and AdvPerturb.

  • SrcValid is a widely used technique that validates performance by leveraging labeled source data. It computes the accuracy using a hold-out labeled source dataset to estimate the target performance. Importantly, the hold-out source data for validation were not used for training in other baselines to ensure they do not affect the model performance. Note that TTA usually assumes that source data are unavailable during test time; hence, this baseline is unrealistic in TTA. We nonetheless include SrcValid as one of our baselines to understand its performance when the source data are accessible.

  • SoftmaxScore [25] utilizes the confidence scores derived from the last softmax layer as the model’s accuracy, which is also a widely used baseline [17], [28]. It estimates the target domain accuracy by averaging softmax confidence scores computed from the current test batch. In addition, we apply temperature scaling [29] to improve the estimation performance [30].

  • Generalization disagreement equality (GDE[19] aims to estimate test accuracy by quantifying the (dis)agreement rate between predictions on a test batch generated by a pair of models. Since training multiple models is impractical, we compare the current adapted model and the previous model right before the adaptation. We also report a comparison with the original GDE and multiple pre-trained models in Appendix 9.

  • Adversarial perturbation (AdvPerturb[26] also aims to estimate the OOD accuracy by calculating the agreement between the domain-adapted model and the source model, where adversarial perturbations on a test batch are applied to penalize the unconfident samples near the decision boundary. We note that the original paper aims to predict the accuracy of the source model, while our goal is to predict the accuracy of the adapted model.

Results. Table [tab:main95experiment95iid] and Table [tab:main95experiment95cont] show the results on the fully and continual TTA settings. We observe that none of the baselines could reliably predict the accuracy among different scenarios. On the other hand, AETTA achieves the lowest mean absolute error, including adaptation failure cases (, TENT in continual CIFAR10/100-C). On average, AETTA outperforms baselines by 19.8%p, validating the effectiveness of our robust prediction disagreement in diverse scenarios. More details are in the Appendix 13.

Qualitative Analysis. We qualitatively analyze the results of the baselines and AETTA to understand the behavior. Figure 5 visualizes the ground-truth accuracy and the estimated accuracy from the baselines and AETTA under adaptation failure and non-failure cases. The Gaussian filter is applied for visualization. We observe that AETTA generally shows a reliable estimation of the ground-truth accuracy in diverse scenarios (fully and continual) and datasets (CIFAR10/100-C and ImageNet-C). SrcValid correctly estimated when model accuracy decreases; however, it consistently predicted high accuracy when the adaptation did not fail. This limitation might be due to the distributional gap between source and target data. SoftmaxScore [25] captures the trend of ground-truth accuracy in some cases, but it overestimates the accuracy when the model accuracy drops. This is mostly due to the over-confident predictions from the model. GDE [19] showed to constantly predict high values among different TTA methods. Note that GDE was originally designed to utilize various pre-trained models. To use GDE in TTA, we utilize adapted models sampled at different stages of adaptation. The result suggests that utilizing multiple models from the single stochastic learning process might not be sufficient to consist of independent and identically distributed (i.i.d.) ensembles, leading to inaccurate estimation. AdvPerturb [26] shows accuracy estimations when ground-truth accuracy decreases but shows high errors in other cases. We believe this happens because it aims to evaluate the performance of the source model, not the adapted model. We found similar patterns were observed with different TTA methods.

a

Number of dropout inferences \(N\).

b

Scaling hyperparameter \(\alpha\).

Figure 6: Impact of hyperparameters on the accuracy estimation performance..

Impact of Hyperparameter \(N\). The number of dropout inferences, \(N\), is a hyperparameter for calculating the test error. We conducted an ablation study in continual CIFAR100-C with varying \(N \in \{5, 10, 15\}\). As shown in Figure 6 (a), we found the effect of hyperparameter \(N\) is negligible. We interpret this result as the effect of calculating prediction disagreement over sufficient batch size with dropout independence, which could reduce the probabilistic variances from dropout inference sampling. We adopt a single value of \(N=10\) for the other experiments.

Impact of Hyperparameter \(\alpha\). We investigate the impact of \(\alpha\), a hyperparameter to control the strength of robust confidence-prediction calibration. We conduct an ablation study in continual CIFAR100-C with varying \(\alpha \in \{0, \dots, 5\}\), where \(\alpha = 0\) indicates no weighting, thus \(\tt Err = PDD\). Figure 6 (b) shows the result. Note that estimations are often inaccurate when \(\alpha = 0\), which shows the importance of our robust equality. Setting a reasonable \(\alpha\) is important to predict failed adaptation cases (TENT and EATA) properly, but it is generally robust after certain values. We adopt \(\alpha = 3\) for the other experiments.

5 Case Study: Model Recovery↩︎

Table 1: Average accuracy improvement (%p) with model recovery. Bold number is the highest improvement. Averaged over three different random seeds for 15 types of corruption.
TTA Method
2-7 Method TENT [11] EATA [13] SAR [7] CoTTA [9] RoTTA [10] SoTTA [12] Avg. (\(\uparrow\))
Episodic [31] 33.58 ± 1.04 51.28 ± 0.52 -7.00 ± 0.26 1.65 ± 0.10 -22.57 ± 0.85 -26.40 ± 0.51 5.09 ± 0.24
MRS [7] 24.12 ± 2.11 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 -1.97 ± 2.23 0.00 ± 0.00 3.69 ± 0.22
Stochastic [9] 35.93 ± 0.78 -0.01 ± 0.47 -2.00 ± 0.48 0.00 ± 0.00 -2.55 ± 0.49 0.35 ± 0.51 5.29 ± 0.19
FisherStochastic [32] 40.27 ± 1.29 0.12 ± 1.16 -4.85 ± 0.13 0.13 ± 0.03 -2.89 ± 0.13 -1.36 ± 0.51 5.24 ± 0.29
DistShift 38.93 ± 1.15 22.17 ± 2.38 -3.25 ± 0.10 1.51 ± 0.09 -7.63 ± 0.23 0.68 ± 0.19 8.74 ± 0.55
AETTA 36.79 ± 1.20 48.64 ± 0.74 -5.66 ± 0.20 1.64 ± 0.11 -6.03 ± 0.89 -4.97 ± 1.58 11.73 ± 0.34

The deployment of TTA algorithms encounters a significant challenge when exposed to extreme test streams, such as continuously changing corruptions [9]. Several TTA algorithms (, TENT [11]) were not designed to exhibit robustness under such extreme conditions. Consequently, the model weights are poorly updated, leading to performance degradation, even worse than the source model. Although recent studies attempt to manage dynamic test streams [6], [9], [12], TTA algorithms are still susceptible to adaptation failures [14]. To tackle the issue, we perform a case study of model recovery based on the accuracy estimation.
Recovery Algorithm. We introduce a simple reset algorithm based on our accuracy estimation with AETTA. Our reset algorithm detects two cases: (1) consecutive low accuracies and (2) sudden accuracy drop. First, we reset the model if the five recent consecutive estimated accuracies (, \(t-4, \cdots, t\)) are lower than the five previous consecutive estimations (, \(t-9, \cdots, t-5\)). This way, we can detect the gradual degradation of TTA accuracy. Second, we apply hard lower-bound thresholding, which resets the model if the estimated accuracy is below the threshold (, 0.2). This could prevent catastrophic failure of TTA algorithms.
Baselines. Some TTA studies covered the model recovery/reset as a part of the TTA algorithm: Episodic resetting (Episodic[31], where the model resets after every batch; Model Recovery Scheme (MRS[7], where the model resets when the moving average of entropy loss falls below a certain threshold; Stochastic restoration (Stochastic[9], where a small number of model weights are stochastically restored to the initial weight of the source model; and Fisher information based restoration (FisherStochastic[32], which applies stochastic restoration for layer importance measured by Fisher information matrix. We also include a baseline (DistShift), which assumes that the model knows when the distribution changes and thus acts as an oracle. DistShift resets the model when the test data distribution (corruption) changes, which is not feasible in practice.

Results. Our simple recovery algorithm outperforms the baselines, including DistShift, which relies on an impractical assumption of knowing when the corruption changes. Episodic [31] showed high accuracy improvements under adaptation failures; however, it prevents continuous adaptation, even without adaptation failures. MRS [7] fails to recover among various TTA methods due to the hard-coded threshold of loss value. Stochastic [9] and FisherStochastic [32] show marginal improvements while failing to recover EATA. Our proposed reset algorithm successfully recovers from adaptation failures while minimizing the negative effect on TTA without failures.

a

b

Figure 7: An example of model recovery compared with DistShift. Reset points are marked over the x-axis..

Qualitative Analysis. Figure 7 shows an example of our model recovery compared with DistShift. Notably, our recovery algorithm resets only when an accuracy degradation trend is detected. On the other hand, DistShift failed to recover in the early steps since it resets the model only on distribution shifts. This implies that estimating performance degradation is more beneficial than knowing when the domain changes to improve TTA performance.

6 Related Work↩︎

Test-Time Adaptation. Recent progress in the field of test-time adaptation (TTA) has focused on improving model robustness [6][10], [12], [13] and addressing novel forms of domain shifts [6], [9], [12]. On the other hand, an analysis [14] pointed out the conventional TTA approaches remain prone to adaptation failures and demonstrated the importance of model recovery. In alignment with this insight, our work not only showcases the feasibility of accuracy estimation for TTA but also investigates a promising model recovery solution to enhance the robustness of TTA.
Accuracy Estimation. Existing accuracy estimation approaches mainly focus on the ensemble of multiple pre-trained models [15][19]. Accuracy-on-the-line [16] and Agreement-on-the-line [15] have demonstrated a notable linear relationship between performances in a wide range of models and distribution shifts, relying on the consistency of model predictions between in-distribution (ID) and out-of-distribution (OOD) data. The Difference of Confidence (DoC) [18] leverages differences in the model’s confidence between ID and OOD data to estimate the accuracy gap under distribution shifts for calculating the final OOD accuracy. Self-training ensemble [17] estimates the accuracy of the pre-trained classifier by iteratively learning an ensemble of models with a training dataset, unlabeled test dataset, and wrongly classified samples. All these methods require labeled ID data to estimate OOD accuracy. To our knowledge, no existing studies target the accuracy estimation in TTA where source data and labels are unavailable.

7 Conclusion↩︎

We proposed a label-free TTA performance estimation method without access to source data and target labels. Based on the dropout inference sampling, we proposed calculating the prediction disagreement to estimate the TTA accuracy. We further improved the method with robust disagreement equality by utilizing the batch-aggregated distribution to penalize skewed predictions. Our method outperformed the baselines in diverse scenarios and datasets. Finally, our case study of model recovery showed the practicality of accuracy estimation. Our findings suggest that accuracy estimation is not only feasible but also a valuable tool in advancing the field of TTA without the need for labeled data.

Acknowledgements↩︎

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2022-0-00495, On-Device Voice Phishing Call Detection).

8 Proof of Theorems↩︎

8.1 Proof of Theorem 1↩︎

We start expanding test error \({\tt Err}\) with few modifications from GDE [19]: \[\begin{align} &\mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt Err}_{\mathcal{D^T}}(h)]&\\ &\triangleq \mathbb{E}_{{\mathcal{H}}_{\mathcal{A}}} [\mathbb{E}_{\mathcal{D^T}} [ \mathbb{1} ( h({X}; \Theta) \neq {Y} ) ]] & \\ &= \mathbb{E}_{\mathcal{D^T}} [ \mathbb{E}_{{\mathcal{H}}_{\mathcal{A}}} [\mathbb{1} ( h({X}; \Theta) \neq {Y} )] & \text{(exchanging expectations)} \\ &= \mathbb{E}_{\mathcal{D^T}} [ 1 - \tilde{h}_{{Y}} ({X}) ]\\ &= \sum^{K-1}_{k=0} \int_{\mathbf{x}}^{} (1 - \tilde{h}_k(\mathbf{x})) \; p({X} = \mathbf{x}, {Y} = k) d\mathbf{x} & (\text{by definition of expectation}) \\ &= \int_{\boldsymbol{q} \in \Delta^K} \sum^{K-1}_{k=0} \int_{\mathbf{x}}^{} (1 - \tilde{h}_k(\mathbf{x})) \; p({X} = \mathbf{x}, {Y} = k, \tilde{h}({X}) = \boldsymbol{q}) d\mathbf{x} d\boldsymbol{q} & (\text{introducing} \; \tilde{h} \; \text{as a r.v.}) \\ &= \int_{\boldsymbol{q} \in \Delta^K} \sum^{K-1}_{k=0} \int_{\mathbf{x}}^{} (1 - \tilde{h}_k(\mathbf{x})) \; p({Y} = k, \tilde{h}({X}) = \boldsymbol{q}) p({X} = \mathbf{x} | {Y} = k, \tilde{h}({X}) = \boldsymbol{q}) d\mathbf{x} d\boldsymbol{q} & \\ &= \int_{\boldsymbol{q} \in \Delta^K} \sum^{K-1}_{k=0} p({Y} = k, \tilde{h}({X}) = \boldsymbol{q}) \int_{\mathbf{x}}^{} (1 - \underbrace{\tilde{h}_k(\mathbf{x})}_{= q_k}) \; p({X} = \mathbf{x} | {Y} = k, \tilde{h}({X}) = \boldsymbol{q}) d\mathbf{x} d\boldsymbol{q} & \\ &= \int_{\boldsymbol{q} \in \Delta^K} \sum^{K-1}_{k=0} p({Y} = k, \tilde{h}({X}) = \boldsymbol{q}) \int_{\mathbf{x}}^{} \underbrace{(1 - q_k)}_{\text{constant w.r.t.} \int_{\mathbf{x}}} \; p({X} = \mathbf{x} | {Y} = k, \tilde{h}({X}) = \boldsymbol{q}) d\mathbf{x} d\boldsymbol{q} & \\ &= \int_{\boldsymbol{q} \in \Delta^K} \sum^{K-1}_{k=0} p({Y} = k, \tilde{h}({X}) = \boldsymbol{q}) (1 - q_k) \underbrace{\int_{\mathbf{x}}^{} \; p({X} = \mathbf{x} | {Y} = k, \tilde{h}({X}) = \boldsymbol{q}) d\mathbf{x}}_{=1} d\boldsymbol{q} & \\ &= \int_{\boldsymbol{q} \in \Delta^K} \sum^{K-1}_{k=0} p({Y} = k, \tilde{h}({X}) = \boldsymbol{q}) (1 - q_k) d\boldsymbol{q} & \\ &= \int_{q \in [0, 1]} \sum^{K-1}_{k=0} p({Y} = k, \tilde{h}_k({X}) = q) (1 - q) dq & (\text{refer \cite{disagreement}}) \\ &= \int_{q \in [0, 1]} \sum^{K-1}_{k=0} \underbrace{p({Y} = k | \tilde{h}_k({X}) = q)}_{= q} p(\tilde{h}_k({X}) = q) (1 - q) dq & \\ &= \int_{q \in [0, 1]} q (1 - q) \sum^{K-1}_{k=0} p(\tilde{h}_k({X}) = q) dq.& (\text{confidence-prediction calibration})~\label{eq:err} \end{align}\tag{4}\] Then, we expand the prediction disagreement with dropouts (\({\tt PDD}\)) from its definition: \[\begin{align} &\mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt PDD}_{\mathcal{D^T}}(h)]&\\ &\triangleq \mathbb{E}_{{\mathcal{H}}_{\mathcal{A}}} \left[\mathbb{E}_{\mathcal{D^T}} \left[ \frac{1}{N} \sum_{i=1}^N \mathbb{1} [ h({X}; \Theta) \neq h({X}; \Theta^{{\tt dropout}_i}) ] \right] \right]&\\ &= \mathbb{E}_{\mathcal{D^T}} \left[\mathbb{E}_{{\mathcal{H}}_{\mathcal{A}}} \left[ \frac{1}{N} \sum_{i=1}^N \mathbb{1} [ h({X}; \Theta) \neq h({X}; \Theta^{{\tt dropout}_i}) ] \right] \right]&(\text{exchanging expectations})\\ &= \mathbb{E}_{\mathcal{D^T}} \left[ \mathbb{E}_{{\mathcal{H}}_{\mathcal{A}}} \left[ \frac{1}{N} \sum_{i=1}^N \sum_{k=0}^{K-1} \mathbb{1} [h({X}; \Theta) = k] (1 - \mathbb{1} [h({X}; \Theta^{{\tt dropout}_i}) = k]) \right] \right] &\\ &= \mathbb{E}_{\mathcal{D^T}} \left[ \sum_{k=0}^{K-1} \frac{1}{N} \sum_{i=1}^N \mathbb{E}_{{\mathcal{H}}_{\mathcal{A}}} \left[ \mathbb{1} [h({X}; \Theta) = k] (1 - \mathbb{1} [h({X}; \Theta^{{\tt dropout}_i}) = k]) \right] \right] &\\ &= \mathbb{E}_{\mathcal{D^T}} \left[ \sum_{k=0}^{K-1} \mathbb{E}_{{\mathcal{H}}_{\mathcal{A}}} \left[\mathbb{1} [h({X}; \Theta) = k] \right] \mathbb{E}_{{\mathcal{H}}_{\mathcal{A}}} [1 - \frac{1}{N} \sum_{i=1}^N \mathbb{1} [h({X}; \Theta^{{\tt dropout}_i}) = k]] \right] & (\text{Dropout independence (Definition~\ref{def:efa})})\\ &= \mathbb{E}_{\mathcal{D^T}} \left[ \sum_{k=0}^{K-1} \tilde{h}_k ({X}) (1 - \tilde{h}_k ({X}) ) \right]&\\ &= \int_{\mathbf{x}} \sum_{k=0}^{K-1} \tilde{h}_k (\mathbf{x}) (1 - \tilde{h}_k (\mathbf{x})) p({X} = \mathbf{x}) d\mathbf{x} & (\text{by definition of expectation})\\ &= \int_{\boldsymbol{q} \in \Delta^K} \int_{\mathbf{x}}^{} \sum^{K-1}_{k=0} {\tilde{h}_k (\mathbf{x}) (1 - \tilde{h}_k (\mathbf{x}))} p\left({X} = \mathbf{x}, \tilde{h} ({X}) = \boldsymbol{q} \right) d\mathbf{x} d\boldsymbol{q} &(\text{introducing} \; \tilde{h} \; \text{as a r.v.})\\ &= \int_{\boldsymbol{q} \in \Delta^K} p (\tilde{h} ({X}) = \boldsymbol{q}) \int_{\mathbf{x}}^{} \sum^{K-1}_{k=0} \underbrace{\tilde{h}_k (\mathbf{x}) (1 - \tilde{h}_k (\mathbf{x}))}_{\tilde{h}_k (\mathbf{x}) = q_k} p\left( {X} = \mathbf{x} | \tilde{h} ({X}) = \boldsymbol{q} \right) d\mathbf{x} d\boldsymbol{q} & \\ &= \int_{\boldsymbol{q} \in \Delta^K} p (\tilde{h} ({X}) = \boldsymbol{q}) \int_{\mathbf{x}}^{} \underbrace{\sum^{K-1}_{k=0}}_{\text{bring to the front}} q_k (1 - q_k) p( {X} = \mathbf{x} | \tilde{h} ({X}) = \boldsymbol{q} ) d\mathbf{x} d\boldsymbol{q} &\\ &= \sum^{K-1}_{k=0} \int_{\boldsymbol{q} \in \Delta^K} p (\tilde{h} ({X}) = \boldsymbol{q}) \int_{\mathbf{x}}^{} \underbrace{q_k (1 - q_k)}_{\text{constant w.r.t.} \; \int_{\mathbf{x}}} p( {X} = \mathbf{x} | \tilde{h} ({X}) = \boldsymbol{q} ) d\mathbf{x} d\boldsymbol{q} &\\ &= \underbrace{\sum^{K-1}_{k=0} \int_{\boldsymbol{q} \in \Delta^K}}_{\text{swap}} p (\tilde{h} ({X}) = \boldsymbol{q}) q_k (1 - q_k) \underbrace{\int_{\mathbf{x}}^{} p( {X} = \mathbf{x} | \tilde{h} ({X}) = \boldsymbol{q} ) d\mathbf{x}}_{=1} d\boldsymbol{q} &\\ &= \int_{\boldsymbol{q} \in \Delta^K} \sum^{K-1}_{k=0} q_k (1 - q_k) p (\tilde{h} ({X}) = \boldsymbol{q}) d\boldsymbol{q} &\\ &= \int_{q \in [0, 1]} q (1 - q) \sum^{K-1}_{k=0} p (\tilde{h}_k ({X}) = q) dq. & (\text{refer~\cite{disagreement}})~\label{eq:pda} \end{align}\tag{5}\] Equation 4 is equivalent to Equation 5 : \[\begin{align} & \mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt Err}_{\mathcal{D^T}}(h)] = \mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt PDD}_{\mathcal{D^T}}(h)],& \end{align}\] which concludes the proof of Theorem 1.

8.2 Proof of Theorem 2↩︎

From robust confidence-prediction calibration, the over-confident model’s conditional probability of the major class \(k'\) is scaled by \(a\), while other classes’ conditional probabilities are equally scaled up by \(b\). Then, Equation 4 now becomes: \[\begin{align} &\mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt Err}_{\mathcal{D^T}}(h)] &\\ &= \int_{q \in [0, 1]} \sum^{K-1}_{k=0} {p({Y} = k | \tilde{h}_k({X}) = q)} p(\tilde{h}_k({X}) = q) (1 - q) dq & \\ &= \int_{q \in [0, 1]} {p({Y} = k' | \tilde{h}_{k'}({X}) = q)} p(\tilde{h}_{k'}({X}) = q) (1 - q) + \sum^{}_{k \neq k'} {p({Y} = k | \tilde{h}_k({X}) = q)} p(\tilde{h}_k({X}) = q) (1 - q) dq & \\ &= \int_{q \in [0, 1]} a q \; p(\tilde{h}_{k'}({X}) = q) (1 - q) + \sum^{}_{k \neq k'} b q \; p(\tilde{h}_k({X}) = q) (1 - q) dq & \nonumber \\ & \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\;\; (\text{robust confidence-prediction calibration}) \\ &= \int_{q \in [0, 1]} a q (1 - q) \; p(\tilde{h}_{k'}({X}) = q) dq \; + \; b \int_{q \in [0, 1]} \sum^{}_{k \neq k'} q (1 - q) \; p(\tilde{h}_k({X}) = q) dq. &~\label{eq:rewrite} \end{align}\tag{6}\] We rewrite Equation 6 as: \[\begin{align} & \int_{q \in [0, 1]} \sum^{}_{k \neq k'} q (1 - q) \; p(\tilde{h}_k({X}) = q) dq = \frac{1}{b} \mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt Err}_{\mathcal{D^T}}(h)] - \int_{q \in [0, 1]} \frac{a}{b} \; q (1 - q) \; p(\tilde{h}_{k'}({X}) = q) dq . &~\label{eq:subst} \end{align}\tag{7}\] Then, we rewrite PDD (Equation 5 ): \[\begin{align} &\mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt PDD}_{\mathcal{D^T}}(h)] &\\ &= \int_{q \in [0, 1]} q (1 - q) \sum^{K-1}_{k=0} p (\tilde{h}_k ({X}) = q) dq&\\ &= \int_{q \in [0, 1]} q (1 - q) \; p (\tilde{h}_{k'} ({X}) = q) + q (1 - q) \sum^{}_{k \neq k'} p (\tilde{h}_k ({X}) = q) dq&\\ &= \int_{q \in [0, 1]} q (1 - q) \; p (\tilde{h}_{k'} ({X}) = q) dq \; + \; \int_{q \in [0, 1]} q (1 - q) \sum^{}_{k \neq k'} p (\tilde{h}_k ({X}) = q) dq&\\ &= \int_{q \in [0, 1]} \frac{b-a}{b} \; q (1 - q) \; p (\tilde{h}_{k'} ({X}) = q) dq \; + \; \frac{1}{b} \mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt Err}_{\mathcal{D^T}}(h)]. & (\text{Equation~\ref{eq:subst}}) \end{align}\] Finally, we obtain the equality between \(\tt Err\) and \(\tt PDD\): \[\begin{align} & \mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt Err}_{\mathcal{D^T}}(h)] & \\ & = b \; \mathbb{E}_{h \sim {\mathcal{H}}_{\mathcal{A}}} [{\tt PDD}_{\mathcal{D^T}}(h)] - \int_{q \in [0, 1]} {(b-a)} \; q (1 - q) \; p (\tilde{h}_{k'} ({X}) = q) dq, & \end{align}\] which concludes the proof of Theorem 2. Note that without weighting (\(a=b=1\)), the result is identical to Theorem 1.

9 Additional Experiments↩︎

9.0.0.1 GDE with multiple pre-trained models.

We compare AETTA with the original version of GDE (denoted as GDE*), utilizing multiple pre-trained models with access to training data. We report the result in Table [tab:supp95gde]. Due to the misalignment of confidence-prediction calibration, GDE* underperforms AETTA even with full access to source data.

ll*6c>cc & & <multicolumn,6>TTA Method & & & & & &
Dataset & Method & TENT [11] & EATA [13] & SAR [7] & CoTTA [9] & RoTTA [10] & SoTTA [12] & Avg. (\(\downarrow\))
& GDE* [19] & 14.54 ± 8.14 & 4.11 ± 2.43 & 7.27 ± 0.16 & 9.89 ± 0.29 & 7.44 ± 0.13 & 5.79 ± 0.23 & 8.17 ± 0.87
Continual CIFAR100-C & AETTA & 5.85 ± 0.36 & 4.18 ± 0.82 & 6.67 ± 0.12 & 6.55 ± 0.17 & 5.86 ± 0.10 & 5.32 ± 0.18 & 5.74 ± 0.13

9.0.0.2 ImageNet-R.

To demonstrate the dataset generality of AETTA, we report the accuracy estimation result on ResNet18 architecture on ImageNet-R (Table [tab:imagenetR]). AETTA outperformed the baselines in all TTA methods, showing AETTA is applicable in various datasets (e.g., CIFAR10-C, CIFAR100-C, ImageNet-C, and ImageNet-R).

ll*6c>cc & & <multicolumn,6>TTA Method & & & & & &
Dataset & Method & TENT [11] & EATA [13] & SAR [7] & CoTTA [9] & RoTTA [10] & SoTTA [12] & Avg. (\(\downarrow\))
& SrcValid & 37.00 ± 0.14 & 37.91 ± 0.18 & 36.58 ± 0.24 & 35.05 ± 0.16 & 34.43 ± 0.05 & 37.82 ± 0.41 & 36.46 ± 0.05
& SoftmaxScore [25] & 10.79 ± 0.17 & 13.87 ± 0.09 & 15.02 ± 0.14 & 14.76 ± 0.07 & 14.08 ± 0.04 & 12.25 ± 0.42 & 13.46 ± 0.06
& GDE [19] & 62.81 ± 0.12 & 61.36 ± 0.14 & 63.27 ± 0.18 & 64.86 ± 0.10 & 62.64 ± 0.12 & 55.23 ± 0.29 & 61.70 ± 0.02
& AdvPerturb [26] & 13.42 ± 0.28 & 16.04 ± 0.36 & 17.90 ± 0.28 & 21.19 ± 0.22 & 31.12 ± 0.12 & 9.91 ± 0.73 & 18.26 ± 0.03
ImageNet-R & AETTA & 8.02 ± 0.12 & 6.87 ± 0.08 & 7.06 ± 0.19 & 7.07 ± 0.11 & 8.63 ± 0.19 & 6.79 ± 0.29 & 7.41 ± 0.05

9.0.0.3 ResNet50.

To demonstrate the model generality of AETTA, we report the accuracy estimation result on ResNet50 architecture on ImageNet-C (Table [tab:resnet50]). AETTA outperformed the baselines in general, showing AETTA is applicable to diverse model architectures.

ll*6c>cc & & <multicolumn,6>TTA Method & & & & & &
Dataset & Method & TENT [11] & EATA [13] & SAR [7] & CoTTA [9] & RoTTA [10] & SoTTA [12] & Avg. (\(\downarrow\))
& SrcValid & 46.46 ± 0.15 & 34.19 ± 0.67 & 30.35 ± 0.75 & 46.47 ± 0.17 & 12.28 ± 0.11 & 19.28 ± 0.18 & 31.50 ± 0.26
& SoftmaxScore [25] & 23.58 ± 0.03 & 24.14 ± 0.04 & 26.22 ± 0.05 & 23.57 ± 0.04 & 24.32 ± 0.02 & 17.87 ± 0.24 & 23.28 ± 0.03
& GDE [19] & 68.39 ± 0.04 & 57.08 ± 0.08 & 55.81 ± 0.06 & 68.36 ± 0.04 & 58.69 ± 0.09 & 48.15 ± 0.33 & 59.41 ± 0.04
& AdvPerturb [26] & 12.77 ± 0.04 & 21.16 ± 0.05 & 23.66 ± 0.08 & 12.77 ± 0.05 & 16.44 ± 0.00 & 25.28 ± 0.32 & 18.68 ± 0.05
Fully ImageNet-C & AETTA & 6.14 ± 0.05 & 9.15 ± 0.03 & 8.50 ± 0.07 & 6.15 ± 0.04 & 28.28 ± 0.03 & 36.90 ± 0.36 & 15.85 ± 0.09
& SrcValid & 46.38 ± 0.10 & 35.83 ± 0.74 & 24.35 ± 1.86 & 46.46 ± 0.22 & 13.79 ± 0.16 & 5.12 ± 0.29 & 28.65 ± 0.46
& SoftmaxScore [25] & 23.58 ± 0.03 & 21.34 ± 0.06 & 16.64 ± 0.25 & 23.61 ± 0.01 & 19.99 ± 0.25 & 51.60 ± 0.75 & 26.13 ± 0.12
& GDE [19] & 68.36 ± 0.03 & 58.41 ± 0.14 & 60.20 ± 0.24 & 68.38 ± 0.01 & 68.98 ± 0.52 & 86.08 ± 0.36 & 68.40 ± 0.09
& AdvPerturb [26] & 12.80 ± 0.04 & 19.82 ± 0.12 & 21.50 ± 0.14 & 12.77 ± 0.02 & 13.77 ± 0.35 & 4.79 ± 0.17 & 14.24 ± 0.07
Continual ImageNet-C & AETTA & 6.15 ± 0.05 & 10.81 ± 0.01 & 6.41 ± 0.08 & 6.00 ± 0.04 & 14.90 ± 0.30 & 4.21 ± 0.12 & 8.08 ± 0.04

10 Discussion↩︎

10.0.0.1 Potential Societal Impact.

The computational overheads associated with test-time adaptation (TTA) could raise environmental concerns, particularly regarding carbon emissions. Our algorithm introduces \(N\) extra model inferences for accuracy estimation. Importantly, our approach of utilizing dropout inference is computationally lightweight compared to baseline methods involving model retraining [19] and adversarial backpropagation [26]. Recent advancements, such as the memory-economic TTA [33], are anticipated to tackle these challenges effectively. This implies that, despite the computational demands, the environmental impact of our approach could be mitigated by integrating emerging strategies for resource-efficient TTA implementations.

10.0.0.2 Limitations and Future Directions.

Our research investigates the possibility of accuracy estimation for TTA with only unlabeled data. A promising direction for further improvements is the (1) optimization of the weighting constant \(b\) (or corresponding \(a\)), which stands to fine-tune the calibration process, and (2) estimation of the variable \(C\) for more precise error estimates. Also, we presented a case study on model recovery to demonstrate the practicality of accuracy estimation. While we chose a heuristic method to reset the model for the simplicity of analysis, there exists room for improvement to be more effective. Beyond model recovery, we also envision the potential of accuracy estimation in broader applications, such as model refinement and maintenance processes, and enhancing the dynamics of human-AI interactions, which we leave as future work.

11 Experiment Details↩︎

We conducted all experiments under three random seeds (0, 1, 2) and reported the average values with standard deviations. The experiments were performed on NVIDIA GeForce RTX 3090 and NVIDIA TITAN RTX GPUs.

11.1 Accuracy Estimation Details↩︎

11.1.0.1 AETTA (Ours).

We used the number of dropout inference samples \(N=10\) and prediction disagreement weighting hyperparameter \(\alpha=3\) for all experiments. The maximum entropy for the model \(E^{\tt{max}}\) is calculated as \(E^{\tt{max}} = {\tt Ent} (\vec{1}_K / K)\) where \(K\) is a number of classes and \(\vec{1}_K\) is one vector with size \(K\); which results in 2.3, 4.6, and 6.9 for 10, 100, and 1,000 classes. We applied the Dropout module for each residual block layer in ResNet18 [1], where the dropout rate is 0.4, 0.3, and 0.2 for 10, 100, and 1,000 classes, following previous studies which apply different hyperparameters for different numbers of classes [9], [12], [13].

11.1.0.2 SrcValid.

For SrcValid, we used labeled source-domain validation data and calculated the accuracy. We used 1,000 random samples from the validation set of the source dataset.

11.1.0.3 SoftmaxScore.

For SoftmaxScore [25], we utilized the average softmax score for the current test batch as the estimated accuracy. We additionally applied temperature scaling [29] with temperature value \(T=2\), which showed the best estimation performance on CIFAR10-C, to enhance the estimation performance.

11.1.0.4 GDE.

For generalization disagreement equality (GDE) [19], we calculated the (dis)agreement rate between predictions of the test batch over a pair of models. Unlike the setting in domain adaptation of utilizing multiple pre-trained models, we utilized the models in different adaptation stages. Specifically, we compared the two models: (1) the currently adapted model and (2) the previous model right before the adaptation. This follows the suggestion that utilizing only two models is sufficient to calculate disagreement [19].

11.1.0.5 AdvPerturb.

Adversarial perturbation [26] estimates the source model accuracy by calculating the agreement between the domain-adapted and source models by applying adversarial perturbation on the source model side. In the TTA setting, we compared the test-time-adapted model with the source model and applied the FGSM [34] adversarial attack with attack size following the original paper (\(\epsilon = 1/255\)).

11.2 TTA Method Details↩︎

In this study, we followed the official implementation of TTA methods. To maintain consistency, we adopted the optimal hyperparameters reported in the corresponding papers or source code repositories. We also provide additional implementation details and the use of hyperparameters if not specified in the original paper or the source code.

11.2.0.1 TENT.

For TENT [11], we configured the learning rate as \(LR = 0.001\) for CIFAR10-C/CIFAR100-C and \(LR = 0.00025\) for ImageNet-C, aligning with the guidelines outlined in the original paper. The implementation followed the official code.3

11.2.0.2 EATA.

For EATA [13], we followed the original configuration of \(LR = 0.005/0.005/0.00025\) for CIFAR10-C/CIFAR100-C/ImageNet-C, entropy constant \(E_0 = 0.4 \times \ln K\), where \(K\) represents the number of classes. Additionally, we set the cosine sample similarity threshold \(\epsilon = 0.4/0.4/0.05\), trade-off parameter \(\beta = 1/1/2,000\), and moving average factor \(\alpha = 0.1\). The Fisher importance calculation involved 2,000 samples, as recommended. The implementation followed the official code.4

11.2.0.3 SAR.

For SAR [7], we selected a batch size 64 for fair comparisons. We set a learning rate of \(LR=0.00025\), sharpness threshold \(\rho = 0.5\), and entropy threshold \(E_0 = 0.4 \times \mathrm{ln} K\), following the recommendations from the original paper. The top layer (layer 4 for ResNet18) was frozen, consistent with the original paper. The implementation followed the official code.5

11.2.0.4 CoTTA.

For CoTTA [9], we set the restoration factor \(p=0.01\), and exponential moving average (EMA) factor \(\alpha=0.999\). For augmentation confidence threshold \(p_{th}\), we followed the authors’ guidelines as \(p_{th}=0.92\) for CIFAR10-C, \(p_{th}=0.72\) for CIFAR100-C, and \(p_{th}=0.1\) for ImageNet-C. The implementation followed the official code.6

11.2.0.5 RoTTA.

For RoTTA [10], we utilized the Adam optimizer [35] with a learning rate of \(LR = 0.001\) and \(\beta = 0.9\). We followed the original hyperparameters, including BN-statistic exponential moving average updating rate \(\alpha = 0.05\), Teacher model’s exponential moving average updating rate \(\nu = 0.001\), timeliness parameter \(\lambda_t = 1.0\), and uncertainty parameter \(\lambda_u = 1.0\). The implementation followed the original code.7

11.2.0.6 SoTTA.

For SoTTA [12], the Adam optimizer [35] was employed, featuring a BN momentum of \(m = 0.2\) and a learning rate of \(LR = 0.001\) with a single adaptation epoch. The memory size was set to 64, with the confidence threshold \(C_0\) configured as 0.99 for CIFAR10-C (10 classes), 0.66 for CIFAR100-C (100 classes), and 0.33 for ImageNet-C (1,000 classes). The entropy-sharpness L2-norm constraint \(\rho\) was set to 0.5, aligning with the suggestion [27]. The top layer was frozen following the original paper. The implementation followed the original code.8

11.3 Experiment Setting Details↩︎

11.3.0.1 Datasets.

CIFAR10-C/CIFAR100-C/ImageNet-C [20] are the most widely used benchmarks for test-time adaptation (TTA) [6], [7], [9][13]. All datasets contain 15 corruption types, including Gaussian, Snow, Frost, Fog, Brightness, Contrast, Elastic Transformation, Pixelate, and JPEG Compression. Each corruption is applied in 5 levels of severity, where we adopt the highest severity level of 5. CIFAR10-C and CIFAR100-C consist of 50,000 train images and 10,000 test images for 10 and 100 classes. ImageNet-C consists of 1,281,167 train images and 50,000 test images for 1,000 classes.

11.3.0.2 Pre-Training.

We employed the ResNet18 [1] as the backbone network. The model is trained for each CIFAR10-C/CIFAR100-C/ImageNet-C on the training dataset. For CIFAR10-C/CIFAR100-C, we utilized the stochastic gradient descent with a batch size of 128, a learning rate of 0.1, and a momentum of 0.9, with cosine annealing learning rate scheduling [36] for 200 epochs. For ImageNet-C, we utilized the pre-trained model from TorchVision [37].

11.3.0.3 Test-Time Adaptation.

For the fully TTA, each TTA method adapts to one corruption at a time. For the continual TTA, each TTA method continually adapts to 15 corruptions in the predefined order of [Gaussian, Snow, Frost, Fog, Brightness, Contrast, Elastic Transformation, Pixelate, and JPEG Compression], following the previous study [9]. For all experiments, we use the batch size of 64, with memory size 64 for RoTTA [10] and SoTTA [12] for a fair comparison.

11.4 Model Recovery Details (Section 5)↩︎

11.4.0.1 AETTA (Ours).

With AETTA, our reset algorithm detects two cases: (1) consecutive low accuracies and (2) sudden accuracy drops. For consecutive low accuracies, we utilize the information of estimated accuracy from each 5 batches. Regarding hard lower-bound thresholding, we employ a threshold value of 0.2. We reset both the model’s weights to those from the source model and the optimizer’s state to its initialization value.

11.4.0.2 Episodic.

Episodic resetting was first introduced by MEMO [31], where the model resets after every batch. We reset both the model’s weights and the optimizer’s state to its value before adaptation.

11.4.0.3 MRS.

The Model Recovery Scheme (MRS) was initially introduced by SAR  [7] to recover the model from collapsing. The reset occurs when the moving average of entropy loss falls below a certain threshold. We utilized the threshold value of 0.2 introduced in the original paper. We reset both the model’s weights to those from the source model and the optimizer’s state to its initialization value.

11.4.0.4 Stochastic.

Stochastic restoration was first introduced by CoTTA  [9]. A small number of model weights are stochastically restored to the initial weights of the source model, with a certain probability specified by the restoration factor. We use the restoration factor 0.01, as introduced in the original work.

11.4.0.5 FisherStochastic.

Fisher information based restoration was proposed by PETAL  [32], based on the stochastic restoration [9]. It applies stochastic restoration based on layer importance measured by the Fisher information matrix (FIM). We use an FIM-based parameter restoration quantile value of 0.03 for CIFAR100-C, as recommended in the original paper. The parameter with an FIM value less than 0.03-quantile would be restored to the original source weight.

11.4.0.6 DistShift.

DistShift assumes that the model knows when the distribution changes and thus acts as an oracle. Resetting occurs when the test data distribution (corruption) changes. We reset both the model’s weights to those from the source model and the optimizer’s state to its initialization value.

12 License of Assets↩︎

12.0.0.1 Datasets.

CIFAR10/CIFAR100 (MIT License), CIFAR10-C/CIFAR100-C (Creative Commons Attribution 4.0 International) and ImageNet-C (Apache 2.0).

12.0.0.2 Codes.

Torchvision for ResNet18 and ResNet50 (Apache 2.0), the official repository of TENT (MIT License), the official repository of EATA (MIT License), the official repository of SAR (BSD 3-Clause License), the official repository of CoTTA (MIT License), the official repository of RoTTA (MIT License), and the official repository of SoTTA (MIT License).

13 Result Details↩︎

We report the detailed results per corruption in the main experiments. Table [tab:main95experiment95iid] in the main paper is detailed in Table [tab:main95experiment95cifar1095full], Table [tab:main95experiment95cifar10095full], and Table [tab:main95experiment95imagenet95full]. Table [tab:main95experiment95cont] in the main paper is detailed in Table 8, Table 9, and Table 10. Table 1 in the main paper is detailed in Table 11.

[H]
\centering
\caption{Mean absolute error (MAE) (\%) of the accuracy estimation on fully CIFAR10-C. Averaged over three different random seeds.}
\label{tab:main95experiment95cifar1095full}
\scriptsize
\setlength\tabcolsep{3pt}
\begin{tabularx}{\textwidth}{ll*{16}Y}
\Xhline{2\arrayrulewidth}
 &  & \multicolumn{3}{c}{Noise} & \multicolumn{4}{c}{Blur} & \multicolumn{4}{c}{Weather} & \multicolumn{4}{c}{Digital} &  \\ 
\addlinespace[-0.05cm]
\cmidrule(lr){3-5} \cmidrule(lr){6-9} \cmidrule(lr){10-13} \cmidrule(lr){14-17}
\addlinespace[-0.05cm]
TTA Method & Acc. Estimation & Gau. & Shot & Imp. & Def. & Gla. & Mot. & Zoom & Snow & Fro. & Fog & Brit. & Cont. & Elas. & Pix. & JPEG & \cellcolor[HTML]{EDEEFF}Avg.($\downarrow$) \\ \midrule
  & SrcValid & \makecell[c]{24.85\\ \scalebox{\std}{± 0.83}} & \makecell[c]{21.82\\ \scalebox{\std}{± 0.74}} & \makecell[c]{32.41\\ \scalebox{\std}{± 0.90}} & \makecell[c]{11.60\\ \scalebox{\std}{± 0.53}} & \makecell[c]{32.49\\ \scalebox{\std}{± 1.50}} & \makecell[c]{12.78\\ \scalebox{\std}{± 0.45}} & \makecell[c]{11.16\\ \scalebox{\std}{± 0.62}} & \makecell[c]{15.90\\ \scalebox{\std}{± 0.57}} & \makecell[c]{17.90\\ \scalebox{\std}{± 1.04}} & \makecell[c]{13.12\\ \scalebox{\std}{± 0.78}} & \makecell[c]{8.46\\ \scalebox{\std}{± 0.35}} & \makecell[c]{12.62\\ \scalebox{\std}{± 0.33}} & \makecell[c]{21.53\\ \scalebox{\std}{± 1.26}} & \makecell[c]{16.35\\ \scalebox{\std}{± 0.68}} & \makecell[c]{22.57\\ \scalebox{\std}{± 1.26}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{18.37\\ \scalebox{\std}{± 0.29}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{8.40\\ \scalebox{\std}{± 0.10}} & \makecell[c]{6.57\\ \scalebox{\std}{± 0.91}} & \makecell[c]{12.95\\ \scalebox{\std}{± 1.53}} & \makecell[c]{3.96\\ \scalebox{\std}{± 0.50}} & \makecell[c]{14.30\\ \scalebox{\std}{± 2.12}} & \makecell[c]{3.76\\ \scalebox{\std}{± 0.12}} & \makecell[c]{3.52\\ \scalebox{\std}{± 0.36}} & \makecell[c]{4.40\\ \scalebox{\std}{± 0.08}} & \makecell[c]{6.20\\ \scalebox{\std}{± 0.94}} & \makecell[c]{4.08\\ \scalebox{\std}{± 0.50}} & \makecell[c]{3.13\\ \scalebox{\std}{± 0.08}} & \makecell[c]{4.06\\ \scalebox{\std}{± 0.36}} & \makecell[c]{6.62\\ \scalebox{\std}{± 1.00}} & \makecell[c]{4.63\\ \scalebox{\std}{± 0.47}} & \makecell[c]{7.32\\ \scalebox{\std}{± 1.40}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{6.26\\ \scalebox{\std}{± 0.49}} \\
 & GDE~\cite{disagreement} & \makecell[c]{25.29\\ \scalebox{\std}{± 0.67}} & \makecell[c]{22.16\\ \scalebox{\std}{± 0.78}} & \makecell[c]{32.75\\ \scalebox{\std}{± 1.04}} & \makecell[c]{12.07\\ \scalebox{\std}{± 0.92}} & \makecell[c]{33.26\\ \scalebox{\std}{± 1.83}} & \makecell[c]{12.78\\ \scalebox{\std}{± 0.55}} & \makecell[c]{11.25\\ \scalebox{\std}{± 0.84}} & \makecell[c]{16.02\\ \scalebox{\std}{± 0.55}} & \makecell[c]{18.57\\ \scalebox{\std}{± 1.07}} & \makecell[c]{13.56\\ \scalebox{\std}{± 1.14}} & \makecell[c]{8.81\\ \scalebox{\std}{± 0.38}} & \makecell[c]{12.72\\ \scalebox{\std}{± 0.25}} & \makecell[c]{21.58\\ \scalebox{\std}{± 1.23}} & \makecell[c]{16.64\\ \scalebox{\std}{± 1.03}} & \makecell[c]{22.97\\ \scalebox{\std}{± 1.34}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{18.69\\ \scalebox{\std}{± 0.28}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{48.04\\ \scalebox{\std}{± 2.26}} & \makecell[c]{44.69\\ \scalebox{\std}{± 3.91}} & \makecell[c]{41.20\\ \scalebox{\std}{± 5.55}} & \makecell[c]{28.88\\ \scalebox{\std}{± 3.38}} & \makecell[c]{10.96\\ \scalebox{\std}{± 2.02}} & \makecell[c]{18.75\\ \scalebox{\std}{± 2.07}} & \makecell[c]{22.50\\ \scalebox{\std}{± 1.34}} & \makecell[c]{5.56\\ \scalebox{\std}{± 0.43}} & \makecell[c]{14.27\\ \scalebox{\std}{± 2.42}} & \makecell[c]{10.58\\ \scalebox{\std}{± 2.22}} & \makecell[c]{2.48\\ \scalebox{\std}{± 0.08}} & \makecell[c]{52.38\\ \scalebox{\std}{± 1.67}} & \makecell[c]{4.73\\ \scalebox{\std}{± 0.25}} & \makecell[c]{35.74\\ \scalebox{\std}{± 1.12}} & \makecell[c]{5.20\\ \scalebox{\std}{± 0.10}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{23.06\\ \scalebox{\std}{± 1.17}} \\
\multirow{-5}{*}{TENT~\cite{tent}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.12\\ \scalebox{\std}{± 0.45}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.67\\ \scalebox{\std}{± 0.42}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.59\\ \scalebox{\std}{± 0.45}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.14\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.05\\ \scalebox{\std}{± 0.49}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.20\\ \scalebox{\std}{± 0.03}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.07\\ \scalebox{\std}{± 0.35}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.66\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.07\\ \scalebox{\std}{± 0.58}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.32\\ \scalebox{\std}{± 0.26}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{2.70\\ \scalebox{\std}{± 0.05}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.29\\ \scalebox{\std}{± 0.19}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.01\\ \scalebox{\std}{± 0.38}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.69\\ \scalebox{\std}{± 0.13}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.41\\ \scalebox{\std}{± 0.35}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.00\\ \scalebox{\std}{± 0.03}} \\  \midrule 
 & SrcValid & \makecell[c]{18.53\\ \scalebox{\std}{± 0.43}} & \makecell[c]{16.06\\ \scalebox{\std}{± 1.16}} & \makecell[c]{24.04\\ \scalebox{\std}{± 1.75}} & \makecell[c]{10.55\\ \scalebox{\std}{± 0.25}} & \makecell[c]{23.95\\ \scalebox{\std}{± 2.07}} & \makecell[c]{11.72\\ \scalebox{\std}{± 0.61}} & \makecell[c]{10.26\\ \scalebox{\std}{± 0.11}} & \makecell[c]{12.81\\ \scalebox{\std}{± 0.31}} & \makecell[c]{12.85\\ \scalebox{\std}{± 0.48}} & \makecell[c]{10.27\\ \scalebox{\std}{± 0.36}} & \makecell[c]{7.78\\ \scalebox{\std}{± 0.32}} & \makecell[c]{8.88\\ \scalebox{\std}{± 0.82}} & \makecell[c]{19.00\\ \scalebox{\std}{± 1.34}} & \makecell[c]{12.62\\ \scalebox{\std}{± 0.56}} & \makecell[c]{16.18\\ \scalebox{\std}{± 1.01}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{14.37\\ \scalebox{\std}{± 0.33}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{4.75\\ \scalebox{\std}{± 0.62}} & \makecell[c]{3.85\\ \scalebox{\std}{± 0.37}} & \makecell[c]{9.08\\ \scalebox{\std}{± 0.24}} & \makecell[c]{4.14\\ \scalebox{\std}{± 0.41}} & \makecell[c]{8.77\\ \scalebox{\std}{± 2.09}} & \makecell[c]{3.92\\ \scalebox{\std}{± 0.19}} & \makecell[c]{4.44\\ \scalebox{\std}{± 0.42}} & \makecell[c]{3.50\\ \scalebox{\std}{± 0.30}} & \makecell[c]{3.54\\ \scalebox{\std}{± 0.39}} & \makecell[c]{4.02\\ \scalebox{\std}{± 0.24}} & \makecell[c]{4.79\\ \scalebox{\std}{± 0.48}} & \makecell[c]{4.22\\ \scalebox{\std}{± 0.47}} & \makecell[c]{4.38\\ \scalebox{\std}{± 0.40}} & \makecell[c]{3.58\\ \scalebox{\std}{± 0.37}} & \makecell[c]{4.70\\ \scalebox{\std}{± 0.42}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{4.78\\ \scalebox{\std}{± 0.12}} \\
 & GDE~\cite{disagreement} & \makecell[c]{22.88\\ \scalebox{\std}{± 0.72}} & \makecell[c]{20.42\\ \scalebox{\std}{± 0.72}} & \makecell[c]{30.68\\ \scalebox{\std}{± 0.46}} & \makecell[c]{11.07\\ \scalebox{\std}{± 0.19}} & \makecell[c]{30.04\\ \scalebox{\std}{± 2.50}} & \makecell[c]{12.58\\ \scalebox{\std}{± 0.56}} & \makecell[c]{10.78\\ \scalebox{\std}{± 0.38}} & \makecell[c]{14.99\\ \scalebox{\std}{± 0.28}} & \makecell[c]{14.52\\ \scalebox{\std}{± 0.56}} & \makecell[c]{11.48\\ \scalebox{\std}{± 0.13}} & \makecell[c]{8.27\\ \scalebox{\std}{± 0.46}} & \makecell[c]{9.32\\ \scalebox{\std}{± 0.48}} & \makecell[c]{21.01\\ \scalebox{\std}{± 0.92}} & \makecell[c]{14.85\\ \scalebox{\std}{± 0.11}} & \makecell[c]{21.31\\ \scalebox{\std}{± 0.97}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{16.95\\ \scalebox{\std}{± 0.22}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{50.21\\ \scalebox{\std}{± 4.36}} & \makecell[c]{45.60\\ \scalebox{\std}{± 3.84}} & \makecell[c]{44.13\\ \scalebox{\std}{± 4.72}} & \makecell[c]{31.34\\ \scalebox{\std}{± 2.85}} & \makecell[c]{16.32\\ \scalebox{\std}{± 0.62}} & \makecell[c]{19.30\\ \scalebox{\std}{± 1.78}} & \makecell[c]{22.95\\ \scalebox{\std}{± 2.10}} & \makecell[c]{6.43\\ \scalebox{\std}{± 0.19}} & \makecell[c]{17.52\\ \scalebox{\std}{± 3.03}} & \makecell[c]{13.62\\ \scalebox{\std}{± 0.85}} & \makecell[c]{2.73\\ \scalebox{\std}{± 0.15}} & \makecell[c]{56.00\\ \scalebox{\std}{± 1.18}} & \makecell[c]{4.93\\ \scalebox{\std}{± 0.06}} & \makecell[c]{37.56\\ \scalebox{\std}{± 0.69}} & \makecell[c]{5.99\\ \scalebox{\std}{± 0.70}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{24.97\\ \scalebox{\std}{± 1.00}} \\
\multirow{-5}{*}{EATA~\cite{eata}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.86\\ \scalebox{\std}{± 0.28}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.09\\ \scalebox{\std}{± 0.25}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.56\\ \scalebox{\std}{± 0.77}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.07\\ \scalebox{\std}{± 0.23}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.88\\ \scalebox{\std}{± 1.05}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.42\\ \scalebox{\std}{± 0.20}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.07\\ \scalebox{\std}{± 0.29}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.25\\ \scalebox{\std}{± 0.03}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.48\\ \scalebox{\std}{± 0.12}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.31\\ \scalebox{\std}{± 0.33}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{2.73\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{2.76\\ \scalebox{\std}{± 0.23}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.45\\ \scalebox{\std}{± 0.10}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.34\\ \scalebox{\std}{± 0.35}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.72\\ \scalebox{\std}{± 0.75}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.87\\ \scalebox{\std}{± 0.14}} \\  \midrule 
 & SrcValid & \makecell[c]{32.05\\ \scalebox{\std}{± 1.03}} & \makecell[c]{30.47\\ \scalebox{\std}{± 0.81}} & \makecell[c]{37.29\\ \scalebox{\std}{± 0.79}} & \makecell[c]{12.22\\ \scalebox{\std}{± 0.20}} & \makecell[c]{33.83\\ \scalebox{\std}{± 0.43}} & \makecell[c]{13.71\\ \scalebox{\std}{± 0.10}} & \makecell[c]{12.62\\ \scalebox{\std}{± 0.36}} & \makecell[c]{18.37\\ \scalebox{\std}{± 0.36}} & \makecell[c]{19.73\\ \scalebox{\std}{± 0.52}} & \makecell[c]{14.61\\ \scalebox{\std}{± 0.57}} & \makecell[c]{9.26\\ \scalebox{\std}{± 0.24}} & \makecell[c]{13.12\\ \scalebox{\std}{± 0.43}} & \makecell[c]{23.28\\ \scalebox{\std}{± 0.20}} & \makecell[c]{20.67\\ \scalebox{\std}{± 0.02}} & \makecell[c]{28.02\\ \scalebox{\std}{± 0.54}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{21.28\\ \scalebox{\std}{± 0.27}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{4.21\\ \scalebox{\std}{± 0.33}} & \makecell[c]{4.08\\ \scalebox{\std}{± 0.17}} & \makecell[c]{5.52\\ \scalebox{\std}{± 0.59}} & \makecell[c]{6.28\\ \scalebox{\std}{± 0.31}} & \makecell[c]{4.89\\ \scalebox{\std}{± 0.20}} & \makecell[c]{5.92\\ \scalebox{\std}{± 0.26}} & \makecell[c]{6.49\\ \scalebox{\std}{± 0.52}} & \makecell[c]{4.85\\ \scalebox{\std}{± 0.27}} & \makecell[c]{4.86\\ \scalebox{\std}{± 0.29}} & \makecell[c]{5.80\\ \scalebox{\std}{± 0.59}} & \makecell[c]{7.11\\ \scalebox{\std}{± 0.48}} & \makecell[c]{5.34\\ \scalebox{\std}{± 0.30}} & \makecell[c]{4.26\\ \scalebox{\std}{± 0.24}} & \makecell[c]{4.55\\ \scalebox{\std}{± 0.09}} & \makecell[c]{3.94\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{5.21\\ \scalebox{\std}{± 0.22}} \\
 & GDE~\cite{disagreement} & \makecell[c]{31.88\\ \scalebox{\std}{± 1.08}} & \makecell[c]{30.38\\ \scalebox{\std}{± 0.85}} & \makecell[c]{37.17\\ \scalebox{\std}{± 0.78}} & \makecell[c]{12.22\\ \scalebox{\std}{± 0.20}} & \makecell[c]{33.72\\ \scalebox{\std}{± 0.46}} & \makecell[c]{13.71\\ \scalebox{\std}{± 0.10}} & \makecell[c]{12.62\\ \scalebox{\std}{± 0.36}} & \makecell[c]{18.37\\ \scalebox{\std}{± 0.36}} & \makecell[c]{19.73\\ \scalebox{\std}{± 0.52}} & \makecell[c]{14.61\\ \scalebox{\std}{± 0.57}} & \makecell[c]{9.26\\ \scalebox{\std}{± 0.24}} & \makecell[c]{13.12\\ \scalebox{\std}{± 0.43}} & \makecell[c]{23.28\\ \scalebox{\std}{± 0.20}} & \makecell[c]{20.67\\ \scalebox{\std}{± 0.02}} & \makecell[c]{27.99\\ \scalebox{\std}{± 0.55}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{21.25\\ \scalebox{\std}{± 0.27}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{42.25\\ \scalebox{\std}{± 2.34}} & \makecell[c]{37.73\\ \scalebox{\std}{± 2.89}} & \makecell[c]{38.52\\ \scalebox{\std}{± 4.53}} & \makecell[c]{30.55\\ \scalebox{\std}{± 3.11}} & \makecell[c]{9.85\\ \scalebox{\std}{± 1.85}} & \makecell[c]{18.17\\ \scalebox{\std}{± 1.49}} & \makecell[c]{21.66\\ \scalebox{\std}{± 2.37}} & \makecell[c]{4.60\\ \scalebox{\std}{± 0.60}} & \makecell[c]{14.48\\ \scalebox{\std}{± 2.98}} & \makecell[c]{11.73\\ \scalebox{\std}{± 0.52}} & \makecell[c]{2.71\\ \scalebox{\std}{± 0.14}} & \makecell[c]{52.81\\ \scalebox{\std}{± 2.08}} & \makecell[c]{4.92\\ \scalebox{\std}{± 0.04}} & \makecell[c]{31.36\\ \scalebox{\std}{± 0.89}} & \makecell[c]{6.98\\ \scalebox{\std}{± 0.44}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{21.89\\ \scalebox{\std}{± 0.95}} \\
\multirow{-5}{*}{SAR~\cite{sar}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.91\\ \scalebox{\std}{± 0.84}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.15\\ \scalebox{\std}{± 0.89}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.75\\ \scalebox{\std}{± 0.19}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{2.92\\ \scalebox{\std}{± 0.14}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.42\\ \scalebox{\std}{± 0.47}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.09\\ \scalebox{\std}{± 0.06}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.18\\ \scalebox{\std}{± 0.21}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.55\\ \scalebox{\std}{± 0.37}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.65\\ \scalebox{\std}{± 0.17}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.25\\ \scalebox{\std}{± 0.34}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{2.81\\ \scalebox{\std}{± 0.06}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.58\\ \scalebox{\std}{± 0.61}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.87\\ \scalebox{\std}{± 0.15}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.69\\ \scalebox{\std}{± 0.43}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.47\\ \scalebox{\std}{± 0.23}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.89\\ \scalebox{\std}{± 0.07}} \\  \midrule 
 & SrcValid & \makecell[c]{23.70\\ \scalebox{\std}{± 0.75}} & \makecell[c]{21.84\\ \scalebox{\std}{± 0.28}} & \makecell[c]{28.79\\ \scalebox{\std}{± 0.21}} & \makecell[c]{12.46\\ \scalebox{\std}{± 0.33}} & \makecell[c]{29.57\\ \scalebox{\std}{± 0.62}} & \makecell[c]{13.92\\ \scalebox{\std}{± 0.06}} & \makecell[c]{12.75\\ \scalebox{\std}{± 0.25}} & \makecell[c]{17.30\\ \scalebox{\std}{± 0.41}} & \makecell[c]{17.21\\ \scalebox{\std}{± 0.30}} & \makecell[c]{14.75\\ \scalebox{\std}{± 0.51}} & \makecell[c]{9.26\\ \scalebox{\std}{± 0.21}} & \makecell[c]{15.14\\ \scalebox{\std}{± 0.06}} & \makecell[c]{21.29\\ \scalebox{\std}{± 0.43}} & \makecell[c]{17.69\\ \scalebox{\std}{± 0.14}} & \makecell[c]{20.78\\ \scalebox{\std}{± 0.38}} &  \cellcolor[HTML]{EDEEFF}\makecell[c]{18.43\\ \scalebox{\std}{± 0.16}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{16.82\\ \scalebox{\std}{± 0.51}} & \makecell[c]{17.21\\ \scalebox{\std}{± 0.63}} & \makecell[c]{16.33\\ \scalebox{\std}{± 0.22}} & \makecell[c]{6.71\\ \scalebox{\std}{± 0.42}} & \makecell[c]{12.30\\ \scalebox{\std}{± 0.19}} & \makecell[c]{7.02\\ \scalebox{\std}{± 0.43}} & \makecell[c]{7.30\\ \scalebox{\std}{± 0.71}} & \makecell[c]{9.69\\ \scalebox{\std}{± 0.48}} & \makecell[c]{12.00\\ \scalebox{\std}{± 0.63}} & \makecell[c]{7.54\\ \scalebox{\std}{± 0.45}} & \makecell[c]{7.18\\ \scalebox{\std}{± 0.56}} & \makecell[c]{7.90\\ \scalebox{\std}{± 0.61}} & \makecell[c]{12.01\\ \scalebox{\std}{± 0.38}} & \makecell[c]{11.76\\ \scalebox{\std}{± 0.74}} & \makecell[c]{12.69\\ \scalebox{\std}{± 0.63}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{10.96\\ \scalebox{\std}{± 0.28}} \\
 & GDE~\cite{disagreement} & \makecell[c]{15.65\\ \scalebox{\std}{± 0.77}} & \makecell[c]{14.29\\ \scalebox{\std}{± 0.35}} & \makecell[c]{19.35\\ \scalebox{\std}{± 0.33}} & \makecell[c]{12.08\\ \scalebox{\std}{± 0.17}} & \makecell[c]{21.18\\ \scalebox{\std}{± 0.52}} & \makecell[c]{13.08\\ \scalebox{\std}{± 0.16}} & \makecell[c]{12.20\\ \scalebox{\std}{± 0.07}} & \makecell[c]{14.43\\ \scalebox{\std}{± 0.15}} & \makecell[c]{13.44\\ \scalebox{\std}{± 0.17}} & \makecell[c]{13.60\\ \scalebox{\std}{± 0.49}} & \makecell[c]{9.20\\ \scalebox{\std}{± 0.16}} & \makecell[c]{13.16\\ \scalebox{\std}{± 0.30}} & \makecell[c]{16.40\\ \scalebox{\std}{± 0.06}} & \makecell[c]{13.91\\ \scalebox{\std}{± 0.58}} & \makecell[c]{15.46\\ \scalebox{\std}{± 0.45}}
 & \cellcolor[HTML]{EDEEFF}\makecell[c]{14.50\\ \scalebox{\std}{± 0.03}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{16.79\\ \scalebox{\std}{± 0.32}} & \makecell[c]{15.09\\ \scalebox{\std}{± 0.89}} & \makecell[c]{18.84\\ \scalebox{\std}{± 3.83}} & \makecell[c]{31.44\\ \scalebox{\std}{± 3.23}} & \makecell[c]{6.81\\ \scalebox{\std}{± 0.51}} & \makecell[c]{20.83\\ \scalebox{\std}{± 2.16}} & \makecell[c]{23.18\\ \scalebox{\std}{± 2.46}} & \makecell[c]{6.05\\ \scalebox{\std}{± 0.33}} & \makecell[c]{11.83\\ \scalebox{\std}{± 1.56}} & \makecell[c]{17.25\\ \scalebox{\std}{± 1.20}} & \makecell[c]{2.64\\ \scalebox{\std}{± 0.06}} & \makecell[c]{55.25\\ \scalebox{\std}{± 1.82}} & \makecell[c]{14.63\\ \scalebox{\std}{± 1.63}} & \makecell[c]{21.96\\ \scalebox{\std}{± 1.58}} & \makecell[c]{7.41\\ \scalebox{\std}{± 0.68}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{18.00\\ \scalebox{\std}{± 0.82}} \\
\multirow{-5}{*}{CoTTA~\cite{cotta}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{15.34\\ \scalebox{\std}{± 1.06}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{15.26\\ \scalebox{\std}{± 1.54}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{14.98\\ \scalebox{\std}{± 0.87}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.02\\ \scalebox{\std}{± 0.06}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.45\\ \scalebox{\std}{± 0.60}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.22\\ \scalebox{\std}{± 0.17}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.20\\ \scalebox{\std}{± 0.28}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.11\\ \scalebox{\std}{± 0.24}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.57\\ \scalebox{\std}{± 0.48}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.48\\ \scalebox{\std}{± 0.39}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{2.79\\ \scalebox{\std}{± 0.05}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.24\\ \scalebox{\std}{± 0.10}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.56\\ \scalebox{\std}{± 0.63}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.10\\ \scalebox{\std}{± 0.92}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.18\\ \scalebox{\std}{± 0.89}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.83\\ \scalebox{\std}{± 0.47}} \\  \midrule 
 & SrcValid & \makecell[c]{27.12\\ \scalebox{\std}{± 7.07}} & \makecell[c]{27.75\\ \scalebox{\std}{± 6.21}} & \makecell[c]{14.88\\ \scalebox{\std}{± 3.34}} & \makecell[c]{12.12\\ \scalebox{\std}{± 3.16}} & \makecell[c]{25.35\\ \scalebox{\std}{± 1.09}} & \makecell[c]{5.02\\ \scalebox{\std}{± 0.23}} & \makecell[c]{4.88\\ \scalebox{\std}{± 0.72}} & \makecell[c]{14.33\\ \scalebox{\std}{± 3.02}} & \makecell[c]{36.52\\ \scalebox{\std}{± 1.99}} & \makecell[c]{11.62\\ \scalebox{\std}{± 1.44}} & \makecell[c]{35.55\\ \scalebox{\std}{± 2.35}} & \makecell[c]{35.93\\ \scalebox{\std}{± 0.82}} & \makecell[c]{20.00\\ \scalebox{\std}{± 0.14}} & \makecell[c]{7.96\\ \scalebox{\std}{± 0.72}} & \makecell[c]{26.23\\ \scalebox{\std}{± 0.98}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{20.35\\ \scalebox{\std}{± 1.31}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{4.68\\ \scalebox{\std}{± 0.46}} & \makecell[c]{4.64\\ \scalebox{\std}{± 0.15}} & \makecell[c]{5.19\\ \scalebox{\std}{± 0.31}} & \makecell[c]{7.29\\ \scalebox{\std}{± 0.50}} & \makecell[c]{4.77\\ \scalebox{\std}{± 0.27}} & \makecell[c]{7.12\\ \scalebox{\std}{± 0.50}} & \makecell[c]{7.73\\ \scalebox{\std}{± 0.65}} & \makecell[c]{6.49\\ \scalebox{\std}{± 0.31}} & \makecell[c]{6.28\\ \scalebox{\std}{± 0.47}} & \makecell[c]{7.32\\ \scalebox{\std}{± 0.76}} & \makecell[c]{8.40\\ \scalebox{\std}{± 0.59}} & \makecell[c]{4.71\\ \scalebox{\std}{± 0.47}} & \makecell[c]{5.25\\ \scalebox{\std}{± 0.04}} & \makecell[c]{5.89\\ \scalebox{\std}{± 0.43}} & \makecell[c]{4.39\\ \scalebox{\std}{± 0.14}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{6.01\\ \scalebox{\std}{± 0.23}} \\
 & GDE~\cite{disagreement} & \makecell[c]{32.94\\ \scalebox{\std}{± 0.75}} & \makecell[c]{30.87\\ \scalebox{\std}{± 0.91}} & \makecell[c]{39.40\\ \scalebox{\std}{± 0.72}} & \makecell[c]{12.02\\ \scalebox{\std}{± 0.22}} & \makecell[c]{34.18\\ \scalebox{\std}{± 0.90}} & \makecell[c]{13.48\\ \scalebox{\std}{± 0.48}} & \makecell[c]{12.01\\ \scalebox{\std}{± 0.26}} & \makecell[c]{17.90\\ \scalebox{\std}{± 0.80}} & \makecell[c]{21.73\\ \scalebox{\std}{± 0.82}} & \makecell[c]{13.93\\ \scalebox{\std}{± 0.51}} & \makecell[c]{8.90\\ \scalebox{\std}{± 0.42}} & \makecell[c]{40.52\\ \scalebox{\std}{± 2.50}} & \makecell[c]{22.68\\ \scalebox{\std}{± 0.44}} & \makecell[c]{21.22\\ \scalebox{\std}{± 0.37}} & \makecell[c]{27.31\\ \scalebox{\std}{± 0.69}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{23.27\\ \scalebox{\std}{± 0.43}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{40.38\\ \scalebox{\std}{± 2.57}} & \makecell[c]{36.59\\ \scalebox{\std}{± 2.88}} & \makecell[c]{35.02\\ \scalebox{\std}{± 3.67}} & \makecell[c]{29.64\\ \scalebox{\std}{± 2.94}} & \makecell[c]{9.29\\ \scalebox{\std}{± 1.72}} & \makecell[c]{17.31\\ \scalebox{\std}{± 1.01}} & \makecell[c]{21.45\\ \scalebox{\std}{± 2.59}} & \makecell[c]{4.81\\ \scalebox{\std}{± 0.35}} & \makecell[c]{11.96\\ \scalebox{\std}{± 2.41}} & \makecell[c]{11.24\\ \scalebox{\std}{± 0.41}} & \makecell[c]{2.70\\ \scalebox{\std}{± 0.09}} & \makecell[c]{26.49\\ \scalebox{\std}{± 4.84}} & \makecell[c]{5.42\\ \scalebox{\std}{± 0.24}} & \makecell[c]{29.92\\ \scalebox{\std}{± 0.45}} & \makecell[c]{8.06\\ \scalebox{\std}{± 0.59}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{19.35\\ \scalebox{\std}{± 0.99}} \\
\multirow{-5}{*}{RoTTA~\cite{rotta}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{13.47\\ \scalebox{\std}{± 5.78}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{13.35\\ \scalebox{\std}{± 5.69}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.74\\ \scalebox{\std}{± 3.87}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.55\\ \scalebox{\std}{± 0.44}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.42\\ \scalebox{\std}{± 1.13}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.68\\ \scalebox{\std}{± 0.27}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.88\\ \scalebox{\std}{± 0.43}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.45\\ \scalebox{\std}{± 0.93}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.93\\ \scalebox{\std}{± 1.02}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.51\\ \scalebox{\std}{± 0.62}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.31\\ \scalebox{\std}{± 0.33}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{12.13\\ \scalebox{\std}{± 6.65}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.66\\ \scalebox{\std}{± 1.17}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.97\\ \scalebox{\std}{± 1.13}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.57\\ \scalebox{\std}{± 0.24}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.44\\ \scalebox{\std}{± 1.35}} \\  \midrule 
 & SrcValid & \makecell[c]{11.98\\ \scalebox{\std}{± 4.00}} & \makecell[c]{10.86\\ \scalebox{\std}{± 0.74}} & \makecell[c]{8.25\\ \scalebox{\std}{± 0.47}} & \makecell[c]{9.73\\ \scalebox{\std}{± 2.10}} & \makecell[c]{23.16\\ \scalebox{\std}{± 0.58}} & \makecell[c]{4.54\\ \scalebox{\std}{± 0.68}} & \makecell[c]{4.74\\ \scalebox{\std}{± 1.04}} & \makecell[c]{5.55\\ \scalebox{\std}{± 0.66}} & \makecell[c]{16.12\\ \scalebox{\std}{± 2.74}} & \makecell[c]{4.56\\ \scalebox{\std}{± 0.17}} & \makecell[c]{13.62\\ \scalebox{\std}{± 2.37}} & \makecell[c]{38.68\\ \scalebox{\std}{± 6.48}} & \makecell[c]{19.00\\ \scalebox{\std}{± 0.45}} & \makecell[c]{5.57\\ \scalebox{\std}{± 0.67}} & \makecell[c]{20.67\\ \scalebox{\std}{± 0.57}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{13.13\\ \scalebox{\std}{± 0.85}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{4.10\\ \scalebox{\std}{± 0.13}} & \makecell[c]{4.46\\ \scalebox{\std}{± 0.27}} & \makecell[c]{4.50\\ \scalebox{\std}{± 0.17}} & \makecell[c]{5.45\\ \scalebox{\std}{± 1.04}} & \makecell[c]{5.05\\ \scalebox{\std}{± 0.65}} & \makecell[c]{5.47\\ \scalebox{\std}{± 0.69}} & \makecell[c]{6.14\\ \scalebox{\std}{± 0.62}} & \makecell[c]{4.82\\ \scalebox{\std}{± 0.53}} & \makecell[c]{4.91\\ \scalebox{\std}{± 0.97}} & \makecell[c]{5.61\\ \scalebox{\std}{± 0.82}} & \makecell[c]{6.17\\ \scalebox{\std}{± 1.00}} & \makecell[c]{4.36\\ \scalebox{\std}{± 0.85}} & \makecell[c]{4.23\\ \scalebox{\std}{± 0.41}} & \makecell[c]{5.25\\ \scalebox{\std}{± 0.70}} & \makecell[c]{4.07\\ \scalebox{\std}{± 0.26}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{4.97\\ \scalebox{\std}{± 0.50}} \\
 & GDE~\cite{disagreement} & \makecell[c]{23.46\\ \scalebox{\std}{± 0.77}} & \makecell[c]{20.15\\ \scalebox{\std}{± 0.46}} & \makecell[c]{29.27\\ \scalebox{\std}{± 0.45}} & \makecell[c]{10.60\\ \scalebox{\std}{± 0.38}} & \makecell[c]{28.84\\ \scalebox{\std}{± 0.86}} & \makecell[c]{11.03\\ \scalebox{\std}{± 0.11}} & \makecell[c]{10.00\\ \scalebox{\std}{± 0.56}} & \makecell[c]{13.53\\ \scalebox{\std}{± 0.58}} & \makecell[c]{14.42\\ \scalebox{\std}{± 0.16}} & \makecell[c]{10.67\\ \scalebox{\std}{± 0.40}} & \makecell[c]{7.09\\ \scalebox{\std}{± 0.09}} & \makecell[c]{13.81\\ \scalebox{\std}{± 1.96}} & \makecell[c]{19.08\\ \scalebox{\std}{± 0.52}} & \makecell[c]{14.25\\ \scalebox{\std}{± 0.50}} & \makecell[c]{20.51\\ \scalebox{\std}{± 0.82}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{16.45\\ \scalebox{\std}{± 0.21}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{47.94\\ \scalebox{\std}{± 3.36}} & \makecell[c]{43.98\\ \scalebox{\std}{± 3.57}} & \makecell[c]{44.74\\ \scalebox{\std}{± 4.05}} & \makecell[c]{30.14\\ \scalebox{\std}{± 2.33}} & \makecell[c]{12.47\\ \scalebox{\std}{± 2.81}} & \makecell[c]{18.81\\ \scalebox{\std}{± 1.68}} & \makecell[c]{22.85\\ \scalebox{\std}{± 2.61}} & \makecell[c]{5.48\\ \scalebox{\std}{± 0.77}} & \makecell[c]{16.01\\ \scalebox{\std}{± 2.86}} & \makecell[c]{12.78\\ \scalebox{\std}{± 1.13}} & \makecell[c]{2.68\\ \scalebox{\std}{± 0.20}} & \makecell[c]{49.76\\ \scalebox{\std}{± 2.32}} & \makecell[c]{4.95\\ \scalebox{\std}{± 0.33}} & \makecell[c]{37.03\\ \scalebox{\std}{± 1.34}} & \makecell[c]{5.63\\ \scalebox{\std}{± 0.43}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{23.68\\ \scalebox{\std}{± 0.85}} \\
\multirow{-5}{*}{SoTTA~\cite{sotta}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.08\\ \scalebox{\std}{± 2.79}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.51\\ \scalebox{\std}{± 2.49}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.23\\ \scalebox{\std}{± 2.73}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.58\\ \scalebox{\std}{± 0.27}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.01\\ \scalebox{\std}{± 0.52}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.63\\ \scalebox{\std}{± 0.45}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.77\\ \scalebox{\std}{± 0.57}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.86\\ \scalebox{\std}{± 0.42}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.99\\ \scalebox{\std}{± 1.17}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.24\\ \scalebox{\std}{± 0.90}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{3.05\\ \scalebox{\std}{± 0.07}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.15\\ \scalebox{\std}{± 1.28}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.33\\ \scalebox{\std}{± 0.63}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.43\\ \scalebox{\std}{± 1.39}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.41\\ \scalebox{\std}{± 0.51}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.28\\ \scalebox{\std}{± 0.87}} \\
\Xhline{2\arrayrulewidth}
\end{tabularx}

[H]
\centering
\caption{Mean absolute error (MAE) (\%) of the accuracy estimation on fully CIFAR100-C. Averaged over three different random seeds.}
\label{tab:main95experiment95cifar10095full}
\scriptsize
\setlength\tabcolsep{3pt}
\begin{tabularx}{\textwidth}{ll*{16}Y}
\Xhline{2\arrayrulewidth}
 &  & \multicolumn{3}{c}{Noise} & \multicolumn{4}{c}{Blur} & \multicolumn{4}{c}{Weather} & \multicolumn{4}{c}{Digital} &  \\ 
\addlinespace[-0.05cm]
\cmidrule(lr){3-5} \cmidrule(lr){6-9} \cmidrule(lr){10-13} \cmidrule(lr){14-17}
\addlinespace[-0.05cm]
TTA Method & Acc. Estimation & Gau. & Shot & Imp. & Def. & Gla. & Mot. & Zoom & Snow & Fro. & Fog & Brit. & Cont. & Elas. & Pix. & JPEG & \cellcolor[HTML]{EDEEFF}Avg.($\downarrow$) \\ \midrule
  & SrcValid & \makecell[c]{46.38\\ \scalebox{\std}{± 1.17}} & \makecell[c]{45.01\\ \scalebox{\std}{± 1.16}} & \makecell[c]{51.81\\ \scalebox{\std}{± 1.79}} & \makecell[c]{31.42\\ \scalebox{\std}{± 0.67}} & \makecell[c]{49.43\\ \scalebox{\std}{± 0.58}} & \makecell[c]{33.51\\ \scalebox{\std}{± 0.64}} & \makecell[c]{31.37\\ \scalebox{\std}{± 0.35}} & \makecell[c]{39.05\\ \scalebox{\std}{± 0.33}} & \makecell[c]{38.89\\ \scalebox{\std}{± 0.28}} & \makecell[c]{34.25\\ \scalebox{\std}{± 0.38}} & \makecell[c]{28.72\\ \scalebox{\std}{± 0.33}} & \makecell[c]{33.04\\ \scalebox{\std}{± 0.61}} & \makecell[c]{41.81\\ \scalebox{\std}{± 0.75}} & \makecell[c]{35.52\\ \scalebox{\std}{± 0.51}} & \makecell[c]{44.24\\ \scalebox{\std}{± 0.66}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{38.96\\ \scalebox{\std}{± 0.22}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{13.70\\ \scalebox{\std}{± 0.41}} & \makecell[c]{14.53\\ \scalebox{\std}{± 0.58}} & \makecell[c]{11.33\\ \scalebox{\std}{± 0.70}} & \makecell[c]{20.57\\ \scalebox{\std}{± 0.40}} & \makecell[c]{13.53\\ \scalebox{\std}{± 0.49}} & \makecell[c]{19.91\\ \scalebox{\std}{± 0.60}} & \makecell[c]{21.14\\ \scalebox{\std}{± 0.25}} & \makecell[c]{17.30\\ \scalebox{\std}{± 0.27}} & \makecell[c]{17.10\\ \scalebox{\std}{± 0.23}} & \makecell[c]{18.95\\ \scalebox{\std}{± 0.23}} & \makecell[c]{21.15\\ \scalebox{\std}{± 0.12}} & \makecell[c]{17.26\\ \scalebox{\std}{± 0.66}} & \makecell[c]{18.07\\ \scalebox{\std}{± 0.57}} & \makecell[c]{19.69\\ \scalebox{\std}{± 0.13}} & \makecell[c]{15.89\\ \scalebox{\std}{± 0.14}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{17.34\\ \scalebox{\std}{± 0.10}} \\
 & GDE~\cite{disagreement} & \makecell[c]{49.21\\ \scalebox{\std}{± 0.79}} & \makecell[c]{47.38\\ \scalebox{\std}{± 0.87}} & \makecell[c]{54.91\\ \scalebox{\std}{± 0.53}} & \makecell[c]{31.71\\ \scalebox{\std}{± 0.36}} & \makecell[c]{50.52\\ \scalebox{\std}{± 0.75}} & \makecell[c]{33.61\\ \scalebox{\std}{± 0.70}} & \makecell[c]{31.61\\ \scalebox{\std}{± 0.37}} & \makecell[c]{40.02\\ \scalebox{\std}{± 0.45}} & \makecell[c]{40.32\\ \scalebox{\std}{± 0.13}} & \makecell[c]{36.41\\ \scalebox{\std}{± 0.33}} & \makecell[c]{28.89\\ \scalebox{\std}{± 0.17}} & \makecell[c]{32.74\\ \scalebox{\std}{± 0.61}} & \makecell[c]{42.20\\ \scalebox{\std}{± 0.70}} & \makecell[c]{36.20\\ \scalebox{\std}{± 0.11}} & \makecell[c]{45.87\\ \scalebox{\std}{± 0.35}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{40.11\\ \scalebox{\std}{± 0.05}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{36.92\\ \scalebox{\std}{± 1.76}} & \makecell[c]{38.54\\ \scalebox{\std}{± 0.24}} & \makecell[c]{35.93\\ \scalebox{\std}{± 0.49}} & \makecell[c]{31.08\\ \scalebox{\std}{± 0.42}} & \makecell[c]{26.47\\ \scalebox{\std}{± 1.53}} & \makecell[c]{18.05\\ \scalebox{\std}{± 0.82}} & \makecell[c]{24.02\\ \scalebox{\std}{± 0.77}} & \makecell[c]{8.33\\ \scalebox{\std}{± 0.53}} & \makecell[c]{21.92\\ \scalebox{\std}{± 1.07}} & \makecell[c]{19.26\\ \scalebox{\std}{± 0.03}} & \makecell[c]{4.58\\ \scalebox{\std}{± 0.38}} & \makecell[c]{48.38\\ \scalebox{\std}{± 0.25}} & \makecell[c]{5.43\\ \scalebox{\std}{± 0.10}} & \makecell[c]{37.17\\ \scalebox{\std}{± 2.42}} & \makecell[c]{6.42\\ \scalebox{\std}{± 0.25}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{24.17\\ \scalebox{\std}{± 0.41}} \\
\multirow{-5}{*}{TENT~\cite{tent}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.55\\ \scalebox{\std}{± 0.59}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.09\\ \scalebox{\std}{± 0.17}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.60\\ \scalebox{\std}{± 0.31}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.57\\ \scalebox{\std}{± 0.18}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{10.27\\ \scalebox{\std}{± 0.36}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.05\\ \scalebox{\std}{± 0.37}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.60\\ \scalebox{\std}{± 0.53}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.26\\ \scalebox{\std}{± 0.09}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.21\\ \scalebox{\std}{± 0.03}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.81\\ \scalebox{\std}{± 0.09}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.54\\ \scalebox{\std}{± 0.28}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.32\\ \scalebox{\std}{± 0.56}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.99\\ \scalebox{\std}{± 0.66}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.50\\ \scalebox{\std}{± 0.23}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.96\\ \scalebox{\std}{± 0.08}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.89\\ \scalebox{\std}{± 0.15}} \\ \midrule
 & SrcValid & \makecell[c]{7.65\\ \scalebox{\std}{± 0.94}} & \makecell[c]{7.33\\ \scalebox{\std}{± 0.33}} & \makecell[c]{7.51\\ \scalebox{\std}{± 0.42}} & \makecell[c]{15.32\\ \scalebox{\std}{± 1.91}} & \makecell[c]{8.35\\ \scalebox{\std}{± 0.45}} & \makecell[c]{13.56\\ \scalebox{\std}{± 1.08}} & \makecell[c]{12.54\\ \scalebox{\std}{± 1.32}} & \makecell[c]{9.21\\ \scalebox{\std}{± 1.68}} & \makecell[c]{8.91\\ \scalebox{\std}{± 0.48}} & \makecell[c]{9.31\\ \scalebox{\std}{± 0.73}} & \makecell[c]{17.52\\ \scalebox{\std}{± 1.75}} & \makecell[c]{15.60\\ \scalebox{\std}{± 0.77}} & \makecell[c]{9.99\\ \scalebox{\std}{± 1.31}} & \makecell[c]{9.80\\ \scalebox{\std}{± 0.28}} & \makecell[c]{8.09\\ \scalebox{\std}{± 0.44}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{10.71\\ \scalebox{\std}{± 0.31}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{36.65\\ \scalebox{\std}{± 1.55}} & \makecell[c]{35.32\\ \scalebox{\std}{± 1.08}} & \makecell[c]{40.06\\ \scalebox{\std}{± 2.16}} & \makecell[c]{15.38\\ \scalebox{\std}{± 1.48}} & \makecell[c]{35.65\\ \scalebox{\std}{± 2.20}} & \makecell[c]{19.64\\ \scalebox{\std}{± 2.12}} & \makecell[c]{20.05\\ \scalebox{\std}{± 5.94}} & \makecell[c]{27.09\\ \scalebox{\std}{± 4.80}} & \makecell[c]{31.54\\ \scalebox{\std}{± 4.28}} & \makecell[c]{24.67\\ \scalebox{\std}{± 4.45}} & \makecell[c]{15.80\\ \scalebox{\std}{± 2.67}} & \makecell[c]{25.68\\ \scalebox{\std}{± 7.41}} & \makecell[c]{31.88\\ \scalebox{\std}{± 2.77}} & \makecell[c]{26.64\\ \scalebox{\std}{± 2.28}} & \makecell[c]{31.81\\ \scalebox{\std}{± 0.85}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{27.86\\ \scalebox{\std}{± 1.11}} \\
 & GDE~\cite{disagreement} & \makecell[c]{83.95\\ \scalebox{\std}{± 1.83}} & \makecell[c]{83.36\\ \scalebox{\std}{± 1.31}} & \makecell[c]{88.21\\ \scalebox{\std}{± 0.66}} & \makecell[c]{55.73\\ \scalebox{\std}{± 3.68}} & \makecell[c]{84.37\\ \scalebox{\std}{± 1.49}} & \makecell[c]{62.93\\ \scalebox{\std}{± 2.76}} & \makecell[c]{60.31\\ \scalebox{\std}{± 7.35}} & \makecell[c]{73.24\\ \scalebox{\std}{± 4.38}} & \makecell[c]{77.51\\ \scalebox{\std}{± 2.86}} & \makecell[c]{70.26\\ \scalebox{\std}{± 4.24}} & \makecell[c]{40.71\\ \scalebox{\std}{± 3.37}} & \makecell[c]{62.07\\ \scalebox{\std}{± 9.58}} & \makecell[c]{79.02\\ \scalebox{\std}{± 2.84}} & \makecell[c]{71.39\\ \scalebox{\std}{± 1.01}} & \makecell[c]{79.95\\ \scalebox{\std}{± 1.07}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{71.53\\ \scalebox{\std}{± 2.12}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{9.32\\ \scalebox{\std}{± 0.83}} & \makecell[c]{8.30\\ \scalebox{\std}{± 0.59}} & \makecell[c]{5.08\\ \scalebox{\std}{± 0.16}} & \makecell[c]{16.05\\ \scalebox{\std}{± 1.45}} & \makecell[c]{5.75\\ \scalebox{\std}{± 0.87}} & \makecell[c]{7.72\\ \scalebox{\std}{± 0.40}} & \makecell[c]{10.48\\ \scalebox{\std}{± 1.96}} & \makecell[c]{4.03\\ \scalebox{\std}{± 0.44}} & \makecell[c]{6.07\\ \scalebox{\std}{± 1.16}} & \makecell[c]{6.79\\ \scalebox{\std}{± 1.07}} & \makecell[c]{3.90\\ \scalebox{\std}{± 0.18}} & \makecell[c]{21.65\\ \scalebox{\std}{± 6.92}} & \makecell[c]{2.93\\ \scalebox{\std}{± 0.09}} & \makecell[c]{12.38\\ \scalebox{\std}{± 0.61}} & \makecell[c]{2.87\\ \scalebox{\std}{± 0.07}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{8.22\\ \scalebox{\std}{± 0.56}} \\
\multirow{-5}{*}{EATA~\cite{eata}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{18.86\\ \scalebox{\std}{± 2.19}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{19.47\\ \scalebox{\std}{± 1.46}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{18.76\\ \scalebox{\std}{± 5.27}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{17.54\\ \scalebox{\std}{± 1.56}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{26.65\\ \scalebox{\std}{± 2.14}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{21.51\\ \scalebox{\std}{± 1.48}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{19.45\\ \scalebox{\std}{± 2.91}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{20.39\\ \scalebox{\std}{± 2.07}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{18.76\\ \scalebox{\std}{± 1.79}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{19.32\\ \scalebox{\std}{± 0.12}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{11.18\\ \scalebox{\std}{± 1.82}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{23.44\\ \scalebox{\std}{± 4.90}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{24.58\\ \scalebox{\std}{± 0.69}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{20.48\\ \scalebox{\std}{± 2.46}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{21.82\\ \scalebox{\std}{± 5.09}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{20.15\\ \scalebox{\std}{± 1.70}} \\ \midrule
 & SrcValid & \makecell[c]{53.44\\ \scalebox{\std}{± 0.56}} & \makecell[c]{51.46\\ \scalebox{\std}{± 0.76}} & \makecell[c]{59.19\\ \scalebox{\std}{± 0.77}} & \makecell[c]{32.52\\ \scalebox{\std}{± 0.17}} & \makecell[c]{53.90\\ \scalebox{\std}{± 0.52}} & \makecell[c]{35.00\\ \scalebox{\std}{± 0.71}} & \makecell[c]{33.63\\ \scalebox{\std}{± 0.21}} & \makecell[c]{43.03\\ \scalebox{\std}{± 0.48}} & \makecell[c]{43.55\\ \scalebox{\std}{± 0.30}} & \makecell[c]{38.69\\ \scalebox{\std}{± 0.40}} & \makecell[c]{30.12\\ \scalebox{\std}{± 0.20}} & \makecell[c]{33.17\\ \scalebox{\std}{± 0.40}} & \makecell[c]{43.77\\ \scalebox{\std}{± 0.40}} & \makecell[c]{39.64\\ \scalebox{\std}{± 0.44}} & \makecell[c]{49.16\\ \scalebox{\std}{± 0.23}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{42.68\\ \scalebox{\std}{± 0.21}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{20.82\\ \scalebox{\std}{± 0.65}} & \makecell[c]{21.98\\ \scalebox{\std}{± 0.41}} & \makecell[c]{18.42\\ \scalebox{\std}{± 0.24}} & \makecell[c]{27.91\\ \scalebox{\std}{± 0.39}} & \makecell[c]{20.67\\ \scalebox{\std}{± 0.57}} & \makecell[c]{26.77\\ \scalebox{\std}{± 0.56}} & \makecell[c]{27.81\\ \scalebox{\std}{± 0.20}} & \makecell[c]{24.33\\ \scalebox{\std}{± 0.51}} & \makecell[c]{24.01\\ \scalebox{\std}{± 0.40}} & \makecell[c]{26.76\\ \scalebox{\std}{± 0.48}} & \makecell[c]{27.92\\ \scalebox{\std}{± 0.16}} & \makecell[c]{25.97\\ \scalebox{\std}{± 0.20}} & \makecell[c]{25.72\\ \scalebox{\std}{± 0.41}} & \makecell[c]{26.20\\ \scalebox{\std}{± 0.38}} & \makecell[c]{23.11\\ \scalebox{\std}{± 0.08}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{24.56\\ \scalebox{\std}{± 0.25}} \\
 & GDE~\cite{disagreement} & \makecell[c]{53.09\\ \scalebox{\std}{± 0.53}} & \makecell[c]{51.11\\ \scalebox{\std}{± 0.69}} & \makecell[c]{58.81\\ \scalebox{\std}{± 0.77}} & \makecell[c]{32.45\\ \scalebox{\std}{± 0.21}} & \makecell[c]{53.63\\ \scalebox{\std}{± 0.55}} & \makecell[c]{34.91\\ \scalebox{\std}{± 0.76}} & \makecell[c]{33.60\\ \scalebox{\std}{± 0.26}} & \makecell[c]{42.90\\ \scalebox{\std}{± 0.57}} & \makecell[c]{43.42\\ \scalebox{\std}{± 0.32}} & \makecell[c]{38.63\\ \scalebox{\std}{± 0.43}} & \makecell[c]{30.05\\ \scalebox{\std}{± 0.19}} & \makecell[c]{33.08\\ \scalebox{\std}{± 0.44}} & \makecell[c]{43.65\\ \scalebox{\std}{± 0.37}} & \makecell[c]{39.50\\ \scalebox{\std}{± 0.44}} & \makecell[c]{48.87\\ \scalebox{\std}{± 0.21}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{42.51\\ \scalebox{\std}{± 0.23}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{35.15\\ \scalebox{\std}{± 1.78}} & \makecell[c]{36.42\\ \scalebox{\std}{± 1.64}} & \makecell[c]{32.87\\ \scalebox{\std}{± 1.13}} & \makecell[c]{30.78\\ \scalebox{\std}{± 0.58}} & \makecell[c]{23.87\\ \scalebox{\std}{± 1.07}} & \makecell[c]{16.74\\ \scalebox{\std}{± 0.22}} & \makecell[c]{22.52\\ \scalebox{\std}{± 0.46}} & \makecell[c]{7.21\\ \scalebox{\std}{± 0.46}} & \makecell[c]{21.01\\ \scalebox{\std}{± 0.61}} & \makecell[c]{18.54\\ \scalebox{\std}{± 0.59}} & \makecell[c]{4.30\\ \scalebox{\std}{± 0.42}} & \makecell[c]{48.28\\ \scalebox{\std}{± 0.16}} & \makecell[c]{5.65\\ \scalebox{\std}{± 0.37}} & \makecell[c]{34.21\\ \scalebox{\std}{± 2.97}} & \makecell[c]{6.09\\ \scalebox{\std}{± 0.46}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{22.91\\ \scalebox{\std}{± 0.60}} \\
\multirow{-5}{*}{SAR~\cite{sar}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.75\\ \scalebox{\std}{± 0.45}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.33\\ \scalebox{\std}{± 0.29}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.90\\ \scalebox{\std}{± 0.24}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.19\\ \scalebox{\std}{± 0.22}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.56\\ \scalebox{\std}{± 0.54}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.43\\ \scalebox{\std}{± 0.32}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.82\\ \scalebox{\std}{± 0.30}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.34\\ \scalebox{\std}{± 0.28}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.15\\ \scalebox{\std}{± 0.38}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.47\\ \scalebox{\std}{± 0.18}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.37\\ \scalebox{\std}{± 0.19}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.04\\ \scalebox{\std}{± 0.23}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.57\\ \scalebox{\std}{± 0.50}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.72\\ \scalebox{\std}{± 0.33}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.50\\ \scalebox{\std}{± 0.24}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.54\\ \scalebox{\std}{± 0.15}} \\ \midrule
 & SrcValid & \makecell[c]{53.11\\ \scalebox{\std}{± 0.39}} & \makecell[c]{51.88\\ \scalebox{\std}{± 0.44}} & \makecell[c]{57.18\\ \scalebox{\std}{± 0.18}} & \makecell[c]{36.41\\ \scalebox{\std}{± 0.61}} & \makecell[c]{53.90\\ \scalebox{\std}{± 0.59}} & \makecell[c]{38.59\\ \scalebox{\std}{± 0.67}} & \makecell[c]{37.33\\ \scalebox{\std}{± 0.46}} & \makecell[c]{44.93\\ \scalebox{\std}{± 0.62}} & \makecell[c]{44.30\\ \scalebox{\std}{± 0.32}} & \makecell[c]{43.99\\ \scalebox{\std}{± 0.46}} & \makecell[c]{32.17\\ \scalebox{\std}{± 0.16}} & \makecell[c]{41.50\\ \scalebox{\std}{± 1.22}} & \makecell[c]{45.74\\ \scalebox{\std}{± 0.20}} & \makecell[c]{40.52\\ \scalebox{\std}{± 0.10}} & \makecell[c]{47.13\\ \scalebox{\std}{± 0.13}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{44.58\\ \scalebox{\std}{± 0.30}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{34.06\\ \scalebox{\std}{± 0.49}} & \makecell[c]{35.00\\ \scalebox{\std}{± 0.50}} & \makecell[c]{32.31\\ \scalebox{\std}{± 0.35}} & \makecell[c]{32.88\\ \scalebox{\std}{± 0.53}} & \makecell[c]{32.16\\ \scalebox{\std}{± 0.58}} & \makecell[c]{33.73\\ \scalebox{\std}{± 0.76}} & \makecell[c]{34.70\\ \scalebox{\std}{± 0.60}} & \makecell[c]{36.25\\ \scalebox{\std}{± 0.96}} & \makecell[c]{36.42\\ \scalebox{\std}{± 0.36}} & \makecell[c]{36.35\\ \scalebox{\std}{± 0.55}} & \makecell[c]{30.59\\ \scalebox{\std}{± 0.77}} & \makecell[c]{32.11\\ \scalebox{\std}{± 0.49}} & \makecell[c]{36.57\\ \scalebox{\std}{± 0.30}} & \makecell[c]{38.82\\ \scalebox{\std}{± 0.21}} & \makecell[c]{35.60\\ \scalebox{\std}{± 0.09}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{34.50\\ \scalebox{\std}{± 0.35}} \\
 & GDE~\cite{disagreement} & \makecell[c]{36.44\\ \scalebox{\std}{± 0.63}} & \makecell[c]{35.59\\ \scalebox{\std}{± 0.34}} & \makecell[c]{38.06\\ \scalebox{\std}{± 0.50}} & \makecell[c]{31.10\\ \scalebox{\std}{± 0.04}} & \makecell[c]{38.74\\ \scalebox{\std}{± 0.55}} & \makecell[c]{31.56\\ \scalebox{\std}{± 0.59}} & \makecell[c]{30.66\\ \scalebox{\std}{± 0.31}} & \makecell[c]{32.25\\ \scalebox{\std}{± 0.84}} & \makecell[c]{31.99\\ \scalebox{\std}{± 0.24}} & \makecell[c]{32.19\\ \scalebox{\std}{± 0.47}} & \makecell[c]{29.77\\ \scalebox{\std}{± 0.36}} & \makecell[c]{33.19\\ \scalebox{\std}{± 0.08}} & \makecell[c]{33.02\\ \scalebox{\std}{± 0.17}} & \makecell[c]{29.45\\ \scalebox{\std}{± 0.39}} & \makecell[c]{34.16\\ \scalebox{\std}{± 0.50}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{33.21\\ \scalebox{\std}{± 0.24}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{26.80\\ \scalebox{\std}{± 0.88}} & \makecell[c]{26.32\\ \scalebox{\std}{± 0.95}} & \makecell[c]{26.90\\ \scalebox{\std}{± 0.45}} & \makecell[c]{32.56\\ \scalebox{\std}{± 0.56}} & \makecell[c]{12.67\\ \scalebox{\std}{± 0.28}} & \makecell[c]{24.33\\ \scalebox{\std}{± 0.97}} & \makecell[c]{27.17\\ \scalebox{\std}{± 0.14}} & \makecell[c]{5.68\\ \scalebox{\std}{± 0.18}} & \makecell[c]{11.58\\ \scalebox{\std}{± 0.31}} & \makecell[c]{30.58\\ \scalebox{\std}{± 0.42}} & \makecell[c]{5.21\\ \scalebox{\std}{± 0.30}} & \makecell[c]{47.59\\ \scalebox{\std}{± 0.76}} & \makecell[c]{10.03\\ \scalebox{\std}{± 0.60}} & \makecell[c]{14.24\\ \scalebox{\std}{± 1.60}} & \makecell[c]{6.28\\ \scalebox{\std}{± 0.19}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{20.53\\ \scalebox{\std}{± 0.14}} \\
\multirow{-5}{*}{CoTTA~\cite{cotta}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.24\\ \scalebox{\std}{± 0.38}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.97\\ \scalebox{\std}{± 0.55}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.79\\ \scalebox{\std}{± 0.53}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.03\\ \scalebox{\std}{± 0.20}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.83\\ \scalebox{\std}{± 0.22}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.73\\ \scalebox{\std}{± 0.21}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.92\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.66\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.76\\ \scalebox{\std}{± 0.43}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.41\\ \scalebox{\std}{± 0.16}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.04\\ \scalebox{\std}{± 0.36}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.68\\ \scalebox{\std}{± 0.24}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.78\\ \scalebox{\std}{± 0.22}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.27\\ \scalebox{\std}{± 0.28}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.65\\ \scalebox{\std}{± 0.15}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.05\\ \scalebox{\std}{± 0.12}} \\ \midrule
 & SrcValid & \makecell[c]{28.06\\ \scalebox{\std}{± 2.60}} & \makecell[c]{29.10\\ \scalebox{\std}{± 2.49}} & \makecell[c]{16.22\\ \scalebox{\std}{± 0.43}} & \makecell[c]{34.38\\ \scalebox{\std}{± 1.94}} & \makecell[c]{10.57\\ \scalebox{\std}{± 1.13}} & \makecell[c]{7.34\\ \scalebox{\std}{± 0.25}} & \makecell[c]{17.58\\ \scalebox{\std}{± 2.21}} & \makecell[c]{14.31\\ \scalebox{\std}{± 1.55}} & \makecell[c]{30.17\\ \scalebox{\std}{± 1.28}} & \makecell[c]{15.75\\ \scalebox{\std}{± 2.45}} & \makecell[c]{30.84\\ \scalebox{\std}{± 2.27}} & \makecell[c]{28.00\\ \scalebox{\std}{± 1.54}} & \makecell[c]{31.30\\ \scalebox{\std}{± 1.65}} & \makecell[c]{15.53\\ \scalebox{\std}{± 1.36}} & \makecell[c]{43.38\\ \scalebox{\std}{± 1.76}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{23.50\\ \scalebox{\std}{± 0.51}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{18.70\\ \scalebox{\std}{± 0.62}} & \makecell[c]{19.63\\ \scalebox{\std}{± 0.63}} & \makecell[c]{17.09\\ \scalebox{\std}{± 0.46}} & \makecell[c]{29.83\\ \scalebox{\std}{± 0.32}} & \makecell[c]{21.67\\ \scalebox{\std}{± 0.49}} & \makecell[c]{28.86\\ \scalebox{\std}{± 0.09}} & \makecell[c]{30.26\\ \scalebox{\std}{± 0.35}} & \makecell[c]{25.59\\ \scalebox{\std}{± 0.36}} & \makecell[c]{21.84\\ \scalebox{\std}{± 0.54}} & \makecell[c]{28.33\\ \scalebox{\std}{± 0.32}} & \makecell[c]{29.95\\ \scalebox{\std}{± 0.16}} & \makecell[c]{13.55\\ \scalebox{\std}{± 0.61}} & \makecell[c]{27.63\\ \scalebox{\std}{± 0.23}} & \makecell[c]{26.73\\ \scalebox{\std}{± 0.37}} & \makecell[c]{23.09\\ \scalebox{\std}{± 0.49}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{24.18\\ \scalebox{\std}{± 0.19}} \\
 & GDE~\cite{disagreement} & \makecell[c]{59.99\\ \scalebox{\std}{± 1.07}} & \makecell[c]{59.43\\ \scalebox{\std}{± 1.11}} & \makecell[c]{63.72\\ \scalebox{\std}{± 0.99}} & \makecell[c]{33.71\\ \scalebox{\std}{± 0.18}} & \makecell[c]{56.11\\ \scalebox{\std}{± 0.24}} & \makecell[c]{36.10\\ \scalebox{\std}{± 0.55}} & \makecell[c]{34.68\\ \scalebox{\std}{± 0.67}} & \makecell[c]{46.27\\ \scalebox{\std}{± 0.67}} & \makecell[c]{52.01\\ \scalebox{\std}{± 0.59}} & \makecell[c]{41.07\\ \scalebox{\std}{± 0.17}} & \makecell[c]{31.91\\ \scalebox{\std}{± 0.62}} & \makecell[c]{62.02\\ \scalebox{\std}{± 0.22}} & \makecell[c]{45.38\\ \scalebox{\std}{± 0.57}} & \makecell[c]{44.14\\ \scalebox{\std}{± 1.05}} & \makecell[c]{53.79\\ \scalebox{\std}{± 0.67}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{48.02\\ \scalebox{\std}{± 0.56}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{25.47\\ \scalebox{\std}{± 1.49}} & \makecell[c]{26.56\\ \scalebox{\std}{± 1.30}} & \makecell[c]{25.08\\ \scalebox{\std}{± 0.90}} & \makecell[c]{28.92\\ \scalebox{\std}{± 0.25}} & \makecell[c]{19.97\\ \scalebox{\std}{± 0.92}} & \makecell[c]{15.41\\ \scalebox{\std}{± 0.66}} & \makecell[c]{21.68\\ \scalebox{\std}{± 0.69}} & \makecell[c]{6.37\\ \scalebox{\std}{± 0.19}} & \makecell[c]{13.92\\ \scalebox{\std}{± 0.74}} & \makecell[c]{15.02\\ \scalebox{\std}{± 0.93}} & \makecell[c]{4.50\\ \scalebox{\std}{± 0.35}} & \makecell[c]{20.75\\ \scalebox{\std}{± 1.25}} & \makecell[c]{6.84\\ \scalebox{\std}{± 0.53}} & \makecell[c]{28.80\\ \scalebox{\std}{± 3.46}} & \makecell[c]{8.34\\ \scalebox{\std}{± 0.62}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{17.84\\ \scalebox{\std}{± 0.65}} \\
\multirow{-5}{*}{RoTTA~\cite{rotta}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.37\\ \scalebox{\std}{± 1.21}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.88\\ \scalebox{\std}{± 0.64}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.81\\ \scalebox{\std}{± 0.43}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.74\\ \scalebox{\std}{± 0.23}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.28\\ \scalebox{\std}{± 0.34}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.29\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.18\\ \scalebox{\std}{± 0.42}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.22\\ \scalebox{\std}{± 0.43}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.58\\ \scalebox{\std}{± 1.46}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.74\\ \scalebox{\std}{± 0.26}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.79\\ \scalebox{\std}{± 0.48}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{21.01\\ \scalebox{\std}{± 3.35}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.53\\ \scalebox{\std}{± 0.49}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.28\\ \scalebox{\std}{± 0.56}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.43\\ \scalebox{\std}{± 1.21}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.88\\ \scalebox{\std}{± 0.10}} \\ \midrule
 & SrcValid & \makecell[c]{13.43\\ \scalebox{\std}{± 2.25}} & \makecell[c]{12.55\\ \scalebox{\std}{± 1.41}} & \makecell[c]{8.28\\ \scalebox{\std}{± 1.68}} & \makecell[c]{37.61\\ \scalebox{\std}{± 3.05}} & \makecell[c]{12.41\\ \scalebox{\std}{± 3.74}} & \makecell[c]{6.47\\ \scalebox{\std}{± 0.88}} & \makecell[c]{12.99\\ \scalebox{\std}{± 1.60}} & \makecell[c]{16.44\\ \scalebox{\std}{± 1.42}} & \makecell[c]{13.05\\ \scalebox{\std}{± 1.58}} & \makecell[c]{7.58\\ \scalebox{\std}{± 0.80}} & \makecell[c]{10.12\\ \scalebox{\std}{± 1.70}} & \makecell[c]{51.02\\ \scalebox{\std}{± 4.34}} & \makecell[c]{35.45\\ \scalebox{\std}{± 0.57}} & \makecell[c]{11.12\\ \scalebox{\std}{± 4.35}} & \makecell[c]{41.55\\ \scalebox{\std}{± 0.87}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{19.34\\ \scalebox{\std}{± 0.63}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{21.66\\ \scalebox{\std}{± 0.49}} & \makecell[c]{22.18\\ \scalebox{\std}{± 0.45}} & \makecell[c]{19.31\\ \scalebox{\std}{± 0.05}} & \makecell[c]{25.96\\ \scalebox{\std}{± 0.20}} & \makecell[c]{21.08\\ \scalebox{\std}{± 0.62}} & \makecell[c]{25.26\\ \scalebox{\std}{± 0.47}} & \makecell[c]{26.54\\ \scalebox{\std}{± 0.43}} & \makecell[c]{24.57\\ \scalebox{\std}{± 0.52}} & \makecell[c]{24.12\\ \scalebox{\std}{± 0.19}} & \makecell[c]{25.49\\ \scalebox{\std}{± 0.26}} & \makecell[c]{26.25\\ \scalebox{\std}{± 0.47}} & \makecell[c]{23.62\\ \scalebox{\std}{± 0.52}} & \makecell[c]{25.30\\ \scalebox{\std}{± 0.45}} & \makecell[c]{25.17\\ \scalebox{\std}{± 0.29}} & \makecell[c]{23.23\\ \scalebox{\std}{± 0.39}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{23.98\\ \scalebox{\std}{± 0.21}} \\
 & GDE~\cite{disagreement} & \makecell[c]{41.62\\ \scalebox{\std}{± 0.27}} & \makecell[c]{40.60\\ \scalebox{\std}{± 0.39}} & \makecell[c]{48.16\\ \scalebox{\std}{± 0.43}} & \makecell[c]{26.49\\ \scalebox{\std}{± 0.24}} & \makecell[c]{44.28\\ \scalebox{\std}{± 0.48}} & \makecell[c]{28.68\\ \scalebox{\std}{± 0.42}} & \makecell[c]{26.67\\ \scalebox{\std}{± 0.34}} & \makecell[c]{33.51\\ \scalebox{\std}{± 0.51}} & \makecell[c]{34.05\\ \scalebox{\std}{± 0.18}} & \makecell[c]{30.44\\ \scalebox{\std}{± 0.51}} & \makecell[c]{24.30\\ \scalebox{\std}{± 0.32}} & \makecell[c]{27.73\\ \scalebox{\std}{± 0.69}} & \makecell[c]{36.20\\ \scalebox{\std}{± 0.21}} & \makecell[c]{31.28\\ \scalebox{\std}{± 0.44}} & \makecell[c]{39.52\\ \scalebox{\std}{± 0.24}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{34.24\\ \scalebox{\std}{± 0.12}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{41.39\\ \scalebox{\std}{± 1.33}} & \makecell[c]{41.35\\ \scalebox{\std}{± 1.19}} & \makecell[c]{38.06\\ \scalebox{\std}{± 0.94}} & \makecell[c]{32.19\\ \scalebox{\std}{± 0.31}} & \makecell[c]{27.92\\ \scalebox{\std}{± 1.51}} & \makecell[c]{18.64\\ \scalebox{\std}{± 0.14}} & \makecell[c]{24.83\\ \scalebox{\std}{± 0.76}} & \makecell[c]{10.19\\ \scalebox{\std}{± 0.29}} & \makecell[c]{24.05\\ \scalebox{\std}{± 0.89}} & \makecell[c]{21.34\\ \scalebox{\std}{± 0.74}} & \makecell[c]{4.84\\ \scalebox{\std}{± 0.51}} & \makecell[c]{48.66\\ \scalebox{\std}{± 0.26}} & \makecell[c]{6.74\\ \scalebox{\std}{± 0.18}} & \makecell[c]{38.48\\ \scalebox{\std}{± 1.95}} & \makecell[c]{7.88\\ \scalebox{\std}{± 0.29}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{25.77\\ \scalebox{\std}{± 0.47}} \\
\multirow{-5}{*}{SoTTA~\cite{sotta}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.11\\ \scalebox{\std}{± 0.91}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.95\\ \scalebox{\std}{± 0.39}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.35\\ \scalebox{\std}{± 0.55}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.67\\ \scalebox{\std}{± 0.32}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.03\\ \scalebox{\std}{± 0.26}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.28\\ \scalebox{\std}{± 0.18}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.50\\ \scalebox{\std}{± 0.28}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.68\\ \scalebox{\std}{± 0.21}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.73\\ \scalebox{\std}{± 0.42}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.00\\ \scalebox{\std}{± 0.37}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.18\\ \scalebox{\std}{± 0.04}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.30\\ \scalebox{\std}{± 0.20}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.58\\ \scalebox{\std}{± 0.07}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.94\\ \scalebox{\std}{± 0.48}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.02\\ \scalebox{\std}{± 0.19}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.29\\ \scalebox{\std}{± 0.18}} \\
\Xhline{2\arrayrulewidth}
\end{tabularx}

[H]
\centering
\caption{Mean absolute error (MAE) (\%) of the accuracy estimation on fully ImageNet-C. Averaged over three different random seeds.}
\label{tab:main95experiment95imagenet95full}
\scriptsize
\setlength\tabcolsep{3pt}
\begin{tabularx}{\textwidth}{ll*{16}Y}
\Xhline{2\arrayrulewidth}
 &  & \multicolumn{3}{c}{Noise} & \multicolumn{4}{c}{Blur} & \multicolumn{4}{c}{Weather} & \multicolumn{4}{c}{Digital} &  \\ 
\addlinespace[-0.05cm]
\cmidrule(lr){3-5} \cmidrule(lr){6-9} \cmidrule(lr){10-13} \cmidrule(lr){14-17}
\addlinespace[-0.05cm]
TTA Method & Acc. Estimation & Gau. & Shot & Imp. & Def. & Gla. & Mot. & Zoom & Snow & Fro. & Fog & Brit. & Cont. & Elas. & Pix. & JPEG & \cellcolor[HTML]{EDEEFF}Avg.($\downarrow$) \\ \midrule
 & SrcValid & \makecell[c]{53.54\\ \scalebox{\std}{± 1.08}} & \makecell[c]{52.12\\ \scalebox{\std}{± 1.12}} & \makecell[c]{53.28\\ \scalebox{\std}{± 0.95}} & \makecell[c]{54.72\\ \scalebox{\std}{± 0.95}} & \makecell[c]{53.66\\ \scalebox{\std}{± 1.07}} & \makecell[c]{42.96\\ \scalebox{\std}{± 0.92}} & \makecell[c]{32.78\\ \scalebox{\std}{± 0.83}} & \makecell[c]{37.25\\ \scalebox{\std}{± 0.96}} & \makecell[c]{38.77\\ \scalebox{\std}{± 0.97}} & \makecell[c]{25.06\\ \scalebox{\std}{± 0.94}} & \makecell[c]{10.47\\ \scalebox{\std}{± 0.67}} & \makecell[c]{53.56\\ \scalebox{\std}{± 0.77}} & \makecell[c]{27.86\\ \scalebox{\std}{± 0.85}} & \makecell[c]{22.35\\ \scalebox{\std}{± 0.70}} & \makecell[c]{28.55\\ \scalebox{\std}{± 0.71}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{39.13\\ \scalebox{\std}{± 0.89}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{11.11\\ \scalebox{\std}{± 0.14}} & \makecell[c]{11.97\\ \scalebox{\std}{± 0.06}} & \makecell[c]{11.35\\ \scalebox{\std}{± 0.03}} & \makecell[c]{9.82\\ \scalebox{\std}{± 0.09}} & \makecell[c]{10.59\\ \scalebox{\std}{± 0.07}} & \makecell[c]{19.16\\ \scalebox{\std}{± 0.09}} & \makecell[c]{26.76\\ \scalebox{\std}{± 0.09}} & \makecell[c]{22.44\\ \scalebox{\std}{± 0.05}} & \makecell[c]{21.08\\ \scalebox{\std}{± 0.06}} & \makecell[c]{31.41\\ \scalebox{\std}{± 0.05}} & \makecell[c]{34.87\\ \scalebox{\std}{± 0.01}} & \makecell[c]{9.52\\ \scalebox{\std}{± 0.04}} & \makecell[c]{28.83\\ \scalebox{\std}{± 0.05}} & \makecell[c]{32.26\\ \scalebox{\std}{± 0.14}} & \makecell[c]{28.88\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{20.67\\ \scalebox{\std}{± 0.01}} \\
 & GDE~\cite{disagreement} & \makecell[c]{85.03\\ \scalebox{\std}{± 0.14}} & \makecell[c]{83.75\\ \scalebox{\std}{± 0.06}} & \makecell[c]{84.81\\ \scalebox{\std}{± 0.04}} & \makecell[c]{85.74\\ \scalebox{\std}{± 0.09}} & \makecell[c]{84.87\\ \scalebox{\std}{± 0.08}} & \makecell[c]{74.26\\ \scalebox{\std}{± 0.10}} & \makecell[c]{64.29\\ \scalebox{\std}{± 0.08}} & \makecell[c]{68.95\\ \scalebox{\std}{± 0.06}} & \makecell[c]{70.47\\ \scalebox{\std}{± 0.07}} & \makecell[c]{56.71\\ \scalebox{\std}{± 0.02}} & \makecell[c]{41.78\\ \scalebox{\std}{± 0.02}} & \makecell[c]{84.43\\ \scalebox{\std}{± 0.09}} & \makecell[c]{59.48\\ \scalebox{\std}{± 0.07}} & \makecell[c]{53.93\\ \scalebox{\std}{± 0.12}} & \makecell[c]{60.29\\ \scalebox{\std}{± 0.14}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{70.58\\ \scalebox{\std}{± 0.01}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{13.38\\ \scalebox{\std}{± 0.12}} & \makecell[c]{14.08\\ \scalebox{\std}{± 0.05}} & \makecell[c]{13.44\\ \scalebox{\std}{± 0.04}} & \makecell[c]{4.43\\ \scalebox{\std}{± 0.15}} & \makecell[c]{5.21\\ \scalebox{\std}{± 0.14}} & \makecell[c]{11.48\\ \scalebox{\std}{± 0.05}} & \makecell[c]{11.47\\ \scalebox{\std}{± 0.13}} & \makecell[c]{17.17\\ \scalebox{\std}{± 0.13}} & \makecell[c]{8.67\\ \scalebox{\std}{± 0.10}} & \makecell[c]{26.80\\ \scalebox{\std}{± 0.03}} & \makecell[c]{6.56\\ \scalebox{\std}{± 0.16}} & \makecell[c]{11.99\\ \scalebox{\std}{± 0.03}} & \makecell[c]{17.97\\ \scalebox{\std}{± 0.03}} & \makecell[c]{19.11\\ \scalebox{\std}{± 0.14}} & \makecell[c]{6.69\\ \scalebox{\std}{± 0.19}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{12.56\\ \scalebox{\std}{± 0.03}} \\
\multirow{-5}{*}{TENT~\cite{tent}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.88\\ \scalebox{\std}{± 0.10}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.27\\ \scalebox{\std}{± 0.02}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.93\\ \scalebox{\std}{± 0.09}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.23\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.03\\ \scalebox{\std}{± 0.15}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.53\\ \scalebox{\std}{± 0.09}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.72\\ \scalebox{\std}{± 0.12}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.35\\ \scalebox{\std}{± 0.13}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.37\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.50\\ \scalebox{\std}{± 0.12}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.88\\ \scalebox{\std}{± 0.05}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.24\\ \scalebox{\std}{± 0.02}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.34\\ \scalebox{\std}{± 0.06}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.95\\ \scalebox{\std}{± 0.21}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.82\\ \scalebox{\std}{± 0.17}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.14\\ \scalebox{\std}{± 0.03}} \\ \midrule
 & SrcValid & \makecell[c]{50.34\\ \scalebox{\std}{± 1.01}} & \makecell[c]{48.89\\ \scalebox{\std}{± 0.88}} & \makecell[c]{49.88\\ \scalebox{\std}{± 0.76}} & \makecell[c]{52.47\\ \scalebox{\std}{± 0.59}} & \makecell[c]{51.11\\ \scalebox{\std}{± 0.57}} & \makecell[c]{40.01\\ \scalebox{\std}{± 0.83}} & \makecell[c]{28.80\\ \scalebox{\std}{± 0.84}} & \makecell[c]{32.41\\ \scalebox{\std}{± 0.71}} & \makecell[c]{35.23\\ \scalebox{\std}{± 0.95}} & \makecell[c]{21.78\\ \scalebox{\std}{± 0.93}} & \makecell[c]{8.88\\ \scalebox{\std}{± 0.53}} & \makecell[c]{51.16\\ \scalebox{\std}{± 0.97}} & \makecell[c]{23.11\\ \scalebox{\std}{± 1.18}} & \makecell[c]{19.03\\ \scalebox{\std}{± 0.72}} & \makecell[c]{25.30\\ \scalebox{\std}{± 0.92}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{35.89\\ \scalebox{\std}{± 0.79}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{12.31\\ \scalebox{\std}{± 0.08}} & \makecell[c]{13.18\\ \scalebox{\std}{± 0.15}} & \makecell[c]{12.71\\ \scalebox{\std}{± 0.04}} & \makecell[c]{10.26\\ \scalebox{\std}{± 0.14}} & \makecell[c]{10.97\\ \scalebox{\std}{± 0.05}} & \makecell[c]{19.72\\ \scalebox{\std}{± 0.03}} & \makecell[c]{27.54\\ \scalebox{\std}{± 0.10}} & \makecell[c]{23.79\\ \scalebox{\std}{± 0.12}} & \makecell[c]{21.46\\ \scalebox{\std}{± 0.08}} & \makecell[c]{30.90\\ \scalebox{\std}{± 0.11}} & \makecell[c]{32.96\\ \scalebox{\std}{± 0.08}} & \makecell[c]{10.51\\ \scalebox{\std}{± 0.08}} & \makecell[c]{29.11\\ \scalebox{\std}{± 0.07}} & \makecell[c]{31.61\\ \scalebox{\std}{± 0.08}} & \makecell[c]{28.90\\ \scalebox{\std}{± 0.04}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{21.06\\ \scalebox{\std}{± 0.03}} \\
 & GDE~\cite{disagreement} & \makecell[c]{81.21\\ \scalebox{\std}{± 0.16}} & \makecell[c]{79.99\\ \scalebox{\std}{± 0.27}} & \makecell[c]{80.82\\ \scalebox{\std}{± 0.03}} & \makecell[c]{77.04\\ \scalebox{\std}{± 0.50}} & \makecell[c]{77.95\\ \scalebox{\std}{± 0.07}} & \makecell[c]{70.22\\ \scalebox{\std}{± 0.13}} & \makecell[c]{60.30\\ \scalebox{\std}{± 0.02}} & \makecell[c]{64.37\\ \scalebox{\std}{± 0.15}} & \makecell[c]{67.25\\ \scalebox{\std}{± 0.16}} & \makecell[c]{53.43\\ \scalebox{\std}{± 0.10}} & \makecell[c]{40.26\\ \scalebox{\std}{± 0.01}} & \makecell[c]{76.64\\ \scalebox{\std}{± 0.03}} & \makecell[c]{55.33\\ \scalebox{\std}{± 0.09}} & \makecell[c]{50.67\\ \scalebox{\std}{± 0.07}} & \makecell[c]{57.02\\ \scalebox{\std}{± 0.03}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{66.17\\ \scalebox{\std}{± 0.07}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{15.44\\ \scalebox{\std}{± 0.07}} & \makecell[c]{16.26\\ \scalebox{\std}{± 0.19}} & \makecell[c]{15.70\\ \scalebox{\std}{± 0.07}} & \makecell[c]{4.89\\ \scalebox{\std}{± 0.07}} & \makecell[c]{6.05\\ \scalebox{\std}{± 0.14}} & \makecell[c]{13.36\\ \scalebox{\std}{± 0.11}} & \makecell[c]{14.92\\ \scalebox{\std}{± 0.09}} & \makecell[c]{20.93\\ \scalebox{\std}{± 0.14}} & \makecell[c]{10.72\\ \scalebox{\std}{± 0.22}} & \makecell[c]{29.38\\ \scalebox{\std}{± 0.13}} & \makecell[c]{5.97\\ \scalebox{\std}{± 0.17}} & \makecell[c]{13.31\\ \scalebox{\std}{± 0.14}} & \makecell[c]{22.32\\ \scalebox{\std}{± 0.09}} & \makecell[c]{21.23\\ \scalebox{\std}{± 0.19}} & \makecell[c]{7.34\\ \scalebox{\std}{± 0.14}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{14.52\\ \scalebox{\std}{± 0.01}} \\
\multirow{-5}{*}{EATA~\cite{eata}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.28\\ \scalebox{\std}{± 0.12}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.72\\ \scalebox{\std}{± 0.05}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.28\\ \scalebox{\std}{± 0.07}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.45\\ \scalebox{\std}{± 0.09}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.12\\ \scalebox{\std}{± 0.10}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.37\\ \scalebox{\std}{± 0.07}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.23\\ \scalebox{\std}{± 0.18}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.60\\ \scalebox{\std}{± 0.13}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.73\\ \scalebox{\std}{± 0.13}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{10.46\\ \scalebox{\std}{± 0.12}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.22\\ \scalebox{\std}{± 0.09}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.38\\ \scalebox{\std}{± 0.09}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.25\\ \scalebox{\std}{± 0.16}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.47\\ \scalebox{\std}{± 0.10}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.68\\ \scalebox{\std}{± 0.05}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.48\\ \scalebox{\std}{± 0.02}} \\ \midrule
 & SrcValid & \makecell[c]{39.53\\ \scalebox{\std}{± 1.22}} & \makecell[c]{37.69\\ \scalebox{\std}{± 1.46}} & \makecell[c]{39.79\\ \scalebox{\std}{± 1.36}} & \makecell[c]{43.89\\ \scalebox{\std}{± 1.00}} & \makecell[c]{43.75\\ \scalebox{\std}{± 0.82}} & \makecell[c]{33.12\\ \scalebox{\std}{± 0.77}} & \makecell[c]{25.52\\ \scalebox{\std}{± 1.06}} & \makecell[c]{27.90\\ \scalebox{\std}{± 0.86}} & \makecell[c]{31.43\\ \scalebox{\std}{± 1.03}} & \makecell[c]{17.14\\ \scalebox{\std}{± 0.46}} & \makecell[c]{9.03\\ \scalebox{\std}{± 0.71}} & \makecell[c]{40.82\\ \scalebox{\std}{± 0.33}} & \makecell[c]{19.89\\ \scalebox{\std}{± 1.25}} & \makecell[c]{16.74\\ \scalebox{\std}{± 0.89}} & \makecell[c]{20.27\\ \scalebox{\std}{± 1.14}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{29.77\\ \scalebox{\std}{± 0.94}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{18.07\\ \scalebox{\std}{± 0.26}} & \makecell[c]{19.20\\ \scalebox{\std}{± 0.20}} & \makecell[c]{18.08\\ \scalebox{\std}{± 0.21}} & \makecell[c]{14.90\\ \scalebox{\std}{± 0.25}} & \makecell[c]{15.23\\ \scalebox{\std}{± 0.29}} & \makecell[c]{24.02\\ \scalebox{\std}{± 0.06}} & \makecell[c]{29.50\\ \scalebox{\std}{± 0.06}} & \makecell[c]{26.57\\ \scalebox{\std}{± 0.09}} & \makecell[c]{23.87\\ \scalebox{\std}{± 0.16}} & \makecell[c]{33.16\\ \scalebox{\std}{± 0.11}} & \makecell[c]{34.57\\ \scalebox{\std}{± 0.09}} & \makecell[c]{12.88\\ \scalebox{\std}{± 1.75}} & \makecell[c]{31.62\\ \scalebox{\std}{± 0.09}} & \makecell[c]{33.20\\ \scalebox{\std}{± 0.20}} & \makecell[c]{31.47\\ \scalebox{\std}{± 0.02}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{24.42\\ \scalebox{\std}{± 0.08}} \\
 & GDE~\cite{disagreement} & \makecell[c]{75.10\\ \scalebox{\std}{± 0.28}} & \makecell[c]{73.20\\ \scalebox{\std}{± 0.21}} & \makecell[c]{75.19\\ \scalebox{\std}{± 0.25}} & \makecell[c]{77.65\\ \scalebox{\std}{± 0.21}} & \makecell[c]{77.44\\ \scalebox{\std}{± 0.28}} & \makecell[c]{66.24\\ \scalebox{\std}{± 0.07}} & \makecell[c]{58.62\\ \scalebox{\std}{± 0.07}} & \makecell[c]{61.48\\ \scalebox{\std}{± 0.10}} & \makecell[c]{65.23\\ \scalebox{\std}{± 0.16}} & \makecell[c]{50.56\\ \scalebox{\std}{± 0.07}} & \makecell[c]{40.51\\ \scalebox{\std}{± 0.12}} & \makecell[c]{74.62\\ \scalebox{\std}{± 0.99}} & \makecell[c]{53.29\\ \scalebox{\std}{± 0.08}} & \makecell[c]{49.16\\ \scalebox{\std}{± 0.20}} & \makecell[c]{53.97\\ \scalebox{\std}{± 0.02}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{63.48\\ \scalebox{\std}{± 0.03}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{22.95\\ \scalebox{\std}{± 0.27}} & \makecell[c]{24.15\\ \scalebox{\std}{± 0.14}} & \makecell[c]{22.86\\ \scalebox{\std}{± 0.25}} & \makecell[c]{7.84\\ \scalebox{\std}{± 0.11}} & \makecell[c]{10.46\\ \scalebox{\std}{± 0.36}} & \makecell[c]{18.99\\ \scalebox{\std}{± 0.17}} & \makecell[c]{17.74\\ \scalebox{\std}{± 0.13}} & \makecell[c]{24.62\\ \scalebox{\std}{± 0.12}} & \makecell[c]{13.19\\ \scalebox{\std}{± 0.07}} & \makecell[c]{32.89\\ \scalebox{\std}{± 0.05}} & \makecell[c]{6.06\\ \scalebox{\std}{± 0.06}} & \makecell[c]{21.28\\ \scalebox{\std}{± 0.94}} & \makecell[c]{25.45\\ \scalebox{\std}{± 0.13}} & \makecell[c]{23.36\\ \scalebox{\std}{± 0.15}} & \makecell[c]{9.55\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{18.76\\ \scalebox{\std}{± 0.06}} \\
\multirow{-5}{*}{SAR~\cite{sar}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.38\\ \scalebox{\std}{± 0.27}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.24\\ \scalebox{\std}{± 0.16}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.11\\ \scalebox{\std}{± 0.15}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.06\\ \scalebox{\std}{± 0.04}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.19\\ \scalebox{\std}{± 0.27}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.53\\ \scalebox{\std}{± 0.08}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.17\\ \scalebox{\std}{± 0.14}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.97\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.71\\ \scalebox{\std}{± 0.04}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{9.73\\ \scalebox{\std}{± 0.16}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.68\\ \scalebox{\std}{± 0.10}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.39\\ \scalebox{\std}{± 0.33}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.76\\ \scalebox{\std}{± 0.17}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.66\\ \scalebox{\std}{± 0.19}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.89\\ \scalebox{\std}{± 0.03}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.43\\ \scalebox{\std}{± 0.09}} \\ \midrule
 & SrcValid & \makecell[c]{55.20\\ \scalebox{\std}{± 0.50}} & \makecell[c]{54.09\\ \scalebox{\std}{± 0.55}} & \makecell[c]{54.87\\ \scalebox{\std}{± 0.49}} & \makecell[c]{56.70\\ \scalebox{\std}{± 0.86}} & \makecell[c]{55.43\\ \scalebox{\std}{± 0.45}} & \makecell[c]{45.10\\ \scalebox{\std}{± 0.55}} & \makecell[c]{34.91\\ \scalebox{\std}{± 0.54}} & \makecell[c]{39.13\\ \scalebox{\std}{± 0.55}} & \makecell[c]{40.12\\ \scalebox{\std}{± 0.54}} & \makecell[c]{27.92\\ \scalebox{\std}{± 0.56}} & \makecell[c]{10.68\\ \scalebox{\std}{± 0.44}} & \makecell[c]{56.19\\ \scalebox{\std}{± 0.65}} & \makecell[c]{29.77\\ \scalebox{\std}{± 0.49}} & \makecell[c]{24.40\\ \scalebox{\std}{± 0.41}} & \makecell[c]{31.80\\ \scalebox{\std}{± 0.56}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{41.09\\ \scalebox{\std}{± 0.53}}
 \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{9.93\\ \scalebox{\std}{± 0.07}} & \makecell[c]{10.66\\ \scalebox{\std}{± 0.11}} & \makecell[c]{10.24\\ \scalebox{\std}{± 0.03}} & \makecell[c]{8.36\\ \scalebox{\std}{± 0.10}} & \makecell[c]{9.36\\ \scalebox{\std}{± 0.05}} & \makecell[c]{17.69\\ \scalebox{\std}{± 0.14}} & \makecell[c]{25.68\\ \scalebox{\std}{± 0.03}} & \makecell[c]{21.47\\ \scalebox{\std}{± 0.07}} & \makecell[c]{20.57\\ \scalebox{\std}{± 0.05}} & \makecell[c]{30.40\\ \scalebox{\std}{± 0.06}} & \makecell[c]{35.07\\ \scalebox{\std}{± 0.08}} & \makecell[c]{7.79\\ \scalebox{\std}{± 0.13}} & \makecell[c]{28.04\\ \scalebox{\std}{± 0.03}} & \makecell[c]{31.69\\ \scalebox{\std}{± 0.11}} & \makecell[c]{27.39\\ \scalebox{\std}{± 0.08}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{19.62\\ \scalebox{\std}{± 0.02}}
 \\
 & GDE~\cite{disagreement} & \makecell[c]{86.85\\ \scalebox{\std}{± 0.06}} & \makecell[c]{85.76\\ \scalebox{\std}{± 0.13}} & \makecell[c]{86.52\\ \scalebox{\std}{± 0.02}} & \makecell[c]{88.18\\ \scalebox{\std}{± 0.12}} & \makecell[c]{87.04\\ \scalebox{\std}{± 0.05}} & \makecell[c]{76.83\\ \scalebox{\std}{± 0.14}} & \makecell[c]{66.63\\ \scalebox{\std}{± 0.03}} & \makecell[c]{70.86\\ \scalebox{\std}{± 0.07}} & \makecell[c]{71.81\\ \scalebox{\std}{± 0.06}} & \makecell[c]{59.64\\ \scalebox{\std}{± 0.06}} & \makecell[c]{42.23\\ \scalebox{\std}{± 0.09}} & \makecell[c]{87.87\\ \scalebox{\std}{± 0.13}} & \makecell[c]{61.50\\ \scalebox{\std}{± 0.03}} & \makecell[c]{56.13\\ \scalebox{\std}{± 0.11}} & \makecell[c]{63.53\\ \scalebox{\std}{± 0.08}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{72.76\\ \scalebox{\std}{± 0.02}}
 \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{11.73\\ \scalebox{\std}{± 0.07}} & \makecell[c]{12.19\\ \scalebox{\std}{± 0.09}} & \makecell[c]{11.84\\ \scalebox{\std}{± 0.03}} & \makecell[c]{3.81\\ \scalebox{\std}{± 0.06}} & \makecell[c]{4.33\\ \scalebox{\std}{± 0.03}} & \makecell[c]{9.70\\ \scalebox{\std}{± 0.09}} & \makecell[c]{9.46\\ \scalebox{\std}{± 0.06}} & \makecell[c]{15.50\\ \scalebox{\std}{± 0.13}} & \makecell[c]{7.86\\ \scalebox{\std}{± 0.13}} & \makecell[c]{24.07\\ \scalebox{\std}{± 0.09}} & \makecell[c]{6.72\\ \scalebox{\std}{± 0.20}} & \makecell[c]{9.11\\ \scalebox{\std}{± 0.13}} & \makecell[c]{15.96\\ \scalebox{\std}{± 0.09}} & \makecell[c]{17.56\\ \scalebox{\std}{± 0.16}} & \makecell[c]{5.96\\ \scalebox{\std}{± 0.34}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{11.05\\ \scalebox{\std}{± 0.02}}
 \\
\multirow{-5}{*}{CoTTA~\cite{cotta}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.76\\ \scalebox{\std}{± 0.07}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.00\\ \scalebox{\std}{± 0.04}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.79\\ \scalebox{\std}{± 0.07}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.69\\ \scalebox{\std}{± 0.18}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{5.04\\ \scalebox{\std}{± 0.06}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.63\\ \scalebox{\std}{± 0.08}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.49\\ \scalebox{\std}{± 0.15}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.39\\ \scalebox{\std}{± 0.08}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{4.25\\ \scalebox{\std}{± 0.10}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.97\\ \scalebox{\std}{± 0.12}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.81\\ \scalebox{\std}{± 0.09}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.70\\ \scalebox{\std}{± 0.18}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.00\\ \scalebox{\std}{± 0.02}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.44\\ \scalebox{\std}{± 0.19}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.27\\ \scalebox{\std}{± 0.19}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{6.02\\ \scalebox{\std}{± 0.03}}
 \\ \midrule
 & SrcValid & \makecell[c]{15.32\\ \scalebox{\std}{± 0.10}} & \makecell[c]{13.73\\ \scalebox{\std}{± 0.68}} & \makecell[c]{15.71\\ \scalebox{\std}{± 0.14}} & \makecell[c]{4.79\\ \scalebox{\std}{± 0.54}} & \makecell[c]{17.18\\ \scalebox{\std}{± 1.34}} & \makecell[c]{5.81\\ \scalebox{\std}{± 0.55}} & \makecell[c]{9.04\\ \scalebox{\std}{± 1.50}} & \makecell[c]{8.44\\ \scalebox{\std}{± 0.80}} & \makecell[c]{5.54\\ \scalebox{\std}{± 0.33}} & \makecell[c]{8.84\\ \scalebox{\std}{± 1.21}} & \makecell[c]{10.68\\ \scalebox{\std}{± 0.77}} & \makecell[c]{14.01\\ \scalebox{\std}{± 0.55}} & \makecell[c]{7.27\\ \scalebox{\std}{± 0.34}} & \makecell[c]{5.59\\ \scalebox{\std}{± 0.27}} & \makecell[c]{12.32\\ \scalebox{\std}{± 0.91}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{10.28\\ \scalebox{\std}{± 0.28}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{11.98\\ \scalebox{\std}{± 0.07}} & \makecell[c]{12.84\\ \scalebox{\std}{± 0.28}} & \makecell[c]{12.31\\ \scalebox{\std}{± 0.13}} & \makecell[c]{9.96\\ \scalebox{\std}{± 0.18}} & \makecell[c]{10.93\\ \scalebox{\std}{± 0.12}} & \makecell[c]{19.51\\ \scalebox{\std}{± 0.09}} & \makecell[c]{27.15\\ \scalebox{\std}{± 0.18}} & \makecell[c]{23.28\\ \scalebox{\std}{± 0.02}} & \makecell[c]{20.61\\ \scalebox{\std}{± 0.11}} & \makecell[c]{31.75\\ \scalebox{\std}{± 0.10}} & \makecell[c]{33.58\\ \scalebox{\std}{± 0.05}} & \makecell[c]{11.05\\ \scalebox{\std}{± 0.14}} & \makecell[c]{29.18\\ \scalebox{\std}{± 0.04}} & \makecell[c]{32.08\\ \scalebox{\std}{± 0.09}} & \makecell[c]{29.22\\ \scalebox{\std}{± 0.10}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{21.03\\ \scalebox{\std}{± 0.04}} \\
 & GDE~\cite{disagreement} & \makecell[c]{80.31\\ \scalebox{\std}{± 0.05}} & \makecell[c]{79.07\\ \scalebox{\std}{± 0.33}} & \makecell[c]{80.18\\ \scalebox{\std}{± 0.21}} & \makecell[c]{82.25\\ \scalebox{\std}{± 0.26}} & \makecell[c]{81.56\\ \scalebox{\std}{± 0.12}} & \makecell[c]{70.30\\ \scalebox{\std}{± 0.10}} & \makecell[c]{60.45\\ \scalebox{\std}{± 0.17}} & \makecell[c]{64.30\\ \scalebox{\std}{± 0.07}} & \makecell[c]{67.01\\ \scalebox{\std}{± 0.25}} & \makecell[c]{52.64\\ \scalebox{\std}{± 0.18}} & \makecell[c]{38.45\\ \scalebox{\std}{± 0.03}} & \makecell[c]{77.50\\ \scalebox{\std}{± 0.21}} & \makecell[c]{55.15\\ \scalebox{\std}{± 0.07}} & \makecell[c]{50.24\\ \scalebox{\std}{± 0.01}} & \makecell[c]{56.48\\ \scalebox{\std}{± 0.20}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{66.39\\ \scalebox{\std}{± 0.04}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{13.96\\ \scalebox{\std}{± 0.07}} & \makecell[c]{14.55\\ \scalebox{\std}{± 0.30}} & \makecell[c]{14.24\\ \scalebox{\std}{± 0.13}} & \makecell[c]{4.68\\ \scalebox{\std}{± 0.09}} & \makecell[c]{5.18\\ \scalebox{\std}{± 0.05}} & \makecell[c]{11.83\\ \scalebox{\std}{± 0.05}} & \makecell[c]{12.11\\ \scalebox{\std}{± 0.12}} & \makecell[c]{17.54\\ \scalebox{\std}{± 0.06}} & \makecell[c]{7.91\\ \scalebox{\std}{± 0.12}} & \makecell[c]{27.54\\ \scalebox{\std}{± 0.11}} & \makecell[c]{6.71\\ \scalebox{\std}{± 0.08}} & \makecell[c]{12.27\\ \scalebox{\std}{± 0.11}} & \makecell[c]{18.72\\ \scalebox{\std}{± 0.11}} & \makecell[c]{19.66\\ \scalebox{\std}{± 0.02}} & \makecell[c]{7.09\\ \scalebox{\std}{± 0.19}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{12.93\\ \scalebox{\std}{± 0.04}} \\
\multirow{-5}{*}{RoTTA~\cite{rotta}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{14.33\\ \scalebox{\std}{± 0.14}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{14.97\\ \scalebox{\std}{± 0.22}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{14.49\\ \scalebox{\std}{± 0.10}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.62\\ \scalebox{\std}{± 0.48}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.53\\ \scalebox{\std}{± 0.33}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{14.33\\ \scalebox{\std}{± 0.27}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{13.02\\ \scalebox{\std}{± 0.16}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{11.06\\ \scalebox{\std}{± 0.16}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{12.36\\ \scalebox{\std}{± 0.13}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{17.18\\ \scalebox{\std}{± 0.13}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{16.82\\ \scalebox{\std}{± 0.13}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{8.27\\ \scalebox{\std}{± 0.27}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{15.24\\ \scalebox{\std}{± 0.05}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{32.03\\ \scalebox{\std}{± 0.07}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{22.07\\ \scalebox{\std}{± 0.11}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{14.82\\ \scalebox{\std}{± 0.01}} \\ \midrule
 & SrcValid & \makecell[c]{28.39\\ \scalebox{\std}{± 0.39}} & \makecell[c]{25.90\\ \scalebox{\std}{± 0.41}} & \makecell[c]{28.46\\ \scalebox{\std}{± 0.79}} & \makecell[c]{9.76\\ \scalebox{\std}{± 2.10}} & \makecell[c]{7.02\\ \scalebox{\std}{± 0.68}} & \makecell[c]{21.34\\ \scalebox{\std}{± 2.88}} & \makecell[c]{12.78\\ \scalebox{\std}{± 1.76}} & \makecell[c]{18.59\\ \scalebox{\std}{± 0.86}} & \makecell[c]{6.08\\ \scalebox{\std}{± 0.71}} & \makecell[c]{18.38\\ \scalebox{\std}{± 0.76}} & \makecell[c]{12.63\\ \scalebox{\std}{± 0.92}} & \makecell[c]{22.00\\ \scalebox{\std}{± 0.98}} & \makecell[c]{13.36\\ \scalebox{\std}{± 1.19}} & \makecell[c]{9.43\\ \scalebox{\std}{± 0.81}} & \makecell[c]{5.84\\ \scalebox{\std}{± 0.67}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{16.00\\ \scalebox{\std}{± 0.33}} \\
 & SoftmaxScore~\cite{softmaxscore} & \makecell[c]{19.01\\ \scalebox{\std}{± 0.14}} & \makecell[c]{20.58\\ \scalebox{\std}{± 0.17}} & \makecell[c]{19.52\\ \scalebox{\std}{± 0.27}} & \makecell[c]{16.57\\ \scalebox{\std}{± 0.42}} & \makecell[c]{17.93\\ \scalebox{\std}{± 0.07}} & \makecell[c]{24.41\\ \scalebox{\std}{± 0.09}} & \makecell[c]{27.95\\ \scalebox{\std}{± 0.15}} & \makecell[c]{26.76\\ \scalebox{\std}{± 0.28}} & \makecell[c]{23.75\\ \scalebox{\std}{± 0.09}} & \makecell[c]{30.23\\ \scalebox{\std}{± 0.10}} & \makecell[c]{30.71\\ \scalebox{\std}{± 0.14}} & \makecell[c]{6.66\\ \scalebox{\std}{± 0.36}} & \makecell[c]{30.04\\ \scalebox{\std}{± 0.14}} & \makecell[c]{30.52\\ \scalebox{\std}{± 0.09}} & \makecell[c]{29.35\\ \scalebox{\std}{± 0.21}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{23.60\\ \scalebox{\std}{± 0.07}} \\
 & GDE~\cite{disagreement} & \makecell[c]{62.46\\ \scalebox{\std}{± 0.16}} & \makecell[c]{60.09\\ \scalebox{\std}{± 0.20}} & \makecell[c]{62.04\\ \scalebox{\std}{± 0.18}} & \makecell[c]{64.31\\ \scalebox{\std}{± 0.25}} & \makecell[c]{63.47\\ \scalebox{\std}{± 0.16}} & \makecell[c]{53.92\\ \scalebox{\std}{± 0.34}} & \makecell[c]{48.60\\ \scalebox{\std}{± 0.35}} & \makecell[c]{50.02\\ \scalebox{\std}{± 0.29}} & \makecell[c]{54.63\\ \scalebox{\std}{± 0.18}} & \makecell[c]{42.15\\ \scalebox{\std}{± 0.15}} & \makecell[c]{35.24\\ \scalebox{\std}{± 0.17}} & \makecell[c]{64.78\\ \scalebox{\std}{± 0.07}} & \makecell[c]{43.55\\ \scalebox{\std}{± 0.18}} & \makecell[c]{40.82\\ \scalebox{\std}{± 0.15}} & \makecell[c]{45.04\\ \scalebox{\std}{± 0.07}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{52.74\\ \scalebox{\std}{± 0.02}} \\
 & AdvPerturb~\cite{advperturb} & \makecell[c]{27.73\\ \scalebox{\std}{± 0.10}} & \makecell[c]{29.55\\ \scalebox{\std}{± 0.21}} & \makecell[c]{28.38\\ \scalebox{\std}{± 0.31}} & \makecell[c]{12.10\\ \scalebox{\std}{± 0.29}} & \makecell[c]{17.45\\ \scalebox{\std}{± 0.17}} & \makecell[c]{24.33\\ \scalebox{\std}{± 0.18}} & \makecell[c]{23.11\\ \scalebox{\std}{± 0.23}} & \makecell[c]{30.31\\ \scalebox{\std}{± 0.26}} & \makecell[c]{17.63\\ \scalebox{\std}{± 0.08}} & \makecell[c]{36.11\\ \scalebox{\std}{± 0.08}} & \makecell[c]{5.52\\ \scalebox{\std}{± 0.12}} & \makecell[c]{21.22\\ \scalebox{\std}{± 0.32}} & \makecell[c]{31.03\\ \scalebox{\std}{± 0.08}} & \makecell[c]{26.34\\ \scalebox{\std}{± 0.10}} & \makecell[c]{12.61\\ \scalebox{\std}{± 0.06}} & \cellcolor[HTML]{EDEEFF}\makecell[c]{22.90\\ \scalebox{\std}{± 0.02}} \\
\multirow{-5}{*}{SoTTA~\cite{sotta}} & \cellcolor[HTML]{DCFFDC}\system{} & \cellcolor[HTML]{DCFFDC}\makecell[c]{17.92\\ \scalebox{\std}{± 0.25}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{18.73\\ \scalebox{\std}{± 0.95}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{16.32\\ \scalebox{\std}{± 2.09}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{14.69\\ \scalebox{\std}{± 1.33}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{7.64\\ \scalebox{\std}{± 0.29}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{18.94\\ \scalebox{\std}{± 0.59}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{17.21\\ \scalebox{\std}{± 0.54}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{16.92\\ \scalebox{\std}{± 0.51}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{14.49\\ \scalebox{\std}{± 0.37}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{20.84\\ \scalebox{\std}{± 0.02}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{18.54\\ \scalebox{\std}{± 0.44}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{12.70\\ \scalebox{\std}{± 1.06}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{20.46\\ \scalebox{\std}{± 0.39}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{25.44\\ \scalebox{\std}{± 0.42}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{20.13\\ \scalebox{\std}{± 0.17}} & \cellcolor[HTML]{DCFFDC}\makecell[c]{17.40\\ \scalebox{\std}{± 0.26}} \\
\Xhline{2\arrayrulewidth}
\end{tabularx}

Figure 8: Mean absolute error (MAE) (%) of the accuracy estimation on continual CIFAR10-C. Averaged over three different random seeds.

Figure 9: Mean absolute error (MAE) (%) of the accuracy estimation on continual CIFAR100-C. Averaged over three different random seeds.

Figure 10: Mean absolute error (MAE) (%) of the accuracy estimation on continual ImageNet-C. Averaged over three different random seeds.

Figure 11: Average accuracy improvement (%p) with model recovery. Averaged over three different random seeds.

References↩︎

[1]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition , booktitle = Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),” 2016.
[2]
O. Ronneberger, P. Fischer, editor="Navab. Brox Thomas", J. Hornegger, W. M. Wells, and A. F. Frangi, “U-net: Convolutional networks for biomedical image segmentation", booktitle="medical image computing and computer-assisted intervention – MICCAI 2015,” 2015, pp. 234–241", isbn="978-3-319-24574-4.
[3]
I. Goodfellow et al., “Generative adversarial networks,” Commun. ACM, vol. 63, no. 11, p. 139?144, 2020 , issue_date = {November 2020}, doi: 10.1145/3422622.
[4]
A. Vaswani et al., “Advances in neural information processing systems , editor = I. Guyon and U. Von Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett,” 2017, vol. 30, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[5]
T. Brown et al., “Advances in neural information processing systems , editor = H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin,” 2020, vol. 33, pp. 1877–1901, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[6]
T. Gong, J. Jeong, T. Kim, Y. Kim, J. Shin, and editor=Alice. H. O. and A. A. and D. B. and K. C. Sung-Ju Lee booktitle=Advances in Neural Information Processing Systems, “NOTE : Robust continual test-time adaptation against temporal correlation,” 2022, [Online]. Available: https://openreview.net/forum?id=E9HNxrCFZPV.
[7]
S. Niu et al., “Towards stable test-time adaptation in dynamic wild world,” 2023, [Online]. Available: https://openreview.net/forum?id=g2YraF75Tj.
[8]
M. Boudiaf, R. Mueller, I. Ben Ayed, and L. Bertinetto, “Parameter-free online test-time adaptation , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2022, pp. 8344–8353.
[9]
Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual test-time domain adaptation , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2022, pp. 7201–7211.
[10]
L. Yuan, B. Xie, and S. Li, “Robust test-time adaptation in dynamic scenarios , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2023, pp. 15922–15932.
[11]
D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and booktitle=International. C. on L. R. Trevor Darrell, “Tent: Fully test-time adaptation by entropy minimization,” 2021, [Online]. Available: https://openreview.net/forum?id=uXl3bZLkr3c.
[12]
T. Gong, Y. Kim, T. Lee, S. Chottananurak, and booktitle=Thirty. C. on N. I. P. S. Sung-Ju Lee, “SoTTA: Robust test-time adaptation on noisy data streams,” 2023, [Online]. Available: https://openreview.net/forum?id=3bdXag2rUd.
[13]
S. Niu et al., “Efficient test-time model adaptation without forgetting,” 2022 , editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, vol. 162 , series = Proceedings of Machine Learning Research, pp. 16888–16905, [Online]. Available: https://proceedings.mlr.press/v162/niu22a.html.
[14]
O. Press, S. Schneider, M. K?mmerer, and M. Bethge, “RDumb: A simple approach that questions our progress in continual test-time adaptation,” 2023, [Online]. Available: http://arxiv.org/abs/2306.05401 , booktitle={Thirty-seventh Conference on Neural Information Processing Systems}.
[15]
C. Baek, Y. Jiang, A. Raghunathan, and book Kolter J. Zico, “Advances in neural information processing systems , editor = S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh,” 2022, vol. 35, pp. 19274–19289, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/7a8d388b7a17df480856dff1cc079b08-Paper-Conference.pdf.
[16]
J. P. Miller et al., “Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization,” 2021 , editor = {Meila, Marina and Zhang, Tong}, vol. 139 , series = Proceedings of Machine Learning Research, pp. 7721–7735, [Online]. Available: https://proceedings.mlr.press/v139/miller21b.html.
[17]
J. Chen, F. Liu, B. Avci, X. Wu, Y. Liang, and book Jha Somesh, “Advances in neural information processing systems , editor = M. Ranzato and A. Beygelzimer and Y. Dauphin and P.S. Liang and J. Wortman Vaughan,” 2021, vol. 34, pp. 14980–14992, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2021/file/7dd3ed2e12d7967b656d156d50308263-Paper.pdf.
[18]
D. Guillory, V. Shankar, S. Ebrahimi, T. Darrell, and L. Schmidt, “Predicting with confidence on unseen distributions , booktitle = Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),” 2021, pp. 1134–1144.
[19]
Y. Jiang, V. Nagarajan, C. Baek, and booktitle=International. C. on L. R. Kolter J Zico, “Assessing generalization of SGD via disagreement,” 2021.
[20]
D. Hendrycks and booktitle=International. C. on L. R. Thomas Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” 2019, [Online]. Available: https://openreview.net/forum?id=HJz6tiCqYm.
[21]
J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Dataset shift in machine learning. Mit Press, 2008.
[22]
Y. Gal and booktitle =. P. of T. 33rd. I. C. on M. L. Ghahramani Zoubin, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” 2016 , editor = {Balcan, Maria Florina and Weinberger, Kilian Q.}, vol. 48 , series = Proceedings of Machine Learning Research, address = New York, New York, USA, pp. 1050–1059, [Online]. Available: https://proceedings.mlr.press/v48/gal16.html.
[23]
W. Liu, X. Wang, J. Owens, and book Li Yixuan, “Advances in neural information processing systems , editor = H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin,” 2020, vol. 33, pp. 21464–21475, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/f5496252609c43eb8a3d147ab9b9c006-Paper.pdf.
[24]
D. Hendrycks and booktitle=International. C. on L. R. Kevin Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” 2017, [Online]. Available: https://openreview.net/forum?id=Hkg4TI9xl.
[25]
H. Elsahar and booktitle=Proceedings. of the 2019. C. on E. M. in N. L. P. and the 9th. I. J. C. on N. L. P. (EMNLP.-I. Gallé Matthias, “To annotate or not? Predicting performance drop under domain shift,” 2019, pp. 2163–2173.
[26]
J. Lee, J. O. Woo, H. Moon, and K. Lee, “Unsupervised accuracy estimation of deep visual models using domain-adaptive adversarial perturbation without source samples , booktitle = Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),” 2023, pp. 16443–16452.
[27]
P. Foret, A. Kleiner, H. Mobahi, and booktitle=International. C. on L. R. Behnam Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” 2021, [Online]. Available: https://openreview.net/forum?id=6Tm1mposlrM.
[28]
C.-Y. Chuang, A. Torralba, and booktitle =. P. of the 37th. I. C. on M. L. Jegelka Stefanie, “Estimating generalization under distribution shifts via domain-invariant representations,” 2020 , editor = {III, Hal Daum? and Singh, Aarti}, vol. 119 , series = Proceedings of Machine Learning Research, pp. 1984–1994, [Online]. Available: https://proceedings.mlr.press/v119/chuang20a.html.
[29]
C. Guo, G. Pleiss, Y. Sun, and booktitle =. P. of the 34th. I. C. on M. L. Kilian Q. Weinberger, “On calibration of modern neural networks,” 2017 , editor = {Precup, Doina and Teh, Yee Whye}, vol. 70 , series = Proceedings of Machine Learning Research, pp. 1321–1330, [Online]. Available: https://proceedings.mlr.press/v70/guo17a.html.
[30]
S. Garg, S. Balakrishnan, Z. C. Lipton, B. Neyshabur, and booktitle=International. C. on L. R. Hanie Sedghi, “Leveraging unlabeled data to predict out-of-distribution performance,” 2022, [Online]. Available: https://openreview.net/forum?id=o_HsiMPYh_x.
[31]
M. Zhang, S. Levine, and book Finn Chelsea, “Advances in neural information processing systems , editor = S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh,” 2022, vol. 35, pp. 38629–38642, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/fc28053a08f59fccb48b11f2e31e81c7-Paper-Conference.pdf.
[32]
D. Brahma and P. Rai, “A probabilistic framework for lifelong test-time adaptation , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2023, pp. 3582–3591.
[33]
J. Hong, L. Lyu, J. Zhou, and booktitle=The. E. I. C. on L. R. Michael Spranger, “MECTA : Memory-economic continual test-time model adaptation,” 2023, [Online]. Available: https://openreview.net/forum?id=N92hjSf5NNh.
[34]
I. J. Goodfellow, J. Shlens, and C. Szegedy, arXiv preprint arXiv:1412.6572, 2014.
[35]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization , booktitle = International Conference on Learning Representations (ICLR) ,” 2015.
[36]
I. Loshchilov and booktitle =. I. C. on L. R. (ICLR). Hutter Frank, “SGDR : Stochastic gradient descent with warm restarts,” 2017.
[37]
T. maintainers and contributors, “TorchVision: PyTorch’s computer vision library,” GitHub repository. GitHub , howpublished = https://github.com/pytorch/vision, 2016.

  1. We rename the term from class-wise calibration [19] to clearly state the purpose of the calibration.↩︎

  2. Using the probabilistic property of expectation of dropout inferences [22], we approximate \(\tilde{h}(X)\) as \(\mathbb{E}_{{\mathcal{H}}_{\mathcal{A}}} [ \mathbb{E}_{{\tt dropout}} [h(X; \Theta^{\tt dropout})] ]\).↩︎

  3. https://github.com/DequanWang/tent↩︎

  4. https://github.com/mr-eggplant/EATA↩︎

  5. https://github.com/mr-eggplant/SAR↩︎

  6. https://github.com/qinenergy/cotta↩︎

  7. https://github.com/BIT-DA/RoTTA↩︎

  8. https://github.com/taeckyung/sotta↩︎