Abstract

Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but moreover continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Statistical methods for nonparametric change-point detection—especially the tools of conformal test martingales (CTMs) and anytime-valid inference—offer promising approaches to this monitoring task. However, existing methods are restricted to monitoring limited hypothesis classes or “alarm criteria,” such as data shifts that violate certain exchangeability assumptions, or do not allow for online adaptation in response to shifts. In this paper, we expand the scope of these monitoring methods by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that accommodate online adaptation to mild covariate shifts (in the marginal input distribution) while raising alarms in response to more severe shifts, such as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.

1 Introduction↩︎

As AI/ML systems become integral to real-world applications, ensuring their safety and utility under evolving conditions is essential for responsible deployment. However, even meticulously trained models with apparent reliability guarantees can fail abruptly when shifts in the data distribution or operational environment violate the underlying conditions they were designed for [1]. That is, unforeseen deployment conditions can make it impossible to guarantee reliability for all situations in advance. Consequently, there is growing recognition for the need to continuously monitor deployed AI systems for determining when model updates are required to mitigate downstream harm [2]–[5]. In this work, we moreover argue that such monitoring methods should ideally perform at least three key functions: (1) maintain end-user reliability and minimize unnecessary alarms by adapting online to mild or benign data shifts; (2) rapidly detect more extreme or harmful shifts that necessitate updates; and (3) identify the root-cause of degradation to inform appropriate recovery.

For example, consider a healthcare use case where the goal is to predict the risk of sepsis, a life-threatening infection, \(Y\) from electronic health record inputs \(X\) (e.g., vital signs, lab tests, medical history, demographics) using an AI/ML system (e.g., [6]). Various clinical data shifts pose challenges to monitoring in practice [7]. Figure 1 illustrates hypothetical synthetic-data shifts that can be interpreted through this sepsis example, where for simplicity \(X\) only represents a patient’s age. An example of a benign shift is a mild shift in patient demographics primarily toward young adults; Figure 1a shows a corresponding covariate shift in the marginal \(X\) distribution. Such a mild shift to a younger population would be benign, as younger individuals are well-represented in the training data and also tend to have lower, less variable (easier to predict) sepsis risk—so, there is no need for an alarm. On the other hand, harmful shift examples include if the AI tool were deployed with a much older population than observed in the training data (e.g., Figure 1b) or if a new microbial strain were to arise that was especially severe in children (e.g., Figure 1c). In any such harmful shift case, swift detection and root-cause analysis is essential to initiate and inform retraining, to ultimately minimize harm.

Figure 1: Each column represents a data shift scenario: the top row is a simulated shift example and the bottom row shows WATCH’s response, averaged over 20 random seeds. WATCH raises an alarm to retrain the AI/ML once the WCTM (blue) exceeds its alarm threshold; meanwhile, an \(X\)-CTM (gray)—a standard CTM that only depends on inputs \(X\), and thus only detects covariate shifts—dynamically initiates the WCTM’s adaptation phase and aids in root-cause analysis. In (a), the \(X\)-CTM starts the WCTM’s adaptation phase, which allows the WCTM to avoid raising an unnecessary alarm. In (b), the extreme covariate shift causes the WCTM to raise an alarm, indicating that the covariate shift is too severe to be adapted to. In (c), the illustrated concept shift causes WATCH to raise an alarm, but without the \(X\)-CTM detecting a shift in covariates \(X\)—this allows WATCH to diagnose the root-cause of the alarm as a concept shift in \(Y\mid X\)..

Recent advances in anytime-valid inference [8] and especially conformal test martingales (CTMs) [9], [10] offer promising tools for AI monitoring with sequential, nonparametric guarantees. However, existing CTM monitoring methods (e.g., [2]) all rely on some form of exchangeability (e.g., IID) assumption in their null hypotheses—informally, meaning that the data distribution is the same across time or data batches—and as a result, standard CTMs can raise unnecessary alarms even when a shift is mild or benign (e.g., Figure 1a). Meanwhile, existing comparable monitoring methods for directly tracking the risk of a deployed AI (e.g., [3]) tend to be less efficient than CTMs regarding their computational complexity, data usage, and/or speed in detecting harmful shifts (Sec. 4.3).

Our paper’s contributions can be summarized as follows:

Our main theoretical contribution is to propose weighted-conformal test martingales (WCTMs), constructed from sequences of weighted-conformal \(p\)-values, which generalize their standard conformal precursors. WCTMs lay a theoretical foundation for sequential and continual testing of a broad class of null hypotheses beyond exchangeability, such as shifts that one aims to model and adapt to.
For practical applications, we propose WATCH: Weighted Adaptive Testing for Changepoint Hypotheses, a framework for AI monitoring using WCTMs. WATCH continuously adapts to mild or benign distribution shifts (e.g., Figure 1a) to maintain end-user safety and utility (and avoid unnecessary alarms), while quickly detecting harmful shifts (e.g., Figure 1b & c) and enabling root-cause analysis.

2 Background↩︎

Notation: Assume an initial dataset \(Z_{1:n}:= \{Z_i\}_{i=1}^n\), where each datapoint \(Z_i := (X_i, Y_i) \in \mathcal{X} \times \mathcal{Y}=\mathcal{Z}\) is a real-valued (\(\mathcal{X} \times \mathcal{Y} \subseteq \mathbb{R}^d \times \mathbb{R}\)) feature-label pair. For simplicity, further assume that an AI/ML model \(\widehat{\mu}\) is pretrained on a separate dataset.¹ After deploying the AI/ML model, the test points are observed sequentially at each time \(t=1, ..., T\) (batch data can be given random ordering), though for simpler exposition, we initially focus on \(t=1\). We abbreviate indices \([m]:=\{1, ..., m\}\). Random variables are denoted with capital letters (e.g., \(Z_i\)) and observed values with lowercase (e.g., \(z_i\)). For \(Z := Z_{1:(n+t)}\), we let \(F_Z\) denote the joint distribution function, \(f_Z\) the joint density function, and boldface \(\mathbf{F_Z}\) denotes a set of disitributions.

2.1 Conformal Prediction and Conformal \(p\)-Values↩︎

Standard conformal prediction (CP) [12] is an approach to predictive inference: the task of converting a black-box AI/ML prediction, \(\widehat{\mu}(X_{n+1})\), into a predictive confidence interval/set, \(\widehat{C}_{[n], \alpha}(X_{n+1})\), that should contain the true label with a user-specified rate, \(1-\alpha \in (0, 1)\) (e.g., 90%). This objective is called valid (marginal) coverage:² \[\begin{align} \mathbb{P}\big\{Y_{n+1} \in \widehat{C}_{[n], \alpha}(X_{n+1})\big\} \geq 1 - \alpha. \label{eq:coverage95condition} \end{align}\tag{1}\] Constructing the CP set, \(\widehat{C}_{[n], \alpha}(X_{n+1})\), requires a labeled calibration dataset,³ \(Z_{1:n}\), and a “nonconformity”score function, \(\mathcal{\widehat{S}}: \mathcal{X}\times\mathcal{Y} \rightarrow \mathbb{R}\), which generally uses the prefit ML predictor to quantify how “strange” a point \((x, y)\) is relative to the training data. A common example is the absolute-value residual score, \(\mathcal{\widehat{S}}(x, y) = |y - \widehat{\mu}(x)|\).

Though not always framed as such, the CP set \(\widehat{C}_{[n], \alpha}(X_{n+1})\) can be computed via a conformal p-value, which places the “strangeness” of the test point’s score in context of the calibration-data scores. That is, with \(v_i := \widehat{\mathcal{S}}(x_{i}, y_{i})\) for all \(i\in [n+1]\), [12] defines the conformal \(p\)-value for test point \(Z_{n+1}\) relative to calibration data \(Z_{1:n}\) as the fraction of the \(n+1\) scores, \(v_{1:(n+1)}\), that are at least as large as \(v_{n+1}\), with ties broken uniformly at random: \[\begin{align} p_{n+1} := \frac{|\{i : v_i > v_{n+1}\}| + u_{n+1}|\{i : v_i = v_{n+1}\}|}{n+1},\label{eq:unweighted95p95vals} \end{align}\tag{2}\] where \(U_{n+1}\stackrel{iid}{\sim} \text{Unif}[0, 1]\); that is, \(u_{n+1}\) is obtained from an independent standard uniform distribution.⁴ The CP set \(\widehat{C}_{[n], \alpha}(X_{n+1})\) can then be understood as the subset of labels \(y\in \mathcal{Y}\) that would not result in too “extreme” of a test-point \(p\)-value, \(p_{n+1}(X_{n+1}, y)\) (this is explicit notation for \(p_{n+1}\) to emphasize its dependence on the candidate label \(y\)): \[\begin{align} \label{eq:cp95set95def95via95p95val} & \widehat{C}_{[n], \alpha}(X_{n+1}) := \big\{y \in \mathcal{Y} : p_{n+1}(X_{n+1}, y) > \alpha \big\}. \end{align}\tag{3}\]

2.2 Exchangeability Underlies Standard CP Validity↩︎

For the standard CP set in Eq. 3 , the coverage guarantee in Eq. 1 holds assuming that the calibration data \(Z_{1:n}\) and test point \(Z_{n+1}\) are all exchangeable, meaning that any ordering of the observations is equally likely—independent and identically distributed (IID) data are a prime example. Formally, exchangeability means that the joint distribution function, \(F_Z\), is invariant to permutations \(\sigma\): that is, \(F_Z(z_{\sigma(1)}, ..., z_{\sigma(n+1)}) = F_Z(z_{1}, ..., z_{n+1})\) for all \(z_1, ..., z_{n+1}\in \mathcal{Z}\) and all \(\sigma\). With \(\mathbf{F_Z^{\boldsymbol{ex}}}\) denoting the set of exchangeable joint distributions, we can write the assumption of exchangeability as a (nonparametric) null hypothesis: \[\begin{align} \mathcal{H}_0^{\text{ex}} \;: \;Z_1, Z_2, ..., Z_{n+1} \sim F_Z \;: \;F_Z \in \mathbf{F_Z^{\boldsymbol{ex}}}. \label{eq:H95095exchangeability} \end{align}\tag{4}\] Standard conformal \(p\)-values are valid or “bona fide” \(p\)-values, in the usual statistical testing sense, for the null hypothesis of exchangeablity [12]. That is, assuming \(\mathcal{H}_0^{\text{ex}}\), then \(P_{n+1}\stackrel{iid}{\sim}\text{Unif}[0,1]\) [10] and \[\begin{align} \label{eq:cp95p95val95validity} \mathbb{P}_{\mathcal{H}_0^{\text{ex}}}(P_{n+1}\leq \alpha) \leq \alpha \qquad \forall \;\alpha \in (0, 1). \end{align}\tag{5}\] Thus, observing an extreme value of \(p_{n+1}\) is evidence against \(\mathcal{H}_0^{\text{ex}}\) (Eq. 4 ), or evidence for a distribution shift.

2.3 Standard Conformal Test Martingales: Testing Exchangeability Online via Betting↩︎

Standard conformal test martingales (CTMs) [2], [9], [10] continually aggregate information from a sequence of conformal \(p\)-values (Eq. 2 ) to perform online testing of the exchangeability null, \(\mathcal{H}_0^{\text{ex}}\). They can be constructed on top of any standard CP method, and thus on top of any conformalized AI/ML model, and used to monitor for deviations from \(\mathcal{H}_0^{\text{ex}}\). We will describe CTMs from a game-theoretic “testing-by-betting” interpretation [14], [15].

Under \(\mathcal{H}_0^{\text{ex}}\), a stochastic process \(M_0, M_1, ..., M_t, ...\) is a CTM if it is a nonnegative martingale—i.e., \(M_t\geq 0\) for all \(t\) and \(\mathbb{E}_{\mathcal{H}_0^{\text{ex}}}[M_t \mid M_0, ..., M_{t-1}] = M_{t-1}\)—constructed from a corresponding sequence of conformal \(p\)-values \(p_1, ..., p_t, ...\) via an appropriately defined “betting process.” That is, in the game-theoretic interpretation, a bettor has initial wealth \(M_0\), and at each time \(t=0, 1, ...\), she may use wealth \(M_t\) to bet on the value of \(p_{t+1}\) to be observed next; once \(p_{t+1}\) is revealed, she receives her reward and/or pays her losses, resulting in a new total wealth \(M_{t+1}\). The bettor bets against the null hypothesis \(\mathcal{H}_0^{\text{ex}}\): if \(\mathcal{H}_0^{\text{ex}}\) is true, then \(P_t \stackrel{iid}{\sim} \text{Unif}[0,1]\) and she cannot expect to outperform random guessing; so, if the bettor grows her wealth by a large factor \(M_t/M_0>1\), this is evidence that the bettor “knows” an alternative hypothesis more accurate than \(\mathcal{H}_0^{\text{ex}}\) [10]. \(M_t/M_0\) can be taken as an “anytime-valid” evidence against \(\mathcal{H}_0^{\text{ex}}\)—in other terms, \(M_t/M_0\) is an “\(e\)-value” for \(\mathcal{H}_0^{\text{ex}}\) at all \(t\) [8], [16].

More formally, define a betting function as a function \(g:[0, 1]\rightarrow [0, \infty]\) that integrates to one, i.e., \(\int_0^1g(u)\text{d}u=1\). We follow [10] and assume the betting function \[\begin{align} g_{\epsilon}(p) := 1 + \epsilon (p - 0.5), \label{eq:betting95function} \end{align}\tag{6}\] where \(\epsilon\in \mathbf{E}:= \{-1, 0, 1\}\) (\(\mathbf{E}\) is selected somewhat arbitrarily). The intuition is that \(\epsilon<0\) corresponds to betting on smaller \(p\)-values, \(\epsilon>0\) corresponds to betting on larger \(p\)-values, and \(\epsilon=0\) represents not betting. A CTM can be constructed by selecting a betting strategy \(g_{\epsilon_t}\) at each time \(t\) that may depend only on the past \(p\)-value observations \(p_1, ..., p_{t-1}\); this paper uses the “composite jumper martingale” strategy described in [12]. Then, a conformal (test) martingale \(M_t : [0, 1]^* \rightarrow [0, \infty], t=0, 1, ...\), can be constructed by accumulating the “wealth” attained by the previous bets on the conformal \(p\)-values, \(p_1, p_2, ...\), \[\begin{align} M_t := \int \Big(\prod_{i=1}^tg_{\epsilon_i}(p_i)\Big)\nu\big(\text{d}(\epsilon_0, \epsilon_1, ...)\big), \;\forall \;t, \label{eq:ctm95wealth95def} \end{align}\tag{7}\] where \(\nu\) is the composite jumper Markov chain described by [12] to determine how the “wealth” values are spread across the betting space \(\mathbf{E}\) at each timepoint.

A standard CTM (Eq. 7 ) can be used to test the exchangeability null \(\mathcal{H}_0^{\text{ex}}\) (Eq. 4 ) online either by using \(M_t/M_0\) as an anytime-valid evidence metric, or by raising an alarm when \(M_t/M_0\) exceeds some user-defined threshold \(c\). That is, by Ville’s inequality for martingales [17], CTMs achieve anytime-valid control over the false alarm rate (i.e., over the probability of an alarm despite \(\mathcal{H}_0^{\text{ex}}\) being true), \[\begin{align} \mathbb{P}_{\mathcal{H}_0^{\text{ex}}}\big(\exists \;t : M_t / M_0 \geq c\big) \leq 1/c, \label{eq:ctm95anytime95valid95guarantee} \end{align}\tag{8}\] which is sometimes referred to as strong validity due to its control over ever raising a false alarm [10].

2.4 Weighted Conformal Prediction for Adapting to Distribution Shifts↩︎

Whereas Standard CP computes valid predictive confidence sets assuming exchangeable data, weighted conformal prediction (WCP) (e.g., [18]–[22]) generalizes Standard CP to attain valid coverage (Eq. 1 ) even under various distribution shifts. Weighted CP methods are thus an approach to adapting to distribution shift, that is by computing CP sets on a reweighted version of the empirical calibration set’s scores, which in effect can modulate the size of the prediction sets to maintain coverage.

As we will see in the next section, weighted CP methods are associated with weighted-conformal \(p\)-values, a special-case of which was introduced in [23] for standard covariate shifts, based on [18]. Most recently, [22] also leverage weighted-conformal \(p\)-values to unify various theories of conformal prediction. However, there are several key differences with our contributions: regarding motivation, that work discusses connections to (one-shot) hypothesis testing as a means of framing conformal prediction, while in our paper (sequential) hypothesis testing is the primary focus; the conformal \(p\)-values in that paper are deterministic and conservative to attain conservative validity, whereas we incorporate randomness to achieve exact validity; lastly, whereas that paper primarily focuses on a batch-data setting where all the data are observed simultaneously, we focus on an online monitoring setting where each \(p\)-value is determined only by the current and past data, and we importantly present new theory for when a sequence of WCP \(p\)-values can be distributed uniformly and independently.

3 Theory and Methods: Weighted- Conformal \(p\)-Values and Test Martingales↩︎

Our main theory and method contribution is to introduce weighted-conformal test martingales (WCTMs), constructed from a generalized version of weighted-conformal \(p\)-values, which expand the scope of their standard conformal analogs. WCTMs enable more customizable and informative alarm criteria than standard CTMs, and our specific variants enable adaptation to mild shifts, while in response to severe shifts they raise alarms and enable root-cause analysis.

3.1 Generalized Weighted-Conformal \(p\)-Values↩︎

In this section we present a general version of weighted-conformal \(p\)-values.⁵ Sequences of these generalized weighted-conformal \(p\)-values will lay a theoretical foundation for online testing of a broad range of null hypotheses beyond exchangeability. To begin, observe that standard conformal \(p\)-values (Eq. 2 ) can be equivalently written as \[\begin{align} p_{n+1} = \sum_{i=1}^{n+1}\frac{1}{n+1}\Big[\text{\usefont{U}{bbm}{m}{n}1}\{v_i > v_{n+1}\} + u_{n+1}\text{\usefont{U}{bbm}{m}{n}1}\{v_i = v_{n+1}\}\Big], \end{align}\] where the purpose of this notation is to isolate where the comparison of \(v_{n+1}\) to each \(i\)-th score is given uniform weight \(\tfrac{1}{n+1}\). For some arbitrary weight vector \(\tilde{w} = (\tilde{w}_1, ..., \tilde{w}_{n+1})\in [0,1]^{n+1}\) where \(\sum_{i=1}^{n+1}\tilde{w}_i=1\), it is then straightforward to define weighted-conformal \(p\)-values as \[\begin{align} p_{n+1}^{\tilde{w}} = \sum_{i=1}^{n+1}\tilde{w}_i\Big[\text{\usefont{U}{bbm}{m}{n}1}\{v_i > v_{n+1}\} + u_{n+1}\text{\usefont{U}{bbm}{m}{n}1}\{v_i = v_{n+1}\}\Big]. \label{eq:weighted95p95vals95arbitrary95def} \end{align}\tag{9}\] Note that existing (split or full) weighted CP methods (e.g., [18], [19], [21] and certain methods in [20]) can be also be defined by weighted-conformal \(p\)-values by plugging \(p_{n+1}^{\tilde{w}}(X_{n+1}, y)\) in for \(p_{n+1}(X_{n+1}, y)\) in Eq. 3 , where \(\tilde{w}\) is an appropriately chosen weight vector.⁶

To understand the meaning of the weights \(\tilde{w}\) for our hypothesis testing purpose, we draw from the general view of weighted CP described in [21], which expounded analysis from [18]: For setup, let \(E_z\) denote the event \(\{Z_1, ..., Z_{n+1}\} = \{z_1, ..., z_{n+1}\}\), meaning that the empirical distribution of the datapoints has been observed, but we do not know whether \(Z_{i}=z_i\), and so on. Then, the oracle weights \(\tilde{w}^o\) would be given by entries \[\begin{align} \label{eq:general95weights} \tilde{w}_i^o = & \;\mathbb{P}\{V_{n+1} = v_i \mid E_z\} \\ = & \;\frac{\sum_{\sigma:\sigma(n+1)=i}f(z_{\sigma(1)}, ..., z_{\sigma(n+1)})}{\sum_{\sigma}f(z_{\sigma(1)}, ..., z_{\sigma(n+1)})},\nonumber \end{align}\tag{10}\] where \(f\) is the joint probability density function (PDF) and \(\sigma\) is a permutation of \([n+1]\). In words, the oracle weight \(\tilde{w}_i^o\) we would ideally use for \(\tilde{w}_i\) is the probability that the test score \(V_{n+1}\) took on the value \(v_i\), conditioned on the empirical distribution \(E_z\). For further exposition, see [21]. While for arbitrary \(f\), computing Eq. 10 is intractable due to requiring knowledge of \(f\) and the factorial complexity, we next turn to how simplifying or approximating \(\tilde{w}^o\) can be useful in practical hypothesis testing.

3.2 Expanded Hypothesis Testing with Weighted-Conformal \(p\)-Values↩︎

A main theoretical insight in our paper is that the weighted-conformal \(p\)-values introduced in Eq. 9 can be used to test a variety of null hypotheses that are broader than exchangeability. Let us begin by using \(\mathcal{H}_0(\hat{f})\) to denote any set of assumptions on \(f\) used to estimate Eq. 10 , and hereon let \(\tilde{w}:= \tilde{w}(\mathcal{H}_0(\hat{f}))\) denote the weight vector computed with this approximation. For example, if we assume exchangeability as our null hypothesis (\(\mathcal{H}_0(\hat{f})=\mathcal{H}_0^{\text{(ex)}}\)), then \(f(z_{\sigma(1)}, ..., z_{\sigma(n+1)})=f(z_{1}, ..., z_{n+1})\) and Eq. 10 reduces to \(\tilde{w}_i=\frac{1}{n+1}\), which recovers standard conformal \(p\)-values. More generally, \(\mathcal{H}_0(\hat{f})\) can denote some other set of invariance assumptions on \(f\) or density-ratio estimates (see Sec. 3.3 for an example). Then, assuming \(\mathcal{H}_0(\hat{f})\), the weights \(\tilde{w}\) yield a weighted empirical distribution \[\begin{align} \label{eq:weighted95conditional95score95dist} V_{n+1}\mid E_z \sim \sum_{i=1}^{n+1}\tilde{w}_i\delta_{v_i}, \end{align}\tag{11}\] from which it follows (via the probability integral transform and marginalizing over the draw of \(E_z\)) that \(P^{\tilde{w}}_{n+1}\) is IID uniformly distributed on [0, 1] and is thus a valid \(p\)-value for testing \(\mathcal{H}_0(\hat{f})\). We next state this formally.

Theorem 1. (Exact validity and independence of online WCP \(p\)-values.) For any \(T\in \mathbb{N}\), let \(\mathcal{H}_0(\hat{f})\) denote all assumptions on or approximations of \(f\) used to estimate \(\tilde{w}^o\) (Eq. 10 ). Let \(\tilde{w}:=\tilde{w}(\mathcal{H}_0(\hat{f}))\) denote the corresponding estimated weights, and assume user-defined significance levels \(\alpha_1, ..., \alpha_T\in(0, 1)^T\). Then, assuming \(\mathcal{H}_0(\hat{f})\), \(P^{\tilde{w}}_1, P^{\tilde{w}}_2, ...\) is IID uniform on [0,1], and thus are independent and exactly valid \(p\)-values for testing the null hypothesis \(\mathcal{H}_0(\hat{f})\): \[\begin{align} \label{eq:H040hat40f414195iid95bernoulli} \mathbb{P}_{\mathcal{H}_0(\hat{f})}\big\{P_1^{\tilde{w}} \leq \alpha_1, ..., P_T^{\tilde{w}} \leq \alpha_T \big\}= \alpha_1 \cdots \alpha_T. \end{align}\tag{12}\]

We defer the proof to Appendix 7. The proof builds on those for standard CTMs in [24] and [25], which leverage the idea of “reversing time.” That is, we imagine that the sequence of data observations \((z_1, ..., z_T)\) is generated in two steps: first, the unordered bag or multiset of data observations, \(\{z_1, ..., z_T\}\), is generated from some probability distribution; then—here generalizing [24] and [25] by weighting permutations according to their likelihood—from all possible permutations \(\sigma\) of the values \(\{z_1, ..., z_T\}\), each possible sequence \((z_{\sigma(1)}, ..., z_{\sigma(T)})\) is chosen with probability proportional to \(f(z_{\sigma(1)}, ..., z_{\sigma(T)})\), where \(f:=f_Z\) is the probability-density function⁷ for the distribution \(F_Z\). Roughly (ignoring borderline effects), the second step ensures that, conditionally on knowing \(\{z_1, ..., z_T\}\) (and therefore unconditionally), that \(P_{T}^{\tilde{w}^o}\) has a standard uniform distribution; when \(Z_T=z_{\sigma(T)}\) is observed, this settles the value of \(P_{T}^{\tilde{w}^o}=p_{\sigma(T)}^{\tilde{w}^o}\), and conditionally on knowing \(\{z_1, ..., z_T\}\) and \(Z_T=z_{\sigma(T)}\) (and therefore, after relabeling indices, on knowing \(\{z_1, ..., z_{T-1}\}\)), that \(P_{T-1}^{\tilde{w}^o}\) also has a standard uniform distribution, and so on.

Figure 2: Tabular data results for the benign covariate shift setting to evaluate the adaptation ability of proposed WCTM methods (blue); all values are averaged over 200 random seeds. Training and calibration sets were sampled uniformly at random (with 1/3 of the total data used for training and calibration each), while post-changepoint test-set datapoints were bias-sampled from the remaining holdout data with probability proportional to \(\exp(\lambda \cdot h(x))\). The shift-magnitude scalar \(\lambda\) for each dataset was set as \(\lambda_{\text{mep}}=5.0\) , \(\lambda_{\text{sup}}=2.5\), and \(\lambda_{\text{bik}}=5.0\). The function \(h\) was selected to simuluate realistic shifts as described in the main text. Error regions represent standard errors for coverage, martingale paths, ans Shiryaev-Roberts paths, and interquartile range for interval widths. Whereas standard CTMs (orange) raise unnecessary alarms with their anytime-valid and scheduled monitoring criteria, WCTMs avoid doing so by adapting online to the shift. That is, WCTMs maintain target coverage, adapt by increasing interval sharpness, and avoid unneeded alarms..

3.3 Main Practical Testing Objective: Testing for {Concept Shift or Unanticipated Covariate Shift}↩︎

We now provide a worked example for how weighted-conformal \(p\)-values can be used to test a common assumption in the robust ML literature: namely, the covariate shift [26], [27] assumption, where the marginal input distribution \(F_X\) may shift to some other distribution \(F_X^T\) at test time, but the conditional label distribution \(F_{Y \mid X}\) is assumed to remain invariant. Under the covariate shift assumptions, the oracle weights \(\tilde{w}_i^o\) in Eq. 10 are proportional to density-ratio weights \(f_X^T(X_i)/f_X(X_i)\) [18]. A common goal is to adapt to covariate shift by reweighting data with this density ratio, which requires learning an approximate density-ratio function \(\widehat{f}_X^T(x)/f_X(x)\) (e.g., [18], [27]–[29]).

However, there is a need for testing for misspecifications in \(\widehat{f}_X^T(x)/f_X(x)\) or unsafe shifts in \(Y\mid X\), which our work enables. That is, by Theorem 1, weighted-conformal \(p\)-values (Eq. 9 ) with weights set by an approximated density-ratio function \(\tilde{w}_i=\widehat{f}_X^T(X_i)/f_X(X_i)\) permit testing the following null hypothesis: \[\begin{align} \label{eq:cs95null} \mathcal{H}_0^{\text{(cs)}} :=\begin{cases} (X_{i}, Y_{i}) \sim F_{Y \mid X}\times F_{X}=F_Z\in \mathbf{F_{Z}^{\text{ex}}} , \;& i \in [n] \\ (X_{n+t}, Y_{n+t}) \sim F_{Y \mid X}\times \widehat{F}_{X}^T, & t>0 \\ \qquad \qquad \qquad \quad F_{Y \mid X} \in \mathbf{F^{\text{ex}}_{Y \mid X}}& \end{cases} \end{align}\tag{13}\] which can be roughly read as assuming both that \(Y\mid X\) is invariant and that \(\widehat{f}_X^T(x)/f_X(x)\) is a close approximation of the true density ratio function. Thus, observing a small value of \(p^{\tilde{w}}_{n+1}\) would convey evidence against \(\mathcal{H}_0^{\text{(cs)}}\), meaning that there has either been a concept shift in \(Y\mid X\) or that the density-ratio adaptation \(\widehat{f}_X^T(x)/f_X(x)\) is inaccurate.

3.4 Weighted-Conformal Test Martingales for Continual Monitoring↩︎

With Theorem 1 at hand, weighted-conformal test martingales (WCTMs) can be constructed from a sequence of weighted-conformal \(p\)-values to enable continual, anytime-valid monitoring of a customizeable null hypothesis \(\mathcal{H}_0(\hat{f})\). Just as in the construction of standard CTMs (Section 2.3), WCTMs require a betting function \(g\) (we assume the betting function in Eq. 6 ) and a strategy \(\nu\) for placing bets \(\epsilon_t\) at each time \(t\) that can only depend on past \(p\)-value observations (we use the composite jumper strategy from [12]). Then, a WCTM is constructed by feeding a sequence of weighted-conformal \(p\)-values \(p_1^{\tilde{w}(\hat{f})}, ..., p_t^{\tilde{w}(\hat{f})}, ...\) into Eq. 7 (i.e., by setting \(p_i:=p_i^{\tilde{w}(\hat{f})}\) in Eq. 7 ). Due to the \(P_t^{\tilde{w}}\) being distributed IID uniformly on [0,1] (Theorem 1), the stochastic process \(\tilde{M}_0, \tilde{M}_1, ..., \tilde{M}_t\) is a nonnegative test martingale for \(\mathcal{H}_0(\hat{f})\), and by Ville’s inequality [17] it achieves the anytime-valid control over false alarms in the following theorem (proof in Appendix 7).

Proposition 2. (WCTM anytime-valid false-alarm control) Let \(\tilde{M}_0, \tilde{M}_1, ..., \tilde{M}_t, ...\) be a WCTM constructed from the sequence of weighted-conformal \(p\)-values \(p_1^{\tilde{w}}, p_2^{\tilde{w}}, ..., p_t^{\tilde{w}}, ...\). Then, assuming \(\mathcal{H}_0(\hat{f})\), \[\begin{align} \mathbb{P}_{\mathcal{H}_0(\hat{f})}\big(\exists \;t : \tilde{M}_t / \tilde{M}_0 \geq c \big) \leq 1/c. \end{align}\]

However, the strength of Eq. [thm:wctm95anytime95valid95guarantee] can sometimes be overly conservative. Thus, to improve the efficiency (speed in detecting true shifts), we can follow [2], [10] by moreover augmenting WCTMs with standard changepoint detection metrics such as CUSUM [30] and Shiryaev-Roberts [31], [32] while achieving a weaker form of multistage validity. For example, consider the Shiryaev-Roberts procedure applied to WCTMs \(\tilde{M}_t\), whose \(k\)-th stage alarm time is \[\begin{align} \tau_k := \min\big\{t > \tau_{k-1} : \sum_{i=0}^{t-1}\frac{\tilde{M}_t}{\tilde{M}_i}\geq c\big\}, \;k \in \mathbb{N}. \label{eq:SR95procedure} \end{align}\tag{14}\] This Shiryaev-Roberts procedure based on standard CTMs controls the average run length (ARL) under \(\mathcal{H}_0(\hat{f})\), where the procedure is expected to be reset after \(c\) timesteps.

Proposition 3. (WCTM-based Shiryaev-Roberts ARL control) Let \(\tau_1\) denote the Shiryaev-Roberts stopping time (Eq. 14 ) based on a WCTM \(\tilde{M}_0, \tilde{M}_1, ..., \tilde{M}_t, ...\) constructed for the null hypothesis \(\mathcal{H}_0(\hat{f})\). Then, assuming \(\mathcal{H}_0(\hat{f})\), \[\begin{align} \mathbb{E}_{\mathcal{H}_0(\hat{f})}\big[\tau_1\big]\geq c. \end{align}\]

3.5 WCTM Implementation in WATCH: Dynamic Online Adaptation and Root-Cause Analysis↩︎

All our implementations and experiments in this paper focus on WCTMs that continuously adapt to mild covariate shifts (e.g., Fig 1a) while raising alarms in response to either extreme covariate shifts (e.g., Fig 1b) or concept shifts in \(Y|X\) (e.g., Fig 1c). These goals correspond to monitoring the \(\mathcal{H}_0^{(cs)}\) null hypothesis given in Eq. 13 . Our adaptation procedure performs online density ratio estimation with an online probabilistic classifier (e.g., a logistic regression model similar to in [29]) to update the estimation of \(\widehat{f}_X^{T(t)}(x)/f_X(x)\) at each timestep \(t\).

Online Adaptation with Dynamic Initialization For the initial adaptation stage, a key question is when to begin the adaptation procedure, or when to begin estimating \(\widehat{f}_X^{T(t)}(x)/f_X(x)\). Our methods make this determination automatically and dynamically by running a secondary standard CTM that is restricted to only monitoring for changepoints in the marginal \(X\) distribution via a nearest-neighbor nonconformity score [2]. At deployment, the main WCTM method is initially a standard CTM (with uniform weights); then, once the secondary “\(X\)-CTM” method exceeds a pre-determined adaptation threshold (see gray paths in Fig 1) at some time \(t_{\text{ad}}\), this triggers the adaptation of the main WCTM monitoring method (blue paths in Fig 1) and corresponding weighted CP intervals. For efficiency purposes, after the adaptation time \(t_{\text{ad}}\), the CP calibration set \([n+t_{\text{ad}}]\) is treated as fixed (no longer adding test points to it online), meaning the weighted-conformal \(p\)-values are computed summing only over \([n+t_{\text{ad}}]\cup \{n+t\}\), \[\begin{align} p_{n+t}^{\tilde{w}} = \sum_{i\in[n+t_{\text{ad}}]\cup \{n+t\}}\tilde{w}_i^{(t)}\Bigg[\begin{matrix}\text{\usefont{U}{bbm}{m}{n}1}\{v_i > v_{n+t}\} \quad \quad \\ + u_{n+t}\text{\usefont{U}{bbm}{m}{n}1}\{v_i = v_{n+t}\}\end{matrix}\Bigg], \label{eq:weighted95p95vals95fixed95cal} \end{align}\tag{15}\] where here \(\tilde{w}_i^{(t)}\propto \widehat{f}_X^{T(t)}(X_i)/f_X(X_i)\).

Root-Cause Analysis with Parallelized WCTM and \(X\)-CTM The parallel implementation of both the primary WCTM and the secondary \(X\)-CTM method furthermore enables root-cause analysis, that is to determine whether performance degradation was due to a harmful covariate shift (Fig 1b) or a fundamental concept shift (Fig 1c). That is, if both the WCTM and the \(X\)-CTM have detected changepoints, then a harmful shift can be diagnosed as an extreme covariate shift (Fig 1b); on the other hand, if the \(X\)-CTM does not detect a changepoint, this suggests that no covariate shift is present, and the WCTM alarm was instead due to a concept shift in \(Y|X\) (Figure 1).

Figure 3: Example martingale trajectories of WCTM and CTM on CIFAR-10-C with increasing levels of corruption. WCTM adapts more finely to varying severity and only triggers alarms when necessary, whereas CTM shows limited adaptability. To better understand the reaction of the real martingale path, the plots are from one random experiment without aggregation..

4 Experiments↩︎

We conduct a comprehensive empirical analysis of the WATCH framework on real-world datasets with various distribution shifts. Our results show that WATCH adapts effectively to benign shifts (Section 4.1) and triggers alarms only when it fails to adapt (Section 4.2), while also detecting potentially harmful shifts with minimal delay (Section 4.3). Details on the datasets, models, and additional results, can be found in Appendix 8. Code to reproduce all experiments is available at the following repository: https://github.com/aaronhan223/wtr.

Baselines We compare the proposed WCTM methods that constitute WATCH against standard CTMs [2] in all experiments. To evaluate detection speed on true harmful shifts (sec 4.3), we also compare against methods for directly performing sequential hypothesis testing and changepoint detection on the set-prediction miscoverage risk, as proposed by [3].

In all experiments and for all baselines, the underlying ML predictor being monitored was a neural network. On the tabular data, we used the scikit-learn [33] MLPRegressor (with L-BFGS solver and logistic activation); for the image data, we used a 3-layer MLP with ReLU activations on the MNIST datasets and a ResNet-32 [34] on CIFAR-10 datasets. For weight estimation, we use a 3-layer MLP with ReLU activations to distinguish between source and target distributions.

4.1 WCTMs Adapt to Mild and Benign Shifts to Avoid Unnecessary Alarms↩︎

Table 1: ADD and wall-clock runtime in minutes computed over 100 seeds. Anytime-valid monitoring methods are WCTMs (proposed), CTMs [10], and sequential testing from [3] (PR-ST); multistage monitoring methods are Shiryaev-Roberts (SR) procedure applied to WCTMs (proposed), SR applied to CTMs [10], changepoint detection on the miscoverage risk from [3] in both online (PR-CD-online) and minibatched (PR-CD-50) variants. The best ADD are highlighted in **bold font** and the best run time results are underlined; results corresponding to our method are in blue.
		Anytime-Valid Monitoring Criteria				Scheduled, Multistage Monitoring Criteria
3-6 (lr)7-12		(W)CTM		PR-ST		SR via (W)CTM		PR-CD-online		PR-CD-50
3-4 (lr)5-6 (lr)7-8 (lr)9-10 (lr)11-12	CP Method	ADD	Time	ADD	Time	ADD	Time	ADD	Time	ADD	Time
1-6 (lr)7-12 Superconduct	Weighted	176.33	4.92E-03	985.35	0.24	149.68	7.95E-03	224.27	2.28	441.5	5.31E-03
	Standard	173.85	4.86E-03	887.26	0.18	149.43	7.88E-03	217.47	2.07	438.5	5.34E-03
1-6 (lr)7-12 Bike Sharing	Weighted	182.59	4.09E-03	685.29	0.08	145.09	6.25E-03	162.71	0.79	2841	0.95
	Standard	183.01	4.21E-03	686.61	0.08	146.50	6.33E-03	163.53	0.78	2149	0.48
1-6 (lr)7-12 MEPS	Weighted	120.74	3.97E-03	513.79	0.07	103.79	5.19E-03	107.05	0.56	196.5	1.70E-03
	Standard	120.59	3.95E-03	479.51	0.06	103.59	5.17E-03	107.98	0.56	191.5	1.68E-03

WCTMs Adapt to Benign Covariate Shifts Figure 2 compares the average performance (across 200 random seeds) of WCTMs to standard CTMs across 3000 sequentially-observed test datapoints, where the true changepoint shift is induced after the 500th test point. From the conformal coverage plots in the first column, it is clear that all the shifts are benign in that coverage for the corresponding (standard or weighted) CP methods—a metric of prediction safety—does not degrade below the target level (0.9) after the changepoint. In fact, for the baseline method (orange), coverage increases, which could be considered a “beneficial” shift; nonetheless, the standard CTM raises unnecessary alarms across all datasets, for both its anytime-valid (fourth column) and scheduled (fifth column) monitoring criteria. In contrast, the proposed WCTM (blue) avoids these unnecessary alarms across all datasets and both monitoring criteria. That is, the WCTM adapts to the benign covariate shift by decreasing its interval widths (second column)—indicating more informative predictions—while maintaining target coverage. The relatively uniform distribution of the postchangepoint weighted \(p\)-values (third column, blue) is empirical evidence validating Theorem 1; meanwhile, the martingale (fourth column) and Shiryaev-Roberts (fifth column) WCTM paths avoiding alarms supports Theorems 2 and 3 respectively.

4.2 From Mild to Extreme Covariate Shifts: WCTMs Raise Alarms if Unable to Adapt↩︎

We demonstrate such a property of WCTM using image corruption experiments. The models were trained on original clean data and then evaluated on corrupted variants. To enable greater flexibility in controlling the level of distribution shift, we combine clean and corrupted samples using self-defined mixture ratios when defining the source and target distributions. Figure 3 illustrates the behavior of WCTM in comparison to CTM on CIFAR-10 under varying levels of brightness corruption. When the target distribution is completely clean (a), neither method triggers a false alarm, but the martingale paths of WCTM are more stable. We then introduce level-1 brightness corruption while retaining over \(50\%\) clean samples (b), creating a very mild, benign shift. In this scenario, CTM quickly raises an unnecessary alarm, whereas WCTM successfully avoids it. Next, we increase the corruption to level-3 and reduce the clean-sample ratio to \(40\%\) (c). Although WCTM reacts to this shift, it still does not trigger an alarm. Finally, at corruption level-5, with the clean-sample ratio reduced to \(30\%\) (d), WCTM does raise the alarm. Throughout these changes, the baseline CTM shows little variation, while our method displays strong adaptivity to different degrees of distribution shift.

We conduct a quantitative evaluation of monitoring performance on the image corruption experiments. Our evaluation metrics include the average detection delay (ADD), following [3], the average number of unnecessary alarms and missed alarms. First, WCTM exhibits shorter detection delays compared to CTM. Given that CTM often doesn’t trigger alarms in the face of severe corruption, which significantly increases its overall ADD. Additionally, as the corruption level intensifies, WCTM detects shifts more quickly, whereas CTM’s detection speed shows minimal change—consistent with our earlier visualizations. To evaluate unnecessary alarms, we again create a mild, benign shift in the target distribution by mixing clean samples into MNIST and level-1 CIFAR-10 corruption data, while treating higher-level corruptions as harmful. We find that WCTM’s unnecessary alarm rate is roughly one-third that of CTM. Moreover, WCTM also exhibits fewer missed alarms, especially under more severe corruptions. These findings highlight the flexibility of our framework, which not only avoids unnecessary alarms but also better detects harmful shifts.

Table 2: WCTM achieves a lower Average Detection Delay (ADD), especially under severe image corruptions, and also exhibits fewer false alarms and missed alarms. Results are averaged over 10 random seeds and all corruption types.
		MNIST-C	CIFAR10-C L1	CIFAR10-C L3	CIFAR10-C L5
ADD	WCTM	156.4	188.0	163.6	129.7
	CTM	285.3	176.5	175.2	170.3
Unnecessary Alarm	WCTM	7.3	12.2	–	–
	CTM	25.4	34.3	–	–
Missed Alarm	WCTM	2.9	6.6	3.8	0.9
	CTM	2.4	5.9	4.6	1.2

4.3 WCTMs Detect Harmful Concept Shifts Faster than Sequentially Tracking Loss Metrics↩︎

Because concept shifts fundamentally change the \(Y|X\) relationship, they are generally harmful, and monitoring methods should detect them as quickly as possible. Our experiments compare the average detection delay (ADD) of WCTM methods relative to comparable CTM methods, as well as against methods for sequentially tracking the loss metrics directly with comparable false-alarm control. That is, the sequential testing procedures proposed in [3] (PR-ST) have anytime-valid control comparable to (W)CTMs, while the changepoint detection procedure in [3] (PR-CD) has average-run-length control comparable to running the Shiryaev-Roberts procedure 14 on top of (W)CTMs. Both PR-ST and PR-CD methods can be used to monitor the miscoverage risks of either weighted CP or standard CP methods, though they are less data efficient (requiring an extra holdout set for computing concentration inequalities). We compared to the betting-based -process in [3] (with \(\epsilon_{\text{tol}}=0\)) because those methods were reported to perform the best and are more comparable to (W)CTMs, which are also betting-based. We compared WCTMs to sequential testing variant with standardized anytime-false alarm rate of 0.01; for SR-WCTMs we compared vs changepoint detection variant with common average-run-length (under the null) of 20,000.

Table 1 reports the average detection delay (ADD) results for monitoring methods grouped by false-alarm control type, with proposed methods in blue. Among the anytime-valid monitoring methods, WCTMs and CTMs achieve comparably fast ADD; however, relative to comparable PR-ST methods, (W)CTMs are over three times faster. Meanwhile, among the stagewise monitoring methods, WCTM and CTM-based SR procedures are comparable, but with significantly faster ADD than PR-CD run either online or in minibatches. PR-CD-online has lower ADD than in minibatches, but its wall-clock runtime is far slower—this is due to the method’s \(\mathcal{O}(t^2)\) time complexity, whereas the proposed WCTM and SR-WCTM methods are \(\mathcal{O}(t)\).

5 Summary and Future Directions↩︎

In this paper we introduced novel weighted conformal test martingales (WCTMs), which we constructed from sequences of weighted conformal \(p\)-values, and we demonstrated how WCTMs enable continual monitoring of deployed AI/ML models via flexible sequential hypothesis testing for changepoints. The proposed approach, WATCH, continuously adapts to benign covariate shifts without raising unnecessary alarms, while also detecting harmful shifts more rapidly. Our empirical results show that WATCH’s adaptation reduces unnecessary alarms relative to standard CTMs [2] while still detecting harmful shifts faster than comparable betting-based \(e\)-processes that directly track the risk (i.e., from [3]). WATCH is further able to perform root-cause analysis to diagnose whether a harmful shift causing degradation was an extreme covariate shift or a fundamental concept shift. Promising future theory directions include developing WCTMs for testing other data-generating assumptions, connections to conditional permutation tests [35], and finer-grained root-cause analysis. For applications, extensions of WATCH to monitor the risks of generative models and AI agents could be valuable.

Impact Statement↩︎

This paper presents research aimed at propelling advancements in the broad domain of machine learning. The implications of our findings are wide-ranging, with potential high-stake applications in sectors including healthcare, autonomous driving, and e-commerce. Based on our current understanding, this research does not warrant an ethics review, and a detailed discussion of the potential societal impacts is not required at the current stage.

Acknowledgements↩︎

D.P. was funded by a PhD fellowship from the Amazon Initiative for Interactive AI (AI2AI); D.P., X.H., A.L., and S.S. were partially funded by the Gordon and Betty Moore Foundation grant #12128; D.P., X.H., and S.S. were partially funded by National Science Foundation (NSF) grant #1840088. A.L. is also supported by an Amazon Research Award. The authors would like to thank Yaniv Romano for helpful discussions on a draft of this paper; Jwala Dhamala and Jie Ding for helpful discussions throughout the JHU + AI2AI collaboration; and the anonymous ICML reviewers for informative feedback.

Appendix for
“WATCH: Weighted Adaptive Testing for Changepoint Hypotheses
via Weighted-Conformal Martingales”

6 Related Works↩︎

This work is motivated by developing methods for monitoring AI/ML deployments that perform three key functions: (1) online adaptation to mild or benign data shifts; (2) rapid detection of extreme or harmful shifts that necessitate updates; and (3) identifying the root-cause of degradation to inform appropriate recovery. Although these monitoring goals could be viewed from many perspectives, in discussing related work we primarily focus on methods in anytime-valid inference, sequential testing of nonparametric null hypotheses, and especially conformal test martingales, which most closely related to our own.

Sequential testing for changepoints in the data distribution: Sequential hypothesis testing to detect changes in the data distribution is an old and widely-studied problem, dating at least to Wald’s sequential probability ratio test [36]. However, classic sequential changepoint detection methods often required a prespecified stopping time and were primarily designed for testing simple, often parametrically-specified null hypotheses—for an overview of classic parametric methods, we defer to a relevant textbook [37] and review articles [38], [39]. In contrast, our weighted-conformal test martingale (WCTM) methods belong to a more recent literature called sequential anytime-valid inference (SAVI), which allow for arbitrary stopping times, and specifically our work is situated in the recent SAVI literature on testing composite and nonparametric null hypotheses. We refer readers to [8] for a recent review of the SAVI literature, as well as to the textbook by [14] on the closely related topic of testing-by-betting.

Standard conformal test martingales: Within the SAVI literature, our work is most closely related to conformal test martingales (CTMs), which are martingales constructed from a sequence of standard conformal \(p\)-values for continually testing the assumption that the data are exchangeability or independent and identically distributed (IID). Our WCTM methods generalize standard CTMs for testing a broader range of null hypotheses, including those where one wishes to accommodate or adapt to certain anticipated changes in the data distribution (e.g., adapting to mild covariate shifts). Standard CTMs were initially introduced by [25], while drawing on theory for the calibration of online conformal prediction methods developed in [24]. Since then, various works have further developed standard CTMs, such as by introducing new betting and score functions for more efficient changepoint detection, ensembling CTMs, and demonstrating their performance on real-world applications [2], [9], [10], [40]–[44]. CTMs are also discussed in textbooks on conformal prediction [45], [46].

In particular, [9] developed inductive CTMs (i.e., CTMs based on inductive or split conformal [11], [47]) with score and betting functions specifically taylored to fast changepoint detection; [2] proposed an approach to using standard CTMs to monitor for when a deployed AI/ML system should be retrained; and [10] provided a comprehensive and detailed review of CTMs as methods for testing the IID or exchangeability assumptions. [48] implement methods based on CTMs for test time-adaptation of a classifier’s point prediction and [49] leverage multiple CTMs over different feature spaces to aid in diagnostic runtime monitoring. Several works have developed or used CTMs for testing for concept shift (i.e., shift in \(Y\mid X\)) [40], [43], [44], [50], but all of these have focused on a classification setting with a limited number of classes (e.g., to implement label-conditional CTMs, or using standard CTMs that lack ability to disambiguate between covariate and concept shifts). In contrast, our specific WCTMs implemented in this paper are able to test for concept shift in both regression and classification settings while also being able to disambiguate between concept shifts and extreme covariate shifts with the aid of an additional \(X\)-CTM.

Confidence sequences, \(e\)-processes, and other multiple-hypothesis testing methods: Our work also relates more broadly to other SAVI and testing-by-betting methods outside of CTMs such as confidence sequences and \(e\)-processes—we refer to the review paper of [8] for full exposition of these topics. In particular, in the main paper we empirically compared our WCTM methods against the betting-based sequential testing and changepoint detection methods in [3] (the theory for which was developed in [51]) because this paper is the closest to ours in its motivation of monitoring deployed AI/ML deployments among (non-CTM) SAVI papers. As we note in the main paper, for monitoring the risk of a set-valued (e.g., conformalized) AI/ML predictor, our WCTM methods are generally more data-efficient (not requiring separate datasets for conformalization and testing, as [3] does); our WCTM-based Shiryaev-Roberts procedure is more computationally efficient than the compareable changepoint detection method in [3] (i.e., \(\mathcal{O}(t)\) versus \(\mathcal{O}(t^2)\)); and, empirically we found that our methods often detect harmful concept shifts faster than comparable methods in [3]. More recently, [52] built on [3] by developing similar methods for when ground-truth labels do not become available and need to be estimated.

A key target of the SAVI literature on testing composite or nonparametric null hypotheses has been developing new approaches for testing the IID or exchangeability assumptions. Other than CTMs, the other main nontrivial approach to sequentially testing exchangeability was developed in [53] based on the ideas of universal inference [54]; more recently, other approaches have emerged based on pairwise betting [55] and sequential Monte Carlo testing [56], the latter of which can be viewed as a special case of CTMs [10] streamlined for a particular alternative hypothesis. Otherwise, there are various and proliferating other methods for sequential nonparametric changepoint detection (e.g., [57]–[61]). SAVI and testing-by betting methods are also being leveraged for a wide variety of applications including interpretability [62], conditional independence testing [63], applications in finance [14], and more.

Testing-by-betting, which is fundamental to SAVI [8], can be understood as one approach to multiple hypothesis testing that is especially advantageous in sequential settings. Other related works that leverage conformal prediction for hypothesis testing while accounting for multiple-testing corrections, but in batch settings, include [64]–[68].

Weighted conformal prediction and adapting to distribution shifts: Weighted conformal prediction (e.g., [18]–[20], [22], [28], [69]–[76]) is broadly an approach to proactively adapting the validity of conformal predictive sets to distribution shift by reweighting nonconformity scores using either knowledge or estimates of the distribution shift. Any weighted CP prediction set is associated with a weighted-conformal \(p\)-values, as described in the main paper; accordingly, a WCTM can be constructed on top of any weighted CP method deployed on a sequence of data observations to continually monitor the assumptions or approximations underlying that WCP method’s implementation. There are of course many other approaches to adapting to distribution shifts at test time; for example, one work that is similar to ours with regard to this motivation, but that does not fall under weighted CP, is [48], which leverages standard CTMs to guide the test-time adaptation of a classifier’s point prediction by entropy matching.

7 Proof of Theorems↩︎

7.1 Proof for Theorem 1 (Weighted Conformal \(p\)-value Validity)↩︎

The proof can be viewed as a generalization of the proofs for Theorem 1 in [25] and Theorem 2 in [24], while drawing on analysis from [18] and exposition and discussion from [21]. The key difference relative to [24] and [25] is that, whereas those papers use the assumption of exchangeability to place equal weight on every permutation of the data observations, here we avoid this assumption by first proving a general result for an arbitrary (potentially non-exchangeable) joint distribution. We then describe how this implies that the validity of more specific and tractable methods is premised on the assumptions or approximations used for practical implementation.

The basic idea for the proof begins with setup from [24], for “reversing time.” In particular, we imagine that the sequence of data observations \((z_1, ..., z_T)\) is generated in two steps: first, the unordered bag or multiset of data observations, \(\{z_1, ..., z_T\}\), is generated from some probability distribution (that is, the image of \(F_Z\) under the mapping \((z_1, z_2, ...) \rightarrow \{z_1, ..., z_T\}\)); then—here generalizing [24] and [25] by weighting permutations according to their likelihood—from all possible permutations \(\sigma\) of the values \(\{z_1, ..., z_T\}\), each possible sequence \((z_{\sigma(1)}, ..., z_{\sigma(T)})\) is chosen with probability proportional to \(f(z_{\sigma(1)}, ..., z_{\sigma(T)})\), where \(f:=f_Z\) is the probability-density function⁸ for the distribution \(F_Z\). Roughly (ignoring borderline effects), the second step ensures that, conditionally on knowing \(\{z_1, ..., z_T\}\) (and therefore unconditionally), that \(P_{T}^{\tilde{w}^o}\) has a standard uniform distribution; when \(Z_T=z_{\sigma(T)}\) is observed, this settles the value of \(P_{T}^{\tilde{w}^o}=p_{\sigma(T)}^{\tilde{w}^o}\), and conditionally on knowing \(\{z_1, ..., z_T\}\) and \(Z_T=z_{\sigma(T)}\) (and therefore, after relabeling indices, on knowing \(\{z_1, ..., z_{T-1}\}\)), that \(P_{T-1}^{\tilde{w}^o}\) also has a standard uniform distribution, and so on.

Lemma 1. For any trial \(t\) and any confidence level \(\alpha \in (0, 1)\), \[\begin{align} \label{eq:lemma1} \mathbb{P}\{P_{t}^{\tilde{w}^o} \leq \alpha \mid E_{z}^{(t)} \} = \alpha. \end{align}\tag{16}\]

Proof of Lemma 1. We begin by conditioning on the event \(\{Z_1, ..., Z_t\}=\{z_1, ..., z_t\}\), which we denote as \(E_{z}^{(t)}\), and we consider drawing any particular ordering or permutation \(\sigma\) of the data values with probability according to \(f\), that is with probability proportional to \(f(z_{\sigma(1)}, ..., z_{\sigma(t)})\).

Note that for any \(i\in [t]\), if a permutation is drawn such that \(\sigma(t)=i\), this means that \(Z_t=z_{\sigma(t)}=z_i\); moreover, because the score function \(\widehat{\mathcal{S}}\) is bijective, this further implies that \(V_t=v_{\sigma(t)}=v_i\). Thus, given the bag of data \(E_z^{(t)}\), recall that for each \(i\in [t]\), the probability of drawing such a permutation is given by the “oracle weights” \[\begin{align} \label{eq:general95weights95app} \tilde{w}^o_i := \mathbb{P}\{V_{t} = v_i \mid E_{z}^{(t)}\} = \mathbb{P}\{Z_{t} = z_i \mid E_{z}^{(t)}\} = \frac{\sum_{\sigma:\sigma(t)=i}f(z_{\sigma(1)}, ..., z_{\sigma(t)})}{\sum_{\sigma}f(z_{\sigma(1)}, ..., z_{\sigma(t)})}, \end{align}\tag{17}\] which we assume to be well-defined, which will generally be the case in practice, where at least \(f(z_1, ..., z_T)>0\) for the true (identity permutation) ordering of the data observations \((z_1, ..., z_T)\).

This implies that the distribution of \(V_t\mid E_{z}^{(t)}\), the conditional distribution of the test-point score given the bag of data values \(E_{z}^{(t)}\), is given by \[\begin{align} V_t \mid E_{z}^{(t)} \sim \sum_{i=1}^t\tilde{w}^o_i\cdot \delta_{v_i}. \end{align}\]

For any \(i\in[t]\), define a conservative WCP \(p\)-value, \(p_i^{\tilde{w}^o+}\), and an anticonservative WCP \(p\)-value, \(p_i^{\tilde{w}^o-}\), as \[\begin{align} p_i^{\tilde{w}^o+} & := \sum_{j=1}^t\tilde{w}^o_j\cdot \text{\usefont{U}{bbm}{m}{n}1}\{v_j\geq v_i\} \\ p_i^{\tilde{w}^o-} & := \sum_{j=1}^t\tilde{w}^o_j\cdot \text{\usefont{U}{bbm}{m}{n}1}\{v_j> v_i\}. \end{align}\] It is worth noting that \(p_t^{\tilde{w}^o+}\) is a valid \(p\)-value for \(f\): \(\mathbb{P}\{P_t^{\tilde{w}^o+}\leq \alpha \mid E_{z}^{(t)}\} \leq \alpha \implies \mathbb{P}\{P_t^{\tilde{w}^o+}\leq \alpha\} \leq \alpha\). However, the lemma claims exact validity for \(p_t^{\tilde{w}^o}\) conditional on \(E_{z}^{(t)}\), which we will now proceed to show.

Observe that for all \(i\in [t]\), \(p_i^{\tilde{w}^o-}<p_i^{\tilde{w}^o+}\) and \[\begin{align} p_i^{\tilde{w}^o+}-p_i^{\tilde{w}^o-} = \sum_{j=1}^t\tilde{w}^o_j\cdot \text{\usefont{U}{bbm}{m}{n}1}\{v_j = v_i\}. \end{align}\] Moreover, observe that as in the proof for Lemma 1 in [24], the semi-closed intervals \([p_i^{\tilde{w}^o-}, p_i^{\tilde{w}^o+})\) either coincide or are disjoint, and \(\cup_{i=1}^t[p_i^{\tilde{w}^o-}, p_i^{\tilde{w}^o+})=[0,1)\).

Similarly as in the proof for Lemma 1 in [24], for an \(\alpha\in (0, 1)\), let us partition the indices as follows, where we say that an index \(i\) is

“strange” if \(p_i^{\tilde{w}^o+}\leq \alpha\),
“ordinary” if \(p_i^{\tilde{w}^o-} > \alpha\),
and “borderline” if \(p_i^{\tilde{w}^o-} \leq \alpha < p_i^{\tilde{w}^o+}\).

Let \(i'\) denote the index of any borderline example, and denote \(p^{\tilde{w}^o+}:=p_{i'}^{\tilde{w}^o+}\) and \(p^{\tilde{w}^o-}:=p_{i'}^{\tilde{w}^o-}\). Then, the probability (conditional on \(E_{z}^{(t)}\), drawing each permutation \(\sigma\) with probabilities according to \(f\)) that the last index \(\sigma(t)\) is strange is \(p^{\tilde{w}^o-}\); the probability that \(\sigma(t)\) is ordinary is \(1-p^{\tilde{w}^o+}\); and, the probability that \(\sigma(t)\) is borderline is \(p^{\tilde{w}^o+}-p^{\tilde{w}^o-}\). Moreover, observe that if \(\sigma(t)\) is strange, then \(p_{\sigma(t)}^{\tilde{w}^o}\leq \alpha\) (by definition) and if \(\sigma(t)\) is borderline then the event that \(p_{\sigma(t)}^{\tilde{w}^o}\leq \alpha\) is determined by the independent uniform \(u_t\), and thus the probability of this event is \(\frac{\alpha-p^{\tilde{w}^o-}}{p^{\tilde{w}^o+}-p^{\tilde{w}^o-}}\). That is,

\[\begin{align} \mathbb{P}\{P_{t}^{\tilde{w}^o} \leq \alpha \mid E_{z}^{(t)} \} = & \;\mathbb{P}\{P_{t}^{\tilde{w}^o} \leq \alpha \mid E_{z}^{(t)} , \;\sigma(t) \text{ is strange} \}\cdot\mathbb{P}\{\sigma(t) \text{ is strange} \mid E_{z}^{(t)} \} \nonumber \\ & + \mathbb{P}\{P_{t}^{\tilde{w}^o} \leq \alpha \mid E_{z}^{(t)} , \;\sigma(t) \text{ is ordinary} \}\cdot\mathbb{P}\{\sigma(t) \text{ is ordinary} \mid E_{z}^{(t)} \} \nonumber \\ & + \mathbb{P}\{P_{t}^{\tilde{w}^o} \leq \alpha \mid E_{z}^{(t)} , \;\sigma(t) \text{ is borderline} \}\cdot\mathbb{P}\{\sigma(t) \text{ is borderline} \mid E_{z}^{(t)} \} \\ = & \;p^{\tilde{w}^o-} + 0 + (p^{\tilde{w}^o+}-p^{\tilde{w}^o-})\cdot\frac{\alpha-p^{\tilde{w}^o-}}{p^{\tilde{w}^o+}-p^{\tilde{w}^o-}} \\ = & \;\alpha. \end{align}\] ◻

With Lemma 1 in hand, we can now proceed with the proof for the main theorem. Temporarily fix a positive integer \(T\); following the strategy of [25], we will first prove by induction that for any \(t=1, ..., T\) and any \(\alpha_1, ..., \alpha_t\in[0,1]^t\), that \[\begin{align} \label{eq:conditional95iid95bernoulli} \mathbb{P}\{P_t^{\tilde{w}^o} \leq \alpha_t, ..., P_1^{\tilde{w}^o} \leq \alpha_1 \mid E_{z}^{(t)}\} = \alpha_t \cdots \alpha_1. \end{align}\tag{18}\]

For \(t=1\), Eq. 18 immediately follows from Lemma 1. For \(t>1\), by the law of total probability over \(\sigma(t)\) (i.e., over the last index value after drawing a permutation \(\sigma\) from \(E_z^{(t)}\)), the fundamental bridge between probability and expectation, and properties of the indicator function, we have \[\begin{align} \mathbb{P}&\{P_t^{\tilde{w}^o} \leq \alpha_t, ..., P_1^{\tilde{w}^o} \leq \alpha_1 \mid E_{z}^{(t)}\} \\ & = \sum_{\sigma(t)=1}^t\mathbb{P}\{P_t^{\tilde{w}^o} \leq \alpha_t, ..., P_1^{\tilde{w}^o} \leq \alpha_1 \mid E_{z}^{(t)}, Z_t = z_{\sigma(t)}\}\cdot \mathbb{P}\{Z_t = z_{\sigma(t)} \mid E_{z}^{(t)}\}\\ & = \sum_{\sigma(t)=1}^t\mathbb{E}\big[\text{\usefont{U}{bbm}{m}{n}1}\{P_t^{\tilde{w}^o} \leq \alpha_t, ..., P_1^{\tilde{w}^o} \leq \alpha_1\} \mid E_{z}^{(t)}, Z_t = z_{\sigma(t)}\big]\cdot \mathbb{P}\{Z_t = z_{\sigma(t)} \mid E_{z}^{(t)}\} \\ & = \sum_{\sigma(t)=1}^t\mathbb{E}\big[\text{\usefont{U}{bbm}{m}{n}1}\{P_t^{\tilde{w}^o} \leq \alpha_t\}\cdot\text{\usefont{U}{bbm}{m}{n}1}\{P_{t-1}^{\tilde{w}^o} \leq \alpha_{t-1}, ..., P_1^{\tilde{w}^o} \leq \alpha_1\} \mid E_{z}^{(t)}, Z_t = z_{\sigma(t)}\big]\cdot \mathbb{P}\{Z_t = z_{\sigma(t)} \mid E_{z}^{(t)}\} \end{align}\] Next, observe that conditioning on \(E_{z}^{(t)}\) and \(Z_t = z_{\sigma(t)}\) in the expectation term settles the value of \(\text{\usefont{U}{bbm}{m}{n}1}\{P_t^{\tilde{w}^o} \leq \alpha_t\}\) to be \(\text{\usefont{U}{bbm}{m}{n}1}\{p_{\sigma(t)}^{\tilde{w}^o} \leq \alpha_t\}\); that is, noting that the score function \(\widehat{\mathcal{S}}\) is bijective, \(\{E_{z}^{(t)}, Z_t = z_{\sigma(t)}\}\implies \{E_{z}^{(t)}, V_t = v_{\sigma(t)}\} \implies P_{t}^{\tilde{w}^o}=p_{\sigma(t)}^{\tilde{w}^o}\). So, \(\text{\usefont{U}{bbm}{m}{n}1}\{P_t^{\tilde{w}^o} \leq \alpha_t\}=\text{\usefont{U}{bbm}{m}{n}1}\{p_{\sigma(t)}^{\tilde{w}^o} \leq \alpha_t\}\) can be pulled out from the expectation to obtain \[\begin{align} \mathbb{P}& \{P_t^{\tilde{w}^o} \leq \alpha_t, ..., P_1^{\tilde{w}^o} \leq \alpha_1 \mid E_{z}^{(t)}\} \\ & = \sum_{\sigma(t)=1}^t\text{\usefont{U}{bbm}{m}{n}1}\{p_{\sigma(t)}^{\tilde{w}^o} \leq \alpha_t\}\cdot\mathbb{E}\big[\text{\usefont{U}{bbm}{m}{n}1}\{P_{t-1}^{\tilde{w}^o} \leq \alpha_{t-1}, ..., P_1^{\tilde{w}^o} \leq \alpha_1\} \mid E_{z}^{(t)}, Z_t = z_{\sigma(t)}\big]\cdot \mathbb{P}\{Z_t = z_{\sigma(t)} \mid E_{z}^{(t)}\}. \end{align}\]

Now, observe that conditioning on \(E_{z}^{(t)}\) and \(Z_t = z_{\sigma(t)}\) implies that \(\{Z_1, ..., Z_{t-1}\}=\{z_1, ..., z_t\}\backslash \{z_{\sigma(t)}\}\), which we can denote as \(E_{z}^{(t-1)}\) (relabeling data indices as needed); in other words, observing the bag of \(t\) data values (i.e., observing \(E_{z}^{(t)}\)) and the value taken on by the \(t\)-th random variable (\(Z_t = z_{\sigma(t)}\)) implies that each of \(Z_1, ..., Z_{t-1}\) takes a value in \(\{z_1, ..., z_t\}\backslash \{z_{\sigma(t)}\}\). Moreover, as the value taken on by \(P_{t'}^{\tilde{w}^o}\) is a function of the values taken on by \(Z_1, ..., Z_{t'}\), the \(P_{t-1}^{\tilde{w}^o}, ..., P_{1}^{\tilde{w}^o}\) do not depend on the value of \(Z_t\), so \[\begin{align} \mathbb{P}& \{P_t^{\tilde{w}^o} \leq \alpha_t, ..., P_1^{\tilde{w}^o} \leq \alpha_1 \mid E_{z}^{(t)}\}\\ & = \sum_{\sigma(t)=1}^t\text{\usefont{U}{bbm}{m}{n}1}\{p_{\sigma(t)}^{\tilde{w}^o} \leq \alpha_t\}\cdot\mathbb{P}\{P_{t-1}^{\tilde{w}^o} \leq \alpha_{t-1}, ..., P_1^{\tilde{w}^o} \leq \alpha_1 \mid E_{z}^{(t-1)}\}\cdot \mathbb{P}\{Z_t = z_{\sigma(t)} \mid E_{z}^{(t)}\}. \end{align}\]

Using the inductive assumption and Lemma 1, this becomes \[\begin{align} \mathbb{P}\{P_t^{\tilde{w}^o} \leq \alpha_t, ..., P_1^{\tilde{w}^o} \leq \alpha_1 \mid E_{z}^{(t)}\}& = \sum_{\sigma(t)=1}^t\text{\usefont{U}{bbm}{m}{n}1}\{p_{\sigma(t)}^{\tilde{w}^o}\leq \alpha_t\}\cdot\alpha_{t-1}\cdots\alpha_{1}\cdot \mathbb{P}\{Z_t = z_{\sigma(t)} \mid E_{z}^{(t)}\}\nonumber \\ & = \mathbb{P}\big\{P_t^{\tilde{w}^o}\leq \alpha_t \mid E_{z}^{(t)}\big\}\cdot\alpha_{t-1}\cdots\alpha_{1} \nonumber \\ & = \alpha_{t}\cdots\alpha_{1}, \end{align}\] which proves Eq. 18 . Note that Eq. 18 is a conditional result for any \(t=1, ..., T\); marginalizing over the event \(E_{z}^{(t)}\) and taking \(t=T\) implies \[\begin{align} \label{eq:unconditional95iid95bernoulli} \mathbb{P}\{P_T^{\tilde{w}^o} \leq \alpha_T, ..., P_1^{\tilde{w}^o} \leq \alpha_1 \} = \alpha_T \cdots \alpha_1. \end{align}\tag{19}\] We have proven that \(P_1^{\tilde{w}^o},P_2^{\tilde{w}^o}, ..., P_T^{\tilde{w}^o}\stackrel{iid}{\sim}\text{Unif}[0, 1]^T\); this implies the analogous result for the infinite sequence, that is, \(P_1^{\tilde{w}^o},P_2^{\tilde{w}^o}, ...\stackrel{iid}{\sim}\text{Unif}[0, 1]^{\infty}\) [25], [77].

Note that Eq. 19 holds in theory for an arbitrary joint PDF \(f\), but it is an abstract statement because it relies on the oracle weights in Eq. 17 , which are intractable due to requiring knowledge of \(f\) and factorial complexity. Thus, if the oracle weights in Eq. 17 are simplified using some assumptions or approximations on \(f\) (e.g., conditional independence or invariance assumptions, density-ratio estimations) that we denote by \(\mathcal{H}_0(\hat{f})\), and denoting the resulting approximated weights as \(\tilde{w}\), then the \(P_{t}^{\tilde{w}}\) are IID uniformly distributed, assuming \(\mathcal{H}_0(\hat{f})\): \[\begin{align} \label{eq:H040hat40f414195iid95bernoulli95app} \mathbb{P}_{\mathcal{H}_0(\hat{f})}\big\{P_T^{\tilde{w}} \leq \alpha_T, ..., P_1^{\tilde{w}} \leq \alpha_1 \big\}= \alpha_T \cdots \alpha_1. \end{align}\tag{20}\]

Eq. 20 implies that \(P_1^{\tilde{w}},P_2^{\tilde{w}}, ..., P_T^{\tilde{w}}\mid \mathcal{H}_0(\hat{f})\stackrel{iid}{\sim}\text{Unif}[0, 1]^T\). Similarly as before, this result for a fixed number of points \(T\) implies the corresponding result for infinite sequences [25], [77], that is, \(P_1^{\tilde{w}},P_2^{\tilde{w}},... \mid \mathcal{H}_0(\hat{f}) \stackrel{iid}{\sim}\text{Unif}[0, 1]^{\infty}\).

7.2 Proofs for Proposition 2 (WCTM Anytime False-Alarm Control) and Proposition 3 (Average-Run-Length Control for WCTM-based Shiryaev-Roberts Procedure)↩︎

The proofs for Proposition 2 and Proposition 3 follow from Theorem 1 and [10]’s results for standard conformal test martingales. By Theorem 1, weighted conformal \(p\)-values have IID uniform distribution on [0,1] premised on some assumptions \(\mathcal{H}_0(\hat{f})\). Thus, a sequence of weighted conformal \(p\)-values is a sequence of IID random variables drawn from [0,1], assuming that the null hypothesis \(\mathcal{H}_0(\hat{f})\) is true. Because the construction of a weighted conformal test martingale (e.g., Eq. 7 ) is only allowed to depend on the data via the sequence of weighted conformal \(p\)-values, we can draw from [10]’s results that construct betting martingales that only assume a sequence of IID uniform [0,1] random variables. That is, by construction, weighted conformal test martingales are a betting martingale [10] whose validity is premised on the assumptions \(\mathcal{H}_0(\hat{f})\) (used to compute the weights). Then, Proposition 2 follows by Ville’s inequality [17] and Proposition 3 follows from Proposition 4.2 in [10].

8 Experiment Details↩︎

8.1 Datasets↩︎

The evaluation datasets include both real-world tabular and image datasets. The tabular datasets are for regression tasks (where conformal methods used the absolute value residual nonconformity score \(\widehat{S}(x,y)=|y-\widehat{\mu}|\)), and the datasest span various sizes and dimensionalities: the Medical Expenditure Panel Survey (MEPS) dataset (33005 samples, 107 features) [78], the UCI Superconductivity dataset (21263 samples, 81 features) [79], and the UCI bike sharing dataset (17379 samples, 12 features) [80]. The image datasets were for classification tasks (where conformal methods used the one-minus-softmax score \(\widehat{S}(x,y)=1-\widehat{p}(y_i\mid x_i)\)) were the MNIST-corruption [81] (60000 clean samples, 10000 corrupted samples) and CIFAR-10-corruption [82] (50000 clean samples, 10000 corrupted samples), which are standard benchmarks for assessing distribution shifts.

8.2 Simulating Shifts in Tabular Data↩︎

To evaluate the online adaptation performance of our proposed WCTMs on the tabular, regression-task datasets, we simulated mild or benign covariate shifts by exponentially tilting (i.e., up-sampling) the test samples from the full dataset based on selected covariates. On the MEPS healthcare dataset, \(h\) was selected to simulate a demographic shift towards younger, higher-educated patients (expected to be a benign shift due to youth tending to correlate with fewer, less variable health issues). On the bike-sharing dataset, \(h\) simulated a shift in weather patterns to colder, windier days; meanwhile, a more complex shift was simulated on the superconductivity dataset by sampling proportional to the projection onto the first principal component representation of the training data. Harmful concept shifts in tabular data were simulated by biasing the label values as a function of selected input covariates for each dataset.

9 Additional Experimental Results↩︎

9.1 Root-Cause Analysis with WCTMs↩︎

Figure 4: Results for root-cause analysis with a WCTM (blue) and a secondary \(X\)-CTM (gray) on the MEPS dataset, averaged over 100 random seeds. Each plot corresponds to a shift setting that is analogous to those in the bottom row in Figure 1. The left plot corresponds to a benign covariate shift setting (same setting as in Figure 2, and WATCH diagnoses it as such with the \(X\)-CTM achieving a large value while the WCTM adapts to the shift, avoiding an alarm. The middle plot corresponds to an extreme covariate shift setting, and WATCH diagnoses it by the \(X\)-CTM identifying covariate shift and the WCTM raising an alarm due to potential harm. Lastly, the right plot corresponds to a harmful concept shift, and WATCH identifies it as such by the WCTM raising an alarm, but with the \(X\)-CTM maintaining lower values, suggesting that no covariate shift has occured..

9.2 What Makes a Covariate Shift Mild, Moderate, or Extreme?↩︎

While the distinction between mild, moderate, and extreme covariate shifts can be gradual, problem-specific, and sometimes user-driven, these disctinctions are not arbitrary, and WCTMs arguably even provide an approach to such delineation with statistical guarantees. Intuitively, benign shifts can be considered those where the “safety” of the CP coverage is maintained nontrivially (i.e., without the prediction set covering the whole label space). This intuition corresponds to the WCTMs’ null hypothesis (and, harmful shifts violate it), as follows:

Benign: The betting martingale’s null hypothesis is that the WCP \(p\)-values are IID \(\text{Unif}[0, 1]\) (Appendix 7.2); this null implies that coverage is satisfied (exactly) for all \(\alpha\in(0,1)\) (ie, the martingale’s null \(\implies\) intuitive definition of “benign” regarding coverage validity).
Harmful: (Contrapositive of the above.) If coverage is not satisfied (exactly) for some \(\alpha\in (0, 1)\), then the \(p_{t}^{\tilde{w}}\) are not IID \(\text{Unif}[0, 1]\) (ie, violation of coverage validity \(\implies\) violation of martingale’s null, thus possibility for detection). Larger violations are easier to detect, and thus more likely to quickly raise an alarm. Note that this can be due to under-coverage (safety violation) or over-coverage (uninformative prediction sets); we further penalize trivial overcoverage—i.e., when \(\widehat{C}(X_{n+t})=\mathcal{Y}\)—by using anticonservative WCP \(p\)-values whenever this occurs. (See pseudocode in Appendix 10.)
Medium: A shift may initially be “harmful” as described above, due to density-ratio estimator having insufficient data, but later become “benign” once enough data has been collected.

Figure 5 provides an ablation study illustrating synthetic data example of WATCH performance for different magnitudes of covariate shift (in the input \(X\) distribution). Each row corresponds to a specific magnitude of covariate shift and illustrates WATCH’s response regarding coverage (prediction safety), interval widths (prediction informativeness), and WCTMs (monitoring criteria for alarms). The post-changepoint test points are sampled from the full source distribution with probabilities proportional to \(exp(|x-18|*\lambda)\); larger values of \(\lambda\) thus correspond to more severe covariate shift toward extreme (and particularly toward large) values of the input \(X\). Experiments are averaged over 20 seeds.

Figure 5: Ablation study illustrating synthetic data example of WATCH performance for different magnitudes of covariate shift (in the input \(X\) distribution)..

9.3 Ablation Experiments for Density-Ratio Weight Estimation↩︎

Another factor determining whether a covariate shift can be considered benign or harmful to deployment is whether a deployed density-ratio estimator is well-specified and thus able to approximate the shift. Figure 6 provides selected synthetic-data example where logistic regression is a misspecified probabilistic classifier for distinguishing between pre- and post-changepoint data, but where a neural network (MLP) is able to accurately discriminate between the same pre- and post-changepoint data. That is, in this example the pre- and post-changepoint data are not linearly separable in the input \(X\) domain, so logistic regression is not able to reliably discriminate, and thereby it is unable to reliably estimate density-ratio weights via probabilistic classification. The result is that the changepoint causes a large increase in coverage, despite some adaptation (decreasing interval widths); the estimator’s misspecification thus causes WCTMs to raise an alarm, indicating that the covariate shift cannot be adapted to by the estimator. In contrast, the MLP estimator is able to appropriately adapt by maintaing target coverage, improving interval sharpness, and avoiding unnecessary alarms.

Figure 6: Ablation study on density-ratio estimator for synthetic data..

9.4 Ablation Experiments for Betting Function↩︎

The primary role of the betting function in (W)CTMs and testing-by-betting more broadly is for quickly rejecting the null hypothesis (i.e., raising an alarm) when it is violated. Figure 7 provides ablation experiment on the betting function used for X-CTMs and WCTMs, on three settings of the synthetic-data example. The "Composite" Jumper betting function is the betting function used in all other experiments, and it is an average of Simple Jumper betting functions over “jumping parameters” \(J\in [0.0001, 0.001, 0.01, 0.1, 1]\); here, we set the Simple Jumper baseline to have \(J=0.01\). See Vovk et al. (2021) for pseudocode and exposition of the Simple Jumper algorithm. \(J=1\) means conservatively spreading bets across all options to avoid cumulative losses, while smaller \(J\) encourages “doubling down” on bets that were previously successful. The CTMs with Composite betting are thus lower bounded at \(M_t = 0.2\), whereas those with Simple betting continually decrease, resulting in slightly delayed detection speed relative to Composite.

Figure 7: Ablation study on betting function..

9.5 WCTM Quickly Reacts to Harmful Shifts↩︎

For the image-data experiments, Figure 10 provides additional example results on harmful shifts with different corruption types.

Figure 10: Results on CIFAR-10 with various corruption types, all at the highest severity level. WCTM reacts more quickly than the standard CTM under these conditions. Moreover, with several types of corruption, the standard CTM does not raise any alarms at all..

9.6 Further Details of Image Classification Experiments↩︎

We provide further discussions on the image classification experiments from Section 4.2. Figure 11 demonstrates the coverage rates and set sizes of different corruption levels as discussed in the main paper. Details of model architectures and training configurations can be found in Table 3 - Table 6.

Figure 11: The results supplement Figure 3 in the main paper. They demonstrate the coverage rate and prediction set size under four different corruption scenarios. In the multi-class classification setting, we adopt metrics different from those used in the regression experiments and follow [83] to measure the prediction set size and coverage rate (defined as the proportion of true classes in the specified range, NOT conformal coverage) as principled risk metrics for distinguishing benign from harmful shifts. We increased the size of the validation set and the number of samples visualized to yield more robust performance and provide a clearer view of the trajectory; the results are averaged over a window size of 200, while all other configurations remain unchanged from the original setting. As discussed in the paper, we mixed test samples (target corrupted) with validation samples (source clean) to improve the estimation of weights for CTMs. So under corrupted scenarios, the “starting points” before the change points for WCTM and CTM differ, as the mixture allows the validation set to contain corrupted data; however, this difference is not clearly reflected in the martingale paths. Overall, the models initially exhibit relatively high classification performance under the clean setting, while CTM rapidly declines to a lower performance level under all corruption conditions. Although WCTM adapts to changes in benign scenarios, it eventually demonstrates severe metric changes under extreme shifts as well, which corresponds to the results in Figure 3..

Table 3: MNISTDiscriminator Model Details
Component	Details
Model Name	MNISTDiscriminator
Purpose	Binary classifier to distinguish between source (uncorrupted) and target (corrupted) MNIST data
Architecture Type	Convolutional Neural Network (CNN) with 2 conv blocks + FC layers
Input Shape	(Batch_size, 1, 28, 28) - Grayscale MNIST images
Output Shape	(Batch_size, 2) - Binary classification logits
Total Parameters	\(\approx\)1.3M parameters
Layers	First Conv Block: Conv2d(1\(\rightarrow\)32, 3\(\times\)3) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) MaxPool(2\(\times\)2) \(\rightarrow\) Dropout Second Conv Block: Conv2d(32\(\rightarrow\)64, 3\(\times\)3) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) MaxPool(2\(\times\)2) \(\rightarrow\) Dropout Fully Connected: Linear(64\(\times\)7\(\times\)7\(\rightarrow\)128) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) Dropout Output: Linear(128\(\rightarrow\)2)
Regularization	Dropout (rate=0.3), BatchNormalization
Calibration Method	Temperature scaling (initial temp=1.5)
Activation Functions	ReLU
Loss Function	Cross-entropy (implicit in the code)
Training Epochs	30
Batch Size	64
Learning Rate	0.001
Optimizer	Adam
Special Features	Temperature scaling parameter for improved probability calibration
Usage Context	Used for estimating likelihood ratios in weighted conformal prediction

Table 4: CIFAR10 Discriminator Model Details
Component	Details
Model Name	CIFAR10Discriminator
Purpose	Binary classifier to distinguish between source (uncorrupted) and target (corrupted) CIFAR-10 data
Architecture Type	Convolutional Neural Network (CNN) with 3 conv blocks + FC layers
Input Shape	(Batch_size, 3, 32, 32) - RGB CIFAR-10 images
Output Shape	(Batch_size, 2) - Binary classification logits
Total Parameters	\(\approx\)4.8M parameters
Layers	First Conv Block: Conv2d(3\(\rightarrow\)64, 3\(\times\)3) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) MaxPool(2\(\times\)2) \(\rightarrow\) Dropout Second Conv Block: Conv2d(64\(\rightarrow\)128, 3\(\times\)3) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) MaxPool(2\(\times\)2) \(\rightarrow\) Dropout Third Conv Block: Conv2d(128\(\rightarrow\)256, 3\(\times\)3) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) MaxPool(2\(\times\)2) \(\rightarrow\) Dropout First FC Layer: Linear(256\(\times\)4\(\times\)4\(\rightarrow\)512) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) Dropout Second FC Layer: Linear(512\(\rightarrow\)128) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) Dropout Output: Linear(128\(\rightarrow\)2)
Regularization	Dropout (rate=0.3), BatchNormalization at each layer
Calibration Method	Temperature scaling (initial temp=1.5)
Activation Functions	ReLU throughout the network
Loss Function	Cross-entropy
Training Epochs	30
Batch Size	64
Learning Rate	0.001
Optimizer	Adam
Special Features	Temperature scaling parameter for improved probability calibration
Usage Context	Used for estimating likelihood ratios in weighted conformal prediction

Table 5: RegularizedMNISTModel Details
Component	Details
Model Name	RegularizedMNISTModel
Purpose	Classification model for MNIST digits (0-9) with robustness to corrupted images
Architecture Type	Convolutional Neural Network (CNN) with 3 conv blocks + FC layers
Input Shape	(Batch_size, 1, 28, 28) - Grayscale MNIST images
Output Shape	(Batch_size, 10) - Logits for 10-class digit classification
Total Parameters	\(\approx\)600K parameters
Layers	First Conv Block: Conv2d(1\(\rightarrow\)32, 3\(\times\)3) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) MaxPool(2\(\times\)2) \(\rightarrow\) Dropout Second Conv Block: Conv2d(32\(\rightarrow\)64, 3\(\times\)3) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) MaxPool(2\(\times\)2) \(\rightarrow\) Dropout Third Conv Block: Conv2d(64\(\rightarrow\)128, 3\(\times\)3) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) MaxPool(2\(\times\)2) \(\rightarrow\) Dropout Fully Connected: Linear(128\(\times\)3\(\times\)3\(\rightarrow\)256) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU \(\rightarrow\) Dropout Output: Linear(256\(\rightarrow\)10)
Regularization	Dropout (rate=0.3), BatchNormalization at each layer
Activation Functions	ReLU throughout the network
Loss Function	Cross-entropy
Training Epochs	30
Batch Size	64
Learning Rate	0.001
Optimizer	Adam
Special Features	Extensive regularization for robustness to corrupted images
Usage Context	Primary classification model for MNIST digits in conformal prediction framework

Table 6: ResNet20 Model Details (Standard Format)
Component	Details
Model Name	ResNet20
Purpose	Classification model for CIFAR-10 images
Architecture Type	Residual Network (ResNet v1) with basic blocks
Input Shape	(Batch_size, 3, 32, 32) - RGB CIFAR-10 images
Output Shape	(Batch_size, 10) - Logits for 10-class CIFAR-10 classification
Total Parameters	\(\approx\)270K parameters
Depth	20 layers (1 initial conv + 18 layers in blocks + 1 final linear)
Block Structure	BasicBlock: Conv\(\rightarrow\)BN\(\rightarrow\)ReLU\(\rightarrow\)Conv\(\rightarrow\)BN + shortcut connection, followed by ReLU
Network Architecture	Initial Layer: Conv2d(3\(\rightarrow\)16, 3\(\times\)3) \(\rightarrow\) BatchNorm \(\rightarrow\) ReLU Stage 1: 3 BasicBlocks (16 channels, stride=1) Stage 2: 3 BasicBlocks (32 channels, stride=2) Stage 3: 3 BasicBlocks (64 channels, stride=2) Output: Global AvgPool \(\rightarrow\) Linear(64\(\rightarrow\)10)
Regularization	BatchNormalization in each block
Activation Functions	ReLU
Weight Initialization	Kaiming normal for convolutional layers, constant for batch normalization
Loss Function	Cross-entropy
Training Epochs	30
Batch Size	64
Learning Rate	0.001
Optimizer	Adam
Special Features	Skip connections (residual learning) for better gradient flow
Usage Context	Primary classification model for CIFAR-10 in conformal prediction framework

10 Algorithms↩︎

Only limited algorithm pseudocode is provided at this time; more comprehensive pseudocode will be included in a final version of this paper.

Figure 12: Calculate weighted conformal prediction set for covariate shift [18]..

Figure 13: Calculate weighted conformal \(p\)-value that penalizes noninformativeness..

Figure 14: WCTMs for (1) adapting to mild shifts in \(X\), (2) detecting harmful shifts, and (3) root-cause analysis..

References↩︎

[1]

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. (2016). Concrete problems in ai safety. arXiv preprint arXiv:1606.06565.

[2]

Vovk, V., Petej, I., Nouretdinov, I., Ahlberg, E., Carlsson, L., and Gammerman, A. (2021). Retrain or not retrain: Conformal test martingales for change-point detection. In Conformal and Probabilistic Prediction and Applications, pages 191–210. PMLR.

[3]

Podkopaev, A. and Ramdas, A. (2021b). Tracking the risk of a deployed model and detecting harmful distribution shifts. arXiv preprint arXiv:2110.06177.

[4]

Feng, J., Phillips, R. V., Malenica, I., Bishara, A., Hubbard, A. E., Celi, L. A., and Pirracchio, R. (2022). Clinical artificial intelligence quality improvement: towards continual monitoring and updating of ai algorithms in healthcare. NPJ digital medicine, 5(1):66.

[5]

Feng, J., Xia, F., Singh, K., and Pirracchio, R. (2025). Not all clinical ai monitoring systems are created equal: Review and recommendations. NEJM AI, 2(2):AIra2400657.

[6]

Adams, R., Henry, K. E., Sridharan, A., Soleimani, H., Zhan, A., Rawat, N., Johnson, L., Hager, D. N., Cosgrove, S. E., Markowski, A., et al. (2022). Prospective, multi-site study of patient outcomes after implementation of the trews machine learning-based early warning system for sepsis. Nature medicine, 28(7):1455–1460.

[7]

Finlayson, S. G., Subbaswamy, A., Singh, K., Bowers, J., Kupke, A., Zittrain, J., Kohane, I. S., and Saria, S. (2021). The clinician and dataset shift in artificial intelligence. New England Journal of Medicine, 385(3):283–286.

[8]

Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). Game-theoretic statistics and safe anytime-valid inference. Statistical Science, 38(4):576–601.

[9]

Volkhonskiy, D., Burnaev, E., Nouretdinov, I., Gammerman, A., and Vovk, V. (2017). Inductive conformal martingales for change-point detection. In Conformal and Probabilistic Prediction and Applications, pages 132–153. PMLR.

[10]

Vovk, V. (2021). Testing randomness online. Statistical Science, 36(4):595–611.

[11]

Papadopoulos, H. (2008). Inductive conformal prediction: Theory and application to neural networks. In Tools in artificial intelligence. Citeseer.

[12]

Vovk, V., Gammerman, A., and Shafer, G. (2022). Algorithmic Learning in a Random World. Springer Nature.

[13]

Foygel Barber, R., Candes, E. J., Ramdas, A., and Tibshirani, R. J. (2021). The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10(2):455–482.

[14]

Shafer, G. and Vovk, V. (2019). Game-theoretic foundations for probability and finance, volume 455. John Wiley & Sons.

[15]

Shafer, G. (2021). Testing by betting: A strategy for statistical and scientific communication. Journal of the Royal Statistical Society Series A: Statistics in Society, 184(2):407–431.

[16]

Vovk, V. and Wang, R. (2021). E-values: Calibration, combination and applications. The Annals of Statistics, 49(3):1736–1754.

[17]

Ville, J. (1939). Etude critique de la notion de collectif. Gauthier-Villars Paris.

[18]

Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019). Conformal prediction under covariate shift. Advances in neural information processing systems, 32.

[19]

Podkopaev, A. and Ramdas, A. (2021a). Distribution-free uncertainty quantification for classification under label shift. In Uncertainty in artificial intelligence, pages 844–853. PMLR.

[20]

Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. (2023). Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816–845.

[21]

Prinster, D., Stanton, S. D., Liu, A., and Saria, S. (2024). Conformal validity guarantees exist for any data distribution (and how to find them). Forty-first International Conference on Machine Learning.

[22]

Barber, R. F. and Tibshirani, R. J. (2025). Unifying different theories of conformal prediction. arXiv preprint arXiv:2504.02292.

[23]

Jin, Y. and Candès, E. J. (2023). Model-free selective inference under covariate shift via weighted conformal p-values. arXiv preprint arXiv:2307.09291.

[24]

Vovk, V. (2002). On-line confidence machines are well-calibrated. In The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings., pages 187–196. IEEE.

[25]

Vovk, V., Nouretdinov, I., and Gammerman, A. (2003). Testing exchangeability on-line. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 768–775.

[26]

Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244.

[27]

Sugiyama, M., Krauledat, M., and Müller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(5).

[28]

Yang, Y., Kuchibhotla, A. K., and Tchetgen Tchetgen, E. (2024). Doubly robust calibration of prediction sets under covariate shift. Journal of the Royal Statistical Society Series B: Statistical Methodology, page qkae009.

[29]

Zhang, Y.-J., Zhang, Z.-Y., Zhao, P., and Sugiyama, M. (2024). Adapting to continuous covariate shift via online density ratio estimation. Advances in Neural Information Processing Systems, 36.

[30]

Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41(1/2):100–115.

[31]

Roberts, S. (1966). A comparison of some control chart procedures. Technometrics, 8(3):411–430.

[32]

Shiryaev, A. N. (1963). On optimum methods in quickest detection problems. Theory of Probability & Its Applications, 8(1):22–46.

[33]

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830.

[34]

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.

[35]

Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. (2020). The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(1):175–197.

[36]

Wald, A. (1945). Sequential tests of statistical hypotheses. In Breakthroughs in statistics: Foundations and basic theory, pages 117–186. Springer.

[37]

Tartakovsky, A., Nikiforov, I., and Basseville, M. (2014). Sequential analysis: Hypothesis testing and changepoint detection. CRC press.

[38]

Veeravalli, V. V. and Banerjee, T. (2014). Quickest change detection. In Academic press library in signal processing, volume 3, pages 209–255. Elsevier.

[39]

Xie, L., Zou, S., Xie, Y., and Veeravalli, V. V. (2021). Sequential (quickest) change detection: Classical results and new directions. IEEE Journal on Selected Areas in Information Theory, 2(2):494–514.

[40]

Ho, S.-S. (2005). A martingale framework for concept change detection in time-varying data streams. In Proceedings of the 22nd international conference on Machine learning, pages 321–327.

[41]

Fedorova, V., Gammerman, A., Nouretdinov, I., and Vovk, V. (2012). Plug-in martingales for testing exchangeability on-line. arXiv preprint arXiv:1204.3251.

[42]

Ho, S.-S., Schofield, M., Sun, B., Snouffer, J., and Kirschner, J. (2019). A martingale-based approach for flight behavior anomaly detection. In 2019 20th IEEE International Conference on Mobile Data Management (MDM), pages 43–52. IEEE.

[43]

Eliades, C. and Papadopoulos, H. (2022). A betting function for addressing concept drift with conformal martingales. In Conformal and Probabilistic Prediction with Applications, pages 219–238. PMLR.

[44]

Eliades, C. and Papadopoulos, H. (2023). A conformal martingales ensemble approach for addressing concept drift. In Conformal and Probabilistic Prediction with Applications, pages 328–346. PMLR.

[45]

Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic learning in a random world, volume 29. Springer.

[46]

Angelopoulos, A. N., Barber, R. F., and Bates, S. (2024). Theoretical foundations of conformal prediction. arXiv preprint arXiv:2411.11824.

[47]

Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. (2002). Inductive confidence machines for regression. In Machine learning: ECML 2002: 13th European conference on machine learning Helsinki, Finland, August 19–23, 2002 proceedings 13, pages 345–356. Springer.

[48]

Bar, Y., Shaer, S., and Romano, Y. (2024). Protected test-time adaptation via online entropy matching: A betting approach. arXiv preprint arXiv:2408.07511.

[49]

Hindy, A., Luo, R., Banerjee, S., Kuck, J., Schmerling, E., and Pavone, M. (2024). Diagnostic runtime monitoring with martingales. arXiv preprint arXiv:2407.21748.

[50]

Vovk, V. (2020). Testing for concept shift online. arXiv preprint arXiv:2012.14246.

[51]

Waudby-Smith, I. and Ramdas, A. (2024). Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1):1–27.

[52]

I Amoukou, S., Bewley, T., Mishra, S., Lecue, F., Magazzeni, D., and Veloso, M. (2024). Sequential harmful shift detection without labels. Advances in Neural Information Processing Systems, 37:129279–129302.

[53]

Ramdas, A., Ruf, J., Larsson, M., and Koolen, W. M. (2022). Testing exchangeability: Fork-convexity, supermartingales and e-processes. International Journal of Approximate Reasoning, 141:83–109.

[54]

Wasserman, L., Ramdas, A., and Balakrishnan, S. (2020). Universal inference. Proceedings of the National Academy of Sciences, 117(29):16880–16890.

[55]

Saha, A. and Ramdas, A. (2024). Testing exchangeability by pairwise betting. In International Conference on Artificial Intelligence and Statistics, pages 4915–4923. PMLR.

[56]

Fischer, L. and Ramdas, A. (2025). Sequential monte carlo testing by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology, page qkaf014.

[57]

Shin, J., Ramdas, A., and Rinaldo, A. (2022). E-detectors: A nonparametric framework for sequential change detection. arXiv preprint arXiv:2203.03532.

[58]

Shekhar, S. and Ramdas, A. (2023a). Nonparametric two-sample testing by betting. IEEE Transactions on Information Theory, 70(2):1178–1203.

[59]

Shekhar, S. and Ramdas, A. (2023b). Reducing sequential change detection to sequential estimation. arXiv preprint arXiv:2309.09111.

[60]

Shekhar, S. and Ramdas, A. (2023c). Sequential changepoint detection via backward confidence sequences. In International Conference on Machine Learning, pages 30908–30930. PMLR.

[61]

Podkopaev, A. and Ramdas, A. (2023). Sequential predictive two-sample and independence testing. Advances in neural information processing systems, 36:53275–53307.

[62]

Teneggi, J. and Sulam, J. (2024). Testing semantic importance via betting. Advances in Neural Information Processing Systems, 37:76450–76499.

[63]

Shaer, S., Maman, G., and Romano, Y. (2023). Model-x sequential testing for conditional independence via testing by betting. In International Conference on Artificial Intelligence and Statistics, pages 2054–2086. PMLR.

[64]

Bates, S., Candès, E., Lei, L., Romano, Y., and Sesia, M. (2023). Testing for outliers with conformal p-values. The Annals of Statistics, 51(1):149–178.

[65]

Bashari, M., Epstein, A., Romano, Y., and Sesia, M. (2023). Derandomized novelty detection with fdr control via conformal e-values. Advances in Neural Information Processing Systems, 36:65585–65596.

[66]

Vovk, V. and Wang, R. (2023). Confidence and discoveries with e-values. Statistical Science, 38(2):329–354.

[67]

Gauthier, E., Bach, F., and Jordan, M. I. (2025). E-values expand the scope of conformal prediction. arXiv preprint arXiv:2503.13050.

[68]

Lee, Y. and Ren, Z. (2025). Selection from hierarchical data with conformal e-values. arXiv preprint arXiv:2501.02514.

[69]

Xu, C. and Xie, Y. (2021). Conformal prediction interval for dynamic time-series. In International Conference on Machine Learning, pages 11559–11569. PMLR.

[70]

Fannjiang, C., Bates, S., Angelopoulos, A. N., Listgarten, J., and Jordan, M. I. (2022). Conformal prediction under feedback covariate shift for biomolecular design. Proceedings of the National Academy of Sciences, 119(43):e2204569119.

[71]

Prinster, D., Liu, A., and Saria, S. (2022). Jaws: Auditing predictive uncertainty under covariate shift. Advances in Neural Information Processing Systems, 35:35907–35920.

[72]

Prinster, D., Saria, S., and Liu, A. (2023). Jaws-x: addressing efficiency bottlenecks of conformal prediction under standard and feedback covariate shift. In International Conference on Machine Learning, pages 28167–28190. PMLR.

[73]

Stanton, S., Maddox, W., and Wilson, A. G. (2023). Bayesian optimization with conformal prediction sets. In International Conference on Artificial Intelligence and Statistics, pages 959–986. PMLR.

[74]

Farinhas, A., Zerva, C., Ulmer, D., and Martins, A. F. (2023). Non-exchangeable conformal risk control. arXiv preprint arXiv:2310.01262.

[75]

Nair, Y. and Janson, L. (2023). Randomization tests for adaptively collected data. arXiv preprint arXiv:2301.05365.

[76]

Feldman, S. and Romano, Y. (2024). Robust conformal prediction using privileged information. arXiv preprint arXiv:2406.05405.

[77]

Shiryaev, A. N. (2016). Probability-1, volume 95. Springer.

[78]

Cohen, J. W., Cohen, S. B., and Banthin, J. S. (2009). The medical expenditure panel survey: a national information resource to support healthcare cost research and inform policy and practice. Medical care, 47(7_Supplement_1):S44–S50.

[79]

Hamidieh, K. (2018). . UCI Machine Learning Repository. : https://doi.org/10.24432/C53P47.

[80]

Fanaee-T, H. (2013). . UCI Machine Learning Repository. : https://doi.org/10.24432/C5W894.

[81]

Mu, N. and Gilmer, J. (2019). Mnist-c: A robustness benchmark for computer vision. arXiv preprint arXiv:1906.02337.

[82]

Hendrycks, D. and Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations.

[83]

Romano, Y., Sesia, M., and Candes, E. (2020). Classification with valid and adaptive coverage. Advances in neural information processing systems, 33:3581–3591.

This corresponds to a split conformal setting [11], but our theory also extends to full conformal [12], which avoids splitting data at a heavy computational cost.↩︎
Here, marginal means on average over the draw of calibration and test data; see, e.g., [13] for more details.↩︎
For \(t>1\), the calibration data might be kept fixed (i.e., \(Z_{1:n}\)), or it may include past test points (i.e., \(Z_{1:(n+t-1)}\)); for now, we focus on \(t=1\) to avoid this distinction and simplify exposition.↩︎
Conservative conformal \(p\)-values set \(u_{n+1}:=1\); the random variable \(P_{n+1}\) corresponding to \(p_{n+1}\) is called a \(p\)-variable, but we will often refer to both as \(p\)-values for more common terminology.↩︎
Here, “weighted” refers to weights on the score distribution prior to computing a \(p\)-value, not to weighting the \(p\)-value itself.↩︎
Conservatively valid weighted CP methods would need to further set \(u_{n+1}=1\) in Eq. 9 to reduce appropriately.↩︎
More generally, the Radon–Nikodym derivative.↩︎
More generally, the Radon–Nikodym derivative.↩︎

WATCH: Weighted Adaptive Testing for Changepoint Hypotheses via Weighted-Conformal Martingales