Abstract

Increased access to computing resources has led to the development of algorithms that can run efficiently on multi-core processing units or in distributed computing environments. In the context of Bayesian inference, many parallel computing approaches to fit statistical models have been proposed in the context of Markov Chain Monte Carlo methods, but they either have limited gains due to high latency cost or rely on model-specific decompositions. Alternatively, adaptive importance sampling, sequential Monte Carlo, and recursive Bayesian methods provide a parallel-friendly and asymptotically exact framework with well-developed theory for error estimation. We propose a recursive adaptive importance sampling approach that alternates between fast recursive weight updates and sample replenishment steps to balance computational efficiency while ensuring sample quality. We derive theoretical results to determine the optimal allocation of replenishing steps, and demonstrate the efficacy of our method in simulated experiments and an application of sea surface temperature prediction in the Gulf of Mexico using Gaussian processes.

Recursive Adaptive Importance Sampling With Optimal Replenishment

Keywords: Bayesian, Gaussian Process, Parallelization, Sea Surface Temperature, Spatial Statistics

1 Introduction↩︎

The development of methods to fit Bayesian models has been an active area of research, with impact in essentially all fields of applied science. However, with the recent increased availability of large data sets, it becomes necessary to make use of distributed computing environments and widely available multi-core processors to solve large statistical problems quickly while facilitating accurate inference.

To achieve this, within the well-established framework of Markov chain Monte Carlo (MCMC) metropolis1953equation?, hastings1970monte?, geman1984stochastic?, many methods have been proposed. Other approaches immediately benefited from the availability of parallel computing resources due to their intrinsic structure, such as sequential Monte Carlo (SMC) gordon1993novel?, kitagawa1996monte?, adaptive importance sampling (AIS) kloek1978bayesian?, naylor1988econometric?, oh1992adaptive?, and, more recently, recursive Bayesian (RB) methods hooten2021making?, taylor2024fast?. We propose a general approach to fit Bayesian models that combines AIS and RB strategies. To balance computational efficiency and estimation accuracy, our method alternates between fast deterministic weight updates and sample replenishment steps, and we derive theoretical results that leverage the concentration rate of the posterior distribution to ensure an asymptotically optimal trade-off even under model misspecification. Our approach uses parallelization to accelerate sampling from general low-dimensional continuous or discrete distributions, and we show it yields appreciable computational gains even when used to fit well-optimized and widely applied models, such as the geostatistical model with Vecchia approximation.

In the context of MCMC, there are three main approaches to leverage parallel computing. The first two aim to either parallelize computation within each step of a single chain (e.g., brockwell2006parallel?, yan2007parallelizing?, byrd2008reducing?) or run chains in parallel while sharing information intermittently to accelerate mixing (e.g., geyer1991markov?, nishihara2014parallel?, biron2025automatic?), both of which result in a high latency cost due to frequent information transmission between parallel processors. This last approach relies on running independent chains in parallel and combining the output afterward. A naive implementation with identical chains is known to yield diminishing returns due to burn-in rosenthal2000parallel?, wilkinson2006parallel?, and thus modified approaches that decomposed the posterior distribution to reduce the cost of each chain were introduced (e.g., neiswanger2014asymptotically?, wang2015parallelizing?, scott2016bayes?). These are known as embarrassingly parallel MCMC methods. However, thus far efficient decompositions of the posterior distribution rely on strong model assumptions, and hence are model-specific.

As alternatives, AIS and SMC can naturally leverage parallel resources to accelerate computation, similar to importance sampling (IS), while having well-developed error estimation and convergence theory (see geweke1989bayesian?, delyon2018asymptotic?). Despite their promising features, both approaches also have limited gains with parallelization due to the communication imposed by the operations of normalizing IS weights and updating proposal distributions rosen2011efficient?. Ways of mitigating these issues have been considered (e.g., chao2010efficient?, murray2016parallel?), but latency remains a potential bottleneck when it comes to parallelization.

More recently, there has been renewed interest in recursive Bayesian methods (e.g., taylor2025generative?, ren2025multi?, scharf2025strategy?). These methods use the natural prior-to-posterior update from Bayes’ theorem to incorporate sample information in a multi-stage approach. RB methods fall between AIS and SMC, and therefore share many of their features and advantages such as ease of parallelization, as well as disadvantages (e.g., latency and sample degeneracy).

We introduce a RB approach to fit Bayesian models that uses AIS to minimize issues associated with latency and degeneracy. Following gelfand1990sampling?, we use \([\cdot]\) to generically represent probability density or mass functions, and we denote the data, prior, and posterior distributions as \([\boldsymbol{y}_{1:n} | \boldsymbol{\theta}]\), \([\boldsymbol{\theta}]\), and \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\), respectively, where \(\boldsymbol{y}_{1:n} = (y_{1}, \dots, y_{n})'\) represents the vector of observations and \(\boldsymbol{\theta}\) is the parameter of interest. Following convention recursive methods, we decompose the full joint posterior distribution as \[\label{eq:recursion32identity} [\boldsymbol{\theta} | \boldsymbol{y}_{1:n+1}] = \dfrac{[\boldsymbol{y}_{1:n+1} | \boldsymbol{\theta}] [\boldsymbol{\theta}]}{[\boldsymbol{y}_{1:n+1}]} = \dfrac{[\boldsymbol{y}_{n+1} | \boldsymbol{\theta}, \boldsymbol{y}_{1:n}] [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]}{[\boldsymbol{y}_{n+1} | \boldsymbol{y}_{1:n}]} \propto [\boldsymbol{y}_{n+1} | \boldsymbol{\theta}, \boldsymbol{y}_{1:n}] [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\tag{1}\] to update the IS weights in an “online” manner. To counteract degeneracy, we use the weights to build approximations of partially updated posterior distributions that are then used as proposal distributions for sample replenishment. A naive introduction of replenishment (i.e., sampling from a proposal after every weight update) leads to high computational cost due to the frequent recalculation of the IS weights and construction of the approximation. Thus, we propose to allow gaps between replenishment times to grow exponentially, and derive theoretical results that guarantee that this choice is asymptotically optimal even under model misspecification, allowing us to control the asymptotic cost and degeneracy of the procedure while minimizing latency. For the aforementioned reasons, we refer to the resulting approach as recursive adaptive importance sampling with optimal replenishment (RAISOR).

We begin with a review of basic concepts to introduce the notation and present our proposed approach in Section 2. Section 3 contains our theoretical contributions, which include the proof of the asymptotic optimality of our proposed replenishment heuristic and an analysis of the asymptotic cost of our method. Section 4 shows the effectiveness of our method in simulated experiments and in the problem of sea surface temperature estimation using in-situ data from ships, moored buoys, and Argo floats. Finally, Section 5 discusses advantages and weakness of the proposed approach, and includes our perspective on future work.

2 Methods↩︎

2.1 Importance Sampling↩︎

Our interest lies in evaluating integrals of the form \[\label{eq:IS32goal} I_{n}(f) = \int f(\boldsymbol{\theta}) \;[\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] \;d\boldsymbol{\theta} = \mathbb{E}\bigg(f(\boldsymbol{\theta}) \bigg| \boldsymbol{y}_{1:n} \bigg),\tag{2}\] for some measurable \(f : \mathbb{R}^{d} \to \mathbb{R}\) with \(\mathbb{E}(|f(\boldsymbol{\theta})| | \boldsymbol{y}_{1:n} ) < +\infty\). If we cannot directly evaluate \(I_{n}(f)\), a simple and well-known Monte Carlo estimator is given by \[\hat{I}_{n}(f) = \frac{1}{M}\sum_{m=1}^{M}f(\boldsymbol{\theta}_{m}), \quad \text{ where } \boldsymbol{\theta}_{1}, \dots, \boldsymbol{\theta}_{M} \overset{\text{i.i.d.}}{\sim} [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}],\] which is both unbiased and consistent for \(I_{n}(f)\). However, this estimator requires us to sample from the posterior distribution directly, which often cannot be achieved. An alternative is to sample from an auxiliary proposal distribution \([\boldsymbol{\theta}]_{*}\), and use the estimator \[\begin{align} \tilde{I}_{n}(f) &= \sum_{m=1}^{M} \tilde{w}(\boldsymbol{\theta}_{m}) f(\boldsymbol{\theta}_{m}), &\text{ where } \tilde{w}(\boldsymbol{\theta}_{m}) &= \frac{w(\boldsymbol{\theta}_{m})}{\sum_{k=1}^{M} w(\boldsymbol{\theta}_{k})}, & w(\boldsymbol{\theta}) &= \frac{[\boldsymbol{\theta} | \boldsymbol{y}_{1:n} ]}{[\boldsymbol{\theta}]_{*}}, \end{align}\] and \(\boldsymbol{\theta}_{1}, \dots, \boldsymbol{\theta}_{M} \overset{\text{i.i.d.}}{\sim} [\boldsymbol{\theta}]_{*}\). In this case, \(\tilde{I}_{n}(f)\) is known as the self-normalized IS estimator, \([\boldsymbol{\theta}]_{*}\) is the IS distribution, \(\{w(\boldsymbol{\theta}_{m})\}_{m=1}^{M}\) are the IS weights, and \(\{\tilde{w}(\boldsymbol{\theta}_{m})\}_{m=1}^{M}\) are the self-normalized weights. It is worth highlighting that to evaluate \(\tilde{w}(\boldsymbol{\theta})\) we only need to compute the posterior up to a normalizing constant, hence \(\tilde{I}_{n}(f)\) can be used without knowledge of \([\boldsymbol{y}_{1:n}]\). Additionally, note that \(\{(\boldsymbol{\theta}_{m}, \tilde{w}(\boldsymbol{\theta}_{m}))\}_{m=1}^{M}\) allows us to estimate \(I_{n}(f)\) for any choice of \(f\), which includes posterior moments and probabilities, and therefore it can be seen as a weighted sample from the posterior distribution.

Under mild conditions, see geweke1989bayesian?, self-normalized IS estimators are known to be consistent with a bias of order \(O(M^{-1})\) and satisfy a central limit theorem. The latter result can then be used to obtain asymptotically valid confidence intervals for any estimated quantity of interest, allowing for consistent estimation of the Monte Carlo error.

2.2 Adaptive Importance Sampling↩︎

Despite their generality, IS estimators are known to be sensitive to the IS distribution \([\boldsymbol{\theta}]_{*}\), with poor choices leading to large variances, see for instance owen2000safe?. Originally proposed by kloek1978bayesian? and naylor1988econometric?, adaptive IS (AIS) estimators mitigate this issue by introducing a family of distributions \([\boldsymbol{\theta}]_{*} = [\boldsymbol{\theta} | \boldsymbol{\eta}]\), indexed by some potentially infinite dimensional quantity \(\boldsymbol{\eta}\), and searching for the value \(\boldsymbol{\eta}_{*}\) that yields the best estimator.

Many different criteria can be used to determine the optimal choice of \(\boldsymbol{\eta}_{*}\), but one can typically represent this as optimization problem of the form \[\label{eq:AIS32minimization} \boldsymbol{\eta}_{*} = \arg\min_{\boldsymbol{\eta}} D_{f}\bigg( [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] \bigg\| [\boldsymbol{\theta} | \boldsymbol{\eta}] \bigg),\tag{3}\] for some objective function \(D_{f}(\cdot \| \cdot)\), which is a function of the IS and target distributions that can also depend on the choice of integrand \(f\) in (2 ). In the context of Bayesian inference, we are usually interested in estimating \(I_{n}(f)\) for \(f\) in a class of functions that might be unknown prior to model fitting, so \(D_{f} = D\) is usually chosen to be agnostic with respect to \(f\). Common choices are to take \(D\) as a distance between moments or a divergence between the IS density and the target (e.g., cappe2008adaptive?, cornuet2012adaptive?, guilmeau2024adaptive?). In this case, the Kullback-Leibler (KL) divergence is the most popular due to its balance between computational convenience and quality of the resulting approximation, but other alternatives have been considered ryu2014adaptive?.

Choosing \(\boldsymbol{\eta}_{*}\) as the minimizer of a divergence may not be tractable in practice. Such a minimization typically requires one to evaluate posterior expectations, which is the goal of IS estimation in the first place. For this reason, AIS methods involve an iterative approach, with each iteration using the newly-generated samples from the IS distribution to improve the estimate of the objective function, which is then minimized to find a better proposal.

Since its initial conception, multiple variations and improvements of AIS have been considered (e.g., zhang1996nonparametric?, cappe2004population?, martino2015adaptive?), including hybrid AIS and MCMC methods (e.g., botev2013markov?, martino2015adaptive?). For a more comprehensive review, see bugallo2017adaptive?. AIS follows an algorithmically simple structure, but showing the convergence of the corresponding IS estimates in the general case can be challenging due to the flexibility one has when building proposal distributions that depend on previous iterations. Initial efforts to analyze convergence considered particular cases (e.g., oh1992adaptive?, zhang1996nonparametric?), but, more recently, convergence in more general cases was established under suitable regularity conditions delyon2018asymptotic?, akyildiz2021convergence?.

2.3 Recursive Adaptive Importance Sampling with Optimal Replenishment↩︎

Although AIS is a flexible and parallel-friendly framework to fit general classes of Bayesian models, shortcomings hinder its widespread adoption. In what follows, we discuss these limitations and propose RAISOR as a sequence of adaptations that address them.

2.3.1 Recursion↩︎

In the context of AIS techniques, a common challenge is the choice of the initial IS distribution because it influences the speed of convergence and numerical stability of the procedure. Ideally, it should be close to the optimal proposal, but due to the nature of the problem this is often infeasible. In the context of fitting Bayesian models, guidance regarding the initial proposal is often vague, but a natural idea is to use the prior distribution. In principle, this can be a good option if the prior closely matches the posterior distribution, but when this is not the case (e.g., when using vague priors or when the data are highly informative), the problem persists.

To address this issue, annealed IS methods were proposed (e.g., evans1991chaining?, gramacy2008importance?), introducing a sequence of distributions that lie between the initial proposal and the target, thus smoothly bridging the gap between the distributions and increasing the stability of the procedure. From the perspective of RB methods, a simple way to build a bridge of distributions is to consider the sequence \(\{ [\boldsymbol{\theta} | \boldsymbol{y}_{1:k}] \}_{k=0}^{n}\), which changes from prior to posterior as \(k\) goes to \(n\), with the convention that \([\boldsymbol{\theta} | \boldsymbol{y}_{1:0}] = [\boldsymbol{\theta}]\). Thus, if \(\{(\boldsymbol{\theta}_{m}, \tilde{w}_{k}(\boldsymbol{\theta}_{m}))\}_{m=1}^{M}\) represents a weighted sample targeting \([\boldsymbol{\theta} | \boldsymbol{y}_{1:k}]\), we can use (1 ) to update the weights recursively and without the need to resample by taking \[\label{eq:recursive32weights} \begin{align} \tilde{w}_{k+1}(\boldsymbol{\theta}_{m}) &= \frac{\frac{[\boldsymbol{\theta}_{m} | \boldsymbol{y}_{1:k+1}]}{[\boldsymbol{\theta}_{m}]_{*}}}{ \sum_{j=1}^{M} \frac{[\boldsymbol{\theta}_{j} | \boldsymbol{y}_{1:k+1}]}{[\boldsymbol{\theta}_{j}]_{*}} } = \frac{\frac{[\boldsymbol{\theta}_{m} | \boldsymbol{y}_{1:k}]}{[\boldsymbol{\theta}_{m}]_{*}} [\boldsymbol{y}_{k+1} | \boldsymbol{y}_{1:k}, \boldsymbol{\theta}_{m}]}{ \sum_{j=1}^{M} \frac{[\boldsymbol{\theta}_{j} | \boldsymbol{y}_{1:k}]}{[\boldsymbol{\theta}_{j}]_{*}} [\boldsymbol{y}_{k+1} | \boldsymbol{y}_{1:k}, \boldsymbol{\theta}_{j}] } \\ &= \frac{\tilde{w}_{k}(\boldsymbol{\theta}_{m}) [\boldsymbol{y}_{k+1} | \boldsymbol{y}_{1:k}, \boldsymbol{\theta}_{m}]}{ \sum_{j=1}^{M} \tilde{w}_{k}(\boldsymbol{\theta}_{j}) [\boldsymbol{y}_{k+1} | \boldsymbol{y}_{1:k}, \boldsymbol{\theta}_{j}] } \propto \tilde{w}_{k}(\boldsymbol{\theta}_{m}) [\boldsymbol{y}_{k+1} | \boldsymbol{y}_{1:k}, \boldsymbol{\theta}_{m}], \end{align}\tag{4}\] for all \(k \in \{ 0, \dots, n-1\}\). Therefore, to update the weights from prior to posterior, one only needs an i.i.d. sample from the prior and the sequence of conditional likelihood functions evaluated at each sampled value. Due to its similarity with the prior-proposal-recursive Bayesian (PP-RB) method of hooten2021making?, we call this procedure “importance PP-RB” and elaborate on it in Algorithm 1. Importance PP-RB is equivalent to directly using the prior as the proposal, which, as argued before, is known to be suboptimal generally. For that reason, we present an improved version of this approach in what follows.

2.3.2 Adaptation↩︎

Considering the importance PP-RB procedure and similar to SMC methods, we treat \(n\) as a time index and track how the quality of the resulting IS estimators deteriorates over time. Because the initial sample is generated directly from the prior, it does not suffer from degeneracy when taking the prior distribution as the target, but as we update the weights and move from prior to posterior, we expect the weighted samples targeting their corresponding distributions to yield estimators with increasing variance over time. To improve upon importance PP-RB, a natural idea is to introduce replenishment to the procedure using some form of adaptation, leading to the recursive AIS (RAIS) framework.

This can be done by choosing a sequence of points in time \(\{n_{k}^{*}\}\) and, at those indices, use the corresponding weighted samples to: estimate an objective function, find the estimated optimal approximation, generate from it to replenish the samples, and recalculate the IS weights considering the new samples. When updating the weights outside the chosen indices, we use the recursive formula of (4 ), resulting in Algorithm 2. As long as the new approximation is closer to the current target distribution than the previous one, we expect the variance of the corresponding estimators to decrease at replenishment times, so we can choose the indices to ensure estimators are high-quality throughout the fitting process.

Figure 2: Recursive Adaptive Importance Sampling

Despite this, the RAIS method can scale poorly with the sample size \(n\) depending on the frequency of adaptation. To show this, we assume that the computational cost of evaluating an expression proportional to \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\) and \([\boldsymbol{y}_{n+1} | \boldsymbol{y}_{n}, \boldsymbol{\theta} ]\) is \(O(n^{p})\) and \(O(n^{p-1})\) respectively, for some constant \(p \geq 1\) and for all \(n\). Then, the cost per Monte Carlo sample \(C_{R}\) of doing all recursive updates will be \(C_{R}(n) = O\left( \sum_{k=1}^{n} k^{p-1} \right) = O(n^{p})\), which is the same order as evaluating \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\) once, and therefore the recursive updates are relatively cheap. However, assuming that we replenish the samples at all times, the cost per Monte Carlo sample \(C_{W}\) of recalculating the weights throughout the procedure will be \(C_{R}(n) = O\left( \sum_{k=1}^{n} k^{p} \right) = O(n^{p+1})\), which becomes prohibitively large as \(n\) grows. For this reason, when considering only the computational cost, it becomes necessary to allocate replenishment times sparingly, which we address next.

2.3.3 Optimal Replenishment↩︎

We seek to introduce as many adaptation steps as possible to ensure high quality estimation, while minimizing the cost. Considering the i.i.d. case, under relatively mild conditions, Theorem 1 of Section 3 guarantees that an asymptotically optimal trade-off can be achieved with a simple choice of adaptation times, for some notion of sample quality that we define below.

Taking \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] \propto [\boldsymbol{y}_{1:n} | \boldsymbol{\theta}] [\boldsymbol{\theta}]\) to represent the usual posterior obtained via Bayes’ theorem, we define the relative effective sample size (RESS) between \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\) and \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]\) as \[\label{eq:RESS} \begin{align} \text{RESS}(n | n_{0}) = \left\{ 1 + D_{\chi^{2}} \left( [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] \left\| \vphantom{\tfrac{1}{1}} \right. [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \right) \right\}^{-1}, \end{align}\tag{5}\] for \(n \geq n_{0}\), where \(D_{\chi^{2}}(f(\boldsymbol{\theta}) \| g(\boldsymbol{\theta})) = \int \frac{(f(\boldsymbol{\theta}) - g(\boldsymbol{\theta}))^{2}}{g(\boldsymbol{\theta})} d\boldsymbol{\theta}\) represents the \(\chi^{2}\)-divergence between two distributions with corresponding densities \(f\) and \(g\). The RESS is closely related to the effective sample size (ESS), commonly used to track degeneration in particle filters (e.g., del2012adaptive?), and can be seen as the proportion of information retained when using samples from \([\boldsymbol{\theta}|\boldsymbol{y}_{1:n_{0}}]\) to approximate \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]\). For details about these connections, see Appendix 11.

Theorem 1 establishes that, assuming perfect replenishment (i.e., sampling directly from \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{k}^{*}}]\) at replenishment times \(n_{k}^{*}\)) and under the conditions stated in Section 3.1, if one takes replenishment times \(\{n^{*}_{k} \}\) with \(n^{*}_{k} = \lceil \alpha^{-k}_{n} \rceil\) for some sequence of constants \(\{\alpha_{n}\}\) with \(\alpha_{n} \in (0,1)\) and \(\alpha_{n} \to \alpha \in (0,1)\), then \[\text{RESS}(n_{k+1}^{*} | n_{k}^{*}) \overset{d}{\to} \text{RESS}_{\theta^{*},\alpha}(\boldsymbol{z}),\] as \(k\) goes \(+\infty\), where \(\text{RESS}_{\theta^{*},\alpha}(\boldsymbol{z}) \in (0,1)\) is a non-degenerate random variable with \(\text{RESS}_{\theta^{*},\alpha}(\boldsymbol{z}) \to 0\) as \(\alpha\) goes to \(0\). Therefore, for sufficiently small values of \(\alpha\), the RESS eventually becomes arbitrarily close to \(0\). Taking \(\alpha_{n} = \alpha\) for all \(n\) corresponds to an exponential sequence of replenishment times, so the stated result indicates that the gaps between replenishment times cannot grow larger than exponentially if one wants to avoid degeneration (see Section 3 for details). But, when analyzing the computational cost, taking replenishment times \(\{n_{k}^{*}\}\), for \(k \in \{1, \dots, k_{n} \}\) with \(k_{n} = \left\lfloor -\frac{\log(n)}{\log(\alpha_{n})} \right\rfloor\), results in \[\begin{gather} C_{W}(n) = \sum_{k=1}^{k_{n}} \lceil \alpha^{-k}_{n} \rceil^{p} \leq (1+\alpha_{n}) \sum_{k=1}^{k_{n}} \alpha_{n}^{-kp} = (1+\alpha_{n}) \frac{\alpha_{n}^{-k_{n}p}-1}{1-\alpha_{n}^{p}} \leq \frac{1 + \alpha_{n}}{1-\alpha_{n}^{p}} (n^{p}-1) \end{gather}\] and \[\label{eq:cost32lower32bound} C_{W}(n) = \sum_{k=1}^{k_{n}} \lceil \alpha_{n}^{-k} \rceil^{p} \geq \sum_{k=1}^{k_{n}} \alpha_{n}^{-kp} = \frac{\alpha_{n}^{-k_{n}p}-1}{1-\alpha_{n}^{p}} \geq \frac{1}{1-\alpha_{n}^{p}}(n^{p} - 1),\tag{6}\] therefore, considering \(\alpha_{n} = \alpha\), yields \(C_{W}(n) = O\left( n^{p} \right)\) per Monte Carlo sample. Combining this with the cost of the recursive update and the fact that other parts of the procedure do not depend on \(n\), results in a total cost of \(C_{T}(n) = C_{R}(n) + C_{W}(n) = O(n^{p})\) per Monte Carlo sample for RAIS with exponential replenishment, which is on the same order of other well-established methods (e.g., the Metropolis-Hastings algorithm).

Table 1: Table comparing the asymptotic RESS and computational cost for a model with base cost \(O(n^p)\) when considering different growth rates of the replenishment times.
Replenishment Times	Computational Cost	Estimation Quality
\(\alpha \to 1\) (sub-exponential)	\(C_{\text{SUB}}(n) > O(n^{p} )\)	\(\text{RESS}_{\theta^{*},\alpha}(\boldsymbol{z}) \overset{\mathbb{P}}{\to} 1\)
\(\alpha \in (0,1)\) (exponential)	\(C_{\text{EXP}}(n) = O(n^{p})\)	\(\text{RESS}_{\theta^{*},\alpha}(\boldsymbol{z}) \in (0,1)\)
\(\alpha \to 0\) (super-exponential)	\(C_{\text{SUP}}(n) = O(n^{p})\)	\(\text{RESS}_{\theta^{*},\alpha}(\boldsymbol{z}) \overset{\mathbb{P}}{\to} 0\)

Table 1 summarizes the role \(\alpha\) plays in the RAIS algorithm, and makes it clear that when aiming for both computational efficiency and sample quality, replenishment times should grow exponentially to achieve an asymptotically optimal trade-off, resulting in the RAIS with optimal replenishment (RAISOR) framework. Knowing that exponential replenishment is optimal for any deterministic sequence \(\{\alpha_{n}\}\) is a useful reference, however this alone is not enough to ensure high-quality estimators throughout the fitting process, because the stochasticity of the limiting distribution can introduce instability.

A natural alternative is to choose \(\{n_{k}^{*}\}\) to be a sequence of stopping times. Following the SMC literature (e.g. del2012adaptive?), one option would be to track an estimate of the RESS over time (see Appendix 11), and introduce a replenishment step when it falls below a certain threshold, for example, taking the next replenishment time \(n_{k+1}^{*}\) to be \[\label{eq:RESS32threshold} n_{k+1}^{*} = \inf\left\{ n > n_{k}^{*} : \widehat{\text{RESS}}(n | n_{k}^{*}) \leq r \right\},\tag{7}\] where \(r\) corresponds to a replenishment threshold (see Appendix 10 for guidance regarding the choice of \(r\)), and \(\widehat{\text{RESS}}(n | n_{k}^{*})\) is an estimate of \(\text{RESS}(n | n_{k}^{*})\). This not only helps with the stability of the procedure, but also ensures that the resulting weighted sample satisfies minimum quality requirements and naturally accounts for imperfect replenishment. Even though optimality becomes harder to establish in this case, the illustrations in Section 4 indicate that this strategy can work well in practice considering both simulated and real data.

2.3.4 Batching↩︎

The time complexity analysis of RAISOR shows that its computational cost must eventually become larger than other competing approaches (e.g., the Metropolis-Hastings algorithm) for a sufficiently large \(n\), which is due to a comparatively large asymptotic constant. However, RAISOR can compensate for this disadvantage because, assuming perfect parallelization and that posterior evaluations are the dominant operation, its computational cost decreases linearly with the number of cores. Therefore, by using parallelization efficiently with enough processors, we can decrease the asymptotic constant to make RAISOR competitively fast.

One way of improving the efficiency of parallelization is to minimize transitions between parallel and sequential tasks because, after each transition, there is a need to synchronize and transmit information among processors. To this end, the Bayesian recursion identity in (1 ) can be generalized to \[\label{eq:recursion32identity32batch} [\boldsymbol{\theta} | \boldsymbol{y}_{1:n+b}] = \dfrac{[\boldsymbol{y}_{1:n+b} | \boldsymbol{\theta}] [\boldsymbol{\theta}]}{[\boldsymbol{y}_{1:n+b}]} = \dfrac{[\boldsymbol{y}_{n+1:n+b} | \boldsymbol{\theta}, \boldsymbol{y}_{1:n}] [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]}{[\boldsymbol{y}_{n+1:n+b} | \boldsymbol{y}_{1:n}]} \propto [\boldsymbol{y}_{n+1:n+b} | \boldsymbol{\theta}, \boldsymbol{y}_{1:n}] [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\tag{8}\] for all \(b \geq 1\), implying that we obtain the same IS weights regardless of whether we update based on a single new observation \(\boldsymbol{y}_{n+1}\) or a batch of observations \(\boldsymbol{y}_{n+1:n+b}\). In the case of RAISOR, most of the communication cost comes from weight updates, therefore, using (8 ) to batch the updates leads to substantial computational improvements. See Appendix 10 for a discussion on batching strategies.

3 Theory↩︎

3.1 Notation and Assumptions↩︎

Following the particular case of i.i.d. observations from kleijn2012bernstein?, let \(\{\boldsymbol{y}_{i}\}_{i=1}^{+\infty}\) be a sequence of i.i.d. random vectors in a probability space \((\Omega, \mathcal{F}, \mathbb{P}_{*})\) and generated from an unknown distribution \(F_{*}\). Consider the potentially misspecified Bayesian hierarchical model \[\begin{align} \boldsymbol{y}_{1}, \dots, \boldsymbol{y}_{n} | \boldsymbol{\theta} &\overset{\text{iid}}{\sim} F(\cdot | \boldsymbol{\theta}), \\ \boldsymbol{\theta} &\sim F_{\theta}, \end{align}\] for all \(n \geq 1\), where \(F(\cdot | \boldsymbol{\theta})\) is a parametric family of distributions indexed by \(\boldsymbol{\theta} \in \Theta\), with \(\Theta \subseteq \mathbb{R}^{d}\) open, and \(F_{\theta}\) is the prior distribution for \(\boldsymbol{\theta}\). We denote the log-likelihood function associated with observations \(\boldsymbol{y}_{n_{0}+1:n_{1}}\) as \(\ell_{n_{0}+1}^{n_{1}}(\boldsymbol{\theta}) = \log [\boldsymbol{y}_{n_{0}+1:n_{1}} | \boldsymbol{\theta}]\).

Note that we do not assume the parametric family of distributions \(F(\cdot | \boldsymbol{\theta})\) contains the true \(F_{*}\), but instead that there exists a unique \(\boldsymbol{\theta}^{*}\) in the interior of \(\Theta\) satisfying \[\label{eq:true32theta} \boldsymbol{\theta}^{*} = \arg\min_{\boldsymbol{\theta} \in \Theta} \mathbb{E}_{*}\left( \frac{[\boldsymbol{y}_{1}]_{*}}{[\boldsymbol{y}_{1} | \boldsymbol{\theta}]} \right),\tag{9}\] where \([\boldsymbol{y}]_{*}\) is the true density of \(\boldsymbol{y}_{i}\) under \(\mathbb{P}_{*}\), and we use the subscript “\(*\)” to indicate that the corresponding object (e.g., expectation) is taken with respect to \(\mathbb{P}_{*}\). To ensure (9 ) is well-defined, we assume there exists a single measure that dominates \(F_{*}\) and all \(F(\cdot | \boldsymbol{\theta})\). Importantly, even though our model attributes a prior distribution over \(\boldsymbol{\theta}\), we treat \(\boldsymbol{\theta}\) as an indexing quantity.

We assume that the prior distribution \(F_{\theta}\) has a Lebesgue-density \([\boldsymbol{\theta}]\), which is continuous and positive in a neighborhood \(B_{\theta^{*}}\) of \(\boldsymbol{\theta}^{*}\). We assume that the likelihood function of the model is such that \(\ell(\boldsymbol{\theta}) := \log [\boldsymbol{y} | \boldsymbol{\theta}]\) is differentiable at \(\boldsymbol{\theta}^{*}\) with derivative \(\dot{\ell}(\boldsymbol{\theta}) = \frac{\partial}{\partial\boldsymbol{\theta}} \log [\boldsymbol{y} | \boldsymbol{\theta}]\), there exists a neighborhood \(U\) of \(\boldsymbol{\theta}^{*}\) and a square integrable function \(m_{\theta^{*}}(\boldsymbol{y})\) satisfying the inequality \[|\log [\boldsymbol{y}|\boldsymbol{\theta}_{1}] - \log[\boldsymbol{y} | \boldsymbol{\theta}_{2}] | \leq m_{\theta^{*}}(\boldsymbol{y}) \| \boldsymbol{\theta}_{1} - \boldsymbol{\theta}_{2} \|_{2}\] for all \(\boldsymbol{\theta}_{1}, \boldsymbol{\theta}_{2} \in U\), \(\mathbb{P}_{*}\)-almost surely, and that there exists a positive-definite \(d \times d\) matrix \(\boldsymbol{V}_{\theta^{*}}\) such that \[\label{eq:taylor32series32approximation} -\mathbb{E}_{*}\left( \log\frac{[y|\boldsymbol{\theta}]}{[y|\boldsymbol{\theta}^{*}]} \right) = \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}^{*})' \boldsymbol{V}_{\theta^{*}} (\boldsymbol{\theta} - \boldsymbol{\theta}^{*}) + o(\| \boldsymbol{\theta} - \boldsymbol{\theta}^{*} \|_{2}^{2}),\tag{10}\] for \(\boldsymbol{\theta} \to \boldsymbol{\theta}^{*}\). Although not strictly necessary, for simplicity we also assume that for all \(n > n_{0} \geq 0\), the maximum likelihood estimator (MLE) \(\hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}\) based on \(\boldsymbol{y}_{n_{0}+1:n}\) exists, lies in the interior of \(\Theta\), \(\mathbb{P}_{*}\)-almost surely, and that regularity conditions are satisfied to ensure that \(\hat{\boldsymbol{\theta}}_{1}^{n} \overset{\mathbb{P}_{*}}{\to} \boldsymbol{\theta}^{*}\), see for instance van2000asymptotic? for sufficient conditions for the consistency of the MLE.

3.2 Convergence of the RESS↩︎

As discussed in Section 2, RAISOR relies on the estimated RESS to determine where replenishment should take place throughout the fitting process, and therefore, analyzing the behavior of this statistic is essential to understand how well the procedure is expected to perform. Theorem 1 establishes the asymptotic distribution of the RESS in the case of i.i.d. observations while accounting for model misspecification.

Theorem 1. Let \(\textrm{RESS}(n | n_{0})\) be as defined in (5 ) and \(n_{0} = \alpha_{n} n\) for all \(n \geq 1\) and some sequence of constants \(\{ \alpha_{n} \}\) such that \(\alpha_{n} \in \left\{\frac{1}{n}, \dots, \frac{n}{n} \right\}\) and \(\alpha_{n} \to \alpha \in (0,1)\). Then, under the assumptions stated in Section 3.1, \[\label{eq:limit32RESS} \textrm{RESS}(n|n_{0}) \overset{d}{\to} \textrm{RESS}_{\theta^{*},\alpha}(\boldsymbol{z}) := \left\{\alpha(2-\alpha)\right\}^{\frac{d}{2}} \exp\left\{-\tfrac{1-\alpha}{2-\alpha} \boldsymbol{z}' \boldsymbol{M}_{\theta^{*}} \boldsymbol{z} \right\},\qquad{(1)}\] as \(n\) goes to \(+\infty\), where \(\boldsymbol{z} \sim \textrm{Normal}_{d}(\boldsymbol{0}, \boldsymbol{I}_{d})\) and \(\boldsymbol{M}_{\theta^{*}} = (\boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}})'\boldsymbol{V}_{\theta^{*}}^{-1} (\boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}})\), with \[\boldsymbol{W}_{\theta^{*}} = \mathbb{E}_{*}\left(\left. \tfrac{\partial}{\partial \boldsymbol{\theta}} \log [\boldsymbol{y} | \boldsymbol{\theta}] \right|_{\boldsymbol{\theta} = \boldsymbol{\theta}^{*}} \right),\] \(\boldsymbol{V}_{\theta^{*}}\) as in (10 ), and \(\boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}}\) such that \(\boldsymbol{W}_{\theta^{*}} = (\boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}})(\boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}})'\).

The technical details of the proof of Theorem 1 are presented in Appendix 9, but the proof is conceptually straightforward. We can analytically characterize the distribution of the RESS when considering the normal case for both the data and posterior distributions (see Appendix 11) and, under regularity conditions, the posterior density and likelihood function will converge to a normal density by the Bernstein-von-Mises theorem and classical results. Therefore, the asymptotic behavior of the RESS for general i.i.d. models reduces to the normal case, yielding the desired result.

We can decompose \(\text{RESS}_{\theta^{*},\alpha}(\boldsymbol{z})\) in (?? ) into the factors \(u_{1}(\alpha) = \left\{ \alpha (2-\alpha )\right\}^{\frac{d}{2}}\) and \(u_{2}(\alpha, \boldsymbol{z}) = \exp\left\{-\frac{1-\alpha}{2-\alpha} \boldsymbol{z}' \boldsymbol{M}_{\theta^{*}} \boldsymbol{z}\right\}\), which represent two separate aspects of the limiting behavior of \(\text{RESS}(n|n_{0})\) (see Appendix 11 for details). The factor \(u_{1}(\alpha)\) gives us a deterministic upper bound and, assuming for simplicity that \(n_{0} = \lceil \alpha n \rceil\), we can use it to show that \[\label{eq:RESS32upper32bound} \begin{align} u_{1}(\alpha) &\geq r_{\text{min}} & &\Longrightarrow & n &\leq c(r_{\text{min}},d) \cdot n_{0}, & c(r_{\text{min}},d) =r_{\text{min}}^{-\frac{2}{d}} \left\{ 1 + \sqrt{1-r_{\text{min}}^{\frac{2}{d}}} \right\}, \end{align}\tag{11}\] for all \(r_{\text{min}} \in (0,1)\). This implies that, if one uses weighted samples from \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]\) targeting \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\), the quality of the approximation measured in terms of the \(\text{RESS}(n|n_{0})\) is guaranteed to be below \(r_{\text{min}}\) for sufficiently large values of \(n,n_{0}\) satisfying \(n > c(r_{\text{min}},d) \cdot n_{0}\). Note that even though taking \(n \leq c(r_{\text{min}}, d) \cdot n_{0}\) does not guarantee an asymptotically high-quality approximation, it can still be used to derive computational heuristics (see Appendix 10) that can be used when applying RAISOR in practice.

4 Applications↩︎

We test our method considering simulated data experiments that demonstrate both the efficiency and numerical stability of the proposed approach. We follow it with an application to sea surface temperature estimation using a geostatistical model with Matérn covariance function.

4.1 Simulated Examples↩︎

We consider two simulation experiments: mean estimation using a conjugate normal model with i.i.d. observations, to show the asymptotic behavior of the RESS under the assumptions of Theorem 1, and function estimation using a Gaussian process (GP) regression, as a way of evaluating how our method behaves in a non-trivial example outside the scope of the theorem.

4.1.1 Mean Estimation↩︎

We let \(y_{1}, \dots, y_{n} \overset{\text{i.i.d.}}{\sim} \text{Normal}(0, \sigma^{2})\), with \(\sigma^{2}\) known, and consider the Bayesian model \[\begin{align} y_{i} | \mu &\overset{\text{ind}.}{\sim} \text{Normal}(\mu, \sigma^{2}), \\ \mu &\sim \text{Normal}(\mu_{0}, \sigma^{2}_{0}), \end{align}\] for \(i \in \{1, \dots, n\}\). Well-known algebraic manipulations show that the posterior distribution is given by \(\mu | \boldsymbol{y}_{1:n} \sim \text{Normal}(\mu_{n}, \sigma^{2}_{n})\), where \(\sigma^{2}_{n} = \left\{ \sigma^{-2}_{0} + n \sigma^{-2} \right\}^{-1}\) and \(\mu_{n} = \sigma^{2}_{n} \{ \sigma^{-2}_{0}\mu_{0} \}\).

Figure 3: Solid lines shows the evolution of \text{RESS}(n|n_{0}) as n increases considering n_{0} = 250, \mu_{0} = 0, \sigma^{2}_{0} = 10^{4}, and \sigma^{2} = 1, across 30 independent replicates of the experiment. One trajectory was highlighted for ease of visualization. Dashed line denotes the theoretical asymptotic upper bound. — Figure 3: Solid lines shows the evolution of \(\text{RESS}(n|n_{0})\) as \(n\) increases considering \(n_{0} = 250\), \(\mu_{0} = 0\), \(\sigma^{2}_{0} = 10^{4}\), and \(\sigma^{2} = 1\), across \(30\) independent replicates of the experiment. One trajectory was highlighted for ease of visualization. Dashed line denotes the theoretical asymptotic upper bound.

Considering a sample size of \(n = 10^{4}\), Figure 3 shows 30 independent replications of the evolution of the \(\text{RESS}(n|n_{0})\) trajectory calculated with \(M = 50000\) samples from an initial posterior distribution \([\mu | \boldsymbol{y}_{1:n_{0}}]\) with \(n_{0} = 250\). The dashed line corresponds to the asymptotic theoretical upper bound \(u_{1}(\alpha)\), derived in Section 3, calculated with \(\alpha = \frac{n_{0}}{n}\). Considering that no sample replenishment is present, the trajectories in Figure 3 follow a noisy trajectory toward zero, although the speed of decay varies significantly across replicates. It is worth pointing out that, as indicated in Section 3 and in Appendix 11, the speed of decay is also expected to vary with the parameter dimension and under model misspecification.

To show the effect of replenishment, we performed the same simulation experiment with \(n = 10^{6}\), but introducing sample replenishment at \(n_{0} \in \{10^{0}, 10^{2}, 10^{4}\}\). Because the posterior distribution has a known form for all \(n\), we can replenish our samples by directly generating new values from the partial posterior distribution without the need to introduce approximations, resulting in perfect replenishment. Figure 4 shows the resulting \(\text{RESS}(n|n_{0})\) trajectories.

Figure 4: Solid lines shows the evolution of \text{RESS}(n|n_{0}) as n increases with replenishment introduced at n_{0} \in \{10^{0}, 10^{2}, 10^{4}\}, with \mu_{0} = 0, \sigma^{2}_{0} = 10^{4}, and \sigma^{2} = 1, across 30 independent replicates of the experiment. One trajectory was highlighted for ease of visualization. Dashed line denotes the theoretical asymptotic upper bound when accounting for replenishment. — Figure 4: Solid lines shows the evolution of \(\text{RESS}(n|n_{0})\) as \(n\) increases with replenishment introduced at \(n_{0} \in \{10^{0}, 10^{2}, 10^{4}\}\), with \(\mu_{0} = 0\), \(\sigma^{2}_{0} = 10^{4}\), and \(\sigma^{2} = 1\), across \(30\) independent replicates of the experiment. One trajectory was highlighted for ease of visualization. Dashed line denotes the theoretical asymptotic upper bound when accounting for replenishment.

In Figure 4, the replenishment naturally divides the plot in three regions, with the RESS returning to 1 after each replenishment time. Noting that the \(x\)-axis is on a logarithmic scale and that replenishment times were taken to grow exponentially, when comparing each region it is possible to see evidence of the convergence in distribution stated in Theorem 1. Although Theorem 1 does not guarantee the joint convergence in distribution of the trajectories, Figure 4 indicates that the \(\text{RESS}(n|n_{0})\) may also converge to a continuous stochastic process indexed by \(\alpha = \lim_{n \to +\infty} \frac{n_{0}}{n} \in [0,1]\). Additionally, Figure 4 suggests that, when accounting for the randomness of the data generation, the RESS trajectories can get arbitrarily close to 0 at any deterministic replenishment time, indicating that stochastic replenishment times, as in (7 ), may be required to ensure robustness.

4.1.2 Gaussian Process Regression↩︎

We consider the problem of function estimation using a geostatistcial model (i.e., Gaussian process regression) of the form \[\label{eq:GP32data32process} \begin{align} \boldsymbol{y}_{1:n} | \boldsymbol{\beta}, \sigma^{2}, \tau^{2}, \phi &\sim \text{Normal}_{n}(\boldsymbol{X}\boldsymbol{\beta}, \boldsymbol{\Sigma}(\sigma^{2}, \tau^{2}, \phi)), \\ \boldsymbol{\beta} &\sim \text{Normal}_{p}(\boldsymbol{\mu}_{\beta}, \boldsymbol{\Sigma}_{\beta}), \\ \sigma^{2} &\sim \text{Inverse-Gamma}\left( \tfrac{\alpha_{1}}{2}, \tfrac{\alpha_{2}}{2} \right), \\ \tau^{2} &\sim \text{Uniform}(0, 1), \\ \phi &\sim \text{Half-Normal}(0, \gamma^{2}), \end{align}\tag{12}\] where \(\boldsymbol{y}_{1:n} = (y(\boldsymbol{s}_{1}), \dots, y(\boldsymbol{s}_{n}))'\) is a vector of (noisy) observations of the function of interest at the spatial locations \(\boldsymbol{s}_{1:n} = (\boldsymbol{s}_{1}, \dots, \boldsymbol{s}_{n})' \in \mathcal{S}^{n}\), \(\boldsymbol{X}\) is a matrix of covariates, \(\boldsymbol{\beta}\) is the vector of corresponding regression coefficients, and \[\label{eq:covariance} \begin{align} \boldsymbol{\Sigma}(\sigma^{2}, \tau^{2}, \phi) &= \sigma^{2} \{(1-\tau^{2}) \boldsymbol{R}(\phi) + \tau^{2}\boldsymbol{I}_{n}\}, & \text{with } \boldsymbol{R}(\phi) &= \tfrac{2^{1-\nu}}{\Gamma(\nu)}\left( \sqrt{2\nu} \tfrac{\boldsymbol{D}}{\phi} \right)^{\nu} K_{\nu}\left( \sqrt{2\nu} \tfrac{\boldsymbol{D}}{\phi} \right), \end{align}\tag{13}\] corresponds to the Matérn covariance matrix matern1986spatial?, stein1999interpolation? with a nugget effect, \(\boldsymbol{D}\) is the \(n \times n\) matrix of pairwise distances between the locations \(\boldsymbol{s}_{1:n}\), \(K_{\nu}\) is the modified Bessel function of the second kind (see abramowitz1965handbook? and 10), and \(\sigma^{2}\), \(\tau^{2}\), and \(\phi\) are spatial parameters to be estimated.

The geostatistical model in (12 ) is notoriously challenging to fit even to moderately large data sets (e.g., for \(n \approx 5000\)). The cost of likelihood evaluations is \(O(n^{3})\), due to the need to invert a dense \(n \times n\) covariance matrix, while the memory requirement is \(O(n^{2})\). To mitigate this, many approximations of GPs have been proposed in the literature (e.g., liu2020gaussian?, heaton2019case?), so we consider the Vecchia approximation vecchia1988estimation?, katzfuss2021general?, and in particular the Nearest-Neighbor GP (NNGP) approximation datta2016hierarchical?, to fit the GP regression model.

In our simulations, we generated six data sets for different values of \(n\), with spatial locations \(s_{ij} \overset{\text{i.i.d.}}{\sim} \text{Uniform}(0,1)\) for \(i \in \{1, \dots, n\}\) and \(j \in \{1, 2\}\), \(\boldsymbol{X}\) consisting of a column of ones and the two spatial coordinates, \(\boldsymbol{\beta} = (8, 4, 16)'\), \(\sigma^{2} = 4\), \(\tau^{2} = 0.05\), \(\phi = 0.05\), \(\nu = \frac{3}{2}\), \(\{D\}_{ij} = \| \boldsymbol{s}_{i} - \boldsymbol{s}_{j} \|_{2}\), and the vector of observations \(\boldsymbol{y}_{1:n}\) generated according to the exact GP regression model in (12 ). We fitted the NNGP approximation of the model with \(k_{n} = \left\lceil 1.2 \log_{10}^{2} n \right\rceil\) neighbors and random ordering to each simulated data set using MCMC and our proposed approach. Implementation details are provided in Appendix 10.

We generated \(M = 50000\) samples using R team2024r? with both RAISOR and MCMC, discarding the initial \(5000\) as burn-in for the MCMC approach, with the RAISOR making use of six parallel cores to update and calculate the importance sampling weights. As shown in the left plot in Figure 5, the use of parallelization yields a moderate speed improvement for the RAISOR approach when compared to our MCMC implementation, with a time reduction of \(\sim40\%\) for \(n \geq 640\) (note that both axes are on a logarithmic scale).

Figure 5: Left plot shows the runtime in minutes for MCMC and RAISOR approaches using the Vecchia approximation for different data sets. Right plot shows the number ESS per minute for both approaches.

We also compared the quality of the generated samples. The right plot in Figure 5 shows the number of effective samples generated per minute. The ESS was calculated for the MCMC output following gamerman2006markov? and for the RAISOR output as \(M\) times the estimated RESS (see Appendix 11). Comparing the two curves we can see that RAISOR provides a significantly greater ESS per minute, resulting in an overall more effective sampler.

Figure 6: Top plot shows the RESS trajetory obtained while fitting the GP regression model with n = 10240. The dashed line corresponds to r = 0.2. Bottom plot shows the evolution of posterior estimates of \phi as the thick solid line, with the thin solid lines representing 95\% credible intervals and the dashed line corresponding to the true value \phi = 0.05. — Figure 6: Top plot shows the RESS trajetory obtained while fitting the GP regression model with \(n = 10240\). The dashed line corresponds to \(r = 0.2\). Bottom plot shows the evolution of posterior estimates of \(\phi\) as the thick solid line, with the thin solid lines representing \(95\%\) credible intervals and the dashed line corresponding to the true value \(\phi = 0.05\).

To visualize how the RAISOR approach fits the model, considering the data set with \(n = 10240\), Figure 6 shows on top the RESS trajectory evaluated after each weight update or recalculation step. The dashed line represents the replenishment threshold \(r = 0.2\), so the method always introduces replenishment after each weight update that reduces the RESS below this line. Comparing it with Figure 4, it is clear that the RESS does not reach one after each replenishment time, which occurs due to approximation error. The bottom plot of Figure 6 shows how the evolution of the estimate of the spatial range parameter \(\phi\) as the fitting process progresses. The uncertainty associated with our estimate shrinks as the sample size increases, eventually approaching the truth.

4.2 Application: Sea Surface Temperature↩︎

We consider sea surface temperature (SST) measurements in the Gulf of Mexico during January of 2021. The data obtained were collected as a part of the in situ SST Quality Monitor (iQuam) system developed by the National Oceanic and Atmospheric Administration (NOAA) xu2014situ?, compiling on-site measurements from multiple sources (e.g., ships, buoys, and Argo floats). These data can be used for quality control of SST obtained from other sources, such as satellite measurements, to make high resolution products or in ecological, environmental, and climatological applications. Figure 7 shows \(n = 6014\) daytime observations, with \(4951\) of those arising from ships, \(63\) from Argo floats, and \(1000\) from moored buoy measurements.

Figure 7: Sea surface temperature measurements (°C) in the Gulf of Mexico during January of 2021, obtained from the in situ SST Quality Monitor system by the National Oceanic and Atmospheric Administration.

We fitted the Vecchia approximation of the GP regression model in (12 ) to the SST data using \(k = 18\) neighbors, considering both the MCMC and RAISOR approaches as described previously. We used the geodesic distance \(\{D\}_{ij} = d_{\text{GEO}}(\boldsymbol{s}_{i}, \boldsymbol{s}_{j})\) in the covariance model in 13 . The total runtime was 5 hours 51 minutes for MCMC and 3 hours 40 minutes for the RAISOR approach, while it would take from weeks to months to fit the exact geostatistical model. Figure 8 shows the posterior predictive mean and standard deviation of the SST at a grid of locations obtained using the proposed approach. These resulting maps can be used in downstream analyses that facilitate ecological and environmental research. For instance, liu2025rapid? studied the effect of sea temperature on the formation of Hurricane Ian, that caused extensive damage after reaching the state of Florida in 2022, and they established a causal link between anomalies in the sea temperature in the Gulf of Mexico and Ian’s rapid intensification from a Category 3 to a Category 5 hurricane.

Figure 8: Estimated posterior predictive mean and marginal standard deviation of the SST at a grid of locations obtained with the proposed approach. Points represent the observed locations.

5 Discussion↩︎

In the context of distributed computing environments, RAISOR combines the strengths of RB and AIS methods to compensate their respective shortcoming. When compared to the importance PP-RB algorithm, it introduces replenishment to mitigate the sample degeneracy issue, and considering AIS approaches, it targets a sequence of partial posteriors, reducing the cost of likelihood evaluations at early iterations and circumventing the initialization problem. Replenishment and batching strategies allow RAISOR to use parallel processing efficiently, making it an attractive alternative to other well-established approaches. Additionally, because the method learns using the prior-to-posterior update, it has strong connections to sequential and annealed methods, indicating that it could be used in the context of online learning or to handle sampling from strongly multimodal distributions.

When considering the effect of the dimension \(d\) of \(\boldsymbol{\theta}\) on the asymptotic distribution of the RESS, Theorem 1 establishes that sample degeneration is expected to accelerate significantly as \(d\) increases, see also bengtsson2008curse? and beskos2014stability?. A similar problem also occurs when using non-parametric density estimators to approximate the current target distribution because the approximation error increases exponentially with \(d\). Thus, RAISOR is less effective when used to sample from high-dimensional posterior distributions, but will yield appreciable gains in low-dimensional parameter models with \(d \leq 10\). Another aspect worth considering is that RAISOR, as proposed, does not address posterior distributions with dimension increasing with the sample size, such as generalized linear mixed models, latent Markov random fields, and state-space models, although extensions could be proposed similar to hooten2021making?. These points raise the question of whether RAISOR could be combined with other approaches, such as MCMC or SMC, to generate methods better equipped to address these issues while still allowing for efficient parallelization.

Theorem 1 provides intuition regarding the problem of sample depletion, but it only holds under the assumptions stated in Section 3.1, so caution must be taken when extrapolating it to other instances. Of the assumptions, the imposition of identically distributed observations can likely to be weakened to encompass more complex models (e.g., generalized linear models), because similar asymptotic results for the posterior distribution and the MLE can still hold. The assumption of independence, however, is more crucial to the proof of Theorem 1, and we conjecture that the asymptotic behavior of the RESS should be assessed on a case-by-case basis when either the model or the true data generating process are not independent. Finally, although imposing \(\Theta\) to be an open set is restrictive, as it automatically excludes models with discrete parameters, which are known to not satisfy asymptotic normality, we suspect that the limit of the RESS could still be described for models with discrete or mixed parameter spaces, provided that the discrete components of the posterior and the MLE collapse to a point mass.

6 Acknowledgments↩︎

The authors thank Alex Barth, Berkeley Ho, Justin Van Ee, Michael Schwob, Myungsoo Yoo, Nikunj Goel, and Rachael Ren for helpful discussions and insights.

7 Disclosure statement↩︎

The authors have no knowledge of any conflict of interests to declare.

8 Data Availability Statement↩︎

The SST data used in this project are available at the following URL:

https://www.star.nesdis.noaa.gov/socd/sst/iquam/.

9 Technical Proofs↩︎

For organizational purposes, we split the proof of Theorem 1 into lemmas that are stated and proven in this Appendix. The first lemma allows us to rewrite the \(\text{RESS}(n | n_{0})\) as the ratio of two expectations with respect to a normal approximation of the posterior \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]\) using the Bernstein-von Mises theorem under model misspecification of kleijn2012bernstein?.

Lemma 1. Let \(\textrm{RESS}(n | n_{0})\) be as defined in (5 ) and \(n_{0} = \alpha_{n} n\), for all \(n \geq 1\) and some sequence of constants \(\{\alpha_{n}\}_{n}\) such that \(\alpha_{n} \in \left\{ \frac{1}{n}, \dots, \frac{n}{n}\right\}\) and \(\alpha_{n} \to \alpha \in (0,1)\). Then, under assumptions stated in Section 3.1, \[\left| \textrm{RESS}(n|n_{0}) - \textrm{RESS}_{\phi}(n|n_{0}) \right| \overset{\mathbb{P}_{*}}{ \to} 0,\] as \(n\) goes to \(+\infty\), where \[\label{eq:RESS32normal} \textrm{RESS}_{\phi}(n | n_{0}) = \frac{ \left\{ \int R_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta} \right\}^{2}}{ \int R_{n_{0}+1:n}^{2}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta}},\qquad{(2)}\] \(R_{n_{0}+1:n}(\boldsymbol{\theta}, \tilde{\boldsymbol{\theta}}) = \exp\left\{ \ell_{n_{0}+1}^{n}(\boldsymbol{\theta}) - \ell_{n_{0}+1}^{n}(\tilde{\boldsymbol{\theta}}) \right\}\) is the likelihood ratio, and \(\phi(\cdot | \boldsymbol{\mu}, \boldsymbol{\Sigma})\) represents the density of a normal random vector with mean \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\).

Proof. To avoid unnecessary complications, note that \(\alpha_{n} \to \alpha \in (0,1)\), which implies that \(\alpha_{n} < 1\) for all large \(n\), and so we assume without loss of generality that \(\alpha_{n} < 1\) for all \(n \geq 2\).

To make \(\text{RESS}(n|n_{0})\) and \(\text{RESS}_{\phi}(n | n_{0})\) compatible, we begin by rewriting \[\begin{align} \text{RESS}(n|n_{0}) &= \left\{ 1 + D_{\chi^{2}}( [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] \;\| \;[\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] ) \right\}^{-1} \\ &= \left\{ \int \frac{[\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]^{2}}{[\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]} \;d\boldsymbol{\theta} \right\}^{-1} = \frac{[\boldsymbol{y}_{1:n}]^{2}}{[\boldsymbol{y}_{1:n_{0}}]} \left\{ \int \frac{[\boldsymbol{y}_{1:n} | \boldsymbol{\theta}]^{2}}{[\boldsymbol{y}_{1:n_{0}} | \boldsymbol{\theta}]} [\boldsymbol{\theta}] \;d\boldsymbol{\theta} \right\}^{-1} \\ &= \frac{ \left\{ \int [\boldsymbol{y}_{1:n} | \boldsymbol{\theta}] [\boldsymbol{\theta}] \;d\boldsymbol{\theta} \right\}^{2}}{ [\boldsymbol{y}_{1:n_{0}}] \int \frac{[\boldsymbol{y}_{1:n} | \boldsymbol{\theta}]^{2}}{[\boldsymbol{y}_{1:n_{0}} | \boldsymbol{\theta}]} [\boldsymbol{\theta}] \;d\boldsymbol{\theta}} = \frac{ \left\{ \int [\boldsymbol{y}_{n_{0}+1:n} | \boldsymbol{y}_{1:n_{0}}, \boldsymbol{\theta}] [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \;d\boldsymbol{\theta} \right\}^{2}}{ \int [\boldsymbol{y}_{n_{0}+1:n} | \boldsymbol{y}_{1:n_{0}}, \boldsymbol{\theta}]^{2} [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \;d\boldsymbol{\theta}} \\ &\overset{\text{ind}}{=} \frac{ \left\{ \int [\boldsymbol{y}_{n_{0}+1:n} | \boldsymbol{\theta}] [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \;d\boldsymbol{\theta} \right\}^{2}}{ \int [\boldsymbol{y}_{n_{0}+1:n} | \boldsymbol{\theta}]^{2} [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \;d\boldsymbol{\theta}} = \frac{ \left\{ \int R_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;[\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \;d\boldsymbol{\theta} \right\}^{2}}{ \int R_{n_{0}+1:n}^{2}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;[\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \;d\boldsymbol{\theta}}, \end{align}\] for all \(n \geq 2\). Note that the above expression is only well-defined because \(\alpha_{n} < 1\), which implies \(n_{0} < n\). To streamline the calculations, we introduce the notation \[\begin{align} a_{n,\gamma} &:= \int R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;[\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \;d\boldsymbol{\theta}, \\ b_{n,\gamma} &:= \int R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta}, \end{align}\] so we have \[\label{eq:bound32RESS} \begin{align} &\left| \text{RESS}(n|n_{0}) - \text{RESS}_{\phi}(n|n_{0}) \right| = \left| \frac{a_{n,1}^{2}}{a_{n,2}} - \frac{b_{n,1}^{2}}{b_{n,2}} \right| = \left| \frac{b_{n,2} a_{n,1}^{2} - a_{n,2} b_{n,1}^{2}}{a_{n,2} b_{n,2}} \right| \\ &= \left| \frac{\left\{ b_{n,2} - a_{n,2} \right\} a_{n,1}^{2} - a_{n,2} \left\{ b_{n,1}^{2} - a_{n,1}^{2} \right\} }{a_{n,2} b_{n,2}} \right| \leq \frac{\left| b_{n,2} - a_{n,2} \right| a_{n,1}^{2} }{a_{n,2} b_{n,2}} + \frac{ a_{n,2} \left| b_{n,1}^{2} - a_{n,1}^{2} \right| }{a_{n,2} b_{n,2}} \\ &= \frac{\left| b_{n,2} - a_{n,2} \right| }{ b_{n,2}} \cdot \frac{a_{n,1}^{2}}{a_{n,2}} + \frac{ (b_{n,1} + a_{n,1}) \left| b_{n,1} - a_{n,1} \right| }{ b_{n,2}} \leq \frac{\left| b_{n,2} - a_{n,2} \right| + 2\left| b_{n,1} - a_{n,1} \right| }{ b_{n,2}}, \end{align}\tag{14}\] where on the last inequality we used that \(a_{n,1}^{2} \leq a_{n,2}\), by Jensen’s inequality, and \(a_{n,1} + b_{n,1} \leq 2\), because the likelihood ratio is bounded by 1. Note that \[\label{eq:bound32l1} \begin{align} \left| a_{n,\gamma} - b_{n,\gamma} \right| &\leq \int \underbrace{R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n})}_{\leq 1} \left| [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] - \phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \right| \;d\boldsymbol{\theta} \\ &\leq \int \left| [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] - \phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \right| \;d\boldsymbol{\theta}, \end{align}\tag{15}\] for \(\gamma > 0\), so combining (14 ) and (15 ) yields \[\begin{align} \left| \text{RESS}(n|n_{0}) - \text{RESS}_{\phi}(n|n_{0}) \right| \leq 3\frac{\int \left| [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] - \phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \right| \;d\boldsymbol{\theta}}{\int R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta}}. \end{align}\] Because \(\alpha_{n} \to \alpha > 0\) as \(n\) goes to \(+\infty\), we also have \(n_{0} \to +\infty\), so, by Lemma 2, the denominator converges in distribution to a positive random variable and, by Theorem 2.1 of kleijn2012bernstein?, the numerator converges in \(\mathbb{P}_{*}\)-probability to 0. Combining those with the continuous mapping theorem yields the desired result. ◻

In the next lemma, we establish the limiting distribution of \(\text{RESS}_{\phi}(n|n_{0})\), defined in equation (?? ), which forms the foundation of the proof of Theorem 1.

Lemma 2. Let \(n_{0} = \alpha_{n} n\) for all \(n \geq 1\) and some sequence of constants \(\{ \alpha_{n} \}_{n}\) such that \(\alpha_{n} \in \left\{\frac{1}{n}, \dots, \frac{n}{n} \right\}\) and \(\alpha_{n} \to \alpha \in (0,1)\). Additionally, let \(f:(0,1)^{k} \to \mathbb{R}^{m}\) be a continuous function \(\lambda\)-almost everywhere in \((0,1)^{k}\), where \(\lambda\) represents the Lebesgue measure, and \[\mathbb{I}_{n,\gamma}(\boldsymbol{\theta}^{*}) = \int R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta}\] for all \(\gamma \geq 0\). Then, under the assumptions stated in Section 3.1, \[f\left( \vphantom{\frac{1}{1}} \mathbb{I}_{n,\gamma_{1}}(\boldsymbol{\theta}^{*}), \dots, \mathbb{I}_{n,\gamma_{k}}(\boldsymbol{\theta}^{*}) \right) \overset{d}{\to} f\left( \vphantom{\frac{1}{1}} \mathbb{I}_{\theta^{*},\gamma_{1}}(\boldsymbol{z}), \dots, \mathbb{I}_{\theta^{*},\gamma_{1}}(\boldsymbol{z})\right),\] where \(\boldsymbol{z} \sim \textrm{Normal}_{d}(\boldsymbol{\theta}, \boldsymbol{I}_{d})\), \(\gamma_{1}, \dots, \gamma_{k}\) are positive constants, and \[\begin{align} \log \mathbb{I}_{\theta^{*},\gamma}(\boldsymbol{z}) := \tfrac{d}{2} \log\left( \tfrac{\alpha}{\gamma + (1-\gamma)\alpha} \right) - \tfrac{\gamma}{2(\gamma + (1-\gamma) \alpha)} \boldsymbol{z}' \boldsymbol{M}_{\theta^{*}} \boldsymbol{z}. \end{align}\] for all \(\gamma > 0\) and \(\boldsymbol{z} \in \mathbb{R}^{d}\).

Proof. To avoid excessive repetition, all limits are taken as \(n\) goes to \(+\infty\) unless otherwise stated. Additionally, for the same reasons considered in the proof of Lemma 1, because \(\alpha_{n} \to \alpha \in (0,1)\), we can assume without loss of generality that \(\alpha_{n} < 1\) for all \(n \geq 2\), which we do to ensure that \(\mathbb{I}_{n,\gamma}(\boldsymbol{\theta}^{*})\) is well-defined.

To show that \(\mathbb{I}_{n,\gamma}(\boldsymbol{\theta}^{*})\) converges in distribution, our strategy is to first obtain an asymptotically tight bound of the likelihood ratio when restricting the integral over fixed compact sets. Then, following the proof of Theorem 2.1 in kleijn2012bernstein?, we let the compact sets grow in a controlled manner to obtain the desired result. With that in mind, we separate \(\mathbb{I}_{n,\gamma}(\boldsymbol{\theta}^{*})\) into two parts, obtaining \[\label{eq:decomposition} \begin{align} \int R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta} &= \mathbb{I}_{n,\gamma}(\boldsymbol{\theta}^{*}, r_{n-n_{0}}) + \mathbb{I}^{c}_{n,\gamma}(\boldsymbol{\theta}^{*}, r_{n-n_{0}}) \end{align}\tag{16}\] where \[\begin{align} \mathbb{I}_{n,\gamma}(\boldsymbol{\theta}^{*}, r_{n-n_{0}}) &= \int_{B(\boldsymbol{\theta}^{*},r_{n-n_{0}})} R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta}, \\ \mathbb{I}^{c}_{n,\gamma}(\boldsymbol{\theta}^{*}, r_{n-n_{0}}) &= \int_{B^{c}(\boldsymbol{\theta}^{*},r_{n-n_{0}})} R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta}, \end{align}\] \(B(\boldsymbol{\theta}^{*},r)\) is an \(L_{2}\) neighborhood of radius \(r\) around \(\boldsymbol{\theta}^{*}\) (i.e., \(B(\boldsymbol{\theta}^{*},r) = \{ \boldsymbol{\theta} \in \Theta : \| \boldsymbol{\theta} - \boldsymbol{\theta}^{*} \|_{2} \leq r \}\)), and \(r_{x} = r(x)\) can be any continuous function in \(\mathbb{R}_{+}\) satisfying \[\label{eq:radii32condition321} \begin{align} \lim_{x \to +\infty} r(x) &= 0 && \text{and} & \lim_{x \to +\infty} x r^{2}(x) = +\infty. \end{align}\tag{17}\] Considering the two terms of the decomposition in (16 ), we can bound the second one and use a change of variables to write \[\label{eq:term32132bound} \begin{align} 0 \leq \mathbb{I}^{c}_{n,\gamma}(\boldsymbol{\theta}^{*}, r_{n-n_{0}}) &= \int_{B^{c}(\boldsymbol{\theta}^{*}, r_{n-n_{0}})} \underbrace{R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n})}_{\leq 1} \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta} \\ &\leq \int_{B^{c}(\boldsymbol{\theta}^{*},r_{n-n_{0}})} \phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta} \\ &= \int_{B^{c}(\boldsymbol{\theta}^{*},1)} \phi(\boldsymbol{\theta} | r_{n-n_{0}}^{-1}\{ \hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*} \} +\boldsymbol{\theta^{*}}, \{n_{0} r_{n-n_{0}}^{2}\}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta}. \end{align}\tag{18}\] Using that \(n_{0} \to +\infty\) because \(\alpha_{n} \to \alpha \in (0,1)\), and noting that the MLE is consistent by assumption and asymptotically normal by Lemma 2.2 of kleijn2012bernstein?, we get \[\label{eq:MLE32normality} \sqrt{n_{0}} (\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*}) \overset{d}{\to} \text{Normal}_{d}(\boldsymbol{0}, \boldsymbol{\Sigma}_{\theta^{*}})\tag{19}\] where \(\boldsymbol{\Sigma}_{\theta^{*}} = \boldsymbol{V}_{\theta^{*}}^{-1} \boldsymbol{W}_{\theta^{*}} \boldsymbol{V}_{\theta^{*}}^{-1}\) is the sandwich matrix. Furthermore, with \(n\{ 1-\alpha_{n}\} \to +\infty\), \(\frac{\alpha_{n}}{1-\alpha_{n}} \to \frac{\alpha}{1-\alpha} \in (0, +\infty)\), and \(r\) continuous we have \[\begin{align} n_{0}r^{2}_{n-n_{0}} &= \alpha_{n} n r^{2}\left( n \left\{ 1 - \alpha_{n} \right\} \right) = \frac{\alpha_{n}}{1-\alpha_{n}} n\{1-\alpha_{n}\} r^{2}(n\{1-\alpha_{n}\}) \to +\infty, \end{align}\] which we can combine with (19 ) to obtain \[r_{n-n_{0}}^{-1} (\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*}) +\boldsymbol{\theta}^{*} = \underbrace{ \{ n_{0}r_{n-n_{0}}^{2} \}^{-\frac{1}{2}} }_{\to 0} \underbrace{ \sqrt{n_{0}} ( \hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*} )}_{= O_{P}(1)} +\boldsymbol{\theta}^{*} \overset{\mathbb{P}_{*}}{\to} \boldsymbol{\theta}^{*}\] by Slutsky’s theorem. Note that the normal density in the upper bound of (18 ) converges to a Dirac delta at \(\boldsymbol{\theta}^{*}\), while the region of integration remains constant and bounded away from \(\boldsymbol{\theta}^{*}\). Hence, the integral converges to \(0\) and we have \[\label{eq:term32132convergence} \mathbb{I}^{c}_{n,\gamma}(\boldsymbol{\theta}^{*}, r_{n-n_{0}}) =\int_{B^{c}(\boldsymbol{\theta}^{*}, r_{n-n_{0}})} R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta} \overset{\mathbb{P}_{*}}{\to} 0.\tag{20}\]

Next, to address the first term of the decomposition in equation (16 ), we rewrite \(r(x) = x^{-\frac{1}{2}} r^{*}(x)\) for some positive function \(r^{*}(x) = r_{x}^{*}\) in \(\mathbb{R}_{+}\), denote the convergence rate by \(\delta_{x} = x^{-\frac{1}{2}}\), and use a change of variables to write \[\label{eq:term32232split} \begin{align} &\mathbb{I}_{n,\gamma}(\boldsymbol{\theta}^{*}, r_{n-n_{0}}) =\int_{B(\boldsymbol{\theta}^{*},r_{n-n_{0}})} R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta} \\ &= \int_{B(\boldsymbol{\theta}^{*}, r_{n-n_{0}})} R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \boldsymbol{\theta}^{*}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta} \;\left\{ R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}^{*}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \right\} \\ &= \underbrace{ \int_{B(\boldsymbol{0},r^{*}_{n-n_{0}})} R^{\gamma}_{n_{0}+1:n}\left( \boldsymbol{\theta}^{*} + \delta_{n-n_{0}}\boldsymbol{\theta} , \boldsymbol{\theta}^{*} \right) \phi\left( \boldsymbol{\theta} \left| \delta_{n-n_{0}}^{-1}(\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*}), \tfrac{n-n_{0}}{n_{0}} \boldsymbol{V}_{\theta^{*}}^{-1} \right. \right) \;d\boldsymbol{\theta} }_{:= \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}})} \\ &\quad \times R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}^{*}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) = \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) \;R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}^{*}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}). \end{align}\tag{21}\] Before proceeding, we define the auxiliary quantities \[\begin{align} a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, \boldsymbol{\theta}) &:= \log R_{n_{0}+1:n}\left( \boldsymbol{\theta}^{*} + \delta_{n-n_{0}} \boldsymbol{\theta} , \boldsymbol{\theta}^{*} \right) - \boldsymbol{\theta}' \delta_{n-n_{0}}\dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}) + \tfrac{1}{2}\boldsymbol{\theta}' \boldsymbol{V}_{\theta^{*}} \boldsymbol{\theta} \\ a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}) &:= \sup_{\boldsymbol{\theta} \in B(\boldsymbol{0}, r^{*})} \left| a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, \boldsymbol{\theta}) \right|, \end{align}\] so, for all \(\boldsymbol{\theta} \in B(\boldsymbol{0}, r^{*}_{n-n_{0}})\), we can write \[\label{eq:ratio32upper32bound} \begin{align} \log R_{n_{0}+1:n}\left( \boldsymbol{\theta}^{*} + \delta_{n-n_{0}}\boldsymbol{\theta} , \boldsymbol{\theta}^{*} \right) &= a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, \boldsymbol{\theta}) + \boldsymbol{\theta}'\delta_{n-n_{0}}\dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}) - \tfrac{1}{2}\boldsymbol{\theta}' \boldsymbol{V}_{\theta^{*}} \boldsymbol{\theta} \\ &\leq \left| a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, \boldsymbol{\theta}) \right| + \boldsymbol{\theta}'\delta_{n-n_{0}}\dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}) - \tfrac{1}{2}\boldsymbol{\theta}' \boldsymbol{V}_{\theta^{*}} \boldsymbol{\theta} \\ &\leq a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) + \boldsymbol{\theta}'\delta_{n-n_{0}}\dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}) - \tfrac{1}{2}\boldsymbol{\theta}' \boldsymbol{V}_{\theta^{*}} \boldsymbol{\theta}, \end{align}\tag{22}\] and analogously, we have \[\label{eq:ratio32lower32bound} \log R_{n_{0}+1:n}\left( \boldsymbol{\theta}^{*} + \delta_{n-n_{0}}\boldsymbol{\theta} , \boldsymbol{\theta}^{*} \right) \geq -a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) + \boldsymbol{\theta}'\delta_{n-n_{0}}\dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}) - \tfrac{1}{2}\boldsymbol{\theta}' \boldsymbol{V}_{\theta^{*}} \boldsymbol{\theta}.\tag{23}\] In what follows, we use inequalities (22 ) and (23 ), and Lemma 2.1 of kleijn2012bernstein? to find asymptotically tight upper and lower bounds of \(\mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}})\). Starting with the upper bound, we write \[\label{eq:integral32upper32bound} \begin{align} &\mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) \leq \exp\left\{ \gamma a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}})\right\} \\ &\quad \times \int_{B(\boldsymbol{0},r^{*}_{n-n_{0}})} e^{ \gamma\boldsymbol{\theta}'\delta_{n-n_{0}}\dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}) - \tfrac{1}{2}\boldsymbol{\theta}'\gamma \boldsymbol{V}_{\theta^{*}} \boldsymbol{\theta} } \phi\left( \boldsymbol{\theta} \left| \delta_{n-n_{0}}^{-1}(\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*}), \tfrac{n-n_{0}}{n_{0}} \boldsymbol{V}_{\theta^{*}}^{-1} \right. \right) \;d\boldsymbol{\theta} \\ &= \left(\tfrac{n_{0}}{\gamma n + (1-\gamma) n_{0}}\right)^{\frac{d}{2}} e^{\boldsymbol{b}_{n, \gamma}(\boldsymbol{\theta}^{*}) + \gamma a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}})} \int_{B(\boldsymbol{0},r^{*}_{n-n_{0}})} \phi\left( \boldsymbol{\theta} \left| \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}), \boldsymbol{\Omega}^{-1}_{n,\gamma} (\boldsymbol{\theta}^{*}) \right. \right) \;d\boldsymbol{\theta}, \end{align}\tag{24}\] where the quantities \[\begin{align} \boldsymbol{\Omega}_{n,\gamma} (\boldsymbol{\theta}^{*}) &= \tfrac{\gamma n + (1-\gamma)n_{0}}{n-n_{0}}\boldsymbol{V}_{\theta^{*}}, \\ \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}) &= \boldsymbol{\Omega}^{-1}_{n,\gamma} (\boldsymbol{\theta}^{*}) \left\{ \gamma \delta_{n-n_{0}} \dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}) + \tfrac{n_{0}}{(n-n_{0})^{\frac{1}{2}}} \boldsymbol{V}_{\theta^{*}} (\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*}) \right\}, \end{align}\] were identified by the usual quadratic form for combining Gaussian kernels, and the notation \[\begin{align} \boldsymbol{b}_{n, \gamma}(\boldsymbol{\theta}^{*}) &= -\tfrac{1}{2}\left\{ (\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*})' n_{0} \boldsymbol{V}_{\theta^{*}}(\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*}) - \boldsymbol{\mu}'_{n,\gamma}(\boldsymbol{\theta}^{*}) \boldsymbol{\Omega}_{n,\gamma} (\boldsymbol{\theta}^{*}) \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}) \right\}, \end{align}\] was introduced to simplify the expressions that follow. Analogously, the lower bound is of the form \[\begin{gather} \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) \geq \left(\tfrac{n_{0}}{\gamma n + (1-\gamma) n_{0}}\right)^{\frac{d}{2}} \exp\left\{ \boldsymbol{b}_{n, \gamma}(\boldsymbol{\theta}^{*}) - \gamma a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) \right\} \\ \int_{B(\boldsymbol{0},r^{*}_{n-n_{0}})} \phi\left( \boldsymbol{\theta} \left| \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}), \boldsymbol{\Omega}^{-1}_{n,\gamma} (\boldsymbol{\theta}^{*}) \right. \right) \;d\boldsymbol{\theta}, \end{gather}\] which differs from the upper bound in equation (24 ) only by the presence of a minus sign in front of \(\gamma a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}})\). Leveraging this last fact allows us to write \[\label{eq:bound32term322} \begin{align} \left| \log\left\{ \frac{ \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) e^{-\boldsymbol{b}_{n,\gamma}(\boldsymbol{\theta}^{*})} \left( \tfrac{n_{0}}{\gamma n + (1-\gamma)n_{0}} \right)^{-\frac{d}{2}} }{ \int_{B(\boldsymbol{0}, r^{*}_{n-n_{0}})} \phi \left(\boldsymbol{\theta} | \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}), \boldsymbol{\Omega}^{-1}_{n, \gamma}(\boldsymbol{\theta}^{*}) \right) d\boldsymbol{\theta} } \right\} \right| \leq \gamma a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}), \end{align}\tag{25}\] which we can use along with the triangle inequality to obtain the inequality \[\begin{gather} \label{eq:bound32term32232final} \left| \log \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) - \log \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}) \right| \\ \leq \underbrace{ \gamma a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}})}_{(\text{I})} + \underbrace{ \left| \log \int_{B(\boldsymbol{0}, r^{*}_{n-n_{0}})} \phi \left(\boldsymbol{\theta} | \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}), \boldsymbol{\Omega}^{-1}_{n, \gamma}(\boldsymbol{\theta}^{*}) \right) \;d\boldsymbol{\theta} \right|}_{(\text{II})}, \end{gather}\tag{26}\] for all \(r^{*}_{n-n_{0}} > 0\), where \[\mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}) = \exp\left\{ b_{n,\gamma}(\boldsymbol{\theta}^{*}) -\tfrac{d}{2} \log\left( \tfrac{n_{0}}{\gamma n + (1-\gamma)n_{0}} \right) \right\}.\] Therefore, establishing the limiting distribution of \(\mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}})\) becomes easier if we show that the upper bound above converges to \(0\) in \(\mathbb{P}_{*}\)-probability for some choice of function \(r^{*}(x) = r^{*}_{x}\). Note that any choice \(r^{*}(x)\) such that \(r(x) = \delta(x) r^{*}(x)\) is continuous and still satisfies (17 ), or equivalently, such that \[\label{eq:radii32condition322} \begin{align} \lim_{x \to +\infty} r_{x}^{*} &= +\infty, & \lim_{x \to +\infty} x^{-\frac{1}{2}}r_{x}^{*} &= 0, \end{align}\tag{27}\] and \(r^{*}\) is continuous, guarantees that (20 ) remains valid. Starting with the term (I) of equation (26 ), Lemma 2.1 of kleijn2012bernstein? states that we have \(a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}) \overset{\mathbb{P}_{*}}{\to} 0\) for any fixed \(r^{*}\). So, in particular, if \(\{ \tilde{r}_{m} \}_{m=1}^{+\infty}\) is a sequence of positive radii satisfying limiting properties analogous to equation (27 ), then the convergence of \(a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, \tilde{r}_{m})\) still holds for all fixed \(m \geq 1\), but may fail if we directly take the limit of \(a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, \tilde{r}_{n-n_{0}})\). To address this, similar to the proof of Theorem 2.1 of kleijn2012bernstein?, it is always possible to construct a monotonic subsequence \(\{ \tilde{r}_{m_{n}} \}_{n=1}^{+\infty}\) that grows sufficiently slowly to ensure that \[0 \leq a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{m_{n-n_{0}}}) \leq a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{m_{n}}) \overset{\mathbb{P}_{*}}{\to} 0\] and with \(r^{*}_{m_{n}} \leq r_{n}^{*}\) for all \(n\), while maintaining \(\lim_{n \to +\infty} r^{*}_{m_{n-n_{0}}} = +\infty\). Extending such a sequence to a continuous function on the positive real numbers (e.g., with linear interpolation) results in a function \(r^{*}_{x} = r^{*}(x)\) satisfying equation (27 ) and such that \[\label{eq:bound32convergence321} a_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) \overset{\mathbb{P}_{*}}{\to} 0,\tag{28}\] as desired. Before addressing the term of equation (26 ), note that by the central limit theorem we have \[\label{eq:score32normality} \begin{align} \delta_{n-n_{0}}\dot{\ell}_{n_{0}+1}^{n}( \boldsymbol{\theta}^{*}) = \frac{1}{\sqrt{n-n_{0}}}\sum_{i=n_{0}+1}^{n} \frac{\partial}{\partial\boldsymbol{\theta}} \log [y_{i} | \boldsymbol{\theta}]\Bigg|_{\theta = \theta^{*}} \overset{d}{\to} \text{Normal}_{d}\left( \boldsymbol{0}, \boldsymbol{W}_{\theta^{*}} \right), \end{align}\tag{29}\] so, from equations (19 ) and (29 ), we known \[\label{eq:mean32and32variance32convergence} \begin{align} \boldsymbol{\Omega}_{n,\gamma} (\boldsymbol{\theta}^{*}) &= \tfrac{\gamma n + (1-\gamma)n_{0}}{n-n_{0}}\boldsymbol{V}_{\theta^{*}} = \tfrac{\gamma n + (1-\gamma)\lceil \alpha_{n} n \rceil}{n-\lceil \alpha_{n} n \rceil }\boldsymbol{V}_{\theta^{*}} \overset{\mathbb{P}_{*}}{\to} \tfrac{\gamma + (1-\gamma)\alpha}{1-\alpha }\boldsymbol{V}_{\theta^{*}}, \\ \boldsymbol{\mu}_{n,\gamma} (\boldsymbol{\theta}^{*}) &= \boldsymbol{\Omega}^{-1}_{n,\gamma} (\boldsymbol{\theta}^{*}) \left\{ \gamma \delta_{n-n_{0}} \dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}) + \tfrac{n_{0}}{(n-n_{0})^{\frac{1}{2}}} \boldsymbol{V}_{\theta^{*}} (\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*}) \right\} \\ &= \boldsymbol{\Omega}^{-1}_{n,\gamma} (\boldsymbol{\theta}^{*}) \bigg\{ \left( \tfrac{n_{0}}{n-n_{0}} \right)^{\frac{1}{2}} \underbrace{ \boldsymbol{V}_{\theta^{*}} \sqrt{n_{0}} (\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*})}_{\to \boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}} \boldsymbol{z}_{0}} + \gamma \underbrace{\delta_{n-n_{0}} \dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*})}_{\to \boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}}\boldsymbol{z}_{1}} \bigg\} \\ &\overset{d}{\to} \tfrac{1-\alpha }{\gamma + (1-\gamma)\alpha}\boldsymbol{V}_{\theta^{*}}^{-1} \boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}} \bigg\{ \left(\tfrac{\alpha}{1-\alpha}\right)^{\frac{1}{2}} \boldsymbol{z}_{0} + \gamma \boldsymbol{z}_{1} \bigg\}, \end{align}\tag{30}\] where \(\boldsymbol{z}_{0}, \boldsymbol{z}_{1} \overset{\text{iid}}{\sim} \text{Normal}_{d}(\boldsymbol{0}, \boldsymbol{I}_{d})\), and \(\boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}}\) is the left factor of the Cholesky decomposition of \(\boldsymbol{W}_{\theta^{*}}\). Note that the independence between \(\boldsymbol{z}_{0}\) and \(\boldsymbol{z}_{1}\) is not immediate, because we only established convergence in distribution, but it comes from the fact that \(\delta_{n-n_{0}} \dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*})\) and \(\boldsymbol{V}_{\theta^{*}} \sqrt{n_{0}} (\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*})\) are independent for all \(n\), because each one of them only depends on a disjoint subset of \(\boldsymbol{y}_{1:n}\). Next, considering the term (II) of equation (26 ), we can use a change of variables to write \[\begin{gather} \label{eq:annoying32integral} \int_{B(\boldsymbol{0}, r^{*}_{n-n_{0}})} \phi \left(\boldsymbol{\theta} | \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}), \boldsymbol{\Omega}^{-1}_{n, \gamma}(\boldsymbol{\theta}^{*}) \right) \;d\boldsymbol{\theta} \\ = \int_{B(\boldsymbol{0}, 1)} \phi \left( \boldsymbol{\theta} \left| \{r^{*}_{n-n_{0}}\}^{-1} \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}), \{r^{*}_{n-n_{0}}\}^{-2} \boldsymbol{\Omega}^{-1}_{n, \gamma}(\boldsymbol{\theta}^{*}) \right. \right) \;d\boldsymbol{\theta}, \end{gather}\tag{31}\] but we have \(r^{*}_{n-n_{0}} \to +\infty\), so \[\begin{align} \{r^{*}_{n-n_{0}}\}^{-1} \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}) &\overset{\mathbb{P}_{*}}{\to} \boldsymbol{0} &&\text{and}& \{r^{*}_{n-n_{0}}\}^{-2} \boldsymbol{\Omega}^{-1}_{n, \gamma}(\boldsymbol{\theta}^{*}) &\overset{\mathbb{P}_{*}}{\to} \boldsymbol{0}, \end{align}\] which imply that the sequence of densities on the r.h.s. of equation (31 ) are converging to a point mass at \(\boldsymbol{0}\), while the region of integration is a fixed neighborhood around \(\boldsymbol{0}\). Therefore the integral must converge to 1, and we have \[\label{eq:bound32convergence322} \left| \log \int_{B(\boldsymbol{0}, r^{*}_{n-n_{0}})} \phi \left(\boldsymbol{\theta} | \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}), \boldsymbol{\Omega}^{-1}_{n, \gamma}(\boldsymbol{\theta}^{*}) \right) \;d\boldsymbol{\theta} \right| \overset{\mathbb{P}_{*}}{\to} 0,\tag{32}\] which, when considering (26 ) and (28 ), implies \[\label{eq:bound32convergence32final} \left| \log \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) - \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}) \right| \overset{\mathbb{P}_{*}}{\to} 0.\tag{33}\] Next, by assumption we have \(\hat{\boldsymbol{\theta}}_{n_{0}+1}^{n} \overset{\mathbb{P}_{*}}{\to} \boldsymbol{\theta}^{*}\) and \[\begin{align} \log R_{n_{0}+1:n}(\boldsymbol{\theta}^{*}, \boldsymbol{\theta}) = -\tfrac{n}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}^{*})' \boldsymbol{V}_{\theta^{*}}(\boldsymbol{\theta} - \boldsymbol{\theta}^{*}) + o(n\| \boldsymbol{\theta} - \boldsymbol{\theta}^{*} \|_{2}^{2}), \end{align}\] for \(\boldsymbol{\theta} \to \boldsymbol{\theta}^{*}\), so \[\begin{gather} \log R_{n_{0}+1:n}(\boldsymbol{\theta}^{*}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \\ = -\tfrac{n-n_{0}}{2} (\hat{\boldsymbol{\theta}}_{n_{0}+1}^{n} - \boldsymbol{\theta}^{*})'\boldsymbol{V}_{\theta^{*}}(\hat{\boldsymbol{\theta}}_{n_{0}+1}^{n} - \boldsymbol{\theta}^{*}) + o\left( \{n -n_{0} \} \| \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n} - \boldsymbol{\theta}^{*} \|_{2}^{2} \right), \end{gather}\] which we can combine with the asymptotic normality of the MLE to show that \[\label{eq:liklihood32ratio32convergence} \log R_{n_{0}+1:n}(\boldsymbol{\theta}^{*}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \overset{d}{\to} -\tfrac{1}{2} \tilde{\boldsymbol{z}}_{1}' \underbrace{( \boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}} )' \boldsymbol{V}_{\theta^{*}}^{-1} \boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}}}_{:= \boldsymbol{M}_{\theta^{*}}} \tilde{\boldsymbol{z}}_{1} = \tilde{\boldsymbol{z}}_{1}' \boldsymbol{M}_{\theta^{*}} \tilde{\boldsymbol{z}}_{1},\tag{34}\] where \(\tilde{\boldsymbol{z}}_{1} \sim \text{Normal}_{d}(\boldsymbol{0}, \boldsymbol{I}_{d})\). Similarly, from (19 ) and (30 ), we get \[\label{eq:integral32convergence} \begin{align} \log \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}) &= - \tfrac{n_{0}}{2}(\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*})' \boldsymbol{V}_{\theta^{*}} (\hat{\boldsymbol{\theta}}_{1}^{n_{0}} - \boldsymbol{\theta}^{*}) + \tfrac{1}{2}\boldsymbol{\mu}'_{n,\gamma}(\boldsymbol{\theta}^{*}) \boldsymbol{\Omega}_{n,\gamma} (\boldsymbol{\theta}^{*}) \boldsymbol{\mu}_{n,\gamma}(\boldsymbol{\theta}^{*}) \\ &\quad + \tfrac{d}{2} \log\left( \tfrac{n_{0}}{\gamma n + (1-\gamma) n_{0}} \right) \\ &\overset{d}{\to} \tfrac{1-\alpha }{2(\gamma + (1-\gamma)\alpha)} \left\{ \left(\tfrac{\alpha}{1-\alpha}\right)^{\frac{1}{2}} \boldsymbol{z}_{0} + \gamma \boldsymbol{z}_{1} \right\}' ( \boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}} )' \boldsymbol{V}_{\theta^{*}}^{-1} \boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}} \left\{ \left(\tfrac{\alpha}{1-\alpha}\right)^{\frac{1}{2}} \boldsymbol{z}_{0} + \gamma \boldsymbol{z}_{1} \right\} \\ & \quad -\tfrac{1}{2}\boldsymbol{z}_{0} ( \boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}} )' \boldsymbol{V}_{\theta^{*}}^{-1} \boldsymbol{W}_{\theta^{*}}^{\frac{1}{2}} \boldsymbol{z}_{0} + \tfrac{d}{2} \log\left( \tfrac{\alpha}{\gamma + (1-\gamma)\alpha} \right) \\ &= \tfrac{1-\alpha }{2(\gamma + (1-\gamma)\alpha)} \left\{ \left(\tfrac{\alpha}{1-\alpha}\right)^{\frac{1}{2}} \boldsymbol{z}_{0} + \gamma \boldsymbol{z}_{1} \right\}' \boldsymbol{M}_{\theta^{*}} \left\{ \left(\tfrac{\alpha}{1-\alpha}\right)^{\frac{1}{2}} \boldsymbol{z}_{0} + \gamma \boldsymbol{z}_{1} \right\} \\ & \quad -\tfrac{1}{2} \boldsymbol{z}_{0} \boldsymbol{M}_{\theta^{*}} \boldsymbol{z}_{0} + \tfrac{1}{2} \log\left( \tfrac{\alpha}{\gamma + (1-\gamma)\alpha} \right) =: \log \mathbb{J}_{\theta^{*},\gamma} (\boldsymbol{z}_{0}, \boldsymbol{z}_{1}), \end{align}\tag{35}\] where \(\boldsymbol{z}_{0},\boldsymbol{z}_{1} \overset{\text{ind}}{\sim} \text{Normal}_{d}(\boldsymbol{0}, \boldsymbol{I}_{d})\). We highlight that Lemma 2.2 of kleijn2012bernstein? implies that \[\left\| \delta_{n-n_{0}}^{-1} (\hat{\boldsymbol{\theta}}_{n_{0}+1}^{n} - \boldsymbol{\theta}^{*}) - \delta_{n-n_{0}} V_{\theta^{*}}^{-1} \dot{\ell}_{n_{0}+1}^{n}(\boldsymbol{\theta}^{*}) \right\|_{2} \overset{\mathbb{P}_{*}}{\to} 0,\] thus, when considering the convergence of \(\mathbb{I}_{n,\gamma}(\boldsymbol{\theta}^{*},r^{*}_{n-n_{0}})\), the random variables \(\tilde{\boldsymbol{z}}_{1}\) and \(\boldsymbol{z}_{1}\), in (34 ) and (35 ) respectively, must be almost surely identical and for simplicity we denote both of them as \(\boldsymbol{z}_{1}\). Combining (35 ) with equation (33 ) yields \[\label{eq:annoying32integral32convergence} \log \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) \overset{d}{\to} \log \mathbb{J}_{\theta^{*},\gamma}(\boldsymbol{z}_{0}, \boldsymbol{z}_{1}),\tag{36}\] and from equations (34 ) and (36 ) we have \[\label{eq:term32232convergence} \begin{align} &\log \mathbb{I}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) = \gamma \log R_{n_{0}+1:n}(\boldsymbol{\theta}^{*}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) + \log \mathbb{J}_{n,\gamma}(\boldsymbol{\theta}^{*}, r^{*}_{n-n_{0}}) \\ &\overset{d}{\to} \tfrac{d}{2} \log\left( \tfrac{\alpha}{\gamma + (1-\gamma)\alpha} \right) -\tfrac{1}{2} \boldsymbol{z}_{0}' \boldsymbol{M}_{\theta^{*}} \boldsymbol{z}_{0} -\tfrac{\gamma}{2} \boldsymbol{z}_{1}' \boldsymbol{M}_{\theta^{*}} \boldsymbol{z}_{1} \\ & \quad +\tfrac{1-\alpha }{2(\gamma + (1-\gamma)\alpha)} \left\{ \left( \tfrac{\alpha}{1-\alpha} \right)^{\frac{1}{2}} \boldsymbol{z}_{0} + \gamma \boldsymbol{z}_{1} \right\}' \boldsymbol{M}_{\theta^{*}} \left\{ \left(\tfrac{\alpha}{1-\alpha}\right)^{\frac{1}{2}} \boldsymbol{z}_{0} + \gamma \boldsymbol{z}_{1} \right\} \\ &= \tfrac{d}{2} \log\left( \tfrac{\alpha}{\gamma + (1-\gamma)\alpha} \right) - \left\{ \tfrac{\gamma (1-\alpha)}{2(\gamma + (1-\gamma) \alpha)} \right\} \boldsymbol{z}_{0}'\boldsymbol{M}_{\theta^{*}}\boldsymbol{z}_{0} - \left\{ \tfrac{\gamma \alpha}{2(\gamma + (1-\gamma) \alpha)} \right\} \boldsymbol{z}_{1}' \boldsymbol{M}_{\theta^{*}} \boldsymbol{z}_{1} \\ & \quad +\left\{ \tfrac{\gamma \alpha^{\frac{1}{2}} (1-\alpha)^{\frac{1}{2}}}{\gamma + (1-\gamma) \alpha} \right\} \boldsymbol{z}_{0}'\boldsymbol{M}_{\theta^{*}} \boldsymbol{z}_{1} \\ &= \tfrac{d}{2} \log\left( \tfrac{\alpha}{\gamma + (1-\gamma)\alpha} \right) - \tfrac{\gamma}{2(\gamma + (1-\gamma) \alpha)} \left\{ (1-\alpha)^{\frac{1}{2}} \boldsymbol{z}_{0} - \alpha^{\frac{1}{2}} \boldsymbol{z}_{1} \right\}' \boldsymbol{M}_{\theta^{*}} \left\{ (1-\alpha)^{\frac{1}{2}} \boldsymbol{z}_{0} - \alpha^{\frac{1}{2}} \boldsymbol{z}_{1} \right\} \\ &= \tfrac{d}{2} \log\left( \tfrac{\alpha}{\gamma + (1-\gamma)\alpha} \right) - \tfrac{\gamma}{2(\gamma + (1-\gamma) \alpha)} \boldsymbol{z}' \boldsymbol{M}_{\theta^{*}} \boldsymbol{z} = \log \mathbb{I}_{\theta^{*},\gamma}(\boldsymbol{z}), \end{align}\tag{37}\] where \(\boldsymbol{z} = (1-\alpha)^{\frac{1}{2}}\boldsymbol{z_{0}} - \alpha^{\frac{1}{2}}\boldsymbol{z}_{1} \sim \text{Normal}_{d}(\boldsymbol{0}, \boldsymbol{I}_{d})\). Finally, the triangle inequality, the continuous mapping theorem, and equations (16 ), (20 ), and (37 ), imply that \[\begin{align} \int R^{\gamma}_{n_{0}+1:n}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{n_{0}+1}^{n}) \;\phi(\boldsymbol{\theta} | \hat{\boldsymbol{\theta}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \;d\boldsymbol{\theta} \overset{d}{\to} \mathbb{I}_{\theta^{*},\gamma}(\boldsymbol{z}), \end{align}\] for all \(\gamma \geq 0\). Importantly, when choosing a set of constants \(\gamma_{1}, \dots, \gamma_{k} > 0\) and accounting for the dependence between the corresponding random variables \[\mathbb{I}_{\theta^{*},\gamma_{1}}(\boldsymbol{z}^{(1)}), \dots, \mathbb{I}_{\theta^{*},\gamma_{k}}(\boldsymbol{z}^{(k)}),\] we must have \(\boldsymbol{z}^{(i)} = \boldsymbol{z}^{(i)}\) almost surely for all \(i,j \in \{1, \dots, k\}\). Hence, considering that the limiting distribution \(\mathbb{I}_{\theta^{*},\gamma}(\boldsymbol{z})\) is absolutely continuous with respect to the Lebesgue measure \(\lambda\) (and therefore does not contain atoms at points of discontinuity of \(f\)), an application of the continuous mapping theorem concludes the proof. ◻

At last, below we provide the proof of Theorem 1.

Proof of Theorem 1. Choosing \(f:(0,1)^{2} \to \mathbb{R}\) with \(f(x,y) = \frac{x^{2}}{y}\) and applying Lemma 2 yields \[\label{eq:convergence32surrogate} \begin{align} \text{RESS}_{\phi}(n | n_{0}) = \frac{\left\{ \mathbb{I}_{n,1}(\boldsymbol{\theta}^{*}) \right\}^{2}}{ \mathbb{I}_{n,2}(\boldsymbol{\theta}^{*}) } \overset{d}{\to} \frac{\{ \mathbb{I}_{\theta^{*},1}(\boldsymbol{z}) \}^{2}}{\mathbb{I}_{\theta^{*},2}(\boldsymbol{z})}, \end{align}\tag{38}\] where \[\begin{align} \frac{\{ \mathbb{I}_{\theta^{*},1}(\boldsymbol{z}) \}^{2}}{\mathbb{I}_{\theta^{*},2}(\boldsymbol{z})} &= \frac{ \left( \frac{\alpha}{1+(1-1)\alpha} \right)^{2\frac{d}{2}} \exp\left\{ -\frac{2}{2(1+(1-1)\alpha)} \boldsymbol{z}'\boldsymbol{M}_{\theta^{*}} \boldsymbol{z} \right\} }{\left( \frac{\alpha}{2+(1-2)\alpha} \right)^{\frac{d}{2}} \exp\left\{ -\frac{2}{2(2+(1-2)\alpha)} \boldsymbol{z}'\boldsymbol{M}_{\theta^{*}} \boldsymbol{z} \right\}} \\ &= \left\{ \alpha (2-\alpha) \right\}^{\frac{d}{2}} \exp\left\{- \tfrac{1-\alpha}{2-\alpha} \boldsymbol{z}'\boldsymbol{M}_{\theta^{*}} \boldsymbol{z} \right\}, \end{align}\] and \(\boldsymbol{z} \sim \text{Normal}_{d}(\boldsymbol{0}, \boldsymbol{I}_{d})\). Next, note that from Lemma 1, we have \[\left| \text{RESS}(n|n_{0}) - \text{RESS}_{\phi}(n|n_{0}) \right| \overset{\mathbb{P}_{*}}{\to} 0,\] which, combined with (38 ), implies \[\text{RESS}(n|n_{0}) \overset{d}{\to} \left\{ \alpha (2-\alpha) \right\}^{\frac{d}{2}} \exp\left\{- \tfrac{1-\alpha}{2-\alpha} \boldsymbol{z}'\boldsymbol{M}_{\theta^{*}} \boldsymbol{z} \right\},\] for all \(\alpha \in (0, 1)\), where \(\boldsymbol{z} \sim \text{Normal}_{d}(\boldsymbol{0}, \boldsymbol{I}_{d})\). ◻

10 Implementation Details↩︎

For the sake of reproducibility of our results, in what follows, we present details of the implementations mentioned in the Section 4. Section 10.1 presents the MCMC sampler used and Section 10.2 provides the RAISOR sampler used along with general guidance regarding implementation choices.

For both methods we fit the NNGP approximation of the geostatistical model given by \[\begin{align} \boldsymbol{y}_{1:n} | \boldsymbol{\beta}, \sigma^{2}, \tau^{2}, \phi &\sim \text{Normal}_{n}(\boldsymbol{X}\boldsymbol{\beta}, \boldsymbol{\Sigma}(\sigma^{2}, \tau^{2}, \phi)), \\ \boldsymbol{\beta} &\sim \text{Normal}_{p}(\boldsymbol{\mu}_{\beta}, \boldsymbol{\Sigma}_{\beta}), \\ \sigma^{2} &\sim \text{Inverse-Gamma}\left( \tfrac{\alpha_{1}}{2}, \tfrac{\alpha_{2}}{2} \right), \\ \tau^{2} &\sim \text{Uniform}(0, 1), \\ \phi &\sim \text{Half-Normal}(0, \gamma^{2}), \end{align}\] with the resulting posterior distribution of the form \[[\boldsymbol{\beta}, \sigma^{2}, \tau^{2}, \phi | \boldsymbol{y}_{1:n}] \propto [\boldsymbol{y}_{1:n} | \boldsymbol{\beta}, \sigma^{2}, \tau^{2}, \phi] [\boldsymbol{\beta}] [\sigma^{2}] [\tau^{2}] [\phi].\] To use the approximation effectively, we write \(\boldsymbol{\Sigma}^{-1} = \frac{1}{\sigma^{2}}\boldsymbol{L}'\boldsymbol{L}\), where \(\boldsymbol{L}\) is a sparse lower triangular matrix (see katzfuss2021general?), and we use \(\boldsymbol{L}\) instead of \(\boldsymbol{\Sigma}\) in all calculations, leveraging sparse linear algebra routines when appropriate.

10.1 MCMC↩︎

For the MCMC approach, we use a Metropolis-within-Gibbs sampler. Considering \(\boldsymbol{\beta}\) and \(\sigma^{2}\) we leverage conjugacy to sample from their full-conditional distributions given by \[\begin{align} \boldsymbol{\beta} | \cdot &\sim \text{Normal}_{p}( \tilde{\boldsymbol{\mu}}_{\beta}, \tilde{\boldsymbol{\Sigma}}_{\beta} ), & \sigma^{2} | \cdot &\sim \text{Inverse-Gamma}\left( \tfrac{\tilde{\alpha}_{1}}{2}, \tfrac{\tilde{\alpha}_{2}}{2} \right), \end{align}\] where \(\tilde{\boldsymbol{\mu}}_{\beta} = \tilde{\boldsymbol{\Sigma}}_{\beta} \left( \frac{1}{\sigma^{2}} \boldsymbol{X}' \boldsymbol{L}'\boldsymbol{L} \boldsymbol{y} + \boldsymbol{\Sigma}^{-1}_{\beta} \boldsymbol{\mu}_{\beta}\right )\), \(\tilde{\boldsymbol{\Sigma}}_{\beta} = \left( \frac{1}{\sigma^{2}}\boldsymbol{X}'\boldsymbol{L}'\boldsymbol{L}\boldsymbol{X} + \boldsymbol{\Sigma}_{\beta}^{-1} \right)^{-1}\), \(\tilde{\alpha}_{1} = \alpha_{1} + n\), and \(\tilde{\alpha}_{2} = \alpha_{2} + (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta})' \boldsymbol{L}'\boldsymbol{L} (\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta})\).

For \(\tau\) and \(\phi\) we consider a joint Metropolis update, which reduces the number of times we need to construct \(\boldsymbol{L}\) and evaluate the likelihood function (note that even with the NNGP approximation those operations remain the most costly for moderate \(n\)). For convenience, we work with the transformed parameters \((\tilde{\tau}^{2}, \tilde{\phi}) = \left(\log \frac{\tau^{2}}{1-\tau^{2}}, \log \phi \right)\), and use a random walk proposal with noise arising from a \(\text{Normal}_{2} (\boldsymbol{0}, \sigma^{2}_{\text{tune}} \boldsymbol{I}_{2} )\), where \(\sigma^{2}_{\text{tune},n}\) is a hyperparameter that needs to be tuned for each sample size \(n\) to avoid poor mixing. Using a few test runs for guidance, we took \(\log \sigma^{2}_{\text{tune},n_{1:6}} = (0.93, 0.34, -0.86, -1.91, -2.68, -3.35)'\). Additionally, for each chain we adapted \(\sigma^{2}_{\text{tune},n}\) during the burn in iterations to improve the mixing.

10.2 RAISOR↩︎

Similarly to other importance sampling-based techniques, RAISOR requires user choices during implementation. In what follows, we describe those and provide general heuristic guidance. Note that because RAISOR can be used to sample from different families of models, optimal choices are likely to be model-specific, and therefore the heuristics provided are meant to be initial values in a tuning process, rather than default choices. We later provide details on how the RAISOR used in Section 4 was implemented.

10.2.1 Family of Approximating Distributions and Notion of Optimality↩︎

The first choice to consider is that of the approximating family of distributions \([\boldsymbol{\theta} | \boldsymbol{\eta}]\). As with any other approximation, a suitable choice has to meet basic general requirements, such as having the same support as the target partial posterior \([\boldsymbol{\theta}|\boldsymbol{y}_{1:n}]\), however, because \([\boldsymbol{\theta} | \boldsymbol{\eta}]\) fulfills the role of an IS distribution, a theoretically good choice should also satisfy the following additional criteria:

(sampling) one can efficiently generate \(\boldsymbol{\theta}_{1}, \dots, \boldsymbol{\theta}_{M} \overset{\text{iid}}{\sim} [\boldsymbol{\theta} | \boldsymbol{\eta}]\) for any choice of \(\boldsymbol{\eta}\);
(tractability) \([\boldsymbol{\theta} | \boldsymbol{\eta}]\) is analytically available up to a proportionally constant and cheap to evaluate;
(closeness) for all \(\varepsilon > 0\), there is \(\boldsymbol{\eta}\) such that \(D_{\chi^{2}}([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] \| [\boldsymbol{\theta} | \boldsymbol{\eta}] ) < \varepsilon\).

The sampling condition ensures fast replenishment, tractability makes it possible to calculate exact IS weights (assuming \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\) is also tractable), and closeness guarantees that one can always increase \(\text{RESS}(n|n_{0})\) after replenishment for some choice of \(\boldsymbol{\eta}\). Taking \(\boldsymbol{\theta}\) to be continuous, the first two can be easily satisfied by common density estimators such as kernel density estimation (KDE) and infinite mixture of normals applied to some transformation of the original parameter \(g(\boldsymbol{\theta})\), while mixtures of discrete distributions can generally address the case of \(\boldsymbol{\theta}\) discrete. Closeness requires a bit more consideration because it also depends on the posterior, but is usually guaranteed if \([\boldsymbol{\theta}|\boldsymbol{\eta}]\) is a family of non-parametric density (or probability mass) estimators with heavy tails (e.g., mixture of \(t\) distributions or KDE with a \(t\) kernel). This is related to the robustness of the IS estimator, see for instance hesterberg1995weighted? and owen2000safe?, and, although relevant from a theoretical point of view, its practical implications may depend on the problem at hand, as discussed in delyon2021safe?.

Regarding the measure of discrepancy in Algorithm 2, used to determine which element from the family of approximations is best, one convenient possibility is to take \[D_{f}\bigg( [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] \bigg\| [\boldsymbol{\theta} | \boldsymbol{\eta}] \bigg) = \mathbb{E}_{\boldsymbol{\theta} \sim [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]} \left\{ \phi\left( \frac{[\boldsymbol{\theta} | \boldsymbol{\eta}]}{[\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]} \right) \right\},\] where \(\phi : [0,+\infty) \to (-\infty, +\infty]\) is a convex function such that \(\phi(x) < +\infty\) for all \(x > 0\), \(\phi(1) = 0\), and \(\phi(0) = \lim_{x \to 0^{+}} \phi(x)\). This family of divergences was introduced by csiszar1967information? and, given a weighted sample \(\{\boldsymbol{\theta}_{m}, w( \boldsymbol{\theta}_{m}) \}_{m=1}^{M}\) from \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\), can be consistently estimated using weighted averages, resulting in an optimization with form \[\label{eq:optimal32approximation} \hat{\boldsymbol{\eta}} = \arg\min_{\boldsymbol{\eta}} \hat{D}_{f}\bigg( [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] \bigg\| [\boldsymbol{\theta} | \boldsymbol{\eta}] \bigg) = \arg\min_{\boldsymbol{\eta}} \frac{1}{M} \sum_{m=1}^{M} w(\boldsymbol{\theta}_{m}) \;\phi\left( \frac{[\boldsymbol{\theta}_{m} | \boldsymbol{\eta}]}{[\boldsymbol{\theta}_{m} | \boldsymbol{y}_{1:n}]} \right).\tag{39}\] A few common instances occur when taking \(\phi(x) = \frac{(x-1)^{2}}{x}\), \(\phi(x) = -\log x\), and \(\phi(x) = \frac{1}{2}|x-1|\), corresponds to the \(\chi^{2}\)-divergence, KL-divergence, and total variation distance respectively.

Although the \(\chi^{2}\)-divergence is theoretically attractive due to its direct connection to the RESS (see Appendix 11), when making this choice, one must also account for the computational cost of obtaining \(\hat{\boldsymbol{\eta}}\). For this reason, the KL is significantly more popular in the context of AIS because it empirically achieves a good trade-off between computational efficiency and quality of the approximation. In this case, the optimization reduces to \[\hat{\boldsymbol{\eta}} = \arg\min_{\boldsymbol{\eta}} \frac{1}{M} \sum_{m=1}^{M} w(\boldsymbol{\theta_{m}}) \log [\boldsymbol{\theta}_{m} | \boldsymbol{\eta}],\] which is equivalent to weighted maximum likelihood estimation. Hence, depending on the choice of \([\boldsymbol{\theta} | \boldsymbol{\eta}]\), solutions can be obtained analytically or with the use of efficient algorithms. For instance, when choosing \([\boldsymbol{\theta} | \boldsymbol{\eta}]\) to be a mixture of normals, the optimization can be done using a weighted version of the Expectation Maximization (EM) algorithm dempster1977maximum?, for which many efficient implementations are readily available.

Additionally, the cost of the optimization in (39 ) depends on the size \(M\) of the weighted sample \(\{ \boldsymbol{\theta}_{m}, w_{m} \}_{m=1}^{M}\) targeting \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\), which typically contains a relatively small ESS at replenishment times. Therefore, we can use sample reduction techniques to obtain a smaller weighted sample \(\{ \boldsymbol{\theta}^{*}_{m}, w^{*}_{m} \}_{m=1}^{N}\) that also targets the posterior and has a similar ESS. For instance, one can use the sampling-importance-resampling (SIR) technique rubin1988using? to get a smaller unweighted sample, although the introduction of potentially repeated samples tends to result in a big reduction in ESS. To circumvent this, similarly to delyon2021safe?, we suggest using a weighted version of SIR, described in Algorithm 9, which corresponds to running SIR until \(N\) unique values are sampled, then grouping repeated values using weights proportional to the number of replicates.

Figure 9: Weighted Sampling-Importance-Resampling

10.2.2 Initialization and Batch Size↩︎

When considering the initialization of RAISOR, one has two choices: begin the procedure targeting the prior distribution \([\boldsymbol{\theta}]\) or a partial posterior \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]\). Choosing the prior can work well if sampling is easily available and the distribution is reasonably informative, because a vague prior can lead to unstable weight updates. Otherwise, it is generally better to bypass potentially ill-behaved initial iterations by initially sampling from a partial posterior and proceeding as usual. In either case, sampling can be done in a variety of ways (e.g., direct sampling, MCMC, or IS), as long as it results in a (potentially weighted) sample targeting the corresponding distribution. The optimal choice of \(n_{0}\) will depend on multiple factors, but our experiments indicate that \(n_{0} \geq 10\) is sufficient to ensure numerical stability of the updates for posterior distributions with weakly informative priors.

Regarding the batch size \(b\), see Section 2, note that including all observations in a single batch corresponds to an importance PP-RB update (i.e., no replenishment in between), while taking each batch to contain a single observation significantly increases the communication cost, so the ideal value lies between those two extremes. As argued in Section 2, taking replenishment times to grow exponentially avoids sample degeneracy, indicating that the batch size \(b = b(n_{k})\) could depend on the size of the current partial posterior \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{k}}]\) and increase geometrically without introducing instability. From this, a strategy could be to take \(b(n_{k}) = \left\lceil \tfrac{1}{\alpha} n_{k} \right\rceil - n_{k}\) for some \(\alpha \in (0,1)\). As discussed in Section 3, we can use the upper bound on \(\text{RESS}_{\theta^{*}, \alpha}\) to narrow down the range of reasonable values of \(\alpha\), as \[\begin{align} b(n_{k}) &> \{c(r_{\text{min}}, d)-1\} n_{k} &&\Longrightarrow &\text{RESS}(n_{k}+b(n_{k}) | n_{k}) &< r_{\text{min}}, \end{align}\] for large values of \(n_{k}\), where \(c(r_{\text{min}}, d)\) is as defined in Section 3 and \(r_{\text{min}}\) represents the smallest RESS tolerable during model fitting, see below for more details. Therefore, heuristically we can take \[\alpha \in \left( r_{\text{min}}^{\frac{2}{d}} \left\{ 1+ \sqrt{1 - r_{\text{min}}^{\frac{2}{d}}} \right\}^{-1}, 1 \right),\] but note that the lower bound above is optimistic in the sense that it assumes perfect replenishment, so setting \(\alpha\) between those values is a starting point (e.g., picking the geometric mean of the end points of the interval above). Alternative heuristics may consider the entire limiting distribution rather than just its upper bound, but we leave this for future work.

After taking \(n_{k+1} = n_{k} + b(n_{k})\), due to the stochasticity inherent to the limiting distribution, it is not possible to guarantee that the estimated \(\widehat{\text{RESS}}(n_{k+1}|n_{k})\) (see Appendix 11) calculated with the sample \(\{\boldsymbol{\theta}_{m}, w(\boldsymbol{\theta}_{m})\}_{m=1}^{M}\) targeting the partial posterior distribution \([\boldsymbol{\theta}| \boldsymbol{y}_{1:n_{k+1}} ]\) will remain above the threshold \(r_{\text{min}}\). This can be problematic because a low value of the RESS generally leads to a poor estimator of the objective function \(D_{f}\), resulting in a poor approximation of \([ \boldsymbol{\theta} | \boldsymbol{y}_{1:n_{k+1}} ]\) and consequently an unstable procedure. To mitigate this issue we considered two options. The first is to calculate the updated weights for a sequence of \(B\) batches \(b_{1}(n_{k}) < \cdots < b_{B}(n_{k})\) (which can be exponentially spaced), and take the weights corresponding to largest batch that yields an RESS above \(r_{\text{min}}\). Alternatively, one can use annealed IS to bridge the distributions with controlled stability until an RESS above \(r_{\text{min}}\) is achieved.

10.2.3 Monte Carlo Sample Size and RESS Thresholds↩︎

We provide guidance regarding the Monte Carlo sample size \(M\), the replenishment threshold \(r\) and the minimal threshold \(r_{\text{min}}\), based on a reference ESS value \(N\). We define \(N\) as the number of independent samples that would be needed to satisfactorily estimate the densities (or p.m.f.s) of the partial posterior distributions \(\{ [ \boldsymbol{\theta} | \boldsymbol{y}_{1:k} ] \}_{k=0}^{n}\) and to obtain posterior estimates or predictions with satisfactory accuracy. Because \(N\) is case dependent and varies with the model in question, the chosen family of approximations, the dimension of \(\boldsymbol{\theta}\), and how the final posterior samples will be used, we do not provide specific guidance on how to make this choice, but note that a similar assessment is often implicitly done when using other sampling techniques, such as MCMC and AIS.

For a weighted sample of size \(M\) with RESS \(r_{\text{min}}\), the ESS is given by the product \(M r_{\text{min}}\), so considering a specified value of \(N\), a natural idea is to impose the constraint \(M r_{\text{min}} \geq N\) because it ensures that the RESS remains high enough throughout the procedure to reliably construct approximations of the partial posteriors and complete downstream inferential and prediction tasks. Additionally, we include the constraint \(r > r_{\text{min}}\) because replenishment is necessary before the RESS drops below \(r_{\text{min}}\) to avoid numerical instability.

As we presented in Section 2, replenishment steps are relatively expensive due to the cost of recomputing IS weights. Considering that an increase in \(r_{\text{min}}\) indirectly results in more replenishment steps, we choose \(M\) to minimize \(r_{\text{min}}\) under the aforementioned constraints (i.e., \(M\) should be taken as large as possible), which has the added benefit of increasing the compatibility of the method with parallelization. Considering the weighted SIR procedure to reduce the cost of obtaining the approximation \([\boldsymbol{\theta} |\hat{\boldsymbol{\eta}}]\), discussed previously, the main limiting factor for the choice of \(M\) in this case is storage, because the memory requirement of the procedure will be at least \(O(Md)\) while the cost of optimizing \(\hat{D}_{f}\) will scale with the reduced sample size.

When choosing \(r\), two relevant aspects need to be considered. The first one is the ratio \(r/r_{\text{min}} > 1\), which creates a region of RESS values where replenishment can be performed without numerical instability. This is relevant when using batched weight updates because large batches tend to accelerate the drop in RESS, potentially leading to values below \(r_{\text{min}}\) (see the section on heuristics for batch sizes for strategies that can account for those scenarios). Thus, increasing \(r\) can help mitigate the need for potentially expensive interventions, however, the second consideration is that increasing \(r\) also increases the frequency of replenishment steps. For well-specified models with low-dimensional parameters \(d \leq 6\), our experiments indicate that a ratio around 2 (i.e., \(r = 2 r_{\min}\)) yields good results.

10.2.4 RAISOR Used in the Applications↩︎

Considering the applications presented in Section 4, we made similar implementation choices for all data sets. We assessed that \(N = 5000\) independent samples would be sufficient to estimate the posterior density and to perform all tasks of interest (in our case prediction). Therefore, to make the results compatible with MCMC, we used a total of \(M = 50000\) IS samples with \(r_{\text{min}} = 0.1\) and \(r = 2r_{\text{min}} = 0.2\).

For convenience, we considered the reparametrization \(\tilde{\boldsymbol{\theta}} = \left(\boldsymbol{\beta}', \log \sigma^{2}, \log \frac{\tau^2}{1-\tau^{2}}, \log \phi \right)'\) to ensure that the support was \(\mathbb{R}^{p+3}\), and chose the family of approximating distributions \([\tilde{\boldsymbol{\theta}} | \boldsymbol{\eta}]\) to be a mixture of multivariate normals. Despite suspecting that the posterior density contains regions with heavy tails (at least when compared to a normal), there were no signs of strong instability during model fitting, implying that even if those regions exist their influence is negligible for all practical purposes. For computational efficiency, we chose the KL-divergence between our approximation and the target as our objective function, and used the weighted EM algorithm implemented in the R package mclust scrucca2023model? to do the optimization. Knowing that our posterior distribution does not present strong multi-modality but that the covariance parameters are highly dependent, we conservatively used 10 components in the mixture to account for it.

We followed the batching strategy described previously and considered the values \(\alpha \in \left\{ \frac{1}{2}, \frac{2}{3}, \frac{3}{4} \right\}\). All values of \(\alpha\) resulted in a similar computational cost, but because \(\alpha = \frac{1}{2}\) showed some signs of instability, we opted to use \(\alpha = \frac{2}{3}\) instead. We used a total of \(B = 20\) in-between batch sizes, picking the highest value that resulted in weights with RESS above \(r_{\text{min}}\). If the RESS got below \(r_{\text{min}}\) for all batch sizes, we chose the batch with the highest corresponding RESS and used annealed IS until generating a sample with \(\text{RESS} \geq r\). Regarding the initial sample size, we considered directly sampling from the prior or using MCMC to sample from a partial posterior. Because the prior was chosen to be vague for some of the parameters, using it as the initial IS distribution resulted in many annealed IS correction steps needed for stability and an overall computational cost similar to sampling from \([\tilde{\boldsymbol{\theta}} | \boldsymbol{y}_{1:10}]\) using MCMC. Thus, for simplicity we used the latter in all simulations.

11 Relative Effective Sample Size↩︎

In this section we present a well-known connection between importance sampling, the \(\chi^{2}\)-divergence, and the RESS, and we provide insight into the limiting behavior of the RESS presented in Theorem 1.

11.1 Importance Sampling, Chi-Squared Divergence, and RESS↩︎

Considering a target distribution \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\) and a proposal \([\boldsymbol{\theta} | \boldsymbol{\eta}]\), we can express the \(\chi^{2}\)-divergence between them as \[\label{eq:chi-squared32to32var} \begin{align} D_{\chi^{2}}\bigg( [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] \bigg\| [\boldsymbol{\theta} | \boldsymbol{\eta}] \bigg) &= \int \frac{([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] - [\boldsymbol{\theta} | \boldsymbol{\eta}])^{2}}{[\boldsymbol{\theta} | \boldsymbol{\eta}]} \;d\boldsymbol{\theta} = \int \frac{[\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]^{2}}{[\boldsymbol{\theta} | \boldsymbol{\eta}]} \;d\boldsymbol{\theta} - 1^{2} \\ &= \mathbb{E}_{\boldsymbol{\theta} \sim [\boldsymbol{\theta} | \boldsymbol{\eta}]}\left( w(\boldsymbol{\theta})^{2} \right) - \left\{ \mathbb{E}_{\boldsymbol{\theta} \sim [\boldsymbol{\theta} | \boldsymbol{\eta}]}(w(\boldsymbol{\theta})) \right\}^{2} =\text{Var}_{\boldsymbol{\theta} \sim [\boldsymbol{\theta} | \boldsymbol{\eta}]}( w(\boldsymbol{\theta}) ), \end{align}\tag{40}\] where \(w(\boldsymbol{\theta}) = \frac{[\boldsymbol{\theta | \boldsymbol{y}_{1:n}}]}{[\boldsymbol{\theta} | \boldsymbol{\eta}]}\). This connection implies that minimizing the variance of the IS weights is equivalent to minimizing the chi-squared divergence between target and proposal, and thus we can use the weights to infer the quality of \([\boldsymbol{\theta}|\boldsymbol{\eta}]\) as an approximation of \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\).

When defining the RESS between \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\) and \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]\) as \[\label{eq:RESS2} \text{RESS}(n | n_{0}) = \left\{ 1 + D_{\chi^{2}} \left( [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}] \left. \vphantom{\tfrac{0}{0}} \right\| [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \right) \right\}^{-1},\tag{41}\] which lies in \([0,1]\), equation (40 ) allow us to reinterpret this quantity as the proportion of information lost when using a weighted sample from \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]\) to target \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\). This interpretation comes from the comparison between the variance of a simple Monte Carlo estimator \(\hat{I}_{n}(f)\) and an IS estimator \(\tilde{I}_{n}(f)\) with proposal \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]\), which typically has a higher variance. Because of this, traditionally the RESS with respect to a function \(f\) would be defined as \[\text{RESS}(f) = \frac{\text{Var}_{\boldsymbol{\theta} \sim [\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]}\left( \left. \hat{I}_{n}(f) \right| \boldsymbol{y}_{1:n} \right)}{\text{Var}_{\boldsymbol{\theta} \sim [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]}\left( \left. \tilde{I}_{n}(f) \right| \boldsymbol{y}_{1:n_{0}} \right)},\] but, under certain conditions (see Appendix A of martino2017effective?), this can be well-approximated by \[\label{eq:RESS32approximation} \text{RESS}(f) \approx \left\{1+ \text{Var}_{\boldsymbol{\theta} \sim [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]}\left( \left. w(\boldsymbol{\theta}) \right| \boldsymbol{y}_{1:n_{0}} \right) \right\}^{-1} = \frac{\left\{\int w(\boldsymbol{\theta}) [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \;d\boldsymbol{\theta} \right\}^{2}}{\int w^{2}(\boldsymbol{\theta}) [\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}] \;d\boldsymbol{\theta}},\tag{42}\] which remarkably does not depend on the integrand \(f\), and therefore can be seen as a universal measure of sample quality. For this reason, we take what would typically be considered an approximation as the definition and use it as measure of quality of a weighted sample, which we justify by writing \(\text{RESS}\) as a decreasing function of the \(\chi^{2}\)-divergence in equation (41 ). Note that equation (42 ) naturally leads to the estimator \[\label{eq:sample32RESS} \widehat{\text{RESS}}(n | n_{0}) = \frac{\left\{ \frac{1}{M}\sum_{m=1}^{M} w(\boldsymbol{\theta}_{m}) \right\}^{2}}{\frac{1}{M} \sum_{m=1}^{M} w^{2}(\boldsymbol{\theta}_{m})} = \frac{\left\{ \frac{1}{M}\sum_{m=1}^{M} \tilde{w}(\boldsymbol{\theta}_{m}) \right\}^{2}}{\frac{1}{M} \sum_{m=1}^{M} \tilde{w}^{2}(\boldsymbol{\theta}_{m})},\tag{43}\] where \(\{ \boldsymbol{\theta}_{m}, w^{*}(\boldsymbol{\theta}_{m}) \}_{m=1}^{M}\) is a weighted sample from \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]\), and \[w(\boldsymbol{\theta}) = w^{*}(\boldsymbol{\theta})[\boldsymbol{y}_{n_{0}+1:n} | \boldsymbol{y}_{1:n_{0}}, \boldsymbol{\theta}],\] which is relatively simple and fast to compute during model fitting and is the estimator we use to track the RESS in all applications.

11.2 Dissecting the Limiting Distribution↩︎

Recall that the limiting distribution in Theorem 1, admits a decomposition of the form \[\text{RESS}_{\theta^{*}, \alpha}(\boldsymbol{z}) = \left\{ \alpha (2-\alpha) \right\}^{\frac{d}{2}} \exp\left\{ -\tfrac{1-\alpha}{2-\alpha} \boldsymbol{z}'\boldsymbol{M}_{\theta^{*}}\boldsymbol{z} \right\} = u_{1}(\alpha) \cdot u_{2}(\alpha, \boldsymbol{z}),\] where \(\boldsymbol{z} \sim \text{Normal}_{d}(\boldsymbol{0}, \boldsymbol{I}_{d})\), \(u_{1}(\alpha) = \{\alpha (2-\alpha)\}^{\frac{d}{2}}\), and \(u_{2}(\alpha, \boldsymbol{z}) = \exp\left\{ -\tfrac{1-\alpha}{2-\alpha}\boldsymbol{z}'\boldsymbol{M}_{\theta^{*}}\boldsymbol{z} \right\}\). In what follows, we use two examples to show that this decomposition leads to an intuitive understanding of the asymptotic behavior of \(\text{RESS}(n|n_{0})\). Consider the following example:

Example 1. Let \(\boldsymbol{\theta}|\boldsymbol{y}_{1:n} \sim \textrm{Normal}_{d}(\boldsymbol{\theta}^{*}, n^{-1}\boldsymbol{V}_{\theta^{*}}^{-1})\) and \(\boldsymbol{\theta}|\boldsymbol{y}_{1:n_{0}} \sim \textrm{Normal}_{d}(\boldsymbol{\theta}^{*}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1})\) represent posterior distributions, where \(n_{0} = \alpha n \in \mathbb{N}^{*}\) and \(\phi(\cdot | \boldsymbol{\mu}, \boldsymbol{\Sigma})\) is the density of a multivariate normal random variable with mean vector \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\). Then we have \[\label{eq:factor321} \begin{align} \textrm{RESS}(n | n_{0}) &=\left\{ \int \frac{ \phi^{2}(\boldsymbol{\theta} | \boldsymbol{\theta}^{*}, n^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) }{ \phi( \boldsymbol{\theta} | \boldsymbol{\theta}^{*}, n_{0}^{-1} \boldsymbol{V}_{\theta^{*}}^{-1}) } \;d\boldsymbol{\theta} \right\}^{-1} \\ &= \left\{ \int (2\pi\alpha)^{-\frac{d}{2}} |\boldsymbol{V}_{\theta^{*}}|^{\frac{1}{2}} \exp\left\{ -\tfrac{2-\alpha}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}^{*})'\boldsymbol{V}_{\theta^{*}}(\boldsymbol{\theta} - \boldsymbol{\theta}^{*}) \right\} \;d\boldsymbol{\theta} \right\}^{-1} \\ &= \left\{\alpha (2-\alpha)\right\}^{\frac{d}{2}} = u_{1}(\alpha). \end{align}\qquad{(3)}\]

Example 1 is not a coincidence, but this connection comes from the asymptotic normality of the posterior distribution, and can be used to infer that \(u_{1}(\alpha)\) describes the decrease in RESS caused by the difference in scale between \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}}]\) and \([\boldsymbol{\theta} | \boldsymbol{y}_{1:n}]\), as the posterior distribution concentrates around \(\boldsymbol{\theta}^{*}\) with \(n\) going to \(+\infty\). Note that for any posterior considering i.i.d. observations the difference in scale is a consequence of the posterior concentration rate, which depends only on the sample size \(n\), and thus \(u_{1}(\alpha)\) is a deterministic function of \(n\) and \(n_{0}\). Intuitively, \(u_{2}(\alpha, \boldsymbol{z})\) should appear in the limit when we also account for the difference in location between the two distributions, which we confirm in the example below.

Example 2. Let \(\boldsymbol{y}_{1}, \dots, \boldsymbol{y}_{n} \overset{\text{iid}}{\sim} \textrm{Normal}_{d}(\boldsymbol{\theta}^{*}, \boldsymbol{V}_{\theta^{*}}^{-1} \boldsymbol{W}_{\theta^{*}} \boldsymbol{V}_{\theta^{*}}^{-1})\), and let \[\begin{align} \boldsymbol{\theta} | \boldsymbol{y}_{1:n} &\sim \textrm{Normal}_{d}(\bar{\boldsymbol{y}}_{1}^{n}, n^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) & &\text{and}& \boldsymbol{\theta} | \boldsymbol{y}_{1:n_{0}} &\sim \textrm{Normal}_{d}(\bar{\boldsymbol{y}}_{1}^{n_{0}}, n_{0}^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) \end{align}\] represent posterior distributions, where \(n_{0} = \alpha n \in \mathbb{N}^{*}\), \(\bar{\boldsymbol{y}}_{i}^{j} = \frac{1}{j-i+1} \sum_{k=i}^{j} \boldsymbol{y}_{k}\), and \(\phi(\cdot | \boldsymbol{\mu}, \boldsymbol{\Sigma})\) is the density of a multivariate normal random variable with mean vector \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\). Then it can be shown that \[\label{eq:factor322} \begin{align} \textrm{RESS}(n | n_{0}) &= \left\{ \int \frac{ \phi^{2}(\boldsymbol{\theta} | \bar{\boldsymbol{y}}_{1}^{n}, n^{-1}\boldsymbol{V}_{\theta^{*}}^{-1}) }{ \phi( \boldsymbol{\theta} | \bar{\boldsymbol{y}}_{1}^{n_{0}}, n_{0}^{-1} \boldsymbol{V}_{\theta^{*}}^{-1}) } \;d\boldsymbol{\theta} \right\}^{-1} \\ &= \left\{\tfrac{n_{0}(2n-n_{0})}{n^{2}}\right\}^{\frac{d}{2}} \exp\left\{ -\tfrac{nn_{0}}{2n-n_{0}} (\bar{\boldsymbol{y}}_{1}^{n} - \bar{\boldsymbol{y}}_{1}^{n_{0}})'\boldsymbol{V}_{\theta^{*}} (\bar{\boldsymbol{y}}_{1}^{n} - \bar{\boldsymbol{y}}_{1}^{n_{0}}) \right\} \\ &= \left\{ \alpha(2-\alpha)\right\}^{\frac{d}{2}}\exp\left\{ -\tfrac{1-\alpha}{2-\alpha} \boldsymbol{z}' \boldsymbol{M}_{\theta^{*}}\boldsymbol{z} \right\} = \textrm{RESS}_{\theta^{*},\alpha}(\boldsymbol{z}), \end{align}\qquad{(4)}\] where \(\boldsymbol{z} = \left\{ \frac{n}{(n-n_{0})n_{0}} \right\}^{-\frac{1}{2}} \boldsymbol{W}_{\theta^{*}}^{-\frac{1}{2}}\boldsymbol{V}_{\theta^{*}} (\bar{\boldsymbol{y}}_{n_{0}+1}^{n} - \bar{\boldsymbol{y}}_{1}^{n_{0}} ) \sim \textrm{Normal}_{d}(\boldsymbol{0}, \boldsymbol{I}_{d})\).

Example 2 not only allows us to more clearly identify the source of randomness of the limit in Theorem 1, which comes from the randomness in the estimation of \(\boldsymbol{\theta}^{*}\) using \(\boldsymbol{y}_{1:n}\), but also provides an example we can use to understand the effect of model misspecification. Assuming a correctly specified model and under the classical regularity conditions, the matrices \(\boldsymbol{V}_{\theta^{*}}\) and \(\boldsymbol{W}_{\theta^{*}}\) are both equal to the Fisher information matrix at \(\boldsymbol{\theta}^{*}\), and we have the simplification \(\boldsymbol{M}_{\theta^{*}} = \boldsymbol{I}_{d}\). Considering now a misspecified model with \(\boldsymbol{W}_{\theta^{*}} = c\boldsymbol{V}_{\theta^{*}}\) for some \(c > 0\), we can rewrite (?? ) as \[\text{RESS}_{\theta^{*},\alpha}(\boldsymbol{z}) = \left\{ \alpha(2-\alpha)\right\}^{-\frac{d}{2}}\exp\left\{- \tfrac{1-\alpha}{2-\alpha} c \boldsymbol{z}'\boldsymbol{z} \right\},\] therefore taking \(c > 1\), which would correspond to a posterior distribution that underestimates the variance of the true data generating process, results in a decrease of the \(\text{RESS}(n|n_{0})\), while the opposite is true for \(c < 1\). This implies that underestimation of uncertainty can significantly slow model fitting because more replenishment steps are needed throughout, and the fitting process itself could be used to detect anomalous observations or other simple forms of model misspecification. However these connections are the subject of ongoing research.

Recursive Adaptive Importance Sampling with Optimal Replenishment