Abstract

When building statistical models for Bayesian data analysis tasks, required and optional iterative adjustments and different modelling choices can give rise to numerous candidate models. In particular, checks and evaluations throughout the modelling process can motivate changes to an existing model or the consideration of alternative models to ultimately obtain models of sufficient quality for the problem at hand. Additionally, failing to consider alternative models can lead to overconfidence in the predictive or inferential ability of a chosen model. The search for suitable models requires modellers to work with multiple models without jeopardising the validity of their results. Multiverse analysis offers a framework for transparent creation of multiple models at once based on different sensible modelling choices, but the number of candidate models arising in the combination of iterations and possible modelling choices can become overwhelming in practice. Motivated by these challenges, this work proposes iterative filtering for multiverse analysis to support efficient and consistent assessment of multiple models and meaningful filtering towards fewer models of higher quality across different modelling contexts. Given that causal constraints have been taken into account, we show how multiverse analysis can be combined with recommendations from established Bayesian modelling workflows to identify promising candidate models by assessing predictive abilities and, if needed, tending to computational issues. We illustrate our suggested approach in different realistic modelling scenarios using real data examples.

Bayesian workflow, safe iterative modelling, model evaluation, multiverse analysis.

1 Introduction↩︎

When specifying statistical models to analyse data, a modeller faces a range of tasks, some of which can be ambiguous or are tackled ad-hoc in practice. In the Bayesian framework, the desire for principled pathways from initial models to reliable inferences and decision-making motivated the notion of Bayesian workflows [1]–[5]. These are explicit collections of required and recommended steps to ensure valid model development. The guidelines provided by Bayesian workflows can request the modeller to navigate potentially many iterations within and across models.

Consider the task of applying Bayesian models to evaluate the effect of a treatment in a randomised controlled trial. This requires the modeller to choose covariates, hierarchical model components, and distributional families for the observations and (marginal) priors. The modeller has to iterate, for example, if a model is of interest but not usable because computational issues result in posterior samples of insufficient quality. To inform decisions and understand the effect of treatment in the given study, the modeller investigates the posterior results for the quantity of interest as well as the predictive abilities of the models. Even models that are substantively and practically relevant may turn out to be inferior candidates, for example, due to poor predictive performance. While using one model without analysing alternatives risks an over- or underestimated effect of the treatment, investigating numerous models can result in an overwhelming amount of additional joint evaluation and model comparisons. Many candidate models are possible, but not all of these will be substantively relevant for assessing the effect of the treatment. As we demonstrate in Section 4.1 and Section 4.2, different models can also imply different conclusions about the treatment under evaluation. Various iterative adjustments in conjunction with different possible modelling choices unavoidably require the modeller to examine multiple models. However, working efficiently and consistently with sets of models is not straightforward; especially when iteration alters the set of considered models.

Motivated by these challenges, we explore how to support modellers throughout iterative Bayesian modelling by combining workflow tasks with multiverse analysis: the joint and transparent creation of multiple models based on different sensible modelling choices [6], [7].

Figure 1: Workflow for iterative filtering with multiple models. We suggest to use (1) predictive abilities, as well as (2) computation and model checking and iterative improvements, to filter many models towards fewer models of higher quality. Each vertical line represents one model. Models indicated by dashed red lines are problematic, for example, due to computational issues. Such models are either dropped after assessing predictive abilities or require iterative adjustments which can make them valid candidates.

In particular, we

propose iterative filtering for multiverse analysis as a tool for Bayesian modelling workflows that reduces large sets of candidate models using assessment of predictive abilities and evaluation of computation (see Figure 1 and Section 3)
investigate desirable properties of iterative Bayesian modelling workflows that can serve as a practical foundation for developing tools to support modellers (see Section 2)
illustrate the suggested filtering approach and the advantages of jointly evaluating models and computation in real data examples (see Section 4)

1.1 Bayesian workflows and iterative modelling↩︎

A Bayesian modelling workflow is a systematic collection of iterative steps and sequential decisions for building, evaluating, and comparing models that provides principled pathways towards valid models within a Bayesian framework [1]–[5]. Several aspects of Bayesian modelling necessitate iteration within and across models. For example, model checking can reveal a misspecified model or inherent aspects of the modelling task, like the complexity of a model, can complicate numerical approximations. Modifications can be essential for reliable posterior inferences, or to obtain posterior samples in the first place. Iteration can be desirable, for instance, if predictive performance is not sufficient or new data becomes available. The iterative facet of model building also implies that more resources like time or modelling ideas can motivate “remodelling” [8], that is, iteration towards improved models.

Building models iteratively has been a long-standing topic of concern and interest, especially in the Bayesian paradigm, potentially inspired by its ability to incorporate new evidence and its reliance on numerical approximations [4], [8]–[11]. [9] identifies learning as the main motivation of any modelling endeavour and, consequently, describes it as a process that unfolds in a “motivated iteration between theory and practice” [9]. Adopting the hypothetico-deductive perspective discussed by [12], models encapsulate hypotheses from which implications for observed data can be deduced, for example, by obtaining posterior replicates for the outcome of interest. These hypotheses are to be questioned, if, for example, the model is not providing sufficiently accurate predictions or is failing to capture essential aspects of the phenomenon of interest. As such, modelling necessarily evolves in an iteration between formulating hypotheses and testing their implications.

Even if existing scientific knowledge suggests an ideal model, it can still be infeasible to evaluate and debug this model directly. As discussed by [4], building models iteratively with increasing complexity is common in practice. At the same time, it can be challenging to make all considered models explicit. Moreover, due to the selective nature of any step-wise procedure, we inevitably risk overlooking a model that would contradict previous conclusions. Adjusting and refining models step-by-step seems intuitive, but it is not inherently efficient or transparent, and does not come with automatic guarantees of sufficient exploration.

1.2 Models everywhere, all at once↩︎

Instead of hand-picking models or iteratively refining one model, multiverse analysis provides a framework for joint and transparent consideration of several candidate models [6]¹. An explicit multiverse of possible results is motivated by the “garden of forking paths” [16]; the issue of arriving at different models given variations in observed data. A “multiverse of models” emerges from a range of possible modelling choices [6].

The R-package multiverse [17] enables straightforward multiverse analyses from existing R-code by adding branches at decision points in a given data analysis with multiverse::branch(). While simultaneously considering multiple models increases transparency and reduces the risk of overconfidence in a single model, models in a multiverse can appear equally plausible without additional assessment of the validity and relevance of models [7], [18]. Challenges in investigating results of multiverse analyses motivated ideas for explorable multiverse analysis reports [19] and an interactive workflow for multiverse analysis in [20]. The accompanying language-agnostic tool Boba [20] can distinguish models, for example, based on normalised root mean squared error. To assist a flexible consideration of multiple Bayesian models, [21], [22] introduces multi-model probabilistic programming for the probabilistic programming language Stan [23], and explores how to navigate a multiverse of models along paths of increasing predictive performance [22].

Albeit increasing transparency, multiverses ask the modeller to keep track of many variations of models, which might itself become unmanageable [6], [22]. A joint consideration of numerous candidate models can overwhelm and mislead due to the increased amount of parallel model evaluations and visual biases when reporting results [18]. To make use of the advantages of multiverse analysis within Bayesian workflows, the question is how to identify models of interest in potentially overwhelming and dynamic multiverses. In particular, modellers need tools for efficient and consistent joint evaluation of different aspects of models, including but also going beyond predictive performance.

1.3 Our approach↩︎

Driven by the need to jointly investigate several models in Bayesian modelling workflows, we propose iterative filtering for multiverse analysis in connection to the three steps of a Bayesian data analysis task proposed by [24], that is,

model building,

conditioning on data and deriving posterior inference, and

model evaluation.

Instead of building and evaluating models one by one in step (I), we immediately create multiple initial models using multiverse analysis, as illustrated in Figure 2.

Figure 2: Exploration of model space with iterative modelling and multiverse analysis. An iterative model building procedure is illustrated on the left by models connected with dashed lines. After evaluating Model 2 and 3, we decide to move back to Model 1 and from there continue to Model 4. The ; area on the right shows a set of six models created at once via multiverse analysis. Model 7 in the ; area is another model achieved, for example, by extending the initial multiverse.

In contrast to focusing on one model until it fails, we connect step (II) and step (III) by allowing for iterative improvements of computation and models while checking and evaluating multiple models jointly.

Specifically, we perform iterative filtering by assessing computational issues in inference, performing posterior predictive checks, evaluating predictive performance, and checking the reliability of estimates for predictive performance. Given that causal constraints have been taken into account, the suggested approach can reduce a set of candidates to a subset of models of higher quality, as illustrated in Figure 1. Iterative filtering for multiverse analysis considers a multiverse of models by design. Instead of building and choosing models based on opaque assessments, the suggested approach ensures transparency and provides clear evaluation criteria to systematically reduce a set of models. This avoids an arbitrary selection of models and allows comparisons between different modelling contexts.

This paper is structured as follows. Relations between iterative modelling and multiverse analysis are discussed in Section 2.1. We motivate the need for meaningful filtering in Section 2.2 and introduce the chosen filtering criteria in Section 2.3. Iterative filtering for multiverse analysis is described in Section 3 and connects the tasks of checking computation (see step (II) above and Section 3.2) as well as evaluating obtained models and their implied conclusions (see step (III) above and Section 3.1). Section 3.3 discusses how the suggested approach adds the possibility of iterating towards improved sets of models to the steps of a Bayesian data analysis task by [24]. In Section 4, real-data examples illustrate that iterative filtering allows clear communication about modelling choices and transparent reduction towards fewer models of higher quality. Section 5 summarises results, and discusses limitations of the presented approach as well as future directions.

2 Filtering a Bayesian multiverse towards minimum viable models↩︎

2.1 Building multiple models transparently with multiverse analysis↩︎

For a given data analysis task that involves model building, potentially infinitely many candidate models are contained in a space of models $\mathcal{M}$. As introduced already briefly in Section 1.2, a multiverse can be defined as a finite set of models $M = \{M_1, \cdots, M_K\} \subset \mathcal{M}$, where each model $M_k$ is characterised by a unique combination of modelling choices [6], [19], [25]. In a Bayesian framework, such choices need to be made, for example, when considering which distributional families for the observations are suitable or what different prior information can be encoded into the model. While [6] focus on multiverses emerging due to different data preprocessing, this work revolves around multiverses of Bayesian models. Each model $M_k = p_k(y, \theta_k, {\rm{X}}_k)$ is defined by different selected covariates in ${\rm{X}}_k$, as well as distributions for the response $y$, $p_k(y \mid \theta_k)$, with parameters $\theta_k$, and prior distributions $p_k(\theta_k)$ for the same observed data $y$.

Note that we do not suggest that any individual combination of modelling choices is the truth. Instead, we want to investigate how the breadth of analysis conclusions can be impacted by selecting only one of these combinations. Indeed, the ability to consider them all in parallel using multiverse analysis can provide a more representative view of possible conclusions than any single model could offer. Multiverse analysis offers a convenient framework for embedding transparency in Bayesian modelling workflows. It ensures accessibility of all modelling attempts and the explicit formulation of all considered modelling choices. Moreover, the possibility of joint exploration of several models is attractive since the modeller can iterate over multiple models as opposed to trying different models sequentially.

2.2 Filtering is needed and justified↩︎

From the perspective of Bayesian model selection and Bayesian model averaging, multiverse analysis is likely interpreted as assuming the same weight for all models a priori; a best guess by the modeller which aspects could be relevant for modelling the problem at hand. Meaningful filtering criteria can enable transparent evaluation and ranking of models. This allows the modeller to update which models they consider to be more promising candidates.

As discussed in Section 1.2, models in a multiverse can seem equally plausible and a multiverse does not directly include validations and rankings of models [6], [7], [19]. Acknowledging the inherent limitations of every model, it is essential to identify which models are “importantly wrong” [9], that is, which models are not useful in the given modelling context and decision problem. We can validate models in a multiverse by checking model quality and evaluate them jointly with model comparison techniques. Filtering can reduce complexity by excluding candidates from the modelling process that turn out to be largely inferior candidates. Subsequently, the non-arbitrarily filtered set of models enables a more detailed investigation of implied conclusions of the remaining models.

[6], [19] and [18] emphasise that multiverse analyses should only consider models based on “reasonable” or “defensible” modelling choices to ensure meaningful comparisons between models. Naturally, the definition of a reasonable modelling choice can change over time and differ between modellers and fields; see, for example, “many-analysts” studies presented and discussed by [26] and [27]. We argue that it is less ambiguous to agree on a set of filtering criteria reflecting desired properties of models than to fix a generalisable notion of justified modelling choices. For consistent assessment of models within and across different modelling endeavours, filtering can incorporate shared indicators of model quality informed by desired model properties as well as widely accepted recommendations.

2.3 Focusing on minimum viable Bayesian models↩︎

Filtering a set of models in a transparent and principled way requires explicit filtering criteria. These criteria can be used to encode the concept of a sufficiently good model in the filtering procedure. To define meaningful criteria, we make use of recommendations in Bayesian workflows summarised by [4] as well as ideas of “utility dimensions” of Bayesian models from [28]. In particular, our approach focuses on evaluating computation and predictive abilities as minimum requirements of a viable Bayesian model because

predictive performance is a well-established and widely adopted model selection criterion related to usefulness and practical applicability of a model [29]–[31];
predictive ability is connected to parameter recoverability for causally consistent models [32];
and evaluating convergence, efficiency, and reliability of computations as well as obtained estimates is necessary to ensure the validity of our results when relying on approximations and resampling methods [4], [28].

Multiverse analysis enables model building with several models, as opposed to considering one model at a time. When combined with the above filtering criteria, we can make use of the increased transparency while efficiently focusing required iterative adjustments on smaller sets of models. [22] searches a multiverse along paths of increasing predictive performance in a network of connected models. Similarly, predictive abilities are an important characteristic of a model in our suggested approach. Iterative filtering involves the evaluation of predictive performance in tandem with other criteria of model quality, as well as the implied conclusions about the quantities of interest. The suggested approach conserves some advantages of multiverse analysis by providing joint and transparent assessment of different aspects of multiple models. To work towards fewer models of higher quality, iterative filtering is not primarily concerned with a path through a network of models but instead with the distinction between inferior and more promising subsets of models therein.

3 Integrating iterative filtering into Bayesian workflows↩︎

As opposed to repeatedly considering one model at a time until it fails, we suggest a different approach by simultaneously evaluating multiple possible models, as illustrated in Figure 1.

3.1 Filtering by predictive abilities↩︎

For a set of models transparently created with multiverse analysis, the modeller obtains posterior results for the different models by conditioning on the observed data. In line with the goal of identifying useful models for applications and decision problems, a filtering effort starts and concludes by comparing predictive abilities of the models (see “Filtering by predictive abilities” in Figure 1). Tangentially to model selection, the modeller takes the negative view: having produced a pool of candidate models, the goal is to remove all models not satisfying a given set of criteria reflecting predictive abilities of models. In particular, the proposed approach employs visual posterior predictive checks [2], [24], predictive performance evaluated by expected log point-wise predictive densities [33] and the reliability of estimates of elpd [34].

Investigating misspecification with posterior predictive checks↩︎

To identify useful models, we need to investigate the hypotheses encapsulated in each model and, in particular, their implications for the observations. This process is often summarised as model checking and comprises an integral part of Bayesian data analysis [4], [12]. [2] explore and summarise the effective use of different visualisations for model checking and other essential parts of Bayesian modelling workflows. [35] show how confidence bands for the empirical cumulative distribution function (ECDF) of probability integral transformation values can enhance posterior predictive checking. For example, visual posterior predictive checks with ECDFs can be used to identify misspecified models by comparing observed data and posterior replicates of a given model (see examples in Figure 5 in Section 4.1 and Figure 13 in Appendix 8).

Assessing predictive performance with expected log pointwise predictive density↩︎

To assess and compare predictive performance for a set of models, we evaluate the predictive density of each model using the log score as a default utility [8], [36]. Following [36], [33] denote the expected log point-wise predictive density (elpd) for one model and $n$ new observations as \[\begin{align} \label{eq:pointwise-pred-density} \textrm{elpd} = \sum_{i=1}^n \int p_t(\tilde{y_i}) \log p(\tilde{y_i} \mid y) d \tilde{y_i}, \end{align}\tag{1}\] where $p_t(\tilde{y_i})$ is the unknown true data generating process for the new $\tilde{y_i}$. Since we do not have access to $p_t(\tilde{y_i})$, elpd can be estimated with cross-validation [36], [37], for example, using importance sampling following [38] and [39] to compute approximate leave-one-out cross-validation (LOO-CV). In particular, we obtain estimates $\widehat{\mathrm{elpd}}$ with LOO-CV and Pareto smoothed importance sampling [40] to stabilise the importance weights for each data point [33].

To compare models in a set of models $M = \{M_1, \cdots, M_K\}$, we make use of the difference in estimated $\mathrm{elpd}^k$ of each model $M_k$ compared to the model with the highest estimated elpd. Based on estimates $\widehat{\mathrm{elpd}}$ for all models in $M$, the difference $\Delta \widehat{\mathrm{elpd}}^k$ of model $M_k$ is obtained as \[\begin{align} \label{eq:elpd-difference} \Delta \widehat{\mathrm{elpd}}^k = \widehat{\mathrm{elpd}}^k - \max\left(\widehat{\mathrm{elpd}}\right). \end{align}\tag{2}\] For the model $M_j$ in $M = \{M_1, \cdots, M_K\}$ with highest $\widehat{\mathrm{elpd}}$, that is, the model with the best predictive performance according to estimated elpd, we have $\Delta \widehat{\mathrm{elpd}}^j = \max(\widehat{\mathrm{elpd}}) - \max(\widehat{\mathrm{elpd}}) = 0$. Therefore, for any other model $M_k$, the closer $\Delta \widehat{\mathrm{elpd}}^k$ to zero, the better $M_k$ performs. When using $\Delta \widehat{\mathrm{elpd}}$ to compare models, we also obtain estimated standard errors, denoted $\widehat{\text{se}}(\Delta \widehat{\textrm{elpd}})$. We can identify a set of models with indistinguishable predictive performance compared to the best model as \[\begin{align} \label{eq:indist-models} \left\{ M_l: 0 \in \left[\Delta \widehat{\textrm{elpd}}^l \pm 2 \widehat{\text{se}}\left(\Delta \widehat{\textrm{elpd}}^l\right)\right] \right\}_{l=1, \cdots, L \leq K}. \end{align}\tag{3}\] [34] discuss the conditions when such a normal distribution approximation is justified and $\widehat{\text{se}}$ are reasonably calibrated. Specifically, $\widehat{\text{se}}$ can be underestimated in case of small data (say $n<100$) or when the models have very similar predictions. [41] justify why an ad hoc rule of $|\Delta\widehat{\mathrm{elpd}}|<4$ can be used to determine when the models have similar predictions and $\widehat{\text{se}}$ might be underestimated. For iterative filtering, we are not interested in model selection based on optimal prediction, but instead want to reduce a set of models towards more promising candidates. To this end, we can use $\Delta\widehat{\mathrm{elpd}}$ and associated $\widehat{\text{se}}$ to identify models with largely inferior predictive abilities.

Reliability of estimates obtained with PSIS-LOO-CV↩︎

When using importance weighting methods like PSIS [40] and data partitioning techniques like Bayesian LOO-CV to obtain estimates for evaluation criteria like elpd, it is essential to check the reliability of the obtained estimates. In particular, since we want to filter by reliably assessed predictive performance, we diagnose estimates for elpd obtained with PSIS-LOO-CV by investigating the tail indices of the Pareto distribution used to smooth the importance sampling weights with the Pareto-$\hat{k}$ diagnostic [40]. Intuitively, reliable importance sampling is hard, if, for example, leaving out one observation affects the resulting posterior strongly. Following recommendations by [33] and [40], elpd estimates have to be considered unreliable if any of their associated Pareto-$\hat{k}$ values are larger than $0.7$.

Given that we detect unreliable estimates for elpd obtained with PSIS-LOO-CV, different computational methods for computing leave-one-out predictive distributions can provide pathways towards trustworthy estimates. For example, problems due to a few extreme observations can often be alleviated with moment matching in PSIS-LOO-CV as suggested by [42] or estimates can be recomputed with K-fold-CV or brute-force LOO-CV² for problematic observations. Section 4.2 provides an example of problems in estimation of elpd when models include intercepts varying on the level of each observation. We illustrate how to obtain reliable estimates in this scenario using a combination of default and integrated PSIS-LOO-CV, as well as re-computation with brute-force LOO-CV (see details in Appendix 7).

3.2 Computation checks and iterative improvements↩︎

After having filtered out models with largely inferior predictive abilities, the remaining models enter the loop of “Computation checks and iterative adjustments” in Figure 1. Usually, posterior distributions in Bayesian data analysis tasks are too complex to be tractable, and we have to resort to numerical approximations to enable posterior inference. We focus on Markov chain Monte Carlo (MCMC) methods, in particular a variant of the dynamic Hamiltonian Monte Carlo no-U-turn sampler [43], as implemented in the probabilistic programming framework Stan [23], [44]. When using MCMC methods like HMC-NUTS, by definition, one can obtain draws from the target distribution when a Markov chain has reached the equilibrium [45], [46]. Computation is not perfect and, in practice, we want to ensure sufficiently accurate sampling with Markov chains that get as close as possible to a common stationary distribution.

Diagnosing the sampling procedure↩︎

Diagnostics for degrees of convergence and efficiency in sampling can provide important insights into the reliability of the obtained posterior draws when using MCMC methods like HMC-NUTS. In particular, comparing between- and within-chain variances with the improved potential scale reduction statistic $\hat{R}$ [47] or recent extensions like nested-$\hat{R}$ [48] provide tools to diagnose insufficiently mixed MCMC chains.

Measures of bulk and tail effective sample size (ESS) can be used to assess the efficiency of sampling for the given model in the bulk and the tails of the distribution, respectively [24], [46], [47], [49]. Specific to HMC-NUTS, divergent transitions can point out insufficient exploration of the posterior distribution [24], [50], [51]. Section 4.3 illustrates iterative filtering with computational issues in inference indicated by divergent transitions, and provides an example of iteration towards an improved model parameterisation to alleviate issues in the sampling procedure.

Model complexity and computation↩︎

For many real-world applications, it is sensible to assume that the underlying data generating process is complex and not directly accessible. A certain degree of experimenting and active assessment of computational issues for different models is likely required to successfully move from a statistical model to a working implementation in code [4]. In this way, computational issues become an essential driver of the amount of necessary iteration, and iterative filtering with computation checks can direct limited resources to models that benefit most from improving computation.

3.3 Iterating towards a better multiverse↩︎

As discussed in Section 1.1 and 2 and illustrated in Section 3.1 and 3.2, moving from modelling ideas to valid posterior inferences often involves iteration, for example, motivated by necessary adjustments or possibilities for improvement. The suggested filtering approach reflects central aspects of Bayesian workflows described by [4] like validating computation, addressing computational issues as well as evaluating and comparing models (see Figure 1 in [4]). In that way, iterative filtering provides access to essential modular components of the larger multi-objective modelling endeavour.

By design, iterative filtering identifies more promising candidate models from a larger set of models in the sense of a minimum viable Bayesian models described in Section 2.3. At the same time, any filtered set of models is a preliminary result, and more iteration can lead to an improved filtered set. In particular, having defined and filtered an initial set of modelling choices, the modeller may be implored by the results observed to make changes to their hypotheses encapsulated in the initial models, or even consider new model structures. Our proposed framework enables the user to explicitly incorporate these discoveries through an iteration of the previous steps. That is, after having obtained a filtered set of models, the set can be extended and filtered again until a sufficient model or set of models is obtained.

Albeit sometimes unavoidable, modellers might not want to focus on one model until it fails, but instead find efficient, consistent and transparent transitions from an initial set of candidate models to more promising models. Non-arbitrarily filtered models can offer a meaningful baseline for further analyses. If parallel exploration of multiple models is computationally costly, several simple models can be set up at once and filtered iteratively taking predictive abilities and computational issues into account. Subsequently, the modeller can increase the complexity of the remaining models and filter again. Section 4.2 provides an example of filtering an extension of a set of models from Section 4.1.

4 Case studies↩︎

In the following, we demonstrate how the suggested approach can simplify initial multiverses in real data examples where interest lies not only in sufficient predictive performance but also the effect of a treatment (Section 4.1 and 4.2) or specific day of the year (Section 4.3) across different candidate models. Section 4.1 presents filtering for predictive abilities for a set of models analysing the effect of an anticonvulsant therapy on epileptic seizure counts in patients with epilepsy. Subsequently, we extend the filtered set of models with additional modelling choices in Section 4.2 and demonstrate filtering for predictive abilities while taking into account the reliability of the estimated filtering criteria. Section 4.3 illustrates how to iteratively filter a large set of models analysing the relative number of births on Halloween in the USA while detecting computational issues in inference and, if needed, adjusting the parameterisation for problematic models. The code for the case studies is available at https://github.com/annariha/multi-workflows.

We use software tools for Bayesian subworkflows implemented in the R Statistical Software [52] for visualisations at different workflow steps (bayesplot, [53]; ggdist, [54]), working cleanly with posterior draws and predictions [55] as well as assessment of predictive performance and model comparisons [56]. Similar tools exist to some extent in other widely-used programming languages (see, e.g., ArviZ, [57] with native packages in, e.g., Python, [58] and Julia, [59]).

4.1 Epileptic seizure counts (Part I)↩︎

We use the data brms::epilepsy openly available in the R-package brms [60], initially published by [61] and previously analysed, for example, by [62] and [63], containing $236$ observations of a randomised trial of an anticonvulsant therapy for epilepsy patients with a treatment group that received the drug progabide $(N_t=124)$ and a control group $(N_c=112)$. All patients visited the clinic four times and the data contains information on seizure counts during and before the study as well as age of the patients.

4.1.0.1 All considered models

We combine different modelling choices to obtain the initial set of $24$ models summarised in Table [tab:models-epi-1-ppc] in Appendix 8. To model the number of epileptic seizures, that is, non-negative counts, two immediate candidates for the distributional family of the observations are a Poisson and a negative Binomial distribution. Additionally, we consider different covariates and an interaction effect between treatment and baseline seizure count. Lastly, we compare default priors in brms [60] as well as a combination of default and regularised horseshoe (RHS) priors [64]–[66] with three degrees of freedom for the Student-$t$ prior of the local shrinkage parameters, using brms::horseshoe(3). The left subplot in Figure 3 shows the median, and 50% and 95% posterior intervals for the treatment coefficient for each model. We see relatively high variation in conclusions that would be made based on these $24$ models.

Figure 3: Epilepsy case study (Part I). Posterior results for the treatment coefficient with 50% and 95% posterior intervals ordered by posterior median value across the initial 24 models (left) and the remaining seven models after filtering (right). Initially, models differ considerably with regard to the conclusions about sign and range of the posterior coefficient of treatment. After filtering for predictive abilities, we observe more agreement within the filtered set of models.

Choosing one model over any other affects the conclusions about the effect of treatment and, clearly, a joint consideration of models is a safer starting point for our modelling efforts than picking only one of the models.

4.1.0.2 Filtering with predictive density estimates

The left subplot in Figure 4 shows $\Delta \widehat{\mathrm{elpd}}$ and $\widehat{\text{se}}(\Delta \widehat{\mathrm{elpd}})$ for all models and the right subplot shows only the seven models that are indistinguishable from the model with the highest estimated elpd according to their mean difference in elpd and the associated uncertainty $2 \cdot \widehat{\text{se}}(\Delta \widehat{\mathrm{elpd}})$. The right subplot in Figure 3 shows posterior median, and 50% and 95% posterior intervals for the coefficient of treatment for the remaining seven models. We see that conclusions implied by the remaining models are similar.

Figure 4: Epilepsy case study (Part I). Difference in expected log point-wise predictive density estimates (\Delta \widehat{\textrm{elpd}}) with estimated standard errors (\widehat{\text{se}}) ordered by mean \Delta \widehat{\textrm{elpd}}. Distributional family for the observations is either Poisson (\circ) or negative Binomial (\triangledown). The initial 24 models are shown on the left, the remaining seven models on the right. Models with any Pareto-\hat{k} values >0.7: red, no Pareto-\hat{k}’s >0.7: black. — Figure 4: Epilepsy case study (Part I). Difference in expected log point-wise predictive density estimates ($\Delta \widehat{\textrm{elpd}}$) with estimated standard errors ($\widehat{\text{se}}$) ordered by mean $\Delta \widehat{\textrm{elpd}}$. Distributional family for the observations is either Poisson ($\circ$) or negative Binomial ($\triangledown$). The initial $24$ models are shown on the left, the remaining seven models on the right. Models with any Pareto-$\hat{k}$ values $>0.7$: red, no Pareto-$\hat{k}$’s $>0.7$: black.

4.1.0.3 Filtering with posterior predictive checks

In the left subplot of Figure 4, elpd results where Pareto-$\hat{k}$ diagnostic indicated unreliable computation for PSIS-LOO-CV are highlighted with red colour. Instead of using computationally more intensive CV approaches, we can use posterior predictive checking to rule out these models. For the given multiverse, all models with high Pareto-$\hat{k}$ assume a Poisson distribution as the distributional family for the observations. Figure 5 shows posterior predictive checking results for the best performing model among models assuming a Poisson distribution (Model 21) and its counterpart that differs only with respect to the chosen distributional family for the observations (Model 22). The results suggest that the Poisson model is not an appropriate choice for this data.

Figure 5: Epilepsy case study (Part I). Empirical cumulative density function (ECDF) estimates based on 100 posterior replicates (grey lines) and the distribution of the data (black line) illustrating the differences in expected and observed seizure counts for Model 22 (negative Binomial) and Model 21 (Poisson), differing only with respect to the chosen distributional family. Assuming a Poisson distribution seems to lead to more disagreement with the data compared to the model with negative Binomial distribution. — Figure 5: Epilepsy case study (Part I). Empirical cumulative density function (ECDF) estimates based on $100$ posterior replicates (grey lines) and the distribution of the data (black line) illustrating the differences in expected and observed seizure counts for Model 22 (negative Binomial) and Model 21 (Poisson), differing only with respect to the chosen distributional family. Assuming a Poisson distribution seems to lead to more disagreement with the data compared to the model with negative Binomial distribution.

Figure 13 in Appendix 8 illustrates similar results of posterior predictive checks for other models, suggesting that there is no need to improve our CV methods for these models. Figure 12 in Appendix 8 shows that in fact the filtering results would stay the same.

4.1.0.4 A filtered set of models

With filtering, we are able to remove a large proportion of models that are clearly worse. The remaining set of seven filtered models can be analysed as usual in multiverse analysis.

4.2 Epileptic seizure counts (Part II)↩︎

We extend the filtered set of seven models in Section 4.1 to $175$ models. Previously, with help of posterior predictive checking, we were able to avoid resolving computational issues in estimation of elpd for some models. With the extended set of models, iteration towards improved reliability of estimates for elpd is obligatory, as conclusions about the filtered set of models are affected.

4.2.0.1 Expanding the set of considered models

For a more detailed comparison of choosing a negative Binomial or a Poisson distribution under overdispersion, we add models with an additional effect varying by each observation. Additionally, models can include additional variation on the level of each patient and each of the four visits and combinations thereof. Table [tab:models-epi-2-psisloocv] in Appendix 9 summarises the resulting $175$ models. The left subplot in Figure 6 shows how the posterior 50% and 95% intervals and the posterior median for the coefficient of treatment vary across all $175$ models.

Figure 6: Epilepsy case study (Part II). Posterior results for the coefficient of treatment with 50% and 95% posterior intervals ordered by posterior median value across the 175 models (left) and the filtered set of 29 models (right). After filtering for predictive abilities, we observe more agreement within the filtered set of models. — Figure 6: Epilepsy case study (Part II). Posterior results for the coefficient of treatment with 50% and 95% posterior intervals ordered by posterior median value across the $175$ models (left) and the filtered set of $29$ models (right). After filtering for predictive abilities, we observe more agreement within the filtered set of models.

4.2.0.2 Filtering with predictive performance

We evaluate predictive performance of the models using $\Delta \widehat{\mathrm{elpd}}$ and $\widehat{\text{se}}(\Delta \widehat{\mathrm{elpd}})$ (see Figure 7).

Figure 7: Epilepsy case study (Part II). Differences in expected log point-wise predictive density estimates (\Delta \widehat{\textrm{elpd}}) with \widehat{\text{se}} ordered by mean \Delta \widehat{\textrm{elpd}} for default PSIS-LOO-CV (top row) and integrated PSIS and brute-force LOO-CV (bottom row). Distributional families for observations are Poisson (\circ) or negative Binomial (\triangledown), any Pareto-\hat{k} values >0.7: red, no Pareto-\hat{k}’s >0.7: black. The two leftmost subplots show all 175 models. The top right subplot presents filtering with unreliable \widehat{\textrm{elpd}}, the bottom right subplot shows the filtered set of 29 models after obtaining reliable estimates and checking the validity of the normal approximation. — Figure 7: Epilepsy case study (Part II). Differences in expected log point-wise predictive density estimates ($\Delta \widehat{\textrm{elpd}}$) with $\widehat{\text{se}}$ ordered by mean $\Delta \widehat{\textrm{elpd}}$ for default PSIS-LOO-CV (top row) and integrated PSIS and brute-force LOO-CV (bottom row). Distributional families for observations are Poisson ($\circ$) or negative Binomial ($\triangledown$), any Pareto-$\hat{k}$ values $>0.7$: red, no Pareto-$\hat{k}$’s $>0.7$: black. The two leftmost subplots show all $175$ models. The top right subplot presents filtering with unreliable $\widehat{\textrm{elpd}}$, the bottom right subplot shows the filtered set of $29$ models after obtaining reliable estimates and checking the validity of the normal approximation.

In contrast to the results presented in Section 4.1, the top right subplot in Figure 7 shows that direct filtering for models indistinguishable with respect to $\Delta \widehat{\mathrm{elpd}}$ and $2 \cdot \widehat{\text{se}}(\Delta \widehat{\mathrm{elpd}})$ would focus exclusively on models with unreliably estimated elpd. As discussed in Section 3.1, any conclusions about promising candidate models should be based on a reliable assessment of predictive performance. This is an example of changing and checking in Figure 1, since computation of estimates of elpd needs to be adjusted to resolve the detected issues and obtain reliably estimated filtering criteria before further filtering attempts.

4.2.0.3 Computational issues in estimation of elpd

Investigation of the modelling choices underlines that issues in reliability of estimates of elpd are mostly related to models including additional variation on the level of each observation. To alleviate challenges of importance sampling and, in turn, PSIS-LOO-CV, for models with an effect varying by each observation, we can separate the varying intercept from the rest of the model and integrate over the observation-level intercepts. This approach is called integrated PSIS-LOO-CV. Appendix 7 explains the procedure in more detail. After obtaining estimates with integrated PSIS-LOO-CV, only a few high ($> 0.7$) Pareto-$\hat{k}$ values remain for some models for which estimates for elpd are recomputed with brute-force LOO-CV using brms::reloo() from the brms package [60]. Comparing results before and after modifying computation in the top and bottom left subplots in Figure 7, we observe lower mean absolute differences in $\widehat{\mathrm{elpd}}$ and larger associated standard errors $\widehat{\text{se}}(\Delta \widehat{\mathrm{elpd}})$ for most models after obtaining reliable estimates for all models.

4.2.0.4 Filtering with reliably estimated elpd

As illustrated in the top right and bottom central subplots in Figure 7, improving reliability of estimates for elpd with integrated PSIS-LOO-CV and brute-force LOO-CV substantially changes the set of filtered models.

4.2.0.5 Checking validity of the normal approximation when summarising point-wise estimated elpd

After improving computation, the bottom central subplot in Figure 7 shows that some models have large absolute $\Delta \widehat{\mathrm{elpd}}$ but also large estimated standard errors $\widehat{\text{se}}(\Delta \widehat{\mathrm{elpd}})$. It seems unlikely that these models are actually promising candidates. As touched upon in Section 3.1, summarising expected log point-wise predictive densities with $\Delta \widehat{\mathrm{elpd}}$ and associated standard errors $\widehat{\text{se}}(\Delta \widehat{\mathrm{elpd}})$ relies on approximating the distribution of point-wise differences in estimated elpd with a normal distribution. Large standard errors can indicate issues with the validity of this approximation. To investigate whether results for all $86$ models are based on valid approximations, we diagnose point-wise differences in estimated elpd by comparing the Pareto-$\hat{k}$ values to the Pareto-$\hat{k}$ threshold and the minimum required sample size for valid approximation with a normal distribution [40] using posterior::pareto_diags() from the posterior package [55]. The Pareto-$\hat{k}$ value for the point-wise differences is larger than the sample size specific threshold (here: $0.58$) and the required minimum sample size exceeds the number of available observations (here: $236$) for several models for which, consequently, the filtering using $\Delta \widehat{\mathrm{elpd}}$ and $2 \cdot \widehat{\text{se}}(\Delta \widehat{\mathrm{elpd}})$ is not accurate.

4.2.0.6 Predictive abilities with few extreme observations

Investigating the point-wise values for elpd and differences in elpd reveals that predictive performance for some models is largely affected by a few extreme observations, which translates to large standard errors and failure of the normal approximation. Checking for extreme observations, we identify patient $25$ and patient $49$ with magnitude or pattern of seizure counts that differ largely from the rest of the patients. Indeed, extreme observations for these patients largely correspond to the points where the problematic models perform a lot worse than other models, which then translates to large estimated standard errors of the mean difference in elpd (see examples in Figure 15 in Appendix 9). In absence of expert knowledge that would allow us to identify data entry or measurement errors in these cases, we consult previous research using this data set [62], [63]. In particular, [62] advise that for extreme observations in this data set “deletion [...] has no clinical basis” [62]. Therefore, given the available data, our filtered set of models should only include models that are able to model the extreme observations in the data set sufficiently well.

4.2.0.7 A filtered set of models

Out of $175$ initial models, we finally obtain $29$ models. Given a larger set of more complex models, iteration towards improved reliability of estimates for elpd is obligatory when conclusions about the filtered set of models are affected. The full model specifications for the remaining $29$ models are highlighted in ; in Table [tab:models-epi-2-psisloocv] in Appendix 9. The right subplot of Figure 6 shows posterior median, and 50% and 95% posterior intervals for the filtered set of models. Now, tools from multiverse analysis can be used for this reduced set of models.

4.2.0.8 Identifying structure in the filtered set of models

In the set of $29$ models, we can identify the most and least complex models, as well as models that are nested within each other. In particular, the most complex model assuming a negative Binomial distribution as the distributional family for the observations (Model 192 in [tab:models-epi-2-psisloocv]) includes all covariates and group-level effects on the level of each observation, each patient and each visit and assumes a combination of default and regularised horseshoe priors. Moreover, the most complex model assuming a Poisson distribution (Model 181 in Table [tab:models-epi-2-psisloocv]) contains all covariates as well as an interaction effect between treatment and baseline seizure count and group-level effects on the level of each observation, each patient and each visit and assumes default prior settings in brms::brm(). The least complex model in the set of $29$ models uses a Poisson distribution as the distributional family for the observations, includes a covariate indicating whether the patient received treatment and a group-level effect for each patient and assumes a combination of default and regularised horseshoe priors (see Model 31 in Table [tab:models-epi-2-psisloocv]).

4.3 Birthdays (Reparameterisation)↩︎

After giving an example for tending to computational issues in filtering criteria, we now present a scenario in which difficult posterior geometry can cause computational issues in inference when analysing the relative number of daily births in the USA from $1969$ to $1988$. This requires the modeller to combine assessment of predictive performance with iterative modification of models via reparameterisation until reliable sampling is possible. We use natality data from the National Vital Statistics System provided by Google BigQuery and exported by Chris Mulligan and Robert Kern containing $7305$ observations of the number of births with the corresponding day, month, and year as well as day of the year and day of the week in the USA from $01.01.1969$ to $31.12.1988$.

4.3.0.1 All considered models

We model the registered number of daily births with models of varying complexity, ranging from a normal hierarchical model for the day of the year effect to a model which combines Gaussian process (GP) priors [67] on multiple unknown time dependent functions with a regularised horseshoe (RHS) prior [64]–[66] for the day of the year effect. The modelling choices are inspired by a subset of the components of the model showcased by [24]. In total, we consider $144$ different models, constructed by individually switching out model components. Our most complex model combines a GP, a yearly periodic GP, a GP controlling the strength of a day of the week effect, a floating holiday effect and a day of the year effect with an RHS prior. Table [tab:models-birthdays-reparam] in Appendix 10.2 summarises the model components and the resulting models.

4.3.0.2 Computational issues in inference

The high number of observations ($N=7305$) and the nonlinear influence of the unknown time dependent functions makes exact covariance matrix based GP implementations computationally slow. We instead use a Hilbert space basis function based approximation [68], which has come to be known as Hilbert space basis function GP [69]. However, these HSGPs as well as the RHS prior come with their own computational challenges. Both lead to model parameter priors with scales that depend strongly on other model parameters, similar to the situation exemplified in Neal’s funnel [70]. Neither monolithic centring nor monolithic non-centring removes these dependencies due to the effect of the likelihood. For many of the considered models, this leads to a significant number of divergences when sampling, ranging up to almost 50% divergent transitions (see Figure 8).

Figure 8: Birthdays case study (Reparameterisation). Posterior draws for the Halloween effect and log local shrinkage parameter of the RHS prior on day-of-the-year effect for the simplest model containing the RHS model component (Model 6 in Table [tab:models-birthdays-reparam]), grouped by parameterisation (left: manual, right: adapted) and diverged transitions (red: yes, black: no). Slightly more than 20% of transitions for the manual parameterisation ended in divergences, only one did so for the adapted parameterisation. The average number of leapfrog steps was 913 for the manual and 127 for the adapted parameterisation. — Figure 8: Birthdays case study (Reparameterisation). Posterior draws for the Halloween effect and log local shrinkage parameter of the RHS prior on day-of-the-year effect for the simplest model containing the RHS model component (Model 6 in Table [tab:models-birthdays-reparam]), grouped by parameterisation (left: manual, right: adapted) and diverged transitions (red: yes, black: no). Slightly more than 20% of transitions for the manual parameterisation ended in divergences, only one did so for the adapted parameterisation. The average number of leapfrog steps was $913$ for the manual and $127$ for the adapted parameterisation.

If the number of divergent transitions is zero, we will assume that sampling has worked if the other convergence diagnostics indicate so. However, even a small fraction of divergent transitions can mean that the sampler has failed to explore parts of the posterior.

4.3.0.3 Modifying parameterisation to improve sampling

For models for which we encounter divergences during sampling, we use the obtained draws to estimate a model parameterisation which minimises the Kullback–Leibler divergence between the posterior distribution and a multivariate normal distribution. As demonstrated by [71], such a reparameterisation tends to improve the posterior geometry and facilitate sampling. This inferred reparameterisation together with an increased target acceptance rate allows us to sample even from the challenging posteriors with no or only very few divergent transitions.

4.3.0.4 Predictive model evaluation

As before, to establish a ranking in the set of models and to investigate whether we have to further improve sampling, we compare predictive performance between the models using estimated elpd (see Figure 9).

Figure 9: Birthdays case study (Reparameterisation). Differences in expected log point-wise predictive density estimates (\Delta \widehat{\textrm{elpd}}) with \widehat{\text{se}} ordered by mean \Delta \widehat{\textrm{elpd}}, by parameterisation (row), day-of-the-year family (\circ: normal, \triangle: Student’s t, \Diamond: RHS) and occurred divergences (red: yes, black: no). — Figure 9: Birthdays case study (Reparameterisation). Differences in expected log point-wise predictive density estimates ($\Delta \widehat{\textrm{elpd}}$) with $\widehat{\text{se}}$ ordered by mean $\Delta \widehat{\textrm{elpd}}$, by parameterisation (row), day-of-the-year family ($\circ$: normal, $\triangle$: Student’s t, $\Diamond$: RHS) and occurred divergences (red: yes, black: no).

This time, the reliability of the elpd estimates is affected strongly by the reliability of the sampling, but not at all by the reliability of the PSIS-LOO-CV. As in Section 4.2, it is not only the lowest ranking models where the reliability of the elpd estimates is in question, but it is in particular the highest ranking and most complex models where sampling seems to have been problematic. After improving sampling, the ranking of the two best models flips and all diagnostics indicate that estimation was successful for the important models.

4.3.0.5 A filtered set of models

Following the original analysis by [72], we look at the effect of Halloween on the number of registered births (see Figure 10).

Figure 10: Birthdays case study (Reparameterisation). Effect of Halloween on registered births with 50% and 95% probability intervals. The left subplot reports results for all 144 models, the right subplot shows the remaining two models grouped row-wise by manual and adapted parameterisation, respectively. — Figure 10: Birthdays case study (Reparameterisation). Effect of Halloween on registered births with 50% and 95% probability intervals. The left subplot reports results for all $144$ models, the right subplot shows the remaining two models grouped row-wise by manual and adapted parameterisation, respectively.

As in Section 4.1 an unfiltered multiverse of models would have suggested considerable variation in plausible effect sizes. By filtering out models with very bad predictive performance and iteratively improving computation for the good model candidates, we are able to narrow down the set of $144$ models to just two candidate models, which both give similar predictions and lead to the same conclusions about the effect in question. As in Section 4.1, but unlike in Section 4.2, the set of candidate models does not change, even though their internal ranking changes. However, as in Section 4.2, correcting the computational issues does not notably affect the conclusions about the effect in question.

4.3.0.6 The filtered set of models as the new baseline

While our procedure has left us with two models that we can tentatively accept, there is no reason to blindly trust their predictions or their inference. Our procedure placing the most complex models at the top should instead encourage us to consider these models as the new baseline, from which we can continue to explore the model space, if needed or wanted. If we choose to do so, we can and should expect to find models which capture the underlying process considerably or even just slightly better, see, for example, the inspiration for our model components by [24]. These models may or may not change the conclusions about the effect of interest, and as such we should accept any conclusions only tentatively, until new information is added.

5 Discussion↩︎

Iterative filtering for multiverse analysis can identify promising candidates in large sets of models, given that causal constraints have been taken into account. Tending to computational issues in tandem with filtering for predictive abilities enables pruning a large set of models to a manageable amount, for example, from $24$ to seven models in Section 4.1, from $175$ to $29$ models in Section 4.2 and from $144$ to two models in Section 4.3. Naturally, this work is not without limitations. The task of supporting iterative Bayesian modelling workflows is multi-faceted and raises questions outside the scope of this paper.

5.1 Challenges and limitations of this work↩︎

Our filtering approach focuses on predictive abilities, investigation of posterior results of the quantity of interest and computation checks as aspects of central importance. [73] illustrate that predictive accuracy alone can be an insufficient criterion for model comparison. Similarities in predictive performance are not necessarily an indicator for similarity in the posteriors of the quantity of interest. After filtering based on predictive performance, multiverse analysis provides means for assessing possible differences in interesting posterior quantities.

We have assumed that no causally nonsensical models are included in the set of candidate models. If causal constraints are not taken into account, filtering may remove causally sensible models with worse predictive performance than causally nonsensical models with better predictive performance [74].

Even though the proposed approach enables modellers to jointly investigate multiple models with less overhead, it is clear that with more time, and more iterations, we can always obtain an even larger set of models [7], [8]. As of now, iterative filtering does not provide an automatic tool, but requires activity and engagement from the modeller. Creating and investigating sets of models is subject to computational and cognitive constraints, limitations in modelling ideas, as well as availability of data and time. If the modeller is faced with an extremely large amount of models, resource limitations can make it infeasible to consider all models at once and the proposed approach can fail to reduce the set of models to a manageable amount.

We do not compare different possible multiverses, and instead assume a fixed set of modelling choices that give rise to one multiverse of models in the examples presented in Section 4. As [25] point out, it is to be expected that the choices required for any multiverse analysis might in turn affect the results: “choice of search space is itself a choice” [25]. This is also related to “model list selection” [75], that is, the question of how to select appropriate lists of models to compare that sufficiently represent the underlying but inaccessible entirety of possible models, which [75] identify as one of the pressing issues for future research on assessment of predictive abilities of several candidate models in a Bayesian framework.

5.2 Outlook↩︎

After having obtained an iteratively filtered set of models, it remains to be explored how the modeller continues; especially if interest lies in decision-making based on the filtered set of models. Iterative filtering focuses on removing clearly inferior models from the multiverse, and not identifying a single best model. If a good reference model is available and interest lies predominantly in predictions, projection predictive inference could provide a systematic way to identify a simpler model that matches the predictive performance of the reference model from a set of candidates [76]–[79]. Another possibility is model averaging for the remaining models, for example, via model stacking [80], since the joint inferences are then based on a pruned set excluding largely inferior models.

Iterative filtering for multiverse analysis is an attempt to make required and desired iteration in a Bayesian modelling workflow more tangible. When building Bayesian models, modellers routinely face the tasks of evaluating computation as well as assessing predictive and inferential abilities for multiple models. To further support Bayesian data analysis tasks, it is important to develop more approaches that

allow joint consideration of multiple models;
implement connections between different subtasks in Bayesian workflows;
and support required and desired iteration throughout.

The goal of supporting Bayesian workflows and, more specifically, the iterative aspects therein, is connected to a larger discussion about vital properties of useful modelling workflows. [81] emphasise that trust in modelling practices is linked to the robustness of conclusions to different underlying assumptions. Every model can be seen as a hypothesis of the modeller, formulated by a combination of modelling choices [12]. By acknowledging the existence of several such hypotheses, trustworthy Bayesian modelling workflows need to allow the consistent and transparent consideration of several candidate models.

Acknowledgements↩︎

We thank Andrew Gelman and Seth Axen for helpful comments. We acknowledge the computational resources provided by the Aalto Science-IT project. This work was supported by the Research Council of Finland Flagship programme: Finnish Center for Artificial Intelligence (FCAI), and Research Council of Finland project “Safe iterative model building” (340721). We also thank our colleagues who provided helpful feedback, particularly Andrew R. Johnson, Yann McLatchie, Frank Weber, and Noa Kallioinen.

6 Some suggested requirements for consistent and efficient iterative modelling↩︎

Figure 11: Overview of some suggested requirements for consistent and efficient iteration in Bayesian modelling workflows and examples of questions a modeller might ask in reflection of these requirements.

7 Integrated PSIS-LOO-CV↩︎

For a model with an intercept varying on the level of each observation, it is possible to compute the log likelihood by separating the varying intercept $r_i$ from the rest of the model and integrating over each $r_i$ which leads to \[\begin{align} \label{eq:intloolikelihood} \log p(y_i \mid X_i, \theta^s_{-r_i}) = \log \int_{-\infty}^{\infty} p(y_i \mid X_i, r_i, \theta^s_{-r_i}) p(r_i \mid \theta^s) d r_i , \end{align}\tag{4}\] for each observation $i=1,\cdots, N$ and each posterior draw $s=1, \cdots, S$. For the $s$-th posterior draw, $\theta^s_{-r_i}$ denotes the vector of model parameters excluding the intercept varying by observations $r_i$ for the $i$-th observation. Using PSIS-LOO-CV means that we are using importance sampling, for example, with the posterior density $p(\theta \mid Y)$ as the proposal and $p(\theta \mid Y_{-i})$ as the target densities for each observation in the leave-one-out approach. Since the posterior density $p(\theta \mid Y)$ includes the marginal effect of each observation-level intercept $r_i$, leaving out one observation at a time means that we will end up comparing the marginal $p(r_i \mid Y)$ given all observations of the outcome with $p(r_i \mid Y_{-i})$ with observation $i$ removed. Clearly, the difference between these marginals will be big, when we are removing the $i$-th observation from the marginal for $r_i$, whereas, for example, removing the $j$-th observation will have a smaller effect on the marginal of $r_i$ but again a big effect on the marginal of $r_j$ and so on. This means that \[\begin{align} \| p(r_i | Y) - p(r_i | Y_{-i}) \| \gg 0, \end{align}\] and we observe that this translates to the proposal density $p(\theta | Y)$ and the target density $p(\theta | Y_{-i})$ as well since they contain the marginals $p(r_i | Y)$ and $p(r_i | Y_{-i})$, respectively.

Since the marginal posterior of the intercept varying on the level of each observation, denoted $r_i$, differs largely between each observation while the rest of the posterior is not affected, we can separate the intercept from the rest of the model and integrate over the intercepts that vary by observations. As detailed in equation 4 , this allows us to incorporate $r_i$ into our proposal and target distribution without having to consider $p(r_i \mid Y)$ and $p(r_i \mid Y_{-i})$ separately by integrating over $r_i$ for each observation $i$. Notably, due to the influence of the marginal of $r_i$, we can assume that the proposal posterior density $p(\theta | Y)$ differs more from the target $p(\theta | Y_{-i})$ than the posterior density $p(\theta_{r_i} \mid Y)$ and the corresponding target $p(\theta_{r_i} | Y_i)$. In particular, \[\begin{align} \| p(\theta \mid Y) - p(\theta \mid Y_{-i}) \| \gg \| p(\theta_{r_i} \mid Y) - p(\theta_{r_i} \mid Y_{-i}) \| . \end{align}\] Intuitively, one would expect that this relationship makes it easier to obtain reliable importance sampling results due to a thinner tail of the distribution of importance sampling weights when the proposal and target distribution are closer to each other. With the log likelihood evaluations obtained by integrating over the intercepts varying on the level of each observation we remove the direct influence of each individual observation and can then again use PSIS-LOO-CV to obtain reliable elpd estimates as exemplified in Section 4.2.

8 Epilepsy case study (Part I)↩︎

8.1 Modelling choices↩︎

llll

Model ID & Obs. family & Priors & Formula

Model ID & Obs. family & Priors & Formula

Model 1 & Poisson & default in brms & count$\sim$zBase * Trt
Model 2 & Neg. Binomial & default in brms & count$\sim$zBase * Trt
Model 3 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt
Model 4 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt
Model 5 & Poisson & default in brms & count$\sim$Trt
Model 6 & Neg. Binomial & default in brms & count$\sim$Trt
Model 7 & Poisson & brms::horseshoe(3) & count$\sim$Trt
Model 8 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt
Model 9 & Poisson & default in brms & count$\sim$Trt+zBase
Model 10 & Neg. Binomial & default in brms & count$\sim$Trt+zBase
Model 11 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase
Model 12 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase
Model 13 & Poisson & default in brms & count$\sim$zBase * Trt+zAge
Model 14 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+zAge
Model 15 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge
Model 16 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge
Model 17 & Poisson & default in brms & count$\sim$Trt+zAge
Model 18 & Neg. Binomial & default in brms & count$\sim$Trt+zAge
Model 19 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zAge
Model 20 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zAge
Model 21 & Poisson & default in brms & count$\sim$Trt+zBase+zAge
Model 22 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+zAge
Model 23 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge
Model 24 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge
*

8.2 Improving estimation of elpd does not affect filtered set of models↩︎

Figure 12: Epilepsy case study (Part I). Difference in expected log point-wise predictive density estimates (\Delta \widehat{\textrm{elpd}}) with estimated standard errors (\widehat{\text{se}}) ordered by mean \Delta \widehat{\textrm{elpd}}. Results for the initial 24 models using integrated PSIS-LOO-CV and brute-force LOO-CV (left) and the remaining seven models (right). The filtered set of models is not affected by improving estimation of elpd. — Figure 12: Epilepsy case study (Part I). Difference in expected log point-wise predictive density estimates ($\Delta \widehat{\textrm{elpd}}$) with estimated standard errors ($\widehat{\text{se}}$) ordered by mean $\Delta \widehat{\textrm{elpd}}$. Results for the initial $24$ models using integrated PSIS-LOO-CV and brute-force LOO-CV (left) and the remaining seven models (right). The filtered set of models is not affected by improving estimation of elpd.

8.3 ECDF plots for posterior predictive checks↩︎

Figure 13: Epilepsy case study (Part I). ECDF plots for Model 17 to Model 24. Models in each row only differ with respect to the chosen distributional family for the observations. Assuming a Poisson distribution seems to lead to more disagreement with the data compared to the model with negative Binomial distribution.

8.4 Conditional effects plots for models with interaction effect↩︎

Figure 14: Epilepsy case study (Part I). Conditional effects plots for Model 1 to 4 and Model 13 to 16 in Table [tab:models-epi-1-ppc] that include an interaction effect between baseline seizure and treatment.

9 Epilepsy case study (Part II)↩︎

9.1 Point-wise differences in estimated elpd↩︎

Figure 15: Epilepsy case study (Part II). Point-wise differences in elpd estimated with a combination of integrated PSIS-LOO-CV and brute-force LOO-CV for the seven models with lowest \Delta \widehat{\mathrm{elpd}} and large \widehat{\text{se}}(\Delta \widehat{\mathrm{elpd}}) in the set of 86 models. Seizure counts that are >4 \cdot \sigma of the overall observed seizure counts, are indicated with dashed vertical lines and correspond to patient 25 with observation ID 143 and patient 49 with observation IDs 108, 143, 167 and 226 whose magnitude or pattern of seizure counts differ largely from the rest of the patients. — Figure 15: Epilepsy case study (Part II). Point-wise differences in elpd estimated with a combination of integrated PSIS-LOO-CV and brute-force LOO-CV for the seven models with lowest $\Delta \widehat{\mathrm{elpd}}$ and large $\widehat{\text{se}}(\Delta \widehat{\mathrm{elpd}})$ in the set of $86$ models. Seizure counts that are $>4 \cdot \sigma$ of the overall observed seizure counts, are indicated with dashed vertical lines and correspond to patient $25$ with observation ID $143$ and patient $49$ with observation IDs $108$, $143$, $167$ and $226$ whose magnitude or pattern of seizure counts differ largely from the rest of the patients.

9.2 Modelling choices↩︎

llll

Model ID & Obs. family & Priors & Formula

Model ID & Obs. family & Priors & Formula

Model 2 & Neg. Binomial & default in brms & count$\sim$zBase * Trt
Model 4 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt
Model 10 & Neg. Binomial & default in brms & count$\sim$Trt+zBase
Model 12 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase
Model 16 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge
Model 22 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+zAge
Model 24 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge
Model 25 & Poisson & default in brms & count$\sim$zBase * Trt+(1 | patient)
Model 26 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+(1 | patient)
Model 27 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | patient)
Model 28 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | patient)
Model 29 & Poisson & default in brms & count$\sim$Trt+(1 | patient)
Model 30 & Neg. Binomial & default in brms & count$\sim$Trt+(1 | patient)
Model 31 & Poisson & brms::horseshoe(3) & count$\sim$Trt+(1 | patient)
Model 32 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+(1 | patient)
Model 33 & Poisson & default in brms & count$\sim$Trt+zBase+(1 | patient)
Model 34 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+(1 | patient)
Model 35 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | patient)
Model 36 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | patient)
Model 37 & Poisson & default in brms & count$\sim$zBase * Trt+zAge+(1 | patient)
Model 38 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+zAge+(1 | patient)
Model 39 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | patient)
Model 40 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | patient)
Model 41 & Poisson & default in brms & count$\sim$Trt+zAge+(1 | patient)
Model 42 & Neg. Binomial & default in brms & count$\sim$Trt+zAge+(1 | patient)
Model 43 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | patient)
Model 44 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | patient)
Model 45 & Poisson & default in brms & count$\sim$Trt+zBase+zAge+(1 | patient)
Model 46 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+zAge+(1 | patient)
Model 47 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | patient)
Model 48 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | patient)
Model 49 & Poisson & default in brms & count$\sim$zBase * Trt+(1 | visit)
Model 50 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+(1 | visit)
Model 51 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | visit)
Model 52 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | visit)
Model 53 & Poisson & default in brms & count$\sim$Trt+(1 | visit)
Model 54 & Neg. Binomial & default in brms & count$\sim$Trt+(1 | visit)
Model 55 & Poisson & brms::horseshoe(3) & count$\sim$Trt+(1 | visit)
Model 56 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+(1 | visit)
Model 57 & Poisson & default in brms & count$\sim$Trt+zBase+(1 | visit)
Model 58 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+(1 | visit)
Model 59 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | visit)
Model 60 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | visit)
Model 61 & Poisson & default in brms & count$\sim$zBase * Trt+zAge+(1 | visit)
Model 62 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+zAge+(1 | visit)
Model 63 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | visit)
Model 64 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | visit)
Model 65 & Poisson & default in brms & count$\sim$Trt+zAge+(1 | visit)
Model 66 & Neg. Binomial & default in brms & count$\sim$Trt+zAge+(1 | visit)
Model 67 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | visit)
Model 68 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | visit)
Model 69 & Poisson & default in brms & count$\sim$Trt+zBase+zAge+(1 | visit)
Model 70 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+zAge+(1 | visit)
Model 71 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | visit)
Model 72 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | visit)
Model 73 & Poisson & default in brms & count$\sim$zBase * Trt+(1 | patient)+(1 | visit)
Model 74 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+(1 | patient)+(1 | visit)
Model 75 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | patient)+(1 | visit)
Model 76 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | patient)+(1 | visit)
Model 77 & Poisson & default in brms & count$\sim$Trt+(1 | patient)+(1 | visit)
Model 78 & Neg. Binomial & default in brms & count$\sim$Trt+(1 | patient)+(1 | visit)
Model 79 & Poisson & brms::horseshoe(3) & count$\sim$Trt+(1 | patient)+(1 | visit)
Model 80 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+(1 | patient)+(1 | visit)
Model 81 & Poisson & default in brms & count$\sim$Trt+zBase+(1 | patient)+(1 | visit)
Model 82 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+(1 | patient)+(1 | visit)
Model 83 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | patient)+(1 | visit)
Model 84 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | patient)+(1 | visit)
Model 85 & Poisson & default in brms & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | visit)
Model 86 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | visit)
Model 87 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | visit)
Model 88 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | visit)
Model 89 & Poisson & default in brms & count$\sim$Trt+zAge+(1 | patient)+(1 | visit)
Model 90 & Neg. Binomial & default in brms & count$\sim$Trt+zAge+(1 | patient)+(1 | visit)
Model 91 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | patient)+(1 | visit)
Model 92 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | patient)+(1 | visit)
Model 93 & Poisson & default in brms & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | visit)
Model 94 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | visit)
Model 95 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | visit)
Model 96 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | visit)
Model 97 & Poisson & default in brms & count$\sim$zBase * Trt+(1 | obs)
Model 98 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+(1 | obs)
Model 99 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | obs)
Model 100 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | obs)
Model 101 & Poisson & default in brms & count$\sim$Trt+(1 | obs)
Model 102 & Neg. Binomial & default in brms & count$\sim$Trt+(1 | obs)
Model 103 & Poisson & brms::horseshoe(3) & count$\sim$Trt+(1 | obs)
Model 104 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+(1 | obs)
Model 105 & Poisson & default in brms & count$\sim$Trt+zBase+(1 | obs)
Model 106 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+(1 | obs)
Model 107 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | obs)
Model 108 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | obs)
Model 109 & Poisson & default in brms & count$\sim$zBase * Trt+zAge+(1 | obs)
Model 110 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+zAge+(1 | obs)
Model 111 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | obs)
Model 112 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | obs)
Model 113 & Poisson & default in brms & count$\sim$Trt+zAge+(1 | obs)
Model 114 & Neg. Binomial & default in brms & count$\sim$Trt+zAge+(1 | obs)
Model 115 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | obs)
Model 116 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | obs)
Model 117 & Poisson & default in brms & count$\sim$Trt+zBase+zAge+(1 | obs)
Model 118 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+zAge+(1 | obs)
Model 119 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | obs)
Model 120 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | obs)
Model 121 & Poisson & default in brms & count$\sim$zBase * Trt+(1 | patient)+(1 | obs)
Model 122 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+(1 | patient)+(1 | obs)
Model 123 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | patient)+(1 | obs)
Model 124 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | patient)+(1 | obs)
Model 125 & Poisson & default in brms & count$\sim$Trt+(1 | patient)+(1 | obs)
Model 126 & Neg. Binomial & default in brms & count$\sim$Trt+(1 | patient)+(1 | obs)
Model 127 & Poisson & brms::horseshoe(3) & count$\sim$Trt+(1 | patient)+(1 | obs)
Model 128 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+(1 | patient)+(1 | obs)
Model 129 & Poisson & default in brms & count$\sim$Trt+zBase+(1 | patient)+(1 | obs)
Model 130 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+(1 | patient)+(1 | obs)
Model 131 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | patient)+(1 | obs)
Model 132 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | patient)+(1 | obs)
Model 133 & Poisson & default in brms & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | obs)
Model 134 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | obs)
Model 135 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | obs)
Model 136 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | obs)
Model 137 & Poisson & default in brms & count$\sim$Trt+zAge+(1 | patient)+(1 | obs)
Model 138 & Neg. Binomial & default in brms & count$\sim$Trt+zAge+(1 | patient)+(1 | obs)
Model 139 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | patient)+(1 | obs)
Model 140 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | patient)+(1 | obs)
Model 141 & Poisson & default in brms & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | obs)
Model 142 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | obs)
Model 143 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | obs)
Model 144 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | obs)
Model 145 & Poisson & default in brms & count$\sim$zBase * Trt+(1 | visit)+(1 | obs)
Model 146 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+(1 | visit)+(1 | obs)
Model 147 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | visit)+(1 | obs)
Model 148 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | visit)+(1 | obs)
Model 149 & Poisson & default in brms & count$\sim$Trt+(1 | visit)+(1 | obs)
Model 150 & Neg. Binomial & default in brms & count$\sim$Trt+(1 | visit)+(1 | obs)
Model 151 & Poisson & brms::horseshoe(3) & count$\sim$Trt+(1 | visit)+(1 | obs)
Model 152 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+(1 | visit)+(1 | obs)
Model 153 & Poisson & default in brms & count$\sim$Trt+zBase+(1 | visit)+(1 | obs)
Model 154 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+(1 | visit)+(1 | obs)
Model 155 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | visit)+(1 | obs)
Model 156 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | visit)+(1 | obs)
Model 157 & Poisson & default in brms & count$\sim$zBase * Trt+zAge+(1 | visit)+(1 | obs)
Model 158 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+zAge+(1 | visit)+(1 | obs)
Model 159 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | visit)+(1 | obs)
Model 160 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | visit)+(1 | obs)
Model 161 & Poisson & default in brms & count$\sim$Trt+zAge+(1 | visit)+(1 | obs)
Model 162 & Neg. Binomial & default in brms & count$\sim$Trt+zAge+(1 | visit)+(1 | obs)
Model 163 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | visit)+(1 | obs)
Model 164 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | visit)+(1 | obs)
Model 165 & Poisson & default in brms & count$\sim$Trt+zBase+zAge+(1 | visit)+(1 | obs)
Model 166 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+zAge+(1 | visit)+(1 | obs)
Model 167 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | visit)+(1 | obs)
Model 168 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | visit)+(1 | obs)
Model 169 & Poisson & default in brms & count$\sim$zBase * Trt+(1 | patient)+(1 | visit)+(1 | obs)
Model 170 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+(1 | patient)+(1 | visit)+(1 | obs)
Model 171 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | patient)+(1 | visit)+(1 | obs)
Model 172 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+(1 | patient)+(1 | visit)+(1 | obs)
Model 173 & Poisson & default in brms & count$\sim$Trt+(1 | patient)+(1 | visit)+(1 | obs)
Model 174 & Neg. Binomial & default in brms & count$\sim$Trt+(1 | patient)+(1 | visit)+(1 | obs)
Model 175 & Poisson & brms::horseshoe(3) & count$\sim$Trt+(1 | patient)+(1 | visit)+(1 | obs)
Model 176 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+(1 | patient)+(1 | visit)+(1 | obs)
Model 177 & Poisson & default in brms & count$\sim$Trt+zBase+(1 | patient)+(1 | visit)+(1 | obs)
Model 178 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+(1 | patient)+(1 | visit)+(1 | obs)
Model 179 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | patient)+(1 | visit)+(1 | obs)
Model 180 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+(1 | patient)+(1 | visit)+(1 | obs)
Model 181 & Poisson & default in brms & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 182 & Neg. Binomial & default in brms & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 183 & Poisson & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 184 & Neg. Binomial & brms::horseshoe(3) & count$\sim$zBase * Trt+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 185 & Poisson & default in brms & count$\sim$Trt+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 186 & Neg. Binomial & default in brms & count$\sim$Trt+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 187 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 188 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 189 & Poisson & default in brms & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 190 & Neg. Binomial & default in brms & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 191 & Poisson & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | visit)+(1 | obs)
Model 192 & Neg. Binomial & brms::horseshoe(3) & count$\sim$Trt+zBase+zAge+(1 | patient)+(1 | visit)+(1 | obs)
*

10 Birthdays (Reparameterisation)↩︎

10.1 Models↩︎

We use an additive model $y(t) = f_1(t) + f_2(t) + f_3(t) + f_4(t) + f_5(t) + \varepsilon_t$ where

$t$ is the time in days since 01.01.1969;
$y_t$ are the registered numbers of births or their natural logarithm depending on the used likelihood;
$f_1, f_2, f_3, f_4$ and $f_5$ are individual model components and
$\varepsilon_t$ are the residuals which we assume to be normally distributed.

The multiverse of $144$ models consists of all possible combinations of likelihoods (normal or log-normal) and different additive model components, where the additive model components can be:

$f_1$ (baseline):
1. intercept only, $f_1(t) = f_1(0) \sim \mathcal{N}(0,1)$, or
2. slowly evolving baseline, $f_1(t) \sim \mathcal{GP}\left(\mathcal{N}(0,1), k_1\right)$ with \[k_1(t, t') = \sigma_1^2\exp\left(-\frac{|t-t'|^2}{2l^2_1}\right)\].
$f_2$ (seasonal variation):
1. no seasonal variation, $f_2(t) = 0$, or
2. yearly periodic GP, $f_2(t) \sim \mathcal{GP}(0, k_2)$ with \[k_2(t, t') = \sigma_2^2\exp\left(-\frac{1}{l^2_2} 2\sin^2(\pi(t-t')/365.25)\right)\].
$f_3$ (day of the week effect):
1. no day of the week effect, $f_3(t) = 0$,
2. constant day of the week effect, $f_3(t) = f_{3, \mathrm{dow}(t)}$ with $f_{3,i} \sim \mathcal{N}(0,1)$ for $i=1,\cdots,7$, or
3. slowly evolving day of the week effect, $f_3(t) = f_{3, \mathrm{dow}(t)} f_{3,GP}(t)$ with $f_{3,GP}(t) \sim \mathcal{GP}(0, k_3)$ with $k_3(t, t') = \sigma_3^2\exp\left(-\frac{|t-t'|^2}{2l^2_3}\right)$
$f_4$ (floating holiday effect):
1. no floating holiday effect, $f_4(t) = 0$, or
2. constant floating holiday effect, $f_4(t) = f_{4, \mathrm{floating}(t)}$ with $f_{4, i} \sim \mathcal{N}(0,1)$ overwriting the intercept, the day of the week and the day of the year effect and including the floating holidays Memorial Day, Labour Day and Thanksgiving as separate effects.
$f_5$ (day of the year effect), $(f_5(t) = f_{5,\mathrm{doy}(t)}$ with
1. normal distribution hierarchical prior, $f_{5, i} \sim \mathcal{N}(0, \sigma)$,
2. student’s t-distribution hierarchical prior, $f_{5, i} \sim t_\nu(0, \sigma)$, or
3. regularised horseshoe prior, $f_{5, i} \sim \mathrm{RHS}(0, c, \lambda, \tau)$ with $\nu_c=100$, $\nu_\lambda=1$, $\sigma_\lambda = 2$, $\sigma_\tau = .1$ and $\nu_\tau=100$.

10.2 Modelling choices↩︎

lllllll

Model ID & Baseline & Seasonal & Day of Week & Floating Holidays & Day of Year & Obs. family

Model ID & Baseline & Seasonal & Day of Week & Floating Holidays & Day of Year & Obs. family

Model 1 & constant & none & none & none & normal & normal
Model 2 & constant & none & none & none & normal & log normal
Model 3 & constant & none & none & none & student’s t & normal
Model 4 & constant & none & none & none & student’s t & log normal
Model 5 & constant & none & none & none & RHS & normal
Model 6 & constant & none & none & none & RHS & log normal
Model 7 & constant & none & none & constant & normal & normal
Model 8 & constant & none & none & constant & normal & log normal
Model 9 & constant & none & none & constant & student’s t & normal
Model 10 & constant & none & none & constant & student’s t & log normal
Model 11 & constant & none & none & constant & RHS & normal
Model 12 & constant & none & none & constant & RHS & log normal
Model 13 & constant & none & constant & none & normal & normal
Model 14 & constant & none & constant & none & normal & log normal
Model 15 & constant & none & constant & none & student’s t & normal
Model 16 & constant & none & constant & none & student’s t & log normal
Model 17 & constant & none & constant & none & RHS & normal
Model 18 & constant & none & constant & none & RHS & log normal
Model 19 & constant & none & constant & constant & normal & normal
Model 20 & constant & none & constant & constant & normal & log normal
Model 21 & constant & none & constant & constant & student’s t & normal
Model 22 & constant & none & constant & constant & student’s t & log normal
Model 23 & constant & none & constant & constant & RHS & normal
Model 24 & constant & none & constant & constant & RHS & log normal
Model 25 & constant & none & evolving & none & normal & normal
Model 26 & constant & none & evolving & none & normal & log normal
Model 27 & constant & none & evolving & none & student’s t & normal
Model 28 & constant & none & evolving & none & student’s t & log normal
Model 29 & constant & none & evolving & none & RHS & normal
Model 30 & constant & none & evolving & none & RHS & log normal
Model 31 & constant & none & evolving & constant & normal & normal
Model 32 & constant & none & evolving & constant & normal & log normal
Model 33 & constant & none & evolving & constant & student’s t & normal
Model 34 & constant & none & evolving & constant & student’s t & log normal
Model 35 & constant & none & evolving & constant & RHS & normal
Model 36 & constant & none & evolving & constant & RHS & log normal
Model 37 & constant & periodic & none & none & normal & normal
Model 38 & constant & periodic & none & none & normal & log normal
Model 39 & constant & periodic & none & none & student’s t & normal
Model 40 & constant & periodic & none & none & student’s t & log normal
Model 41 & constant & periodic & none & none & RHS & normal
Model 42 & constant & periodic & none & none & RHS & log normal
Model 43 & constant & periodic & none & constant & normal & normal
Model 44 & constant & periodic & none & constant & normal & log normal
Model 45 & constant & periodic & none & constant & student’s t & normal
Model 46 & constant & periodic & none & constant & student’s t & log normal
Model 47 & constant & periodic & none & constant & RHS & normal
Model 48 & constant & periodic & none & constant & RHS & log normal
Model 49 & constant & periodic & constant & none & normal & normal
Model 50 & constant & periodic & constant & none & normal & log normal
Model 51 & constant & periodic & constant & none & student’s t & normal
Model 52 & constant & periodic & constant & none & student’s t & log normal
Model 53 & constant & periodic & constant & none & RHS & normal
Model 54 & constant & periodic & constant & none & RHS & log normal
Model 55 & constant & periodic & constant & constant & normal & normal
Model 56 & constant & periodic & constant & constant & normal & log normal
Model 57 & constant & periodic & constant & constant & student’s t & normal
Model 58 & constant & periodic & constant & constant & student’s t & log normal
Model 59 & constant & periodic & constant & constant & RHS & normal
Model 60 & constant & periodic & constant & constant & RHS & log normal
Model 61 & constant & periodic & evolving & none & normal & normal
Model 62 & constant & periodic & evolving & none & normal & log normal
Model 63 & constant & periodic & evolving & none & student’s t & normal
Model 64 & constant & periodic & evolving & none & student’s t & log normal
Model 65 & constant & periodic & evolving & none & RHS & normal
Model 66 & constant & periodic & evolving & none & RHS & log normal
Model 67 & constant & periodic & evolving & constant & normal & normal
Model 68 & constant & periodic & evolving & constant & normal & log normal
Model 69 & constant & periodic & evolving & constant & student’s t & normal
Model 70 & constant & periodic & evolving & constant & student’s t & log normal
Model 71 & constant & periodic & evolving & constant & RHS & normal
Model 72 & constant & periodic & evolving & constant & RHS & log normal
Model 73 & evolving & none & none & none & normal & normal
Model 74 & evolving & none & none & none & normal & log normal
Model 75 & evolving & none & none & none & student’s t & normal
Model 76 & evolving & none & none & none & student’s t & log normal
Model 77 & evolving & none & none & none & RHS & normal
Model 78 & evolving & none & none & none & RHS & log normal
Model 79 & evolving & none & none & constant & normal & normal
Model 80 & evolving & none & none & constant & normal & log normal
Model 81 & evolving & none & none & constant & student’s t & normal
Model 82 & evolving & none & none & constant & student’s t & log normal
Model 83 & evolving & none & none & constant & RHS & normal
Model 84 & evolving & none & none & constant & RHS & log normal
Model 85 & evolving & none & constant & none & normal & normal
Model 86 & evolving & none & constant & none & normal & log normal
Model 87 & evolving & none & constant & none & student’s t & normal
Model 88 & evolving & none & constant & none & student’s t & log normal
Model 89 & evolving & none & constant & none & RHS & normal
Model 90 & evolving & none & constant & none & RHS & log normal
Model 91 & evolving & none & constant & constant & normal & normal
Model 92 & evolving & none & constant & constant & normal & log normal
Model 93 & evolving & none & constant & constant & student’s t & normal
Model 94 & evolving & none & constant & constant & student’s t & log normal
Model 95 & evolving & none & constant & constant & RHS & normal
Model 96 & evolving & none & constant & constant & RHS & log normal
Model 97 & evolving & none & evolving & none & normal & normal
Model 98 & evolving & none & evolving & none & normal & log normal
Model 99 & evolving & none & evolving & none & student’s t & normal
Model 100 & evolving & none & evolving & none & student’s t & log normal
Model 101 & evolving & none & evolving & none & RHS & normal
Model 102 & evolving & none & evolving & none & RHS & log normal
Model 103 & evolving & none & evolving & constant & normal & normal
Model 104 & evolving & none & evolving & constant & normal & log normal
Model 105 & evolving & none & evolving & constant & student’s t & normal
Model 106 & evolving & none & evolving & constant & student’s t & log normal
Model 107 & evolving & none & evolving & constant & RHS & normal
Model 108 & evolving & none & evolving & constant & RHS & log normal
Model 109 & evolving & periodic & none & none & normal & normal
Model 110 & evolving & periodic & none & none & normal & log normal
Model 111 & evolving & periodic & none & none & student’s t & normal
Model 112 & evolving & periodic & none & none & student’s t & log normal
Model 113 & evolving & periodic & none & none & RHS & normal
Model 114 & evolving & periodic & none & none & RHS & log normal
Model 115 & evolving & periodic & none & constant & normal & normal
Model 116 & evolving & periodic & none & constant & normal & log normal
Model 117 & evolving & periodic & none & constant & student’s t & normal
Model 118 & evolving & periodic & none & constant & student’s t & log normal
Model 119 & evolving & periodic & none & constant & RHS & normal
Model 120 & evolving & periodic & none & constant & RHS & log normal
Model 121 & evolving & periodic & constant & none & normal & normal
Model 122 & evolving & periodic & constant & none & normal & log normal
Model 123 & evolving & periodic & constant & none & student’s t & normal
Model 124 & evolving & periodic & constant & none & student’s t & log normal
Model 125 & evolving & periodic & constant & none & RHS & normal
Model 126 & evolving & periodic & constant & none & RHS & log normal
Model 127 & evolving & periodic & constant & constant & normal & normal
Model 128 & evolving & periodic & constant & constant & normal & log normal
Model 129 & evolving & periodic & constant & constant & student’s t & normal
Model 130 & evolving & periodic & constant & constant & student’s t & log normal
Model 131 & evolving & periodic & constant & constant & RHS & normal
Model 132 & evolving & periodic & constant & constant & RHS & log normal
Model 133 & evolving & periodic & evolving & none & normal & normal
Model 134 & evolving & periodic & evolving & none & normal & log normal
Model 135 & evolving & periodic & evolving & none & student’s t & normal
Model 136 & evolving & periodic & evolving & none & student’s t & log normal
Model 137 & evolving & periodic & evolving & none & RHS & normal
Model 138 & evolving & periodic & evolving & none & RHS & log normal
Model 139 & evolving & periodic & evolving & constant & normal & normal
Model 140 & evolving & periodic & evolving & constant & normal & log normal
Model 141 & evolving & periodic & evolving & constant & student’s t & normal
Model 142 & evolving & periodic & evolving & constant & student’s t & log normal
Model 143 & evolving & periodic & evolving & constant & RHS & normal
Model 144 & evolving & periodic & evolving & constant & RHS & log normal

References↩︎

[1]

Savage, J. (2016). What is ModernStatisticalWorkflow? http://khakieconomics.github.io/2016/08/29/What-is-a-modern-statistical-workflow.html.

[2]

Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., and Gelman, A. (2019). Visualization in Bayesian workflow. Journal of the Royal Statistical Society Series A: Statistics in Society, 182(2):389–402.

[3]

Betancourt, M. (2020). Towards APrincipledBayesianWorkflow. https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html.

[4]

Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., Kennedy, L., Gabry, J., Bürkner, P.-C., and Modrák, M. (2020). Bayesian workflow. arXiv:2011.01808 [stat].

[5]

Martin, O. A., Kumar, R., and Lao, J. (2021). Bayesian Modeling and Computation in Python. Chapman &Hall/CRCTexts in StatisticalScience. Chapman & Hall, Boca Raton.

[6]

Steegen, S., Tuerlinckx, F., Gelman, A., and Vanpaemel, W. (2016). Increasing TransparencyThrough a MultiverseAnalysis. Perspectives on Psychological Science, 11(5):702–712.

[7]

Simonsohn, U., Simmons, J. P., and Nelson, L. D. (2020). Specification curve analysis. Nature Human Behaviour, 4(11):1208–1214.

[8]

Bernardo, J. M. and Smith, A. F. (1994). Bayesian Theory. John Wiley & Sons, Ltd, 1st edition.

[9]

Box, G. E. P. (1976). Science and Statistics. Journal of the American Statistical Association, 71(356):791–799.

[10]

Leamer, E. E. (1978). Specification Searches - Ad Hoc Inference with Nonexperimental Data. John Wiley & Sons, Inc.

[11]

Tukey, J. W. (1980). We NeedBothExploratory and Confirmatory. The American Statistician, 34(1):23–25.

[12]

Gelman, A. and Shalizi, C. R. (2012). Philosophy and the Practice of BayesianStatistics in the SocialSciences. In Kincaid, H., editor, The Oxford Handbook of Philosophy of Social Science. Oxford University Press.

[13]

van den Berg, R., Awh, E., and Ma, W. J. (2014). Factorial Comparison of WorkingMemoryModels. Psychological review, 121(1):124–149.

[14]

Patel, C. J., Burford, B., and Ioannidis, J. P. (2015). Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations. Journal of clinical epidemiology, 68(9):1046–1058.

[15]

Young, C. and Holsteen, K. (2017). Model Uncertainty and Robustness: AComputationalFramework for MultimodelAnalysis. Sociological Methods & Research, 46(1):3–40.

[16]

Gelman, A. and Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.

[17]

Sarma, A., Kale, A., Moon, M. J., Taback, N., Chevalier, F., Hullman, J., and Kay, M. (2021). multiverse: MultiplexingAlternativeDataAnalyses in RNotebooks.

[18]

Hall, B. D., Liu, Y., Jansen, Y., Dragicevic, P., Chevalier, F., and Kay, M. (2022). A Survey of Tasks and Visualizations in MultiverseAnalysisReports. Computer Graphics Forum, 41(1):402–426.

[19]

Dragicevic, P., Jansen, Y., Sarma, A., Kay, M., and Chevalier, F. (2019). Increasing the Transparency of ResearchPapers with ExplorableMultiverseAnalyses. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, pages 1–15, New York, NY, USA. Association for Computing Machinery.

[20]

Liu, Y., Kale, A., Althoff, T., and Heer, J. (2021). Boba: Authoring and VisualizingMultiverseAnalyses. IEEE Transactions on Visualization and Computer Graphics, 27(2):1753–1763. arXiv:2007.05551 [cs].

[21]

Bernstein, R. (2022). Multi-ModelProbabilisticProgramming. arXiv:2208.06329 [cs, stat].

[22]

Bernstein, R. (2023). Abstractions for Probabilistic Programming to Support Model Development. PhD thesis, Columbia University.

[23]

Stan Development Team(2023). Stan ModelingLanguageUsersGuide and ReferenceManual 2.31.

[24]

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis. Chapman &Hall/CRCTexts in StatisticalScience. Chapman & Hall, 3rd edition.

[25]

Bell, S. J., Kampman, O. P., Dodge, J., and Lawrence, N. D. (2022). Modeling the MachineLearningMultiverse. arXiv:2206.05985 [cs, stat].

[26]

Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., Bahník, Š., Bai, F., Bannard, C., Bonnier, E., Carlsson, R., Cheung, F., Christensen, G., Clay, R., Craig, M. A., Dalla Rosa, A., Dam, L., Evans, M. H., Flores Cervantes, I., Fong, N., Gamez-Djokic, M., Glenz, A., Gordon-McKeon, S., Heaton, T. J., Hederos, K., Heene, M., Hofelich Mohr, A. J., Högden, F., Hui, K., Johannesson, M., Kalodimos, J., Kaszubowski, E., Kennedy, D. M., Lei, R., Lindsay, T. A., Liverani, S., Madan, C. R., Molden, D., Molleman, E., Morey, R. D., Mulder, L. B., Nijstad, B. R., Pope, N. G., Pope, B., Prenoveau, J. M., Rink, F., Robusto, E., Roderique, H., Sandberg, A., Schlüter, E., Schönbrodt, F. D., Sherman, M. F., Sommer, S. A., Sotak, K., Spain, S., Spörlein, C., Stafford, T., Stefanutti, L., Tauber, S., Ullrich, J., Vianello, M., Wagenmakers, E.-J., Witkowiak, M., Yoon, S., and Nosek, B. A. (2018). Many Analysts, OneDataSet: MakingTransparentHowVariations in AnalyticChoicesAffectResults. Advances in Methods and Practices in Psychological Science, 1(3):337–356.

[27]

Kummerfeld, E. and Jones, G. L. (2023). One data set, many analysts: Implications for practicing scientists. Frontiers in Psychology, 14.

[28]

Bürkner, P.-C., Scholz, M., and Radev, S. T. (2023b). Some models are useful, but how do we know which ones? Towards a unified Bayesian model taxonomy. Statistics Surveys, 17:216–310.

[29]

O’Hagan, A. and Forster, J. (2004). Bayesian inference, volume 2B of Kendall’s Advanced Theory of Statistics. John Wiley & Sons, Ltd.

[30]

Vehtari, A. and Ojanen, J. (2012). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys, 6:142–228.

[31]

Piironen, J. and Vehtari, A. (2017a). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3):711–735.

[32]

Scholz, M. and Bürkner, P.-C. (2023). Prediction can be safely used as a proxy for explanation in causally consistent Bayesian generalized linear models. arXiv:2210.06927 [stat].

[33]

Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5):1413–1432.

[34]

Sivula, T., Magnusson, M., Matamoros, A. A., and Vehtari, A. (2022). Uncertainty in Bayesian leave-one-out cross-validation based model comparison. arXiv:2008.10296 [stat].

[35]

Säilynoja, T., Bürkner, P.-C., and Vehtari, A. (2022). Graphical test for discrete uniformity and its applications in goodness-of-fit evaluation and multiple sample comparison. Statistics and Computing, 32(2):32.

[36]

Geisser, S. and Eddy, W. F. (1979). A PredictiveApproach to ModelSelection. Journal of the American Statistical Association, 74(365):153–160.

[37]

Geisser, S. (1975). The PredictiveSampleReuseMethod with Applications. Journal of the American Statistical Association, 70(350):320–328.

[38]

Gelfand, A. E., Dey, D., and Chang, H. (1992). Model determination using predictive distributions with implementation via sampling-based methods. In Bernardo, J. M., Berger, J. O., Dawid, A., and Smith, A., editors, Bayesian Statistics, pages 147–167. Oxford University Press, Oxford, 4th edition.

[39]

Gelfand, A. E. (1995). Model determination using sampling-based methods. In Spiegelhalter, D., Gilks, W. R., and Richardson, S., editors, Markov Chain Monte Carlo in Practice. Chapman and Hall/CRC, 1st edition.

[40]

Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2024). Pareto smoothed importance sampling. In press.

[41]

McLatchie, Y. and Vehtari, A. (2023). Efficient estimation and correction of selection-induced bias with order statistics. arXiv:2309.03742 [stat].

[42]

Paananen, T., Piironen, J., Bürkner, P.-C., and Vehtari, A. (2021). Implicitly adaptive importance sampling. Statistics and Computing, 31(2):16.

[43]

Hoffman, M. D. and Gelman, A. (2014). The No-U-TurnSampler: AdaptivelySettingPathLengths in HamiltonianMonteCarlo. Journal of Machine Learning Research, 15:1593–1623.

[44]

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: AProbabilisticProgrammingLanguage. Journal of Statistical Software, 76:1–32.

[45]

Neal, R. (1993). Probabilistic Inference using MarkovChainMonteCarloMethods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto.

[46]

Brooks, S., Gelman, A., Jones, G., and Meng, X.-L., editors (2011). Handbook of Markov Chain Monte Carlo. Chapman and Hall/CRC, New York.

[47]

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Bürkner, P.-C. (2021). Rank-normalization, folding, and localization: An improved $\widehat{R}$ for assessing convergence of MCMC(with discussion). Bayesian Analysis, 16(2):667 – 718.

[48]

Margossian, C. C., Hoffman, M. D., Sountsov, P., Riou-Durand, L., Vehtari, A., and Gelman, A. (2024). Nested $\widehat{R}$: Assessing the convergence of Markov chain MonteCarlo when running many short chains. arXiv:2110.13017 [stat].

[49]

Geyer, C. J. (1992). Practical markov chain monte carlo. Statistical Science, 7(4):473–483.

[50]

Leimkuhler, B. and Reich, S. (2005). Simulating Hamiltonian Dynamics. Cambridge Monographs on Applied and ComputationalMathematics. Cambridge University Press, Cambridge.

[51]

Betancourt, M. (2016). Diagnosing SuboptimalCotangentDisintegrations in HamiltonianMonteCarlo. arXiv:1604.00695 [stat].

[52]

R Core Team(2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

[53]

Gabry, J. and Mahr, T. (2022). bayesplot: Plotting for BayesianModels.

[54]

Kay, M. (2023). ggdist: Visualizations of Distributions and Uncertainty. https://mjskay.github.io/ggdist/.

[55]

Bürkner, P.-C., Gabry, J., Kay, M., and Vehtari, A. (2023a). posterior: Tools for Working with PosteriorDistributions.

[56]

Vehtari, A., Gabry, J., Magnusson, M., Yao, Y., Bürkner, P.-C., Paananen, T., and Gelman, A. (2020). loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models.

[57]

Kumar, R., Carroll, C., Hartikainen, A., and Martin, O. (2019). a unified library for exploratory analysis of Bayesian models in Python. Journal of Open Source Software, 4(33):1143.

[58]

Python Software Foundation(2023). Python. Python Language Reference, Version 3.12.

[59]

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM review, 59(1):65–98.

[60]

Bürkner, P.-C. (2017). brms: AnRPackage for BayesianMultilevelModelsUsingStan. Journal of Statistical Software, 80:1–28.

[61]

Leppik, I. E., Dreifuss, F. E., Porter, R., Bowman, T., Santilli, N., Jacobs, M., Crosby, C., Cloyd, J., Stackman, J., Graves, N., Sutula, T., Welty, T., Vickery, J., Brundage, R., Gates, J., Gumnit, R. J., and Gutierrez, A. (1987). A controlled study of progabide in partial seizures: Methodology and results. Neurology, 37(6):963–963.

[62]

Thall, P. F. and Vail, S. C. (1990). Some covariance models for longitudinal count data with overdispersion. Biometrics, 46(3):657–671.

[63]

Breslow, N. E. and Clayton, D. G. (1993). Approximate Inference in GeneralizedLinearMixedModels. Journal of the American Statistical Association, 88(421):9–25.

[64]

Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009). Handling Sparsity via the Horseshoe. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, pages 73–80. PMLR. ISSN: 1938-7228.

[65]

Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480. Oxford University Press.

[66]

Piironen, J. and Vehtari, A. (2017b). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2):5018–5051. Publisher: Institute of Mathematical Statistics and Bernoulli Society.

[67]

Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning. The MIT Press.

[68]

Solin, A. and Särkkä, S. (2020). Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing, 30(2):419–446.

[69]

Riutort-Mayol, G., Bürkner, P.-C., Andersen, M. R., Solin, A., and Vehtari, A. (2022). Practical Hilbert space approximate BayesianGaussian processes for probabilistic programming. Statistics and Computing, 33(1):17.

[70]

Neal, R. (2003). Slice Sampling. Annals of Statistics, 31(3):705–67.

[71]

Gorinova, M., Moore, D., and Hoffman, M. (2020). Automatic Reparameterisation of ProbabilisticPrograms. In Proceedings of the 37th International Conference on Machine Learning, pages 3648–3657. PMLR. ISSN: 2640-3498.

[72]

Levy, B. R., Chung, P. H., and Slade, M. D. (2011). Influence of Valentine’s Day and Halloween on BirthTiming. Social Science & Medicine, 73(8):1246–1248.

[73]

Wang, W. and Gelman, A. (2015). Difficulty of selecting among multilevel models using predictive accuracy. Statistics and Its Interface, 8(2):153–160.

[74]

McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press, 2nd edition.

[75]

Clarke, B. and Yao, Y. (2023). A CheatSheet for BayesianPrediction. arXiv:2304.12218 [stat].

[76]

Pavone, F., Piironen, J., Bürkner, P.-C., and Vehtari, A. (2023). Using reference models in variable selection. Computational Statistics, 38(1):349–371.

[77]

Piironen, J., Paasiniemi, M., and Vehtari, A. (2020). Projective inference in high-dimensional problems: Prediction and feature selection. Electronic Journal of Statistics, 14(1):2155–2197. Publisher: Institute of Mathematical Statistics and Bernoulli Society.

[78]

Piironen, J., Paasiniemi, M., Catalina, A., Weber, F., and Vehtari, A. (2023). projpred: Projection predictive feature selection.

[79]

McLatchie, Y., Rögnvaldsson, S., Weber, F., and Vehtari, A. (2023). Robust and efficient projection predictive inference. arXiv:2306.15581 [stat].

[80]

Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). Using Stacking to average bayesian predictive distributions (with discussion). Bayesian Analysis, 13(3):917–1007.

[81]

Broderick, T., Gelman, A., Meager, R., Smith, A. L., and Zheng, T. (2023). Toward a taxonomy of trust for probabilistic machine learning. Science Advances, 9(7).

Similar and related approaches have been suggested for working with sets of possible models, among them, “factorial model comparison” [13], “vibration of effects” [14], “model influence analysis” [15], or “specification curve analysis” [7].↩︎
[42] demonstrate that brute-force LOO-CV can underestimate elpd in presence of outliers compared to analytically obtained elpd, leading to opposite bias compared to PSIS-LOO-CV in their experiments. Here, we do not assume access to an analytical solution and consider pessimistic estimates for elpd less costly for the modeller than unreliably overestimated predictive performance.↩︎

Supporting Bayesian modelling workflows with iterative filtering for multiverse analysis