Abstract

We present RobustiPy, a next generation Python‐based framework for model‐uncertainty quantification and multiverse analysis, released under the GNU GPL v3.0. Through the integration of efficient bootstrap-based confidence intervals, combinatorial exploration of dependent‐variable specifications, model selection and averaging, and two complementary joint‐inference routines, RobustiPy transcends existing uncertainty‐quantification tools. Its design further supports rigorous out-of-sample evaluation and apportions the predictive contribution of each covariate. We deploy the library across five carefully constructed simulations and ten empirically grounded case studies drawn from high-impact literature and teaching examples, including a novel re-analysis of ‘unexplained discrepancies’ in famous prior work. To illustrate its performance, we time-profile RobustiPy over roughly 672 million simulated linear regressions. These applications showcase how RobustiPy not only accelerates robust inference but also deepens our interpretive insight into model sensitivity across the vast analytical multiverse within which scientists operate.

Keywords: Model Uncertainty, Statistical Software, Open Science

“Analysts can test thousands of plausible models, trying out different model assumptions. Consumers are still constrained to know only the handful of model specifications that are curated for publication. Technology created this problem, and technology will have to be a part of the solution.”

– [2]

1 Introduction↩︎

The act of constructing quantitative models to examine scientific phenomena represents the corner-stone of most intellectual processes [3]. Yet, the scientific community – particularly within the health and social sciences [4] – has long recognised that the process of testing hypotheses using such models is susceptible to substantial variation in outcomes, even under relatively conservative assumptions [5]. While Leamer’s infamous “Let’s Take the Con Out of Econometrics” article [6] is often credited with dramatically drawing attention to pervasive risks in applied work, the concept and even formal tools for handling model uncertainty date back decades earlier [7], [8]. Numerous factors influence the model-building process, ranging from uncontrollable elements to more structured components like the conditions of data collection; the choices which responsible researchers make over this space are otherwise known as the ‘garden of forking paths’ [9]. The range of possible outcomes in scientific modelling may, in fact, mirror the vast variation inherent in reality itself, all when focusing on just one estimand of interest [10]. This is at a time when it is increasingly recognised that complex computational routines can themselves be utilised for theory-building [11]. We create a tractable solution which takes steps to reduce this consequential source of variation; the analytical choices made by researchers. These choices can – in principle – be systematically examined and controlled. Developed in Python, our readily available, continuously maintained, and highly accessible tool aims to bring transparency to research practice, representing what we believe is the next generation of model uncertainty software. RobustiPy is also agnostic to philosophies regarding whether models in a multiverse should be weighted [12], [13] by plotting both the full model space, metrics of fit, estimates which maximize information criteria, and weighted and unweighted estimates of the estimand. We document the five different types of analysis available through simulated analysis, conduct a time profiling exercises, and apply RobustiPy to ten high-profile empirical examples across the research and teaching-related landscape. These empirical examples span the economic, social, psychological and medical sciences, and beyond. All of our simulated and empirical examples are made available as Python scripts and interactive notebooks within the first release of our library which has been indexed into Zenodo [1].

Analytical decisions – in comparison to more ethereal, uncontrollable concerns – are not only more amenable to scrutiny, but also potentially more influential for the scientific enterprise. While other sources of variation (such as data collection) may be expected to cancel out with replication, research decisions regarding the modelling of scientific processes are often inherently non-random and systematically biasing. They can introduce systematic biases that remain hidden unless explicitly interrogated. This typically results in an asymmetry of information [2]: current common practice involves researchers varying their decisions, selectively reporting what they choose. Although exploring alternative specifications is not inherently problematic, the persistent lack of transparency in these practices enables questionable research behaviours, such as p-hacking; the process of iteratively modifying models until statistically significant results are obtained [14], [15]. A related issue is ‘HARKing’ (Hypothesising After the Results are Known, see for example [16]), in which researchers test multiple hypotheses and present significant findings as though they were specified in advance. Beyond concerns about scientific integrity, such selective reporting contributes to poor reproducibility and a misleading sense of empirical consensus, ultimately undermining confidence in research and the scientific endeavour more broadly [17]. A range of solutions have been proposed to address these challenges. These include greater transparency through preregistration [18], a stronger emphasis on replication [19], improved statistical power [14], [20], the adoption of Bayesian inference [21], and a shift towards predictive modelling as an evaluation tool [22], [23]. One particularly promising approach to improving transparency is the systematic reporting of all defensible combinations of analytical decisions made by a researcher. This strategy – independently developed as specification curve analysis [24] and multiverse analysis [25] – explicitly counters selective reporting; the former by looking at the specification based choices which researchers can make, and the latter by analysing a broader choice space. Rather than presenting a single preferred model, it aims to identify and estimate all plausible analysis branches, presenting the full distribution of permissible results in a single, usually visual output. This is what we build upon.

Despite their conceptual appeal, multiverse and specification-curve analyses are computationally demanding. Among the many degrees of freedom available to researchers, covariate selection alone can result in an exponential number of model specifications. For example, a dataset with ten candidate covariates produces \(2^{10} = 1,024\) possible models; twenty covariates yield over a million. Historically, researchers have tackled this complexity using bespoke computational routines to estimate and summarise the results of thousands [6], millions [26], billions [27], or even trillions [28] of regression models. The expansion of empirical research has brought increased attention to the role of ‘researcher degrees of freedom’ – the wide range of defensible yet subjective decisions made during data collection, model specification and result reporting [29]. All of which undermine the validity of empirical findings and contribute to broader issues such as the replication crisis. These problems are pervasive across almost all quantitative academic disciplines, spanning and especially psychology [30], the social sciences [31], and medical research [32]. However, custom implementations are often inefficiently done, and are often not sufficiently feature-rich, despite the ever increasing computational power available to researchers in the limit of human progress [33], [34]. To overcome these limitations, we introduce RobustiPy, which streamlines model selection, averaging, and out-of-sample validation and feature contribution while performing statistical inference testing and evaluating fit across the entire space of feasible specifications. By offering a standardised, efficient, and unit-tested solution, RobustiPy enhances computational reproducibility and mitigates the risks associated with costly and limited ad-hoc implementations of multiverse and specification curve analysis.

1.1 Existing State of the Art↩︎

Several computational tools have been developed to facilitate the implementation of specification curve analysis – a way to visualise all specifications of a model space – and multiverse analysis – the systematic estimation of every defensible analytic pathway – in applied research. In the R programming ecosystem, the multiverse package by [35] introduces a declarative syntax for defining multiple decision points within a single analysis script. Users can specify branches corresponding to alternative models and specifications, which are then executed across the entire, permissible model space; the package integrates with R Markdown and supports structured workflows for running and reporting multiverse analyses. Specifically regarding specification curve analysis – as opposed to multiversal analysis – and likely most formally introduced by [36], the specr package [37] offers a comprehensive set of tools to define, estimate, and visualise a range of model specifications. It is particularly well suited for observational studies, enabling researchers to explore variations in outcome variables, independent variables, and covariate adjustments. Results are presented through specification curves that summarise the distribution of estimates and confidence intervals across the analytical space. [38] advocate a Bayesian approach, with code available on the Open Science Framework [39]. Similar, but with more limited functionality, spec_curve allows you to create a specification chart using base R [40]. In Stata, there exists the MULTIVRS module to conduct multiverse analysis [41], and MROBUST, a module to estimate model robustness and model influence [42].

In contrast, the development of comparable toolkits in the Python ecosystem has been more limited, despite the fact that Python has become the world’s foremost language for highly performant data science research [43]. The spec_curve package by [44] provides foundational functionality for conducting specification curve analyses. While it offers a useful entry point, it lacks the breadth and extensibility of its R-based counterparts. The Boba domain-specific language (DSL, see [45]) provides a programming language agnostic tool and visualisation system for multiverse analysis. It automatically expands and executes all valid combinations of user decision points. The results are then presented through an interactive, linked-view interface. While Boba supports execution in both R and Python environments, its use of a custom DSL makes it difficult to integrate seamlessly into standard analysis pipelines. To address this gap more broadly, we introduce RobustiPy. RobustiPy extends beyond existing tools by incorporating functionality for model selection, bootstrapping, joint inference, out-of-sample evaluation, and the systematic exploration of multiple estimands and outcome spaces. In doing so, it enables a more rigorous and transparent approach to robustness analysis. A summary of the main packages in this space is described in Table 1.

Table 1: Comparison of Toolkits for Multiverse and Specification Curve Analysis
Feature / Toolkit	multiverse (R)	specr (R)	MULTIVRS (Stata)	spec_curve (Python)	RobustiPy (Python)
Multiverse analysis	✔	–	✔	–	✔ (limited)
Specification curve analysis	–	✔	–	✔	✔
High-abstraction syntax	✔	(partial)	(limited)	–	✔
Visualisation of results	✔	✔(built-in)	✔ (kdensity)	✔(basic)	✔ (custom)
Joint inference	–	✔	–	✔	✔
Influence analysis	–	–	✔	–	✔
Bootstrapping	–	✔	–	✔	✔
Out-of-sample	–	–	–	–	✔
Multiple estimands	–	✔ (limited)	–	✔(partial)	✔

1.2 Formalisation↩︎

In order to describe the problem which RobustiPy addresses, we formalise the problem below. Consider the key association between variables \(\mathbf{Y}\) and \(\mathbf{X}\), where a set of covariates \(\mathbf{Z}\) can influence the relationship between the former as follows:

\[\label{objective95func} \mathbf{Y} = F(\mathbf{X}, \mathbf{Z}) + \boldsymbol{\epsilon} .\tag{1}\]

Assume \(\mathbf{Y}\) is an unknown data generating function \(F()\) which takes the values of \(\mathbf{X}\) conditioned by the values of the set of covariates \(\mathbf{Z}\), plus an independent and identically distributed random error term (\(\boldsymbol{\epsilon}\)). In this conceptualisation of the problem, Eq. 1 represents the real data generation process that researchers approximate, but, its perfect reproduction is (almost always) unattainable. Following the example of [36], \(\mathbf{Y}\) and \(\mathbf{X}\) are usually imprecisely defined latent variables (for which we cannot have a direct measure). Likewise, the set of covariates \(\mathbf{Z}\) can also be composed of imprecisely defined variables from which the number of them can be unknown and/or non-finite. Under these conditions, it is evident that researchers must make concessions, usually in the form of finite sets of ‘reasonable’ operationalisations of \(\mathbf{Y}\), \(\mathbf{X}\), \(\mathbf{Z}\) and \(F()\). The way ‘reasonable’ is defined is in itself a point of contention; [36] somewhat canonically defines this as a set where each element is theoretically justified, statistically valid, and non-redundant. Let’s define the set of ‘reasonable’ operationalisations of \(\mathbf{Y}\), \(\mathbf{X}\), \(\mathbf{Z}\) and \(F()\) as \(\overleftrightarrow{\mathbf{Y}}\), \(\overleftrightarrow{\mathbf{X}}\), \(\overleftrightarrow{\mathbf{Z}}\) and \(\overleftrightarrow{F()}\). Then, we have:

\[\label{spec95func} \overleftrightarrow{\mathbf{Y}}_{\pi} = \overleftrightarrow{F}_{\pi}(\overleftrightarrow{\mathbf{X}}_{\pi}, \overleftrightarrow{\mathbf{Z}}_{\pi}) + \boldsymbol{\epsilon}.\tag{2}\]

Eq. 2 corresponds to a single possible choice (\(\pi\)) of all the possible choices (\(\Pi\), such that \(\pi \in \Pi\)) which we can generate using our current formalisation. The total number of ‘researcher degrees of freedom’ [46] can then be calculated as \(2^D\) where \(D\) equals \(d_{\overleftrightarrow{\mathbf{Y}}} + d_{\overleftrightarrow{\mathbf{X}}} + d_{\overleftrightarrow{\mathbf{Z}}} + d_{\overleftrightarrow{F()}}\) and \(d\) is the length of the vectors containing the number of choices for each parameter in Eq. 2 . For example, an analysis with two functional forms, two target variables, two predictors and four covariates will yield a space of \(\Pi\) (\(2^{10}\) choices). We split the computation based on the operationalised variables in Eq. 2 by creating a ‘reasonable specification space’ for each: \(\Pi_{\overleftrightarrow{\mathbf{Y}}}\), \(\Pi_{\overleftrightarrow{F()}}\), \(\Pi_{\overleftrightarrow{\mathbf{X}}}\), \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\). The entire model space can be expressed as:

\[\Pi = \Pi_{\overleftrightarrow{\mathbf{Y}}} \times \Pi_{\overleftrightarrow{F()}} \times \Pi_{\overleftrightarrow{\mathbf{X}}} \times \Pi_{\overleftrightarrow{\mathbf{Z}}}\]

Given that \(\Pi\) represents a collection of multiple possible versions of the true relationship between \(\mathbf{Y}\) and \(\mathbf{X}\), an increasingly large space of random possible choices (\(D\)) will in principle always lead to a more reasonable approximation of \(\mathbf{Y} = F(\mathbf{X},\mathbf{Z})\). The main problem with current research practices is that researchers usually choose a small and non-random sample of \({\Pi}\) to represent the variety of results in the full space of possible choices [36].²

To formalise the intuition that larger model spaces can yield better approximations, consider a growing sequence of specification sets \(\{\Pi_D\}_{D=1}^\infty\) with \(|\Pi_D| = 2^D\) and \(\Pi_D \subset \Pi_{D+1}\). Let \(\widehat{Y}_\pi = \widehat{F}_\pi(X,Z)\) denote the fitted response from model \(\pi \in \Pi_D\). Suppose that the true conditional expectation \(Y = F^\star(X,Z) + \varepsilon\) lies within the closure of the union \(\bigcup_D \Pi_D\), and each \(\pi\) in \(\Pi_D\) is explored with probability bounded below by \(c > 0\) as \(D \to \infty\) (e.g., uniform or random subset sampling). Then for any \(\varepsilon > 0\), the probability that the best model in \(\Pi_D\) achieves mean-squared prediction error within \(\varepsilon\) of the truth converges to 1: \[\label{eq:oracle-approx} \lim_{D \to \infty} \Pr\!\left( \min_{\pi \in \Pi_D} \bigl\| \mathbb{E}\bigl[\widehat{Y}_\pi\bigr] - F^\star(X,Z) \bigr\|_2 < \varepsilon \right) = 1,\tag{3}\] where \(\|\cdot\|_2\) denotes the Euclidean norm. This is not a convergence claim for a randomly chosen specification, but rather a statement that with enough diversity in \(\Pi_D\), some model in the set will approximate the true data-generating process arbitrarily well. The result depends on sufficient coverage of the model space and identifiability, and does not hold under adversarial or highly correlated specifications.

1.3 Types of Analysis↩︎

RobustiPy enables an approximation of the real data generation process (Eq. 1 ) by generating and organising specifications spaces (\(\Pi_{\overleftrightarrow{\mathbf{Y}}}\), \(\Pi_{\overleftrightarrow{\mathbf{X}}}\), \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\)). In particular, RobustiPy allows a full examination across the entirety of \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\) assuming a single \(\overleftrightarrow{\mathbf{Y}}\), a fixed set of \(\overleftrightarrow{\mathbf{X}}\) predictors and a fixed set of \(\overleftrightarrow{F()}\) estimators.³ Additional functionality allows users to generate and compute \(\Pi_{\overleftrightarrow{\mathbf{Y}}}\), assuming multiple valid dependent measurements. Let’s review the most common use cases – including two stand-alone Code Blocks – where our online repository hosts five simulations pertaining to each type of analysis and Section 2 discusses ten highly varied empirical applications of each type. Section 3 discusses these results, and Section 4 contains a full delineation of every metric and method which is undertaken and available within RobustiPy.

1.3.1 Vanilla computation↩︎

Our first simulated example (‘Code Block 1: Vanilla computation’) assumes a linear data generating process. Here we fit an OLS estimator with 10‐fold cross‐validation and 1,000 bootstrap resamples, and then we plot results and print a large number of summary outputs. The data generation process described in Code Block 1 is analogous to a typical modelling problem whereby a researcher seeks to evaluate the relationship between a target variable of interest with only one valid operationalisation, and a predictor with only one possible operationalisation which is conditional on a ‘reasonable’ set of covariates (and, by extension, a researcher’s idea about the data generating process, be it theoretical or otherwise). In this situation (i.e. with no variation in the outcome space, the variable of interest space, or the functional form), \(\Pi_{\overleftrightarrow{\mathbf{Z}}} \equiv \Pi\). For the purposes of delineation – as in Code Block 1 – the OLSRobust class allows us to declare an object (vanilla_obj) containing the target variable (\(Y_{1}\)), the main predictor as part of an array (\(\overleftrightarrow{\mathbf{X}}\)), and the source of data. The .fit() method allows us to declare the array of possible covariates (\(\overleftrightarrow{\mathbf{Z}}\)) – which a researcher may be agnostic about including – to be used in the analysis. RobustiPy automatically adds an intercept unless the user has already included one in \(\overleftrightarrow{\mathbf{X}}\) or \(\overleftrightarrow{\mathbf{Z}}\) (i.e., a feature with any name which has zero variance). When a constant is present in \(\overleftrightarrow{\mathbf{X}}\), the software leaves the matrix unchanged. A constant placed in \(\overleftrightarrow{\mathbf{Z}}\) is treated as a varying regressor, so the iterator yields some specifications with an intercept and others without. RobustiPy automatically adds an intercept unless the user has already included one in \(\overleftrightarrow{\mathbf{X}}\) or \(\overleftrightarrow{\mathbf{Z}}\). When a constant is present in \(\overleftrightarrow{\mathbf{X}}\), RobustiPy leaves the matrix unchanged. A constant placed in \(\overleftrightarrow{\mathbf{Z}}\) is treated as a varying regressor, so the iterator yields some specifications with an intercept and others without. We – by default – parallelise computation to decrease runtime.⁴ Finally,to return the results of fitted models, OLSRobust has a method entitled get_results, returning a class object containing estimates as well as utility functions for summarisation and visualisation. We allow the user to specify which specifications they wish to visually highlight – which allows direct comparisons with already published results – alongside multiple other options.⁵

1.3.2 Array of never changing variables↩︎

RobustiPy also allows for the inclusion of multiple ‘fixed’ covariates, which are not varied across the co-variate space. Programmatically, this set of fixed covariates will be included in all specifications, but not considered in the combinatoric operation. They act as a set of always present features. This functionality is useful in cases when the researchers have a strong determination that these variables always need to be included in a model, likely due to theoretical reasoning (or for specific hypothesis testing, see Section 2.2). It also has the secondary effect of reducing the specification space and hence the compute cost, making it an alternative path to follow in cases with very large sets of covariates (see Section 2.3). To do this, we simply need to append additional variables into our \(\overleftrightarrow{\mathbf{X}}\) array; RobustiPy will take the first name of the list as the main estimand of interest, and the rest are fixed, never varying covariates. A stand-alone example of this type of use case is shown in Supplementary Code Block 1. An alternate approach to reduce compute cost with large covariate space is to use our z_specs_sample_size() functionality.

1.3.3 Fixed effect OLS↩︎

RobustiPy also provides the option to run a fixed effects OLS model. These types of models are commonly used in the analysis of panel data, in which repeated observations are nested within the observed unit (e.g. individuals, households, etc.), usually over time. In these contexts, fixed effects OLS is used to isolate and eliminate the intraclass variation of the observations. RobustiPy allows the specifying of a grouping variable, ‘group’, which enters into the .fit() method. By providing this, RobustiPy automatically performs demeaning of the relevant variables and estimates its sets of models on the resultant, transformed data. A stand-alone example use case is shown in Supplementary Code Block 2.

1.3.4 Binary dependents↩︎

RobustiPy also provides the opportunity to apply a logistic estimator to handle binary outcome variables (via the Newton method, see [47], and a tolerance of \(1\times10^{-7}\)). We chose logistic regression as it is one of the most common and computationally efficient estimators for binary choice models; an important consideration when running specification curves. This functionality is implemented through the LRobust class, which offers essentially the same functionality as OLSRobust, but only accepts binary outcome variables. Naturally, it is also possible to estimate the model space for a binary dependent variable with OLSRobust, which would represent a Linear Probability Model (see, for example [48]). A standalone example use case is shown in Supplementary Code Block 3.

1.3.5 Multiple dependents↩︎

Let’s now modify our assumed data generation process by allowing the dependent variable to be a vector of four possible operationalisations. We describe the process to simulate this data in Code Block 2. In this scenario, researchers might not only want to compute \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\) over each operationalisation in \(\overleftrightarrow{\mathbf{Y}}\) but also to compute all the possible combinations of composites derived from \(\overleftrightarrow{\mathbf{Y}}\) (or, in other words, \(\Pi_{\overleftrightarrow{\mathbf{Y}}}\)) in order to properly obtain \(\Pi\). It might also involve an array of never-changing constants (Section 1.3.2). RobustiPy takes the list of operationalisations and creates combinations of composites by first standardising their values, and then combining them using the row-wise average. ‘Code Block 2: Multiple dependents’ extends our simulation from above. In this particular example, we highlight a specification containing two co-variates and the \(Y_1\) and \(Y_2\) composites: [‘y_1’, ‘y_2’, ‘z_1’, ‘z_2’], and a specification containing only one covariate (\(Z_4\)) and a single operationalisation, non-composite of the dependent variable in the form of \(Y_4\): [‘y_4’, ‘z_4’]. If researchers are interested in running multiple fit commands – one for each dependent variable – RobustiPy contains a concat_results function within utils.py module which stacks results objects together. This is naturally the way by which a researcher can conduct analyses with multiple binary dependents.

2 Results↩︎

All of the results in the following Sections 2.1-2.5 –unless where specified – are reported with 1,000 bootstrapped draws, ten folds, a seed of 192735, and options of the HQIC information criteria for model selection, and the pseudo-R\(^2\) metric for out-of-sample evaluation.⁶\(^,\)⁷ We provide the results of ten empirical examples which enunciate the five types of analysis as detailed in Sections 1.3.1-1.3.5; some are in more detail than others in the main text, each with a corresponding notebook as part of our Online Supplementary Materials (i.e., [1]). All results are discussed in Section 3 (and in particular, Table [tab:results]).

Figure 3: Example output of RobustiPy utilising the union.dta dataset. Subfigure ‘a.’ represents in-sample \bar{R}^2 fit against bootstrapped estimands, and ‘b.’ represents the full-sample, non-bootstrapped estimands plotted against the log likelihood of the corresponding models. Subfigure ‘c.’ contains Bayesian Model Averages, and shares y-axis tick labels with ‘.d’, which represents SHAP values. Subfigure ‘e.’ shows the distribution of the chosen out of sample metric.In Subfigure ‘f.’, median estimates across specifications are shown in solid black, while bootstrapped confidence intervals (with a default interval of the entire range) are shown in dashed grey (with optional LOESS). The colouring of specific lines, distributions, and markers in figures ‘f.’, ‘g.’, and ‘h.’ represent the position of highlighted specifications of interest; RobustiPy will always show the position model with all variables (‘Full Model’) and the model with no covariates beyond the variable of interest (‘No Controls’). Details of all metrics, methods, and features which create analysis such as this can be found in Section 4. — Figure 3: **Example output of `RobustiPy` utilising the `union.dta` dataset**. Subfigure ‘a.’ represents in-sample \(\bar{R}^2\) fit against bootstrapped estimands, and ‘b.’ represents the full-sample, non-bootstrapped estimands plotted against the log likelihood of the corresponding models. Subfigure ‘c.’ contains Bayesian Model Averages, and shares y-axis tick labels with ‘.d’, which represents SHAP values. Subfigure ‘e.’ shows the distribution of the chosen out of sample metric.In Subfigure ‘f.’, median estimates across specifications are shown in solid black, while bootstrapped confidence intervals (with a default interval of the entire range) are shown in dashed grey (with optional LOESS). The colouring of specific lines, distributions, and markers in figures ‘f.’, ‘g.’, and ‘h.’ represent the position of highlighted specifications of interest; `RobustiPy` will always show the position model with all variables (‘Full Model’) and the model with no covariates beyond the variable of interest (‘No Controls’). Details of all metrics, methods, and features which create analysis such as this can be found in Section 4.

2.1 Vanilla computation↩︎

Our first example utilises the infamous union.dta dataset; a subsample of the National Longitudinal Survey of Young Women [49]. Women were surveyed in each of the 21 years 1968–1988, except for the six years 1974, 1976, 1979, 1981, 1984, and 1986). This canonical dataset is also well examined as an exemplary dataset in other work which considers model uncertainty and provides multiversal tools [42]. It is also a teaching resource – used in undergraduate and graduate econometrics classes around the world – to illustrate the estimation of treatment effects in observational data. Textbooks which utilise it include [50] and [51], as just two examples. Through OLS regression, students and researchers estimate \(\hat{\beta}\), which is interpreted as the average percentage difference in wages attributable to union membership, holding other factors constant. Importantly, this dataset is entirely anonymised and publicly available. It contains information on things like wage (which, when logarithmically transformed, represents the dependent variable, often written as ‘\(y_i\)’), union membership (and the associated estimand of interest, ‘\(\hat{\beta}_1\)’ in the canonical reduced form linear regression), and a multitude of often arbitrarily specified controls. Results are visualised in Figure 3; a simple OLS regression on the full sample yields an estimand of interest of 10.18; lower than ‘conventional estimates’, which are nearer to a 15% premium (Hirsch 2004). [42] estimates the (unweighted) median effect across the multiverse as 14.00; the median of our unweighted estimates across all samples with no bootstraps is 13.5, across all samples and bootstraps as 13.6, with a BIC-weighted estimate of 9.8 (as per the recommendation of [13]). We are agnostic as to which number is most statistically valid, simply providing the tools by which a researcher can consider the range of plausible results.

We also undertake a ‘vanilla’ application to criminology: a revisitation of [52] based on a dataset from [53], where the dependent variable is state-level crime rate, and the estimand of interest is income inequality (defined as the percentage of families earning below one-half of the median income). The results from the .summary() method – inherent to RobustiPy – are shown in ‘Code Block 3: Summarizing a re-analysis of criminality’ and Supplementary Information Figure 10. The Supplementary Information (Figure 11) also contains a re-analysis of the ‘Colonial Origins of Comparative Development’, based on the influential paper of [54].

Figure 5: Replicating and extending [55] with RobustiPy. With regards to subfigure ‘a.’, the distribution of \bar{R}^2 is identical for each of the two orderings of \overleftrightarrow{\mathbf{X}}, of as the model space is otherwise identical across the two results objects. The right subfigure (‘b.’) is a composite of two runs of RobustiPy, with the two rotated \overleftrightarrow{\mathbf{X}} arrays. — Figure 5: **Replicating and extending [55] with `RobustiPy`**. With regards to subfigure ‘a.’, the distribution of \(\bar{R}^2\) is identical for each of the two orderings of \(\overleftrightarrow{\mathbf{X}}\), of as the model space is otherwise identical across the two results objects. The right subfigure (‘b.’) is a composite of two runs of `RobustiPy`, with the two rotated \(\overleftrightarrow{\mathbf{X}}\) arrays.

2.2 Array of never changing variables↩︎

We next detail the case where we have an array of variables that we always want to include in the model space, but never vary. In terms of RobustiPy functionality, this simply involves passing an array of variable names (\(\overleftrightarrow{\mathbf{X}}\)) which has length greater than one, with the array of variable controls specified as before (\(\overleftrightarrow{\mathbf{Z}}\)). This involves using updated variables from the Penn World Table (version 10.01, see [56]), to revisit [55] which contributes to the ‘Empirics of Economic Growth’. We map modern data from 73 of the 98 countries in the original 1988 sample, using observed values of all parts of the \(\log(n+g+\delta)\) and \(\log(\frac{I}{Y})\) which are canonical to the Solow Growth Model. We also include various possible extensions – as done in [55] – such as including logged metrics of human capital, and previous values of GDP in levels. By running RobustiPy twice – with \(\overleftrightarrow{\mathbf{X}} = [\log(n+g+\delta), \log(\frac{I}{Y})]\) and \(\overleftrightarrow{\mathbf{X}} = [\log(\frac{I}{Y}), \log(n+g+\delta)]\), both of which we always want in our core model – we are able to recover two large sets of bootstrapped estimates of the relevant estimands accompanying the first variable in each of these two arrays. This allows us to extract estimates from the results object of RobustiPy to assess whether the signs are of equal size and opposite magnitude. RobustiPy also allows us visualise the distribution of the \(\bar{R}^2\) values (a key test of the Solow Growth Model); both are shown in Figure 5, with full RobustiPy outputs for the two runs shown in Figures 12-13.

2.3 Fixed effect OLS↩︎

Regarding fixed-effect estimation, our empirical replication comes from [57], which takes a longitudinal approach to local government spending on adult social care and carers’ subjective well-being in England. Specifically, we analyse the effect of a Local Authority measure of Adult Social Care Spending on a General Health Questionnaire (GHQ12) metric from the United Kingdom Household Longitudinal Survey (see Appendix I, column (1) of [57]). A static group of controls (i.e., inside \(\overleftrightarrow{\mathbf{X}}\)) includes seven sociodemographic and health variables, alongside thirteen year fixed effect dummy variables, whereas our array of controls – which do vary – contains a further eight variables. Importantly, we document the inclusion of a grouping variable (‘pipd’), common in longitudinal survey analysis. Figure 6 shows the results, with the median value of the estimand of interest (0.203 for all specifications and all bootstrapped resamples, 0.202 for all specifications and no resampling, and 0.182, 0.182 and 0.183 for AIC, BIC, and HQIC weighted estimates). This result is extremely close to that reported in the original paper (0.1825). Given that the size of this feature space is so large, had we not specified a larger \(\overleftrightarrow{\mathbf{X}}\) (an array of always present variables), RobustiPy would have had to estimate a relatively intractable model space of \(2^{D}\) = 268,435,456 specifications. To this extent, we also use this as an example to show our resampling of \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\), with results shown in Table [tab:results]. A complementary pedagogical example – with the Understanding Society Longitudinal Teaching Dataset (Waves 1-9, from October 2021, see [58]) is provided in Supplementary Figure 14, where the estimand of interest is ‘sf1_dv’ (general health) and a fixed effect grouping is again specified based on a unique cross-wave person identifier.

Figure 6: Example output of RobustiPy with both a grouping variable and an array of never changing variables. The dependent variable is a measure of general health (GHQ12), and the estimand of interest (\hat{\beta}) is associated with spending on adult social care. — Figure 6: **Example output of `RobustiPy` with both a grouping variable and an array of never changing variables**. The dependent variable is a measure of general health (GHQ12), and the estimand of interest (\(\hat{\beta}\)) is associated with spending on adult social care.

2.4 Binary dependents↩︎

We next describe an example of how RobustiPy can be used when our dependent variable is binarized as part of an application common to the medical sciences. In particular, we utilise the Framingham Heart Survey [59] for the estimation of models related to a dependent variable termed ‘TenYearCHD’: a binary indicator equal to one if a study participant experienced a coronary heart disease (CHD) event within ten years of their baseline examination, and zero otherwise. This is used in multiple, high-profile papers such as [60], and as part of a composite in papers such as [61]. The dataset includes over 4,240 records across sixteen features. We fit our LRobust class with ‘TenYearCHD’ as our target dependent, ‘BMI’ as the single element of \(\overleftrightarrow{\mathbf{X}}\), for which we calculate estimands of interest, and we use ‘age’, ‘male’, ‘totChol’, ‘sysBP’, ‘BPMeds’, ‘currentSmoker’, ‘cigsPerDay’, ‘diabetes’, ‘heartRate’, and ‘glucose’ as elements of our array of controls (\(\overleftrightarrow{\mathbf{Z}}\)). This example accepts the ‘oddsratio’ boolean into the .plot() command on the results object to calculate odds ratios, with results shown in Figure 16. Our online replication archive also includes models fitted to a Kaggle dataset on a binarized measure of airline customer satisfaction (from an undisclosed airline); the resulting coefficient estimates indicate how RobustiPy can be used as a method for model selection and averaging, and are presented in Figure 15 and, again, summarized in Table [tab:results] (which also contains RobustiPy results from estimating the problem as a Linear Probability Model, i.e., with the OLSRobust class.).

2.5 Multiple dependents↩︎

Figure 7: Output of OLSRobust using multiple dependent variables. In this example we used a custom plot from the output results object of the OLSResult class, comparing multiple competing versions of \Pi, based on [62]. — Figure 7: **Output of `OLSRobust` using multiple dependent variables.** In this example we used a custom plot from the output results object of the `OLSResult` class, comparing multiple competing versions of \(\Pi\), based on [62].

We next present the case whereby a study might have multiple candidates for a dependent variable that measures the same underlying concept. While this idea might be uncommon in some disciplines, having a composite target variable that combines multiple measures is extremely common practice in myriad academic disciplines. The most common case of multiple target variables arises from the use of multi-item measures in surveys. Here, we attempt to replicate the results of Study 3a in [62], alongside with the further replication analysis conducted by DataColada for the same study. In their analysis (see [63]), DataColada alluded to the manipulation of the results through the alteration of the data used in the original analyses, ultimately leading to a retraction of the original study (see [64]). Here, we attempt to replicate the effect of the experimental condition on measures of moral impurity (seven items) and networking intentions (four items), separating the datasets used in both the original study, and that which has been reconstructed. In the case of multiple dependent measures, RobustiPy combines them into all possible composite combinations, including single-item combinations. To ensure comparability between items, RobustiPy uses \(z\)-scores rather than raw measures (see Section 1.3.5). Finally, to replicate an ANOVA/ANCOVA analysis, we convert the categorical column containing the experimental condition group membership into dummy variables and use them as predictors in each case. The results are shown in Figure 7. The resultant outputs from RobustiPy are used to plot specification curves based on the published dataset which would appear to support the main hypothesis of the study, as it shows that any combination of the outcome variable ‘moral impurity’ has a significant effect from the experimental condition. However, results using the reconstructed data show that the effect is much weaker and in the opposite direction than hypothesised. The results reside for a very similar outcome (‘networking intentions’). This example illustrates the flexibility of RobustiPy, demonstrating how it can be used to replicate both ANOVA and ANCOVA analyses – even though these are not explicitly implemented in the library – while also enabling the integration of custom visualisations to present results. More broadly, this example shows how RobustiPy can be employed as a tool for auditing existing studies by systematically exploring the robustness of published findings. By taking the output of the ‘.get_results()’ object, RobustiPy enables a multitude of custom comparisons beyond that which one run alone makes possible. We also use RobustiPy to replicate [65] which aims to understand the ‘association between adolescent well-being and digital technology use’. In this article data from well known adolescence well-being panel studies is used across a series of measures of well-being (dependent variables) and digital use (independent variables). Under this setting, the authors opt for a specification curve analysis approach to obtain a distribution of the effect of many forms of technology use across many measures of well-being; we replicate one of the main analyses, where the results – which are also a function of RobustiPy’s ability to rescale \(\overleftrightarrow{\mathbf{X}}\) and \(\overleftrightarrow{\mathbf{Z}}\) – are seen in Figure 8 and Table [tab:results].

3 Discussion↩︎

Table [tab:results] outputs all results from our ten empirical replications. It explicates the enormous variation in both the estimand of interest – across a researcher’s set of reasonable choices – and also the variation in out-of-sample accuracy across specification spaces. It also shows, for example, how we can consider whether coefficient values are approximately equal and opposite across various different realisations of an estimand (weighted or unweighted), as done by [55] in their test of the Solow Model. It emphasizes the ability of RobustiPy to act as a tool for the evaluation of research already published in the scientific record, not least through its ability to highlight specific specifications of interest in the visualisation; it acts as an essential audit tool when the validity of research is under question (such as through the retraction of [62]). While not emphasized elsewhere in this manuscript, RobustiPy also outputs exact specifications which maximize out-of-sample evaluation metrics, and which specifications minimize certain information criteria; this function is immensely useful for helping to maximize model fit (see Code Block 3 for an example of this). While Table [tab:results] is compiled from the .summary() method, this is just one of the three ways by which RobustiPy provides output; it also provides a large number of (seperable) visualisations along side all raw estimates (which can be used for bespoke downstream analyses).

Figure 8: Example output of RobustiPy utilising multiple dependent variables. This example is taken from [65]. It involves multiple dependent variables, and rescaled independent variables. It also involves a range of different predictor sets, and multiple runs of RobustiPy stacked together using the ‘.concat_results’ utility. — Figure 8: **Example output of `RobustiPy` utilising multiple dependent variables**. This example is taken from [65]. It involves multiple dependent variables, and rescaled independent variables. It also involves a range of different predictor sets, and multiple runs of `RobustiPy` stacked together using the ‘`.concat_results`’ utility.

We also time profile RobustiPy. It involves two simple simulations of data generating processes (see online Supplementary Material), and contains an example of each of both the OLSRobust and LRobust classes. We generate 25 integer values on a log scale between 10 and 10000 which represent a varying number of draws (\(b\)). We use the ‘.fit()’ method with 2, 5, 10, 15, 20, and 25 folds (\(k\)). We use 3, \(\ldots\), 7 controls (which results in groups of \(\{2^3, \ldots, 2^7\}\) specifications). For each combination of folds, draws, and number of specifications, the problem is estimated ten times. The seed is unchanged for each iteration, and we use twenty cores of an 14900F processor. This results in approximately 671,512,080 regressions being estimated for our time profiling in total, including those done to create joint-inference tests (see Section 4). With the results shown in Figure 9, RobustiPy operates as an approximately \(O(K(2b+k))\) process.

Figure 9: Time profiling of RobustiPy with simulated data. Subfigure ‘a.’ represents the use of the OLSRobust class, and ‘b.’ the case of the LRobust class. Shaded areas represent the range across ten trials (i.e., filled between minimum and maximum runtime for that exact parameterisation of the simulation). Dashed lines represent the median. Insets show variation across number of folds (k), while main subfigures represent variation across number of draws (b). — Figure 9: **Time profiling of `RobustiPy` with simulated data**. Subfigure ‘a.’ represents the use of the `OLSRobust` class, and ‘b.’ the case of the `LRobust` class. Shaded areas represent the range across ten trials (i.e., filled between minimum and maximum runtime for that exact parameterisation of the simulation). Dashed lines represent the median. Insets show variation across number of folds (\(k\)), while main subfigures represent variation across number of draws (\(b\)).

We aim to provide an accessible tool for researchers to run the most common associative models in an era when computational power is vast [33], [34]. We offer this easy-to-use interface in the hope that transparent results‐reporting will spread and become the default standard for statistical analysis. By lowering the user-experience barrier, we aim to accelerate the adoption—and routine practice—of robust modelling. We include as many tools, tests, and features as possible whilst attempting to maintain the simplest possible interface in order to encourage researchers to explore the robustness of the models which test their research hypotheses. Importantly, our application is highly modular and easily extendible.⁸ We envision tools such as RobustiPy as a necessity for a large majority of empirical workflows, yet there are some natural limitations which warrant further elaboration. First, while RobustiPy offers a comprehensive array of tools to conduct various forms of specification curve analyses, many aspects of the process rely heavily on decisions made by the researcher. The selection of a reasonable array of covariates, the choice of the underlying model, and ultimately the hypothesis to be tested are all left to the researcher’s discretion; spurious or poorly specified models will still yield spurious and poorly specified results, even when implemented using tools like RobustiPy (despite the fact that we also weight by information criteria, as advocated by [13]). Our application does not presently conduct mis-specification tests, for example, which is a natural extension. While analysis of this kind provides a computationally powerful framework for testing the robustness of statistical models, readers should not assume that it is the ultimate or only approach for such assessments. Like any method, it is most useful when applied critically and in conjunction with substantive theoretical reasoning. The natural extensions to RobustiPy are limitles. We plan to explore incorporating Laplacian approximations into RobustiPy by expanding each model’s log-posterior around its maximum a-posteriori estimate, using the resulting Gaussian integral as an efficient proxy for the marginal likelihood (thereby enabling scalable comparisons across large model spaces [66] in order to reduce compute costs [67]. Other natural extensions involve the incorporation of additional estimators beyond the OLS, Logit, and Fixed Effects models currently supported. There is also pioneering work which attempts to equivocate estimators [68], much of which will be incorporated as we continue to support enhancements to the RobustiPy framework over time.

4 Online Methods↩︎

4.1 Metrics, Methods, and Tests↩︎

4.1.1 In-sample Metrics↩︎

RobustiPy provides an array of in-sample metrics to evaluate the robustness and quality of analysis. These metrics are calculated for all models and samples, and stored in the results class to maximize their availability to the end-user.

For each ordinary least squares specification, RobustiPy performs the standard – but adjusted – in-sample coefficient of determination (\(\bar{R}^2\)) computation,

\[\bar{R}^2 = 1 - \frac{ \displaystyle \sum_{i=1}^N \frac{(y_i - \hat{y}_i)^2 }{(N - P - 1)} }{ \displaystyle \sum_{i=1}^N \frac{(y_i - \bar y)^2 )}{(N - 1)} }\]

where \(y_i\) are observed values of each observation \(i \in N\), \(\hat{y}_i\) are the predicted values, \(\bar{y}\) is the mean of the observed data, \(N\) is the number of observations for that estimation, and \(P\) is the number of parametres. RobustiPy also prints a large number of summary information about all of the calculations which it undertakes, including the maximum value of \(\bar{R}^{2}\) and the specification \(\pi^*_{\bar{R}^2}\) to which it belongs, as follows: \[\pi^*_{\bar{R}^2} = \arg\max_{\pi \in \{1, \dots, |\Pi|\}} \bar{R}^2_\pi \label{r295max}\tag{4}\]

where \(\Pi\) is the set of models in the choice space (see Section 1.2) and \(\bar{R}^{2}_\pi\) is the \(\bar{R}^2\) value of the \(\pi\)-th model. In the case of logistic regressions, we calculate

\[R_{\mathrm{McF}}^2 = 1 - \frac{ \displaystyle \sum_{i=1}^N \bigl[y_i\ln\hat{y}_i \;+\;(1-y_i)\ln(1-\hat{y}_i)\bigr] }{ \displaystyle \sum_{i=1}^N \bigl[y_i\ln\bar y \;+\;(1-y_i)\ln(1-\bar y)\bigr] }\label{eq:mcfadden}\tag{5}\]

which is a model’s likelihood divided by the likelihood of a null model, where \(\hat{y}_i\) is the model’s predicted probability that observation \(i\) has outcome equal to one. RobustiPy also calculates the specification which maximizes this metric in the Logistic case, equivalent to Eq. 4 . See [69] for more details. In the case of the LRobust class, the full model’s likelihood is also returned. Similarly for OLSRobust – for each specification – the model’s log-likelihood when assuming homoscedastic errors is calculated for expediancy as follows:

\[\log L = -\frac{N}{2} \log(2\pi) - \frac{1}{2} \sum_{i=1}^N (y_i - \hat{y}_i)^2\] The specification \(\pi^*_{\textrm{log L}}\) which maximizes the log-likelihood is presented in the summary method, determined as follows: \[\pi^*_{\textrm{log L}} = \arg\max_{\pi \in \{1, \dots, |\Pi|\}} \log L_\pi\]

where \(|\Pi|\) is the number of models in the specification space (2\(^D\)), and \(L_\pi\) is the log-likelihood value of the \(\pi\)-th model.

RobustiPy also computes three information criteria for each specification: Akaike Information Criterion (AIC, see [70]), Bayesian Information Criterion (BIC, see [71]), and Hannan-Quinn Information Criterion (HQIC, see [72]). Each information criteria is calculated based on the previously computed log-likelihood, as follows: \[\textrm{AIC} = 2P - 2 \log L\] \[\label{eq:bic} \textrm{BIC}= P \log N - 2 \log L\tag{6}\] \[\textrm{HQIC} = 2P \log (\log N) - 2 \log L\]

where: \(P\) is the number of estimated parameters, \(N\) is the number of observations, and \(\log L\) is the log-likelihood of the model. As before, all information criteria are saved in the results class and the minimum value and its associated model is provided be the summary method as before:

\[\pi^*_{\textrm{IC}} = \arg\min_{\pi \in \{1, \dots, |\Pi|\}} IC_\pi.\]

where IC \(\in\) .

4.1.2 Out-of-Sample Metrics↩︎

In addition to in-sample metrics, RobustiPy provides multiple k-fold cross validated out-of-sample metrics for each estimated model. Given a dataset \(\mathcal{D}\), defined multiple times by different sets of predictors \(\overleftrightarrow{\mathbf{X}}^{(\pi)}\), where \(\pi\) indexes different model specifications, we perform analysis with \(K\)-fold cross-validation to evaluate a predictive model. Each specification and associated dataset \(\mathcal{D}^{(\pi)}\) consists of:

\[\mathcal{D}^{(\pi)} = \{(\overleftrightarrow{\mathbf{X}}_i^{(\pi)}, \overleftrightarrow{\mathbf{Z}}_i^{(\pi)}, \overleftrightarrow{\mathbf{Y}}_i^{(\pi)})\}_{i=1}^{N^{(\pi)}},\]

\(N^{(\pi)}\) is the sample size specific to specification \(\pi\) after all preprocessing, \(\overleftrightarrow{\mathbf{X}}^{(\pi)}\) represents a specific choice of predictors \(\overleftrightarrow{\mathbf{X}}\), \(\overleftrightarrow{\mathbf{Z}}^{(\pi)}\) represents a specific choice of covariates \(\overleftrightarrow{\mathbf{Z}}\) and \(\overleftrightarrow{\mathbf{Y}}^{(\pi)}\) represents a specific choice of dependent variable \(\overleftrightarrow{\mathbf{Y}}\). For each specification \(\pi\), we divide \(\mathcal{D}^{(\pi)}\) into \(K\) approximately equal-sized, mutually exclusive subsets \(\{ \mathcal{D}_1^{(\pi)}, \mathcal{D}_2^{(\pi)}, \dots, \mathcal{D}_K^{(\pi)} \}\), such that:

\[\mathcal{D}_k^{(\pi)} \cap \mathcal{D}_j^{(\pi)} = \emptyset \quad \text{for } k \neq j, \quad \bigcup_{k=1}^{K} \mathcal{D}_k^{(\pi)} = \mathcal{D}^{(\pi)}.\]

For each fold \(k \in \{1, \dots, K\}\) and specification \(\pi\), define a training dataset \(\mathcal{D}_{\text{train}}^{(k, \pi)} = \mathcal{D}^{(\pi)} \setminus \mathcal{D}_k^{(\pi)}\), and a validation set, \(\mathcal{D}_{\text{val}}^{(k, \pi)} = \mathcal{D}_k^{(\pi)}\). Train the model on \(\mathcal{D}_{\text{train}}^{(k, \pi)}\) to obtain the learned parameters \(\theta^{(k, \pi)}\), and then evaluate it on \(\mathcal{D}_{\text{val}}^{(k, \pi)}\). Users then specify or are prompted to select their preferred out-of-sample metric (‘oos_metric’) of choice into the .fit() method. For both OLSRobust and LRobust, we calculate the Root Mean Square Error (selected by choosing oos_metric=‘rmse’) for fold \(k\) and specification \(\pi\), given by: \[\text{RMSE}_{k}^{(\pi)} = \sqrt{\frac{1}{|\mathcal{D}_k^{(\pi)}|} \sum_{i \in \mathcal{D}_k^{(\pi)}} \left( y_i - f(\overleftrightarrow{\mathbf{X}}_i^{(\pi)},\overleftrightarrow{\mathbf{Z}}_i^{(\pi)}; \theta^{(k, \pi)}) \right)^2 }.\]

We also allow the users to evaluate their out-of-sample model performance with the pseudo-\(R^2\) metric as popularised by [73], selected through oos_metric=‘pseudo-r2’. This is otherwise known as the Brier Skill Score when evaluating probabilities – as in the case of LRobust – and introduced by [74]. For fold \(k\) and specification \(\pi\):

\[\textrm{Pseudo-R}^{2 (\pi)}_{\space k} = 1 - \frac{\sum_{i \in \mathcal{D}_k^{(\pi)}} \left( y_i - f(\overleftrightarrow{\mathbf{X}}_i^{(\pi)},\overleftrightarrow{\mathbf{Z}}_i^{(\pi)}; \theta^{(k, \pi)}) \right)^2 }{\sum_{i \in \mathcal{D}_k^{(\pi)}} \left( y_i - \bar{y}^{(\pi)} \right)^2 },\]

where \(\bar{y}^{(\pi)}\) is the mean of \(y_i\) in the training set of each fold. In the case of LRobust when calculating an OOS-R\(^2\), we also allow the user to choose to evaluate their model with McFadden’s R\(^2\)-metric (oos_metric``=``‘mcfaddens-r2’), but instead use the mean of the training dataset in the calculation of the denominator of Eq. 5 .

We additionally include the Cross Entropy (also referred to as Log Loss) as an out-of-sample metric for evaluating probabilistic predictions in binary classification models (oos_metric=‘cross-entropy’). This metric penalises confident but incorrect predictions more heavily, making it particularly useful for assessing well-calibrated models. For fold \(k\) and specification \(\pi\), Cross Entropy is defined as:

\[\text{Cros Entropy}_{k}^{(\pi)} = -\frac{1}{|\mathcal{D}_k^{(\pi)}|} \sum_{i \in \mathcal{D}_k^{(\pi)}} \left[ y_i \log(f(\overleftrightarrow{\mathbf{X}}_i^{(\pi)},\overleftrightarrow{\mathbf{Z}}_i^{(\pi)}; \theta^{(k, \pi)}) ) + (1 - y_i) \log (1 - f(\overleftrightarrow{\mathbf{X}}_i^{(\pi)},\overleftrightarrow{\mathbf{Z}}_i^{(\pi)}; \theta^{(k, \pi)}) ) \right],\]

In addition, we include the InterModel Vigorish (IMV) – recently introduced by [75], [76], and [77] – for binary classifications (oos_metric=‘imv’). For fold \(k\) and specification \(\pi\), it is defined as:

\[\text{IMV}_{k}^{(\pi)} = \frac{w_{\text{enhanced}}^{(k,\pi)} - w_{\text{null}}^{(k,\pi)}}{w_{\text{null}}^{(k,\pi)}},\] where \(w_{\text{null}}^{(k,\pi)}\) and \(w_{\text{enhanced}}^{(k,\pi)}\) are entropic representations of model fit. Each \(w\) is the value minimising the difference between \(w \log w - (1 - w) \log (1 - w)\) and \(\frac{1}{|\mathcal{D}_k^{(\pi)}|}\sum_{i\in \mathcal{D}_k^{(\pi)}}y_i\log \hat{y}_i+(1-y_i)\log (1-\hat{y}_i)\). The \(\hat{y}_i\) is predicted value from the model fitted with choice \(\pi\) for \(w_{\text{enhanced}}^{(k,\pi)}\), and for \(w_{\text{null}}^{(k,\pi)}\) as \(\bar{y}_i\) in the validation set \(\mathcal{D}_k^{(\pi)}\). These metrics are then averaged over the corresponding \(K\) folds, as \(\text{RMSE}_{\text{CV}, \pi} = \frac{1}{K} \sum_{k=1}^{K} \text{RMSE}_{k,\pi}\), \(\text{Pseudo-R}^{2}_{\text{CV},\pi} = \frac{1}{K}\sum_{k=1}^{K}\text{Pseudo-}R^{2}_{k,\pi}\), \(\text{McFadden's-R}^{2}_{\text{CV},\pi} = \frac{1}{K}\sum_{k=1}^{K}\text{McFaddens-R}^{2}_{k,\pi}\), \(\text{Cross Entropy}_{\text{CV}, \pi} = \frac{1}{K}\sum_{k=1}^{K} \text{Cross Entropy}_{k,\pi}\) and \(\text{IMV}_{\text{CV},\pi} \;=\; \frac{1}{K}\sum_{k=1}^{K}\text{IMV}_{k,\pi}\).

To compare multiple specifications, we compute, store, and make readily available the metrics for each \(\pi\) and analyse their distributions across arrays stored as follows: \[\mathcal{S}_{\text{RMSE}} = [ \text{RMSE}_{\text{CV}, \pi} ]_{\pi=1}^{|\Pi|},\] \[\mathcal{S}_{\text{Pseudo-R}^2} = [ \text{Pseudo-R}^2_{\text{CV}, \pi} ]_{\pi=1}^{|\Pi|},\] \[\mathcal{S}_{\text{McFaddens-R}^2} = [\text{McFaddens-R}^2_{\text{CV}, \pi} ]_{\pi=1}^{|\Pi|},\] \[\mathcal{S}_{\text{Cross Entropy}} = [\text{Cross Entropy}_{\text{CV}, \pi}]_{\pi=1}^{|\Pi|},\] \[\mathcal{S}_{\text{IMV}} = [ \text{IMV}_{\text{CV}, \pi} ]_{\pi=1}^{|\Pi|},\]

where \(|\Pi|\) is the total number of models estimated.

4.1.3 Computational Abilities↩︎

RobustiPy undertakes an array computations for analysing each model in the choice space (\(\pi \in \Pi\)).

To quantify uncertainty in the regression estimates in the specification/choice curve, RobustiPy performs a bootstrap procedure to estimate the confidence interval of the effect between \(\overleftrightarrow{\mathbf{Y}}\) and the key estimand of interest, \(\mathbf{X}_{1}\). For each choice \(\pi\in \Pi\):

Fit the model to observed data \[\hat{\omega}_{\pi} = F(\overleftrightarrow{\mathbf{Y}}_{\pi}, \overleftrightarrow{\mathbf{X}}_{\pi}, \overleftrightarrow{\mathbf{Z}_{\pi}})\] where \(\hat{\omega}_{\pi}\) is typically the coefficient of the predictor of interest \(\mathbf{X}_{1,\pi}\)
Draw \(B\) bootstrap replicates \(b\) from the dataset \(\mathcal{D}^{(\pi)}\), each of size \(N^{(\pi)}\).
Re-estimate the model in each replicate \(b\) \[\hat{\omega}_{\pi}^{(b)} = F(\overleftrightarrow{\mathbf{Y}}_{\pi}^{(b)}, \overleftrightarrow{\mathbf{X}}_{\pi}^{(b)}, \overleftrightarrow{\mathbf{Z}_{\pi}}^{(b)}) \qquad b = 1,\dots,B.\]
Construct the empirical distribution of coefficients \(\hat{\omega}_{\pi}^{(b)}\) across all bootstrap replicates \[\mathcal{S}_{\pi}^{\text{boot}} = \{ \hat{\omega}_{\pi}^{(b)} \}_{b=1}^{B}\]

The \((1 - \alpha) \times 100\%\) confidence interval for the full set of choices \(\Pi\), computed using the empirical quantiles (with user specified ‘ci=\(\alpha\)’, defaulting to 0.95) is: \[\text{CI}_{\omega} = \left\{\left[ Q_{\alpha/2}(\mathcal{S}_{\pi}^{\text{boot}}), \quad Q_{1-\alpha/2}(\mathcal{S}_{\pi}^{\text{boot}}) \right] \colon \pi \in \Pi \right\}\] where \(Q_{\alpha}\) denotes the empirical quantile function. Values (for the specification curve subfigure) can be optionally LOESS-smoothed for better visualization, but raw estimates are kept.

Bayesian Model Averaging for control coefficients (BIC-implied priors)↩︎

RobustiPy performs Bayesian Model Averaging (BMA) for variables specified in \(\overleftrightarrow{\mathbf{Z}}_{\pi}\). The focal predictor(s) in \(\overleftrightarrow{\mathbf{X}}_{\pi}\) are included in every specification, so they are not averaged over. For clarity, the equations below follow one generic control variable (indexed as \(p\)), where the same steps are repeated for every other control;

\[\overleftrightarrow{\mathbf{Y}}_{\pi} = \bigl[ \overleftrightarrow{\mathbf{X}}_{\pi}, \overleftrightarrow{\mathbf{Z}}_{\pi} \bigr] \boldsymbol{\beta}_{\pi} +\boldsymbol{\epsilon}_{\pi}, \qquad \boldsymbol{\epsilon}_{\pi}\sim \mathcal{N}\!\bigl(\mathbf{0},\, \sigma_{\pi}^{2}\mathbf{I}_{N_\pi}\bigr).\]

The \(p^{\text{th}}\) coefficient in \(\boldsymbol{\beta}_{\pi}\) is denoted \[\omega_{p,\pi}\equiv\beta_{p,\pi}.\]

With a uniform prior over models, the posterior weight of model \(M_\pi\) is \[P(M_\pi\mid\mathcal{D}) = \frac{\exp\!\bigl[-\tfrac12\,\text{BIC}_{\pi}\bigr]}{\displaystyle\sum_{\pi'\in\Pi} \exp\!\bigl[-\tfrac12\,\text{BIC}_{\pi'}\bigr]},\] where \[\label{eq:bic95pi} \text{BIC}_{\pi} = P_{\pi}\,\log N_{\pi} \;-\; 2\,\log\hat{L}_{\pi}, \qquad\text{(cf.\;Eq.~\ref{eq:bic})}\tag{7}\]

Let \(\hat{\omega}_{p,\pi}\) be the within-model posterior mean (equal to the OLS/ML estimate under a vague prior). The BMA estimate for index \(p\) is \[\mathbb{E}[\omega_{p}\mid\mathcal{D}] = \sum_{\pi\in\Pi} P(M_\pi\mid\mathcal{D})\, \hat{\omega}_{p,\pi}.\]

Repeating this step for every \(p=1,\dots,P\) yields the full vector \[\widehat{\boldsymbol{\omega}}^{\text{BMA}} = \bigl( \mathbb{E}[\omega_{1}\mid\mathcal{D}], \dots, \mathbb{E}[\omega_{p}\mid\mathcal{D}] \bigr)^{\!\top}.\]

SHAP Values for the Full Linear–Additive Model↩︎

RobustiPy computes SHapley Additive exPlanations (SHAP) for the single ‘full’ specification containing all predictors and controls. After dropping missing values (and optional group demeaning), the data are split into an 80% training set and a 20% held‐out test set. We fit ordinary least squares (or a Logistic Regression with an L2 penalty and inverse regularisation strength of 0.1, depending on the class used) on the training data:

\[\widehat{Y}_{i} = [\overleftrightarrow{\mathbf{X}}_{i},\overleftrightarrow{\mathbf{Z}}_{i}]\;\widehat{\boldsymbol{\beta}},\] for \(i=1,\dots,N_{\mathrm{test}}\) where \(\overleftrightarrow{\mathbf{X}}_{i}\) and \(\overleftrightarrow{\mathbf{Z}}_{i}\) are row‐vectors of the test‐set features (including the intercept). Let \[\mu_{p} = \mathbb{E}_{\mathrm{train}}[X_{p}]\] and \[v(\emptyset) = \sum_{p=1}^{P}\widehat\beta_{p}\,\mu_{p}.\] Under linearity, the exact SHAP values are \[\widehat\phi_{p,i} = \widehat\beta_{p}\,\bigl(X_{p,i} - \mu_{p}\bigr),\] for \(p=1,\dots,P\) and \(i=1,\dots,N_{\mathrm{test}}\), which satisfy \[\widehat{Y}_{i} = v(\emptyset) + \sum_{p=1}^{P}\widehat\phi_{p,i}.\]

Rather than aggregating them internally, RobustiPy simply returns \[\texttt{shap\_return} = \Bigl( [\widehat\phi_{p,i}]_{i=1,\dots,N_{\mathrm{test}};\,p=1,\dots,P}, \;\mathbf{X}^{\mathrm{test}} \Bigr).\] Users may then compute summaries (e.g.mean absolute SHAP per feature) or visualize these values as desired.

Beyond per–specification uncertainty, RobustiPy carries out a curve-level hypothesis test that asks whether the pattern of coefficients across the whole choice space (\(\Pi\)) could plausibly arise under the sharp null \(H_{0}\!:\beta_{1}=0\). The procedure mirrors the bootstrap introduced earlier, but aggregates after each coefficient is estimated.

Full-sample fits: Fit every specification, once on its observed data \(\mathcal{D}^{(\pi)}\) and store the coefficient and p-value (\(\hat{\omega}_{\pi}\) and \(p_{\pi}\) for \(\pi\in\Pi\)).
Outcome bootstrap: Draw \(B\) i.i.d.resamples of rows—the same row indices for every specification—and re-estimate \(\hat{\omega}_{\pi}^{(b)}\) for \(\pi\in\Pi\), and for \(b=1,\dots, B\).
Null bootstrap: Force the null via \(Y^{\star}_{\pi}=Y_{\pi}-\mathbf{X}_{1,\pi}\hat{\omega}_{\pi}\) and repeat the shared-row bootstrap to obtain \(\hat{\omega}_{\pi}^{\star(b)}\) (again for \(\pi\in\Pi\) and \(b=1,\dots,B\)).
Curve-level summary functionals: For a collection \(A=\{a_{\pi}\}_{\pi\in\Pi}\) define \[\begin{array}{ll} S_{1}(A) &= \operatorname{median}_{\pi} a_{\pi} \\[4pt] S_{2}(A) &= \displaystyle\min_{\pi} a_{\pi} \\[4pt] S_{3}(A) &= \displaystyle\max_{\pi} a_{\pi} \\[4pt] S_{4}(A) &= \displaystyle\sum_{\pi}\mathbb{1}\{a_{\pi}>0\} \\[4pt] S_{5}(A) &= \displaystyle\sum_{\pi}\mathbb{1}\{a_{\pi}<0\} \\[4pt] S_{6}(A,p) &= \displaystyle\sum_{\pi}\mathbb{1}\{p_{\pi}<0.05\} \\[4pt] S_{7}(A,p) &= \displaystyle\sum_{\pi}\mathbb{1}\{a_{\pi}>0\}\, \mathbb{1}\{p_{\pi}<0.05\} \\[4pt] S_{8}(A,p) &= \displaystyle\sum_{\pi}\mathbb{1}\{a_{\pi}<0\}\, \mathbb{1}\{p_{\pi}<0.05\} \end{array}\] and evaluate each \(S_{k}\) on the full-sample vector \(\hat{\omega}_{\pi}\) and on every bootstrap replicate \(\hat{\omega}_{\pi}^{(b)}\).

Null \(p\)-values and model-class caveat: Compare each observed statistic with its null distribution; for the median, \[p_{\text{med}} \;=\; \frac{1}{B}\sum_{b=1}^{B} \mathbb{1}\!\Bigl\{S_i\!\bigl(\omega_{\pi}^{\star(b)}\bigr) > S_i\!\bigl(\hat{\omega}_{\pi}\bigr)\Bigr\}, \qquad \text{for } i\in\{1,\ldots,B\}.\]

Eq. [p-values] is valid under the ideal i.i.d. setting used by ordinary least-squares. For clustered fixed-effects or logistic specifications, the shared-row bootstrap may under- or over-estimate tail probabilities because fixed-effect models violate the i.i.d. assumption across clusters, and logit residuals are heteroskedastic and bounded. We therefore still report these \(p\)-values for completeness, but users should treat them as descriptive diagnostics rather than strict hypothesis tests in those two model classes.
Meta-analytic Stouffer test: Calculate the appropriate \(Z\) score and p-values accordingly: \[Z =\frac{1}{\sqrt{|\Pi|}} \sum_{\pi\in\Pi}\Phi^{-1}\!\bigl(1-p_{\pi}\bigr)\] and \[p_{\text{Stouffer}}=\Phi(-Z).\]
Outputs stored: All statistics—point estimates, counts, null p-values and \((Z,p_{\text{Stouffer}})\)—are saved to OLSResult.inference for use in .summary() and visualisations.

When the number of admissible outcome composites \(\Pi_{\overleftrightarrow{\mathbf{Y}}}\) or covariate sets \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\) is very large, estimating every model in \(\Pi=\Pi_{\overleftrightarrow{\mathbf{Y}}}\!\times\! \Pi_{\overleftrightarrow{\mathbf{X}}}\!\times\! \Pi_{\overleftrightarrow{\mathbf{Z}}}\!\times\! \Pi_{\overleftrightarrow{F}}\) can be computationally prohibitive. RobustiPy therefore lets the user draw a random subset of specifications before any fitting begins.

Determine target sizes. Choose integers \(m_{Y}\le |\Pi_{\overleftrightarrow{\mathbf{Y}}}|\) and \(m_{Z}\le |\Pi_{\overleftrightarrow{\mathbf{Z}}}|\), the number of outcome– and covariate–options you wish to keep (defaults are the full sizes).
Uniform sub-sampling. Independently sample, without replacement, \[\tilde{\Pi}_{\overleftrightarrow{\mathbf{Y}}} \subset \Pi_{\overleftrightarrow{\mathbf{Y}}}, \qquad |\tilde{\Pi}_{\overleftrightarrow{\mathbf{Y}}}|=m_{Y},\] \[\tilde{\Pi}_{\overleftrightarrow{\mathbf{Z}}} \subset \Pi_{\overleftrightarrow{\mathbf{Z}}}, \qquad |\tilde{\Pi}_{\overleftrightarrow{\mathbf{Z}}}|=m_{Z},\] using NumPy’s fast random sampler.
Construct the reduced model space. The analysis then proceeds on \[\tilde{\Pi} \;=\; \tilde{\Pi}_{\overleftrightarrow{\mathbf{Y}}} \times \Pi_{\overleftrightarrow{\mathbf{X}}} \times \tilde{\Pi}_{\overleftrightarrow{\mathbf{Z}}} \times \Pi_{\overleftrightarrow{F}},\] which typically cuts runtime by orders of magnitude while still giving a representative picture of the full specification curve.

The random seeds are user-controllable, ensuring that sub-samples—and therefore results—are fully replicable.

Supplementary Information↩︎

Supplementary Figures↩︎

Figure 10: An application of RobustiPy to criminology: [52] revisited, based on a revised dataset from [53]. The dependent variable is state-level crime rate, while the estimand of interest is income inequality. — Figure 10: An application of `RobustiPy` to criminology: [52] revisited, based on a revised dataset from [53]. The dependent variable is state-level crime rate, while the estimand of interest is income inequality.

Figure 11: A re-examination of the influential [54] paper. Our analysis spans from the core, baseline OLS model, to the fullest OLS model as specified in Table 2 of the aforementioned paper, where the variation of all intermediate specifications are analysed by RobustiPy. The baseline model without controls is highlighted in blue in subfigures ‘f.’, ‘g.’ and ‘h.’, whilst the fullest model is specified in gold (and results in the lowest median estimand of interest. — Figure 11: A re-examination of the influential [54] paper. Our analysis spans from the core, baseline OLS model, to the fullest OLS model as specified in Table 2 of the aforementioned paper, where the variation of all intermediate specifications are analysed by `RobustiPy`. The baseline model without controls is highlighted in blue in subfigures ‘f.’, ‘g.’ and ‘h.’, whilst the fullest model is specified in gold (and results in the lowest median estimand of interest.

Figure 12: A re-analysis of the Solow Growth Model as per [55]. With this instantiation of RobustiPy, we order the logarithm of (n+g+\delta) –averaged over the preceding years and empirically calibrated for all countries – first in our \mathrm{x} array. — Figure 12: A re-analysis of the Solow Growth Model as per [55]. With this instantiation of `RobustiPy`, we order the logarithm of (n+g+\(\delta\)) –averaged over the preceding years and empirically calibrated for all countries – first in our \(\mathrm{x}\) array.

Figure 13: A re-analysis of the Solow Growth Model as per [55]. With this instantiation of RobustiPy, we order the logarithm of \frac{I}{Y} –averaged over the preceding years and empirically calibrated for all countries – first in our \mathrm{x} array. — Figure 13: A re-analysis of the Solow Growth Model as per [55]. With this instantiation of `RobustiPy`, we order the logarithm of \(\frac{I}{Y}\) –averaged over the preceding years and empirically calibrated for all countries – first in our \(\mathrm{x}\) array.

Figure 14: Example output of RobustiPy utilising the UKHLS dataset. The dependent variable is a binary representation of whether a survey recipient is in ‘good health’, estimated as a linear probability model. This example showcases individuals grouped longitudinally via the group input to the ‘.fit()’ method (in this case, with group=’pipd’). — Figure 14: **Example output of `RobustiPy` utilising the UKHLS dataset**. The dependent variable is a binary representation of whether a survey recipient is in ‘good health’, estimated as a linear probability model. This example showcases individuals grouped longitudinally via the `group` input to the ‘.fit()’ method (in this case, with `group=’pipd’`).

Figure 15: Example output of RobustiPy utilising a binary dependent variable. The dependent variable is a representation custom satisfaction within an anonymous airline company, with data provided by Kaggle. The estimand of interest relates to the age of the customer. — Figure 15: **Example output of `RobustiPy` utilising a binary dependent variable**. The dependent variable is a representation custom satisfaction within an anonymous airline company, with data provided by Kaggle. The estimand of interest relates to the age of the customer.

Figure 16: Example output of RobustiPy with a binary dependent variable. The dependent variable is a ten year metric of coronary heart disease (CHD), and the variable of interest is Body Mass Index. — Figure 16: **Example output of `RobustiPy` with a binary dependent variable**. The dependent variable is a ten year metric of coronary heart disease (CHD), and the variable of interest is Body Mass Index.

Supplementary Code Blocks↩︎

References↩︎

[1]

Valdenegro Ibarra, D., J. Yan, D. Dai, and C. Rahal (2025, June). Robustipy/robustipy: v1.0.1 for zenodo.

[2]

Young, C. and E. Cumberworth (2025). Multiverse Analysis: Computational Methods for Robust Results. . Cambridge University Press.

[3]

Oreskes, N. (2003). . Models in Ecosystem Science 13, 31–42.

[4]

Camerer, C. F., A. Dreber, F. Holzmeister, T.-H. Ho, J. Huber, M. Johannesson, M. Kirchler, G. Nave, B. A. Nosek, T. Pfeiffer, et al. (2018). . Nature Human Behaviour 2(9), 637–644.

[5]

Klein, R. A., K. A. Ratliff, M. Vianello, J. Adams, Reginald B., S. Bahník, M. J. Bernstein, K. Bocian, M. J. Brandt, B. Brooks, C. C. Brumbaugh, et al. (2014). . Social Psychology, 1–12.

[6]

Leamer, E. E. (1983). . American Economic Review 73(1), 31–43.

[7]

Jeffreys, H. (1939). Theory of Probability(1st ed.). Oxford, UK: Oxford University Press.

[8]

Lindley, D. V. (1957). A statistical paradox. Biometrika 44(1/2), 187–192.

[9]

Gelman, A. and E. Loken (2013). . Technical report, Department of Statistics, Columbia University.

[10]

Lundberg, I., R. Johnson, and B. M. Stewart (2021). . American Sociological Review 86(3), 532–565.

[11]

Engzell, P. and C. Mood (2023). . American Sociological Review 88(4), 600–626.

[12]

Young, C. (2019). . Sociological Methods & Research 48(2), 431–447.

[13]

Slez, A. (2019). . Sociological Methods & Research 48(2), 400–430.

[14]

Head, M. L., L. Holman, R. Lanfear, A. T. Kahn, and M. D. Jennions (2015). . PLOS Biology 13(3), e1002106.

[15]

Brodeur, A., M. Lé, M. Sangnier, and Y. Zylberberg (2016). . American Economic Journal: Applied Economics 8(1), 1–32.

[16]

Kerr, N. L. (1998). . Personality and Social Psychology Review 2(3), 196–217.

[17]

Wingen, T., J. B. Berkessel, and B. Englich (2020). . Social Psychological and Personality Science 11(4), 454–463.

[18]

Van’t Veer, A. E. and R. Giner-Sorolla (2016). . Journal of Experimental Social Psychology 67, 2–12.

[19]

Edlund, J. E., K. Cuccolo, M. S. Irgens, J. R. Wagge, and M. S. Zlokovich (2022). . Perspectives on Psychological Science 17(1), 216–225.

[20]

Benjamin, D. J., J. O. Berger, M. Johannesson, B. A. Nosek, E.-J. Wagenmakers, R. Berk, K. A. Bollen, B. Brembs, L. Brown, C. Camerer, et al. (2018). . Nature Human Behaviour 2(1), 6–10.

[21]

Etz, A. and J. Vandekerckhove (2016). . PLOS ONE 11(2), e0149794.

[22]

Verhagen, M. D. (2021). . SocArXiv Preprint.

[23]

Rahal, C., M. Verhagen, and D. Kirk (2022). . AI & Society, 1–3.

[24]

Simonsohn, U., J. P. Simmons, and L. D. Nelson (2015). . SSRN Working Paper.

[25]

Steegen, S., F. Tuerlinckx, A. Gelman, and W. Vanpaemel (2016). . Perspectives on Psychological Science 11(5), 702–712.

[26]

Sala‐i‐Martin, X. (1997). . , National Bureau of Economic Research.

[27]

Muñoz, J. and C. Young (2018). . Sociological Methodology 48(1), 1–33.

[28]

Hanck, C. (2016). . Economics Bulletin 36(4), 2037–2042.

[29]

Gelman, A. and E. Loken (2014). . American Scientist 102(6), 460–465.

[30]

Simmons, J. P., L. D. Nelson, and U. Simonsohn (2011). . Psychological Science 22(11), 1359–1366.

[31]

Huntington‐Klein, N., A. Arenas, E. Beam, M. Bertoni, J. R. Bloem, P. Burli, N. Chen, P. Grieco, G. Ekpe, T. Pugatch, et al. (2021). . Economic Inquiry 59(3), 944–960.

[32]

Ioannidis, J. P. A. (2005). . PLOS Medicine 2(8), e124.

[33]

Yan, J. and C. Rahal (2025). . Nature Computational Science 5(1), 1–3.

[34]

Sutton, R. (2019). . Incomplete Ideas Blog 13(1), 38.

[35]

Sarma, A., A. Kale, M. J. Moon, N. Taback, F. Chevalier, J. Hullman, and M. Kay (2023). . .

[36]

Simonsohn, U., J. P. Simmons, and L. D. Nelson (2020). . Nature Human Behaviour 4(11), 1208–1214.

[37]

Masur, P. K. and M. Scharkow (2020). . .

[38]

Semken, C. and D. Rossell (2022b). . Journal of the Royal Statistical Society Series C: Applied Statistics 71(5), 1330–1355.

[39]

Semken, C. and D. Rossell (2022a). .

[40]

Ortiz‐Bobea, A. (2021). Spec_chart: Generate a specification chart in base R. Software example.

[41]

Young, C. and K. Holsteen (2021). MULTIVRS: Stata Module to Conduct Multiverse Analysis. .

[42]

Young, C. and K. Holsteen (2017). . Sociological Methods & Research 46(1), 3–40.

[43]

Castro, O., P. Bruneau, J.-S. Sottet, and D. Torregrossa (2023). . ACM Computing Surveys 56(3), 1–30.

[44]

Turrell, A. (2025). . .

[45]

Liu, Y., A. Kale, T. Althoff, and J. Heer (2020). . IEEE Transactions on Visualization and Computer Graphics 27(2), 1753–1763.

[46]

Wicherts, J. M., C. L. Veldkamp, H. E. Augusteijn, M. Bakker, R. C. Van Aert, and M. A. Van Assen (2016). . Frontiers in psychology 7, 1832.

[47]

Nelder, J. A. and R. W. M. Wedderburn (1972). . Journal of the Royal Statistical Society, Series A 135(3), 370–384.

[48]

Aldrich, J. H. and F. D. Nelson (1984). Linear probability, logit, and probit models. Number 45. Sage.

[49]

U.S. Bureau of Labor Statistics(1988). .

[50]

Cameron, A. C. (2010). Microeconometrics Using Stata(Revised Edition ed.). .

[51]

Acock, A. C. (2008). A Gentle Introduction to Stata. Stata Press.

[52]

Ehrlich, I. (1973). . Journal of Political Economy 81(3), 521–565.

[53]

Vandaele, W. (1987). Participation in Illegitimate Activities: Ehrlich Revisited, 1960. .

[54]

Acemoglu, D., S. Johnson, and J. A. Robinson (2001). . American Economic Review 91(5), 1369–1401.

[55]

Mankiw, N. G., D. Romer, and D. N. Weil (1992). . Quarterly Journal of Economics 107(2), 407–437.

[56]

Feenstra, R. C., R. Inklaar, and M. P. Timmer (2015). . American Economic Review 105(10), 3150–3182.

[57]

Zhang, Y., M. R. Bennett, and S. Yeandle (2021). . BMJ Open 11(12), e049652.

[58]

ISER(2025). . UK Data Service. SN: 6614.

[59]

Dawber, T. R., G. F. Meadors, and J. Moore, Felix E. (1951). . American Journal of Public Health 41(3), 279–286.

[60]

Wilson, P. W. F., R. B. D’Agostino, D. Levy, A. M. Belanger, H. Silbershatz, and W. B. Kannel (1998). . Circulation 97(18), 1837–1847.

[61]

D’Agostino, R. B., R. S. Vasan, M. J. Pencina, P. A. Wolf, M. Cobain, J. M. Massaro, and W. B. Kannel (2008). . Circulation 117(6), 743–753.

[62]

Gino, F., M. Kouchaki, and T. Casciaro (2020). . Journal of Personality and Social Psychology 119(6), 1221–1241.

[63]

Simmons, J. (2024). . .

[64]

Harvard Business School Investigation Committee(2024). . , Harvard Business School.

[65]

Orben, A. and A. K. Przybylski (2019). . Nature Human Behaviour 3(2), 173–182.

[66]

Tierney, L. and J. B. Kadane (1986). . Journal of the American Statistical Association 81(393), 82–86.

[67]

Strubell, E., A. Ganesh, and A. McCallum (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3645–3650.

[68]

Young, C. and S. A. Stewart (2021). . .

[69]

McFadden, D. (1972). . Annals of Economic and Social Measurement 5, 105–142.

[70]

Akaike, H. (1974). . IEEE Transactions on Automatic Control 19(6), 716–723.

[71]

Schwarz, G. E. (1978). . The Annals of Statistics 6(2), 461–464.

[72]

Hannan, E. J. and B. G. Quinn (1979). . Journal of the Royal Statistical Society. Series B (Methodological) 41(2), 190–195.

[73]

Salganik, M. J., I. Lundberg, A. T. Kindel, C. E. Ahearn, K. Al-Ghoneim, A. Almaatouq, D. M. Altschul, J. E. Brand, N. B. Carnegie, R. J. Compton, et al. (2020). . Proceedings of the National Academy of Sciences 117(15), 8398–8403.

[74]

Brier, G. W. (1950). . Monthly Weather Review 78(1), 1–3.

[75]

Domingue, B. W., C. Rahal, J. Faul, J. Freese, K. Kanopka, A. Rigos, B. Stenhaug, and A. S. Tripathi (2025). . PLOS ONE 20(3), e0316491.

[76]

Domingue, B. W., K. Kanopka, R. Kapoor, S. Pohl, R. P. Chalmers, C. Rahal, and M. Rhemtulla (2024). . Psychometrika 89(3), 1034–1054.

[77]

Zhang, L., K. Kanopka, C. Rahal, E. Ulitzsch, Z. Zhang, B. W. Domingue, et al. (2023). . .

For correspondence: Daniel Valdenegro and Charles Rahal, Leverhulme Centre for Demographic Science and the Centre for Care, Oxford Population Health, University of Oxford, OX1 1JD, United Kingdom. Tel: 01865 286170. Email: daniel.valdenegro@demography.ox.ac.uk and charles.rahal@demography.ox.ac.uk. Code Availability Statement: The extensive code library which accompanies this work can be found at github.com/RobustiPy, and indexed into Zenodo as ‘Online Supplementary Material’ available as [1]. Documentation is available at https://robustipy.readthedocs.io/. Data Availability Statement: All but two of our empirical examples utilise publicly available information, where they require an account with the UK Data Service. The authors declare no conflicts of interest and are grateful to comments on earlier versions of the work from participants at the International Conference on Computational Social Science, the British Society for Population Studies, the European Consortium for Sociological Research, Population Association of America, and internal feedback received from the Leverhulme Centre for Demographic Science, Reproducible Research Oxford, and the Pandemic Sciences Institute. We also thank all attendees at a previous hackathon, hosted at the University of Oxford in June 2024. The authors are grateful for funding from the ESRC (an ESRC Large Centre grant), the Leverhulme Trust (Grant RC-2018-003) for the Leverhulme Centre for Demographic Science, and a Grand Union DTP ESRC studentship. All errors remain our own.↩︎
This space is usually as sparse as a single ‘baseline’ model and a few scant robustness checks – which coincidentally always seem to concur with the results from the baseline – at the quest of peer reviewers; see [42] for a quantification of this at two leading Sociology journals.↩︎
We call this the ‘Vanilla’ computation in Sections 1.3.1 and 2.1.↩︎
RobustiPy explicitly asks the user for how many CPU cores of their machine which they want to use for the analysis (the ‘n_cpu’ option into the .fit() command, see below). If the number of cores is not specified, the user is asked how many cores they want to use after being asked whether one fewer than the maximum number of cores available is permissible.↩︎
By default, the ‘full’ model (the one containing all \(\overleftrightarrow{\mathbf{Z}}\) covariates) and the ‘null’ model (which contains only the smallest guaranteed model space) are always highlighted in the relevant visual analysis for the case of one dependent variable. When there is more than one dependent variable specified – see Section 1.3.5 – it highlights the model with solely the first dependent variable specified in the list, and the model which incorporates all dependent and independent variables.↩︎
All of these options can be varied, along with further, additional ones; see Section 4 and the accompanying RobustiPy documentation for further details.↩︎
If no number of draws, number of folds, seed, out‐of‐sample metric, or number of CPUs are selected, the user is prompted to enter valid options as a system input.↩︎
We are constantly looking for suggestions with regards to new features to incorporate. Please contact the corresponding authors to discuss!↩︎

Introducing RobustiPy: An efficient next generation multiversal library with model selection, averaging, resampling, and explainable artificial intelligence

¹

Abstract

1 Introduction↩︎

1.1 Existing State of the Art↩︎

1.2 Formalisation↩︎

1.3 Types of Analysis↩︎

1.3.1 Vanilla computation↩︎

1.3.2 Array of never changing variables↩︎

1.3.3 Fixed effect OLS↩︎

1.3.4 Binary dependents↩︎

1.3.5 Multiple dependents↩︎

2 Results↩︎

2.1 Vanilla computation↩︎

2.2 Array of never changing variables↩︎

2.3 Fixed effect OLS↩︎

2.4 Binary dependents↩︎

2.5 Multiple dependents↩︎

3 Discussion↩︎

4 Online Methods↩︎

4.1 Metrics, Methods, and Tests↩︎

4.1.1 In-sample Metrics↩︎

4.1.2 Out-of-Sample Metrics↩︎

4.1.3 Computational Abilities↩︎

Bayesian Model Averaging for control coefficients (BIC-implied priors)↩︎

SHAP Values for the Full Linear–Additive Model↩︎

Supplementary Information↩︎

Supplementary Figures↩︎

Supplementary Code Blocks↩︎

References↩︎

Subjects

Updated on Academus

Introducing RobustiPy: An efficient next generation multiversal library with model selection, averaging, resampling, and explainable artificial intelligence 1

Abstract

1 Introduction↩︎

1.1 Existing State of the Art↩︎

1.2 Formalisation↩︎

1.3 Types of Analysis↩︎

1.3.1 Vanilla computation↩︎

1.3.2 Array of never changing variables↩︎

1.3.3 Fixed effect OLS↩︎

1.3.4 Binary dependents↩︎

1.3.5 Multiple dependents↩︎

2 Results↩︎

2.1 Vanilla computation↩︎

2.2 Array of never changing variables↩︎

2.3 Fixed effect OLS↩︎

2.4 Binary dependents↩︎

2.5 Multiple dependents↩︎

3 Discussion↩︎

4 Online Methods↩︎

4.1 Metrics, Methods, and Tests↩︎

4.1.1 In-sample Metrics↩︎

4.1.2 Out-of-Sample Metrics↩︎

4.1.3 Computational Abilities↩︎

Bayesian Model Averaging for control coefficients (BIC-implied priors)↩︎

SHAP Values for the Full Linear–Additive Model↩︎

Supplementary Information↩︎

Supplementary Figures↩︎

Supplementary Code Blocks↩︎

References↩︎

Subjects

Updated on Academus

Introducing RobustiPy: An efficient next generation multiversal library with model selection, averaging, resampling, and explainable artificial intelligence

¹