Introducing RobustiPy: An efficient next generation multiversal library with model selection, averaging, resampling, and explainable artificial intelligence
June 24, 2025
We present RobustiPy
, a next generation Python‐based framework for model‐uncertainty quantification and multiverse analysis, released under the GNU GPL v3.0. Through the integration of efficient bootstrap-based confidence intervals,
combinatorial exploration of dependent‐variable specifications, model selection and averaging, and two complementary joint‐inference routines, RobustiPy
transcends existing uncertainty‐quantification tools. Its design further supports rigorous
out-of-sample evaluation and apportions the predictive contribution of each covariate. We deploy the library across five carefully constructed simulations and ten empirically grounded case studies drawn from high-impact literature and teaching examples,
including a novel re-analysis of ‘unexplained discrepancies’ in famous prior work. To illustrate its performance, we time-profile RobustiPy
over roughly 672 million simulated linear regressions. These applications showcase how
RobustiPy
not only accelerates robust inference but also deepens our interpretive insight into model sensitivity across the vast analytical multiverse within which scientists operate.
Keywords: Model Uncertainty, Statistical Software, Open Science
“Analysts can test thousands of plausible models, trying out different model assumptions. Consumers are still constrained to know only the handful of model specifications that are curated for publication. Technology created this problem, and technology will have to be a part of the solution.”
– [2]
The act of constructing quantitative models to examine scientific phenomena represents the corner-stone of most intellectual processes [3]. Yet, the scientific community – particularly within the health and social sciences [4] – has
long recognised that the process of testing hypotheses using such models is susceptible to substantial variation in outcomes, even under relatively conservative assumptions [5]. While Leamer’s infamous “Let’s Take the Con Out of Econometrics” article [6] is often credited with dramatically drawing attention to pervasive risks in applied work, the concept and even formal tools for handling model uncertainty date back decades earlier [7], [8]. Numerous factors influence the model-building process, ranging
from uncontrollable elements to more structured components like the conditions of data collection; the choices which responsible researchers make over this space are otherwise known as the ‘garden of forking paths’ [9]. The range of possible outcomes in scientific modelling may, in fact, mirror the vast variation inherent in reality itself, all when focusing on just one
estimand of interest [10]. This is at a time when it is increasingly recognised that complex computational routines can themselves be
utilised for theory-building [11]. We create a tractable solution which takes steps to reduce this consequential source of
variation; the analytical choices made by researchers. These choices can – in principle – be systematically examined and controlled. Developed in Python, our readily available, continuously maintained, and highly accessible tool aims to bring transparency
to research practice, representing what we believe is the next generation of model uncertainty software. RobustiPy
is also agnostic to philosophies regarding whether models in a multiverse should be weighted [12], [13] by plotting both the full model space, metrics of fit, estimates
which maximize information criteria, and weighted and unweighted estimates of the estimand. We document the five different types of analysis available through simulated analysis, conduct a time profiling exercises, and apply RobustiPy
to ten
high-profile empirical examples across the research and teaching-related landscape. These empirical examples span the economic, social, psychological and medical sciences, and beyond. All of our simulated and empirical examples are made available as Python
scripts and interactive notebooks within the first release of our library which has been indexed into Zenodo [1].
Analytical decisions – in comparison to more ethereal, uncontrollable concerns – are not only more amenable to scrutiny, but also potentially more influential for the scientific enterprise. While other sources of variation (such as data collection) may be expected to cancel out with replication, research decisions regarding the modelling of scientific processes are often inherently non-random and systematically biasing. They can introduce systematic biases that remain hidden unless explicitly interrogated. This typically results in an asymmetry of information [2]: current common practice involves researchers varying their decisions, selectively reporting what they choose. Although exploring alternative specifications is not inherently problematic, the persistent lack of transparency in these practices enables questionable research behaviours, such as p-hacking; the process of iteratively modifying models until statistically significant results are obtained [14], [15]. A related issue is ‘HARKing’ (Hypothesising After the Results are Known, see for example [16]), in which researchers test multiple hypotheses and present significant findings as though they were specified in advance. Beyond concerns about scientific integrity, such selective reporting contributes to poor reproducibility and a misleading sense of empirical consensus, ultimately undermining confidence in research and the scientific endeavour more broadly [17]. A range of solutions have been proposed to address these challenges. These include greater transparency through preregistration [18], a stronger emphasis on replication [19], improved statistical power [14], [20], the adoption of Bayesian inference [21], and a shift towards predictive modelling as an evaluation tool [22], [23]. One particularly promising approach to improving transparency is the systematic reporting of all defensible combinations of analytical decisions made by a researcher. This strategy – independently developed as specification curve analysis [24] and multiverse analysis [25] – explicitly counters selective reporting; the former by looking at the specification based choices which researchers can make, and the latter by analysing a broader choice space. Rather than presenting a single preferred model, it aims to identify and estimate all plausible analysis branches, presenting the full distribution of permissible results in a single, usually visual output. This is what we build upon.
Despite their conceptual appeal, multiverse and specification-curve analyses are computationally demanding. Among the many degrees of freedom available to researchers, covariate selection alone can result in an exponential number of model
specifications. For example, a dataset with ten candidate covariates produces \(2^{10} = 1,024\) possible models; twenty covariates yield over a million. Historically, researchers have tackled this complexity using bespoke
computational routines to estimate and summarise the results of thousands [6], millions [26], billions [27], or even trillions [28] of regression models. The expansion of empirical research has brought increased attention to the role of ‘researcher degrees of freedom’ – the wide range of defensible yet subjective decisions
made during data collection, model specification and result reporting [29]. All of which undermine the validity of empirical
findings and contribute to broader issues such as the replication crisis. These problems are pervasive across almost all quantitative academic disciplines, spanning and especially psychology [30], the social sciences [31], and medical research [32]. However, custom implementations are often inefficiently done, and are often not sufficiently feature-rich, despite the ever increasing
computational power available to researchers in the limit of human progress [33], [34]. To overcome these limitations, we introduce RobustiPy
, which streamlines model selection, averaging, and out-of-sample validation and feature contribution while performing statistical inference testing and
evaluating fit across the entire space of feasible specifications. By offering a standardised, efficient, and unit-tested solution, RobustiPy
enhances computational reproducibility and mitigates the risks associated with costly and limited
ad-hoc implementations of multiverse and specification curve analysis.
Several computational tools have been developed to facilitate the implementation of specification curve analysis – a way to visualise all specifications of a model space – and multiverse analysis – the systematic estimation of every defensible analytic
pathway – in applied research. In the R
programming ecosystem, the multiverse package by [35] introduces a
declarative syntax for defining multiple decision points within a single analysis script. Users can specify branches corresponding to alternative models and specifications, which are then executed across the entire, permissible model space; the package
integrates with R
Markdown and supports structured workflows for running and reporting multiverse analyses. Specifically regarding specification curve analysis – as opposed to multiversal analysis – and likely most formally introduced by [36], the specr package [37] offers a comprehensive set of tools to define, estimate, and visualise a range of model specifications. It is particularly well suited for observational studies, enabling researchers to explore variations in outcome
variables, independent variables, and covariate adjustments. Results are presented through specification curves that summarise the distribution of estimates and confidence intervals across the analytical space. [38] advocate a Bayesian approach, with code available on the Open Science Framework [39]. Similar, but with more limited functionality, spec_curve allows you to create a specification chart using base R
[40]. In Stata
, there exists the MULTIVRS module to conduct multiverse analysis [41], and MROBUST, a module to estimate model robustness and model influence [42].
In contrast, the development of comparable toolkits in the Python
ecosystem has been more limited, despite the fact that Python
has become the world’s foremost language for highly performant data science research [43]. The spec_curve package by [44] provides foundational functionality for conducting specification curve analyses. While it offers a useful entry point, it lacks the breadth and extensibility of its
R
-based counterparts. The Boba domain-specific language (DSL, see [45]) provides a programming language agnostic tool and visualisation
system for multiverse analysis. It automatically expands and executes all valid combinations of user decision points. The results are then presented through an interactive, linked-view interface. While Boba supports execution in both R
and
Python
environments, its use of a custom DSL makes it difficult to integrate seamlessly into standard analysis pipelines. To address this gap more broadly, we introduce RobustiPy
. RobustiPy
extends beyond existing
tools by incorporating functionality for model selection, bootstrapping, joint inference, out-of-sample evaluation, and the systematic exploration of multiple estimands and outcome spaces. In doing so, it enables a more rigorous and transparent approach to
robustness analysis. A summary of the main packages in this space is described in Table 1.
Feature / Toolkit | multiverse (R) |
specr (R) |
MULTIVRS (Stata) |
spec_curve (Python) |
RobustiPy (Python) |
---|---|---|---|---|---|
Multiverse analysis | ✔ | – | ✔ | – | ✔ (limited) |
Specification curve analysis | – | ✔ | – | ✔ | ✔ |
High-abstraction syntax | ✔ | (partial) | (limited) | – | ✔ |
Visualisation of results | ✔ | ✔(built-in) | ✔ (kdensity) | ✔(basic) | ✔ (custom) |
Joint inference | – | ✔ | – | ✔ | ✔ |
Influence analysis | – | – | ✔ | – | ✔ |
Bootstrapping | – | ✔ | – | ✔ | ✔ |
Out-of-sample | – | – | – | – | ✔ |
Multiple estimands | – | ✔ (limited) | – | ✔(partial) | ✔ |
In order to describe the problem which RobustiPy
addresses, we formalise the problem below. Consider the key association between variables \(\mathbf{Y}\) and \(\mathbf{X}\),
where a set of covariates \(\mathbf{Z}\) can influence the relationship between the former as follows:
\[\label{objective95func} \mathbf{Y} = F(\mathbf{X}, \mathbf{Z}) + \boldsymbol{\epsilon} .\tag{1}\]
Assume \(\mathbf{Y}\) is an unknown data generating function \(F()\) which takes the values of \(\mathbf{X}\) conditioned by the values of the set of covariates \(\mathbf{Z}\), plus an independent and identically distributed random error term (\(\boldsymbol{\epsilon}\)). In this conceptualisation of the problem, Eq. 1 represents the real data generation process that researchers approximate, but, its perfect reproduction is (almost always) unattainable. Following the example of [36], \(\mathbf{Y}\) and \(\mathbf{X}\) are usually imprecisely defined latent variables (for which we cannot have a direct measure). Likewise, the set of covariates \(\mathbf{Z}\) can also be composed of imprecisely defined variables from which the number of them can be unknown and/or non-finite. Under these conditions, it is evident that researchers must make concessions, usually in the form of finite sets of ‘reasonable’ operationalisations of \(\mathbf{Y}\), \(\mathbf{X}\), \(\mathbf{Z}\) and \(F()\). The way ‘reasonable’ is defined is in itself a point of contention; [36] somewhat canonically defines this as a set where each element is theoretically justified, statistically valid, and non-redundant. Let’s define the set of ‘reasonable’ operationalisations of \(\mathbf{Y}\), \(\mathbf{X}\), \(\mathbf{Z}\) and \(F()\) as \(\overleftrightarrow{\mathbf{Y}}\), \(\overleftrightarrow{\mathbf{X}}\), \(\overleftrightarrow{\mathbf{Z}}\) and \(\overleftrightarrow{F()}\). Then, we have:
\[\label{spec95func} \overleftrightarrow{\mathbf{Y}}_{\pi} = \overleftrightarrow{F}_{\pi}(\overleftrightarrow{\mathbf{X}}_{\pi}, \overleftrightarrow{\mathbf{Z}}_{\pi}) + \boldsymbol{\epsilon}.\tag{2}\]
Eq. 2 corresponds to a single possible choice (\(\pi\)) of all the possible choices (\(\Pi\), such that \(\pi \in \Pi\)) which we can generate using our current formalisation. The total number of ‘researcher degrees of freedom’ [46] can then be calculated as \(2^D\) where \(D\) equals \(d_{\overleftrightarrow{\mathbf{Y}}} + d_{\overleftrightarrow{\mathbf{X}}} + d_{\overleftrightarrow{\mathbf{Z}}} + d_{\overleftrightarrow{F()}}\) and \(d\) is the length of the vectors containing the number of choices for each parameter in Eq. 2 . For example, an analysis with two functional forms, two target variables, two predictors and four covariates will yield a space of \(\Pi\) (\(2^{10}\) choices). We split the computation based on the operationalised variables in Eq. 2 by creating a ‘reasonable specification space’ for each: \(\Pi_{\overleftrightarrow{\mathbf{Y}}}\), \(\Pi_{\overleftrightarrow{F()}}\), \(\Pi_{\overleftrightarrow{\mathbf{X}}}\), \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\). The entire model space can be expressed as:
\[\Pi = \Pi_{\overleftrightarrow{\mathbf{Y}}} \times \Pi_{\overleftrightarrow{F()}} \times \Pi_{\overleftrightarrow{\mathbf{X}}} \times \Pi_{\overleftrightarrow{\mathbf{Z}}}\]
Given that \(\Pi\) represents a collection of multiple possible versions of the true relationship between \(\mathbf{Y}\) and \(\mathbf{X}\), an increasingly large space of random possible choices (\(D\)) will in principle always lead to a more reasonable approximation of \(\mathbf{Y} = F(\mathbf{X},\mathbf{Z})\). The main problem with current research practices is that researchers usually choose a small and non-random sample of \({\Pi}\) to represent the variety of results in the full space of possible choices [36].2
To formalise the intuition that larger model spaces can yield better approximations, consider a growing sequence of specification sets \(\{\Pi_D\}_{D=1}^\infty\) with \(|\Pi_D| = 2^D\) and \(\Pi_D \subset \Pi_{D+1}\). Let \(\widehat{Y}_\pi = \widehat{F}_\pi(X,Z)\) denote the fitted response from model \(\pi \in \Pi_D\). Suppose that the true conditional expectation \(Y = F^\star(X,Z) + \varepsilon\) lies within the closure of the union \(\bigcup_D \Pi_D\), and each \(\pi\) in \(\Pi_D\) is explored with probability bounded below by \(c > 0\) as \(D \to \infty\) (e.g., uniform or random subset sampling). Then for any \(\varepsilon > 0\), the probability that the best model in \(\Pi_D\) achieves mean-squared prediction error within \(\varepsilon\) of the truth converges to 1: \[\label{eq:oracle-approx} \lim_{D \to \infty} \Pr\!\left( \min_{\pi \in \Pi_D} \bigl\| \mathbb{E}\bigl[\widehat{Y}_\pi\bigr] - F^\star(X,Z) \bigr\|_2 < \varepsilon \right) = 1,\tag{3}\] where \(\|\cdot\|_2\) denotes the Euclidean norm. This is not a convergence claim for a randomly chosen specification, but rather a statement that with enough diversity in \(\Pi_D\), some model in the set will approximate the true data-generating process arbitrarily well. The result depends on sufficient coverage of the model space and identifiability, and does not hold under adversarial or highly correlated specifications.
RobustiPy
enables an approximation of the real data generation process (Eq. 1 ) by generating and organising specifications spaces (\(\Pi_{\overleftrightarrow{\mathbf{Y}}}\), \(\Pi_{\overleftrightarrow{\mathbf{X}}}\), \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\)). In particular,
RobustiPy
allows a full examination across the entirety of \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\) assuming a single \(\overleftrightarrow{\mathbf{Y}}\), a fixed set of \(\overleftrightarrow{\mathbf{X}}\) predictors and a fixed set of \(\overleftrightarrow{F()}\) estimators.3
Additional functionality allows users to generate and compute \(\Pi_{\overleftrightarrow{\mathbf{Y}}}\), assuming multiple valid dependent measurements. Let’s review the most common use cases – including two stand-alone
Code Blocks – where our online repository hosts five simulations pertaining to each type of analysis and Section 2 discusses ten highly varied empirical applications of each type. Section 3 discusses these results, and Section 4 contains a full delineation of every metric and method which is undertaken and available within RobustiPy
.
Our first simulated example (‘Code Block 1: Vanilla computation’) assumes a linear data generating process. Here we fit an OLS estimator with 10‐fold cross‐validation and 1,000 bootstrap resamples, and then we plot results and print a large number of
summary outputs. The data generation process described in Code Block 1 is analogous to a typical modelling problem whereby a researcher seeks to evaluate the relationship between a target variable of interest with only one valid operationalisation, and a
predictor with only one possible operationalisation which is conditional on a ‘reasonable’ set of covariates (and, by extension, a researcher’s idea about the data generating process, be it theoretical or otherwise). In this situation (i.e. with no
variation in the outcome space, the variable of interest space, or the functional form), \(\Pi_{\overleftrightarrow{\mathbf{Z}}} \equiv \Pi\). For the purposes of delineation – as in Code Block 1 – the
OLSRobust
class allows us to declare an object (vanilla_obj
) containing the target variable (\(Y_{1}\)), the main predictor as part of an array (\(\overleftrightarrow{\mathbf{X}}\)), and the source of data. The .fit()
method allows us to declare the array of possible covariates (\(\overleftrightarrow{\mathbf{Z}}\)) – which a
researcher may be agnostic about including – to be used in the analysis. RobustiPy
automatically adds an intercept unless the user has already included one in \(\overleftrightarrow{\mathbf{X}}\) or \(\overleftrightarrow{\mathbf{Z}}\) (i.e., a feature with any name which has zero variance). When a constant is present in \(\overleftrightarrow{\mathbf{X}}\), the software leaves the matrix
unchanged. A constant placed in \(\overleftrightarrow{\mathbf{Z}}\) is treated as a varying regressor, so the iterator yields some specifications with an intercept and others without. RobustiPy
automatically
adds an intercept unless the user has already included one in \(\overleftrightarrow{\mathbf{X}}\) or \(\overleftrightarrow{\mathbf{Z}}\). When a constant is present in \(\overleftrightarrow{\mathbf{X}}\), RobustiPy
leaves the matrix unchanged. A constant placed in \(\overleftrightarrow{\mathbf{Z}}\) is treated as a varying regressor, so the iterator
yields some specifications with an intercept and others without. We – by default – parallelise computation to decrease runtime.4 Finally,to return the results of fitted
models, OLSRobust
has a method entitled get_results
, returning a class object containing estimates as well as utility functions for summarisation and visualisation. We allow the user to specify which specifications they wish to
visually highlight – which allows direct comparisons with already published results – alongside multiple other options.5
Figure 1: .
RobustiPy
also allows for the inclusion of multiple ‘fixed’ covariates, which are not varied across the co-variate space. Programmatically, this set of fixed covariates will be included in all specifications, but not considered in the
combinatoric operation. They act as a set of always present features. This functionality is useful in cases when the researchers have a strong determination that these variables always need to be included in a model, likely due to theoretical reasoning (or
for specific hypothesis testing, see Section 2.2). It also has the secondary effect of reducing the specification space and hence the compute cost, making it an alternative path to follow in cases with very
large sets of covariates (see Section 2.3). To do this, we simply need to append additional variables into our \(\overleftrightarrow{\mathbf{X}}\) array; RobustiPy
will take the first name of the list as the main estimand of interest, and the rest are fixed, never varying covariates. A stand-alone example of this type of use case is shown in Supplementary Code Block 1. An alternate approach to reduce compute cost
with large covariate space is to use our z_specs_sample_size()
functionality.
RobustiPy
also provides the option to run a fixed effects OLS model. These types of models are commonly used in the analysis of panel data, in which repeated observations are nested within the observed unit (e.g. individuals, households,
etc.), usually over time. In these contexts, fixed effects OLS is used to isolate and eliminate the intraclass variation of the observations. RobustiPy
allows the specifying of a grouping variable, ‘group
’, which enters into the
.fit()
method. By providing this, RobustiPy
automatically performs demeaning of the relevant variables and estimates its sets of models on the resultant, transformed data. A stand-alone example use case is shown in Supplementary
Code Block 2.
RobustiPy
also provides the opportunity to apply a logistic estimator to handle binary outcome variables (via the Newton method, see [47], and a tolerance of \(1\times10^{-7}\)). We chose logistic regression as it is one of the most common and computationally efficient estimators for binary choice models; an important
consideration when running specification curves. This functionality is implemented through the LRobust
class, which offers essentially the same functionality as OLSRobust
, but only accepts binary outcome variables. Naturally, it
is also possible to estimate the model space for a binary dependent variable with OLSRobust
, which would represent a Linear Probability Model (see, for example [48]). A standalone example use case is shown in Supplementary Code Block 3.
Let’s now modify our assumed data generation process by allowing the dependent variable to be a vector of four possible operationalisations. We describe the process to simulate this data in Code Block 2. In this scenario, researchers might not only want
to compute \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\) over each operationalisation in \(\overleftrightarrow{\mathbf{Y}}\) but also to compute all the possible combinations of composites
derived from \(\overleftrightarrow{\mathbf{Y}}\) (or, in other words, \(\Pi_{\overleftrightarrow{\mathbf{Y}}}\)) in order to properly obtain \(\Pi\). It
might also involve an array of never-changing constants (Section 1.3.2). RobustiPy
takes the list of operationalisations and creates combinations of composites by first standardising their values,
and then combining them using the row-wise average. ‘Code Block 2: Multiple dependents’ extends our simulation from above. In this particular example, we highlight a specification containing two co-variates and the \(Y_1\)
and \(Y_2\) composites: [‘y_1’, ‘y_2’, ‘z_1’, ‘z_2’]
, and a specification containing only one covariate (\(Z_4\)) and a single operationalisation, non-composite of the dependent
variable in the form of \(Y_4\): [‘y_4’, ‘z_4’]
. If researchers are interested in running multiple fit commands – one for each dependent variable – RobustiPy
contains a concat_results
function within utils.py
module which stacks results objects together. This is naturally the way by which a researcher can conduct analyses with multiple binary dependents.
All of the results in the following Sections 2.1-2.5 –unless where specified – are reported with 1,000 bootstrapped draws, ten folds, a seed of 192735
, and
options of the HQIC information criteria for model selection, and the pseudo-R\(^2\) metric for out-of-sample evaluation.6\(^,\)7 We provide the results of ten empirical examples which enunciate the five types of analysis as detailed in Sections 1.3.1-1.3.5; some are in more detail than others in the main text, each with a corresponding notebook as part of our Online Supplementary Materials (i.e., [1]). All results are discussed in Section 3 (and in particular, Table [tab:results]).
Figure 2: .
Figure 4: .
Our first example utilises the infamous union.dta
dataset; a subsample of the National Longitudinal Survey of Young Women [49]. Women were
surveyed in each of the 21 years 1968–1988, except for the six years 1974, 1976, 1979, 1981, 1984, and 1986). This canonical dataset is also well examined as an exemplary dataset in other work which considers model uncertainty and provides multiversal
tools [42]. It is also a teaching resource – used in undergraduate and graduate econometrics classes around the world – to illustrate the
estimation of treatment effects in observational data. Textbooks which utilise it include [50] and [51], as just two examples. Through OLS regression, students and researchers estimate \(\hat{\beta}\), which is
interpreted as the average percentage difference in wages attributable to union membership, holding other factors constant. Importantly, this dataset is entirely anonymised and publicly available. It contains information on things like wage (which, when
logarithmically transformed, represents the dependent variable, often written as ‘\(y_i\)’), union membership (and the associated estimand of interest, ‘\(\hat{\beta}_1\)’ in the canonical
reduced form linear regression), and a multitude of often arbitrarily specified controls. Results are visualised in Figure 3; a simple OLS regression on the full sample yields an estimand of interest of 10.18; lower than
‘conventional estimates’, which are nearer to a 15% premium (Hirsch 2004). [42] estimates the (unweighted) median effect across the multiverse
as 14.00; the median of our unweighted estimates across all samples with no bootstraps is 13.5, across all samples and bootstraps as 13.6, with a BIC-weighted estimate of 9.8 (as per the recommendation of [13]). We are agnostic as to which number is most statistically valid, simply providing the tools by which a researcher can consider the range of plausible
results.
We also undertake a ‘vanilla’ application to criminology: a revisitation of [52] based on a dataset from [53], where the dependent variable is state-level crime rate, and the estimand of interest is income inequality (defined as the
percentage of families earning below one-half of the median income). The results from the .summary()
method – inherent to RobustiPy
– are shown in ‘Code Block 3: Summarizing a re-analysis of criminality’ and Supplementary
Information Figure 10. The Supplementary Information (Figure 11) also contains a re-analysis of the ‘Colonial Origins of Comparative Development’, based on the influential paper of [54].
We next detail the case where we have an array of variables that we always want to include in the model space, but never vary. In terms of RobustiPy
functionality, this simply involves passing an array of variable names (\(\overleftrightarrow{\mathbf{X}}\)) which has length greater than one, with the array of variable controls specified as before (\(\overleftrightarrow{\mathbf{Z}}\)). This involves using updated
variables from the Penn World Table (version 10.01, see [56]), to revisit [55] which contributes to the ‘Empirics of Economic Growth’. We map modern data from 73 of the 98 countries in the original 1988 sample, using observed values of all parts of the \(\log(n+g+\delta)\) and \(\log(\frac{I}{Y})\) which are canonical to the Solow Growth Model. We also include various possible extensions – as done in [55] – such as including logged metrics of human capital, and previous values of GDP in levels. By running RobustiPy
twice – with \(\overleftrightarrow{\mathbf{X}} = [\log(n+g+\delta), \log(\frac{I}{Y})]\) and \(\overleftrightarrow{\mathbf{X}} = [\log(\frac{I}{Y}), \log(n+g+\delta)]\), both of which we always want in our
core model – we are able to recover two large sets of bootstrapped estimates of the relevant estimands accompanying the first variable in each of these two arrays. This allows us to extract estimates from the results
object of
RobustiPy
to assess whether the signs are of equal size and opposite magnitude. RobustiPy
also allows us visualise the distribution of the \(\bar{R}^2\) values (a key test of the Solow Growth
Model); both are shown in Figure 5, with full RobustiPy
outputs for the two runs shown in Figures 12-13.
Regarding fixed-effect estimation, our empirical replication comes from [57], which takes a longitudinal approach to local
government spending on adult social care and carers’ subjective well-being in England. Specifically, we analyse the effect of a Local Authority measure of Adult Social Care Spending on a General Health Questionnaire (GHQ12) metric from the United Kingdom
Household Longitudinal Survey (see Appendix I, column (1) of [57]). A static group of controls (i.e., inside \(\overleftrightarrow{\mathbf{X}}\)) includes seven sociodemographic and health variables, alongside thirteen year fixed effect dummy variables, whereas our array of controls – which do vary – contains a further eight variables.
Importantly, we document the inclusion of a grouping variable (‘pipd’), common in longitudinal survey analysis. Figure 6 shows the results, with the median value of the estimand of interest (0.203 for all specifications and
all bootstrapped resamples, 0.202 for all specifications and no resampling, and 0.182, 0.182 and 0.183 for AIC, BIC, and HQIC weighted estimates). This result is extremely close to that reported in the original paper (0.1825). Given that the size of this
feature space is so large, had we not specified a larger \(\overleftrightarrow{\mathbf{X}}\) (an array of always present variables), RobustiPy
would have had to estimate a relatively intractable model space of
\(2^{D}\) = 268,435,456 specifications. To this extent, we also use this as an example to show our resampling of \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\), with results shown in Table [tab:results]. A complementary pedagogical example – with the Understanding Society Longitudinal Teaching Dataset (Waves 1-9, from October 2021, see [58]) is provided in Supplementary Figure 14, where the estimand of interest is ‘sf1_dv
’ (general
health) and a fixed effect grouping is again specified based on a unique cross-wave person identifier.
We next describe an example of how RobustiPy
can be used when our dependent variable is binarized as part of an application common to the medical sciences. In particular, we utilise the Framingham Heart Survey [59] for the estimation of models related to a dependent variable termed ‘TenYearCHD’: a binary indicator equal to one if a study participant
experienced a coronary heart disease (CHD) event within ten years of their baseline examination, and zero otherwise. This is used in multiple, high-profile papers such as [60], and as part of a composite in papers such as [61]. The dataset includes over
4,240 records across sixteen features. We fit our LRobust
class with ‘TenYearCHD’ as our target dependent, ‘BMI’ as the single element of \(\overleftrightarrow{\mathbf{X}}\), for which we calculate estimands of
interest, and we use ‘age’, ‘male’, ‘totChol’, ‘sysBP’, ‘BPMeds’, ‘currentSmoker’, ‘cigsPerDay’, ‘diabetes’, ‘heartRate’, and ‘glucose’ as elements of our array of controls (\(\overleftrightarrow{\mathbf{Z}}\)). This
example accepts the ‘oddsratio
’ boolean into the .plot()
command on the results object to calculate odds ratios, with results shown in Figure 16. Our online replication archive also includes
models fitted to a Kaggle dataset on a binarized measure of airline customer satisfaction (from an undisclosed airline); the resulting coefficient estimates indicate how RobustiPy
can be used as a method for model selection and averaging, and
are presented in Figure 15 and, again, summarized in Table [tab:results] (which also contains RobustiPy
results from estimating
the problem as a Linear Probability Model, i.e., with the OLSRobust
class.).
We next present the case whereby a study might have multiple candidates for a dependent variable that measures the same underlying concept. While this idea might be uncommon in some disciplines, having a composite target variable that combines multiple
measures is extremely common practice in myriad academic disciplines. The most common case of multiple target variables arises from the use of multi-item measures in surveys. Here, we attempt to replicate the results of Study 3a in [62], alongside with the further replication analysis conducted by DataColada for the same study. In their analysis (see [63]), DataColada alluded to the manipulation of the results through the alteration of the data used in the original analyses,
ultimately leading to a retraction of the original study (see [64]). Here, we attempt to replicate the effect of the
experimental condition on measures of moral impurity (seven items) and networking intentions (four items), separating the datasets used in both the original study, and that which has been reconstructed. In the case of multiple dependent measures,
RobustiPy
combines them into all possible composite combinations, including single-item combinations. To ensure comparability between items, RobustiPy
uses \(z\)-scores rather than raw measures
(see Section 1.3.5). Finally, to replicate an ANOVA/ANCOVA analysis, we convert the categorical column containing the experimental condition group membership into dummy variables and use them as predictors in each
case. The results are shown in Figure 7. The resultant outputs from RobustiPy
are used to plot specification curves based on the published dataset which would appear to support the main hypothesis of the study, as
it shows that any combination of the outcome variable ‘moral impurity’ has a significant effect from the experimental condition. However, results using the reconstructed data show that the effect is much weaker and in the opposite direction than
hypothesised. The results reside for a very similar outcome (‘networking intentions’). This example illustrates the flexibility of RobustiPy
, demonstrating how it can be used to replicate both ANOVA and ANCOVA analyses – even though these are
not explicitly implemented in the library – while also enabling the integration of custom visualisations to present results. More broadly, this example shows how RobustiPy
can be employed as a tool for auditing existing studies by
systematically exploring the robustness of published findings. By taking the output of the ‘.get_results()
’ object, RobustiPy
enables a multitude of custom comparisons beyond that which one run alone makes possible. We also use
RobustiPy
to replicate [65] which aims to understand the ‘association between adolescent well-being and digital
technology use’. In this article data from well known adolescence well-being panel studies is used across a series of measures of well-being (dependent variables) and digital use (independent variables). Under this setting, the authors opt for a
specification curve analysis approach to obtain a distribution of the effect of many forms of technology use across many measures of well-being; we replicate one of the main analyses, where the results – which are also a function of
RobustiPy
’s ability to rescale \(\overleftrightarrow{\mathbf{X}}\) and \(\overleftrightarrow{\mathbf{Z}}\) – are seen in Figure 8 and Table [tab:results].
Table [tab:results] outputs all results from our ten empirical replications. It explicates the enormous variation in both the estimand of interest – across a researcher’s
set of reasonable choices – and also the variation in out-of-sample accuracy across specification spaces. It also shows, for example, how we can consider whether coefficient values are approximately equal and opposite across various different realisations
of an estimand (weighted or unweighted), as done by [55] in their test of the Solow Model. It emphasizes the ability of
RobustiPy
to act as a tool for the evaluation of research already published in the scientific record, not least through its ability to highlight specific specifications of interest in the visualisation; it acts as an essential audit tool when
the validity of research is under question (such as through the retraction of [62]). While not emphasized elsewhere in this manuscript,
RobustiPy
also outputs exact specifications which maximize out-of-sample evaluation metrics, and which specifications minimize certain information criteria; this function is immensely useful for helping to maximize model fit (see Code Block 3
for an example of this). While Table [tab:results] is compiled from the .summary()
method, this is just one of the three ways by which RobustiPy
provides output; it also provides a large number of (seperable) visualisations along side all raw estimates (which can be used for bespoke downstream analyses).
We also time profile RobustiPy
. It involves two simple simulations of data generating processes (see online Supplementary Material), and contains an example of each of both the OLSRobust
and LRobust
classes. We
generate 25 integer values on a log scale between 10 and 10000 which represent a varying number of draws (\(b\)). We use the ‘.fit()
’ method with 2, 5, 10, 15, 20, and 25 folds (\(k\)). We use 3, \(\ldots\), 7 controls (which results in groups of \(\{2^3, \ldots, 2^7\}\) specifications). For each combination of folds, draws, and number of
specifications, the problem is estimated ten times. The seed is unchanged for each iteration, and we use twenty cores of an 14900F processor. This results in approximately 671,512,080 regressions being estimated for our time profiling in total, including
those done to create joint-inference tests (see Section 4). With the results shown in Figure 9, RobustiPy
operates as an approximately \(O(K(2b+k))\) process.
We aim to provide an accessible tool for researchers to run the most common associative models in an era when computational power is vast [33], [34]. We offer this easy-to-use interface in the hope that transparent results‐reporting will spread and become the default standard for statistical analysis. By
lowering the user-experience barrier, we aim to accelerate the adoption—and routine practice—of robust modelling. We include as many tools, tests, and features as possible whilst attempting to maintain the simplest possible interface in order to encourage
researchers to explore the robustness of the models which test their research hypotheses. Importantly, our application is highly modular and easily extendible.8 We envision
tools such as RobustiPy
as a necessity for a large majority of empirical workflows, yet there are some natural limitations which warrant further elaboration. First, while RobustiPy
offers a comprehensive array of tools to conduct
various forms of specification curve analyses, many aspects of the process rely heavily on decisions made by the researcher. The selection of a reasonable array of covariates, the choice of the underlying model, and ultimately the hypothesis to be tested
are all left to the researcher’s discretion; spurious or poorly specified models will still yield spurious and poorly specified results, even when implemented using tools like RobustiPy
(despite the fact that we also weight by information
criteria, as advocated by [13]). Our application does not presently conduct mis-specification tests, for example, which is a natural
extension. While analysis of this kind provides a computationally powerful framework for testing the robustness of statistical models, readers should not assume that it is the ultimate or only approach for such assessments. Like any method, it is most
useful when applied critically and in conjunction with substantive theoretical reasoning. The natural extensions to RobustiPy
are limitles. We plan to explore incorporating Laplacian approximations into RobustiPy
by expanding each
model’s log-posterior around its maximum a-posteriori estimate, using the resulting Gaussian integral as an efficient proxy for the marginal likelihood (thereby enabling scalable comparisons across large model spaces [66] in order to reduce compute costs [67]. Other natural extensions involve the incorporation of additional estimators beyond the OLS, Logit, and Fixed Effects models currently supported. There is also pioneering work which attempts to equivocate estimators
[68], much of which will be incorporated as we continue to support enhancements to the RobustiPy
framework over
time.
RobustiPy
provides an array of in-sample metrics to evaluate the robustness and quality of analysis. These metrics are calculated for all models and samples, and stored in the results class to maximize their availability to the
end-user.
For each ordinary least squares specification, RobustiPy
performs the standard – but adjusted – in-sample coefficient of determination (\(\bar{R}^2\)) computation,
\[\bar{R}^2 = 1 - \frac{ \displaystyle \sum_{i=1}^N \frac{(y_i - \hat{y}_i)^2 }{(N - P - 1)} }{ \displaystyle \sum_{i=1}^N \frac{(y_i - \bar y)^2 )}{(N - 1)} }\]
where \(y_i\) are observed values of each observation \(i \in N\), \(\hat{y}_i\) are the predicted values, \(\bar{y}\) is
the mean of the observed data, \(N\) is the number of observations for that estimation, and \(P\) is the number of parametres. RobustiPy
also prints a large number of summary
information about all of the calculations which it undertakes, including the maximum value of \(\bar{R}^{2}\) and the specification \(\pi^*_{\bar{R}^2}\) to which it belongs, as follows:
\[\pi^*_{\bar{R}^2} = \arg\max_{\pi \in \{1, \dots, |\Pi|\}} \bar{R}^2_\pi \label{r295max}\tag{4}\]
where \(\Pi\) is the set of models in the choice space (see Section 1.2) and \(\bar{R}^{2}_\pi\) is the \(\bar{R}^2\) value of the \(\pi\)-th model. In the case of logistic regressions, we calculate
\[R_{\mathrm{McF}}^2 = 1 - \frac{ \displaystyle \sum_{i=1}^N \bigl[y_i\ln\hat{y}_i \;+\;(1-y_i)\ln(1-\hat{y}_i)\bigr] }{ \displaystyle \sum_{i=1}^N \bigl[y_i\ln\bar y \;+\;(1-y_i)\ln(1-\bar y)\bigr] }\label{eq:mcfadden}\tag{5}\]
which is a model’s likelihood divided by the likelihood of a null model, where \(\hat{y}_i\) is the model’s predicted probability that observation \(i\) has outcome equal to one.
RobustiPy
also calculates the specification which maximizes this metric in the Logistic case, equivalent to Eq. 4 . See [69] for more details. In the case of the LRobust
class, the full model’s likelihood is also returned. Similarly for OLSRobust
– for each specification – the
model’s log-likelihood when assuming homoscedastic errors is calculated for expediancy as follows:
\[\log L = -\frac{N}{2} \log(2\pi) - \frac{1}{2} \sum_{i=1}^N (y_i - \hat{y}_i)^2\] The specification \(\pi^*_{\textrm{log L}}\) which maximizes the log-likelihood is presented in the summary method, determined as follows: \[\pi^*_{\textrm{log L}} = \arg\max_{\pi \in \{1, \dots, |\Pi|\}} \log L_\pi\]
where \(|\Pi|\) is the number of models in the specification space (2\(^D\)), and \(L_\pi\) is the log-likelihood value of the \(\pi\)-th model.
RobustiPy
also computes three information criteria for each specification: Akaike Information Criterion (AIC, see [70]), Bayesian Information Criterion (BIC, see [71]), and Hannan-Quinn Information Criterion
(HQIC, see [72]). Each information criteria is calculated based on the previously computed log-likelihood, as follows: \[\textrm{AIC} = 2P - 2 \log L\] \[\label{eq:bic}
\textrm{BIC}= P \log N - 2 \log L\tag{6}\] \[\textrm{HQIC} = 2P \log (\log N) - 2 \log L\]
where: \(P\) is the number of estimated parameters, \(N\) is the number of observations, and \(\log L\) is the log-likelihood of the model. As before, all information criteria are saved in the results class and the minimum value and its associated model is provided be the summary method as before:
\[\pi^*_{\textrm{IC}} = \arg\min_{\pi \in \{1, \dots, |\Pi|\}} IC_\pi.\]
where IC \(\in\) .
In addition to in-sample metrics, RobustiPy
provides multiple k-fold cross validated out-of-sample metrics for each estimated model. Given a dataset \(\mathcal{D}\), defined multiple times by different sets
of predictors \(\overleftrightarrow{\mathbf{X}}^{(\pi)}\), where \(\pi\) indexes different model specifications, we perform analysis with \(K\)-fold
cross-validation to evaluate a predictive model. Each specification and associated dataset \(\mathcal{D}^{(\pi)}\) consists of:
\[\mathcal{D}^{(\pi)} = \{(\overleftrightarrow{\mathbf{X}}_i^{(\pi)}, \overleftrightarrow{\mathbf{Z}}_i^{(\pi)}, \overleftrightarrow{\mathbf{Y}}_i^{(\pi)})\}_{i=1}^{N^{(\pi)}},\]
\(N^{(\pi)}\) is the sample size specific to specification \(\pi\) after all preprocessing, \(\overleftrightarrow{\mathbf{X}}^{(\pi)}\) represents a specific choice of predictors \(\overleftrightarrow{\mathbf{X}}\), \(\overleftrightarrow{\mathbf{Z}}^{(\pi)}\) represents a specific choice of covariates \(\overleftrightarrow{\mathbf{Z}}\) and \(\overleftrightarrow{\mathbf{Y}}^{(\pi)}\) represents a specific choice of dependent variable \(\overleftrightarrow{\mathbf{Y}}\). For each specification \(\pi\), we divide \(\mathcal{D}^{(\pi)}\) into \(K\) approximately equal-sized, mutually exclusive subsets \(\{ \mathcal{D}_1^{(\pi)}, \mathcal{D}_2^{(\pi)}, \dots, \mathcal{D}_K^{(\pi)} \}\), such that:
\[\mathcal{D}_k^{(\pi)} \cap \mathcal{D}_j^{(\pi)} = \emptyset \quad \text{for } k \neq j, \quad \bigcup_{k=1}^{K} \mathcal{D}_k^{(\pi)} = \mathcal{D}^{(\pi)}.\]
For each fold \(k \in \{1, \dots, K\}\) and specification \(\pi\), define a training dataset \(\mathcal{D}_{\text{train}}^{(k, \pi)} = \mathcal{D}^{(\pi)}
\setminus \mathcal{D}_k^{(\pi)}\), and a validation set, \(\mathcal{D}_{\text{val}}^{(k, \pi)} = \mathcal{D}_k^{(\pi)}\). Train the model on \(\mathcal{D}_{\text{train}}^{(k, \pi)}\)
to obtain the learned parameters \(\theta^{(k, \pi)}\), and then evaluate it on \(\mathcal{D}_{\text{val}}^{(k, \pi)}\). Users then specify or are prompted to select their preferred
out-of-sample metric (‘oos_metric
’) of choice into the .fit()
method. For both OLSRobust
and LRobust
, we calculate the Root Mean Square Error (selected by choosing oos_metric=‘rmse’
) for fold
\(k\) and specification \(\pi\), given by: \[\text{RMSE}_{k}^{(\pi)} = \sqrt{\frac{1}{|\mathcal{D}_k^{(\pi)}|} \sum_{i \in \mathcal{D}_k^{(\pi)}} \left( y_i -
f(\overleftrightarrow{\mathbf{X}}_i^{(\pi)},\overleftrightarrow{\mathbf{Z}}_i^{(\pi)}; \theta^{(k, \pi)}) \right)^2 }.\]
We also allow the users to evaluate their out-of-sample model performance with the pseudo-\(R^2\) metric as popularised by [73], selected through oos_metric=‘pseudo-r2’
. This is otherwise known as the Brier Skill Score when evaluating probabilities – as in the case of LRobust
– and introduced by [74]. For fold \(k\) and specification \(\pi\):
\[\textrm{Pseudo-R}^{2 (\pi)}_{\space k} = 1 - \frac{\sum_{i \in \mathcal{D}_k^{(\pi)}} \left( y_i - f(\overleftrightarrow{\mathbf{X}}_i^{(\pi)},\overleftrightarrow{\mathbf{Z}}_i^{(\pi)}; \theta^{(k, \pi)}) \right)^2 }{\sum_{i \in \mathcal{D}_k^{(\pi)}} \left( y_i - \bar{y}^{(\pi)} \right)^2 },\]
where \(\bar{y}^{(\pi)}\) is the mean of \(y_i\) in the training set of each fold. In the case of LRobust
when calculating an OOS-R\(^2\), we
also allow the user to choose to evaluate their model with McFadden’s R\(^2\)-metric (oos_metric``=``‘mcfaddens-r2’
), but instead use the mean of the training dataset in the calculation of the denominator of
Eq. 5 .
We additionally include the Cross Entropy (also referred to as Log Loss) as an out-of-sample metric for evaluating probabilistic predictions in binary classification models (oos_metric=‘cross-entropy’
). This metric penalises confident but
incorrect predictions more heavily, making it particularly useful for assessing well-calibrated models. For fold \(k\) and specification \(\pi\), Cross Entropy is defined as:
\[\text{Cros Entropy}_{k}^{(\pi)} = -\frac{1}{|\mathcal{D}_k^{(\pi)}|} \sum_{i \in \mathcal{D}_k^{(\pi)}} \left[ y_i \log(f(\overleftrightarrow{\mathbf{X}}_i^{(\pi)},\overleftrightarrow{\mathbf{Z}}_i^{(\pi)}; \theta^{(k, \pi)}) ) + (1 - y_i) \log (1 - f(\overleftrightarrow{\mathbf{X}}_i^{(\pi)},\overleftrightarrow{\mathbf{Z}}_i^{(\pi)}; \theta^{(k, \pi)}) ) \right],\]
In addition, we include the InterModel Vigorish (IMV) – recently introduced by [75], [76], and [77] – for binary
classifications (oos_metric=‘imv’
). For fold \(k\) and specification \(\pi\), it is defined as:
\[\text{IMV}_{k}^{(\pi)} = \frac{w_{\text{enhanced}}^{(k,\pi)} - w_{\text{null}}^{(k,\pi)}}{w_{\text{null}}^{(k,\pi)}},\] where \(w_{\text{null}}^{(k,\pi)}\) and \(w_{\text{enhanced}}^{(k,\pi)}\) are entropic representations of model fit. Each \(w\) is the value minimising the difference between \(w \log w - (1 - w) \log (1 - w)\) and \(\frac{1}{|\mathcal{D}_k^{(\pi)}|}\sum_{i\in \mathcal{D}_k^{(\pi)}}y_i\log \hat{y}_i+(1-y_i)\log (1-\hat{y}_i)\). The \(\hat{y}_i\) is predicted value from the model fitted with choice \(\pi\) for \(w_{\text{enhanced}}^{(k,\pi)}\), and for \(w_{\text{null}}^{(k,\pi)}\) as \(\bar{y}_i\) in the validation set \(\mathcal{D}_k^{(\pi)}\). These metrics are then averaged over the corresponding \(K\) folds, as \(\text{RMSE}_{\text{CV}, \pi} = \frac{1}{K} \sum_{k=1}^{K} \text{RMSE}_{k,\pi}\), \(\text{Pseudo-R}^{2}_{\text{CV},\pi} = \frac{1}{K}\sum_{k=1}^{K}\text{Pseudo-}R^{2}_{k,\pi}\), \(\text{McFadden's-R}^{2}_{\text{CV},\pi} = \frac{1}{K}\sum_{k=1}^{K}\text{McFaddens-R}^{2}_{k,\pi}\), \(\text{Cross Entropy}_{\text{CV}, \pi} = \frac{1}{K}\sum_{k=1}^{K} \text{Cross Entropy}_{k,\pi}\) and \(\text{IMV}_{\text{CV},\pi} \;=\; \frac{1}{K}\sum_{k=1}^{K}\text{IMV}_{k,\pi}\).
To compare multiple specifications, we compute, store, and make readily available the metrics for each \(\pi\) and analyse their distributions across arrays stored as follows: \[\mathcal{S}_{\text{RMSE}} = [ \text{RMSE}_{\text{CV}, \pi} ]_{\pi=1}^{|\Pi|},\] \[\mathcal{S}_{\text{Pseudo-R}^2} = [ \text{Pseudo-R}^2_{\text{CV}, \pi} ]_{\pi=1}^{|\Pi|},\] \[\mathcal{S}_{\text{McFaddens-R}^2} = [\text{McFaddens-R}^2_{\text{CV}, \pi} ]_{\pi=1}^{|\Pi|},\] \[\mathcal{S}_{\text{Cross Entropy}} = [\text{Cross Entropy}_{\text{CV}, \pi}]_{\pi=1}^{|\Pi|},\] \[\mathcal{S}_{\text{IMV}} = [ \text{IMV}_{\text{CV}, \pi} ]_{\pi=1}^{|\Pi|},\]
where \(|\Pi|\) is the total number of models estimated.
RobustiPy
undertakes an array computations for analysing each model in the choice space (\(\pi \in \Pi\)).
To quantify uncertainty in the regression estimates in the specification/choice curve, RobustiPy
performs a bootstrap procedure to estimate the confidence interval of the effect between \(\overleftrightarrow{\mathbf{Y}}\) and the key estimand of interest, \(\mathbf{X}_{1}\). For each choice \(\pi\in \Pi\):
Fit the model to observed data \[\hat{\omega}_{\pi} = F(\overleftrightarrow{\mathbf{Y}}_{\pi}, \overleftrightarrow{\mathbf{X}}_{\pi}, \overleftrightarrow{\mathbf{Z}_{\pi}})\] where \(\hat{\omega}_{\pi}\) is typically the coefficient of the predictor of interest \(\mathbf{X}_{1,\pi}\)
Draw \(B\) bootstrap replicates \(b\) from the dataset \(\mathcal{D}^{(\pi)}\), each of size \(N^{(\pi)}\).
Re-estimate the model in each replicate \(b\) \[\hat{\omega}_{\pi}^{(b)} = F(\overleftrightarrow{\mathbf{Y}}_{\pi}^{(b)}, \overleftrightarrow{\mathbf{X}}_{\pi}^{(b)}, \overleftrightarrow{\mathbf{Z}_{\pi}}^{(b)}) \qquad b = 1,\dots,B.\]
Construct the empirical distribution of coefficients \(\hat{\omega}_{\pi}^{(b)}\) across all bootstrap replicates \[\mathcal{S}_{\pi}^{\text{boot}} = \{ \hat{\omega}_{\pi}^{(b)} \}_{b=1}^{B}\]
The \((1 - \alpha) \times 100\%\) confidence interval for the full set of choices \(\Pi\), computed using the empirical quantiles (with user specified ‘ci=
\(\alpha\)’, defaulting to 0.95) is: \[\text{CI}_{\omega} = \left\{\left[ Q_{\alpha/2}(\mathcal{S}_{\pi}^{\text{boot}}), \quad Q_{1-\alpha/2}(\mathcal{S}_{\pi}^{\text{boot}}) \right] \colon \pi \in \Pi
\right\}\] where \(Q_{\alpha}\) denotes the empirical quantile function. Values (for the specification curve subfigure) can be optionally LOESS-smoothed for better visualization, but raw estimates are kept.
RobustiPy
performs Bayesian Model Averaging (BMA) for variables specified in \(\overleftrightarrow{\mathbf{Z}}_{\pi}\). The focal predictor(s) in \(\overleftrightarrow{\mathbf{X}}_{\pi}\) are included in every specification, so they are not averaged over. For clarity, the equations below follow one generic control variable (indexed as \(p\)), where the same steps are repeated for every other control;
\[\overleftrightarrow{\mathbf{Y}}_{\pi} = \bigl[ \overleftrightarrow{\mathbf{X}}_{\pi}, \overleftrightarrow{\mathbf{Z}}_{\pi} \bigr] \boldsymbol{\beta}_{\pi} +\boldsymbol{\epsilon}_{\pi}, \qquad \boldsymbol{\epsilon}_{\pi}\sim \mathcal{N}\!\bigl(\mathbf{0},\, \sigma_{\pi}^{2}\mathbf{I}_{N_\pi}\bigr).\]
The \(p^{\text{th}}\) coefficient in \(\boldsymbol{\beta}_{\pi}\) is denoted \[\omega_{p,\pi}\equiv\beta_{p,\pi}.\]
With a uniform prior over models, the posterior weight of model \(M_\pi\) is \[P(M_\pi\mid\mathcal{D}) = \frac{\exp\!\bigl[-\tfrac12\,\text{BIC}_{\pi}\bigr]}{\displaystyle\sum_{\pi'\in\Pi} \exp\!\bigl[-\tfrac12\,\text{BIC}_{\pi'}\bigr]},\] where \[\label{eq:bic95pi} \text{BIC}_{\pi} = P_{\pi}\,\log N_{\pi} \;-\; 2\,\log\hat{L}_{\pi}, \qquad\text{(cf.\;Eq.~\ref{eq:bic})}\tag{7}\]
Let \(\hat{\omega}_{p,\pi}\) be the within-model posterior mean (equal to the OLS/ML estimate under a vague prior). The BMA estimate for index \(p\) is \[\mathbb{E}[\omega_{p}\mid\mathcal{D}] = \sum_{\pi\in\Pi} P(M_\pi\mid\mathcal{D})\, \hat{\omega}_{p,\pi}.\]
Repeating this step for every \(p=1,\dots,P\) yields the full vector \[\widehat{\boldsymbol{\omega}}^{\text{BMA}} = \bigl( \mathbb{E}[\omega_{1}\mid\mathcal{D}], \dots, \mathbb{E}[\omega_{p}\mid\mathcal{D}] \bigr)^{\!\top}.\]
RobustiPy
computes SHapley Additive exPlanations (SHAP) for the single ‘full’ specification containing all predictors and controls. After dropping missing values (and optional group demeaning), the data are split into an 80% training set
and a 20% held‐out test set. We fit ordinary least squares (or a Logistic Regression with an L2 penalty and inverse regularisation strength of 0.1, depending on the class used) on the training data:
\[\widehat{Y}_{i} = [\overleftrightarrow{\mathbf{X}}_{i},\overleftrightarrow{\mathbf{Z}}_{i}]\;\widehat{\boldsymbol{\beta}},\] for \(i=1,\dots,N_{\mathrm{test}}\) where \(\overleftrightarrow{\mathbf{X}}_{i}\) and \(\overleftrightarrow{\mathbf{Z}}_{i}\) are row‐vectors of the test‐set features (including the intercept). Let \[\mu_{p} = \mathbb{E}_{\mathrm{train}}[X_{p}]\] and \[v(\emptyset) = \sum_{p=1}^{P}\widehat\beta_{p}\,\mu_{p}.\] Under linearity, the exact SHAP values are \[\widehat\phi_{p,i} = \widehat\beta_{p}\,\bigl(X_{p,i} - \mu_{p}\bigr),\] for \(p=1,\dots,P\) and \(i=1,\dots,N_{\mathrm{test}}\), which satisfy \[\widehat{Y}_{i} = v(\emptyset) + \sum_{p=1}^{P}\widehat\phi_{p,i}.\]
Rather than aggregating them internally, RobustiPy
simply returns \[\texttt{shap\_return}
=
\Bigl( [\widehat\phi_{p,i}]_{i=1,\dots,N_{\mathrm{test}};\,p=1,\dots,P}, \;\mathbf{X}^{\mathrm{test}}
\Bigr).\] Users may then compute summaries (e.g.mean absolute SHAP per feature) or visualize these values as desired.
Beyond per–specification uncertainty, RobustiPy
carries out a curve-level hypothesis test that asks whether the pattern of coefficients across the whole choice space (\(\Pi\)) could plausibly arise
under the sharp null \(H_{0}\!:\beta_{1}=0\). The procedure mirrors the bootstrap introduced earlier, but aggregates after each coefficient is estimated.
Full-sample fits: Fit every specification, once on its observed data \(\mathcal{D}^{(\pi)}\) and store the coefficient and p-value (\(\hat{\omega}_{\pi}\) and \(p_{\pi}\) for \(\pi\in\Pi\)).
Outcome bootstrap: Draw \(B\) i.i.d.resamples of rows—the same row indices for every specification—and re-estimate \(\hat{\omega}_{\pi}^{(b)}\) for \(\pi\in\Pi\), and for \(b=1,\dots, B\).
Null bootstrap: Force the null via \(Y^{\star}_{\pi}=Y_{\pi}-\mathbf{X}_{1,\pi}\hat{\omega}_{\pi}\) and repeat the shared-row bootstrap to obtain \(\hat{\omega}_{\pi}^{\star(b)}\) (again for \(\pi\in\Pi\) and \(b=1,\dots,B\)).
Curve-level summary functionals: For a collection \(A=\{a_{\pi}\}_{\pi\in\Pi}\) define \[\begin{array}{ll} S_{1}(A) &= \operatorname{median}_{\pi} a_{\pi} \\[4pt] S_{2}(A) &= \displaystyle\min_{\pi} a_{\pi} \\[4pt] S_{3}(A) &= \displaystyle\max_{\pi} a_{\pi} \\[4pt] S_{4}(A) &= \displaystyle\sum_{\pi}\mathbb{1}\{a_{\pi}>0\} \\[4pt] S_{5}(A) &= \displaystyle\sum_{\pi}\mathbb{1}\{a_{\pi}<0\} \\[4pt] S_{6}(A,p) &= \displaystyle\sum_{\pi}\mathbb{1}\{p_{\pi}<0.05\} \\[4pt] S_{7}(A,p) &= \displaystyle\sum_{\pi}\mathbb{1}\{a_{\pi}>0\}\, \mathbb{1}\{p_{\pi}<0.05\} \\[4pt] S_{8}(A,p) &= \displaystyle\sum_{\pi}\mathbb{1}\{a_{\pi}<0\}\, \mathbb{1}\{p_{\pi}<0.05\} \end{array}\] and evaluate each \(S_{k}\) on the full-sample vector \(\hat{\omega}_{\pi}\) and on every bootstrap replicate \(\hat{\omega}_{\pi}^{(b)}\).
Null \(p\)-values and model-class caveat: Compare each observed statistic with its null distribution; for the median, \[p_{\text{med}} \;=\; \frac{1}{B}\sum_{b=1}^{B} \mathbb{1}\!\Bigl\{S_i\!\bigl(\omega_{\pi}^{\star(b)}\bigr) > S_i\!\bigl(\hat{\omega}_{\pi}\bigr)\Bigr\}, \qquad \text{for } i\in\{1,\ldots,B\}.\]
Eq. [p-values] is valid under the ideal i.i.d. setting used by ordinary least-squares. For clustered fixed-effects or logistic specifications, the shared-row bootstrap may under- or over-estimate tail probabilities because fixed-effect models violate the i.i.d. assumption across clusters, and logit residuals are heteroskedastic and bounded. We therefore still report these \(p\)-values for completeness, but users should treat them as descriptive diagnostics rather than strict hypothesis tests in those two model classes.
Meta-analytic Stouffer test: Calculate the appropriate \(Z\) score and p-values accordingly: \[Z =\frac{1}{\sqrt{|\Pi|}} \sum_{\pi\in\Pi}\Phi^{-1}\!\bigl(1-p_{\pi}\bigr)\] and \[p_{\text{Stouffer}}=\Phi(-Z).\]
Outputs stored: All statistics—point estimates, counts, null p-values and \((Z,p_{\text{Stouffer}})\)—are saved to OLSResult.inference
for use in .summary()
and
visualisations.
When the number of admissible outcome composites \(\Pi_{\overleftrightarrow{\mathbf{Y}}}\) or covariate sets \(\Pi_{\overleftrightarrow{\mathbf{Z}}}\) is very large, estimating
every model in \(\Pi=\Pi_{\overleftrightarrow{\mathbf{Y}}}\!\times\! \Pi_{\overleftrightarrow{\mathbf{X}}}\!\times\! \Pi_{\overleftrightarrow{\mathbf{Z}}}\!\times\! \Pi_{\overleftrightarrow{F}}\) can be
computationally prohibitive. RobustiPy
therefore lets the user draw a random subset of specifications before any fitting begins.
Determine target sizes. Choose integers \(m_{Y}\le |\Pi_{\overleftrightarrow{\mathbf{Y}}}|\) and \(m_{Z}\le |\Pi_{\overleftrightarrow{\mathbf{Z}}}|\), the number of outcome– and covariate–options you wish to keep (defaults are the full sizes).
Uniform sub-sampling. Independently sample, without replacement, \[\tilde{\Pi}_{\overleftrightarrow{\mathbf{Y}}} \subset \Pi_{\overleftrightarrow{\mathbf{Y}}}, \qquad |\tilde{\Pi}_{\overleftrightarrow{\mathbf{Y}}}|=m_{Y},\] \[\tilde{\Pi}_{\overleftrightarrow{\mathbf{Z}}} \subset \Pi_{\overleftrightarrow{\mathbf{Z}}}, \qquad |\tilde{\Pi}_{\overleftrightarrow{\mathbf{Z}}}|=m_{Z},\] using NumPy’s fast random sampler.
Construct the reduced model space. The analysis then proceeds on \[\tilde{\Pi} \;=\; \tilde{\Pi}_{\overleftrightarrow{\mathbf{Y}}} \times \Pi_{\overleftrightarrow{\mathbf{X}}} \times \tilde{\Pi}_{\overleftrightarrow{\mathbf{Z}}} \times \Pi_{\overleftrightarrow{F}},\] which typically cuts runtime by orders of magnitude while still giving a representative picture of the full specification curve.
The random seeds are user-controllable, ensuring that sub-samples—and therefore results—are fully replicable.
Figure 17: .
Figure 18: .
Figure 19: .
For correspondence: Daniel Valdenegro and Charles Rahal, Leverhulme Centre for Demographic Science and the Centre for Care, Oxford Population Health, University of Oxford, OX1 1JD, United Kingdom. Tel: 01865 286170. Email: daniel.valdenegro@demography.ox.ac.uk and charles.rahal@demography.ox.ac.uk. Code Availability Statement: The extensive code library which accompanies this work can be found at github.com/RobustiPy, and indexed into Zenodo as ‘Online Supplementary Material’ available as [1]. Documentation is available at https://robustipy.readthedocs.io/. Data Availability Statement: All but two of our empirical examples utilise publicly available information, where they require an account with the UK Data Service. The authors declare no conflicts of interest and are grateful to comments on earlier versions of the work from participants at the International Conference on Computational Social Science, the British Society for Population Studies, the European Consortium for Sociological Research, Population Association of America, and internal feedback received from the Leverhulme Centre for Demographic Science, Reproducible Research Oxford, and the Pandemic Sciences Institute. We also thank all attendees at a previous hackathon, hosted at the University of Oxford in June 2024. The authors are grateful for funding from the ESRC (an ESRC Large Centre grant), the Leverhulme Trust (Grant RC-2018-003) for the Leverhulme Centre for Demographic Science, and a Grand Union DTP ESRC studentship. All errors remain our own.↩︎
This space is usually as sparse as a single ‘baseline’ model and a few scant robustness checks – which coincidentally always seem to concur with the results from the baseline – at the quest of peer reviewers; see [42] for a quantification of this at two leading Sociology journals.↩︎
We call this the ‘Vanilla’ computation in Sections 1.3.1 and 2.1.↩︎
RobustiPy
explicitly asks the user for how many CPU cores of their machine which they want to use for the analysis (the ‘n_cpu
’ option into the .fit()
command, see below). If the number of cores is not
specified, the user is asked how many cores they want to use after being asked whether one fewer than the maximum number of cores available is permissible.↩︎
By default, the ‘full’ model (the one containing all \(\overleftrightarrow{\mathbf{Z}}\) covariates) and the ‘null’ model (which contains only the smallest guaranteed model space) are always highlighted in the relevant visual analysis for the case of one dependent variable. When there is more than one dependent variable specified – see Section 1.3.5 – it highlights the model with solely the first dependent variable specified in the list, and the model which incorporates all dependent and independent variables.↩︎
All of these options can be varied, along with further, additional ones; see Section 4 and the accompanying RobustiPy
documentation for further details.↩︎
If no number of draws, number of folds, seed, out‐of‐sample metric, or number of CPUs are selected, the user is prompted to enter valid options as a system input.↩︎
We are constantly looking for suggestions with regards to new features to incorporate. Please contact the corresponding authors to discuss!↩︎