Robustly estimating heterogeneity in factorial data using Rashomon Partitions


1 Introduction↩︎

“You didn’t come here to make the choice, you’ve already made it. You’re here to try to understand why you made it. I thought you’d have figured that out by now.”

— The Oracle, The Matrix Reloaded

We prove a number of appealing results and provide meaningful new insights in three data examples. We propose a \(\ell_0\) prior on the complexity of models, which does not require specifying the correlation structure between variables effects a priori, a task that can otherwise be challening in complex settings. We show that this prior, which stands in contrast to typical methods, which require strong assumptions to identify heterogeneity in settings with correlated effects, is minimax optimal. We characterize the approximation error of (functions of) parameters computed conditional on being in the RPS relative to the entire posterior. Finally, we propose conditions to restrict the set to candidate models only to those that are unique and interpretable, and, under these conditions, derive bounds on the size of the RPS. We propose an algorithm to enumerate the models in the RPS and demonstrate its potential to arrive at new scientific insights in the context of three data examples from diverse applied domains. In all three examples, we find divergent conclusions compared to those in the extant literature.

In the remainder of this section, we provide more background on our approach and then give a road map for the rest of the paper.

[standard BMA] does not accurately represent model uncertainty. Science is an iterative process in which competing models of reality are compared on the basis of how well they predict what is observed; models that predict much less well than their competitors are discarded. Most of the models in [standard BMA] have been discredited [...] so they should be discarded.

While [1] use this logic to justify an approach that favors high posterior models, . In contrast, our approach uses the MAP as an anchor and then enumerates a list of models with similar posterior density.

Consider a simple experiment with two medications (Amoxicillin, Ibuprofen), each given at two dosages. Every patient is already taking both medications and the difference lies in the dosage (high or low for both medications). Nodes in the Hasse diagram correspond to levels of each factor, which we call feature combinations. The first level in each node corresponds to the dosage of Amoxicillin (250mg or 500mg) and the second corresponds to the dose of Ibuprofen (200mg or 400mg).

Figure 1: Running example. Each node in the Hasse diagram corresponds to a specific level of each factor. Pools aggregate across nodes.

We also impose permissibility conditions that limit our search to the space of interpretable models.

In constructing our permissibility rules, we think of the world in increments: how changing a level in one variable marginally affects the outcome. The overall effect of some feature combination, then, comes from adding these marginal effects. This perspective leads us to define variants, which are otherwise identical feature combinations that only differ by one feature by a single intensity value, taking one step up the Hasse diagram. In addition to restricting ourselves to partitions that identify marginal effects, we also rule out partitions that correspond to measure zero events.

To complement these permissibility restrictions, we use a prior that controls the complexity of models. In our running example, we have a low dimensional Hasse diagram in a context with well-known biological mechanisms, so we can speculate with high confidence about the impacts of moving from one level of each feature to another. The Hasse diagram allows us to explore complex interactions between factors, however, making it extremely difficult to supply a prior over the covariance between all of these marginal effects. What’s more, they are rarely independent. As discussed previously, for example, the effect of taking more Ibuprofen with a low level of Amoxicillin is not independent of taking more Ibuprofen at a higher level of Amoxicillin. This situation, of course, becomes much more complicated as we add more features to consider. To address this, our prior is the \(\ell_0\) prior, which does not impose any structure on the relationship between the variable effects. We show this prior is minimax optimal. When combined, our prior and restrictions on the space of partitions, both motivated scientifically, substantially improve computational efficiency, making it possible to show that the RPS is bounded in size polynomially (in the number of features and levels), and enumerate the entire RPS in realistic data examples. .

The remainder of the paper is structured as follows. In Section 2, we define the RPS formally. Then, we give statistical properties for an arbitrary set of partitions and introduce our minimax optimal \(\ell_0\) penalty in Section 3. We then give a formal definition of our robust partition structure in Section 4. We show that this combination allows us to bound the size of the RPS in Section 5 and enumerate it entirely in Section 6. Section 7 provides simulation evidence and Section 8 gives three empirical examples, highlighting robust archetypes in each setting. Finally, Sections 9 and 10 provide a discussion of related work and future directions, respectively. All of our code is available at https://github.com/AparaV/rashomon-partition-sets.

2 Rashomon Partition sets↩︎

Suppose that there are \(n\) units (or individuals) and each has \(M\) features. Features are general and can be properties or allocations of treatment status in an experiment. The feature matrix is given by \(\mathbf{X}_{1:n, 1:M}\) and outcomes are \(\mathbf{y}\in \mathbb{R}^n\). Every feature has \(R\) possible values, partially ordered. Let \(\mathcal{K}\) be the set of all \(K=R^M\) unique feature combinations. Depending on the context, we let \(k\) denote the vector of feature values or its index in \(\mathcal{K}\). Each combination of features \(k \in \mathcal{K}\) can be represented in a dummy binary matrix \(\mathbf{D}\) with entries \(D_{ik}=1\) if and only if observation \(i\) has feature combination \(k\). The dataset is \(\mathbf{Z}:= (\mathbf{y},\mathbf{X})\). So, the researcher studies \[\begin{align} \label{eq:mainreg} \mathbf{y}= \mathbf{D}\boldsymbol{\beta}+ \boldsymbol{\zeta}, \end{align}\tag{1}\] where \(\beta_k = \mathbb{E}[Y_i \mid D_{ik} = 1]\) is the expected outcome in the population given the feature combination and \(\zeta_i\) is some idiosyncratic mean-zero residual. A pool, \(\pi\), is a set of feature combinations having identical expected outcomes. Formally, we have:

Definition 1 (Pool).

A partition, \(\Pi\), is a set of pools such that every observation is assigned to a single pool. Observations in the same pool have the same expected outcome. Identifying heterogeneity in outcomes can be thought of as a search across possible partition structures.

Definition 2 (Partition).

The partition, \(\Pi\), in the space of all partitioning models, \(\mathcal{P}\), is a model of heterogeneity such that for every pool \(\pi \in \Pi\), possibly a singleton, if feature combinations \(k,k'\in \pi\), then \(\beta_k = \beta_{k'}\). The posterior given the data \(\mathbf{Z}\) is \(\mathbb{P}(\Pi \mid \mathbf{Z})\). Let \(\mathcal{P}^{\star} \subseteq \mathcal{P}\) be the set of permissible partitions that obey some permissibility rules (to be defined in 4).

Definition 3 (Rashomon Partition Set (RPS)). For some posterior probability threshold \(\tau \in (0,1)\), the Rashomon Partition Set relative to a reference partition \(\Pi_0\), \(\mathcal{P}_\tau(\Pi_0)\), is \[\begin{align} \mathcal{P}_{\tau}(\Pi_0) = \{ \Pi \in \mathcal{P}^{\star} : \ \mathbb{P}(\Pi \mid \mathbf{Z}) \geq (1-\tau) \cdot \mathbb{P}(\Pi_0 \mid \mathbf{Z}) \} .\label{eq:RPS} \end{align}\tag{2}\]

The RPS relative to \(\Pi_0\) is the set of partitions that have a similar or higher posterior value than the reference. In our analysis, we are interested in \(\Pi^{\text{MAP}}\)—the maximum a posteriori (MAP) partition, so we will focus on \(\mathcal{P}_\tau(\Pi^{\text{MAP}})\). That is, in our setting the RPS is the set of partitions that are sufficiently close to the posterior of the MAP partition. We write this as \(\mathcal{P}_\tau\), dropping the reference argument unless explicitly needed. We could, more generally, define an RPS based on any posterior threshold, \(\theta\), such that \(\mathcal{P}_{\theta} = \{ \Pi \in \mathcal{P}^{\star} : \ \mathbb{P}(\Pi \mid \mathbf{Z}) \geq \theta \}\). To construct the RPS, we begin with an initialization partition, then enumerate all models with posteriors at least as high as this initialization partition (which by definition includes \(\Pi^{\text{MAP}}\)). We then construct the RPS by moving down the list of partitions, ordered by posterior value, until we reach \((1-\tau)\mathbb{P}(\Pi^{\text{MAP}}|\mathbf{Z})\).

Given a posterior over the partition models, we have a posterior over the effects of various feature combinations, possibly pooled, on the outcome of interest conditional on the partition models in this set. So \[\begin{align} \label{eq:conditional-exp-RPS} \mathbb{P}(\boldsymbol{\beta}\mid \mathbf{Z}, \mathcal{P}_\tau) = \sum_{\Pi \in \mathcal{P}_{{\tau} }} \mathbb{P}(\boldsymbol{\beta}\mid \mathbf{Z}, \Pi )\mathbb{P}(\Pi \mid \mathbf{Z},\mathcal{P}_{{\tau }}), \end{align}\tag{3}\] and analogously for measurable functions of \(\boldsymbol{\beta}\). Using Hasse diagrams and our permissibility criteria, defined below, avoids imposing an artificial hierarchy on the partially ordered set of covariates, which allows us to interpret the partitions in the RPS.1 The posterior for \(\boldsymbol{\beta}\) restricted to the RPS, 3 , is, of course not the same as the distribution over all possible partitions, \(P_{\boldsymbol{\beta}\mid \mathbf{Z}} (\boldsymbol{\beta})\). We can, however, characterize the uniform approximation error of the posterior distribution of \(\boldsymbol{\beta}\), and measurable functions of it, restricting to the RPS (Theorem 1 in Section 3). This result does not depend on a specific prior over partitions, but we do need to specify a prior to find the RPS in practice. We propose an \(\ell_0\) prior that assumes only sparsity in heterogeneity and is robust to any potential correlation structure between covariates. We show that the \(\ell_0\) prior is minimax optimal (2). When combined with our permissible partition rules (Section 4), we can calculate bounds on the size of the RPS (Theorem 3) and enumerate the entire RPS ([alg:r-aggregate] and Theorem 6) in Section 6.

3 Statistical properties of Rashomon Partition Sets↩︎

We elaborate on the statistical framework underlying the RPS. We give results for the general definition of the RPS with respect to some threshold, \(\theta\), though, in practice we define the RPS relative to the MAP, defining \(\theta=(1-\tau)\cdot\mathbb{P}(\Pi^{\text{MAP}}|\mathbf{Z})\) i.e., \(\mathcal{P}_{\tau}(\Pi^{\text{MAP}}) = \mathcal{P}_{\theta}\). We first obtain posteriors over the (functions of) effects of feature combinations. Given a unique partition \(\Pi\) with some probability \(\mathbb{P}(\Pi \mid \mathbf{Z})\), it may be useful to know the likely effects of using that specific feature \(k \in \pi \in \Pi\). This could be because, for instance, there may be scientific reasons to otherwise prefer one versus the other, heterogeneity in costs, logistical considerations, etc. There is also a statistical reason: the posterior may not be concentrated on just a few pools for some \(k\) but may be for others. Therefore, we may be interested in the posterior over the entire set of permissible pools \[\begin{align} P_{\boldsymbol{\beta}\mid \mathbf{Z}} (\boldsymbol{\beta}) = \sum_{\Pi \in \mathcal{P}^{\star}} P_{\boldsymbol{\beta}\mid \mathbf{Z}} (\boldsymbol{\beta}\mid \Pi) \cdot \mathbb{P}(\Pi \mid \mathbf{Z}), \end{align}\] where \(P_{\boldsymbol{\beta}\mid \mathbf{Z}}\) is the distribution function of \(\boldsymbol{\beta}\mid \mathbf{Z}\). Throughout our analysis, we will assume that \(P_{\boldsymbol{\beta}\mid \mathbf{Z}}\) is a proper distribution i.e., it satisfies the Kolmogorov axioms. Our goal is to approximate functions of \(P_{\boldsymbol{\beta}\mid \mathbf{Z}}\) using only the RPS. That is, \[\begin{align} P_{\boldsymbol{\beta}\mid \mathbf{Z}, \mathcal{P}_\theta} (\boldsymbol{\beta}) = \sum_{\Pi \in \mathcal{P}_\theta} P_{\boldsymbol{\beta}\mid \mathbf{Z}, \mathcal{P}_\theta} (\boldsymbol{\beta}\mid \Pi) \cdot \mathbb{P}(\Pi \mid \mathbf{Z},\mathcal{P}_\theta), \qquad \mathbb{P}(\Pi \mid \mathbf{Z},\mathcal{P}_\theta) &= \frac{\mathbb{P}(\Pi \mid \mathbf{Z})}{\sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}} \mathbb{P}(\Pi^{\prime} \mid \mathbf{Z})}, \end{align}\] meaning that the approximation only evaluates models in the RPS but is also normalized by the RPS. The quality of this approximation, of course, depends on both the shape of the posterior (i.e., how concentrated is the posterior around the highest probability models) and the structure of the RPS. Our first goal, then, is to describe how well we can approximate \(P_{\boldsymbol{\beta}\mid \mathbf{Z}}\) using the RPS. And then, we discuss how to construct the posterior over partitions, \(\mathbb{P}(\Pi \mid \mathbf{Z})\), using generalized Bayesian inference. Technical details for results discussed here are deferred to 13.

3.1 Posterior over effects.↩︎

Theorem 1 (Rashomon approximation of posterior effects).

Setting \(f(\boldsymbol{\beta}) = \boldsymbol{\beta}\) recovers the posterior of \(\boldsymbol{\beta}\). \(f\) also covers other useful quantities derived from the vector \(\boldsymbol{\beta}\). One example is \(f(\boldsymbol{\beta}) = \max_k \beta_k\) since conditional on a given variant \(k\) being estimated as the one with the maximum effect. There is a winner’s curse since the selection of the maximum is positively biased, so the bias needs to be corrected (the posterior needs to be adjusted to have a lower mean) to undo this effect [2]. Other examples include the variability over outcomes across the feature combinations \(\left\lVert \boldsymbol{\beta}- \sum_K \beta_k/K\right\rVert_2^2\) and quantiles of the expected outcome distribution.

We now focus specifically on estimating the full posterior mean, \(\mathbb{E}_{\Pi} \boldsymbol{\beta}= \sum_{\Pi \in \mathcal{P}^{\star}} \boldsymbol{\beta}_{\Pi} \mathbb{P}(\Pi \mid \mathbf{Z})\), using the RPS. If we simply restricted ourselves to only models in the RPS, we would have \(\mathbb{E}_{\Pi, \mathcal{P}_{\theta}} \boldsymbol{\beta}= \sum_{\Pi \in \mathcal{P}_{\theta}} \boldsymbol{\beta}_{\Pi} \mathbb{P}(\Pi \mid \mathbf{Z})\). For some priors on \(\boldsymbol{\beta}\), we could approximate \(\mathbb{P}(\Pi \mid \mathbf{Z})\) but this requires specifying a prior on \(\boldsymbol{\beta}\) and a corresponding approximation with adequate accuracy (Appendix 12.1 gives an example using Gaussian priors). More generally, the easiest way to compute the expectation is by taking the mean of the effects weighted by the self-normalized posterior probabilities as in 4 . Of course, if the RPS captures most of the posterior density, then this method can validly approximate the posterior mean but that is not the goal of this estimator. This estimator simply tells us what the effects are across the RPS. \[\begin{align} \mathbb{E}_{\Pi \mid \mathcal{P}_{\theta}} \boldsymbol{\beta}&= \sum_{\Pi \in \mathcal{P}_{\theta}} \boldsymbol{\beta}_{\Pi} \frac{\mathbb{P}(\Pi \mid \mathbf{Z}, \mathcal{P}_{\theta})}{\sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}} \mathbb{P}(\Pi^{\prime} \mid \mathbf{Z},\mathcal{P}_{\theta})} \label{eq:conditional-effects-RPS}. \end{align}\tag{4}\] We can then characterize the quality of this approximation for a given RPS construction.

Corollary 1.

The behavior of the error in 1 can be studied similarly to 1. This result extends to functions of \(\boldsymbol{\beta}\) as well.

3.2 Posterior over partitions.↩︎

Now, we turn to the proverbial elephant of constructing a tractable posterior for \(\Pi\), \(\mathbb{P}(\Pi \mid \mathbf{Z}) \propto \mathbb{P}(\mathbf{y}\mid \mathbf{D},\Pi) \cdot \mathbb{P}(\Pi)\). We need to model the likelihood component and the prior. Nested within \(\mathbb{P}(\mathbf{y}\mid \mathbf{D},\Pi)\) is a prior over \(\boldsymbol{\beta}\). In work on Bayesian tree models (e.g., [3]), the typical strategy is to define a prior over partitions and then define conjugate priors on \(\boldsymbol{\beta}\) and related hyperparameters so that it is easy to evaluate each draw from the distribution over trees. We take a different approach based on generalized Bayesian inference [4], which requires specifying fewer distributions explicitly. However, in Appendix 12.2, we also give an example of a fully specified Bayesian model where maximizing the posterior is equivalent to minimizing the loss we define below, drawing a parallel with the previous work in the Bayesian literature. Let \(\exp\{ -\mathcal{L}(\Pi;\mathbf{Z}) \}\) be the likelihood of \(\mathbf{Z}\) where \(\mathcal{L}(\Pi; \mathbf{Z})\) is the loss incurred by the partition \(\Pi\). Let \(\exp \{ -\lambda H(\Pi) \}\) be the prior over \(\mathcal{P}^{\star}\), then \[\begin{align} \mathbb{P}(\Pi \mid \mathbf{Z}) &\propto \exp \{ -\mathcal{L}(\Pi;\mathbf{Z}) \} \cdot \exp \{ -\lambda H(\Pi) \} =: \exp\{ -Q(\Pi) \}. \label{eq:objective_fcn} \end{align}\tag{5}\] Specifically, we use the mean-squared error for the loss function, \[\begin{align} \mathcal{L}(\Pi; \mathbf{Z}) &= \frac{1}{n} \sum_{\pi \in \Pi} \sum_{k(i) \in \pi} (y_i - \widehat{\mu}_{\pi})^2, \qquad \widehat{\mu}_{\pi} = \frac{\sum_{k(i) \in \pi} y_i}{ \sum_{k(i) \in \pi} 1}. \label{eq:mse-and-predictions} \end{align}\tag{6}\]

We define a prior over the number of distinct pools i.e., \(H(\Pi) \propto \left\lvert\Pi\right\rvert\), the size of the partition. The prior plays a regularizing role, putting more weight on less granular aggregations. It corresponds to the \(\ell_0\) penalty: conditional on the number of pools in a partition, all permissible partitions are equally likely. In 2, we show that our key algorithm, yet to be described, can work with any non-negative error function and a penalty (prior) that is increasing in \(\left\lvert\Pi\right\rvert\). Critically, this prior does not impose a particular correlation structure on the association between possible pools. In most applications, it will be extremely difficult, if not impossible to fully specify a prior over all possible associations in a factorial space. A researcher might have a prior that there is similarity between adjacent values of a single variable (e.g., the outcome of a medium dose of a medication is perhaps similar to a high dose than the low dose is to the high dose). When we allow for interactions (particularly higher order interactions) between features, these associations become much more difficult to intuit a priori, all the while the space of possible prior values explodes. Our prior allows the researcher to exert a preference for more parsimonious models without specifying a distribution of all interactions.

The RPS, taken together with the \(\ell_0\) penalty, is similar in spirit to the Occam’s Window approach used in the context of Bayesian Model Averaging by [1] and [5]. These papers use a stochastic search over the discrete space of models that ultimately results in a set of high posterior models and discards more complicated models if simpler models are found to have higher posterior probability. Our approach, which does not do discrete model averaging, formalizes this notion by including a prior with an \(\ell_0\) penalty as part of the model, rather than using it to guide the search. In Theorem 2 we show that this choice of prior is minimax optimal.

Let \(\mathcal{Q}\) be a family of priors for some expected outcome \(\boldsymbol{\beta}\). For any prior \(Q \in \mathcal{Q}\), denote the posterior over \(\boldsymbol{\beta}\) given some data \(\mathbf{Z}\) as \(\textrm{P}_{Q, \mathbf{Z}}\), i.e., \[\begin{align} \textrm{P}_{Q, \mathbf{Z}}(\boldsymbol{\beta}) &= \mathbb{P}(\boldsymbol{\beta}\mid \mathbf{Z}, \boldsymbol{\beta}\sim Q) = \frac{\mathbb{P}(\mathbf{y}\mid \mathbf{X}, \boldsymbol{\beta}) Q(\boldsymbol{\beta})}{\mathbb{P}( \mathbf{y}\mid \mathbf{X})}. \end{align}\]

Let us fix the sparsity at \(h\): there are \(h\) distinct pools in the partition. Define the restricted space of partitions as \(\mathcal{P}_{\mid h} = \{\Pi \in \mathcal{P}^{\star} : H(\Pi) = h\}\). Let \(N(h) = \left\lvert \mathcal{P}_{\mid h}\right\rvert\). The \(\ell_0\) penalty imposes a sparsity restriction on the number of pools. Therefore, at a fixed sparsity \(h\), the \(\ell_0\) penalty corresponds to a uniform prior over \(\mathcal{P}_{\mid h}\). Denote the \(\ell_0\) prior as \(P_{\ell_0}\). So for any \(\Pi \in \mathcal{P}_{\mid h}\), \(P_{\ell_0}(\Pi) = 1 / N(h)\).

For any given \(\boldsymbol{\beta}\), there is a corresponding permissible partition \(\Pi_{\boldsymbol{\beta}} \in \mathcal{P}^{\star}\). Then we can define \(\mathcal{Q}_{\mid h}\) to be the family of priors for the restricted space of \(\boldsymbol{\beta}\) such that there is some \(\Pi_{\boldsymbol{\beta}} \in \mathcal{P}_{\mid h}\). Let \(\mathcal{Q}_{\mathcal{P} \mid h}\) denote the family of priors, derived from \(\mathcal{Q}_{\mid h}\), over partitions in \(\mathcal{P}_{\mid h}\). We can traverse from \(\mathcal{Q}_{\mid h}\) to \(\mathcal{Q}_{\mathcal{P} \mid h}\) by noticing that for a given \(\boldsymbol{\beta}\), there is a corresponding permissible partition \(\Pi_{\boldsymbol{\beta}} \in \mathcal{P}\) i,e, for any prior \(Q \in \mathcal{Q}_{\mid h}\), we can define a prior over \(\mathcal{P}_{\mid h}\), \[\begin{align} Q_{\mathcal{P} \mid h}(\Pi) &= \int_{\boldsymbol{\beta}} \mathbb{I}(\Pi_{\boldsymbol{\beta}} = \Pi) Q(\boldsymbol{\beta}) d\boldsymbol{\beta}, \quad \Pi \in \mathcal{P}_{\mid h}. \end{align}\] For reference, we define the supports for various priors in Table 2.

For two priors \(P, Q \in \mathcal{Q}_{\mathcal{P} \mid h}\), define the total variation distance as \[\begin{align} \delta(P, Q) &= \sup_{\Pi \in \mathcal{P}_{\mid h}} \left\lvert\textrm{P}_{P, \mathbf{Z}}(\Pi) - \textrm{P}_{Q, \mathbf{Z}}(\Pi)\right\rvert. \end{align}\]

Theorem 2. For a given sparsity \(h\), the \(\ell_0\) penalty is minimax optimal in the sense that \[\begin{align} \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \delta(\textrm{P}_{P_{\ell_0}, \mathbf{Z}}, \textrm{P}_{Q, \mathbf{Z}}) &= \inf_{P \in \mathcal{Q}_{\mathcal{P} \mid h}} \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \delta(\textrm{P}_{P, \mathbf{Z}}, \textrm{P}_{Q, \mathbf{Z}}). \end{align}\]

In other words, if one is unwilling to commit to some specific correlation structure for the model coefficients, the \(\ell_0\) penalty, which puts a prior on the number of selected features, is optimal for model selection.

3.3 Loss.↩︎

It is also useful to characterize the Rashomon threshold in the loss space. Specifically, we define \(\theta := \theta(q_0, \epsilon) = e^{-q_0(1+\epsilon)}/c\) where \(c := c(\mathbf{Z})\) is some scaling constant depending on the observed data, \(q_0 = Q(\Pi_0)\) is the loss incurred of some good model \(\Pi_0\), and \(\epsilon\) is largest acceptable deviation from \(q_0\). Then, the \((q_0,\epsilon)\)-Rashomon Partition Set (RPS), \(\mathcal{P}_{q_0,\epsilon}\) is defined as \[\mathcal{P}_{q_0,\epsilon} = \{ \Pi \in \mathcal{P}^{\star} : \ \mathbb{P}(\Pi \mid \mathbf{Z}) \geq e^{-q_0(1+\epsilon)}/c \}.\] This allows us to interpret Rashomon partitions with respect to a reference partition or pooling. Without any context, it is difficult to choose a threshold \(\theta\). However, if we have some good reference model \(\Pi_0\), then we can pick a threshold that is not much worse than the performance of \(\Pi_0\). In particular, using a reference \(\Pi_0\), we can define a measure of performance of model \(\Pi\) with respect to \(\Pi_0\) as \[\begin{align} \xi(\Pi,\Pi_0) &= \frac{\log \mathbb{P}(\Pi \mid \mathbf{Z}) - \log \mathbb{P}(\Pi_0 \mid \mathbf{Z})}{\log \mathbb{P}(\Pi_0 \mid \mathbf{Z}) + \log \mathbb{P}(\mathbf{y}\mid \mathbf{X})} = \frac{Q(\Pi) - Q(\Pi_0)}{Q(\Pi_0)} = \frac{Q(\Pi) - q_0}{q_0}. \end{align}\] \(\xi\) is essentially the log-likelihood ratio of the two models weighted by the log-likelihood of the reference model. For data that has a considerably higher likelihood , the measure goes to 1. So when \(\Pi\) is a better fit than the reference, \(\xi(\Pi, \Pi_0) < 0\). Conversely, when \(\Pi\) is a poorer fit than the reference, \(\xi(\Pi, \Pi_0) > 0\). Note that if the two posteriors are identical, then \(\xi(\Pi, \Pi_0) = 0\).

Suppose we know that \(\Pi_0\) is a good model such that \(\mathbb{P}(\Pi_0 \mid \mathbf{Z}) = e^{-q_0}/c\). It makes sense to only look at partitions \(\Pi\) such that \(\xi(\Pi,\Pi_0) \leq \epsilon\) for some \(\epsilon > 0\). We show how to recover the RPS using \((q_0, \epsilon)\) in 1.2

Proposition 1. Fix \(\Pi_0 \in \mathcal{P}^{\star}\) and let \(q_0 = Q(\Pi_0)\). Then \(\mathcal{P}_{q_0,\epsilon} = \{\Pi \in \mathcal{P}^{\star} : \ \xi(\Pi,\Pi_0) \leq \epsilon \}\).

By construction, this result is almost trivial. However, 1 is very powerful because it circumvents needing to calculate the normalizing constant \(c(\mathbf{Z})\) which can be intractable. The loss tolerance, \(\epsilon\), is specified by the researcher. If the goal is to approximate the full posterior, then \(\epsilon\) should be chosen based on computational constraints, since adding more models to the RPS will improve the approximation. However, if the goal is scientific interpretation, we want to choose \(\epsilon\) such that the models in the RPS are all very high quality since adding more models with little evidence in the data would dilute the results. We give an example of how to choose \(\epsilon\) in practice in Section 8.

In practice, our strategy has three steps. We first define a reference partition using an off-the-shelf algorithm. Second, we use this partition (which may or may not be in the final RPS) to get a sense of the magnitude of the loss and to enumerate an initial set in the neighborhood of the reference. By definition, this set includes the MAP. Third, we move down in posterior from the MAP until we reach the desired threshold. This gives us \(\mathcal{P}_{\tau}(\Pi^{\textrm{MAP}})\). We demonstrate this procedure with empirical examples in 8.

4 Permissible partitions↩︎

In this section, we describe our definition of a permissible partition, limiting ourselves to partitions that are interpretable and substantively meaningful, while also substantially reducing the computational burden. To begin, recall from 2 that we have \(M\) features taking on \(R\) discrete values and \(\mathcal{K}\) is the set of all \(K=R^M\) unique feature combinations. Therefore, we can partially order the feature combinations. For a feature combination \(k \in \mathcal{K}\), let \(k_m\) denote the value that the \(m\)-th feature takes. We say \(k \geq k^{\prime}\) if and only if \(k_m \geq k^{\prime}_m\) for all \(m = 1, \dots, M\). We say \(k > k^{\prime}\) if \(k \geq k^{\prime}\) but \(k \neq k^{\prime}\), and say that \(k\) and \(k^{\prime}\) are incomparable if there are two features \(m_1\) and \(m_2\) such that \(k_{m_1} > k^{\prime}_{m_1}\) and \(k_{m_2} < k^{\prime}_{m_2}\). We denote the expected outcome of feature combination \(k\) by \(\beta_k\). We will return to our running example from the introduction (see Figure 1).

4.1 Hasse diagrams.↩︎

We now formalize permissibility in terms of the geometry of Hasse diagram.  [6] introduced Hasse diagrams to identify heterogeneity in outcomes in the context of selecting a single single model in the frequentist paradigm with a \(\ell_1\) penalty. We leverage this geometry to construct the RPS while also generalizing their implementation to address the strong assumptions the [6] method requires. We begin by formally defining the Hasse diagram.

Definition 4 (Hasse diagram). The Hasse diagram, \(\mathcal{H} = (\mathcal{K}, \mathcal{E})\), is a graph with nodes \(\mathcal{K}\) and edges \(\mathcal{E}\) relating the feature combinations through the partial ordering. Specifically, for two features \(k, k^{\prime} \in \mathcal{K}\), the edge \(\langle k, k^{\prime} \rangle \in \mathcal{E}\) if and only if \(k\) and \(k^{\prime}\) are variants. We will also denote the edge \(\langle k, k^{\prime} \rangle\) as \(e_{k, k^{\prime}}\).

Admissible partition

 

Inadmissible partition

 

Admissible partition

Figure 2: Hasse diagrams for amoxicillin and ibuprofen example. To see why ¿fig:fig:hasse-amoxicillin-ibuprofen-panel-b? is not permissible, consider \(\pi_i = \{(250~\textrm{mg},400~\textrm{mg}), (500~\textrm{mg},400~\textrm{mg})\}\) and \(\pi_j = \{(500~\textrm{mg},200~\textrm{mg})\}\) with incomparable minima \((250~\textrm{mg},400~\textrm{mg}) \not\lessgtr(500~\textrm{mg},200~\textrm{mg})\)..

\(\square\)

4.2 Variants and Permissibility.↩︎

First, we imagine that the feature combination represents the physical world mechanistically capturing marginal changes: how changing a dosage in one drug marginally affects the outcome. To do that we use variants, or feature combinations that differ by only one level, as finest grain building blocks for partitions.

Definition 5 (Variants). Two feature combinations \(k < k^{\prime}\) are variants if they have the same value of features for all but one and they vary by exactly one intensity value i.e., \(k^{\prime}_{-m} = k_{-m}\) and \(k^{\prime}_m = k_m + 1\) for some feature \(m\).

Then one can think of the overall effect of some feature combination as summing up through these marginal effects. This perspective also means we can re-parameterize the overall outcome of each feature combination \(k\), \(\beta_k\), into a sum of the marginals, which we denote by the vector \(\boldsymbol{\alpha}\). Then, we have, \[\begin{align} \beta_k &= \sum_{k^{\prime} \leq k; \rho(k) = \rho(k^{\prime})} \alpha_{k^{\prime}}. \label{eq:tva-climbing-up-hasse} \end{align}\tag{7}\] This says that an expected outcome of a feature combination is the sum of expected marginal values leading up to it. For instance, the treatment \((500~\textrm{mg},400~\textrm{mg})\) has value \(\alpha_{500, 400} = (\beta_{500,400} - \beta_{500,200}) - (\beta_{250,400} - \beta_{250, 200})\). These will either capture a main effect of increasing a dosage (as on the sides of the Hasse diagram) or an interaction effect between multiple dosage increases (as in the interior of the Hasse diagram).

Second, we want to only consider partitions that represent robust effects. Specifically, we disregard partitions that are “measure zero,” which are non-robust in the sense that the only rationalization for these splits rely on exact marginal effects that offset in a specific way. Such partitions require tremendous coincidence. Since partitions are discrete objects, it makes sense to ignore any partition that may arise from such measure zero events. Other techniques such as Lasso or decision trees may falsely estimate a non-zero posterior probability for such unrealistic partitions. We give examples of this in Appendix 11. We present the formal definition of permissibility in 6. We provide additional technical details with examples in Appendix 11.

Definition 6 (Permissible partitions). A partition \(\Pi_0\) is permissible if

  1. every \(\pi \in \Pi_0\) is a pool (cf. 1), and

  2. for every \(\boldsymbol{\beta}\) that generates \(\Pi_0\), with respect to the Lebesgue measure:

    1. the support of \(\boldsymbol{\alpha}(\boldsymbol{\beta})\), \(S_{\boldsymbol{\alpha}(\boldsymbol{\beta})} = \{\alpha_k \neq 0 \mid \alpha_k \in \boldsymbol{\alpha}(\boldsymbol{\beta})\}\), is measurable, and

    2. the support of \(\boldsymbol{\gamma}(\boldsymbol{\beta})\), \(S_{\boldsymbol{\gamma}(\boldsymbol{\beta})} = \{\gamma_k \neq 0 \mid \gamma_k \in \boldsymbol{\gamma}(\boldsymbol{\beta})\}\), is measurable.

We denote the set of all permissible partitions by \(\mathcal{P}^{\star}\).

\(\square\)

4.3 The geometry of permissibility.↩︎

Permissibility corresponds to topological restrictions on the Hasse diagram. We can leverage this in representing the search problems for high-quality partitions to improve storage and computational performance. These benefits are byproducts of modeling permissibility natively. We can define partitions on the Hasse diagram by removing (splitting on) edges in \(\mathcal{E}\) to form disjoint components in \(\mathcal{K}\). The removed edges correspond to non-zero marginal changes. This guarantees that all sets in a partition on the Hasse diagram contain only “connected” feature combinations and generates a convex structure in the permissible partitions.

At a high level, permissibility restrictions generate equivalence classes among the edges in the Hasse diagram, \(\mathcal{E}\). The equivalence classes are those edges that can only be removed together to generate a partition. Suppose we decompose \(\mathcal{E}\) into \(n\) mutually disjoint and exhaustive sets of edges \(E_1, \dots, E_n\), where each set \(E_j\) is an equivalence class. A partition \(\Pi\) induced by these equivalence classes says, \(\Pi\) removes edge \(e\) if and only if \(\Pi\) removes \(e^{\prime}\) for every \(e^{\prime}\) such that \(e, e^{\prime} \in E_i\) for some \(i = 1, \dots, n\). The equivalence classes will correspond to partitions where we pool along one of these edges if and only if we pool along the other. Let \(\mathcal{E}^{\prime}\) represent the set of edges that remain after the partition. Then the pruned graph \((\mathcal{K}, \mathcal{E}^{\prime})\) specifies the corresponding partition.

Specifically, permissible partitions can be generated by identifying a unique decomposition of \(\mathcal{E}\) into equivalent edges. The decomposition is given by,

\[\begin{align} \mathcal{E} &= \bigcup_{m=1}^M \bigcup_{r=1}^R E_{m, r}, \end{align}\] where for some feature \(m\) taking on value \(r\), the equivalence class is \(E_{m,r} = \{e_{k, k^{\prime}} \in \mathcal{E} \mid k_m = r, k^{\prime}_m = r+1\}\). In other words, edges between all pairs of variants \(k\) and \(k^{\prime}\) that differ on the \(m\)-th feature i.e., \(k_m = r\) and \(k^{\prime}_m = r+1\) for some arbitrary \(r\) belong to \(E_{m,r}\). This decomposition of \(\mathcal{E}\) into equivalent edges corresponds to all “parallel” edges on the Hasse diagram (see 2 for example). We formalize this equivalent geometric interpretation of 6 in 7.

The equivalence between the edges allows us to store partitions efficiently. If there are \(n\) equivalence classes, then there are \(2^n\) possible partitions – we either split or pool across the edges in the \(i\)-th equivalence class. Rather than storing the partition as a set of pools or through a tree data structure, we can reduce the storage and calculation by a logarithmic factor by just keeping track of the hyperplanes induced by splits. That is, we can simply store a binary vector of length \(n\). For interpretability purposes, we can reshape this vector into the \(\boldsymbol{\Sigma}\in \{0, 1\}^{M \times (R-2)}\) matrix. Here \(\Sigma_{mr} = 1\) if and only if feature combinations with level \(r\) is pooled with feature combinations with factor \(r+1\) in feature \(m\). We walk through detailed examples in 11.

Any estimation strategy implicitly takes some stand on partition structure. In Appendix 11, we discuss permissibility restrictions in the context of several other commonly used models. We sketch examples using long/short regression, Lasso, decision trees, causal trees, and treatment variant aggregation. These alternative approaches impose permissibility restrictions that are naturally captured by our Hasse diagram topology.

So far, we assumed all features take on the same number of ordered values, \(R\). Without loss of generality, we can extend to the case where feature \(i\) takes on \(R_i\) values such that not all \(R_i\) are equal (some entries in the partition matrix will not be defined). Henceforth, we consider only permissible partitions and drop the “permissible” quantifier unless specifically needed. Our paper is modular: one could construct Rashomon Partitions with a different notion of permissibility.

4.4 Profiles.↩︎

Permissible; increase in Ibuprofen

Not permissible; increase in Ibuprofen while taking Amoxicillin

Figure 3: Extended Hasse diagram including the control (0mg) level. Panel A shows a permissible partition corresponding to increasing Ibuprofen while averaging over all levels of Amoxicillin (including control). Panel B shows increasing Ibuprofen while already taking Amoxicillin (e.g., comparing \(\{(250,200),(500,200)\}\) to \(\{(250,400),(500,400)\}\)) but is not permissible..

5 Size of the Rashomon Partition Set↩︎

Given that we would like to enumerate \(\mathcal{P}_{\theta}\) it is useful to calculate bounds on both its size and also \(\mathcal{P}^{\star}\). Since any permissible partition requires each profile to respect 6, we can consider each profile independently. We will use \(m\) to denote the number of features with non-zero values in the profile we are focusing on, so \(m \in \{1,\dots,M\}\). Without loss of generality, we assume that every feature \(i\) takes on \(R_i = R\) ordered values. Our proofs naturally extend to the general case. All technical details are deferred to Appendix 14. First, \(\mathcal{P}^{\star}\) is small relative to the total number of potential partitions.

Proposition 2. In each profile, the total number of all possible partitions is \(\mathcal{O}\left(2^{2(R-1)^m} \right)\), and permissible partitions is \(\mathcal{O}(2^{m(R-2)})\).

Next, we show that the size of the RPS is only polynomial in \(M\) and \(R\). In Lemma 1, we observe that the \(\ell_0\) prior bounds the number of pools in any Rashomon partition.

Lemma 1. For a given Rashomon threshold \(\theta\) and regularization parameter \(\lambda\), any partition in the RPS, \(\mathcal{P}_{\theta}\), can have at most \(H_{\theta}(\lambda)\) pools, \[\begin{align} H_{\theta}(\lambda) = \left\lfloor - \frac{\ln (c \theta)}{\lambda} \right\rfloor, \end{align}\] where \(c := c(\mathbf{Z})\) is a normalization constant depending only on \(\mathbf{Z}\) and \(\lfloor \cdot \rfloor\) is the floor function.

Lemma 1 allows us to further reduce the number of the partitions in Proposition 2 by considering only partitions that meet this requirement. Even when the regimes of scientific action i.e., profiles, are unknown, we show that the size of the RPS is bounded polynomially in 3. In 7 we bound the size of the RPS when the profiles are known apriori. Such a relationship between regularization and size of the model class was previously hypothesized and shown for empirical data by [7].

Theorem 3. For a given Rashomon threshold \(\theta\) and regularization parameter \(\lambda\), the size of the Rashomon Partition Set is bounded by \[\begin{align} \left\lvert\mathcal{P}_{\theta}\right\rvert &= \begin{cases} \mathcal{O}\left(M^{2H - 1} R^{H - 1} \right), & R > M^{c_{\mathrm{crit}}} \\ \mathcal{O}\left( (MR)^{H - 1} (\log_2 (MR))^{-1} \right), & \mathrm{else} \end{cases}, \end{align}\] where \(H := H_\theta(\lambda)\) and \(c_{\mathrm{crit}} = (\log_2 3 - 1) / (2 - \log_2 3)\).

Observe that \(c_{\mathrm{crit}} \approx 1.41\). In most settings, the number of factors \(R\) is fixed and the number of features \(M\) grows. Therefore, \(R \leq M^{c_{\mathrm{crit}}}\) as \(M\) grows, and we will fall under the second case which is smaller by a factor of \(M^H\). In our empirical examples, we see that with just hundreds of partitions in our RPS we get close to the full posterior.

6 Enumerating Rashomon Partitions↩︎

We will first develop intuition to present an algorithm to enumerate the RPS for a single profile. Since we do not pool across profiles, we can enumerate the Rashomon Partition Set for each profile independently and then finally combine them in the end. The intuition behind our enumeration is that any split we make introduces a new set of pools. If for some reason this split is very bad, then no matter what other split we make, we can never recover. Theorems 4 and 5 help us identify those poor splits. They rely on the fact that equivalent points having the exact same feature values will always belong to the same pool. However, equivalent units may not have the same outcome. Therefore, we will always incur some loss from these equivalent units (see [8] or [9] for implementations of related strategies). We defer technical details of the results to Appendix 15.

Consider some partition matrix \(\boldsymbol{\Sigma}\), where the partition is given by \(\Pi := \Pi(\boldsymbol{\Sigma})\). Given some data \(\mathbf{Z}= (\mathbf{X}, \mathbf{y})\), we will use the mean squared error and the average outcome in pool \(\pi \in \Pi\), \(\widehat{\mu}_{\pi}\), as defined in 6 . However, the results generalize to any non-negative loss as we will see in 17 with weighted mean-squared error.

Suppose we fix some indices \(\mathcal{M}\) in \(\boldsymbol{\Sigma}\). Define a new matrix \(\boldsymbol{\Sigma}_{\mathfrak{f}}\), \[\begin{align} \Sigma_{\mathfrak{f}, (i, j)} &= \begin{cases} \Sigma_{(i, j)}, & (i, j) \in \mathcal{M} \\ 0, & \text{else} \end{cases}. \end{align}\] In other words, \(\boldsymbol{\Sigma}_{\mathfrak{f}}\) is a partition where all heterogeneity splits made by \(\boldsymbol{\Sigma}\) corresponding to indices in \(\mathcal{M}\) are obeyed and we maximally split at all other places. Let \(\Pi_{\mathfrak{f}} := \Pi(\boldsymbol{\Sigma}_{\mathfrak{f}})\) correspond to this maximal partition respecting \(\boldsymbol{\Sigma}\) at indices \(\mathcal{M}\). Next, define \[\begin{align} \pi_{\mathfrak{f}} &= \{ k \in \mathcal{K} \mid k_i \leq j+1 \iff (i, j) \in \mathcal{M} \} \end{align}\] to be the set of all feature combinations covered by indices in \(\mathcal{M}\). And we define the complement \(\pi_{\mathfrak{f}}^c = \mathcal{K} \setminus \pi_{\mathfrak{f}}\). Finally, define \(H(\Pi, \mathcal{M}) = \sum_{\pi \in \Pi} \mathbb{I}\{\pi \cap \pi_{\mathfrak{f}} \neq \varnothing\}\) to be the number of pools in \(\Pi\) consisting of feature combinations corresponding to indices \(\mathcal{M}\).

Consider a procedure where we keep \(\boldsymbol{\Sigma}\) constant at \(\mathcal{M}\) and make further splits (not already implied by \(\mathcal{M}\)) at other indices only. Define \(\mathrm{child}(\boldsymbol{\Sigma}, \mathcal{M})\) to be all such \(\boldsymbol{\Sigma}^{\prime}\). Our search algorithm in Algorithm [alg:r-aggregate-profile] starts at some partition and fixes some heterogeneity splits. Theorem 4 says that if the loss incurred by these fixed heterogeneity splits is already too high, then we should discard this partition and its children.

Theorem 4. Let \(\theta_{\epsilon}\) be the Rashomon threshold in the loss space i.e., \(\Pi \in \mathcal{P}_{q, \epsilon}\) if and only if \(Q(\Pi) < \theta_{\epsilon}\). Given a partition \(\Pi := \Pi(\boldsymbol{\Sigma})\) for partition matrix \(\boldsymbol{\Sigma}\), a set of fixed indices \(\mathcal{M}\), and data \(\mathbf{Z}\) consisting of \(n\) observations, define \[\begin{align} b(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) &= \frac{1}{n} \sum_{\pi \in \Pi_{\mathfrak{f}}} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \in \pi_{\mathfrak{f}} \right\} (y_i - \widehat{\mu}_{\pi})^2 + \lambda H(\Pi, \mathcal{M}). \label{eq:fixed-point-loss} \end{align}\tag{8}\] If \(b(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) > \theta_{\epsilon}\), then \(\boldsymbol{\Sigma}\) and all \(\boldsymbol{\Sigma}^{\prime} \in \mathrm{child}(\boldsymbol{\Sigma}, \mathcal{M})\) are not in the Rashomon set \(\mathcal{P}_{q, \epsilon}\).

Theorem 5 “looks ahead” to see if this partition is of poor quality. If the loss incurred by feature combinations yet to be split is too high, then we abandon this partition.

Theorem 5. Consider the same setting as Theorem 4. Define \[\begin{align} b_{eq}(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) &= \frac{1}{n} \sum_{\pi \in \Pi_{\mathfrak{f}}} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \in \pi_{\mathfrak{f}}^c \right\} (y_i - \widehat{\mu}_{\pi})^2, \tag{9} \\ B(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) &= b(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) + b_{eq}(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}). \tag{10} \end{align}\] If \(B(\boldsymbol{\Sigma}, \mathcal{M} ; \mathbf{Z}) > \theta_{\epsilon}\), then \(\boldsymbol{\Sigma}\) and all \(\boldsymbol{\Sigma}^{\prime} \in \mathrm{child}(\boldsymbol{\Sigma}, \mathcal{M})\) are not in the Rashomon set \(\mathcal{P}_{q, \epsilon}\).

Theorems 4 and 5 help aggressively cut down the search space by combining the lowest penalty on the splits already made and the lowest mean-squared error on the splits yet to be made. If this is already too high, then we abandon our search. We illustrate this in Algorithm [alg:r-aggregate-profile]. Here, we start with all feature combinations pooled together. We begin our search at the first feature trying to split the two variants with the lowest dosages into separate pools. We keep a queue of possible splits to consider. Whenever we remove a possible split from the queue, we check its viability using Lemma 1, and Theorems 4 and 5. If this is a bad split, we go to the next split in the queue. And if this is a good split (so far), we check if it already meets the Rashomon threshold and recursively add other possible splits to this queue. We also maintain a cache of splits that have been added to the queue at some point to avoid doubling back on old splits. As noted before, we can solve each profile independently. In Algorithm [alg:r-aggregate], we explicitly show how to do this. Note that in line [line:eq-bound], we once again leverage Theorem 5 by noting that each profile will always incur some loss. Once we solve each profile independently, Algorithm [alg:r-aggregate-across-profiles] describes how to pool across profiles as defined in 9. Putting all of these results together, we have the following theorem.

Theorem 6. Algorithm [alg:r-aggregate] correctly enumerates the Rashomon partition set.

\(M\) features, \(R\) factors per feature, max pools \(H\), data \(\mathbf{Z}\), reference loss \(q\), threshold \(\epsilon\) Rashomon set \(\mathcal{P}_{q, \epsilon}\)

\(\mathcal{P}_{q, \epsilon} = \varnothing\) \(\mathcal{R}\) all sets of candidate profiles \(H^{\prime} = H - \left\lvert\rho\right\rvert + 1\) \(\mathcal{E} = [ b_{eq} \text{ of profile \rho_i for \rho_i \in \rho} ]\)

\(\mathcal{P} = \texttt{dict()}\) \(q_i = q(1+\epsilon) - \texttt{sum}(\mathcal{E}) + \mathcal{E}_{\rho_i}\) \(M_i =\) active features in \(\rho_i\) \(R_i = R[M_i]\) \(\mathcal{P}[\rho_i] = \texttt{EnumerateRPS\_genprofile}(M_i, R_i, H^{\prime}, \mathbf{Z}, q_i)\) Sort partition matrices in \(\mathcal{P}[\rho_i]\) on loss

\(\mathcal{P}^{\prime} = \bigtimes_{\rho_i \in \rho} \mathcal{P}[\rho_i]\)

\(\mathcal{P}_{q, \epsilon} = \mathcal{P}_{q, \epsilon} \cup \texttt{PoolProfiles}(\mathcal{P}^{\prime}, \rho_0, \mathbf{Z}, q(1+\epsilon))\)

\(\mathcal{P}_{q, \epsilon}\)

7 Simulations↩︎

In this section, we present two simulation studies to illustrate the performance of our method. First, consider a setting with four features. Each feature takes on four ordered factors, \(\{0, 1, 2, 3\}\). There are \(2^4 = 16\) possible combinations of active (\(>0\)) and inactive (level 0) features. We refer to “control” as the setting where all features are inactive, corresponding in to an experiment with four distinct interventions, each of which has four levels (control, plus low-medium-high). \[\begin{align} \beta_{(0, 0, 0, 1)} &= 4.4, \ \sigma^2_{(0, 0, 0, 1)} = 1, \\beta_{(0, 1, 0, 0)} = 4.3, \\sigma^2_{(0, 1, 0, 0)} = 1, \ \beta_{(0, 1, 0, 1)} = 4.45, \ \sigma^2_{(0, 1, 0, 1)} = 1, \\ \beta_{(1, 0, 1, 0)} &= 4.5, \ \sigma^2_{(1, 0, 1, 0)} = 1.5, \ \beta_{(1, 1, 1, 1)} = 4.35, \\sigma^2_{(1, 1, 1, 1)} = 1. \end{align}\] All other settings have outcome \(\beta = 0\) and variance \(\sigma^2 = 1\). We generated data with \(n_a = 30\) data points per feature combination (so \(30\times 16=480\) total data points). Each vector \(a\) has a different number of feature combinations depending on which features are active.3 The outcomes were drawn from \(\mathcal{N}(\beta_a, \sigma_a^2)\). The setting where \(a=(1, 0, 1, 0)\) should have the highest outcomes, so it is the true “best” setting or profile. We averaged the results over \(r = 100\) simulations.

Figure 4: Simulation results. The plot shows often the true best profile is discovered as we increase the Rashomon threshold in the blue curve. With just \epsilon \approx 0.038, we recover the true best profile in the Rashomon set about 90% of the time. The red dot corresponds to how often Lasso recovers the true best profile.

Figure 4 tells us how often the true highest expected outcome feature setting is present in RPS as a function of the threshold \(\epsilon\). Lasso is able to correctly identify the true highest outcome setting only about half the time, though as we increase \(\epsilon\) we see that the true best profile is identified correctly almost always in the RPS.

In our second simulation study, we study the treatment effect in a binary treatment setting. We look at [10]’s Causal Random Forests (CRFs), which we discuss in 9 and 18. CRFs sample over the space of decision trees, so we ask how many causal trees are permissible and are in the RPS? Causal trees partition the features to find heterogeneity in the treatment effect directly. This is in contrast to partitions in the RPS that find heterogeneity in the outcome. To make the comparison fair, we set the outcome of the control group to a constant, 0, without any noise.

We simulate data with four features, the first feature being a binary treatment variable. The second feature takes on 3 ordered levels and the last two features take on 4 ordered levels. The following are the outcomes for the treatment group: \[\begin{align} \beta_{(1, 1, 1:2, 1:3)} &= 2, \ \beta_{(1, 1, 1:2, 4)} = 4, \\beta_{(1, 1, 3:4, 1:3)} = 2, \ \beta_{(1, 1, 3:4, 4)} = 0, \\ \beta_{(1, 2, 1:2, 1:3)} &= 3, \ \beta_{(1, 2, 1:2, 4)} = 5, \ \beta_{(1, 2, 3:4, 1:3)} = 7, \ \beta_{(1, 2, 3:4, 4)} = 1, \\ \beta_{(1, 3, 1:2, 1:3)} &= 1, \ \beta_{(1, 3, 1:2, 4)} = -1, \ \beta_{(1, 3, 3:4, 1:3)} = -1, \ \beta_{(1, 3, 3:4, 4)} = -2. \end{align}\]

Table 1: Results for second simulation study. We compare how often CRFs find permissible partitions and how often they are present in the RPS. We vary both the number of trees in the CRF and the Rashomon threshold. Each cell shows the fraction of CRF trees inside the RPS (within parentheses are absolute counts). The numbers are averaged over 100 simulations.
(# permissible = 1.29)
(# permissible = 2.67)
(# permissible = 10.32)
\(\epsilon = 0.1\) (\(\left\lvert\mathcal{P}_{\theta}\right\rvert = 7.46\)) 0% (0) 0% (0) 0% (0)
\(\epsilon = 0.2\) (\(\left\lvert\mathcal{P}_{\theta}\right\rvert = 46.6\)) 0% (0) 0% (0) 0% (0)
\(\epsilon = 0.3\) (\(\left\lvert\mathcal{P}_{\theta}\right\rvert = 126.54\)) 0.41% (0.52) 0.91% (1.15) 3.35% (4.24)
\(\epsilon = 0.5\) (\(\left\lvert\mathcal{P}_{\theta}\right\rvert = 823.81\)) 0.16% (1.29) 0.32% (2.67) 1.25% (10.32)

We generate \(n_a = 10\) data points per feature combination. In the treatment group, we drew outcomes from a \(\mathcal{N}(\beta_a, 1)\) distribution. We averaged simulations over 100 iterations. The results are presented in Table 1. The vast majority of partitions sampled by CRFs are not scientifically coherent (permissible) partitions and thus cannot be interpreted as plausible explanations. This result is not specific to CRFs and would hold for any algorithm using unrestricted trees. Consequently, the number of trees that are in the RPS is also very small, meaning that, although averaging over trees has appealing asymptotic properties, the trees included in particular sample are unlikely to be high-quality explanations. This result is, again, not specific to CRFs but highlights the value of exploring uncertainty by enumerating high-quality explanations compared to sampling.

8 Empirical data examples↩︎

In three distinct environments, we emphasize robust conclusions, both affirming and challenging the established literature’s findings in each context through the use of RPS.

8.1 Does price matter in charitable giving?↩︎

[11] used mail solicitations to prior donors of a non-profit political organization to study the effect of price on charitable donations. The data contains 50,083 individuals in the United States who had previously donated to the organization. All individuals received a letter soliciting donations. Those in the treatment group (33,396 people) included an additional paragraph describing that their donation will be matched in some way. The letters were identical otherwise. Three treatment arms were cross-randomized: (i) the maximum size of the matching gift across all donations ($25k, $50k, $100k, unspecified), (ii) the ratio of price match (1:1, 2:1, 3:1), and (iii) an example suggested donation amount (\(1\times\), \(1.25\times\), \(1.5\times\) the person’s highest previous contribution). Additionally, they classify states as red or blue depending on whether they voted for George Bush or John Kerry in the 2004 U.S. presidential election. We restrict our analysis to individuals who received the treatment, with the goal of discovering robust patterns of treatment heterogeneity.

a

   

b


c

Figure 5: Results for the [11] dataset. The top two panels show the size of the RPS and a partial error term from 1 as a function of \(\epsilon\). Our choice of \(\epsilon = 10^{-4}\) is highlighted by the black dashed line. The bottom panel shows the RPS approximation of the effects of price match for $2:$1 and $3:$1 stratified by political leaning. Each row denotes the relative size of the effect of price match relative to $1:$1, measured in standard deviations. Each column shows the fraction of the total RPS posterior mass with an effect of that size for each feature combination (gift size and suggested donation) in the RPS. For example, ($25k, 1x, $3:$1, Democrat) shows a strong positive effect (higher donations) while (Unstated 1.5x, $3:$1:, Democrat) shows a strong negative effect (lower donations) compared to $1:$1 (all else equal).. a — image, b — image, c — image

Figure 5 shows how the set size and error bound change with \(\epsilon\). Using Figure 5 as a guide, we chose \(\epsilon\) so that adding additional models to the RPS does not dramatically increase the approximation quality (akin to choosing the number of components in principal components using a scree plot). We choose \(\epsilon = 10^{-4}\). A larger \(\epsilon\) would dilute the RPS by adding more models that have little support in the data.

8.2 Heterogeneity in telomere length.↩︎

Telomeres are regions of repeated nucleotide sequences near the end of the chromosome that protect the chromosome from damage. They reduce in length every time a cell divides, eventually becoming so short that the cell can no longer divide. A recent literature examines features are associated with (or possibly cause) changes in telomere length. Telomere shortening has been associated with cellular senescence and may hold target biomarkers for genetic predispositions and anti-cancer therapies [12], [13]. Recent research suggests that there may only be a narrow range of healthy telomere lengths; anything extremal is at increased risk of immune system problems or cancer [14], [15]. Research has found heterogeneity by race, ethnicity, age, and even stress [16][19].

We use the National Health and Nutrition Examination Survey (NHANES) collected in 1999 and 2002. The survey included blood draws, and DNA analyzes were performed from the samples and the length of the telomere was estimated. Specifically, we consider the mean T/S ratio (telomere length relative to standard reference DNA).4 The dataset also contains socio-economic variables. To speak to the emerging literature on telomere heterogeneity, we focus on hours worked (a proxy for stress), age, gender, race, and education. Our goal is to study the RPS of this heterogeneity on T/S.5

a

 

b


c

Figure 6: The top two panels show what happens as we increase \(\epsilon\) in the NHANES dataset to the size of the RPS and the partial error term from 1, highlighting our choice of \(\epsilon\). In the bottom panel, we highlight heterogeneity in telomere length across the four features (hours worked, gender, age, and education). Each set of histograms marginalizes one feature and stratifies it by race. Each column corresponds to a feature level. Each cell represents the difference in telomere length relative to the lowest level taken by that feature. And each column represents the distribution of the difference in telomere length within the RPS.. a — image, b — image, c — image

We show our choice of \(\epsilon\) for the Rashomon threshold in the top two panels of Figure 6, and a visual summary of the conclusions of the RPS in the bottom panel, similar to Figure 5. In the RPS, we found robust heterogeneity in race – specifically, we found no partition that pools features across races. We found robust evidence of male/female heterogeneity in only in White and Black races. All partitions for these races split males and females into separate pools. This was absent in Other races – only 23% of the partitions in the RPS split on male/female.

We find very few robust patterns. As discussed earlier, we find robust differences in telomere lengths across males and females in the Black population and a robust non-difference in Other races. Similarly, we find a robust non-difference in Black and Other races in telomere lengths for people who work fewer than 40 hours.

Our findings reveal an absence of robust evidence supporting the patterns highlighted in existing literature. Moreover, of the few robust patterns we do identify, several findings contradict prior research. We find Black males have longer telomeres than females. Among White people, we find older people have longer telomeres, which also contradicts existing research. This underscores the necessity for further exploration in this field using comprehensive data and appropriate statistical methods. This contradiction reveals the fragility of empirical research that relies on a single model.

8.3 Heterogeneity in the impact of microcredit access.↩︎

A large literature has looked at the impact of microfinance on several outcomes, ranging from private consumption to business outcomes to social outcomes (e.g., female empowerment). Mostly, the literature has found little beyond basic consumption effects [20][26], though there is suggestive evidence of some potential heterogeneity. One specific heterogeneity of interest concerns entrepreneurs: those with pre-existing businesses may be particularly benefited by the access to microfinance loans [27]. Another concerns family size [28]: the returns to credit access may vary by whether the household has more children.

We analyze data from [23] generated from a randomized controlled trial in which 102 neighborhoods in Hyderabad, India were randomly assigned to treatment or control, each with equal probability, where treatment meant that a partner microfinance organization, Spandana, entered. At baseline a number of characteristics of sampled individuals were collected, including the gender of the head of the household, the education status of the head of the household, the number of businesses previously owned by the household, and the number of children in the household. Additionally, at the neighborhood level, information about the share of households with debt, the share of households with businesses, total expenditure per capita in the region, and average literacy rates in the region were also collected at baseline. Motivated by the literature, we look at the regional debt and business variables. We consider outcomes from the second (longer term) endline, focusing on four spheres: (i) loans, (ii) household response (total expenditure, durables, temptation goods, labor supply), (iii) business (revenue, size, assets, profits), (iv) female empowerment (female business participation, education of daughters). We discretized the regional characteristics and the number of businesses previously owned into four levels based using quartiles. We set the first quartile as the “base control,” so the feature is active in the profile if it is a higher level.

Figure 7: This plot visualizes the distribution a positive, zero, or negative effect, averaged across partitions in the RPS, for each feature combination. The distributions are normalized across each column. Each column corresponds to one of the five robust feature profiles described here where the label denotes which features are active (i.e., do not take the lowest level). “None” means that all features are taking these lowest values.

To study the impact of access to microcredit, we allow features across treatment and control profiles to be pooled together (see 9). This allows for differences in the differences in the structure of heterogeneity (i.e. different Hasse diagrams) between treatment and control. Otherwise, \(\widehat{\textrm{CATE}}(\mathbf{x}) \neq 0\) indicating treatment effect heterogeneity. We present \(\widehat{\textrm{CATE}}(\mathbf{x})\) categorized into five bins based on effect sizes across the RPS, which captures robust (or non-robust) qualitative patterns. We sort \(\mathbf{x}\) into various profiles and count the number of features in each profile that have a large positive, small positive, zero, small negative, and large negative effect, depending on the posterior density over relative to all partitions in the RPS. In Appendix 19, Figure 24 shows the full set of profiles and outcomes.

The RPS provides an avenue for the researcher to identify patterns of robust treatment effects across outcomes of interest. It also clearly demonstrates when, for numerous profiles, there is little robustness to be had. Without strong priors, data cannot robustly speak to the impacts of microcredit in most cases. Research, therefore, that makes conclusions in such settings based on a very specific treatment of heterogeneity is relying heavily on their (highly consequential) priors over the structure of such heterogeneity. The RPS also gives policymakers guidance on robust interventions. If the policymaker considered regions with high baseline debt, since robustly there are no positive profits and half the RPS suggests negative profits, they may not wish for the microcredit firm to enter this market. But in contrast, in other markets, e.g., low debt and business presence, for large non-entrepreneurial families since there are robustly no effects on profits and robustly positive effects on consumption and leisure, they can proceed with confidence.

9 Related work↩︎

Our work is related to a flourishing literature on the Rashomon effect [29][37]. One line, reminiscent of dealing with p-hacking, identifies sets of estimands that generate similar objective function values [38][40] and has been explored in the context of variable importance [41], [42]. The most related is [9], who identify \(\epsilon\)-Rashomon sets and a decision tree algorithm to enumerate the set of estimands (trees) that have squared loss smaller than a threshold slightly higher than that of a reference model. Our work focuses on (Bayesian) inference with discrete (and continuous, as we outline in the Appendix) variables in the general regression setting, while they address prediction with binary variables only for classification. For a fixed dataset and hyperparameters, [9] prove inclusion relations between Rashomon sets defined under different empirical metrics and under controlled dataset modifications. Our results provide approximation bounds for the posterior distribution when restricting inference only to the RPS. Additionally, our work opts for a geometric representation based on Hasse diagrams, encoding scientific plausibility using permissibility rules, and defining the RPS in the context of a (Bayesian) probabilistic framework. This distinction allows us to leverage the geometry of the Hasse to improve the efficiency of our enumeration algorithm. Additionally, [7] hypothesized and showed using simulations that regularization changes the size of the RPS. We formally establish and prove this relationship for the \(\ell_0\) penalty in Theorem 3.

Next, we relate our work to the immense literature on Bayesian model uncertainty, which we discuss in more detail in Appendix 18. Our setup is reminiscent of other work on Bayesian tree models that leverage priors over trees (e.g., [43],  [44],  [45] or Bayesian Additive Regression Trees (BART) [3]). In contrast to many subsequent papers in this line of work, our goal is not solely or even principally prediction, but the identification of sets of combinations of characteristics that are heterogeneous with respect to the outcome. We use Hasse diagrams that obey permissibility criteria and our computational approach does not involve sampling from the posterior, which allows researchers to focus on a set of the highest posterior explanations for heterogeneity while avoiding the computational issues associated with sampling the massive space of trees. We demonstrate in Appendix 17 how to extend our framework to functions across pools (e.g. [46]).

Our approach is also related to Bayesian Model Averaging (BMA) [47], [48]. Unlike BMA, the dimension of \(\boldsymbol{\beta}\) stays fixed throughout, though there are restrictions on \(\boldsymbol{\beta}\) given a particular partition. We thereby avoid searching the extremely large space of highly correlated models of different dimensions [47], [49], [50] while preserving a unified Bayesian inference framework. [51] and [52] use a similar strategy for causal discovery by finding high posterior equivalence classes of Bayesian networks.

Finally, our work speaks to a rapidly growing literature that leverages ideas from machine learning to estimate treatment effect heterogeneity. We previously discussed [6] (and give a detailed comparison in Appendix 18). Our work is also related to existing tree-based methods for identifying treatment heterogeneity.  [10], which we also discuss in detail in Appendix 18, construct regression trees (every tree corresponding to some partition \(\Pi\) in our language) to describe heterogeneity in the space of covariates and then sample from the distribution over trees to (honestly) estimate conditional average treatment effects. Similar to the comparison with Bayesian tree models, our work differs in moving from trees to Hasse diagrams, which do not impose false hierarchies on a partially ordered set, considering only plausible heterogeneity through permissibility rules, and emphasizing enumeration of high quality explanations rather than sampling across all possible trees.

Finally, we contrast with recent work that uses machine learning proxies, or predictions of the outcome that use features flexibly, to study heterogeneity in treatment effects. [53], which we discuss in Appendix 18, conduct inference while using an arbitrary machine learning algorithm to construct proxies and then cluster respondents into groups with the highest and lowest treatment efficacy using treatment outcomes predicted based on the proxies. These clusters, which are derived from amalgamating covariates through “black box” machine learning algorithms, can then be related back to observables. Our work, in contrast, enumerates a set of plausible explanations in the domain of observed covariates directly.

10 Discussion↩︎

In this paper, we present an approach for estimating heterogeneity in outcomes based on a set of discrete covariates. We derive a fully Bayesian framework and an algorithm to enumerate all possible pooling across feature combinations with the highest posterior density: the Rashomon Partition Set. We provide bounds on the portion of the posterior captured by models for heterogeneity in this set, allowing researchers to compute posteriors for marginal effects and evaluate specific treatment combinations. Appealing to a Bayesian framework addresses the issue of multiple testing/selection that leaves practitioners with the unappealing choice between invalid inference and procedures such as data splitting that have implications for power, which are particularly problematic in settings where we expect the cost of data collection to be high. Meanwhile, by enumerating a set of high posterior models rather than sampling from the entire posterior, we avoid the inefficiency and impracticality of existing Bayesian approaches to model uncertainty. By only considering scientifically plausible pools in a geometry that allows for partial ordering (Hasse diagrams), we substantially reduce the number of possible explanations for heterogeneity without sacrificing flexibility. Additionally, and critically, these choices mean that the resulting high posterior partitions are interpretable and useful for researchers and policymakers when designing future interventions or generalizations.

We now highlight two additional philosophical points about our approach. First, our approach is fundamentally generative in the sense that it produces insights that are directly interpretable. As we highlight in our empirical examples, we expect that Rashomon partitions themselves will be of interest for researchers or policymakers. They allow for the identification of the most robust conclusions, settings where policymakers can intervene without worrying about likely negative consequences, and defining “archetypes” for theory-building. In this way, our work contributes to a growing literature in artificial intelligence and machine learning that pushes back on the use of black box algorithms to make high-stakes decisions (see e.g., [54]). While machine learning models may be effective at estimating complex relationships between covariates, they also often do so in ways that obfuscate the influence of particular features (or combinations of features). Our approach presents an alternative that generates insights about sources of treatment effect heterogeneity based on combinations of observed covariates. Our work shows that admissibility restrictions and the geometry of the underlying problem provide enough structure to make search in the space of covariates possible, alleviating the need to use black box algorithms to summarize relationships between covariates through proxies.

Second, our work highlights the aperture that exists between statistical and practical decision-making. We take as given that in many moderate to high dimensional settings there will be interactions between features. With finite data, the result is a set of possible models for heterogeneity whose statistical performance is indistinguishable. Said another way, our work posits that the quest for the “best” statistical model is Sisyphean in essentially any scientifically interesting setting. While this may seem dire, it actually presents an opportunity to involve additional factors beyond model performance that are often critical in practice for making decisions. Amongst models in the RPS, a policymaker could choose based on, for example, implementation cost, equity considerations, or preserving privacy without sacrificing statistical performance.

There are many promising areas for future work in extending the framework we present here. First, we present results in terms of a posterior in a Bayesian framework. We could, however, also construct a similar structure under a frequentist paradigm. In such a setup, we would need to explore a re-splitting strategy (see [10], for example) to construct an “honest” set of Hasse diagrams. Furthermore, we could use our approach to identify groups that are systematically underrepresented in randomized trials (see [55], for example) and, as a further generalization, to compare results across trials (see for example [26]). We could also blend PPMx’s covariate‐driven cohesion with RPS’s guaranteed enumeration. By adopting a prior that upweights pools of treatment arms whose covariate profiles are more similar, RPS could focus on the most plausible partitions in applications where side‐information is available. We could likely adapt our branch‐and‐bound enumeration and associated uniform‐error and size bounds, ensuring that exhaustive uncertainty quantification is maintained even under this richer, covariate‐informed prior. Finally, our computational approach could be more generally valuable in a wide range of settings, in model selection for graphical models or in for discrete model averaging more generally.

11 Permissibility and Hasse diagrams.↩︎

One way to understand permissibility is by arranging the data of feature combination assignments into a feature variant aggregation design matrix, \(\mathbf{F}\in\left\{ 0,1\right\} ^{n,K}\). The entries of the matrix are as follows. If \(k\left(i\right)\) is the feature combination that \(i\) is assigned to, we set \[\begin{align} F_{i\ell} &:= \mathbb{I} \left\{ k\left(i\right)\geq\ell\ \cap\ \rho\left(k\left(i\right)\right) = \rho\left(\ell\right)\right\} . \end{align}\] So the variant design matrix switches on a dummy variable for all variants that are subordinate to \(k(i)\). The utility is that it allows for us to understand the marginal value of climbing the ordering up from \(k(i)\), as in the treatment variant aggregation (TVA) procedure of [6].

To see this, it is useful to rewrite 1 in its variant form, \[\begin{align} \mathbf{y}&= \mathbf{F}\boldsymbol{\alpha}+ \epsilon, \label{eq:tva-regression} \end{align}\tag{11}\] which is just a linear transformation of 1 , with \(\boldsymbol{\beta}\) described by 7 , \[\begin{align} \beta_k &= \sum_{k^{\prime} \leq k; \rho(k) = \rho(k^{\prime})} \alpha_{k^{\prime}}. \end{align}\]

It is useful to represent our framework in a Hasse diagram. We imagine that moving up from one node to its adjacent node in Hasse inherits a value that corresponds to the marginal change in the outcome moving from an immediate subordinate variant to the present variant.

Of course, in this particular parameterization of \(\boldsymbol{\beta}\), we chose to climb up the Hasse. We could have alternatively chosen to climb down the Hasse as \[\begin{align} \mathbf{y}&= \mathbf{G} \boldsymbol{\gamma}+ \epsilon, \\ \beta_k &= \sum_{k^{\prime} \geq k; \rho(k) = \rho(k^{\prime})} \gamma_{k^{\prime}}, \label{eq:tva-climbing-down-hasse} \end{align}\tag{12}\] where \(G_{i\ell} := \mathbb{I} \left\{ k\left(i\right)\leq \ell \ \cap\ \rho \left(k\left(i\right)\right) = \rho\left(\ell\right)\right\}\).

Consider the following definition of permissibility that places topological restrictions on any partition of the Hasse.

Definition 7 (Permissible partition of a profile). A partition \(\Pi_0\) of a profile \(\rho_0\) is permissible if and only if

  1. every \(\pi \in \Pi_0\) is a pool (cf. 1),

  2. every \(\pi \in \Pi_0\) is strongly convex i.e., \(k, k^{\prime} \in \pi\) and \(k \geq k^{\prime \prime} \geq k^{\prime}\) implies \(k^{\prime \prime} \in \pi\), and \(\min \pi\) and \(\max \pi\) both exist and are unique (and possibly equal to each other), and

  3. \(\Pi_0\) respects parallel splits, i.e., for every pair of distinct pools \(\pi_i, \pi_j \in \Pi_0\)

    1. if \(\min \pi_i \not\lessgtr\min \pi_j\), then there exists a \(\pi^\prime \in \Pi_0\) such that \(\min \pi^{\prime} = p^{\prime}\), where for each feature \(m\), \(p^{\prime}_m := \max\{ p^{(i)}_m, p^{(j)}_m\}\) where \(p^{(i)} = \min \pi_i\) and \(p^{(j)}= \min \pi_j\), and

    2. if \(\max \pi_i \not\lessgtr\max \pi_j\), then there exists a \(\pi^{\prime \prime} \in \Pi_0\) such \(\max \pi^{\prime \prime} = p^{\prime \prime}\), where for each feature \(m\), \(p^{\prime \prime}_m := \min \{ \tilde{p}^{(i)}_m , \tilde{p}^{(j)}_m \}\) where \(\tilde{p}^{(i)} = \max \pi_i\) and \(\tilde{p}^{(j)} = \max \pi_j\).

Condition [profile-permissibility:1] is obvious. Condition [profile-permissibility:2] says pools must contain contiguous features. Condition [profile-permissibility:3] is more technical but says that pools must be parallel on the Hasse.

The geometric definition of permissibility presented in 7 is equivalent to the statistically measurable definition we presented in 6. We formally state this and prove it below.

Lemma 2. Definitions 7 and 6 are equivalent to each other.

We first look at permissibility as defined in 7. Condition [profile-permissibility:1] simply comes from the definition of a pool (cf. 1). To understand the necessity of strong convexity in condition [profile-permissibility:2] and the parallel splitting criteria in condition [profile-permissibility:3], we need to understand how the marginal increments \(\alpha_k\) affect the overall outcome \(\beta_{k^{\prime}}\). Observe that \(\alpha_{k}\) affect \(\beta_{k^{\prime}}\) for all feature combinations \(k^{\prime} \geq k\) by Equation 7 . We can define this “sphere of influence” of \(k\) as \(A_k = \{k^{\prime} \in \mathcal{K} \mid \rho(k) = \rho(k^{\prime}), k^{\prime} \geq k\}\). Further, when we are interested in the outcomes \(\beta_{k^{\prime}}\), it is sufficient to consider only spheres \(A_k\) where \(\alpha_{k} > 0\) i.e., the “active spheres.” Therefore, the intersection of all active spheres (taken either directly or through its complement) will give rise to a set of feature combinations with the same outcome i.e., a pool. It is easy to see that any such sphere is strongly convex in our sense. Therefore, any pool will also be strongly convex.

The parallel splitting criteria “from above" in condition [profile-permissibility:3a] of 7, also follows from these spheres of influence interpretation of the active \(\alpha_k\). Specifically, tracking the active \(\alpha_k\) ensures that if a segment through a Hasse is pooled, then any segment both parallel to it and below it must be pooled as well. For the sake of contradiction, assume to the contrary that the top segment is pooled while a parallel bottom segment is cleaved. There must be some node along the bottom segment that was responsible for this cleaving through its marginal effect. However, the sphere of influence of this marginal effect cuts through the top segment too, cleaving the top segment into distinct pooled sets, a contradiction.

Of course, as mentioned in 4, it is possible to have exact marginal increments exactly offset each other so that despite two feature combinations \(k_1, k_2\) influenced by two different spheres of action, \(\beta_{k_1} = \beta_{k_2}\) (in this proof, we will overload our notation and use subscripts \(k_i\) to denote distinct feature combinations for readability). However, we do not want to pool these features together because adding very little noise to one of the corresponding active \(\alpha_{k^{\prime}}\) will immediately render \(\beta_{k_1} \neq \beta_{k_2}\) i.e., this is a measure zero event.

The above shows that permissibility (with condition [profile-permissibility:3] limited to condition [profile-permissibility:3a]) is necessary from just using the spheres of influence of marginal effects, and nothing more. However, this version of permissibility is also sufficient: any permissible partition \(\Pi_0\) (as per 7) can be shown to derive from a common set of nodes \(k_1,...k_n\) so that each \(\pi \in \Pi_0 = A_{k_1}^{a_1} \cap ....\cap A_{k_n}^{a_n}\), where \(a_i \in \{1,c\}\), i.e. denoting either the sphere of influence or its complement. A quick proof sketch is as follows. First, we define the spheres through \(\Pi_0\). For each \(\pi_i \in \Pi_0\), take \(k_i = \min \pi_i\) its unique minimum. These \(k_i\) will be the nodes that generate the spheres of influence. We will show that for any \(p_1, p_2 \in \pi_i \in \Pi_0\), \(p_1 \in A_{k_j} \iff p_2 \in A_{k_j}\). Observe that for any \(p \in \pi_i\), then trivially \(p \in A_{k_i}\). Now consider the condition when \(p \in A_{k_j}\) for \(k_j \neq k_i\). Then, \(p > k_j\). It is not possible that \(k_j \not\lessgtr k_i\) as this would violate the parallel splits criteria (one can show there would be another \(\pi^{\prime} \in \Pi_0\) containing \(p\)). It is also not possible for \(k_j > k_i\) as this would violate convexity (\(p > k_j > k_i\) would imply \(k_j \in \pi_i\)). Thus, if \(p \in \pi_i\) and \(p \in A_{k_j}\), then necessarily \(k_j \leq k_i\). Then, it follows that if \(p_1, p_2 \in \pi_i\), then \(p_1 \in A_{k_j} \iff p_2 \in A_{k_j}\). Therefore, all members of each \(\pi \in \Pi_0\) are in the same unique intersection of spheres of influence, so each \(pi \in \Pi_0\) is uniquely represented as \(\pi = A_{k_1}^{a_1} \cap ....\cap A_{k_n}^{a_n}\).

It is easy to see with an analogous sphere of influence argument with \(\boldsymbol{\gamma}\) that permissibility (With condition [profile-permissibility:3] limited to [profile-permissibility:3b] this time) is necessary and sufficient characterization of pools from active marginals in \(\boldsymbol{\gamma}\).

\(\square\)

When we wish to learn heterogeneity in \(\boldsymbol{\beta}\), there is no reason to prefer one parameterization of climbing the Hasse over the other. Seeing that the parallel splitting criteria is linked to robustly estimating the pools of heterogeneity, we want to obey both of them together at the same time. This does run the risk of generating more granular partitions as a result of stronger restrictions, but this is a small price to pay for robustly estimating heterogeneity when one wishes to be agnostic about the system. Hence the full criterion for permissibility 7, respecting parallel splits from both above (condition [profile-permissibility:3a]) and below (condition [profile-permissibility:3b]).

One might imagine that asking for more restrictions can complicate the search process. However, a by-product of being agnostic to the direction of Hasse traversal is that there is a bijective mapping between the \(\boldsymbol{\Sigma}\) partition matrices and permissible partitions. We show in 2 that this significantly reduces the size of the model class, and later show in 3 that the size of the RPS, which is our primary estimation goal, is only polynomial. This has very important practical implications for computational feasibility.

We discuss specific examples of using the \(\boldsymbol{\Sigma}\) matrix to represent partitions in Examples [example:hasse-1], [example:hasse-2], [example:hasse-3] below.

 

Figure 8: Hasse diagram for Examples [example:hasse-1] and [example:hasse-2]. The partition described in Example [example:hasse-1] is shown in blue ellipses on the left panel. The right panel describes a different admissible partition in red ellipses seen in Example [example:hasse-2].

Consider an example with \(M=2\) features, each with \(R = 3\) discrete values, \(\{1, 2, 3\}\). Then there are \(K=R^M = 9\) different feature combinations. The Hasse diagram is shown in Figure 8. So, we end up pooling \((2, 2)\) with \((3, 2)\) and \((2, 3)\) with \((3, 3)\). The corresponding \(\boldsymbol{\Sigma}\in \{0, 1\}^{2 \times 2}\) matrix for this profile is \[\begin{align} \boldsymbol{\Sigma}&= \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}. \end{align}\] This indicates that we split variants with value 1 from value 2 in the first feature (by \(\Sigma_{11} = 0\)) and pool variants of value 2 with value 3 in the first feature (by \(\Sigma_{12} = 1\)). Further, we pool variants with value 1 and value 2 in the second feature (by \(\Sigma_{21} = 1\)) and split variants with value 2 from value 3 in the second features (by \(\Sigma_{22} = 0\)).

Consider the same setup in Example [example:hasse-1] with \(M=2\) features, each with \(R = 3\) discrete values, \(\{1, 2, 3\}\). Another permissible partition can be defined by the matrix \[\begin{align} \boldsymbol{\Sigma}&= \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix}. \end{align}\] The pools are \(\pi_1 = \{(a, b) \mid a = \{1, 2\}, b = \{1, 2, 3\}\}\) and \(\pi_2 = \{(a, b) \mid a = \{3\}, b = \{1, 2, 3\}\}\). This is illustrated in the right panel of Figure 8.

Figure 9: Hasse diagram for Example [example:hasse-3]. The admissible partition is shown in blue ellipses..

Consider a different setup with \(M=2\) features, The first feature takes on \(R_1 = 5\) discrete values \(\{1, 2, 3, 4, 5\}\) and the second feature takes on \(R_2 = 3\) discrete values, \(\{1, 2, 3\}\). An permissible partition can be defined by the matrix

\[\begin{align} \boldsymbol{\Sigma}&= \begin{bmatrix} 1 & 0 & 1 & 0 \\ 1 & 0 & - & - \end{bmatrix}, \end{align}\] where we use “\(-\)” to denote that the second feature does not have dosages corresponding to those entries in the \(\boldsymbol{\Sigma}\) matrix. The pools are \(\pi_1 = \{(a, b) \mid a, b \leq 2\}\), \(\pi_2 = \{(a, b) \mid a \leq 2, b = 3\}\), \(\pi_3 = \{(a, b) \mid a = 3, 4, b \leq 2\}\), \(\pi_4 = \{(a, b) \mid a = 3, 4, b = 3\}\), \(\pi_5 = \{(a, b) \mid a = 5, b \leq 2\}\), and \(\pi_6 = \{(5, 3)\}\). This is illustrated in Figure 9.

One can quickly verify that Examples [example:hasse-1] - [example:hasse-3] satisfy permissibility as defined in 7 by visual inspection and identifying the corresponding \(\boldsymbol{\Sigma}\) matrices. In Example [example:hasse-not-permissible] below, we show an example of a partition that is not permissible. Interestingly, there is a valid decision tree that arrives at this partition.

 

Figure 10: Hasse diagram with the partition that is not permissible described in Example [example:hasse-not-permissible]. The pools are \(\pi_1 = \{(1, 1), (1, 2), (1, 3)\}\), \(\pi_2 = \{(2, 1), (2, 2) \}\), \(\pi_3 = \{(3, 1), (3, 2) \}\), and \(\pi_4 = \{(2, 3), (3, 3) \}\). The decision tree illustrates how to generate this partition..

Consider the same setup in Example [example:hasse-1] with \(M=2\) features, each with \(R = 3\) discrete values, \(\{1, 2, 3\}\). In Figure 10, we illustrate a partition that is not permissible. This is not permissible because we have pools \(\pi_1 = \{(1, 1), (1, 2), (1, 3)\}\), \(\pi_2 = \{(2, 1), (2, 2) \}\), \(\pi_3 = \{(3, 1), (3, 2) \}\), and \(\pi_4 = \{(2, 3), (3, 3) \}\). Permissibility (see condition [profile-permissibility:3] of Definition 7) says that if \(\pi_1\) is in the partition, then feature combinations \((\cdot, 2)\) and \((\cdot, 3)\) should always be pooled together. This contradicts what we observe in \(\pi_2\), \(\pi_3\), and \(\pi_4\). Similarly, if \(\pi_4\) is in the partition, permissibility would require that feature combinations \((2, \cdot)\) and \((3, \cdot)\) need to be pooled together which contradicts \(\pi_2\) and \(\pi_4\). Since this partition is not permissible, we cannot represent it using the \(\boldsymbol{\Sigma}\) matrix.

To see why this is not permissible from the marginal perspective, let us look at \(\pi_3\) and \(\pi_4\). From these pools, it is evident that \(\alpha_{2,3} \neq 0\), \(\alpha_{3,1} \neq 0\), and \(\alpha_{3,3} \neq 0\). And we know that, \[\begin{align} \beta_{3,3} &= \alpha_{3,3} + \alpha_{3,2} + \alpha_{3,1} + \alpha_{2,3} + C \\ \beta_{2,3} &= \alpha_{2,3} + C, \end{align}\] where the term \(C\) is common to both equations. In the pooling currently, it so happens that \(\beta_{3,3} = \beta_{2,3}\) – the terms \(\alpha_{3,3}, \alpha_{3,2}, \alpha_{3,1}\) jointly make this true by \(\alpha_{3,3} + \alpha_{3,2} + \alpha_{3,1} = 0\). However, we know that \(\alpha_{3,1} \neq 0\). So if we add some noise \(\varepsilon > 0\) to \(\alpha_{3,1}\) to get \(\alpha^{\prime}_{3,1} := \alpha_{3,1} + \varepsilon\). Then, \(\beta_{3,3} \neq \beta_{2,3}\) anymore. In other words, the pool \(\pi_4\) is not robust to noise in the non-zero marginals as any noise will almost surely break \(\pi_4\) into \(\{(2,3)\}\) and \(\{(3, 3)\}\) as separate pools. Hence, this partition is not permissible.

Decision trees are not robust in this sense as they may generate partitions that are not permissible. The right panel of Figure 10 illustrates a decision tree that generates the partition that is not permissible discussed in this example.

11.1 Pooling across profiles.↩︎

So far, we have been describing how to pool different feature combinations if they belong to the same profile. Now, we turn our attention to pooling across profiles. Definition 6 captures permissibility within a single profile, but we also want to consider pooling across profiles. For example, Definition 7 does not speak to the question of pooling decisions for adding Ibuprofen, as a temporary pain reliever, to a prescription of Amoxicillin against a bacterial infection. Does introducing Ibuprofen make an appreciable difference (offering the patient relief while waiting for the bacterial infection to work) or not (because the antibiotic itself offers pain relief by attacking the root cause)?

In order to reason about this, we consider partially ordering of the profiles themselves using their binary representation. This also allows us to embed the profiles in an \(M\)-d unit hypercube with profiles as the vertices. By the same intuition behind convexity, we can pool two profiles if they are reachable on this hypercube. We can generalize the marginal re-parameterization to allow for marginal gains when moving between profiles through a \(\boldsymbol{\delta}\) vector, \[\begin{align} \mathbf{y}&= \mathbf{F} \boldsymbol{\alpha}+ \mathbf{A} \boldsymbol{\delta}+ \epsilon, \label{eq:tva-regression-general} \\ A_{i, (\ell, \rho)} &= \mathbb{I} \left\{ k (i) > \ell \ \cap\ \rho (k(i)) > \rho(\ell) = \rho \right\}, \nonumber \\ \beta_k &= \sum_{k^{\prime} \leq k; \rho(k) = \rho(k^{\prime})} \alpha_{k^{\prime}} + \sum_{k^{\prime} < k} \sum_{\rho; \rho(k^{\prime}) < \rho(k)} \delta_{k^{\prime}, \rho}. \end{align}\tag{13}\] Here, observe that \(\boldsymbol{\delta}\) is indexed by the profile \(\rho\) as well as the feature \(k^{\prime}\). Being feature-specific gives the freedom to pool across profiles without imposing strong cross-profile restrictions that prevent measure zero events.

By setting \(\boldsymbol{\delta}= \mathbf{0}\), we can immediately see that Equation 13 is a generalization of Equation 11 . In fact, if depending on the context, we do not want to pool profile \(\rho_1\) with \(\rho_2\), then this corresponds to setting the appropriate entries in \(\boldsymbol{\delta}\) to 0. This is exactly what [6] do in their analysis of cross-randomized behavioral nudges for improving immunization.

We formalize this in Definition 8.

Definition 8 (Permissible partition). A partition \(\Pi\) of the entire feature space \(\mathcal{K}\) is permissible if:

  1. for every profile \(\rho_0\), the partition induced by \(\Pi\) on \(\rho_0\), \(\Pi_0 = \{ \pi \setminus \{k \mid \rho(k) \neq \rho_0 \} \mid \pi \in \Pi \}\) is permissible by 7, and

  2. for every \(\boldsymbol{\beta}\) that generates \(\Pi\), with respect to the Lebesgue measure, the support of \(\boldsymbol{\delta}(\boldsymbol{\beta})\), \(S_{\boldsymbol{\delta}(\boldsymbol{\beta})} = \{\delta_{k, \rho} \neq 0 \mid \delta_{k, \rho} \in \boldsymbol{\delta}(\boldsymbol{\beta})\}\), is measurable.

Specifically, by allowing to pool across different profiles, 8 naturally allows us to explore heterogeneity in treatment effects where treatment and control are two distinct profiles. We illustrate this in the empirical data analysis of microcredit access in 8.

9 gives an equivalent geometric interpretation of 8 through the Hasse.

Definition 9 (Permissible partition). A partition \(\Pi\) of the entire feature space \(\mathcal{K}\) is permissible if and only if the following hold true:

  1. for every profile \(\rho_0\), the partition induced by \(\Pi\) on \(\rho_0\), \(\Pi_0 = \{ \pi \setminus \{k \mid \rho(k) \neq \rho_0 \} \mid \pi \in \Pi \}\) is permissible by 7,

  2. every \(\pi \in \Pi\) is connected in feature levels across profiles i.e., if \(k^{(1)}, k^{(2)} \in \pi\) such that \(\rho_1 = \rho(k^{(1)})\) and \(\rho_2 = \rho(k^{(2)})\) are adjacent on the hypercube, then there are feature combinations \(k^{(3)}, k^{{(4)}} \in \pi\) such that \(\rho(k^{(3)}) = \rho_1\), \(\rho(k^{(4)}) = \rho_2\) and \(\left\lVert k^{(3)} - k^{(4)}\right\rVert_1 = 1\),6 and

  3. every \(\pi \in \Pi\) is connected in profiles i.e., if \(\pi\) contains feature combinations from profiles \(\rho_0\) and \(\rho_k\) where \(\rho_0 < \rho_k\), then \(\pi\) also contains features in profiles \(\rho_1, \dots, \rho_{k-1}\) such that \(\left\lVert\rho_i - \rho_{i+1}\right\rVert_0 = 1\) for \(i = 0, \dots, k-1\).7

This representation agrees with our permissibility in 9. Case [permissibility:case1] follows from the fact that this is a generalization of Equation 11 . Cases [permissibility:case2] and [permissibility:case3] follow from the definition of the \(\mathbf{A}\) matrix. For example, consider two features \(k^{(1)}, k^{(2)}\) that belong to two different profiles. We can only pool variants i.e., \(||k^{(1)} - k^{(2)}||_1 = 1\). If they are variants, then the two profiles must be adjacent on the \(M\)-d hypercube.

At this point, it is important to note that there are no restrictions such as the parallel splitting criteria across different profiles. This is because the marginal \(\delta_k\) only contributes to the outcome across profiles i.e., from the perspective within a profile, the sphere of influence of \(\delta_k\) is indistinguishable from the sphere of influence of \(\alpha_{k^{\prime}}\) where \(k^{\prime}\) is at the lower boundary of the Hasse adjacent to the Hasse of \(k\). Since each pair of feature variants from different profiles have different across-profile marginals \(\delta_k\), they are not coupled together like the \(\boldsymbol{\alpha}\) marginals are.

Just like the \(\boldsymbol{\Sigma}\) matrix within each profile, we can also construct the intersection matrix \(\boldsymbol{\Sigma}^{\cap}\) to denote how features are pooled across two adjacent profiles. Consider partitions induced by \(\Pi\) on two profiles \(\rho_1\) and \(\rho_2\). Let us call these \(\Pi_1, \Pi_2\) respectively. \(\boldsymbol{\Sigma}^{\cap} = \{0, 1, \infty\}^{\left\lvert\Pi_1\right\rvert \times \left\lvert\Pi_2\right\rvert}\) where \(\boldsymbol{\Sigma}_{i, j}^{\cap} = 0\) means that pools \(\pi_i \in \Pi_1\) and \(\pi_j \in \Pi_2\) are poolable according to [permissibility:case2] of 9 but are not pooled together in \(\Pi\). \(\boldsymbol{\Sigma}_{i, j}^{\cap} = 1\) means that these pools are poolable and are indeed pooled in \(\Pi\). Finally, \(\boldsymbol{\Sigma}_{i, j}^{\cap} = \infty\) means that these pools are not poolable by 9. Observe that if \(\boldsymbol{\Sigma}^{\cap}_{i, j} = 1\), then \(\boldsymbol{\Sigma}^{\cap}_{i, -j} = \infty\) and \(\boldsymbol{\Sigma}^{\cap}_{-i, j} = \infty\) in order to respect [permissibility:case1] of 9. This object is useful in our enumeration step in Algorithm [alg:r-aggregate].

11.2 Examples of other permissibility restrictions↩︎

Estimation strategies, in general, implicitly take some stand on partition structure. They impose permissibility restrictions, though they are not often presented formally as such. These restrictions are generated by the choice of technique, rather than through any specific scientific consideration. In the following examples, we show how these techniques can be framed as permissibility restrictions by defining them as partitions on the Hasse. This involves identifying sets of equivalent edges for each technique that, when removed together, generate corresponding partitions learned by that technique.

First, take a saturated or “long” regression. Here, every possible feature combination is its own pool i.e., the partition is the most granular partition possible. For example, consider the treatment outcome as a function of dosages of two drugs, \(A\), \(B\), and weight of the treated individual, \(W\). Suppose that each variable takes on three discrete levels. Then, \(\Pi^{\text{long}} = \{(a,b,w): \ \text{ for every } a,b,w \in \{0,1,2\}^2\times\{\text{low, med, high}\}\}\) with 27 elements.

Second, consider a “short” regression where the researcher does not include all relevant variables in the regression. Then, the partition generated is identical to long regression for variables included in the model i.e., it pools across all excluded variables. For example, if the researcher ignores weight, then \(\Pi^{\text{short}} = \{(a,b,:): \ \text{ for every } a,b \in \{0,1,2\}^2\}\) which pools across weight, with 9 elements.

Third, say the researcher uses Lasso to regularize the data to set marginal dosage or weight increase effects to 0, generating pools. Then the \(\Pi^{\text{Lasso}}\) is bijectively determined by the support of the Lasso: the zero’ed elements generate the pooling structure.8

This is perhaps the most common approach used beyond imposed short and long regressions. In a decision tree, at every node, the statistician chooses whether or not to split based on the value of a given arm (e.g., Amoxicillin greater than 250mg). Conditional on this split, following a given decision, another variable is selected, and the process repeats recursively until termination (maybe defined by some maximum number of splits or until no more splits are possible). It is useful to note that this procedure generates parallel splits in the Hasse, conditional on previously made splits. Therefore, decision trees generate convex partitions.

It is useful to note that the decision trees are captured by the equivalent edge framing in a conditional setting. Initially, the set of equivalent edges for a decision tree is identical to those of the robust partitions we consider in this paper. However, the edges to split upon are chosen sequentially (rather than jointly, as in the case of robust partitions). Thus, after each split, the set of equivalent edges changes. Specifically, each set of equivalent edges could get decomposed into two smaller sets of equivalent edges upon splitting a different equivalence class (this means that there are more than \(2^n\) possible partitions where there were \(n\) equivalence classes originally). In other words, edges are equivalent conditional on previously made splits. The binary \(\sigma\) vector we used to represent splitting of jointly equivalent edges can be generalized to a list of indices that alternatingly indicate splitting and pooling along the edges. For example, we can order the edges in equivalence class \(E_i\). Then, the pooling decisions for \(E_i\) is represented by a list \(\sigma_i\) such that all edges until \(e_{\sigma_{i,1}}\) are pooled, all edges from \(e_{\sigma_{i,1}}\) to \(e_{\sigma_{i,2}}\) are split, all edges from \(e_{\sigma_{i,2}}\) to \(e_{\sigma_{i,3}}\) are pooled, and so on. This is essentially a tree data structure, i.e., the \(\sigma\) data structure is a tree in the limit where we only have conditionally equivalent edges (but not jointly). The key point here is the perspective of encoding the partition by keeping track of splits (made in equivalence classes) is general, takes us to scientifically relevant settings beyond robust partitions and decision trees, and generates vast improvements in refinements as well. However, the pooling restrictions imposed by trees suffer from a coherency issue, which we explore in our discussion of causal trees. We walk through a detailed example in [example:hasse-not-permissible].

Decision trees cannot natively estimate heterogeneities in treatment effects. Causal trees [56] and causal random forests [10] can do this natively by modifying the fit criteria used to make splits. The fit criteria for causal trees is the MSE of the treatment effect. So each leaf of the tree needs to contain both treatment and control observations to estimate the treatment effect. The partitions generated by causal trees are identical to decision trees if we ignore the treatment indicator as \((1, x)\) and \((0, x)\) are always in the same leaf or pool. Here, \(x\) is the feature combination, and 1/0 is the binary treatment indicator variable.

This differs from our robust partitions because we allow for pools with \(\{ (1, x_1), (1, x_2), (0, x_1)\}\) and \(\{ (0, x_2) \}\) by the cross-Hasse pooling rules. On the other hand, causal trees will split them as \(\{(1, x_1), (0, x_1)\}\) and \(\{(1, x_2), (0, x_2)\}\) (or pool them together) even though \((0, x_1)\) and \((1, x_2)\) has the same outcome. This raises a conceptual issue. If \((1, x_2)\) and \((1, x_1)\) are equivalent to each other, and \((1, x_1)\) and \((0, x_1)\) are equivalent to each other, then by transitivity, \((1, x_2)\) and \((0, x_1)\) should be equivalent to each other as well. However, causal trees force them apart because \((0, x_1)\) and \((0, x_2)\) are not equivalent to each other. Such a pooling restriction appears incoherent. Besides this, since we allow for flexibility through the transitivity of equivalences, we enjoy statistical properties such as lower bias and variance.

Causal random forests are an aggregation over sampled partitions, \(\{\Pi_b: \ b = 1,\ldots, B\}\), each of which is bijective with a partition created by causal trees.

Figure 11: Hasse diagram admissible that is not strongly convex but robust. This cannot be represented by a decision tree..

Here the authors study a Hasse directly and impose permissibility settings slightly different from that of the present paper. In this setting, the natural structure is to allow only one-sided convexity. Note that this is disallowed by a decision tree and scientifically desirable in certain settings.

Here, the equivalent edges framework does not hold. Consider the \(2 \times 2\) Hasse with \(\{11, 12, 21, 22\}\) as the four feature combinations. TVA allows for the following partitions of the Hasse:

  • \(\Pi_1 = \left\{ \{11, 12, 21\}, \{22\} \right\}\)

  • \(\Pi_2 = \left\{ \{11\}, \{21\}, \{12, 22\} \right\}\)

  • \(\Pi_3 = \left\{ \{11\}, \{12\}, \{21, 22\} \right\}\)

  • \(\Pi_4 = \left\{ \{11, 12, 21, 22\} \right\}\)

  • \(\Pi_5 = \left\{ \{11\}, \{12\}, \{21\}, \{22\} \right\}\)

  • \(\Pi_6 = \left\{ \{11, 12\}, \{21, 22\} \right\}\)

  • \(\Pi_7 = \left\{ \{11, 21\}, \{12, 22\} \right\}\)

In this \(2 \times 2\) Hasse, there are four edges \(\{\langle 11, 12 \rangle\), \(\langle 11, 21 \rangle\), \(\langle 12, 22 \rangle\), \(\langle 21, 22\rangle \}\). The partitions listed above illustrate that no edge is truly equivalent to another edge in the Hasse despite TVA imposing convexity restrictions. In other words, the set of equivalent edges appears to be degenerate i.e., each each equivalence class is a singleton. However, there is a well-defined structure. We are not free to arbitrarily split on edges. For example, \(\Pi_8 = \left\{ \{11, 12, 22\}, \{21\} \right\}\) is not permissible as it violates convexity. This is why we say that the equivalent edges framing does not hold here.

However, we can still generalize the \(\sigma\) data structure to efficiently store this partition. This is because these partitions are “parallel from below,” i.e., if a feature combination \(k^{(1)} = [r_1, \dots, r_i, \dots, r_m]\) and \(k^{(2)} = [r_1, \dots, r_i + 1, \dots, r_m]\) are split, then all pairs of feature combinations \(k^{(3)} = [s_1, \dots, r_i, \dots, s_m]\) and \(k^{(4)} = [s_1, \dots, r_i + 1, \dots, s_m]\) where \(k^{(3)} > k^{(1)}\) and \(k^{(4)} > k^{(2)}\) are also split. Therefore, using the same set of equivalent edge decomposition as our robust partitions (described below), we allow \(\sigma_i\) to denote a vector of largest levels in all features \(j\), \(\ell_j\), besides the \(i\)-th one such that \(k^{(3)} = [\dots, \ell_j, \dots, r_i, \dots]\) and \(k^{(4)} = [\dots, \ell_j, \dots, r_i + 1, \dots]\) are pooled.

12 Laplace approximation and generalized Bayesian inference↩︎

12.1 Laplace approximation.↩︎

Here, we briefly outline how to approximate the full posterior using the Rashomon set and Laplace’s method. Our goal is to estimate \[\begin{align} p(\boldsymbol{\beta}\mid \mathbf{Z}) &= \sum_{\Pi \in \mathcal{P}_{\theta}} p(\boldsymbol{\beta}\mid \mathbf{Z}, \Pi) \mathbb{P}(\Pi \mid \mathbf{Z}) \end{align}\]

We will do this by constructing a specific data-generating process. Consider the following data-generating process after fixing a partition \(\Pi\). For each pool \(\pi_j \in \Pi\), draw \(\gamma_j \overset{i.i.d.}{\sim} \mathcal{N}(\mu_0, \tau^2)\) i.e., draw \(\boldsymbol{\gamma} \sim \mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Lambda}_0)\).9 Here, \(\boldsymbol{\gamma} \in \mathbb{R}^{\left\lvert\Pi\right\rvert}\) and \(\boldsymbol{\Lambda}_0 = \tau^2 \mathcal{I}_{\left\lvert\Pi\right\rvert}\) where and \(\mathcal{I}_m\) is an identity matrix of size \(m\).

Then, we can define a transformation matrix \(\boldsymbol{P}\in \{0, 1\}^{K \times \left\lvert\Pi\right\rvert}\), where \(K\) is the number of possible feature combinations, that assigns each \(\gamma\) of each pool to the feature combinations in that pool, \[\begin{align} P_{ij} &= \begin{cases} 1, & \text{feature combination } i \in \pi_j \\ 0, & \text{else} \end{cases}. \end{align}\] The mean vector for the feature combinations is given by \(\boldsymbol{\beta}= \boldsymbol{P}\boldsymbol{\gamma}\). By properties of the multivariate normal, we have \(\boldsymbol{\beta}\mid \Pi \sim \mathcal{N}(\boldsymbol{\mu}_{\Pi}, \boldsymbol{\Lambda}_{\Pi})\), where \(\boldsymbol{\mu}_{\Pi} = \boldsymbol{P}\boldsymbol{\mu}_0\) and \(\boldsymbol{\Lambda}_{\Pi} = \boldsymbol{P}\boldsymbol{\Lambda}_0 \boldsymbol{P}^{\intercal}\). Specifically, note that the means of all feature combinations in a given pool don’t just share the same mean, but are effectively equivalent to each other.

Then, given some feature combinations \(\mathbf{D}\), we draw the outcomes as \[\begin{align} \mathbf{y}\mid \mathbf{D}, \boldsymbol{\beta}, \Pi &\sim \mathcal{N}(\mathbf{D}\boldsymbol{\beta}, \boldsymbol{\Sigma}) \implies \mathbf{y}\mid \mathbf{D}, \boldsymbol{\gamma}, \Pi \sim \mathcal{N}(\mathbf{D}\boldsymbol{P}\boldsymbol{\gamma}, \boldsymbol{\Sigma}). \end{align}\] Therefore, \(\boldsymbol{\gamma}\mid \mathbf{Z}, \Pi \sim \mathcal{N} \left( \boldsymbol{\mu}_n, \boldsymbol{\Lambda}_n^{-1} \right)\) where \[\begin{align} \boldsymbol{\mu}_n &= \boldsymbol{\Lambda}_n \left((\mathbf{D}\boldsymbol{P})^{\top} (\mathbf{D}\boldsymbol{P}) \widehat{\boldsymbol{\gamma}} + \boldsymbol{\Lambda}_0 \boldsymbol{\mu}_0 \right) \\ \boldsymbol{\Lambda}_n^{-1} &= (\mathbf{D}\boldsymbol{P})^{\top} (\mathbf{D}\boldsymbol{P}) + \boldsymbol{\Lambda}_0 \\ \widehat{\boldsymbol{\gamma}} &= ((\mathbf{D}\boldsymbol{P})^{\top} (\mathbf{D}\boldsymbol{P}))^{-1} (\mathbf{D}\boldsymbol{P})^{\top} \mathbf{y} \end{align}\]

Next, \(\mathbb{P}(\Pi \mid \mathbf{Z}) = \mathbb{P}(\Pi) \int_{\boldsymbol{\gamma}^{\prime}} p(\mathbf{Z}\mid \boldsymbol{\gamma}^{\prime}, \Pi) p(\boldsymbol{\gamma}^{\prime} \mid \Pi) d\boldsymbol{\gamma}^{\prime}\). We know that \(\mathbb{P}(\Pi) = C \exp\{-\lambda \left\lvert\Pi\right\rvert\}\) where \(C\) is the normalization constant (or the partition function). Therefore, we have

\[\begin{align} \mathbb{P}(\Pi \mid \mathbf{Z}) &= C A_{1, \Pi} A_{2, \Pi}e^{-\lambda \left\lvert\Pi\right\rvert} \int_{\boldsymbol{\gamma}^{\prime}} \exp \left\{-g(\boldsymbol{\gamma}^{\prime})\right\} d\boldsymbol{\gamma}^{\prime}, \\ g(\boldsymbol{\gamma}^{\prime}) &= \frac{1}{2} (\boldsymbol{\gamma}^{\prime} - \boldsymbol{\mu}_{0})^{\top} \boldsymbol{\Lambda}_{0} (\boldsymbol{\gamma}^{\prime} - \boldsymbol{\mu}_{0}) + \frac{1}{2} (\mathbf{y}- \mathbf{D}\boldsymbol{P}\boldsymbol{\gamma}^{\prime})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{y}- \mathbf{D}\boldsymbol{P}\boldsymbol{\gamma}^{\prime}) \end{align}\] where \(A_{1, \Pi}, A_{2, \Pi}\) are known constants from the normal distributions. Since \(g\) is quadratic in \(\boldsymbol{\gamma}^{\prime}\), we can directly estimate this using the closed form, \[\begin{align} \int_{\boldsymbol{\gamma}^{\prime}} \exp \left\{-g(\boldsymbol{\gamma}^{\prime})\right\} d\boldsymbol{\gamma}^{\prime} &= (2 \pi)^{\left\lvert\Pi\right\rvert/2} \textrm{det}(\boldsymbol{\Lambda}_{0} + (\mathbf{D}\boldsymbol{P})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{D}\boldsymbol{P}))^{1/2} e^{-g(\boldsymbol{\gamma}^{\star})} \\ \implies \mathbb{P}(\Pi \mid \mathbf{Z}) &= C A_{1, \Pi} A_{2, \Pi} e^{-\lambda \left\lvert\Pi\right\rvert - g(\boldsymbol{\gamma}^{\star})} (2 \pi)^{\left\lvert\Pi\right\rvert/2} \textrm{det}(\boldsymbol{\Lambda}_{0} + (\mathbf{D}\boldsymbol{P})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{D}\boldsymbol{P}))^{1/2}, \\ \boldsymbol{\gamma}^{\star} &= (\boldsymbol{\Lambda}_{0} + (\mathbf{D}\boldsymbol{P})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{D}\boldsymbol{P}))^{-1} (\boldsymbol{\Lambda}_{0} \boldsymbol{\mu}_{0} + (\mathbf{D}\boldsymbol{P})^{\top} \boldsymbol{\Sigma}^{-1} \mathbf{y}). \end{align}\]

This allows us to estimate the original quantity of interest, \(p(\boldsymbol{\beta}\mid \mathbf{Z})\) through a variable transformation of \(\boldsymbol{\gamma}\) where all the constants are known except for \(C\). Exactly computing \(C\) is NP-Hard and is reminiscent of partition functions used in graphical models (not to be confused with the “partition” that we are using in this work). See [57] for a survey of methods used to estimate or approximate the constant \(C\).

12.2 Generalized Bayesian inference.↩︎

We have our mean squared error for a given partition \(\Pi\), \[\begin{align} \mathcal{L}(\Pi ; \mathbf{Z}) &= \frac{1}{n} \left( \mathbf{y}- \widehat{\mathbf{y}} \right)^{\intercal} \left( \mathbf{y}- {\mathbf{y}} \right), \nonumber \\ \widehat{y}_i &= \frac{\sum_{\pi \in \Pi} \mathbb{I}\{k(i) \in \pi\} \sum_{j} \mathbb{I}\{k(j) \in \pi\} y_j }{\sum_{\pi \in \Pi} \mathbb{I}\{k(i) \in \pi\} \sum_{j} \mathbb{I}\{k(j) \in \pi\}} \label{eq:loss-outcome-prediction} \end{align}\tag{14}\] where \(\widehat{y}_i\) is the mean outcome in the pool \(\pi \in \Pi\) containing the feature combination of unit \(i\), \(k(i)\).

Our goal is to show that minimizing \(\mathcal{L}(\Pi; \mathbf{Z})\) corresponds to maximizing the likelihood \(\mathbb{P}(\mathbf{y}\mid \mathbf{D}, \Pi)\). Consider the same data-generating process in 12.1. Specifically, we require independence over \(\gamma_i\) and we will assume that the prior over \(\boldsymbol{\gamma}\) is diffuse i.e., \(\tau^2 \gg 1\). As before, given some feature combinations \(\mathbf{D}\), we draw the outcomes as \[\begin{align} \mathbf{y}\mid \mathbf{D}, \boldsymbol{\beta}, \Pi &\sim \mathcal{N}(\mathbf{D}\boldsymbol{\beta}, \boldsymbol{\Sigma}) \implies \mathbf{y}\mid \mathbf{D}, \boldsymbol{\gamma}, \Pi \sim \mathcal{N}(\mathbf{D}\boldsymbol{P}\boldsymbol{\gamma}, \boldsymbol{\Sigma}). \end{align}\] This allows us to find the likelihood, \[\begin{align} \mathbb{P}(\mathbf{y}\mid \mathbf{D}, \Pi) &= \int_{\boldsymbol{\beta}} \mathbb{P}(\mathbf{y}\mid \mathbf{D}, \Pi, \boldsymbol{\beta}) \mathbb{P}(\boldsymbol{\beta}\mid \Pi) d \boldsymbol{\beta}= \int_{\boldsymbol{\gamma}} \mathbb{P}(\mathbf{y}\mid \mathbf{D}, \Pi, \boldsymbol{\gamma}) \mathbb{P}(\boldsymbol{\gamma}\mid \Pi) d \boldsymbol{\gamma}\\ &= \int_{\boldsymbol{\gamma}} \mathbb{P}(\mathbf{y}\mid \mathbf{D}, \boldsymbol{P}, \boldsymbol{\gamma}) \mathbb{P}(\boldsymbol{\gamma}\mid \Pi) d \boldsymbol{\gamma}\\ &\propto \int_{\boldsymbol{\gamma}} \exp \left\{ - \frac{1}{2} ( \mathbf{y}- \mathbf{D}\boldsymbol{P}\boldsymbol{\gamma})^{\intercal} \boldsymbol{\Sigma}^{-1} ( \mathbf{y}- \mathbf{D}\boldsymbol{P}\boldsymbol{\gamma}) \right\} \exp \left\{ - \frac{1}{2} (\boldsymbol{\gamma}- \boldsymbol{\mu}_{0})^{\intercal} \boldsymbol{\Lambda}^{-1}_{0} (\boldsymbol{\gamma}- \boldsymbol{\mu}_{0}) \right\} d \boldsymbol{\gamma} \end{align}\] After re-arranging the terms in the exponent, we have \[\begin{align} - \frac{1}{2} \left( \left(\boldsymbol{\gamma}- \mathbf{M}^{-1} \boldsymbol{u}\right)^{\intercal} \mathbf{M}\left(\boldsymbol{\gamma}- \mathbf{M}^{-1} \boldsymbol{u}\right) + \mathbf{y}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{y}+ \boldsymbol{\mu}_{0}^{\intercal} \boldsymbol{\Lambda}^{-1}_0 \boldsymbol{\mu}_{0} - \boldsymbol{u}^{\intercal} \mathbf{M}^{-1} \boldsymbol{u}\right), \end{align}\] where \(\mathbf{M}= \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \boldsymbol{P}\mathbf{D}+ \boldsymbol{\Lambda}^{-1}_{0}\) and \(\boldsymbol{u}= \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{y}+ \boldsymbol{\Lambda}^{-1}_{0} \boldsymbol{\mu}_{0}\). Notice that when integrating with respect to \(\boldsymbol{\gamma}\), the first quadratic term becomes a constant in \(\mathbf{y}\). Therefore, \[\begin{align} \mathbb{P}(\mathbf{y}\mid \mathbf{D}, \Pi) &\propto \exp \left\{ - \frac{1}{2} \left( \mathbf{y}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{y}+ \boldsymbol{\mu}_{0}^{\intercal} \boldsymbol{\Lambda}^{-1}_{0} \boldsymbol{\mu}_{0} - \boldsymbol{u}^{\intercal} \mathbf{M}^{-1} \boldsymbol{u}\right) \right\}. \end{align}\]

Now, as the prior over \(\boldsymbol{\gamma}\) becomes more diffuse i.e., as \(\tau^2 \to \infty\), we have that \(\boldsymbol{\Lambda}_{0}^{-1} \to \mathbf{0}\). Therefore, \(\boldsymbol{\mu}_{0}^{\intercal} \boldsymbol{\Lambda}^{-1}_{0} \boldsymbol{\mu}_{0} \to \mathbf{0}\), \(\mathbf{M}\to \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \boldsymbol{P}\mathbf{D}\), and \(\boldsymbol{u}\to \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{y}\). This allows us to simplify, \[\begin{align} \mathbb{P}(\mathbf{y}\mid \mathbf{D}, \Pi) &\propto \exp \left\{ - \frac{1}{2} \left( \mathbf{y}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{y}- ( \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{y})^{\intercal} (\boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{D}\boldsymbol{P})^{-1} (\boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{y}) \right) \right\} \\ &\propto \exp \left\{ - \frac{1}{2} \mathbf{y}^{\intercal} \left( \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1} \mathbf{D}\boldsymbol{P}(\boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{D}\boldsymbol{P})^{-1} \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \right) \mathbf{y}\right\} \end{align}\]

The likelihood is maximized when the log-likelihood is maximized, \[\begin{align} \log \mathbb{P}(\mathbf{y}\mid \mathbf{D}, \Pi) &= - \frac{1}{2} \mathbf{y}^{\intercal} \left( \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1} \mathbf{D}\boldsymbol{P}(\boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{D}\boldsymbol{P})^{-1} \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \right) \mathbf{y}+ \textrm{c} \\ \frac{\partial \log \mathbb{P}(\mathbf{y}\mid \mathbf{D}, \Pi)}{\partial \mathbf{y}} &\stackrel{\textrm{set}}{=} 0 \\ % \implies -2 \left( \bSigma^{-1} - \bSigma^{-1} \D (\D^{\intercal} \bSigma^{-1} \D)^{-1} \D^{\intercal} \bSigma^{-1} \right) \y &= 0 \\ \implies \mathbf{y}&= \mathbf{D}\boldsymbol{P}( \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{D}\boldsymbol{P})^{-1} \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{y}\\ &= \mathbf{D}\boldsymbol{P}\widehat{\boldsymbol{\gamma}} = \mathbf{D}\widehat{\boldsymbol{\beta}}, \end{align}\] where \(\widehat{\boldsymbol{\beta}} = \boldsymbol{P}\widehat{\boldsymbol{\gamma}}\) and \(\widehat{\boldsymbol{\gamma}} = ( \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{D}\boldsymbol{P})^{-1} \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \boldsymbol{\Sigma}^{-1} \mathbf{y}\). This is exactly the solution to is the solution to the following ordinary least-squares problem, \[\begin{align} \min_{\boldsymbol{\gamma}} \left\lVert\mathbf{y}- \mathbf{D}\boldsymbol{P}\boldsymbol{\gamma}\right\rVert_2^2. \end{align}\]

Next, we will show that this ordinary least squares problem is identical to \(\mathcal{L}(\Pi; \mathbf{Z})\). In order to make this argument cleaner, we will assume that \(\boldsymbol{\Sigma}= \sigma^2 \mathcal{I}_n\). Now, observe the structure of \(\mathbf{D}\). In any row \(i\), \(D_{ik} = 1\) if observation \(i\) is assigned to feature combination \(k\), and \(D_{ik} = 0\) otherwise. So, \(\mathbf{D}^{\intercal} \mathbf{D}\) is a diagonal matrix of size \(K \times K\) where \((\mathbf{D}^{\intercal} \mathbf{D})_{kk}\) is the number of observations assigned to feature combination \(k\), \(n_k\). And \(\mathbf{D}^{\intercal} \mathbf{y}\) sums all outcomes \(y_i\) corresponding to each feature combination \(k\).

Similar to how \(\mathbf{D}\) collects all observations into their respective feature combinations, \(\boldsymbol{P}\) collects all feature combinations into their respective pools. Therefore \((\boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \mathbf{D}\boldsymbol{P})^{-1} \boldsymbol{P}^{\intercal} \mathbf{D}^{\intercal} \mathbf{y}\) is the average outcome in each pool. This is exactly our estimated \(\widehat{\mathbf{y}}\) in Equation 14 . In other words, \(\mathcal{L}(\Pi; \mathbf{Z})\) is exactly the minimized squared error (up to some scaling constant).

Therefore, maximizing the posterior \(\mathbb{P}(\mathbf{y}\mid \mathbf{D}, \Pi) \mathbb{P}(\Pi)\) corresponds to minimizing the mean-squared error with the \(\ell_0\) penalty. This has connections to loss-based generalized Bayesian inference [4]. Here, we have described one possible prior on \(\boldsymbol{\beta}\) to recover the mean-squared error. However, other such priors exist that describe such analytic loss functions. We refer the reader to Section 4 of [3] for examples of such priors used for BARTs.

13 Approximating the posterior↩︎

By the triangle inequality, we can write \[\begin{align} \sup_{\boldsymbol{t}} \left\lvert F_{\boldsymbol{\beta}\mid \mathbf{Z}, \mathcal{P}_\theta}(\boldsymbol{t}) - F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}) \right\rvert &= \sup_{\boldsymbol{t}} \left\lvert \sum_{\Pi \in \mathcal{P}_{\theta}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \frac{\mathbb{P}(\Pi \mid \mathbf{Z})}{\sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}} \mathbb{P}(\Pi^{\prime} \mid \mathbf{Z})} - \sum_{\Pi \in \mathcal{P}^{\star}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \mathbb{P}(\Pi \mid \mathbf{Z})\right\rvert, \\ &\leq (\textrm{I}) + (\textrm{II}) \\ \textrm{I} &= \sup_{\boldsymbol{t}} \left\lvert \sum_{\Pi \in \mathcal{P}_{\theta}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \frac{\mathbb{P}(\Pi \mid \mathbf{Z})}{\sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}} \mathbb{P}(\Pi^{\prime} \mid \mathbf{Z})} - \sum_{\Pi \in \mathcal{P}_{\theta}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \mathbb{P}(\Pi \mid \mathbf{Z})\right\rvert, \\ \textrm{II} &= \sup_{\boldsymbol{t}} \left\lvert \sum_{\Pi \in \mathcal{P}_{\theta}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \mathbb{P}(\Pi \mid \mathbf{Z}) - \sum_{\Pi \in \mathcal{P}^{\star}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \mathbb{P}(\Pi \mid \mathbf{Z})\right\rvert \end{align}\]

Let us denote \(K = \sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}}\mathbb{P}(\Pi^{\prime} \mid \mathbf{Z})\). Then the first term is, \[\begin{align} (\textrm{I}) &= \sup_{\boldsymbol{t}} \left\lvert \sum_{\Pi \in \mathcal{P}_{\theta}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \frac{\mathbb{P}(\Pi \mid \mathbf{Z})}{K} - \sum_{\Pi \in \mathcal{P}_{\theta}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \mathbb{P}(\Pi \mid \mathbf{Z})\right\rvert \\ &\leq \left\lvert\frac{1}{K} - 1\right\rvert \sup_{\boldsymbol{t}} \left\lvert \sum_{\Pi \in \mathcal{P}_{\theta}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \mathbb{P}(\Pi \mid \mathbf{Z})\right\rvert \\ &\leq \left\lvert\frac{1}{K} - 1\right\rvert \sup_{\boldsymbol{t}} \left\lvert \sum_{\Pi \in \mathcal{P}_{\theta}} \mathbb{P}(\Pi \mid \mathbf{Z})\right\rvert = \left\lvert\frac{1}{K} - 1\right\rvert \sup_{\boldsymbol{t}} \left\lvert K\right\rvert\\ &= 1 - \sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}}\mathbb{P}(\Pi^{\prime} \mid \mathbf{Z}) \end{align}\] where in the third line, we trivially bound \(F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \leq 1\).

Moving on to the second term, \[\begin{align} (\textrm{II}) &= \sup_{\boldsymbol{t}} \left\lvert \sum_{\Pi \in \mathcal{P}_{\theta}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \cdot \mathbb{P}(\Pi \mid \mathbf{Z}) - \sum_{\Pi \in \mathcal{P}^{\star}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \cdot \mathbb{P}(\Pi \mid \mathbf{Z})\right\rvert \\ &= \sup_{\boldsymbol{t}} \left\lvert \sum_{\Pi \in \mathcal{P}^{\star}\setminus \mathcal{P}_{\theta}} F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \cdot \mathbb{P}(\Pi \mid \mathbf{Z})\right\rvert \\ &\leq \sup_{\boldsymbol{t}} \left\lvert \sum_{\Pi \in \mathcal{P}^{\star}\setminus \mathcal{P}_{\theta}} 1 \cdot \mathbb{P}(\Pi \mid \mathbf{Z})\right\rvert = \sum_{\Pi \in \mathcal{P}^{\star}\setminus \mathcal{P}_{\theta}} \mathbb{P}(\Pi \mid \mathbf{Z}) \\ &= 1 - \sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}}\mathbb{P}(\Pi^{\prime} \mid \mathbf{Z}), \end{align}\] where in the third line, we again bound \(F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}\mid \Pi) \leq 1\).

Therefore, \[\begin{align} \sup_{\boldsymbol{t}} \left\lvert F_{\boldsymbol{\beta}\mid \mathbf{Z}, \mathcal{P}_\theta}(\boldsymbol{t}) - F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}) \right\rvert &\leq (\textrm{I}) + (\textrm{II}) \\ &\leq 2(1 - \sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}}\mathbb{P}(\Pi^{\prime} \mid \mathbf{Z})). \end{align}\]

There are two ways to bound this. First, \[\begin{align} 1 - \sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}}\mathbb{P}(\Pi^{\prime} \mid \mathbf{Z}) &\leq 1 - \sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}} \theta = 1 - \left\lvert\mathcal{P}_{\theta}\right\rvert \theta. \end{align}\] Second, \[\begin{align} 1 - \sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}}\mathbb{P}(\Pi^{\prime} \mid \mathbf{Z}) &= \sum_{\Pi \in \mathcal{P}^{\star}\setminus \mathcal{P}_{\theta}} \mathbb{P}(\Pi \mid \mathbf{Z}) \leq \sum_{\Pi \in \mathcal{P}^{\star}\setminus \mathcal{P}_{\theta}} \theta = (\left\lvert\mathcal{P}^{\star}\right\rvert - \left\lvert\mathcal{P}_{\theta}\right\rvert) \theta. \end{align}\] The first bound is smaller whenever \(\theta > 1 / \left\lvert\mathcal{P}^{\star}\right\rvert\). Therefore, \[\begin{align} \sup_{\boldsymbol{t}} \left\lvert F_{\boldsymbol{\beta}\mid \mathbf{Z}, \mathcal{P}_\theta}(\boldsymbol{t}) - F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}) \right\rvert &\leq \begin{cases} 2 \left( 1 - \left\lvert\mathcal{P}_{\theta}\right\rvert \theta \right), & \theta > 1 / \left\lvert\mathcal{P}^{\star}\right\rvert \\ 2 \left(\left\lvert\mathcal{P}^{\star}\right\rvert - \left\lvert\mathcal{P}_{\theta}\right\rvert \right) \theta, &\text{else} \end{cases}. \end{align}\] And finally, we add a \(\min\) operator because \(\sup_{\boldsymbol{t}} \left\lvert F_{\boldsymbol{\beta}\mid \mathbf{Z}, \mathcal{P}_\theta}(\boldsymbol{t}) - F_{\boldsymbol{\beta}\mid \mathbf{Z}}(\boldsymbol{t}) \right\rvert\) is trivially bounded by 1. \(\square\)

This argument is similar to 1 except for how we bound the expectations. We have \[\begin{align} \left\lVert\mathbb{E}_{\Pi \mid \mathcal{P}_{\theta}} \boldsymbol{\beta}- \mathbb{E}_{\Pi, \mathcal{P}_{\theta}} \boldsymbol{\beta}\right\rVert &= \left\lVert\sum_{\Pi \in \mathcal{P}_{\theta}} \boldsymbol{\beta}_{\Pi} \frac{\mathbb{P}(\Pi \mid \mathbf{Z})}{\sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}} \mathbb{P}(\Pi^{\prime} \mid \mathbf{Z})} - \sum_{\Pi \in \mathcal{P}_{\theta}} \boldsymbol{\beta}_{\Pi} \mathbb{P}(\Pi \mid \mathbf{Z})\right\rVert \\ &= \left\lvert\frac{1}{\sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}}\mathbb{P}(\Pi^{\prime} \mid \mathbf{Z})} - 1\right\rvert \left\lVert\sum_{\Pi \in \mathcal{P}_{\theta}} \boldsymbol{\beta}_{\Pi} \mathbb{P}(\Pi \mid \mathbf{Z})\right\rVert \\ &= \left\lvert\frac{1}{K} - 1\right\rvert \left\lVert\mathbb{E}_{\Pi, \mathcal{P}_{\theta}} \boldsymbol{\beta}\right\rVert, \end{align}\] where \(K = \sum_{\Pi^{\prime} \in \mathcal{P}_{\theta}}\mathbb{P}(\Pi^{\prime} \mid \mathbf{Z})\). Note that by definition, \(K \geq \left\lvert\mathcal{P}_{\theta}\right\rvert \theta\). Further, \(K \leq 1\) gives us \(1 / K - 1 > 0\). Therefore, \[\begin{align} \left\lVert\mathbb{E}_{\Pi \mid \mathcal{P}_{\theta}} \boldsymbol{\beta}- \mathbb{E}_{\Pi, \mathcal{P}_{\theta}} \boldsymbol{\beta}\right\rVert &\leq \left(\frac{1}{\left\lvert\mathcal{P}_{\theta}\right\rvert \theta} - 1 \right) \left\lVert\mathbb{E}_{\Pi, \mathcal{P}_{\theta}} \boldsymbol{\beta}\right\rVert \\ \implies \frac{\left\lVert\mathbb{E}_{\Pi \mid \mathcal{P}_{\theta}} \boldsymbol{\beta}- \mathbb{E}_{\Pi, \mathcal{P}_{\theta}} \boldsymbol{\beta}\right\rVert}{\left\lVert\mathbb{E}_{\Pi, \mathcal{P}_{\theta}} \boldsymbol{\beta}\right\rVert } &= \mathcal{O} \left( \frac{1}{\left\lvert\mathcal{P}_{\theta}\right\rvert \theta} - 1 \right). \end{align}\]

If we assume that \(\left\lVert\boldsymbol{\beta}_{\Pi}\right\rVert < \infty\), then define \(C = \max_{\Pi \in \mathcal{P}^{\star}} \left\lVert\boldsymbol{\beta}_{\Pi}\right\rVert < \infty\). Then, the remainder of the proof is identical to 1 except for carrying the multiplicative constant \(C\) from the expectations. \(\square\)

Table 2: Notation used in Theorem 2.
Notation Definition
\(\mathcal{P}_{\mid h}\) Set of permissible partitions with \(h\) pools
\(Q \in \mathcal{Q}\) Prior over all \(\boldsymbol{\beta}\)
\(Q \in \mathcal{Q}_{\mid h}\) Prior over \(\boldsymbol{\beta}\) such that there is some partition \(\Pi_{\boldsymbol{\beta}} \in \mathcal{P}_{\mid h}\)
\(Q \in \mathcal{Q}_{\mathcal{P} \mid h}\) Prior for partitions \(\Pi \in \mathcal{P}_{\mid h}\)
\(P_{\ell_0}\) The uniform prior over \(\mathcal{P}_{\mid h}\) (induced by \(\ell_0\) over \(\mathcal{P}^{\star}\))
\(\textrm{P}_{Q, \mathbf{Z}}\) Posterior density (over partitions or \(\boldsymbol{\beta}\)) with prior \(Q\)
\(\delta(P, Q)\) Total variation distance between \(P\) and \(Q\)

For any prior \(P \in \mathcal{Q}_{\mathcal{P} \mid h}\), we have, \[\begin{align} \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \delta(\textrm{P}_{P, \mathbf{Z}}, \textrm{P}_{Q, \mathbf{Z}}) &= \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \sup_{\Pi \in \mathcal{P}_{\mid h}} \left\lvert\textrm{P}_{P, \mathbf{Z}}(\Pi) - \textrm{P}_{Q, \mathbf{Z}}(\Pi)\right\rvert \\ &= \sup_{\Pi \in \mathcal{P}_{\mid h}} \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \left\lvert\textrm{P}_{P, \mathbf{Z}}(\Pi) - \textrm{P}_{Q, \mathbf{Z}}(\Pi)\right\rvert \\ &= \frac{1}{\mathbb{P}(\mathbf{y}\mid \mathbf{X})} \sup_{\Pi \in \mathcal{P}_{\mid h}} \mathbb{P}(\mathbf{y}\mid \mathbf{X}, \Pi) \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \left\lvert P(\Pi) - Q(\Pi)\right\rvert. \end{align}\]

First, consider the \(\ell_0\) prior. \[\begin{align} \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \delta(\textrm{P}_{P_{\ell_0}, \mathbf{Z}}, \textrm{P}_{Q, \mathbf{Z}}) &= \frac{1}{\mathbb{P}(\mathbf{y}\mid \mathbf{X})} \sup_{\Pi \in \mathcal{P}_{\mid h}} \mathbb{P}(\mathbf{y}\mid \mathbf{X}, \Pi) \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \left\lvert\frac{1}{N(h)} - Q(\Pi)\right\rvert. \end{align}\] Choose an adversarial prior \(Q^{\star}\) such that \(Q^{\star}(\Pi^{\star}) = 1\) for some arbitrary \(\Pi^{\star} \in \mathcal{P}_{\mid h}\). Then, \[\begin{align} \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \left\lvert\frac{1}{N(h)} - Q(\Pi)\right\rvert &= \left\lvert\frac{1}{N(h)} - Q^{\star}(\Pi^{\star})\right\rvert = 1 - \frac{1}{N(h)} \\ \implies \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \delta(\textrm{P}_{P_{\ell_0}, \mathbf{Z}}, \textrm{P}_{Q, \mathbf{Z}}) &= \left(1 - \frac{1}{N(h)} \right) \frac{\sup_{\Pi \in \mathcal{P}_{\mid h}}\mathbb{P}(\mathbf{y}\mid \mathbf{X}, \Pi)}{\mathbb{P}(\mathbf{y}\mid \mathbf{X})} \end{align}\]

Next, consider any other prior \(P \in \mathcal{Q}_{\mathcal{P} \mid h}\), \(P \neq P_{\ell_0}\). Let \(\Pi_m = \mathop{\mathrm{argmin}}_{\Pi} P(\Pi)\). Denote \(P(\Pi_m) =~p\). Observe that \(p < 1 / N(h)\) because \(P \neq P_{\ell_0}\). Construct an adversarial prior \(Q^{\star}\) such that \(Q^{\star}(\Pi_m) =~1\). Therefore, \[\begin{align} \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \left\lvert P(\Pi) - Q(\Pi)\right\rvert &= \left\lvert P(\Pi_m) - Q^{\star}(\Pi_m)\right\rvert = 1 - p \\ \implies \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \delta(\textrm{P}_{P, \mathbf{Z}}, \textrm{P}_{Q, \mathbf{Z}}) &= \frac{1}{\mathbb{P}(\mathbf{y}\mid \mathbf{X})} \sup_{\Pi \in \mathcal{P}_{\mid h}} \mathbb{P}(\mathbf{y}\mid \mathbf{X}, \Pi) (1 - p) \\ &= (1 - p) \frac{\sup_{\Pi \in \mathcal{P}_{\mid h}}\mathbb{P}(\mathbf{y}\mid \mathbf{X}, \Pi)}{\mathbb{P}(\mathbf{y}\mid \mathbf{X})} \\ &> \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \delta(\textrm{P}_{P_{\ell_0}, \mathbf{Z}}, \textrm{P}_{Q, \mathbf{Z}}). \end{align}\]

Thus, the \(\ell_0\) prior is minimax optimal, \[\begin{align} \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \delta(\textrm{P}_{P_{\ell_0}, \mathbf{Z}}, \textrm{P}_{Q, \mathbf{Z}}) &= \inf_{P \in \mathcal{Q}_{\mathcal{P} \mid h}} \sup_{Q \in \mathcal{Q}_{\mathcal{P} \mid h}} \delta(\textrm{P}_{P, \mathbf{Z}}, \textrm{P}_{Q, \mathbf{Z}}). \end{align}\] \(\square\)

We can easily see that \[\begin{align} \xi(\Pi,\Pi_0) &\leq \epsilon \\ \iff \exp (- Q(\Pi)) &\geq \exp \left\{ - Q(\Pi_0) (1 + \epsilon) \right\} = \exp \left\{ -q_0 (1 + \epsilon) \right\} \\ \iff \mathbb{P}(\Pi \mid \mathbf{Z}) &\geq \frac{e^{-q_0(1 + \epsilon)}}{c}, \end{align}\] where \(c := c(\mathbf{Z})\) is the normalization constant. \(\square\)

14 Appendix to Size of the Rashomon Set.↩︎

  To count the number of all possible partitions, we cast this as a decision tree problem. There are \((R-1)^m\) possible feature combinations in the profile with all arms turned on. These constitute possible nodes in a binary decision tree. The leaves in the decision tree are the pools. The number of binary trees with \(n\) nodes is given by \[\begin{align} C_n &= \frac{1}{n+1} \binom{2n}{n}, \end{align}\] where \(C_n\) is the Catalan number see An Invitation to Analytic Combinatorics from [58]. Therefore, the number of trees we can construct (that may or may not be admissible) is \[\begin{align} T &= \sum_{n=1}^{(R-1)^m} C_n = \mathcal{O}\left(2^{2(R-1)^m} \right), \end{align}\] where the big-O bound is given by [59].

To count the number of permissible partitions, conceptualize the binary matrix, \(\boldsymbol{\Sigma}\in \{0, 1\}^{m\times (R-2)}\) again. Each element of \(\boldsymbol{\Sigma}\) tells us whether a particular pair of adjacent levels in a feature is pooled. In particular, we define \(\Sigma_{ij} = 1\) if and only if feature combinations with dosage \(j\) are pooled with feature combinations with factor \(j+1\) in feature \(i\). Therefore, \(\boldsymbol{\Sigma}\) enumerates all admissible partitions. This gives us the desired result.

\(\square\)

From the definition of the Rashomon set, if \(\Pi \in \mathcal{P}_{\theta}\), then \[\begin{align} \mathbb{P}(\Pi \mid \mathbf{Z}) &\geq \theta \\ \implies \frac{\exp \left\{-\mathcal{L}(\Pi) - \lambda H(\Pi)\right\}}{c} &\geq \theta \\ \implies \exp \left\{ - \lambda H(\Pi) \right\} &\geq c \theta \\ \implies H(\Pi) &\leq - \frac{\ln (c \theta)}{\lambda}, \end{align}\] which gives our desired result. \(\square\)

We know that the total number of pools in any partition is bounded by \(H := H_\theta(\lambda)\).

Suppose there are \(k\) profiles. There is at least one pool in each profile. Therefore, the number of profiles is bounded by \(1 \leq k \leq H\). The profiles can be generated by making a tree-like partition of the feature space. In each feature, there are \(R-1\) places to split. Therefore, there are \(M(R-1)\) places to split overall. To generate \(k\) profiles, we need to choose \(k-1\) positions to split. This gives us \(\binom{M(R-1)}{k-1}\) possibilities.

For a given set of \(k\) profiles, 7 bounds the number of partitions as \[\begin{align} \left\lvert\mathcal{P}^{(k)}\right\rvert &= \begin{cases} \mathcal{O}\left( M^k R^{H-k} \right), & R > M^{c_{\mathrm{crit}}} \\ \mathcal{O}\left( (MR)^{k \log_2 H / k} (\log_2 (MR))^{-1} \right), & \mathrm{else} \end{cases}, \end{align}\] where \(c_{\mathrm{crit}} = (\log_2 3 - 1) / (2 - \log_2 3)\).

Now, all that remains is to count across the number of profiles. This gives us, \[\begin{align} \left\lvert\mathcal{P}_{\theta}\right\rvert &\leq \sum_{k=1}^H \binom{M(R-1)}{k-1} \left\lvert\mathcal{P}^{(k)}\right\rvert \\ &\leq C \sum_{k=1}^H (MR)^{k-1} \times \begin{cases} \mathcal{O}\left( M^k R^{H-k} \right), & R > M^{c_{\mathrm{crit}}} \\ \mathcal{O}\left( (MR)^{k \log_2 H / k} (\log_2 (MR))^{-1} \right), & \mathrm{else} \end{cases} \\ &\leq \begin{cases} \mathcal{O}\left( \sum_{k=1}^H M^{2k-1} R^{H-1} \right), & R > M^{c_{\mathrm{crit}}} \\ \mathcal{O}\left( \sum_{k=1}^H (MR)^{k (1 + \log_2 H / k) - 1} (\log_2 (MR))^{-1} \right), & \mathrm{else} \end{cases} \end{align}\] The first case simplifies as \[\begin{align} R^{H-1} \sum_{k=1}^H M^{2k-1} &= \mathcal{O}\left( M^{2H - 1} R^{H - 1} \right). \end{align}\] Next, look at the exponent on \((MR)\) in the second case, \(k(1 + \log_2 H/k) - 1\). Using the second derivative test, we can see that this is maximized when \(k = H\): \[\begin{align} \frac{\partial}{\partial k} &: \log_2 H - \log_2 k \stackrel{\mathrm{set}}{=} 0 \implies k = H \\ \frac{\partial^2}{\partial k^2} &: - \frac{1}{k} < 0. \end{align}\] Therefore, the second case can be bounded as, \[\begin{align} \sum_{k=1}^H (MR)^{k (1 + \log_2 H / k) - 1} &\leq \sum_{k=1}^H (MR)^{H - 1} = \mathcal{O}\left( (MR)^{H - 1} \right). \end{align}\]

Therefore, we have \[\begin{align} \left\lvert\mathcal{P}_{\theta}\right\rvert &= \begin{cases} \mathcal{O}\left(M^{2H - 1} R^{H - 1} \right), & R > M^{c_{\mathrm{crit}}} \\ \mathcal{O}\left( (MR)^{H - 1} (\log_2 (MR))^{-1} \right), & \mathrm{else} \end{cases}, \end{align}\] where \(c_{\mathrm{crit}} = (\log_2 3 - 1) / (2 - \log_2 3)\).

\(\square\)

14.1 Helpful results↩︎

We state a useful result that helps us count the number of pools generated by a partition matrix \(\boldsymbol{\Sigma}\).

Lemma 3. Let \(\boldsymbol{\Sigma}\) be the partition matrix for a profile with \(m\) active features. Suppose there are \(z_i\) 1’s in the \(i\)-th row of \(\boldsymbol{\Sigma}\). Then the number of pools created by \(\boldsymbol{\Sigma}\) is, \[\begin{align} H(\boldsymbol{\Sigma}) &= (R-1)^m - (R-1)^{m-1} \sum_i z_i + (R-1)^{m-2} \sum_{i_1 < i_2} z_{i_1} z_{i_2} \\ & - (R-1)^{m-3} \sum_{i_1 < i_2 < i_3} z_{i_1} z_{i_2} z_{i_3} + \dots + (-1)^m z_1 \dots z_m. \end{align}\]

Observe that there are \((R-1)^m\) feature combinations in total (\(R-1\) because we are assuming the \(R\) discrete values include the control). Suppose, we set \(\boldsymbol{\Sigma}_{ij} = 1\), then we are pooling features of type \([r_1, \dots, r_{i-1}, j, r_{i+1}, \dots, r_m]\) with \([r_1, \dots, r_{i-1}, j-1, r_{i+1}, \dots, r_m]\), where \(r_{i^{\prime}}\) can take on \(R-1\) values. Therefore, \((R-1)^{m-1}\) features are pooled. So, if there are in \(\text{nnz}(\boldsymbol{\Sigma}) = \sum_{i} z_i\), then \((R-1)^{m-1} \sum_i z_i\) features are pooled.

However, if some of those 1’s are in a different treatment arm, then we end up double counting those. For example, if \(\boldsymbol{\Sigma}_{i_1, j} = 1\) and \(\boldsymbol{\Sigma}_{i_2, j^{\prime}} = 1\), then we remove features of type \([r_1, \dots, j, \dots, j^{\prime}, \dots, r_m]\) twice where \(j\) and \(j^{\prime}\) are at indices \(i_1\) and \(i_2\). So, we need to add them back. Similarly, the remaining non-linear terms account for this “double counting.” \(\square\)

 

3 tells us how to count the number of pools given a partition matrix. We now state another result that bounds the sparsity of the partition matrix given some number of pools in 4.

Lemma 4. Let \(\boldsymbol{\Sigma}\) be the matrix defined in Proposition 2 for a profile with \(m\) active features. Suppose there are \(H\) pools. Then, \[\begin{align} \sum_i z_i &\leq \frac{(2R-3)^m + 1 - 2H}{2(R-1)^{m-1}} \end{align}\]

Rearranging and dropping negative terms from Lemma 3, \[\begin{align} (R-1)^{m-1} \sum_i z_i &\leq -H + (R-1)^m + (R-1)^{m-2} \sum_{i_1 < i_2} z_{i_1} z_{i_2} \\ & \quad + (R-1)^{m-4} \sum_{i_1 < \dots < i_4} z_{i_1} z_{i_2} z_{i_3} z_{i_4} + \dots \\ &\leq -H + (R-1)^m + (R-1)^{m-2} (R-2)^2 \sum_{i_1 < i_2} 1 \\ & \quad + (R-1)^{m-4} (R-2)^4 \sum_{i_1 < \dots < i_4} 1 + \dots \\ &= -H + \sum_{n \text{ even}} \binom{m}{n}(R-1)^{m-n}(R-2)^n \\ \implies \sum_i z_i &\leq \frac{(2R-3)^m + 1 - 2H}{2(R-1)^{m-1}}, \end{align}\] where the second inequality uses \(z_j \leq R-2\) and the last step uses the well-known identity \[\begin{align} \sum_{k \text{ even}} \binom{n}{k} a^{n-k} b^k &= \frac{1}{2} \left((a+b)^n + (a-b)^n \right). \end{align}\] \(\square\)

 

We state a stronger result in 5 that tells us exactly how many partition matrices could have generated a given number of pools. This result is crucial in bounding the size of the RPS in a practically meaningful way as in Theorem 3.

Lemma 5. Let \(\boldsymbol{\Sigma}\) be the matrix defined in Proposition 2 for a profile with \(m\) active features. Then the number of \(\boldsymbol{\Sigma}\) matrices that generate \(h\) pools is given by \[\begin{align} N(h) &= \sum_{k=0}^m \binom{m}{k} \sum_{\prod_{i=1}^k (z_i + 1) = h} \prod_{i=1}^k \binom{R-2}{z_i}, \end{align}\] where we define \(N(1) = 1\) and \(\binom{n}{k} = 0\) for \(k > n\).

As \(m, R \to \infty\), we have \[\begin{align} N(h) &= \begin{cases} \mathcal{O}(mR^{h-1}), & R > m^{c_{\mathrm{crit}}} \\ \mathcal{O}((mR)^{\log_2 h}), & \text{else} \end{cases}, \end{align}\] where \(c_{\mathrm{crit}} = (\log_2 3 - 1) / (2 - \log_2 3)\).

This is an exercise in counting. When we make \(z\) splits in one feature, we generate \(z+1\) pools. When we make \(z_i\) splits in feature \(i\) and \(z_j\) splits in feature \(j\), we generate \((z_i + 1)(z_j + 2)\) pools.

When we want to generate \(h\) pools, we first choose the features where we want the splits to occur. This is what the first summation is doing. Suppose that we’ve chosen \(k\) features where we want to perform splits. Next, we need to identify how many splits can be made in each feature. This is what the inner summation is doing with the condition \(\prod_{i=1}^k (z_i + 1) = h\). Finally, we need to identify where those splits are made, which is where the binomial coefficient comes in.

To get the asymptotic bound, we first consider the term where the exponent on \(R\) is the largest. This is when we choose all splits in the same feature. Next, we consider the term where the exponent on \(m\) is the largest. For this to happen, we need to choose as many arms as possible i.e., make the smallest number of non-zero splits in each feature. This corresponds to making 1 split in each feature i.e., selecting \(\log_2 h\) feature. Hence, we get the asymptotic bound \(\mathcal{O}(\max\{ m R^{h-1}, (mR)^{\log_2 h}\})\).

Observe that \[\begin{align} mR^{h-1} > (mR)^{\log_2 h} &\iff R > m^{\frac{\log_2 h - 1}{h - \log_2 h - 1}}. \end{align}\] The exponent on \(m\) is a decreasing function in \(h\). When \(h = 2\), \(mR^{h-1} = (mR)^{\log_2 h}\). And when \(h=3\), the exponent is \(c_{\mathrm{crit}} = (\log_2 3 - 1) / (2 - \log_2 3)\). Therefore \(mR^{h-1} > (mR)^{\log_2 h}\) whenever \(R > m^{c_{\mathrm{crit}}}\), which gives our desired result. \(\square\)

Lemma 5 has a nice implication. When \(h\) is a prime number, we expect \(N(h)\) to be small because all of the splits need to be made in the same feature. On the other hand, when \(h = 2^k\) is a power-of-two, we expect \(N(h)\) to be very large since we can make splits in multiple features at the same time.

::: {#lemma:integral-a^logx .lemma} Lemma 6. For \(a > 1\), \[\begin{align} \int_x a ^{\log_2 x} dx &= \frac{x a^{\log_2 x}}{1 + \log_2 a} + C. \end{align}\] :::

We use integration by parts to solve this, \[\begin{align} \int_x a ^{\log_2 x} dx &= a^{\log_2 x} \int_x dx - \int_x x \cdot \frac{a^{\log_2 x} \log_2 a}{x} dx \\ &= x a^{\log_2 x} - \log_2 a \int_x a^{\log_2 x} dx \\ \implies \int_x a^{\log_2 x} dx &= \frac{x a^{\log_2 x}}{1 + \log_2 a} + C. \end{align}\] \(\square\)

Lemma 7. Suppose there are \(k \geq 1\) fixed profiles across \(M\) features each taking on \(R\) discrete ordered values. Suppose that the maximum number of pools in any partition is \(H\). Then, the number of permissible partitions is bounded by, \[\begin{align} \left\lvert\mathcal{P}^{(k)}\right\rvert &= \begin{cases} \mathcal{O}\left( M^k R^{H-k} \right), & R > M^{c_{\mathrm{crit}}} \\ \mathcal{O}\left( (MR)^{k \log_2 H / k} (\log_2 (MR))^{-1} \right), & \mathrm{else} \end{cases}, \end{align}\] where \(c_{\mathrm{crit}} = (\log_2 3 - 1) / (2 - \log_2 3)\).

Let \(h_i\) denote the number of pools within each profile. Then, we know that \(k \leq \sum_{i=1}^k h_i \leq h\) where \(1 \leq h_i \geq h - k + 1\) for every profile \(i\) and \(h \leq H\). Observe that partitions within each profile are strongly convex. By Lemma 5, we have a bound on number of partitions of size \(h_i\), \(N_i(h_i)\), \[\begin{align} N_i(h_i) &= \max \left\{ \mathcal{O}(MR^{h_i-1}), \mathcal{O}((MR)^{\log_2 h_i}) \right\}. %\\ % &= \max \left\{ \bigO(MR^{h - k}), \bigO((MR)^{\log_2 (h - k + 1)}) \right\}. \end{align}\] We also know from Lemma 5 that \(MR^{h_i-1}\) beats \((MR)^{\log_2 h_i}\) whenever \(R > m^{c_{\mathrm{crit}}}\) where \(c_{\mathrm{crit}} = (\log_2 3 - 1) / (2 - \log_2 3)\). For now, we will suppress this condition for readability. We will re-introduce this at the end.

For a given set of \(k\) profiles, the number of partitions is given by, \[\begin{align} \left\lvert\mathcal{P}^{(k)}\right\rvert &= \sum_{h=k}^H \prod_{\sum_{i=1}^k h_i = h} N_i(h_i). \end{align}\] There are \(\binom{h-1}{k-1}\) positive integral solutions to the equation \(\sum_{i=1}^k h_i = h\). Thus, \[\begin{align} \left\lvert\mathcal{P}^{(k)}\right\rvert &= \sum_{h=k}^H \prod_{\sum_{i=1}^k h_i = h} N_i(h_i) \\ &= \sum_{h=k}^H \binom{h-1}{k-1} \max \left\{ \mathcal{O}\left( M^kR^{\sum_{i=1}^k h_i-1} \right), \mathcal{O}\left((MR)^{\sum_{i=1}^k \log_2 h_i} \right) \right\} \\ &= \sum_{h=k}^H \binom{h-1}{k-1} \max \left\{ \mathcal{O}\left( M^kR^{h-k} \right), \mathcal{O}\left((MR)^{\sum_{i=1}^k \log_2 h_i} \right) \right\} \end{align}\] To bound \(\sum_{i=1}^k \log_2 h_i\) when \(\sum_{i=1}^k h_i = h\), we use the arithmetic-geometric mean inequality, \[\begin{align} \left( \prod_{i=1}^k h_i \right)^{1/k} &\leq \frac{\sum_{i=1}^k h_i}{k} \implies \prod_{i=1}^k h_i \leq \left(\frac{h}{k}\right)^k \\ \implies \sum_{i=1}^k \log_2 h_i &\leq k (\log_2 h - \log_2 k). \end{align}\] Therefore, \[\begin{align} \left\lvert\mathcal{P}^{(k)}\right\rvert &= \sum_{h=k}^H \binom{h-1}{k-1} \max \left\{ \mathcal{O}\left( M^kR^{h-k} \right), \mathcal{O}\left((MR)^{ k (\log_2 h/k)} \right) \right\} \\ &= \max \left\{ \mathcal{O}\left( M^k \sum_{h=k}^H R^{h-k} \right), \mathcal{O}\left( \sum_{h=k}^H (MR)^{ k (\log_2 h/k)} \right) \right\} \end{align}\] The first term simplifies as \[\begin{align} M^k \sum_{h=k}^H R^{h-k} &= \mathcal{O}(M^k R^{H-k}). \end{align}\] The second term can be bounded by the integral, \[\begin{align} \sum_{h=k}^H (MR)^{ k (\log_2 h/k)} &\leq \int_{h=k}^H (MR)^{ k (\log_2 h/k)} dh = \frac{H (MR)^{k \log_2 H / k} - k}{1 + k \log_2 (MR)} \\ &= \mathcal{O}\left( (MR)^{k \log_2 H / k} (\log_2 (MR))^{-1} \right) \end{align}\] where the integral is evaluated in Lemma 6.

Thus, for a given set of \(k\) profiles, the number of partitions is bounded by, \[\begin{align} \left\lvert\mathcal{P}^{(k)}\right\rvert &= \max \left\{ \mathcal{O}\left( M^k R^{H-k} \right), \mathcal{O}\left( (MR)^{k \log_2 H / k} (\log_2 (MR))^{-1} \right) \right\}. \end{align}\] Re-introducing the condition for when the first term beats the second gives us the desired result. \(\square\)

15 Appendix to Rashomon set enumeration and Generalizations↩︎

We organize this appendix into proofs for results in Section 6, additional algorithms used in Section 6, and proofs for results in Section 17.

15.1 Proofs in Section 6.↩︎

By definition, \[\begin{align} b(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) &\leq \frac{1}{n} \sum_{\pi \in \Pi_{\mathfrak{f}}} \sum_{k(i) \in \pi} (y_i - \widehat{\mu}_{\pi})^2 + \lambda H(\Pi, \mathcal{M}) \end{align}\] Notice that \(\left\lvert\Pi\right\rvert \geq H(\Pi, \mathcal{M})\). Further, by making more splits, we can only reduce the total mean-squared error incurred. Therefore, \[\begin{align} Q(\Pi ; \mathbf{Z}) &= \mathcal{L}(\Pi ; \mathbf{Z}) + \lambda \left\lvert\Pi\right\rvert \\ &= \frac{1}{n} \sum_{\pi \in \Pi} \sum_{k(i) \in \pi} (y_i - \widehat{\mu}_{\pi})^2 + \lambda \left\lvert\Pi\right\rvert \\ &\geq \frac{1}{n} \sum_{\pi \in \Pi} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \cap \pi_{\mathfrak{f}} \neq \varnothing \right\} (y_i - \widehat{\mu}_{\pi})^2 + \lambda \left\lvert\Pi\right\rvert \\ &\geq \frac{1}{n} \sum_{\pi \in \Pi_{\mathfrak{f}}} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \cap \pi_{\mathfrak{f}} \neq \varnothing \right\} (y_i - \widehat{\mu}_{\pi})^2 + \lambda \left\lvert\Pi\right\rvert \\ &\geq \frac{1}{n} \sum_{\pi \in \Pi_{\mathfrak{f}}} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \cap \pi_{\mathfrak{f}} \neq \varnothing \right\} (y_i - \widehat{\mu}_{\pi})^2 + \lambda H(\Pi, \mathcal{M}) \\ &= b(\boldsymbol{\Sigma}_{\mathfrak{f}} ; \mathbf{Z}). \end{align}\] So if \(b(\boldsymbol{\Sigma}, \mathcal{M} ; \mathbf{Z}) > \theta_{\epsilon}\), then \(\boldsymbol{\Sigma}\) is not in the Rashomon set. Now consider \(\boldsymbol{\Sigma}^{\prime} \in \mathrm{child}(\boldsymbol{\Sigma}, \mathcal{M})\). Notice that the size of the fixed set of indices \(\mathcal{M}^{\prime}\) in any child of \(\boldsymbol{\Sigma}\) increases (because there are fewer places to make further splits). With any further split we make in \(\mathcal{M}\), the number of pools increases. Finally, the loss is non-negative. These together imply, \[\begin{align} b(\boldsymbol{\Sigma}^{\prime}, \mathcal{M}^{\prime}; \mathbf{Z}) &\geq b(\boldsymbol{\Sigma}, \mathcal{M} ; \mathbf{Z}) \\ \implies Q(\Pi(\boldsymbol{\Sigma}^{\prime}); \mathbf{Z}) &\geq b(\boldsymbol{\Sigma}^{\prime}, \mathcal{M}^{\prime} ; \mathbf{Z}) \geq b(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}). \end{align}\] Therefore, if \(b(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) > \theta_{\epsilon}\), then \(\boldsymbol{\Sigma}\) and all \(\boldsymbol{\Sigma}^{\prime} \in \mathrm{child}(\boldsymbol{\Sigma}, \mathcal{M})\) are not in the Rashomon set. \(\square\)

By definition of \(b_{eq}\), \[\begin{align} b_{eq}(\boldsymbol{\Sigma}, \mathcal{M} ; \mathbf{Z}) &\leq \frac{1}{n} \sum_{\pi \in \Pi} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \cap \pi_{\mathfrak{f}}^c \neq \varnothing \right\} (y_i - \widehat{\mu}_{\pi})^2 . \end{align}\] The idea in the inequality above is that any further split we make must obey the splits made at \(\mathcal{M}\). \[\begin{align} Q(\Pi; \mathbf{Z}) &= \mathcal{L}(\Pi ; \mathbf{Z}) + \lambda \left\lvert\Pi\right\rvert \\ &= \frac{1}{n} \sum_{\pi \in \Pi} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \cap \pi_{\mathfrak{f}} \neq \varnothing \right\} (y_i - \widehat{\mu}_{\pi})^2 + \lambda \left\lvert\Pi\right\rvert + \\ & \qquad \quad \frac{1}{n} \sum_{\pi \in \Pi} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \cap \pi_{\mathfrak{f}}^c \neq \varnothing \right\} (y_i - \widehat{\mu}_{\pi})^2 \\ &\geq b(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) + b_{eq}(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) \\ &= B(\boldsymbol{\Sigma}, \mathcal{M} ; \mathbf{Z}). \end{align}\] Therefore, if \(B(\boldsymbol{\Sigma}, \mathcal{M} ; \mathbf{Z}) > \theta_{\epsilon}\), then \(Q(\Pi ; \mathbf{Z}) > \theta_{\epsilon}\) and \(\boldsymbol{\Sigma}^{\prime} \in \mathrm{child}(\boldsymbol{\Sigma}, \mathcal{M})\). Let \(\Pi^{\prime} := \Pi(\boldsymbol{\Sigma}^{\prime})\). Then, \[\begin{align} Q(\Pi^{\prime} ; \mathbf{Z}) &= \mathcal{L}(\Pi^{\prime}; Z) + \lambda \left\lvert\Pi^{\prime}\right\rvert \\ &= \frac{1}{n} \sum_{\pi \in \Pi^{\prime}} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \cap \pi_{\mathfrak{f}} \neq \varnothing \right\} (y_i - \widehat{\mu}_{\pi})^2 + \lambda \left\lvert\Pi^{\prime}\right\rvert + \\ & \qquad \quad \frac{1}{n} \sum_{\pi \in \Pi^{\prime}} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \cap \pi_{\mathfrak{f}}^c \neq \varnothing \right\} (y_i - \widehat{\mu}_{\pi})^2 \\ &\geq b(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) + \frac{1}{n} \sum_{\pi \in \Pi^{\prime}} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \cap \pi_{\mathfrak{f}}^c \neq \varnothing \right\} (y_i - \widehat{\mu}_{\pi})^2 \\ &\geq b(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) + b_{eq}(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) \\ &= B(\boldsymbol{\Sigma}, \mathcal{M} ; \mathbf{Z}). \end{align}\] In the steps above, we used the fact that making any split will increase the number of pools to say that \(\left\lvert\Pi^{\prime}\right\rvert \geq \left\lvert\Pi\right\rvert\). We also used the definition of \(b_{eq}\) and the idea of a minimum loss incurred by equivalent units in the final step.

Therefore, if \(B(\boldsymbol{\Sigma}, \mathcal{M} ; \mathbf{Z}) > \theta_{\epsilon}\), then \(Q(\Pi^{\prime}; \mathbf{Z}) > \theta_{\epsilon}\) for any \(\boldsymbol{\Sigma}^{\prime} \in \mathrm{child}(\boldsymbol{\Sigma}, \mathcal{M})\). So \(\boldsymbol{\Sigma}\) and all such \(\boldsymbol{\Sigma}^{\prime}\) are not in the Rashomon set. \(\square\)

By inspecting the proofs of Theorems 4 and 5, we have the following Corollary that generalizes to other non-negative loss functions and penalties that are increasing in \(\left\lvert\Pi\right\rvert\).

Corollary 2. Appropriate versions of Theorems 4 and 5 hold for non-negative error functions \(\mathcal{L}^{\prime}(\Pi; \boldsymbol{Z}) \geq 0\) for all \(\Pi, \boldsymbol{Z}\) and penalties \(H^{\prime}(\Pi)\) that are an increasing function of \(\left\lvert\Pi\right\rvert\).

This follows from the proofs of Theorems 4 and 5. Both of these proofs rely on two key ideas: (1) adding more (mean squared) error terms will never decrease the overall loss (keeping the scaling constant \(1/n\) fixed), and (2) adding more pools to \(\Pi\) will only increase the overall loss. Both of these things are true for any non-negative error function and increasing function of \(\left\lvert\Pi\right\rvert\). \(\square\)

First note that Algorithm [alg:r-aggregate-profile] correctly enumerates the Rashomon set for any given profile. This follows directly from Lemma 1, and Theorems 4 and 5.

Next, [alg:r-aggregate-across-profiles] performs a breadth-first search starting at the control profile. Since the \(M\)-d hypercube has a unique source (the control profile) and sink (the profile with all features active), the breadth-first search will terminate after a finite time and traverse every possible path in the hypercube. When traversing an edge in the hypercube, [alg:pool-adjacent-profiles] attempts to pool adjacent profiles using the intersection matrix \(\boldsymbol{\Sigma}^{\cap}\) while obeying [permissibility:case1] and [permissibility:case2] of 9. This pooling attempt is done recursively guaranteeing that all permissible partitions are considered for the Rashomon set.

The choice of Rashomon thresholds for each profile, described in Line [line:eq-bound], is justified by the usage of Theorem 5.

Correctness of Algorithm [alg:r-aggregate] immediately follows. \(\square\)

15.2 Additional algorithms.↩︎

\(M\) features, \(R\) factors per feature, max pools \(H\), data \(\mathbf{Z}\), Rashomon threshold \(\theta_{\epsilon}\) Rashomon set \(\mathcal{P}_{q, \epsilon}\)

\(\mathcal{P}_{q, \epsilon} = \varnothing\) \(\mathcal{S} = \texttt{cache()}\) \(\mathcal{Q} = \texttt{queue()}\) \(\boldsymbol{\Sigma}= \{1\}^{M \times (R-2)}\) \(\mathcal{Q}\).push\((\boldsymbol{\Sigma}, 1, 1)\) \((\boldsymbol{\Sigma}, i, j) = \mathcal{Q}\texttt{.dequeue()}\)

continue

\(\mathcal{S}\texttt{.insert}((\boldsymbol{\Sigma}, i, j))\)

continue

\(\boldsymbol{\Sigma}^{(1)} = \boldsymbol{\Sigma}\), \(\boldsymbol{\Sigma}^{(1)}_{i,j} = 1\) \(\boldsymbol{\Sigma}^{(0)} = \boldsymbol{\Sigma}\), \(\boldsymbol{\Sigma}^{(0)}_{i,j} = 0\)

\(j_1 = \min \{j \leq R-2 \mid \text{ not } \mathcal{S}\texttt{.seen}(\boldsymbol{\Sigma}^{(1)}, m, j) \}\) \(\mathcal{Q}\texttt{.enqueue}(\boldsymbol{\Sigma}^{(1)}, m, j_1)\) \(j_0 = \min \{j \leq R-2 \mid \text{ not } \mathcal{S}\texttt{.seen}(\boldsymbol{\Sigma}^{(0)}, m, j) \}\) \(\mathcal{Q}\texttt{.enqueue}(\boldsymbol{\Sigma}^{(0)}, m, j_1)\)

continue

\(\mathcal{P}_{q, \epsilon}.add(\boldsymbol{\Sigma}^{(1)})\) \(\mathcal{P}_{q, \epsilon}.add(\boldsymbol{\Sigma}^{(0)})\)

\(\mathcal{Q}\).enqueue\((\boldsymbol{\Sigma}^{(1)}, i, j+1)\) \(\mathcal{Q}\).enqueue\((\boldsymbol{\Sigma}^{(0)}, i, j+1)\)

\(\mathcal{P}_{q, \epsilon}\)

Consider enumerating the RPS in a setup with \(M = 2\) features and \(R = 5\) levels in each feature. Let us take the profile where both features are active i.e., \(\rho = (1, 1)\). Algorithm [alg:r-aggregate-profile] starts with the singleton partition i.e., there is only one pool containing all feature combinations (see line 4). This is given by the partition matrix, \[\begin{align} \boldsymbol{\Sigma}&= \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}. \end{align}\] We start our search at (1, 1) in the matrix i.e., in the first feature at the first level and insert \((\boldsymbol{\Sigma}, 1, 1)\) into the queue \(\mathcal{Q}\). Since the queue is initially not empty, we enter the loop in line 6 and remove this partition matrix. As this is the first time we encounter this partition matrix, we mark it is as seen using \(\mathcal{S}\).

Then, in lines 11 and 12, we create two copies of this partition matrix – \(\boldsymbol{\Sigma}^{(1)}\) with \(\boldsymbol{\Sigma}_{1, 1} = 1\) (we do not split at the first level of the first feature) and \(\boldsymbol{\Sigma}^{(0)}\) with \(\boldsymbol{\Sigma}_{1, 1} = 0\) (we split at the first level of the first feature).

For each copy, we scan the other features to identify the lowest feature level that we haven’t already seen in \(\mathcal{S}\). Since we haven’t tested the feasibility of these partitions yet, we add them to the queue \(\mathcal{Q}\). This operation happens in lines 13-17.

In line 18, we test the feasibility of the partition \(\boldsymbol{\Sigma}\) we dequeued being present in the RPS. If it is infeasible, we move on to the next element in the queue \(\mathcal{Q}\). If it is indeed feasible, then we check whether \(\boldsymbol{\Sigma}^{(1)}\) and \(\boldsymbol{\Sigma}^{(0)}\) meet the Rashomon threshold in lines 19 and 20, respectively. If they do, then we attempt to split at the next feature level in that feature by adding those to the queue \(\mathcal{Q}\) in lines 21-23.

Then, we dequeue the next item in \(\mathcal{Q}\) and repeat the process until the queue is empty. Any partition or split that we did not check explicitly is a child of one of the partitions we checked and deemed infeasible.

Walkthrough: Figure 12 shows the search tree in this scenario. For the sake of illustration, we follow only certain branches of \(\boldsymbol{\Sigma}^{(0)}\) and show what happens when certain partitions are found to be infeasible. The green numbers show the order in which the matrices are added to the queue (and therefore inspected by the algorithm). The red circled index shows which feature and level is being considered next.

After dequeuing \(\boldsymbol{\Sigma}^{(0)}\), we add matrices 3 and 4 to the queue as per lines 13-17. Then, we add matrices 5 and 6 once we find that \(\boldsymbol{\Sigma}^{(0)}\) satisfies 5.

Following the order of the queue, we dequeue matrix 3. Since the matrices that we would have added in lines 13-17 are already in the queue, we skip over them.10 Once we find matrix that matrix 3 is feasible, we add matrices 7 and 8 to the queue as in lines 21-23.

Next, we dequeue matrix 4. We add matrices 9 and 10 as per lines 13-17. We find that matrix 4 does not satisfy Theorem 5. So we move on. Observe in the top right corner of Figure 12, we list matrices that never get checked because we found that 4 is infeasible.

When we dequeue matrix 5, we observe a similar behavior to matrix 3. And when we dequeue matrix 6, we observe a similar behavior to matrix 4. This process continues until the queue is empty.

It is worth noting that Algorithm [alg:r-aggregate-profile] will correctly enumerate the RPS independently of our starting search position. Obviously, some starting positions may be computationally favorable i.e., we do not need to search for too long before we encounter low posterior partitions. We believe domain experts will have a better understanding of the context and may be able to choose a starting location that can reduce computation costs. For instance, in Example [ex:algorithm-raggregate-profile], if we know for sure that there is heterogeneity in the lowest level in the second feature, we may wish to start at position (2, 1) instead of (1, 1) discarding several infeasible partitions when we dequeue the first element itself!

Finally, [alg:r-aggregate-cache] describes the implementation of the caching object used in [alg:r-aggregate].

Partition \(\Pi\), Adjacent profiles \(\rho_i, \rho_j\) such that \(\rho_i < \rho_j\) Intersection matrix \(\boldsymbol{\Sigma}^{\cap}\)

\(\mathbf{m} = \rho_i \land \rho_j\) \(m^{\prime} = \rho_i \oplus \rho_j\)

\(\Pi_{\rho_i} = \{ \pi \setminus \{k \mid \rho(k) \neq \rho_i \} \mid \pi \in \Pi \}\) \(\Pi_{\rho_j} = \{ \pi \setminus \{k \mid \rho(k) \neq \rho_j \} \mid \pi \in \Pi \}\) \(\boldsymbol{\Sigma}^{\cap} = [\infty]^{\left\lvert\Pi_{\rho_i}\right\rvert \times \left\lvert\Pi_{\rho_j}\right\rvert}\) \(\mathcal{A} = \sum_{a_1 \in \pi_{k}} \sum_{a_2 \in \pi_{k^{\prime}}} \mathbb{I}\{ \left\lVert\mathbf{x}(a_1) - \mathbf{x}(a_2)\right\rVert_1 = 1\}\) \(\boldsymbol{\Sigma}^{\cap}_{k, k^{\prime}} = 0\)

\(\boldsymbol{\Sigma}^{\cap}\)

Rashomon set \(\mathcal{P}_{q, \epsilon}\), partition \(\Pi\), list of pools that can be pooled across profiles \(\mathbf{z}\), data \(\mathbf{Z}\), Rashomon threshold \(\theta\), intersection matrices already seen \(\mathcal{S}\) Rashomon set \(\mathcal{P}_{q, \epsilon}\)

\((k, k^{\prime}) = \mathbf{z}.\texttt{pop}()\) \(\mathcal{P}_{q, \epsilon} = \texttt{PoolAdjacentProfiles}(\mathcal{P}_{q, \epsilon}, \Pi, \mathbf{z}, \boldsymbol{\Sigma}^{\cap}, \theta)\) \(\boldsymbol{\Sigma}^{\cap, \prime} = \boldsymbol{\Sigma}^{\cap}\) \(\boldsymbol{\Sigma}^{\cap, \prime}_{k, k^{\prime}} = 1\) \(\boldsymbol{\Sigma}^{\cap, \prime}_{k, -k^{\prime}} = \infty\), \(\boldsymbol{\Sigma}^{\cap, \prime}_{-k, k^{\prime}} = \infty\) \(\Pi^{\prime} = \left(\Pi \setminus \{\pi_k, \pi_{k^{\prime}} \}\right) \cup (\pi_k \cup \pi_{k^{\prime}})\) \(\mathcal{P}_{q, \epsilon} = \Pi^{\prime} \cup \texttt{PoolAdjacentProfiles}(\mathcal{P}_{q, \epsilon}, \Pi^{\prime}, \mathbf{z}, \boldsymbol{\Sigma}^{\cap, \prime}, \theta)\)

\(\mathcal{P}_{q, \epsilon}\)

Candidates for Rashomon set \(\mathcal{P}\), control profile \(\rho_0\), data \(\mathbf{Z}\), Rashomon threshold \(\theta\) Rashomon set \(\mathcal{P}_{q, \epsilon}\)

\(\mathcal{P}_{q, \epsilon} = \varnothing\) \(\mathcal{Q} = \texttt{queue}()\) \(\rho_i = \mathcal{Q}.\texttt{dequeue()}\) \(\mathcal{N}(\rho_i) = \{ \rho_j \mid \left\lVert\rho_i - \rho_j\right\rVert_0 = 1, \rho_j > \rho_i \}\) \(\mathcal{Q}.\texttt{enqueue}(\rho_j)\) \(\boldsymbol{\Sigma}^{\cap} =\) IntersectionMatrix\((\Pi, \rho_i, \rho_j)\) \(\mathbf{z} = \{ (k, k^{\prime}) \mid \boldsymbol{\Sigma}^{\cap}_{k, k^{\prime}} = 0 \}\) \(\mathcal{P}_{q, \epsilon} =\) PoolAdjacentProfiles\((\mathcal{P}_{q, \epsilon}, \Pi, \mathbf{z}, \boldsymbol{\Sigma}^{\cap}, \mathbf{Z}, \theta)\)

\(\mathcal{P}_{q, \epsilon}\)

\(\mathcal{S} = \texttt{cache()}\) \(C = \{ \}\)

\(\mathcal{S}\texttt{.insert}(\boldsymbol{\Sigma}, i, j)\) \(\boldsymbol{\Sigma}[i, j:(R-2)] = \texttt{NA}\) \(C = C \cup \{ \boldsymbol{\Sigma}\}\)

\(\mathcal{S}\texttt{.seen}(\boldsymbol{\Sigma}, i, j)\) \(\boldsymbol{\Sigma}[i, j:(R-2)] = \texttt{NA}\) return \(\boldsymbol{\Sigma}\in C\)

\(K\) list of \(n\) sorted lists containing a numerical score, \(\theta\) threshold \(F\), list of lists of length \(n\) with indices of elements from each of \(K_i\) such that their sum is less than \(\theta\)

\(F = \{ \}\) \(n = \texttt{len}(K)\)

return \(\{\}\)

\(K_{1, \text{feasible indices}} = \{ i \mid K_1[i] \leq \theta \}\)

\(F = \{K_{1, \text{feasible indices}}\}\)

\(F\)

\(x = \sum_{j=2}^n K_j[1]\)

\(\theta_i = \theta - K_1[i]\) break \(F_i = \texttt{select\_feasible\_combinations}(K[2:], \theta_i)\) \(F\texttt{.insert}([i]\texttt{.append}(f))\)

\(F\)

16 Appendix to simulations↩︎

16.1 Simulation 1: An example in medicine.↩︎

Imagine a setting where a pharmacist is interested in finding the best treatment as a combination of two drugs, A and B. In particular, we consider a scenario where drug B has no effect, and only the dosage level of drug A matters. Specifically, the dosage of drug A is directly proportional to the treatment effect. Suppose that each drug has 4 possible non-zero dosages \(d \in \{1, 2, 3, 4\}\). We will denote each of \(4^2 = 16\) unique drug cocktails by their dosage \((d_A, d_B)\). Our parameters and partition are summarized in Figure 13.

Observe that, by design, all non-zero marginal increases in outcomes as we increase dosage levels are correlated. Therefore, we expect Lasso to perform poorly as the \(\ell_1\) regularization penalizes selecting correlated features. However, the \(\ell_0\) penalty does not presume any such false independence assumptions.

Figure 13: Hasse diagram illustrating partition used in Simulation 1..

In each dataset, we fixed the number of samples per feature combination to \(n_k\) and outcomes for each feature were drawn from a \(\mathcal{N}(\beta_i, 1)\) distribution. We varied \(n_k \in \{10, 20, 50, 100, 500, 1000\}\) and for each \(n_k\) simulated \(r = 100\) datasets. For each dataset, we fit the TVA method [6], naive Lasso, and enumerate the RPS.

The TVA method involves writing down Equation 1 in its variant form described in Equation 11 . Then, we puffer-precondition this regression problem and set \(\ell_1\) regularization parameter as \(\lambda_{\text{TVA}, n_k} = n_k \cdot \lambda_{\text{Lasso}}\) where \(\lambda_{\text{Lasso}} = 10^{-3}\) [60]. The Lasso model we fit is also Equation 11 but without the preconditioning with \(\ell_1\) regularization parameter \(\lambda_{\text{Lasso}}\). The Rashomon threshold was chosen to be \(1.5 \times \text{MSE}_{\text{Lasso}}\) where \(\text{MSE}_{\text{Lasso}}\) is the MSE of the Lasso model with \(n_k = 10\). We use the regularization parameter \(\lambda_{\text{RPS}} = 10^{-2}\) for the Rashomon \(\ell_0\) regularization.

Figure 14: Results for Simulation 1. From left to right: mean squared error, best policy set coverage and best policy mean squared error.

The results from the simulations are presented in 14. The metrics reported here are as follows:

  1. Overall mean-squared error (MSE): Suppose \(\widehat{y}_i\) and \(y_i\) are the estimated and true outcomes for unit \(i\), then the overall MSE is defined as \[\begin{align} \text{MSE} &= \frac{1}{n} \sum_{i=1}^n (\widehat{y}_i - y_i)^2. \end{align}\]

  2. Best feature set coverage: Let \(\pi^{\star}\) and \(\widehat{\pi}^{\star}\) be the true and estimated set of features with the highest effect. Then, we define the best feature set coverage as the intersection-over-union of these two sets \[\begin{align} \text{IOU} &= \frac{\left\lvert\pi^{\star} \cap \widehat{\pi}^{\star}\right\rvert}{\left\lvert\pi^{\star} \cup \widehat{\pi}^{\star}\right\rvert}. \end{align}\]

  3. MSE for feature outcome: Let \(y_{\max}\) be the true best policy effect and \(\widehat{y}_{max}\) be the estimated best treatment effect. Then the best feature outcome MSE is \[\begin{align} \text{MSE}_{\text{best}} &= (\widehat{y}_{\max} - y_{\max})^2. \end{align}\]

These metrics are easily understood for a single model. For the RPS, we reported the performance metric averaged across all partitions in the RPS.

From Figure 14, it is clear that the “average” model in the RPS performs similarly to Lasso and TVA in terms of overall MSE but outperforms them by being able to recover the drug combinations correctly. By looking only for the optimal model, Lasso and TVA consistently miss out on coverage for the drug combinations with the highest treatment effect. However, when we look at the RPS of near-optimal models, we can recover the true set of treatments with the highest effect at a higher rate. In fact, the true partition in 13 is present in our RPS 100% of the time.

Our results also reveal that the poor performance in coverage of Lasso and TVA is not a sample size issue. They are simply not the right tool in situations with correlated parameters. Since \(\ell_1\) regularization assumes independence of parameters, Lasso and TVA break down. However, the RPS model does not make any assumptions and is thus able to partition the feature space freely.

We visualize the RPS through a heat map. An example heatmap with instructions on how to read it is shown in 15. We also use these heatmaps in our empirical data examples in 19.

Figure 15: Visualizing the Rashomon set through a heat map. This heatmap actually reflects a 2D histogram binned by the model size (number of pools in a partition) and the relative posterior probability ratio i.e., (\mathbb{P}(\Pi \mid \mathbf{Z}) - \max \mathbb{P}(\Pi \mid \mathbf{Z})) / \max \mathbb{P}(\Pi \mid \mathbf{Z}). The color of the bin reflects the number of times, averaged per simulation, a model at that sparsity and probability (distinct models may be in the same bin) appear in some Rashomon set. One might refine the set of partitions further by the probability and the sparsity. For example, if we want models with a relative probability of at least -0.25, then we look only at models that are above the dashed black horizontal line. If we want models with fewer than 6 pools, then we look only at models to the left of the dashed black vertical line. If we want both criteria to be satisfied, we look at the top left box.

For Simulation 1, we visualize the RPS in a heatmap in Figure 16. As the number of samples increases, the number of diverse partitions in the Rashomon set becomes smaller since we are more confident in our estimates.

Figure 16: Visualizing the Rashomon set in Simulation 1. Notice how as the size of the data set grows, the Rashomon set concentrates around a few very good models, one of which corresponds to the data generating process.

In 19, we show the full Rashomon approximation error as in 1 for two specific settings. We show both of the error curves illustrating when each case kicks in as we vary \(\theta\).

 


Figure 17: Runtime of enumerating RPS for a single profile.. a — image, b — image, c — image

 

Figure 18: Fraction of the model space covered by the RPS.. a — image, b — image

 

Figure 19: Rashomon approximation error as in 1.. a — image, b — image

17 Generalizations↩︎

Here, we consider two generalizations of the methods discussed so far. First, we consider a family of heterogeneous effects functions beyond just heterogeneity splits. For example, there might be some heterogeneity in slopes (and slopes of slopes, and so on). Second, we extend our method to pool on the space of the covariance between coefficients, rather than on the coefficients themselves. This means that coefficients no longer need to be exactly equal but, instead, related through a sparse covariance structure.

17.1 Pooling higher order derivatives.↩︎

We ask whether, given some feature combination \(k = (k_1, \dots, k_M)\), the marginal effect of increasing, say, \(k_1\) has a linear effect. That is, we can just as simply allow for outcomes as we increase the intensity up a given feature that is not just a step function, but one that checks if there is a linear relationship.11 In this case, there is no “large” versus “small” effect and no natural pool in the space of levels. However, there is a natural low dimensional effect and even pools when considering the space of slopes. The result is a framework that captures extensions of Bayesian treed models (e.g., [46]).

Before we proceed, we first generalize the notion of pools described in Definition 1.

Definition 10 (Generalization of pools). Given \(M\) features taking on \(R\) partially ordered values each and some function \(g(k, \boldsymbol{\beta})\), a pool \(\pi\) is a set of feature combinations \(k\) whose outcomes are given by \(g(k, \boldsymbol{\beta}_{\pi})\) where \(\boldsymbol{\beta}_{\pi}\) depends on \(\pi\).

It is easy to see that the original pool defined in Definition 1 is recovered by setting \(g(k, \beta_{\pi}) = \beta_{\pi}\).

For instance, suppose we are interested in linear effects. Then the regression equation for pool \(\pi\) \[\begin{align} y = g(k, \boldsymbol{\beta}_{\pi}) = \beta_{\pi, 0} + \sum_{m=1}^M \beta_{\pi, m} k_m = \beta_{\pi, 0} + \boldsymbol{\beta}^{\intercal}_{\pi} k \end{align}\] where \(\boldsymbol{\beta}_{\pi}\) is the linear slope within that pool. The estimated outcome for feature combination \(k \in \pi\) is \(\widehat{y} = f(k, \widehat{\boldsymbol{\beta}}_{\pi} ; \Pi)\), where \(\widehat{\boldsymbol{\beta}}_{\pi}\) is estimated within each pool using some procedure like least squares.

For some partition \(\Pi\), define the block vector \(\boldsymbol{\beta}= [\boldsymbol{\beta}_{\pi_1}, \dots, \boldsymbol{\beta}_{\pi_{\left\lvert\Pi\right\rvert}}]\) where \(\pi_i \in \Pi\). Then, the general outcome function for any feature combination \(k\) can be written as \[\begin{align} y &= g(k, \boldsymbol{\beta}; \Pi) = \sum_{\pi \in \Pi} \mathbb{I}\{k \in \pi\} g(k, \boldsymbol{\beta}_{\pi}). \end{align}\]

The practitioner is free to choose any domain-specific parametric function. For example, \(g\) could be a higher-order Taylor series-like expansion. Or, \(g\) could even be sinusoidal because the practitioner believes the outcomes are (piece-wise) sinusoidal. Of course, the more complex the estimation procedure for \(\boldsymbol{\beta}\), the harder it is to enumerate the RPS.

Observe that the form of the posterior remains the same, \[\begin{align} \mathbb{P}(\Pi \mid \mathbf{Z}) &\propto \exp \left\{ -\mathcal{L}(\mathbf{Z}) + \lambda H(\Pi) \right\} = \exp \left\{ - \frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y}_i)^2 + \lambda H(\Pi) \right\}. \end{align}\] Therefore, the results in Section 3 still hold. We can freely choose any other non-negative loss function, \(\mathcal{L}(\mathbf{Z})\), and still use the same framework and algorithm to enumerate the RPS.

Further, the results in Section 6 are also valid when using an arbitrary parametric outcome function as discussed here. We summarize this in 7.

Theorem 7. Suppose the outcome function is \(g(k, \boldsymbol{\beta}; \Pi)\) for feature combination \(k\), admissible partition \(\Pi\), and some unknown parameter \(\boldsymbol{\beta}\). Let us denote the estimated outcome for unit \(i\) with feature combination \(k\) by \(\widehat{y}_i = g(k, \widehat{\boldsymbol{\beta}}; \Pi)\) where \(\widehat{\boldsymbol{\beta}}\) is estimated from the data. If we use \(\widehat{y}_i\) instead of \(\widehat{\mu}_{\pi}\) in Equations 8 and 10 , then

  1. Theorem 4 is still true,

  2. Theorem 5 is still true, and

  3. Algorithm [alg:r-aggregate] correctly enumerates the Rashomon partitions for outcome function \(f\).

The results follow directly from Theorems 4, 5, and 6. \(\square\)

Drug B dosages, \(b \in \{1, 2\}\)

 

Drug B dosages, \(b \in \{3, 4, 5\}\)

Figure 20: Hasse diagram for simulation with linear outcomes. \(y\) is young and \(o\) is old..

To see the usefulness of the generalization, we motivate a simple example where we are interested in how a person’s age affects their response to a treatment consisting of a combination of two drugs, A and B. Suppose that there are four possible dosages for drug A, \(\{0, 1, 2, 3\}\), six possible dosages for drug B, \(\{0, 1, 2, 3, 4, 5\}\), and people are classified as young aged or old aged where 0 indicates control. We assume that there is no treatment effect unless drug A and drug B are taken together. The partition matrix is \[\begin{align} \boldsymbol{\Sigma}&= \begin{bmatrix} 0 & - & - & - \\ 0 & 0 & - & - \\ 1 & 0 & 1 & 1 \end{bmatrix}. \end{align}\] We visualize the twelve pools in Figure 20 indicating heterogeneity in age and the dosages of drugs A and B.

Suppose that the treatment effects are piecewise linear (which generalizes the stepwise effects that we’ve assumed in previous simulations), \[\begin{align} \boldsymbol{\beta}_{1} &= [0, -1, 0, 1] \\ \boldsymbol{\beta}_{2} &= [1.5, -4, 0, 1.5] \\ \boldsymbol{\beta}_{3} &= [0, -1, -0, 1] \\ \boldsymbol{\beta}_{4} &= [4.5, -4, 0, 0.5] \\ \boldsymbol{\beta}_{5} &= [4, -2, -1, 1] \\ \boldsymbol{\beta}_{6} &= [1, 1, 1, -1] \\ \boldsymbol{\beta}_{7} &= [-3, 2, -3, 1] \\ \boldsymbol{\beta}_{8} &= [0, 0, 0, 0] \\ \boldsymbol{\beta}_{9} &= [4, 2, -3, -1] \\ \boldsymbol{\beta}_{10} &= [0, 0, 0, 0] \\ \boldsymbol{\beta}_{11} &= [5, 2, -3, 0] \\ \boldsymbol{\beta}_{12} &= [5, -1, 0, -1], \end{align}\] where the first coefficient is the intercept and the remaining elements are slopes on each feature. For feature profiles with zero treatment effect, we set the effect to be 0, a constant. For the feature profile where drugs A and B are administered together, a random error is drawn independently and identically from \(\mathcal{N}(0, 1)\). We draw 10 measurements for each feature combination. We set \(\lambda = 4 \times 10^{-3}\).

We illustrate the treatment effects for different combinations in black dashed lines in Figure 21. By choosing a linear function as the outcome for each pool, we can find the Rashomon set. In Figure 21, we show the estimated linear curves in 100 models present in the Rashomon set (\(\epsilon \approx 5 \times 10^{-4}\)) in blue. The denser the blue line, the more often it appears in the Rashomon set.

Figure 21: The black line corresponds to the true data-generating process and the blue lines correspond to effects estimated in each model in the Rashomon set. We estimate the outcome of each pool as a linear function of the features. The denser the blue line, the more often it appears in the Rashomon set.

17.2 Sparse correlation structure between coefficients.↩︎

Next, we explore the space of potential (sparse) covariance matrices between the coefficients. We now apply the Hasse structure to the elements of the variance-covariance matrix and pool on the space of covariances rather than the coefficients themselves. This generalization requires an additional distributional assumption on the coefficients. Specifically, assume that \[\begin{align} \boldsymbol{\beta}\mid \mu, \boldsymbol{\Lambda}\sim \mathcal{N}(\mu, \boldsymbol{\Lambda}) \end{align}\] where \(\mu\) is some mean matrix and \(\boldsymbol{\Lambda}\) is some covariance matrix. Then the posterior has the form \[\begin{align} \mathbb{P}(\boldsymbol{\beta}, \boldsymbol{\Lambda}, \Pi \mid \mathbf{Z}) \propto \mathbb{P}(\mathbf{y}\mid \boldsymbol{\beta}, \boldsymbol{\Lambda}, \mathbf{D}, \Pi) \cdot \mathbb{P}(\boldsymbol{\beta}, \boldsymbol{\Lambda}, \Pi) \end{align}\]

The likelihood component of the loss is \[\begin{align} \mathbb{P}(\mathbf{y}\mid \boldsymbol{\beta}, \boldsymbol{\Lambda}, \mathbf{D}, \Pi) \propto \exp \left\{ - \frac{1}{N} (\mathbf{y}- \mathbf{D}\boldsymbol{\beta})^{\intercal} \boldsymbol{\Lambda}(\mathbf{y}- \mathbf{D}\boldsymbol{\beta}) \right\}, \end{align}\] where \(N\) is the number of observed data points.

We do not have additional information about the covariance structure (though this could of course also be included in a prior) beyond the following three assumptions. First, we think that \(\boldsymbol{\Lambda}\) is dense i.e., \(\boldsymbol{\Lambda}\) is sparse in the number of uncorrelated outcomes. Second, we neither know nor want to know the correlation: it is an \(\ell_0\) problem. Third, we assume independence across the mean and correlation conditional on the covariance pooling. That is, \(\Pi\) is sufficient for the existence of dependence. Then \[\begin{align} \mathbb{P}(\boldsymbol{\beta}, \boldsymbol{\Lambda}, \Pi) &= \mathbb{P}(\boldsymbol{\beta}\mid \boldsymbol{\Lambda}, \Pi) \cdot \mathbb{P}(\boldsymbol{\Lambda}\mid \Pi) \cdot \mathbb{P}(\Pi) \end{align}\]

Suppose that we have a partition \(\Pi = \{\pi_1, \dots, \pi_H\}\) where \(H = \left\lvert\Pi\right\rvert\) and \(\Pi\) now is defined in the space of covariance matrices, so pooling means that the covariance values within a pool are non-zero. Then, consider the following procedure for drawing the covariance matrix, \(\boldsymbol{\Lambda}\in \mathbb{R}^{K \times K}\). For each pool \(\pi_i \in \Pi\), draw \(\boldsymbol{\Lambda}_i \sim f_i\) independently where \(f_i\) is some prior (for example, inverse Wishart). Then, \(\boldsymbol{\Lambda}= \text{diag}(\boldsymbol{\Lambda}_i, \dots, \boldsymbol{\Lambda}_H)\). The number of non-zero elements of \(\boldsymbol{\Lambda}\) is given by \(\left\lVert\boldsymbol{\Lambda}\right\rVert_0 = \sum_{i=1}^H h_i^2\). Therefore, we penalize the number of zero elements, \(K^2 - \sum_{i=1}^H h_i^2\). Thus, the prior is \[\begin{align} \mathbb{P}(\Pi) &\propto \exp \left\{ -\lambda \left( K^2 - \sum_{i=1}^H h_i^2 \right) \right\}. \end{align}\]

So our penalized loss function is just weighted mean-squared error penalized differently, \[\begin{align} Q(\Pi; \mathbf{Z}) &= \mathcal{L}(\Pi; \mathbf{Z}) + \lambda H(\Pi) = \frac{1}{n} \left( \mathbf{y}- \mathbf{D}\boldsymbol{\beta}\right)^{\intercal} \boldsymbol{\Lambda}\left( \mathbf{y}- \mathbf{D}\boldsymbol{\beta}\right) + \lambda \left( K^2 - \sum_{i=1}^H h_i^2 \right). \label{eq:sparse-cov-loss} \end{align}\tag{15}\]

Theorem 8. Consider the same setup in Section 6 except the loss function is weighted mean squared error penalized by the number of zeros in the covariance matrix as in Equation 15 . Specifically, Equations 8 and 9 are modified, respectively, as \[\begin{align} b(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) &= \frac{1}{n} \sum_{\pi \in \Pi_{\mathfrak{f}}} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \in \pi_{\mathfrak{f}} \right\} \widehat{\Lambda}_{k(i), k(i)}^2 (y_i - \widehat{\mu}_{\pi})^2 + \lambda H(\Pi, \mathcal{M}), \\ b_{eq}(\boldsymbol{\Sigma}, \mathcal{M}; \mathbf{Z}) &= \frac{1}{n} \sum_{\pi \in \Pi_{\mathfrak{f}}} \sum_{k(i) \in \pi} \mathbb{I} \left\{ k(i) \in \pi_{\mathfrak{f}}^c \right\} \widehat{\Lambda}_{k(i), k(i)}^2 (y_i - \widehat{\mu}_{\pi})^2. \end{align}\] where \(\widehat{\Lambda}_{k, k}^2\) is the estimated variance of feature combination \(k\).

Then

  1. Theorem 4 is still true,

  2. Theorem 5 is still true, and

  3. Algorithm [alg:r-aggregate] correctly enumerates the Rashomon partitions.

The results follow directly from Theorems 4, 5, and 6. \(\square\)

18 Further Details on Related Work.↩︎

It is useful to contrast our method with several other (some recent) approaches to study heterogeneity. Specifically, we are interested in their application to settings with partial orderings (e.g., factorial structure and admissibility) which is easily interpretable.

We will focus on four main related approaches: (1) canonical Bayesian Hierarchical Models (BHM) [26], [61], [62]; (2) \(\ell_1\) regularization of marginal effects to identify heterogeneity [6]; (3) causal forests [10]; and (4) machine learned proxies [53]. We intend this discussion to be a guide for practitioners considering implementing our proposed method or one of these state-of-the-art alternatives. We discuss conceptually related work (e.g. Bayesian decision trees) in previous sections. an Let us for the moment set aside the following immediate differences. Our focus on robustness, profiles, and enumerating the entire Rashomon Partition are all novel. Instead, it is useful to identify the philosophical differences across the various approaches and how they relate to us. Every approach, as we will note, effectively uses partitions \(\Pi\) at some point to determine which data to pool or not. The specific techniques create distributions, possibly degenerate, over these partitions, and these distributions are sampled from and marginalized to estimate treatment effects \(\beta_k\). The interesting thing therefore is how one builds a distribution over \(\Pi\).

18.1 Bayesian Hierarchical Models.↩︎

We now discuss how our work relates to a canonical representation of a Bayesian Hierarchical Model. As discussed previously, our work is more similar to Bayesian Tree(d) models than to other methods for accounting for learning heterogeneity, such as Bayesian Model Averaging. For context, however, we present how our approach compares to the canonical Bayesian approach. The Bayesian perspective provides a compromise between complete and partial pooling. Partial pooling occurs by encouraging similarity in the values for parameters without requiring strict equality. Using the notation from our model, for example, we could construct a model where

\[\mathbf{y}\sim N( \mathbf{D}\boldsymbol{\beta},\sigma_y^2)\] and, for the sake of exposition, all \(\beta\) are draw independently from \[\boldsymbol{\beta}\sim N(\mu_{\beta}, \sigma^2_\beta).\] Requiring that all values of \(\boldsymbol{\beta}\) come from the same distribution encourages sharing information across potential feature combinations and encourages the effects on heterogeneity to be similar (but not identical).  [26] uses this approach when comparing treatment effects across multiple domains. In that paper, the goal is not to pool across potentially similar treatment conditions but instead to (partially) pool across geographic areas.

As one example of the classical model, [26] has outcomes for household \(i\) in study \(k\) modeled as

\[y_{ik}\sim N (\mu_k+\tau_k{\mathbf{T}_{ik}},\sigma^2_{yk})\foralli,k\]

\[\begin{align} \begin{pmatrix}\mu_k\\ \tau_k \end{pmatrix} & \sim & N\left[\left(\begin{array}{c} \mu\\ \tau \end{array}\right),\left(\begin{array}{ccc} \sigma^2_\mu & \sigma_{\mu\tau} \\ \sigma_{\mu\tau} & \sigma^2_\tau \end{array}\right)\right]\foralli,k \end{align}\]

where \(\tau_k\), \(\mu_k\) are the overall mean and treatment effect at area \(k\), respectively. The vector \({\boldsymbol{T}_{ik}}\) is the treatment indicator for household \(i\) in study \(k\).

One way to measure the degree of pooling is the (partial) “pooling factor” metric defined in [62], \(\omega(\boldsymbol{\beta})={\sigma^2_y}/{\sigma^2_y+\sigma^2_\beta}\). The partial pooling metric quantifies how much the effect of treatment combinations varies compared to the overall heterogeneity in the outcomes. The partial pooling metric, the [26] context refers to the relative variation related to differences between studies compared to sampling variability.

In contrast, we could think of our approach as using a prior on \(\boldsymbol{\beta}\) conditional on the partitions that potentially force some values of \(\beta_k, \beta_k'\) to be equal. In Appendix 12.2, we show that the objective function we use in Equation 5 corresponds to a hierarchical model where we draw the \(\boldsymbol{\beta}\) vector as \[\begin{align} \boldsymbol{\beta}\mid \Pi &\sim \mathcal{N}(\boldsymbol{\mu}_{\Pi}, \boldsymbol{\Lambda}), \end{align}\] where \(\boldsymbol{\mu}_{\Pi}\) is structured such that \(\mu_k = \mu_{k^{\prime}}\) for any \(k, k^{\prime} \in \pi \in \Pi\). Then, given some feature combinations \(\mathbf{D}\), we draw the outcomes as \[\begin{align} \mathbf{y}\mid \mathbf{D}, \boldsymbol{\beta}&\sim \mathcal{N}(\mathbf{D}\boldsymbol{\beta}, \boldsymbol{\Sigma}). \end{align}\] To understand the variation within the \(\boldsymbol{\beta}\) vector, we need to average across potential partitions, since some partitions will set \(\beta_k=\beta_k'\) and others will not. This amounts to replacing the \(\sigma_\beta^2\) in the pooling factor with the variance of the distribution of \(\mathbb{P}(\boldsymbol{\beta}|Z)\), which is defined in Equation 3 .

We could also conceptualize the above derivation in terms of equality on \(\boldsymbol{\beta}\) rather than the means \(\mu_\Pi\). If, for example, we replace \(\boldsymbol{\Lambda}\) with \(\boldsymbol{\Lambda}_\Pi\) where \(Var(\mu_k, \mu_k')=0\) for any \(k, k^{\prime} \in \pi \in \Pi\) (or equivalently, when \(\mu_k = \mu_{k^{\prime}}\)) then we enforce that \(\beta_k=\beta_k'\). Of course, if we go the opposite direction and let the diagonal of \(\boldsymbol{\Lambda}_\Pi\) be unconstrained then there is essentially no sharing of information across feature combinations.

Finally, hierarchical models of this type are, of course, quite flexible and we could construct more complex models that capture features of our pooling approach. Among those options would be to use the Bayesian version of penalized regression, such as the Bayesian Lasso, which would be philosophically related to the approach we describe in the next section.

18.2 Lasso regularization.↩︎

This is the approach taken in prior work by several of the authors of the present paper, in [6]. There the setting was one in which the researcher faced a factorial experimental design: a crossed randomized controlled trial (RCT). The paper developed the Hasse structure described above and an approach that required transforming \(\mathbf{D}\) into an equivalent form presented in Equation 11 in Appendix 11. Here every parameter \(\alpha_k\) represents the marginal difference between \(\beta_k\) and \(\beta_{k'}\) where \(\rho(k) = \rho(k')\) (they are the same profile) and \(k\) exactly differs from \(k'\) on one arm by one dose. The parameter vector \(\boldsymbol{\alpha}\) records the marginal effects. Notice the support of \(\boldsymbol{\alpha}\) therefore identifies \(\Pi\) (since non-zero entries determine splits).

The first difficulty in applying this to general settings of heterogeneity is that \(\ell_1\) regularization requires irrepresentability: that there is limited correlation between the regressors so that the support may be consistently recovered [63]. Unfortunately, the regression implied by the Hasse does not satisfy this so some pre-processing is required. [6] apply the Puffer transformation of [60] to retain irrepresentability and estimate the Lasso model. However, this is not free: the approach requires conditions on the minimum singular value of the design matrix. The authors leverage the structure of a crossed randomized controlled trial (which places considerable restriction on the design matrix) to argue that indeed these conditions are met. There is no guarantee and it is unlikely to be the case that these conditions are met for general factorial data of arbitrary covariates. So, tackling the much more general structure required moving away from regression (we use decision trees) and changing the regularization (we use \(\ell_0\)).

The second key observation is that the Bayesian lasso means that the \(\ell_1\) penalty corresponds to priors \(\mathbb{P}(\boldsymbol{\alpha})\) that are i.i.d. Laplace on every dimension \(k\), which arises from first principles. That is \[\begin{align} - \log \mathbb{P}(\boldsymbol{\alpha}) &= \log \prod_k \mathbb{P}(\alpha_k) = \log \prod_k \exp(-\lambda \left\lvert\alpha_k\right\rvert) = \lambda \sum_k \left\lvert\alpha_k\right\rvert. \end{align}\] Note that this is true whether one uses regular lasso, Puffer transformed lasso, spike-and-slab lasso, group lasso (up to the group level), and so on. No matter at whatever level the \(\ell_1\) sum is being taken, it corresponds to independence at that level in the prior.

In practice what this means is that given two partitions \(\Pi\) and \(\Pi'\), which have the same number of pools and which have the same loss value, if one is more consistent with independent values of \(\alpha_k\) than the other, it will receive a higher posterior. There are at least two problems.

The main philosophical problem is that there is no reason to place the meta-structure that the marginal differences between adjacent variants should have an i.i.d. distribution. In fact, one might think that the basic science or social science dictates exactly the opposite. Independence means that a marginal increase in dosage of drug A, holding fixed B and C at some level, is thought to be independent of increasing A holding fixed B and C at (potentially very similar) different levels. Similarly, the marginal value of receiving a slightly larger loan given that the recipient has 10 years of schooling and started 5 previous businesses is independent of receiving a slightly larger loan if the recipient had 10 years of schooling and started 6 previous businesses. Independence is unreasonable in both examples.

There is a second issue in that if an object of interest is \(\Pi\), this approach provides no way forward. Regularization delivers posteriors over \(\boldsymbol{\alpha}\): \(\mathbb{P}(\boldsymbol{\alpha}\mid \mathbf{y},\mathbf{X})\). This implies a posterior over \(S_{\boldsymbol{\alpha}}\). The map from \(S_{\boldsymbol{\alpha}}\) to \(\Pi\) is deterministic, and is given by some \(\phi(S_{\boldsymbol{\alpha}}) = \Pi\), which means that \[\mathbb{P}(\Pi \mid \mathbf{y},\mathbf{X}) = \int_{\boldsymbol{\alpha}} {\boldsymbol{1}}\left\{\phi(S_{\boldsymbol{\alpha}}) = \Pi \right\} \cdot \mathbb{P}(\boldsymbol{\alpha}\mid \mathbf{y},\mathbf{X})\] is the actual calculation of interest.

So the regularization approach requires the statistician to take all the marginal parameters to be i.i.d., and given this, integrate over possible coefficient vectors that are consistent with this specific aggregation. This makes calculating an RPS very difficult.

18.3 Causal Random Forests.↩︎

We now compare our approach to Causal Random Forests (CRFs) introduced by  [10]. CRFs construct regression trees over the space of potential combinations of covariates. Trees partition the space of covariates into “leaves.” Unlike our setting, trees are heirarchical; the procedure to construct trees involves splitting the observed data in two based on \(X_i\) being above or below a threshold. They then partition recursively, dividing each subsequent group until the leaves contain very few observations. This approach can also be thought of as finding nearest neighbors, where the number of neighbors is the number of observations in the leaf and using distance on the tree as the closeness metric. CRFs construct a conditional average treatment effect at a pre-determined point \(X=x\), \(\tau(x)=\mathbb{E}[Y{(1)}-Y{(0)}|X=x]\) where \(Y{(1)}\) is potential outcome for the treated and \(Y{(0)}\) is the potential outcome for the control.

Relating this back to our work, take \(T\) to be a tree and \(\pi\in \Pi(T)\) to be a leaf in the tree, which corresponds to a pool in our language. Then, the estimated expected outcomes for each leaf is \[\begin{align} \widehat{\beta}_{\pi} &= \frac{1}{\left\lvert\{i:X_i \in \pi \}\right\rvert}\sum_{\{i:X_i \in \pi \}} Y_i. \end{align}\]

Further, taking \(\tau_\pi\) to be the treatment effect of observations in pool (leaf) \(\pi\) and \(W_i\) as the treatment indicator, which we assume orthogonal to \(X\) and \(Y\), the estimated treatment effect for \(\pi\) is \[\begin{align} \widehat{\tau}_{\pi} &= \frac{1}{\left\lvert\{i:W_i=1, X_i \in \pi \}\right\rvert}\sum_{\{i:W_i=1,X_i \in \pi \}} Y_i - \frac{1}{\left\lvert\{i:W_i=0,X_i \in \pi \}\right\rvert}\sum_{\{i:W_i=0,X_i \in \pi \}} Y_i. \end{align}\]

To summarize, the approach for forming trees splits the observed covariate space into partitions, known as leaves. Each leaf consists of a mix of people in treatment and control groups and, in fact, the specification of the tree depends on this balance across treatment and control groups since the algorithm requires that splitting be done in a way that preserves a minimum number of treatment and control in each leaf. To compute a treatment effect conditional on a particular value of \(X\), look at the difference in outcome between treated and control people in a given leaf. Outcomes are not considered with constructing the tree (in contrast to our proposed approach) and treatment status is not used to split explicitly but does influence the construction of the tree through the sample size restriction.

Despite being similar in that we both use geometric objects that partition the space of covariates, there are three fundamental differences between our approach and CRFs. The first difference is geometric. CRFs use regression trees, whereas we use Hasse diagrams. Regression trees are appealing in many settings because of their flexibility in representing complex, nonlinear, relationships between variables. Regression trees, however, require imposing a hierarchy between variables that is not supported by the data. This hierarchy is “baked in” to the structure of the trees and is evident from how we describe constructing trees in the previous paragraph. The data, however, are not fully hierarchical and are instead partially ordered.

This mismatch creates an identification issue. Within education and within income, there are clear orderings. There is, however, no heirarchy between education and income. One tree may, therefore, split first based on income and then split on education conditional on income while another tree does the opposite. In both cases, we can trace the trees to end up with the same estimated treatment effects for any group of covariates (as shown in [10]). The trees themselves, however, arise from this arbitrary ordering and are, thus, not interpretable. Work such as [64] describe measures of variable importance in CRFs, but the problem of an arbitrarily imposed hierarchy is still present. Hasse diagrams, in contrast, are the natural geometry for partially ordered sets, alleviating this issue and allowing the researcher to interpret the pooling structure on the domain of the covariates directly.

The second difference is computational but has conceptual implications. In both our approach and CRFs, we do not take the stucture of the partition as known. Both approaches must, therefore, account for additional uncertainty in treatment effect estimates that arises from not knowing the partition. In CRFs, bootstrap samples over the data propagate this uncertainty. CRFs then aggregate over trees using Monte Carlo averaging over \(b=1,..,B\) boostrap samples of the covariates and outcomes, \(\{Z_1,...,Z_n\}\),

\[RF(\pi; Z_1,...,Z_n) \approx \frac{1}{B}\sum_{b=1}^B T(\pi; \xi_b; Z_{b1}^*,...,Z_{bs}^*),\]

where \(\pi\) represents a pool or leaf specifying a combination of features and levels. The \(\xi_b\) term is an additional stochastic component. The trees sampled as part of this process create a “forest” are, by definition, random draws given the data. That is, given a different set of data, the distribution of likely trees would change. They are also not guaranteed to be optimal or nearly optimal. If the goal is to estimate average treatment effects, this approach represents a principled way to explore the space of trees. If the goal, however, is to identify potential models of heterogeneity, then sampling randomly is very unlikely to produce high quality trees. With Rashomon partitions, by definition, we guarantee that all models in our set are of high posterior.

To this point, we have not discussed inference in CRFs. A key contribution of [10] is forming so-called “honest” trees that account for issues that arise when using the same data to learn trees and then to make inference conditional on the group of trees. In our work, we use a Bayesian framework to address this issue, which also has the advantage of being able to estimate functions of treatment effects (see 1). Future work, however, could consider Rashomon sets for honest regression trees. This work would build upon our own work as well as  [9] that introduces Rashomon sets for classification trees. The algorithm for inference would begin with splitting as proposed by [10] to preserve honest inference, then construct Rashomon sets using the algorithm from [9]. Since the space of trees is enormous, finding the “best” tree is impossible, which creates issues for finding the Rashomon set since it is used to define the reference partition. Fortunately, a recent paper by [65] provides an algorithm. While this approach would allow the CRF framework to find optimal trees, it does not address the identifiability issue that arises when using trees for data that are only partially ordered. Similar work was explored in [66], who estimate heterogeneous treatment effects using a sum of Bayesian regression trees, which they refer to as the Bayesian causal forest. They decompose the outcome into a mean outcome and a treatment effect. Since they are only interested in the treatment effect, the mean outcome becomes a nuisance parameter. They impose a vague prior on the mean and a strong prior on the treatment effect. Otherwise, the tree estimation procedure is identical to Bayesian Additive Regression Trees [3].

Third, both our approach and CRFs impose regularization but do so in philosophically very different ways. We take the perspective that we do not know and cannot fully enumerate correlation structure in a high dimensional space. So we use the \(\ell_0\) prior, which we show is the least informative prior in 2. In other words, we regularize, and impose a prior, on the size of the partition. In doing so, we are trading off information on full distribution to robustly identify partitions. On the other hand, causal forests regularize on the number of observations in each leaf of the tree. Specifically, they require at least \(k\) samples in each leaf. This choice is sensible because with insufficient data there is no information. At the same time, this is odd as the regularization depends directly on the data. Elaborating on this, we can write the posterior for some tree \(T\) given data \(\mathbf{y}, \mathbf{X}\) as \[\begin{align} \mathbb{P}(T \mid \mathbf{y}, \mathbf{X}) &\propto \mathbb{P}(\mathbf{y}\mid T, \mathbf{X}) \mathbb{P}( T \mid \mathbf{X}) \text{ where } \mathbb{P}(T \mid \mathbf{X}) \propto \exp \left\{ - \frac{\lambda}{\min_{\pi \in \Pi(T)} n_{\pi}(\mathbf{X})} \right\}, \end{align}\] where \(\Pi(T)\) is the set of pools (leaves) in \(T\) and \(n_{\pi}(\mathbf{X})\) is the number of observations in \(\mathbf{X}\) that belong to pool \(\pi\). This prior down-weights and discards partitions where for some \(\pi\) the observations are low. In that sense, the prior effectively assumes that in the background there is a kind of stratification – that observations are sampled from some process such that all pools have enough observations, though of course the true partition is unknown. This feels awkward as there is a relationship between the data collection process and the actual true partitioning wherein the user of the causal forest is assuming that they have effectively stratified data collection against the unknown partitions.

Together, these differences mean that the scope of our method is wider than CRFs. While both methods can estimate heterogeneity in treatment effects and control for multiple testing, we also produce interpretable explanations of heterogeneity. For the reasons outlined above, namely identification and sampling, it is not possible to extract information on the relationship between covariates from elements of the random forest. We can, of course, test for any hypothesis about potential heterogneeity between arbitrary combinations of features, but CRFs require that we specify the hypothesis a priori. In our setting, however, finding the set of high posterior probability partitions gives a policymaker or researcher as set of potential models of heterogeneity and interaction between the covariates that can be used to design future policies or generate new research hypotheses. On the other hand, our method assumes that the posterior has separated modes. If the posterior distribution is very flat or has many (many) very similar modes, then the Rashomon set will be very large and our benefits in terms of interpretability will diminish.

18.4 Treatment heterogeneity via Machine Learning Proxies.↩︎

[53] propose a general framework for using machine learning proxies to explore treatment effect heterogeneity. They allow for estimation of multiple outcomes, including conditional average treatment effects and treatment effect heterogeneity between the most and least impacted groups. Rather than search the space of covariates directly, [53] uses a machine learning method to create a “proxy” for the heterogeneous effects. This approach has the advantage that it can be applied in high dimensional settings. A downside, however, is that the machine learning proxies are often uninterpretable in terms of the original covariates, making it necessary to post-process the treatment effect distributions to gain insights about particular covariates.

We now give a brief overview to unify notation but do not exhasutively cover all the estimators presented in [53]. Say that \(s_0(Z)\) is the true conditional average treatment effect, \(\mathbb{E}[Y(1)|Z]-\mathbb{E}[Y(0)|Z]\) Ascertaining the functional form of the relationship between the non-intervention covariates \(X\) and the outcome \(Y\), though, is complicated when \(X\) is high dimensional. In response,  [53] use a machine learning method (e.g. neural networks, random forests, etc) to construct a proxy for \(s_0(Z)\) using an auxillary dataset. In a heuristic sense, this proxy serves the role of a partition \(\pi\), in that it aggregates across covariates to separate the data based on the treatment effect. This analogy is most direct when the machine learning model is a decision tree (which it need not be) since in that case leaves of the tree would correspond to partitions of the covariate space based on treatment effect. After computing the machine learning proxy, [53] then project it back to the space of the observed outcomes. It is then also possible to construct clusterings based on the proxies and related those clusterings back to the outcomes.  [53] differs from our approach in many of the same ways as the comparison with [10], namely that we focus on identifying multiple explanations for heterogeneity and that we utlize the Hasse diagram as a geometric representation of partial ordering. We also find that this structure is sufficient to explore models for heterogeneity on the space of the covariates without the need to use proxies.

19 Appendix to Empirical Data Examples↩︎

19.1 2D histograms.↩︎

Figures 22 and 23 visualize the Rashomon sets for the charitable giving datasets of [11] and the NHANES telomere lengths using the 2D histogram that is described in Figure 15 in Appendix 16.

a

 

b


c

 

d

Figure 22: Visualizing the Rashomon set for [11] charitable donations dataset. The top two panels show the distribution of partition sizes and a 2D histogram of how partition sizes and relative posterior probabilities vary. The black dotted line in the 2D histogram shows our chosen Rashomon threshold. The bottom two panels show the same after pruning low-posterior models.. a — image, b — image, c — image, d — image

a

 

b


c

 

d

Figure 23: Visualizing the Rashomon set for NHANES telomeres dataset. The top two panels show the distribution of size of models and their relative posterior probability relative. The black dashed vertical and horizontal lines show the sparsity cutoff and Rashomon cutoff respectively. The bottom two panels show the same after pruning low-posterior models.. a — image, b — image, c — image, d — image

19.2 Heterogeneity in the impact of microcredit access.↩︎

For the microcredit data from [23], we present the results for all profiles in Figure 24. This includes the robust profiles we discussed in Figure 7 as well as the non-robust ones.

Figure 24: Here, we visualize the average number of models in the Rashomon set indicating a positive, zero, or negative effect. Each column corresponds to a different feature profile where the label denotes which features are active (i.e., do not take the lowest level). “None” means that all features are taking these lowest values. We also allow the gender of the household head and education status of the household head to take on any value in all of the sixteen feature profiles.

Additionally, we look at the treatment effect heterogeneity across genders, \[\begin{align} \textrm{HTE}(\mathbf{x}) &= \mathbb{E}\left[ \left\{Y_i(1, F, \mathbf{x}) - Y_i(0, F, \mathbf{x}) \right\} - \left\{ Y_i(1, M, \mathbf{x}) - Y_i(0, M, \mathbf{x})\right\} \right], % (\widehat{y}_{(1, F, \mathbf{x})} - \widehat{y}_{(0, F, \mathbf{x})}) - (\widehat{y}_{(1, M, \mathbf{x})} - \widehat{y}_{(0, M, \mathbf{x})}), \end{align}\] where \(Y_i(\cdot, F, \cdot)\) is interpreted as the potential outcome of household \(i\) were it headed by a woman, and \(Y_i(\cdot, M, \cdot)\) is the potential outcome of household \(i\) were it headed by a man. As before, we use the sample means \(\widehat{y}(\cdot)\) to find \(\widehat{\textrm{HTE}}(\mathbf{x})\) and \(\textrm{sign}\{ \widehat{\textrm{HTE}}(\mathbf{x}) \}\). Again, we sort \(\mathbf{x}\) into profiles and repeat the same counting and averaging exercise. We visualize the results as before in Figure 25. For most profiles, we see essentially no robust conclusions about gender heterogeneity in treatment effects. We highlight a few robust items below.

Figure 25: Here, we visualize the average number of models in the Rashomon set indicating a positive, zero, or negative effect. The axis labels should be read as in Figure 24.

We see an increase in loans procured by households headed by women with past business experience when compared to households headed by men. When these households are already in debt with no previous experience, they tend to borrow less. We see no heterogeneity by gender in the amount of informal loans procured.

Households headed by women tend to consistently spend more. However, they spend more money on durable goods than households headed by men. We also see that, in the absence of past experience, there is a decline in expenditure on tempting goods compared to households headed by men. We also see a higher tendency for women to invest in business assets more than men.

We find that households headed by women with no past experience have a lower revenue than men. But this effect is reversed when the households do have previous business experience. However, there is no heterogeneity by gender in the profit or the number of employees. We also find that households headed by women tend to spend fewer hours working when they are in debt or when there is regional competition. But this makes a negligible difference in the profits.

We find that in households headed by women, there is less participation in the business by women if the household is in debt and there is competition from neighbors. We also find that fewer girls attend school in households headed by women with no previous experience than in households headed by men.

References↩︎

[1]
Madigan, D. and Raftery, A. E. (1994). Model selection and accounting for model uncertainty in graphical models using occam’s window. Journal of the American Statistical Association, 89(428):1535–1546.
[2]
Andrews, I., Kitagawa, T., and McCloskey, A. (2019). Inference on winners. Technical report, National Bureau of Economic Research.
[3]
Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). : Bayesian additive regression trees. The Annals of Applied Statistics, 4(1).
[4]
Bissiri, P. G., Holmes, C. C., and Walker, S. G. (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):1103–1130.
[5]
Madigan, D., Raftery, A. E., Volinsky, C. T., and Hoeting, J. A. (1996). Bayesian model averaging. Integrating Multiple Learned Models (IMLM-96), (P. Chan, S. Stolfo, and D. Wolpert, eds).
[6]
Banerjee, A., Chandrasekhar, A. G., Dalpath, S., Duflo, E., Floretta, J., Jackson, M. O., Kannan, H., Loza, F. N., Sankar, A., Schrimpf, A., et al. (2021). Selecting the most effective nudge: Evidence from a large-scale experiment on immunization. Technical report, National Bureau of Economic Research.
[7]
Semenova, L., Rudin, C., and Parr, R. (2022). On the existence of simpler machine learning models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1827–1858.
[8]
Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M., and Rudin, C. (2017). Learning certifiably optimal rule lists. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 35–44.
[9]
Xin, R., Zhong, C., Chen, Z., Takagi, T., Seltzer, M., and Rudin, C. (2022). Exploring the whole rashomon set of sparse decision trees. Advances in Neural Information Processing Systems, 35:14071–14084.
[10]
Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242.
[11]
Karlan, D. and List, J. A. (2007). Does price matter in charitable giving? evidence from a large-scale natural field experiment. American Economic Review, 97(5):1774–1793.
[12]
Rossiello, F., Jurk, D., Passos, J. F., and d’Adda di Fagagna, F. (2022). Telomere dysfunction in ageing and age-related diseases. Nature cell biology, 24(2):135–147.
[13]
Srinivas, N., Rachakonda, S., and Kumar, R. (2020). Telomeres and telomere length: a general overview. Cancers, 12(3):558.
[14]
Alder, J. K., Hanumanthu, V. S., Strong, M. A., DeZern, A. E., Stanley, S. E., Takemoto, C. M., Danilova, L., Applegate, C. D., Bolton, S. G., Mohr, D. W., et al. (2018). Diagnostic utility of telomere length testing in a hospital-based setting. Proceedings of the National Academy of Sciences, 115(10):E2358–E2365.
[15]
Protsenko, E., Rehkopf, D., Prather, A. A., Epel, E., and Lin, J. (2020). Are long telomeres better than short? relative contributions of genetically predicted telomere length to neoplastic and non-neoplastic disease risk and population health burden. PloS one, 15(10):e0240185.
[16]
Chae, D. H., Nuru-Jeter, A. M., Adler, N. E., Brody, G. H., Lin, J., Blackburn, E. H., and Epel, E. S. (2014). Discrimination, racial bias, and telomere length in african-american men. American journal of preventive medicine, 46(2):103–111.
[17]
Geronimus, A. T., Pearson, J. A., Linnenbringer, E., Schulz, A. J., Reyes, A. G., Epel, E. S., Lin, J., and Blackburn, E. H. (2015). Race-ethnicity, poverty, urban stressors, and telomere length in a detroit community-based sample. Journal of health and social behavior, 56(2):199–224.
[18]
Hamad, R., Tuljapurkar, S., and Rehkopf, D. H. (2016). Racial and socioeconomic variation in genetic markers of telomere length: a cross-sectional study of us older adults. EBioMedicine, 11:296–301.
[19]
Vyas, C. M., Ogata, S., Reynolds, C. F., Mischoulon, D., Chang, G., Cook, N. R., Manson, J. E., Crous-Bou, M., De Vivo, I., and Okereke, O. I. (2021). Telomere length and its relationships with lifestyle and behavioural factors: variations by sex and race/ethnicity. Age and ageing, 50(3):838–846.
[20]
Angelucci, M., Karlan, D., and Zinman, J. (2015). Microcredit impacts: Evidence from a randomized microcredit program placement experiment by compartamos banco. American Economic Journal: Applied Economics, 7(1):151–182.
[21]
Attanasio, O., Augsburg, B., De Haas, R., Fitzsimons, E., and Harmgart, H. (2015). The impacts of microfinance: Evidence from joint-liability lending in mongolia. American Economic Journal: Applied Economics, 7(1):90–122.
[22]
Augsburg, B., De Haas, R., Harmgart, H., and Meghir, C. (2015). The impacts of microcredit: Evidence from bosnia and herzegovina. American Economic Journal: Applied Economics, 7(1):183–203.
[23]
Banerjee, A., Duflo, E., Glennerster, R., and Kinnan, C. (2015). The miracle of microfinance? evidence from a randomized evaluation. American economic journal: Applied economics, 7(1):22–53.
[24]
Crépon, B., Devoto, F., Duflo, E., and Parienté, W. (2015). Estimating the impact of microcredit on those who take it up: Evidence from a randomized experiment in morocco. American Economic Journal: Applied Economics, 7(1):123–150.
[25]
Tarozzi, A., Desai, J., and Johnson, K. (2015). The impacts of microcredit: Evidence from ethiopia. American Economic Journal: Applied Economics, 7(1):54–89.
[26]
Meager, R. (2019). Understanding the average impact of microcredit expansions: A Bayesianhierarchical analysis of seven randomized experiments. American Economic Journal: Applied Economics, 11(1):57–91.
[27]
Banerjee, A., Breza, E., Duflo, E., and Kinnan, C. (2019). Can microfinance unlock a poverty trap for some entrepreneurs? Technical report, National Bureau of Economic Research.
[28]
Baland, J.-M., Somanathan, R., and Vandewalle, L. (2008). Microfinance Lifespans: A Study of Attrition and Exclusion in Self-Help Groups in India. In India Policy Forum, volume 4(1), pages 159–210. National Council of Applied Economic Research.
[29]
Chatfield, C. (1995). Model uncertainty, data mining and statistical inference. Journal of the Royal Statistical Society Series A: Statistics in Society, 158(3):419–444.
[30]
Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3):199–231.
[31]
McAllister, J. W. (2007). Model selection and the multiplicity of patterns in empirical data. Philosophy of Science, 74(5):884–894.
[32]
Tulabandhula, T. and Rudin, C. (2014). Robust optimization using machine learning for uncertainty sets. arXiv preprint arXiv:1407.1097.
[33]
Pawelczyk, M., Broelemann, K., and Kasneci, G. (2020). On counterfactual explanations under predictive multiplicity. In Peters, J. and Sontag, D., editors, Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pages 809–818. PMLR.
[34]
Black, E., Raghavan, M., and Barocas, S. (2022). Model multiplicity: Opportunities, concerns, and solutions. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 850–863.
[35]
D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., Chen, C., Deaton, J., Eisenstein, J., Hoffman, M. D., et al. (2022). Underspecification presents challenges for credibility in modern machine learning. The Journal of Machine Learning Research, 23(1):10237–10297.
[36]
Kobylińska, K., Krzyziński, M., Machowicz, R., Adamek, M., and Biecek, P. (2023). Exploration of rashomon set assists explanations for medical data. arXiv preprint arXiv:2308.11446.
[37]
Zhong, C., Chen, Z., Seltzer, M., and Rudin, C. (2023). Exploring and interacting with the set of good sparse generalized additive models. arXiv e-prints, pages arXiv–2303.
[38]
Marx, C., Calmon, F., and Ustun, B. (2020). Predictive multiplicity in classification. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6765–6774. PMLR.
[39]
Coker, B., Rudin, C., and King, G. (2021). A theory of statistical inference for ensuring the robustness of scientific results. Management Science, 67(10):6174–6197.
[40]
Watson-Daniels, J., Parkes, D. C., and Ustun, B. (2023). Predictive multiplicity in probabilistic classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37(9), pages 10306–10314.
[41]
Fisher, A., Rudin, C., and Dominici, F. (2019). All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res., 20(177):1–81.
[42]
Dong, J. and Rudin, C. (2020). Exploring the cloud of variable importance for the set of all good models. Nature Machine Intelligence, 2(12):810–824.
[43]
Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443):935–948.
[44]
Denison, D. G., Mallick, B. K., and Smith, A. F. (1998). A Bayesian CART algorithm. Biometrika, 85(2):363–377.
[45]
Wu, Y., Tjelmeland, H., and West, M. (2007). Bayesian CART: Prior specification and posterior simulation. Journal of Computational and Graphical Statistics, 16(1):44–66.
[46]
Chipman, H. A., George, E. I., and McCulloch, R. E. (2002). Bayesian treed models. Machine Learning, 48:299–320.
[47]
Raftery, A. E., Madigan, D., and Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92(437):179–191.
[48]
Clyde, M. (2003). Model averaging. Subjective and objective Bayesian statistics, pages 636–642.
[49]
Hans, C., Dobra, A., and West, M. (2007). Shotgun stochastic search for “large p” regression. Journal of the American Statistical Association, 102(478):507–516.
[50]
Onorante, L. and Raftery, A. E. (2016). Dynamic model averaging in large model spaces using dynamic occam’s window. European Economic Review, 81:2–14.
[51]
Tian, J. and He, R. (2009). Computing posterior probabilities of structural features in Bayesian networks. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 538–547, Arlington, Virginia, USA. AUAI Press.
[52]
Chen, Y. and Tian, J. (2014). Finding the k-best equivalence classes of Bayesian network structures for model averaging. In Proceedings of the AAAI conference on artificial intelligence, volume 28(1).
[53]
Chernozhukov, V., Demirer, M., Duflo, E., and Fernandez-Val, I. (2018). Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in India. Technical report, National Bureau of Economic Research.
[54]
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215.
[55]
Parikh, H., Ross, R., Stuart, E., and Rudolph, K. (2024). Who are we missing? a principled approach to characterizing the underrepresented population. arXiv preprint arXiv:2401.14512.
[56]
Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360.
[57]
Agrawal, D., Pote, Y., and Meel, K. S. (2021). Partition function estimation: A quantitative study. arXiv preprint arXiv:2105.11132.
[58]
Flajolet, P. and Sedgewick, R. (2009). Analytic Combinatorics. Cambridge University Press.
[59]
Topley, K. (2016). Computationally efficient bounds for the sum of catalan numbers. arXiv preprint arXiv:1601.04223.
[60]
Jia, J. and Rohe, K. (2015). Preconditioning the lasso for sign consistency. Electronic Journal of Statistics, 9:1150–1172.
[61]
Rubin, D. B. (1981). Estimation in parallel randomized experiments. Journal of Educational Statistics, 6(4):377–401.
[62]
Gelman, A. (2006). . Bayesian Analysis, 1(3):515 – 534.
[63]
Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563.
[64]
Bénard, C. and Josse, J. (2023). Variable importance for causal forests: breaking down the heterogeneity of treatment effects. arXiv preprint arXiv:2308.03369.
[65]
Hu, X., Rudin, C., and Seltzer, M. (2019). Optimal sparse decision trees. Advances in Neural Information Processing Systems, 32.
[66]
Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2020). Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion). Bayesian Analysis, 15(3):965–1056.

  1. We discuss this distinction in more detail in Section 18 in the context of frequentist tree-based methods that use the same geometry.↩︎

  2. This is related to the candidate models for Bayesian model selection used by [1]. Their set of models is \(\mathcal{A} = \{\Pi : \mathbb{P}(\Pi_0 \mid \mathbf{Z}) / \mathbb{P}(\Pi \mid \mathbf{Z}) \leq \tilde{c}\}\) where \(\Pi_0\) is the maximum a posteriori estimate. In the language of Rashomon sets, \(\tilde{c} = (\mathbb{P}(\Pi_0 \mid \mathbf{Z}) \mathbb{P}(\mathbf{Z}))^{-\epsilon}\).↩︎

  3. For example, \(a=(1, 0, 0, 0)\) will have three feature combinations, \([(1, 0, 0, 0), (2, 0, 0, 0), (3, 0, 0, 0)]\) and \(a=(1, 1, 0, 0)\) will have \(3^2 = 9\) feature combinations, \([(1, 1:3, 0, 0), (2, 1:3, 0, 0), (3, 1:3, 0, 0)]\).↩︎

  4. See https://wwwn.cdc.gov/nchs/nhanes/1999-2000/TELO_A.htm for details. Website last accessed on 2024-01-29.↩︎

  5. We removed all individuals who were missing data for relevant covariates. We binned the number of hours worked: \(\leq 20\) hours, \(21-40\) hours, and \(\geq 40\) hours. Age was categorized into five ordered discrete factors – \(\leq 18\) years, \(19-30\) years, \(31-50\) years, \(51-70\) years, and \(>70\) years. Education was categorized into 3 ordered discrete factors (no GED, GED no college, completed college).↩︎

  6. Along with [permissibility:case1], this means that we can reach \(k^{(1)}\) from \(k^{(2)}\) by traversing the Hasse for \(\rho_1\) to \(k^{(3)}\), then jumping to \(k^{(4)}\) along an edge on the \(M\)-d hypercube, and then moving from \(k^{(4)}\) to \(k^{(2)}\) while respecting the Hasse for \(\rho_2\).↩︎

  7. Along with [permissibility:case1] and [permissibility:case2], this means that we can reach \(\rho_k\) from \(\rho_0\) by traversing the \(M\)-d hypercube while staying within \(\pi\) and respecting the Hasse at each vertex of the hypercube.↩︎

  8. In a frequentist perspective, under “beta-min” classical assumptions and irrepresentability, this will correctly identify the true generative partition with probability tending to one (as seen in [6]).↩︎

  9. In this case, these draws need not be independent of identical. The computations just become a little more tedious.↩︎

  10. Strictly speaking, the algorithm would add them to the queue anyway and would later skip them when they get dequeued. For simplicity of illustration, we skip over them immediately.↩︎

  11. Extensions of this kind can be made to accommodate higher order derivatives and other bases as well, e.g., sinusoidal effects.↩︎