Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization

Caio de Prospero Iglesias caiopigl@mit.edu
Sloan School of Management
Massachusetts Institute of Technology Kimberly Villalobos Carballo kimberly.v@nyu.edu
Tandon School of Engineering
New York University Dimitris Bertsimas dbertsim@mit.edu
Sloan School of Management
Massachusetts Institute of Technology


Abstract

We address the problem of policy selection in contextual stochastic optimization (CSO), where covariates are available as contextual information and decisions must satisfy hard feasibility constraints. In many CSO settings, multiple candidate policies—arising from different modeling paradigms—exhibit heterogeneous performance across the covariate space, with no single policy uniformly dominating. We propose Prescribe-then-Select (PS), a modular framework that first constructs a library of feasible candidate policies and then learns a meta-policy to select the best policy for the observed covariates. We implement the meta-policy using ensembles of Optimal Policy Trees trained via cross-validation on the training set, making policy choice entirely data-driven. Across two benchmark CSO problems—single-stage newsvendor and two-stage shipment planning—PS consistently outperforms the best single policy in heterogeneous regimes of the covariate space and converges to the dominant policy when such heterogeneity is absent. All the code to reproduce the results can be found at https://anonymous.4open.science/r/Prescribe-then-Select-TMLR.

1 Introduction↩︎

Optimization under uncertainty arises in numerous applications, including supply chain management, energy markets, and financial planning, and remains an important research area in the optimization community [1][5]. In these problems, decisions are made under uncertainty about future outcomes, with the objective of minimizing total cost. A wide variety of data-driven methods have been developed to address such problems, including sample average approximation [6][8], stochastic approximation algorithms [9], [10], robust optimization techniques [11][13], and distributionally robust optimization formulations [14], [15]. These approaches differ in how they model and mitigate uncertainty: some aim to minimize expected costs (e.g., sample average or stochastic approximation methods), while others focus on guarding against worst-case outcomes (as in robust optimization) or balance the two by minimizing worst-case expected cost under a range of plausible distributions (as in distributionally robust optimization).

In many real-world settings, however, the decision-maker also observes auxiliary data—known as covariates, context, or side information—prior to making a decision. For example, in transportation, real-time weather and traffic conditions can help reduce uncertainty in travel times; in retail, promotions and seasonal signals can inform demand; and in healthcare, patient’s comorbidities and clinical histories provide valuable information about health outcomes. When such covariates are available, they can help infer relevant characteristics of the underlying uncertainty, narrowing the set of plausible scenarios and enabling more informed decisions. This observation has led to the development of contextual stochastic optimization [16], where the goal is to learn a policy that maps covariates to decisions.

A diverse set of methodologies has emerged for solving contextual stochastic optimization problems. One approach is to fit a parametric decision rule—such as a linear function [17], [18], kernel-based model [17], [19][22], or neural network [23][25]—by minimizing empirical cost (potentially regularized) over historical data. These methods are computationally efficient at deployment since no optimization is needed once the policy is trained. However, their performance is often sensitive to the choice of function class and may suffer if the true policy is poorly approximated within the chosen function space.

An alternative family of methods, commonly referred to as predict-then-optimize, follows a two-stage approach: first, the conditional distribution of the uncertain parameters is estimated given the observed covariates; then, this estimate is used to approximate the expected cost and solve the corresponding optimization problem [17], [18], [26], [27]. Such two-stage approaches offer modularity and interpretability; yet, they risk suboptimal decisions when accurate predictions do not translate into good prescriptions—particularly when the loss function used in training is misaligned with the downstream cost. To mitigate this, recent work has focused on end-to-end training schemes that integrate optimization objectives with predictive modeling [28][32]. By taking into account the optimization task loss during predictive model training, these methods can yield stronger performance. However, they introduce new challenges, such as increased computational burden during training.

Each of these paradigms presents trade-offs between statistical guarantees, computational complexity, and decision quality. In particular, when multiple methods or policies are available—perhaps developed using different paradigms, modeling choices, or feature sets—an important question arises: how should one select among a set of candidate policies? In this paper, we address this question through the following contributions:

  1. We formalize the problem of policy selection in contextual stochastic optimization, where covariates are available as contextual information, and decisions must satisfy hard feasibility constraints.

  2. We introduce Prescribe-then-Select, a modular framework that first generates a diverse library of feasible candidate policies and then uses supervised learning to train a meta-policy that maps each covariate to the best candidate policy.

  3. We conduct extensive computational experiments on two benchmark CSO problems in operations management: newsvendor and shipment planning. Results show that the proposed meta-policy improves over the best single policy in settings where different policies perform best across different regions of the covariate space, and matches the best single policy otherwise.

2 Related Work↩︎

Our work relates to several strands of literature on selecting among a set of candidate decision policies in the presence or absence of contextual information. While there is extensive research on policy ensembling, mixture-of-experts architectures, and divide-and-conquer strategies in reinforcement learning (RL), the problem has received far less attention in the setting of contextual optimization under uncertainty within operations research, where feasibility constraints and explicit cost function minimization are critical.

One line of work in reinforcement learning examines policy ensembling in context-free settings, where multiple policies are combined to improve stability, robustness, or exploration without conditioning on the current context [33], [34]. These methods typically aggregate policies by averaging their action outputs or using majority voting, aiming to reduce variance, mitigate overfitting, and hedge against the weaknesses of individual learners. While effective in improving average performance, these methods do not adapt policy choice to specific states, tasks, or other contextual information, and therefore cannot exploit scenarios where different policies are optimal in different contexts.

To address such heterogeneity, RL has developed context-dependent policy selection mechanisms, most prominently through mixture-of-experts architectures. Here, a meta-policy—often referred to as a gating function—maps the current context to a distribution over expert policies. In early work [35][37], the context was simply the state of the environment, leading individual policies to specialize in different regions of the state space; the meta-policy would then either combine their actions (soft selection) or choose the highest-scoring policy (hard selection). More recently, [38] extended this approach to contextual policy transfer, in which the context incorporates higher-level task descriptors in addition to the state, allowing the meta-policy to select from a library of policies that were trained on different tasks but may transfer knowledge to the current task. Related approaches in divide-and-conquer RL partition the state space and assign a policy to different regions [39], [40]. While these techniques achieve context-dependent specialization, they are typically studied in online settings and operate in unconstrained action spaces where policies can be aggregated without violating feasibility. Moreover, partitions of the contextual space are often learned to optimize surrogate objectives that are disconnected from the true cost function of interest in the downstream optimization problem.

In contrast, contextual optimization under uncertainty has seen almost no work on combining multiple policies. The only example we are aware of is [41], who recently proposed learning a fixed linear combination of candidate policies to improve decision quality. Their method uses historical data to estimate weights for each policy and forms a weighted sum of their outputs, demonstrating performance gains in the newsvendor problem [42]. While promising in unconstrained settings, this approach is not applicable when decisions must satisfy hard feasibility constraints, since averaging feasible policies does not necessarily produce another feasible policy. Furthermore, limiting aggregation to a fixed linear combination excludes more expressive functions that can capture complex, nonlinear relationships between context and the most suitable policy.

In contrast to these approaches, we propose a general framework for policy selection in single-stage and two-stage contextual optimization under uncertainty, where decisions do not affect the underlying uncertainty and must satisfy hard feasibility constraints. Our method first generates a library of feasible candidate policies, and then learns a supervised selection model to choose exactly one policy for each instance based on its context. By tailoring the policy assignment to each observed covariate, the approach can outperform any single candidate policy on average, especially when different policies are optimal in different regions of the covariate space. This enables flexible reuse of high-performing policies, adaptation to heterogeneous contexts, and robust performance across complex decision environments.

The remainder of the paper is organized as follows. Section 3 introduces the general formulation of contextual optimization under uncertainty, and Section 4 presents the proposed Prescribe-then-Select framework in detail. Section 5 reports the results of our computational experiments, and Section 6 offers concluding remarks.

3 Problem Setting↩︎

We consider the problem of contextual stochastic optimization (CSO) [16], in which a random vector of covariates \(X\in {\mathcal{X}}\subseteq \mathbb{R}^{d_x}\)—capturing context such as holidays, promotions, or seasonal trends—is observed as side information in a stochastic optimization problem. After observing \(X= {\boldsymbol{x}}\), the decision-maker selects an action \({\boldsymbol{z}}\) from a feasible set \({\mathcal{Z}}\subseteq \mathbb{R}^{d_z}\) as to minimize the expected cost with respect to some unknown random variable \(Y\in \mathbb{{\mathcal{Y}}}\subseteq \mathbb{R}^{d_y}\). We assume that a cost function \(c : {\mathcal{Z}}\times \mathbb{{\mathcal{Y}}}\to \mathbb{R}\) is provided, and that decisions \({\boldsymbol{z}}\) do not affect the uncertainty \(Y\). Specifically, the problem can be written as \[\label{eq:cso} v^\star({\boldsymbol{x}})= \min_{{\boldsymbol{z}}\in{\mathcal{Z}}}\; \mathbb{E}\bigl[c({\boldsymbol{z}},Y)\mid X={\boldsymbol{x}}\bigr],\qquad \pi^\star({\boldsymbol{x}})\in \mathop{\mathrm{arg\,min}}_{{\boldsymbol{z}}\in{\mathcal{Z}}}\; \mathbb{E}\bigl[c({\boldsymbol{z}},Y)\mid X={\boldsymbol{x}}\bigr],\tag{1}\] where \(v^\star\) is the optimal value function and \(\pi^\star\) is the optimal policy function. Given a dataset \({\mathcal{D}}_N=\{({\boldsymbol{x}}_i,{\boldsymbol{y}}_i)\}_{i=1}^N\) with historical observations, the data-driven approach to solving equation 1 consists of using these observations to approximate the conditional expectation in the objective.

Two classical baselines illustrate the spectrum of data-driven approaches. At one extreme, the sample-average-approximation (see, e.g., [8]) method ignores covariates altogether, replacing the conditional expectation in equation 1 with its unconditional counterpart and approximating it via the empirical average. While simple to implement, sample-average approximation produces the same decision for all contexts, even when covariates could help narrow down the realized value of the uncertainty  \(Y\). At the other extreme lies the point-prediction approach (see, e.g., [18]), which fits a predictive model \(\hat{\mu}_N({\boldsymbol{x}})\) to estimate \(\mathbb{E}[Y\mid X={\boldsymbol{x}}]\), and then substitutes this prediction into the objective, converting the problem into \(\mathop{\mathrm{arg\,min}}_{{\boldsymbol{z}}\in{\mathcal{Z}}} c\bigl({\boldsymbol{z}},\hat{\mu}_N({\boldsymbol{x}})\bigr)\). This method tailors decisions to the observed context and is computationally efficient, but it reduces the entire conditional distribution to a single point estimate, ignoring its variance and how the cost function responds to uncertainty beyond the mean.

An important obstacle of the CSO problem is that \(X\) and \(Y\) can be highly variable across regimes. For instance, in retail sales, routine weekdays, promotional weekends, and holiday peaks may each exhibit very different conditional demand distributions \(f_{Y\mid X}(y \mid x)\). Such heterogeneity in the covariate space may also induce heterogeneity in policy performance, with different policies performing best in different covariate regimes. For instance, low‑variance regions often reward the precision of point-prediction, whereas high-variance regions may benefit from approaches that explicitly account for uncertainty in \(c({\boldsymbol{z}}, Y)\) such as sample average approximation. Likewise, many other existing approaches to the CSO problem may perform well in certain regions of the covariate space but not on average across the entire space. These observations motivate the use of multiple candidate policies, each designed to perform well under a specific regime, together with a selection mechanism that assigns the most cost-effective policy to each context. The remainder of the paper develops and analyzes this approach.

4 Methodology: Prescribe-then-Select↩︎

This section presents our proposed method for addressing covariate heterogeneity in contextual stochastic optimization by combining multiple decision policies within a unified framework. Instead of relying on a single model across the entire covariate space, we construct a library of diverse candidate policies and use a meta-policy to select one among them based on the observed covariates. Each candidate policy maps contextual features to feasible decisions under distinct modeling assumptions and inference mechanisms. They may differ in the type of predictive estimates they produce, their complexity, or their inductive biases. The meta-policy then leverages this diversity by jointly learning a partition of the covariate space and deciding the most effective policy for each region.

We illustrate the methodology using a few representative examples of individual policies that often exhibit heterogeneous performance across the covariate space. The framework, however, is readily extensible and can incorporate policies developed under other approaches to the CSO problem. We obtain the meta-policy by training an ensemble of policy trees, which are depth-constrained decision trees over the covariate space that aim to minimize the empirical cost of the CSO problem over historical data. We refer to the proposed method as Prescribe-then-Select, reflecting its two-stage structure: first prescribing a set of candidate policies, then selecting the most suitable one for each region of the covariate space. The following subsections describe each component of this method in detail.

4.1 Prescribing: Developing a library of candidate policies↩︎

The first stage of our framework constructs a library of prescriptive models, each representing a policy that makes feasible decisions given the observed covariates. We here consider three families of prescriptive methods: sample-average-approximation policies, point–prediction policies, and predictive–prescriptive policies. However, we highlight that our framework could include any other feasible policies to the CSO problem.

4.1.1 Sample Average Approximation (SAA).↩︎

The sample-average-approximation method [8] is a classical approach for solving stochastic optimization problems by replacing the true expectation in equation 1 with an empirical average computed from a finite sample. Given \(N\) historical scenarios \(\{{\boldsymbol{y}}_i\}_{i=1}^N\), each representing a realization of the uncertain parameters, SAA approximates the expectation as \(\mathbb{E}[c({\boldsymbol{z}},Y)] \;\approx\; \frac{1}{N}\sum_{i=1}^N c({\boldsymbol{z}},{\boldsymbol{y}}_i)\) and finds a prescription by solving the following optimization problem: \[\label{eq:saa} \begin{align} \pi^{\mathrm{\small SAA}} \in \mathop{\mathrm{arg\,min}}_{{\boldsymbol{z}}\in{\mathcal{Z}}}\; \frac{1}{N}\sum_{i=1}^N c({\boldsymbol{z}},{\boldsymbol{y}}_i). \end{align}\tag{2}\]

Unlike other CSO policies, SAA does not condition on covariates \({\boldsymbol{x}}\) and instead treats all observed scenarios equally. As a result, it is well suited to settings where covariate information is absent or irrelevant. Solving equation 2 reduces the problem to a single deterministic optimization problem, making SAA straightforward to implement.

4.1.2 Point–Prediction Policies (PPt).↩︎

The point–prediction framework first fits a predictive model \(\hat{\mu}_N({\boldsymbol{x}}) \approx \mathbb{E}[Y\mid X= {\boldsymbol{x}}]\) and then solves a deterministic optimization problem where uncertainty is collapsed to its predicted mean: \[\label{eq:ppt} \pi^{\mathrm{\small PPt}}({\boldsymbol{x}}) = \mathop{\mathrm{arg\,min}}_{{\boldsymbol{z}}\in {\mathcal{Z}}} \, c\bigl({\boldsymbol{z}}, \hat{\mu}_N({\boldsymbol{x}})\bigr).\tag{3}\] This approach is efficient and interpretable, and often performs well when the cost function is approximately linear and the conditional distribution of \(Y\) has low variance. We illustrate this approach with two predictive models:

(a) PPt-kNN: Given a data point \({\boldsymbol{x}}\in {\mathcal{X}}\), the \(\mathrm{k}\)-nearest neighbor model (kNN) [43] identifies the \(\mathrm{k}\) training samples \(\{({\boldsymbol{x}}_i, {\boldsymbol{y}}_i)\}\) closest to \({\boldsymbol{x}}\) under a fixed distance metric. Denoting by \(\mathcal{N}_{\mathrm{k}}({\boldsymbol{x}})\) the set of indices \(i\) for these neighbors, we use \(\hat{\mu}_N({\boldsymbol{x}}) = \frac{1}{\mathrm{k}}\sum_{i\in \mathcal{N}_{\mathrm{k}}({\boldsymbol{x}})} \boldsymbol{y}_i\) in the case of numerical uncertainty, or majority vote in the case of categorical uncertainty.

(b) PPt-RF: Random forests (RF) are ensemble models that aggregate predictions from multiple decision trees, each trained on a different bootstrap sample of the data and typically using randomized feature selection at each split [44]. This procedure, known as bagging (bootstrap aggregation), reduces variance and enhances generalization. Let \(T_1, \dots, T_B\) denote the trees in the forest, and let \(\mathcal{L}_b({\boldsymbol{x}})\) be the set of training indices falling in the same leaf as \({\boldsymbol{x}}\) in tree \(T_b\). The predictions for a given input \({\boldsymbol{x}}\) are then given by \(\hat{\mu}_N({\boldsymbol{x}}) = \frac{1}{B} \sum_{b=1}^B \frac{1}{|\mathcal{L}_b({\boldsymbol{x}})|}\sum_{i \in \mathcal{L}_b({\boldsymbol{x}})} \boldsymbol{y}_i\).

4.1.3 Predictive–Prescriptive Policies (PP).↩︎

The predictive–prescriptive framework [18], [45] integrates local statistical estimation with scenario-based optimization, and can be interpreted as a form of weighted sample-average-approximation [16]. Given a predictive model that assigns weights \(w_{N,i}({\boldsymbol{x}})\) to each training pair \(({\boldsymbol{x}}_i, {\boldsymbol{y}}_i)\), this approach approximates the conditional expectation in equation 1 as \[\label{eq:pp-prescription} \begin{align} \mathbb{E}\!\big[c({\boldsymbol{z}},Y)\mid X={\boldsymbol{x}}\big] &\approx \sum_{i=1}^N w_{N,i}({\boldsymbol{x}})\,c({\boldsymbol{z}},{\boldsymbol{y}}_i) , \quad \text{and solves}\quad \pi^{\mathrm{\small PP}}({\boldsymbol{x}}) \in \mathop{\mathrm{arg\,min}}_{{\boldsymbol{z}}\in{\mathcal{Z}}}\; \sum_{i=1}^N w_{N,i}({\boldsymbol{x}})\,c({\boldsymbol{z}},{\boldsymbol{y}}_i). \end{align}\tag{4}\]

These normalized weights (\(\sum_{i=1}^N w_{N,i}({\boldsymbol{x}}) = 1\)) capture the local distributional information implied by each model and are used to generate prescriptions tailored to the covariate \({\boldsymbol{x}}\). Using the PP framework, the kNN and RF predictive models would lead to the following weighting schemes:

(a) PP-kNN: In this method, the nearest neighbors of a data point are not used to estimate \(\mathbb{E}[Y \mid X = {\boldsymbol{x}}]\), but rather they are used to estimate a local approximation of the conditional distribution \(f_{Y\mid X}(\cdot \mid {\boldsymbol{x}})\). Specifically, the weights in the PP approach are given by: \[w_{N,i}^{\mathrm{\small kNN}}({\boldsymbol{x}}) = \begin{cases} \frac{1}{\mathrm{k}}, & \text{if } i \in \mathcal{N}_{\mathrm{k}}({\boldsymbol{x}}), \\ 0, & \text{otherwise}. \end{cases}\]

(b) PP-RF: For a given data point \({\boldsymbol{x}}\in {\mathcal{X}}\), each tree in the forest routes \({\boldsymbol{x}}\) to a leaf node. The collection of training samples retrieved from the leaf nodes across all trees forms a local, ensemble-based approximation to the conditional distribution \(f_{Y\mid X}(\cdot \mid {\boldsymbol{x}})\). As with kNN, this distribution can be used to support scenario-based optimization. The PP approach in this case defines the weights as: \[\begin{align} w_{N,i}^{\mathrm{\small RF}}({\boldsymbol{x}}) = \frac{1}{B} \sum_{b=1}^B \frac{\mathbb{1}[i \in \mathcal{L}_b({\boldsymbol{x}})]}{|\mathcal{L}_b({\boldsymbol{x}})|}. \end{align}\]

4.2 Selecting: Learning the best policy for each context↩︎

The third stage of our framework learns a meta-policy that chooses the best candidate policy for the given context. Let the library of candidate policies be denoted by \(\Pi_M = \{\pi^{m}\}_{m=1}^M\), each mapping covariates \({\boldsymbol{x}}\in {\mathcal{X}}\) to a feasible decision \({\boldsymbol{z}}\in {\mathcal{Z}}\). To learn the meta-policy we adopt Optimal Policy Trees (OPT) [46], which partitions the feature space into axis-aligned regions and assigns a fixed policy index to each region. While extensions that allow for more general splits—such as linear combinations of features—do exist, we do not explore them in this paper.

Formally, a policy tree is a decision tree \(T(x; \Theta)\), parameterized by \(\Theta = \{(R_j, \gamma_j)\}_{j=1}^J\), where \(\{R_j\}_{j=1}^J\) is a disjoint partition of \({\mathcal{X}}\) and \(\gamma_j \in [M]\) denotes the policy index assigned to region \(R_j\), with \([n] \mathrel{\vcenter{:}}= \{1, \dots, n\}\). The policy induced by this tree is \(\pi^{T({\boldsymbol{x}};\Theta)}({\boldsymbol{x}})\), where \(T({\boldsymbol{x}};\Theta) = \sum_{j=1}^J \gamma_j\, \mathbb{1}\{{\boldsymbol{x}}\in R_j\}\). The policy tree itself does not produce decisions, but rather selects which candidate policy to invoke for a given input. Since all policies \(\pi^{m}\) produce feasible decisions, so does \(\pi^{T({\boldsymbol{x}};\Theta)}({\boldsymbol{x}})\). The goal is to find the policy tree that solves the following minimization problem: \[\label{eq:opt95obj} \begin{align} \hat{\Theta} \in \mathop{\mathrm{arg\,min}}_{\Theta} \quad & \mathbb{E}\left[ c\bigl(\pi^{T(X;\Theta)}(X), Y\bigr)\right]. \end{align}\tag{5}\] By construction, the optimal policy tree cannot perform worse than the best single policy in the library. Moreover, if there exists a region of the covariate space where another policy offers a sufficiently large improvement in conditional expected cost, the optimal policy tree will achieve a corresponding improvement in overall performance. The following lemma formalizes this property.

Lemma 1. Suppose that \(X, Y\) are obtained from a joint probability distribution, and let \(m^\star\) be the index of the policy with the smallest overall expected cost \(\mathbb{E}\left[ c\bigl(\pi^{m}(X), Y\bigr)\right]\) from the library \(\Pi_M = \{\pi^{m}\}_{m=1}^M\). Suppose there exists a region \(R \subseteq \mathcal{X}\) with \(\Pr(X\in R) > 0\) and a policy \(m \neq m^\star\) such that \(\mathbb{E}\left[c\bigl(\pi^{m}(X), Y\bigr) \,\middle|\, X\in R \right] \le \mathbb{E}\left[c\bigl(\pi^{m^\star}(X), Y\bigr) \,\middle|\, X\in R \right] - \delta\), for some \(\delta > 0\). Then \[\begin{align} \mathbb{E}\!\left[c\bigl({\pi}^{T({\boldsymbol{x}};\hat{\Theta})}(X), Y\bigr)\right] \le \mathbb{E}\!\left[c\bigl(\pi^{m^\star}(X), Y\bigr)\right] - \delta \cdot \Pr(X\in R). \end{align}\]

The proof follows because we can construct a tree that assigns policy \(m\) to \(R\) and policy \(m^\star\) to it’s complement \(R^c\). The expected cost of this tree is \[\Pr(X\in R) \, \mathbb{E}\left[c\bigl(\pi^{m}(X), Y\bigr) \mid X\in R \right] + \Pr(X\in R^c) \, \mathbb{E}\left[c\bigl(\pi^{m^\star}(X), Y\bigr) \mid X\in R^c \right].\] Subtracting the cost of always using \(m^\star\) yields an improvement of at least \(\delta \cdot \Pr(X\in R) > 0\). Since the optimal policy tree \(T(.\; ;\hat{\Theta})\) cannot do worse than this construction, the claim follows.

Although evaluating the expectation and handling the non-convex objective in equation 5 make computing \(\hat{\Theta}\) challenging, several heuristic algorithms have been proposed to address this problem. In particular, we adopt the Optimal Trees algorithm for its strong empirical performance in similar policy selection tasks and its ability to scale to large datasets [46]. This method uses a coordinate descent approach to minimize a regularized empirical risk version of equation 5 . At each iteration, the current tree structure determines the best prescription for each leaf; these prescriptions are then evaluated in the objective, and the result guides the next coordinate descent step.

4.3 End-to-End Pipeline: Training and Inference Phases↩︎

We now integrate the Prescribe (Section 4.1) and Select (Section 4.2) components into a unified end-to-end framework Prescribe-then-Select (PS). The process consists of two stages: a training phase, which constructs both the candidate policies and a collection of policy trees for selecting the best among them, and an inference phase, which uses the ensemble of these trees to form a meta-policy that generates context-specific prescriptions.

Figure 1: End-to-end pipeline. Left: K-fold, R-replicate training produces an ensemble of OPTs, each trained on a distinct held-out fold using its corresponding cost table. We also train each policy on the full dataset to maximize downstream performance. Right: At inference, a new context is routed through the OPT ensemble, and the majority-vote policy is applied via the refit models to produce the final decision.

4.3.0.1 Training phase.

Let \({\mathcal{D}}_{\mathrm{train}}=\{({\boldsymbol{x}}_i,{\boldsymbol{y}}_i)\}_{i=1}^N\) denote the training set, and let \(\{{\mathcal{I}}^{(1)},\dots,{\mathcal{I}}^{(K)}\}\) be a \(K\)-fold partition of the index set \([N]\). For a given fold \(k\), the training indices \({\mathcal{I}}^{(-k)} \mathrel{\vcenter{:}}= [N] \setminus {\mathcal{I}}^{(k)}\) are used to fit the candidate policies \(\pi^{1},\dots,\pi^{M}\) according to the procedures in Section 4.1. For each held-out instance \(i \in {\mathcal{I}}^{(k)}\) and each candidate policy \(m \in [M]\), we assess the quality of this policy on this observation via the realized out-of-sample cost: \(C^{(k)}_{i,m} \;=\; c\,\bigl( \pi^{m}({\boldsymbol{x}}_i),\,{\boldsymbol{y}}_i\bigr)\). Arranging these values for all \(i \in {\mathcal{I}}^{(k)}\) yields the cost table \(C^{(k)} \in \mathbb{R}^{|{\mathcal{I}}^{(k)}| \times M}\), where \[C^{(k)} = \begin{bmatrix} c(\pi^{1}({\boldsymbol{x}}_{i_1}),{\boldsymbol{y}}_{i_1}) & \cdots & c(\pi^{M}({\boldsymbol{x}}_{i_1}),{\boldsymbol{y}}_{i_1})\\ \vdots & \ddots & \vdots\\ c(\pi^{1}({\boldsymbol{x}}_{i_{|{\mathcal{I}}^{(k)}|}}),{\boldsymbol{y}}_{i_{|{\mathcal{I}}^{(k)}|}}) & \cdots & c(\pi^{M}({\boldsymbol{x}}_{i_{|{\mathcal{I}}^{(k)}|}}),{\boldsymbol{y}}_{i_{|{\mathcal{I}}^{(k)}|}}) \end{bmatrix}, \quad \{i_1,\ldots,i_{|{\mathcal{I}}^{(k)}|}\} = {\mathcal{I}}^{(k)}.\]

For each fold \(k\), we train \(R\) independent OPTs \(T^{(k,1)},\dots,T^{(k,R)}\) which take as input the pairs \(({\boldsymbol{x}}_i, \mathbf{C}^{(k)}_{i,1:M})\) for \(i\in{\mathcal{I}}^{(k)}\), solving equation 5 with \(R\) different random seeds to mitigate the fact that the learning algorithm is a stochastic heuristic and may not always find a global optimum. After repeating this procedure for all \(K\) folds, we have a total of \(K \times R\) policy trees. Finally, all candidate policies \(\pi^{1},\dots,\pi^{M}\) are refit on the full training set \({\mathcal{D}}_{\mathrm{train}}\) to maximize the performance on the inference phase. A detailed, per-fold view of the procedure appears in Figure 2, while Algorithm 6 in the appendix summarizes the complete training phase.

4.3.0.2 Inference phase.

Upon observing a covariate \({\boldsymbol{x}}\in{\mathcal{X}}\), each OPT \(T^{(k,r)}\) outputs a policy index \(\gamma^{(k,r)}({\boldsymbol{x}})\in [M]\), and we aggregate these outputs by majority vote \(\gamma({\boldsymbol{x}}) \;=\; \mathrm{mode}\!\left(\{\gamma^{(k,r)}({\boldsymbol{x}}):\, k\in[K],\;r\in[R]\}\right)\), breaking ties uniformly at random. The final decision made by the meta-policy is obtained by applying the selected policy \(\pi^{\gamma({\boldsymbol{x}})}({\boldsymbol{x}})\) (see Figure 1). These procedure is explained with detail in Algorithm 7 in the appendix.

Figure 2: Detailed view of one fold in the training phase. In fold k, models trained on {\mathcal{I}}^{(-k)} produce sample weights or point predictions, mapped to decisions \pi^{m}({\boldsymbol{x}}_i) for each i \in {\mathcal{I}}^{(k)} and m \in \Pi_M. Evaluating these using the true outcomes {\boldsymbol{y}}_i yields the cost matrix C^{(k)}, which, together with \{{\boldsymbol{x}}_i : i \in {\mathcal{I}}^{(k)}\}, trains R Optimal Policy Trees. Across all folds, this produces R \times K selectors. We show an example with M=3 policies (predictive–prescriptive kNN, predictive–prescriptive RF, and point-prediction RF).

5 Computational Experiments↩︎

In this section we evaluate Prescribe-then-Select on two benchmark tasks commonly used in contextual stochastic optimization applied to operations management. The first task is a two-stage shipment-planning problem, where initial production decisions are made at multiple facilities before demand is realized, followed by additional production and shipment decisions once demand becomes known. The second task is a single-stage newsvendor problem, a classic inventory management setting where stocking levels must be chosen in advance under uncertain demand. For both tasks, we introduce context heterogeneity by generating synthetic datasets with distinct demand segments—for example, holiday spikes versus routine weekday or weekend cycles—each following different distributions.

5.1 Multi–Product Newsvendor↩︎

The multi–product newsvendor problem is a classical model in inventory management [42], where a decision-maker must determine order quantities for multiple products before uncertain demand is realized. It is widely used in retail, wholesale, and manufacturing to balance the costs of understocking (lost sales) and overstocking (excess holding or disposal). In the multi-product setting, items compete for a shared resource such as storage space or budget, and the decision-maker seeks to allocate this capacity as to maximize the expected profit. We consider the contextual version in which the decision-maker observes covariates as side information (e.g., calendar or seasonality indicators) before ordering.

Let \({\boldsymbol{p}},{\boldsymbol{c}},{\boldsymbol{s}}\in\mathbb{R}_+^{d}\) denote prices, costs, and storage requirements for \(d\) products, which share a total storage capacity \(S>0\). The random vector \(Y\in \mathbb{R}^d\) represents the (unknown) demand, and \({\boldsymbol{x}}\) the observed covariates. The decision maker chooses stocking levels \({\boldsymbol{q}}\) to maximize expected profit, formulated as \[\begin{align} \max_{{\boldsymbol{q}}\ge 0}\; \mathbb{E}\!\big[\,{\boldsymbol{p}}^\top \min\{Y,{\boldsymbol{q}}\}-{\boldsymbol{c}}^\top{\boldsymbol{q}}\,\big|\, X={\boldsymbol{x}}\big] \quad\text{s.t.}\quad {\boldsymbol{s}}^\top{\boldsymbol{q}}\le S, \end{align}\] where the minimum \(\min\{Y,{\boldsymbol{q}}\}\) is applied component-wise.

In our experiments we set \(d=4\) and \(S=1200\). Prices \({\boldsymbol{p}}\), costs \({\boldsymbol{c}}\), and storage coefficients \({\boldsymbol{s}}\) are set such that higher-price items carry proportionally higher unit costs and larger storage requirements—while maintaining positive margins (\({\boldsymbol{p}}>{\bm{c}}\)). This yields comparable profit magnitudes across products and nontrivial trade-offs under the knapsack constraint \({\boldsymbol{s}}^\top{\boldsymbol{q}}\le S\). The exact parameter values for prices, costs, and storage coefficients are provided in Appendix 8. Moreover, since \(\mathbb{E}\!\big[\,{\boldsymbol{p}}^\top \min\{Y,{\boldsymbol{q}}\} \,\big|\, X={\boldsymbol{x}}\big] =\sum_{i=1}^d p_i\,\mathbb{E}\!\big[\min\{Y_i,q_i\}\mid X={\boldsymbol{x}}\big]\) (by linearity of the expectation), the implementation of both the PP and PPt policies for this problem estimate the distributions \(f_{Y_j|X} (\cdot\mid {\boldsymbol{x}}\)) for all \(j\in[d]\) instead of \(f_{Y|X} (\cdot\mid {\boldsymbol{x}}\)).

5.2 Shipment Planning↩︎

Shipment–planning is a relevant problem to manufacturers, distributors, and logistics providers seeking to minimize transportation and production costs while meeting demand requirements [18]. In a two–stage setting, the planner commits to an initial production plan before demand is observed, and then adjusts additional production and plans shipments once demand becomes known. We again assume the decision-maker has access to contextual information prior to making decisions.

Formally, \(F\) production facilities, \(L\) demand locations, and covariates \({\boldsymbol{x}}\in{\mathcal{X}}\) available as side information. In the first stage, before demand is observed, the planner chooses \(\mathbf{u}_{1}\in\mathbb{R}^{F}_{+}\) with unit cost \(p_{1}\). After demand \(Y\in\mathbb{R}^{L}_{+}\) is realized, additional items \({\boldsymbol{e}}\in\mathbb{R}^{F}_{+}\) at price \(p_{2}>p_{1}\) are produced, and shipments \(u_{2, fl}\) from location \(f\) to location \(\ell\) at per–unit costs \(c_{f\ell}\ge 0\) must be decided. Each product yields per-unit revenue \(a>0\) and full demand must be satisfied. The cost minimization CSO problem can then be written as \[\begin{align} &\min_{\mathbf{u}_1\in\mathbb{R}_+^{F}}\; p_{1}\,\mathbf{1}_F^\top\mathbf{u}_1 \;+\; \mathbb{E}\!\left[\,Q(\mathbf{u}_1;Y)-a\,\mathbf{1}_L^\top Y\mid X={\boldsymbol{x}}\right],\quad \text{where}\\ &Q(\mathbf{u}_1; {\boldsymbol{y}})= \min_{\mathbf{U}_2\in\mathbb{R}_+^{F\times L},\;\mathbf{e}\in\mathbb{R}_+^{F}} \;\sum_{f=1}^{F}\sum_{\ell=1}^{L} c_{f\ell}\,u_{2,f\ell} + p_{2}\,\mathbf{1}_F^\top\mathbf{e} \quad\text{s.t.}\quad \mathbf{U}_2^\top\mathbf{1}_F \ge {\boldsymbol{y}},\;\; \mathbf{e} \ge \mathbf{U}_2\mathbf{1}_L - \mathbf{u}_1. \end{align}\] In our experiment, we set \(F=L=4\), \(p_{1}=5\), \(p_{2}=10\), \(a=90\), and draw fixed shipping costs via \(c_{f\ell}=20+2(f-1)+\xi_{f\ell}\) with \(\xi_{f\ell}\sim\mathrm{Unif}[0,3]\).

5.3 Implementation Details↩︎

Data Generation: We simulate demand data with explicit covariate-driven heterogeneity to create distinct demand regimes. In the multi-product newsvendor setting, regimes capture patterns such as holiday-driven spikes, smooth seasonal variation with weekday effects, and short-term disruptions. In the shipment-planning setting, regimes reflect operational patterns such as contracted replenishment periods, event-driven surges, and routine demand with calendar effects. In both cases, regime activation is determined by calendar covariates, and noise is added to introduce variability across products or locations. Full specifications, including functional forms, parameter values, and regime definitions, are provided in Appendix 8 and 9.

Sofware Implementation: Our implementation combines Python for the machine-learning components with Julia for the optimization layer. The main packages are Gurobi (v12.0.1) for mathematical programming, the Optimal Policy Tree (OPT) framework from Interpretable AI (IAI, v3.2.2) for model selection, and in Python, scikit-learn (v1.6.0) and pandas (v2.2.3). Gurobi is accessed through Julia, while preprocessing and evaluation pipelines run in Python.

Implementation Setup: We use \(K=5\) folds and \(R=10\) repetitions, yielding \(K\times R=50\) trees per ensemble. The PS selector chooses among the base policies \(\{\mathrm{SAA}, \mathrm{\small PPt}\text{–RF}, \mathrm{\small PP}\text{–RF}, \mathrm{\small PPt}\text{–kNN}, \mathrm{\small PP}\text{–kNN}\}\). We evaluate training sizes \(N \in \{250, 500, 750, 1000, 1250, 1500, 2000, 3000, 5000\}\); for each \(N\) we draw 100 independent training samples and assess strictly out-of-sample performance on a fixed test horizon. For the prediction models, we use kNN with \(5\) neighbors and RF with number of trees \(B = 5\). Evaluation: We evaluate policies on data not used for training or meta-policy construction. Let \({\mathcal{I}}^{\mathrm{test}}\) be the index set of held-out test points and \({\mathcal{D}}_{\mathrm{test}} \mathrel{\vcenter{:}}= \{({\boldsymbol{x}}_i,{\boldsymbol{y}}_i) : i \in {\mathcal{I}}^{\mathrm{test}}\}\). To evaluate a policy \(\pi\), we compute its average test profit \(cost(\pi) = \frac{1}{|{\mathcal{I}}^{\mathrm{test}}|} \sum_{i \in {\mathcal{I}}^{\mathrm{test}}} c\!\left(\pi^{m}({\boldsymbol{x}}_i),\,{\boldsymbol{y}}_i\right)\). In all our results we report this quantity for the candidate policies \(\{\pi^m\}_{m = 1}^M\) as well as for the meta-policy \(\pi^{\gamma(\cdot)}\). To quantify variability as a function of the training-set size \(N\), we draw \(S=100\) independent training samples for each of the sizes considered and run the full pipeline in Section 4.3 end to end, yielding base policies \(\{\pi^{m}_{N,s}\}_{m=1}^M\) and meta-policy \(\pi^{\gamma(\cdot)}_{N,s}\) for \(s\in[S]\). For policy \(m\) in sample \(s\) with data-size \(N\), its average test cost is then \(cost(\pi^m_{N, s})\). We estimate the mean of \(\{cost(\pi^m_{N, s})\}_{s=1}^S\) over samples \(s\) and construct a two-sided \(100(1-\alpha)\%\) confidence interval (CI) using the Student-\(t\) approximation as: \[\begin{align} \hat{\mu}^m_{N} =\frac{1}{S}\sum_{s=1}^S cost(\pi^m_{N, s}), \qquad \hat{\sigma}^m_{N} =\sqrt{\frac{1}{S-1}\sum_{s=1}^S \!\bigl(cost(\pi^m_{N, s})-\hat{\mu}^m_{N}\bigr)^2}, \qquad \mathrm{CI}^m_{N} =\hat{\mu}^m_{N}\;\pm\;t_{1-\alpha/2,\;S-1}\;\frac{\hat{\sigma}^m_{N}}{\sqrt{S}}. \end{align}\]

5.4 Results↩︎

In this section we report performance on the held-out test set as described in Section 5.3 to address three main research questions. Unless stated otherwise, each curve in the plots show average profit across repeated samples with two-sided 95% \(t\)-based confidence intervals. For consistency across tasks, we report profit as the negative of total net cost for shipment planning and as net revenue for the newsvendor problem, such that higher values are better in all figures.

5.4.1 Segment-wise heterogeneity↩︎

RQ1: Do the candidate CSO policies exhibit heterogeneous out-of-sample performance across the covariate space, or does a single policy dominate the rest? To answer this question, we compute out-of-sample performance as a function of training size for five candidate policies (SAA, PP–kNN, PPt–kNN, PP–RF, and PPt–RF) and three data regimes, and we analyze the results below.

Figure 3: Newsvendor: segment-wise profit vs.training size (average and 95% CIs shown).

5.4.1.1 Newsvendor.

As shown in Figure 3, we indeed observe heterogeneity across segments:

  • Segment A (small variance for holiday-influenced products). For small \(N\), PPt–RF attains the highest profit, consistent with a low-variance setting where point prediction works well. As \(N\) grows, PP–RF matches PPt–RF suggesting improvements in the local distribution estimates of the PP method.

  • Segment B (medium variance for products with seasonal smooth demand). PP–kNN is best for small \(N\), which is expected as the averaging effect tends to work well for continuous functions. As \(N\) increases (e.g., 3000–5000), PP–RF becomes the best method.

  • Segment C (high-variance products with discontinuos summer demand). PP–RF dominates for all training sizes. This is expected as kNN averaging effect fails to capture abrupt changes and point-prediction variants can be very inaccurate when the prediction falls on the wrong side of the demand discontinuity.

Figure 4: Shipment: segment-wise mean test profit vs.training size (95% CIs; profit = -\,net cost).

5.4.1.2 Shipment Planning.

Heterogeneity is also observed in this task, as shown in Figure 4:

  • Segments A/B (small-variance products/high-variance products with random discontinuities). PP–kNN yields higher profit across training sizes. Forests were observed to wrongly place calendar splits that affected performance in these segments, increasing leaf variance and harming the weights in the PP method; while kNN neighborhoods remained fairly stable.

  • Segment C (medium-variance for products with seasonal smooth demand). PP–RF again delivers the best performance, as the problem involves a complex continuous demand function. By contrast, kNN fails to capture the segment structure, lacking the flexibility of RF to adapt effectively.

Across both tasks the answer to RQ1 is affirmative: test performance is heterogeneous, and no single policy dominates uniformly. Which candidate policy is best depends on the segment and the training size. With limited data, the best policies differ by context (e.g., PPt–RF in low-noise segments; PP–kNN where local averaging reduces variance), whereas in segments with abrupt regime changes PP–RF is consistently superior at all \(N\). As \(N\) grows, PP–RF closes the gap with point-prediction in simpler segments and usually overtakes other methods. We also observe cross-segment spillovers: because models are trained on pooled data, noise introduced by other segments (e.g., latent holiday/event effects) can make tree leaves noisier and depress RF performance even in otherwise regular segments, while kNN’s locality is less sensitive to such noise.

5.4.2 Benefit of Predict-then-Select↩︎

RQ2: Does Prescribe-then-Select outperform the best single policy, is the improvement statistically significant, and how does it vary with training size? To answer this question, we conduct experiments on both benchmark problems, comparing PS against the best single-policy baseline across a range of training sizes. Figures 5 (a) and 5 (b) show average profit as as function of the training size with two-sided 95% \(t\)-based confidence intervals for PS and all candidate policies. Illustrative examples of the OPTs being trained are in the Appendix 10.

a
b

Figure 5: Profit vs. training size for PS and the best single candidate policy in the (left) newsvendor and (right) shipment tasks. Point indicate averages and vertical lines indicate 95% confidence intervals. In the shipment task, profit is reported as negative cost.. a — Newsvendor., b — Shipment Planning.

5.4.2.1 Newsvendor.

For small samples (\(N\!\in\![750,1500]\)), PS is statistically significantly better than all individual policies (non-overlapping or barely overlapping CIs at multiple \(N\)), reflecting its ability to route contexts toward PPt–RF (best in Segment A for small \(N\)) and PP–kNN (best in segment B for small \(N\)). The magnitude of the gain in this regime is modest in absolute terms but comparable to the improvement obtained when moving from point-prediction to predictive-prescriptive baselines. As data size increases (\(N\!>\!1500\)), PP–RF dominates and PS rapidly converges to PP–RF (by \(N\approx\!2000\)).

5.4.2.2 Shipment Planning.

PS gains become apparent once the training set exceeds \(N\!\approx\!1000\), mirroring the pattern observed in 5.4.1: Segments A/B favor PP–kNN, whereas Segment C favors PP–RF in this data size regime. Improvements strengthen at \(N\!\in\!\{3000,5000\}\): at \(N=3000\), the 95% CIs become disjoint and at \(N=5000\) they slightly overlap. Since the problem is two stage with costly recourse (\(p_{2}\!\gg\!p_{1}\)), choosing the wrong policy is highly expensive in terms of cost, which affects average gains even as PS increasingly routes A/B to PP–kNN and C to PP–RF.

Across both tasks, PS yields statistically significant gains by learning partitions of the covariate space from cross-validation alone and finding the best candidate policy for each region. Notably in the shipment setting, policies that are very weak on average and would be disregarded in general frameworks—such as PP–kNN and PPt–kNN—become useful: PS deploys them in segments where they excel (A/B) and defaults to PP–RF elsewhere (e.g., C), thereby improving overall performance.

5.4.3 Uniform dominance and convergence↩︎

RQ3: In the absence of substantial segment heterogeneity, does PS effectively revert to the best base policy, or does the selection step risk choosing an inferior policy? In other words, we want to evaluate whether, in regimes where a single base policy is uniformly superior across the covariate space, PS asymptotically selects that policy and thus incurs negligible regret relative to the best single policy. In the newsvendor task (Figure 5 (a)), for \(N \gtrsim 2000\) the mean profits of PS and PP–RF are statistically indistinguishable within two-sided 95% \(t\)-based confidence intervals, indicating effective convergence of PS to the dominant policy. Because PS is trained solely via cross-validated folds and never observes test data, this demonstrates PS is a low-risk default: when uniform dominance emerges, PS matches the best single policy, while at heterogeneous cases it retains the performance gains.

5.4.4 Summary of findings and computational considerations↩︎

Our results show that (i) out-of-sample performance is segment-dependent, with no single candidate policy dominating the others; (ii) PS delivers statistically significant improvements over all single policies in heterogeneous regimes; and (iii) when a single base policy becomes uniformly dominant, PS converges to it, incurring negligible regret. Relative to simply training each base policy and selecting the best overall, PS incurs additional cost from the cross-validation stage. For each type of base policy (e.g., kNN, RF), the machine learning model is trained \(K\) times, once per fold, while the associated optimization model must be solved for all \(M\) variants within each fold, leading to \(K \times M\) optimization solves in total.

6 Conclusion↩︎

We introduced Prescribe-then-Select, a modular framework for adaptive policy selection in contextual stochastic optimization. PS generates a diverse set of candidate policies and learns a selection model that matches each context to its most effective policy, enabling flexible integration of diverse policies while preserving constraint feasibility. Our experiments on two benchmark problems showed that PS consistently exploits heterogeneity in the covariate space to improve upon the best single policy, while converging to the dominant policy in homogeneous regimes. These findings position PS as a practical, low-risk approach for decision-making environments where diverse, high-quality prescriptive models are already available. Future work includes extending PS to multi-stage settings, exploring other selection models beyond policy trees, and evaluating performance in large-scale, real-world applications with high-dimensional and partially observed covariates.

7 Algorithms↩︎

Figure 6: Training phase: fold-wise construction of cost matrices and training of an OPT ensemble. Each C^{(k)} contains realized out-of-sample costs from prescriptions \pi^{m}({\boldsymbol{x}}_i) (fit on {\mathcal{I}}^{(-k)}) and true outcomes {\boldsymbol{y}}_i for i\in{\mathcal{I}}^{(k)}.
Figure 7: Decision phase: majority-vote selection via \gamma^{(k,r)}({\boldsymbol{x}}), followed by applying the chosen refit policy.

8 Multi-Product Newsvendor: Experimental Details↩︎

8.1 Data Generation↩︎

We simulate demands for \(d_y\) products over time \(t\) using the following calendar covariates: day of week \(dow_t\in[6]\), day of month \(dom_t\in [31]\), month \(M_t\in[12]\), day of year \(\mathrm{doy}_t\in[366]\), weekend indicator \(\omega_t=\mathbb{1}\{dow_t\ge 5\}\), and holiday indicator \(h_t\sim\mathrm{Bernoulli}(p_{\mathrm{hol}})\) with \(p_{\mathrm{hol}}=0.1\). We define three covariate regimes: Segment A models holiday–sensitive products (e.g., gifts); Segment B features smooth seasonality with weekday modulation; Segment C introduces abrupt midsummer weekday jumps, representing short promotions or disruptions. In Segment C, \(s_{M_t}\) is a month–specific offset: \(s_7=-7\) (July) and \(s_8=+8\) (August), producing discontinuous changes without trend.

Realized demands on day \(t\) are \(Y_{j} = \max\{0,\, \mu_{jt} + \varepsilon_{jt}\}\), with noise \(\varepsilon_{it} \sim \mathcal{N}(0,\sigma_A^2) A_{it} + \mathcal{N}(0,\sigma_B^2) B_{it} + \mathcal{N}(0,\sigma_C^2) C_t\), where \(A_{jt}\), \(B_{jt}\), and \(C_t\) are binary indicators for product \(j\) at time \(t\) being in Segment A, Segment B, or Segment C, respectively. Noise is independent across products on any given day and segment. The specification for each regime is summarized in Table 1.

6pt

Table 1: Segment specification for multi–product newsvendor demand. Mean \(\mu_{jt}\) and noise scale depend on the active segment. Baseline \(B=30\), holiday lifts \(\alpha_0=8\), \(\alpha_1=5\), and step adjustments \(s_{M_t}\in\{-7,+8\}\) for \(M_t\in\{7,8\}\).
Segment A (holiday–sensitive) Segment B (seasonal/weekday) Segment C (summer jump)
Activation \(A_{jt} = \mathbb{1}\{h_t = 1,\, j \in \{0,1\}\}\) \(B_{jt} = 1 - \max(A_{jt}, C_t)\) \(C_t = \mathbb{1}\{M_t \in \{7,8\},\;dow_t \le 3\}\)
Business rationale Holiday–driven lift for gift-suitable items Seasonal cycle with weekday variation Short promotions or disruptions in midsummer
Qualitative pattern Sharp, low–variance holiday spikes Smooth annual wave modulated by weekdays Large weekday jumps in July/August
Mean \(\mu_{jt}\) \(B+\alpha_j\) \(B+6\sin\!\tfrac{2\pi M_t}{12}\cdot\tfrac{dow_t+1}{5}\cdot(1+0.15\,j)\) \(B+s_{M_t}+4j\)
Noise scale \(\sigma_A=0.5\) \(\sigma_B=3.0\) \(\sigma_C=4.0\)

8.2 Parameters↩︎

Table 2 reports the selling prices \(p_i\), procurement costs \(c_i\), and storage coefficients \(s_i\) used in our experiments.

Table 2: Selling prices, procurement costs, and storage coefficients for the products used in the multi–product newsvendor experiments.
Product Price (\(p_i\)) Cost (\(c_i\)) Storage (\(s_i\))
Product 0 500.0 350.0 3.0
Product 1 800.0 600.0 15.0
Product 2 50.0 30.0 1.5
Product 3 10.0 6.0 0.5

9 Shipment Planning: Experimental Details↩︎

9.1 Data Generation↩︎

For the shipment–planning setting, we create demand regimes distinct from the multi–product newsvendor case, while maintaining realistic patterns. This yields a complementary experiment with different sources of heterogeneity. We simulate demands for \(d\) products using the same calendar covariates \((dow_t, dom_t, M_t, \mathrm{doy}_t, \omega_t, h_t, H_t)\), and latent driver \(H_t\sim\mathcal{N}(0,10^2)\), that is not included as feature in the model. Exactly one of three segments (A/B/C) is active per day, summarized in Table 3.

Each location \(\ell\) has offset \(\delta_\ell=\sin\!\big(\frac{2\pi(\ell-1)}{L}\big)\), and realized demands are \(Y_{\ell t}=\max\{0,\,\mu_t+\delta_\ell+\varepsilon_{\ell t}\}\), with independent noise \(\varepsilon_{\ell t}\sim\mathcal{N}(0,\sigma_A^2)A_t+\mathcal{N}(0,\sigma_B^2)B_t+\mathcal{N}(0,\sigma_C^2)C_t\).

6pt

Table 3: Segment specification for shipment demand. One segment is active per day \(t\); \(\mu_t\) and noise scale depend on the active segment. Baseline \(B=30\).
Segment A (early–month) Segment B (holiday/event) Segment C (routine)
Activation \(A_t=\mathbb{1}\{dom_t\le 8,\,M_t\le 4\}\) \(B_t=\mathbb{1}\{h_t=1\}\) \(C_t=1-\max(A_t,B_t)\)
Business rationale Contracted early–month replenishment Event–driven surges Regular operations
Qualitative pattern Flat mean, low variance Short, high–variance spikes Gradual trend with weekday/weekend shifts
Mean \(\mu_t\) \(B+25\) \(B+5+20\,H_t\) \(B+0.08\sqrt{\mathrm{doy}_t} + 4(dow_t)^2 + 10\,\omega_t\)
Noise scale \(\sigma_A=0.3\) \(\sigma_B=4.0\) \(\sigma_C=1.2\)

10 Illustrative Trees from Ensemble↩︎

a
b

Figure 8: Illustrative OPT selectors for the multi–product newsvendor. Two randomly chosen trees from the PS ensemble (\(K=5\) folds, \(R=10\) repetitions, \(K\times R=50\) trees) on one sample with \(N=1000\). Left: the tree approximates regime C (summer jumps) using day_of_week \(<5.5\) (weekdays) and month \(\ge 7\) (July–August); it prescribes PPt–RF in that region which is the second best model. For the remaining cases with day_of_week \(<5.5\), it routes to kNN, which is best for regime B (most of the remaining of the data). Right: the tree partially isolates regime A by splitting on is_holiday and prescribing PPt–RF on holidays, which is the best for regime A; non-holiday days go to kNN, which is the best in the bigger regime B. Each tree is trained on a cross-validation fold and is imperfect on its own, but the ensemble (majority vote over 50 trees) aggregates these partial signals into an effective meta-policy.. a — Tree A, b — Tree B

a
b

Figure 9: Illustrative OPT selectors for the shipment–planning task. Two randomly chosen trees from the PS ensemble (\(K=5\) folds, \(R=10\) repetitions, \(K\times R=50\) trees) on a sample with \(N=3000\). Left: the tree uses day_of_year \(\le 184.5\) (roughly first half of the year) and day_of_month \(<4.5\) (very early month) to approximate regime A, prescribing PP–kNN in that region (the best for regime A). For the remaining contexts it prescribes PP–RF or PPt–RF, consistent with regime C (routine), but it does not isolate regime B. Right: the tree splits on is_holiday, cleanly isolating regime B (holiday/event) and prescribing PP–kNN there (the best for regime B), while routing non-holiday days to RF, which aligns with the bigger regime C. Each tree is trained on a cross-validation fold and is imperfect on its own; the ensemble (majority vote over 50 trees) aggregates these partial signals to recover an approximation to the regime structure.. a — Tree A, b — Tree B

References↩︎

[1]
John R. Birge and François Louveaux. Introduction to Stochastic Programming. Springer, New York, 2nd edition, 2011. ISBN 978-1-4614-0236-8.
[2]
Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochastic programming: modeling and theory. SIAM, 2021.
[3]
Antonio J Conejo, Miguel Carrión, Juan M Morales, et al. Decision making under uncertainty in electricity markets, volume 1. Springer, 2010.
[4]
Lawrence V Snyder. Facility location under uncertainty: a review. IIE transactions, 38 (7): 547–564, 2006.
[5]
Lawrence V Snyder and Zuo-Jun Max Shen. Fundamentals of supply chain theory. John Wiley & Sons, 2019.
[6]
Alexander Shapiro. Monte carlo sampling methods. Handbooks in operations research and management science, 10: 353–425, 2003.
[7]
Alexander Shapiro and Arkadi Nemirovski. On complexity of stochastic programming problems. In Continuous optimization: Current trends and modern applications, pp. 111–146. Springer, 2005.
[8]
Anton J Kleywegt, Alexander Shapiro, and Tito Homem-de Mello. The sample average approximation method for stochastic discrete optimization. SIAM Journal on optimization, 12 (2): 479–502, 2002.
[9]
Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
[10]
Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19 (4): 1574–1609, 2009.
[11]
Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. Robust sample average approximation. Mathematical Programming, 171 (1): 217–282, 2018.
[12]
Aharon Ben-Tal, Arkadi Nemirovski, and Laurent El Ghaoui. Robust optimization. 2009.
[13]
Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. Data-driven robust optimization. Mathematical Programming, 167 (2): 235–292, 2018.
[14]
Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58 (3): 595–612, 2010.
[15]
Giuseppe Carlo Calafiore and L El Ghaoui. On distributionally robust chance-constrained linear programs. Journal of Optimization Theory and Applications, 130 (1): 1–22, 2006.
[16]
Utsav Sadana, Abhilash Chenreddy, Erick Delage, Alexandre Forel, Emma Frejinger, and Thibaut Vidal. A survey of contextual optimization methods for decision-making under uncertainty. European Journal of Operational Research, 320: 271–289, 2025. .
[17]
Gah-Yi Ban and Cynthia Rudin. The big data newsvendor: Practical insights from machine learning. Operations Research, 67 (1): 90–108, 2019.
[18]
Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. Management Science, 66 (3): 1025–1044, 2020. .
[19]
Thierry Bazier-Matte and Erick Delage. Generalization bounds for regularized portfolio selection with market side information. INFOR: Information Systems and Operational Research, 58 (2): 374–401, 2020.
[20]
Dimitris Bertsimas and Nihal Koduri. Data-driven optimization: A reproducing kernel hilbert space approach. Operations Research, 70 (1): 454–471, 2022.
[21]
Pascal M Notz and Richard Pibernik. Prescriptive analytics for flexible capacity management. Management Science, 68 (3): 1756–1775, 2022.
[22]
Dimitris Bertsimas and Kimberly Villalobos Carballo. Multistage stochastic optimization via kernels. arXiv preprint arXiv:2303.06515, 2023.
[23]
Afshin Oroojlooyjadid, Lawrence V Snyder, and Martin Takáč. Applying deep learning to the newsvendor problem. Iise Transactions, 52 (4): 444–463, 2020.
[24]
Jakob Huber, Sebastian Müller, Moritz Fleischmann, and Heiner Stuckenschmidt. A data-driven newsvendor problem: From data to decision. European Journal of Operational Research, 278 (3): 904–915, 2019.
[25]
Yanfei Zhang and Junbin Gao. Assessing the performance of deep learning algorithms for newsvendor problem. In International conference on neural information processing, pp. 912–921. Springer, 2017.
[26]
Rohit Kannan, Güzin Bayraksan, and James R Luedtke. Residuals-based distributionally robust optimization with covariate information. Mathematical Programming, 207 (1): 369–425, 2024.
[27]
Yunxiao Deng and Suvrajeet Sen. Predictive stochastic programming. Computational Management Science, 19 (1): 65–98, 2022.
[28]
Yoshua Bengio. Using a financial training criterion rather than a prediction criterion. International journal of neural systems, 8 (04): 433–443, 1997.
[29]
Priya Donti, Brandon Amos, and J Zico Kolter. Task-based end-to-end model learning in stochastic optimization. Advances in neural information processing systems, 30, 2017.
[30]
Nathan Kallus and Xiaojie Mao. Stochastic optimization forests. Management Science, 69 (4): 1975–1994, 2023.
[31]
Meng Qi, Paul Grigas, and Zuo-Jun Max Shen. Integrated conditional estimation-optimization. arXiv preprint arXiv:2110.12351, 2021.
[32]
Adam N Elmachtoub and Paul Grigas. Smart “predict, then optimize.” Management Science, 68 (1): 9–26, 2022.
[33]
Marco A Wiering and Hado Van Hasselt. Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38 (4): 930–936, 2008.
[34]
Siegmund Duell and Steffen Udluft. Ensembles for continuous actions in reinforcement learning. In ESANN, 2013.
[35]
Kenji Doya, Kazuyuki Samejima, Ken-ichi Katagiri, and Mitsuo Kawato. Multiple model-based reinforcement learning. Neural computation, 14 (6): 1347–1369, 2002.
[36]
Kazuyuki Samejima, Kenji Doya, and Mitsuo Kawato. Inter-module credit assignment in modular reinforcement learning. Neural Networks, 16 (7): 985–994, 2003.
[37]
Harm Van Seijen, Bram Bakker, Leon Kester, et al. Switching between different state representations in reinforcement learning. In Proceedings of the 26th IASTED International Conference on Artificial Intelligence and Applications, pp. 226–231, 2008.
[38]
Michael Gimelfarb, Scott Sanner, and Chi-Guhn Lee. Contextual policy transfer in reinforcement learning domains via deep mixtures-of-experts. In Uncertainty in Artificial Intelligence, pp. 1787–1797. PMLR, 2021.
[39]
Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divide-and-conquer reinforcement learning. arXiv preprint arXiv:1711.09874, 2017.
[40]
Anirudh Goyal, Shagun Sodhani, Jonathan Binas, Xue Bin Peng, Sergey Levine, and Yoshua Bengio. Reinforcement learning with competitive ensembles of information-constrained primitives. arXiv preprint arXiv:1906.10667, 2019.
[41]
Xiangyu Cui, Nicholas G. Hall, Yun Shi, and Tianyuan Su. Collective wisdom: Policy averaging with an application to the newsvendor problem, 2025. URL https://arxiv.org/abs/2503.17638.
[42]
Moutaz Khouja. The single-period (news-vendor) problem: literature review and suggestions for future research. omega, 27 (5): 537–553, 1999.
[43]
T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13 (1): 21–27, 1967. .
[44]
Leo Breiman. Random forests. Machine learning, 45 (1): 5–32, 2001.
[45]
D. Bertsimas and J. Dunn. Machine Learning Under a Modern Optimization Lens. Dynamic Ideas LLC, 2019. ISBN 9781733788502. URL https://books.google.com/books?id=g3ZWygEACAAJ.
[46]
Maxime Amram, Jack Dunn, and Ying Daisy Zhuo. Optimal policy trees. Machine Learning, 111 (7): 2741–2768, 2022. .