August 31, 2025
We consider off-policy evaluation (OPE) in contextual bandits with finite action space. Inverse Propensity Score (IPS) weighting is a widely used method for OPE due to its unbiased, but it suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored. Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings. However, these embeddings do not minimize the mean squared error (MSE) of the estimators and do not consider context information. To address these limitations, we introduce Context-Action Embedding Learning for MIPS, or CAEL-MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator. Building on the theoretical analysis of bias and variance of MIPS, we present an MSE-minimizing objective for CAEL-MIPS. In the empirical studies on a synthetic dataset and a real-world dataset, we demonstrate that our estimator outperforms baselines in terms of MSE.
Off-policy evaluation (OPE) has been widely applied to real-world applications such as recommendation systems [1], digital marketing [2], and medical treatments [3]. OPE directly evaluates the performance of a target policy using offline data collected from different behavior policies. This offline evaluation approach alleviates the need to run A/B testing in the underlying environment, which could be costly in practice [4]–[8].
For OPE methods in contextual bandits, there are Inverse Propensity Score (IPS) weighting, direct method (DM), and doubly robust (DR) methods [5], [9], [10]. The main challange comes from the difference between a target policy and the sampling distribution of actions in the data, and IPS corrects the sampling distribution to the distribution of a target policy by using IPS weights. While IPS is an unbiased estimator [9], it suffers from high variance when there is a large mismatch between the target and behavior policies, or when the number of actions is large [6]. For DM, it fits a reward model that estimates the expected reward from data by supervised learning to estimate the reward of actions sampled from a target policy. As DM is using a reward model output, it has low variance, but it can suffer sever bias due to the estimation error of the model [4]. The Doubly Robust (DR) estimator combines IPS and DM methods to leverage their respective strengths [10]. In this work, we focus on reducing variance of IPS.
To reduce high variance in IPS, Marginalized IPS (MIPS) methods that use action embeddings were introduced in [6]. MIPS methods use action embeddings with a smaller space compared to the action space to lower significant variance while inducing a small bias. These methods either assume that the given action embeddings are similar when their corresponding rewards are similar [6], or attempt to learn such action embeddings [11]. However, as these methods learn embeddings for actions that are not conditioned on contexts, the action embeddings can be similar when the corresponding reward are different depending on a context which could be critical for OPE in treatments. For example, adrenaline can be life-saving or lethal depending on a condition of a patient [12], but these methods will learn same action embedding for the same adrenaline dosage regardless of patient’s condition (context). Furthermore, the embeddings used in these methods do not minimize the MSE of MIPS.
In this work, we propose learning context-action embeddings for MIPS by minimzing its MSE. In our theoretical studies, we first derive an upper bounds on the MSE of MIPS, and propose an embedding learning objective using the upper bounds. Our proposed objective learns embeddings to reduce variance by smoothing out the IPS weights across the action space, while inducing smaller bias compared to the previous work of [11]. We also show an information-theoretic connection to our objective. In our empirical studies, we conduct experiments on a synthetic dataset and a real-world bandit dataset collected from a fashion e-commerce platform [13]. In both the experiments, our method further reduces MSE of MIPS by learning embeddings that balance the bias-variance trade-off compared to [11], which only considers reward prediction for embedding learning.
In this section, we introduce the problem setting for off-policy evaluation of contextual bandits, and the Marginalized IPS (MIPS) estimator which we build our proposed method upon.
Let \(\mathcal{X}\subseteq \mathbb{R}^d\) be the context space, \(\mathcal{A}\) be a finite set of actions, and \(\mathcal{R}\) be the space of rewards, which are assumed to be known. Data are collected offline by following a behavior policy \(\mu \colon \mathcal{X}\to \Delta(\mathcal{A})\) in the following way. First, a context \(X \sim P_X\) is sampled where \(P_X\) is the unknown context distribution. Then an action \(A \sim \mu(\cdot | X)\) is sampled. Finally a reward \(R \sim P_{R|X,A}\) is received, where \(P_{R|X,A}\) is the unknown reward distribution given context \(X\) and action \(A\). The reward \(R\) is the bandit feedback as it only reveals the value of the action taken in a context. We collect such iid samples in a dataset \(\mathcal{D}= ((X_i,A_i,R_i))_{i=1}^n\) of size \(n\).
For a policy \(\pi \colon \mathcal{X}\to \Delta(\mathcal{A})\), let \[\begin{align} v(\pi) \coloneq \int_{\mathcal{X}} \sum_{a \in \mathcal{A}} p_X(x) \pi(a|x) q(x,a) dx \,. \end{align}\] Here \(p_X\) is the probability density function (PDF) of \(P_X\) and \(q \colon \mathcal{X}\times \mathcal{A}\to \mathbb{R}\) is the mean reward function \(q(x,a) \coloneq \int r p_R(r|x,a) dr\), where \(p_R\) is the PDF of \(P_{R|X,A}\).
The goal in OPE is to estimate \(v(\pi)\) for a target policy \(\pi\) using the offline dataset \(\mathcal{D}\). The performance of an estimator \(\hat{v}(\pi)\) is measured using the mean squared error or MSE: \[\begin{align} \text{MSE}[\hat{v}(\pi)] &\mathrel{\vcenter{:}}= \mathbb{E}[(\hat{v}(\pi) - v(\pi))^2] \\ &=\text{Bias}[\hat{v}(\pi)]^2 + \operatorname{\mathbb{V}}[\hat{v}(\pi)]\,, \end{align}\] where \(\text{Bias}[\hat{v}(\pi)]\) and \(\operatorname{\mathbb{V}}[\hat{v}(\pi)]\) denotes the bias and variance of the estimator \(\hat{v}(\pi)\) respectively, and the expectation is over the data generating distribution.
The OPE problem has been extensively studied. Here we describe some of the most common estimators for OPE. Perhaps the most widely used estimator is the Inverse Propensity Score (IPS) weighting, sometimes referred as importance sampling. Given a behavior policy \(\mu\), a target policy \(\pi\), and a dataset \(\mathcal{D}= ((X_i, A_i, R_i))_{i=1}^n\), the IPS estimator estimates \(v(\pi)\) as \[\begin{align} \hat{v}_{\text{IPS}}(\pi) \coloneq \frac{1}{n} \sum_{i=1}^n \underbrace{\frac{\pi(A_i|X_i)}{\mu(A_i|X_i)}}_{=:w(X_i,A_i)} R_i\,, \end{align}\] where \(w(x,a)\) is the IPS weight for \((x,a) \in \mathcal{X}\times\mathcal{A}\). Under the assumption that \(\pi(a|x) > 0 \implies \mu(a|x) > 0\) for all \(x \in \mathcal{X}\) and \(a \in \mathcal{A}\), IPS is unbiased: \(\mathbb{E}[\hat{v}_{\text{IPS}}(\pi)] = v(\pi)\). However, in practice, this assumption is often violated making IPS biased. The bias is given by [6], [8]: \[\begin{align} |\text{Bias}[\hat{v}_{\text{IPS}}(\pi)]| = \mathbb{E}\left[ \sum_{a \in \mathcal{U}(X,\mu)} \pi(a|X) q(X,a) \right] \,, \end{align}\] where \(\mathcal{U}(x, \mu) = \{ a \in \mathcal{A}: \mu(a|x) = 0 \}\) is the set of unsupported actions in context \(x\) under policy \(\mu\). A bigger issue is the variance of IPS [6], [10], given by \[\begin{align} \operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)] = \frac{1}{n}\mathbb{E}\!\left[ w(X,A)^2 \operatorname{\mathbb{V}}[R|X,A] \right] + \operatorname{\mathbb{V}}\left[ \mathbb{E}_\mu[w(X,A) q(X,A)] \right] + \mathbb{E}\left[ \operatorname{\mathbb{V}}_\mu[w(X,A) q(X,A)] \right]\,\notag \end{align}\] where \(\mathbb{E}_\mu\) and \(\operatorname{\mathbb{V}}_\mu\) are expectation and variance respectively under \(\mu\). The equation shows that the variance of IPS is high when the IPS weights \(w\) are large or when \(\operatorname{\mathbb{V}}[R|X,A]\) is large, i.e., the rewards are noisy.
To alleviate the issues of IPS, [6] introduced the Marginalized IPS (MIPS) estimator. The MIPS estimator assumes access to embeddings from some set \(\mathcal{E}\subseteq \mathbb{R}^{d_\mathcal{E}}\). The embeddings are sampled from an unknown embedding distribution \(P_{E|X,A}\) with density \(p_E\). The rewards are assumed to be sampled from \(P_{R|X,A,E}\). Now the value of a policy is defined as \[\begin{align} v(\pi) \coloneq \int_\mathcal{X}\sum_{a\in \mathcal{A}} \int_\mathcal{E}p_X(x) \pi(a|x)p_E(e|x,a) q(x,a,e) dx de \,, \end{align}\] where \(q(x,a,e) \mathrel{\vcenter{:}}= \int r p_R(r|x,a,e)\) is the mean reward function with context \(x\), action \(a\), embedding \(e\), and PDF \(p_R\) of \(P_{R|X,A,E}\). The marginal embedding distribution induced by a policy \(\pi\) is defined as \(p_E(e|x,\pi) \coloneq \sum_{a \in \mathcal{A}} p_E(e|x,a) \pi(a|x)\). The goal is still to estimate \(v(\pi)\). Given dataset \(\mathcal{D}\coloneq ((X_i, A_i, E_i, R_i))_{i=1}^n\), the MIPS estimator [13] is defined as \[\begin{align} \label{eq:mips} \hat{v}_{\text{MIPS}}(\pi) \mathrel{\vcenter{:}}= \frac{1}{n} \sum_{i=1}^n \underbrace{ \sum_{a \in \mathcal{A}} \mu(a|X_i, E_i) w(X_i, a)}_{_{=:w(X_i,E_i)}} R_i \,, \end{align}\tag{1}\] where \(w(x,e)\) is the marginal IPS weight for \((x,e) \in \mathcal{X}\times\mathcal{E}\) and \[\begin{align} \mu(a|x,e) \coloneq \frac{\mu(a|x) \cdot p_E(e|x,a)}{p_E(e|x,\mu)} \end{align}\] is the posterior of an action given a context and an embedding. The estimator \(\hat{v}_{\text{MIPS}}(\pi)\) is computed by estimating \(\mu(\cdot|X_i,E_i)\) using logistic regression for all \(i \in [n]\) [13] and then using 1 .
The MIPS estimator assumes access to embeddings which might not be available in many cases. Even if the embeddings are given, they might not be optimal embeddings that minimize the MSE of the MIPS estimator. [11] proposed a method to learn action embeddings based on the data for MIPS. However, their method learns an embedding that does not necessarily minimize the MSE of the MIPS estimator. Specifically, they fit a linear reward model with a neural network and use its final layer as the action embedding. While this approach shows promising results, embeddings learned only via reward prediction—by minimizing the MSE—do not explicitly account for their predictive power with respect to actions, i.e., how concentrated the posterior \(\mu(\cdot \mid x, e)\) is for \(x \in \mathcal{X}\) and \(e \in \mathcal{E}\).
In addition, [11] only learn action-embeddings, but the embeddings assigned to actions should be different depending on the context. An action could have different reward depending on the context. For example, a particular drug dosage could be effective or lethal depending on the patient. To alleviate these issues, we introduce our method Context-Action Embedding Learning for MIPS or CAEL-MIPS in the next section.
Our proposed method, CAEL-MIPS, learns context-action embeddings by minimizing the MSE of MIPS. To this end, we derive upper bounds of bias and variance of MIPS and propose a embedding learning objective that combines the reward prediction error from [11] and the upper bounds.
In the MIPS estimator setup (2.2), the expressiveness of embeddings with respect to rewards is characterized by the no direct effect assumption [6]: \(A \perp R \mid X, E\). Intuitively, this assumption states that the embedding \(E\) captures all information about \(A\) relevant for predicting \(R\). Under this condition, the MIPS estimator is unbiased. [6] further showed that when the no direct effect assumption is violated, the bias of MIPS [6] can be decomposed as \[\begin{align} \text{Bias}[\hat{v}_{\text{MIPS}}(\pi)] = \mathbb{E}\bigg[ \sum_{i < j} \mu(i|X,E)\mu(j|X,E) \cdot (q(X,i,E) - q(X, j, E)) \cdot (w(X,j) - w(X,i)) \bigg] \,. \end{align}\] We now derive an upper bound on the bias of MIPS. From the previous equation, \[\begin{align} & |\text{Bias}[\hat{v}_{\text{MIPS}}(\pi)]| \\ & \le \mathbb{E}\biggl[ \biggl\lvert \sum_{i < j}\mu(i|X,E)\mu(j|X,E) \cdot (q(X,i,E) - q(X, j, E)) \cdot (w(X,j) - w(X,i)) \biggr\rvert \biggr]\\ &\le \mathbb{E}\biggl[ \sum_{i < j} \abs{ \mu(i|X,E)\mu(j|X,E) \cdot(q(X,i,E) - q(X, j, E)) \cdot (w(X,j) - w(X,i))} \biggr]\\ &\le (b-a) \cdot \mathbb{E}\biggl[ \sum_{i < j} \mu(i|X,E)\mu(j|X,E) \cdot \abs{w(X,j) - w(X,i)} \biggr] \,, \end{align}\] where we used Jensen’s inequality in \((\clubsuit)\), triangle inequality in \((\spadesuit)\), and assumed boundedness of rewards in \([a, b]\) in \((\diamondsuit)\). To learn embeddings with low bias, we parameterize the embedding \(E = f_\theta(X,A)\) using a function approximator \(f_\theta\) and minimize the square of the bias upper bound. We define the objective based on bias as \[\begin{align} \label{eq:bias-loss} \mathcal{L}_{\text{bias}}(\theta) \coloneq \mathbb{E}\biggl[ \sum_{i < j} {\mu(i | X, f_\theta(X,A))\mu(j|X, f_\theta(X,A))} \cdot \abs{w(X,j) - w(X,i)} \biggr]^2. \end{align}\tag{2}\] We observe that \(\mathcal{L}_{\text{bias}}(\theta) \to 0\) when the posterior \(\mu(i \mid x, e) \to 0\) or \(1\) for some \(i \in \mathcal{A}\), \(x \in \mathcal{X}\), and \(e \in \mathcal{E}\). Intuitively, this corresponds to the case where \(e\) fully predicts \(a\). Moreover, when \(\pi\) and \(\mu\) have little overlap, the difference \(|w(x,j) - w(x,i)|\) becomes large for some \(x \in \mathcal{X}\) and \(i, j \in \mathcal{A}\). In such cases, the gradient of \(\mathcal{L}_{\text{bias}}(\theta)\) is large, and the loss decreases more rapidly toward zero. Hence, minimizing \(\mathcal{L}_{\text{bias}}(\theta)\) is particularly effective when \(\pi\) and \(\mu\) exhibit limited overlap.
[6] showed that when the bias of MIPS is zero then the variance reduction of MIPS is given by \[\begin{align} \operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)] - \operatorname{\mathbb{V}}[\hat{v}_{\text{MIPS}}(\pi)] = \frac{1}{n}\mathbb{E}[ \mathbb{E}_{P_{R|X,E}}[R^2]\operatorname{\mathbb{V}}_{\mu(\cdot |X,E)}[w(X,A)]] \,, \end{align}\] where \(P_R(\cdot|X,E)\) is the reward distribution that only depends on \(X\) and \(E\). From the previous equation, \[\begin{align} \label{eq:mips-var} \operatorname{\mathbb{V}}[\hat{v}_{\text{MIPS}}(\pi)] &= \operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)] - \frac{1}{n}\mathbb{E}[\mathbb{E}_{P_{R|X,E}}[R^2]\operatorname{\mathbb{V}}_{\mu(\cdot|X,E)}[w(X,A)]] \nonumber\\ &= \operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)] - \frac{1}{n}\mathbb{E}[R^2\operatorname{\mathbb{V}}_{\mu(\cdot|X,E)}[w(X,A)]] \,, \end{align}\tag{3}\] where the last equality follows from observing that \(\operatorname{\mathbb{V}}_{\mu(\cdot|X,E)}[w(X,A)]\) does not depend on \(R\). We now upper bound the RHS of 3 as follows. \[\begin{align} \label[ineq]{ineq:mips-var-ub} & \operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)] - \frac{1}{n}\mathbb{E}[R^2\operatorname{\mathbb{V}}_{\mu(\cdot|X,E)}[w(X,A)]] \nonumber \\ &= \operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)] - \frac{1}{n}\mathbb{E}\left[R^2 \left(\mathbb{E}_{\mu(\cdot|X,E)}[w(X,A)^2] - \mathbb{E}_{\mu(\cdot|X,E)}[w(X,A)]^2 \right) \right] \nonumber \\ &= \operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)] + \frac{1}{n}\mathbb{E}\left[ R^2 \mathbb{E}_{\mu(\cdot|X,E)}[w(X,A)]^2 \right] - \frac{1}{n}\mathbb{E}\left[ R^2 \mathbb{E}_{\mu(\cdot|X,E)}\left[w(X,A)^2\right] \right] \nonumber \\ & \le \operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)] + \frac{1}{n}\mathbb{E}\Biggl[ R^2 \left(\sum_{a \in \mathcal{A}} \mu(a|X,E)w(X,a)\right)^2 \Biggr] \nonumber \\ &\le \operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)] + \frac{1}{n}\mathbb{E}\left[ R^2 \sum_{a \in \mathcal{A}} \mu(a|X,E)^2 \sum_{a \in \mathcal{A}} w(X,a)^2 \right] \,, \end{align}\] where we use Cauchy-Schwarz in the final step. Combining 3 and [ineq:mips-var-ub], we get \[\begin{align} \operatorname{\mathbb{V}}[\hat{v}_{\text{MIPS}}(\pi)] \le \operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)] + \frac{1}{n}\mathbb{E}\left[ R^2 \sum_{a \in \mathcal{A}} \mu(a|X,E)^2 \sum_{a \in \mathcal{A}} w(X,a)^2 \right] \,. \end{align}\] Similar to the bias objective in 2 , we parameterize \(E = f_\theta(X,A)\) and minimize the above upper bound. Since the first term does not depend on the embeddings, we define the objective based on variance as \[\begin{align} \label{eq:var-loss} \mathcal{L}_{\text{var}}(\theta) \mathrel{\vcenter{:}}= \frac{1}{n} \mathbb{E}\left[ R^2 \sum_{a \in \mathcal{A}} \mu(a|X, f_\theta(X, A))^2 \sum_{a \in \mathcal{A}} w(a, X)^2 \right]. \end{align}\tag{4}\]
Note that \(\mathcal{L}_{\text{var}}(\theta) \to 0\) when \(\sum_a \mu(a \mid X, f_\theta(X,A))^2\) is minimized. This sum of squared probabilities is minimized under a uniform distribution, i.e., when \(\mu(a \mid X, f_\theta(X,A)) = 1/|\mathcal{A}|\) for all \(a \in \mathcal{A}\). Additionally, \(\mathcal{L}_{\text{var}}(\theta)\) decreases more rapidly when \(w(x,a)\) is large for some \(x \in \mathcal{X}\) and \(a \in \mathcal{A}\). Since large values of \(w(x,a)\) increase \(\operatorname{\mathbb{V}}[\hat{v}_{\text{IPS}}(\pi)]\) [6], [10], minimizing \(\mathcal{L}_{\text{var}}(\theta)\) is especially valuable in scenarios where \(\hat{v}_{\text{IPS}}(\pi)\) exhibits high variance. We also highlight an interesting connection between \(\sum_a \mu(a \mid X, f_\theta(X,A))^2\) and entropy: the negative logarithm of this term corresponds to the collision entropy, or Rényi entropy with parameter \(\alpha = 2\): \(H_\alpha(X) = (1-\alpha)^{-1}\log\sum_{i=1}^m p(x_i)^\alpha\), where \(X\) is a random variable with probability mass function \(p(\cdot)\).
For our embedding learning objective, we combine the upper bound of bias and variance derived in 2 and 4 with the reward prediction error used by [11]: \[\begin{align} \label{eq:final-obj} \mathcal{L}(\theta) \mathrel{\vcenter{:}}= \mathcal{L}_R(\theta) + \rho\mathcal{L}_{\text{bias}}(\theta) + \gamma\mathcal{L}_{\text{var}}(\theta) \,, \end{align}\tag{5}\] where \(\mathcal{L}_R(\theta)\) is the MSE of reward prediction, and \(\mathcal{L}_{\text{bias}}(\theta)\) and \(\mathcal{L}_{\text{var}}(\theta)\) are defined in 2 and 4 . We estimate 5 using offline data \(\mathcal{D}\) by replacing expectations with empirical averages to get \(\hat{\mathcal{L}}(\theta; \mathcal{D}) \coloneq \hat{\mathcal{L}}_R(\theta; \mathcal{D}) + \rho\hat{\mathcal{L}}_{\text{bias}}(\theta; \mathcal{D}) + \gamma\hat{\mathcal{L}}_{\text{var}}(\theta; \mathcal{D})\), where \(\rho\) and \(\gamma\) are positive coefficients. Recall that \(f_\theta(x,a)\) outputs the embedding of \((x,a) \in \mathcal{X}\times\mathcal{A}\) which is used in reward prediction \(\hat{R}(\theta)\). The various components of the loss \(\hat{\mathcal{L}}(\theta; \mathcal{D})\) is: \[\begin{align} \label{eq:emp-r-loss} \hat{\mathcal{L}}(\theta; \mathcal{D}) &= \underbrace{\frac{1}{n} \sum_{i=1}^n (\hat{R}_i(\theta) - R_i)^2 }_{=\hat{\mathcal{L}}_R(\theta; \mathcal{D})}\\ &+ \underbrace{ \frac{\rho}{n^2} \biggl(\sum_{k=1}^n \sum_{i < j} {\hat{\mu}(i | X_k, f_\theta(X_k,A_k))\hat{\mu}(j|X_k, f_\theta(X_k,A_k))} \cdot \abs{w(X_k,j) - w(X_k,i)} \biggr)^2}_{=\rho\hat{\mathcal{L}}_{\text{bias}}(\theta; \mathcal{D})} \notag\\ & + \underbrace{\frac{\gamma}{n^2} \sum_{i=1}^n \hat{R}_i^2(\theta) \sum_{a \in \mathcal{A}} \hat{\mu}(a|X_i, f_\theta(X_i, A_i))^2 \sum_{a \in \mathcal{A}} w(a, X_i)^2}_{=\gamma\hat{\mathcal{L}}_{\text{var}}(\theta; \mathcal{D})}. \end{align}\tag{6}\]
Minimizing \(\hat{\mathcal{L}}(\theta; \mathcal{D})\) over \(\theta\) gives us the parameters defining the embeddings. The pseudocode of the MIPS estimator using our method CAEL-MIPS is presented in 1.
Our method CAEL-MIPS(1) uses a three-layer feedforward neural network \(f_\theta\) to parameterize the embeddings and the rewards. The hidden layers use batchnorm, ReLU activations, and a dropout rate of 0.2. The network takes \((x,a) \in \mathcal{X}\times \mathcal{A}\) as input and outputs an embedding \(\hat{e} \coloneq f_\theta(x,a)\). Following [11], the reward prediction is given by \(\hat{r} \coloneq \hat{e}^\top x\).
To compute \(\hat{v}_{\text{MIPS}}(\pi)\), 1 first learns the context-action embeddings \((\hat{E}_i)_{i=1}^n\) by performing gradient updates on the loss defined in 6 (lines [line:mini-batch]-[line:grad-step]). After learning the embeddings, 1 estimates \(\hat{\mu}_\phi\) using supervised learning on \(((\hat{E}_i, A_i))_{i=1}^n\) (line [line:log-regr2]) following [13] and [11]. Finally, on line [line:mips], it computes the MIPS estimator using \(\hat{\mu}_\phi\).
We provide experimental results with both synthetic and real-world data.
To synthesize synthetic data, we first choose the context space and the action space to be \(\mathcal{X}= [0, 1]^d\) and \(\mathcal{A}= [k]\), respectively. For an action \(a \in [k]\), we construct its vector representation by choosing the first coordinate as \(a/k\), and the remaining \(k-1\) coordinates arbitrariliy in \([0, 1]\). That is, for \(a \in [k]\), its vector representation is \((a/k, b_1, \dots, b_{k-1})^\top\) where \(b_1, \dots, b_{k-1} \in [0, 1]\). This is done so that the expected reward function only depends on the first coordinate of the action representation. The expected reward function is \(q(x,a) = 10 e^{-(x_1 - a_1)^2}\). We choose the behavior policy \(\mu\) to be uniformly random over the actions, and the target policy \(\pi\) to be \(\epsilon\)-greedy with \(\epsilon \in [0,1]\). The context distribution is chosen as \(P_X = \text{Unif}[0, 1]^d\) and the reward distribution as \(P_R = \mathcal{N}(q(x,a), \sigma^2)\). To generate a dataset, we first sample \(X \sim P_X\). Then we sample \(A \sim \mu(\cdot|X)\) and construct its vector representation. Using \(X\) and the vector representation of \(A\), we sample \(R \sim \mathcal{N}(q(X,A), \sigma^2)\) and repeat this process \(n\) times to get \(\mathcal{D}= ((X_i, A_i, R_i))_{i=1}^n\).
We compare our method CAEL-MIPS with the Direct Method (DM), IPS, and the embedding learning method of [11]. For DM, we use the same reward model as we learn for our embedding learning method, and estimate the policy as \(\hat{v}_{\text{DM}}(\pi) = (1/n) \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a|x_i) \hat{q}(x_i,a)\), where \(\hat{q}\) is the learned reward model. The results are shown in 2. To obtain the results, we use \(n = 1000\) data points and repeat our experiments 30 times to estimate the MSE, bias, and variance of all the estimators. We set the context space and embedding space dimension as \(d = d_e = 5\). The number of actions is set to \(|\mathcal{A}| = 500\). For the target policy we set \(\epsilon = 0.2\). The variance of rewards is set to \(\sigma^2 = 1\). For the loss computation, we set \(\gamma = 0.1\) and \(\rho = 10\).
In 2, we observe that CAEL-MIPS significantly improves the MSE compared to the other estimators. In the figure, we see that IPS has small bias but large variance campared to DM. AEL-MIPS has smaller MSE comapred to IPS since it learns action embeddings for variance reduction, where the embeddings are learned from a reward prediction loss. CAEL-MIPS learns context-action embeddings that balances between bias and variance by minimizing the MSE of MIPS. As AEL-MIPS only considers reward prediction loss to reduce variance of IPS, CAEL-MIPS shows lower MSE.
3 shows the MSE of the estimators as a function of various problem parameters. In [fig:n95val], we vary the dataset size, with 100 trials for each dataset size. We observe that CAEL-MIPS performs better than AEL-MIPS, and significantly outperforms IPS and DM. This is expected as CAEL-MIPS and AEL-MIPS leverage embeddings while IPS and DM do not. [fig:n95act] [fig:eps] [fig:r95std] show how the MSE changes as we change the environment, where each experiment is performed 30 times for each paramater value. [fig:n95act] shows how the MSE changes as the size of the action space is increased. We expect the MSE to go up as the number of actions increase, which we observe in [fig:n95act]. [fig:eps] shows the MSE as a function of \(\epsilon\) which controls the randomness of the \(\epsilon\)-greedy target policy. When \(\epsilon=0\), the target policy is greedy w.r.t the optimal policy. Since the behavior policy is uniformly random, there is a mismatch between the policies and CAEL-MIPS significantly improves the MSE. When \(\epsilon=0\), both the behavior and the target policies are the same and the MSE of all the estimators is low. [fig:r95std] shows the MSE as the noise in the reward is varied. Again, CAEL-MIPS performs better than all the other estimators.
We evaluate our method on Open Bandit Dataset, a real-world bandit dataset collected on a fashion e-commerce platform ZOZOTOWN [13]. The dataset was collected via an A/B test of two policies: a uniformly random behavior policy and a Thompson sampling target policy. We use \(n=10000\) samples. The context dimension is \(d=20\), the total number of actions \(|\mathcal{A}| = 240\), and the reward is binary representing if the user clicked the recommended item or not. To learn the embeddings, we use the same three-layer feedforward neural network as in the synthetic data experiment. We use one-hot action representations as input to the neural network. For DM, we use the same network as CAEL-MIPS, and use it as a reward model to estimate \(v(\pi)\). Similar to the synthetic experiment setup, we set \(\gamma=0.1\) and \(\rho=10\) for the loss computation. We repeat the experiments 30 times to estimate the squared errors of the estimators.
Following [6] and [11], we plot the empirical cumulative distribution function (CDF) of squared errors relative to IPS in 4. In other words, the squared errors of all the estimators are first normalized by dividing them with the squared error of IPS, and then their empirical CDF is computed. From 4, we observe that CAEL-MIPS outperforms IPS in more than 75% of the runs. Furthermore, while the relative errors of AEL-MIPS and DM never reach below 0.1 and 0.01 respectively, CAEL-MIPS’s relative error reaches below 0.1 for about 40% of the runs and below 0.01 for about 20% of the runs.
In this work, we proposed CAEL-MIPS, a method for learning context–action embeddings that can be incorporated into the MIPS estimator for OPE. We derived an objective function based on the bias and variance of the MIPS estimator, which can be optimized to obtain embeddings that achieve low MSE for the MIPS estimator. In our empirical evaluation on a synthetic and a real-world bandit dataset, we show that the MIPS estimator using embeddings learned by CAEL-MIPS achieves the lowest MSE among all the baselines.
Although our method optimizes an upper bound on the MSE, the size of the gap between this bound and the true MSE remains unclear. Due to this gap, our method introduces new hyperparameters \(\rho\) and \(\gamma\) in 2 and 4 , respectively, which may be difficult to tune. Moreover, computing the bias upper bound in 1 requires \(O(|\mathcal{A}|^2)\) operations, which can be prohibitive in large action spaces. In addition, the algoriothm relies on multiple calls to a supervised learning subroutine to estimate the posterior \(\hat{\mu}(\cdot | x, e)\). While using batch updates reduces the computational cost of estimating \(\hat{\mu}(\cdot | x, e)\), there remains room for further improvement.
Several directions remain for future exploration. First, reducing the \(O(|\mathcal{A}|^2)\) complexity in the bias loss computation 1 would make our method applicable to very large action spaces. Second, deriving a tighter bound on the MSE of MIPS could eliminate the need for additional hyperparameters, simplifying the method. Finally, extending CAEL-MIPS to policy optimization and the full reinforcement learning setting represents another promising direction.
Work done during an internship at RBC Borealis.↩︎