April 02, 2024

We introduce a novel doubly-robust (DR) off-policy evaluation (OPE) estimator for Markov decision processes, `DRUnknown`

, designed for situations where both the logging policy and the value function are unknown. The proposed estimator
initially estimates the logging policy and then estimates the value function model by minimizing the asymptotic variance of the estimator while considering the estimating effect of the logging policy. When the logging policy model is correctly specified,
`DRUnknown`

achieves the smallest asymptotic variance within the class containing existing OPE estimators. When the value function model is also correctly specified, `DRUnknown`

is optimal as its asymptotic variance reaches the
semiparametric lower bound. We present experimental results conducted in contextual bandits and reinforcement learning to compare the performance of `DRUnknown`

with that of existing methods.

In various decision-making problems, estimating the value, the expected reward of a policy is a crucial question that needs to be addressed. Online evaluation requiring a comprehensive evaluation of policy value can be expensive and may not be applicable to multiple target policies.

Alternatively, *off-policy* evaluation (OPE) refers to a technique that estimates the value of a *target* policy by utilizing log data generated from a different *logging* policy. This approach has attracted considerable interest in
the domains of contextual bandits (CB) [1], [2] and
reinforcement learning (RL) [3]–[5].

Several off-policy evaluation algorithms [1], [6]–[9] currently in use rely on having complete knowledge of the logging policy in order to utilize inverse probability weighting (IPW). However, in situations where information about the data logger is not available, such as when using human decision data, it becomes necessary to estimate the logging policy. Applying existing estimators directly to cases where the logging policy is unknown can compromise the desired properties of these methods, as they fail to take into account the impact of logging policy estimation.

In this paper, we present a novel efficient estimator called `DRUnknown`

, which simultaneously estimate both the logging policy model and the value function model. Our proposed approach, without the need for external data, effectively
captures the interdependency between the two models.

In the field of conventional statistics, [10] employs the influence function to estimate parameters in the doubly-robust (DR) estimator for population means when there are missing observations. The study focused on situations where data is incomplete and the missing mechanisms are unknown and estimated. To advance this concept, we utilized the influence function approach to address the OPE problem. Here, we treat the rewards of unselected actions as missing values and estimate the unknown logging policy. This enables us to derive the asymptotic variance of the OPE estimator and estimate the parameters that minimize it.

The proposed OPE estimator estimates both the logging policy model and the value function model, and is referred to as *doubly-robust* due to its consistency if either the model for the logging policy or the value function is correctly specified.
When the logging policy is correctly specified, the proposed estimator is the most efficient among the class of DR OPE estimators with an estimated logging policy, and is at least as efficient as existing methods. Moreover, if the value function model is
correctly specified, the estimator locally efficient, achieving the semiparametric lower bound for asymptotic variance and is asymptotically optimal. In order to demonstrate the effectiveness of the proposed estimator, we conducted simulation experiments
to compare its performance with previous methods in contextual bandits and reinforcement learning problems.

The main contributions of this paper are as follows:

We present a new DR OPE estimator for Markov decision process, for the case where both the logging policy and the value function are unknown. The proposed estimator is consistent when either the logging policy model or the value function model is correctly specified.

We propose a method to estimate both the logging policy model parameter \(\hat{\phi}\) and the value function model parameter \(\hat{\beta}\) simultaneously. We use the maximum likelihood estimator (MLE) for \(\hat{\phi}\) and minimize the asymptotic variance to estimate \(\hat{\beta}\), accounting for the impact of estimating \(\hat{\phi}\).

The proposed estimator has the smallest asymptotic variance among estimators using MLE for \(\hat{\phi}\) when the logging policy model is correctly specified. When the value function model is also correctly specified the proposed estimator is optimal, as its asymptotic variance reaches the semiparametric lower bound.

Simulations on contextual bandits and reinforcement learning problems show that the proposed estimator consistently shows smaller mean-squared errors compared to the benchmarks methods.

In this paper, we model the decision-making problem as a reinforcement learning framework in which the learner’s interaction with the system is represented as a *Markov Decision Process* (MDP). In this section, we provide definitions of MDP and
the corresponding off-policy evaluation problem for our study.

An MDP is defined as a tuple (\(\mathcal{X}\), \(\mathcal{A}\), \(R, P, P_0, \gamma\)), where \(\mathcal{X}\) and \(\mathcal{A}\) represent the state and the finite action spaces with \(|\mathcal{A}|=K\), \(R(x, a)\) denotes the distribution of the random variable \(r(x, a)\) of the bounded immediate reward of taking action \(a\) in state \(x\), \(P(\cdot|x, a)\) is the transition probability distribution. \(P_0 : X \rightarrow [0, 1]\) is the initial state distribution, and \(\gamma \in [0, 1]\) is the discounting factor. The learner utilizes a policy \(\pi : \mathcal{X}\times \mathcal{A}\rightarrow [0, 1]\), a stochastic mapping from states to actions. Here, \(\pi(a|x)\) represents the probability of taking action \(a\) in state \(x\).

Let \(\mathcal{H}_{T}\) = \(\{(x_t, a_t, r_t)\} _{t=0}^{T-1}\) represent a \(T\)-step trajectory containing state-action-rewards generated by policy \(\pi\). Denote \(R_{t_1:t_2} = \sum\limits_{t=t_1}^{t_2} \gamma^{t-t_1} r_t\) as the sum of discounted return of from time step \(t_1\) to \(t_2\). Our goal is to estimate the *policy value*, the expected return of trajectories generated by policy \(\pi\), i.e., \(V^{\pi,T}= \text{\mathbb{E}}_{\pi}
[R_{0:T-1}]\). Here \(\text{\mathbb{E}}_{\pi}\) denotes the expectation over the trajectory sampled from \(\pi\). We also define the value function of a policy \(\pi\) at state \(x\) and at state-action pair \((x,a)\) from time step \(t\) by \(V^{\pi,t}(x)=\text{\mathbb{E}}_{\pi} [R_{t:T-1}|x_t = x]\) and \(Q^{\pi,t}(x,a)=\text{\mathbb{E}}_{\pi} [R_{t:T-1}|x_t = x, a_t=a]\), respectively.

Throughout this paper, we fix the trajectory length as \(T\). We also omit the \(\pi\) from \(V^{\pi,t}(s)\) and \(Q^{\pi,t}(s,a)\) to simplify them as \(V^t(s)\) and \(Q^t(s,a)\), respectively, since we are only interested in the value of \(\pi\). Note that, unlike the case with \(T=\infty\), where the \(Q\) and \(V\) value functions remain unchanged, the value functions \(Q^{t}\) and \(V^{t}\) can vary for each time step \(t\) in general.

The contextual bandits (CB) is a special case of the MDP, in which the trajectory length \(T\) equals to \(1\), and the transition dynamics \(P\) do not exist. In CB, the state is represented as the context, and the action is called as the arm.

In an OPE problem, we are given a dataset \(\mathcal{D}_n = \{\mathcal{H}_{T}^{(i)}\}_{i=1}^n\) comprising \(n\) independent \(T\)-step trajectories drawn
from a logging policy \(\mu\). We use the superscript \((i)\) to refer to the data from the \(i^{\text{th}}\) trajectory, denoted as \(\mathcal{H}_{T}^{(i)}\). The objective is to estimate the policy value \(V^{\pi}\) of a separate target policy \(\pi\). We define the *importance ratio*
\(\rho_{t}=\pi(a_t|x_t) / \mu(a_t|x_t)\), and the *cumulative importance ratio* from time step \(t_1\) to \(t_2\) between \(\mu\) and \(\pi\) as \(\rho_{t_1: t_2}= \Pi_{t=t_1}^ {t_2} \rho_t\). In cases where \(t_1 > t_2\), we define \(\rho_{t_1:t_2}\) to be 1. Similarly, we denote the estimated importance ratio as \(\hat{\rho}\), where the probability of the logging policy \(\mu(a|x)\) is
replaced with its estimated version \(\hat{\mu}(a|x;\hat{\phi})\) for the logging policy model \(\hat{\mu}(\cdot;\phi)\).

The policy value \(\widehat{V}\) is a point estimator, and there are various metrics to assess the quality of the provided estimator. One commonly used is the *mean squared error* (MSE), defined as \(\text{MSE}(\widehat{V}) = \text{\mathbb{E}}_{\mu}[{(\widehat{V}- V^{\pi})^2}]\) with the expectation taken with respect to the data sampled from \(\mu\). MSE quantifies the performance of a
given estimator in finite samples. Another measure is the *asymptotic variance*, which assesses the efficiency of an estimator with a large number of samples. The estimator \(\widehat{V}\) is called
*asymptotically linear* if there exists a function \(\psi\) of trajectory \(\mathcal{H}_{T}\) such that \[\widehat{V}- V^{\pi} = \frac{1}{n}\sum_{i=1}^n
\psi(\mathcal{H}_{T}^{(i)}) + o_p(\frac{1}{\sqrt{n}}).\] with \(\text{\mathbb{E}}_{\mu}{[\psi(\mathcal{H}_{T})]}=0\) and \(\mathrm{Var}_{\mu}{[\psi(\mathcal{H}_{T})^2]} < \infty\).
The quantity \(\mathrm{Var}_{\mu}{[\psi(\mathcal{H}_{T})^2]}\) is denoted as the asymptotic variance of \(\widehat{V}\), and the function \(\psi(\mathcal{H}_{T})\) is referred to as the *influence function*. In our study, we derive the specific form of the influence function for OPE estimators and conduct an analysis to minimize the asymptotic variance.

For our analysis, we use the following standard regularity assumptions for OPE problem.

*Assumption 1* (Absolute Continuity).

For all state-action pair \((x,a) \in \mathcal{X}\times \mathcal{A}\), if \(\pi(a|x)>0\) then \(\mu(a|x)>0\).

*Assumption 2* (Square Integrable).

\(\text{\mathbb{E}}_{\mu}{[\rho_{0:t}^2]}<\infty\) for all \(t<T\).

Assumption 1 is crucial in order to avoid an infinite IPW and the OPE estimators which employ IPW are well-defined. Assumption 2 is necessary to guarantee that estimators utilizing the IPW method have finite variance.

The doubly-robust type estimator for the OPE problem was first proposed by [1]. The estimator depends on a regression model for value function that is estimated using a separate independent dataset, which may not always be available.

The works of [6] and [7] introduce estimators that improve efficiency by modifying the IPW component of the DR OPE estimator. These methods still require the independently estimated value function using external data. [8] employed the idea of [11], which estimates the value function model by minimizing the estimation variance without external dataset. All these methods assume that the true logging policy is known and do not require estimation.

Several studies [12]–[15] have proposed the IPW estimators for OPE utilizing the estimated logging policy and examined their theoretical and numerical properties of these estimators. However, these methods do not consider the value function, making them not doubly-robust and suboptimal in terms of asymptotic variance.

There are three standard approaches to estimate \(V^{\pi}\) in OPE problem, and we provide a brief overview of these methods.

The first method is *direct method* (DM) [1], [16], which
directly approximates the state-action value function \(Q(x,a)\) of the target policy. The DM does not use the information from the logging policy \(\mu\) and relies heavily on the accuracy
of the prediction of the value function. While the DM estimators often exhibits low variance, it is not guaranteed to be unbiased.

The second method is called *inverse probability weighting* (IPW) estimator [17], which utilize the importance
ratio term to correct the discrepancy between the target and logging policies. The IPW estimator is unbiased in case the logging policy is known, but it can exhibit a large variance when there is a significant difference between the logging and target
policies and the importance ratio have large value. The Assumption 1 is required in order for the IPW estimator to be well-defined: one cannot know about the pair \((x,a)\) never explored by the logging policy.

The third approach, named *doubly-robust* (DR) estimator [1], [5], [18], [19], combines the DM and IPW, leveraging their respective
strengths to obtain favorable characteristics. The IPW estimator is a special case of DR with the value function fixed to zero. Our work focuses on the DR type estimators, as our main objective is to introduce an efficient method for simultaneously
estimating the logging policy and the value function model.

The standard DR OPE estimator with known logging policy is given by \[\widehat{V}^{\text{DR}} = \frac{1}{n} \sum\limits_{i=1}^n\sum\limits_{t=0}^{T-1}\gamma^{t} \rho_{0:{t-1}}^{(i)} \bigl[\rho_t^{(i)}[r_t^{(i)} - \widehat{Q}(x_t^{(i)}, a_t^{(i)};\beta)] + \widehat{V}(x_t^{(i)};\beta)\bigr],\]

where \(\widehat{Q}(\cdot ;\beta)\) is a *value function model* of the state-action value function parameterized by \(\beta\), and \(\widehat{V}(x ;\beta)
= \sum_{a \in \mathcal{A}} \pi(a|x) \widehat{Q}(x,a ;\beta).\)

Here, \(\widehat{Q}\) depends solely on the state-action pair. However, as mentioned in Section 2.1, the \(Q^t(x,a)\) may vary for different time steps, even if the state-action pair remains the same when the horizon length \(T\) is finite. Hence, we cannot ensure that the model \(\widehat{Q}\) can express the \(Q^t\) for all \(t\) in the OPE problem with a finite \(T\). To address this issue, we utilize the function class \(\widehat{Q}(t,x,a)\), which incorporates \(t\) as an input parameter in practice. Also, to ensure that the class of DR OPE estimator contains the IPW estimator, we assume that the class of \(\widehat{Q}\) contains the constant functions. This can be easily satisfied by bringing the parameters \(\nu_0, \nu_1 \in \mathbb{R}\) to extend the function class by \(\nu_0 + \nu_1 \widehat{Q}\).

The DR OPE estimator [eq:DR] utilizes the importance ratio using the true probability of the logging policy. When the logging policy \(\mu\) is
unknown we use the logging policy model \(\hat{\mu}(\cdot ;\hat{\phi})\) to approximate the unknown \(\mu\). We say that the model \(\hat{\mu}\) is
*correctly specified* if there exists a value of \(\phi\) such that \(\hat{\mu}(x,a;\phi)=\mu(x,a)\) for all \(x\) and \(a\). The correct specification of \(\hat{\mu}\) is necessary for the consistent estimation of the policy value \(V^{\pi}\), as correctly specifying the value
function model \(\widehat{Q}\) is difficult in many OPE problems.

In the following sections, our focus will be on the DR OPE estimator, which estimates both the logging policy parameter \(\phi\) and the value function parameter \(\beta\). Our goal is to build an efficient OPE estimator that exhibits lower asymptotic variance compared to existing methods.

`DRUnknown`

↩︎In this section, we present our class of `DRUnknown`

estimators. The central concept of `DRUnknown`

is to first estimate the unknown logging policy parameter \(\hat{\phi}\) for the IPW component of
the DR estimator, and then learn the value function parameter \(\hat{\beta}\) to minimize the asymptotic variance of the estimator. We utilize the influence function of the estimator to derive a feasible objective function
to minimize the asymptotic variance of `DRUnknown`

.

`DRUnknown`

for Contextual Bandits↩︎We first present the `DRUnknown`

for contextual bandits problem with \(T=1\). Consider a logging policy model \(\hat{\mu}(x,a ; \phi)\) parameterized by \(\phi\). Let \(\Delta_{a}^{(i)}\) represent the indicator for an action \(a\) chosen in trajectory \(i\), and denote the partial
derivative of \(\hat{\mu}\) with respect to \(\phi\) as \(\dot{\hat{\mu}} (a|x; \phi)\). The *maximum-likelihood estimator* (MLE) \(\hat{\phi}= \hat{\phi}_n\) is given by the solution of the estimating equation \(U_n(\phi)\) given by \[U_n(\phi)=\sum\limits_{i=1}^n\sum\limits_{a \in
\mathcal{A}}\Delta_{a}^{(i)} \frac{\dot{\hat{\mu}} (a|x^{(i)}; \phi)}{\hat{\mu}(a|x^{(i)}; \phi)}=0.\] If the logging policy model is correctly specified, the estimating equation is unbiased, and \(\hat{\phi}\) is
consistent for \(\phi\).

Plugging in the estimated value of \(\hat{\phi}\), we obtain the following class of DR OPE estimator \[\widehat{V}^{\text{DR}}(\beta, \hat{\phi}) = \frac{1}{n} \sum\limits_{i=1}^n\Bigl[ \frac{\pi^{(i)}}{\hat{\mu}^{(i)}} (r^{(i)} - \widehat{Q}^{(i)}(\beta)) + \widehat{V}^{(i)}(\beta) \Bigr]\] where the notations \(\pi^{(i)}, \hat{\mu}^{(i)}, \widehat{Q}^{(i)}(\beta)\) and \(\widehat{V}^{(i)}(\beta)\) denote the \(\pi(a^{(i)}|x^{(i)})\), \(\hat{\mu}(a^{(i)}|x^{(i)};\hat{\phi})\), \(\widehat{Q}(x^{(i)}, a^{(i)}; \beta)\) and \(\widehat{V}(x^{(i)}; \beta)\) respectively. Henceforth in the paper, if there is no ambiguity, we simplify terms containing \(x^{(i)}\) and \(a^{(i)}\) by using only the superscript \((i)\).

The expression of the bias and variance of the estimator \(\widehat{V}^{\text{DR}}(\beta, \hat{\phi})\) is not simple in general, even for a fixed \(\beta\). The estimator is not a mean of the independent and identically distributed (i.i.d) variables, due to the estimated \(\hat{\phi}\) in the denominator. Consequently, estimating the regression parameter \(\beta\) to minimize the MSE of the proposed estimator poses challenges. Hence, we instead aim to find \(\hat{\beta}\) minimizing the asymptotic variance of the estimator.

To attain this, we derive the influence function asymptotically equivalent to \(\widehat{V}\), through a Taylor expansion with respect to \(\phi\) and \(\beta\). Denote the new regression function \(F\), the product of the target policy \(\pi\) and the \(\widehat{Q}\) with an additional estimating effect term of \(\phi\), given by \[F(x,a;\beta, c, \phi)=\pi(a|x)\widehat{Q}(x,a;\beta) + c^{\top} \dot{\hat{\mu}}(a|x;\phi),\] for \(\beta\) and \(c \in \mathbb{R}^{\text{dim}(\phi)}.\) The influence function of \(\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})\) with arbitrary estimator \(\hat{\beta}\) can be formulated employing \(F\), as described in the following proposition.

**Proposition 1** (Asymptotic Equivalence of \(\widehat{V}\) for CB). * Consider an estimator \(\hat{\beta}\)
converging to some \(\beta^*\) in probability. If the model \(\hat{\mu}(\cdot;\phi)\) is correctly specified, the DR OPE estimator is asymptotically linear with influence function \(\eta\): \[\widehat{V}^{\text{DR}}(\hat{\beta}, \hat{\phi}) = \widetilde{V}(\beta^*, c(\beta^*)) + o_p(n^{-1/2})
= \frac{1}{n} \sum\limits_{i=1}^n\eta^{(i)}(\beta,c(\beta^*)) + o_p(n^{-1/2}),\] where \(c(\beta)\) is a vector that solely depends on \(\beta\), and \[\eta^{(i)}(\beta,c)=\frac{1}{\mu^{(i)}}\bigl[\pi^{(i)}r^{(i)} -F^{(i)}(\beta,c,\phi) \bigr] \\
+\sum\limits_{a \in \mathcal{A}}F(x^{(i)},a; \beta, c,\phi).\]*

As the value of the vector \(c(\beta^*)\) is unknown, we estimate both \(\beta\) and \(c\) that minimize the variance of \(\widetilde{V}(\beta,c)\). Denote the vector \(\vec{F}(x; \beta,c,\phi)=(F(x, a; \beta,c,\phi))_{a \in \mathcal{A}}\) and \(\pi\vec{Q}(x) = (\pi(a|x) Q(x,a))_{a \in \mathcal{A}}\), and the gradient matrix \(f(x;\beta,c,\phi) =\displaystyle\frac{\partial \vec{F}}{\partial(\beta,c)}.\) By the law of total variance, the variance of \(\widetilde{V}^{\text{DR}}(\beta,c)\) can be expressed as the square of stochastic seminorm added by a constant independent of \(\beta\) and \(c\).

**Proposition 2** (Variance of \(\widetilde{V}^{\text{DR}}\)). *The variance of \(\widetilde{V}(\beta,c)\) is given by \[n\mathrm{Var}{(\widetilde{V}(\beta,c))} = \mathrm{Var}(V^{\pi}(x)) + \text{\mathbb{E}}_{\mu}\displaystyle\left\Vert \pi\vec{r} - \pi\vec{Q}(x) \right\Vert _{M_{\mu}}^2
+ \text{\mathbb{E}}_{\mu}\displaystyle\left\Vert \vec{F}(x;\beta,c,\phi)-\pi\vec{Q}(x) \right\Vert _{M_{\mu}}^2\] where \(M_{\mu}=\operatorname{diag}(\mu(a|x)^{-1})_{a \in \mathcal{A}} - J_K\) and \(J_K \in \mathbb{R}^{K \times K}\) the matrix of ones.*

The Proposition 2 tells minimizing the square of seminorm \[\label{eq:bandit95seminorm}
\text{\mathbb{E}}_{\mu}{\left\Vert \vec{F}(x;\beta,c,\phi)- \pi \vec{Q}(x)\right\Vert _{M_{\mu}}^2}\tag{1}\] with respect to \(\beta\) and \(c\) is equivalent to minimizing the
variance of the \(\widetilde{V}(\beta,c)\). Denote the minimizer of 1 by \((\beta_{\text{opt}}, c_{\text{opt}})\), the zero of its gradient
\[\label{eq:bandit95gradient}
\begin{align}
\text{\mathbb{E}}_{\mu} \bigl[ f^{\top}(x;\beta,c,\phi) M_{\mu} (\vec{F}(x;\beta,c,\phi)- \pi \vec{Q}(x)) \bigr]= 0.
\end{align}\tag{2}\] We can observe that the solution satisfies \(c_{\text{opt}}=c(\beta_{\text{opt}})\), so that the variance of \(\widetilde{V}(\beta_{\text{opt}},c(\beta_{\text{opt}}))=\widetilde{V}(\beta_{\text{opt}},c_{\text{opt}})\) is minimized at \((\beta_{\text{opt}}, c_{\text{opt}})\). Therefore by Proposition [prop:influence], the smallest asymptotic variance of DR OPE estimator \(\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})\) under the
correctly specified \(\hat{\mu}\), is achieved by \(\hat{\beta}\) converging in probability to this \(\beta_{\text{opt}}\).The following estimating equation
\(S_n(\beta,c)\) is a tractable equation that jointly solves for \((\beta,c)\), with its solution \(\hat{\beta}\) consistent for \(\beta_{\text{opt}}\): \[\label{eq:bandit95gradient95data}
S_n(\beta,c)= \sum\limits_{i=1}^nf^{(i) \top}(\beta,c)
M^{(i)} \bigl( F^{(i)}(\beta,c) - \operatorname{diag}(\pi(a|x^{(i)}))_{a \in \mathcal{A}} ~ \vec{r}^{(i)}\bigr)= 0\tag{3}\] for \(M^{(i)}=\displaystyle\operatorname{diag}(\hat{\mu}(a|x^{(i)})^{-1})_{a \in
\mathcal{A}} - J_K\) and the pseudo reward \(\vec{r}^{(i)} = \displaystyle\bigl(\frac{\Delta_a^{(i)}}{\hat{\mu}(a|x^{(i)})} r^{(i)}\bigr)_{a \in \mathcal{A}}\). Plugging in the solution \(\hat{\beta}\) of \(S_n(\beta,c)=0\) as \(\widehat{V}^{\text{DR}}(\hat{\beta}, \hat{\phi})\), we obtain our proposed estimator, `DRUnknown`

. The \(c\) serves as an auxiliary parameter that adjusts for the effect of estimating \(\hat{\phi}\), and does not have a direct role in the final estimator \(\widehat{V}^{\text{DR}}(\hat{\beta}, \hat{\phi})\).

The proposed `DRUnknown`

is doubly-robust within the statistical context, as it remains consistent if either the logging policy model \(\hat{\mu}\) or the value function model \(\widehat{Q}\) is correctly specified. We have addressed the scenario with the correct logging policy model above. Regarding the second scenario, the minimizer of 1 is \((\beta_{\text{opt}},c_{\text{opt}})=(\beta_0, 0)\) since \(\Gamma(\beta_0)=0\), where \(\beta_0\) is the true value of the parameter satisfying \(Q(\cdot)=\widehat{Q}(\cdot;\beta_0).\) Consequently, the solution of 3 converges to \((\beta_0, 0)\), establishing the consistency of
`DRUnknown`

. This observation can be summarized as the following proposition.

**Proposition 3** (Doubly-Robustness). *The DRUnknown \(\widehat{V}^{\text{DR}}(\hat{\beta}, \hat{\phi})\) for contextual bandits is a doubly robust: it converges to \(V^{\pi}\) in probability if either the logging policy model \(\hat{\mu}\) or the value function model \(\widehat{Q}\) is correctly specified.*

`DRUnknown`

for CB: extended function class↩︎For an intuitive understanding of `DRUnknown`

for contextual bandits, suppose that the target policy \(\pi\) always have a nonzero value. Then, \(F\) can be rewritten as \(F(x,a;\beta,\phi) = \pi(a|x)\widetilde{Q}(x,a;\beta,c)\) for the extended function class \[\widetilde{Q}(x,a;\beta,c)=\widehat{Q}(x,a;\beta,c) + c^{\top}
\displaystyle\frac{\dot{\hat{\mu}}(a|x;\phi)}{\pi(a|x)}.\] The objective function 1 we aim to minimize in `DRUnknown`

can then be expressed as \[\text{\mathbb{E}}_{\mu}{\left\Vert \bigl( \pi(a|x)[\widetilde{Q}(x,a;\beta,c) - Q(x,a)] \bigr)_{a \in \mathcal{A}} \right\Vert _{M_{\mu}}^2}.\] The additional linear term \(\displaystyle
c^{\top}\frac{\dot{\hat{\mu}}(a|x;\phi)}{\pi(a|x)}\) from \(\widetilde{Q}\) can be seen as the projection of \(Q^{\pi}\) onto the inverse probability tangent space, the linear space
spanned by the score function of \(\hat{\phi}\).

Hence, the proposed estimator can be seen as employing a linear parameter \(c\) within the expanded function class \(\widetilde{Q}\) to remove the effect of estimating \(\hat{\phi}\), while seeking the optimal \(\hat{\beta}\) for \(\widehat{Q}\) that minimizes the asymptotic variance.

`DRUnknown`

for Reinforcement Learning↩︎The proposed `DRUnknown`

for RL is constructed similarly to the case of CB. However, its analysis is more complex, as the estimator is expressed as a weighted sum of terms observed from time step \(0\) to
\(T-1\). Below, we introduce the construction of `DRUnknown`

for RL, commencing with the estimation of \(\hat{\phi}\).

The MLE \(\hat{\phi}_n = \hat{\phi}\) of \(\phi\) for RL is given by the solution of \[U_n(\phi)=\sum\limits_{i=1}^n\sum\limits_{t=0}^{T-1}\sum\limits_{a \in
\mathcal{A}}\Delta_{a,t}^{(i)} \frac{\dot{\hat{\mu}}(a|x_t^{(i)}; \phi)}{\hat{\mu}(a|x_t^{(i)}; \phi)}=0,\] where \(\Delta_{a,t}^{(i)}\) is the indicator variable for selected action at time step \(t\) from trajectory \(i\). Substituting the MLE \(\hat{\phi}\) into the DR estimator [eq:DR], the estimator is given by \(\widehat{V}^{\text{DR}}(\beta,\hat{\phi}) = \displaystyle\sum\limits_{t=0}^{T-1}\gamma^{t} \widehat{V}_{t}(\beta,\hat{\phi}),\) with \[\widehat{V}_{t}(\beta,\hat{\phi}) = \frac{1}{n} \sum\limits_{i=1}^n\hat{\rho}_{0:{t-1}}^{(i)}
\bigl[\hat{\rho}_t^{(i)}[r_t^{(i)} - \widehat{Q}^{(i)}(\beta)] + \widehat{V}^{(i)}(\beta)\bigr].\] The estimator of the value at each time step \(t\), \(\widehat{V}_t(\beta,\hat{\phi})\), takes the same form of `DRUnknown`

for CB, weighted by \(\hat{\rho}^{(i)}_{0:{t-1}}\). However, additional analysis is needed in the case of RL, as
the \(\widehat{V}_t\) for each \(t\) are stochastically correlated. Now, as done in CB, we define the value regression function \(F_t(x,a;\beta,c,\phi)\) for
each \(t\), given by \[F_{t}(x,a;\beta, c,\phi)
=\rho_{0:t-1}\pi(a|x)\widehat{Q}(x,a;\beta) + \gamma^{-t} c^{\top} \dot{\hat{\mu}}(a|x;\phi).\] and we derive the influence function of the proposed estimator as stated in the following proposition.

**Proposition 4** (Asymptotic Equivalence of \(\widehat{V}\) for RL). *Consider an estimator \(\hat{\beta}\) converging to some \(\beta^*\) in probability. If the logging policy model \(\hat{\mu}(\cdot;\phi)\) is correctly specified, then the proposed DR OPE estimator is asymptotically linear, with influence function \(\eta=\displaystyle\sum\limits_{t=0}^{T-1}\gamma^{t}\eta_t\): \[\widehat{V}^{\text{DR}}(\hat{\beta}, \hat{\phi}) = \widetilde{V}(\beta^*, c(\beta^*)) + o_p(n^{-1/2}) = \frac{1}{n}
\sum\limits_{i=1}^n\eta^{(i)}(\beta,c(\beta^*)) + o_p(n^{-1/2}),\] where \(c(\beta)\) is a vector that solely depends on \(\beta\), and \[\eta_t^{(i)}(\beta,c) = \frac{1}{\mu^{(i)}_t}\bigl[ \rho_{0:t-1}^{(i)} \pi^{(i)}_t r_t^{(i)} -F_{t}^{(i)}(\beta,c,\phi) \bigr]
+\sum\limits_{a \in \mathcal{A}}F_{t}(x_t^{(i)},a; \beta, c,\phi).\]*

Now, to derive the asymptotic variance of the proposed estimator for RL, we denote the vector \(\vec{F}_{t}(x; \beta,c, \phi)=(F_t(x, a; \beta,c, \phi))_{a \in \mathcal{A}},\) \(\pi \vec{Q}^t(x) = (\pi(a|x) Q^t(x,a))_{a \in \mathcal{A}}\) and the gradient matrix \(f_t(x;\beta,c, \phi) =\displaystyle\frac{\partial \vec{F}_t}{\partial(\beta,c)}.\)

The asymptotic variance of the DR OPE estimator for RL can be expressed with these notations as in the following proposition.

**Proposition 5** (Variance of \(\widetilde{V}\) for RL). *The variance of \(\widetilde{V}(\beta,c)\) is given by \[n\mathrm{Var}{(\widetilde{V}(\beta,c))} = C_T +\sum\limits_{t=0}^{T-1}\gamma^{2t} \text{\mathbb{E}}_{\mu} \left\Vert \vec{F}_{t}(x_t;\beta,c,\phi)- \rho_{0:t-1} \pi \vec{Q}^t(x_t)\right\Vert ^2_{M_{t,\mu}}\] where \(M_{t,\mu}=\operatorname{diag}(\mu(a|x_t)^{-1})_{a \in \mathcal{A}} - J_K\) and \[C_T = \sum\limits_{t=0}^{T-1}\gamma^{2t} \text{\mathbb{E}}_{\mu} \bigl[ \mathrm{Var}_t \left[ \rho_{0:t-1}V^t(x_t)
\right] + \mathrm{Var}_{t+1}[\rho_{0:t}r_t] \bigr],\] with \(\mathrm{Var}_{t}\) the variance conditioned on the history up to time step \(t-1\), \(\{x_0,a_0,\dots, x_{t-1},a_{t-1}\}\).*

Similar to the case of CB, the proposition implies that minimizing the weighted sum of seminorms with respect to \(\beta\) and \(c\) is equivalent to minimizing the asymptotic variance of \(\widehat{V}^{\text{DR}}\). The solution \(\hat{\beta}\) of the following estimating equation is consistent for \(\beta_{\text{opt}}\), minimizing the asymptotic variance of \(\widehat{V}^{\text{DR}}\) by \(\sigma^2(\beta^*)=\mathrm{Var}_{\mu}{[\eta(\beta^*, c(\beta^*))]}\). \[\label{eq:RL95gradient95data} S_n(\beta,c)=\sum\limits_{i=1}^n\sum\limits_{t=0}^{T-1}\gamma^{2t} f_{t}^{(i) \top}(\beta,c,\hat{\phi}) M_t^{(i)}\bigl(\vec{F}_{t}^{(i)}(\beta,c) - \hat{\rho}_{0:t-1} \operatorname{diag}(\pi_t^{(i)})_{a \in \mathcal{A}} ~ \vec{r}_{t:T-1}^{(i)}\bigr)= 0,\tag{4}\] with the pseudo-reward \(\displaystyle{\vec{r}^{(i)}_{t:T-1}} = \bigl( \frac{\Delta_{a,t}^{(i)}}{\hat{\mu}(a|x_t^{(i)})} \bar{R}_{t:T-1}^{(i)} \bigr)_{a \in \mathcal{A}}\) for \(\bar{R}_{t:T-1}^{(i)} = r_t^{(i)} + \displaystyle\sum\limits_{\tau=t+1}^{T-1} \gamma^{\tau-t} \hat{\rho}_{t+1:\tau}^{(i)} r_{\tau}^{(i)},\) and \(M_t^{(i)}=\displaystyle\operatorname{diag}(\hat{\mu}(a|x_t^{(i)})^{-1})_{a \in \mathcal{A}} - J_K\).

We now present the theoretical properties of the proposed estimator, with a focus on its asymptotic distribution. Combining Proposition [prop:influence], 2, 4 and 5, the `DRUnknown`

has the following asymptotic normality for both contextual bandits and reinforcement learning problems.

**Theorem 6** (Asymptotic distribution). *When the logging policy model \(\hat{\mu}\) is correctly specified, the proposed DRUnknown estimator is asymptotically normal, given by \[\sqrt{n}(\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi}) - V^{\pi}) \xrightarrow{d} \mathcal{N} (0, \sigma^2)\] for \(\sigma^2=\sigma^2(\beta^*).\)*

To calculate the confidence interval of \(V^{\pi}\), one needs to estimate the unknown variance \(\sigma^2\). As we know that \[\begin{align} \sigma^2 &= \sigma^2(\beta^*, c(\beta^*), \phi) \\ &= \mathrm{Var}_{\mu}{[\widetilde{V}(\beta^*,c(\beta^*))]} = n \mathrm{Var}_{\mu}[\eta(\beta^*,c(\beta^*))], \end{align}\] we have \(\hat{\sigma}^2 (\beta, c(\beta^*), \phi) = \displaystyle\frac{1}{n} \sum\limits_{i=1}^n(\eta^{(i)}(\hat{\beta},\hat{c},\hat{\phi}) - \bar{\eta})^2\) is a consistent estimator for \(\sigma^2\), with \(\bar{\eta}=\displaystyle\frac{1}{n}\sum\limits_{i=1}^n\eta^{(i)}(\hat{\beta},\hat{c},\hat{\phi})\) the average of \(\eta^{(i)}\) and \(\hat{\beta}, \hat{c}\) are the solutions of 4 and the MLE \(\hat{\phi}\). By the Slutzky’s theorem, we have \[\hat{\sigma}^2 = \hat{\sigma}^2(\hat{\beta}, \hat{c}, \hat{\phi}) \xrightarrow{p} \sigma^2,\] and combining the result with the theorem we have the \((1-\alpha)\) confidence interval as \[\bigl[ \widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi}) - z_{\alpha/2} \frac{\hat{\sigma}^2}{\sqrt{n}}, \widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi}) + z_{\alpha/2} \frac{\hat{\sigma}^2}{\sqrt{n}} \bigr],\] with \(z_{\alpha/2}\) the \(\alpha/2\) standard Gaussian quantile.

When the value function model \(\widehat{Q}\) is also correctly specified, the proposed `DRUnknown`

is asymptotically equivalent to the DR OPE estimator using the true logging policy and the value function,
and is locally efficient.

**Proposition 7** (Local Efficiency). *When both the logging policy model \(\hat{\mu}\) and the value function model \(\widehat{Q}\) are correctly specified, The
asymptotic variance of the proposed estimator achieves the semiparametric lower bound and is asymptotically optimal.*

Proposition 7 states that `DRUnknown`

can be asymptotically optimal when the model \(\widehat{Q}\) is
properly chosen. Also, for any choice of value function model, whether it is correct or not, the proposed estimator is asymptotically more efficient compared to existing OPE algorithms, including IPW, DR, and MRDR, in the OPE problem with an unknown
logging policy.

**Proposition 8** (Intrinsic Efficiency). *The proposed DR OPE estimator has the smallest asymptotic variance among the class of DR OPE using MLE for \(\phi\), when the logging policy model \(\hat{\mu}\) is correctly specified. The same holds for arbitrary consistent estimator of \(\phi\).*

**Corollary 9** (Comparison to Existing Algorithms). *The proposed estimator has at least smaller asymptotic variance than IPW, DR and MRDR utilizing the same estimator for \(\phi\), when the logging
policy model \(\hat{\mu}\) is correctly specified.*

In this section, we compare the performance of the four estimators: (i) IPW [17] (ii) MLIPW [14] (iii) MRDR [8] and
(iv) the proposed `DRUnknown`

on CB and RL problems. IPW requires knowledge of the true logging policy \(\mu\) and cannot be applied in our experimental scenario. Therefore it serves as a baseline for the other
three estimators, and we compare the relative MSE of each estimator.

We use the simulation environments described as follows. We generate \(N=100\) test datasets with size \(n\), given by \(D_{j} = \{(x_{i}^{j}, a_{i}^{j}, r_{i}^{j})\}_{i=1}^n,~j \in [N]\). The context \(x_{i}^{j}\) for \(i^{\text{th}}\) sample from \(j^{\text{th}}\) dataset contains \(d=5\) dimensional \(K=10\) vectors randomly sampled from uniform distribution \(U(-1/\sqrt{d},1/\sqrt{d})\). The rewards are generated from the Gaussian distribution with mean \(\exp (x_{i}^{j \top} \beta)\) and variance 1, where \(\beta\) is a fixed coefficient also sampled from the uniform distribution \(U(-1/\sqrt{d},1/\sqrt{d})\).

The logging policy \(\mu\) and target policy \(\pi\) follows the linear logistic model with random coefficients \(\phi_{\mu}\) and \(\phi_{\pi}\) as \(\mu(a|x) = \displaystyle \frac{\exp(x_a^{\top} \phi_{\mu})}{\sum\limits_{i=1}^K \exp(x_i^{\top} \phi_{\mu})}\) and \(\pi(a|x) = \displaystyle \frac{\exp(x_a^{\top} \phi_{\pi})}{\sum\limits_{i=1}^K \exp(x_i^{\top} \phi_{\pi})}.\) The logging policy model \(\hat{\mu}\) also follows the linear logistic model, while the value function model \(\hat{Q}\) is defined by the linear regression model. Consequently, only the logging policy model is correctly specified.

The Table 1 and Figure 1 reports the relative MSE values of MLIPW, MRDR and the proposed `DRUnknown`

, calculated by
dividing their MSE by the MSE of IPW, for the different sizes \(n\) of each dataset. The proposed estimator achieves the lowest relative MSE. Figure 2 displays boxplots representing the
estimated values of four methods, each computed with \(N=100\) repeats and a dataset size of \(n=10000\).

sample size | MLIPW | MRDR | `DRUnknown` |
---|---|---|---|

\(5000\) | 0.8911 | 0.8126 | 0.8038 |

\(6000\) | 0.8416 | 0.7957 | 0.7731 |

\(7000\) | 0.8366 | 0.8043 | 0.7620 |

\(8000\) | 0.8234 | 0.8073 | 0.7560 |

\(9000\) | 0.8222 | 0.8200 | 0.7662 |

\(10000\) | 0.7575 | 0.7369 | 0.6792 |

For the real data experiment, we transform six classification datasets of the UCI Machine Learning Repository into contextual bandit problems : `glass`

, `letter`

, `zoo``image`

, `iris`

and
`handwritten`

[20]–[25]. Assigning the data to each class is considered as pulling an arm in the bandit.
When the class is correct the reward is 1 and otherwise 0.

We construct the logging policy \(\mu\) as follows: we train the policy \(\mu_0\) using a linear logistic model on a separate dataset, and mix the policy with a random policy \(\mu_{\text{random}}\) as \(\mu=\alpha \mu_0 + (1-\alpha)\mu_{\text{random}}\), for \(\alpha \in (0,1)\). The target policy \(\pi\) is constructed as in the simulation data experiment. The logging policy class is defined as \(\{\hat{\mu}=\alpha \mu_0 + (1-\alpha)\mu_{\text{random}}:\alpha \in (0,1)\}\), where \(\alpha\) serves as the parameter \(\phi\), and we use the constant value function model for \(\widehat{Q}\).

We generate \(N=100\) datasets, each with a size of \(n=10000\), by randomly sampling contexts from the original dataset and selecting arms using the logging policy \(\mu\). Table 2 displays the relative MSE values of MLIPW, MRDR, and our proposed `DRUnknown`

. Additionally, Figure 3 depicts the log-relative MSE of each estimator on the `glass`

dataset as a function of the logging policy parameter \(\alpha\). The `DRUnknown`

consistently
demonstrates the smallest relative MSE across all six datasets.

dataset | MLIPW | MRDR | `DRUnknown` |
---|---|---|---|

glass | 0.9548 | 0.8308 | 0.8125 |

letter | 0.9677 | 0.9323 | 0.9265 |

zoo | 0.9987 | 0.9842 | 0.9583 |

image | 0.8250 | 0.8245 | 0.8062 |

iris | 0.6116 | 0.5715 | 0.5454 |

handwritten | 0.9521 | 0.9432 | 0.9407 |

In this section, we provide the experimental results of OPE in the context of RL. We conducted the experiments on the ModelWin and ModelFail domains introduced in [6]. Detailed descriptions of these environments are available in the Appendix. As done in Section 6.1.2, we utilize a mixture of the separately trained optimal policy and the uniform random policy with a rate \(\alpha \in (0,1)\) as the logging policy \(\mu\). We employ a linear logistic policy for the target policy \(\pi\) and a linear model as the value function model \(\widehat{Q}\).

Table 4 displays the relative MSE of each estimator in ModelWin with \(T=20\) and ModelFail
environments. The proposed `DRUnknown`

attains the lowest relative MSE in most scenarios, if the number of sample \(n\) is large enough. Figure 4 illustrates the cumulative distribution
function (CDF) of squared error values (larger is better) on the ModelWin domain with \(T=20\) and \(n=20\). The `DRUnknown`

exhibits the highest CDF values, indicating that the
estimated values are more concentrated around the true value with high probability.

In conclusion, our study has introduced \(\texttt{DRUnknown}{}\), a novel doubly-robust off-policy evaluation estimator for contextual bandits and reinforcement learning problems where both the logging policy and the
value function remain unknown. Through a two-step estimation process, `DRUnknown`

first estimates the logging policy using maximum likelihood estimator and subsequently estimates the value function model, aiming to minimize the asymptotic
variance of the estimator while considering the impact of the logging policy estimation.

When the logging policy model is correctly specified, `DRUnknown`

achieves the smallest asymptotic variance within the class that encompasses existing OPE estimators. Furthermore, if the value function model is also correctly specified,
`DRUnknown`

attains optimality by reaching the semiparametric lower bound for asymptotic variance.

To show the effectiveness of `DRUnknown`

, we conducted experiments in both contextual bandits and reinforcement learning settings. The empirical results demonstrate the superior performance of `DRUnknown`

compared to existing
methods.

This paper addresses the off-policy evaluation problem, aiming to contribute to the field of Machine Learning. It includes theoretical contributions and numerical experimental results, with no apparent societal or ethical issues that require further discussion.

ModelWin | MLIPW | MRDR | `DRUnknown` |
---|---|---|---|

\(n=10\) | 1.1604 | 1.9963 | 0.8916 |

\(n=20\) | 1.0187 | 2.5577 | 0.8273 |

\(n=30\) | 1.1064 | 1.7349 | 0.8529 |

\(n=40\) | 1.0676 | 1.5270 | 0.9117 |

ModelFail | MLIPW | MRDR | `DRUnknown` |
---|---|---|---|

\(n=10\) | 0.7794 | 0.7963 |
0.8476 |

\(n=20\) | 0.5608 | 0.5591 | 0.2951 |

\(n=30\) | 0.5532 | 0.5605 | 0.3154 |

\(n=40\) | 0.5273 | 0.5285 | 0.1451 |

*Proof.* By the Taylor expansion, \[\widehat{V}^{\text{DR}}(\hat{\beta}, \hat{\phi}) = V(\beta^*) + \text{\mathbb{E}}_{\mu}{[\frac{\partial \widehat{V}}{\partial \beta}]^{\top}}(\hat{\beta}- \beta^*) +
\text{\mathbb{E}}_{\mu}{[\frac{\partial \widehat{V}}{\partial \phi}]^{\top}}(\hat{\phi}-\phi) + o_p(n^{-1/2}),\] where \[V(\beta) = \frac{1}{n} \sum\limits_{i=1}^n\Bigl[ \frac{\pi^{(i)}}{\mu^{(i)}}
(r^{(i)} - \widehat{Q}^{(i)}(\beta)) + \widehat{V}^{(i)}(\beta) \Bigr]\] and \(\hat{\beta}\xrightarrow{p} \beta^*\) for some \(\beta^*\). As \(\hat{\mu}\) is correctly specified, we have \[\text{\mathbb{E}}_{\mu}{[\frac{\partial \widehat{V}}{\partial \beta}]} = \frac{\partial }{\partial
\beta}\text{\mathbb{E}}_{\mu}{[-\frac{\pi(a|x)}{\mu(a|x)} \widehat{Q}(x,a; \beta^*) + \widehat{V}(x; \beta^*)]} = 0.\] For the following term, \[-\text{\mathbb{E}}_{\mu}{[\frac{\partial \widehat{V}}{\partial \phi}]} =
-\text{\mathbb{E}}_{\mu}{[-\frac{\pi \dot{\hat{\mu}}}{\mu^2} (r - \widehat{Q}(x,a;\beta^*)]} = \text{\mathbb{E}}_{\pi}{[-\frac{ \dot{\hat{\mu}}}{\mu} (r - \widehat{Q}(x,a;\beta^*)]} := \Gamma(\beta).\]

For \(\hat{\phi}-\phi\), from \(U_n(\hat{\phi})=0\) we have \(0 = U(\phi) + \text{\mathbb{E}}{[\dot{U}(\phi)]}(\hat{\phi}-\phi) + o_p(n^{-1/2}),\) where \[\text{\mathbb{E}}{[\dot{U}(\phi)]} = \displaystyle\sum\limits_{a \in \mathcal{A}}\text{\mathbb{E}}_{\mu}{[\frac{1}{\mu(a|x)^2} \dot{\hat{\mu}}(a|x;\phi) \dot{\hat{\mu}}(a|x;\phi)^{\top}]} := \Sigma_{\phi \phi}\] and \[\hat{\phi}-\phi = -\Sigma_{\phi \phi}^{-1} U(\phi) + o_p(n^{-1/2}).\] Plugging in the results above and denoting \(c(\beta)=\Sigma_{\phi \phi}^{-1} \Gamma(\beta)\), we have \[\begin{align} \widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi}) &= V(\beta^*) + \Gamma(\beta^*)^{\top}\Sigma_{\phi \phi}^{-1}U(\phi) + o_p(n^{-1/2}) \\ &= V(\beta^*) + c(\beta^*)^{\top} U(\phi) + o_p(n^{-1/2}) \\ &= \frac{1}{n} \sum\limits_{i=1}^n\eta^{(i)}(\beta^*,c(\beta^*)) + o_p(n^{-1/2}), \end{align}\] for \(\eta^{(i)}(\beta,c)=\displaystyle\frac{1}{\mu(a^{(i)}|x^{(i)})}\bigl[\pi(a^{(i)}|x^{(i)})r^{(i)} -F(x,a;\beta,c,\phi) \bigr] +\sum\limits_{a \in \mathcal{A}}F(x^{(i)},a; \beta, c,\phi)\), since \(\sum\limits_{a \in \mathcal{A}}\dot{\hat{\mu}}(a|x;\phi)=0.\) ◻

*Proof.* By the law of total variance, we decompose the variance of \(\widetilde{V}(\beta,c)\) with \(n=1\), given contexts \(x\) and rewards \(r\): \[\mathrm{Var}{(\widetilde{V}(\beta,c))} = \mathrm{Var}\text{\mathbb{E}}{(\widetilde{V}(\beta,c))|x,r)} + \text{\mathbb{E}}\mathrm{Var}{(\widetilde{V}(\beta,c))|x,r)}.\] For the first
term, we have \(\text{\mathbb{E}}_{\mu}{(\widetilde{V}(\beta,c))|x,r)} = V^{\pi}(x),\) and for the second term, \[\begin{align}
& \mathrm{Var}{(\widetilde{V}(\beta,c))|x,r)} = \mathrm{Var}_{\mu}\bigl[\frac{1}{\mu(a|x)}(\pi(a|x)r - F(x,a;\beta,c,\phi)) \bigr | x,r]\\
&= \text{\mathbb{E}}_{\mu}\bigl[\frac{1}{\mu(a|x)^2}(\pi(a|x)r - F(x,a;\beta,c,\phi))^2 \bigr | x,r] - \text{\mathbb{E}}_{\mu}\bigl[\frac{1}{\mu(a|x)}(\pi(a|x)r - F(x,a;\beta,c,\phi)) \bigr | x,r]^2 \\
&= \sum\limits_{a \in \mathcal{A}}\frac{1}{\mu(a|x)}(\pi(a|x)r - F(x,a;\beta,c,\phi))^2 - [\sum\limits_{a \in \mathcal{A}}(\pi(a|x)r - F(x,a;\beta,c,\phi)) ]^2 \\
&= \displaystyle\left\Vert \vec{F}(x;\beta,c,\phi)-\pi\vec{r} \right\Vert _{M_{\mu}}^2
\end{align}\] and since \(\text{\mathbb{E}}{[\pi \vec{Q}(x) - \pi \vec{r}]} =0\), we have \[\begin{align}
n\mathrm{Var}{(\widetilde{V}(\beta,c))} &= \mathrm{Var}(V^{\pi}(x)) + \text{\mathbb{E}}_{\mu}\displaystyle\left\Vert \vec{F}(x;\beta,c,\phi)-\pi\vec{r} \right\Vert _{M_{\mu}}^2 \\
&=\mathrm{Var}(V^{\pi}(x)) + \text{\mathbb{E}}_{\mu}\displaystyle\left\Vert \vec{F}(x;\beta,c,\phi)-\pi\vec{Q}(x) \right\Vert _{M_{\mu}}^2 + \text{\mathbb{E}}_{\mu}\displaystyle\left\Vert \pi\vec{Q}(x)- \pi\vec{r}\right\Vert _{M_{\mu}}^2.
\end{align}\] ◻

*Proof.* The proof is similar to that of Proposition [prop:influence] for contextual bandits, with \(T=1\).

For each \(t \in [T]\), applying the Taylor expansion, we have \[\widehat{V}_t(\hat{\beta},\hat{\phi}) = V_t(\beta^*) + \text{\mathbb{E}}_{\mu}{[\frac{\partial \widehat{V}_t}{\partial \beta}]^{\top}}(\hat{\beta}- \beta^*) + \text{\mathbb{E}}_{\mu}{[\frac{\partial \widehat{V}_t}{\partial \phi}]^{\top}}(\hat{\phi}-\phi) + o_p(n^{-1/2}),\] where \[V_t(\beta) = \frac{1}{n} \sum\limits_{i=1}^n\rho^{(i)}_{0:{t-1}} \bigl[\rho^{(i)}_t [r_t^{(i)} - \widehat{Q}^{(i)}_t(\beta)] + \widehat{V}^{(i)}_t(\beta)\bigr]\] and \(\hat{\beta}\xrightarrow{p} \beta^*\) for some \(\beta^*\). As \(\hat{\mu}\) is correctly specified, we have \[\text{\mathbb{E}}_{\mu}{[\frac{\partial \widehat{V}_t}{\partial \beta}]} = \frac{\partial }{\partial \beta}\displaystyle\text{\mathbb{E}}_{\mu}{[\rho_{0:t-1}[-\frac{\pi(a_t|x_t)}{\mu(a_t|x_t)} \widehat{Q}(x_t,a_t; \beta^*) + \widehat{V}(x_t; \beta^*)]]} = 0.\] For the following term, \[\begin{align} \text{\mathbb{E}}_{\mu}{[\frac{\partial \widehat{V}_t}{\partial \phi}]} &= \displaystyle\text{\mathbb{E}}_{\mu}{\bigl[\frac{\partial \rho_{0:t-1}}{\partial \phi}[\rho_t(r_t - \widehat{Q}_t(x_t,a_t;\beta^*))+\widehat{V}_t(x_t,\beta^*)] + \rho_{0:t-1} \frac{\partial \rho_t}{\partial \phi}(r_t-\widehat{Q}_t(x_t,a_t;\beta^*))\bigr]}\\ &=\text{\mathbb{E}}_{\mu}\bigl[ \frac{\partial \rho_{0:t-1}}{\partial \phi} \text{\mathbb{E}}_{\mu}[\rho_t(r_t - \widehat{Q}_t(x_t,a_t;\beta^*))+\widehat{V}_t(x_t,\beta^*)|\mathcal{H}_{t-1}] \bigr] \\ &+ \text{\mathbb{E}}_{\mu} \bigl[\rho_{0:t-1} \frac{\partial \rho_t}{\partial \phi}(r_t-\widehat{Q}_t(x_t,a_t;\beta^*))\bigr] \\ &= \text{\mathbb{E}}_{\mu} \rho_{0:t-1} \text{\mathbb{E}}_{\mu} \bigl[ \frac{\partial \rho_t}{\partial \phi}(r_t-\widehat{Q}_t(x_t,a_t;\beta^*)) | \mathcal{H}_{t-1} \bigr] \text{ (the first term equals to zero)} \\ &= -\text{\mathbb{E}}_{\mu} \rho_{0:t-1} \text{\mathbb{E}}_{\pi} \bigl[ \frac{\dot{\hat{\mu}}(a_t|x_t;\phi)}{\mu(a_t|x_t)}(Q^t(x_t,a_t)-\widehat{Q}^t(x_t,a_t;\beta^*)) \bigr] :=\Gamma_t(\beta^*). \end{align}\]

For \(\hat{\phi}-\phi\), from \(U_n(\hat{\phi})=0\) we have \(0 = U(\phi) + \text{\mathbb{E}}{[\dot{U}(\phi)]}(\hat{\phi}-\phi) + o_p(n^{-1/2}),\) where \[\displaystyle\sum\limits_{t=0}^{T-1} \sum\limits_{a \in \mathcal{A}} \text{\mathbb{E}}_{\mu}{[\frac{1}{\mu(a|x_t)^2} \dot{\hat{\mu}}(a|x_t;\phi) \dot{\hat{\mu}}(a|x_t;\phi)^{\top}]} := \Sigma_{\phi \phi}\] and \[\hat{\phi}-\phi = -\Sigma_{\phi \phi}^{-1} U(\phi) + o_p(n^{-1/2}).\] \[\widehat{V}_t(\hat{\beta},\hat{\phi}) = V_t(\beta^*) + \Gamma_t(\beta^*)^{\top} \Sigma_{\phi\phi}^{-1}U_n(\phi) + o_p(n^{-1/2}).\] Plugging in the results above and denoting \(c(\beta)=\displaystyle\sum\limits_{t=0}^{T-1}\gamma^t \Sigma_{\phi \phi}^{-1} \Gamma_t(\beta)\), we have \[\begin{align} \widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi}) =\sum\limits_{t=0}^{T-1}\gamma^t \widehat{V}_t(\hat{\beta},\hat{\phi}) &= V(\beta^*) + c(\beta^*)^{\top} U_n(\phi) + o_p(n^{-1/2}) \\ &= \frac{1}{n} \sum\limits_{i=1}^n\eta^{(i)}(\beta^*,c(\beta^*)) + o_p(n^{-1/2}), \end{align}\] for \(V(\beta)=\sum\limits_{t=0}^{T-1}\gamma^t V_t(\beta)\) and \(\eta^{(i)}(\beta,c) = \displaystyle\sum\limits_{t=0}^{T-1}\gamma^t \eta^{(i)}_t(\beta,c)\) with \[\eta_t^{(i)}(\beta,c) = \frac{1}{\mu(a_t^{(i)}|x_t^{(i)})}\bigl[ \rho_{0:t-1}^{(i)}\pi(a_t^{(i)}|x_t^{(i)})r_t^{(i)} -F_{t}(x_t^{(i)},a_t^{(i)};\beta,c,\phi) \bigr]+\sum\limits_{a \in \mathcal{A}} F_{t}(x_t^{(i)},a; \beta, c,\phi)\] ◻

*Proof.* Section 4.1 of [5] introduces an inductive definition for a DR OPE estimator with a known logging policy and a fixed value
function model. The Theorem 1 from the same paper provides the variance of this DR OPE estimator, presented as follows:> \[\begin{align}
n\mathrm{Var}[\widehat{V}] =\sum\limits_{t=0}^{T-1}\gamma^{2t} \bigl[ \text{\mathbb{E}}_{\mu} \bigl[ \mathrm{Var}_t \left[ \rho_{0:t-1}V^t(x_t) \right] + \mathrm{Var}_{t+1}[\rho_{0:t}r_t] \bigr] + \text{\mathbb{E}}_{\mu} \bigl[\mathrm{Var}_t
[\rho_{0:t}(Q^t(x_t,a_t)-\widehat{Q}(x_t,a_t))|x_t] \bigr] \bigr],
\end{align}\] where \(\mathrm{Var}_t\) refers to the conditional variance \(\mathrm{Var}(\cdot|x_0,a_0, \dots, x_{t-1},a_{t-1})\). The proof of the theorem still holds for a more
general class of \(\widehat{Q}\), which takes the state-action trajectory \((x_0,a_0, \dots, x_t,a_t)\) and the time step \(t\) as arguments. Therefore,
following the same approach as the proof of Theorem 1, we obtain the following representation of our proposed `DRUnknown`

for RL: \[\begin{align}
n\mathrm{Var}[\widehat{V}^{\text{DR}}] &=\sum\limits_{t=0}^{T-1}\gamma^{2t} \bigl[ \text{\mathbb{E}}_{\mu} \bigl[ \mathrm{Var}_t \left[ \rho_{0:t-1}V^t(x_t) \right]+ \mathrm{Var}_{t+1}[\rho_{0:t}r_t] \bigr]\\
&+ \text{\mathbb{E}}_{\mu} \bigl[\mathrm{Var}_t [\rho_{0:t}Q^t(x_t,a_t) - \mu(a_t|x_t)^{-1} F_{t}(x_t,a_t,\beta,c,\phi)|x_t] \bigr] \bigr].
\end{align}\] The first term does not depend on the parameters \(\beta\) and \(c\), so we denote it as a constant \[C_T = \sum\limits_{t=0}^{T-1}\gamma^{2t}
\text{\mathbb{E}}\bigl[ \mathrm{Var}_t \left[ \rho_{0:t-1}V^t(x_t) \right] + \mathrm{Var}_{t+1}[\rho_{0:t}r_t] \bigr].\] The variance in the second term is conditioned on \((x_0,a_0,\dots,x_{t-1},a_{t-1},x_t)\), and
thus the randomness of this variable only incurs from the action \(a_t\) at time step \(t\). Therefore, it can be calculated similarly to the proof of Proposition 2, and can be represented as a stochastic semi-norm as below, and we have the desired result. \[\begin{align}
&\mathrm{Var}_t [\rho_{0:t}Q^t(x_t,a_t) - \mu(a_t|x_t)^{-1} F_{t}(x_t,a_t,\beta,c,\phi)|x_t]\\
&= \mathrm{Var}_t [\sum\limits_{a \in \mathcal{A}}\Delta_a^t [\rho_{0:t-1} \frac{\pi(a|x_t)}{\mu(a|x_t)} Q^t(x_t,a) - \mu(a|x_t)^{-1} F_{t}(x_t,\beta,c,\phi)|x_t]]\\
&= \left\Vert \vec{F}_{t}(x_t;\beta,c,\phi)- \rho_{0:t-1} \pi \vec{Q}^t(x_t)\right\Vert ^2_{M_{t,\mu}}
\end{align}\] ◻

*Proof.* Theorem 3 of [26] states that in the OPE problem, the DR estimator with the true logging policy \(\mu\) and value function \(Q\), given by \[\widehat{V}^{\text{opt}} = \frac{1}{n} \sum\limits_{i=1}^n\sum\limits_{t=0}^{T-1}\gamma^{t} \rho_{0:{t-1}}^{(i)}
\bigl[\rho_t^{(i)}[r_t^{(i)} - Q^t(x_t^{(i)}, a_t^{(i)})] + V^t(x_t^{(i)})\bigr],\] achieves the smallest asymptotic variance, reaching the semiparametric lower bound, for the case with a discount factor \(\gamma=1\). When the value function model is also correctly specified, the proposed `DRUnknown`

is asymptotically equivalent to \(\widehat{V}^{\text{opt}}\), as the \(\Gamma_t(\beta^*)=0\) for all \(t\) and \(c(\beta^*)=0\). For the general problem with \(\gamma<1\), we can modify the MDP by
fixing the discount factor to 1, incorporating the time step \(t\) into the state variable \(x\), and changing the reward function \(R(x,a)\) to \(\gamma^t R(x,a)\). ◻

*Proof.* The IPW [17], DR [1], [5], and MRDR [8] estimators are originally designed for the OPE problem with a known logging policy \(\mu\). Therefore, for comparison, we assume that the logging policy model is estimated by MLE, as same
as in our proposed estimator. We can observe that all three estimators are asymptotically equivalent to \(\widetilde{V}(\beta,c)\) for some value of \(\beta\) and fixed \(c=0\). As all three estimators cannot take the estimation effect of \(\hat{\phi}\) into account, they all have \(c=0\).

The IPW sets the value of \(\beta\) such that the value function model \(\widehat{Q}\) becomes zero, the DR estimator finds \(\beta\) by minimizing the
least-squares error for the value function. MRDR minimizes the variance of the estimator only with respect to \(\beta\), with \(c\) fixed to zero. The class of estimators contains all these
estimators, and the proposed `DRUnknown`

achieves the smallest asymptotic variance among them, being at least more efficient. ◻

The maximum likelihood estimator used to estimate \(\hat{\phi}\) in this work is most efficient in many situations. However, the theoretical results in this paper are applicable to other estimating equations as given below, \[U_n(\phi)=\sum\limits_{i=1}^n\sum\limits_{t=0}^{T-1}\sum\limits_{a \in \mathcal{A}}(\Delta_{a,t}^{(i)} -\mu(a|x_t^{(i)}) )h(x_t^{(i)},a;\phi)=0,\] for any smooth function \(h\). The equation \(U_n(\phi)=0\) is an unbiased estimating equation as long as the logging policy model is correctly specified. A possible choice is to assign more weight to the state-action pair with high probability in \(\pi\). This estimator remains consistent, satisfies the theoretical properties, and may be particularly useful for scenarios with a small-sized finite sample.

To build the synthetic dataset for the simulation experiment, we generate elements for the context vectors \(x\) and the coefficient vector \(\beta\) from the uniform distribution \(U(-1/\sqrt{d},1/\sqrt{d})\). The reward mean is determined by the nonlinear function \(\exp(x^{t}\beta)\), making the linear value function model incorrectly specified.

For the logging policy \(\mu\) and the target policy \(\pi\), we generate the coefficients \(\phi_{\mu}\) and \(\phi_{\pi}\) from the uniform distribution \(U(-1/\sqrt{d},1/\sqrt{d})\) and \(U(-2/\sqrt{d},2/\sqrt{d})\), respectively.

The six datasets used in these experiments were initially designed for the classification problem. To transform the problem into a bandit setting, we interpret the assignment of the class label as the selection of an arm in a bandit, with a reward of 1 if the class is correct and 0 if incorrect. We train the classifier \(\mu_0\), which returns the probability of each class label given a context vector \(x\). We treat \(\mu_0\) as a policy and combine it with a uniform random policy with rates \(\alpha\) and \(1-\alpha\). The value function model is constant and can be regarded as an intercept value.

ModelWin consists of three states, starting from state \(s_1\). When choosing action \(a_1\), the agent moves to \(s_2\) with a probability of 0.6 and to \(s_3\) with a probability of 0.4. Conversely, action \(a_2\) leads to \(s_2\) with a probability of 0.4 and to \(s_3\) with a probability of 0.6. In states \(s_2\) and \(s_3\), both actions return the agent to \(s_1\) with a probability of 1. If the agent visits \(s_2\) or \(s_3\), it receives rewards of 1 and -1, respectively. The horizon is fixed at \(T = 20\).

ModelFail is an MDP with 4 states, but the learner cannot observe the current state of the agent. Starting from \(s_1\), action \(a_1\) leads to the upper middle state \(s_2\), and \(a_2\) to the lower middle state \(s_3\). From both, any action moves to the terminal state \(s_4\). If the transition is from the upper state, a reward of 1 is received; otherwise, a reward of -1. The horizon is always \(T = 2\).

For both environments, the target policy at \(s_1\) selects action \(a_1\) with a probability of 0.7 and \(a_2\) with a probability of 0.3. The logging policy chooses action \(a_1\) with a probability of 0.75 and \(a_2\) with a probability of 0.25. The parameter of the logging policy model \(\hat{\mu}\) for both environments is the probability of choosing \(a_1\), and the value function model \(\widehat{Q}\) is given by the linear model with intercepts.

This paper primarily focuses on the asymptotic properties of the proposed estimator for the OPE problem with an unknown logging policy. We do not extensively explore the estimator’s behavior with a finite sample. This aligns with similar statistical works, such as [10], which addresses estimated missing mechanisms without providing finite-sample theory. Additionally, current DR OPE methods do not address scenarios with an unknown logging policy.

[1]

Miroslav Dudı́k, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In *Proceedings of the 28th International Conference on International
Conference on Machine Learning*, pages 1097–1104, 2011.

[2]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. *Advances in Neural
Information Processing Systems*, 30, 2017.

[3]

Doina Precup. Eligibility traces for off-policy policy evaluation. *Computer Science Department Faculty Publication Series*, page 80, 2000.

[4]

A Rupam Mahmood, Hado P Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation. *Advances in neural information
processing systems*, 27, 2014.

[5]

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In *International Conference on Machine Learning*, pages 652–661. PMLR, 2016.

[6]

Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In *International Conference on Machine Learning*, pages 2139–2148. PMLR,
2016.

[7]

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. Optimal and adaptive off-policy evaluation in contextual bandits. In *International Conference on Machine Learning*,
pages 3589–3597. PMLR, 2017.

[8]

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In *International Conference on Machine Learning*, pages 1447–1456.
PMLR, 2018.

[9]

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudı́k. Doubly robust off-policy evaluation with shrinkage. In *International Conference on Machine
Learning*, pages 9167–9176. PMLR, 2020.

[10]

Weihua Cao, Anastasios A Tsiatis, and Marie Davidian. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. *Biometrika*,
96 (3): 723–734, 2009.

[11]

Daniel B Rubin and Mark J van der Laan. Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. *The
International Journal of Biostatistics*, 4 (1), 2008.

[12]

Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. In *Artificial Intelligence and Statistics*, pages 608–616.
PMLR, 2015.

[13]

Aniruddh Raghu, Omer Gottesman, Yao Liu, Matthieu Komorowski, Aldo Faisal, Finale Doshi-Velez, and Emma Brunskill. Behaviour policy estimation in off-policy policy evaluation:
Calibration matters. *arXiv preprint arXiv:1807.01066*, 2018.

[14]

Yuan Xie, Boyi Liu, Qiang Liu, Zhaoran Wang, Yuan Zhou, and Jian Peng. Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy. In *7th
International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019*. OpenReview.net, 2019. URL https://openreview.net/forum?id=HklKui0ct7.

[15]

Josiah P Hanna, Scott Niekum, and Peter Stone. Importance sampling in reinforcement learning with an estimated behavior policy. *Machine Learning*, 110 (6): 1267–1317, 2021.

[16]

Christoph Rothe. The value of knowing the propensity score for estimating average treatment effects. *Available at SSRN 2797560*, 2016.

[17]

Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. *Journal of the American statistical Association*, 47 (260):
663–685, 1952.

[18]

Claes M Cassel, Carl E Särndal, and Jan H Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations.
*Biometrika*, 63 (3): 615–620, 1976.

[19]

James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. *Journal of the American Statistical Association*, 90 (429):
122–129, 1995.

[20]

B. German. . UCI Machine Learning Repository, 1987. : https://doi.org/10.24432/C5WW2P.

[21]

David Slate. . UCI Machine Learning Repository, 1991. : https://doi.org/10.24432/C5ZP40.

[22]

Richard Forsyth. . UCI Machine Learning Repository, 1990. : https://doi.org/10.24432/C5R59V.

[23]

Image Segmentation. UCI Machine Learning Repository, 1990. : https://doi.org/10.24432/C5GP4N.

[24]

R. A. Fisher. . UCI Machine Learning Repository, 1988. : https://doi.org/10.24432/C56C76.

[25]

E. Alpaydin and C. Kaynak. . UCI Machine Learning Repository, 1998. : https://doi.org/10.24432/C50P49.

[26]

Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient and robust off-policy evaluation. In *International Conference on Machine Learning*, pages
5078–5088. PMLR, 2020.