Robust Offline Reinforcement Learning with Linearly Structured \(f\)-Divergence Regularization


Abstract

The Distributionally Robust Markov Decision Process (DRMDP) is a popular framework for addressing dynamics shift in reinforcement learning by learning policies robust to the worst-case transition dynamics within a constrained set. However, solving its dual optimization oracle poses significant challenges, limiting theoretical analysis and computational efficiency. The recently proposed Robust Regularized Markov Decision Process (RRMDP) replaces the uncertainty set constraint with a regularization term on the value function, offering improved scalability and theoretical insights. Yet, existing RRMDP methods rely on unstructured regularization, often leading to overly conservative policies by considering transitions that are unrealistic. To address these issues, we propose a novel framework, the \(d\)-rectangular linear robust regularized Markov decision process (\(d\)-RRMDP), which introduces a linear latent structure into both transition kernels and regularization. For the offline RL setting, where an agent learns robust policies from a pre-collected dataset in the nominal environment, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI), employing linear function approximation and \(f\)-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, showing these bounds depend on how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. This term is further shown to be fundamental to \(d\)-RRMDPs via information-theoretic lower bounds. Finally, numerical experiments validate that R2PVI learns robust policies and is computationally more efficient than methods for constrained DRMDPs.

1 Introduction↩︎

Offline reinforcement learning (Offline RL) [1] enables policy learning without direct interaction with the environment, necessitating robust policies that remain effective under environment shifts [2][5]. A widely adopted framework for learning such policies is the distributionally robust Markov decision process (DRMDP) [6], [7], which models dynamics changes as an uncertainty set around a nominal transition kernel. In this setup, the agent seeks to perform well even in the worst-case environment within the uncertainty set. The most common design of these uncertainty sets is the \((s,a)\)-rectangularity [4], [8], [9], which models uncertainty independently for each state-action pair. Although mathematically elegant, the \((s,a)\)-rectangularity can include unrealistic dynamics, often resulting in overly conservative policies. To address this issue, [10] introduce the \(r\)-rectangular set, which parameterizes transition kernels using latent factors. When focusing on linear function classes, this concept has been extended to \(d\)-rectangular linear DRMDPs (\(d\)-DRMDPs), where latent structures in transition kernels and uncertainty sets are linearly parameterized by feature mappings. Building on \(d\)-DRMDPs, recent works [11][13] propose computationally efficient algorithms with provable guarantees that leverage function approximation for robust policy learning and extrapolation.

However, the \(d\)-DRMDP framework has several limitations that remain unaddressed in existing works, which we summarize as follows. Theoretical Gaps: Current understanding of \(d\)-DRMDPs is largely restricted to uncertainty sets defined by the Total Variation (TV) divergence [13], [14]. For uncertainty sets defined by the Kullback-Leibler (KL) divergence, prior works [8], [11] rely on additional regularity assumptions regarding the KL dual variable, which is hard to validate in practice. Moreover, \(\chi^2\)-divergence-based uncertainty sets has been shown to be effective in certain empirical RL applications [15] and also been analyzed under the \((s,a)\)-rectangularity [16]. Yet there are no theoretical results or efficient algorithms in \(d\)-DRMDPs. Practical Challenges: Existing algorithms [11][13] depend on a dual optimization oracle (see Remark 4.2 in [14]) to estimate the robust value function. The computational complexity of these methods is proportional to the feature dimension \(d\) and the planning horizon \(H\). While heuristic methods like the Nelder-Mead algorithm [17] can approximate the oracle, they become computationally expensive for high-dimensional features (large \(d\)) and extended planning horizons (large \(H\)), which are common in real-world applications. These limitations raise an important question:

Can we design provably and computationally efficient algorithms for robust RL using general \(f\)-divergence5 for modeling uncertainty in structured transition dynamics?

In this work, we provide a positive answer to this question. Inspired by the robust regularized MDP (RRMDP) framework under the \((s,a)\)-rectangularity condition [4], [18], [19], where the uncertainty set constraint in DRMDP is replaced by a regularization penalty term measuring the divergence between the nominal and perturbed dynamics, we propose the \(d\)-rectangular linear RRMDP (\(d\)-RRMDP) framework. Specifically, \(d\)-RRMDP replaces the \(d\)-rectangular uncertainty set in \(d\)-DRMDPs with a carefully designed penalty term that preserves the linear structure. The motivations are two folds: (1) it has been shown by [18] that the robust policy under the RRMDP is equivalent to that under the DRMDP for unstructured uncertainty as long as the regularizer is properly chosen; (2) removing the uncertainty set constraint simplifies the dual problem for certain divergences [20], potentially improving computational efficiency and facilitating theoretical analysis. In this paper, we show that the above two advantages hold for our proposed \(d\)-RRMDP. We summarize our contributions as follows:

  • We establish that key dynamic programming principles, including the robust Bellman equation and the existence of deterministic optimal robust policies, hold under the \(d\)-RRMDP framework. Additionally, we derive dual formulations of robust Q-functions for TV, KL, and \(\chi^2\) divergences, highlighting their linear structures.

  • We propose a computationally efficient meta-algorithm, Regularized Robust Pessimistic Value Iteration (R2PVI), for offline \(d\)-RRMDPs with general \(f\)-divergence regularization. For TV, KL, and \(\chi^2\) divergences, we provide instance-dependent upper bounds on the suboptimality gap of policies learned by R2PVI. These bounds take the general form: \[\beta \sup_{P \in \mathcal{U}^\lambda(P^0)} \sum_{h=1}^H \mathbb{E}^{\pi^\star, P} \bigg[\sum_{i=1}^d \| \phi_i(s, a) \mathbf{1}i \|{\Lambda_h^{-1}} \mid s_1 = s \bigg],\] where \(d\) is the feature dimension, \(H\) is the horizon length, \(\bphi(s, a)\) is the feature mapping, \(\lambda\) is the regularization parameter, and \(\beta\) is a problem-dependent parameter whose specific form depends on the choice of the divergence (see for details). The set \(\mathcal{U}^\lambda(P^0)\) is derived from our theoretical analysis, though it does not represent an uncertainty set in the conventional DRMDP framework. We further construct an information-theoretic lower bound, demonstrating that this instance-dependent uncertainty function is intrinsic to the problem.

  • We conduct experiments in simulated environments, including a linear MDP setting [14] and the American Put Option environment [11]. Our findings show that: 1. The \(d\)-RRMDP framework yields equivalent robust policies as \(d\)-DRMDP with appropriately chosen regularization parameters. 2. R2PVI significantly improves computational efficiency compared to algorithms for \(d\)-DRMDPs. 3. R2PVI with TV and KL divergences achieves computational efficiency comparable to algorithms designed for standard linear MDPs.

Notations. In this paper, we denote \(\Delta(\cS)\) as the probability distribution over the state space \(\cS\). For any \(H \in \NN\), \([H]\) represents the set \(\{ 1,2,3,\cdots,H \}\). For a vector \(\bv\in\RR^d\), we denote \(v_i\) as the \(i\)-th element. For any function \(V:\cS\rightarrow[0,H]\), we denote \(V_{\min}= \min_{s\in\cS}V(s)\) and \(V_{\max} = \max_{s\in\cS}V(s)\). For any distribution \(\mu\in\Delta(\cS)\), we denote \(\Var_{s\sim\mu}V(s)\) as the variance of the random variable \(V(s)\) under \(\mu\). For any two probability measures \(P\) and \(Q\) satisfying that \(P\) is absolute continuous with respect to \(Q\), the \(f\)-divergence is defined as \(D_{f}(P \| Q)=\int_{\mathcal{S}}f(P(s)/Q(s))Q(s) d s\), where \(f\) is a convex function on \(\RR\) and differentiable on \(\RR_+\) satisfying \(f(1)=0\) and \(f(t)=+\infty, \forall t<0\). The Total Variation (TV) divergence, Kullback-Leibler (KL) divergence and Chi-Square (\(\chi^2\)) divergence between \(P\) and \(Q\) are defined by \(f(x) = |x-1|/2, f(x) =x \log x, f(x) =(x-1)^2\), respectively. Given a scalar \(\alpha\), we denote \([V(s)]_\alpha = \min\{V(s), \alpha\}\). We denote \(\mathbf{I}\) as the identity matrix and \(\mathbf{1}_i\in\RR^d\) as the one-hot vector with the \(i\)-th element equals to one.

2 Related Work↩︎

Distributionally Robust MDPs. The seminal works of [6], [7] proposed the framework of DRMDP. There are several lines of works studying DRMDP under different settings. [4], [9], [21] studied the offline DRMDP assuming access to an offline dataset, and provided sample complexity bounds under the coverage assumption on the offline dataset. [14], [22], [23] studied the online DRMDP where an agent learns robust policies by actively interacting with the nominal environment. [8], [21] studied the DRMDP with general function approximation, they focused on the offline setting with the \((s,a)\)-rectangularity assumption. [11][13] studied the offline \(d\)-DRMDP, they proposed provably and computationally efficient algorithms and provided sample complexity bounds under different kinds of coverage assumptions on the offline dataset.

RRMDPs. The work of [18], [20] proposed the RRMDP, which can be regarded as a generalization of the DRMDP by substituting the uncertainty set constraint in DRMDP with the regularization term defined based on the divergence between the perturbed dynamics and the nominal dynamics. In particular, [18] studied the tabular RRMDP and proposed a model-free algorithms assuming access to a simulator. [20] studied the offline RRMDP, they established connections between RRMDPs with risk sensitive MDPs, and derived the policy gradient principle. Moreover, they studied general function approximation and proposed a computationally efficient algorithm, RFZI, under the RRMDP with the regularization term defined based on the KL-divergence. [20] firstly discovered that the duality of the robust value function has a closed expression under the KL-divergence. [19] studied the offline RRMDP with regularization terms defined by general \(f\)-divergence. They studied general function approximation and provided sample complexity result under the general \(f\)-divergence. They further proposed a hybrid algorithm which learns robust policies with both historical data and online sampling, for RRMDP with TV-divergence regularization term. Their works mostly focus on \((s,a)\)-rectangularity uncertainty regularization, which is different from ours.

3 Problem formulation↩︎

3.0.0.1 Markov decision process (MDP).

We first introduce the concept of MDPs, which is the basis of our settings. Specifically, we denote \(\text{MDP}(\cS, \mathcal{A}, H, P^0, r)\) as a finite horizon MDP, where the \(\mathcal{S}\) is the state space, \(\mathcal{A}\) is the action space, \(H\) is the horizon length, \(P^0 = \{P^0_h \}_{h = 1}^H\) are nominal transitional kernels, and the \(r(s,a) \in [0,1]\) is the deterministic reward function assumed to be known in advance. For any policy \(\pi\), the value function and Q-function at time step \(h\) is defined as \[\begin{align} &V_h^{\pi}(s) = \EE^{P^0} \bigg[\sum_{t = h}^H r_t(s_t, a_t) \Big| s_h = s, \pi \bigg], &Q^{\pi}_h(s,a) = \EE^{P^0} \bigg[ \sum_{t = h}^H r_t(s_t, a_t) \Big| s_h = s, a_h=a, \pi \bigg]. \end{align}\]

3.0.0.2 The robust regularized MDP (RRMDP)

The RRMDP framework with a finite horizon is defined by \(\text{RRMDP}(\cS, \mathcal{A}, H, P^0, r, \lambda, D, \mathcal{F})\), where \(\lambda\) is the regularizer, \(D\) is the probability divergence metric, and \(\cF\) is the feasible set of all perturbed transition kernels. For any policy \(\pi\), the robust regularized value function and Q-function are defined as \[\begin{align} &V_h^{\pi, \lambda}(s) = \inf_{P \in \mathcal{F}} \EE^{P} \bigg[\sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda D(P_t(|s_t,a_t)\|P_t^0(\cdot|s_t,a_t))\big] \Big| s_h = s, \pi \bigg], \tag{1}\\ &Q^{\pi, \lambda}_h(s,a) = \inf_{P \in \mathcal{F}} \EE^{P} \bigg[ \sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda D(P_t(\cdot|s_t,a_t)\|P_t^0(\cdot|s_t,a_t))\big] \tag{2} \Big| s_h = s, a_h=a, \pi \bigg]. \end{align}\] The RRMDP framework has been referred to by different names in the literature, including the penalized robust MDP [18], the soft robust MDP [20], and the robust \(\phi\)-regularized MDP (RRMDP) [19]. For consistency, we adopt the term RRMDP in this work. In RRMDPs, the transition kernel class \(\cF\) typically encompasses all possible kernels. However, for environments with large state-action spaces, \(\cF\) may be overly broad, including transitions that are unrealistic or irrelevant. To address this, we introduce latent structures on transition kernels and design regularization penalties that capture dynamics changes in the latent structure, which is similar to \(r\)-rectangular MDPs [10] and \(d\)-rectangular linear DRMDPs [11].

3.0.0.3 The \(d\)-rectangular linear RRMDP (\(d\)-RRMDP).

In this paper, we propose a novel \(d\)-RRMDP, which admits linear structure of feasible set and reward function. Specifically, a \(d\)-RRMDP is a RRMDP where the nominal environment \(P^0\) is a special case of linear MDP with a simplex feature space [24], and the feasible sets \(\cF\) involves kernels defined based on the linear structure of the nominal transition kernel. Specifically, we make the following assumption:

Assumption 1 ([24]). Given a known state-action feature mapping \(\bphi: \cS \times \cA \rightarrow \RR^d\) satisfying \(\sum_{i=1}^d\phi_i(s,a) = 1, \phi_i(s,a) \geq 0\). Further more, we assume the reward function \(\{r_h\}_{h=1}^H\) and the nominal transition kernels \(\{P_h^0\}_{h=1}^H\) admit linear structures. Specifically, we have \[\begin{align} r_h(s,a) = \langle \bphi(s,a), \btheta_h \rangle,~ P_h^0(\cdot |s,a) = \langle \bphi(s,a), \bmu_h^0(\cdot) \rangle, ~\text{for all} ~(h,s,a)\in[H]\times\cS\times\cA, \end{align}\] where \(\{ \btheta_h \}_{h=1}^H\) are known vectors with bounded norm \(\| \btheta_h \|_2 \leq \sqrt{d}\) and \(\{ \bmu^0_h \}_{h=1}^H\) are unknown probability measure vectors over \(\cS\), i.e., \(\bmu^0_h = (\mu^0_{h,1}, \mu^0_{h,2}, \cdots ,\mu^0_{h, d})\), \(\mu^0_{h, i} \in \Delta(\cS), \forall i \in [d]\).

With , due to our concentration on linear-structured based feasible set \(\mathcal{F}\), the robust regularized value function and Q-function are defined as \[\begin{align} V_h^{\pi, \lambda}(s) &= \inf_{\bmu_t \in \Delta(\cS)^d,P_t=\la \bphi,\bmu_t \ra} \EE^{\{P_t\}_{t=h}^H} \bigg[\sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\bmu_t||\bmu_t^0) \ra\big] \Big| s_h = s, \pi \bigg], \label{def:definition32of32d-rec32value32function}\\ Q_h^{\pi, \lambda}(s,a) &= \inf_{\bmu_t \in \Delta(\cS)^d,P_t=\la \bphi,\bmu_t \ra} \EE^{\{P_t\}_{t=h}^H} \bigg[\sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\bmu_t||\bmu_t^0) \ra\big] \Big| s_h = s, a_h = a, \pi \bigg], \nonumber \end{align}\tag{3}\] where \(\bD(\bmu||\bmu^0) = \big( D(\mu_{1} \| \mu^0_{1}), D(\mu_{2} \| \mu^0_{2}) ,\cdots, D(\mu_{d} \| \mu^0_{d}) \big)^\top\) is the concatenated vector of \(D(\mu_{i} \| \mu^0_{i})\). In other words, we only consider perturbed kernels in the linear feasible set \[\begin{align} \cF_{\text{L}} = \{P=\{P_h\}_{h=1}^H | P_h(\cdot|s,a) = \la\bphi(s,a),\bmu_h(\cdot)\ra, \bmu_h = (\mu_{h, 1}, \mu_{h, 2}, ... \mu_{h,d})^{\top}, \mu_{h, i} \in \Delta(\cS), \forall i \in [d]\}. \end{align}\] The optimal robust regularized value function and Q-function is defined as: \[\begin{align} V_h^{\star, \lambda}(s) = \sup_{\pi}V_h^{\pi, \lambda}(s), Q_h^{\star, \lambda}(s,a) = \sup_{\pi}Q_h^{\pi. \lambda}(s,a). \label{def:optimal32value32function} \end{align}\tag{4}\] Based on 4 , the optimal robust policy is defined as the policy that achieves the optimal robust regularized value function at time step 1, i.e, \(\pi^{\star,\lambda} = \argmax_{\pi}V_1^{\pi, \lambda}(s),~\forall s\in\cS\).

3.0.0.4 Dynamic Programming Principles for \(d\)-RRMDP

For completeness, we first show that the dynamic programming principles [25] hold for the \(d\)-rectangular linear RRMDP.

(Robust Regularized Bellman Equation) Under the \(d\)-rectangular linear RRMDP, for any policy \(\pi\) and any \((h,s,a) \in [H]\times \cS \times \cA\), we have \[\begin{align} Q_{h}^{\pi, \lambda}(s, a) &= r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\pi, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big], \tag{5} \\ V_h^{\pi, \lambda}(s) &= \EE_{a \sim \pi(\cdot |s)}\big[Q_{h}^{\pi, \lambda}(s, a)\big]. \tag{6} \end{align}\]

Next, we show that the optimal robust policy is deterministic and stationary.

Under the setting of \(d\)-rectangular linear RRMDP, there exists a deterministic and stationary policy \(\pi^\star\), such that for any \((h,s,a) \in [H]\times \cS \times \cA\), \[\begin{align} V^{\pi^\star, \lambda}_h(s) = V^{\star, \lambda}_{h}(s), Q^{\pi^\star, \lambda}_h(s, a) = Q^{\star, \lambda}_h(s,a). \label{eq:optimal32value32function} \end{align}\tag{7}\]

With these results, we can restrict the policy class \(\Pi\) to the deterministic and stationary one. This leads to the robust regularized Bellman optimality equation: \[\begin{align} Q_{h}^{\star, \lambda}(s, a) &= r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\star, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big], \nonumber \\ V_h^{\star, \lambda}(s) &= \max_{a\in\cA}Q_{h}^{\pi, \lambda}(s, a). \label{eq:32robust32regularized32Bellman32optimality32equation32-322} \end{align}\tag{8}\]

3.0.0.5 Offline Dataset and Learning Goal.

Assume that we have an offline dataset \(\cD\) consisting \(K\) i.i.d. trajectories collected from the nominal environment by a behavior policy \(\pi^b\). Specifically, for the \(\tau\)-th trajectory \(\{(s_h^{\tau}, a_h^{\tau}, r_h^{\tau})\}_{h=1}^H\), we have \(a_h^{\tau}\sim \pi_h^b(\cdot|s_h^{\tau})\), \(r_h^{\tau}=r_h(s_h^{\tau},a_h^{\tau})\), and \(s_{h+1}^{\tau}\sim P^0_h(\cdot|s_h^{\tau},a_h^{\tau})\) for any \(h\in[H]\). The goal of the offline RRMDP is to learn the optimal robust policy \(\pi^{\star}\) from the offline dataset \(\cD\). The suboptimality gap between any policy \(\hat{\pi}\) and the optimal robust policy \(\pi^{\star}\) is \[\begin{align} \label{eq:definition32of32suboptimality} \text{SubOpt}(\hat{\pi}, s_1, \lambda) := V_1^{\star, \lambda}(s_1) - V_1^{\hat{\pi}, \lambda}(s_1). \end{align}\tag{9}\] The goal of an agent is to learn a \(\hat{\pi}\) that minimizes the suboptimality gap \(\text{SubOpt}(\hat{\pi}, s, \lambda), \forall s\in\cS\).

4 Robust Regularized Pessimistic Value Iteration↩︎

In this section, we first develop a meta-algorithm for offline \(d\)-rectangular linear RRMDP with \(D\) being the general \(f\)-divergence. Then in order to instantiate the meta-algorithm, we provide exact dual form solution under specific the \(f\)-divergence \(D\) as Total-Variation (TV) divergence, Kullback-Leibler (KL) divergence and Chi-Square (\(\chi^2\)) divergence, respectively.

4.1 Regularized Robust Pessimism Iteration (R2PVI)↩︎

We first show that under the \(d\)-rectangular linear RRMDP, \(Q_h^{\pi, \lambda}(s,a)\) admits a linear representation.

Under , for any \((\pi, s, a, h) \in \Pi \times \cS \times \cA \times [H]\), we have \[\begin{align} \label{eq:linear32representation} Q_h^{\pi, \lambda}(s,a) = \langle \bphi(s,a), \btheta_h + \bw_h^{\pi, \lambda} \rangle, \end{align}\tag{10}\] where \(\bw_h^{\pi, \lambda} = \big(w_{h, 1}^{\pi, \lambda}, w_{h,2}^{\pi, \lambda},\cdots, w_{h,d}^{\pi, \lambda}\big)^\top \in \RR^d\), and \(w_{h, i}^{\pi, \lambda} = \inf_{\mu_{h,i} \in \Delta(s)} \big[\EE^{\mu_{h,i}}\big[V_{h+1}^{\pi, \lambda}(s)\big] + \lambda D(\mu_{h,i} \| \mu_{h, i}^0)\big]\).

The linear representation of the robust Q-function enables linear function approximation. Then, it remains to obtain an estimation, \(\hat{\bw}_h^\lambda\), of the parameter \(\bw_h^\lambda\). For now, we focus on the algorithm design and assume that we have an oracle to get \(\hat{\bw}_h^\lambda\). We will instantiate the estimation procedure under different cases with different probability divergence \(D\) in the following sections. Leveraging the robust regularized Bellman equation and the pessimism principle [13], [26] well-developed to take account of the distribution shift of the offline dataset, we propose the meta-algorithm in .

Figure 1: Robust Regularized Pessimistic Value Iteration (R2PVI)

presents a meta-algorithm for the \(d\)-rectangular linear RRMDP with any probability divergence metric \(D\). The algorithm follows a value iteration framework, integrating the pessimism principle to iteratively estimate the robust regularized Q-function. Based on the robust regularized Bellman optimality equation 8 , the estimated robust policy is derived as the greedy policy with respect to the estimated robust regularized Q-function. In subsequent sections, we instantiate \(D\) as commonly studied \(f\)-divergences, including the TV-, KL-, and \(\chi^2\)-divergences. We also detail the parameter estimation procedure in Line 1 and the construction of the pessimism penalty in Line [eq:pessimism32penalty] of .

4.2 R2PVI with the TV-Divergence↩︎

In this section, we show how to get the estimation in Line 1 of R2PVIfor TV divergence based regularization. We first present a result on the duality of the TV-divergence.

Given any probability measure \(\mu^0\in\Delta(\cS)\) and value function \(V:\cS\rightarrow[0,H]\), if the distance \(D\) is chosen as the TV-divergence, the dual formulation of the original regularized optimization problem is formed as: \[\begin{align} \label{eq:32duality32for32TV} \inf_{\mu \in \Delta (S)}\EE_{s\sim\mu}V(s) + \lambda D_{\text{TV}}(\mu\| \mu^0) =\EE_{s\sim\mu^0}[V(s)]_{\min_{s'}\{V(s')\}+ \lambda}. \end{align}\tag{11}\]

We compare the duality of the regularized problem in with the duality of the constraint problem with the TV-divergence in DRMDP [9]: \[\begin{align} \inf_{P \in \mathcal{U}^{\rho}(P^0)} \EE^{P} V(s) = \max_{\alpha \in [V_{\min}, V_{\max}]} \big\{ \EE^{P^0}[V(s)]_{\alpha} - \rho(\alpha - \min_{s'}[V(s')]_{\alpha}) \big\}. \end{align}\] We highlight that the former has a closed form, while the later needs to be solved through optimization. And both involve a truncation on the value function. We will see later this distinction makes R2PVI much more computationally efficient compared to algorithms designed for DRMDP.

Next, we present the parameter estimation procedure. Given an estimated robust value function \(\hat{V}^{\lambda}_{h+1}\), we denote \(\alpha_{h+1} = \min_{s'} \hat{V}^{\lambda}_{h+1}(s') + \lambda\). According to the linear representation of the Q-function , the duality for TV-divergence and the linear structure of the nominal kernel in , we estimate the parameter \(\bw_h^\lambda\) as follows \[\begin{align} \hat{\bw}_h^{\lambda} =\argmin_{\bw \in R^d} \sum_{\tau=1}^K \big([\hat{V}_{h+1}^{\lambda}(s_{h+1}^\tau)]_{\alpha_{h+1}} - \bphi(s_h^\tau, a_h^{\tau})^{\top}\bw\big)^2 + \gamma \| \bw \|_2^2, \label{TV-estimation} \end{align}\tag{12}\] where \(\gamma\) is a stabilization parameter to ensure numerical stability and prevent the matrix from becoming ill-conditioned or singular.

Thanks to the closed form expression of the duality for TV in , R2PVI does not need the dual oracle as the DRPVI algorithm proposed for the \(d\)-rectangular linear DRMDP [13]. They need to solve dual oracle separately for each dimension in each iteration, which is not necessary in our algorithm.

4.3 R2PVI with the KL-Divergence↩︎

Similar to TV-divergence, we next show how to get the estimation in Line 1 of Algorithm for KL divergence based regularization. We first present a result on the duality of the KL-divergence.

[20] Given any probability measure \(\mu^0\in\Delta(\cS)\) and value function \(V:\cS\rightarrow[0,H]\), if the probability divergence \(D\) is chosen as KL-divergence, the dual formulation of the original regularized optimization problem is: \[\begin{align} \inf_{\mu \in \Delta(S)}\EE_{s\sim\mu}V(s) + \lambda D_{\text{KL}}(\mu\|\mu^0) = -\lambda \log \EE_{s\sim \mu^0}\big[e^{-\frac{V(s)}{\lambda}}\big]. \label{duality:32Kl} \end{align}\tag{13}\]

The duality of KL also has a closed form, which is first observed in [20]. We will see that this property will not only lead to more computationally efficient algorithm, but also simplify the theoretical analysis. Next, we present the parameter estimation procedure. According to the linear representation of the Q-function in , the duality for TV-divergence in and the linear structure of the nominal kernel in , we estimate the parameter \(\bw_h^\lambda\) by a two-step procedure. Given an estimated robust value function \(\hat{V}^{\lambda}_{h+1}\), we first solve the following ridge-regression to get an estimation of \(\EE_{s\sim\bmu^0}e^{-\hat{V}^{\lambda}_{h+1}(s)/\lambda}\) \[\begin{align} \hat{\bw}'_h = \argmin_{\bw \in \RR^d} \sum_{\tau=1}^K \Big(e^{-\frac{\hat{V}_{h+1}^{\lambda}(s^\tau_{h+1})}{\lambda}} - \bphi(s_h^\tau, a_h^{\tau})^{\top}\bw \Big)^2 + \gamma \| \bw \|_2^2. \label{ridge-regression-KL} \end{align}\tag{14}\] We then take a log-transformation to get an estimation of \(\bw_h^\lambda\): \[\begin{align} \hat{\bw}^\lambda_h= -\lambda \log \max\{\hat{\bw}'_h, e^{-H/\lambda}\} \label{KL-estimation} \end{align}\tag{15}\] Note that the max operator is to ensure the ridge-regression estimator is well-defined to take log-transformation, and \(e^{-H/\lambda}\) is a lower bound on \(\EE_{s\sim\bmu^0}e^{-\hat{V}^{\lambda}_{h+1}(s)/\lambda}\).

[8], [11] need to deal with hard dual oracle under KL divergence, while our algorithms have close form solution. Their works require sophisticated techniques to guarantee the estimated parameter well-defined to take log-transformation, while our algorithm does not need such techniques, which reduces the computational cost and eases the theoretical analysis (see and for detailed comparison).

4.4 R2PVI with the \(\chi^2\)-Divergence↩︎

Now it remains to show how to get the estimation in Line 1 of Algorithm for \(\chi^2\)-divergence based regularization. We first present a result on the duality of the \(\chi^2\)-divergence.

Given any probability measure \(\mu^0\in\Delta(\cS)\) and value function \(V:\cS\rightarrow[0,H]\), if the probability divergence \(D\) is chosen as the \(\chi^2\)-divergence, the dual formulation of the original regularized optimization problem is: \[\begin{align} \label{eq:32x232robust32bellman32equation} \inf_{\mu \in \Delta(S)} \EE_{s\sim\mu}V(s) + \lambda D_{\chi^2}(\mu\| \mu^0) = \sup_{\alpha \in [V_{\min}, V_{\max}]}\Big\{ \EE_{s\sim\mu^0}[V(s)]_{\alpha} - \frac{1}{4\lambda}\Var_{s\sim\mu^0}[V(s)]_{\alpha} \Big\} \end{align}\tag{16}\]

Next, we present the parameter estimation procedure. According to the linear representation of the Q-function in , the duality for \(\chi^2\)-divergence in and the linear structure of the nominal kernel in , estimate the parameter \(\bw_h^\lambda\) as follows. To estimate the variance of the value function in 16 , we propose a new method motivated by the variance estimation in [13]. Specifically, given an estimated robust value function \(\hat{V}^{\lambda}_{h+1}\) and dual variable \(\alpha\), we first estimate \(\EE_{s\sim\mu^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha}\) and \(\EE_{s\sim\mu^0}[\hat{V}^{\lambda}_{h+1}(s)]^2_{\alpha}\) as follows: \[\begin{align} \hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]_{\alpha} = \Bigg[\argmin_{\bw \in R^d} \sum_{\tau=1}^K ([\hat{V}_{h+1}^{\lambda}(s_{h+1}^\tau)]_{\alpha} - \bphi(s_h^\tau, a_h^{\tau})^{\top}\bw)^2 + \gamma \| \bw \|_2^2 \Bigg]_{[0,H]}^i, \tag{17}\\ \hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]^2_{\alpha} = \Bigg[\argmin_{\bw \in R^d} \sum_{\tau=1}^K ([\hat{V}_{h+1}^{\lambda}(s_{h+1}^\tau)]^2_{\alpha} - \bphi(s_h^\tau, a_h^{\tau})^{\top}\bw)^2 + \gamma \| \bw \|_2^2 \Bigg]_{[0,H^2]}^i, \tag{18} \end{align}\] where the superscript \(i\) represents the \(i\)-th element of a vector. Then we construct the estimator \(\hat{\bw}_h^\lambda\) as follows \[\begin{align} \hat{w}_{h,i}^{\lambda} &= \sup_{\alpha \in [(\hat{V}_{h+1}^{\lambda})_{\min},(\hat{V}_{h+1}^{\lambda})_{\max}]}\Big\{ \hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]_{\alpha} - \frac{1}{4\lambda}\widehat{\Var}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]_{\alpha} \Big\} \notag\\ & = \max_{\alpha \in [(\hat{V}_{h+1}^{\lambda})_{\min},(\hat{V}_{h+1}^{\lambda})_{\max}]} \Big \{ \hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]_{\alpha} + \frac{1}{4\lambda}(\hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]_{\alpha})^2- \frac{1}{4\lambda}\hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]^2_{\alpha} \Big \}. \label{eq:parameter32estimation32x2} \end{align}\tag{19}\] We note that the parameter estimation procedure appears to be distinct from that of TV and KL, and this is due to the fact that the duality of \(\chi^2\) does not admit a closed form expression. Specifically, it estimates the parameter \(\bw_h^{\lambda}\) element-wisely, and for each dimension, it solves an optimization problem over an estimated dual formulation. This parameter estimation procedure shares a similar spirit with that in the \(d\)-DRMDP with TV divergence [13], [14].

5 Suboptimality Analysis↩︎

In this section, we establish theoretical guarantees for the algorithms proposed in . First, we derive instance-dependent upper bounds on the suboptimality gap of the policies learned by the instantiated algorithms. Next, under a partial coverage assumption, we present an instance-independent upper bound for the suboptimality gap and compare it with results from existing works. Finally, we provide a matching information-theoretic lower bound to highlight the intrinsic characteristics of offline \(d\)-RRMDPs.

5.1 Instance-Dependent Upper Bound↩︎

Theorem 1. Under , for any \(\delta \in (0,1)\), if we set \(\gamma=1\) and \(\Gamma_h(s,a) = \beta \sum_{i=1}^d\|\phi_i(\cdot, \cdot) \mathbf{1}_i\| _{\bLambda_h^{-1}}\) in ,

  • (TV) \(\beta = 16 Hd \sqrt{\xi_{\text{TV}}}\), where \(\xi_{\text{TV}} = 2\log(1024Hd^{1/2} K^2 /\delta)\);

  • (KL) \(\beta = 16d\lambda e^{H/\lambda} \sqrt{(H/\lambda + \xi_{\text{KL}})}\), where \(\xi_{\text{KL}} = \log(1024d\lambda^2 K^3H/\delta)\);

  • (\(\chi^2\)) \(\beta = 8 dH^2(1+ 1/\lambda) \sqrt{\xi_{\chi^2}}\), where \(\xi_{\chi^2} = \log (192 K^5H^6 d^3 (1 + H/2\lambda)^3/ \delta)\),

then with probability at least \(1-\delta\), for all \(s \in \mathcal{S}\), the suboptimality of satisfies: \[\begin{align} \text{SubOpt}(\hat{\pi}, s, \lambda) \leq 2\beta\sup_{P \in \mathcal{U}^\lambda(P^0)}\sum_{h=1}^H\EE^{\pi^*, P}\Big[\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\Lambda_h^{-1}}|s_1 = s\Big], \end{align}\] where \(\mathcal{U}^\lambda (P^0)\) is the robustly admissible set defined as \[\begin{align} \label{def:robustly32admissible32set} \mathcal{U}^\lambda(P^0)=\bigotimes_{(h, s, a) \in [H] \times \mathcal{S} \times \mathcal{A}} \mathcal{U}_h^\lambda(s, a ; \bmu_h^0), \end{align}\qquad{(1)}\] with \(\mathcal{U}_h^\lambda(s, a ; \bmu_h^0) = \{\sum_{i=1}^d \phi_i(s, a) \mu_{h, i}(\cdot):D(\mu_{h, i}\| \mu_{h,i}^0) \leq \max_{s\in\cS} V_{h+1}^{*, \lambda}(s) /\lambda, \forall i \in[d] \}\).

provides an instance-dependent upper bound on the suboptimality gap, closely resembling the bounds established for algorithms tailored to the \(d\)-DRMDP [12], [13]. Notably, the set \(\mathcal{U}^\lambda (P^0)\) in represents a subset of the feasible set \(\cF_{\text{L}}\) in the \(d\)-RRMDP. While the RRMDP framework does not impose explicit uncertainty set constraints, this term naturally arises from our theoretical analysis (see for details). Specifically, we show that only distributions within \(\mathcal{U}^\lambda (P^0)\) are relevant when considering the infimum in the robust regularized value and Q-functions 3 . Intuitively, the regularization term in 3 should not exceed the change in expected cumulative rewards, which is bounded above by the optimal robust regularized value function. Similar terms are also found in Definition 1 of [20] and Assumption 1 of [21].

5.2 Instance-Independent Upper Bound↩︎

Next, we derive instance-independent upper bounds on the suboptimality gap, building on the results in . To achieve this, we adapt the robust partial coverage assumption on the offline dataset, originally proposed for \(d\)-DRMDP (Assumption A.2 of [8]), to the \(d\)-RRMDP setting. This adaptation is straightforward and involves replacing the uncertainty set in the \(d\)-DRMDP framework with the robustly admissible set defined in ?? .

Assumption 2 (Robust Regularized Partial Coverage). For the offline dataset \(\cD\), we assume that there exists some constant \(c^\dagger >0\), such that \(\forall (h, s, P) \in [H] \times \cS \times \mathcal{U}^{\lambda}(P^0)\), we have \[\begin{align} \bLambda_h \succeq \gamma \mathbf{I} + K \cdot c^\dagger \cdot \EE^{\pi^*, P}\big[(\phi_i(s,a)\mathbf{1}_i)(\phi_i(s,a)\mathbf{1}_i)^{\top}| s_1 = s\big]. \end{align}\]

Intuitively, assumes that the offline dataset has good coverage on the state-action space visited by the optimal robust policy \(\pi^{\star}\) under any transition kernel in the robustly admissible set \(\mathcal{U}^{\lambda}(P^0)\). With and , we can provide instance-independent bound as:

Corollary 1. Under the same assumption and condition in , if we further assume holds for the offline dataset \(\cD\), then for any \(\delta \in (0,1)\), with probability at least \(1-\delta\), we have

  • (TV) \(\text{SubOpt}(\hat{\pi}, s, \lambda) \leq 16 H^2d^2\sqrt{\xi_{\text{TV}}}/{\sqrt{c^\dagger K}}\);

  • (KL) \(\text{SubOpt}(\hat{\pi}, s, \lambda) \leq 16\lambda e^{H/\lambda}d^2 H \sqrt{(H/\lambda + \xi_{\text{KL}})}/\sqrt{c^\dagger K}\);

  • (\(\chi^2\)) \(\text{SubOpt}(\hat{\pi}, s, \lambda) \leq 8d^2H^3 (1 + 1/\lambda)\sqrt{\xi_{\chi^2}}/\sqrt{c^\dagger K}\).

We compare the suboptimality upper bound of with those algorithms proposed in previous works for the offline \(d\)-DRMDP in . For the case with TV-divergence, the suboptimality bound of R2PVI matches that of P2MPO in terms of \(d\) and \(H\). DRPVI and DROP admit tighter bounds on the suboptimality bound, simply because their bounds are derived based on advanced techniques and a stronger assumption on the offline dataset. We remark that our analysis can be tailored to adopt the same techniques and assumption, and thus get tighter bounds.

For the case with KL-divergence, existing theoretical results [8], [11] rely on an additional regularity assumption regarding the KL dual variable, stating that the optimal dual variable for the KL duality admits a positive lower bound \(\underline{\beta}\) under any feasible transition kernel (see Assumption F.1 in [8]). However, this assumption presents the following drawbacks. First, it is challenging to verify the assumption’s validity in practice; second, even if such a lower bound holds, there is no straightforward method to determine the magnitude of the lower bound. It can be seen from that the suboptimality bound of R2PVI matches that of DRVI-L [11] in terms of \(d\) and \(H\). However, our result depends on \(\lambda\) which is the regularization parameter and can be arbitrarily chosen, while the result of DRVI-L depends on \(\underline{\beta}\) which can be extremely small such that \(\sqrt{\underline{\beta}}e^{H/\underline{\beta}}\ll \sqrt{\lambda}e^{H/\lambda}\). Moreover, [20] studied the RRMDP with the regularization term defined by the KL-divergence in their Theorem 5 and the suboptimality bound also depends on the term \(\sqrt{\lambda}e^{H/\lambda}\)6. Further, comparing the bounds of P2MPO [8] and R2PVI, we can qualitatively conclude that the regularization parameter \(\lambda\) in \(d\)-RRMDP plays a role analogous to \(1/\rho\) in \(d\)-DRMDP. This relation aligns with the intuition that a smaller \(\lambda\) in \(d\)-RRMDP or a larger \(\rho\) in \(d\)-DRMDP can induce a more robust policy.

For the case with \(\chi^2\) divergence, our bound is the first result in literature. Compared with the TV divergence, the sample complexity with \(\chi^2\) divergence is higher due to the difficulty in solving the dual oracle. This observation aligns with the findings of [16]. While existing RRMDP works have focused on the \((s,a)\)-rectangular structured regularization, our work fills the theoretical gaps in RRMDP by introducing the \(d\)-rectangular structured regularization, a contribution that may be of independent interest.

Table 1: Comparison of the suboptimality gap between our works and previous works. The \(^{\star}\) symbol denotes results that require an additional assumption (Assumption 4.4 of [11] and Assumption F.1 of [8]) on the KL dual variable, an assumption not required by our R2PVI algorithm. The parameter \(\rho\) represents the uncertainty level in DRMDP, while \(\lambda\) represents the regularization term in RRMDP. The Coverage column indicates the assumption used to derive the suboptimality gap: the robust partial coverage assumption refers to Assumption 6.2 of [8], and the regularized partial coverage assumption represents .
Algorithm Setting Divergence Coverage Suboptimality Gap
[13] \(d\)-DRMDP TV full \(\tilde{O}(dH^2 K^{-1/2})\)
[12] \(d\)-DRMDP TV robust partial \(\tilde{O}(d^{3/2}H^2 K^{-1/2})\)
[8] \(d\)-DRMDP TV robust partial \(\tilde{O}(d^{2}H^2 K^{-1/2})\)
\(d\)-RRMDP TV regularized partial \(\tilde{O}(d^{2}H^2 K^{-1/2})\)
[11] \(d\)-DRMDP KL robust partial \(\tilde{O}(\sqrt{\underline{\beta}} e^{H/\underline{\beta}}d^{2}H^{3/2} K^{-1/2})^{\star}\)
[8] \(d\)-DRMDP KL robust partial \(\tilde{O}(e^{H/\underline{\beta}}d^{2}H^{2}\rho^{-1} K^{-1/2})^{\star}\)
\(d\)-RRMDP KL regularized partial \(\tilde{O}(\sqrt{\lambda} e^{H/ \lambda} d^2 H^{3/2} K^{-1/2})\)
\(d\)-RRMDP \(\chi^2\) regularized partial \(\tilde{O}(d^2H^3(1+\lambda^{-1})K^{-1/2})\)

5.3 Information-Theoretic Lower Bound↩︎

We highlight that in , suboptimality bounds under cases with TV-, KL-, \(\chi^2\)-divergence share the same term \(\sup_{P\in\cU^{\lambda}(P^0)} \sum_{h=1}^H\EE^{\pi^\star, P}[\sum_{i=1}^d\Vert \phi_i(s_h,a_h)\mathbf{1}_i\Vert_{\bLambda_h^{-1}} |s_1=s]\). In this section, we establish information theoretic lower bounds to show that this term is intrinsic in \(d\)-RRMDP problem.

The construction of the information-theoretic lower bound relies on a novel family of hard instances. We illustrate one such instance in . Both the nominal and target environments satisfy . The environment consists of two states, \(s_1\) and \(s_2\). In the nominal environment , \(s_1\) represents the good state with a positive reward. For any transition originating from \(s_1\), there is a \(1-\epsilon\) probability of transitioning to itself and an \(\epsilon\) probability of transitioning to \(s_2\), regardless of the action taken, where \(\epsilon\) is a parameter to be determined. The state \(s_2\) is the fail state with zero reward and can only transition to itself. The worst-case target environment is obtained by perturbing the transition probabilities in the nominal environment. The perturbation magnitude \(\Delta_h^{\lambda}(\epsilon,D)\) depends on the stage \(h\), regularizer \(\lambda\), divergence metric \(D\), and parameter \(\epsilon\).

We would like to highlight the difference between our hard instances and the hard instances developed in [13]. We find out that the instances developed in [13] only allow perturbations measured in TV-divergence. The reason is that in their nominal environment, both \(s_1\) and \(s_2\) are absorbing states, and thus \(P^0_h(\cdot|s,a)\) only has support on \(s\), which could be either \(s_1\) or \(s_2\). In this case, any perturbation to \(P^0_h(\cdot|s,a)\) would cause a violation of the absolute continuous condition in the definition of the KL-divergence and the \(\chi^2\)-divergence (as well as the TV-divergence if strictly speaking). In comparison, we inject a small error \(\epsilon\) in the nominal kernel such that \(P^0_h(\cdot|s_1,a)\) has full support \(\{s_1,s_2\}\) when the transition starts from \(s_1\). Hence, we can make perturbation on \(P^0_h(\cdot|s_1,a)\) safely without violating the absolutely continuous condition. Additionally, while [13] only construct perturbation in first time step, we admits perturbation in every time step \(h\) in order to make our instance more general. For details on the construction of hard instances and the proof of , we refer readers to .

Figure 2: The nominal environment and the worst case environment. The value on each arrow represents the transition probability.The MDP has two states, \(s_1\) and \(s_2\), and \(H\) steps. For he nominal environment on the left, the \(s_1\) is the good state where the transition is determined by an error term \(\epsilon\), and \(s_2\) is a fail state with reward 0 and only transitions to itself.The worst case environment on the right is obtained by perturbing the transition probability at each step of the nominal environment. The magnitude of the perturbation \(\Delta_h^\lambda(\epsilon,D)\) at each stage \(h\) depends on the divergence metric \(D\), the regularized \(\lambda\) and the parameter \(\epsilon\).

In order to give a formal presentation of the information-theoretical lower bound, we define \(\cM\) as a class of \(d\)-RRMDPs and \(\text{SubOpt}(M,\hat{\pi},s,\rho)\) as the suboptimality gap specific to one \(d\)-RRMDP instance \(M\in\cM\). Hence, we state the information-theoretic lower bound in the following theorem:

Theorem 2. Given a regularizer \(\lambda\), dimension \(d\), horizon length \(H\) and sample size \(K>\max\{\tilde{O}(d^6), \tilde{O}(d^3H^2/\lambda^2)\}\), there exists a class of \(d\)-rectangular linear RRMDPs \(\cM\) and an offline dataset \(\cD\) of size \(K\) such that for all \(s\in\cS\) and any divergence \(D\) among \(D_{\text{TV}}, D_{\text{KL}}\) and \(D_{\chi^2}\), with probability at least \(1-\delta\), we have \[\begin{align} \inf_{\hat{\pi}}\sup_{M\in\cM} \text{SubOpt}(M, \hat{\pi},s,\lambda, D) \geq c \cdot \sup_{P\in\cU^{\lambda}(P^0)} \sum_{h=1}^H\EE^{\pi^\star, P}\bigg[\sum_{i=1}^d\Vert \phi_i(s_h,a_h)\mathbf{1}_i\Vert_{\bLambda_h^{-1}} \big|s_1=s\bigg], \end{align}\] where \(c\) is a universal constant.

is a universal information theoretic lower bound for \(d\)-RRDMPs with all three divergences studied in . shows that the instance-dependent term is actually intrinsic to the offline \(d\)-RRDMPs, and is near-optimal up to a factor \(\beta\), for which the definition varies among different divergence metric \(D\) as shown in . The proof outline of is inspired by that of Theorem 6.1 in [13], but that here we need careful treatment on bounding the robust regularized value function by duality under different choice of \(f\)-divergence, especially considering the error \(\epsilon\) in each stage would accumulate throughout the \(H\) steps (see and its proof for more details).

6 Experiment↩︎

In this section, we conduct numerical experiments to explore (1) the robustness of R2PVI when facing adversarial dynamics, (2) the role of regularizer \(\lambda\) in determining the robustness of R2PVI, and (3) the computation cost of R2PVI. We evaluate our algorithm in two off-dynamics experiments that has been used in the literature [11], [14]. All experiments are conducted on a machine with an 11th Gen Intel(R) Core(TM) i5-11300H @ 3.10GHz processor, featuring 8 logical CPUs, 4 physical cores, and 2 threads per core.

6.0.0.1 Baselines.

We compare our algorithms with three types of baseline frameworks: (1) non-robust pessimism-based algorithm: PEVI [26], (2) \(d\)-DRMDP based algorithm under TV divergence: DRPVI [13], (3) \(d\)-DRMDP based algorithm under KL divergence: DRVI-L [11]. We do not implement P2MPO and DROP mentioned in in our experiment, due to the lack of code base and numerical experiment in those works.

6.1 Simulated Linear MDP↩︎

We borrow the simulated linear MDP constructed in [14] and adapt it to the offline setting. We set the behavior policy \(\pi^b\) such that it chooses actions uniformly at random. The sample size of the offline dataset is set to 100. For completeness, we present more details on the experiment set up and result in .

a

b

c

Figure 3: Simulated results for linear MDP. In and , the \(x\)-axis refers to the perturbation in the testing environment. In , the \(x\)-axis represents different robust level \(\rho\) and regularized penalty \(\lambda\), respectively..

In , we compare R2PVIwith its non-robust counterpart PEVI [26]. We conclude that PEVI outperforms R2PVIwhen the perturbation of the environment is small, but underperforms when the environment encounters a significant shift, which verifies the robustness of R2PVI. The regularizer \(\lambda\) controls the extend of robustness of R2PVIby determining the magnitude of the penalty as shown in . From , we can conclude that a smaller \(\lambda\) leads to a more robust policy. To illustrate the relation between the \(d\)-rectangular linear RRMDP and the \(d\)-rectangular DRMDP, we fix a target environment, and then test R2PVIwith different \(\lambda\) and DRPVI (the algorithm designed for \(d\)-rectangular linear DRMDP in [13]) with different \(\rho\). We find from that the ranges of the average reward are approximately the same for the two algorithms, though the behaviors w.r.t. \(\lambda\) and \(\rho\) are opposite. Thus, we verify the Theorem 3.1 of [18] that for each given RRMDP, there exists a DRMDP, whose value functions are exactly the same, and the regularizer \(\lambda\) plays a similar role in the RRMDP as the inversed robustness parameter \(1/\rho\) in the DRMDP.

6.2 Simulated American Put Option↩︎

In this section, we test our algorithm in a simulated American Put Option environment [27], [28] that does not belong to the \(d\)-rectangular linear RRMDP. This environment is a finite horizon MDP with \(H=20\), and is controlled by a hyperparameter \(p_0\), which is set to be 0.5 in the nominal environment. We collect offline data from the nominal environment by a uniformly random behavior policy. An agent use the collected offline dataset to learn a policy which decides at each state whether or not to exercise the option. To implement our algorithm, we use a manually designed feature mapping of dimension \(d\). For more details on the experiment setup, we refer readers to Section 6.2 of [14] for more details.

a

b

c

Figure 4: Simulation results for the simulated American put option task. shows the computation time of R2PVIwith respect to the sample size \(N\). shows the robustness of algorithms under different extend of perturbation in \(p_0\). shows the computation time of algorithms with respect to the feature dimension \(d\)..

All experiment results are shown in . In particular, from and , we can conclude that the computation cost of R2PVIis as low as its non-robust counterpart PEVI [26], and improves that of DRPVI [13] and DRVI-L [11] designed for the \(d\)-rectangular linear DRMDP. This is due to the closed form duality of TV and \(KL\) under the \(d\)-rectangular linear RRMDP. From , we conclude that R2PVIon one hand is robust to the environment perturbation. On the other hand, for each value of the regularizer \(\lambda\) in R2PVI, there exist a value of the uncertainty level \(\rho\) for DRPVI such that they achieve almost the same performance. This verifies the equivalence between DRMDP and RRMDP.

7 Conclusion↩︎

In this work, we propose a novel \(d\)-rectangular linear Robust Regularized Markov Decision Process (\(d\)-RRMDP) framework to overcome the theoretical limitations and computational challenges inherent in the \(d\)-rectangular DRMDP framework. We design a provable and practical algorithm R2PVIthat learns robust policies under the offline \(d\)-RRMDP setting across three divergence metrics: Total Variation, Kullback-Leibler, and \(\chi^2\) divergence. Our theoretical results demonstrate the superiority of the \(d\)-RRMDP framework compared to the \(d\)-DRMDP framework, primarily through the simplification of the duality oracle. Extensive experiments validate the robustness and high computational efficiency of R2PVIwhen compared to \(d\)-DRMDP-based algorithms. It remains an intriguing open question to improve the current upper and lower bounds to study the fundamental hardness of the \(d\)-RRMDP.

8 Additional Details on Experiments↩︎

In this section, we provide details on experiment setup.

8.1 Simulated Linear MDP↩︎

8.1.0.1 Construction of the simulated linear MDP

We leverage the simulated linear MDP instance proposed by [14]. The state space is \(\cS = \{x_1,\cdots,x_5\}\) and the action space is \(\cA=\{-1,1\}^4\subset\RR^4\). At each episode, the initial state is always \(x_1\). From \(x_1\), the next state can be \(x_2, x_4, x_5\) with probability defined on the arrows. Both \(x_4\) and \(x_5\) are absorbing states. \(x_4\) is the fail state with 0 reward and \(x_5\) is the goal state with reward 1. The hyperparameter \(\bxi\in\RR^4\) is designed to determine the reward functions and transition probabilities and \(\delta\) is the parameter defined to determine the environment. We perturb the transition probability at the initial stage to construct the source environment. The extend of perturbation is controlled by the hyperparameter \(q\in(0,1)\). For more details on the simulated linear DRMDP, we refer readers to the Supplementary A.1 of [14].

8.1.0.2 Hyper-parameters

The hyper-parameters in our setting are shown in . The horizon is 3, the \(\beta, \gamma, \delta\) are set the same in all tasks, the \(\| \bxi \|_1\) is set as 0.3, 0.2, 0.1 in in order to illustrate the versatility of our algorithms.

Table 2: Hyper-parameters.
Hyper-parameters Value
\(H\) (Horizon) 3
\(\beta\) (pessimism parameter) 1
\(\gamma\) 0.1
\(\delta\) 0.3
\(\|\bxi\|_1\) 0.3, 0.2, 0.1

8.2 Simulated American Put Option↩︎

8.2.0.1 Construction of the simulated American put option

In each episode, there are \(H=20\) stages, and each state \(h\), the dynamics evolves following the Bernoulli distribution: \[\begin{align} \label{American32put32option32transit} s_{h+1} = \begin{cases} 1.02s_h, \text{ w.p } p_0 \\ 0.98 s_h, \text{ w.p } 1-p_0 \end{cases}, \end{align}\tag{20}\] where \(p_0 \in (0,1)\) is the probability of price up. At each step, the agent has two actions to take: exercise the option \(a_h = 1\) or not exercise \(a=0\). If exercising the option \(a_h =0\), the agent will obtain reward \(r_h = \max \{0, 100-s_h\}\) and the state comes to an end. If not exercising the option \(a_h=1\), The state will continue to transit based on 20 and no reward will be received. To implement our algorithms, we use the following feature mapping: \[\begin{align} \phi\left(s_h, a\right)= \begin{cases}{\left[\varphi_1\left(s_h\right),\cdots, \varphi_d\left(s_h\right), 0\right]} & \text{ if } a=1 \\ {\left[0, \cdots, 0, \max \left\{0,100-s_h\right\}\right]} & \text{ if } a=0 \end{cases}, \end{align}\] where \(\varphi_i(s)=\max \left\{0,1-\left|s_h-s_i\right| / \Delta\right\}\), \(\{s_i\}_{i=1}^d\) are anchor states, \(s_1=80\) \(s_{i+1}-s_i=\Delta\) and \(\Delta = 60/d\). For more details on the simulated American put option environment, we refer readers to the Appendix C of [11].

8.2.0.2 Offline dataset and hyper-parameters

We set \(p_0=0.5\) in the nominal environment, from which trajectories are collected by fixed behavior policy, which chooses \(a_h = 0\). The \(\beta = 0.1\) and \(\gamma = 1\) are set hyper-parameters in all tasks. For the time efficiency comparison in and , we counted the time it took for the agent to train once and repeated 5 times to take the average.

9 Proof of Properties of \(d\)-RRMDPs↩︎

In this section, we provide the proofs of results in , namely, the robust regularized Bellman equation, the existence of the optimal robust policy, and the linear representation of the robust regularized Q-function under the \(d\)-rectangular linear RRMDP.

9.1 Proof of↩︎

Proof. We prove the a stronger proposition by induction from the last stage \(H\). Specifically, besides the equations in hold, we further assume that there exist transition kernels \(\{ \hat{\bmu}_t \}_{t=1}^H, \hat{P}_t = \la \bphi, \hat{\bmu}_t \ra\), such that for any \((h,s)\in[H]\times\cS\), \[\begin{align} V_h^{\pi, \lambda}(s) &= \EE^{\{\hat{P}_t \}_{t=h}^H} \bigg[\sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\hat{\bmu}_t||\bmu_t^0) \ra \Big| s_h = s, \pi \bigg]. \label{existence32of32transition32kernel} \end{align}\tag{21}\] As there is no transitional kernel involved, the base case holds trivially. Suppose the conclusion holds for stage \(h+1\), that is to say, there exists \(\hat{P}_{t}, t=h+1, h+2,\cdots, H\) such that \[\begin{align} V_{h+1}^{\pi,\lambda}(s) = \EE^{\{\hat{P}_i\}_{i=h+1}^H} \bigg[ \sum_{t = h+1}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\hat{\bmu}_t||\bmu_t^0)\big] \Big| s_{h+1}= s, \pi \bigg]. \end{align}\] For the case of \(h\), recall the definition of \(Q_h^{\pi}\), we have \[\begin{align} Q^{\pi, \lambda}_h(s,a) & =\inf_{\bmu_t \in \Delta(\cS)^d,P_t=\la \bphi,\bmu_t \ra} \EE^{\{P_t\}_{t=h}^H} \bigg[\sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\bmu_t||\bmu_t^0) \ra\big] \Big| s_h = s, a_h = a, \pi \bigg] \nonumber\\ & = r_h(s,a) + \inf_{\bmu_t \in \Delta(\cS)^d,P_t=\la \bphi,\bmu_t \ra} \lambda \la\bphi(s_h, a_h),\bD(\bmu_h||\bmu_h^0)\ra \nonumber\\ &\quad + \int_{\cS} P_{h}(ds'| s,a)\EE^{\{P_t\}_{t=h+1}^H} \bigg[ \sum_{t = h+1}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\bmu_t||\bmu_t^0)\ra\big] \Big| s_{h+1} = s' , \pi \bigg] \nonumber\\ & \leq r_h(s,a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\lambda \la\bphi(s_h, a_h),\bD(\bmu_h||\bmu_h^0)\ra \nonumber\\ &\quad + \int_{\cS} P_{h}(ds'| s,a)\EE^{\{\hat{P}_t\}_{t=h+1}^H} \bigg[ \sum_{t = h+1}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\hat{\bmu}_t||\bmu_t^0)\ra\big] \Big| s_{h+1} = s' , \pi \bigg] \nonumber \\ & = r_h(s,a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra} \lambda \la\bphi(s_h, a_h),\bD(\bmu_h||\bmu_h^0)\ra + \EE_{s'\sim P_h(\cdot |s,a)}[V_{h+1}^{\pi, \lambda}(s')] \label{B2}, \end{align}\tag{22}\] where 22 follows by the inductive hypothesis of \(V_{h+1}^{\pi, \lambda}(s)\). On the other hand, we can lower bound \(Q_h^{\pi, \lambda}(s,a)\) as \[\begin{align} &Q^{\pi, \lambda}_h(s,a) \nonumber \\ & = r_h(s,a) + \inf_{\bmu_t \in \Delta(\cS)^d,P_t=\la \bphi,\bmu_t \ra} \lambda \la\bphi(s_h, a_h),\bD(\bmu_h||\bmu_h^0)\ra \nonumber\\ &\quad + \int_{\cS} P_{h}(ds'| s,a)\EE^{\{P_t\}_{t=h+1}^H} \bigg[ \sum_{t = h+1}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\bmu_t||\bmu_t^0)\ra\big] \Big| s_{h+1} = s' , \pi \bigg] \nonumber\\ &\geq r_h(s,a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra} \lambda \la\bphi(s_h, a_h),\bD(\bmu_h||\bmu_h^0)\ra \tag{23}\\ & \quad + \int_{\cS} P_{h}^\pi(ds'| s,a) \inf_{\bmu_t \in \Delta(\cS)^d,P_t=\la \bphi,\bmu_t \ra} \EE^{\{P_t\}_{t=h+1}^H} \bigg[ \sum_{t = h+1}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\bmu_t||\bmu_t^0)\ra\big] \Big| s_{h+1} = s' , \pi \bigg] \nonumber\\ & = r_h(s,a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra} \lambda \la\bphi(s_h, a_h),\bD(\bmu_h||\bmu_h^0)\ra + \EE_{s'\sim P_h(\cdot |s,a)}[V_{h+1}^{\pi, \lambda}(s')] \tag{24}, \end{align}\] where 23 follows by the Fatou’s lemma, 24 follows by the definition of \(V_{h+1}^{\pi, \lambda}(s)\). Hence, combining the two above inequalities, we conclude the 5 . Next we focus on the proof of the (21 ), by which we aim to proof the existence of transition kernel \(\{\hat{P}_t\}_{t=h}^H\). By the fact that \[\begin{align} Q_{h}^{\pi, \lambda}(s, a) &= r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\pi, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big], \end{align}\] we notice that the \(\inf\) problem above is constraint by the distance \(D\). Therefore by Lagrange duality and the closeness of distribution \(\Delta(\cS)\), there exists \(\hat{\bmu}_h \in \Delta(\cS)^d, \hat{P}_h = \la \bphi, \hat{\bmu}_h \ra\) such that \[\begin{align} Q^{\pi, \lambda}_h(s,a) = r_h(s, a) + \EE_{s'\sim \hat{P}_h(\cdot|s,a)}\big[V_{h+1}^{\pi, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\hat{\bmu}_h||\bmu_h^0) \ra. \label{40A46141} \end{align}\tag{25}\] Now it remains to proof 6 . By the definition of \(V_{h}^{\pi, \lambda}(s)\), we have \[\begin{align} &V_{h}^{\pi, \lambda}(s) \\ &= \inf_{\bmu_t \in \Delta(\cS)^d,P_t=\la \bphi,\bmu_t \ra} \EE^{\{P_t\}_{t=h}^H} \bigg[\sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\bmu_t||\bmu_t^0) \ra\big] \Big| s_h = s, \pi \bigg] \nonumber\\ &= \inf_{\bmu_t \in \Delta(\cS)^d,P_t=\la \bphi,\bmu_t \ra}\sum_{a \in \cA} \pi(a|s) \EE^{\{P_t\}_{t=h}^H} \bigg[\sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\bmu_t||\bmu_t^0) \ra\big] \Big| s_h = s, a_h =a, \pi \bigg] \nonumber\\ & \leq \sum_{a \in \cA} \pi(a|s) \EE^{\{\hat{P}_t\}_{t=h}^H} \bigg[ \sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\hat{\bmu}_t||\bmu_t^0)\ra\big] \Big| s_{h}= s, a_h = a, \pi \bigg] \nonumber\\ & = \sum_{a \in \cA} \pi(a|s) Q_{h}^{\pi, \lambda}(s,a), \label{B6} \end{align}\tag{26}\] where 26 comes from 25 and the inductive hypothesis. On the other hand, by the definition of \(Q_{h}^{\pi, \lambda}(s,a)\), we have \[\begin{align} &\sum_{a \in \cA} \pi(a|s) Q_{h}^{\pi, \lambda}(s,a) \nonumber\\ &= \sum_{a \in \cA} \pi(a|s) \inf_{\bmu_t \in \Delta(\cS)^d,P_t=\la \bphi,\bmu_t \ra} \EE^{\{P_t\}_{t=h}^H} \bigg[\sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\bmu_t||\bmu_t^0) \ra\big] \Big| s_h = s, a_h = a, \pi \bigg] \nonumber\\ & \leq \inf_{\bmu_t \in \Delta(\cS)^d,P_t=\la \bphi,\bmu_t \ra}\sum_{a \in \cA} \pi(a|s) \EE^{\{P_t\}_{t=h}^H} \bigg[\sum_{t = h}^H \big[r_t(s_t, a_t) + \lambda \la\bphi(s_t, a_t),\bD(\bmu_t||\bmu_t^0) \ra\big] \Big| s_h = s, a_h = a, \pi \bigg] \nonumber \\ & = V_{h}^{\pi, \lambda}(s), \label{B5} \end{align}\tag{27}\] where 27 comes from the definition of \(V_h^{\pi, \lambda}(s)\). Combining the two inequalities 26 and 27 , we have \[\begin{align} V_{h}^{\pi, \lambda}(s) = \EE_{a \sim \pi(\cdot | s)}\big[Q_{h}^{\pi}(s,a)\big]. \end{align}\] This proves the 6 for stage h. Therefore, by using an induction argument, we finish the proof of . ◻

9.2 Proof of↩︎

Proof. We define the optimal stationary policy \(\pi^\star = \{ \pi_h^\star \}_{h=1}^H\) as: for all \((h,s) \in [H]\times\cS\), \[\begin{align} \pi^\star_h(s) = \argmax_{a \in \cA} \Big[r_h(s,a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\star, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big]\Big]. \end{align}\] Now it remains to show that the regularized robust value function \(V_h^{\pi^\star, \lambda}, Q_h^{\pi^\star, \lambda}\) induced by policy \(\pi^\star\) is optimal, i.e., for all \((h,s) \in [H] \times \cS\), \[\begin{align} V^{\pi^\star, \lambda}_h(s) = V^{\star, \lambda}_{h}(s), Q^{\pi^\star, \lambda}_h(s, a) = Q^{\star, \lambda}_h(s,a). \end{align}\] By the [prop:regularized32Robust32Bellman32equation], we only need to prove the first equation above, then the optimality of the \(Q\) holds trivially. we prove this statement by induction from \(H\) to \(1\). For stage \(H\), the conclusion holds by: \[\begin{align} V_H^{\star, \lambda}(s) &= \sup_{\pi}V_H^{\pi, \lambda}(s)\\ &= \sup_{\pi} \inf_{\bmu_H \in \Delta(\cS)^d,P_H=\la \bphi,\bmu_H \ra} \EE^{P_H} \bigg[\big[r_H(s_H, a_H) + \lambda \la\bphi(s_H, a_H),\bD(\bmu_H||\bmu_H^0) \ra\big] \Big| s_H = s, \pi \bigg] \\ & = \sup_{\pi} \Big[ r_H(s_H,\pi_H(s_H)) + \inf_{\bmu_H \in \Delta(\cS)^d,P_H=\la \bphi,\bmu_H \ra} \lambda \la\bphi(s, a),\bD(\bmu_H||\bmu_H^0) \ra \Big] \\ & = V_H^{\pi^\star, \lambda}(s). \end{align}\] Now assume that the conclusion holds by stage \(h+1\). Hence, we have that for all \(s \in \cS\), \[\begin{align} V_{h+1}^{\pi^\star, \lambda}(s) = V_{h+1}^{\star, \lambda}(s). \end{align}\] For the case of \(h\), by [prop:regularized32Robust32Bellman32equation], we have \[\begin{align} &V_{h}^{\pi^\star, \lambda}(s) \nonumber\\&= \EE_{a \sim \pi^\star_{h}(\cdot|s)} \Big[ Q_h^{\pi^\star, \lambda}(s,a) \Big] \nonumber\\ & = \EE_{a \sim \pi^\star_{h}(\cdot|s)} \Big[r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\pi^\star, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \Big] \nonumber\\ & = \EE_{a \sim \pi^\star_{h}(\cdot|s)} \Big[r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\star, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \Big] \tag{28}\\ & = \max_{a \in \cA} \Big[r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\star, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \Big], \tag{29} \end{align}\] where 28 holds by the inductive hypothesis, 29 holds by the definition of \(\pi^\star\). On the other hand, recall the definition of \(V^{\star,\lambda}_h(s)\), then for any \(s \in \cS\), by [prop:regularized32Robust32Bellman32equation] we have \[\begin{align} &V_h^{\star,\lambda}(s) \nonumber \\&= \sup_{\pi}V_h^{\pi, \lambda}(s) \nonumber\\ &= \sup_{\pi}\EE_{a \sim \pi_{h}(\cdot|s)} \Big[ Q_h^{\pi, \lambda}(s,a) \Big] \nonumber\\ & = \sup_{\pi} \EE_{a \sim \pi_{h}(\cdot|s)} \Big[r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\pi, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \Big] \nonumber\\ & \leq \sup_{\pi} \EE_{a \sim \pi_{h}(\cdot|s)} \Big[r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\star, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \Big] \tag{30}\\ & = \max_{a \in \cA} \Big[r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\star, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \Big] \nonumber\\ & = V_{h}^{\pi^\star, \lambda}(s), \tag{31} \end{align}\] where 30 holds by the fact that \(V^{\pi^\star, \lambda}_{h+1}(s) \leq V^{\star,\lambda}_{h+1}(s) ,\forall s \in \cS\), 31 holds by (29 ). In turn, we trivially have \(V_h^{\star, \lambda}(s) \geq V_h^{\pi^\star, \lambda}(s)\) due to the optimality of the value function. Hence, we obtain \(V_{h}^{\pi^\star, \lambda}(s) = V_h^{\star, \lambda}(s), \forall s \in \cS\). Therefore, by the induction argument, we conclude the proof. ◻

9.3 Proof of↩︎

Proof. By , we have \[\begin{align} Q_h^{\pi, \lambda}(s,a) \nonumber &= r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\pi, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \nonumber\\ & = \big\langle \bphi(s,a) , \btheta_h \big\rangle + \inf_{\bmu_h \in \Delta(\cS)^d} \Big[ \big\langle \bphi(s,a), \EE_{s' \sim \bmu_h}[V_{h+1}^{\pi, \lambda}(s')] \big\rangle + \lambda \sum_{i=1}^d\phi_i(s,a)D(\mu_{h, i} \| \mu_{h,i}^0) \Big] \nonumber\\ & = \big\langle \bphi(s,a) ,\btheta_h \big\rangle +\big\langle \bphi(s,a) , \bw^{\pi, \lambda}_h \big\rangle \nonumber\\ & =\big\langle \bphi(s,a) ,\btheta_h + \bw_h^{\pi, \lambda} \big\rangle.\nonumber \end{align}\] Hence we conclude the proof. ◻

9.4 Proof of↩︎

Proof. The optimization problem can be formalized as: \[\begin{align} \inf_{\mu}\EE_{s\sim\mu}V(s) + \lambda D_{\text{TV}}(\mu\|\mu^0) ~~ \text{subject to} \sum_s\mu(s) = 1, \mu(s) \geq 0. \end{align}\] Denote \(y(s) = \mu(s) - \mu^0(s)\), the objective function can be rewritten as: \[\begin{align} \EE_{s\sim\mu}V(s) + \lambda D_{\text{TV}}(\mu\| \mu^0) &=\sum_{s} \mu(s)V(s) + \lambda/2\sum_s|\mu(s)-\mu^0(s)| \\ &= \sum_{s} V(s)(y(s) + \mu^0(s)) + \lambda/2 \sum |y(s)| \\&= \EE_{s\sim\mu^0}V(s)+ \sum_{s} V(s)y(s) + \lambda /2 \sum |y(s)|. \end{align}\] Recall the constraint \(\sum_{s}y(s) = 0, y(s) \geq -\mu^0(s)\), by the Lagrange duality, we establish the Lagrangian function: \[\begin{align} \cL = \min_y\max_{\mu \geq 0, r \in R}\Big(\sum_s [y(s)(V(s)-\mu(s)-r)) + \lambda/2|y(s)|] - \sum_s \mu(s)\mu^0(s) \Big). \end{align}\] In order to achieve the minimax optimality, for any \(s\), term \(y(s)(V(s)-\mu(s)-r)) + \lambda/2|y(s)|\) should obtain a bounded lower bound with respect to \(y(s)\), which requires that \(\mu(s), r\) should satisfy the following conditions: \[\begin{align} \forall s \in \cS, |V(s) - \mu(s) - r| \leq \lambda/2 \Rightarrow \max_{s \in \cS}\{V(s)-\mu(s)\} - \min_{s \in \cS}\{V(s) - \mu (s)\} \leq \lambda. \end{align}\] With the constraint above, we denote \(g(s): = V(s) - \mu(s)\), we have \[\begin{align} \cL &= \max_{\mu \geq 0, r \in R}\min_y\Big \{\sum_s [y(s)(V(s)-\mu(s)-r)) + \lambda/2|y(s)|] - \sum_s \mu(s)\mu^0(s) \Big \} \\ &= \max_{\max_{s \in \cS}(V(s)-\mu(s)) - \min_{s \in \cS}(V(s) - \mu (s)) \leq \lambda} - \sum\mu(s)\mu^0(s) \\ & = \max_{\max_sg(s) - \min_sg(s) \leq \lambda, g(s) \leq V(s)}\Big \{ \sum g(s)\mu^0(s)\Big \} - \EE_{s\sim\mu^0}V(s). \end{align}\] Thus we have, \[\begin{align} &\EE_{s\sim\mu}V(s) + \lambda D_{\text{TV}}(\mu\| \mu^0) \nonumber\\ &= \EE_{s\sim\mu^0}V(s) +\max_{\max_sg(s) - \min_sg(s) \leq \lambda, g(s) \leq V(s)}\Big \{\sum g(s)\mu^0(s)\Big \} -\EE_{s\sim\mu^0}V(s) \nonumber \\ &= \EE_{s\sim\mu^0}[V(s)]_{V_{\min}+ \lambda}, \label{3462} \end{align}\tag{32}\] where 32 holds by directly solving the max problem. Hence we conclude the proof. ◻

9.5 Proof of↩︎

Proof. Similar to the proof of , define \(y(s) = \mu(s) - \mu^0(s)\), with lagrange duality, we have: \[\begin{align} \mathcal{L} &= \sum_{s} V(s) (y(s) + \mu^0(s)) + \lambda \sum_{s} \frac{y(s)^2}{\mu^0(s)} - \sum_{s} (\mu^0(s) + y(s)) \mu(s) - r \sum_{s} y(s) \\ &= \sum_{s} \Big( \frac{\lambda y(s)^2}{\mu^0(s)} + y(s) (V(s) - \mu(s) - r) + \sum_{s} V(s) \mu^0(s) - \sum_{s} \mu^0(s) \mu(s) \Big). \end{align}\] Noticing that \(\cL\) is a quadratic function with respect to \(y(s)\), therefore after we fix the term \(\mu(s), r\) and compute the min with respect to \(y(s)\), we have \[\begin{align} \mathcal{L} &= -\frac{1}{4 \lambda} \sum_{s} \mu^0(s) (V(s) - \mu(s) - r)^2 + \sum_{s} V(s) \mu^0(s) - \sum_{s} \mu^0(s) \mu(s) \nonumber\\ & = -\frac{1}{4 \lambda} \Big[ \EE_{s\sim\mu^0} [V - \mu]^2 - \big(\EE_{s\sim\mu^0} [V - \mu] \big)^2 \Big] + \EE_{s\sim\mu^0} [V - \mu] \tag{33}\\ & = \EE_{s\sim\mu^0} [V - \mu] - \frac{1}{4 \lambda} \Var_{s\sim\mu^0}[V - \mu] \nonumber\\ &= \sup_{\alpha \in [V_{\min}, V_{\max}]} \EE_{s\sim\mu^0}[V(s)]_{\alpha} - \frac{1}{4\lambda}\Var_{s\sim\mu^0}[V(s)]_{\alpha} \tag{34}, \end{align}\] where 33 comes from maximizing \(r\), and 34 comes from maximizing \(\mu(s), s \in \cS\) and the observation that \(\mu(s) = 0 \text{ or } V(s), \forall s \in \cS\) when achieving its maximum. Hence, we conclude the proof. ◻

10 Proof of the Upper Bounds of Suboptimality↩︎

In this section, we prove and . For simplicity, we denote \(\bphi_h^{\tau} = \bphi(s_h^{\tau}, a_h^{\tau})\). According to the robust regularized Bellman equation in , we first define the robust regularized Bellman operator: for any \((h,s,a) \in [H]\times\cS\times\cA\) and any function \(V:\cS\times\cA\rightarrow [0,H]\), \[\begin{align} \cT^{\lambda} _hV(s,a):= r_h(s, a) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\pi, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big]. \label{def:32regularized32bellman32operator} \end{align}\tag{35}\] We have \(Q_h^{\pi, \lambda}(s,a) = \cT^{\lambda} _hV_{h+1}^{\pi, \lambda}(s,a)\).

10.1 Proof of↩︎

We start from bounding the suboptimality gap by the estimation uncertainty in the following Lemma.

Lemma 1. If the following inequality holds for any \((h,s,a) \in [H] \times \cS \times \cA\): \[\begin{align} |\mathcal{T}_h^{\lambda}\hat{V}^{\lambda}_{h+1}(s,a) - \la \bphi(s,a), \hat{\bw}_h^\lambda \ra| \leq \Gamma_h(s,a), \end{align}\] then we have \[\begin{align} \text{SupOpt}(\hat{\pi}, s, \lambda) \leq 2\sup_{P \in \mathcal{U}^\lambda(P^0)}\sum_{h=1}^H\EE^{\pi^*, P}\big[\Gamma_h(s_h, a_h)|s_1 = s\big]. \end{align}\]

10.1.1 Proof of - Case with the TV Divergence↩︎

Figure 5: Robust Regularized Pessimistic Value Iteration under TV distance (R2PVI-TV)

For completeness, we present R2PVI specific to the TV distance in , which gives a closed form solution of 12 . Now we present the upper bound of weights as follows.

Lemma 2. (Bound of weights - TV) For any \(h \in [H]\), we have \[\begin{align} \|\bw_h^\lambda\|_2 \leq H\sqrt{d}, \|\hat{\bw}_h^{\lambda}\|_2 \leq H\sqrt{\frac{Kd}{\gamma}}. \end{align}\]

Proof of - TV. The R2PVIwith TV-divergence is presented in . We derive the upper bound on the estimation uncertainty \(\Gamma_h(s,a)\) to prove the theorem. We first decompose the difference between the regularized robust bellman operator \(\mathcal{T}_h^{\lambda}\) and the empirical regularized robust bellman operator \(\hat{\mathcal{T}}_h^{\lambda}\) as \[\begin{align} &\big|\mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a)-\la \bphi(s,a), \hat{\bw}_h^\lambda \ra\big| \\&= \Big|\sum_{i=1}^d \phi_i(s,a)(w_{h,i}^\lambda - \hat{w}_{h,i}^{\lambda})\Big| \nonumber\\ &= \Big|\sum_{i =1}^d\phi_i(s,a)\mathbf{1}_i(\bw_h^\lambda - \hat{\bw}_{h}^{\lambda}) \Big|\nonumber\\ &= \Big| \gamma\sum_{i =1 }^d\phi_i(s,a)\mathbf{1}_i\bLambda_h^{-1}\bw_h^\lambda + \sum_{i =1 }^d\phi_i(s,a)\mathbf{1}_i\bLambda_h^{-1} \sum_{\tau =1}^K\bphi(s_h^\tau, a_h^{\tau})\eta_h^{\tau}([\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_{h+1}})\Big| \tag{36} \\ & \leq \gamma \sum_{i =1 }^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\underbrace{\|\bw_h^\lambda\|_{\bLambda_h^{-1}}}_{\text{(i)}} + \sum_{i =1 }^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} \underbrace{\|\sum_{\tau =1}^K\bphi(s_h^\tau, a_h^{\tau})\eta_h^{\tau}([\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_{h+1}})\|_{\bLambda_h^{-1}}}_{\text{(ii)}}, \tag{37} \end{align}\] where 36 comes from the definition of \(\hat{\bw}_h^\lambda\), while 37 follows by the Cauchy-Schwartz inequality. By and the fact that \(\hat{V}^{\lambda}_{h+1}(s) \leq H\) and \(\gamma = 1\), we have \[\begin{align} \text{(i)} = \|\bw_h^\lambda\|_{\bLambda_h^{-1}}\leq \|\bLambda_h^{-1}\|^{1/2} \|\bw_h^\lambda\|_2 \leq H\sqrt{d}, \end{align}\] where the last inequality comes from the fact that \(\|\bLambda_h^{-1}\| \leq \gamma^{-1}\). Now it remains to bound term (ii), as \(\hat{V}^{\lambda}_{h+1}\) depends on data, which makes it difficult to bound it directly by concentration equality. Instead, we consider focus on the function class \(\mathcal{V}_{h}(R_0, B_0, \gamma)\): \[\begin{align} \mathcal{V}_{h}(R_0, B_0, \gamma) = \{ V_h(x; \btheta, \beta, \bLambda): \mathcal{S} \rightarrow [0, H], \|\btheta\|_2 \leq R_0, \beta \in [0, B_0], \gamma_{\min}(\bLambda_h) \geq \gamma \}, \end{align}\] where \(V_h(x; \btheta, \beta, \bLambda) = \max_{a \in \mathcal{A}}[\bphi(s,a)^\top\btheta - \beta \sum_{i=1}^d\|\phi_i(s,a)\|_{\bLambda_h^{-1}}]_{[0, H-h+1]}\). By and the definition of \(\hat{V}_{h+1}^\lambda\), when we set \(R_0 = H\sqrt{Kd/\gamma}, B_0 = \beta = 16 Hd \sqrt{\xi_{\text{TV}}}\), it suffices to show that \(\hat{V}_{h+1}^\lambda \in \mathcal{V}_{h+1}(R_0, B_0, \gamma)\). Next we aim to find a union cover of the \(\mathcal{V}_{h+1}(R_0, B_0, \gamma)\), hence the term (ii) can be upper bounded. Let \(\mathcal{N}_h(\epsilon; R_0, B_0, \gamma)\) be the minimum \(\epsilon\)-cover of \(\mathcal{V}_h(R_0, B_0, \lambda)\) with respect to the supreme norm, \(\mathcal{N}_{h}([0, H])\) be the minimum \(\epsilon\)-cover of \([0, H]\) respectively. In other words, for any function \(V \in \mathcal{V}_{h}(R_0, B_0, \gamma), \alpha_{h+1} \in [0, H]\), there exists a function \(V' \in \mathcal{V}_{h}(R_0, B_0, \gamma)\) and a real number \(\alpha_{\epsilon} \in [0,H]\) such that: \[\begin{align} \sup_{s \in \mathcal{S}}|V(s) - V'(s)| \leq \epsilon, |\alpha_{\epsilon} - \alpha_{h+1}| \leq \epsilon. \end{align}\] By Cauchy-Schwartz inequality and the fact that \(\|a + b\|_{\bLambda_h^{-1}}^2 \leq 2\|a\|_{\bLambda_h^{-1}}^2 + 2\|b\|_{\bLambda_h^{-1}}^2\) and the definition of the term (ii), we have \[\begin{align} \text{(ii)}^2 &\leq 2 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + 2 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{h+1}} - [\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} \nonumber \\ & \leq 4 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([V'_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + 4 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}} - [V'_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + \frac{2\epsilon^2K^2}{\gamma}, \label{40eq46141} \end{align}\tag{38}\] where 38 follows by the fact that \[\begin{align} 2 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{h+1}} - [\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} \leq 2 \epsilon^2 \sum_{\tau = 1, \tau' = 1}^{K}|\bphi_h^\tau\bLambda_h^{-1}\bphi_h^{\tau\top}| \leq \frac{2\epsilon^2K^2}{\gamma}. \end{align}\] Meanwhile, by the fact that \(|[\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}} - [V'_{h+1}]_{\alpha_{\epsilon}}| \leq |\hat{V}^{\lambda}_{h+1} - V'_{h+1}|\), we have \[\begin{align} &4 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}} - [V'_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}}\\ &\leq 4 \sum_{\tau = 1, \tau' = 1}^{K}|\bphi_h^\tau\bLambda_h^{-1}\bphi_h^{\tau'}| \max|\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}} - [V'_{h+1}]_{\alpha_{\epsilon}})|^2 \nonumber\\ &\leq 4 \epsilon^2\sum_{\tau = 1, \tau' = 1}^{K}|\bphi_h^\tau\bLambda_h^{-1}\bphi_h^{\tau'}| \nonumber\\ &\leq \frac{4\epsilon^2K^2}{\gamma}. \label{TV-100} \end{align}\tag{39}\] By applying the (39 ) into (38 ), we have \[\begin{align} (\text{ii})^2 \leq 4 \sup_{V' \in \mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma), \alpha_{\epsilon} \in \mathcal{N}_{h}([0, H])}\Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([V'_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + \frac{6\epsilon^2K^2}{\gamma}. \label{TV-1} \end{align}\tag{40}\] By , applying a union bound over \(\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\) and \(\mathcal{N}_{h}([0, H])\), with probability at least \(1 - \delta / 2H\), we have \[\begin{align} &4 \sup_{V' \in \mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma), \alpha_{\epsilon} \in \mathcal{N}_{h}([0, H])}\Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([V'_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + \frac{6\epsilon^2K^2}{\gamma} \nonumber\\ &\leq 4 H^2\Big(2\log\frac{2H|\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\|\mathcal{N}_{h}([0, H])|}{\delta} + d\log(1+ K/\gamma)\Big) + \frac{6\epsilon^2K^2}{\gamma} . \label{TV-2} \end{align}\tag{41}\] Applying , we have \[\begin{align} \log|\mathcal{N}_h(\epsilon;R_0, B_0, \lambda)| &\leq d\log(1+4R_0/\epsilon) + d^2\log(1+8d^{1/2}B_0^2/\gamma \epsilon^2) \nonumber\\ &= d \log(1+ 4K^{3/2}d^{-1/2}) + d^2\log(1+ 8d^{-3/2}B_0^2K^2H^{-2}) \nonumber\\ & \leq 2d^2\log(1+8d^{-3/2}B_0^2K^2H^{-2}). \label{TV-3} \end{align}\tag{42}\] Similarly, by , we have \[\begin{align} |\mathcal{N}_h([0, H])| \leq 3H/\epsilon. \end{align}\] Combining (42 ) with (40 ) and (41 ), by setting \(\epsilon = dH/K\), we have \[\begin{align} \text{(ii)}^2 &\leq 4 H^2\Big(2\log\frac{2H|\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\|\mathcal{N}_{h}([0, H])|}{\delta} + d\log(1+ K/\gamma)\Big) + \frac{6\epsilon^2K^2}{\gamma} \\ &\leq 4H^2(4d^2 \log(1 + 8d^{-3/2} B_0^2 K^2 H^{-2}) + \log(3K/d) + d \log(1+ K) + 2\log 2H/\delta) + 6d^2H^2 \\ & \leq 16H^2d^2 (\log(1 + 8d^{-3/2} B_0^2 K^2 H^{-2}) + \log(1+K)/d + 3/8 + \log H/\delta) \\ & \leq 32H^2 d^2 \log 8d^{-3/2} B_0^2 K^2 H^{-1}/\delta \\ & = 32H^2 d^2 \log 1024Hd^{1/2} K^2 \xi_{\text{TV}} /\delta\\ &= 32H^2 d^2 (\log 1024Hd^{1/2} K^2/\delta + \log \xi_{\text{TV}}) \\ & \leq 64 H^2 d^2 \xi_{\text{TV}} := \frac{\beta^2}{4}. \end{align}\] Recall the upper bound in (37 ), we have with probability at least \(1- \delta\), \[\begin{align} &|\mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a)-\la \bphi(s,a), \hat{\bw}_h^\lambda \ra| \nonumber\\ & \leq \gamma \sum_{i =1 }^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\|\bw_h^\lambda\|_{\bLambda_h^{-1}} + \sum_{i =1 }^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big\|\sum_{\tau =1}^K\bphi(s_h^\tau, a_h^{\tau})\eta_h^{\tau}([\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_{\epsilon}})\Big\|_{\bLambda_h^{-1}} \nonumber\\ & \leq \sum_{i =1 }^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}(H\sqrt{d} + \beta/2) \nonumber\\ &\leq \beta\sum_{i =1 }^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} ,\label{C1462} \end{align}\tag{43}\] where 43 follows by the fact that \(2H\sqrt{d} \leq \beta\). Hence, the prerequisite is satisfied in , we can upper bound the suboptimality gap as: \[\begin{align} \text{SubOpt}(\hat{\pi}, s, \lambda) &\leq 2\sup_{P \in \mathcal{U}^\lambda(P^0)}\sum_{h=1}^H\EE^{\pi^*, P}\big[\Gamma_h(s_h, a_h)|s_1 = s\big] \\ & = 2\beta\cdot\sup_{P \in \mathcal{U}^\lambda(P^0)}\sum_{h=1}^H\EE^{\pi^*, P}\Big[\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}|s_1 = s\Big]. \end{align}\] This concludes the proof. ◻

10.1.2 Proof of - Case with KL Divergence↩︎

Figure 6: Robust Regularized Pessimistic Value Iteration under KL distance (R2PVI-KL)

For completeness, we present the R2PVIalgorithm specific to the KL distance in , which gives closed form solution of 15 . Our proof relies on the following lemmas on bounding the regression parameter and \(\epsilon\)-covering number of the robust value function class.

Lemma 3 (Bound of weights - KL). For any \(h \in [H]\), \[\begin{align} \|\bw^\lambda_h\|_2 \leq \sqrt{d}, \|\hat{\bw}'_h\|_2 \leq \sqrt{\frac{Kd}{\gamma}}. \end{align}\]

Lemma 4 (Bound of covering number - KL). For any \(h \in [H]\), let \(\mathcal{V}_h\) denote a class of functions mapping from \(\mathcal{S}\) to \(\RR\) with the following form: \[\begin{align} V_h(x; \btheta, \beta, \bLambda_h) = \max_{a \in \mathcal{A}} \Big\{ \bphi(s,a)^{\top} \btheta - \lambda \log\Big(1+\beta \sum_{i=1}^d\|\phi_i(\cdot, \cdot) \mathbf{1}_i\|_{\bLambda_h^{-1}}\Big) \Big\}_{[0, H-h+1]}, \end{align}\] the parameters \((\btheta, \beta, \bLambda_h)\) satisfy \(\|\btheta\|_2 \leq L, \beta \in [0, B], \gamma_{\min}(\bLambda_h) \geq \gamma\). Let \(\mathcal{N}_h(\epsilon)\) be the \(\epsilon\)-covering number of \(\mathcal{V}\) with respect to the distance \(\text{dist}(V_1, V_2) = \sup_x|V_1(x) - V_2(x)|\).Then \[\begin{align} \log|\mathcal{N}_h(\epsilon)| \leq d\log(1+4L/\epsilon) + d^2\log(1+8\lambda^2d^{1/2}B^2/\gamma \epsilon^2). \end{align}\]

Proof. The R2PVIwith KL-divergence is presented in . Similar to the proof of TV divergence, we decompose the estimation uncertainty between \(\mathcal{T}_h^{\lambda}\) and \(\hat{\mathcal{T}}_h^{\lambda}\) as: \[\begin{align} |\mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a)-\la \bphi(s,a), \hat{\bw}_h^\lambda \ra| &= \big|\bphi(s,a)^{\top}\big(\btheta_h - \lambda \log \bw_h^{\lambda} - \btheta_h + \lambda \log \max \{\hat{\bw}'_h, e^{-H/\lambda} \}\big)\big| \nonumber \\ &=\big |\bphi(s,a)^{\top}\big(\lambda \log \max \{\hat{\bw}'_h, e^{-H/\lambda} \} - \lambda \log \bw_h^{\lambda}\big)\big| \nonumber \\ &= \lambda \Big|\sum_{i=1}^d \phi_i(s,a) \log \frac{\max \{ \hat{w}'_{h,i}, e^{-H/\lambda}\}}{w_{h,i}^{\lambda}}\Big| \nonumber \\ &\leq \lambda \sum_{i=1}^d \phi_i(s,a) \Big|\log \frac{\max \{\hat{w}'_{h,i}, e^{-H/\lambda}\}}{w_{h,i}^{\lambda}} \Big| \nonumber \\ & \leq \lambda \sum_{i=1}^d \phi_i(s,a) \big|\log \big(1+ e^{H/\lambda}|\max \{\hat{w}'_{h,i}, e^{-H/\lambda}\} - w_{h,i}^{\lambda}|\big)\big| \tag{44}\\ & \leq \lambda \sum_{i=1}^d \phi_i(s,a) \log \big(1+ e^{H/\lambda}|\hat{w}'_{h,i} - w_{h,i}^{\lambda}|\big) \tag{45}\\ & \leq \lambda \log \Big(\sum_{i=1}^d\phi_i(s,a)+ e^{H/\lambda}\sum_{i=1}^d\phi_i(s,a)|\hat{w}'_{h,i} - w_{h,i}^{\lambda}|\Big) \tag{46} \\ & = \lambda \log\Big(1 + e^{H/\lambda}\sum_{i=1}^d\phi_i(s,a) \mathbf{1}_i^\top|\hat{\bw}'_h - \bw_h^\lambda|\Big),\nonumber \end{align}\] where 44 and 45 comes from the fact that: \[\begin{align} w_{h,i}^{\lambda} = \EE_{s' \sim \mu_{h,i}^0}\Big[e^{-\frac{\hat{V}_h^{\lambda}(s')}{\lambda}}\Big] \geq \EE_{s' \sim \mu_{h,i}^0}[e^{-H/\lambda}]=e^{-H/\lambda}, |\log A - \log B| = \log \Big(1 + \frac{|A-B|}{\min \{ A, B\} }\Big), \end{align}\] and 46 comes from the Jensen’s inequality applying to function \(\log(x)\). Therefore, our next goal is to bound the term \(\sum_{i=1}^d\phi_i(s,a)\mathbf{1}_i^\top|\hat{\bw}'_h - \bw_h^{\lambda}|\). Specifically, we have \[\begin{align} &\sum_{i=1}^d\phi_i(s,a)\mathbf{1}_i^\top|\hat{\bw}'_h - \bw_h^{\lambda}| \\&= \sum_{i=1}^d\phi_i(s,a)\mathbf{1}_i^\top\Big|\bw_h^\lambda - \bLambda_h^{-1}\sum_{\tau=1}^K(\bphi^{\tau}_h) e^{-\frac{\hat{V}^{\lambda}_{h+1}(s_{h+1})}{\lambda}}\Big| \\ &=\sum_{i=1}^d\phi_i(s,a)\mathbf{1}_i^\top\Big|\bw_h^\lambda - \bLambda_h^{-1}\sum_{\tau=1}^K\bphi^{\tau}_h(\bphi_h^\tau)^{\top}\bw_h^\lambda +\bLambda_h^{-1}\sum_{\tau=1}^K\bphi^{\tau}_h(\bphi_h^\tau)^{\top}\bw_h^\lambda -\bLambda_h^{-1}\sum_{\tau=1}^K(\bphi^{\tau}_h) e^{-\frac{\hat{V}_{h+1}^{\lambda}(s_{h+1})}{\lambda}}\Big| \\ &=\sum_{i=1}^d\phi_i(s,a)\mathbf{1}_i^\top\Big(\underbrace{\Big|\bw_h^\lambda - \bLambda_h^{-1}\sum_{\tau=1}^K\bphi^{\tau}_h(\bphi_h^\tau)^{\top}\bw_h^\lambda\Big|}_{\text{(i)}} +\underbrace{\Big|\bLambda_h^{-1}\sum_{\tau=1}^K \bphi^{\tau}_h (\bphi_h^\tau)^{\top} \bw_h^\lambda - \bLambda_h^{-1}\sum_{\tau=1}^K(\bphi^{\tau}_h) e^{-\frac{\hat{V}_{h+1}^{\lambda}(s_{h+1})}{\lambda}}\Big|}_{\text{(ii)}} \Big). \end{align}\] Next, we upper bound term (i) and (ii), respectively. For the first term, we have: \[\begin{align} \sum_{i=1}^d\phi_i(s,a)\mathbf{1}_i^\top \cdot \text{(i)} &= \sum_{i=1}^d\phi_i(s,a)\mathbf{1}_i^\top(|\bw_h^{\lambda} - \bLambda_h^{-1} (\bLambda_h - \gamma \mathbf{I})\bw_h^{\lambda}|) \nonumber\\ &=\gamma \sum_{i=1}^d\phi_i(s,a)\mathbf{1}_i^\top\bLambda_h^{-1}|\bw_h^{\lambda}| \nonumber\\ &\leq \gamma \sum_{i=1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\|\bw_h^{\lambda}\|_{\bLambda_h^{-1}} \tag{47}\\ &\leq \gamma \sqrt{d}\sum_{i=1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}},\tag{48} \end{align}\] where 47 follows from the Cauchy-Schwartz inequality, 48 follows from the fact that: \[\begin{align} \|\bw_h^\lambda\|_{\bLambda_h^{-1}} \leq \|\bLambda_h^{-1}\|^{1/2}\|\bw_h^\lambda\|_2 \leq \sqrt{d/\gamma}, \end{align}\] where the last inequality follows from and the fact that \(\|\bLambda_h^{-1}\| \leq \gamma^{-1}\). Now it remains to bound the term (ii), by the definition of \(\eta_h^{\tau}(f) = \EE_{s' \sim P_h^0(\cdot |s^{\tau}_h, a_h^{\tau})}[f(s')] - f(s_{h+1}^{\tau})\), the term (ii) can be rewritten as: \[\begin{align} \text{(ii)} &= \sum_{i=1}^d\phi_i(s,a) \mathbf{1}_i^\top \Big|\bLambda_h^{-1} \sum_{\tau=1}^K \bphi_h^\tau\Big[(\bphi_h^\tau)^{\top}\bw^\lambda_h - e^{-\frac{\hat{V}^{\lambda}_{h+1}(s_{h+1})}{\lambda}}\Big]\Big| \\ &=\sum_{i=1}^d\phi_i(s,a) \mathbf{1}_i^\top\Big| \bLambda_h^{-1} \sum_{\tau=1}^K \bphi_h^\tau\eta_h^{\tau}\Big(e^{-\frac{\hat{V}^{\lambda}_{h+1}(s)}{\lambda}}\Big)\Big| \\ & \leq \sum_{i=1}^d \|\phi_i(s,a) \mathbf{1}_i\|_{\bLambda_h^{-1}} \underbrace{\Big\|\sum_{\tau=1}^K \bphi^{\tau}_h \eta_h^{\tau}\Big(e^{-\frac{\hat{V}^{\lambda}_{h+1}(s)}{\lambda}}\Big)\Big\|_{\bLambda_h^{-1}}}_{\text{(iii)}}. \end{align}\] For the rest of the proof, it’s left to bound the term (iii). As the \(\hat{V}^{\lambda}_{h+1}\) depends on the offline dataset, which makes it difficult to upper bound directly from concentration equality due to the dependence issue, we seek for providing a uniform concentration bound applied to the term (iii), i.e. we aim to upper bound the following term: \[\begin{align} \sup_{V \in \mathcal{V}_{h+1}(R, B, \gamma)}\Big\|\sum_{\tau = 1}^{K} \bphi_{h}^{\tau}\eta_h^{\tau}(e^{-\frac{V}{\lambda}})\Big\|_{\bLambda_h^{-1}}. \end{align}\] Here for all \(h \in [H]\), the function class is defined as: \[\begin{align} &\mathcal{V}_h(R,B, \gamma) = \{ V_h(x; \btheta, \beta, \bLambda_h) : \|\btheta\|_2 \leq R, \beta \in [0,B], \gamma_{\min}(\bLambda_h) \geq \gamma \}, \end{align}\] where \(V_h(x; \btheta, \beta, \bLambda_h) = \max_{a \in \mathcal{A}} \{ \bphi(s,a)^{\top} \btheta - \lambda \log(1+\beta \sum_{i=1}^d\|\phi_i(\cdot, \cdot) \mathbf{1}_i\|_{\bLambda_h^{-1}}) \}_{[0, H-h+1]}\). In order to ensure \(\hat{V}^{\lambda}_{h+1} \in \mathcal{V}_{h+1}(R_0, B_0, \lambda)\), we need to bound \(\hat{\btheta}_h = \btheta_h - \lambda \log \max\{\hat{\bw}'_h, e^{-H/\lambda}\}\). Following the fact that: \[\begin{align} \|\hat{\btheta}_h\|_2 \leq \|\btheta_h\|_2 + \lambda \|\log \max\{\hat{\bw}'_h, e^{-H/\lambda}\}\|_2. \end{align}\] By , \(e^{-H/\lambda} \leq \max\{\hat{w}'_{h,i}, e^{-H/\lambda}\} \leq \max\{\|\hat{\bw}'_h\|, e^{-H/\lambda}\} \leq \max \{ \sqrt{Kd/\lambda},e^{-H/\lambda}\}\), therefore the term can be bounded as: \[\begin{align} \|\hat{\btheta}_h\|_2 &\leq \sqrt{d} + \lambda \sqrt{d} \max{\Big(\log\sqrt{\frac{Kd}{\lambda}}, H/\lambda\Big)} \nonumber\\ &\leq H\sqrt{d} + d \sqrt{K\lambda} \nonumber\\ &\leq 2Hd\sqrt{K\lambda}.\label{bound32of32weights32in32KL} \end{align}\tag{49}\] Hence, we can choose \(R_0 = 2Hd\sqrt{K\lambda}\) and \(B_0 = \beta = 16d\lambda e^{H/\lambda} \sqrt{(H/\lambda + \xi_{\text{KL}})}\), then we have for all \(h \in [H], \hat{V}^{\lambda}_{h+1} \in \mathcal{V}_{h+1}(R_0, B_0, \lambda)\). Next we aim to find a union cover of the \(\mathcal{V}_{h+1}(R_0, B_0, \gamma)\), hence the term (iii) can be upper bounded. For all \(\epsilon \in (0, \lambda), h \in [H]\), let \(\mathcal{N}_h(\epsilon; R, B, \lambda): = \mathcal{N}_h(\epsilon)\) denote the minimal \(\epsilon\)-cover of \(\mathcal{V}_h(R,B, \lambda)\) with respect to the supreme norm. In other words, for any function \(\hat{V}^{\lambda} \in \mathcal{V}_h(R, B, \lambda)\), there exists a function \(V' \in \mathcal{N}_{h+1}(\epsilon)\) such that \[\begin{align} \sup_{x \in \mathcal{S}}|\hat{V}^{\lambda}_{h+1}(x) - V'_{h+1}(x)| \leq \epsilon. \end{align}\] Hence, given \(\hat{V}^{\lambda}_{h+1}, V'_{h+1}\) satisfying the inequality above, recall the definition of \(\eta_h^\tau = \eta_h^{\tau}(f)= \EE_{s' \sim P_h^0(\cdot |s^{\tau}_h, a_h^{\tau})}[f(s')] - f(s_{h+1}^{\tau})\), we have: \[\begin{align} &\Big|\eta_h^{\tau}\Big(e^{-\frac{\hat{V}^{\lambda}_{h+1}(s)}{\lambda}}\Big) - \eta_h^{\tau}\Big(e^{-\frac{V'_{h+1}(s)}{\lambda}}\Big)\Big| \nonumber\\ &\leq \Big|\EE_{s \sim P_h^0(\cdot |s^{\tau}_h, a_h^{\tau})}\Big[e^{-\frac{\hat{V}^{\lambda}_{h+1}(s)}{\lambda}} - e^{-\frac{V'_{h+1}(s)}{\lambda}}\Big] - e^{-\frac{\hat{V}^{\lambda}_{h+1}(s_{h+1})}{\lambda}} + e^{-\frac{V'_{h+1}(s_{h+1})}{\lambda}}\Big| \nonumber\\ &\leq \Big|\EE_{s \sim P_h^0(\cdot |s^{\tau}_h, a_h^{\tau})}\Big[e^{-\frac{\hat{V}^{\lambda}_{h+1}(s)}{\lambda}} - e^{-\frac{V'_{h+1}(s)}{\lambda}}\Big]\Big| + \Big|e^{-\frac{\hat{V}^{\lambda}_{h+1}(s_{h+1})}{\lambda}} - e^{-\frac{V'_{h+1}(s_{h+1})}{\lambda}}\Big| \nonumber \\ &\leq 2\epsilon /\lambda + 2\epsilon / \lambda=4\epsilon /\lambda, \label{KL6} \end{align}\tag{50}\] where 50 follows from the fact that for any \(s \in \mathcal{S}\), \[\begin{align} \Big|e^{-\frac{\hat{V}^{\lambda}_{h+1}(s)}{\lambda}} - e^{-\frac{V'_{h+1}(s)}{\lambda}}\Big| \leq e^{\frac{|\hat{V}^{\lambda}_{h+1}(s)-V'_{h+1}(s)|}{\lambda}} - 1 \leq e^{\frac{\epsilon}{\lambda}}-1 \leq 2\epsilon /\lambda, \end{align}\] where the last inequality is held by the fact that \(\epsilon \in (0, \lambda)\). By the Cauchy-Schwartz inequality, for any two vectors \(a,b \in \RR^d\) and positive definite matrix \(\bLambda \in \RR^{d \times d}\), it holds that \(\|a+b\|^2_{\bLambda} \leq 2\|a\|_{\bLambda}^2 + 2\|b\|_{\bLambda}^2\), hence for all \(h \in [H]\), we have: \[\begin{align} |\text{(iii)}|^2 &\leq 2\Big\|\sum_{\tau = 1}^{K} \bphi_{h}^{\tau}\eta_h^{\tau}\Big(e^{-\frac{V'_{h+1}(s)}{\lambda}}\Big)\Big\|^2_{\bLambda_h^{-1}} + 2\Big\|\sum_{\tau = 1}^{K} \bphi_{h}^{\tau}\Big[\eta_h^{\tau}\Big(e^{-\frac{V'_{h+1}(s)}{\lambda}}\Big) - \eta_h^{\tau}\Big(e^{-\frac{\hat{V}^{\lambda}_{h+1}(s)}{\lambda}}\Big)\Big]\Big\|^2_{\bLambda_h^{-1}} \nonumber\\ &\leq 2\Big\|\sum_{\tau = 1}^{K} \bphi_{h}^{\tau}\eta_h^{\tau}\Big(e^{-\frac{V'_{h+1}(s)}{\lambda}}\Big)\Big\|_{\bLambda_h^{-1}}^2 + 32\epsilon^2/\lambda^2\sum_{\tau, \tau' =1}^K|\bphi_h^\tau\bLambda_h^{-1}\bphi_h^{\tau'}| \nonumber\\ &\leq 2\sup_{V \in \mathcal{N}_{h+1}(\epsilon)}\Big\|\sum_{\tau = 1}^{K} \bphi_{h}^{\tau}\eta_h^{\tau}\Big(e^{-\frac{V(s)}{\lambda}}\Big)\Big\|_{\bLambda_h^{-1}}^2 + \frac{32\epsilon^2K^2}{\lambda^2 \gamma}. \label{KL10} \end{align}\tag{51}\] We set \(f(s) = e^{-\frac{V(s)}{\lambda}}\), by applying , for any fixed \(h \in [H], \delta \in (0, 1)\), we have: \[\begin{align} P\Big(\sup_{V \in \mathcal{N}_{h+1}(\epsilon)}\Big\|\sum_{\tau = 1}^K \bphi_h^\tau\eta_{h}^{\tau}\Big(e^{-\frac{V(s)}{\lambda}}\Big)\Big\|_{\bLambda_h^{-1}}^2 \geq4\Big(2\log\frac{H |\mathcal{N}_{h+1}(\epsilon)|}{\delta}+ d\log\Big(1+ \frac{K}{\gamma}\Big)\Big)\Big) \leq \delta/H. \label{KL7} \end{align}\tag{52}\] Hence, combining 52 with 51 and let \(\gamma = 1\), then for all \(h \in [H]\), it holds that \[\begin{align} \Big\|\sum_{\tau = 1}^K \bphi_h^\tau\eta_{h}^{\tau}\Big(e^{-\frac{\hat{V}^{\lambda}_{h+1}(s)}{\lambda}}\Big)\Big\|_{\bLambda_h^{-1}}^2 \leq 8\Big(2\log\frac{H |\mathcal{N}_{h+1}(\epsilon)|}{\delta}+ d\log(1+ K) + \frac{4\epsilon^2K^2}{\lambda^2}\Big), \label{E465} \end{align}\tag{53}\] with probability at least \(1-\delta\). By , recall \(L = R_0 = 2Hd\sqrt{K\lambda}\) in this setting, we have \[\begin{align} \log(|\mathcal{N}_{h+1}(\epsilon)|) \leq d\log(1+ 4R_0/\epsilon) + d^2\log(1+ 8\lambda^2 d^{1/2}B^2/ \epsilon^2). \label{E466} \end{align}\tag{54}\] We then set \(\epsilon = d\lambda/{K} \in (0, \lambda)\) and define \(\beta' = \beta / \lambda e^{H/\lambda} = 16d \sqrt{(H/\lambda + \xi_{\text{KL}})}\) for brevity, then (54 ) can be bounded as: \[\begin{align} \log(|\mathcal{N}_{h+1}(\epsilon)|) &\leq d\log(1+ 4R_0K/d\lambda) + d^2\log(1+ 8\lambda^2 K^2d^{-3/2}e^{\frac{2H}{\lambda}}\beta'^2) \\ & = d \log (1+ 4R_0 K/d\lambda) + d^2 \log(e^{-2H/\lambda}+ 8\lambda^2 K^2d^{-3/2}\beta'^2) + 2d^2H/\lambda\\ &\leq 2d^2\log(8\lambda^2K^2d^{-3/2}\beta'^2) + 2d^2H/\lambda. \end{align}\] Therefore, by combining the result with the inequality (53 ), we can get \[\begin{align} & \Big\|\sum_{\tau = 1}^K \bphi_h^\tau\eta_{h}^{\tau}\Big(e^{-\frac{\hat{V}^{\lambda}_{h+1}(s)}{\lambda}}\Big)\Big\|_{\bLambda_h^{-1}}^2 \\ &\leq 8\Big(2\log\frac{H}{\delta} + 4d^2H/\lambda + 4d^2\log(8\lambda^2K^2d^{-3/2}\beta'^2) + 4d^2 + d\log(1+K)\Big) \nonumber\\ &\leq 8(4d^2H/\lambda + 4d^2\log(8\lambda^2K^3Hd^{-3/2}\beta'^2/\delta)) \tag{55} \\ & = 8(4d^2H/\lambda + 4d^2\log(8\lambda^2K^3Hd^{-3/2}/\delta) + 4d^2\log(\beta'^2))\nonumber\\ & \leq \beta'^2/4, \tag{56} \end{align}\] where 55 follows by the fact that \(2\log\frac{H}{\delta} + 4d ^2 + d\log(1+K) \leq 4d^2 \log (\frac{HK}{\delta})\), and 56 is held due to the fact that \[\begin{align} \beta'^2/4 &= 64d^2\Big(H/\lambda + \log\frac{1024d\lambda^2 K^3H}{\delta}\Big)\nonumber \\ &= 8\Big(8d^2 H /\lambda + 4d^2\log(8\lambda^2K^3Hd^{-3/2}/\delta)+4d^2\log\frac{1024d^{7/2} \lambda^2 K^3H}{\delta} + 4d^2 \log(128)\Big) \nonumber\\ &\geq 8\Big(4d^2 H /\lambda + 4d^2\log(8\lambda^2K^3Hd^{-3/2}/\delta) + 4d^2 \log(\beta'^2)\Big), \label{KL13} \end{align}\tag{57}\] where 57 holds by \[\begin{align} \log(\beta'^2) &= \log\Big(256d^2 \Big(H/\lambda + \log\frac{1024d\lambda^2 K^3H}{\delta}\Big) \Big) \\ & \leq \log(256d^2) + \Big(H/\lambda + \log\frac{1024d\lambda^2 K^3H}{\delta}\Big) \\ & \leq \log(128) + \log\frac{1024d^{7/2} \lambda^2 K^3H}{\delta} + H/\lambda. \end{align}\] By the bound on (i), (ii), (iii), for all \(h\in H\) and \((s,a) \in \mathcal{S} \times \mathcal{A}\), with probability at least \(1-\delta\), it holds that \[\begin{align} |\mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a)-\la \bphi(s,a), \hat{\bw}_h^\lambda \ra| &\leq\lambda \log\Big(1 + e^{H/\lambda}(\sqrt{d} +\beta'/2)\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big) \nonumber\\ &\leq\lambda \log\Big(1 + e^{H/\lambda}\beta'\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big) \tag{58}\\ & \leq \beta\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}},\tag{59} \end{align}\] where 58 follows by the fact that \(\beta' \geq 2\sqrt{d}\), 59 follows by the fact that \(\log(1+x) \leq x\) holds for any positive \(x\). Thus, by , we can upper bound the suboptimality gap as: \[\begin{align} \text{SubOpt}(\hat{\pi}_1, s) &\leq 2\sup_{P \in \mathcal{U}^\lambda(P^0)}\sum_{h=1}^H\EE^{\pi^*, P}[\Gamma_h(s_h, a_h)|s_1 = s] \\ & = 2\beta\sup_{P \in \mathcal{U}^\lambda(P^0)}\sum_{h=1}^H\EE^{\pi^*, P}\Big[\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}|s_1 = s\Big]. \end{align}\] Therefore, we conclude the proof. ◻

10.1.3 Proof of - Case with \(\chi^2\) Divergence↩︎

Figure 7: Robust Regularized Pessimistic Value Iteration under \(\chi^2\) distance (R2PVI-\(\chi^2\))

For completeness, we present the R2PVIalgorithm specific to the \(\chi^2\) distance in , which gives closed form solution of 17 and 18 . Before the proof, we first present the bound on weights under \(\chi^2\)-divergence:

Lemma 5 (Bound of weights - \(\chi^2\)). For any \(h \in [H]\), \[\begin{align} \|\hat{\bw}_h^{\lambda}\|_2 \leq \sqrt{d}\Big(H + \frac{H^2}{2\lambda}\Big). \end{align}\]

Proof of - \(\chi^2\). The R2PVIwith \(\chi^2\)-divergence is presented in . By the definition of \(\mathcal{T}_h^{\lambda}, \hat{\mathcal{T}}_h^{\lambda}\), we have \[\begin{align} &\mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a)-\la \bphi(s,a), \hat{\bw}_h^\lambda \ra \nonumber\\ &= \bphi(s,a)^{\top}(\btheta_h + \bw_h^{\lambda} - \btheta_h - \hat{\bw}_h^{\lambda}) \nonumber\\ &= \sum_{i=1}^d \phi_i(s,a)(w_{h, i}^{\lambda} - \hat{w}'_{h,i}) \nonumber\\ & = \sum_{i=1}^d \phi_i(s,a)\Big[ \sup_{\alpha \in [0, H]}\Big\{\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha} + \frac{1}{4\lambda}(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha})^2 - \frac{1}{4\lambda}\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha}^2 \Big\} \nonumber\\ & \quad - \sup_{\alpha \in [0, H]}\Big\{\hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha} + \frac{1}{4\lambda}(\hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha})^2 - \frac{1}{4\lambda}\hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha}^2 \Big\} \Big]. \label{chi2-first32decompose} \end{align}\tag{60}\] To continue, for any \(i \in [d]\), we denote \[\begin{align} \alpha_i = \argmax_{\alpha \in [0, H]}\Big\{\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha} + \frac{1}{4\lambda}(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha})^2 - \frac{1}{4\lambda}\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha}^2 \Big\}. \end{align}\] Hence, 60 can be further upper bounded as \[\begin{align} &\mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a)-\la \bphi(s,a), \hat{\bw}_h^\lambda \ra \nonumber \\ &\leq \underbrace{\sum_{i=1}^d \phi_i(s,a)\big(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} - \hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}\big)\Big(\frac{1}{4\lambda}\big(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} + \hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}\big)+1\Big)}_{\text{(i)}} \nonumber \\ & \quad- \underbrace{\sum_{i=1}^d \phi_i(s,a)\big(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]^2_{\alpha_i} - \hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]^2_{\alpha_i}\big)}_{\text{(ii)}}. \label{upper32bound32for32suboptimality} \end{align}\tag{61}\] Next, we bound (i) and (ii), respectively.

10.1.3.1 Bounding term (i).

We define \[\begin{align} \tilde{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^\lambda(s)]_{\alpha} &= \Big[\argmin_{\bw \in R^d} \sum_{\tau=1}^K ([\hat{V}_{h+1}^{\lambda}(s_{h+1}^\tau)]_{\alpha} - \bphi(s_h^\tau, a_h^{\tau})^{\top}\bw)^2 + \gamma \| \bw \|_2^2 \Big]^i. \end{align}\] Considering the gap between the \(\hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}\) and \(\tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}\) due to the definition that \(\hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} = [\tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}]_{[0,H]}\), we eliminate the clip operator at first. We rewrite (i) as follows: \[\begin{align} \text{(i)} &= \sum_{i=1}^d \phi_i(s,a)\big(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} - \hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}\big)\Big(\frac{1}{4\lambda}\big(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} + \hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}\big)+1\Big) \\ & = \sum_{i=1}^d \phi_i(s,a)(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} - \tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}) \\ &\qquad \times\underbrace{\Big(\frac{1}{4\lambda}\big(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} + \hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}\big)+1\Big) \frac{\EE^{\mu_{h, i}^0}[\hat{V}_{h+1}^\lambda(s)]_{\alpha_i} -\hat{\EE}^{\mu_{h, i}^0}[\hat{V}_{h+1}^\lambda(s)]_{\alpha_i}}{\EE^{\mu_{h, i}^0}[\hat{V}_{h+1}^\lambda(s)]_{\alpha_i}- \tilde{\EE}^{\mu_{h, i}^0}[\hat{V}_{h+1}^\lambda(s)]_{\alpha_i}}}_{:=C_i}. \end{align}\] We claim that \(|C_i| \leq 1 + H/2\lambda, \forall i \in [H]\). We prove the claim by discussing the value of \(\tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}\) in the following three cases:

10.1.3.2 Case I.

\(\tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} \leq 0\). By the fact that \(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_{i}} \leq H\), we have: \[\begin{align} |C_i| &= \Bigg|\Big( \frac{1}{4\lambda}\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} + 1 \Big) \frac{\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}}{\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} - \tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}} \Bigg| \leq 1 + H/4\lambda, \end{align}\] where the equality holds by \(\frac{1}{4\lambda}\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} + 1 \leq 1 + H/4\lambda\). Hence the claim holds by Case I.

10.1.3.3 Case II.

\(0\leq \tilde{\EE}^{\mu_{h,i}^0} [\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} \leq H\). The claim holds trivially, as we have: \[\begin{align} |C_i| = \frac{1}{4\lambda}\big(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} + \tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}\big)+1 \leq 1 + H/2\lambda. \end{align}\] Hence, we conclude the claim.

10.1.3.4 Case III.

\(\tilde{\EE}^{\mu_{h,i}^0} [\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} > H\). Notice that \[\begin{align} |C_i| &= \Bigg|\Big(\frac{1}{4\lambda}\big(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} + H\big) + 1 \Big) \frac{\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} - H}{\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} - \tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}} \Bigg| \nonumber\\ & = \Big(\frac{1}{4\lambda}\big(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} + H\big) + 1 \Big)\frac{H - \EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}}{\tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} - \EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i}} \nonumber\\ &\leq H/2\lambda + 1, \label{chi24610} \end{align}\tag{62}\] where 62 holds by the fact that \(\tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} >H\).

With the upper bound for \(C_i\), we can upper bound (i) as \[\begin{align} |\text{(i)}| & = \Big|\sum_{i=1}^d \phi_i(s,a)(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i} - \tilde{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]_{\alpha_i})C_i\Big| \nonumber \\ &= \Big|\gamma \sum_{i=1}^d\phi_i(s,a)\mathbf{1}_i\bLambda_h^{-1}\EE^{\bmu_h^0}[\hat{V}^{\lambda}_h(s)]_{\alpha_i}C_i + \sum_{i=1}^d\phi_i(s,a)\mathbf{1}_i\bLambda_h^{-1}\sum_{\tau = 1}^K\bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau} ([\hat{V}^{\lambda}_{h+1}]_{\alpha_i})C_i \Big| \nonumber\\ &\leq (1 + H/2\lambda)\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big(\gamma H +\Big\|\sum_{\tau = 1}^K\bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_i})\Big\|_{\bLambda_{h}^{-1}}\Big). \label{upper32bound32for3240i41} \end{align}\tag{63}\]

10.1.3.5 Bounding term (ii).

Similar to bounding (i), we can deduce that: \[\begin{align} |\text{(ii)}| & = \Big |\sum_{i=1}^d \phi_i(s,a)(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]^2_{\alpha_i} - \hat{\EE}^{\mu_{h,i}^0}[\hat{V}^{\lambda}_{h+1}(s)]^2_{\alpha_i}) \Big | \nonumber \\ &\leq \gamma H ^2\sum_{i = 1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} + \sum_{i=1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big\|\sum_{\tau=1}^{K} \bphi(s_h^\tau, a_h^{\tau})\eta_h^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_i}^2)\Big\|_{\bLambda_h^{-1}}, \label{upper32bound32for3240ii41} \end{align}\tag{64}\] where 64 follows by the Cauchy Schwartz inequality and the fact that \(\EE_{s \sim \mu_{h,i}^0}[\hat{V}^{\lambda}_h(s)]_{\alpha_i}^2 \leq H^2, \forall i \in [d]\). Hence combining 61 , 63 and 64 , we have \[\begin{align} &\mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a) -\la \bphi(s,a), \hat{\bw}_h^\lambda \ra \\ & \leq (1 + H/2\lambda)\Big(\gamma H \sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} +\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big\|\sum_{\tau = 1}^K\bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha'_i})\Big\|_{\bLambda_{h}^{-1}} \Big) \\ &\quad + \gamma H ^2\sum_{i = 1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} + \sum_{i=1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big\|\sum_{\tau=1}^{K} \bphi(s_h^\tau, a_h^{\tau})\eta_h^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha'_i}^2)\Big\|_{\bLambda_h^{-1}}. \end{align}\] On the other hand, we can similarly deduce that there exists \(\alpha'_i\) s.t. \[\begin{align} &\la \bphi(s,a), \hat{\bw}_h^\lambda \ra- \mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a) \\ & \leq (1 + H/2\lambda)\Big(\gamma H \sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} +\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big\|\sum_{\tau = 1}^K\bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha'_i})\Big\|_{\bLambda_{h}^{-1}} \Big)\\ &\quad + \gamma H ^2\sum_{i = 1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} + \sum_{i=1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big\|\sum_{\tau=1}^{K} \bphi(s_h^\tau, a_h^{\tau})\eta_h^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha'_i}^2)\Big\|_{\bLambda_h^{-1}}. \end{align}\] Then for all \(i \in [d]\), there exists \(\hat{\alpha}_i \in \{ \alpha_i, \alpha'_i \}\), such that \[\begin{align} &| \mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a) -\la \bphi(s,a), \hat{\bw}_h^\lambda \ra| \\& \leq (1 + H/2\lambda)\Big(\gamma H \sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big)+ \gamma H ^2\sum_{i = 1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} \\ &\quad +\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big(( 1+ H/2\lambda) \Big\|\sum_{\tau = 1}^K\bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{ \hat{\alpha}_i})\Big\|_{\bLambda_{h}^{-1}} + \Big\|\sum_{\tau=1}^{K} \bphi(s_h^\tau, a_h^{\tau})\eta_h^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\hat{\alpha}_i}^2)\Big\|_{\bLambda_h^{-1}} \Big). \end{align}\] Now it remains to bound the terms \[\begin{align} \underbrace{\Big\|\sum_{\tau = 1}^K\bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\hat{\alpha}_i})\Big\|_{\bLambda_{h}^{-1}}}_{(\text{iii})}~\text{and}~\underbrace{\Big\|\sum_{\tau=1}^{K} \bphi(s_h^\tau, a_h^{\tau})\eta_h^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\hat{\alpha}_i}^2)\Big\|_{\bLambda_h^{-1}}}_{\text{(iv)}}. \end{align}\] Similar to the proof in KL divergence, we aim to find a union function class \(\mathcal{V}_{h+1}(R_0, B_0, \lambda)\), which holds uniformly that \(\hat{V}^{\lambda}_{h+1} \in \mathcal{V}_{h+1}(R_0, B_0, \lambda)\), here for all \(h \in [H]\), the function class is defined as: \[\begin{align} \mathcal{V}_{h}(R_0, B_0, \lambda) = \{ V_h(x; \btheta, \beta, \bLambda): \mathcal{S} \rightarrow [0, H], \|\btheta\|_2 \leq R_0, \beta \in [0, B_0], \gamma_{\min}(\bLambda_h) \geq \gamma \}, \end{align}\] where \(V_h(x; \btheta, \beta, \bLambda) = \max_{a \in \mathcal{A}}[\bphi(s,a)^\top\btheta - \beta \sum_{i=1}^d\|\phi_i(s,a)\|_{\bLambda_h^{-1}}]_{[0, H-h+1]}\). By , when we set \(R_0 = \sqrt{d}(H + H^2/2\lambda), B_0 = \beta = 8 dH^2(1+ 1/\lambda) \sqrt{\xi_{\chi^2}}\), it suffices to show that \(\hat{V}_{h+1}^\lambda \in \mathcal{V}_{h+1}(R_0, B_0, \gamma)\). Next we aim to find a union cover of the \(\mathcal{V}_{h+1}(R_0, B_0, \gamma)\), hence the term (iii) and (iv) can be upper bounded. Let \(\mathcal{N}_h(\epsilon; R_0, B_0, \gamma)\) be the minimum \(\epsilon\)-cover of \(\mathcal{V}_h(R, B, \lambda)\) with respect to the supreme norm, \(\mathcal{N}_{h}([0, H])\) be the minimum \(\epsilon\)-cover of \([0, H]\) respectively. In other words, for any function \(V \in \mathcal{V}_{h}(R, B, \lambda), \alpha \in [0, H]\), there exists a function \(V' \in \mathcal{V}_{h}(R, B, \lambda)\) and a real number \(\alpha_{\epsilon} \in [0,H]\) such that: \[\begin{align} \sup_{s \in \mathcal{S}}|V(s) - V'(s)| \leq \epsilon, |\alpha - \alpha_{\epsilon}| \leq \epsilon. \end{align}\] Recall the definition of (iii) and (iv). By Cauchy-Schwartz inequality and the fact that \(\|a + b\|_{\bLambda_h^{-1}}^2 \leq 2\|a\|_{\bLambda_h^{-1}}^2 + 2\|b\|_{\bLambda_h^{-1}}^2\), we have \[\begin{align} \text{(iii)}^2 &\leq 2 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + 2 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\hat{\alpha}} - [\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} \nonumber \\ & \leq 4 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([V'_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + 4 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}} - [V'_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + \frac{2\epsilon^2K^2}{\gamma}, \label{40chi2eq46141} \end{align}\tag{65}\] where 65 follows by the fact that \[\begin{align} 2 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\hat{\alpha}} - [\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} \leq 2 \epsilon^2 \sum_{\tau = 1, \tau' = 1}^{K}|\bphi_h^\tau\bLambda_h^{-1}\bphi_h^{\tau'}| \leq \frac{2\epsilon^2K^2}{\gamma}. \end{align}\] Meanwhile, by the fact that \(|[\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}} - [V'_{h+1}]_{\alpha_{\epsilon}}| \leq |\hat{V}^{\lambda}_{h+1} - V'_{h+1}|\), we have \[\begin{align} &4 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}} - [V'_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} \\ &\leq 4 \sum_{\tau = 1, \tau' = 1}^{K}|\bphi_h^\tau\bLambda_h^{-1}\bphi_h^{\tau'}| \max|\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}} - [V'_{h+1}]_{\alpha_{\epsilon}})|^2 \\ &\leq 4 \epsilon^2\sum_{\tau = 1, \tau' = 1}^{K}|\bphi_h^\tau\bLambda_h^{-1}\bphi_h^{\tau'}| \\ &\leq \frac{4\epsilon^2K^2}{\gamma}. \end{align}\] By applying the above two inequalities and the union bound into (65 ), we have \[\begin{align} (\text{iii})^2 \leq 4 \sup_{V' \in \mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma), \alpha_{\epsilon} \in \mathcal{N}_{h}([0, H])}\Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([V'_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + \frac{6\epsilon^2K^2}{\gamma}. \end{align}\] By , applying a union bound over \(\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\) and \(\mathcal{N}_{h}([0, H])\), with probability at least \(1 - \delta / 2H\), we have \[\begin{align} &4 \sup_{V' \in \mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma), \alpha_{\epsilon} \in \mathcal{N}_{h}([0, H])}\Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([V'_{h+1}]_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + \frac{6\epsilon^2K^2}{\gamma} \\ &\leq 4 H^2\Big(2\log\frac{2H|\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\|\mathcal{N}_{h}([0, H])|}{\delta} + d\log(1+ K/\gamma)\Big) + \frac{6\epsilon^2K^2}{\gamma} . \end{align}\] Similarly, by the fact that \(\|a + b\|_{\bLambda_h^{-1}}^2 \leq 2\|a\|_{\bLambda_h^{-1}}^2 + 2\|b\|_{\bLambda_h^{-1}}^2\), noticing (iv) has the almost same form as (iii), we have \[\begin{align} \text{(iv)}^2 &\leq 2 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]^2_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + 2 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]^2_{\hat{\alpha}} - [\hat{V}^{\lambda}_{h+1}]^2_{\alpha_{\epsilon}})\Big \|^2_{\bLambda_h^{-1}} \nonumber \\ & \leq 4 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([V'_{h+1}]^2_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} + \frac{24H^2\epsilon^2K^2}{\gamma}, \label{eq1} \end{align}\tag{66}\] where 66 follows by the fact that \[\begin{align} 2 \Big\|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]^2_{\hat{\alpha}} - [\hat{V}^{\lambda}_{h+1}]^2_{\alpha_{\epsilon}})\Big\|^2_{\bLambda_h^{-1}} &\leq 8H^2 \epsilon^2 \sum_{\tau = 1, \tau' = 1}^{K}|\bphi_h^\tau\bLambda_h^{-1}\bphi_h^{\tau'}| \leq \frac{8H^2\epsilon^2K^2}{\gamma}, \end{align}\] and \[\begin{align} & 4 \|\sum_{\tau = 1}^K \bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]^2_{\alpha_{\epsilon}} - [V'_{h+1}]^2_{\alpha_{\epsilon}})\|^2_{\bLambda_h^{-1}} \\ &\leq 4 \sum_{\tau = 1, \tau' = 1}^{K}|\bphi_h^\tau\bLambda_h^{-1}\bphi_h^{\tau'}| \max|\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\alpha_{\epsilon}} - [V'_{h+1}]_{\alpha_{\epsilon}})|^2 \\ &\leq 16 H^2 \epsilon^2\sum_{\tau = 1, \tau' = 1}^{K}|\bphi_h^\tau\bLambda_h^{-1}\bphi_h^{\tau'}| \\ &\leq \frac{16H^2\epsilon^2K^2}{\gamma}. \end{align}\] We apply the union bound and , with probability at least \(1 - \delta/2H\) \[\begin{align} \text{(iv)}^2 & \leq 4 H^4\Big(2\log\frac{2H|\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\|\mathcal{N}_{h}([0, H])|}{\delta} + d\log(1+ K/\gamma)\Big) + \frac{24H^2\epsilon^2K^2}{\gamma}. \end{align}\] Therefore, with probability at least \(1 - \delta\), \[\begin{align} &| \mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a) -\la \bphi(s,a), \hat{\bw}_h^\lambda \ra| \\&\leq \gamma H(1 + H/2\lambda) \sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}+ \gamma H ^2\sum_{i = 1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} \nonumber \\ &\quad+\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Big(( 1+ H/2\lambda) \Big\|\sum_{\tau = 1}^K\bphi(s_h^\tau, a_h^{\tau})\eta_{h}^{\tau}([\hat{V}^{\lambda}_{h+1}]_{ \hat{\alpha}_i})\Big\|_{\bLambda_{h}^{-1}} + \Big\|\sum_{\tau=1}^{K} \bphi(s_h^\tau, a_h^{\tau})\eta_h^{\tau}([\hat{V}^{\lambda}_{h+1}]_{\hat{\alpha}_i}^2)\Big\|_{\bLambda_h^{-1}} \Big) \nonumber \\ &\leq\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}\Bigg[\gamma H(1 + H + H/2\lambda) \nonumber \\ &\quad+ (1 + H/2\lambda)\sqrt{4 H^2\Big(2\log\frac{2H|\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\|\mathcal{N}_{h}([0, H])|}{\delta} + d\log(1+ K/\gamma)\Big) + \frac{6\epsilon^2K^2}{\gamma}} \nonumber\\ & \quad+ \sqrt{4H^4\Big(2\log\frac{2H|\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\|\mathcal{N}_{h}([0, H])|}{\delta} + d\log(1+ K/\gamma)\Big) + \frac{24H^2\epsilon^2K^2}{\gamma}} \Bigg]. \label{40chi241} \end{align}\tag{67}\] By the fact that \(R_0 = \sqrt{d}(H + H^2/2\lambda)\), and , we can upper bound the term \(|\mathcal{N}_h(\epsilon; R_0,B_0,\gamma)|\) and \(|\mathcal{N}_{h}([0,H])|\) as follows: \[\begin{align} \log|\mathcal{N}_h(\epsilon; R_0,B_0,\gamma)| \leq d\log(1+ 4R_0/\epsilon) + d^2\log(1+ 8d^{1/2}B^2/\gamma \epsilon^2), \log |\mathcal{N}_{h}([0,H])| \leq \log(3H/\epsilon). \end{align}\] We set \(\epsilon = \frac{1}{K}, \gamma = 1\), with the upper bound above, we have \[\begin{align} &4 H^2\Big(2\log\frac{2H|\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\|\mathcal{N}_{h}([0, H])|}{\delta} + d\log(1+ K/\gamma)\Big) + \frac{6\epsilon^2K^2}{\gamma} \\ &\leq 4 H^2\Big(2\log\frac{6H^2K}{ \delta} + d\log(1 + K) + d\log(1+d^{1/2}(1+H/2\lambda)K) + d^2\log(1 + d^{1/2}B^2K^2) + 3/2\Big) \\ &\leq 4 H^2 \Big(2d\log\frac{6H^2K}{ \delta} + 2d\log(K) + d\log(d^{1/2}(1+H/2\lambda)K) + 2d^2 \log d^{1/2}B^2K^2\Big) \\ & \leq 8H^2d^2\Big(\log\frac{6H^2K}{ \delta} + \log(K) + \log(d^{1/2}(1+H/2\lambda)K) + \log 8d^{1/2}B^2K^2\Big) \\ & = 8H^2d^2 (\log 48 K^5 H^2 B^2 d (1 + H/2\lambda)/ \delta). \end{align}\] Similarly, we can upper bound the third term in (67 ) as follows: \[\begin{align} &4H^4\Big(2\log\frac{2H|\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\|\mathcal{N}_{h}([0, H])|}{\delta} + d\log(1+ K/\gamma)\Big) + \frac{24H^2\epsilon^2K^2}{\gamma} \\ & = H^2 \Big( 4 H^2(2\log\frac{2H|\mathcal{N}_{h}(\epsilon; R_0, B_0, \gamma)\|\mathcal{N}_{h}([0, H])|}{\delta} + d\log(1+ K/\gamma)) + \frac{24\epsilon^2K^2}{\gamma}\Big) \\ &\leq 8H^4d^2 (\log 48 K^5H^2 B^2 d (1 + H/2\lambda)/ \delta). \end{align}\] Hence, we apply this bound into the (67 ), we have \[\begin{align} &|\la \bphi(s,a), \hat{\bw}_h^\lambda \ra- \mathcal{T}_h^{\lambda} \hat{V}^{\lambda}_{h+1}(s,a)| \nonumber\\ & \leq \sum_{i=1}^d \|\bphi(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}(H(1 + H + H/2\lambda) \notag\\ &\qquad + (1+ H/2\lambda + H)\sqrt{8H^2d^2 (\log 48 K^5H^2 B^2 d (1 + H/2\lambda)/ \delta)}) \nonumber\\ &\leq \sum_{i=1}^d \|\bphi(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}2(H/\lambda + H)\sqrt{8H^2d^2 (\log 48 K^5H^2 B^2 d (1 + H/2\lambda)/ \delta)} \tag{68}\\ & \leq \sum_{i=1}^d \|\bphi(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} 2(H/\lambda + H)Hd \sqrt{8(\log 192 K^5H^6 d^3 (1 + H/2\lambda)^3/ \delta) + \log \xi_{\chi^2})} \nonumber\\ &= \sum_{i=1}^d \|\bphi(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} 2(H/\lambda + H)Hd \sqrt{8(\xi_{\chi^2} + \log \xi_{\chi^2})} \nonumber\\ & \leq \beta\sum_{i=1}^d\|\bphi(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}} ,\tag{69} \end{align}\] where 68 comes from the fact that \(1+ 1/\lambda \leq 1 + H/2\lambda\), the 69 comes from the fact that \(\log \xi_{\chi^2} \leq \xi_{\chi^2}\). Hence, the prerequisite is satisfied in , we can upper bound the suboptimality gap as: \[\begin{align} \text{SubOpt}(\hat{\pi}, s, \lambda) &\leq 2\sup_{P \in \mathcal{U}^\lambda(P^0)}\sum_{h=1}^H\EE^{\pi^*, P} \big[\Gamma_h(s_h, a_h)|s_1 = s\big] \\ & = 2\beta\sup_{P \in \mathcal{U}^\lambda(P^0)}\sum_{h=1}^H \EE^{\pi^*, P}\Big[\sum_{i=1}^d\|\phi_i(s,a)\mathbf{1}_i\|_{\bLambda_h^{-1}}|s_1 = s\Big]. \end{align}\] This concludes the proof. ◻

10.2 Proof of↩︎

Proof. The proof follows the argument in (F.15) and (F.16) of [8]. Specifically, we denote \[\begin{align} &\bLambda_{h, i}^P=\mathbb{E}^{\pi^*, P}\Big[(\phi_i(s_h, a_h) \mathbf{1}_i)(\phi_i(s_h, a_h) \mathbf{1}_i)^{\top}\big|s_1=s\Big], \quad \forall(h, i, P) \in[H] \times[d] \times \mathcal{U}^\lambda(P^0). \label{definition32of32Lambda95hi} \end{align}\tag{70}\] By , setting \(\gamma = 1\), we have \[\begin{align} & \sup _{P \in \mathcal{U}^\lambda(P^0)} \sum_{h=1}^H \mathbb{E}^{\pi^*, P}\bigg[\sum_{i=1}^d\|\phi_i(s_h, a_h) \mathbf{1}_i\|_{\bLambda_h^{-1}} \big| s_1=s\bigg] \nonumber\\ &= \sup _{P \in \mathcal{U}^\lambda(P^0)} \sum_{h=1}^H \sum_{i=1}^d \mathbb{E}^{\pi^*, P}\Big[\sqrt{\operatorname{Tr}\big((\phi_i(s_h, a_h) \mathbf{1}_i)(\phi_i(s_h, a_h) \mathbf{1}_i)^{\top} \bLambda_h^{-1}\big)} \big | s_1=s\Big] \nonumber\\ & \leq \sup _{P \in \mathcal{U}^\lambda(P^0)} \sum_{h=1}^H \sum_{i=1}^d \sqrt{\operatorname{Tr}\big(\mathbb{E}^{\pi^*, P}\big[(\phi_i(s_h, a_h) \mathbf{1}_i)(\phi_i(s_h, a_h) \mathbf{1}_i)^{\top} | s_1=s\big] \bLambda_h^{-1}\big)} \tag{71}\\ & \leq \sup_{P \in \mathcal{U}^\lambda(P^0)} \sum_{h=1}^H \sum_{i=1}^d \sqrt{\operatorname{Tr}\big(\bLambda_{h, i}^P \cdot \big(\mathbf{I}+K \cdot c^\dagger \cdot \bLambda_{h, i}^P\big)^{-1}\big)} \tag{72}\\ & =\sup _{P \in \mathcal{U}^\lambda(P^0)} \sum_{h=1}^H \sum_{i=1}^d \sqrt{\frac{( \EE^{\pi^*,P}[\phi_i(s_h, a_h) | s_1=s])^2}{1+c^\dagger \cdot K \cdot(\EE^{\pi^*,P}[\phi_i(s_h, a_h) | s_1=s])^2}} \nonumber\\ & \leq \sup _{P \in \mathcal{U}^\lambda\left(P^0\right)} \sum_{h=1}^H \sum_{i=1}^d \sqrt{\frac{1}{c^\dagger \cdot K}} \tag{73}\\ & =\frac{d H}{ \sqrt{c^\dagger K}},\nonumber \end{align}\] where (71 ) is due to the Jensen’s inequality, (72 ) holds by the definition in (70 ) and , (73 ) holds by the fact that the only nonzero element of \(\bLambda_{h, i}^P\) is the \(i\)-th diagonal element. Thus, by , with probability at least \(1-\delta\), for any \(s \in \cS\) the suboptimality can be upper bounded as: \[\begin{align} \text{SubOpt}(\hat{\pi}, s, \lambda) \leq \beta \sup_{P \in \mathcal{U}^\lambda(P^0)}\sum_{h=1}^H\EE^{\pi^*, P}\Big[\sum_{i=1}^d \| \phi_i(s,a)\mathbf{1}_i\|_{{\bLambda}_h^{-1}}\big|s_1 = s\Big] \leq \frac{\beta dH}{\sqrt{c^\dagger K}}, \end{align}\] where \[\begin{align} \beta = \begin{cases} 16 Hd \sqrt{\xi_{\text{TV}}}, & \text{if D is TV;} \\ 16d\lambda e^{H/\lambda} \sqrt{(H/\lambda + \xi_{\text{KL}})}, & \text{if D is KL;} \\ 8 dH^2(1+ 1/\lambda) \sqrt{\xi_{\chi^2}}, &\text{if D is \chi^2.} \end{cases} \end{align}\] Hence, we conclude the proof. ◻

11 Proof of the Information-Theoretic Lower Bound↩︎

In this section, we prove the information-theoretic lower bound. We first introduce the construction of hard instances in , then we prove in

11.1 Construction of Hard Instances↩︎

We design a family of \(d\)-rectangular linear RRMDPs parameterized by a Boolean vector \(\bxi=\{\bxi_h\}_{h\in[H]}\), where \(\bxi_h\in\{-1,1\}^d\). For a given \(\bxi\) and regularizer \(\lambda\), the corresponding \(d\)-rectangular linear RRMDP \(M_{\bxi}^\rho\) is constructed as follows. The state space \(\cS=\{x_1, x_2\}\) and the action space \(\cA=\{0,1\}^d\). The initial state distribution \(\mu_0\) is defined as \[\begin{align} \mu_0(s_1)=\frac{d+1}{d+2}\quad \text{and} \quad \mu_0(x_2)=\frac{1}{d+2}. \end{align}\] The feature mapping \(\bphi:\cS\times\cA\rightarrow\RR^{d+2}\) is defined as \[\begin{align} &\bphi(s_1,a)^{\top}=\Big(\frac{a_1}{d}, \frac{a_2}{d}, \cdots, \frac{a_d}{d}, 1-\sum_{i=1}^d\frac{a_i}{d}, 0 \Big),\\ &\bphi(s_2,a)^{\top}=\big(0, 0, \cdots, 0, 0, 1\big), \end{align}\] which satisfies \(\phi_i(s,a)\geq 0\) and \(\sum_{i=1}^d\phi_i(s,a)=1\). The nominal distributions \(\{\bmu_h^0\}_{h\in[H]}\) are defined as \[\begin{align} \bmu_h^0=\big(\underbrace{(1-\epsilon)\delta_{s_1} +\epsilon \delta_{s_2}, (1-\epsilon)\delta_{s_1} + \epsilon \delta_{s_2}, \cdots, (1-\epsilon)\delta_{s_1} + \epsilon \delta_{s_2}}_\text{d+1}, \delta_{s_2}\big)^{\top}, \forall h\in[H], \end{align}\] where \(\epsilon\) is an error term injected into the nominal model, which is to be determined later. Thus, the transition is homogeneous and does not depend on action but only on state. The reward parameters \(\{ \btheta_h \}_{h \in [H]}\) are defined as \[\begin{align} \btheta_h^{\top} = \delta\cdot\Big(\frac{\xi_{h1}+1}{2}, \frac{\xi_{h2}+1}{2},\cdots\frac{\xi_{hd}+1}{2}, \frac{1}{2}, 0\Big), \forall h \in [H], \end{align}\] where \(\delta\) is a parameter to control the differences among instances, which is to be determined later. The reward \(r_h\) is generated from the normal distribution \(r_h\sim\cN(r_h(s_h,a_h),1)\), where \(r_h(s,a)=\bphi(s,a)^\top\btheta_h\). Note that \[\begin{align} r_h(s_1, a)=\bphi(s_1,a)^\top\btheta_h = \frac{\delta}{2d}\big(\la\bxi_h, a\ra+d\big)\geq 0\quad \text{and} \quad r_h(s_2, a)=\bphi(s_2,a)^\top\btheta_h=0,~ \forall a\in\cA, \end{align}\] Thus, the worst case transition kernel should have the highest possible transition probability to \(s_2\), and the optimal robust policy should lead to a transition probability to \(s_2\) as small as possible. Therefore the optimal action at step \(h\) is \[\begin{align} a_h^* = \Big(\frac{1+\xi_{h1}}{2}, \frac{1+\xi_{h2}}{2} \cdots,\frac{1+\xi_{hd}}{2}\Big). \end{align}\] We illustrate the designed \(d\)-rectangular linear RRMDP \(M_{\bxi}^{\lambda}\) in and .

Finally, the offline dataset is collected by the following procedure: the behavior policy \(\pi^b=\{\pi_h^b\}_{h\in[H]}\) is defined as \[\pi_h^b\sim \text{Unif}\big(\{\be_1, \cdots, \be_d, \mathbf{0}\}\big),\forall h\in [H],\] where \(\{\be_i\}_{i\in[d]}\) are the canonical basis vectors in \(\RR^d\). The initial state is generated according to \(\mu_0\) and then the behavior policy interact with the nominal environment \(K\) episodes to collect the offline dataset \(\cD\).

11.2 Proof of↩︎

With this family of hard instances, we are ready to prove the information-theoretic lower bound. For any \(\bxi \in \{-1,1\}^{dH}\), let \(\QQ_{\bxi}\) denote the distribution of dataset \(\cD\) collected from the instance \(M_{\bxi}\). Denote the family of parameters as \(\Omega = \{-1,1\}^{dH}\) and the family of hard instances as \(\cM=\{M_{\bxi}:\bxi\in\Omega\}\). Before the proof, we introduce the following lemma bounding the robust value function.

Lemma 6. Under the constructed hard instances in , let \(\delta = d^{3/2}/\sqrt{2K}\) and \(K>d^3H^2/(2\lambda^2)\). For any \(h\in [H]\) , we have \[\begin{align} 0 \leq \frac{\delta}{2d}\sum_{j=h}^H ( 1- \epsilon)^{j-h} \Big( d + \Big(\sum_{i=1}^d\xi_{ji}\EE^{\pi}a_{ji}\Big)\Big)-V^{\pi,\lambda}_h(s_1) \leq f_h^\lambda(\epsilon), \label{1} \end{align}\qquad{(2)}\] where \(f_h^\lambda(\epsilon)\) is a error term, which is defined as: \[\begin{align} f_h^\lambda(\epsilon) = \begin{cases} 0, & \text{if D is TV;} \\ (H-h)\lambda \epsilon (e-1), & \text{if D is KL;} \\ (H-h)\lambda \epsilon (1-\epsilon)/4, &\text{if D is \chi^2.} \end{cases} \end{align}\] Furthermore, if we set the \(\epsilon\) as \[\begin{align} \epsilon = \begin{cases} 1 - 2^{-1/H}, & \text{if D is TV;} \\ \min \{1 - 2^{-1/H}, d^{3/2}/(64\lambda\sqrt{2K})\} , &\text{if D is KL;} \\ \min \{1 - 2^{-1/H}, d^{3/2}/(8\lambda\sqrt{2K} )\}, & \text{if D is TV,} \end{cases} \label{definition32of32epsilon} \end{align}\qquad{(3)}\] then we have \(f_h^\lambda(\epsilon) \leq d^{3/2} H/{32\sqrt{2K}}\).

Proof of . Invoking , we have \[\begin{align} V^{\pi^*, \lambda}_1(s_1) - V^{\pi, \lambda}_1(s_1)& \geq \frac{\delta}{2d}\sum_{j=1}^H\sum_{i=1}^d(1-\epsilon)^{j-1}\Big(\frac{1+ \xi_{ji}}{2} - \xi_{ji}\EE^{\pi}a_{ji}\Big) - f_1^\lambda(\epsilon) \nonumber\\ & = \frac{\delta}{4d}\sum_{j=1}^H\sum_{i=1}^d(1-\epsilon)^{j-1}(1 - \xi_{ji}\EE^{\pi}(2a_{ji}-1)) - f_1^\lambda(\epsilon) \nonumber\\ & = \frac{\delta}{4d}\sum_{j=1}^H\sum_{i=1}^d(1-\epsilon)^{j-1}(\xi_{ji} - \EE^{\pi}(2a_{ji}-1))\xi_{ji} - f_1^\lambda(\epsilon)\nonumber\\ & = \frac{\delta}{4d}\sum_{j=1}^H\sum_{i=1}^d(1-\epsilon)^{j-1}|\xi_{ji} - \EE^{\pi}(2a_{ji}-1)| - f_1^\lambda(\epsilon) \tag{74}\\ & \geq \frac{\delta}{4d} (1-\epsilon)^{H-1}\sum_{j=1}^H\sum_{i=1}^d|\xi_{ji} - \EE^{\pi}(2a_{ji}-1)| - f_1^\lambda(\epsilon), \tag{75} \end{align}\] where 74 follows from the fact that \(\xi_{ji} \in \{ -1,1 \}\). To continue, \[\begin{align} \frac{\delta}{4d}\sum_{j=1}^H\sum_{i=1}^d|\xi_{ji} - \EE^{\pi}(2a_{ji}-1)| &\geq \frac{\delta}{4d}\sum_{j=1}^H\sum_{i=1}^d|\xi_{ji} - \EE^{\pi}(2a_{ji}-1)| \boldsymbol{1}\{ \xi_{hi} \neq \sign(\EE(2a_{ji}-1)) \}| \nonumber\\ & \geq \frac{\delta}{4d}\sum_{j=1}^H\sum_{i=1}^d \boldsymbol{1}\{ \xi_{hi} \neq \sign(\EE(2a_{ji}-1)) \}|\nonumber\\ & = \frac{\delta}{4d}D_{H}(\xi, \xi^{\pi}), \label{2} \end{align}\tag{76}\] where \(D_H(\cdot, \cdot)\) is the Hamming distance. Then applying the Assouad’s method [29], we have \[\begin{align} \inf_{\pi}\sup_{\xi \in \Omega}\EE_{\xi}[D_H(\xi, \xi^{\pi})] &\geq \frac{dH}{2}\min_{D_H(\xi, \xi^{\pi}) =1} \inf_{\phi}[\QQ_{\xi}(\psi(\mathcal{D}) \neq \xi) + \QQ_{\xi^{\pi}}(\psi(\mathcal{D}) \neq \xi^{\pi})] \nonumber\\ & \geq \frac{dH}{2}\Big(1 - \Big(\frac{1}{2}\max_{D_H(\xi, \xi^{\pi})=1}D_{\text{KL}}(\QQ_{\xi}\|\QQ_{\xi^{\pi}})\Big)^{1/2}\Big), \label{3} \end{align}\tag{77}\] where \(D_{\text{KL}}\) represents the KL divergence. Next we bound \(D_{\text{KL}}(\QQ_{\xi}\|\QQ_{\xi^{\pi}})\), according to the definition of \(\QQ_{\xi}(\cD)\), we have \[\begin{align} \QQ_{\xi}(\cD) = \prod_{k=1}^K \prod_{\tau=1}^H \pi^b_{h}(a_{h}^\tau | s_{h}^\tau)P_h^0(s_{h+1}^{\tau} | s_h^{\tau}, a_h^{\tau})R_h(r_h^{\tau}|s_h^{\tau}, a_h^{\tau}), \end{align}\] where \(R_h(r_h^{\tau}|s_h^{\tau}, a_h^{\tau})\) refers to the density function of \(\mathcal{N}(r_h(s_h^{\tau},a_h^{\tau}), 1)\) at \(r_h^{\tau}\). By the fact that the difference between the two distributions \(\QQ_{\xi}(\cD)\) and \(\QQ_{\xi^{\pi}}(\cD)\) lie only in the reward distribution corresponding to the index where \(\xi\) and \(\xi^{\pi}\) differ, we have \[\begin{align} D_{KL}(\QQ_{\xi}(\cD)\|\QQ_{\xi^{\pi}}(\cD)) = \sum_{\tau=1}^{\frac{K}{d+2}}D_{KL}\Big(\mathcal{N}\Big(\frac{d+1}{2d}\delta,1 \Big) \| \mathcal{N}\Big(\frac{d-1}{2d}\delta, 1\Big)\Big) = \frac{K}{d+2} \frac{\delta^2}{d^2} \leq \frac{1}{2}, \label{4} \end{align}\tag{78}\] where the last inequality follows from the definition of \(\delta\). By the fact that \(\delta = d^{3/2}/\sqrt{2K}\), we have \[\begin{align} \inf_{\hat{\pi}}\sup_{M \in \mathcal{M}} \text{subopt}(M, \hat{\pi}, s, \lambda) &\geq \frac{\delta H(1-\epsilon)^{H-1}}{4}\Big( 1 - \Big(\frac{1}{2}\max_{D_H(\xi, \xi^{\pi})=1}D_{KL} (\QQ_{\xi}\|\QQ_{\xi^{\pi}})\Big)^{1/2}\Big) - f_h^\lambda(\epsilon)\tag{79}\\ & \geq \frac{\delta H(1-\epsilon)^{H-1}}{8} - f_h^\lambda(\epsilon) \tag{80}\\ &= \frac{d^{3/2} H(1-\epsilon)^{H-1}}{8\sqrt{2K}} - f_h^\lambda(\epsilon)\nonumber \\ \tag{81} & \geq \frac{d^{3/2} H}{16\sqrt{2K}} - \frac{d^{3/2} H}{32\sqrt{2K}} \\ & \geq \frac{1}{128\sqrt{2}} \sum_{h=1}^H\EE^{\pi^*, P}\Big[\sum_{i=1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{{\Lambda}_h^{-1}}|s_1 = s\Big], \tag{82} \end{align}\] where 79 holds by applying the inequality 75 , (76 ) and (77 ) in order, 80 holds by (78 ), 81 holds by the definition of \(\epsilon\) in ?? , and 82 holds by . Hence, it is sufficient for taking \(c = 1/128\sqrt{2}\). This concludes the proof. ◻

12 Proof of Technical Lemmas↩︎

12.1 Proof of↩︎

Proof. We first decompose the \(\text{SubOpt}(\pi, s, \lambda)\) as follows: \[\begin{align} \text{SubOpt}(\hat{\pi}, s, \lambda) &= V_1^{\star, \lambda}(s) - V_1^{\hat{\pi},\lambda}(s) \\ &= \underbrace{V_1^{\star, \lambda}(s) - \hat{V}_{1}^{\lambda}(s)}_{\text{(i)}} + \underbrace{\hat{V}_{1}^{\lambda}(s) - V_1^{\hat{\pi}, \lambda}(s)}_{\text{(ii)}}, \end{align}\] where \(\hat{V}^{\lambda}_1(s)\) is computed in the algorithm. We next bound the term (i) and (ii) respectively. For term (i), the error comes from the estimated error of the value function and the Q-function, therefore by [prop:regularized32Robust32Bellman32equation] and the definition of the \(\hat{Q}_h^{\lambda}(s, a)\) in meta-algorithm, for any \(h \in [H]\), we can decompose the error as: \[\begin{align} V_h^{\pi^\star, \lambda}(s) - \hat{V}^{\lambda}_h(s) &= Q_h^{\pi^\star, \lambda}(s, \pi^\star_h(s)) - \hat{Q}_h^{\lambda}(s, \hat{\pi}_h(s)) \nonumber\\ & \leq Q_h^{\pi^\star, \lambda}(s, \pi^\star_h(s)) - \hat{Q}_h^{\lambda}(s, \pi^\star(s)) \tag{83}\\ & = \mathcal{T}_h^{\lambda}V_{h+1}^{\pi^\star, \lambda}(s, \pi_h^\star(s)) - \mathcal{T}_h^{\lambda}\hat{V}^{\lambda}_{h+1}(s, \pi_h^\star(s)) + \mathcal{T}_h^{\lambda}\hat{V}^{\lambda}_{h+1}(s, \pi_h^\star(s)) - \hat{Q}_h^{\lambda}(s, \pi^\star(s)) \nonumber\\ & = \mathcal{T}_h^{\lambda}V_{h+1}^{\pi^\star, \lambda}(s, \pi_h^\star(s)) - \mathcal{T}_h^{\lambda}\hat{V}^{\lambda}_{h+1}(s, \pi_h^\star(s)) + \delta^\lambda_h(s,\pi_h^\star(s)), \tag{84} \end{align}\] where 83 comes from the fact that \(\hat{\pi}_h\) is the greedy policy with respect to \(\hat{Q}_h^{\lambda}(s,a)\), the regularized robust Bellman update error \(\delta_h^\lambda\) is defined as: \[\begin{align} \delta_h^\lambda(s,a) := \mathcal{T}_h^{\lambda}\hat{V}^{\lambda}_{h+1}(s, a) - \hat{Q}_h^{\lambda}(s, a), \forall (s,a) \in \cS \times \cA, \label{definition32of32regularized32bellman32error} \end{align}\tag{85}\] which aims to eliminate the clip operator in the definition of \(\hat{Q}_h^{\lambda}(s, a)\). Denote the worst transition kernel w.r.t the regularized Bellman operator as \(\hat{P} = \{ \hat{P}_h \}_{h \in [H]}\), where \(\hat{P}_h\) is defined as: \[\begin{align} \hat{P}_h(\cdot |s,a) &= \argmin_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[\hat{V}_{h+1}^{\lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \\ & = \sum_{i=1}^d\phi_i(s,a) \argmin_{\mu_{h, i} \in \Delta(\cS)}\big[\EE_{s' \sim \mu_{h, i}}[\hat{V}^\lambda_{h+1}(s')]+ \lambda D(\mu_{h, i} \|\mu_{h, i}^0)\big] \\ & = \sum_{i=1}^d\phi_i(s,a) \hat{\mu}_{h,i}(\cdot) , \end{align}\] where the \(\hat{\mu}_{h,i}\) is defined as \(\hat{\mu}_{h,i} = \argmin_{\mu_{h, i} \in \Delta(\cS)}\big[\EE_{s' \sim \mu_{h, i}}[\hat{V}_{h+1}(s')]+ \lambda D(\mu_{h, i} \|\mu_{h, i}^0)\big]\). Hence the difference between the regularized Bellman operator and the empirical regularized Bellman operator can be bounded as \[\begin{align} &\mathcal{T}_h^{\lambda}V_{h+1}^{\pi^\star, \lambda}(s, \pi_h^\star(s)) - \mathcal{T}_h^{\lambda}\hat{V}^{\lambda}_{h+1}(s, \pi_h^\star(s)) \nonumber\\ &= r_h(s, \pi_h^\star(s)) + \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,\pi_h^\star(s))}\big[V_{h+1}^{\pi^\star, \lambda}(s')\big]+ \lambda \la\bphi(s, \pi_h^\star(s)),\bD(\bmu_h||\bmu_h^0) \ra\big]\nonumber \\ &\quad - r_h(s, \pi_h^\star(s)) - \inf_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,\pi_h^\star(s))}\big[\hat{V}_{h+1}^{\lambda}(s')\big]+ \lambda \la\bphi(s, \pi_h^\star(s)),\bD(\bmu_h||\bmu_h^0) \ra\big] \nonumber\\ &\leq \EE_{s' \sim \hat{P}_h(\cdot |s,\pi_h^\star(s))}[\hat{V}^{\lambda}_{h+1}(s')] - \EE_{s' \sim \hat{P}_h(\cdot |s,\pi_h^\star(s))}[V^{\star,\lambda}_{h+1}(s')] \nonumber\\ &= \EE_{s' \sim \hat{P}_h(\cdot |s,\pi_h^\star(s))}[\hat{V}^{\lambda}_{h+1}(s')-V^{\star, \lambda}_{h+1}(s')]. \label{EQ2} \end{align}\tag{86}\] Combining inequality (84 ) and (86 ), we have for any \(h \in [H]\) \[\begin{align} V_h^{\pi^\star, \lambda}(s) - \hat{V}^\lambda_h(s) \leq \EE_{s' \sim \hat{P}_h(\cdot |s,\pi_h^\star(s))}[\hat{V}^\lambda_{h+1}(s')-V^{\star, \lambda}_{h+1}(s')] + \delta^\lambda_h(s,\pi_h^\star(s)). \label{EQ3} \end{align}\tag{87}\] Recursively applying (87 ), we have \[\begin{align} \text{(i)} = V_1^{\pi^\star, \lambda}(s) - \hat{V}^\lambda_1(s) \leq \sum_{h=1}^H \EE^{\pi^\star, \hat{P}}\big[\delta^\lambda_h(s_h,a_h)\|s_1 = s\big]. \end{align}\] Next we bound term (ii), similar to term (i), by 35 , the error can be decomposed to \[\begin{align} \hat{V}^\lambda_h(s) - V_h^{\hat{\pi}, \lambda}(s) &= \hat{Q}_h^{\lambda}(s, \hat{\pi}_h(s)) - Q_h^{\hat{\pi}, \lambda}(s, \hat{\pi}_h(s))\nonumber \\ & = \mathcal{T}_h^{\lambda} \hat{V}^\lambda_{h+1}(s, \hat{\pi}_h(s)) - \delta_h^\lambda(s, \hat{\pi}_h(s)) - \mathcal{T}_h^{\lambda} V_{h+1}^{\hat{\pi}, \lambda}(s, \hat{\pi}_h(s)). \label{EQ4} \end{align}\tag{88}\] Denote \(P^{\hat{\pi}} = \{ P^{\hat{\pi}}_h \}_{h \in [H]}\) where \(P^{\hat{\pi}}_h\) is defined as: \(\forall (s,a) \in \cS \times \cA\), \[\begin{align} P^{\hat{\pi}}_h(\cdot |s,a) = \argmin_{\bmu_h \in \Delta(\cS)^d,P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[\hat{V}_{h+1}^{\hat{\pi}}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big]. \end{align}\] Hence similar to the bound in (86 ), the difference between the regularized Bellman operator and the empirical regularized Bellman operator can be bounded as \[\begin{align} \mathcal{T}_h^{\lambda} \hat{V}^\lambda_{h+1}(s, \hat{\pi}_h(s))- \mathcal{T}_h^{\lambda} V_{h+1}^{\hat{\pi}, \lambda}(s, \hat{\pi}_h(s)) \leq \EE_{s' \sim P^{\hat{\pi}}_h(\cdot |s,\hat{\pi}_h(s))}[\hat{V}^\lambda_{h+1}(s')-V^{\hat{\pi}, \lambda}_{h+1}(s')]. \label{EQ5} \end{align}\tag{89}\] Combining inequality (88 ), (89 ), we have for any \(h \in [H]\) \[\begin{align} \hat{V}^\lambda_h(s) - V^{\hat{\pi}, \lambda}_h(s) \leq \EE_{s' \sim P^{\hat{\pi}}_h(\cdot |s,\hat{\pi}_h(s))}[\hat{V}^\lambda_{h+1}(s')-V^{\hat{\pi}, \lambda}_{h+1}(s')] - \delta_h^\lambda(s, \hat{\pi}_h(s)). \label{EQ6} \end{align}\tag{90}\] Recursively applying (90 ), we have the "pessimisim" of the estimated value function that \(\forall h \in [H]\) \[\begin{align} \hat{V}^\lambda_1(s) - V^{\hat{\pi}, \lambda}_1(s) \leq \sum_{h=1}^H \EE^{\hat{\pi}, P^{\hat{\pi}}}\big[-\delta^\lambda_h(s_h,a_h)\|s_1 = s\big]. \end{align}\] Therefore combining the two bounds above, we have \[\begin{align} \text{SubOpt}(\hat{\pi}, s, \lambda) = \text{(i)} + \text{(ii)} \leq \sum_{h=1}^H \EE^{\pi^\star, \hat{P}}\big[\delta^\lambda_h(s_h,a_h)|s_1 = s\big] + \sum_{h=1}^H \EE^{\hat{\pi}, P^{\hat{\pi}}}\big[-\delta^\lambda_h(s_h,a_h)\|s_1 = s\big]. \label{upper32bound32for32subopt32in32prep} \end{align}\tag{91}\] Hence, it requires to estimate the range of the regularized Bellman update error \(\delta^\lambda_h(s,a)\). Recall the definition in 85 , we claim that \[\begin{align} 0 \leq \delta^\lambda_h(s,a) \leq 2\Gamma_h(s,a) \label{bound32of32bellman32error} \end{align}\tag{92}\] holds for \(\forall (s,a,h) \in \cS \times \cA \times [H]\). For the LHS of 92 , first we notice that if \(\la \bphi(s,a), \hat{\bw}_h^\lambda \ra - \Gamma_h(s, a)\leq 0\), the inequality holds trivially as \(\hat{Q}_h^\lambda(s,a)= 0\). Next we consider the case where \(\la \bphi(s,a), \hat{\bw}_h^\lambda \ra - \Gamma_h(s, a)\geq 0\). By the definition of \(\hat{Q}_h^\lambda(s,a)\) and the assumption in the lemma, we have \[\begin{align} \delta^\lambda_h(s,a) &= \mathcal{T}_h^{\lambda} \hat{V}_{h+1}^{\lambda}(s, a) - \hat{Q}_h^\lambda(s,a) \\ & = \mathcal{T}_h^{\lambda} \hat{V}_{h+1}^{\lambda}(s, a) - \min \big \{\la \bphi(s,a), \hat{\bw}_h^\lambda \ra- \Gamma_h(s,a), H-h+1 \big \} \\ & \geq \mathcal{T}_h^{\lambda} \hat{V}_{h+1}^{\lambda}(s, a) - \la \bphi(s,a), \hat{\bw}_h^\lambda \ra + \Gamma_h(s,a) \\ & \geq 0. \end{align}\] On the other hand, by the assumption in the lemma, we have \[\begin{align} \la \bphi(s,a), \hat{\bw}_h^\lambda \ra- \Gamma_h(s,a) \leq \mathcal{T}_h^{\lambda} \hat{V}_{h+1}^{\lambda}(s, a) \leq H-h+1. \end{align}\] Hence, we can upper bound \(\delta_h^\lambda(s,a)\) as \[\begin{align} \delta^\lambda_h(s,a) &= \mathcal{T}_h^{\lambda} \hat{V}_{h+1}^{\lambda}(s, a) - \hat{Q}_h^\lambda(s,a) \\ & = \mathcal{T}_h^{\lambda} \hat{V}_{h+1}^{\lambda}(s, a) - \max \big \{\la \bphi(s,a), \hat{\bw}_h^\lambda \ra- \Gamma_h(s,a), 0 \big \} \\ & \leq \mathcal{T}_h^{\lambda} \hat{V}_{h+1}^{\lambda}(s, a) - \la \bphi(s,a), \hat{\bw}_h^\lambda \ra +\Gamma_h(s,a) \\ & \leq 2\Gamma_h(s,a). \end{align}\] This concludes the claim. Now it remains to bound the empirical transition kernel \(\hat{P}\). Noticing the fact that \(\forall h \in [H], (s,a) \in \cS \times \cA,\) \[\begin{align} \lambda D(\hat{\mu}_{h,i} \|\mu^0_{h,i}) &\leq \EE_{s' \sim \hat{\mu}_{h,i}}[\hat{V}^\lambda_{h+1}(s')]+ \lambda D(\hat{\mu}_{h,i} \|\mu^0_{h,i}) \nonumber\\ & = \inf_{\mu_{h, i} \in \Delta(\cS)}\big[\EE_{s' \sim \mu_{h, i}}[\hat{V}_{h+1}(s')]+ \lambda D(\mu_{h, i} \|\mu_{h, i}^0)\big] \nonumber\\ & \leq \inf_{\mu_{h, i} \in \Delta(\cS)}\big[\EE_{s' \sim \mu_{h, i}}[V_{h+1}^{\star, \lambda}(s')]+ \lambda D(\mu_{h, i} \|\mu^0_{h,i})\big] \label{D2461}\\ & \leq \EE_{s' \sim \mu_{h, i}^0}[V_{h+1}^{\star, \lambda}(s')] \nonumber \\ & \leq \max_{s \in \cS} V_{h+1}^{\star, \lambda}(s), \nonumber \end{align}\tag{93}\] where 93 comes from the pessimism of value function, i.e \(\hat{V}^\lambda_{h+1}(s) \leq V^{\star, \lambda}_h(s), \forall h \in [H]\). Hence, the empirical transition kernel \(\hat{P}_h(\cdot | s,a)\) is contained in the set \(\mathcal{U}^\lambda(P^0)\) defined in . Hence, by 91 and 92 , we have \[\begin{align} \text{SubOpt}(\hat{\pi},s,\lambda) &\leq \sum_{h=1}^H \EE^{\pi^\star, \hat{P}}\big[\delta^\lambda_h(s_h,a_h)|s_1 = s\big] + \sum_{h=1}^H \EE^{\hat{\pi}, P^{\hat{\pi}}}\big[-\delta^\lambda_h(s_h,a_h)\|s_1 = s\big] \\ & \leq 2 \sum_{h=1}^{H}\EE^{\pi^\star, \hat{P}}\big[\Gamma_h(s_h, a_h)|s_1 = s\big] \\ &\leq 2 \sup_{P \in \mathcal{U}^\lambda(P^0)} \sum_{h=1}^{H}\EE^{\pi^\star, P}\big[\Gamma_h(s_h, a_h)|s_1 = s\big]. \end{align}\] This concludes the proof. ◻

12.2 Proof of↩︎

Proof. For all \(h \in [H]\), from the definition of \(w_h\), \[\begin{align} \|\bw_h^\lambda\|_2 = \|\EE_{s \sim \bmu^0_h}[\hat{V}^\lambda_{h+1}(s)]_{\alpha_{h+1}}\|_2 \leq H\sqrt{d}, \end{align}\] where the inequality follows from the fact that \(\hat{V}_{h+1}^\lambda \leq H\), for all \(h \in [H]\). Meanwhile, by the definition of \(\hat{\bw}_h^{\lambda}\) in , and the triangle inequality, \[\begin{align} \|\hat{w}_h^{\lambda}\|_2 &= \Big\|\bLambda_h^{-1}\sum_{\tau=1}^K \bphi(s_h^{\tau}, a_h^{\tau})[\hat{V}^\lambda_{h+1}(s)]_{\alpha_{h+1}}\Big\|_2 \nonumber\\ &\leq H\sum_{\tau=1}^K\|\bLambda_h^{-1}\bphi(s_h^{\tau}, a_h^{\tau})\|_2 \nonumber\\ &= H\sum_{\tau = 1}^K \sqrt{\bphi(s_h^{\tau}, a_h^{\tau})^{\top} \bLambda_h^{-1/2} \bLambda_h^{-1} \bLambda_h^{-1/2} \bphi(s_h^{\tau}, a_h^{\tau})} \nonumber\\ &\leq \frac{H}{\sqrt{\gamma}} \sum_{\tau=1}^{K}\sqrt{\bphi(s_h^{\tau}, a_h^{\tau})^{\top}\bLambda_h^{-1}\bphi(s_h^{\tau}, a_h^{\tau})} \tag{94}\\ &\leq \frac{H\sqrt{K}}{\sqrt{\gamma}} \sqrt{\sum_{\tau=1}^{K}\bphi(s_h^{\tau}, a_h^{\tau})^{\top}\bLambda_h^{-1}\bphi(s_h^{\tau}, a_h^{\tau})} \tag{95}\\ &=\frac{H\sqrt{K}}{\sqrt{\gamma}} \sqrt{\text{Tr}(\bLambda_h^{-1}(\bLambda_h - \gamma \mathbf{I}))} \nonumber\\ &\leq \frac{H\sqrt{K}}{\sqrt{\gamma}}\sqrt{\text{Tr}(\mathbf{I})} \nonumber\\ & =H\sqrt{\frac{Kd}{\gamma}},\nonumber \end{align}\] where 94 follows from the fact that \(\|\bLambda_h^{-1}\| \leq \gamma^{-1}\), 95 follows from the Cauchy-Schwartz inequality. Then we conclude the proof. ◻

12.3 Proof of↩︎

Proof. By definition, we have \[\begin{align} \big \|\bw_h^\lambda\|_2 = \Big\|\EE_{s \sim \bmu^{0}_{h}}\Big[e^{-\frac{\hat{V}^\lambda_{h+1}(s)}{\lambda}}\Big]\Big\|_2 = \int _{\mathcal{S}}\Big\|e^{-\frac{\hat{V}^\lambda_{h+1}(s)}{\lambda}}\bmu_h^0(s)\Big\|_2ds \leq \int_{\mathcal{S}}\|\bmu_h^0(s)\|_2ds \leq \sqrt{d}, \end{align}\] this concludes the proof of \(\bw_h^\lambda\). For \(\hat{\bw}_h^{\lambda}\), \[\begin{align} \|\hat{\bw}_h^{\lambda}\|_2 &= \Big\|\bLambda_h^{-1}\sum_{\tau=1}^K \bphi(s_h^{\tau}, a_h^{\tau})e^{-\frac{\hat{V}^\lambda_{h+1}(s_{h+1})}{\lambda}}\Big\|_2 \nonumber\\ &\leq \sum_{\tau=1}^K\|\bLambda_h^{-1}\bphi(s_h^{\tau}, a_h^{\tau})\|_2 \nonumber\\ & = \sum_{\tau = 1}^K \sqrt{\bphi(s_h^{\tau}, a_h^{\tau})^{\top} \bLambda_h^{-1/2} \bLambda_h^{-1} \bLambda_h^{-1/2} \bphi(s_h^{\tau}, a_h^{\tau})} \nonumber\\ &\leq \frac{1}{\sqrt{\gamma}} \sum_{\tau=1}^{K}\sqrt{\bphi(s_h^{\tau}, a_h^{\tau})^{\top}\bLambda_h^{-1}\bphi(s_h^{\tau}, a_h^{\tau})} \tag{96}\\ &\leq \frac{\sqrt{K}}{\sqrt{\gamma}} \sqrt{\sum_{\tau=1}^{K}\bphi(s_h^{\tau}, a_h^{\tau})^{\top}\bLambda_h^{-1}\bphi(s_h^{\tau}, a_h^{\tau})} \tag{97} \\ &=\frac{\sqrt{K}}{\sqrt{\gamma}} \sqrt{\text{Tr}(\bLambda_h^{-1}(\bLambda_h -\gamma \mathbf{I}))} \nonumber\\ &\leq \frac{\sqrt{K}}{\sqrt{\gamma}}\sqrt{\text{Tr}(\mathbf{I})} =\sqrt{\frac{Kd}{\gamma}}, \nonumber \end{align}\] where 96 follows from the fact that \(\|\bLambda_h^{-1}\| \leq \gamma^{-1}\), 97 follows from the Cauchy-Schwartz inequality. Then we conclude the proof. ◻

12.4 Proof of↩︎

Proof. Denote \(\bA = \beta^2 \Lambda_h^{-1}\), then we have \(\|\btheta\|_2 \leq L, \|\bA\|_2 \leq B^2 \gamma^{-1}\). For any two functions \(V_1, V_2 \in \mathcal{V}\) with parameters \((\btheta_1, \bA_1), (\btheta_2, \bA_2)\), since both \(\{ \cdot \}_{[0, H-h+1]}\) and \(\max_a\) are contraction maps, \[\begin{align} &\text{dist}(V_1, V_2) \\ &\leq \sup_{s,a}\Big|\bphi(s,a)^{\top}(\btheta_1 - \btheta_2) - \lambda \Big(\log\Big(1+ \sum_{i=1}^d\|\phi_i(s,a) \mathbf{1}_i\|_{\bA_1}\Big) - \log\Big(1+ \sum_{i=1}^d\|\phi_i(s,a) \mathbf{1}_i\|_{\bA_2}\Big) \Big) \Big|\nonumber \\ &\leq \sup_{\bphi \in \RR^d, \|\bphi\| \leq 1}\Big|\bphi^{\top}(\btheta_1 - \btheta_2) - \lambda \log \frac{1+ \sum_{i=1}^d\|\phi_i(s,a) \mathbf{1}_i\|_{\bA_1}}{1+ \sum_{i=1}^d\|\phi_i(s,a) \mathbf{1}_i\|_{\bA_2}}\Big| \nonumber\\ &\leq \sup_{\bphi \in \RR^d: \|\bphi\| \leq 1}|\bphi^{\top}(\btheta_1 - \btheta_2)| +\lambda \sup_{\bphi \in \RR^d: \|\bphi\| \leq 1} \Big|\log \frac{1+ \sum_{i=1}^d\|\phi_i(s,a) \mathbf{1}_i\|_{\bA_1}}{1+ \sum_{i=1}^d\|\phi_i(s,a) \mathbf{1}_i\|_{\bA_2}}\Big| , \label{G1} \end{align}\tag{98}\] we notice the fact that: for any \(x >0, y>0\), \[\begin{align} \Big|\log\frac{1+x}{1+y}\Big| = \Big|\log\Big(\frac{x-y}{1+y} + 1\Big)\Big| \leq \log(|x-y| + 1) \leq |x-y|. \end{align}\] Therefore, 98 can be bounded as: \[\begin{align} \eqref{G1} &\leq \sup_{\bphi \in \RR^d: \|\bphi\| \leq 1}|\bphi^{\top}(\btheta_1 - \btheta_2)| +\lambda \sup_{\bphi \in \RR^d: \|\bphi\| \leq 1} \Big|\sum_{i =1 }^d\|\phi_i(s,a) \mathbf{1}_i\|_{\bA_1} - \sum_{i =1 }^d\|\phi_i(s,a) \mathbf{1}_i\|_{\bA_2}\Big| \nonumber\\ & = \sup_{\bphi \in \RR^d: \|\bphi\| \leq 1}|\bphi^{\top}(\btheta_1 - \btheta_2)| +\lambda \sup_{\bphi \in \RR^d: \|\bphi\| \leq 1} \Big|\sum_{i =1 }^d\sqrt{\phi_i\mathbf{1}_i^{\top}\bA_1\phi_i\mathbf{1}_i} - \sum_{i =1 }^d\sqrt{\phi_i\mathbf{1}_i^{\top}\bA_2\phi_i\mathbf{1}_i}\Big| \nonumber \\ &\leq \|\btheta_1 - \btheta_2\|_2 + \lambda \sup_{\bphi \in \RR^d:\|\bphi\| \leq 1}\sum_{i=1}^d\sqrt{\phi_i\mathbf{1}_i^{\top}(\bA_1-\bA_2)\phi_i \mathbf{1}_i} \tag{99} \\ &\leq \|\btheta_1 - \btheta_2\|_2 + \lambda \sqrt{\|\bA_1 - \bA_2\|}\sup_{\bphi \in \RR^d:\|\bphi\| \leq 1}\sum_{i=1}^d \|\phi_i \mathbf{1}_i\| \nonumber \\ &\leq \|\btheta_1 - \btheta_2\|_2 + \lambda \sqrt{\|\bA_1 - \bA_2\|_{F}} \tag{100}, \end{align}\] where the 99 follows from the triangular inequality and the fact \(|\sqrt{x}- \sqrt{y}| \leq \sqrt{|x-y|}\), and \(\|\cdot\|_F\) denotes the Frobenius norm. We next define that \(\mathcal{C_{\btheta}}\) is an \(\epsilon/2\)-cover of \(\{ \btheta \in \RR^d| \|\btheta\|_2 \leq L \}\), and the \(\mathcal{C}_{\bA}\) is an \(\epsilon^2/{4\lambda^2}\)-cover of \(\{ \bA \in \RR^{d \times d}| \|\bA\|_F \leq d^{1/2}B^2\gamma^{-1} \}\). By , we have that: \[\begin{align} |\mathcal{C_{\btheta}}| \leq (1+ 4L/\epsilon)^d, |\mathcal{C}_{\bA}| \leq (1+ 8\lambda^2d^{1/2}B^2/\gamma \epsilon^2)^{d^2}. \end{align}\] By (100 ), for any \(V_1 \in \mathcal{V}\), there exists \(\btheta_2 \in \mathcal{C}_{\btheta}\) and \(\bA_2 \in \mathcal{C}_{\bA}\) s.t \(V_2\) parametrized by \((\btheta_2, \bA_2)\) satisfying \(\text{dist}(V_1, V_2) \leq \epsilon\). Therefore, we have the following: \[\begin{align} \log|\mathcal{N}(\epsilon)| \leq \log|\mathcal{C_{\btheta}}| + \log|\mathcal{C}_{\bA}| \leq d\log(1+4L/\epsilon) + d^2\log(1+8\lambda^2d^{1/2}B^2/\gamma \epsilon^2). \end{align}\] Hence we conclude the proof. ◻

12.5 Proof of↩︎

Proof. By definition, we have that \[\begin{align} &\|\hat{\bw}_h^{\lambda}\|_2\notag\\ &= \Big\|\Big[\max_{\alpha \in [(\hat{V}_{h+1}^{\lambda})_{\min},(\hat{V}_{h+1}^{\lambda})_{\max}]} \Big \{ \hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]_{\alpha} + \frac{1}{4\lambda}(\hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]_{\alpha})^2- \frac{1}{4\lambda}\hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]^2_{\alpha} \Big \}\Big]_{i \in [d]}\Big\|_2 \nonumber\\ & \leq \Big\|\Big[H + \frac{H^2}{2\lambda}\Big]_{i \in [d]}\Big\|_2 \label{chi1} \\&= \sqrt{d}\Big(H + \frac{H^2}{2\lambda}\Big),\nonumber \end{align}\tag{101}\] where 101 follows by the fact that \(\hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]_{\alpha} \in [0,H], \hat{\EE}^{\mu_{h,i}^0}[\hat{V}_{h+1}^{\lambda}(s)]^2_{\alpha} \in [0, H^2]\). ◻

12.6 Proof of↩︎

Proof. We first proof the LHS of the lemma by induction from last stage \(H\). From the definition of \(V^{\pi, \lambda}_H\) and \(\btheta_h\), we can learn that \[\begin{align} V_H^{\pi, \lambda}(s_1) = r_H(s_1, \pi_H(s_1)) = \bphi(s_1, \pi(s_1))^{\top}\btheta_h = \frac{\delta}{2d}\Big(d + \sum_{i=1}^d\xi_{Hi}\EE^{\pi}a_{Hi}\Big). \end{align}\] This is the base case. Now suppose the conclusion holds for stage \(h+1\), that is to say, \[\begin{align} V^{\pi, \lambda}_{h+1}(s_1) \leq \frac{\delta}{2d}\sum_{j=h+1}^H ( 1- \epsilon)^{j-h-1} \Big( d + \Big(\sum_{i=1}^d\xi_{ji}\EE^{\pi}a_{ji}\Big)\Big). \end{align}\] Recall the regularized robust bellman equation in and the regularized duality of the three divergences, we have \[\begin{align} Q_{h}^{\pi, \lambda}(s_1, a) &= r_h(s_1, a) + \inf_{\bmu_{h} \in \Delta(\cS)^{d+2},P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\pi, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \nonumber \\ & \leq r_h(s_1,a) +\EE_{s' \sim P_h^0(\cdot |s_1,a)}[V_{h+1}^{\pi, \lambda}(s')] \\ &= r_h(s_1,a) + (1-\epsilon) V_{h+1}^{\pi, \lambda}(s_1). \end{align}\] Then with regularized robust bellman equation in and the inductive hypothesis, we have \[\begin{align} V_{h}^{\pi, \lambda}(s_1) &= Q_h^{\pi, \lambda}(s_1, \pi(s_1)) \nonumber\\ &\leq r_h(s_1, \pi_h(s_1)) + (1-\epsilon) V_{h+1}^{\pi, \lambda}(s_1) \label{for32TV} \\& =\frac{\delta}{2d}\Big(d + \sum_{i=1}^d\xi_{hi}\EE^{\pi}a_{hi}\Big) + \frac{\delta}{2d}\sum_{j=h+1}^H ( 1- \epsilon)^{j-h} \Big( d + \Big(\sum_{i=1}^d\xi_{ji}\EE^{\pi}a_{ji}\Big)\Big) \nonumber\\ & = \frac{\delta}{2d}\sum_{j=h}^H ( 1- \epsilon)^{j-h} \Big( d + \Big(\sum_{i=1}^d\xi_{ji}\EE^{\pi}a_{ji}\Big)\Big). \nonumber \end{align}\tag{102}\] Hence, by the induction argument, we conclude the proof of the RHS. Furthermore, for any \(h \in [H]\), we can upper bound \(V_{h}^{\pi, \lambda}(s)\) as \[\begin{align} V_{h}^{\pi, \lambda}(s) \leq \frac{\delta}{2d}\sum_{j=h}^H ( 1- \epsilon)^{j-h} \Big( d + \Big(\sum_{i=1}^d\xi_{ji}\EE^{\pi}a_{ji}\Big)\Big) \leq \delta(H-h) \leq \lambda (H-h)/H \leq \lambda,\label{ineq:32upper32bound32for32V} \end{align}\tag{103}\] where the third inequality holds by the definition of \(\delta\). For the left, we prove by discussing the KL, \(\chi^2\) and TV cases respectively.

12.6.0.1 Case I - TV.

The case for TV holds trivially as by , we have \[\begin{align} Q_{h}^{\pi, \lambda}(s_1, a) &= r_h(s_1, a) + \inf_{\bmu_{h} \in \Delta(\cS)^{d+2},P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\pi, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \nonumber \\ & =r_h(s_1, a) + \la \bphi(s_1,a), \EE_{s' \sim \bmu_h^0}[V_{h+1}^{\pi, \lambda}(s')]_{\min_{s'}(V_{h+1}^{\pi, \lambda}(s')) + \lambda} \ra \\ & = r_h(s_1, a) + \EE_{s' \sim P_h^0(\cdot |s,a)}[V_{h+1}^{\pi, \lambda}(s')]_{\min_{s'}(V_{h+1}^{\pi, \lambda}(s'))+ \lambda} \nonumber\\ & = r_h(s_1,a) + (1-\epsilon) V_{h+1}^{\pi, \lambda}(s_1),\label{lower5} \end{align}\tag{104}\] where 104 holds by (103 ). Hence, the inequality in (102 ) holds for equality. This concludes the proof for TV-divergence.

12.6.0.2 Case II - KL.

We prove by induction. The case holds trivially in last stage \(H\). Suppose \[\begin{align} V^{\pi, \lambda}_{h+1}(s_1) \geq \frac{\delta}{2d}\sum_{j=h+1}^H ( 1- \epsilon)^{j-h-1} \Big( d + \Big(\sum_{i=1}^d\xi_{ji}\EE^{\pi}a_{ji}\Big)\Big)- (H-h)\lambda \epsilon (e-1). \end{align}\] Recall the duality form of , the Q-function at stage \(h\) can be upper bounded as: \[\begin{align} Q_{h}^{\pi, \lambda}(s_1, a) &= r_h(s_1, a) + \inf_{\bmu_{h} \in \Delta(\cS)^{d+2},P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\pi, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \nonumber \\ & = r_h(s_1,a) + \la \bphi(s_1,a), - \lambda \log \EE_{s' \sim \bmu_h^0} e^{-V_{h+1}^{\pi, \lambda}(s')/\lambda} \ra \nonumber\\ & = r_h(s_1,a) - \lambda \log \big(\epsilon + (1-\epsilon)e^{-V_{h+1}^{\pi, \lambda}(s_1)/\lambda }\big) \nonumber\\ & = r_h(s_1,a) + V_{h+1}^{\pi, \lambda}(s_1) - \lambda \log \big(\epsilon e^{V_{h+1}^{\pi, \lambda}(s_1)/\lambda} + (1-\epsilon)\big) \nonumber\\ &\geq r_h(s_1,a) + V_{h+1}^{\pi, \lambda}(s_1) -\lambda \epsilon \big(e^{V_{h+1}^{\pi, \lambda}(s_1)/\lambda}-1\big) \tag{105}\\ & \geq r_h(s_1,a) + V_{h+1}^{\pi, \lambda}(s_1) - \lambda \epsilon (e - 1), \tag{106} \end{align}\] where 105 follows by the fact that \(\log(1+x) \leq x, \forall x >0\), 106 follows by (103 ). Therefore, by the inductive hypothesis, we have \[\begin{align} V_{h}^{\pi, \lambda}(s_1) &= Q_h^{\pi, \lambda}(s_1, \pi(s_1)) \\ &\geq r_h(s_1, \pi_h(s_1)) + (1-\epsilon) V_{h+1}^{\pi, \lambda}(s_1) - \lambda \epsilon (e-1) \\& = \frac{\delta}{2d}\sum_{j=h}^H ( 1- \epsilon)^{j-h} \Big( d + \Big(\sum_{i=1}^d\xi_{ji}\EE^{\pi}a_{ji}\Big)\Big) - (H-h) \lambda \epsilon (e-1). \end{align}\] This finishes the KL setting.

12.6.0.3 Case III - \(\chi^2\).

Similar to the case in TV, KL, by the duality of \(\chi^2\) in , we have \[\begin{align} Q_{h}^{\pi, \lambda}(s_1, a) &= r_h(s_1, a) + \inf_{\bmu_{h} \in \Delta(\cS)^{d+2},P_h=\la \bphi,\bmu_h \ra}\big[\EE_{s'\sim P_h(\cdot|s,a)}\big[V_{h+1}^{\pi, \lambda}(s')\big]+ \lambda \la\bphi(s, a),\bD(\bmu_h||\bmu_h^0) \ra\big] \nonumber \\ & = r_h(s_1, a) + (1-\epsilon)\sup_{\alpha \in [V_{\min}, V_{\max}]} \Big \{ [V_{h+1}^{\pi, \lambda}(s_1)]_{\alpha} - \frac{\epsilon}{4\lambda}[V_{h+1}^{\pi, \lambda}(s_1)]_{\alpha}^2 \Big\} \nonumber\\ &\geq r_h(s_1, a) + (1-\epsilon)\Big[V_{h+1}^{\pi, \lambda}(s_1) - \frac{\epsilon}{4\lambda}[V_{h+1}^{\pi, \lambda}(s_1)]^2 \Big] \nonumber\\ & \geq r_h(s_1, a) + (1-\epsilon)V_{h+1}^{\pi, \lambda}(s_1) - \frac{\epsilon \lambda(1-\epsilon)}{4}, \label{lower2} \end{align}\tag{107}\] where 107 follows by (103 ). Hence, similar to Case II, by induction, we have \[\begin{align} V_{h}^{\pi, \lambda}(s_1) \geq \frac{\delta}{2d}\sum_{j=h}^H ( 1- \epsilon)^{j-h} \Big( d + \Big(\sum_{i=1}^d\xi_{ji}\EE^{\pi}a_{ji}\Big)\Big) - (H-h) \frac{\epsilon \lambda(1-\epsilon)}{4}. \end{align}\] This finishes the \(\chi^2\) setting. This concludes the proof. ◻

13 Auxiliary Lemmas↩︎

Lemma 7 (Lemma D.3 of [14]). For any \(h \in [H]\), let \(\mathcal{V}_h\) denote a class of functions mapping from \(\mathcal{S}\) to \(\RR\) with the following form: \[\begin{align} V_h(x; \btheta, \beta, \bLambda_h) = \max_{a \in \mathcal{A}} \Big\{ \bphi(s,a)^{\top} \btheta - \beta \sum_{i=1}^d\|\phi_i(\cdot, \cdot) \mathbf{1}_i\|_{\bLambda_h^{-1}} \Big\}_{[0, H-h+1]}, \end{align}\] the parameters \((\btheta, \beta, \bLambda_h)\) satisfy \(\|\btheta\|_2 \leq L, \beta \in [0, B], \gamma_{\min}(\bLambda_h) \geq \gamma\). Let \(\mathcal{N}_h(\epsilon)\) be the \(\epsilon\)-covering number of \(\mathcal{V}\) with respect to the distance \(\text{dist}(V_1, V_2) = \sup_x|V_1(x) - V_2(x)|\). Then \[\begin{align} \log\mathcal{N}_h(\epsilon) \leq d\log(1+4L/\epsilon) + d^2\log(1+8d^{1/2}B^2/\gamma \epsilon^2). \end{align}\]

Lemma 8 (Corollary 4.2.11 of [30]). Denote the \(\epsilon\)-covering number of the closed interval \([a, b]\) for some real number \(b > a\) with respect to the distance metric \(d(\alpha_1, \alpha_2) = |\alpha_1 - \alpha_2|\) as \(\mathcal{N}_{\epsilon}([a,b])\), then we have \(\mathcal{N}_{\epsilon}([a,b]) \leq 3(b-a)/{\epsilon}\).

Lemma 9 (Lemma B.2 of [26]). Let \(f:\mathcal{S} \rightarrow [0, R-1]\) be any fixed function. For any \(\delta \in (0, 1)\), we have \[\begin{align} P\bigg(\|\sum_{\tau = 1}^{K}\bphi(s_h^{\tau}, a_h^{\tau})\eta_h^{\tau}(f)\|_{\Lambda_h^{-1}}^2 \geq R^2 (2\log(1/\delta) + d\log(1 + K/\gamma))\bigg) \leq \delta, \end{align}\] where \(\eta_h^{\tau}(f)= \EE_{s' \sim P_h^0(\cdot |s^{\tau}_h, a_h^{\tau})}[f(s')] - f(s_{h+1}^{\tau})\).

Lemma 10 (Lemma F.3 of [13]). If \(K \geq \mathcal{O}(d^6)\) and the feature map is define as , then with probability at least \(1- \delta\), we have for any transition P, \[\begin{align} \sum_{h=1}^H\EE^{\pi^\star, P}\Big[\sum_{i=1}^d \|\phi_i(s,a)\mathbf{1}_i\|_{{\Lambda}_h^{-1}}|\mathbf{s}_1 = s_1\Big] \leq \frac{4d^{3/2}H}{\sqrt{K}}. \end{align}\]

Lemma 11 (Lemma 5.2 of [31]). For any \(\epsilon > 0\), the \(\epsilon\) -covering number of the Euclidean ball in \(\RR^d\) with radius \(R >0\) is upper bounded by \((1+2R/{\epsilon})^d\).

References↩︎

[1]
Levine, S., Kumar, A., Tucker, G. and Fu, J.(2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 .
[2]
Garcıa, J. and Fernández, F.(2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research16 1437–1480.
[3]
Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V. and Song, D.(2018). Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282 .
[4]
Zhang, H., Chen, H., Xiao, C., Li, B., Liu, M., Boning, D. and Hsieh, C.-J.(2020). Robust deep reinforcement learning against adversarial perturbations on state observations. Advances in Neural Information Processing Systems33 21024–21037.
[5]
Wang, R., Yang, Y., Liu, Z., Zhou, D. and Xu, P.(2024). Return augmented decision transformer for off-dynamics reinforcement learning. arXiv preprint arXiv:2410.23450 .
[6]
Iyengar, G. N.(2005). Robust dynamic programming. Mathematics of Operations Research30 257–280.
[7]
Nilim, A. and El Ghaoui, L.(2005). Robust control of markov decision processes with uncertain transition matrices. Operations Research53 780–798.
[8]
Blanchet, J., Lu, M., Zhang, T. and Zhong, H.(2024). Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. Advances in Neural Information Processing Systems36.
[9]
Shi, L. and Chi, Y.(2024). Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. Journal of Machine Learning Research25 1–91.
[10]
Goyal, V. and Grand-Clement, J.(2023). Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research48 203–226.
[11]
Ma, X., Liang, Z., Blanchet, J., Liu, M., Xia, L., Zhang, J., Zhao, Q. and Zhou, Z.(2022). Distributionally robust offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2209.06620 .
[12]
Wang, H., Shi, L. and Chi, Y.(2024). Sample complexity of offline distributionally robust linear markov decision processes. arXiv preprint arXiv:2403.12946 .
[13]
Liu, Z. and Xu, P.(2024). Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning. arXiv preprint arXiv:2403.09621 .
[14]
Liu, Z. and Xu, P.(2024). Distributionally robust off-dynamics reinforcement learning: Provable efficiency with linear function approximation. In International Conference on Artificial Intelligence and Statistics. PMLR.
[15]
Lee, J., Jeon, W., Lee, B.-J., Pineau, J. and Kim, K.-E.(2021). Optidice: Offline policy optimization via stationary distribution correction estimation.
[16]
Shi, L., Li, G., Wei, Y., Chen, Y., Geist, M. and Chi, Y.(2024). The curious price of distributional robustness in reinforcement learning with a generative model. Advances in Neural Information Processing Systems36.
[17]
Nelder, J. A. and Mead, R.(1965). A simplex method for function minimization. The computer journal7 308–313.
[18]
Yang, W., Wang, H., Kozuno, T., Jordan, S. M. and Zhang, Z.(2023). Robust markov decision processes without model estimation. arXiv preprint arXiv:2302.01248 .
[19]
Panaganti, K., Wierman, A. and Mazumdar, E.(2024). Model-free robust \(\varphi\)-divergence reinforcement learning using both offline and online data. arXiv preprint arXiv:2405.05468 .
[20]
Zhang, R., Hu, Y. and Li, N.(2023). Regularized robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity. arXiv preprint arXiv:2306.11626 .
[21]
Panaganti, K., Xu, Z., Kalathil, D. and Ghavamzadeh, M.(2022). Robust reinforcement learning using offline data. Advances in neural information processing systems35 32211–32224.
[22]
Liu, Z., Wang, W. and Xu, P.(2024). Upper and lower bounds for distributionally robust off-dynamics reinforcement learning. arXiv preprint arXiv:2409.20521 .
[23]
Lu, M., Zhong, H., Zhang, T. and Blanchet, J.(2024). Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm. arXiv preprint arXiv:2404.03578 .
[24]
Jin, C., Yang, Z., Wang, Z. and Jordan, M. I.(2020). Provably efficient reinforcement learning with linear function approximation. In Conference on learning theory. PMLR.
[25]
Sutton, R. S.(2018). Reinforcement learning: An introduction. A Bradford Book .
[26]
Jin, Y., Yang, Z. and Wang, Z.(2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning. PMLR.
[27]
Tamar, A., Mannor, S. and Xu, H.(2014). Scaling up robust mdps using function approximation. In International conference on machine learning. PMLR.
[28]
Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J. and Glynn, P.(2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics. PMLR.
[29]
Tsybakov, A. B.(2009). Introduction to Nonparametric Estimation. Springer, New York.
[30]
Vershynin, R.(2018). High-dimensional probability: An introduction with applications in data science, vol. 47. Cambridge university press.
[31]
Vershynin, R.(2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 .

  1. Tsinghua University; email: c-tang21@mails.tsinghua.edu.cn↩︎

  2. Duke University; email: zhishuai.liu@duke.edu↩︎

  3. Equal contribution↩︎

  4. Duke University; email: pan.xu@duke.edu↩︎

  5. The general \(f\)-divergence includes widely studied divergences such as TV, KL, and \(\chi^2\) divergences.↩︎

  6. [20] studied the infinite horizon RRMDP with a discounted factor \(\gamma\), we replace the effective horizon length \(\frac{1}{1-\gamma}\) by the horizon length \(H\) in the finite horizon setting.↩︎