Soft Switching Expert Policies
for Controlling Systems
with Uncertain Parameters
October 23, 2025
This paper proposes a simulation-based reinforcement learning algorithm for controlling systems with uncertain and varying system parameters. While simulators are useful for safely learning control policies for physical systems, mitigating the reality gap remains a major challenge. To address the challenge, we propose a two-stage algorithm. In the first stage, multiple control policies are learned for systems with different parameters in a simulator. In the second stage, for a real system, the control policies learned in the first stage are smoothly switched using an online convex optimization algorithm based on observations. Our proposed algorithm is demonstrated through numerical experiments.
reinforcement learning, deep learning, simulation, realty gap, policy adaptation, online convex optimization
Reinforcement learning (RL) is a machine learning framework that addresses sequential decision-making problems [1]. In RL, an agent collects experience data through interactions with a system and automatically learns a control policy based on the collected data. Recently, deep RL (DRL) has attracted much attention as an effective framework for controlling complex real-world systems. Its applications are wide-ranging, including robot manipulation [2], autonomous driving [3], and industrial plant operations [4]. These successes can be attributed to the ability of deep neural networks (DNNs) to effectively handle high-dimensional data and accurately approximate nonlinear functions. However, DRL requires a large amount of data through repeated interactions with systems to fully exploit its potential. In particular, for physical systems, data collection tends to be costly, and almost random interactions during the early stages of learning can cause catastrophic equipment damage. In such cases, simulators are commonly used. Simulators are valuable for leveraging the capability of DRL to control physical systems because they can be easily accelerated and parallelized to safely collect a large amount of data. Particularly, in the robotics community, high-fidelity physical simulators have been developed [5].****
Although simulators provide substantial advantages for DRL, learning in simulators faces a major challenge of bridging the reality gap, which refers to the mismatch in behaviors between simulated and real systems. If a control policy is learned for a simulated system that poorly represents the real system, it may not perform as expected. In general, even if a high-fidelity simulator is available, it remains challenging to accurately identify the system parameters of the real system (e.g., mass, friction coefficients, and actuator gains). Thus, domain randomization (DR) has been proposed to learn control policies that are robust to the reality gap [6], [7]. In DRL with DR, the uncertain system parameters are randomized within a given range during the learning phase in a simulator, rather than learning for a fixed set of parameters. This enables the agent to experience diverse scenarios. Despite its conceptual simplicity, DR has successfully bridged the reality gap in complex robotic systems. Nevertheless, in general, a control policy learned with DR needs the system parameters of a real system to determine control actions appropriately. When the system parameters of the real system are unobservable, it is necessary to employ DNNs with recurrent architectures such as recurrent neural networks and long short-term memory networks, which may increase the complexity of the learning procedure due to issues such as vanishing or exploding gradients. Additionally, the effectiveness of DR may be limited when the range of the potential reality gap is large. In DRL with DR, since a control policy is evaluated by an expected objective over randomly sampled system parameters, the learned control policy may not perform well across all parameters within the given range.****
To synthesize a control policy without recurrent architectures considering a wide reality gap, we propose a two-stage learning algorithm. In the first stage, multiple representative points are selected within a given range of system parameters, referred to as the system parameter space. For each representative point, an expert policy is learned for the corresponding simulated system in a simulator. These expert policies are learned using a standard DRL algorithm, such as the deep deterministic policy gradient (DDPG) algorithm [8]. In the second stage, we construct a high-level policy, which is called an adaptive policy, in the form of a normalized weighted-sum formulation to combine the expert policies, where its weights are adjusted online based on observations from a real system. The adaptive policy can be regarded as a smooth switching mechanism among expert policies. Specifically, we estimate a belief vector, which quantifies the similarity between the behavior of the real system and that of each representative simulated system, using online convex optimization (OCO) [9], and use it directly as the weights in the adaptive policy. OCO is a well-established theoretical framework in online learning, providing a solid foundation for sequential decision-making under uncertainty. To apply OCO, we define a convex loss function based on observations from the real system.******
The paper is organized as follows: Section 2 formulates the problem. Section 3 briefly reviews the fundamentals. Section 4 presents a two-stage learning algorithm with a simulator. Section 5 demonstrates the results of the proposed algorithm. Finally, Section 6 concludes the paper and discusses future work.
Notation: \(\mathbb{R}\) denotes the set of real numbers. \(\mathbb{R}_{\ge0}\) denotes the set of nonnegative real numbers. \(\mathbb{R}^{n}\) denotes the \(n\)-dimensional Euclidean space. \(O(\cdot)\) denotes Landau’s notation; that is, \(f(n)=O(g(n))\Leftrightarrow \lim\sup_{n\to\infty}f(n)/g(n)<\infty\). \(E[\cdot]\) denotes the expectation operator.
We consider a discrete-time nonlinear system governed by \[\begin{align} x_{t+1}=f(x_{t},a_{t};\xi_{t}),\label{dynamical95system} \end{align}\tag{1}\] where \(x_{t}\in\mathcal{X}\subseteq\mathbb{R}^{n_x}\) and \(a_{t}\in\mathcal{A}\subseteq\mathbb{R}^{n_a}\) denote the state and control input at time \(t=1,2,...\), respectively. Let \(\mathcal{X}\) and \(\mathcal{A}\) denote the state and control input spaces, respectively. \(\xi_{t}\in\Xi\subseteq\mathbb{R}^{n_{\xi}}\) is the vector of system parameters at time \(t\), where \(\Xi\) denotes the system parameter space. \(f:\mathcal{X}\times\mathcal{A}\times\Xi\to\mathcal{X}\) denotes the nonlinear dynamics of the system. In this study, we assume that \(f\) is known, whereas \(\xi_t\) is uncertain and may vary gradually or abruptly. The initial state \(x_1\) is drawn from a probability density function \(\rho_1:\mathcal{X}\to\mathbb{R}_{\ge0}\).
Our goal is to synthesize a parameterized (deterministic) control policy \(\mu_{\theta}:\mathcal{X}\times\Xi\to\mathcal{A}\) that maximizes \[\begin{align} &&J(\theta)=E_{x_1\sim \rho_1}\left[\sum_{t=1}^{\infty}\gamma^{t}R(x_t,\mu_{\theta}(x_t,\xi_t))\right],\nonumber\\ &&\mathrm{s.t.}\;\;\;x_{t+1}=f(x_{t},\mu_{\theta}(x_{t},\xi_{t});\xi_{t}),\;\;\;x_{1}\sim\rho_{1},\label{objective} \end{align}\tag{2}\] where \(\gamma\in(0,1)\) is a discount factor and \(R:\mathcal{X}\times\mathcal{A}\to\mathbb{R}\) is an immediate reward function. \(\theta\) denotes the parameter vector of the control policy. In this study, a simulator with knowledge of \(f\) can be used. However, in control operation, the actual vector \(\xi_t\) is not observable from the real system.
An RL problem is formulated by a Markov decision process (MDP) \(\mathcal{M}=\left<\mathcal{X},\mathcal{A},R,p_f,\rho_1\right>\), where \(\mathcal{X}\) is the state space of a system, \(\mathcal{A}\) is the control action (control input) space of an agent, \(R:\mathcal{X}\times\mathcal{A}\to\mathbb{R}\) is the immediate reward function, \(p_{f}:\mathcal{X}\times\mathcal{X}\times\mathcal{A}\to\mathbb{R}_{\ge0}\) is the probability density of state transitions, and \(\rho_1:\mathcal{X}\to\mathbb{R}_{\ge0}\) is the probability density of an initial state. The goal of RL is to learn a deterministic control policy \(\pi:\mathcal{X}\to\mathcal{A}\) or a stochastic control policy \(\pi:\mathcal{A}\times\mathcal{X}\to\mathbb{R}_{\ge0}\), which maximizes the sum of discounted rewards \(G_{t}=\sum_{k=1}^{\infty}\gamma^{k}R(x_{t+k},a_{t+k})\), where \(\gamma\in(0,1)\) is a discount factor. To solve the problem, the value function and the Q-function underlying a control policy \(\pi\) are commonly used and defined as \(V^{\pi}(x)=E[G_t|x_t=x]\) and \(Q^{\pi}(x,a)=E[G_t|x_t=x,a_t=a]\), respectively. Additionally, the optimal value function and the optimal Q-function are defined by \(V^{*}(x)=\max_{\pi}V^{\pi}(x),\;\forall{x}\in\mathcal{X}\) and \(Q^{*}(x,a)=\max_{\pi}Q^{\pi}(x,a),\;\forall x\in\mathcal{X},\;\forall a\in\mathcal{A}\), respectively. In the Q-learning algorithm, which is a standard RL algorithm, an agent learns the optimal Q-function \(Q^{*}\) through interactions with a system and determines its greedy control action as follows: \(a_t=\arg\max_{a\in\mathcal{A}}Q^{*}(x_{t},a)\). On the other hand, when \(\mathcal{X}\) and \(\mathcal{A}\) are continuous spaces and \(Q^{*}\) is modeled by a non-convex function such as a DNN, it is difficult to maximize \(Q^{*}(x_t,a)\) analytically. To address the problem, the DDPG algorithm has been proposed [8].
The DDPG algorithm can be viewed as an extension of the Q-learning algorithm for problems with continuous-valued states and control actions. In the DDPG algorithm, an agent learns the optimal Q-function \(Q_{\theta_{Q}}:\mathcal{X}\times\mathcal{A}\to\mathbb{R}\) and the optimal deterministic control policy \(\mu_{\theta_{\mu}}:\mathcal{X}\to\mathcal{A}\) using two types of DNNs, which are called the critic DNN and the actor DNN, respectively. Let \(\theta_{Q}\) and \(\theta_{\mu}\) be the parameter vectors of the critic DNN and the actor DNN, respectively.**
At each time \(t=1,2,...,T\), the agent observes the state of the system \(x_t\) and determines the greedy control action \(\mu_{\theta_{\mu}}(x_t)\). The agent then injects a noise \(\epsilon_t\), which is drawn from an arbitrary stochastic process, into the greedy control action \(a_t=\mu_{\theta_{\mu}}(x_t)+\epsilon_t\) for exploration. By executing \(a_t\), the agent receives the next state \(x_{t+1}\) and the corresponding immediate reward \(r_t=R(x_t,a_t)\). The agent stores the experience \(e_t=(x_t,a_t,x_{t+1},r_{t})\) in the buffer \(\mathcal{D}\), which is called the replay buffer. In parallel with explorations, the agent samples the \(N\) experiences \(e^{(n)}=(x^{(n)},a^{(n)},x'^{(n)},r^{(n)}),\;n=1,2,...,N\) from \(\mathcal{D}\) randomly and updates the parameter vectors \(\theta_{Q}\) and \(\theta_{\mu}\) using the experiences, which is called the experience replay. The technique helps to prevent the agent from learning from data that are temporally correlated or biased. The parameter vector of the critic DNN \(\theta_{Q}\) is updated by minimizing the following temporal difference error. \[\begin{align} &&L=\frac{1}{N}\sum_{n=1}^{N}(y^{(i)}-Q_{\theta_{Q}}(x^{(n)},a^{(n)}))^{2},\label{td95error}\\ &&y^{(n)}=r^{(n)}+\gamma Q_{\theta_{Q}^{-}}(x'^{(n)},\mu_{\theta_{\mu}^{-}}(x'^{(n)})),\label{td95target} \end{align}\] {#eq: sublabel=eq:td95error,eq:td95target} where target values \(y^{(n)}\) are generated by the target critic DNN \(Q_{\theta_{Q}^{-}}\) and the target actor DNN \(\mu_{\theta_{\mu}^{-}}\), rather than the original critic and actor DNNs, which improves the stability of the learning process. The parameter vector of the actor DNN \(\theta_{\mu}\) is updated using the following policy gradient. \[\begin{align} &&\nabla_{\theta_{\mu}}J(\theta_{\mu})\simeq\nonumber\\ &&\frac{1}{N}\sum_{n=1}^{N}\nabla_{a}Q_{\theta_{Q}}(x,a)|_{x=x^{(n)},a=a^{(n)}}\nabla_{\theta_{\mu}}\mu_{\theta_{\mu}}(x)|_{x=x^{(n)}}.\label{PG} \end{align}\qquad{(1)}\] The parameter vectors of the target DNNs are updated as follows: \(\theta_{Q}^{-}\leftarrow\eta\theta_{Q}+(1-\eta)\theta_{Q}^{-},\;\theta_{\mu}^{-}\leftarrow\eta\theta_{\mu}+(1-\eta)\theta_{\mu}^{-},\) where \(\eta\in(0,1)\). When \(\eta\ll 1\), the target DNNs are updated slowly and smoothly. The aforementioned process for \(t=1,2,\dots,T\), known as an episode, is repeated iteratively to learn the critic and actor DNNs.******
When DRL is used to synthesize a control policy for a physical system, such as a robotic system or an industrial plant, the learning process is typically conducted in a simulator to ensure safety and reduce operational costs. However, this approach faces a major challenge: the reality gap. When there exists a large reality gap, the learned control policy tends to perform poorly on the real system. Even if \(f\) is known, identifying the system parameters of the real system remains a challenge. In real systems, \(\xi_t\) is not only uncertain, but also varying due to factors such as disturbances. To mitigate the reality gap, DR is commonly employed [6], [7]. In DRL with DR, the agent learns a control policy through interactions with a simulator whose system parameter is randomly varied. The objective is as follows: \[\begin{align} \max_{\theta_{\mu}} E_{\xi\sim \rho_{\Xi}}\left[E_{x_1\sim\rho_{1},x_{t+1}\sim p_{f,\xi}}\left[\sum_{t=1}^{\infty}\gamma^{t}R(x_{t},\mu_{\theta_{\mu}}(x_{t},\xi))\right]\right],\nonumber\\\label{DR95objective} \end{align}\tag{3}\] where \(\rho_{\Xi}:\Xi\to\mathbb{R}_{\ge0}\) denotes a probability density of the system parameters, and \(p_{f,\xi}\) denotes a probability density of state transitions with a vector of the system parameters \(\xi\). Practically, at the start of each episode, \(\xi\) is drawn from \(\rho_{\Xi}\) (e.g., a uniform distribution \(\mathcal{U}(\Xi)\)) and remains fixed for the entire duration of the episode.
In general, a control policy learned by DRL with DR depends not only on the current state \(x_t\) but also on the parameters of the real system \(\xi_t\). When \(\xi_t\) is not observable directly from a real system, it is necessary to estimate them using past observations from the real system. Additionally, in DRL with DR, we optimize the expected performance with respect to randomly sampled system parameters, following a risk-neutral formulation. The learned policy may not perform well on certain systems in \(\Xi\).
We use DRL in a simulator to synthesize a control policy for a nonlinear system (1 ). Since we cannot identify the system parameter \(\xi_{t}\) beforehand, it is necessary to account for the effects of the reality gap across the system parameter space \(\Xi\). Instead of DR, which relies on DNNs with recurrent architectures and is difficult to apply in a large system parameter space, we propose the two-stage algorithm for controlling the system (1 ) while accounting for a large reality gap, as shown in Fig. 1. In the first stage, multiple representative points are selected from the system parameter space \(\Xi\) and a control policy is learned for each corresponding simulated system in a simulator, which is referred to as an expert policy. In the second stage, the expert policies are combined using a normalized weighted-sum formulation, and switching among them is performed smoothly by online adjustment of the weights based on observations from the real system. The algorithm is called the soft switching expert policies (SSEP) algorithm.**
To prepare multiple expert policies, we define the following performance measure of a control policy \(\mu\) for a system with \(\xi\in\Xi\). \[\begin{align} &&G(\mu|\xi)=\sum_{t=1}^{H}R(x_t,\mu(x_t)),\nonumber\\ &&x_{t+1}=f(x_t,\mu(x_t);\xi),\;x_1=\tilde{x},\nonumber \end{align}\] where \(\tilde{x}\) denotes an initial state given for the evaluation and \(H\) denotes the evaluation horizon. We define a control policy \(\mu\) as performing poorly on a system with \(\xi\) when \(G(\mu|\xi)\) is below a given threshold \(d\). In this study, several representative points \(\{\xi^{(j)}\}_{j=1}^{M}\) are manually selected from \(\Xi\), and expert policies are learned for the corresponding simulated systems. To ensure coverage of the parameter space \(\Xi\), we construct a sufficient number of expert policies such that, for any \(\xi \in \Xi_{\text{grid}}(\subseteq\Xi)\), at least one expert policy performs well on the system with \(\xi\). The development of a practical method to automatically select representative points \(\{\xi^{(j)}\}_{j=1}^{M}\) remains future work.
We consider the following normalized weighted sum of the expert policies learned in a simulator as an adaptive policy. \[\begin{align} \mu(x,w)=\sum_{j=1}^{M}w_{j}\mu_{\theta_{\mu,j}}(x),\label{adaptive95policy} \end{align}\tag{4}\] where \(w=[w_1,w_2,\dots,w_M]^{\top}\) is an element of the probability simplex \(\Delta_M\), i.e., \(\sum_{j=1}^{M}w_{j}=1,\;w_{j}\ge0,\;\forall{j}\in\{1,2,\dots,M\}\). Based on the history of observations from the real system, we estimate the belief vector \(w=[w_1,w_2,\dots,w_M]^{\top}\in\Delta_{M}\), which quantifies the similarity between the behavior of the real system and that of each representative simulated system. We then use the vector directly as the weights of the adaptive policy. This is because an expert policy that is effective for a representative system is likely to perform well on systems exhibiting similar behaviors.
We apply an OCO algorithm to estimate \(w_t\) online. The loss function for the OCO algorithm is defined as follows: \[\begin{align} \ell_{t}(w_{t})=\left\|x_{t+1}-\sum_{j=1}^{M}w_{t,j}f(x_{t},a_{t};\xi^{(j)})\right\|_{2}^{2},\label{loss95OCO} \end{align}\tag{5}\] where \(x_{t+1}\) results from executing the control action \(a_{t}\) in the state \(x_{t}\) of the real system. The loss function is convex with respect to \(w_t\). In this study, we apply the follow the regularized leader (FTRL) algorithm to estimate \(w_{t}\) [9]. Let us define \(\mathrm{f}(x_{t},a_{t})w_{t}:=\sum_{j=1}^{M}w_{t,j}f(x_{t},a_{t};\xi^{(j)})\), where \(\mathrm{f}(\cdot,\cdot)\) is a \((n_x\times M)\)-matrix. The gradient of the loss \(\ell_{t}(w_t)\) with respect to \(w_{t}\) is \[\begin{align} \nabla_{w}\ell_{t}(w_t)=-2\mathrm{f}(x_t,a_t)^{\top}(x_{t+1}-\mathrm{f}(x_t,a_t)w_t),\nonumber \end{align}\] which is computed by the loss generator as shown in Fig. 1. In the FTRL algorithm, at each time \(t\), the belief vector is computed by \[\begin{align} w_{t}=\arg\min_{w\in\Delta_M} \left\{\sum_{\tau=1}^{t-1}\nabla_{w}\ell_{\tau}(w_{\tau})^{\top}w+\frac{1}{\eta}\Phi(w)\right\},\label{FTRL} \end{align}\tag{6}\] where \(\eta>0\) denotes a learning rate. \(\Phi:\Delta_{M}\to\mathbb{R}\) is a regularizer for which we choose unnormalized negentropy \(\Phi(w)=\sum_{j=1}^{M}w_{j}\log w_{j}-w_{j}\). Specifically, \(w_{t}\) is computed by \[\begin{align} w_{t,i}=\frac{\exp(-\eta\sum_{\tau=1}^{t-1} \nabla_{w}\ell_{\tau}(w_{\tau})_{i})}{\sum_{j=1}^{M}\exp(-\eta\sum_{\tau=1}^{t-1}\nabla_{w}\ell_{\tau}(w_{\tau})_{j})},\;i\in{1,2,...,M}. \nonumber\\ \label{w95update} \end{align}\qquad{(2)}\] The standard performance measure of OCO algorithms is static regret, which is the difference between the cumulative loss incurred by the OCO algorithm and the best fixed optimum in hindsight as follows: \[\begin{align} Reg(T)=\sum_{t=1}^{T}\ell_{t}(w_{t})-\min_{w\in\Delta_{M}}\sum_{t=1}^{T}\ell_{t}(w).\nonumber \end{align}\] The upper bound on the regret for the algorithm is known as \(Reg(T)= O(\sqrt{T\log M})\) with the learning rate \(\eta=c\sqrt{\frac{\log M}{T}}\), where \(c\) is a constant.****
Remark: An adaptive policy can be constructed using a DNN that generates the weights of expert policies. However, if expert policies are modified as required, the DNN-based adaptive policy must be retrained, which reduces its practicality. Therefore, we apply an OCO-based approach to adjust the weight of the adaptive policy.
The FTRL algorithm (6 ) achieves a sublinear static regret bound with respect to \(T\), which is valuable when the system parameters of a real system are nearly fixed. However, in general, the system parameters may vary gradually or abruptly due to disturbances. The FTRL algorithm estimates the belief vector using all past losses, which implies that losses generated under past system parameters are treated equally. These past system parameters may differ from the current ones, which can potentially degrade the estimation of the current belief. To mitigate this effect, we apply the following discounted FTRL algorithm. \[\begin{align} w_{t}=\arg\min_{w\in\Delta_M} \left\{\sum_{\tau=1}^{t-1}\beta^{t-1-\tau}\nabla_{w}\ell_{\tau}(w_{\tau})^{\top}w+\frac{1}{\eta}\Phi(w)\right\},\nonumber\\ \label{discount95FTRL} \end{align}\qquad{(3)}\] where \(\eta\) is a learning rate and \(\beta\in(0,1)\) is a discount factor to gradually reduce the effect of losses from the distant past [10]. Specifically, \(w_t\) is computed by \[\begin{align} &&w_{t,i}=\frac{\exp(-\eta\sum_{\tau=1}^{t-1}\beta^{t-1-\tau}\nabla_{w}\ell_{\tau}(w_{\tau})_{i})}{\sum_{j=1}^{M}\exp(-\eta\sum_{\tau=1}^{t-1}\beta^{t-1-\tau} \nabla_{w}\ell_{\tau}(w_{\tau})_{j})},\;\nonumber\\ && i\in{1,2,...,M}. \label{w95update95discount} \end{align}\qquad{(4)}\]
We consider the following discrete-time nonlinear system. \[\begin{align} \begin{bmatrix} x_{t+1,1}\\ x_{t+1,2} \end{bmatrix}=\begin{bmatrix} x_{t,1}+\delta x_{t,2}\\ x_{t,2}+\delta\left(\frac{\mathrm{g}}{\mathrm{l}}\sin x_{t,1}-\frac{\xi_{t,2}x_{t,2}}{\xi_{t,1}\mathrm{l}^{2}}+\frac{10.0 a_{t,1}}{\xi_{t,1}\mathrm{l}^2}\right) \end{bmatrix},\nonumber\\ \label{pendulum} \end{align}\tag{7}\] where \(\mathrm{g}=9.81\), \(\delta=0.05\), and \(\mathrm{l}=1.0\). The state space is \(\mathcal{X}=\mathbb{R}^{2}\), and the control action (control input) space is \(\mathcal{A}=[-1,1]\subset\mathbb{R}\). Let \(\xi=[\xi_{1},\xi_{2}]^{\top}\) denote a vector of system parameters. We assume that \(\xi\) is uncertain but lies within the system parameter space \(\Xi=\{\xi\in\mathbb{R}^{2}|\;0.1\le\xi_{1}\le2.0,\;0.0\le\xi_{2}\le 2.0\}\). The immediate reward function \(R\) is defined by \[\begin{align} R(x_{t},a_{t})=-x_{t,1}^{2}-0.1x_{t,2}^{2}-10.0a_{t,1}^{2},\label{reward95example} \end{align}\tag{8}\] that is, our goal is to stabilize the target state \(x^{*}=[x_{1}^{*},x_{2}^{*}]^{\top}=[0,0]^{\top}\), which is a fixed point of (7 ). To learn expert policies in a simulator, we apply DDPG with the same network architectures for actor and critic DNNs. Each DNN consists of two fully connected hidden layers with 128 units per layer. The activation functions of hidden layers are ReLU. For the output layers of the actor DNNs, we use hyperbolic tangent functions. The parameter vectors of DNNs are updated using Adam [11] with learning rates \(1.0\times10^{-4}\) for the actor DNNs and \(1.0\times 10^{-3}\) for the critic DNNs.
To visualize the robustness of a control policy \(\mu\) to discrepancies in system parameters, we plot \(G(\mu|\xi)\) for \(\xi\in\Xi_{\mathrm{grid}}=\{0.15,0.25,...,1.95\}\times\{0.05,0.15,...,1.95\}\). The initial state for \(G(\mu|\xi)\) is set to \(\tilde{x}=[\pi\;0]^{\top}\), the evaluation horizon \(H\) is 1000, and the threshold is set to \(d=-1500\), i.e., when \(G(\mu|\xi)\) is below \(-1500\), the policy \(\mu\) does not perform well on the system with \(\xi\).
The performances of the control policies learned using DDPG with DR are shown in Fig. 2, where \(\xi\) is uniformly sampled from \(\Xi\). Fig. 2 (a) shows the performance of the policy learned without access to the true system parameters, whereas Fig. 2 (b) shows that of the policy learned with access to them. These results indicate that it is difficult to learn a robust policy using DR without access to the true system parameters. Additionally, even if the true system parameters are available, the learned policy may not perform well for some parameters in \(\Xi\) as shown in Fig.2 (b). Thus, we apply the SSEP algorithm.
Figure 2: \(G(\mu|\xi)\) of the policies learned using DDPG with DR.. a — The policy learned without access to true system parameters, b — The policy learned with access to true system parameters
To prepare multiple expert policies, we manually select representative points from \(\Xi\) based on the simulation results \(G(\mu|\xi)\) obtained from certain simulated systems \(f(\cdot,\cdot,\xi)\). Specifically, we select the following three representative points: \(\xi^{(1)}=[0.1,1.0],\;\xi^{(2)}=[2.0, 0.0]\), and \(\xi^{(3)}=[2.0,2.0]\). We then synthesize expert policies using DDPG for these representative simulated systems. The performances of the three expert policies are shown in Fig.3. Each expert policy performs well on systems whose behaviors are similar to those of the corresponding representative simulated system. Additionally, for any \(\xi\in\Xi_{\text{grid}}\), there exists at least one expert policy that performs well.
Figure 3: \(G(\mu|\xi)\) of three expert policies learned for representative simulated systems using DDPG.. a — Expert policy learned for \(f(\cdot,\cdot;\xi^{(1)})\)., b — Expert policy learned for \(f(\cdot,\cdot;\xi^{(2)})\)., c — Expert policy learned for \(f(\cdot,\cdot;\xi^{(3)})\).
We consider a real system whose system parameters are fixed. The weights of the adaptive policy are adjusted through the estimation of the belief vector \(w\) using the FTRL algorithm every five control steps. We set the learning rate \(\eta=0.5\). The performance of the adaptive policy is shown in Fig.4. By smoothly switching based on observations from the real system (5 ), the adaptive policy achieves high performance for all systems in \(\Xi_{\text{grid}}\).
We consider a real system whose system parameters vary abruptly several times. We adjust the weights of the adaptive policy by estimating the belief vector \(w\) using the discounted FTRL algorithm every five control steps. We set the learning rate \(\eta=1.0\) and the discount factor \(\beta=0.9\). We assume that the system parameters vary as follows: \[\begin{align} \xi_{t,1}=\begin{cases} 1.2 & t\in T_1\\ 0.2 & t\in T_2\\ 1.8 & t\in T_3, \end{cases}\;\;\;\; \xi_{t,2}=\begin{cases} 0.0 & t\in T_1\\ 1.0 & t\in T_2\\ 2.0 & t\in T_3, \end{cases}\nonumber \end{align}\] where these time intervals are \(T_1=[1,100]\), \(T_2=[101,200]\), and \(T_3\in[201,500]\), respectively. The state is reset to \([\pi,0]^{\top}\) at the beginning of each time interval. The time response is shown in Fig.5. The adaptive policy can smoothly switch expert policies by adjusting the weights \(w\) in response to the varying system parameters.
We proposed a simulation-based two-stage algorithm. In the first stage, we select representative systems \(f(\cdot,\cdot;\xi^{(j)})\), \(j=1,2,...,M\), and synthesize an expert policy for each representative system using a DRL algorithm in a simulator. In the second stage, we construct an adaptive policy as a normalized weighted sum of expert policies. The weights are adjusted by an OCO algorithm using observations from a real system. The effectiveness of our proposed algorithm was validated through numerical experiments. One direction for future research is to develop a method that automatically selects representative systems for preparing expert policies. Another important direction is to apply our proposed algorithm to complex real-world systems (e.g., industrial plants and power grids).