Data-Driven LQR using Reinforcement Learning and Quadratic Neural Networks

Soroush Asri, Luis Rodrigues
Department of Electrical Engineering, Concordia University


Abstract

This paper introduces a novel data-driven approach to design a linear quadratic regulator (LQR) using a reinforcement learning (RL) algorithm that does not require the system model. The key contribution is to perform policy iteration (PI) by designing the policy evaluator as a two-layer quadratic neural network (QNN). This network is trained through convex optimization. To the best of our knowledge, this is the first time that a QNN trained through convex optimization is employed as the Q-function approximator (QFA). The main advantage is that the QNN’s input-output mapping has an analytical expression as a quadratic form, which can then be used to obtain an analytical expression for policy improvement. This is in stark contrast to available techniques in the literature that must train a second neural network to obtain the policy improvement. The article establishes the convergence of the learning algorithm to the optimal control provided the system is controllable and one starts from a stabilitzing policy. A quadrotor example demonstrates the effectiveness of the proposed approach.

LQR, Reinforcement Learning, Policy Iteration, Quadratic Neural Networks.

1 Introduction↩︎

Optimal controllers minimize a given cost function subject to a dynamic model. This design method has received a lot of attention, especially due to its potential in applications such as autonomous vehicle navigation and robotics, economics and management, as well as energy optimization [1], to cite a few. Traditionally, optimal control design relies on well-established techniques like dynamic programming, calculus of variations, and Pontryagin’s maximum principle [2]. While these methods have proven effective in many cases, they often face challenges when dealing with large-scale, nonlinear, or uncertain systems [[3]][4]. To address this issue, one often approximates a nonlinear system by a linear model around an operating point, thus enabling the design of a linear quadratic regulator (LQR) [5]. One of the main advantages of LQR [5] is that it has a closed-form solution [[6]][7] as a function of a matrix that is the positive definite solution of a Riccati equation [2]. An iterative technique for solving the Riccati equation is discussed in article [8], which requires an accurate linear model of the system. However, there are instances where complex nonlinear systems cannot be adequately approximated by linear models or where the system’s dynamics remain unknown. Therefore, conventional techniques for solving optimal control problems may not be applicable. As a result, researchers have explored data-driven methods such as RL, imitation learning, and Gaussian process regression as alternatives for solving optimal control problems [[9]][10][11]. RL is a powerful data-driven approach that enables an agent to learn optimal control policies by interacting with the environment and adjusting its actions or control policies to minimize a cumulative cost [12]. The motivation to use RL as an approxiation to optimal control designs stems from several key factors, including adaptability, the ability to handle complex dynamics, learning from real-world experience, and overcoming inaccuracies in system models [[13]][14]. The majority of research dedicated to RL in optimal control investigates discrete-time optimal control using value-based RL algorithms [15]. In value-based RL, the core concept is to estimate value functions, such as the Q-function [13], which can then be used to derive optimal policies. A common approach to estimate value functions is to use the temporal difference (TD) equation [16]. In such approach the Bellman error [17] is reduced by iteratively updating the value function approximation. This is called adaptive dynamic programming (ADP) [15]. ADP is widely favored due to its link with dynamic programming and the Bellman equation [18]. The application of ADP methods such as the policy iteration (PI) and the value-iteration (VI) to feedback control systems are discussed in references [[15]][19]. For optimal control, PI is preferred over VI due to its stability arising from the policy improvement guarantee, which leads to reliable convergence to the optimal policy [20]. Additionally, Q-learning, an ADP algorithm that approximates the Q-function, can be used to achieve optimal control without relying on a system model [21].

It is common to use a neural network (NN) to approximate the value function during the policy evaluation step of a PI algorithm [[22]][23]. However, the drawbacks of using NNs as the value function approximator (VFA) are (i) the optimization of the neural network’s weights is not convex and training the neural network yields only locally optimal weights, (ii) selecting the appropriate architecture for the NN typically involves a trial and error procedure, (iii) the input-output mapping of the NN lacks an analytical expression and as a consequence a second neural network is needed to perform policy improvement, (iv) providing a comprehensive proof of convergence to the optimal control is difficult or even unattainable. To address the mentioned issues, a two-layer quadratic neural network (QNN) trained by a convex optimization introduced in reference [24] will be chosen as the VFA. The advantages of using the two-layer QNN as the VFA compared to other neural networks are: (i) Two-layer QNNs are trained by solving a convex optimization and therefore the global optimal weights are found [24], (ii) the optimal number of neurons in the hidden layer is obtained ad a by-product [24], (iii) the input-output mapping of the QNN is a quadratic form [24], [25] and therefore one can analytically minimize the value-function with respect to the control policy. Reference [25] addresses applications of QNNs to system identification and control of dynamical systems.

LQR design problems have been addressed using various RL approaches as they can serve as benchmarks for evaluating the performance of RL algorithms [26], [27] because their solution is known in closed-form. This fact has motivated the use of ADP to solve discrete-time LQR in reference [28], discrete-time LQ Gaussian in references [[29]][30], and discrete-time LQ tracking in references [[31]][32] without using the system model. Moreover, in the Q-learning scheme proposed in reference [28] one can obtain the desired LQR controller by solving a least-squares optimization to derive the Q-function provided the system is controllable and a stabilizing linear state feedback controller is known.

This paper designs a discrete-time LQR controller using Q-learning. More specifically, we propose to use a two-layer QNN as the VFA in Q-learning for a LQR problem, for which the value function is known to be quadratic. To the best of our knowledge, this represents the first result with a convex optimization-trained QNN used as a VFA. Additionally, the designed controller is proven to converge to the LQR controller provided the system is controllable and an initial stabilizing policy is known. Simulations conducted in MATLAB illustrate the convergence of the learning algorithm to the optimal controller for different initial stabilizing policies.

This paper is organized as follows. Section 2 presents the problem statement. Section 3 reviews two-layer QNNs. Section 4 presents the general form of the Q-learning algorithm. Section 5 contains the proposed approach with the proof of convergence. Section 6 focuses on examples, which is followed by the conclusions.

2 Problem statement↩︎

An unknown controllable linear system model is written as \[\begin{gather} x_{k+1}=Ax_k+Bu_k \label{linear95dynamics} \end{gather}\tag{1}\]

where \(x_k \in \mathbb{R}^{n_x}\) is the state vector, \(u_k \in \mathbb{R}^{n_u}\) is the input vector and \(A, \: B\) are unknown matrices constrained to be such that the pair \((A,B)\) is controllable. Define the policy \(\pi (.)\) as a linear mapping from the state vector to the input vector as \[\begin{gather} u_{k}=\pi (x_k) = -K^\pi x_k \end{gather}\]

where \(K^\pi\) is a matrix to be determined by the designer. Define the local cost as \[\begin{gather} c(x_k,u_k)= x_k^T Q x_k + u_k^T R u_k \end{gather}\]

where \(Q\), and \(R\) are both positive definite. The value function when one follows the control policy \(\pi(.)\) is defined as \[V^\pi(x_k) = \sum_{i=k}^{\infty} \gamma^{i-k} \left( x_i^T Q x_i + u_i^T R u_i \right)\]

where \(0<\gamma \leq 1\) is a discount factor.

The objective is to obtain the optimal policy \(\pi^* (.)\) that minimizes \(V^{\pi}(x_k)\) for all states \(x_k\) subject to the unknown dynamics of the system. This is done using QNNs as a VFA.

Remark 1. If \(\gamma =1\) then the optimal control problem is a LQR problem.

3 Two-layer QNNs↩︎

Consider the neural network in Fig 1 with one hidden layer, one output, and a degree two polynomial activation function, where \(\mathcal{X}_i \in \mathbb{R}^n\) is the \(i\)-th input data vector, \(\mathcal{\hat{Y}}_i \in \mathbb{R}\) is the corresponding output, \(\mathcal{Y}_i \in \mathbb{R}\) is the output label corresponding to the input \(\mathcal{X}_i\), \(L\) is the number of hidden neurons, \(f(z)=az^2+bz+c\) is the polynomial activation function, and \(a \neq 0\), \(b\), and \(c\) are pre-defined constant coefficients. The notation \(w_{kj}\) represents the weight from the \(k\)-th input-neuron to the \(j\)-th hidden-neuron, and \(\rho_{j}\) represents the weight from the j-th hidden-neuron to the output. The input-output mapping is

Figure 1: two-layer QNN with one output

\[\mathcal{\hat{Y}}_i=\sum_{j=1}^{L} f(\mathcal{X}_i^TW_j)\rho_{j}\] where \(W_j = \begin{bmatrix} w_{1j} & w_{2j} & \ldots & w_{nj} \end{bmatrix}^T\).

Reference [24] proposes the training optimization \[\begin{align} \label{eq:32changed32neural32net32optimizaion} \min_{W_k,\rho_k} \: \: & l(\mathcal{\hat{Y}}-\mathcal{Y}) + \beta \sum_{j=1}^{L} | \rho_j |\\ s.t. \: \: & \mathcal{\hat{Y}}_i=\sum_{j=1}^{L} f(\mathcal{X}_i^TW_j)\rho_{j}\\ & \Vert W_k \Vert_2 =1, \: \: \: \: k=1,2,...,L, \: \: \: \:i=1,2,...,N \end{align}\tag{2}\]

where \(\beta\ge 0\) is a pre-defined regularization parameter, \(l(.)\) is a convex loss function, N is the number of data points, \(\mathcal{\hat{Y}} = \begin{bmatrix} \mathcal{\hat{Y}}_1 & \mathcal{\hat{Y}}_2 & \ldots & \mathcal{\hat{Y}}_N \end{bmatrix}^T\), and \(\mathcal{Y} = \begin{bmatrix} \mathcal{Y}_1 & \mathcal{Y}_2 & \ldots & \mathcal{Y}_N \end{bmatrix}^T\).

The optimization problem 2 can be equivalently solved by the dual convex optimization [24] \[\begin{align} \label{convex95formula} \min_{Z^+ , Z^-} \: \: & l(\mathcal{\hat{Y}} - \mathcal{Y}) + \beta \: Trace(Z_{1}^+ + Z_{1}^-) \\ s.t. \: \: & \mathcal{\hat{Y}}_{i}=a\mathcal{X}_{i}^T(Z_{1}^+ - Z_{1}^-)\mathcal{X}_{i} + b\mathcal{X}_{i}^T(Z_{2}^+ - Z_{2}^-) + \\ & \: \: \: \: \: \: \: \: \: \: c \: Trace(Z_{1}^+ - Z_{1}^-), \\ & Z^+=\begin{bmatrix} Z_{1}^+ & Z_{2}^+ \\(Z_{2}^+)^T &Trace(Z_{1}^+) \end{bmatrix} \geq 0 , \\ & Z^-=\begin{bmatrix} Z_{1}^- & Z_{2}^- \\(Z_{2}^-)^T &Trace(Z_{1}^-) \end{bmatrix} \geq 0, \\ & i=1, 2, \ldots N \end{align}\tag{3}\] where \(Z_1^+\), \(Z_2^+\), \(Z_1^-\), \(Z_2^-\) are optimization parameters [24]. After training the neural network and obtaining \(Z^+\), \(Z^-\) from 3 , the quadratic input-output mapping is [24], [25]

\[\label{quadratic95mapping95QNN} \mathcal{\hat{Y}}_i=\begin{bmatrix} \mathcal{X}_i \\ 1 \end{bmatrix}^T H \begin{bmatrix} \mathcal{X}_i \\ 1 \end{bmatrix}\tag{4}\] where \[\begin{align} H=\begin{bmatrix} a(Z_{1}^+ - Z_{1}^-) & 0.5b(Z_{2}^+ - Z_{2}^-) \\0.5b(Z_{2}^+ - Z_{2}^-)^T & c \left [ Trace \left (Z_{1}^+ - Z_{1}^- \right ) \right ] \end{bmatrix} \end{align}\]

Remark 2. If \(b=c=0,\) and \(a =1\) then \[\mathcal{\hat{Y}}_i= \mathcal{X}^T_i (Z_{1}^+ - Z_{1}^-) \mathcal{X}_i\]

4 policy iteration-based Q-learning↩︎

This section presents the Q-learning algorithm used to calculate the optimal policy \(\pi^* (.)\). The Q-function [33] is \[Q^{\pi}(x_k,u_k) = c(x_k,u_k) + \gamma V^{\pi}(x_{k+1}) \label{eqn:32Q-function95def}\tag{5}\]

Using Bellman’s principle of optimality [15] and the definition of Q-function, equation 5 can be rewritten as \[\begin{gather} Q^{\pi}(x_k,u_k) = c(x_k,u_k) + \gamma Q^{\pi}(x_{k+1},\pi(x_{k+1})) \label{eqn:32Bellman-eqn} \end{gather}\tag{6}\] Obtaining the Q-function by equation 6 is known as the policy evaluation step [21]. After the policy evaluation step, the optimal stabilizing policy \(\pi^\prime (.)\) for \(Q^\pi\) is given by \[\begin{gather} \pi^\prime (x_k) = \arg \min_{u_k} Q^{\pi}(x_k,u_k) \label{eq:32policy32improvement32vanila} \end{gather}\tag{7}\] where \(V^{\pi^\prime}(x_k) < V^{\pi}(x_k)\) for all \(x_k\) [23]. Solving 7 is known as the policy improvement step. The optimal policy \(\pi^* (.)\) can be attained from any initial stabilizing policy by repeatedly performing policy evaluation and policy improvement steps until convergence, starting from an initial stabilizing policy \(\pi_0(.)\). Algorithm 2 is the PI-based Q-learning algorithm.

Figure 2: Policy iteration-based Q-learning

Solving equation [eq:32PEA1] can be challenging as it is a Lyapunov function [15]. However, exploiting its fixed-point nature, we can obtain \(Q^\pi(.)\) for the stabilizing policy \(\pi(.)\) as \[\label{eq:32iterative32Bellman32van} Q^{\pi}_{i+1}(x_{k},u_k) = c(x_k,u_k) + \gamma Q^{\pi}_{i}(x_{k+1},\pi(x_{k+1}))\tag{8}\] Under certain conditions that will be detailed later, starting with any initial value \(Q^\pi_0\) and using the iteration (8 ) will lead to \(Q^\pi_i \rightarrow Q^\pi\). Consequently, equation [eq:32PEA1] will be replaced by the iterative equation 8 .

Remark 3. PI algorithms require persistent excitation (PE) [[15]][29]. To achieve PE, a probing noise term \(n_k\) can be added to the input \(u_k\). It is shown in reference [29] that the solution computed by PI differs from the actual value corresponding to the Bellman equation when the probing noise term \(n_k\) is added. It is discussed in the same reference that adding the discount factor \(\gamma\) reduces this harmful effect of \(n_k\).

5 The proposed approach↩︎

This section presents how to perform policy evaluation and policy improvement steps by designing a two-layer QNN. According to reference [28], the Q-function is a quadratic form \[\begin{gather} \label{Q95function95linear95systems} Q^{\pi}(x_k,u_k) = \begin{bmatrix} x_k \\ u_k \end{bmatrix}^T H^\pi \begin{bmatrix} x_k \\ u_k \end{bmatrix} \end{gather}\tag{9}\] where \(H^\pi \in \mathbb{R}^{(n_x + n_u) \times (n_x + n_u)}\). As a result, a QNN with coefficients \(b=c=0\), \(a=1\) is the perfect candidate to approximate the Q-function. The policy evaluation step then obtains \(H^\pi\) for the stabilizing policy \(\pi(.)\) by solving, \[\begin{gather} \label{eq:32policy95evaluation95vanilla95LQR950} \begin{bmatrix} x_k \\ u_k \end{bmatrix}^T {H}^\pi \begin{bmatrix} x_k \\ u_k \end{bmatrix} = x_k^T Q x_k + u_k ^T R u_k + \\ \gamma \begin{bmatrix} x_{k+1} \\ u_{k+1} \end{bmatrix}^T {H}^\pi \begin{bmatrix} x_{k+1} \\ u_{k+1} \end{bmatrix} \end{gather}\tag{10}\]

Remark 4. Note that 11 is a scalar equation, \(H^\pi\) is symmetric, and \[\begin{gather} \begin{bmatrix} x_k \\ u_k \end{bmatrix} \in \mathbb{R}^{n_x + n_u} \end{gather}\] Therefore, the matrix \(H^\pi\) has \(M = \frac{ (n_x + n_u) (n_x + n_u + 1)}{2}\) unknown independent elements and \(N \geq M\) data samples are needed to obtain \(H^\pi\) from equation 11 .

We propose to obtain \(H^\pi\) in the policy evaluation step by employing the iterative equation 11 . The proof of convergence is presented in Lemma 1.

Lemma 1. Assume that the unknown system (1 ) is controllable. Then the solution of the iterative equation 11 starting with any initial value \(H^\pi_0\) converges to \(H^\pi\) provided the initial policy \(\pi(.)\) is stabilizing. \[\begin{gather} \label{eq:32policy95evaluation95vanilla95LQR} \begin{bmatrix} x_k \\ u_k \end{bmatrix}^T {H}^\pi_{i+1} \begin{bmatrix} x_k \\ u_k \end{bmatrix} = x_k^T Q x_k + u_k^T R u_k + \\ \gamma \begin{bmatrix} x_{k+1} \\ u_{k+1} \end{bmatrix}^T {H}^\pi_i \begin{bmatrix} x_{k+1} \\ u_{k+1} \end{bmatrix} \end{gather}\tag{11}\]

Proof. Applying equation 11 recursively yields \[\begin{gather} \begin{bmatrix} x_k \\ u_k \end{bmatrix}^T {H}_{i+1}^\pi \begin{bmatrix} x_k \\ u_k \end{bmatrix} = \sum^{i}_{j=0} \gamma^j c(x_{k+j}, u_{k+j}) + \\ \gamma^{i+1} \begin{bmatrix} x_{k+i+1} \\ u_{k+i+1} \end{bmatrix}^T {H}_{0}^\pi \begin{bmatrix} x_{k+i+1} \\ u_{k+i+1} \end{bmatrix} \end{gather}\] It should be noted that the policy \(\pi(.)\) is a stabilizing policy, therefore \(\lim_{i\to\infty}x_{k+i+1}=0\) and \(\lim_{i\to\infty}u_{k+i+1}=0\) for all \(k\). Consequently, for any \(H^\pi_0\), \[\begin{gather} \label{eq:32one32use} \lim_{i\to\infty}\gamma^{i+1} \begin{bmatrix} x_{k+i+1} \\ u_{k+i+1}^\pi \end{bmatrix}^T {H}_{0}^\pi \begin{bmatrix} x_{k+i+1} \\ u_{k+i+1}^\pi \end{bmatrix}= 0 \end{gather}\tag{12}\]

and therefore \[\begin{align} \label{mainresult} \lim_{i\to\infty} \begin{bmatrix} x_k \\ u_k \end{bmatrix}^T {H}_{i+1}^\pi \begin{bmatrix} x_k \\ u_k \end{bmatrix} = \sum^{\infty}_{j=0} \gamma^j c(x_{k+j}, u_{k+j})=\nonumber\\ =c(x_k,u_k)+\gamma V^\pi(x_{k+1}) \end{align}\tag{13}\]

Therefore, from equations (5 ), (9 ), and (13 ), \(H_i^\pi\to H^\pi\) when \(i\to\infty\). ◻

Remark 5. In practice the condition \[\lVert H^\pi_i - H^\pi_{i-1} \rVert < \epsilon,\] is used as the stopping criterium of the algorithm.

Due to the implications of Lemma 1, the problem of policy evaluation transforms into the task of computing \(H^\pi_{i+1}\) from equation 11 given \(H^\pi_{i}\). This can be done by training a two-layer QNN. Since it is assumed that one has access to the state \(x_k\), one can calculate \(\mathcal{Y}_k\) defined as \[\begin{gather} \mathcal{Y}_k = x_k^T Q x_k + u_k^T R u_k + \gamma \begin{bmatrix} x_{k+1} \\ u_{k+1} \end{bmatrix}^T {H}_{i}^\pi \begin{bmatrix} x_{k+1} \\ u_{k+1} \end{bmatrix}. \end{gather}\] Therefore, from (11 ), \[\mathcal{X}_k^T {H}_{i+1}^\pi \mathcal{X}_k = \mathcal{Y}_k\] where \(\mathcal{X}_k^T = \begin{bmatrix} x_k^T & u_{k}^T \end{bmatrix}\). Thus, \(H^\pi_{i+1}\) can be obtained from the training of a QNN as the solution of the convex optimization 3 using a set of input data points \(\mathcal{X}_k\) and the corresponding labels \(\mathcal{Y}_k\), as well as the coeffficients \(a=1, \: b=0, \: c=0\).

5.1 Policy improvement:↩︎

In this section, the policy improvement step is addressed for the stabilizing policy \(\pi(\cdot)\) using \(H^\pi\) from the policy evaluation. We first partition \(H^\pi\) as \[\begin{gather} \begin{bmatrix} x_k \\ u_k \end{bmatrix}^T {H}^\pi \begin{bmatrix} x_k \\ u_k \end{bmatrix} = \begin{bmatrix} x_k \\ u_k \\ \end{bmatrix}^T \begin{bmatrix} {H}_{xx}^\pi & {H}_{xu}^\pi \\ ({H}_{xu}^\pi)^T & {H}_{uu}^\pi \end{bmatrix} \begin{bmatrix} x_k \\ u_k \end{bmatrix} \end{gather}\]

Lemma 2. The policy improvement for the stabilizing policy \(\pi(.)\) is given by \[\begin{gather} \pi^\prime (x_k) = - ({H}_{uu}^\pi)^{-1}({H}_{xu}^\pi)^Tx_k \label{eq:32policy32improvement} \end{gather}\tag{14}\]

Proof. The improved policy \(\pi^\prime (.)\) is obtained by \[\begin{gather} \pi^\prime (x_k) = \arg \min_{u_k} Q^{\pi}(x_k,u_k) \label{eq:32policy32improvement32vanila952} \end{gather}\tag{15}\] Since \(Q^{\pi}(x_k,u_k )\) is a quadratic form, the necessary and sufficient conditions of optimality are \[\begin{gather} \nonumber \frac{\partial Q^{\pi}(x_k,u_k )}{\partial u_k } = 0 \\ \frac{\partial^2 {Q}^{\pi}(x_k,u_k )}{\partial u_k^2} >0 \label{eq:32solve95policy95imp95by95derivative} \end{gather}\tag{16}\]

The Q-function can be written as \[\begin{gather} \nonumber Q^{\pi}(x_k , u_k ) = \begin{bmatrix} x_k \\ u_k \\ \end{bmatrix}^T \begin{bmatrix} {H}_{xx}^\pi & {H}_{xu}^\pi \\ ({H}_{xu}^\pi)^T & {H}_{uu}^\pi \end{bmatrix} \begin{bmatrix} x_k \\ u_k \end{bmatrix} = \\ x_k^T H_{xx}^\pi x_k + x_k^T H_{xu}^\pi u_k + u_k^T (H_{xu}^\pi)^T x_k + u_k^T H_{uu}^\pi u_k \label{QFA95open95form95for95b61c610} \end{gather}\tag{17}\]

Therefore, the solution to the first constraint yields \[\begin{gather} \nonumber \frac{\partial Q^{\pi}(x_k,u_k )}{\partial u_k } = 0\Leftrightarrow {H}_{uu}^\pi u_k + ({H}_{xu}^\pi)^Tx_k = 0 \Leftrightarrow \\ u_k = - ({H}_{uu}^\pi)^{-1}({H}_{xu}^\pi)^Tx_k \label{dbuotzag} \end{gather}\tag{18}\]

Note that the matrix \(H^\pi\) is positive definite. Therefore, all matrices on the main diagonal of \(H^\pi\), including \(H_{uu}^\pi\), are also positive definite. As a result, the inverse of \(H_{uu}^\pi\) does exist.

The second constraint in 16 is thus satisfied since \(H_{uu}^\pi >0\). Thus, the policy obtained in 14 is the unique minimizer \(\pi^\prime (x_k)\). ◻

The complete procedure for solving the LQR is presented in Algorithm 3.

Figure 3: Solving LQR with QNN and Q-learning

6 Simulations↩︎

This sectiom considers a quadrotor flying at a constant altitude movimg in the \(x_1\) direction. It is shown that the control policy converges to the LQR solved by conventional methods. The quadrotor is modeled by the state-space \[\begin{gather} \begin{bmatrix} \dot{x}_1 \\ \dot{x}_2 \\ \dot{x}_3\\ \dot{x}4 \end{bmatrix} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & -0.1 & 10 & 0\\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3\\ x_4 \end{bmatrix} + \begin{bmatrix} 0 \\ 0\\ 0\\ 4.35 \end{bmatrix}u \end{gather}\] where \(x_1\) is the horizontal position, \(x_2\) is the speed, \(x_3\) is the pitch angle, \(x_4\) is the pitch rate and the initial state is \[\begin{bmatrix} x_1(0) \\ x_2(0) \\ x_3(0) \\ x_4(0) \end{bmatrix} = \begin{bmatrix} -10 \\ 0 \\ 0 \\ 0 \end{bmatrix}\] Assume that the sampling time is \(T=0.1\) seconds. The discretized model 19 is obtained using Tustin’s method in order to compare the result with the discrete-time LQR. \[\begin{gather} \begin{bmatrix} x_{1,k+1} \\ x_{2,k+1} \\ x_{3,k+1} \\ x_{4,k+1} \end{bmatrix} = \begin{bmatrix} 1 & 0.1 & 0.05 & 0.003 \\ 0 & 0.99 & 0.99 & 0.05 \\ 0 & 0 & 1 & 0.1 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x_{1,k } \\ x_{2,k } \\ x_{3,k } \\ x_{4,k } \end{bmatrix} + \begin{bmatrix} 0.001 \\ 0.011 \\ 0.022 \\ 0.435 \end{bmatrix}u_k \label{eq:32linear95quadrotor95tustin} \end{gather}\tag{19}\]

Rewrite the control policy \(\pi_j(.)\) as: \[\begin{gather} \pi_j(x_k)= \begin{bmatrix} k_1^{\pi_j} & k_2^{\pi_j} & k_3^{\pi_j} & k_4^{\pi_j} \end{bmatrix} \begin{bmatrix} x_{1,k} \\ x_{2,k} \\ x_{3,k} \\ x_{4,k} \end{bmatrix} \label{eq:32shand} \end{gather}\tag{20}\] In this example, we choose the following parameter values: \[\begin{gather} \nonumber R = 100,\: N=100, \: \gamma=1, \:\beta=0.005 \\ Q = \begin{bmatrix} 0.01 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 10 \end{bmatrix} \end{gather}\]

The optimal LQR controller is \[\begin{gather} \pi^*(x_k)=- \begin{bmatrix} 0.046 & 0.464 & 4.347 & 2.014 \end{bmatrix} \begin{bmatrix} x_{1,k} \\ x_{2,k} \\ x_{3,k} \\ x_{4,k} \end{bmatrix} \end{gather}\] We now pretend that we do not know the model of the quadrotor and run the Algorithm 3. Fig 4 demonstrates the convergence of \(k_1^{\pi_j}\), \(k_2^{\pi_j}\), \(k_3^{\pi_j}\), \(k_4^{\pi_j}\) to their optimal values. To illustrate this convergence, we conducted five simulations using random initial stabilizing policies. The random initial policies are given in the Table 1.

a

Figure 4: No caption. a — Convergence of the policy iteration to LQR controller

The trajectory of the quadrotor’s position and its speed over time using the optimal policy are depicted in Fig. 5 where one sees the convergence to the origin.

a

Figure 5: No caption. a — Position and speed trajectory using the optimal policy.

Table 1: The random initial stabilizing policies
Simulation number \(K^{\pi_0}\)
Simulation 1 \(\begin{pmatrix} 0.082 & 0.169 & 1.592 & 0.838 \end{pmatrix}\)
Simulation 2 \(\begin{pmatrix} 0.236 & 0.420 & 2.978 & 1.156 \end{pmatrix}\)
Simulation 3 \(\begin{pmatrix} 0.626 & 0.998 & 5.578 & 1.660 \end{pmatrix}\)
Simulation 4 \(\begin{pmatrix} 1.851 & 1.709 & 7.491 & 1.858 \end{pmatrix}\)
Simulation 5 \(\begin{pmatrix} 0.701 & 0.781 & 4.380 & 1.379 \end{pmatrix}\)

References↩︎

[1]
Arthur E Bryson and Yu-Chi Ho. Applied optimal control: optimization, estimation, and control. Routledge, 2018.
[2]
Frank L Lewis, Draguna Vrabie, and Vassilis L Syrmos. Optimal control. John Wiley & Sons, 2012.
[3]
Minggang Gan, Jingang Zhao, and Chi Zhang. Extended adaptive optimal control of linear systems with unknown dynamics using adaptive dynamic programming. Asian Journal of Control, 23(2):1097–1106, 2021.
[4]
Yu Jiang and Zhong-Ping Jiang. Global adaptive dynamic programming for continuous-time nonlinear systems. IEEE Transactions on Automatic Control, 60(11):2917–2929, 2015.
[5]
Rudolf Emil Kalman et al. Contributions to the theory of optimal control. Bol. soc. mat. mexicana, 5(2):102–119, 1960.
[6]
Brian DO Anderson and John B Moore. Optimal control: linear quadratic methods. Courier Corporation, 2007.
[7]
Vasile Sima. Algorithms for linear-quadratic optimization. CRC Press, 2021.
[8]
David Kleinman. On an iterative technique for riccati equation computations. IEEE Transactions on Automatic Control, 13(1):114–115, 1968.
[9]
Dharmesh Tailor and Dario Izzo. Learning the optimal state-feedback via supervised imitation learning. Astrodynamics, 3:361–374, 2019.
[10]
Joshua L Proctor, Steven L Brunton, and J Nathan Kutz. Dynamic mode decomposition with control. SIAM Journal on Applied Dynamical Systems, 15(1):142–161, 2016.
[11]
Joschka Boedecker, Jost Tobias Springenberg, Jan Wülfing, and Martin Riedmiller. Approximate real-time optimal control based on sparse gaussian process models. In 2014 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pages 1–8. IEEE, 2014.
[12]
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
[13]
Dimitri Bertsekas. Reinforcement learning and optimal control. Athena Scientific, 2019.
[14]
Bahare Kiumarsi, Kyriakos G Vamvoudakis, Hamidreza Modares, and Frank L Lewis. Optimal and autonomous control using reinforcement learning: A survey. IEEE transactions on neural networks and learning systems, 29(6):2042–2062, 2017.
[15]
Frank L Lewis and Draguna Vrabie. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE circuits and systems magazine, 9(3):32–50, 2009.
[16]
Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: Model-free deep rl for model-based control. arXiv preprint arXiv:1802.09081, 2018.
[17]
Russ Tedrake. Underactuated robotics: Algorithms for walking, running, swimming, flying, and manipulation. Course Notes for MIT, 6, 2016.
[18]
Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volume 4. Athena scientific, 2012.
[19]
Danil V Prokhorov and Donald C Wunsch. Adaptive critic designs. IEEE transactions on Neural Networks, 8(5):997–1007, 1997.
[20]
Bo Pang and Zhong-Ping Jiang. Robust reinforcement learning: A case study in linear quadratic regulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9303–9311, 2021.
[21]
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
[22]
Said G Khan, Guido Herrmann, Frank L Lewis, Tony Pipe, and Chris Melhuish. Reinforcement learning and optimal adaptive control: An overview and implementation examples. Annual reviews in control, 36(1):42–59, 2012.
[23]
Dongbin Zhao, Zhongpu Xia, and Ding Wang. Model-free optimal control for affine nonlinear systems with convergence analysis. IEEE Transactions on Automation Science and Engineering, 12(4):1461–1468, 2014.
[24]
Burak Bartan and Mert Pilanci. Neural spectrahedra and semidefinite lifts: Global convex optimization of polynomial activation neural networks in fully polynomial-time. arXiv preprint arXiv:2101.02429, 2021.
[25]
Luis Rodrigues and Sidney Givigi. System identification and control using quadratic neural networks. Control Systems Letters, 7:2209–2214, 2023.
[26]
Nikolai Matni, Alexandre Proutiere, Anders Rantzer, and Stephen Tu. From self-tuning regulators to reinforcement learning and back again. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3724–3740. IEEE, 2019.
[27]
Farnaz Adib Yaghmaie, Fredrik Gustafsson, and Lennart Ljung. Linear quadratic control using model-free reinforcement learning. IEEE Transactions on Automatic Control, 2022.
[28]
Steven J Bradtke, B Erik Ydstie, and Andrew G Barto. Adaptive linear quadratic control using policy iteration. In Proceedings of 1994 American Control Conference-ACC’94, volume 3, pages 3475–3479. IEEE, 1994.
[29]
Frank L Lewis and Kyriakos G Vamvoudakis. Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(1):14–25, 2010.
[30]
Syed Ali Asad Rizvi and Zongli Lin. Experience replay–based output feedback q-learning scheme for optimal output tracking control of discrete-time linear systems. International Journal of Adaptive Control and Signal Processing, 33(12):1825–1842, 2019.
[31]
Bahare Kiumarsi, Frank L Lewis, Hamidreza Modares, Ali Karimpour, and Mohammad-Bagher Naghibi-Sistani. Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica, 50(4):1167–1175, 2014.
[32]
Bahare Kiumarsi, Frank L Lewis, Mohammad-Bagher Naghibi-Sistani, and Ali Karimpour. Optimal tracking control of unknown discrete-time linear systems using input-output measured data. IEEE transactions on cybernetics, 45(12):2770–2779, 2015.
[33]
Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.