January 06, 2025
The paper derives analytical expressions for the asymptotic average updating direction of the adaptive moment generation (ADAM) algorithm when applied to recursive identification of nonlinear systems. It is proved that the standard hyper-parameter setting results in the same asymptotic average updating direction as a diagonally power normalized stochastic gradient algorithm. With the internal filtering turned off, the asymptotic average updating direction is instead equivalent to that of a sign-sign stochastic gradient algorithm. Global convergence to an invariant set follows, where a subset of parameters contain those that give a correct input-output description of the system. The paper also exploits a nonlinear dynamic model to embed structure in recurrent neural networks. A Monte-Carlo simulation study validates the results.
The adaptive moment generation algorithm (ADAM) [1] has become a workhorse in machine learning. The majority of the many successful applications use batch processing, sometimes with a large complexity and cost. In such cases recursiveness over time can reduce the complexity to be linear in the amount of data. However, properties like convergence then needs to be analysed with averaging, see [2] and [3]. The paper therefore studies the asymptotic updating direction and convergence properties of a recursive variant of ADAM.
From a system identification point of view, the neural network, e.g. [4], provides a parametrization that complements classical nonlinear model structures like piecewise linear static models [5], block oriented models [6], [7], the NARMAX class of models [8], as well as general state space models [9], [10]. Some of the modern sequential Monte-Carlo (SMC) algorithms, like the bootstrap particle filter are also recursive [11]. The machine learning field has instead favoured general models like recurrent neural networks (RNNs) when applying so called supervised learning [4]. The advantage is generality, however the canonical model structures used in the system identification field minimize the set of parameters to avoid ambiguity and to maximize the accuracy, see [12]. Performance improvements can therefore be expected when structure is embedded into RNN models.
Convergence analysis of recursive identification algorithms was pioneered in [2], [3] and [13]. These publications state how global stability of an averaged ordinary differential equation (ODE) associated with the algorithm is related to global convergence of the algorithm. The asymptotic paths of the recursive algorithm are also proved to converge to the solutions of the associated ODE, which enables the study of the average updating direction of ADAM performed in the paper. Recently, related results have been used to prove convergence of ADAM to a minimizer of the criterion function, see e.g. [14], [15], [16]. No study focused on the average updating direction seems to have appeared.
The paper suggests a dynamic model that combines a discrete delay line chain and a static nonlinear function. When this nonlinear function equals a neural network, the result is an RNN with embedded structure which is believed to be a novel first contribution. The second contribution assumes a typical hyper-parameter setting of ADAM and then proves that the asymptotic average updating direction coincides with that of a diagonally power normalized stochastic gradient algorithm. Thirdly, with the internal filtering turned off, the asymptotic average updating direction is proved to correspond to that of a sign-sign stochastic gradient algorithm. The fourth contribution proceeds to prove that the algorithm converges globally to an invariant set, with a subset of parameters that give a correct input-output description of the system. The final Monte-Carlo simulation study verifies the theoretical results, and indicates correctness of the RNN.
The paper is organized as follows. Section II presents the structured RNN and the recursive version of ADAM. The convergence analysis appears in Sections III and IV. Experiments and conclusions follow in Sections V and VI.
To begin, a canonical nonlinear dynamic model is proposed to embed structure in RNNs. The model signals are the input signal vector \(\mathbf{u}(t)\) consisting of \(K\) vector signals, and the \(n\)-dimensional state vector \(\hat{\mathbf{x}}(t, \boldsymbol{\theta})\), given by \[\mathbf{u}(t)=\left( \mathbf{u}_1^T(t) \;\;... \;\;\mathbf{u}_K^T(t) \right)^T, \label{eq:eq2}\tag{1}\] \[\mathbf{u}_k(t)=\left(u_k(t) \;\;... \;\; u_k^{(n_k)}(t) \right)^T, \;k=1,...,K, \label{eq:eq3}\tag{2}\] \[\hat{\mathbf{x}}(t, \boldsymbol{\theta}) = \left( \hat{x}_1(t,\boldsymbol{\theta}) \;\;... \;\;\hat{x}_n(t, \boldsymbol{\theta}) \right)^T. \label{eq:eq4}\tag{3}\] The superscript \(^{(i)}\) denotes differentiation with respect to time \(t\), \(i\) times, to handle potential zero dynamics, and \(\boldsymbol{\theta}\) denotes the unknown parameter vector with dimension \(d\). The ODE underpinning the model then follows as \[\left( \begin{array}{c} \dot{\hat{x}}_1(t,\boldsymbol{\theta}) \\ \vdots \\ \dot{\hat{x}}_{n-1}(t,\boldsymbol{\theta}) \\ \dot{\hat{x}}_n(t,\boldsymbol{\theta}) \end{array} \right) = \left( \begin{array}{c} \hat{x}_2(t,\boldsymbol{\theta}) \\ \vdots \\ \hat{x}_{n}(t,\boldsymbol{\theta}) \\ f\left(\hat{\mathbf{x}}(t,\boldsymbol{\theta}),\mathbf{u}(t), \boldsymbol{\theta} \right) \end{array} \right). \label{eq:eq1}\tag{4}\] Here \(f\left(\hat{\mathbf{x}}(t,\boldsymbol{\theta}),\mathbf{u}(t), \boldsymbol{\theta} \right)\) parameterizes the \(n {\rm :th}\) component of the ODE. When selected as a neural network \(f^{nn}\left(\hat{\mathbf{x}}(t,\boldsymbol{\theta}),\mathbf{u}(t), \boldsymbol{\theta} \right)\) an RNN results. Using the Euler method to discretize (4 ) with sampling period \(T_s\) gives \[\begin{align} \hat{\mathbf{x}}(t+T_s,\boldsymbol{\theta})= \left( \begin{array}{c} \hat{x}_1(t+T_s,\boldsymbol{\theta}) \\ \vdots \\ \hat{x}_{n-1}(t+T_s,\boldsymbol{\theta}) \\ \hat{x}_n(t+T_s,\boldsymbol{\theta}) \end{array} \right) \end{align}\] \[= \left( \begin{array}{c} \hat{x}_1(t,\boldsymbol{\theta}) \\ \vdots \\ \hat{x}_{n-1}(t,\boldsymbol{\theta}) \\ \hat{x}_n(t,\boldsymbol{\theta}) \end{array} \right) + T_s \left( \begin{array}{c} \hat{x}_2(t,\boldsymbol{\theta}) \\ \vdots \\ \hat{x}_{n}(t,\boldsymbol{\theta}) \\ f\left(\hat{\mathbf{x}}(t,\boldsymbol{\theta}),\mathbf{u}(t), \boldsymbol{\theta} \right) \end{array} \right). \label{eq:eq0}\tag{5}\] The \(p\)-dimensional output measurement model is \[\hat{\mathbf{y}}(t, \boldsymbol{\theta})=\mathbf{C}\hat{\mathbf{x}}(t, \boldsymbol{\theta}), \label{eq:eq5}\tag{6}\] where \(\mathbf{C}\) is the measurement matrix. In case \(n=0\), a nonlinear static model results. The parameterization of (5 ) is given by the details of \(f\left(\hat{\mathbf{x}}(t,\boldsymbol{\theta}),\mathbf{u}(t), \boldsymbol{\theta} \right)\), e.g. by the static neural network \(f^{nn}\left(\hat{\mathbf{x}}(t,\boldsymbol{\theta}),\mathbf{u}(t), \boldsymbol{\theta} \right)\) and its hyper-parameters.
The gradient of the model (6 ) is given by \[\boldsymbol{\psi}^{\top}(t,\boldsymbol{\theta})=\frac{\partial \hat{\mathbf{y}}(t, \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbf{C} \frac{\partial \hat{\mathbf{x}}(t, \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} =\mathbf{C}\boldsymbol{\Psi}(t,\boldsymbol{\theta}). \label{eq:eq6}\tag{7}\] To obtain the matrix \(\boldsymbol{\Psi}(t,\boldsymbol{\theta})\), the components of the difference equation (5 ) are differentiated with respect to \(\boldsymbol{\theta}\) to give \[\begin{align} \left( \begin{array}{c} \frac{\partial \hat{x}_1(t+T_s,\boldsymbol{\theta}) }{\partial \boldsymbol{\theta}} \\ \vdots \\ \frac{\partial \hat{x}_{n-1}(t+T_s,\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ \frac{\partial \hat{x}_n(t+T_s,\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \end{array} \right)=\boldsymbol{\Psi}(t+T_s,\boldsymbol{\theta}) \end{align}\] \[=\boldsymbol{\Psi}(t,\boldsymbol{\theta})+T_s \left( \begin{array}{c} \frac{\partial \hat{x}_2(t,\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ \vdots \\ \frac{\partial \hat{x}_{n}(t,\boldsymbol{\theta})}{\partial \boldsymbol{\theta} } \\ \frac{\partial}{\partial \boldsymbol{\theta} }f\left(\hat{\mathbf{x}}(t,\boldsymbol{\theta}),\mathbf{u}(t), \boldsymbol{\theta} \right) \end{array} \right). \label{eq:eq6a}\tag{8}\] \(\frac{\partial}{\partial \boldsymbol{\theta}}f\left(\hat{\mathbf{x}}(t,\boldsymbol{\theta}),\mathbf{u}(t), \boldsymbol{\theta} \right)\) can be computed by auto-differentiation at run-time, cf. (8 ) and \({\rm diff}_{\boldsymbol{\theta}}(\cdot)\) of (11 ).
When ADAM is applied to (5 )-(8 ), there is a significant difference as compared to other recursive identification algorithms, since ADAM does not process the gradient of the model separately. Instead the full gradient of the criterion \[\begin{align} V_A(\boldsymbol{\theta})=\frac{1}{2} \lim_{t\rightarrow \infty}E[\boldsymbol{\varepsilon}^{\top}(t,\boldsymbol{\theta})\boldsymbol{\varepsilon}(t,\boldsymbol{\theta})] \end{align}\] \[= \frac{1}{2} \lim_{t\rightarrow \infty}E[\left( \mathbf{y}(t)-\hat{\mathbf{y}}(t,\boldsymbol{\theta}) \right)^{\top}\left( \mathbf{y}(t)-\hat{\mathbf{y}}(t,\boldsymbol{\theta}) \right)] \label{eq:eq7}\tag{9}\] is processed, where \(\mathbf{y}(t)\) is the measurement. The gradient that is approximated by ADAM is hence \[\mathbf{g}(t, \boldsymbol{\theta})=\left( \frac{\partial V_A(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \right)^{\top}=- \lim_{t \rightarrow \infty}E[\boldsymbol{\psi}(t,\boldsymbol{\theta})\boldsymbol{\varepsilon(t, \boldsymbol{\theta})}]. \label{eq:eq9}\tag{10}\] Hence, when ADAM estimates approximate second order properties, this is done for \(\boldsymbol{\psi}(t,\boldsymbol{\theta})\boldsymbol{\varepsilon(t, \boldsymbol{\theta})}\) rather than for \(\boldsymbol{\psi}(t,\boldsymbol{\theta})\) that would be the case for a Gauss-Newton based recursive identification algorithm, cf. [12]. This has significant consequences that will be discussed in Section IV.
After reordering the equations of Algorithm 1 of [1] to coincide with [17] to facilitate the convergence analysis, and after replacing model signals with running estimates, the recursive algorithm becomes \[\begin{array}{rcl} \boldsymbol{\varepsilon}(t) & = & \mathbf{y}(t)-\hat{\mathbf{y}}(t) \\ \mathbf{m}(t) & = & \beta_1\mathbf{m}(t-T_s) \\ & & +(1-\beta_1)(-\boldsymbol{\psi}(t)\boldsymbol{\varepsilon}(t)) \\ \hat{\mathbf{m}}(t) & = & \frac{\mathbf{m}(t)}{1-\beta_1^{{\rm round}(t/T_s)}} \\ \mathbf{v}(t) & = & \beta_2\mathbf{v}(t-T_s) +(1-\beta_2)\\ & & \times ({\rm \boldsymbol{vec}}({\rm \boldsymbol{diag}}(\boldsymbol{\psi}(t)\boldsymbol{\varepsilon}(t)\boldsymbol{\varepsilon}^{\top}(t)\boldsymbol{\psi}^{\top}(t)))) \\ \hat{\mathbf{v}}(t) & = & \frac{\mathbf{v}(t)}{1-\beta_2^{{\rm round}(t/T_s)}} \\ \hat{\boldsymbol{\theta}}(t) & = & \hat{\boldsymbol{\theta}}(t-T_s)-\alpha(t) \left( \hat{\mathbf{v}}^{ \cdot \frac{1}{2}}(t)+\epsilon \right)^{ \cdot -1} \cdot \hat{\mathbf{m}}(t) \\ \hat{\mathbf{x}}(t+T_s) & = & \left( \begin{array}{c} \hat{x}_1(t) \\ \vdots \\ \hat{x}_{n-1}(t) \\ \hat{x}_n(t) \end{array} \right) \\ & & + T_s \left( \begin{array}{c} \hat{x}_2(t) \\ \vdots \\ \hat{x}_{n}(t) \\ f\left(\hat{\mathbf{x}}(t),\mathbf{u}(t), \hat{\boldsymbol{\theta}}(t) \right) \end{array} \right) \\ \hat{\mathbf{y}}(t+T_s) & = & \mathbf{C}\hat{\mathbf{x}}(t+T_s) \\ \boldsymbol{\Psi}(t+T_s) & = & \boldsymbol{\Psi}(t) \\ & & +T_s \left( \begin{array}{c} \frac{\partial \hat{x}_2(t,\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ \vdots \\ \frac{\partial \hat{x}_{n}(t,\boldsymbol{\theta})}{\partial \boldsymbol{\theta} } \\ {\rm diff}_{\boldsymbol{\theta}} \left(f\left( \hat{\mathbf{x}}(t),\mathbf{u}(t), \hat{\boldsymbol{\theta}}(t) \right) \right) \end{array} \right) \\ \boldsymbol{\psi}(t+T_s) & = & \left(\mathbf{C}\boldsymbol{\Psi}(t+T_s)\right)^{\top}. \end{array} \label{eq:eq10}\tag{11}\] To explain (11 ) and its notation, it is first noted that the element-wise multiplication \(\mathbf{g}(t)\odot \mathbf{g}(t)\) of [1] may be re-written as a vectorizing operation on the matrix \(\rm{\boldsymbol{diag}} (\boldsymbol{\psi}(t)\boldsymbol{\varepsilon}(t)\boldsymbol{\varepsilon}^{\top}(t))\boldsymbol{\psi}^{\top}(t)\), where \({\rm \boldsymbol{diag}}(\cdot)\) extracts the diagonal matrix from a matrix and \({\rm \boldsymbol{vec}(\cdot)}\) creates a vector of the diagonal elements. The additional element-wise operations of ADAM use a notation with a dot \((\cdot)\) before the mathematical operation. When the operation is implicit like for multiplication, a dot means element-wise operation.
The quantities of the algorithm include \(\mathbf{m}(t)\) which denotes the first (order) moment and \({\hat{\mathbf{m}}}(t)\) which denotes the bias corrected first moment, while \(\mathbf{v}(t)\) and \({\hat{\mathbf{v}}}(t)\) denote the second (order) moment and the bias compensated counterpart, used to approximate a Newton search. \(\beta_1\) and \(\beta_2\) are filtering hyper-parameters with standard values \(0.9\) and \(0.999\), and \(\alpha(t)\propto t^{-1}\) is the gain sequence that needs to replace the constant step size \(\alpha\) in the convergence analysis.
The analysis follows a similar path as [17]. First regularity conditions are defined, so that the averaging results of [2] and [13] can be re-used. The average updating direction can then be analysed, interpreted and compared to classical gradient descent algorithms, for different hyper-parameters. Global convergence of (11 ) finally follows from the Lyapunov stability of the associated ODE of (11 ). The argument \((t,\theta)\) indicates a fixed \(\theta\) when expectations are computed.
The averaging analysis requires regularity conditions, defined below. The conditions M1 - M4 define a model set for which exponential stability and ergodicity holds. This may seem restrictive, however projection algorithms need to enforce asymptotic stability of simulated models and gradients in recursive identification of linear systems as well [12]. M2 restricts the scope to continuously differentiable activation functions in case a neural network is used in (5 ) and (11 ). The conditions A1, S1 and S2 imply that also the data generating system is exponentially stable. The condition G1 ensures an appropriate decay rate of the gain sequence.
A2 lists the quantities of the average updating direction. Referring to [12], average updating directions for \(\hat{\boldsymbol{\theta}}(t)\) and \(\hat{\mathbf{v}}(t)\) are needed, both with gain sequences \(\sim 1/t\) to fit the general algorithm of [13], cf. equation (A1) of [17]. To achieve this, the analysis for the first set of standard hyper-parameters need to be restricted by \(\beta_2 \rightarrow 1\). When the bandwidth of the unity gain autoregressive filtering then tends to \(0\), \(\mathbf{v}(t)\) and consequently \(\hat{\mathbf{v}}(t)\) approaches the expected value of the input to the autoregressive filter, and \[\begin{align} \lim_{\beta_2 \rightarrow 1} \lim_{t \rightarrow \infty} \hat{\mathbf{v}}(t,\boldsymbol{\theta})=\lim_{\beta_2 \rightarrow 1} \lim_{t \rightarrow \infty} \mathbf{v}(t,\boldsymbol{\theta}) \end{align}\] \[\begin{align} =\lim_{t \rightarrow \infty} E \left[ {\rm \boldsymbol{vec}}({\rm \boldsymbol{diag}}(\boldsymbol{\psi}(t,\boldsymbol{\theta})\boldsymbol{\varepsilon}(t,\boldsymbol{\theta})\boldsymbol{\varepsilon}^{\top}(t,\boldsymbol{\theta}) \boldsymbol{\psi}^{\top}(t,\boldsymbol{\theta}))) \right] \end{align}\] \[\label{eq:eq200} =\lim_{t \rightarrow \infty} \frac{T_s}{t}\sum_{k=1}^{\frac{t}{T_s}} {\rm \boldsymbol{vec}}({\rm \boldsymbol{diag}}(\boldsymbol{\psi}(´kT_s)\boldsymbol{\varepsilon}(kT_s)\boldsymbol{\varepsilon}^{\top}(kT_s) \boldsymbol{\psi}^{\top}(kT_s))).\tag{12}\] The first equality follows since \(\left(1-\beta_2^{{\rm round}(t/T_s)} \right)^{-1}\) \(\rightarrow 1\) as \(t \rightarrow \infty\). The equality with the sample average follows from ergodicity. The equation underpinning the sample average of (12 ) can then be written recursively as \[\begin{align} \mathbf{v}(t)=\mathbf{v}(t-T_s) \end{align}\] \[\label{eq:eq201} +\frac{T_s}{t}\left( {\rm \boldsymbol{vec}}({\rm \boldsymbol{diag}}(\boldsymbol{\psi}(t)\boldsymbol{\varepsilon}(t)\boldsymbol{\varepsilon}^{\top}(t) \boldsymbol{\psi}^{\top}(t)))-\mathbf{v}(t-T_s) \right).\tag{13}\] The updating structure of (11 ) and (13 ) then appear in A2:
The system and model are single output, i.e. \(p=1\).
The model set \(\mathcal{D}_\mathcal{M}\) is a compact subset of \(\mathcal{R}^{d+d}\), such that \(\left( \boldsymbol{\theta}^{\top} \;\;\mathbf{v}^{\top} \right)^{\top} \in \mathcal{D}_\mathcal{M}\) implies continuously differentiable, exponentially stable and bounded state dynamics, state gradient dynamics, and derivatives.
\(\left( \boldsymbol{\theta}^{\top} \;\;\mathbf{v}^{\top} \right)^{\top} \in \mathcal{D}_\mathcal{M}\) implies that \(\mathbf{v}(t)>\delta_\mathbf{v}\mathbf{1}\), \(\delta_\mathbf{v}>0\).
\(\mathbf{u}(t)=\left( u_1(t) \;\;... \;\;u_K(t) \right)^{\top}\), without time derivatives, is generated from i.i.d bounded random vectors \(\{\bar{\mathbf{u}}(t) \}\), by asymptotically stable linear filtering.
\(\lim_{t \rightarrow \infty} t \alpha(t)= \bar{\alpha}\), \(0<\bar{\alpha}<\infty\).
The data \(\{ \mathbf{z}(t) \} = \{ \left( y(t) \;\;\mathbf{u}^{\top}(t) \right)^{\top} \}\) is strictly stationary, ergodic and \(\| \mathbf{z}(t) \| \leq C<\infty\), w.p.1, \(\forall t\).
The following limits exist for \(\left( \boldsymbol{\theta}^{\top} \;\;\mathbf{v}^{\top} \right)^{\top} \in \mathcal{D}_\mathcal{M}\) when \(\beta_2 \rightarrow 1\): \[\begin{align} \mathbf{f}(\boldsymbol{\theta}, \mathbf{v})= \lim_{t \rightarrow \infty} E\left[ \left( \mathbf{v}^{\cdot \frac{1}{2}} +\epsilon\mathbf{I} \right)^{\cdot -1} \cdot \hat{\mathbf{m}}(t,\boldsymbol{\theta}) \right], \end{align}\] \[\begin{align} \mathbf{G}(\boldsymbol{\theta})=\lim_{t \rightarrow \infty} E \left[ {\rm \boldsymbol{vec}}({\rm \boldsymbol{diag}}(\boldsymbol{\varepsilon}^2(t,\boldsymbol{\theta}) \boldsymbol{\psi}(t,\boldsymbol{\theta}) \boldsymbol{\psi}^{\top}(t,\boldsymbol{\theta}))) \right]. \end{align}\]
For each \(t,s, t\geq s\), there exists a random vector \(\mathbf{z}_s^0(t)\) that belongs to the \(\sigma\)-algebra generated by \(\mathbf{z}^t\) but is independent of \(\mathbf{z}^s\) (for \(s=t\) take \(\mathbf{z}_s^0(t)=\mathbf{0}\)), such that \(E[\| \mathbf{z}(t)-\mathbf{z}_s^0(t) \|^4]<C \lambda^{t-s}, \;C<\infty, |\lambda|<1\).
The data generating system is described by \(y(t)=\mathbf{C}\mathbf{x}(t)+ w(t)\), where \(\mathbf{x}(t)\) is generated by sampling of the states of a continuously differentiable, bounded and exponentially stable ODE, and where \(w(t)\) is generated from a sequence of i.i.d random vectors independent of \(\{ \mathbf{u}(t) \}\), by asymptotically stable filtering.
Since there is no projection algorithm defined for ADAM, the boundedness condition needs to be included as an assumption, see [2]. The boundedness condition is related to time varying exponential stability, which can be secured by a projection algorithm in combination with a limitation of the adaptation rate, [18]. The boundedness condition is given by:
The Boundedness Condition: There is a random variable \(\mathcal{C}\) and an infinite subsequence \(\{ t_k \}\), such that \(\left( \hat{\boldsymbol{\theta}}^{\top}(t_k) \;\;\mathbf{v}^{\top}(t_k) \right)^{\top} \in \bar{\mathcal{D}}_{\mathcal{M}} \subset \mathcal{D}_{\mathcal{M}} \setminus \partial \mathcal{D}_{\mathcal{M}}\) and with \(\hat{\mathbf{x}}(t_k)\), \(\boldsymbol{\Psi}(t_k)\), \(\boldsymbol{\psi}(t_k)\), \(\mathbf{x}(t_k)\), \(\mathbf{u}(t_k)\), \(w(t_k)\) bounded by \(\mathcal{C}\), \(\forall t_k\), w.p.1.
Theorem 1 now follows from [2] and [13]:
Theorem 1: Consider (11 ) and assume that M1-M4, G1, A1, A2, S1, S2 and the boundedness condition hold. Also assume that there exists a twice differentiable positive function \(V(\boldsymbol{\theta},\mathbf{v})\) such that \[\begin{align} \frac{d}{d \tau}V(\boldsymbol{\theta}_D(\tau), \mathbf{v}_D(\tau)) \leq 0, \end{align}\] for \(\left(\boldsymbol{\theta}_D^{\top}(\tau) \;\;\mathbf{v}_D^{\top}(\tau) \right)^{\top} \in \mathcal{D}_M \setminus \partial \mathcal{D}_M\) when evaluated along solutions of the associated system of ODEs \[\begin{align} \frac{d}{d\tau}\boldsymbol{\theta}_D(\tau) = - \bar{\alpha} \mathbf{f}(\boldsymbol{\theta}_D(\tau), \mathbf{v}_D(\tau)), \end{align}\] \[\begin{align} \frac{d}{d\tau}\mathbf{v}_D(\tau)= \mathbf{G}(\boldsymbol{\theta}_D(\tau)) -\mathbf{v}(\boldsymbol{\theta}_D(\tau)). \end{align}\] Then \[\begin{align} \left( \hat{\boldsymbol{\theta}}^{\top}(t) \;\;\mathbf{v}^{\top}(t) \right)^{\top} \rightarrow \mathcal{D}_C \end{align}\] \[\begin{align} = \left\{ \left( \boldsymbol{\theta}_D^{\top}(\tau) \;\;\mathbf{v}_D^{\top}(\tau) \right)^{\top} \in \mathcal{D}_M \setminus \partial \mathcal{D}_M \right. \end{align}\] \[\begin{align} \left. | \;\frac{d}{d \tau} V(\boldsymbol{\theta}_D(\tau),\mathbf{v}_D(\tau))=0 \right\} \end{align}\] \({\rm w.p.1 \;as} \;t \rightarrow \infty\), or \(\left( \hat{\boldsymbol{\theta}}^{\top}(t) \;\;\mathbf{v}^{\top}(t) \right)^{\top} \rightarrow \partial \mathcal{D}_M\).
Proof: The proof is omitted due to page constraints and since it parallels the corresponding proof of the downloadable open access paper [17], with minor changes. The proof of [17] is pre-ceeded by [19], supervised by the first author.
ADAM is then analysed for two hyper-parameter settings.
The first case with close to standard filtering is defined by
First it follows from (12 ) and M1 that \[\begin{align} \lim_{\beta_2 \rightarrow 1}\lim_{t\rightarrow \infty} \mathbf{v}(t,\boldsymbol{\theta}) \end{align}\] \[\label{eq:eq102} =\lim_{t \rightarrow \infty}E\left[ \varepsilon^2(t,\boldsymbol{\theta}) {\rm \boldsymbol{vec}}{\rm (\boldsymbol{diag}}(\boldsymbol{\psi}(t,\boldsymbol{\theta}) \boldsymbol{\psi}^{\top}(t,\boldsymbol{\theta}))) \right].\tag{14}\]
Then consider \(\mathbf{f}(\boldsymbol{\theta},\mathbf{v})\). An analysis of the element-wise operations of \(\left( \mathbf{v}^{\cdot \frac{1}{2}}+\epsilon \mathbf{I} \right)^{\cdot -1}\) shows that the quantity transforms as follows when moved before the expectation \[\begin{align} \lim_{\epsilon \rightarrow 0} \lim_{\beta_2 \rightarrow 1}\mathbf{f}(\boldsymbol{\theta}, \mathbf{v}) \end{align}\] \[\begin{align} = -\left( \lim_{t \rightarrow \infty}E\left[ \varepsilon^2(t,\boldsymbol{\theta}) {\rm (\boldsymbol{diag}}(\boldsymbol{\psi}(t,\boldsymbol{\theta}) \boldsymbol{\psi}^{\top}(t,\boldsymbol{\theta}))) \right] \right)^{-\frac{1}{2}} \end{align}\] \[\times \lim_{t \rightarrow \infty} E\left[ \mathbf{m}(t, \boldsymbol{\theta}) \right]. \label{eq:eq16}\tag{15}\] This follows since \(\lim_{t \rightarrow \infty} \left(1-\beta_1^{{\rm round}(t/T_s)}\right)^{-1}=1\) implies that \(\lim_{t \rightarrow \infty}\hat{\mathbf{m}}(t,\boldsymbol{\theta})\) \(=\) \(\lim_{t \rightarrow \infty}\mathbf{m}(t,\boldsymbol{\theta})\). The unity gain of the autoregressive filtering of \(\mathbf{m}(t, \boldsymbol{\theta})\) in (11 ) then gives \[\begin{align} \lim_{t \rightarrow \infty}E\left[ \mathbf{m}(t,\boldsymbol{\theta})\right]=\beta_1 \lim_{t \rightarrow \infty}E\left[\mathbf{m}(t,\boldsymbol{\theta})\right] \end{align}\] \[\label{eq:eq203} +(1-\beta_1) \lim_{t \rightarrow \infty}E \left[ -\boldsymbol{\psi}(t,\boldsymbol{\theta}) \varepsilon(t, \boldsymbol{\theta}) \right].\tag{16}\] Since \(\beta_1\) of (11 ) always fulfils \(0<\beta_1<1\), it follows that \[\label{eq:eq15a} \lim_{t \rightarrow \infty}E \left[ \mathbf{m}(t,\boldsymbol{\theta)} \right]=-\lim_{t\rightarrow \infty}E \left[ \boldsymbol{\psi}(t,\boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right].\tag{17}\] When (17 ) is inserted in (15 ) the result is \[\begin{align} \lim_{\epsilon \rightarrow 0} \lim_{\beta_2 \rightarrow 1}\mathbf{f}(\boldsymbol{\theta}, \mathbf{v}) =\mathbf{f}(\boldsymbol{\theta}) \end{align}\] \[\begin{align} = -\left( \lim_{t \rightarrow \infty}E\left[ \varepsilon^2(t,\boldsymbol{\theta}) {\rm (\boldsymbol{diag}}(\boldsymbol{\psi}(t,\boldsymbol{\theta}) \boldsymbol{\psi}^{\top}(t,\boldsymbol{\theta}))) \right] \right)^{-\frac{1}{2}} \end{align}\] \[\times \lim_{t \rightarrow \infty} E\left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon (t, \boldsymbol{\theta}) \right], \label{eq:eq16a}\tag{18}\] where the diagonal matrix is positive definite by M3. A comparison of (15 ) with the average updating direction of a steepest descent gradient algorithm, see e.g. [12], then gives:
Theorem 2: Assume that M1-M4, A1-A3, S1, S2 and the boundedness condition hold. Then the asymptotic behaviour of the parameter update of (11 ) coincides with that of a stochastic gradient algorithm with diagonal normalization.
The second case with filtering turned off assumes
The turned off filtering does not represent recommended hyper-parameters. However, the analysis contributes to an understanding of the behaviour of ADAM for hyper-parameter settings in between turned off and standard filtering.
In this case \(\mathbf{f}(\boldsymbol{\theta},\mathbf{v})\) is evaluated by direct simplification, without consideration of (12 ). Instead A5 replaces A2:
Application of A4 in \(\mathbf{f}(\boldsymbol{\theta},\mathbf{v})\) of A5, and using M1 gives \[\begin{align} \lim_{\epsilon \rightarrow 0} \mathbf{f}(\boldsymbol{\theta},\mathbf{v})=\mathbf{f}(\boldsymbol{\theta})=-\lim_{t \rightarrow \infty} E \left[ \right. \end{align}\] \[\begin{align} \left. \frac{ \varepsilon(t,\boldsymbol{\theta})}{\sqrt{\left( \varepsilon(t,\boldsymbol{\theta}) \right)^2}} \left( {\rm \boldsymbol{vec} ( \boldsymbol{diag}}(\boldsymbol{\psi}(t,\boldsymbol{\theta}) \boldsymbol{\psi}^{\top}(t,\boldsymbol{\theta}))) \right)^{\cdot - \frac{1}{2}} \cdot \boldsymbol{\psi}(t,\boldsymbol{\theta}) \right] \end{align}\] \[=-\lim_{t \rightarrow \infty} E \left[{\rm sign}(\varepsilon(t,\boldsymbol{\theta})) {\rm \boldsymbol{sign}}(\cdot \boldsymbol{\psi}(t,\boldsymbol{\theta})) \right]. \label{eq:eq11}\tag{19}\] The result (19 ) is summarized in:
Theorem 3: Assume that M1-M4, A1, A4, A5, S1, S2 and the boundedness condition hold. Then the asymptotic behaviour of the parameter update of (11 ) coincides with that of a stochastic gradient sign-sign algorithm.
The sign-sign behaviour is related to the fact that ADAM adapts and normalizes the parameter update for the complete gradient \(\boldsymbol{\psi}(t,\boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta})\) in an element-wise way, contrary to Gauss-Newton algorithms, [12]. Referring to [20] and [21], it is well known that sign-sign algorithms converge significantly slower than stochastic gradient algorithms.
The global convergence analysis of both hyper-parameter cases above is based on the Lyapunov function candidate \[V(\boldsymbol{\theta},\mathbf{v})=V_A(\boldsymbol{\theta})= \frac{1}{2}\lim_{t\rightarrow \infty}E[\boldsymbol{\varepsilon}^{2}(t,\boldsymbol{\theta})]\geq 0. \label{eq:eq20}\tag{20}\] Since \(\mathbf{f}(\boldsymbol{\theta},\mathbf{v})\) of (18 ), (19 ) and A2 no longer depend on \(\mathbf{v}\), and since the associated ODE for \(\mathbf{v}\) of A2 is linear and asymptotically stable, it is sufficient to use (20 ). Using M1-M4, G1, A1, A2, S1 and S2, the time derivative of the Lyapunov function candidate along the solutions of the associated differential equations of Theorem 1 becomes \[\begin{align} \frac{dV(\boldsymbol{\theta}_D(\tau),\mathbf{v}_D(\tau))}{d\tau}=\frac{d}{d\tau} \lim_{t\rightarrow \infty}\frac{1}{2} E \left[ \varepsilon^2(t,\boldsymbol{\theta}_D(\tau)) \right] \end{align}\] \[\begin{align} =\lim_{t \rightarrow \infty} \frac{1}{2} E\left[ \frac{\partial \varepsilon^2(t,\boldsymbol{\theta})}{\partial \boldsymbol{\theta} } \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} \frac{d \boldsymbol{\theta}_D(\tau)}{d\tau} \end{align}\] \[\begin{align} =\lim_{t \rightarrow \infty}E \left[ -\boldsymbol{\psi}^{\top}(t,\boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} \left( -\bar{\alpha} \mathbf{f}(\boldsymbol{\theta}_D(\tau))\right) \end{align}\] \[=\bar{\alpha}\mathbf{f}^{\top}(\boldsymbol{\theta}_D(\tau))\lim_{t \rightarrow \infty}E \left[ \boldsymbol{\psi}(t,\boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)}. \label{eq:eq17}\tag{21}\]
Proceeding from (21 ) and using (18 ) immediately gives \[\begin{align} \frac{dV(\boldsymbol{\theta}_D(\tau),\mathbf{v}_D(\tau))}{d\tau} \end{align}\] \[\begin{align} =-\bar{\alpha} \left( \lim_{t \rightarrow \infty} E\left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon (t, \boldsymbol{\theta}) \right] \right)^{\top}_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} \end{align}\] \[\begin{align} \times \left( \lim_{t \rightarrow \infty}E\left[ \varepsilon^2(t,\boldsymbol{\theta}) {\rm (\boldsymbol{diag}}(\boldsymbol{\psi}(t,\boldsymbol{\theta}) \boldsymbol{\psi}^{\top}(t,\boldsymbol{\theta}))) \right] \right)^{-\frac{1}{2}}_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} \end{align}\] \[\times\lim_{t \rightarrow \infty} E\left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon (t, \boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} \leq 0. \label{eq:eq30}\tag{22}\] Equality holds if and only if \(\lim_{t \rightarrow \infty} E\left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon (t, \boldsymbol{\theta}) \right]=0\), referring to M3 and G1.
Combining (21 ) and (19 ) gives \[\begin{align} \frac{dV(\boldsymbol{\theta}_D(\tau),\mathbf{v}_D(\tau))}{d\tau} \end{align}\] \[\begin{align} =-\bar{\alpha}\lim_{t\rightarrow \infty}E \left[ {\rm sign}(\varepsilon(t,\boldsymbol{\theta})){\rm \boldsymbol{sign}}\left(\cdot \boldsymbol{\psi}^{\top}(t,\boldsymbol{\theta}) \right) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} \end{align}\] \[\begin{align} \times \lim_{t \rightarrow \infty} E \left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} \end{align}\] \[\begin{align} =-\bar{\alpha}\lim_{t\rightarrow \infty}E \left[ {\rm \boldsymbol{sign}}\left( \cdot \boldsymbol{\psi}(t,\boldsymbol{\theta})\varepsilon(t,\boldsymbol{\theta})) \right)^{\top} \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} \end{align}\] \[\times \lim_{t \rightarrow \infty} E \left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)}. \label{eq:eq21}\tag{23}\] The following assumption is now introduced
The reason why A6 is introduced is the following result:
Lemma 1: Assume that the distribution \(p_X\) of the stochastic variable \(X\) is symmetric around its mean \(\bar{x}\). Then \(E\left[ {\rm sign}(X) \right]= {\rm sign} \left( E \left[ X \right] \right)\).
Proof: The proof is by direct calculation. \[\begin{align} E \left[ {\rm sign}(X) \right]=\int_{-\infty}^{\infty} {\rm sign}(x)p_X(x)dx \end{align}\] \[\begin{align} =\int_{-\infty}^{0} (-1)p_X(x)dx+\int_{0}^{\infty} (1)p_X(x)dx \end{align}\] \[\begin{align} =\int_{-\bar{x}}^{\infty} p_X(z+\bar{x})dz - \int_{-\infty}^{-\bar{x}} p_X(z+\bar{x})dz \end{align}\]
\[\begin{align} =\int_{-\bar{x}}^{\bar{x}} p_X(z+\bar{x})dz. \end{align}\] Noting that the integral is positive if \(\bar{x}>0\) and negative if \(\bar{x}<0\), Lemma 1 follows.
A6 and an element-wise use of Lemma 1 in (23 ) gives \[\begin{align} \frac{dV(\boldsymbol{\theta}_D(\tau),\mathbf{v}_D(\tau))}{d\tau} \end{align}\] \[\begin{align} =-\bar{\alpha} {\rm \boldsymbol{sign}} \left( \cdot \lim_{t\rightarrow \infty}E \left[ \left( \boldsymbol{\psi}(t,\boldsymbol{\theta})\varepsilon(t,\boldsymbol{\theta}) \right)^{\top} \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} \right) \end{align}\] \[\begin{align} \times \lim_{t \rightarrow \infty} E \left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} \end{align}\] \[=-\bar{\alpha} \sum_{i}| \lim_{t \rightarrow \infty} E \left[ \psi_i(t, \boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)} | \leq 0. \label{eq:eq23}\tag{24}\] Again, equality holds if and only if \(\lim_{t \rightarrow \infty} E \left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)}=0\).
It is now noted that the derived condition for global convergence for both hyper-parameter settings is given by \(\lim_{t \rightarrow \infty} E \left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)}=0\). The following assumption is therefore convenient
By S3, the condition \(\lim_{t \rightarrow \infty} E \left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)}=0\) holds for all \(\boldsymbol{\theta}^{\star}\) since \(\boldsymbol{\psi}(t, \boldsymbol{\theta})\) is generated only from \(\mathbf{u}(t)\) which is independent of \(\varepsilon(t,\boldsymbol{\theta})\). Theorem 1 now implies:
Theorem 4: Assume that the boundedness condition, M1-M4, G1, A1, and S1-S3 hold for (11 ). If i) A2 and A3 hold, or ii) A4, A5 and A6 hold, then \(\hat{\boldsymbol{\theta}}(t) \rightarrow \mathcal{D}_C \;{\rm w.p.1} \;{\rm as} \;t \rightarrow \infty\), or \(\hat{\boldsymbol{\theta}}(t) \rightarrow \partial \mathcal{D}_M\), where \(\boldsymbol{\theta}^{\star} \in \mathcal{D}_C\) is defined in Theorem 1.
Convergence is global and Theorem 4 is valid for both cases treated by the paper. However there may be other sub-optimal classes of points in the invariant set \(\mathcal{D}_C\) than \(\boldsymbol{\theta}^{\star}\), cf. e.g. [14]. Such sub-optimal points do then not meet S3, but they do fulfil \(\lim_{t \rightarrow \infty} E \left[ \boldsymbol{\psi}(t, \boldsymbol{\theta}) \varepsilon(t,\boldsymbol{\theta}) \right]_{|\boldsymbol{\theta}=\boldsymbol{\theta}_D(\tau)}=0\). Note also that S3 implies that \(w(t)\) of S2 can replace \(\varepsilon(t,\boldsymbol{\theta}^{\star})\).
To test the proposed RNN and to validate the results of the averaging analysis, a Python-based Monte-Carlo analysis of a simulated automotive cruise control system was performed. The vehicle traveling with velocity \(x_1(t)\) is subject to thrust, friction, air resistance and gravitational forces in hilly terrain, see e.g. [22]. Here, the friction and gravitational forces are treated as a disturbance \(w(t)\). Newton’s second law gives \[\dot{x}_1(t)=u(t)-\frac{\rho A C_{x_1}}{2m}x_1^2(t)-w(t), \label{eq:eq100}\tag{25}\] In (25 ), \(u(t)\) is the accelerator command, \(m\) the mass of the vehicle, \(A\) the frontal area, \(\rho\) the density of the air, and \(C_{x_1}\) is the air resistance coefficient. This system was sampled with \(T_s=0.1\) \(s\). The mass of the vehicle was \(m=1500\) \(kg\), while \(\rho\), \(A\) and \(C_{x_1}\) were set to give the vehicle a maximum speed of \(60\) \(m/s\). The maximum acceleration and retardation were \(\pm 3.0\) \(m/s^2\). The white velocity measurement standard deviation was \(0.1\) \(m/s\), while the standard deviation of \(w(t)\) was \(0.01\) \(m/s^2\). To identify the dynamics, \(f^{nn}\left(\hat{\mathbf{x}}(t,\boldsymbol{\theta}),\mathbf{u}(t), \boldsymbol{\theta} \right)\) was selected with one hidden layer of width 8. The analysis averaged twenty runs for each of the two hyper-parameter settings analysed in Section IV, using fix \(\alpha\) values \(0.001\) and \(0.0001\) to also illustrate tracking. The \(tanh\) activation functions was used. The results in Fig. 1 and Fig. 2 are consistent with the analysis, with the standard hyper-parameter setting performing significantly better than the sign-sign one. The sign-sign hyper-parameter setting with \(\alpha=0.0001\) leads to very slow convergence. This may be due to the algorithm, a suboptimal \(\boldsymbol{\theta}^{\star}\), or that A5 fails to hold.
The paper derived the asymptotic average updating direction of ADAM for two hyper-parameter settings. It was proved that the setting that represents close to standard hyper-parameters behaves as a diagonally power normalized stochastic gradient algorithm. The case with filtering turned off instead behaves as a stochastic sign-sign algorithm. In addition it was proved that the algorithm converges globally to an invariant set that is a superset of the parameter vectors that represent perfect input-output models. The paper also proposed a model structure that embeds structure in RNNs. A Monte-Carlo simulation study validated the results. In view of the asymptotic similarity to other diagonally power scaled gradient descent algorithms, [23], [24], a significant performance advantage for ADAM with respect to conventional normalized gradient descent algorithms is expected, cf. [24].