May 23, 2025
This paper proposes a novel theoretical framework for guaranteeing and evaluating the resilience of long short-term memory (LSTM) networks in control systems. We introduce recovery time as a new metric of resilience in order to quantify the time required for an LSTM to return to its normal state after anomalous inputs. By mathematically refining incremental input-to-state stability (\(\delta\)ISS) theory for LSTM, we derive a practical data-independent upper bound on recovery time. This upper bound gives us resilience-aware training. Experimental validation on simple models demonstrates the effectiveness of our resilience estimation and control methods, enhancing a foundation for rigorous quality assurance in safety-critical AI applications.
In recent years, artificial intelligence (AI) has become increasingly integrated into various aspects of our daily lives, from healthcare to autonomous vehicles. This widespread adoption has raised concerns regarding AI system reliability and raised the need for robust quality assurance measures. In response to these trends, multiple organizations and regulatory bodies have published guidelines for ensuring AI safety and reliability, such as the European Union’s AI Act [1] and AI HLEG guidelines [2]. Although these guidelines have contributed to the gradual establishment of AI quality assurance processes, the development of specific evaluation metrics and implementation methodologies remains in its nascent stages. The development of these metrics and implementation methodologies is a critical issue, especially for control systems incorporating AI, because such systems interact directly with the physical world, potentially impacting human lives in extreme scenarios.
In this paper, we focus on the control of time-series models, particularly long short-term memory (LSTM) networks, having real-time capability, which is an important property in control systems. Although typical time-series models can extract nonlinear features from time-series data, store them as internal states, and utilize them to make inferences at each time step, especially, LSTM surpasses other time-series models in the accuracy of prediction [3]–[5]. This ability is well-suited for predicting future outputs of nonlinear systems in model predictive control (MPC), and therefore LSTM’s capacity to track target values is better than traditional control methods [6]–[8]. Moreover, LSTM can be implemented on Field Programmable Gate Array (FPGA), which are suitable for machine embedding [9]. However, the state-space nature of time-series models also presents significant risks: If a model receives temporarily anomalous input values, it may retain inaccurate state memories, potentially leading to persistent errors in prediction that could significantly compromise the system’s safety and reliability.
The concerns arising from the use of time-series models for control systems can be mitigated by ensuring stability, robustness, and resilience in these models [10]. Stability refers to consistent performance across similar inputs, which forms the foundation for robustness and resilience. Robustness refers to patience against perturbations, while resilience refers to the ability to return to a normal state after transitioning to an abnormal mode due to anomalous inputs. See also Section 0.0.2.1 for related work on these concepts. Time-series models with these properties should exhibit consistent behavior under normal conditions. However, these properties cannot be adequately evaluated using conventional metrics focused on inference accuracy, such as precision, recall, and F1-score [11]. Therefore, it is necessary to formulate new evaluation metrics and develop techniques to achieve the required metrics. This issue might be addressed by comprehensively collecting and learning from data across various scenarios for evaluation, but there are difficulties in defining comprehensiveness and collecting data at scale. Given this context, we mathematically formulate the resilience of LSTM, which is different from data-driven approaches. For theoretical and experimental approaches in assurance of LSTM stability and quality, see also Section 0.0.2.2 and Section 0.0.2.3.
Figure 1: Schematic diagram of recovery time. The blue line depicts the system’s response to normal input without perturbations, while the orange and gray lines illustrate responses of systems with and without resilience to anomalous inputs subjected to instantaneous noise, respectively. We define the recovery time as the interval required for a system’s response to revert to its nominal behavior following a perturbation induced by noise. A resilient system is a system that has a finite recovery time..
Our research focuses on formulating evaluation metrics and implementation methodologies to assess and ensure the resilience of LSTMs. Specifically, our goals are as follows:
To introduce a novel mathematical formulation of the resilience, recovery time, defined as the duration required for a system to return to its normal state after transitioning to an abnormal mode due to anomalous inputs (see Figure 1).
To theoretically evaluate the upper bound of recovery time and to develop a practical data-independent method for estimating it.
To present parameter adjustment techniques for controlling the recovery time of the model during the training process.
The key tool to achieve our goals is the incremental input-to-state stability (\(\delta\)ISS) [12], [13], because \(\delta\)ISS guarantees the resilience of LSTM. Specifically, even if temporary perturbations in the input sequence disrupt inference, the state variables can converge to the same inference values that would have been obtained without perturbations. Sufficient conditions for the LSTM to exhibit the \(\delta\)ISS properties have already been derived as inequalities of parameters [14], [15]. However, it is not clarified in these studies how long it takes for state variables to converge. We will refine and improve upon these previous studies to achieve the above three goals.
The main contributions of this study are as follows:
Refinement of an evaluation of LSTM’s invariant sets, which form the foundation for stability analysis (Section 0.0.4).
Improvement of sufficient conditions for LSTM to be \(\delta\)ISS (Section 0.0.4).
Definition of recovery time that quantifies the resilience of LSTM and derivation of a data-independent upper bound for it (Section 0.0.5).
Introduction of a quantitative metric for quality assurance through the recovery time (Section 0.0.6).
Development of a method to control the trade-off between recovery time and inference accuracy for LSTMs (Section 0.0.6).
In control theory, stability and resilience are important properties. Furthermore, it is well-established that the stability of controlled objects fundamentally determines the overall stability and resilience of control systems. For instance, systems possessing input-to-state stability (ISS) properties achieve global asymptotic stability under closed-loop control [16]. Moreover, systems with \(\delta\)ISS characteristics acquire robustness when implemented with model predictive control (MPC) [13].
Therefore, in control methods such as MPC where controlled objects are modeled as dynamical systems, stability requirements for the dynamical systems are essential. While stability can be comprehensively guaranteed by general theory for linear systems, practical applications require specific and detailed engineering for a nonlinear system. This research focuses on LSTM networks as a specific case of nonlinear systems.
Theoretical approaches have investigated sufficient conditions for LSTM networks to satisfy various stability properties. For single-layer LSTMs, conditions for global asymptotic stability [17], ISS [18], and \(\delta\)ISS [14] have been successively established. Research on sufficient conditions for \(\delta\)ISS has been extended to multilayer LSTM architectures [15].
However, existing studies face a significant limitation: while they can guarantee recovery, they cannot provide estimates of time to recover. Our research addresses these challenges by advancing the stability theory of LSTM to enable both quantitative evaluation and adjustment of LSTM resilience.
This paper discusses theoretical LSTM quality assurance; however, there are also some experimental methods. For example, [19] evaluated the robustness of LSTM inference using noise-added datasets. [20] proposed creating robust LSTMs by expanding training datasets to include abnormal scenes. [21] proposed a data-driven stability assurance method.
These data-driven verification methods, like the ones mentioned above, are comprehensive when sufficient data are available, yet face challenges in determining adequate data quantities and provide only sparse guarantees for nonlinear models like LSTMs. Theoretical verification offers limited but exhaustive guarantees within its scope, and is therefore complementary to, rather than in conflict with, a data-driven approach.
In this section, we introduce essential mathematical preliminaries, an important stability concept: \(\delta\)ISS, and the LSTM.
For a vector \(v \in \mathbb{R}^n\), \(\|v\|=\|v\|_2\) represents the Euclidean norm of \(v\), and \(\|v\|_{\infty}=\max_{i \in [n]} |v_{(i)}|\) represents the maximum norm, where \(v_{(i)}\) is the \(i\)-th component of \(v\) and \([n]=\{1,2,\ldots,n\}\). For a discrete-time sequence of vector \(v(k), k \in \mathbb{Z}_{\geq 0}\), we denote \(v(k_1 : k_2)=[v(k_1),\ldots, v(k_2)]\), and \(\|v(k_1:k_2)\|_{2,\infty}=\|(\|v(k_1)\|_2,\ldots,\|v(k_2)\|_2)^\top\|_\infty\).
For a matrix \(A \in \mathbb{R}^{m \times n}\), \(\|A\|=\|A\|_2\) represents the induced 2-norm of \(A\), and \(\|A\|_{\infty}\) does the induced \(\infty\)-norm, that is, \(\|A\|_{\infty}=\max_i \sum_j A_{(i,j)}\). In addition, \(\left\lvert{A}\right\rvert\) denotes a matrix whose components are the absolute values of \(A\), that is, \(\left\lvert{A}\right\rvert \mathrel{\vcenter{:}}= (|A_{(i,j)}|)\). For a matrix \(A \in \mathbb{R}^{n \times n}\), \(\rho(A)\) denotes the spectral radius of \(A\), that is, \(\rho(A) \mathrel{\vcenter{:}}= \max_{i \in [n]} |\lambda_i|\), where \(\lambda_{i}\) are the eigenvalues of \(A\).
We denote the sigmoid function as \(\sigma(w)=\frac{1}{1+e^{-w}}\) and the hyperbolic tangent as \(\phi(w)=\tanh{w}=\frac{e^w-e^{-w}}{e^w+e^{-w}}\).
Definition 1 (Class \(\mathcal{K}\), \(\mathcal{KL}\), and \(\mathcal{K}_\infty\)). A function \(\alpha:\mathbb{R}_{\ge 0}\to \mathbb{R}_{\ge 0}\) is called \(\mathcal{K}\)-function if it is continuous, strictly increasing and \(\alpha(0)=0\). A function \(\beta:\mathbb{R}_{\ge 0}\times \mathbb{R}_{\ge 0}\to\mathbb{R}_{\ge 0}\) is called \(\mathcal{KL}\)-function if \(\beta(\cdot,t)\) is \(\mathcal{K}\)-function for each \(t\ge 0\), and \(\beta(s,\cdot)\) is decreasing and \(\beta(s,t)\to 0\) as \(t\to \infty\) for each \(s\ge 0\). A function \(\gamma:\mathbb{R}_{\ge 0}\to \mathbb{R}_{\ge 0}\) is called \(\mathcal{K}_\infty\)-function if it is \(\mathcal{K}\)-function and \(\lim_{t \to \infty}\gamma(t)=\infty\).
Now we consider a (autonomous) discrete-time nonlinear system \[\label{NS} s(t+1)=f(s(t),x(t))\tag{1}\] with \(f(0,0)=0\), for initial state \(s(0)\) and input \(x(t), t \geq 0\). In the following, we denote \(s_i(t)\) as the system whose initial state is \(s_i(0)\) and input is \(x_i(t)\).
Definition 2 (\(\delta\)ISS, [12], [13]). System 1 is called incrementally input-to-state stable (\(\delta\)ISS) if there exist continuous functions \(\beta \in \mathcal{KL}\) and \(\gamma \in \mathcal{K}_{\infty}\) such that, for any initial states \(s_1(0),s_2(0) \in \mathbb{R}^{n_s}\) and for any input sequence \(x_1(0:t)\), \(x_2(0:t) \in [-x_\mathrm {max},x_\mathrm {max}]^{n_x \times (t+1)}\), \(s_1(t)\) and \(s_2(t)\) satisfy \[\begin{align} \label{def-delta-ISS} \|s_1(t)-s_2(t)\| \leq \, &\beta(\|s_1(0)-s_2(0)\|,t)\notag \\ &+\gamma(\|x_1(0:t)-x_2(0:t)\|_{2,\infty}) \end{align}\tag{2}\] for all \(t \in \mathbb{Z}_{\geq 0}\).
The LSTM is described by the following relations: \[\begin{align} c^{(l)}(t+1) &=\sigma(W^{(l)}_f x^{(l)}(t) + U^{(l)}_f h^{(l)}(t) + b^{(l)}_f) \odot c^{(l)}(t) \notag\\ &\quad + \sigma(W^{(l)}_i x^{(l)}(t) + U^{(l)}_i h^{(l)}(t) + b^{(l)}_i) \notag\\ &\qquad \odot \phi(W^{(l)}_c x^{(l)}(t) + U^{(l)}_c h^{(l)}(t) + b^{(l)}_c), \\\tag{3}\\ h^{(l)}(t+1) &= \sigma(W^{(l)}_o x^{(l)}(t) + U^{(l)}_o h^{(l)}(t) + b^{(l)}_o)\notag \\ &\quad \odot \phi(c^{(l)}(t+1)), \tag{4}\\ y(t)&= U_y h^{(L)}(t) + b_y, \tag{5}\noeqref{y94L-LSTM}\\ x^{(l)}(t)&= \begin{cases} x & \text{for } l=1,\\h^{(l-1)}(t+1) & \text{for } l=2,\ldots L,\end{cases} \tag{6} \end{align}\] where \(x \in [-x_\mathrm {max},x_\mathrm {max}]^{n_x}\), \(c^{(l)} \in \mathbb{R}^{n_c}\), \(h^{(l)} \in (-1,1)^{n_c}\), \(y \in \mathbb{R}^{n_y}\) are the input, the cell state, the hidden state, and the output, respectively, for each time \(t \in \mathbb{Z}_{\geq 0}\), each layer \(l \in [L]\), and \(x_\mathrm {max}>0\). Moreover, \(W^{(l)}_{\ast} \in \mathbb{R}^{n_c \times n_x}, U^{(l)}_{\ast} \in \mathbb{R}^{n_c \times n_c}, U_y \in \mathbb{R}^{n_y \times n_c}\) are the weight matrices, \(b^{(l)}_{\ast} \in \mathbb{R}^{n_c}\) and \(b_y \in \mathbb{R}^{n_y}\) are the biases, where \(\ast\) denotes \(f, i, c\), or \(o\).
In this paper, we regard the LSTM as a discrete-time nonlinear system 1 . We define the state \(s(t)\) and the input \(x(t)\) of 1 as \(s^{(1:L)}(t)=[s^{(1)}(t),\dots, s^{(L)}(t)]\) and \((x^{(l)}(t)^\top,b_c^{{(l)}\top})^\top\), respectively, where \(s^{(l)}(t)=(c^{(l)}(t)^\top, h^{(l)}(t)^\top)^\top\).
In this section, we relax the existing \(\delta\)ISS condition of the LSTM [14] by refining the invariant set of the LSTM dynamics. This relaxation contributes to an accurate evaluation of the recovery time introduced in Section 0.0.5.
Define \(G^{(l)}_*, G^{(l)}_c:\mathbb{R}_{\ge 0}\to \mathbb{R}_{\ge 0}\) (\(*\) denotes \(f, i\), or \(o\)) by \[\begin{align} G^{(l)}_*(\eta)&= \left\|{\left(x_\mathrm {max}|W^{(l)}_*| \boldsymbol{1}_{n_x}+ \eta |U^{(l)}_*| \boldsymbol{1}_{n_c} + b^{(l)}_*\right)_+}\right\|_{\infty}, \\ G^{(l)}_c(\eta)&=\left\|{x_\mathrm {max}|W^{(l)}_c| \boldsymbol{1}_{n_x}+ \eta |U^{(l)}_c| \boldsymbol{1}_{n_c} + |b^{(l)}_c|}\right\|_{\infty}, \end{align}\] where \(\boldsymbol{1}_n\) is a \(n\)-dimensional vector of all components being one, and \(v_+\) denotes a vector \((\max\{v_{(i)},0\})_{i \in [n_c]}\) for a vector \(v\). Note that \(x_\mathrm {max}=1\) if \(n\ge 2\). In addition, we define the sequences \(\{\overline{\sigma}^{(l)}_*(k)\}_{k=0}^{\infty}\), \(\{\overline{\phi}^{(l)}_c(k)\}_{k=0}^{\infty}\), \(\{\eta^{(l)}(k)\}_{k=-1}^{\infty}\), and \(\{\overline{c}^{(l)}(k)\}_{k=0}^{\infty}\) using following recursive formulas: \[\begin{align} \overline{\sigma}^{(l)}_*(k)&=\sigma(G^{(l)}_*(\eta^{(l)}(k-1))), \tag{7}\\ \overline{\phi}^{(l)}_c(k)&=\phi(G^{(l)}_c(\eta^{(l)}(k-1))),\tag{8}\\ \eta^{(l)}(k)&=\phi(\overline{c}^{(l)}(k)) \overline{\sigma}^{(l)}_o(k),\tag{9} \\ \overline{c}^{(l)}(k) &=\frac{\overline{\sigma}^{(l)}_i(k) \overline{\phi}^{(l)}_c(k)}{1-\overline{\sigma}^{(l)}_f(k)}, \end{align}\] for \(k \in \mathbb{Z}_{\geq 0}\), with initial term \(\eta^{(l)}(-1)=1\), where \(\ast\) denotes \(f,i\) or \(o\). Moreover, we define that \[\begin{align} \mathcal{C}^{(l)}(k) &\mathrel{\vcenter{:}}= \left\{c^{(l)}\in \mathbb{R}^{n_c}:\|c^{(l)}\|_{\infty} \leq \overline{c}^{(l)}(k) \right\},\\ \mathcal{H}^{(l)}(k) &\mathrel{\vcenter{:}}= \left\{h^{(l)}\in \mathbb{R}^{n_c}:\|h^{(l)}\|_{\infty} \leq \phi(\overline{c}^{(l)}(k))\overline{\sigma}^{(l)}_o(k) \right\}. \end{align}\]
Proposition 1 (Invariant Set). Let \(L\in\mathbb{N}\) and \(k \in \mathbb{Z}_{\geq 0}\). \(\mathcal{S}^{(l)}(k)=\mathcal{C}^{(l)}(k) \times \mathcal{H}^{(l)}(k)\) is an invariant set, that is, for all \(t \in \mathbb{Z}_{\ge 0}\), \(l\in[L]\) and \(x \in [-x_\mathrm {max},x_\mathrm {max}]^{n_x}\), if \(s^{(l)}(0)=(c^{(l)}(0), h^{(l)}(0))\in\mathcal{S}^{(l)}(k)\), then \(s^{(l)}(t)=(c^{(l)}(t), h^{(l)}(t)) \in \mathcal{S}^{(l)}(k)\). Furthermore, \(\{\mathcal{S}^{(l)}(k)\}_{k=0}^\infty\) is a decreasing sequence in the sense of set inclusion.
Proof sketch. Since the activation functions are strictly monotonically increasing and bounded, the sequences are monotonically decreasing and converge to some limit. This proposition is proved by induction on \(k\). ◻
The exact proof is described in Appendix 0.0.10.1. We note that the invariant set \(\mathcal{S}^{(l)}(0)\) is considered in the literature [14], [15].
Using Proposition 1, we provide the following sufficient condition of \(\delta\)ISS:
Theorem 2 (\(\delta\)ISS of LSTM). Let \(L \in \mathbb{N}\) and \(k \in \mathbb{Z}_{\geq 0}\). We assume that \(s^{(l)}(0)\in\mathcal{S}^{(l)}(k)\) for all \(l \in [L]\). The LSTM is \(\delta\)ISS if \(\rho(A^{{(l)}}_{s}(k)) <1\) for all \(l \in [L]\), where \[\begin{align} &A_s^{(l)}(k)\\ &=\begin{bmatrix} \overline{\sigma}_f^{(l)}(k) & \alpha_s^{(l)}(k)\\ \overline{\sigma}_o^{(l)}(k)\overline{\sigma}_f^{(l)}(k) & \alpha_s^{(l)}(k)\overline{\sigma}_o^{(l)}(k)+\dfrac{1}{4}\phi(\overline{c}^{(l)}(k))\|U_o^{(l)}\| \end{bmatrix},\\ &\alpha_s^{(l)}(k) =\frac{1}{4}\|U_f^{(l)}\|\overline{c}^{(l)}(k)+\overline{\sigma}_i^{(l)}(k)\|U_c^{(l)}\|+\frac{1}{4}\|U_i^{(l)}\|\overline{\phi}_c^{(l)}(k). \end{align}\] Furthermore, \(\rho(A^{{(l)}}_{s}(k))\) is monotonically decreasing with respect to \(k \in \mathbb{Z}_{\geq 0}\).
Proof sketch. The matrix \(A_s^{(l)}(k)\) is obtained by the invariant set in Proposition 1, Lipschitz continuity of the activation functions, triangle inequality, and norm properties. The monotonic decrease of \(\rho(A^{{(l)}}_{s}(k))\) is demonstrated using the Perron–Frobenius theorem. ◻
The rigorous proof is described in Appendix 0.0.10.3, and where we also see that a similar relaxation can be obtained for ISS.
Remark 3. We give some important observations to Theorem 2:
Theorem 2 coincides with existing results [14], [15] for the case \(k=0\). Monotonically decreasing of \(\rho(A^{(l)}_s(k))\) makes it possible to relax the constraints on the model in the case of \(k > 0\).
Because \(\mathcal{S}^{(l)}(k)\) is a contracting sequence, the assumption for the initial state becomes stronger. However, since the initial states of LSTMs are usually set to zero in practice, we believe that there is no practical limitation.
In this section, we define a novel mathematical formulation, the recovery time of LSTM, which accurately expresses the time required for the system to recover from disturbances and return to its normal state. This definition makes it difficult to evaluate in practical applications due to data dependence, but we overcome this limitation by deriving a practical and computable upper bound for the recovery time. This upper bound constitutes the primary novel contribution of our research, offering a fresh perspective on LSTM performance evaluation.
Definition 3 (Recovery Time). Let inputs \(x(t), \widehat{x}(t) \in [-x_\mathrm {max}, x_\mathrm {max}]^{n_x}\) for \(t \in \mathbb{Z}_{\geq 0}\). Assume that there exists \(t_0>0\) such that \(x(t)=\widehat{x}(t)\) for any \(t \geq t_0\). We define recovery time \(T_R \geq 0\) for given \(e \geq 0\) as \[\begin{align} &T_R(x,\widehat{x},s(0);e)\\ &=\min\{t \ge t_0:\|y(t', s(0), x(0:t'))\\ & - y(t', s(0), \widehat{x}(0:t'))\| \leq e \;\text{for any} \;t' \geq t \}-t_0, \end{align}\] and as \(T_R=\infty\) if the set on the right-hand side is empty.
Evaluating the recovery time \(T_R\) is difficult because it depends on the input data, then we utilize the function \(\beta\) of \(\delta\)ISS in Definition 2.
Theorem 2 indicates that there exists a \(\mathcal{KL}\) function \(\beta\) satisfying \[\begin{align} &\| s(t, s_1(0), x(0:t))- s(t, s_2(0), x(0:t)) \| \\ &\le \beta(\|s_1(0) - s_2(0) \|,t) \label{beta} \end{align}\tag{10}\] for any initial states \(s_1(0), s_2(0)\in\prod^L_{l=1}\mathcal{S}^{(l)}(k)\) and input sequence \(x(0:t) \in [-x_\mathrm {max},x_\mathrm {max}]^{n_x \times (t+1)}\) of the LSTM and \(t>0\). The proof of Theorem 2 also leads to a more precise and explicit evaluation of 10 :
Proposition 4 (Explicit Representation of \(\delta\)ISS). Let \[\begin{align} &\tilde{\beta}(s^{(1:L)},t;k)\\ &= \mu^{(L)}(t) \rho(A_s^{(L)}(k))^t s^{(L)} \\ &\quad+\sum^{L-1}_{l=1} \mu^{(l:L)}(t) \frac{(t+L-l)!}{t!(L-l)!} \\ &\qquad \times \max_{i = l,\cdots,L} \rho(A_s^{(i)}(k))^t \left( \prod^L_{i=l+1} \|a_x^{(i)}(k)\| \right) s^{(l)}. \label{beta-multi-layer-LSTM} \end{align}\qquad{(1)}\] Then, \[\begin{align} &\| s^{(L)}(t, s_1(0), x(0:t))- s^{(L)}(t, s_2(0), x(0:t)) \| \\ &\le \tilde{\beta}(\|s^{(1)}_1(0) - s^{(1)}_2(0) \|,\dots,\|s^{(L)}_1(0) - s^{(L)}_2(0) \|,t;k) \end{align}\] satisfies for any initial states \(s_1(0), s_2(0)\in\prod^L_{l=1}\mathcal{S}^{(l)}(k)\) and input sequence \(x(0:t) \in [-x_\mathrm {max},x_\mathrm {max}]^{n_x \times (t+1)}\) of the LSTM and \(t>0\). Here, \[\begin{align} a_x^{(l)}(k) &= \begin{pmatrix} \alpha_x^{(l)}(k) \\ \alpha_x^{(l)}(k) \bar{\sigma}^{(l)}_o(k) + \frac{1}{4}\phi(\overline{c}^{(l)}(k)) \|W^{(l)}_o\| \end{pmatrix},\\ \alpha_x^{(l)}(k)&=\frac{1}{4}\|W_f^{(l)}\|\overline{c}^{(l)}(k)+\overline{\sigma}_i^{(l)}(k)\|W_c^{(l)}\|\\ &\quad +\frac{1}{4}\|W_i^{(l)}\|\overline{\phi}_c^{(l)}(k),\\ \mu^{(l:L)}(t) &= \prod^{L}_{i=l}\mu^{(i)}(t), \\ \mu^{(l)}(t) &= \sqrt{(1+(r^{(l)})^{2t}) +\frac{|\nu^{(l)}|^2}{|\lambda^{(l)}_1|^2} {\left( \frac{1-(r^{(l)})^t}{1-r^{(l)}} \right)}^2}, \end{align}\] for any decomposition such as \[\begin{align} A^{(l)}_s(k) &= U \begin{bmatrix} \lambda^{(l)}_1 & \nu^{(l)}\\ 0 & \lambda^{(l)}_2 \end{bmatrix} U^*, \lambda^{(l)}_1\ge\lambda^{(l)}_2,\\ r^{(l)}&:= |\lambda^{(l)}_2|/|\lambda^{(l)}_1| \in [0,1], \end{align}\] where \(U\) is a suitable unitary matrix (note that any matrix is triangulizable by some regular matrix \(U\)). Also, we regard the term of \(\sum_{l=1}^0\) as \(0\).
Proof sketch. This representation is given in the proof of Theorem 2, specifically by repeating the evaluation of the inputs of each layer with the state variables of front layers according to 6 . ◻
Theorem 5 (Upper Bound of Recovery Time). We assume that \(s^{(l)}(0)\in\mathcal{S}^{(l)}(k)\) and \(\rho(A^{{(l)}}_{s}(k)) <1\) for all \(l \in [L]\). Let infinite length inputs \(x(t), \widehat{x}(t) \in [-x_\mathrm {max},x_\mathrm {max}]^{n_x}, t\ge0\) and \(e > 0\). We assume that there exists \(t_0\) such that \(x(t)=\widehat{x}(t)\) for any \(t \ge t_0\). Then, we have \[\begin{align} &\sup_{s(0) \in \prod^L_{l=1}S^{(l)}_k}T_R(x, \widehat{x},s(0);e) \notag \\ &\le \min \{t \ge 0 : \tilde{\beta}(2\sqrt{n_c}\bar{s}(k),t'; k) \le e/\|U_y\| \;\text{for any} \;t' \ge t \}, \end{align}\] where \(\bar{s}(k)=(\bar{s}^{(1)}(k),\dots,\bar{s}^{(L)}(k))\) and \(\bar{s}^{(l)}(k)=\|(\bar{c}^{(l)}(k) , \phi (\bar{c}^{(l)}(k)) \bar{\sigma}^{(l)}_o (k))\|\).
Proof sketch. This is from Definition 3, Proposition 4 (shifting the initial time by \(t_0\)), and the fact that the diameter of the invariant set is \(2\sqrt{n_c}\bar{s}^{(l)}(k)\). ◻
Detailed proof is stated in Appendix 10.
Remark 6. Proposition 4 and Theorem 5 imply the followings:
\(\tilde{\beta}(s^{(1:L)},t;k)\) is calculated from the weight of LSTM and the recursive formulas 7 , 8 , and 9 . Since recovery time can be estimated using only \(\tilde{\beta}(s^{(1:L)},t;k)\), this upper bound is a data-independent formula.
As \(k\) increases, \(\rho(A^{{(l)}}_s(k))\) decreases, consequently leading to a reduction in \(\tilde{\beta}(\bar{s}(k),t;k)\). This reduction in \(\tilde{\beta}\) results in a sharper upper bound for \(T_R\).
We provide a comprehensive strategy for translating theoretical foundations into practical applications.
Quantitatively assess the resilience of LSTMs.
Evaluate the obtained model’s resilience using the following index: \[\begin{align} &\bar{T}_{R}(e;k) \\ &=\min \{t_1 \ge 0 \;: \;\tilde{\beta} (2\bar{s}(k),t; k) \le e/\|U_y\| \;\text{for any} \;t \ge t_1\} \end{align}\] for some \(k\). Since \(\bar{T}_{R}\) depends solely on the model’s weight parameters, it can be evaluated independently of the data. According to Theorem 5, a smaller \(\bar{T}_{R}\) indicates a stronger resilience capacity of the model.
We propose to establish this metric as a formal requirement for assessing LSTM quality, contextualized within the framework of control periods.
Build an LSTM with a specified degree of resilience.
Control the model’s resilience (i.e., \(T_R\)) during training by the following loss function: \[\begin{align} Loss &= \mathcal{L} + \lambda \Phi(\theta,\varepsilon), \tag{11}\\ \Phi(\varepsilon) &= \sum^{L}_{l=1} \max \{\rho(A^{(l)}_s(k))-1+\varepsilon, 0 \} \tag{12}, \end{align}\] where \(\mathcal{L}\) denotes the existing loss function, \(\Phi\) denotes the penalty term for the resilience, \(\lambda>0\) denotes the penalty intensity, and \(\varepsilon \in [0, 0.5]\) denotes the control parameter for the resilience. The resilience of LSTM can be ensured through the design of the loss function, without modifications to the model architecture. In addition, the degree of resilience can be precisely calibrated by adjusting the parameter \(\varepsilon\). In fact, as training progresses, \(\Phi(\theta,\varepsilon)\) is expected to approach 0, and after training is completed, the model is expected to satisfy \(\rho(A^{(l)}_s(k)) \thickapprox 1-\varepsilon\). Based on Theorems 2 and 5, \(T_R\) is expected to decrease as \(\rho(A^{(l)}_s(k))\) decreases. Consequently, increasing the value of \(\varepsilon\) enhances the resilience of the model. This amplification of resilience concurrently imposes stronger constraints on the model, potentially leading to a trade-off between resilience and model accuracy.
We tune \(\varepsilon\) to balance resilience and accuracy requirements, observing the trade-off and selecting an optimal model.
In this section, we verify the following two properties of our proposed method:
Increasing \(k\) in \(\bar{T}_R(k)\) leads to more accurate recovery time estimates.
Adjusting \(\varepsilon\) enables the tuning of \(T_R\), and there is a trade-off between recovery time and inference loss.
Detailed experimental settings are given in Appendix 0.0.11.
We also compare our resilience training method and adversarial training in Appendix 0.0.12.
In this section, we employ a simplified model to validate (A).
We employed a simplified model using an LSTM network with randomly initialized weight parameters. We applied step input to this model, introduced instantaneous noise, and observed the time required for inference recovery (for a visual representation of the recovery time measurement, see Figure 1).
We calculated \(T_R\) and \(\bar{T}_R(k)\) \((k=0,1,\cdots,20)\) for randomly initialized \(20\) models, deriving estimated average test errors and correlation coefficients. As shown in Figure 2, the estimation test error consistently decreases as \(k\) increases, eventually converging at around \(k=20\) steps. Furthermore, the correlation coefficient exhibits a generally increasing (except for a temporary decrease) trend as \(k\) increases, which indicates a strong relationship between the true and our estimated values.
Figure 3 compares the estimation accuracy for \(k=0\) and \(k=20\), demonstrating improved accuracy with a larger \(k\). The left side of the red line is the area where \(\delta\)ISS is guaranteed by Theorem 2. Increasing \(k\) from \(0\) to \(20\) decreases \(\rho(A(k))\) by 32% on average, then all data points are included in the \(\delta\)ISS guaranteed area.
To illustrate this improvement in a specific case, we focus on one of these models. As evident in Figure 4, the \(\bar{T}_R(k)\) value for \(k=20\) is smaller than that for \(k=0\). Furthermore, this value is closer to the true \(T_R\). This indicates that using a larger \(k\) value enables a more accurate assessment.
These results offer empirical support for our theoretical statements in Theorem 5 and Remark 6: increasing \(k\) not only relaxes the sufficient condition of \(\delta\)ISS but also enhances the accuracy of recovery time estimation.
Figure 4: Recovery time of simplified model. The blue and orange lines depict the LSTM outputs for the normal input without noise and the anomalous input subjected to instantaneous noise, respectively. Vertical lines denote the recovery times, indicating that using a larger \(k\) value enables a more accurate assessment..
Here, we demonstrate the effectiveness of our approach from the perspective of (B) in a more practical scenario using the two-tank system, a recognized benchmark in control experiments [22], [23].
The two-tank system is described by the following differential equations [23]: \[\begin{align} \frac{dh_1}{dt}&=-p_1\sqrt{2gh_1}+p_2u,\\ \frac{dh_2}{dt}&=p_3\sqrt{2gh_1}-p_4\sqrt{2gh_2}, \end{align}\] where \(u\) is the control input, \(h_1\) and \(h_2\) are the tank levels, \(g\) is the gravitational acceleration and \(p_1\), \(p_2\), \(p_3\), and \(p_4\) are hyperparameters depending on the system properties. The LSTM model is designed to predict the water level at the next time step based on the current water level and control input. The current water level is obtained from sensors, which are assumed to be subject to sensor noise. Formally, we define the input-output relationship of the LSTM as \(x(t) =(u(t), h(t)+w(t))\) and \(y(t) = h(t+1)\), where \(w \sim \mathcal{N}(0, {0.01}^2)\).
To evaluate our proposed recovery time estimation method and adjustment technique, we prepared a baseline model: a standard model trained without considering resilience. As shown in Figure 5, the baseline model closely approximates the target system for inference.
We will compare the models trained using our proposed loss function 11 against the baseline model.
First, regarding (A), although the result in Figure 6 is not contradict Theorem 5 in the sense of weakly decreasing, the magnitude of this decay was notably smaller than the results in Section 0.0.5.1.
Next, concerning (B), we observed a trade-off between recovery times and MAE loss when comparing the baseline model and our models, and when varying \(\varepsilon\) of our model, as illustrated in Figure 6. Specifically, as \(\varepsilon\) increases from \(0.1\) to \(0.5\), we observe a trend of increasing MAE loss and decreasing recovery time. This trend aligns with our theoretical predictions in Section 0.0.6.2.
The experimental results demonstrate that LSTM resilience can be tuned via \(\varepsilon\), validating our proposed method 12 that introduces recovery time as a quality assurance metric and balances inference accuracy with recovery time.
Theoretical and data-independent guarantees may not always reflect actual data distributions. Therefore, we strongly recommend complementing our approach with data-driven quality assurance methods for practical implementation. Also, be aware that applying our method to an unstable system can be counterproductive.
As tasks become increasingly dependent on long-term memory, the decline in the accuracy of recovery time estimation and the accuracy degradation caused by the constraints for resilience are likely to emerge as significant challenges. Relaxing the \(\delta\)ISS sufficient conditions could be a key approach to address these challenges. This is because, as demonstrated in Theorem 5, the \(\delta\)ISS conditions form the foundation for the recovery time estimation, and relaxing these conditions would contribute to weakening model constraints and mitigating inference accuracy degradation.
Although our current research focused on the resilience of LSTM modules in control systems, comprehensive quality assurance for control systems remains a challenging issue. Refining studies that have demonstrated the relationship between LSTM stability and control system stability [14] could be a key approach. By more precisely analyzing the calculations used to prove control system stability in these studies, it may be possible to quantitatively determine the convergence time of the state of the system to the target value.
We are grateful to Professor Tomoyuki Tanaka of Yokohama National University for helpful discussions and comments on the manuscript.
Yoshihara Sota was financially supported by JST SPRING, Grant Number JPMJSP2125. He would like to take this opportunity to thank the “THERS Make New Standards Program for the Next Generation Researchers”.
In this paper, we define recovery time as the duration from when different input sequences align until the output difference falls below a threshold. We prove that LSTMs with \(\delta\)ISS properties can effectively estimate this time.
Finite time stability (FTS) [24] rigorously discusses convergence time, characterized by a settling-time function. Finite-time input-to-state stability (FTISS), combining FTS and ISS, enables robust control against disturbances [25]. Recent studies [26], [27] propose Lyapunov functions satisfying FTISS, eliminating the need to directly consider the settling-time function.
However, deriving sufficient FTS conditions involves strict model constraints, established only for specific models like Hopfield networks [28] and Clifford-valued RNNs [29], but not for conventional RNNs or LSTMs. While achieving FTS for LSTMs would theoretically allow for precise recovery time evaluation, we argue that from a practical standpoint, considering numerical errors and rounding, it suffices for output differences to fall below a threshold rather than strictly reaching equilibrium. Thus, we opted for \(\delta\)ISS instead of FTS.
We show the following lemma before proving Proposition 1.
Lemma 1 (Monotonically Decreasing and Convergence). Let \(L\in\mathbb{N}\). \(\{\overline{\sigma}^{(l)}_*(k)\}_{k=0}^{\infty}\), \(\{\overline{\phi}^{(l)}_c(k)\}_{k=0}^{\infty}\), and \(\{\eta^{(l)}(k)\}_{k=-1}^{\infty}\) are monotonically decreasing and converged for all \(l\in[L]\), where \(\ast\) denote \(f,i\), or \(o\).
Proof. From the definition in Section 0.0.4.1, \(G^{(l)}_*(\eta^{(l)}(k-1)), G^{(l)}_c(\eta^{(l)}(k-1)) \ge 0\). Since \(\sigma\) and \(\phi\) are strictly monotonically increasing functions, we have \(1/2=\sigma(0)\le\overline{\sigma}^{(l)}_*(k)<1\) and \(0=\phi(0)\le\overline{\phi}^{(l)}_c(k)<1\). Therefore, \(\overline{\sigma}^{(l)}_i(k)\overline{\phi}^{(l)}_c(k)\ge 0\) and \(0<1-\overline{\sigma}^{(l)}_f(k))\le 1/2\) are obtained, then, \[\begin{align} \eta(k)=\phi\left(\frac{\overline{\sigma}^{(l)}_i(k)\bar{\phi}_c(k)}{1-\overline{\sigma}^{(l)}_f(k)}\right)\overline{\sigma}^{(l)}_o(k)\ge\phi\left(\frac{0}{2}\right)\frac{1}{2}=0. \end{align}\] Hence, \(\{\overline{\sigma}^{(l)}_*(k)\}_{k=0}^{\infty}\), \(\{\overline{\phi}^{(l)}_c(k)\}_{k=0}^{\infty}\), and \(\{\eta^{(l)}(k)\}_{k=-1}^{\infty}\) are bounded below sequences.
In the following, we show these sequences are monotonically decreasing by induction, therefore converging.
Firstly, consider the case of \(k=0\). From the above inequalities, and the fact that \(\sigma(x)\to\infty\) as \(x\to\infty\), and that \(0<\overline{\sigma}^{(l)}_o(0)<1\), we have \[\begin{align} 0&\le\phi\left(\frac{\overline{\sigma}^{(l)}_i(0)\overline{\phi}^{(l)}_c(0)}{1-\overline{\sigma}^{(l)}_f(0)}\right)< 1=\eta^{(l)}(-1),\\ \eta^{(l)}(0)&=\phi\left(\frac{\overline{\sigma}^{(l)}_i(0)\overline{\phi}^{(l)}_c(0)}{1-\overline{\sigma}^{(l)}_f(0)}\right)\overline{\sigma}_o^{(l)}(0)\le\overline{\sigma}_o^{(l)}(0)<\eta^{(l)}(-1). \end{align}\] Substituting \(\eta^{(l)}(0)<\eta^{(l)}(-1)\) for \(G_*(\eta)\) and \(G_c(\eta)\), we obtain \[\begin{align} G_*(\eta^{(l)}(0))&= \left\|{(x_\mathrm {max}\left\lvert{W^{(l)}_*}\right\rvert \boldsymbol{1}_{n_x}+ \eta^{(l)}(0) \left\lvert{U^{(l)}_*}\right\rvert \boldsymbol{1}_{n_c} + b^{(l)}_*)_+}\right\|_{\infty}\\ &\le\left\|{(x_\mathrm {max}\left\lvert{W^{(l)}_*}\right\rvert \boldsymbol{1}_{n_x}+ \eta^{(l)}(-1) \left\lvert{U^{(l)}_*}\right\rvert \boldsymbol{1}_{n_c} + b^{(l)}_*)_+}\right\|_{\infty}=G_*(\eta^{(l)}(-1)),\\ G_c(\eta^{(l)}(0))&=\left\|{x_\mathrm {max}\left\lvert{W^{(l)}_c}\right\rvert \boldsymbol{1}_{n_x}+ \eta^{(l)}(0) \left\lvert{U^{(l)}_c}\right\rvert \boldsymbol{1}_{n_c} + \left\lvert{b^{(l)}_c}\right\rvert}\right\|_{\infty}\\ &\le\left\|{x_\mathrm {max}\left\lvert{W^{(l)}_c}\right\rvert \boldsymbol{1}_{n_x}+ \eta^{(l)}(-1) \left\lvert{U^{(l)}_c}\right\rvert \boldsymbol{1}_{n_c} + \left\lvert{b^{(l)}_c}\right\rvert}\right\|_{\infty}=G_c(\eta^{(l)}(-1)). \end{align}\] Since \(\sigma\) and \(\phi\) are strictly monotonically increasing functions, the following hold: \[\begin{align} \overline{\sigma}^{(l)}_*(1)&=\sigma(G_*(\eta^{(l)}(0)) \le\sigma(G_*(\eta^{(l)}(-1))=\overline{\sigma}^{(l)}_k(0)\tag{13},\\ \overline{\phi}^{(l)}_c(1)&=\phi(G_c(\eta^{(l)}(0))\le\phi(G_c(\eta^{(l)}(-1))=\overline{\phi}^{(l)}_c(0)\tag{14}. \end{align}\] Therefore, \(\overline{\sigma}^{(l)}_*(1)\le\overline{\sigma}^{(l)}_*(1)\) and \(\overline{\phi}^{(l)}_c(1)\le\overline{\phi}^{(l)}_c(0)\). Applying these inequality for \(\overline{c}^{(l)}(1)\) and \(\eta^{(l)}(1)\), we have \[\begin{align} \overline{c}^{(l)}(1)&=\frac{\overline{\sigma}^{(l)}_i(1)\overline{\phi}^{(l)}_c(1)}{1-\overline{\sigma}^{(l)}_f(1)}\le\frac{\overline{\sigma}^{(l)}_i(0)\overline{\phi}^{(l)}_c(0)}{1-\overline{\sigma}^{(l)}_f(1)}\le\frac{\overline{\sigma}^{(l)}_i(0)\overline{\phi}^{(l)}_c(0)}{1-\overline{\sigma}^{(l)}_f(0)}=\overline{c}^{(l)}(0), \\ \eta^{(l)}(1)&=\phi(\overline{c}^{(l)}(1))\overline{\sigma}^{(l)}_o(1)\le\phi(\overline{c}^{(l)}(0))\overline{\sigma}^{(l)}_o(1)=\eta^{(l)}(0). \end{align}\] Thus, we obtain that \(\Bar{\sigma}^{(l)}_*(1)\le\overline{\sigma}^{(l)}_*(0)\), \(\overline{\phi}^{(l)}_c(1)\le \overline{\phi}^{(l)}_c(0)\) and \(\eta^{(l)}(1)\le \eta^{(l)}(0) \le \eta^{(l)}(-1)=1\).
Secondly, consider the case of \(k=k_0\in \mathbb{Z}_{\geq 0}\) and assume that \(\overline{\sigma}^{(l)}_*(k_0+1)\le\overline{\sigma}^{(l)}_*(k_0)\), \(\overline{\phi}^{(l)}_c(k_0+1)\le \overline{\phi}^{(l)}_c(k_0)\), and \(\eta^{(l)}(k_0+1)\le \eta^{(l)}(k_0)\). From assumption, \(G_*(\eta^{(l)}(k_0))\le G_*(\eta^{(l)}(k_0+1))\) and \(G_c(\eta^{(l)}(k_0))\le G_c(\eta^{(l)}(k_0+1))\) hold in the same way as the case of \(k=0\). Therefore, \(\overline{\sigma}^{(l)}_*(k_0+2)\le\overline{\sigma}^{(l)}_*(k_0+1)\) and \(\overline{\phi}^{(l)}_c(k_0+2) \le\overline{\phi}^{(l)}_c(k_0+1)\) are obtained. By using these inequalities, we have \[\begin{align} \overline{c}^{(l)}(k_0+2)=\frac{\overline{\sigma}^{(l)}_i(k_0+2)\overline{\phi}^{(l)}_c(k_0+2)}{1-\overline{\sigma}^{(l)}_f(k_0+2)}\le\frac{\overline{\sigma}^{(l)}_i(k_0+1)\overline{\phi}^{(l)}_c(k_0+1)}{1-\overline{\sigma}^{(l)}_f(k_0+2)}\le\frac{\overline{\sigma}^{(l)}_i(k_0+1)\overline{\phi}^{(l)}_c(k_0+1)}{1-\overline{\sigma}^{(l)}_f(k_0+1)}=\overline{c}^{(l)}(k_0+1). \end{align}\] In addition, \(\overline{\sigma}^{(l)}_o(k_0+2)\le\overline{\sigma}^{(l)}_o(k_0+1)\) and \(\phi\) is strictly monotonically increasing. Then, we obtain that \(\eta (k_0+2)\le \eta(k_0+1)\). ◻
Here we restate and prove Proposition 1:
Proposition 7 (Invariant Set). Let \(L\in\mathbb{N}\) and \(k \in \mathbb{Z}_{\geq 0}\). \(\mathcal{S}^{(l)}(k)=\mathcal{C}^{(l)}(k) \times \mathcal{H}^{(l)}(k)\) is an invariant set, that is, for all \(t \in \mathbb{Z}_{\ge 0}\), \(l\in[L]\), and \(x \in [-x_\mathrm {max},x_\mathrm {max}]^{n_x}\), if \(s^{(l)}(0)=(c^{(l)}(0), h^{(l)}(0))\in\mathcal{S}^{(l)}(k)\), then \(s^{(l)}(t)=(c^{(l)}(t), h^{(l)}(t)) \in \mathcal{S}^{(l)}(k)\). Furthermore, the sequence \(\{\mathcal{S}^{(l)}(k)\}_{k=0}^\infty\) is a decreasing sequence with respect to \(k\) in the sense of set inclusion.
Proof. From Lemma 1 and the definition of \(\mathcal{C}^{(l)}(k)\) and \(\mathcal{H}^{(l)}(k)\) in Section 0.0.4.1, it is obvious that \(\{\mathcal{S}^{(l)}(k)\}_k\) is a decreasing sequence.
Below, we show \(\{\mathcal{S}^{(l)}(k)\}_k\) is an invariant set. Assume \(s^{(l)}(0)=(c^{(l)}(0), h^{(l)}(0))\in\mathcal{S}^{(l)}(k)\), then we need only to check that \[\begin{align} s^{(l)}(t)=(c^{(l)}(t), h^{(l)}(t))\in\mathcal{S}^{(l)}(k) \;\Longrightarrow \;s^{(l)}(t+1)=(c^{(l)}(t+1), c^{(l)}(t+1)) \in \mathcal{S}^{(l)}(k) \end{align}\] for all \(t \in \mathbb{Z}_{\geq 0}\) and \(x \in [-x_\mathrm {max},x_\mathrm {max}]^{n_x}\).
First, we assume \(c^{(l)}(t)\in\mathcal{C}^{(l)}(k)\). Since \(h^{(l)}(t)\in \mathcal{H}^{(l)}(k)\), \[\begin{align} \|h^{(l)}(t)\|_{\infty}\le \phi(\overline{c}^{(l)}(k))\overline{\sigma}^{(l)}_o(k)=\eta^{(l)}(k). \end{align}\] Therefore, for all \(j\in\{1,\dots,n_c\}\), it hold that \[\begin{align} \sigma(W^{(l)}_* x^{(l)}(t) + U^{(l)}_* h^{(l)}(t) + b^{(l)}_*)_{(j)}&\le \sigma\left( \left\|{(x_\mathrm {max}\left\lvert{W^{(l)}_*}\right\rvert \boldsymbol{1}_{n_x}+ \left\lvert{U^{(l)}_*}\right\rvert h^{(l)}(t) + b^{(l)}_*)_+}\right\|_{\infty}\right)\\ &= \sigma\left( \left\|{(x_\mathrm {max}\left\lvert{W^{(l)}_*}\right\rvert \boldsymbol{1}_{n_x}+ \eta^{(l)}(k)\left\lvert{U^{(l)}_*}\right\rvert \boldsymbol{1}_{n_c} + b^{(l)}_*)_+}\right\|_{\infty}\right)\\ &\le\sigma\left( \left\|{(x_\mathrm {max}\left\lvert{W^{(l)}_*}\right\rvert \boldsymbol{1}_{n_x}+ \eta^{(l)}(k-1)\left\lvert{U^{(l)}_*}\right\rvert \boldsymbol{1}_{n_c} + b^{(l)}_*)_+}\right\|_{\infty}\right)\\ &=\sigma(G^{(l)}_*(\eta^{(l)}(k-1)))\\ &=\overline{\sigma}^{(l)}_*(k), \end{align}\] where \(*\) denotes \(f,i\) or \(o\). We have \(\phi(W^{(l)}_c x^{(l)}(t) + U^{(l)}_c h^{(l)}(t) + b^{(l)}_c)_{(j)}\le\overline{\phi}^{(l)}_c(k)\) in the same way. Thus, each component of \(c^{(l)}(t+1)\) is evaluated as follows: \[\begin{align} c^{(l)}(t+1)_{(j)} &=\sigma(W^{(l)}_f x^{(l)}(t) + U^{(l)}_f h^{(l)}(t) + b^{(l)}_f)_{(j)} \times c^{(l)}(t)_{(j)} \\ &\quad+ \sigma(W^{(l)}_i x^{(l)}(t) + U^{(l)}_i h^{(l)}(t) + b^{(l)}_i)_{(j)} \times \phi(W^{(l)}_c x^{(l)}(t) + U^{(l)}_c h^{(l)}(t) + b^{(l)}_c)_{(j)} \\ &\le\overline{\sigma}^{(l)}_f(k) \|c^{(l)}(t)\|_{\infty}+\overline{\sigma}^{(l)}_i(k)\overline{\phi}^{(l)}_c(k)\\ &=\overline{\sigma}^{(l)}_f(k) \frac{\overline{\sigma}^{(l)}_i(k) \overline{\phi}^{(l)}_c(k)}{1-\overline{\sigma}^{(l)}_f(k)}+\overline{\sigma}_i(k) \overline{\phi}_c(k)\\ &=\overline{c}^{(l)}(k). \end{align}\] This inequality holds for all \(j \in \{1,\dots,n_c\}\), therefore \(c^{(l)}(t+1)\in\mathcal{C}^{(l)}(k)\).
Next, we assume \(h^{(l)}(t)\in \mathcal{H}^{(l)}(k)\). Since hyperbolic tangent \(\phi\) is strictly monotonically increasing, \[\begin{align} h^{(l)}(t+1)_{(j)}&=\sigma(W^{(l)}_o x^{(l)}(t) + U^{(l)}_o h^{(l)}(t) + b^{(l)}_o)_{(j)}\times \phi(c^{(l)}(t+1))_{(j)}\\ &\le\overline{\sigma}^{(l)}_o(k)\phi(\overline{c}^{(l)}(k)). \end{align}\] This inequality also holds for all \(j \in \{1,\dots,n_c\}\), therefore \(h^{(l)}(t+1)\in\;\mathcal{H}^{(l)}(k)\).
Hence, we conclude that \(s^{(l)}(t+1)\in\mathcal{S}^{(l)}(k)\) when \(s^{(l)}(t)\in\mathcal{S}^{(l)}(k)\). ◻
Denote the limits of sequences \(\{\overline{\sigma}^{(l)}_*(k)\}_{k=0}^{\infty}\) (\(*\) denotes \(f,i\), or \(o\)), \(\{\overline{\phi}^{(l)}_c(k)\}_{k=0}^{\infty}\), and \(\{\eta^{(l)}(k)\}_{k=-1}^{\infty}\) as \(\overline{\sigma}^{(l)}_*(\infty)\), \(\overline{\phi}^{(l)}_c(\infty)\) and \(\eta^{(l)}(\infty)\), respectively. Since \(\overline{\sigma}^{(l)}_*(\infty)=\sigma(G^{(l)}_*(\eta^{(l)}(\infty)))\) and \(\overline{\phi}^{(l)}_c(\infty)=\phi(G^{(l)}_c(\eta^{(l)}(\infty)))\), we have \[\begin{align} \eta^{(l)}(\infty)&=\phi\left(\frac{\sigma(G^{(l)}_i(\eta^{(l)}(\infty)))\phi(G^{(l)}_c(\eta^{(l)}(\infty)))}{1-\sigma(G^{(l)}_f(\eta^{(l)}(\infty)))}\right)\sigma(G^{(l)}_o(\eta^{(l)}(\infty))). \label{eq:eta-infty} \end{align}\tag{15}\]
If the equation 15 could be solved, that is, \(\overline{\sigma}^{(l)}_*(\infty)\) and \(\overline{\phi}^{(l)}_c(\infty)\) could be obtained, we would find the loosest sufficient condition for the LSTM to be \(\delta\)ISS, and the optimal estimate of recovery time. However, it is difficult to solve it analytically, so we approximate it using properties of convex functions.
Let \(w \geq 0\). Since \(\sigma(w)\) is convex upward and \(\frac{d\sigma(w)}{dw} = \sigma(w) (1 - \sigma(w))\), then for \(w\) and \(w_0 \ge 0\), \[\sigma(w) \leq \sigma(w_0) + \sigma(w_0) \left( 1 - \sigma(w_0) \right) (w - w_0) \eqqcolon \widecheck{\sigma}(w; w_0).\] As \(\frac{d\phi(w)}{dw} = 1 - \phi^2(w)\), it holds that \[\phi(w) \leq \phi(w_0) + (1 - \phi^2(w_0)) (w - w_0) \eqqcolon \widecheck{\phi}(w; w_0).\] Therefore, for \(w\) and \(w_0\ge 0\), we have \[\begin{align} \sigma(G^{(l)}_*(w)) &\leq \widecheck{\sigma}(G^{(l)}_*(w); G^{(l)}_*(w_0)),\\ \phi(G^{(l)}_c(w)) &\leq \widecheck{\phi}(G^{(l)}_c(w); G^{(l)}_c(w_0)). \end{align}\] From these inequalities, the following inequality holds: \[\begin{align} &\sigma(G^{(l)}_o(w)) \phi\left( \frac{ \sigma(G^{(l)}_i(w)) \phi(G^{(l)}_c(w)) }{1 - \sigma(G^{(l)}_f(w))} \right) \eqqcolon \overline{g}^{(l)}(w) \\ &\leq \widecheck{\sigma}(G^{(l)}_o(w); G^{(l)}_o(w_0)) \, \widecheck{\phi}\left( \frac{ \widecheck{\sigma}(G^{(l)}_i(w); G^{(l)}_i(w_0)) \widecheck{\phi}(G^{(l)}_c(w); G^{(l)}_c(w_0)) }{1 - \sigma(G^{(l)}_f(w))} ; \frac{ \sigma(G^{(l)}_i(w_0)) \phi(G^{(l)}_c(w_0)) }{1 - \sigma(G^{(l)}_f(w_0))} \right) \\ &\eqqcolon {\widecheck{g}}^{(l)}(w; w_0). \end{align}\] Here, \(G_*(w)\), \(\widecheck{\sigma}(w, w_0)\), \(\widecheck{\phi}(w, w_0)\), and \((1 - \sigma(w))^{-1}\) are non-negative, monotonically increasing, and convex downward functions. Since \(\widecheck{g}^{(l)}(w; w_0)\) is represented by the composition and product of these functions, it is also convex downward. Hence, for \(0 \leq \kappa^{(l)}\leq \eta^{(l)}\), \(0 \leq \lambda^{(l)}\leq 1\), and \(\eta^{{(l)}+} \mathrel{\vcenter{:}}= (1 - \lambda) \kappa^{(l)}+ \lambda \eta^{(l)}\), it follows that \[\overline{g}^{(l)}(\eta^{{(l)}+}) \leq \widecheck{g}^{(l)}(\eta^{{(l)}+}; \eta^{(l)}) \leq (1 - \lambda^{(l)}) \widecheck{g}^{(l)}(\kappa^{(l)}; \eta^{(l)}) + \lambda^{(l)}\overline{g}^{(l)}(\eta^{(l)}).\] In particular, if \(\eta^{(l)}(\infty)\) holds \(\overline{g}^{(l)}(\eta^{(l)}(\infty)) = \eta^{(l)}(\infty)\) and \(\kappa^{(l)}\leq \eta^{(l)}(\infty) \leq \eta^{(l)}\), there exists \(\lambda^{(l)}(\infty)\) such that \(\eta^{(l)}(\infty) = (1 - \lambda^{(l)}(\infty)) \kappa^{(l)}+ \lambda^{(l)}(\infty) \eta^{(l)}\). In this case, we have the following: \[\begin{gather} (1 - \lambda^{(l)}(\infty)) \kappa^{(l)}+ \lambda^{(l)}(\infty) \eta^{(l)} \leq (1 - \lambda^{(l)}(\infty)) \widecheck{g}^{(l)}(\kappa^{(l)}; \eta^{(l)}) + \lambda^{(l)}(\infty) \overline{g}^{(l)}(\eta^{(l)}), \\ \lambda^{(l)}(\infty) \leq \frac{ \widecheck{g}^{(l)}(\kappa^{(l)}; \eta^{(l)}) - \kappa^{(l)} }{ \eta^{(l)}- \kappa^{(l)}- \overline{g}^{(l)}(\eta^{(l)}) + \widecheck{g}^{(l)}(\kappa^{(l)}; \eta^{(l)}) }, \\ \eta^{(l)}(\infty) \leq \frac{ \eta^{(l)}\, \widecheck{g}^{(l)}(\kappa^{(l)}; \eta^{(l)}) - \kappa^{(l)}\, \overline{g}^{(l)}(\eta^{(l)}) }{ \eta^{(l)}- \kappa^{(l)}- \overline{g}^{(l)}(\eta^{(l)}) + \widecheck{g}^{(l)}(\kappa^{(l)}; \eta^{(l)}) } = \eta^{(l)}- \frac{ (\eta^{(l)}- \kappa^{(l)}) (\eta^{(l)}- \overline{g}^{(l)}(\eta^{(l)})) }{ \eta^{(l)}- \kappa^{(l)}- \overline{g}^{(l)}(\eta^{(l)}) + \widecheck{g}^{(l)}(\kappa^{(l)}; \eta^{(l)}) }. \end{gather}\]
Therefore, we can approximate \(\eta^{(l)}(\infty)\) by following three steps:
Take \(\kappa^{(l)}\) and \(\eta^{(l)}\) such that \(\kappa^{(l)}\leq \eta^{(l)}(\infty) \leq \eta^{(l)}\), for example, \(\kappa^{(l)}=\overline{g}^{(l)}(0)\) and \(\eta^{(l)}=\overline{g}^{(l)}(1)\).
Calculate \(\dfrac{ \eta^{(l)}\, \widecheck{g}(\kappa^{(l)}; \eta^{(l)}) - \kappa^{(l)}\, \overline{g}^{(l)}(\eta^{(l)})}{ \eta^{(l)}- \kappa^{(l)}- \overline{g}^{(l)}(\eta^{(l)}) + \widecheck{g}^{(l)}(\kappa^{(l)}; \eta^{(l)})}\).
Substitute the result of Step 2 for \(\eta^{(l)}\). Return to Step 2.
In this section, we prove the input-to-state stability and the incremental input-to-state stability of LSTM.
First, we introduce the definition of input-to-state stability.
Definition 4 (ISS, [16], [30]). System 1 is called input-to-state stable (ISS) if there exist continuous functions \(\beta \in \mathcal{KL}\) and \(\gamma \in \mathcal{K}_{\infty}\) such that for any initial state \(s(0) \in \mathbb{R}^{n_s}\) and for any input sequence \(x(0:t) \in [-x_\mathrm {max},x_\mathrm {max}]^{n_x \times (t+1)}\), \(s(t)\) satisfies \[\|s(t)\| \leq \beta(\|s(0)\|,t) +\gamma(\|x(0:t)\|_{2,\infty})\] for all \(t \in \mathbb{Z}_{\geq 0}\).
Remarking the invariant set in Proposition 1, the sufficient condition for ISS of LSTM is relaxed (in the case of \(k \geq 1\)) than [14] as follows.
Theorem 8 (ISS of LSTM). Let \(L \in \mathbb{N}\) and \(k \in \mathbb{Z}_{\geq 0}\). We assume that \(s^{(l)}(0)\in\mathcal{S}^{(l)}(k)\) for all \(l \in [L]\). The LSTM is ISS if \(\rho(A^{{(l)}}(k)) <1\) for all \(l \in [L]\), where \[\begin{align} A_s^{(l)}(k) =\begin{bmatrix} \overline{\sigma}_f(k) & \overline{\sigma}_i(k) \left\|{U_c}\right\|\\ \overline{\sigma}_o(k)\overline{\sigma}_f(k) & \overline{\sigma}_o(k)\overline{\sigma}_i(k) \left\|{U_c}\right\| \end{bmatrix}. \end{align}\] Furthermore, \(\rho(A^{{(l)}}_{s}(k))\) is monotonically decreasing with respect to \(k \in \mathbb{Z}_{\geq 0}\).
Proof. From 3 and the fact Lipschitz constant of \(\phi\) is 1, we have \[\begin{align} \left\|{c^{(l)}(t+1)}\right\| &\leq \left\|{\sigma\left(W^{{(l)}}_{f} x^{{(l)}}(t) + U^{{(l)}}_{f} h^{{(l)}}(t) +b^{{(l)}}_{f}\right)}\right\|_{\infty}\left\|{c^{(l)}(t)}\right\|\\ &\quad +\left\|{\sigma\left(W^{{(l)}}_{i} x^{{(l)}}(t) + U^{{(l)}}_{i} h^{{(l)}}(t) +b^{{(l)}}_{i}\right)}\right\|_{\infty}\left\|{\phi\left(W^{{(l)}}_{c} x^{{(l)}}(t) + U^{{(l)}}_{c} h^{{(l)}}(t) +b^{{(l)}}_{c}\right)}\right\| \\ &\leq \sigma\left(\left\|{W^{{(l)}}_{f} x^{{(l)}}(t) + U^{{(l)}}_{f} h^{{(l)}}(t) +b^{{(l)}}_{f}}\right\|_{\infty}\right)\left\|{c^{(l)}(t)}\right\|\\ &\quad+\sigma\left(\left\|{W^{{(l)}}_{i} x^{{(l)}}(t) + U^{{(l)}}_{i} h^{{(l)}}(t) +b^{{(l)}}_{i}}\right\|_{\infty}\right)\left\|{W^{{(l)}}_{c} x^{{(l)}}(t) + U^{{(l)}}_{c} h^{{(l)}}(t) +b^{{(l)}}_{c}}\right\| \\ &\leq \sigma\left(G_f^{(l)}(\eta^{(l)}(k-1))\right)\left\|{c^{(l)}(t)}\right\|+\sigma\left(G_i^{(l)}(\eta^{(l)}(k-1))\right)\left(\left\|{W^{(l)}_c}\right\|\left\|{x^{(l)}(t)}\right\|+\left\|{U^{(l)}_c}\right\|\left\|{h^{(l)}(t)}\right\|+\left\|{b^{(l)}_c}\right\|\right)\\ &= \overline{\sigma}^{(l)}_f(k) \left\|{c^{(l)}(t)}\right\|+\overline{\sigma}^{(l)}_i(k)\left(\left\|{W^{(l)}_c}\right\|\left\|{x^{(l)}(t)}\right\|+\left\|{U^{(l)}_c}\right\|\left\|{h^{(l)}(t)}\right\|+\left\|{b^{(l)}_c}\right\|\right) \end{align}\] for any \(k \in \mathbb{Z}_{\geq 0}\) and \(l \in [L]\). Similarly, 4 imply that \[\begin{align} \left\|{h^{(l)}(t+1)}\right\| &\leq \left\|{\sigma\left(W^{{(l)}}_{o} x^{{(l)}}(t) + U^{{(l)}}_{o} h^{{(l)}}(t) +b^{{(l)}}_{o}\right)}\right\|_{\infty}\left\|{\phi\left(c^{(l)}(t+1)\right)}\right\|\\ &\leq \sigma\left(\left\|{W^{{(l)}}_{o} x^{{(l)}}(t) + U^{{(l)}}_{o} h^{{(l)}}(t) +b^{{(l)}}_{o}}\right\|_{\infty}\right)\left\|{c^{(l)}(t+1)}\right\|\\ &\leq \overline{\sigma}^{(l)}_o(k)\left(\overline{\sigma}^{(l)}_f(k) \left\|{c^{(l)}(t)}\right\|+\overline{\sigma}^{(l)}_i(k)\left(\left\|{W^{(l)}_c}\right\|\left\|{x^{(l)}(t)}\right\|+\left\|{U^{(l)}_c}\right\|\left\|{h^{(l)}(t)}\right\|+\left\|{b^{(l)}_c}\right\|\right)\right). \end{align}\] Then, \[\begin{align} &\begin{pmatrix} \left\|{c^{(l)}(t+1)}\right\| \\ \left\|{h^{(l)}(t+1)}\right\| \end{pmatrix}\\ &\leq \begin{bmatrix} \overline{\sigma}^{(l)}_f(k) & \overline{\sigma}^{(l)}_i(k) \left\|{U^{(l)}_c}\right\|\\ \overline{\sigma}^{(l)}_o(k)\overline{\sigma}^{(l)}_f(k) &\overline{\sigma}^{(l)}_o(k)\overline{\sigma}^{(l)}_i(k) \left\|{U^{(l)}_c}\right\| \end{bmatrix} \begin{pmatrix} \left\|{c^{(l)}(t)}\right\| \\ \left\|{h^{(l)}(t)}\right\| \end{pmatrix}+ \begin{pmatrix} \overline{\sigma}^{(l)}_i(k) \left\|{W^{(l)}_c}\right\|\\ \overline{\sigma}^{(l)}_o(k) \overline{\sigma}^{(l)}_i(k) \left\|{W^{(l)}_c}\right\| \end{pmatrix} \left\|{x^{(l)}}\right\| + \begin{pmatrix} \overline{\sigma}^{(l)}_i(k)\\ \overline{\sigma}^{(l)}_o(k) \overline{\sigma}^{(l)}_i(k) \end{pmatrix} \left\|{b^{(l)}_c}\right\| \\ & \eqcolon A^{(l)}_s(k) \begin{pmatrix} \left\|{c^{(l)}(t)}\right\| \\ \left\|{h^{(l)}(t)}\right\| \end{pmatrix} + a^{(l)}_x(k) \left\|{x^{(l)}}\right\| + a^{(l)}_b(k) \left\|{b^{(l)}_c}\right\|. \end{align}\]
For each layer \(l \in [L]\) of LSTM, we estimate for any \(k \in \mathbb{Z}_{\geq 0}\) that \[\begin{align} \begin{pmatrix} \left\|{c^{(l)}(t)}\right\| \\ \left\|{h^{(l)}(t)}\right\| \end{pmatrix} &\leq A_s^{(l)}(k) \begin{pmatrix} \left\|{c^{(l)}(t-1)}\right\| \\ \left\|{h^{(l)}(t-1)}\right\| \end{pmatrix} +a_x^{(l)}(k) \left\|{x^{(l)}(t-1)}\right\|+a_b^{(l)}(k) \left\|{b_c^{(l)}}\right\| \\ & \leq A_s^{(l)}(k)^2 \begin{pmatrix} \left\|{c^{(l)}(t-2)}\right\| \\ \left\|{h^{(l)}(t-2)}\right\| \end{pmatrix} +A_s^{(l)}(k)a_x^{(l)}(k) \left\|{x^{(l)}(t-2)}\right\|+A_s^{(l)}(k)a_b^{(l)}(k) \left\|{b_c^{(l)}}\right\| \\ & +a_x^{(l)}(k) \left\|{x^{(l)}(t-1)}\right\|+a_b^{(l)}(k) \left\|{b_c^{(l)}}\right\|\\ & \vdots \\ &\leq A_s^{(l)}(k)^t \begin{pmatrix} \left\|{c^{(l)}(0)}\right\| \\ \left\|{h^{(l)}(0)}\right\| \end{pmatrix} +\sum_{\tau=0}^{t-1}A_s^{(l)}(k)^{t-\tau-1} a_x^{(l)}(k)\left\|{x^{(l)}(\tau)}\right\| +\left\|{b_c^{(l)}}\right\|\sum_{\tau=0}^{t-1} A_s^{(l)}(k)^t a_b^{(l)}(k), \label{el-est} \end{align}\tag{16}\] where \(a_b^{(l)}\mathrel{\vcenter{:}}= 0\) for \(l=1,\ldots,L-1\).
For single-layer LSTM (the case of \(L=1\)), we obtain that for constant \(\mu>1\), \[\begin{align} \left\|{s(t)}\right\| &=\left\|{ \begin{pmatrix} c(t) \\ h(t) \end{pmatrix} }\right\| =\sqrt{\left\|{c(t)}\right\|^2+\left\|{h(t)}\right\|^2} =\left\|{ \begin{pmatrix} \left\|{c(t)}\right\| \\ \left\|{h(t)}\right\| \end{pmatrix} }\right\| \\ &\leq \left\|{A_s(k)^t}\right\|\left\|{s(0)}\right\| +\left\|{\sum_{j=0}^{t-1}A_s(k)^{t-j-1}a_x(k)\left\|{x(j)}\right\|}\right\| +\left\|{b_c}\right\|\left\|{\sum_{j=0}^{t-1}A_s(k)^j a_b(k)}\right\| \tag{17}\\ &\leq \left\|{A_s(k)^t}\right\|\left\|{s(0)}\right\| +\left\|{\sum_{j=0}^{t-1}A_s(k)^j a_x(k)}\right\|\max_{j \in [t]}\left\|{x(j-1)}\right\| +\left\|{\sum_{j~0}^{t-1}A_s(k)^j a_b(k)}\right\|\left\|{b_c}\right\| \\ &\leq \mu \lambda_1(A_s(k))^t \left\|{s(0)}\right\| +\left\|{(I-A_s(k)^t)(I-A_s(k))^{-1}a_x(k)}\right\|\left\|{x}\right\|_{2,\infty} +\left\|{(I-A_s(k))^{-1}a_b(k)}\right\|\left\|{b_c}\right\|. \tag{18} \end{align}\] Here, we note that \(\lambda_1(A_s(k)) \in (0,1)\) since \(\rho(A_s(k))<1\). Therefore, the first term of 18 belongs to class \(\mathcal{KL}(\left\|{s(0)}\right\|,t)\), the second term belongs to class \(\mathcal{K}_\infty (\left\|{x}\right\|_{2,\infty})\), and the third term belongs to class \(\mathcal{K}_\infty (\left\|{b_c}\right\|)\). This concludes that single-layer LSTM is ISS if \(\rho(A_s(k))<1\).
For multi-layer LSTM (the case of \(L \geq 2\)), it follows from 17 that \[\begin{align} \left\|{s^{(L)}(t)}\right\| &\leq \left\|{A_s^{(L)}(k)^t}\right\|\left\|{s^{(L)}(0)}\right\| +\left\|{\sum_{\tau=0}^{t-1}A_s^{(L)}(k)^{t-\tau-1}a_x^{(L)}(k)\left\|{x^{(L)}(\tau)}\right\|}\right\| +\left\|{b_c^{(L)}}\right\|\left\|{\sum_{\tau=0}^{t-1}A_s^{(L)}(k)^\tau a_b^{(L)}(k)}\right\|. \label{ml-est} \end{align}\tag{19}\] Also, the estimate 16 imply that \[\begin{align} \left\|{h^{(l)}(t)}\right\| \leq \llcorner A_s^{(l)}(k)^t \lrcorner \begin{pmatrix} \left\|{c^{(l)}(0)}\right\| \\ \left\|{h^{(l)}(0)}\right\| \end{pmatrix} +\sum_{\tau=0}^{t-1} \llcorner A_s^{(l)}(k)^{t-\tau-1} \lrcorner a_x^{(l)}(k)\left\|{x^{(l)}(\tau)}\right\| \label{ml-est95h} \end{align}\tag{20}\] for \(l=1,\ldots,L-1\), where \(\llcorner A \lrcorner\) denotes the lower horizontal vector \((A_{(2,1)}, A_{(2,2)})\) for a matrix \(A=\begin{bmatrix} A_{(1,1)} & A_{(1,2)} \\ A_{(2,1)} & A_{(2,2)} \end{bmatrix}\). Applying the relation 6 and inequality 20 to the second term of 19 repeatedly, we estimate that \[\begin{align} &\left\|{\sum_{\tau=0}^{t-1}A_s^{(L)}(k)^{t-\tau-1}a_x^{(L)}(k)\left\|{x^{(L)}(\tau)}\right\|}\right\| = \left\|{\sum_{\tau_L=1}^t A_s^{(L)}(k)^{t-\tau_L} a_x^{(L)}(k) \left\|{h^{(L-1)}(\tau_L)}\right\|}\right\| \\ &\leq \left\|{\sum_{\tau_L=1}^t A_s^{(L)}(k)^{t-\tau_L} a_x^{(L)}(k) \llcorner A_s^{(L-1)}(k)^{\tau_L}\lrcorner}\right\| \left\|{s^{(L-1)}(0)}\right\|\\ &\quad +\left\|{\sum_{\tau_L=1}^t A_s^{(L)}(k)^{t-\tau_L} a_x^{(L)}(k) \sum_{\tau_{L-1}=1}^{\tau_L} \llcorner A_s^{(L-1)}(k)^{\tau_L-\tau_{L-1}}\lrcorner a_x^{(L-1)}(k) \left\|{h^{(L-2)}(\tau_{L-1})}\right\|}\right\|\\ &\leq \left\|{\sum_{\tau_L=1}^t A_s^{(L)}(k)^{t-\tau_L} a_x^{(L)}(k) \llcorner A_s^{(L-1)}(k)^{\tau_L}\lrcorner}\right\| \left\|{s^{(L-1)}(0)}\right\|\\ &\quad +\left\|{\sum_{\tau_L=1}^t A_s^{(L)}(k)^{t-\tau_L} a_x^{(L)}(k) \sum_{\tau_{L-1}=1}^{\tau_L} \llcorner A_s^{(L-1)}(k)^{\tau_L-\tau_{L-1}}\lrcorner a_x^{(L-1)}(k) \llcorner A_s^{(L-2}(k)^{\tau_{L-1}}\lrcorner}\right\|\left\|{s^{(L-2)}(0)}\right\|\\ &\quad\vdots\\ &\quad +\left\|{\sum_{\tau_L=1}^t A_s^{(L)}(k)^{t-\tau_L} a_x^{(L)}(k) \sum_{\tau_{L-1}=1}^{\tau_L}\llcorner A_s^{(L-1)}(k)^{\tau_L-\tau_{L-1}}\lrcorner a_x^{(L-1)}(k)\cdots \sum_{\tau_2=1}^{\tau_3}\llcorner A_s^{(2)}(k)^{\tau_3-\tau_2}\lrcorner a_x^{(2)}(k) \llcorner A_s^{(1)}(k)^{\tau_2}\lrcorner}\right\|\left\|{s^{(1)}(0)}\right\|\\ &\quad +\left\|{\sum_{\tau_L=1}^t A_s^{(L)}(k)^{t-\tau_L} a_x^{(L)}(k) \sum_{\tau_{L-1}=1}^{\tau_L}\llcorner A_s^{(L-1)}(k)^{\tau_L-\tau_{L-1}}\lrcorner a_x^{(L-1)}(k) \cdots \sum_{\tau_1=1}^{\tau_2}\llcorner A_s^{(1)}(k)^{\tau_2-\tau_1}\lrcorner a_x^{(1)}(k) \left\|{x(\tau_1-1)}\right\|}\right\|\\ &=\left\|{\sum_{\tau_L=1}^t A_s^{(L)}(k)^{t-\tau_L} a_x^{(L)}(k) \llcorner A_s^{(L-1)}(k)^{\tau_L}\lrcorner}\right\| \left\|{s^{(L-1)}(0)}\right\|\\ &\quad +\sum_{m=1}^{L-2}\Bigg\|\sum_{\tau_L=1}^t A_s^{(L)}(k)^{t-\tau_L} a_x^{(L)}(k) \sum_{\tau_{L-1}=1}^{\tau_L} \llcorner A_s^{(L-1)}(k)^{\tau_L-\tau_{L-1}}\lrcorner a_x^{(L-1)}(k) \sum_{\tau_{L-2}=1}^{\tau_{L-1}} \cdots \\ & \cdots \sum_{\tau_{m+1}=1}^{\tau_{m+2}}\llcorner A_s^{(m+1)}(k)^{\tau_{m+2}-\tau_{m+1}}\lrcorner a_x^{(m+1)}(k) \llcorner A_s^{(m)}(k)^{\tau_{m+1}}\lrcorner\Bigg\|\left\|{s^{(m)}(0)}\right\|\\ &\quad +\left\|{\sum_{\tau_L=1}^t A_s^{(L)}(k)^{t-\tau_L} a_x^{(L)}(k) \sum_{\tau_{L-1}=1}^{\tau_L} \llcorner A_s^{(L-1)}(k)^{\tau_L-\tau_{L-1}}\lrcorner a_x^{(L-1)}(k) \sum_{\tau_{L-2}=1}^{\tau_{L-1}} \cdots \sum_{\tau_1=1}^{\tau_2}\llcorner A_s^{(1)}(k)^{\tau_2-\tau_1}\lrcorner a_x^{(1)}\left\|{x(\tau_1-1)}\right\|}\right\|\\ &\leq \sum_{\tau=1}^t \left\|{A_s^{(L)}(k)^{(t-\tau)}}\right\|\left\|{a_x^{(L)}}\right\|\left\|{\llcorner A_s^{(L-1)}(k)^\tau \lrcorner}\right\|\left\|{s^{(L-1)}(0)}\right\|\\ &\quad +\sum_{m=1}^{L-2} \left(\sum_{\tau_L=1}^t \sum_{\tau_{L-1}=1}^{\tau_L} \cdots \sum_{\tau_{m+1}=1}^{\tau_{m+2}} \left\|{A_s^{(L)}(k)^{t-\tau_L}}\right\| \prod_{l=m+1}^{L-1}\left\|{\llcorner A_s^{(l)}(k)^{\tau_{l+1}-\tau_l}\lrcorner}\right\|\left\|{\llcorner A_s^{(m)}(k)^{\tau_{m+1}}\lrcorner}\right\|\prod_{j=m+1}^L\left\|{a_x^{(j)}(k)}\right\|\right)\left\|{s^{(m)}(0)}\right\|\\ &\quad +\sum_{\tau_L=1}^t \sum_{\tau_{L-1}=1}^{\tau_L} \cdots \sum_{\tau_1=1}^{\tau_2} \left\|{A_s^{(L)}(k)^{t-\tau_L}}\right\|\prod_{l=1}^{L-1}\left\|{\llcorner A_s^{(l)}(k)^{\tau_{l+1}-\tau_l}\lrcorner}\right\|\prod_{j=1}^L\left\|{a_x^{(j)}(k)}\right\|\left\|{x(\tau_1-1)}\right\|, \end{align}\] where the last inequality follows from the inequality \(\left\|{A a_1}\right\| \leq \left\|{A}\right\|\left\|{a_1}\right\|\) and equality \(\left\|{a_1 a_2^\top}\right\|=\max_{|\xi|=1}\left\|{a_1 a_2\xi}\right\|=\left\|{a_1}\right\|\max_{|\xi|=1}|a_2\xi|=\left\|{a_1}\right\|\left\|{a_2}\right\|\) for matrix \(A \in \mathbb{R}^{2 \times 2}\) and vectors \(a_1,a_2 \in \mathbb{R}^2\). Then, we estimate 19 again that \[\begin{align} \left\|{s^{(L)}(t)}\right\| &\leq \left\|{A_s^{(L)}(k)^t}\right\|\left\|{x^{(L)}(0)}\right\| +\sum_{m=1}^{L-1}\left(\sum_{\sum_{l=m}^L\tau_l=t, \tau_l \geq 0} \prod_{l=m}^L \left\|{A_s^{(l)}(k)^{\tau_l}}\right\| \prod_{l=m+1}^L \left\|{a_x^{(l)}(k)}\right\|\right) \left\|{s^{(m)}(0)}\right\| \\ &\quad +\sum_{\tau=0}^{t-1} \sum_{\sum_{l=1}^L \tau_l=t-\tau-1, \tau_l \geq 0} \left(\prod_{l=1}^L \left\|{A_s^{(l)}(k)^{\tau_l}}\right\| \left\|{a_x^{(l)}}\right\|\right) \left\|{x(\tau)}\right\| +\sum_{\tau=0}^{t-1} \left\|{A_s^{(L)}(k)^\tau}\right\|\left\|{a_b^{(L)}}\right\|\left\|{b_c^{(L)}}\right\| \\ &\leq \mu^{(L)}\rho(A_s^{(L)}(k))^t \left\|{s^{(L)}(0)}\right\| +\sum_{m=1}^{L-1} \left(\sum_{\sum_{l=m}^L \tau_l=t} \prod_{l=m}^L \mu^{(l)}\rho(A_s^{(l)}(k))^{\tau_l}\right)\left(\prod_{l=m+1}^L \left\|{a_x^{(l)}(k)}\right\|\right) \left\|{s^{(m)}(0)}\right\| \\ &\quad +\sum_{\tau=0}^{t-1} \left(\sum_{\sum_{l=1}^L \tau_l=t-\tau-1} \prod_{l=1}^L \mu^{(l)}\rho(A_s^{(l)}(k))^{\tau_l}\right)\left(\prod_{l=1}^L \left\|{a_x^{(l)}(k)}\right\|\right)\left\|{x(\tau)}\right\| +\sum_{\tau=0}^{t-1} \mu^{(L)}\rho(A_s^{(L)}(k))^\tau \left\|{a_b^{(L)}}\right\|\left\|{b_c^{(L)}}\right\| \\ &\leq \mu^{(L)}\rho(A_s^{(L)}(k))^t \left\|{s^{(L)}(0)}\right\| +\sum_{m=1}^{L-1} \mu^{(m:L)} \frac{(t+L-m)!}{t!(L-m)!} \max_{l=m,\ldots,L} \rho(A_s^{(l)}(k))^t \left(\prod_{l=m+1}^L \left\|{a_x^{(l)}(k)}\right\|\right) \left\|{s^{(m)}(0)}\right\| \\ &\quad + \sum_{\tau=0}^{t-1} \mu^{(1:L)} \frac{(t+L-\tau-2)!}{(t-\tau-1)!(L-1)!} \max_{l=1,\ldots,L} \rho(A_s^{(l)}(k))^{t-\tau-1}\left(\prod_{l=1}^L \left\|{a_x^{(l)}(k)}\right\|\right)\left\|{x(\tau)}\right\| \\ &\quad +\sum_{\tau=0}^{t-1} \mu^{(L)}\rho(A_s^{(L)}(k))^\tau \left\|{a_b^{(L)}}\right\|\left\|{b_c^{(L)}}\right\| \\ &\leq M_L(k) \rho_L(k)^t \left\|{s^{(1:L)}(0)}\right\|_{2,\infty} +M'_L(k) \sum_{\tau=0}^{t-1} \rho'_L(k)^{t-\tau-1} \left\|{x}\right\|_{2,\infty} +M''_L(k) \sum_{\tau=0}^{t-1} \rho(A_s^{(L)}(k))^\tau \left\|{b_c^{(L)}}\right\|, \end{align}\] where \(M_L(k), M'_L(k)\), and \(M''_L(k)\) are constants independent of \(t\) and \(0<\rho_L(k),\rho'_L(k)<1\). Therefore, if \(\rho(A_s^{(l)}(k)) \in (0,1)\) for any \(l \in [L]\), the multi-layer LSTM is ISS.
We note that \(\mu^{(l)}\) can be expressed by direct calculation as \[\begin{align} \left\|{A_s^{(l)}(k)^t}\right\| &=\left\|{U \begin{bmatrix} \lambda^{(l)}_1 & \nu^{(l)}\\ 0 & \lambda^{(l)}_2 \end{bmatrix} U^*}\right\| =\left\|{\begin{bmatrix} (\lambda_1^{(l)})^t & \nu^{(l)}\sum_{\tau=1}^t (\lambda_1^{(l)})^{t-\tau}(\lambda_2^{(l)})^{\tau-1} \\ 0 & (\lambda_2^{(l)})^t \end{bmatrix}}\right\| \\ &\leq \frac{1}{\sqrt{2}} \left\{(1+(r^{(l)})^{2t})(\lambda_1^{(l)})^2 + (\nu^{(l)})^2 \sum_{\tau=1}^t (r^{(l)})^{\tau-1} \sum_{\upsilon=1}^t (r^{(l)})^{\upsilon-1}\right\} (\lambda_1^{(l)})^{2(t-1)} \\ &\quad +\left[\left\{(1+(r^{(l)})^{2t})(\lambda_1^{(l)})^2 + (\nu^{(l)})^2 \sum_{\tau=1}^t (r^{(l)})^{\tau-1} \sum_{\upsilon=1}^t (r^{(l)})^{\upsilon-1}\right\}^2 -4(r^{(l)})^{2t} (\lambda_1^{(l)})^4 \right]^{\frac{1}{2}} (\lambda_1^{(l)})^{2(t-1)} \\ &\leq \rho(A_s^{(l)}(k))^t\sqrt{(1+(r^{(l)})^{2t}) +\frac{|\nu^{(l)}|^2}{|\lambda^{(l)}_1|^2} {\left( \frac{1-(r^{(l)})^t}{1-r^{(l)}} \right)}^2}. \label{mu} \end{align}\tag{21}\]
Finally, we show \(\rho(A^{{(l)}}_{s}(k))\) is monotonically decreasing. Suppose that \(\rho(A^{{(l)}}_{s}(k))<\rho(A^{{(l)}}_{s}(k+1))\). Since \(A^{(l)}_s(k)\) and \(A^{(l)}_s(k+1)\) are non-negative matrices, \(\rho(A^{{(l)}}_{s}(k))\) and \(\rho(A^{{(l)}}_{s}(k+1))\) are Perron-Frobenius eigenvalues of \(A^{(l)}_s(k)\) and \(A^{(l)}_s(k+1)\), respectively. Also, there exists a non-negative eigenvector \(v\) of \(A_s^{(l)}(k+1)\) for eigenvalue \(\rho(A^{{(l)}}_{s}(k+1))\). From Lemma 1, \(A^{(l)}_s(k)-A^{(l)}_s(k+1)\) is non-negative matrix. Then, the vector \[\begin{align} \label{non-negative} (A_s^{(l)}(k)-\rho(A^{{(l)}}_{s}(k+1))I)v =A^{(l)}_s(k)v-\rho(A^{{(l)}}_{s}(k+1))v=(A^{(l)}_s(k)-A^{(l)}_s(k+1))v \end{align}\tag{22}\] is non-negative. Since all eigenvalues of \(A^{{(l)}}_{s}(k)\) less than \(\rho(A^{{(l)}}_{s}(k+1))\), then \(\rho(A^{{(l)}}_{s}(k+1))I-A^{(l)}_s(k)\) is a regular matrix, and which inverse matrix can be expanded as \[\begin{align} (\rho(A^{{(l)}}_{s}(k+1))I-A^{(l)}_s(k))^{-1}=\frac{1}{\rho(A^{{(l)}}_{s}(k+1))}\sum^{\infty}_{q=0}\left(\frac{A^{(l)}_s(k)}{\rho(A^{{(l)}}_{s}(k+1))}\right)^q. \end{align}\] Therefore, \((\rho(A^{{(l)}}_{s}(k+1))I-A^{(l)}_s(k))^{-1}\) is positive matrix. Hence, \(v\) is non-positive from 22 . Because \(v\) is an eigenvector, \(v \neq 0\). Thus, we conclude that \(\rho(A^{{(l)}}_{s}(k))\ge\rho(A^{{(l)}}_{s}(k+1))\). ◻
The next is the same as Theorem 2.
Theorem 9 (\(\delta\)ISS of LSTM). Let \(L \in \mathbb{N}\) and \(k \in \mathbb{Z}_{\geq 0}\). We assume that \(s^{(l)}(0)\in\mathcal{S}^{(l)}(k)\) for all \(l \in [L]\). The LSTM is \(\delta\)ISS if \(\rho(A^{{(l)}}_{s}(k)) <1\) for all \(l \in [L]\), where \[\begin{align} A_s^{(l)}(k) &=\begin{bmatrix} \overline{\sigma}_f^{(l)}(k) & \alpha_s^{(l)}(k)\\ \overline{\sigma}_o^{(l)}(k)\overline{\sigma}_f^{(l)}(k) & \alpha_s^{(l)}(k)\overline{\sigma}_o^{(l)}(k)+\dfrac{1}{4}\phi(\overline{c}^{(l)}(k))\|U_o^{(l)}\| \end{bmatrix},\\ \alpha_s^{(l)}(k) &=\frac{1}{4}\|U_f^{(l)}\|\overline{c}^{(l)}(k)+\overline{\sigma}_i^{(l)}(k)\|U_c^{(l)}\|+\frac{1}{4}\|U_i^{(l)}\|\overline{\phi}_c^{(l)}(k). \end{align}\] Furthermore, \(\rho(A^{{(l)}}_{s}(k))\) is monotonically decreasing with respect to \(k \in \mathbb{Z}_{\geq 0}\).
Proof. From 3 and 4 , we have \[\begin{align} &c_1^{(l)}(t+1)-c_2^{(l)}(t+1)\\ &= \sigma\left(W^{{(l)}}_{f} x_1^{{(l)}}(t) + U^{{(l)}}_{f} h_1^{{(l)}}(t) +b^{{(l)}}_{f}\right) \odot \left(c_1^{(l)}(t)-c_2^{(l)}(t)\right)\\ &\quad +\left\{\sigma\left(W^{{(l)}}_{f} x_1^{{(l)}}(t) + U^{{(l)}}_{f} h_1^{{(l)}}(t) +b^{{(l)}}_{f}\right)-\sigma\left(W^{{(l)}}_{f} x_2^{{(l)}}(t) + U^{{(l)}}_{f} h_2^{{(l)}}(t) +b^{{(l)}}_{f}\right)\right\} \odot c_2^{(l)}(t)\\ &\quad +\sigma\left(W^{{(l)}}_{i} x_1^{{(l)}}(t) + U^{{(l)}}_{i} h_1^{{(l)}}(t) +b^{{(l)}}_{i}\right) \\ &\qquad \odot \left\{\phi\left(W^{{(l)}}_{c} x_1^{{(l)}}(t) + U^{{(l)}}_{c} h_1^{{(l)}}(t) +b^{{(l)}}_{c}\right)-\phi\left(W^{{(l)}}_{c} x_2^{{(l)}}(t) + U^{{(l)}}_{c} h_2^{{(l)}}(t) +b^{{(l)}}_{c}\right)\right\}\\ &\quad +\left\{\sigma\left(W^{{(l)}}_{i} x_1^{{(l)}}(t) + U^{{(l)}}_{i} h_1^{{(l)}}(t) +b^{{(l)}}_{i}\right)-\sigma\left(W^{{(l)}}_{i} x_2^{{(l)}}(t) + U^{{(l)}}_{i} h_2^{{(l)}}(t) +b^{{(l)}}_{i}\right)\right\} \\ &\qquad \odot \phi\left(W^{{(l)}}_{c} x_2^{{(l)}}(t) + U^{{(l)}}_{c} h_2^{{(l)}}(t) +b^{{(l)}}_{c}\right), \end{align}\] and \[\begin{align} &h_1^{(l)}(t+1)-h_2^{(l)}(t+1) \\ &= \sigma\left(W^{{(l)}}_{o} x_1^{{(l)}}(t) + U^{{(l)}}_{o} h_1^{{(l)}}(t) +b^{{(l)}}_{o}\right) \odot \left\{\phi\left(c_1^{(l)}(t+1)\right)-\phi\left(c_2^{(l)}(t+1)\right)\right\}\\ &\quad + \left\{\sigma\left(W^{{(l)}}_{o} x_1^{{(l)}}(t) + U^{{(l)}}_{o} h_1^{{(l)}}(t) +b^{{(l)}}_{o}\right)-\sigma\left(W^{{(l)}}_{o} x_2^{{(l)}}(t) + U^{{(l)}}_{o} h_2^{{(l)}}(t) +b^{{(l)}}_{o}\right)\right\} \odot \phi\left(c_2^{(l)}(t+1)\right). \end{align}\] Since the Lipschitz constant of \(\sigma\) is \(1/4\), we obtain that \[\begin{align} &\left\|{c_1^{(l)}(t+1)-c_2^{(l)}(t+1)}\right\|\\ &\leq \left\|{\sigma\left(W^{{(l)}}_{f} x_1^{{(l)}}(t) + U^{{(l)}}_{f} h_1^{{(l)}}(t) +b^{{(l)}}_{f}\right)}\right\|_{\infty}\left\|{c_1^{(l)}(t)-c_2^{(l)}(t)}\right\| \\ &\quad +\frac{1}{4}\left\|{\left(W^{{(l)}}_{f} x_1^{{(l)}}(t) + U^{{(l)}}_{f} h_1^{{(l)}}(t) +b^{{(l)}}_{f}\right)-\left(W^{{(l)}}_{f} x_2^{{(l)}}(t) + U^{{(l)}}_{f} h_2^{{(l)}}(t) +b^{{(l)}}_{f}\right)}\right\|\left\|{c_2^{(l)}(t)}\right\|_{\infty}\\ &\quad +\left\|{\sigma\left(W^{{(l)}}_{i} x_1^{{(l)}}(t) + U^{{(l)}}_{i} h_1^{{(l)}}(t) +b^{{(l)}}_{i}\right)}\right\|_{\infty}\left\|{\left(W^{{(l)}}_{c} x_1^{{(l)}}(t) + U^{{(l)}}_{c} h_1^{{(l)}}(t) +b^{{(l)}}_{c}\right)-\left(W^{{(l)}}_{c} x_2^{{(l)}}(t) + U^{{(l)}}_{c} h_2^{{(l)}}(t) +b^{{(l)}}_{c}\right)}\right\|\\ &\quad +\frac{1}{4}\left\|{\left(W^{{(l)}}_{i} x_1^{{(l)}}(t) + U^{{(l)}}_{i} h_1^{{(l)}}(t) +b^{{(l)}}_{i}\right)-\left(W^{{(l)}}_{i} x_2^{{(l)}}(t) + U^{{(l)}}_{i} h_2^{{(l)}}(t) +b^{{(l)}}_{i}\right)}\right\|\left\|{\phi\left(W^{{(l)}}_{c} x_2^{{(l)}}(t) + U^{{(l)}}_{c} h_2^{{(l)}}(t) +b^{{(l)}}_{c}\right)}\right\|_{\infty} \\ &\leq \overline{\sigma}_f^{(l)}(k) \left\|{c_1^{(l)}(t)-c_2^{(l)}(t)}\right\| + \frac{1}{4}\left(\left\|{W_f^{(l)}}\right\|\left\|{x_1^{(l)}(t)-x_2^{(l)}(t)}\right\|+\left\|{U_f^{(l)}}\right\|\left\|{h_1^{(l)}(t)-h_2^{(l)}(t)}\right\|\right)\overline{c}^{(l)}(k) \\ &\quad +\overline{\sigma}_i^{(l)}(k) \left(\left\|{W_c^{(l)}}\right\|\left\|{x_1^{(l)}(t)-x_2^{(l)}(t)}\right\|+\left\|{U_c^{(l)}}\right\|\left\|{h_1^{(l)}(t)-h_2^{(l)}(t)}\right\|\right) \\ &\quad +\frac{1}{4}\overline{\phi}_c^{(l)}(k) \left(\left\|{W_i^{(l)}}\right\|\left\|{x_1^{(l)}(t)-x_2^{(l)}(t)}\right\|+\left\|{U_i^{(l)}}\right\|\left\|{h_1^{(l)}(t)-h_2^{(l)}(t)}\right\|\right), \end{align}\] and \[\begin{align} &\left\|{h_1^{(l)}(t+1)-h_2^{(l)}(t+1)}\right\|\\ &\leq \overline{\sigma}_o^{(l)}(k) \left\|{c_1^{(l)}(t+1)-c_2^{(l)}(t+1)}\right\|+\frac{1}{4}\phi\left(\overline{c}^{(l)}(k)\right) \left(\left\|{W_o^{(l)}}\right\|\left\|{x_1^{(l)}(t)-x_2^{(l)}(t)}\right\|+\left\|{U_o^{(l)}}\right\|\left\|{h_1^{(l)}(t)-h_2^{(l)}(t)}\right\|\right) \\ &\leq \overline{\sigma}_o^{(l)}(k) \Big\{\overline{\sigma}_f^{(l)}(k) \left\|{c_1^{(l)}(t)-c_2^{(l)}(t)}\right\| + \frac{1}{4}\left(\left\|{W_f^{(l)}}\right\|\left\|{x_1^{(l)}(t)-x_2^{(l)}(t)}\right\|+\left\|{U_f^{(l)}}\right\|\left\|{h_1^{(l)}(t)-h_2^{(l)}(t)}\right\|\right)\overline{c}^{(l)}(k) \\ & +\overline{\sigma}_i^{(l)}(k) \left(\left\|{W_c^{(l)}}\right\|\left\|{x_1^{(l)}(t)-x_2^{(l)}(t)}\right\|+\left\|{U_c^{(l)}}\right\|\left\|{h_1^{(l)}(t)-h_2^{(l)}(t)}\right\|\right) \\ & +\frac{1}{4}\overline{\phi}_c^{(l)}(k) \left(\left\|{W_i^{(l)}}\right\|\left\|{x_1^{(l)}(t)-x_2^{(l)}(t)}\right\|+\left\|{U_i^{(l)}}\right\|\left\|{h_1^{(l)}(t)-h_2^{(l)}(t)}\right\|\right)\Big\}\\ &\quad +\frac{1}{4}\phi\left(\overline{c}^{(l)}(k)\right)\left(\left\|{W_o^{(l)}}\right\|\left\|{x_1^{(l)}(t)-x_2^{(l)}(t)}\right\|+\left\|{U_o^{(l)}}\right\|\left\|{h_1^{(l)}(t)-h_2^{(l)}(t)}\right\|\right). \end{align}\] Then, \[\begin{align} \begin{pmatrix} \left\|{c_1^{(l)}(t+1)-c_2^{(l)}(t+1)}\right\| \\ \left\|{h_1^{(l)}(t+1)-h_2^{(l)}(t+1)}\right\| \end{pmatrix} &\leq \begin{bmatrix} \overline{\sigma}_f^{(l)}(k) & \alpha_s^{(l)}(k) \\ \overline{\sigma}_o^{(l)}(k) \overline{\sigma}_f^{(l)}& \alpha_s^{(l)}(k) \overline{\sigma}_o^{(l)}(k)+\frac{1}{4}\phi\left(\overline{c}^{(l)}(k)\right)\left\|{U_o^{(l)}}\right\| \end{bmatrix} \begin{pmatrix} \left\|{c_1^{(l)}(t)-c_2^{(l)}(t)}\right\| \\ \left\|{h_1^{(l)}(t)-h_2^{(l)}(t)}\right\| \end{pmatrix}\\ &\quad + \begin{pmatrix} \alpha_x^{(l)}(k) \\ \alpha_x^{(l)}(k) \overline{\sigma}_o^{(l)}(k)+\frac{1}{4}\phi\left(\overline{c}^{(l)}(k)\right)\left\|{W_o^{(l)}}\right\| \end{pmatrix} \left\|{x_1^{(l)}(t)-x_2^{(l)}(t)}\right\|\\ &\eqqcolon A_s^{(l)}(k) \begin{pmatrix} \left\|{c_1^{(l)}(t)-c_2^{(l)}(t)}\right\| \\ \left\|{h_1^{(l)}(t)-h_2^{(l)}(t)}\right\| \end{pmatrix} + a_x^{(l)}(k) \left\|{x_1^{(l)}(t)-x_2^{(l)}(t)}\right\|, \end{align}\] where \(\alpha_s^{(l)}(k)=\frac{1}{4}\overline{c}^{(l)}(k)\|U_f^{(l)}\|+\overline{\sigma}_i^{(l)}(k)\|U_c^{(l)}\|+\frac{1}{4}\overline{\phi}_c^{(l)}(k)\|U_i^{(l)}\|\) and \(\alpha_x^{(l)}(k)=\frac{1}{4}\overline{c}^{(l)}(k)\|W_f^{(l)}\|+\overline{\sigma}_i^{(l)}(k)\|W_c^{(l)}\|+\frac{1}{4}\overline{\phi}_c^{(l)}(k)\|W_i^{(l)}\|\).
Similar to the proof of Theorem 8, we have \[\begin{align} \left\|{s_1(t)-s_2(t)}\right\| &\leq \mu \lambda_1(A_s(k))^t \left\|{s_1(0)-s_2(0)}\right\| +\left\|{(I-A_s(k)^t)(I-A_s(k))^{-1}a_x(k)}\right\|\left\|{x_1-x_2}\right\|_{2,\infty} \label{s-LSTM95dISS-upper} \end{align}\tag{23}\] for single-layer LSTM, and \[\begin{align} \left\|{s_1^{(L)}(t)-s_2^{(L)}(t)}\right\| &\leq \mu^{(L)}\rho(A_s^{(L)}(k))^t \left\|{s_1^{(L)}(0)-s_2^{(L)}(0)}\right\|\\ &\quad +\sum_{m=1}^{L-1} \mu^{(m:L)} \frac{(t+L-m)!}{t!(L-m)!} \max_{l=m,\ldots,L} \rho(A_s^{(l)}(k))^t \left(\prod_{l=m+1}^L \left\|{a_x^{(l)}(k)}\right\|\right) \left\|{s_1^{(m)}(0)-s_2^{(m)}(0)}\right\| \\ &\quad + \sum_{\tau=0}^{t-1} \mu^{(1:L)} \frac{(t+L-\tau-2)!}{(t-\tau-1)!(L-1)!} \max_{l=1,\ldots,L} \rho(A_s^{(l)}(k))^{t-\tau-1}\left(\prod_{l=1}^L \left\|{a_x^{(l)}(k)}\right\|\right)\left\|{x_1(\tau)-x_2(\tau)}\right\| \label{delta-ISS-explicit} \end{align}\tag{24}\] for multi-layer LSTM. Thus, LSTM is \(\delta\)ISS if \(\rho(A_s^{(l)}(k))<1\) for all \(l \in [L]\).
The proof of the monotonically decreasing of \(\rho(A^{{(l)}}_{s}(k))\) is the same as that of Theorem 8. ◻
Theorem 10 (Upper Bound of Recovery Time). We assume that \(s^{(l)}(0)\in\mathcal{S}^{(l)}(k)\) and \(\rho(A^{{(l)}}_{s}(k)) <1\) for all \(l \in [L]\). Let infinite length inputs \(x(t), \widehat{x}(t) \in [-x_\mathrm {max},x_\mathrm {max}]^{n_x}, t\ge0\) and \(e > 0\). We assume that there exists \(t_0\) such that \(x(t)=\widehat{x}(t)\) for any \(t \ge t_0\). Then, we have \[\begin{align} &\sup_{s(0) \in \prod^L_{l=1}S^{(l)}_k}T_R(x, \widehat{x},s(0);e) \le \min \{t_1 \ge 0 : \tilde{\beta}(2\sqrt{n_c}\bar{s}(k),t; k) \le e/\|U_y\| \;\text{for any} \;t \ge t_1 \}, \end{align}\] where \(\bar{s}(k)=(\bar{s}^{(1)}(k),\dots,\bar{s}^{(L)}(k))\) and \(\bar{s}^{(l)}(k)=\|(\bar{c}^{(l)}(k) , \phi (\bar{c}^{(l)}(k)) \bar{\sigma}^{(l)}_o (k))\|\).
Proof. Denote that \[\begin{align} s_1^{(l)} = \begin{pmatrix} c_1^{(i)} \\ h_1^{(l)} \end{pmatrix} = s^{(l)}(t_0, s(0), x(0:t_0)), \\ s_2^{(l)} = \begin{pmatrix} c_2^{(l)} \\ h_2^{(l)} \end{pmatrix} = s^{(l)}(t_0, s(0), \widehat{x}(0:t_0)) \end{align}\] for \(l = 1, \dots, L\). Proposition 4 implies that there exists \(\tilde{\beta}\) such that \[\begin{align} \MoveEqLeft \|s^{(L)}(t - t_0, s_1(t_0), x(t_0 : t)) - s^{(L)}(t - t_0, s_2(t_0), x(t_0 : t))\| \leq \tilde{\beta}(\lVert s_1^{(1)} - s_2^{(1)} \rVert, \dots, \lVert s_1^{(L)} - s_2^{(L)} \rVert, t - t_0; k) \end{align}\] and \(\tilde{\beta}(s^{(1)}, \dots, s^{(L)}, t; k)\) is increasing with respect to \(s^{(1)}, \dots, s^{(L)}\) for any \(t\) and \(k\). On the other hand, \(s_1^{(l)}, s_2^{(l)} \in \mathcal{S}^{(l)}(k)\) implies \[\begin{align} \lVert c_1^{(l)} - c_2^{(l)} \rVert &\le \sqrt{n_c} \lVert c_1^{(l)} - c_2^{(l)} \rVert _\infty \le 2 \sqrt{n_c} \bar{c}^{(l)}(k), \\ \lVert h_1^{(l)} - h_2^{(l)} \rVert &\le \sqrt{n_c} \lVert h_1^{(l)} - h_2^{(l)} \rVert _\infty \le 2 \sqrt{n_c} \phi(\bar{c}^{(l)}(k)) \bar{\sigma}_o^{(l)}(k), \end{align}\] and then \[\lVert s_1^{(l)} - s_2^{(l)} \rVert = \sqrt{ \lVert c_1^{(l)} - c_2^{(l)} \rVert^2 + \lVert h_1^{(l)} - h_2^{(l)} \rVert^2 } \le 2 \sqrt{n_c} \bar{s}^{(l)}(k).\] Thus, we have \[\begin{align} \|y(t, s(0), x(0:t) - y(t, s(0), \widehat{x}(0:t)\| &\leq \|U_y\| \|s^{(L)}(t, s(0), x(0:t)) - s^{(L)}(t, s(0), \widehat{x}(0:t))\| \\ &= \|U_y\| \|s^{(L)}(t - t_0, s_1, x(t_0 : t)) - s^{(L)}(t - t_0, s_2, x(t_0 : t))\| \\ &\leq \lVert U_y \rVert \tilde{\beta}( 2 \sqrt{n_c} \bar{s}(k), t - t_0; k ). \end{align}\] The theorem follows immediately from this inequality. ◻
This section provides a detailed description of the experimental settings used in the experiments discussed in the main text. The experiments were conducted using the PyTorch framework.
The model dimensions are provided in Table 1. Twenty models were generated with weights sampled uniformly from the intervals \(W_* \in [0,0.1]\), \(U_* \in [0, 1.0]\), and \(C \in [0,1.0]\).
\(L\) | \(n_x\) | \(n_c\) | \(n_y\) |
---|---|---|---|
1 | 1 | 1 | 1 |
We define the nominal input data \(x\) and the perturbed input data \(\widehat{x}\) that satisfy the assumptions of Theorem 4.3 with \(t_0=20\) as follows. \[\begin{align} x(t) = 0.5 , \;\widehat{x}(t) = \begin{cases} 1.0 & \text{if } t = 20, \\ 0.5 & \text{else}, \end{cases} \end{align}\] where \(t=0,1,2,\ldots,99\). \(T_R(x,\widehat{x})\) and \(\bar{T}_R(k)\) were calculated according to Algorithm 7 and Algorithm 8 where \(\mathcal{T} =100\). It should be noted that in cases where the model exhibits instability or \(\rho(A_s(k)) \ge 1\), this algorithm computes \(T_R = \infty\) or \(\bar{T}_R(k) = \infty\). Calculating statistical measures such as estimation errors or correlation coefficients using infinite values would result in arbitrary and meaningless numbers. Indeed, for some values of \(k\) and specific models, our experiments showed that \(\bar{T}_R(k) = \infty\) and \(\rho(A_s(k)) \ge 1\) occurred. This result is theoretically correct and does not indicate any deficiency in the theory. However, using infinite values to calculate statistical measures such as estimation errors or correlation coefficients would result in arbitrary and meaningless numbers. To address this issue, we replaced infinity values with finite proxies: \(\bar{T}_R(k)=\mathcal{T}=100\) when computing these statistical measures.
The model dimensions are provided in Table 2.
\(L\) | \(n_x\) | \(n_c\) | \(n_y\) |
---|---|---|---|
1 | 3 | 22 | 2 |
The training and test data were randomly generated through the following process [22], [23]. First, the continuous two-tank system \[\begin{align} \frac{dh_1}{dt}&= -p_1\sqrt{2gh_2}+p_2u,\\ \frac{dh_2}{dt}&= p_3\sqrt{2gh_1}-p_4\sqrt{2gh_2}, \end{align}\] was discretized using the Runge-Kutta method with \(dt = 0.01\). The system parameters are shown in Table 3. We randomly generated a sequence of control inputs \(\{u(t)\}_t\), set the initial state of the system to \((h_1(0), h_2(0)) = (0,0)\), and obtained the system states \(\{(h_1(t),h_2(t))\}_t\). Here, the generation of control inputs was set to a time series length of \(100,000\), with values randomly switched every \(4,000\) step following a uniform distribution \(\mathcal{U}(0.5, 3.0)\). Using this method, we created two independent control input sequences, one for training and one for testing.
As the final stage of the data generation process, we created training and test datasets. In this step, we simulated more realistic scenarios by adding noise to the previously generated state variables. Specifically, we constructed the input data as follows: \[\begin{align} x(t) = (u(t), h_1(t) + w_1(t), h_2(t)+w_2(t)). \end{align}\] This addition of noise simulates real-world factors such as sensor measurement errors and environmental uncertainties. The noise distribution was set to \(w_1,w_2 \sim \mathcal{N}(0,{0.1}^2)\) for the training data and to \(w_1,w_2 \sim \mathcal{N}(0,{0.01}^2)\) for the test data. This is because while noise with a standard deviation of about \(0.01\) is expected under operational conditions, intentionally introducing larger noise during the training phase aims to improve the model’s robustness and accuracy. On the other hand, the inference target data was set using the state of the next time step as follows: \[\begin{align} y(t) = (h_1(t+1), h_2(t+1)). \end{align}\] This setup requires the model to predict the state of the next time step from the current control input and the state including noise. This simulates the important task of one-step-ahead state prediction in actual system control.
The time series data are normalized to the interval \([-1, 1]\) for both input \(x(t)=(u(t),h_1(t)+w(t), h_2(t)+w(t))\) and output \(y(t)=(h_1(t+1),h_2(t+1))\) of the LSTM before conducting inference. The equation for normalization is given below: \[\begin{align} x &= 2 (x-x_\mathrm {min})/(x_\mathrm {max}-x_\mathrm {min}) -1, \\ y &= 2 (y-y_\mathrm {min})/(y_\mathrm {max}-y_\mathrm {min}) -1, \end{align}\] where \(x_\mathrm {min}= (0.5, 0,0)\), \(x_\mathrm {max}= (3.0, 10,10)\), \(y_\mathrm {min}= (0,0)\), and \(y_\mathrm {max}= (10,10)\).
System constants | \(p_1\) | \(p_2\) | \(p_3\) | \(p_4\) | \(g\) |
0.5 | 0.5 | 0.5 | 0.5 | 0.5 | |
Actuator | \(u_{\min}\) | \(u_{\max}\) | |||
0.5 | 3.0 |
The training dataset is split into training and validation sets, with the first \(50\%\) of the sequences allocated for training and the remaining \(50\%\) used for validation purposes. To reduce computational load and augment data, the training dataset is created by segmenting one length time series \(50,000\) into individual sequences \(49,000\), each with a length of \(1,000\), using a sliding window of step size \(1\). On the other hand, as data augmentation is not required for the validation dataset, we create the dataset using both a window length and a step size of 1,000, solely considering computational efficiency.
We design the loss function for backward as follows: \[\begin{align} \mathcal{L} = \text{MAE}(y_{\textrm{true}}, y_{\textrm{pred}}) + \lambda \Phi(\theta,\varepsilon). \end{align}\] When calculating the MAE on the batched data, we discard the first 10 steps of the time series predictions. This adjustment accounts for the inherent instability of LSTM predictions during the initial time steps, which is a characteristic behavior of LSTM networks.
We used the following hyperparameters. The weights were initialized with uniform distribution \(\mathcal{U}(-1/\sqrt{n_c},1/\sqrt{n_c})\) and optimized using the Adam (\(\beta =(0.9, 0.999)\), \(\varepsilon=10^{-8}\)) optimizer. The mini-batch size was \(64\), the epoch size was \(200\), the learning rate was \(0.001\) with ReduceLROnPlateau (\(\text{mode}=\text{min}\), \(\text{factor}=0.1\), \(\text{patience}=10\)) scheduler. The baseline model was constructed with \(\lambda = 0\), while the stability-guaranteed model was implemented with \(\lambda = 1.0\). The recovery time control parameter \(\varepsilon\) was selected from the set \(\{0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45, 0.5 \}\).
The inference accuracy was evaluated using MAE as the metric for the test data. For the time series, we calculated the MAE, ignoring the first \(10\) steps and using non-normalized values for the evaluation.
Separate from the data used for evaluating inference accuracy, we generate the nominal input data \(x\) and the perturbed input data \(\widehat{x}\) for the estimation of recovery time \(T_R\) using the following steps. First, we set the time series length to \(3500\) steps, the control input to \(u = 1.0\), and the initial water level of the system to \((h_1(0),h_2(0)) = (0,0)\). Then, following the data generation process, we created \((h_1(t), h_2(t))\) and set \(x(t) = (u(t),h_1(t), h_2(t))\). Then, we define \(\widehat{x}\) that satisfies the assumptions of Theorem 5 with \(t_0=1010\) as \[\begin{align} \widehat{x}(t) = \begin{cases} (u(t), h_1(t)+ 1.0, h_2(t) +1.0) & \text{if } 1000 \le t \le 1010, \\ x(t) & \text{else} .\end{cases} \end{align}\] This definition models the phenomenon where observational noise temporarily has a magnitude larger than expected. Based on these settings, we calculate \(T_R\) and \(\bar{T}_R\). The algorithm is similar to the one used with the toy model, but there is one point to note due to the effect of data normalization. As the LSTM output is normalized, \(T_R\) is calculated to ensure the error remains within e on the original scale. Consequently, \(\bar{T}_R\) must also be adjusted to the original scale. In the calculation of \(\bar{T}_R\), we apply the following scale transformation to \(e\) in Theorem 5: \[\begin{align} e \gets e/(h_\mathrm {max}- h_\mathrm {min}) = e/(10-0) =e/10. \end{align}\]
We conducted comparative experiments between our proposed theoretical method and conventional data-driven approaches. In the data-driven approach, training was performed by randomly adding pulses along the time series direction to the training data.
For the adversarial learning, training and validation data were generated by adding pulse signals to the input data generated in Section 0.0.11.2, excluding control inputs. For data with time series length \(T=50,000\), we generated randomly occurring sustained pulses. Each pulse occurrence follows a Poisson process with an average rate of \(\lambda = 0.001\) per unit time. The total number of pulses over the entire period \(T\) is determined according to the Poisson distribution Poisson\((\lambda T)\).Each pulse has the following characteristics:
The start time is randomly selected from a uniform distribution in the range \([0, T-d]\), where \(d=10\) is the duration of each pulse.
The amplitude is sampled from a uniform distribution within \([0, M]\), where \(M\) is the maximum intensity of the pulse.
Each pulse is added to the input signal at a constant value for \(d\) steps from the selected start time.
We created training and validation data with variations of \(M = 1, 3, 10\) using the method described above. Models trained with each of these datasets are referred to as "DD pulse \(M\) model" (where DD stands for data-driven), while our model was trained using data without added pulses.
We used the same loss function as defined in Section 0.0.11.2.
With the exception of the following two points, we used the same hyperparameters as in Section 0.0.11.2:
For our model, \(\varepsilon=0.05\).
For the DD pulse \(M\) models, \(\lambda=0\).
We evaluated each model using the following evaluation data, under the same settings as in Section 0.0.11.2.
We define \(\widehat{x}_p\) that satisfies the assumptions of Theorem 5 with \(t_0=1010\) as \[\begin{align} \widehat{x}_p(t) = \begin{cases} (u(t), h_1(t)+ p, h_2(t) + p) & \text{if } 1000 \le t \le 1010, \\ x(t) & \text{else}, \end{cases} \end{align}\] where \(p\) was selected from the set \(\{1.0, 5.0, 9.0\}\), and let \(\widehat{y}_p\) represent the output of the LSTM when \(\widehat{x}_p\) is used as the input.
As shown in Table 4, regarding \(T_R\), our proposed method demonstrated superiority. Our model achieved faster recovery times compared to models obtained using data-driven methods. Although models trained with data having stronger noise pulses showed improvement in recovery time, they could not reach the performance of our theoretically guaranteed model. These results indicate that our proposed method is effective in enhancing LSTM’s recovery capability.
In achieving stability criteria, our model with theoretical guarantees also showed clear advantages. Models trained with these data-driven techniques failed to meet the theoretical stability criterion of \(\rho(A_s)<1\). This result suggests that approaches relying solely on data are insufficient for theoretical guarantees of system resilience.
On the other hand, in terms of noise resistance, models trained with data-driven approaches outperformed our model in certain cases. This is likely because our proposed method did not directly aim to maximize noise resistance. Noise resistance can be theoretically bounded using \(\gamma\) from Definition 2. As a future challenge, by explicitly deriving \(\gamma\), there is potential to improve noise resistance metrics.
In general, our theoretical approach has been proven to provide a better solution than data-driven methods to ensure resilience under any noise conditions.
Baseline | DD pulse \(1\) | DD pulse \(5\) | DD pulse \(10\) | Our | |
---|---|---|---|---|---|
MAE | 0.007 | 0.017 | 0.014 | 0.012 | 0.010 |
\(T_R(x,\widehat{x}_1)\) | 21.31 | 17.04 | 9.31 | 4.49 | 0.86 |
\(T_R (x,\widehat{x}_5)\) | 22.91 | 19.74 | 8.11 | 4.74 | 1.04 |
\(T_R (x,\widehat{x}_9)\) | 24.39 | 17.33 | 8.11 | 4.90 | 1.04 |
\(\overline{T}_R\) | \(\infty\) | \(\infty\) | \(\infty\) | \(\infty\) | 19.32 |
\(\rho(A)\) | 1.537e+11 | 6.001e+7 | 2.627e+12 | 1.295e+6 | 0.9476 |
\(\max_t|y(t)-\widehat{y}_1(t)|\) | 0.03 | 0.06 | 0.16 | 0.26 | 0.21 |
\(\max_t|y(t)-\widehat{y}_5(t)|\) | 0.40 | 0.35 | 1.83 | 3.05 | 1.30 |
\(\max_t|y(t)-\widehat{y}_9(t)|\) | 1.23 | 0.82 | 3.30 | 5.88 | 2.80 |