February 28, 2024
The vector field of a controlled differential equation (CDE) describes the relationship between a control path and the evolution of a solution path. Neural CDEs (NCDEs) treat time series data as observations from a control path, parameterise a CDE’s vector field using a neural network, and use the solution path as a continuously evolving hidden state. As their formulation makes them robust to irregular sampling rates, NCDEs are a powerful approach for modelling real-world data. Building on neural rough differential equations (NRDEs), we introduce Log-NCDEs, a novel, effective, and efficient method for training NCDEs. The core component of Log-NCDEs is the Log-ODE method, a tool from the study of rough paths for approximating a CDE’s solution. Log-NCDEs are shown to outperform NCDEs, NRDEs, the linear recurrent unit, S5, and MAMBA on a range of multivariate time series datasets with up to \(50{,}000\) observations.
Neural controlled differential equations (NCDEs) offer a number of advantages for modelling real-world multivariate time series. These include being robust to irregular sampling rates and decoupling the number of forward passes through their neural network from the number of observations in the time series. However, there exists a gap in performance between NCDEs and current state-of-the-art approaches for time series modelling, such as S5, the linear recurrent unit (LRU), and MAMBA [1]–[3].
This paper demonstrates that, on a range of multivariate time series benchmarks, NCDEs can outperform current state-of-the-art approaches by utilising a tool from the study of rough paths, the Log-ODE method. We refer to this new approach as Log-NCDEs1.
Let \(\{(t_i,x_i)\}_{i=0}^n\) denote a set of observations from a multivariate time series, where \(x_i\in\mathbb{R}^{v-1}\). Let \(X:[t_0,t_n]\rightarrow \mathbb{R}^{v}\) be a continuous interpolation, such that \(X_{t_i}=(t_i, x_i)\), where subscript on time-dependent variables denotes evaluation. Let \(h_t\in\mathbb{R}^u\) and \(z_t\in\mathbb{R}^w\) be the NCDE’s hidden state and output at time \(t\) respectively. Let \(\xi_{\phi}:\mathbb{R}^{v}\rightarrow\mathbb{R}^u\) and \(f_{\theta}:\mathbb{R}^u\rightarrow\mathbb{R}^{u \times v}\) be neural networks, and \(l_{\psi}:\mathbb{R}^u\rightarrow\mathbb{R}^{w}\) be a linear map, where \(\phi\), \(\theta\), and \(\psi\) represent the learnable parameters. A NCDE is defined by \[\begin{align} \label{eq:ncde} h_{t_0} &= \xi_{\phi}(t_0, x_0), \\ h_t &= h_{t_0} + \int_{t_0}^t f_{\theta}(h_s)\text{d} X_s, \\ z_t&=l_{\psi}(h_t), \end{align}\tag{1}\] for \(t\in[t_0,t_n]\), where \(f_{\theta}(h_s)\text{d} X_s\) is matrix-vector multiplication [4]. Details on the regularity required for existence and uniqueness of the solution to 1 can be found in Appendix 8.1. By an extension of the Picard-Lindelöf Theorem, a sufficient condition is \(X\) being of bounded variation and \(f_{\theta}\) being Lipschitz continuous [5].
NCDEs are an attractive option for modelling multivariate time series. They are universal approximators of continuous real-valued functions on time series data [6]. Additionally, since they interact with time series data through \(X\), NCDEs are unaware of when the time series was observed. This makes them robust to irregular sampling rates. Furthermore, the number of forward passes through \(f_{\theta}\) when evaluating 1 is controlled by the differential equation solver. This is opposed to recurrent models, where it is controlled by the number of observations \(n\). By decoupling the number of forward passes through their neural network from the number of observations in the time series, NCDEs can mitigate exploding or vanishing gradients on highly-sampled time series.
To make use of the techniques developed for training neural ordinary differential equations (ODEs) [7], NCDEs are typically rewritten as an ODE, \[\label{eq:ncde95ode} \tilde{h}_t = \tilde{h}_{t_0} + \int_{t_0}^t g_{\theta, X}(\tilde{h}_s)\text{d}s,\tag{2}\] where \(\tilde{h}_{t_0}=h_{t_0}\). [4] proposed taking \(X\) to be a differentiable interpolation and \[\label{eq:cde95diff} g_{\theta, X}(\tilde{h}_s) = f_{\theta}(\tilde{h}_s)\frac{\text{d} X}{\text{d} s}.\tag{3}\]
Neural rough differential equations (NRDEs) are based on the Log-ODE method, which is discussed briefly here, and in full detail in Section 2.
Given a set of intervals \(\{[r_i, r_{i+1}]\}_{i=0}^{m-1}\) satisfying \(t_0=r_0<\ldots< r_m=t_n\), the Log-ODE method replaces a CDE with an ODE on each interval. For a depth\(-N\) method, the ODE on \([r_i,r_{i+1}]\) is defined by \[\label{eq:logode} g_{\theta,X}(\tilde{h}_s) = \bar{f}_{\theta}\left(\tilde{h}_s\right)\frac{\log(S^{N}(X)_{[r_i,r_{i+1}]})}{r_{i+1}-r_i},\tag{4}\] where \(\bar{f}_{\theta}\) is constructed using the iterated Lie brackets of \(f_{\theta}\) and \(\log(S^{N}(X)_{[r_i,r_{i+1}]})\) is the depth-\(N\) truncated log-signature of X over \([r_i,r_{i+1}]\) [8]. Informally, \(\bar{f}_{\theta}\) is a high order description of the vector field \(f_{\theta}\) and \(\log(S^{N}(X)_{[r_i,r_{i+1}]})\) is a high order description of the control path \(X\) over \([r_i,r_{i+1}]\). Note that when using 3 , \(\tilde{h}_t\) is exactly \(h_t\) for all \(t\in[t_0,t_n]\), whereas when using 4 , \(\tilde{h}_t\) is an approximation of \(h_t\) when \(t\in\{r_i\}_{i=1}^m\).
NRDEs replace 3 with 4 , but instead of calculating \(\bar{f}_{\theta}\) using the iterated Lie brackets of \(f_{\theta}\), it is treated as a neural network \(\bar{f}_{\theta}:\mathbb{R}^u\rightarrow\mathbb{R}^{u\times \beta(v,N)}\), where \(\beta(v,N)\) is the dimension of a depth\(-N\) truncated log-signature of a \(v\) dimensional path. By neglecting the structure of \(\bar{f}_{\theta}\), NRDEs are able to reduce the computational burden of evaluating the vector field, at the cost of increasing the output dimension of the neural network.
Compared to NCDEs, NRDEs can reduce the number of forward passes through the network while evaluating the model, as the vector field is autonomous on each interval \([r_i,r_{i+1}]\). This has been shown to lead to improved classification accuracy, alongside reduced time and memory-usage, on time series with up to 17,000 observations [9]. Furthermore, as it is no longer necessary to apply a differentiable interpolation to the time series data, NRDEs are applicable to a wider range of input signals.
This paper introduces Log-NCDEs, which build on NRDEs by constructing \(\bar{f}_{\theta}\) using the iterated Lie brackets of a NCDE’s vector field, \(f_{\theta}\). For depth’s \(N\geq2\), this change drastically reduces the output dimension of the model’s neural network, without impacting the model’s expressivity. Furthermore, it improves model performance on every dataset considered in this paper. Calculating the Lie brackets requires that \(f_{\theta}\) satisfies a regularity constraint, specifically being \(\mathrm{Lip}(\gamma)\) for \(\gamma\in(N-1,N]\). Section 3.2 presents a novel theoretical result bounding the \(\mathrm{Lip}(\gamma)-\)norm for a class of fully connected neural networks when \(1<\gamma\leq 2\). Following this, Section 3.3 details how to efficiently calculate the Lie brackets of a \(\mathrm{Lip}(\gamma)\) neural network using standard machine learning tools. The paper concludes by showing that, over a range of multivariate time series benchmarks, Log-NCDEs outperform NCDEs, NRDEs, S5, LRU, and MAMBA.
The following section provides a thorough mathematical description of the Log-ODE method. It will introduce \(\mathrm{Lip}(\gamma)\) regularity, the Lie bracket of two vector fields, and the signature and log-signature of a path, alongside their respective spaces, the tensor algebra and the free Lie algebra.
Definition 1. Let \(U\), \(V\), and \(W\) be vector spaces. The tensor product space \(U\otimes V\) is the unique (up to isomorphism) space such that for all bilinear functions \(\kappa:U\times V \rightarrow W\) there exists a unique linear map \(\tau:U\otimes V \rightarrow W\), such that \(\kappa = \tau \circ \otimes\) [10].
As an example, let \(V=\mathbb{R}^2\) and \(W=\mathbb{R}^3\). In this case, the tensor product is the outer product of the two vectors, and the resulting tensor product space is the space of \(2\times 3\) matrices, \(\mathbb{R}^2\otimes\mathbb{R}^3 = \mathbb{R}^{2\times 3}\). The tensor product of \(v\in \mathbb{R}^2\) and \(w\in \mathbb{R}^3\) is defined by \[v\otimes w = \begin{bmatrix} a \\ b \end{bmatrix} \otimes \begin{bmatrix} c \\ d \\ e \end{bmatrix} = \begin{bmatrix} ac & ad & ae \\ bc & bd & be \end{bmatrix},\] where any bilinear function \(\kappa(v,w)\) can be written as a linear function \(\tau(v\otimes w)\).
Definition 2. The tensor algebra space is the space \[T((V)) = \{x=(x^0,x^1,\ldots) | x^i \in V^{\otimes i}\},\] where \(V^{\otimes 0}=\mathbb{R}\), \(V^{\otimes 1} = V\), and \(V^{\otimes j} = V \otimes V^{\otimes j-1}\) [10].
Details on the choice of norm for \(V^{\otimes i}\) when \(V\) is a complete normed vector space, i.e. a Banach space, can be found in Appendix 8.2.
Let \(V\) and \(W\) be Banach spaces and \(\mathbf{L}(V, W)\) denote the space of linear mappings from \(V\) to \(W\) equipped with the operator norm.
Definition 3. A linear map \(l\in\mathbf{L}(V^{\otimes j}, W)\) is \(j\) symmetric if for all \(v_1 \otimes \cdots \otimes v_j \in V^{\otimes j}\) and all bijective functions \(p:\{1, \ldots, j\}\rightarrow \{1, \ldots, j\}\) [10], \[l(v_1 \otimes \ldots \otimes v_j) = l(v_{p(1)} \otimes \cdots \otimes v_{p(j)}).\] Let \(\mathbf{L}_s(V^{\otimes j}, W)\) denote the set of all \(j\) symmetric linear maps.
Definition 4. Let \(\gamma>0\), \(k\in\mathbb{Z}\) such that \(\gamma \in (k, k+1]\), \(F\) be a closed subset of \(V\), and \(f^0:F\rightarrow W\). For \(j\in\{1,\ldots,k\}\), let \(f^j:F\rightarrow \mathbf{L}_s(V^{\otimes j}, W)\). The collection \((f^0, f^1, \ldots, f^k)\) is an element of \(\text{Lip}(\gamma)\) if there exists \(M\geq0\) such that for \(j\in\{0,\ldots,k\}\), \[\label{eq:lipbound1} \sup_{x\in F}||f^j(x)||_{\mathbf{L}(V^{\otimes j}, W)} \leq M,\tag{5}\] and for \(j\in\{0,\ldots,k\}\), all \(x,y \in F\), and each \(v \in V^{\otimes j}\) [11], \[\label{eq:lipbound2} \frac{\Big|\Big|f^j(y)(v)-\sum_{l=0}^{k-j}\frac{f^{j+l}(x)(v \otimes (y-x)^{\otimes l})}{l!}\Big|\Big|_{W}}{|x-y|_V^{\gamma-j}} \leq M.\tag{6}\]
If \(f=(f^0, f^1, \ldots, f^k)\in\text{Lip}(\gamma)\), then the \(\text{Lip}(\gamma)-\)norm, denoted \(||f||_{\text{Lip}(\gamma)}\), is the smallest \(M\) for which 5 and 6 hold. When \(0<\gamma \leq 1\), then \(k=0\) and \(f^0\in\text{Lip}(\gamma)\) implies \(f^0\) is bounded and \(\gamma-\)Hölder continuous. When \(\gamma=1\), then \(f^0\) is bounded and Lipschitz. In this paper, we are concerned with the regularity of vector fields on \(\mathbb{R}^u\). In this case, \(f\in\mathrm{Lip}(\gamma)\) for \(\gamma \in (k, k+1]\) implies that \(f\) is bounded, has \(k\) bounded derivatives, and the \(k^{\text{th}}\) derivative satisfies \[||D^kf(y)-D^kf(x)|| \leq M|x-y|^{\gamma-k},\] for all \(x,y\in\mathbb{R}^u\).
Definition 5. A Lie algebra is a vector space \(V\) with a bilinear map \([\cdot,\cdot]:V\times V\rightarrow V\) satisfying \([w,w]=0\) and the Jacobi identity, \[[[x,y],z]+[[y,z],x]+[[z,x],y] = 0,\] for all \(w,x,y,z\in V\). The map \([\cdot,\cdot]\) is called the Lie bracket [12].
Any associative algebra, \((V,\times)\), has a Lie bracket structure with Lie bracket defined by \[[x, y] = x\times y - y\times x,\] for all \(x,y \in V\). For example, \(V=\mathbb{R}^{n\times n}\) with the matrix product. For another example, consider the vector space of infinitely differentiable functions from \(\mathbb{R}^u\) to \(\mathbb{R}^u\) with pointwise addition, denoted \(C^{\infty}(\mathbb{R}^u,\mathbb{R}^u)\). This space is a Lie algebra when equipped with the Lie bracket \[[a, b](x) = J_b(x)a(x) - J_a(x)b(x),\] for \(a,b\in C^{\infty}(\mathbb{R}^u,\mathbb{R}^u)\) and \(x\in\mathbb{R}^u\), where \(J_a(x)\in\mathbb{R}^{u\times u}\) is the Jacobian of \(a\) with entries given by \(J_{a}^{ij}(x)=\partial_ja^i(x)\) for \(i,j\in\{1,\ldots,u\}\) [13].
Definition 6. Let \(A\) be a non-empty set, \(L_0\) be a Lie algebra, and \(\phi:A\rightarrow L_0\) be a map. The Lie algebra \(L_0\) is said to be the free Lie algebra generated by \(A\) if for any Lie algebra \(L\) and any map \(f:A\rightarrow L\), there exists a unique Lie algebra homomorphism \(g:L_0\rightarrow L\) such that \(g \circ \phi = f\) [12].
The free Lie algebra generated by \(V\) is the space \[\mathfrak{L}((V)) = \{(l_0,l_1,\ldots):l_i\in L_i\},\] where \(L_0=0\), \(L_1=V\), and \(L_{i+1}\) is the span of \([v,l]\) for \(v\in V\) and \(l\in L_i\).
Let \(X:[t_0,t_n]\rightarrow V\) have bounded variation and define \[X^n_{[t_0,t_n]} = \underbrace{\int\cdots\int}_{\substack{u_1<\cdots<u_n \\ u_i\in [t_0,t_n]}}\text{d}X_{u_1}\otimes \cdots \otimes \text{d}X_{u_n} \in V^{\otimes n}.\] The signature of the path \(X\) is \[S(X)_{[t_0,t_n]}=(1,X^1_{[t_0,t_n]},\ldots,X^n_{[t_0,t_n]},\ldots) \in \tilde{T}((V)),\] where \(\tilde{T}((V))=\{x\in T((V)) | x=(1,\ldots)\}\) [5]. The depth-\(N\) truncated signature is defined as \(S^N(X)_{[t_0,t_n]}=(1,X^1_{[t_0,t_n]},\ldots,X^N_{[t_0,t_n]}) \in \tilde{T}^N(V)\). The signature is an infinite dimensional vector which describes the path \(X\) over the interval \([t_0,t_n]\). In fact, assuming \(X\) contains time as a channel, linear maps on \(S(X)_{[t_0,t_n]}\) are universal approximators for continuous, real-valued functions of \(X\) [14], [15]. This property of the signature relies on the shuffle-product identity, which states that polynomials of the elements in a truncated signature can be rewritten as linear maps on the signature truncated at greater depth [16], [17]. A consequence of the shuffle-product identity is that not every term in the signature provides new information about the path \(X\). The transformation which removes the information redundancy is the logarithm.
Definition 7. For \(\mathbf{x}\in\tilde{T}((V))\), the logarithm is defined by \[\log(\mathbf{x}) = \log(1+\mathbf{t}) = \sum_{n=1}^\infty \frac{(-1)^n}{n}\mathbf{t}^{\otimes n},\] where \(\mathbf{t}=(0,x^1,x^2,\ldots)\) [12].
It was shown by [18] that \[\log(S(X)_{[t_0,t_n]}) \in \mathfrak{L}((V)).\] This result plays a crucial role in the Log-ODE method.
The vector field of a CDE is typically thought of as a matrix-valued function \(f:\mathbb{R}^u\rightarrow\mathbb{R}^{u\times v}\). An equivalent formulation is \(f\) being a linear map acting on \(\text{d}X\in\mathbb{R}^{v}\) and returning a vector field on \(\mathbb{R}^u\). In other words, \(f(\cdot)\text{d}X:\mathbb{R}^u\rightarrow\mathbb{R}^u\). This formulation will prove more useful in the following section.
Let \(X:[t_0,t_n]\rightarrow\mathbb{R}^v\) be a continuous path. The (truncated) log-signature is a map which takes \(X\) to the (truncated) free Lie algebra \(\mathfrak{L}^N(\mathbb{R}^v)\). If \(f(\cdot)\text{d}X\) is restricted to smooth vector fields on \(\mathbb{R}^u\), then \(f(\cdot)\text{d}X\) is a linear map from \(\mathbb{R}^v\) to the Lie algebra \(C^{\infty}(\mathbb{R}^u, \mathbb{R}^u)\). By definition 6, there exists a unique linear map \(\bar{f}\) from \(\mathfrak{L}^N(\mathbb{R}^v)\) to the smooth vector fields on \(\mathbb{R}^u\). Figure 1 is a schematic diagram of this relationship. Since \(\bar{f}\) is a Lie algebra homomorphism, it can be defined recursively by \[\label{eq:logoderecurs1} \bar{f}(\cdot)a = f(\cdot)a, \;\; a\in \mathbb{R}^v\tag{7}\] and \[\label{eq:logoderecurs2} \bar{f}(\cdot)[a,b] = [\bar{f}(\cdot)a,\bar{f}(\cdot)b]\tag{8}\] for \([a,b]\in\mathfrak{L}^N(\mathbb{R}^v)\).
Over an interval, the Log-ODE method approximates a CDE using an autonomous ODE constructed by applying the linear map \(\bar{f}\) to the truncated log-signature of the control, as seen in 9. There exist theoretical results bounding the error in the Log-ODE method’s approximation, including when the control and solution paths live in infinite dimensional Banach spaces [8]. However, for a given set of intervals, the series of vector fields \(\{g_{X}(\cdot)\}_{N=1}^{\infty}\) is not guaranteed to converge. In practice, \(N\) is typically chosen as the smallest \(N\) such that a reasonably sized set of intervals \(\{r_i\}_{i=0}^m\) gives an approximation error of the desired level. A recent development has been the introduction of an algorithm which adaptively updates \(N\) and \(\{r_i\}_{i=0}^m\) [19].
Log-NCDEs use the same underlying model as NRDEs \[\label{eq:Log-NCDE} \begin{align} h_t &= h_{t_0} + \int_{t_0}^t g_{\theta, X}(h_s)\text{d}s, \\ g_{\theta, X}(h_s) &= \bar{f}_{\theta}\left(h_s\right)\frac{\log(S^{N}(X)_{[r_i,r_{i+1}]})}{r_{i+1}-r_i}, \end{align}\tag{9}\] but with two major changes. First, instead of parameterising \(\bar{f}_{\theta}\) using a neural network, it is constructed using the iterated Lie brackets of a NCDE’s neural network, \(f_{\theta}\), via 7 and 8 . Second, \(f_{\theta}\) is ensured to be a \(\mathrm{Lip}(\gamma)\) function for \(\gamma\in(N-1,N]\). Figure 2 is a schematic diagram of a Log-NCDE.
These changes have a major benefit. For \(N>1\), Log-NCDEs are exploring a drastically smaller output space during training than NRDEs, while maintaining the same expressivity, as NCDEs are universal approximators. This is because the output dimension of \(f_{\theta}\) is \(u \times v\), whereas the output dimension of \(\bar{f}_{\theta}\) is \(u\times \beta(v,N)\). Figure 3 compares these values for paths of dimension \(v\) from \(1\) to \(15\) and truncation depths of \(N=1\) and \(N=2\). This benefit comes at the cost of needing to calculate the iterated Lie brackets when evaluating Log-NCDEs, which will be quantified in Section 3.4 and explored empirically in Section 5.2.
When \(N=1\), 9 simplifies to \[\label{eq:Log-NCDE95N1} g_{\theta, X}(h_s) = f_{\theta}\left(h_s\right)\frac{X_{r_{i+1}}-X_{r_i}}{r_{i+1}-r_i}.\tag{10}\] Hence, in this case the only difference between Log-NCDEs and NRDEs is the regularisation of \(f_{\theta}\). Furthermore, 10 and 3 are equivalent when \(X\) is a linear interpolation. Therefore, the approach of NCDEs, NRDEs, and Log-NCDEs coincide when using a depth\(-1\) Log-ODE approximation [9].
The composition of two \(\mathrm{Lip}(\gamma)\) functions is \(\mathrm{Lip}(\gamma)\) [20]. Hence, a simple approach to ensuring a fully connected neural network (FCNN) is \(\mathrm{Lip}(\gamma)\) is to make each layer \(\mathrm{Lip}(\gamma)\). This can be achieved by choosing an infinitely differentiable activation function, such as \(\text{SiLU}\) [21]. However, in practice this may not ensure sufficient regularity, as demonstrated by Theorem 1.
Theorem 1. Let \(f_{\theta}\) be a FCNN with input dimension \(n_{in}\), hidden dimension \(n_h\), depth \(m\), and activation function \(\text{SiLU}\). Assuming the input \(\mathbf{x}=[x_1,\ldots,x_{n_{in}}]^T\) satisfies \(|x_j|\leq1\) for \(j=1,\ldots,n_{in}\), then \(f_{\theta}\in\mathrm{Lip}(2)\) and \[\label{eq:lipnn} ||f_{\theta}||_{\text{Lip}(2)} \leq CP_{m!}\left(\{||W^i||_2, ||\mathbf{b}^i||_2\}_{i=1}^m\right)\tag{11}\] where \(C\) is a constant depending on \(n_{in}, n_h,\) and \(m\), \(\{W^i\}_{i=1}^m\) and \(\{\mathbf{b}^i\}_{i=1}^m\) are the weights and biases of \(i^{\text{th}}\) layer of \(f_{\theta}\), and \(P_{m!}\) is a polynomial of order \(m!\).
Proof. A proof is given in Appendix 9. ◻
Assuming that each layer \(\{L^i\}_{i=1}^m\) of \(f_{\theta}\) satisfies \(||L^i||_{\text{Lip}(2)}=1\), an explicit evaluation of 11 gives \[||f_{\theta}||_{\text{Lip}(2)} \leq 5^{2^{m-1}-1}.\] For a depth \(7\) FCNN, this bound is greater than the maximum value of a single precision floating point number. Hence, it may be necessary to control \(||f_{\theta}||_{\text{Lip}(2)}\) explicitly during training. This is achieved by modifying the neural network’s loss function \(L\) to \[\label{eq:weightpenalty} L \mapsto L + \lambda\left(\sum_{i=1}^m||W^i||_2 + ||\mathbf{b}^i||_2\right),\tag{12}\] where \(\lambda\) is a hyperparameter controlling the weight of the penalty. This is an example of weight regularisation, which has long been understood to improve generalisation in NNs [22], [23]. Equation 12 is specifically a variation of spectral norm regularisation [24].
The linear map \(\bar{f}_{\theta}\) in 9 is defined recursively by 7 and 8 . Assuming \(f_{\theta}(\cdot)a\) is infinitely differentiable, then \(f_{\theta}(\cdot)a\) is an element of the Lie algebra \(C^{\infty}(\mathbb{R}^u, \mathbb{R}^u)\) and \[[f_{\theta}(\cdot)a,f_{\theta}(\cdot)b] = J_{f_{\theta}(\cdot)a}f_{\theta}(\cdot)b - J_{f_{\theta}(\cdot)b}f_{\theta}(\cdot)a,\] as discussed in Section 2.3. Let \(\{e_j\}_{j=1}^{v}\) be the usual basis of \(\mathbb{R}^{v}\). A choice of basis for \(\mathfrak{L}^N(\mathbb{R}^{v})\) is a Hall basis, denoted \(\{\hat{e}_k\}_{k=1}^{\beta(v, N)}\), which is a specific subset of up to the \((N-1)^{\text{th}}\) iterated Lie brackets of \(\{e_j\}_{j=1}^{v}\) [25]. Rewriting 9 using a Hall basis, \[\label{eq:hall} \bar{f}_{\theta}\left(h_s\right)\frac{\log(S^{N}(X)_{[r_i,r_{i+1}]})}{r_{i+1}-r_i} = \sum_{k=1}^{\beta(v, N)}\lambda_k \bar{f}_{\theta}(h_s)\hat{e}_k,\tag{13}\] where \(\lambda_k\) is the term in the scaled log-signature corresponding to the basis element \(\hat{e}_k\). Since each \(\hat{e}_k\) can be written as iterated Lie brackets of \(\{e_j\}_{j=1}^{v}\), it is possible to replace \(\bar{f}_{\theta}(\cdot)\hat{e}_k\) with the iterated Lie brackets of \(f_{\theta}(\cdot)e_i\) using 7 and 8 . Each \(f_{\theta}(\cdot)e_i:\mathbb{R}^u \rightarrow \mathbb{R}^u\) is a vector field defined by the \(i^{\text{th}}\) column of the neural network’s output. Hence, \(g_{\theta, X}\) can be evaluated at a point using iterated Jacobian-vector products (JVPs) of \(f_{\theta}\).
When the signature truncation depth \(N\) is greater than \(1\), NRDEs and Log-NCDEs incur an additional computational cost for each evaluation of the vector field, which we quantify here
for \(N=2\). Assume that a NCDE, NRDE, and Log-NCDE are all using an identical FCNN as their vector field, except for the dimension of the final layer in the NRDE. Let \(m\) and \(n_h\) be the depth and dimension of the FCNN’s hidden layers respectively, and \(u\) and \(v\) be the dimension of \(h_t\) and
\(X\) from 1 respectively. Letting \(F_{\text{x}}\) be the number of FLOPs to evaluate model x’s vector field, \[\begin{align}
F_{\text{NCDE}}&= 2un_h + 2(m-1)n_h^2 + 2uvn_h,\\ F_{\text{NRDE}}&= 2un_h + 2(m-1)n_h^2 + u(v^2-v)n_h,\\ F_{\text{Log-NCDE}}&= 3vF_{\text{NCDE}}, \end{align}\] where the number of FLOPs to calculate a JVP is \(3\) times that of evaluating the FCNN and \(v\) JVPs of \(f_{\theta}\) are needed to evaluate 13 when \(N=2\) [26]. Log-NCDEs and NRDEs have the same asymptotic computational complexity in each variable. However, each JVP
is evaluated at the same point \(h_s\). This allows the computational burden of Log-NCDEs on high-dimensional time series to be reduced by constructing a batched function using Jax’s vmap [27]. The computational cost is evaluated empirically in Section 5.2.
In this paper, we only consider Log-NCDEs which use a depth\(-1\) or depth\(-2\) Log-ODE approximation. This is due to the following two limitations. First, there are no theoretical results explicitly bounding the \(\mathrm{Lip}(\gamma)\) norm of a neural network for \(\gamma>2\). Second, the computational cost required to evaluate \(g_{\theta, X}\) grows rapidly with the depth \(N\). This can make \(N>2\) computationally infeasible, especially for high-dimensional time series. Another general limitation of NCDEs is the need to solve the differential equation recursively, preventing parallelisation. This is in contrast to non-selective structured state-space models, whose underlying model is a differential equation that can be solved parallel in time [28].
Log-NCDEs are compared against six models, which represent the state-of-the-art for a range of deep learning approaches to time series modelling. Four of these models are stacked recurrent models, which use the same general architecture, but with different recurrent layers. The architecture used in this paper is based on the one introduced by [1] and the four different recurrent layers considered are LRU, S5, MAMBA, and S6, the selective state-space recurrence introduced as a component of MAMBA [1]–[3]. The other two baseline models are continuous models; a NCDE using a Hermite cubic spline with backward differences as the interpolation and a NRDE [4], [9]. Further details on all model architectures can be found in Appendices 10.1 and 10.2.
We construct a toy dataset of \(100{,}000\) time series with \(6\) channels and \(100\) regularly spaced samples each. For every time step, the change in each channel is sampled independently from the discrete probability distribution with density \[p(n) = \int_{n-0.5}^{n+0.5}\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}\text{d}x,\] where \(n\in\mathbb{Z}\). In other words, the change in a channel at each time step is a sample from a standard normal distribution rounded to the nearest integer. Figure 4 is a plot of a sample path from the toy dataset.
We consider four different binary classifications on the toy dataset. Each classification is a specific term in the signature of the path which depends on a different number of channels.
Was the change in the third channel, \(\int_0^1\text{d}X^3_s\), greater than zero?
Was the area integral of the third and sixth channels, \(\int_0^1\int_0^u\text{d}X^3_s\text{d}X^6_u\), greater than zero?
Was the volume integral of the third, sixth, and first channels, \(\int_0^1\int_0^v\int_0^u\text{d}X^3_s\text{d}X^6_u\text{d}X^1_v\), greater than zero?
Was the \(4\)D volume integral of the third, sixth, first, and fourth channels, \(\int_0^1\int_0^w\int_0^v\int_0^u\text{d}X^3_s\text{d}X^6_u\text{d}X^1_v\text{d}X^4_w\), greater than zero?
On this dataset, all models use a hidden state of dimension \(64\) and Adam with a learning rate of \(0.0003\) [29]. LRU, S5, S6, and MAMBA use six blocks. NRDE and Log-NCDE take \(r_{i+1}-r_i\) to be \(4\) observations and the signature truncation depth \(N\) to be \(2\). Full hyperparameter choices are given in Appendix 10.3.
The models considered in this paper are evaluated on six datasets from the UEA multivariate time series classification archive (UEA-MTSCA)2. These six datasets were chosen via the following two criteria. First, only datasets with more than \(200\) total cases were considered. Second, the six datasets with the most observations were chosen, as datasets with many observations have previously proved challenging for deep learning approaches to time series modelling. Further details on the chosen datasets can be found in Appendix 10.4. Following [9], the original train and test cases are combined and resplit into new random train, validation, and test cases using a \(70:15:15\) split.
Hyperparameters for all models are found using a grid search over the validation accuracy on a fixed random split of the data. Full details on the hyperparameter grid search are in Appendix 10.4. Having fixed their hyperparameters, models are compared on their average test set accuracy over five different random splits of the data. In order to compare models on their average GPU memory and run time, \(1000\) steps of training are run on an NVIDIA RTX 4090. The average run time is estimated by combining the time for \(1000\) training steps with the average total number of training steps from the five runs over the random data splits.
PPG-DaLiA is a multivariate time series regression dataset, where the aim is to predict a person’s heart rate using data collected from a wrist-worn device [30]. The dataset consists of fifteen individuals with around \(150\) minutes of recording each at a maximum sampling rate of \(128 \,\mathrm{Hz}\). There are six channels; blood volume pulse, electrodermal activity, body temperature, and three-axis acceleration. The data is split into a train, validation, and test set following a \(70:15:15\) split for each individual. After splitting the data, a sliding window of length \(49920\) and step size \(4992\) is applied.
Hyperparameters are found using the same method as for the UEA-MTSCA, but with validation mean squared error and slightly different hyperparameter choices given the high number of observations. Full details can be found in Appendix 10.4. Having fixed their hyperparameters, models are compared on their average mean squared error over five different runs on the same fixed data split.
Figure 5 compares the performance of the models considered in this paper on the four different toy dataset classifications. As expected, given that the classifications considered are solutions to CDEs, NCDE’s are the best performing model. Since NRDEs and Log-NCDEs are fixed to \(r_{i+1}-r_i\) being \(4\) observations and \(N=2\), they are both approximations of a CDE. Notably, Log-NCDEs consistently outperform NRDEs, providing empirical evidence that NRDEs do not always accurately learn the Lie bracket structure of \(\bar{f}_{\theta}\). All of the stacked recurrent models perform well when the label depends on one or two channels. However, their performance begins to decrease for three channels, and only MAMBA performs well when the label depends on four channels.
| Dataset | Method | ||||||
| LRU | S5 | S6 | MAMBA | NCDE | NRDE | Log-NCDE | |
| EigenWorms | \(\mathbf{87.8 \pm 2.8}\) | \(81.1 \pm 3.7\) | \(85.0 \pm 16.1\) | \(70.9 \pm 15.8\) | \(75.0 \pm 3.9\) | \(83.9 \pm 7.3\) | \(\underline{85.6 \pm 5.1}\) |
| EthanolConcentration | \(21.5 \pm 2.1\) | \(24.1 \pm 4.3\) | \(26.4 \pm 6.4\) | \(27.9 \pm 4.5\) | \(\underline{29.9 \pm 6.5}\) | \(25.3 \pm 1.8\) | \(\mathbf{34.4 \pm 6.4}\) |
| Heartbeat | \(\mathbf{78.4 \pm 6.7}\) | \(\underline{77.7 \pm 5.5}\) | \(76.5 \pm 8.3\) | \(76.2 \pm 3.8\) | \(73.9 \pm 2.6\) | \(72.9 \pm 4.8\) | \(75.2 \pm 4.6\) |
| MotorImagery | \(48.4 \pm 5.0\) | \(47.7 \pm 5.5\) | \(\underline{51.3 \pm 4.7}\) | \(47.7 \pm 4.5\) | \(49.5 \pm 2.8\) | \(47.0 \pm 5.7\) | \(\mathbf{53.7 \pm 5.3}\) |
| SelfRegulationSCP1 | \(82.6 \pm 3.4\) | \(\mathbf{89.9 \pm 4.6}\) | \(82.8 \pm 2.7\) | \(80.7 \pm 1.4\) | \(79.8 \pm 5.6\) | \(80.9 \pm 2.5\) | \(\underline{83.1 \pm 2.8}\) |
| SelfRegulationSCP2 | \(51.2 \pm 3.6\) | \(50.5 \pm 2.6\) | \(49.9 \pm 9.5\) | \(48.2 \pm 3.9\) | \(53.0 \pm 2.8\) | \(\underline{\mathbf{53.7 \pm 6.9}}\) | \(\underline{\mathbf{53.7 \pm 4.1}}\) |
| Av. | \(61.7\) | \(61.8\) | \(\underline{62.0}\) | \(58.6\) | \(60.2\) | \(60.6\) | \(\mathbf{64.3}\) |
| Av. Rank | \(\underline{3.5}\) | \(4.0\) | \(\underline{3.5}\) | \(5.5\) | \(4.5\) | \(4.9\) | \(\mathbf{2.1}\) |
| Model | Av. GPU Mem. (MB) | Av. run time (s) |
|---|---|---|
| LRU | 4121.67 | 466.09 |
| S5 | 2815.00 | 244.78 |
| S6 | 2608.00 | 578.15 |
| MAMBA | 4450.33 | 1553.83 |
| NCDE | 1759.67 | 6649.91 |
| NRDE | 2676.33 | 7284.20 |
| Log-NCDE | 1999.67 | 2128.32 |
Table 1 reports the average and standard deviation of each model’s test set accuracy over five data splits. MAMBA achieves the lowest average accuracy. NCDEs and NRDEs have similar accuracies except on EigenWorms, the dataset with the most observations, where NRDEs significantly improve over NCDEs. However, NRDEs are still outperformed on average accuracy by three stacked recurrent models: LRU, S5, and S6. Log-NCDEs are the best performing model on average accuracy and rank. Compared to NRDEs, they achieve an equal or higher average accuracy on all six datasets and a lower standard deviation on four datasets. These results highlight the improvement in performance which can be achieved by calculating the Lie brackets.
Table 2 details the average GPU memory and run time for each model, with the results for individual datasets given in Appendix 10.5. The major contributors to NCDEs high average run time are time series with many observations. Using a depth\(-2\) Log-ODE method decreases the computational burden on datasets with many observations for both NRDEs and Log-NCDEs. Although NRDEs and Log-NCDEs have the same asymptotic computational complexity, using a batched function to calculate the Lie brackets leads to Log-NCDEs having a lower computational burden on high-dimensional time series than NRDEs, as discussed in Section 3.4. Hence, Log-NCDEs have a lower average run time than both NCDEs and NRDEs. Despite these improvements, all four stacked recurrent models have lower average run times than Log-NCDEs.
Table 3 contains the average and standard deviation of each model’s test set mean squared error on the PPG-DaLiA dataset. In contrast to the UEA-MTSCA experiments, MAMBA is the best performing stacked recurrent model. However, Log-NCDEs still achieve the best performance, obtaining the lowest average test set mean squared error and the second lowest standard deviation.
| Model | MSE \((\times 10^{-2})\) |
|---|---|
| LRU | \(12.17 \pm 0.49\) |
| S5 | \(12.63 \pm 1.25\) |
| S6 | \(12.88 \pm 2.05\) |
| MAMBA | \(10.65 \pm 2.20\) |
| NCDE | \(13.54 \pm 0.69\) |
| NRDE | \(\underline{9.90\pm 0.97}\) |
| Log-NCDE | \(\mathbf{9.56 \pm 0.59}\) |
Recent theoretical work on the expressive power of structured state space models may provide an explanation for their performance on the toy dataset. It has been shown that the recurrent layer of non-selective state space models, such as S5, are unable to capture terms in the signature that depend on more than one channel. Instead, the computational burden is placed on the non-linear mixing in-between recurrent blocks. Furthermore, it has been shown that stacked selective state-space models, such as MAMBA, can capture higher order terms in the signature with only linear mixing layers [31]. Although the toy dataset highlights a potential limitation of the stacked recurrent models, this did not translate into poor performance on the real-world datasets considered in this paper.
Log-NCDEs achieve the highest average accuracy on the UEA-MTSCA datasets and the lowest mean squared error on the PPG-DaLiA dataset. In particular, they improve upon NRDEs on every dataset. These results highlight the effectiveness of the Log-ODE method for improving the performance of NCDEs and the importance of calculating the Lie brackets when applying the Log-ODE method. Furthermore, despite increasing the computational cost of each vector field evaluation, Log-NCDEs have a lower average run time than both NCDEs and NRDEs on the UEA-MTSCA datasets. Empirical evidence suggests this is due to the Log-ODE method improving efficiency on time series with many observations, and calculating the Lie brackets lowering the computational burden of the Log-ODE method on high-dimensional time series. In addition to achieving state-of-the-art performance on the regularly sampled datasets considered in this paper, Log-NCDEs maintain the ability of NCDEs to naturally handle irregular sampling, making them an attractive option for real-world applications.
Building on NRDEs, this paper introduced Log-NCDEs, which utilise the Log-ODE method to train NCDEs in an effective and efficient manner. This required proving a novel theoretical result bounding the \(\mathrm{Lip}(\gamma)\) norm of fully connected neural networks for \(1<\gamma\leq 2\), as well as developing an efficient method for calculating the iterated Lie brackets of a neural network. A thorough empirical evaluation demonstrated the benefits of calculating the Lie brackets when applying the Log-ODE method. Furthermore, it showed that Log-NCDEs can achieve state-of-the-art performance on a range of multivariate time series datasets.
A reasonable direction of future work is extending Log-NCDE’s to depth\(-N\) Log-ODE methods for \(N>2\). This would require proving an equivalent result to Theorem 1 for \(\gamma>2\). Furthermore, it would be necessary to address the computational cost of the iterated Lie brackets. This could be achieved by using a structured neural network with cheap Jacobian-vector products as the CDE vector field. Another avenue of future work could be incorporating the recently developed adaptive version of the Log-ODE method [19].
This paper presents Log Neural Controlled Differential Equations, a novel approach aimed at advancing the field of time series modelling. Potential applications of the method include healthcare, finance, and biology, where accurate time series modeling plays a crucial role. Despite the clear potential for positive impact, care must be taken to further understand the capabilities and limitations of the model before real-world deployment. Additionally, structured state-space models, an alternative approach to time series modelling, have recently been integrated into large language models (LLMs). The advancement of LLMs has many potential societal consequences, both positive and negative.
Benjamin Walker was funded by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA). Andrew McLeod was funded in part by the EPSRC [grant number EP/S026347/1] and in part by The Alan Turing Institute under the EPSRC grant EP/N510129/1. Terry Lyons was funded in part by the EPSRC [grant number EP/S026347/1], in part by The Alan Turing Institute under the EPSRC grant EP/N510129/1, the Data Centric Engineering Programme (under the Lloyd’s Register Foundation grant G0095), the Defence and Security Programme (funded by the UK Government) and the Office for National Statistics & The Alan Turing Institute (strategic partnership) and in part by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA). The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work. http://dx.doi.org/10.5281/zenodo.22558
Let \(V\) and \(W\) be Banach spaces, \(X:[0,T]\rightarrow V\) and \(y:[0,T]\rightarrow W\) be continuous paths, and \(f(\cdot)\nu\) be a linear map from \(\nu\in V\) to vector fields on \(W\). Assume that \(X\), \(Y\), and \(f\) are regular enough for the integral \[\int_{0}^tf(Y_s)\text{d}X_s\] to be defined for all \(t\in [0,T]\) in the Young sense [32]. The path \(Y\) is said to obey a controlled differential equation (CDE) if \[\label{sgxmoqnt} Y_t = Y_{0} + \int_{0}^tf(Y_s)\text{d}X_s,\tag{14}\] for \(t\in [0,T]\), where \(Y_{0}\in W\) is the initial condition and \(X\) is the control [5]. The existence and uniqueness of the solution to a CDE depends on the smoothness of the control path \(X\) and the vector field \(f\). We will measure the smoothness of a path by the smallest \(p\geq1\) for which the \(p-\)variation is finite and the smoothness of a vector field by the largest \(\gamma>0\) such that the function is \(\text{Lip}(\gamma)\) (defined in Section 2.2).
Definition 8. (Partition) A partition of a real interval \([0,T]\) is a set of real numbers \(\{r_i\}_{i=0}^m\) satisfying \(0=r_0<\ldots<r_m=T\).
Definition 9. (\(p-\)variation [32]) Let \(V\) be a Banach space, \(\mathcal{D}=(r_0, \cdots, r_m) \subset [0,T]\) be a partition of \([0,T]\), and \(p\geq 1\) be a real number. The \(p-\)variation of a path \(X:[0,T]\rightarrow V\) is defined as \[||X||_{p} = \left[\sup_{\mathcal{D}}\sum_{r_i \in \mathcal{D}}|X_{r_i}-X_{r_{i+1}}|^p\right]^{\frac{1}{p}}.\]
Theorem 2. Let \(1\leq p <2\) and \(p-1 < \gamma \leq 1\). If \(W\) is finite-dimensional, \(X\) has finite \(p-\)variation, and \(f\) is \(\text{Lip}(\gamma)\), then ([eq:cde]) admits a solution for every \(y_{0} \in W\) [33].
Theorem 3. Let \(1\leq p <2\) and \(p<\gamma\). If \(X\) has finite \(p-\)variation and \(f\) is \(\text{Lip}(\gamma)\), then ([eq:cde]) admits a unique solution for every \(y_{0} \in W\) [33].
These theorems extend the classic differential equation existence and uniqueness results to controls with unbounded variation but finite \(p-\)variation for \(p<2\). A proof of these theorems is can be found in [5]. These theorems are sufficient for the differential equations considered in this paper. However, there are many settings where the control has infinite \(p-\)variation for all \(p<2\), such as Brownian motion. The theory of rough paths was developed in order to give meaning to ([eq:cde]) when the control’s \(p-\)variation is finite only for \(p\geq2\) [34]. An introduction to rough path theory can be found in [5].
Let \(V\) be a Banach space and \(V^{\otimes n}\) denote the tensor powers of \(V\), \[V^{\otimes n} = V \underbrace{\otimes \cdots \otimes}_{\text{n-1\;times}} V.\] There is choice in the norm of \(V^{\otimes n}\). In this paper, we follow the setting of [35] and [36]. It is assumed that each \(V^{\otimes n}\) is endowed with a norm such that the following conditions hold for all \(v\in V^{\otimes n}\) and \(w\in V^{\otimes m}\):
\(||v|| = ||v_1\otimes \cdots \otimes v_n|| = ||v_{p(1)} \otimes \cdots \otimes v_{p(n)}||\) for all all bijective functions \(p:\{1, \ldots, n\}\rightarrow \{1, \ldots, n\}\),
\(|| v \otimes w || \leq ||v||\;||w||\),
for any bounded linear functional \(f\) on \(V^{\otimes n}\) and \(g\) on \(V^{\otimes m}\), there exists a unique bounded linear functional \(f \otimes g\) on \(V^{\otimes (m+n)}\) such that \((f \otimes g)(v \otimes w) = f(v)g(w).\)
Definition 10. (The Tensor Algebra [5]) For \(n\geq1\), let \(V^{\otimes n}\) be equipped with a norm satisfying the above conditions, and define \(V^{\otimes 0}=\mathbb{R}\). The tensor algebra space is the set \[T((V)) = \{\mathbf{x} = (x^0, x^1, \ldots ) | x^k \in V^{\otimes k}\}\] with product \(\mathbf{z}=\mathbf{x}\otimes\mathbf{y}\) defined by \[z^k = (\mathbf{x} \otimes \mathbf{y})^k = \sum_{j=0}^k x^j \otimes y^{k-j}.\]
The tensor algebra’s product is associative and has unit \(\mathbf{1} = (1,0,0,\ldots)\). As \(T((V))\) is an associative algebra, it has a Lie algebra structure, with Lie bracket \[[\mathbf{x}, \mathbf{y}] = \mathbf{x} \otimes \mathbf{y} - \mathbf{y} \otimes \mathbf{x}\] for \(\mathbf{x},\mathbf{y} \in T((V))\) [12].
The proof of Theorem 1 relies on two lemmas. The first is a bound on the \(\text{Lip}(\gamma)-\)norm of the composition of two \(\mathrm{Lip}(\gamma)\) functions. The second is a bound on the \(\mathrm{Lip}(2)-\)norm of each layer of a fully connected neural network (FCNN).
Lemma 1. (Composed \(\text{Lip}(\gamma)-\)norm [20]) Let \(U\), \(V\), and \(W\) be Banach spaces and \(\Sigma\subset U\) and \(\Omega \subset V\) be closed. For \(\gamma\geq 1\), let \(f \in \text{Lip}(\gamma,\Sigma,\Omega)\) and \(g \in \text{Lip}(\gamma,\Omega,W)\). Then the composition \(g\circ f: \Omega \rightarrow W\) is \(\text{Lip}(\gamma)\) with \[\label{eq:normcomp} ||g\circ f||_{\text{Lip}(\gamma)} \leq C_{\gamma}||g||_{\text{Lip}(\gamma)}\max\left\{||f||^{k+1}_{\text{Lip}(\gamma)},1\right\},\tag{15}\] where \(k\) is the unique integer such that \(\gamma\in(k, k+1]\) and \(C_{\gamma}\) is a constant independent of \(f\) and \(g\).
The original statement of lemma 1 in [20] gives 15 as \[\label{eq:compboundincorrect} ||g\circ f||_{\text{Lip}(\gamma)} \leq C_{\gamma}||g||_{\text{Lip}(\gamma)}\max\left\{||f||^{\textcolor{red}{k}}_{\text{Lip}(\gamma)},1\right\}.\tag{16}\] We believe this is a small erratum, as for \(g:[0,1]\rightarrow[0,1]\) defined as \(g(x)=x\), 16 implies there exists \(C_1>0\) such that \[||g\circ f||_{\text{Lip}(1)} = ||f||_{\text{Lip}(1)} \leq C_1||g||_{\text{Lip}(1)}=C_1\] for all bounded and Lipschitz \(f:[0,1]\rightarrow[0,1]\). As a counterexample, for any \(C_1>0\), take \(f(x)=x^{n}\) with \(n>\max\{C_1, 1\}\). The following proof of lemma 1 is given in [20].
Proof. Let \((g\circ f)^0, \ldots, (g\circ f)^k\) be defined by the generalisation of the chain rule to higher derivatives. Explicit calculation can be used to verify that if \(f\) and \(g\) are \(\text{Lip}(\gamma)\), definition 4 implies \(g\circ f\) is \(\text{Lip}(\gamma)\) with \(||g\circ f||_{\text{Lip}(\gamma)}\) obeying 15 . ◻
Bounding the \(\text{Lip}(\gamma)-\)norm of a neural network (NN) requires an explicit form for \(C_{\gamma}\) in 15 . This can be obtained via the explicit calculations mentioned in the proof of lemma 1. Here, we present the case \(\gamma \in (1,2]\).
Lemma 2. Let \(U\), \(V\), and \(W\) be Banach spaces and \(\Sigma\subset U\) and \(\Omega \subset V\) be closed. For \(\gamma \in (1,2]\), let \(f = ( f^{(0)} , f^{(1)} ) \in \text{Lip}(\gamma,\Sigma,\Omega)\) and \(g = (g^{(0)} , g^{(1)} ) \in \text{Lip}(\gamma,\Omega,W)\). Consider \(h^{(0)} : \Sigma \to W\) and \(h^{(1)} : \Sigma \to \mathbf{L}(V,W)\) defined for \(p \in \Sigma\) and \(v \in V\) by \[\label{eq:hdef} h^{(0)}(p) := g^{(0)} \left( f^{(0)}(p) \right) \qquad \text{and} \qquad h^{(1)}(p)[v] := g^{(1)}\left( f^{(0)}(p) \right) \left[ f^{(1)}(p)[v] \right].\tag{17}\] Then \(h := \left( h^{(0)} , h^{(1)} \right) \in \text{Lip}(\gamma,\Sigma,W)\) and \[\label{eq:normcomplip2} || h ||_{\text{Lip}(\gamma,\Sigma,W)} \leq \left( 1 + 2^{\gamma} \right) || g ||_{\text{Lip}(\gamma,\Omega,W)} \max \left\{ 1 , || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)}^{\gamma} \right\}.\tag{18}\]
Proof. From definition 4, \(f^{(0)} : \Sigma \to \Omega\), \(f^{(1)} : \Sigma \to \mathbf{L}(U,V)\), \(g^{(0)} : \Omega \to W\) and \(g^{(1)} : \Omega \to \mathbf{L}(V,W)\). Furthermore, for all \(p\in\Sigma\) \[\label{f95pointwise95bounds} ({\boldsymbol{I}}) \quad \left|\left| f^{(0)}(p) \right|\right|_V \leq ||f||_{\text{Lip}(\gamma,\Sigma,\Omega)} \qquad \text{and} \quad ({\boldsymbol{I}I}) \quad \left|\left| f^{(1)}(p) \right|\right|_{\mathbf{L}(U,V)} \leq ||f||_{\text{Lip}(\gamma,\Sigma,\Omega)}.\tag{19}\] Similarly, for all \(x \in \Omega\) we have that \[\label{g95pointwise95bounds} ({\boldsymbol{I}}) \quad \left|\left| g^{(0)}(x) \right|\right|_W \leq ||g||_{\text{Lip}(\gamma,\Omega,W)} \qquad \text{and} \quad ({\boldsymbol{I}I}) \quad \left|\left| g^{(1)}(x) \right|\right|_{\mathbf{L}(V,W)} \leq ||g||_{\text{Lip}(\gamma,\Omega,W)}.\tag{20}\] Define \(R^f_0 : \Sigma \times \Sigma \to V\) and \(R^f_1 : \Sigma \times \Sigma \to \mathbf{L}(U,V)\) by \[\label{f95remainder95term95defs} \begin{align} R^f_0 (p,q) &:= f^{(0)}(q) - f^{(0)}(p) - f^{(1)}(p)[q-p], \\ R^f_1(p,q)[u] &:= f^{(1)}(q)[u] - f^{(1)}(p)[u], \end{align}\tag{21}\] for any \(p,q \in \Sigma\) and \(u \in U\). Then \[\label{f95remainder95term95bounds} \begin{align} & ({\boldsymbol{I}}) \quad \left|\left| R^f_0(p,q) \right|\right|_V \leq ||f||_{\text{Lip}(\gamma,\Sigma,\Omega)} ||q-p||_U^{\gamma}, \\ & ({\boldsymbol{I}I}) \quad \left|\left| R^f_1(p,q) \right|\right|_{\mathbf{L}(U,V)} \leq ||f||_{\text{Lip}(\gamma,\Sigma,\Omega)} ||q-p||_U^{\gamma - 1}. \end{align}\tag{22}\] Similarly, define \(R^g_0 : \Omega \times \Omega \to W\) and \(R^g_1 : \Omega \times \Omega \to \mathbf{L}(V,W)\) by \[\label{g95remainder95term95defs} \begin{align} R^g_0 (x,y) &:= g^{(0)}(y) - g^{(0)}(x) - g^{(1)}(x)[y-x], \\ R^g_1(x,y)[v] &:= g^{(1)}(y)[v] - g^{(1)}(x)[v], \end{align}\tag{23}\] for \(x,y \in \Omega\) and \(v \in V\). Then, \[\label{g95remainder95term95bounds} \begin{align} & ({\boldsymbol{I}}) \quad \left|\left| R^g_0(x,y) \right|\right|_W \leq || g ||_{\text{Lip}(\gamma,\Omega,W)} ||y-x||_V^{\gamma} \\ & ({\boldsymbol{I}I}) \quad \left|\left| R^g_1(x,y) \right|\right|_{\mathbf{L}(V,W)} \leq || g ||_{\text{Lip}(\gamma,\Omega,W)} ||y-x||_V^{\gamma - 1}. \end{align}\tag{24}\] Define \(h^{(0)} : \Sigma \to W\) and \(h^{(1)} : \Sigma \to \mathbf{L}(V,W)\) as in 17 , \[\label{lip95gamma95chain95rule95h95def95proof} h^{(0)}(p) := g^{(0)} \left( f^{(0)}(p) \right) \qquad \text{and} \qquad h^{(1)}(p)[u] := g^{(1)}\left( f^{(0)}(p) \right) \left[ f^{(1)}(p)[u] \right],\tag{25}\] for \(p \in \Sigma\) and \(u \in U\). Finally, define remainder terms \(R^h_0 : \Sigma \times \Sigma \to W\) and \(R^h_1 : \Sigma \times \Sigma \to \mathbf{L}(U,W)\) by \[\label{h95remain95terms95def} \begin{align} R^h_0(p,q) &:= h^{(0)}(q) - h^{(0)}(p) - h^{(1)}(p)[q-p], \\ R^h_1(p,q)[u] &:= h^{(1)}(q)[u] - h^{(1)}(p)[u], \end{align}\tag{26}\] for \(p,q \in \Sigma\) and \(u \in U\). We now establish that \(h = ( h^{(0)} , h^{(1)} ) \in \text{Lip}(\gamma,\Sigma,W)\) and that the norm estimate claimed in 18 is satisfied.
First we consider the bounds on \(h^{(0)}\) and \(h^{(1)}\). For any \(p \in \Sigma\), (I) in 20
implies that \[\label{h095bd} \left|\left| h^{(0)}(p) \right|\right|_W = \left|\left| g^{(0)} \left( f^{(0)}(p) \right) \right|\right|_W \leq || g
||_{\text{Lip}(\gamma,\Omega,W)}\tag{27}\] since \(f^{(0)}(p) \in \Omega\). Further, for any \(p \in \Sigma\) and any \(u \in U\), 20 and (II) in 19 imply that \[\begin{align} \left|\left| h^{(1)}(p)[u] \right|\right|_W &= \left|\left| g^{(1)} \left( f^{(0)}(p)
\right) \left[ f^{(1)}(p)[u] \right] \right|\right|_W \\ &\leq \left|\left| g^{(1)} \left( f^{(0)}(p) \right) \right|\right|_{\mathbf{L}(V,W)} \left|\left| f^{(1)}(p) \right|\right|_{\mathbf{L}(U,V)} ||u||_U \\ &\leq || g
||_{\text{Lip}(\gamma,\Omega,W)} || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} ||u||_U
\end{align}\] since \(f^{(0)}(p) \in \Omega\). Taking the supremum over \(u \in U\) with unit \(U\)-norm, it follows that
\[\label{h195bd} \left| \left| h^{(1)}(p) \right|\right|_{\mathbf{L}(U,W)} \leq || g ||_{\text{Lip}(\gamma,\Omega,W)} || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)}.\tag{28}\] Now we
consider the bounds on \(R^h_0\) and \(R^h_1\). For this purpose we fix \(p,q \in \Sigma\) and \(u \in U\). We first assume
that \(||q - p||_U > 1\). In this case we may use 27 and 28 to compute that \[\begin{align} \left|\left| R^h_0(p,q) \right|\right|_W &=
\left| \left| h^{(0)}(q) - h^{(0)}(p) - h^{(1)}(p)[q-p] \right|\right|_W \\ &\leq 2 || g ||_{\text{Lip}(\gamma,\Omega,W)} + || g ||_{\text{Lip}(\gamma,\Omega,W)} || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} || q - p ||_U.
\end{align}\] Since \(\gamma > 1\) means that \(1 < || q - p ||_U < || q - p ||_U^{\gamma}\), we deduce that \[\label{Rh095bd95a} \left| \left| R^h_0(p,q) \right|\right|_W \leq || g ||_{\text{Lip}(\gamma,\Omega,W)} \left( 2 + || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} \right) || q - p ||_U^{\gamma}.\tag{29}\] Similarly, we may
use 28 and that \(||q-p||_U^{\gamma - 1} > 1\) to compute that \[\label{Rh195bd95a95with95v} \left|\left| R^h_1(p,q)[u]
\right|\right|_W = \left|\left| h^{(1)}(q)[u] - h^{(1)}(p)[u] \right|\right|_W \leq 2 || g ||_{\text{Lip}(\gamma,\Omega,W)} || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} || q - p ||_U^{\gamma - 1} ||u||_U.\tag{30}\] Taking the supremum over
\(u \in U\) with unit \(U\)-norm in 30 yields the estimate that \[\label{Rh195bd95a}
\left|\left| R^h_1(p,q) \right|\right|_{\mathbf{L}(V,W)} \leq 2 || g ||_{\text{Lip}(\gamma,\Omega,W)} || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} || q - p ||_U^{\gamma - 1}.\tag{31}\] Together, 29 and 31 establish the remainder term estimates required to conclude that \(h = (h^{(0)} , h^{(1)} ) \in \text{Lip}(\gamma,\Sigma,W)\) in the case that \(|| q - p ||_U >
1\).
We next establish similar remainder term estimates when \(|| q - p ||_U < 1\). Thus we fix \(p,q \in \Sigma\) and assume that \(||q-p||_U < 1\). Note
that \(\gamma > 1\) means that \(||q - p||_U^{\gamma} < ||q-p||_U < 1\). Additionally, \[\label{f095diff95bd}
\begin{align} \left|\left| f^{(0)}(q) - f^{(0)}(p) \right|\right|_V &\stackrel{(\ref{f95remainder95term95defs})}{=} \left|\left|f^{(1)}(p)[q-p] + R^f_0(p,q) \right|\right|_V, \\ &\;\;\leq || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} \left( ||q-p||_U
+ ||q-p||_U^{\gamma} \right), \\ &\;\;\leq 2 || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} ||q-p||_U, \end{align}\tag{32}\] where (II) in 19 and (I) in 22 have been used. We now consider the term \(R^h_0(p,q)\). We start by observing that \[\begin{align} R^h_0(p,q)
&\stackrel{(\ref{h95remain95terms95def})}{=} h^{(0)}(q) - h^{(0)}(p) - h^{(1)}(p)[q-p] \\ &\stackrel{(\ref{lip95gamma95chain95rule95h95def95proof})}{=} g^{(0)}\left( f^{(0)}(q) \right) - g^{(0)} \left( f^{(0)}(p) \right) - g^{(1)}\left(f^{(0)}(p)
\right) \left[ f^{(1)}(p)[q-p] \right] \\ &\stackrel{(\ref{g95remainder95term95defs})}{=} g^{(1)}\left( f^{(0)}(p) \right) \left[ f^{(0)}(q) - f^{(0)}(p) - f^{(1)}(p)[q-p] \right] + R^g_0\left( f^{(0)}(p) , f^{(0)}(q) \right) \\
&\stackrel{(\ref{f95remainder95term95defs})}{=} g^{(1)}\left( f^{(0)}(p) \right) \left[ R^f_0(p,q) \right] + R^g_0\left( f^{(0)}(p) , f^{(0)}(q) \right).
\end{align}\] Consequently, by using (II) in 20 to estimate the term \(g^{(1)}\left( f^{(0)}(p) \right)\), (I) in 22 to estimate the term \(R^f_0(p,q)\), and (I) in 24 to estimate the term \(R^g_0\left(
f^{(0)}(p) , f^{(0)}(q) \right)\), we may deduce that \[\label{Rh095bd95b95almost} \left|\left| R^h_0(p,q) \right|\right|_W \leq || g ||_{\text{Lip}(\gamma,\Omega,W)} \left(
|| f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} || q - p ||_U^{\gamma} + \left|\left| f^{(0)}(q) - f^{(0)}(p) \right|\right|_V^{\gamma} \right).\tag{33}\] The combination of 32 and 33 yields the estimate \[\label{Rh095bd95b} \left|\left| R^h_0(p,q) \right|\right|_W \leq || g ||_{\text{Lip}(\gamma,\Omega,W)} \left( || f
||_{\text{Lip}(\gamma,\Sigma,\Omega)} + 2^{\gamma} ||f||_{\text{Lip}(\gamma,\Sigma,\Omega)}^{\gamma} \right) || q - p ||_U^{\gamma}.\tag{34}\] Turning our attention to \(R^h_1\), we fix \(u \in U\) and compute that \[\begin{align} R^h_1(p,q)[u] &\stackrel{(\ref{h95remain95terms95def})}{=} h^{(1)}(q)[u] - h^{(1)}(p)[u] \\
&\stackrel{(\ref{lip95gamma95chain95rule95h95def95proof})}{=} g^{(1)}\left( f^{(0)}(q) \right) \left[ f^{(1)}(q)[u] \right] - g^{(1)}\left( f^{(0)}(p) \right) \left[ f^{(1)}(p)[u] \right] \\ &\stackrel{(\ref{g95remainder95term95defs})}{=} g^{(1)}
\left( f^{(0)}(p) \right) \left[ f^{(1)}(q)[u] - f^{(1)}(p)[u] \right] + R^g_1 \left( f^{(0)}(p) , f^{(0)}(q) \right) \left[ f^{(1)}(q)[u] \right] \\ &\stackrel{(\ref{f95remainder95term95defs})}{=} g^{(1)} \left( f^{(0)}(p) \right) \left[ R^f_1(p,q)[u]
\right] + R^g_1 \left( f^{(0)}(p) , f^{(0)}(q) \right) \left[ f^{(1)}(q)[u] \right].
\end{align}\] Consequently, by using (II) in 20 to estimate the term \(g^{(1)} \left( f^{(0)}(p) \right)\), (II) in 19 to estimate the term \(f^{(1)}(q)\), (II) in 22 to estimate the term \(R^f_1(p,q)\), and
(II) in 24 to estimate the term \(R^g_1 \left( f^{(0)}(p) , f^{(0)}(q) \right)\), we may deduce that \[\label{Rh195bd95b95almost} \left|\left| R^h_1(p,q)[u] \right|\right|_W \leq || g ||_{\text{Lip}(\gamma,\Omega,W)} || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} \left( || q - p ||_U^{\gamma - 1} + \left|\left| f^{(0)}(q) - f^{(0)}(p)
\right|\right|_{V}^{\gamma - 1} \right) ||u||_U.\tag{35}\] The combination of 32 and 35 yields the estimate that \[\label{Rh195bd95b95almost2} \left|\left| R^h_1(p,q)[u] \right|\right|_W \leq || g ||_{\text{Lip}(\gamma,\Omega,W)} \left( || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} + 2^{\gamma - 1}||f||_{\text{Lip}(\gamma,\Sigma,\Omega)}^{\gamma}
\right) ||q - p||_U^{\gamma - 1} ||u||_U.\tag{36}\] Taking the supremum over \(u \in U\) with unit \(U\)-norm in 36 yields the estimate
that \[\label{Rh195bd95b} \left|\left| R^h_1(p,q) \right|\right|_{\mathbf{L}(V,W)} \leq || g ||_{\text{Lip}(\gamma,\Omega,W)} \left( || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} + 2^{\gamma -
1} ||f||_{\text{Lip}(\gamma,\Sigma,\Omega)}^{\gamma} \right) || q - p ||_U^{\gamma - 1}.\tag{37}\] Finally, we complete the proof by combining the various estimates we have established for \(h\) to obtain the
\(\text{Lip}(\gamma,\Sigma,W)\)-norm bound claimed in 18 .
We start this task by combining 29 and 34 to deduce that for every \(p,q \in \Sigma\) we have \[\label{Rh095bd95c} \begin{align} \left|\left| R^h_0(p,q) \right|\right|_W \leq \left\{ \begin{array}{ll} || g ||_{\text{Lip}(\gamma,\Omega,W)} \left( 2 + || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} \right) || q - p ||_U^{\gamma} & if||q-p||_U > 1 \\ || g ||_{\text{Lip}(\gamma,\Omega,W)} \left( || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} + 2^{\gamma} ||f||_{\text{Lip}(\gamma,\Sigma,\Omega)}^{\gamma} \right) || q - p ||_U^{\gamma} & if||q - p ||_U \leq 1. \end{array} \right. \end{align}\tag{38}\] Moreover, the combination of 31 and 37 yields the estimate that \[\small \label{Rh195bd95c} \left|\left| R^h_1(p,q) \right|\right|_{\mathbf{L}(V,W)} \leq \left\{ \begin{array}{ll} 2 || g ||_{\text{Lip}(\gamma,\Omega,W)} || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} || q - p ||_U^{\gamma - 1} & if||q-p||_U > 1 \\ || g ||_{\text{Lip}(\gamma,\Omega,W)} \left( || f ||_{\text{Lip}(\gamma,\Sigma,\Omega)} + 2^{\gamma - 1} ||f||_{\text{Lip}(\gamma,\Sigma,\Omega)}^{\gamma} \right) || q - p ||_U^{\gamma - 1} & if||q - p ||_U \leq 1. \end{array} \right.\tag{39}\] A consequence of 38 is that \[\label{Rh095bd95d} \left|\left| R^h_0(p,q) \right|\right|_W \leq \left( 1 + 2^{\gamma} \right)|| g ||_{\text{Lip}(\gamma,\Omega,W)} \max \left\{||f||_{\text{Lip}(\gamma,\Sigma,\Omega)}^{\gamma} , 1 \right\} ||q - p||_U^{\gamma},\tag{40}\] whilst a consequence of 39 is that \[\label{Rh195bd95d} \left|\left| R^h_1(p,q) \right|\right|_{\mathbf{L}(V,W)}\leq \left( 1 + 2^{\gamma - 1} \right) || g ||_{\text{Lip}(\gamma,\Omega,W)} \max \left\{||f||_{\text{Lip}(\gamma,\Sigma,\Omega)}^{\gamma}, 1 \right\} ||q - p||_U^{\gamma - 1}.\tag{41}\] Therefore, by combining 27 , 28 , 40 , and 41 , we conclude both that \(h = (h^{(0)} , h^{(1)} ) \in \text{Lip}(\gamma,\Sigma,W)\) and that \[\label{h95lip95gamma95norm} || h ||_{\text{Lip}(\gamma,\Sigma,W)} \leq \left( 1 + 2^{\gamma} \right) || g ||_{\text{Lip}(\gamma,\Omega,W)} \max \left\{ ||f||^{\gamma}_{\text{Lip}(\gamma,\Sigma,\Omega)}, 1 \right\}.\tag{42}\] ◻
Note that (18 ) is a stricter bound than (15 ), as for \(\gamma\in(k, k+1]\), \[\max \left\{|| f ||_{\text{Lip}(\gamma,\Sigma,\Omega)}^{\gamma}, 1 \right\} \leq \max\left\{||f||^{k+1}_{\text{Lip}(\gamma)},1\right\}.\] There is equality when \(||f||_{\text{Lip}(\gamma)}\leq 1\) or \(\gamma=2\).
Lemma 2 allows us to bound the \(\text{Lip}(2)-\)norm of a neural network (NN) given a bound on the \(\text{Lip}(2)-\)norm of each layer of a NN. We demonstrate this here for a simple NN.
Definition 11. (Fully Connected NN) Let \(m,n_{in},n_{out},n_h\in\mathbb{N}\) and \(f_{\theta}\) be a fully connected NN with \(m\) layers, input dimension \(n_{in}\), output dimension \(n_{out}\), hidden dimension \(n_h\), and activation function \(\sigma\). Given an input \(\mathbf{x}\in\mathbb{R}^{n_{in}}\), the output of the NN is defined as \[\mathbf{y} = L^m(\cdots (L^1(\mathbf{x}))\cdots),\] where \(L^1:\mathbb{R}^{n_{in}}\rightarrow\mathbb{R}^{n_h}\), \(L^i:\mathbb{R}^{n_h}\rightarrow\mathbb{R}^{n_h}\) for \(i=2,\ldots,m-1\), and \(L^m:\mathbb{R}^{n_h}\rightarrow\mathbb{R}^{n_{out}}\). Each layer is defined by \[L^i(\mathbf{y}) = \begin{bmatrix} L^i_1(\mathbf{y}) \\ \vdots \\ L^i_\alpha(\mathbf{y}) \end{bmatrix} = \begin{bmatrix} \sigma(l^i_1(\mathbf{y})) \\ \vdots \\ \sigma(l^i_\alpha(\mathbf{y}) )\end{bmatrix} = \begin{bmatrix} \sigma(W^i_1 \cdot \mathbf{y} + b^i_1) \\ \vdots \\ \sigma(W^i_\alpha \cdot \mathbf{y} + b^i_\alpha) \end{bmatrix},\] where \(\mathbf{y}\in\mathbb{R}^\beta\), \(W^i=[W^i_1, \ldots, W^i_\alpha]^T\in\mathbb{R}^{\alpha\times \beta}\) and \(\mathbf{b}^i =[b^i_1, \ldots, b^i_\alpha]^T\in\mathbb{R}^\alpha\) are the learnable parameters and \[(\alpha,\beta) = \begin{cases} (n_h,n_{in}), \qquad i=1, \\ (n_h, n_h), \qquad \; i=2,\ldots,m-1, \\ (n_{out}, n_h), \qquad i=m. \end{cases}\]
Definition 12. (SiLU [21]) The activation function \(\text{\emph{SiLU}}:\mathbb{R}\rightarrow\mathbb{R}\) is defined as \[\text{\emph{SiLU}}(y) = \frac{y}{1+e^{-y}}.\]
Lemma 3. Let \(f_{\theta}\) be a fully connected NN with activation function \(\text{SiLU}\). Assume the input is normalised such that \(\mathbf{x}=[x_1,\ldots,x_{n_{in}}]^T\) satisfies \(|x_j|\leq1\) for \(j=1,\ldots,n_{in}\). Then \[||L^i||_{\text{Lip}(2)} \leq \left(\sum_{j=1}^n\max\left\{0.5||W^i_j||_2^2, 1.1||W^i_j||_2, \Gamma^i ||W^i_j||_2 + |b^i_j|\right\}^2\right)^{\frac{1}{2}},\] where \[\Gamma^i = \sqrt{n_h}||W^{i-1}||_2\bigg(\cdots\Big(\sqrt{n_h}||W^{2}||_2\big(\sqrt{n_{in}}||W^1||_2+||\mathbf{b}^1||_2\big)+||\mathbf{b}^{2}||_2\Big)+\cdots\bigg)+||\mathbf{b}^{i-1}||_2.\]
Proof. Consider \(L^1:A\rightarrow\mathbb{R}^n\), where \(A\subset\mathbb{R}^{n_{in}}\) is the set of \(\mathbf{x}=[x_1,\ldots,x_{n_{in}}]^T\) satisfying \(|x_j|\leq1\) for \(j=1,\ldots,n_{in}\). Each \(L^1_j:\mathbb{R}^{n_{in}}\rightarrow\mathbb{R}\) is composed of a linear layer \(l^1_j\) and a \(\text{SiLU}\). First, we show that \[(L^1_j, \nabla L^1_j) = \left(\frac{l^1_j}{1+e^{-l^1_j}}, \frac{e^{l^1_j}(e^{l^1_j}+l^1_j+1)}{(e^{l^1_j}+1)^2}W^1_j\right) \in \text{Lip}(2),\] where \(l^1_j(\mathbf{x})=W^1_j\cdot\mathbf{x}+b^1_j.\) So, \[\max_{\mathbf{x}\in A}|L_j^1(\mathbf{x}) |\leq \max_{\mathbf{x}\in A}|l_j^1(\mathbf{x})| \leq \sqrt{n_{in}}||W^1_j||_2+|b^1_j|\] and \[||\nabla L^1_j||_2 \leq 1.1||W^1_j||_2.\] Since \(L^1_j\) is at least twice differentiable, \[\frac{|L^1_j(\mathbf{y})-L^1_j(\mathbf{x}) - \nabla L^1_j(\mathbf{x})(\mathbf{y}-\mathbf{x})|}{||\mathbf{y}-\mathbf{x}||_2^2} = \frac{|(\mathbf{y}-\mathbf{x})^T\nabla^2 L^1_j(\mathbf{b})(\mathbf{y}-\mathbf{x}))|}{2|\mathbf{x}-\mathbf{y}|_2^2}\leq \frac{1}{2}\beta(\nabla^2 L^1_j(\mathbf{b})),\] where \(\mathbf{b} = \mathbf{x} + t(\mathbf{y}-\mathbf{x})\) for some \(t\in(0, 1)\) and \(\beta(A) = \max\{|\lambda_{\max}(A)|, |\lambda_{\min}(A)|\}\). Similarly, \[\frac{||\nabla L^1_j(\mathbf{y})-\nabla L^1_j(\mathbf{x})||_2}{||\mathbf{y}-\mathbf{x}||_2} = \frac{|\nabla^2 L^1_j(\mathbf{z})(\mathbf{y}-\mathbf{x})|_2}{|\mathbf{x}-\mathbf{y}|_2} \leq ||\nabla^2 L^1_j(\mathbf{z})||_2 = \beta(\nabla^2 L^1_j(\mathbf{z})),\] where \(\mathbf{z} = \mathbf{x} + \alpha(\mathbf{y}-\mathbf{x})\) for some \(\alpha\in(0, 1)\). Now, \[\nabla^2 L^1_j = \frac{e^{l^1_j}(2 + 2e^{l^1_j}+l^1_j(1-e^{l^1_j})}{(e^{l^1_j}+1)^3}W^1_j \otimes W^1_j,\] which has one non-zero eigenvalue \[\beta(\nabla^2 L^1_j) = \frac{e^{l^1_j}(2 + 2e^{l^1_j}+l^1_j(1-e^{l^1_j})}{(e^{l^1_j}+1)^3}||W^1_j||_2^2\] satisfying \[\max_{\mathbf{x}\in\mathbb{R}^{n_{in}}}\beta(\nabla^2 L^1_j(\mathbf{x})) = 0.5||W^1_j||_2^2.\] Therefore, \[||L^1_j||_{\text{Lip}(2)} = \max\left\{0.5||W^1_j||_2^2, 1.1||W^1_j||_2, \sqrt{n_{in}}||W^1_j||_2+|b^1_j|\right\},\] and \[||L^1||_{\text{Lip}(2)} = \left(\sum_{j=1}^n\max\left\{0.5||W^1_j||_2^2, 1.1||W^1_j||_2, \sqrt{n_{in}}||W^1_j||_2+|b^1_j|\right\}^2\right)^{\frac{1}{2}}.\] The calculations for subsequent layers are very similar, except that the input to each layer is no longer restricted to \(A\). For example, \[\begin{align} \max_{\mathbf{x}\in A}|L^2_j(L^1(\mathbf{x}))| &= \max_{\mathbf{x}\in A}|W^2_j\cdot\text{SiLU}(W^1\mathbf{x}+\mathbf{b}^1)+b^2_j|, \\ &\leq \left( \sqrt{n_{in}}||W^1||_2+||\mathbf{b}^1||_2\right)||W^2_j||_2+|b^2_j|. \end{align}\] In general \[||L^i||_{\text{Lip}(2)} \leq \left(\sum_{j=1}^n\max\left\{0.5||W^i_j||_2^2, 1.1||W^i_j||_2, \Gamma^i ||W^i_j||_2 + |b^i_j|\right\}^2\right)^{\frac{1}{2}},\] where \[\Gamma^i = \sqrt{n_h}||W^{i-1}||_2\bigg(\cdots\Big(\sqrt{n_h}||W^{2}||_2\big(\sqrt{n_{in}}||W^1||_2+||\mathbf{b}^1||_2\big)+||\mathbf{b}^{2}||_2\Big)+\cdots\bigg)+||\mathbf{b}^{i-1}||_2.\] ◻
Theorem 4. Let \(f_{\theta}\) be a FCNN with input dimension \(n_{in}\), hidden dimension \(n_h\), depth \(m\), and activation function \(\text{SiLU}\) [21]. Assuming the input \(\mathbf{x}=[x_1,\ldots,x_{n_{in}}]^T\) satisfies \(|x_j|\leq1\) for \(j=1,\ldots,n_{in}\), then \(f_{\theta}\in\mathrm{Lip}(2)\) and \[\label{eq:lip2nnbound95app} ||f_{\theta}||_{\text{Lip}(2)} \leq CP_{m!}(||W^1||_2,\ldots,||W^m||_2,||\mathbf{b}^1||_2,\ldots,||\mathbf{b}^m||_2)\tag{43}\] where \(C\) is a constant depending on \(n_{in}, n_h,\) and \(m\), \(\{W^i\}_{i=1}^m\) and \(\{\mathbf{b}^i\}_{i=1}^m\) are the weights and biases of \(i^{\text{th}}\) layer of \(f_{\theta}\), and \(P_{m!}\) is a polynomial of order \(m!\).
Proof. Lemma 3 means that each layer \(L^j\) of \(f_{\theta}\) is \(\mathrm{Lip}(2)\) with norm satisfying \[||L^j||_{\text{Lip}(2)} \leq CP_j(||W_1||_2,\ldots,||W_j||_2,||\mathbf{b}_1||_2,\ldots,||\mathbf{b}_j||_2),\] where \(C\) is a constant depending on \(n\) and \(n_{in}\) and \(P_j\) is a \(j^{\text{th}}\) order polynomial. Applying lemma 2 gives the bound in 43 . ◻
The stacked recurrent models considered in this paper are based on the official implementation of S5 located at https://github.com/lindermanlab/S5 [1]. A recurrent block consists of a batch or layer normalisation, a recurrent layer, a GLU layer [37], dropout with rate \(0.1\), and a skip connection. A full model consists of a linear encoder, a number of stacked recurrent blocks, and a final linear layer. The four different recurrent layers considered are the linear recurrent unit, S5, S6, and MAMBA, where S6 refers to the selective state-space recurrence introduced by [3] and MAMBA refers to the combination of a gated MLP, convolution, and S6 recurrence that was also introduced by [3]. S5 and LRU use batch normalisation, whereas S6 and MAMBA use layer normalisation.
NCDEs, NRDEs, and Log-NCDEs use a single linear layer as \(\xi_{\phi}\). NCDEs and NRDEs use FCNNs as their vector fields configured in the same way as their original papers [4], [9]. NCDEs use \(\text{ReLU}\) activation functions for the hidden layers and a final activation function of \(\tanh\). NRDEs use the same, but move the \(\tanh\) activation function to be before the final linear layer in the FCNN. Log-NCDEs use a FCNN with SiLU activation functions for the hidden layers and a final activation function of \(\tanh\). NRDEs and Log-NCDEs take their intervals \(r_{i+1} - r_i\) to be a fixed number of observations, referred to as the Log-ODE step.
On the toy dataset, all models use a hidden state of dimension \(64\) and Adam with a learning rate of \(0.0003\) [29]. LRU, S5, S6, and MAMBA use \(6\) blocks and S5, S6, and MAMBA use a state dimension of \(64\). S5 uses \(2\) initialisation blocks and MAMBA uses a convolution dimension of \(4\) and an expansion factor of \(2\). NCDEs, NRDEs, and Log-NCDEs use a FCNN with width \(128\) and depth \(3\) as their vector field. Furthermore, they all use Heun as their differential equation solver with a fixed stepsize of \(0.01\). NRDEs and Log-NCDEs use a Log-ODE step of \(4\) and a signature truncation depth of \(2\). Log-NCDEs do not use any \(\mathrm{Lip}(\gamma)\) regularisation, i.e. \(\lambda=0\).
Table 4 provides details on the dimension, number of observations, and number of classes for the datasets chosen from the UEA-MTSCA for the experiments conducted in this paper. Care is necessary when using the EigenWorms dataset. As of June \(1^{\text{st}}\) 2024, the train and test splits obtained from https://timeseriesclassification.com/description.php?Dataset=EigenWorms contain repeated time series. The repeated data was removed for the experiments in this paper.
| Dataset | Dimension | Number of Observations | Classes | |
|---|---|---|---|---|
| EigenWorms | \(6\) | \(17984\) | \(5\) | |
| EthanolConcentration | \(3\) | \(1751\) | \(4\) | |
| Heartbeat | \(61\) | \(405\) | \(2\) | |
| MotorImagery | \(64\) | \(3000\) | \(2\) | |
| SelfRegulationSCP1 | \(6\) | \(896\) | \(2\) | |
| SelfRegulationSCP2 | \(7\) | \(1152\) | \(2\) |
Tables 5 and 6 give an overview of the hyperparameters optimised over during the UEA-MTSCA and PPG-DaLiA experiments for the stacked recurrent models and the NCDE models respectively. The optimisation was performed using a grid search of the validation accuracy for the UEA-MTSCA datasets and the mean squared error for the PPG-DaLiA dataset. All models and experiments used Adam as their optimiser and a batch size of \(32\), except the stacked recurrent models on PPG-DaLiA, which used a batch size of \(4\) due to memory constraints. NCDEs, NRDEs, and Log-NCDEs use Heun as their differential equation solver with a fixed stepsize of \(1 / \max\{500, 1 + (\text{Time series length} / \text{Log-ODE step})\}\), with \(\text{Log-ODE step}=1\) for NCDEs. Additionally, Log-NCDEs scale down their initial FCNN parameters by a factor of \(1000\) to reduce the starting \(\mathrm{Lip}(2)\) norm of the vector field.
| Hyperparameters | Options | Method | |||
| LRU | S5 | S6 | MAMBA | ||
| Learning Rate | \(10^{-3}\) | ||||
| SCP2, PPG | |||||
| PPG | |||||
| SCP2 | EW, EC, PPG | ||||
| \(10^{-4}\) | HB | EW, SCP2 | SCP1, PPG | HB, SCP2 | |
| \(10^{-5}\) | EC | EC | EC | MI, SCP1 | |
| Include Time | True | EC, HB, SCP2 | |||
| PPG | |||||
| PPG | EW, EC, SCP2 | ||||
| False | |||||
| PPG | HB, MI, SCP1 | EW, SCP1, SCP2 | |||
| PPG | |||||
| Hidden Dimension | 16 | MI | MI, SCP2, PPG | ||
| MI, SCP2 | EW | ||||
| 64 | |||||
| SCP2 | EW | SCP1, PPG | EC, HB, SCP2 | ||
| 128 | HB, PPG | EC, HB, SCP1 | MI, SCP1, PPG | ||
| Number of Layers | 2 | HB, SCP1, SCP2 | EW, EC, SCP2 | SCP1, SCP2, PPG | MI, SCP1 |
| 4 | EW | HB | EW, EC, HB, MI | EC, HB, PPG | |
| 6 | EC, MI, PPG | MI, SCP1, PPG | EW, SCP2 | ||
| State Dimension | 16 | EC, SCP1, SCP2 | |||
| SCP1 | EC, HB, SCP1 | SCP1 | |||
| 64 | EW | MI, SCP2, PPG | EW, PPG | ||
| PPG | |||||
| 256 | HB, MI, PPG | MI, SCP2 | EC, HB | ||
| S5 Initialisation Blocks | 2 | SCP2, PPG | |||
| 4 | HB, MI | ||||
| 8 | EW, EC, SCP1 | ||||
| Convolution Dimension | 2 | EW, HB, SCP2 | |||
| 3 | MI, PPG | ||||
| 4 | EC, SCP1 | ||||
| Expansion Factor | 1 | EW, MI, SCP1 | |||
| 2 | SCP2, PPG | ||||
| 4 | EC, HB | ||||
| Hyperparameters | Options | Method | ||
| NCDE | NRDE | Log-NCDE | ||
| Learning Rate | \(10^{-3}\) | |||
| MI, SCP2, PPG | ||||
| SCP1, PPG | ||||
| PPG | ||||
| \(10^{-4}\) | SCP1 | MI, SCP2 | EC, SCP1, SCP2 | |
| \(10^{-5}\) | ||||
| Include Time | True | |||
| MI, PPG | ||||
| PPG | ||||
| PPG | ||||
| False | SCP1, SCP2 | EW, HB, MI | MI, SCP1, SCP2 | |
| Hidden Dimension | 16 | MI, PPG | HB, MI | |
| 64 | ||||
| SCP1 | EC, SCP1 | |||
| 128 | ||||
| SCP1, SCP2 | MI, SCP2, PPG | EW, SCP2, PPG | ||
| Vector Field (Depth, Width) | (2, 32) | EW, SCP2 | ||
| (3, 64) | EW | EC, MI, PPG | ||
| (3, 128) | EW | HB | HB, SCP1 | |
| (4, 128) | ||||
| SCP1, SCP2, PPG | ||||
| SCP2, PPG | ||||
| Log-ODE (Depth, Step) | (1, 1) | EC, MI, SCP1, SCP2 | EC | |
| (2, 2) | HB | HB | ||
| (2, 4) | EW | SCP2 | ||
| (2, 8) | ||||
| (2, 12) | EW | |||
| (2, 16) | MI, SCP1 | |||
| (1, 10) | PPG | PPG | ||
| (2, 10) | ||||
| (2, 100) | ||||
| (2, 1000) | ||||
| Regularisation \(\lambda\) | \(10^{-3}\) | EW, MI, SCP2 | ||
| \(10^{-6}\) | EC, HB | |||
| \(0.0\) | SCP1, PPG | |||
Models are compared on their average GPU memory usage and run time for the UEA-MTSCA datasets. In order to compare the models, \(1000\) steps of training were run on an NVIDIA RTX 4090 with each model using the hyperparameters obtained from the hyperparameter optimisation. The specific choices of the hyperparameters are listed in Tables 5 and 6. In addition to the time for \(1000\) steps and GPU memory usage, shown in Figures 6 (a) and 6 (b) respectively, the average number of total training steps taken to produce the results in Table 1 is recorded in Figure 6 (c). Combining the results for time for \(1000\) training steps and total number of training steps gives an approximation of the total run time on the same hardware, and these results are shown in Figure 6 (d).
Although the time per training step is lower for stacked recurrent models than NCDEs, NRDEs, or Log-NCDEs, they also require more training steps to converge. Additionally, NCDEs, NRDEs, and Log-NCDEs require less GPU memory. The largest contributors to the average run time of NCDEs are the datasets with the most observations, EigenWorms and MotorImagery. The positive impact of the Log-ODE method on computational burden is demonstrated empirically by the decrease in run time achieved by NRDEs and Log-NCDEs on EigenWorms when using a depth\(-2\) Log-ODE method. When a depth\(-1\) Log-ODE method is used, such as NRDEs on MotorImagery, the same decrease is not observed. Section 3.4 demonstrated that Log-NCDEs and NRDEs have the same asymptotic computational complexity. However, when using a depth\(-2\) Log-ODE approximation and the same stepsize, NRDEs and Log-NCDEs exhibit drastically different run times on Heartbeat, a high-dimensional dataset. This difference is partly explained by the model’s having different optimal hyperparameter choices, but even when using identical hyperparameters to the NRDE, Log-NCDE’s run time for \(1000\) steps of training is \(1673\) seconds, whereas NRDE’s run time is \(9539\) seconds. The remaining difference in run time is due to calculating the JVPs of \(f_{\theta}\) using a batched function, as discussed in Section 3.4. If instead the JVPs are calculated using a for-loop, then Log-NCDEs run time increases to \(17045\) seconds.
Figure 6: Memory, time for \(1000\) steps, number of steps, and approximate total time for each model and dataeset from the UEA-MTSCA considered in this paper on an NVIDIA RTX 4090. The following abbreviations are used: EigenWorms (EW), EthanolConcentration (EC), Heartbeat (HB), MotorImagery (MI), SelfRegulationSCP1 (SCP1), and SelfRegulationSCP2 (SCP2).. a — Memory, b — Time, c — Number of steps, d — Total time
As of June \(1^{\text{st}}\) 2024, the EigenWorms dataset at https://timeseriesclassification.com has \(23\) duplicated time series, which were removed for the experiments in this paper.↩︎