Sliding-Window Signatures for Time Series: Application to Electricity Demand Forecasting


Abstract

Nonlinear and delayed effects of covariates often render time series forecasting challenging. To this end, we propose a novel forecasting framework based on ridge regression with signature features calculated on sliding windows. These features capture complex temporal dynamics without relying on learned or hand-crafted representations. Focusing on the discrete-time setting, we establish theoretical guarantees, namely universality of approximation and stationarity of signatures. We introduce an efficient sequential algorithm for computing signatures on sliding windows. The method is evaluated on both synthetic and real electricity demand data. Results show that signature features effectively encode temporal and nonlinear dependencies, yielding accurate forecasts competitive with those based on expert knowledge.

1 INTRODUCTION↩︎

Time series forecasting is essential across many domains where anticipated scenarios guide critical decisions, including energy management, healthcare systems, and financial markets. The task is inherently challenging: series often exhibit complex temporal dynamics, requiring models capable of capturing them.

This paper focuses in particular on forecasting electricity demand, a key challenge for maintaining the balance between supply and demand on the electric grid. As energy production is set according to predicted consumption, reliable forecasts are crucial in keeping the grid operating around the clock. Many state-of-the-art approaches rely on Generalised Additive Models (GAMs), which are a flexible way of representing demand as a sum of smooth functions, applied to pre-processed calendar and weather variables. We refer to [1] for an in-depth review of GAMs and other models for electricity forecasting. However, the selection of smooth functions and covariate pre-processing requires specialist knowledge. For example, due to the thermal inertia of buildings, changes in temperature affect electricity demand with a delay, which can be accounted for by exponential smoothing [2]. Furthermore, the relationship between temperature and demand is nonlinear and can be modelled using splines (see, among others, [3]. We propose a new generic framework to automatically extract these complex dependencies of the target to exogenous variables, regardless of any prior assumptions about the nature of relationships or the impact of the past.

The signature transform, introduced by [4] and further developed in the context of rough path theory by [5], provides a representation of sequential data through iterated integrals referred to as signatures. These objects have been shown to capture key geometric and analytic properties of multi-dimensional paths ([6]). Casting data as continuous paths and computing their signatures has proven to be an effective way of obtaining non-parametric feature sets for time-ordered data in machine learning tasks [7], [8]. Recently, a growing line of work has highlighted the relevance of signatures in modern AI pipelines, ranging from their connection to recurrent neural networks [9] to their integration with transformers for time-series tasks [10].

Building on these ideas, we aim to leverage signatures for discrete-time time series forecasting. To this end, we propose a regression model on sliding-window signatures. Sliding windows are key in our approach as they enable the signatures to focus on the most recent past, discarding distant history as opposed to using exponentially fading memory as in [11]. While [12] already used sliding windows for nowcasting in finance, we propose a novel increment-based model that is both theoretically grounded and straightforward to implement.

The goal of this paper is to present this new forecasting framework that consists of three simple steps: data augmentation, sliding-window signature calculation and ridge regression on the increments of the target time series. We support it by the following main contributions.
Theoretical foundations. We establish universal approximation results for sliding-window signatures in discrete time and prove that these features preserve stationarity when derived from a stationary series, providing a solid theoretical basis for our method.
Efficient algorithm. We develop a sequential algorithm that exploits sliding window overlap for fast and efficient signature calculation.
Empirical validation. Through experiments on synthetic and real-world data, we demonstrate that our method successfully leverages signatures as non-parametric features to capture complex temporal dynamics and outperform baselines with incorporated expert knowledge.

2 SIGNATURES FOR TIME SERIES FORECASTING↩︎

In this section, we present signatures, along with necessary theoretical background, frame them as features and place them in a discrete-time forecasting setting. For a more rigorous mathematical treatment, we refer to the Appendix 7.

2.1 Preliminaries↩︎

We define a continuous path in \(\mathbb{R}^d\) as any continuous mapping \(x : [s, t] \longrightarrow \mathbb{R}^d\). We assume \(d\geq2\).

Definition 1. Let \(x:[s,t] \longrightarrow \mathbb{R}^d\) be a continuous path. The 1-variation of \(x\) is defined by \[\|x\|_{\mathrm{1}-\mathrm{var}}=\sup _{\left(t_0, \ldots, t_k\right) \in P} \sum_{i=1}^k\left\|x_{t_i}-x_{t_{i-1}}\right\|,\] where \(P=\{(t_0, \ldots, t_k) \mid k \geq 0,\) \(s=t_0<\cdots<t_k=t\}\) denotes the set of all finite partitions of \([s, t]\).

Intuitively, \(\|x\|_{\mathrm{1}-\mathrm{var}}\) can be seen as the length of the path \(x\). From now on, we consider only paths of finite length, i.e. \(\|x\|_{\mathrm{1}-\mathrm{var}} < \infty\). We denote by \(\otimes\) the tensor product and by \((\mathbb{R}^d)^{\otimes k}\) the \(k\)-th tensor power of \(\mathbb{R}^d\) , with \((\mathbb{R}^d)^{\otimes 0} := \mathbb{R}\). We note that \((\mathbb{R}^d)^{\otimes k}\) is a Hilbert space of dimension \(d^k\).

Definition 2. We denote by \(\mathscr{T}\) the space of square-summable sequences of tensors of increasing order: \[\mathscr{T}(\mathbb{R}^d) = \left\{ (a_k)_{k\ge0} \;\middle|\; a_k \in (\mathbb{R}^d)^{\otimes k},\; \sum_{k=0}^\infty \|a_k\|^2_{(\mathbb{R}^d)^{\otimes k}} < \infty \right\}.\]

We endow \(\mathscr{T}(\mathbb{R}^d)\) with the scalar product \(\langle \mathbf{a}, \mathbf{b}\rangle_{\mathscr{T}(\mathbb{R}^d)} = \sum_{k=0}^{\infty} \langle a_k, b_k\rangle_{(\mathbb{R}^d)^{\otimes k}} \,,\) which induces the norm \(\|\mathbf{a}\|_{\mathscr{T}\left(\mathbb{R}^d\right)} = \sqrt{\sum_{k=0}^{\infty}\left\|a_k\right\|_{\left(\mathbb{R}^d\right)^{\otimes k}}^2}\).

Proposition 1. \(\left(\mathscr{T}\left(\mathbb{R}^d\right),\langle\cdot, \cdot\rangle_{\mathscr{T}\left(\mathbb{R}^d\right)}\right)\) is a separable Hilbert space.

2.2 Signatures↩︎

Definition 3. The signature of a finite-length path \(x\) on \([s, t]\) is defined as an infinite tensor sequence: \[S\left( x_{[s,t]} \right)=\left(1, \;S^1\left( x_{[s,t]} \right), \ldots, \;S^k\left( x_{[s,t]} \right), \;\ldots \right),\] where the \(k\)-th element (called level) is given by \[{S^k \left( x_{[s,t]} \right)}=\idotsint \displaylimits_{s < u_1<\cdots<u_k < t} \mathrm{d} x_{u_1} \otimes \cdots \otimes \mathrm{d} x_{u_k} \in {\left(\mathbb{R}^d\right)}^{\otimes k}.\]

Each tensor \(S^k(x)\) can be written in terms of its elements, referred to as signature coefficients, indexed by multi-indices \((i_1,\ldots,i_k) \in \{1,\ldots,d\}^k\): \[S^{\left(i_1, \ldots, i_k\right)} = \idotsint \displaylimits_{s < u_1<\cdots<u_k < t} \mathrm{d}x^{\left(i_1\right)}_{u_1} \cdots \mathrm{d}x^{\left(i_k\right)}_{u_k}.\]

In practice, instead of dealing with infinite sequences, we only consider the levels up to (truncation) order \(N\), defining the truncated signature as \[S^{\leq N}\left( x_{[s,t]} \right) = \left(1, \;S^1\left( x_{[s,t]} \right), \;\ldots, \;S^N\left( x_{[s,t]} \right) \right).\] Furthermore, we often ignore the tensor structure and view \(S^{\leq N}\) as a collection of all signature coefficients with multi-index of length \(k \leq N\), arranged in a vector of size \(s_d(N)= \sum_{k=0}^{N} d^k = \left(d^{N+1}-1\right) /(d-1)\). We also note that, when the interval is not of importance, we use the abbreviation \(S(x)\).

While the definition may look complex, we now illustrate that in the simple case of linear paths, there is no need for evaluating or numerically approximating the iterated integrals. Instead, we obtain a closed formula for signature coefficients which we could interpret as monomials of path increments.

Example 1 (Signature of a linear path). Let \(x:[s,t] \mapsto \mathbb{R}^d\), \(u \to \frac{x_{t} - x_s}{t-s}(u-s) + x_{s}\) be a d-dimensional linear path. It follows by integration that \(S^{\left(i_1, \ldots, i_k\right)}\left(x_{[s,t]}\right) = \frac{1}{k!} \prod_{j=1}^k\left(x_t^{(i_j)}-x_s^{(i_j)}\right)\) and \(S^k\left( x_{[s,t]} \right)=\frac{1}{k!}\left(x_t-x_s\right)^{\otimes k}\). (Detailed calculation in Example 4.)

We now turn to key signature properties leveraged in our framework. It follows from their definition as iterated integrals that signatures are invariant to both translation and time reparametrization (see Proposition [prop:invariances]). Consequently, they do not encode the path’s starting point nor the speed with which it was traversed. A standard approach to reintroduce the latter information is \(\textit{time augmentation}\), i.e. adding time as a path component.

The following proposition and example will prove crucial for efficient signature calculation.

Proposition 2 (Algebraic properties). Let \(x:[s, t] \mapsto \mathbb{R}^d\) and \(y:[t, u] \mapsto \mathbb{R}^d\) denote two paths of finite length.

  1. (Chen’s identity) Let \(x * y:[s, u] \mapsto \mathbb{R}^d\) be the concatenation of \(x\) and \(y\), meaning \((x * y)_v=x_v\) for \(v \in[s, t]\) and \((x * y)_v=x_t+y_v-y_t\) for \(v \in[t, u]\). Then \(S(\left(x * y\right)_{[s,u]})=S\left( x_{[s,t]} \right) \otimes S(y_{[t,u]}) \label{chen}.\)

  2. (Time reversal) We denote the time-reversal of \(x\) as the path \(\overleftarrow{x}:[s, t] \mapsto \mathbb{R}^d\) where \(\overleftarrow{x}(u)=x_{s+t-u}\). Then \(S\left( x_{[s,t]} \right) \otimes S\left( \overleftarrow{x}_{[s,t]} \right) = (1,0,0,\ldots).\)

Chen’s identity allows us to extend the result from Example 1 to piecewise linear paths.

Example 2. Let \(x:[s,t] \rightarrow \mathbb{R}^d\) be a piecewise linear path and let \(s=u_0<u_1<\cdots<u_k=t\) be a partition such that \(x\) is linear on each \(\left[ u_{j-1}, u_j \right]\). From Chen’s identity we have: \[S\left( x_{[s,t]} \right)=S\left( x_{[s,u_1]} \right)\otimes \cdots \otimes S\left( x_{[u_{k-1},t]} \right),\] where every \(S\left( x_{[u_{j-1},u_j]} \right)\) is given in Example 1.

The next result ([13] Lemma 5.1) is key for switching from infinite to finite-dimensional objects.

Proposition 3. Let \(x:[s,t] \longrightarrow \mathbb{R}^d\) be a path of finite length. It holds that: \[\begin{align} {\|{S^k \left( x_{[s,t]} \right)}\|}_{ {\left( \mathbb{R}^d \right)}^{\otimes k}} & \leq \frac{\|x\|_{1 \text{-} var}^k}{k!}, \\ \|S\left( x_{[s,t]} \right)\|_{\mathscr{T}\left(\mathbb{R}^d\right)} & \leq \exp \left(\|x\|_{1 \text{-} var}\right)<\infty. \end{align}\]

The first equation can be seen as a bound on the information carried by each signature level: the factorial decay of the \(k\)-th level norm means that information decreases as the level increases. Consequently, for sufficiently large \(N\), \(S^{\leq N}(x)\) provides a good approximation of \(S(x)\), justifying truncation in practice. The second equation shows that signatures lie in the separable Hilbert space \(\mathscr{T}(\mathbb{R}^d)\), a property we will exploit when working with probability distributions.

Finally, we present the main motivation for using truncated signatures as features in machine learning tasks.

Theorem 1 (Universal approximation). Let \(K\) be a compact subset of the space of finite-length paths from \([s,t]\) to \(\mathbb{R}^d\) and such that for any \(x \in K,\) \(x_s=a\) for some \(a \in \mathbb{R}^d\) and \(x\) has at least one monotone coordinate. Let \(f: K \rightarrow \mathbb{R}\) be continuous. Then, for every \(\varepsilon>0\), there exists \(N \ge 1, \theta \in \mathbb{R}^{s_d(N)}\), such that, for any \(x \in K\), \[\left\lvert f(x)- \theta^{\top} {S^{\leq N}\left(x_{[s,t]}\right)}\right\rvert \leq \varepsilon.\]

This result follows directly from the Stone-Weierstrass theorem, showing that signature coefficients serve as the path analogue of monomials. Therefore, they can be leveraged to linearize the problem of learning a more complex (non-linear) function on a compact set of paths.

So far, our treatment of paths and signatures has been deterministic. Turning to the statistical perspective, from now on, we consider random paths \(X : [s, t] \to \mathbb{R}^d\). Indeed, for each \(\omega\) in the sample space \(\Omega\), the realisation \(X(\omega)\) is a path for which we can define the signature \(S\left( x(\omega) \right)\) as before. In other words, \(S(X)\) is a random variable taking values in a separable Hilbert space (Proposition 1).

2.3 Signature Features for Time Series↩︎

We now present the pipeline for using signatures as a feature extraction method for discrete-time time series forecasting.

Let \(\left( Y_t \right)_t\) denote the target time series and \(\left( X_t \right)_t\) denote the covariate time series, with \(Y_t \in \mathbb{R}\) and \(X_t \in \mathbb{R}^d\). We consider predictions for \(Y_t\) of the form \(f(X_{u \le t})\) for some continuous, measurable, possibly non-linear function \(f\). In practice, when forecasting electricity demand, covariates from the distant past, such as temperature from a year ago, have little predictive value. Hence, we assume short-term dependence: the forecast of \(Y_t\) depends only on a finite number of the most recent observations of \(\left( X_t \right)_t\).

To apply the previously established results, discrete data must be first embedded into continuous time to yield a path. The choice of embedding can significantly impact the performance of signature-based methods [14]. We turn to linear interpolation, meaning that the path on the interval \(\left[t, t+1\right]\) is defined by \(u \longmapsto X_{t}+{(u-t)}\left(X_{t+1}-X_{t}\right).\) By concatenating finitely many of these segments, we obtain piecewise linear paths. We note that these are finite-length paths and therefore fall within the theoretical framework established earlier. Moreover, as illustrated in Example 2, they allow for a straightforward iterative computation, consisting in calculating the signatures on each linear segment as in Example 1 and concatenating them inductively through the tensor product (see Remark 7). This procedure is efficiently implemented in the Python package iisignature ([15]), used in this work.

In classical frameworks like  [16], signatures summarize the entire path. In time series, this corresponds to expanding windows, where all past points contribute equally. This is unsuitable for our application, where only recent observations matter. A more natural choice is a sliding window of size \(w\), considering only data from \(t-w\) to \(t\). As shown in Section 3, sliding windows also help preserve stationarity, though they introduce \(w\) as a hyperparameter, which we discuss in Section 5.

Finally, we apply time augmentation within each sliding window: \(\left(0, X_{t-w} \right), \left(1, X_{t-w+1} \right), \ldots, \left( w, X_t \right)\) at time \(t\). This ensures that the resulting paths have at least one monotone coordinate.

Remark 1. This approach yields the same signature values as if the time coordinate was added over the entire time horizon: \(\left(t-w, X_{t-w} \right), \left(t-w+1, X_{t-w+1} \right),\ldots, \left( t, X_t \right)\). Consequently, when the window slides, only the new observation needs to be added, without adjusting previous timestamps.

Figure 1: Method Pipeline: leveraging sliding-window signature features to linearize the forecasting task.

Our method consists in the following steps. We first fix a sliding window size \(w\). At each time \(t\), we consider \(\left(0, X_{t-w} \right), \left(1, X_{t-w+1} \right), \ldots, \left( w, X_t \right)\). We linearly interpolate these points to obtain a piecewise linear path, denoted by \(X_{[t-w,t]}\). Having constructed a continuous path, we can calculate its signature \(S\left( X_{[t-w,t]} \right)\) and leverage the theory presented in Section 2. In terms of application, we can now use the truncated signature \(S^{\leq N} \left( X_{[t-w,t]} \right)\) as a feature set for \(\left(X_t\right)_t\) on \([t-w,t]\). We summarize this as a pipeline in Figure 1. The final step of fitting a linear model is motivated by the universal approximation theorem, which has yet to be established for our sliding-window, discrete-time setting. In the following section, we identify conditions under which this result holds and examine the important property of stationarity within our framework.

3 THEORETICAL CONTRIBUTIONS↩︎

We introduce two assumptions on the discrete-time time series of covariates \({(X_t)}_t\).

Assumption 1 (Uniform boundedness). Suppose that \(|X^{(i)}_t| \leq M\) a.s.* for some \(M > 0\) and all \(1 \leq i \leq d\).*

Assumption 2 (Stationarity of increments). Suppose that the increments \((X_{t+1}-X_{t})_{t \ge 1}\) are strictly stationary.

Assuming the time series is uniformly bounded is not restrictive, as real-world data such as temperature lie within physical limits. Similarly, stationarity of increments is a milder assumption than stationarity of the full series; for example, while temperature exhibits seasonality, its half-hour increments can reasonably be considered stationary (see Appendix 10).

3.1 Universal Approximation Theorem↩︎

We treat the paths on sliding windows as functions \({X}_{[t-w, t]}(\omega) : [0,w] \mapsto \mathbb{R}^{d+1}\), which share a common time domain, allowing us to consider the set of such paths and functions over this set.

Theorem 2 (Universal approximation on sliding windows). Let \({(X_t)}_t\) be a \(d\)-dimensional time series such that Assumption 1 holds. Let \(w \in \mathbb{N}\) be a fixed window size and \(f : C\left([0,w],\mathbb{R}^{d+1}\right) \mapsto \mathbb{R}\) a continuous function given the uniform topology. It holds that for every \(\varepsilon>0\), there exists \(N \in \mathbb{N}, \theta \in \mathbb{R}^{s_{d+1}(N)}\), such that \[\sup_{t}\left\lvert f\left({X}_{[t-w, t]} - (0,X_{t-w})\right)- \theta^{\top} S^{\leq N} \left({X}_{[t-w, t]}\right) \right\rvert \leq \varepsilon.\]

Sketch of proof. We first introduce paths on windows translated by the starting point, \({X}_{[t-w, t]} - \left( 0, X_{t-w} \right)\). We consider the set of all realisations of these random paths on all sliding windows. These paths have a monotone coordinate (time augmentation), they all begin at the same point (translation), are uniformly bounded (Assumption 1) and they all are Lipschitz with the same constant (Assumption 1 and piecewise linearity). By constructing a superset of paths with the same properties that is compact in the uniform topology, we can apply Theorem 1 to it, completing the proof. For more detail, see Appendix 8.1.

Remark 2. Assuming all paths start at the same point is vital, since signatures cannot encode the path’s starting value due to translation invariance. This is commonly addressed in practice through basepoint augmentation, prepending a zero to each path ([17]). As it was not suitable for our sliding-window framework, we instead subtract the initial value from each window. The truncated signature remains unchanged under this transformation, \(S^{\le N}\bigl(X_{[t-w,t]} - (0, X_{t-w})\bigr) = S^{\le N}(X_{[t-w,t]})\), again due to translation invariance. This is key for efficient sequential computation, as there is no need to modify the path at each step. The subtraction has mainly a theoretical implication: we consider the function \(f\) as acting on translated paths, though in practice the starting point still carries predictive information. We address this explicitly in Section 4.

The main takeaway from this theorem is that we can effectively linearize the task of learning the mapping \(f\) by approximating it with a linear functional on the truncated signature. The result allows for an approximation of a wide range of complicated functions \(f\) (such as GAMs or conditional expectation), without requiring assumptions on their specific form, but given that they are continuous with respect to the input.

3.2 Stationarity of Signatures↩︎

Computing the signature of \((X_t)_t\) over a sliding window produces a new signature series \({\left( S(X_{[t-w,t]}) \right)}_t\). Since these signatures form the features of our forecasting model, we study their stationarity to better characterize the resulting prediction process \((\widehat{Y}_t)_t\).

Remark 3. Although signatures are defined as infinite sequences, the usual definition of strict stationarity (via equality in distribution) still applies, since they lie in the separable Hilbert space \(\mathscr{T}(\mathbb{R}^d)\), where finite-dimensional marginal distributions determine the law of the process [18].

Theorem 3. Let \({(X_t)}_{t}\) be a discrete \(d\)-dimensional time series such that Assumption 2 holds. Then, for any truncation order \(N \in \mathbb{N}\), the time series \({\left({S^{\leq N}\left(X_{[t-w,t]}\right)}\right)}_t\) is strictly stationary. Furthermore, the time series \({\left({S\left(X_{[t-w,t]}\right)}\right)}_t\) is strictly stationary.

Sketch of proof. The main argument is that signature coefficients, and consequently truncated signatures, are deterministic, measurable functions of the series’ increments. Applying any measurable, deterministic function to two random vectors with the same distribution preserves their equality in distribution. Therefore, given Assumption 2, the stationarity of truncated signatures holds. The conclusion for signatures follows immediately from the previous result and Remark 3. For full proof, see Appendix 8.2.

Remark 4. Let us now suppose that Assumptions 1, 2 hold for our time series of covariates \((X_t)_t\). Combining the two theoretical results, we can now conclude that the prediction process \(( \widehat{Y}_t)_t\), with \(\widehat{Y_t} = \theta^{\top} {S^{\leq N}\left({X}_{[t-w, t]} \right)}\), is also strictly stationary as a linear transformation of a stationary process. This has important implications for choosing a model.

4 IMPLEMENTATION↩︎

4.1 Model↩︎

The preceding theoretical considerations yield two key implications for the specification of our model. First, our forecast is a function of the translated paths \(X_{[t-w]} - \left(0,X_{t-w}\right)\). Second, in accordance with Remark 4, under the Assumption 2, our forecast is stationary. It is therefore natural to apply our model on a stationary target variable. These requirements are, however, restrictive and fail to hold in the context of electricity demand. To address this, we introduce an additional hyperparameter: a delay \(D\) in the target variable. The delay \(D\) is selected to be sufficiently large to ensure that at time \(t\) the true value of \(Y\) at time \(t-D\) is observable, but small enough that the increment process \(\Delta_D Y_t := Y_t - Y_{t-D}\) can be treated as stationary. Furthermore, it is far more realistic that the increments \(\Delta_D Y_t\), rather than the raw values \(Y_t\), depend on the translated paths (see Remark 2).

We now present our pipeline applied to \(\Delta_D Y_t\). The training procedure consists in fitting a ridge regression \[\begin{align} \hat{\theta} \in \arg\min\limits_\theta \sum\limits_{t=1}^{t_{train}} \left(\Delta_D Y_t - {\theta}^{\top} S^{\leq N} \left( x_{[t-w,t]} \right)\right)^2 + \lambda \|\theta\|_2^2, \end{align}\] where the regularization constant \(\lambda > 0\) is chosen on the validation set. Our final forecast of the target for any \(t> w\) is then \(\widehat{Y}_t = y_{t-D} + \widehat{\Delta_D Y_t}\), with \(\widehat{\Delta_D Y_t} = {\hat{\theta}}^{\top} S^{\leq N} \left( x_{[t-w,t]} \right)\). See Appendix 9 for the full algorithm in pseudocode.

Several challenges arise when implementing this model. We begin by addressing the choice of hyperparameters: namely \(D, w\) and \(N\). The delay \(D\) depends on the data and is chosen so that the assumptions of stationarity of \(\left( \Delta_D Y_t \right)_t\) and the availability of \(Y_{t-D}\) at time \(t\) hold. When selecting the truncation order \(N\), we highlight that the size of the truncated signature \(s_d(N)\) grows polynomially with the dimension of the covariate series \(d\), but exponentially with \(N\). Consequently, large values of \(N\) (above 10) are typically infeasible. In practice, \(N\) is either chosen on a validation set or fixed to a moderate value. The window size \(w\) can likewise be selected on the validation set, noting that only \(w \geq D\) are considered so that \(\Delta_D Y_t\) may be forecasted as a function of \(X_{[t-w]} - \left(0,X_{t-w}\right)\).

The use of a ridge regression is motivated by the ill-conditioning of the design matrix. High correlations among signature components, amplified by overlapping sliding windows, lead to unstable coefficient estimates. We note that lasso and ridge regression are common when working with signature features  [16], [19]; we adopt ridge, as it performs better in presence of collinearity. Additionally, we omit all signature coefficients that depend solely on time. The \(k\)-th such coefficient is given by a constant \(S^{(1,\ldots,1)}(x_{[t-w,t]}) = \frac{w^k}{k!}\), contributing only redundant intercept terms and worsening the collinearity. We thus redefine \(S^{\leq N}\) to exclude these coefficients.

Finally, the truncated signature’s rapidly growing size (in \(N\) and \(d\)) and the tensor-multiplication involved in its computation make calculating signatures for large and numerous windows computationally expensive. To address this, we propose an algorithm that incrementally updates the signature as the window slides, rather than recomputing it from scratch.

4.2 Efficient Signature Computation↩︎

The key observation is that in our pipeline, there is a large overlap in paths captured within two consecutive sliding windows. To exploit this, we use the algebraic properties of signatures from Proposition 2. Recall that the time-reversal property allows us to remove the contribution of \(X_{t-w}\) by computing the signature of the linear section from \(\left(t-w+1, X_{t-w+1}\right)\) back to \(\left(t-w, X_{t-w}\right)\) (reversing time) and concatenating it on the left of the path using Chen’s identity. Similarly, to incorporate the new data point \(X_{t+1}\), it suffices to compute the signature of the linear section between \(\left(t, X_t\right)\) and \(\left(t+1, X_{t+1}\right)\) and concatenate it on the right, again via Chen’s identity. These sequential updates are possible because the path remains unchanged as the window slides, which is still consistent with the theoretical insights, as highlighted in Remarks 1 and 2.

With the proposed procedure, presented in Algorithm 2 and implemented in Python, the number of tensor products required to compute the signature on a new window drops from \(w\) to just 2. This ensures that, even with a very fine time grid and a large window size, the procedure remains computationally feasible.

Figure 2: Signature update on sliding windows

5 NUMERICAL EXPERIMENTS↩︎

This section demonstrates the proof of concept of our sliding-window signature approach to time series forecasting. Our focus is on forecasting electricity demand in France, which exhibits a strong nonlinear dependence on recent past values [1]. This time series also depends heavily on its own past values, so it can be useful to consider its available observations as features (see, among other [20]). Our experiments on both synthetic and real data examine the effectiveness of signatures in capturing such dependencies. Both code and data are made open source, see Appendix 10.

5.1 Dataset↩︎

We gather half-hourly electrical demand data from Eco2mix1 open-source dataset published by RTE, France’s electricity transmission system operator. We combine it with temperature observation data from Météo France 2. To obtain a time series of temperatures at the national level, we average the observations across French weather stations. We then linearly interpolate the data to get a unified data set comprising 48 observations per day, covering the period from January 1, 2012, to December 30, 2015. This stable period was chosen in order to avoid the significant fluctuations in electricity demand brought about by the COVID-19 pandemic and the 2022 energy crisis. In what follows, we denote by \(Y_t\) the electricity demand and by \(T_t\) the temperature at any time \(t=1, 2, \dots\).We apply our method for a single feature: \(T_t\), and underline that in practice, it is replaced with temperature forecasts. Furthermore, for any smoothing parameter \(\alpha \in [0,1]\), we can define exponentially smoothed temperature as \[\left\{ \begin{array}{ll} \overline{T}_1^\alpha = T_1 \\ \overline{T}_t^\alpha = (1-\alpha)\overline{T}_{t-1}^\alpha + \alpha T_t^\alpha \,, \,\textrm{ for any } t \geq 2 \,. \end{array} \right.\] As mentioned in the introduction, smoothed temperature values are often used to model thermal inertia and were shown to enhance the quality of forecasts. These variables stem from expert knowledge and will be useful for generating synthetic data and comparing the signature method to relevant baselines.

5.2 Experimental Setup↩︎

In what follows, we represent the data as a two-dimensional path consisting of temperature augmented with rescaled time, i.e., \((t/w, T_t)\), where \(w\) denotes the window size. Time is rescaled so that its increments, and therefore the scale of signature coefficients, remain comparable across different choices of \(w\). The selection of window size is data-driven: since half-hourly demand strongly depends on the time of day, we preserve this structure by considering window sizes in units of full days. For the delay, we fix \(D=2 \text{ days}\), reflecting the availability of observed demand. We use the years 2012 and 2013 to train the model, the year 2014 as a validation set on which we fix the hyperparameters (namely the ridge regularization constant \(\lambda\)) and the year 2015 as a test set.

Given a window size \(w\) and a signature truncation order \(N\), we predict the target time series using the approach described in Subsection 4.1, which we denote by \(\mathrm{RidgeSig}\). We evaluate our method and benchmark models, namely linear regressions performed on various features \(X^1,X^2,\dots\) and denoted \(\mathrm{LR}(X^1,X^2,\dots)\), using the two following standard error metrics: Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE) on the test set. We recall that for a target values \(Y_1,\dots Y_T\) and the associated forecasts \(\hat{Y}_1,\dots \hat{Y}_T\), we have: \[\mathrm{RMSE} = \frac{1}{T}\sum_{t=1}^T \big( \hat{Y}_t-Y_t \big)^2, \mathrm{MAPE} = \frac{100}{T}\sum_{t=1}^T\frac{|\hat{Y}_t-Y_t|}{|Y_t|}.\]

5.3 Proof of Concept on Synthetic Data↩︎

5.3.0.1 Data generation.

For a smoothing parameter \(\alpha\) and any time step \(t\), we generate the electricity demand time series \(\big(\widetilde{Y}^\alpha_t\big)_t\) as: \[\widetilde{Y}^\alpha_t = \theta_1 \overline{T}_t^\alpha + \theta_2 \big(\overline{T}_t^\alpha\big)^2 + \varepsilon_t, \quad \varepsilon_t \overset{\mathrm{i.i.d}}{\sim} \mathcal{N}(0,\sigma^2)\,.\] Therefore, it depends non-linearly on recent and present temperatures, with \(\alpha\) controlling the impact of past temperature values - the smaller \(\alpha\), the longer the memory. We refer to Appendix D for details on fitting the model parameters (\(\theta_1\), \(\theta_2\) and \(\sigma\)) and synthetic data fidelity with real demand data.

5.3.0.2 Capturing temporal and non-linear dependencies.

Fixing the smoothing parameter \(\alpha = 0.005\) and the signature truncation order \(N=4\), the optimal window obtained on the test dataset is \(w=9\). Results in Table 1 show that our approach outperforms all the benchmarks, except \(\mathrm{RL}(\overline{T}_t^\alpha, (\overline{T}_t^\alpha)^2)\), which constitutes the best possible result because this model knows exactly what the variables are and how they relate to the target used for data generation. These results suggest that our signature approach captures both temporal dependencies, as it outperforms \(\mathrm{RL}(\overline{T}_t^\alpha)\), and nonlinear effects, as it outperforms \(\mathrm{RL}(T_t, T_t^2)\).

Table 1: RMSE (MW) and MAPE (%) on test data set for synthetic data generated with \(\alpha=0.005\).
Model \(\mathrm{LR}(T)\) \(\mathrm{LR}(T, T^2)\) \(\mathrm{LR}(\overline{T}^{\alpha})\) RidgeSig \(\mathrm{LR}\big(\overline{T}_t^\alpha, (\overline{T}_t^\alpha)^2\big)\)
RMSE (MW) 5 435 4 729 3 079 1 637 995
MAPE (%) 8.3 6.4 4.9 2.5 1.5

5.3.0.3 Optimal sliding window size.

We now investigate how the optimal window size is affected by the strength of past dependencies in the data. Figure 3 shows the RMSE of our approach with truncation order fixed at \(N=5\), plotted for window lengths ranging from 2 to 32 days, applied to synthetic data generated by two different smoothing parameters \(\alpha = 0.005\) and \(\alpha = 0.05\). This confirms the intuition that with \(\alpha\) decreasing, the optimal window size increases. Capturing this additional structure requires a larger window so that the signature representation has access to a broader history of the covariates.

Figure 3: RMSE (MW) on the test set as a function of the sliding window size for synthetic data generated with \alpha=0.005 (top: optimal window = 9 days) and \alpha =0.05 (bottom: optimal window = 3 days).

Remark 5. We can show by induction that \(\overline{T}_t^\alpha = \sum_{s=0}^{t-1} \alpha (1-\alpha)^{s}T_{t-s} + (1- \alpha)^{t} T_1\), so the ratio between the proportion of \(T_{t-s}\) and the one of \(T_{t}\) in the smoothing equals \((1-\alpha)^s\). After \(3\) days, this ratio still equals \(0.48\) for \(\alpha = 0.005\), while it is already \(6e^{-4}\) for \(\alpha = 0.05\). This suggests that the temperature from three days ago continues to have a significant impact on the data generated by the first model, whereas it is already being forgotten by the second model. Figure 3 perfectly illustrates this phenomenon.

5.3.0.4 Optimal signature truncation order.

We investigate the impact of the signature truncation order by running our algorithm for \(N=2\) to \(N=7\) and window sizes from \(2\) to \(32\) days. As shown in Figure 4, an order that is too small makes it difficult to detect temporal and non-linear dependencies, which degrades performance Therefore, \(N\) must be large enough. On the other hand, we highlight that the larger the value of N, the longer the computational time and the more unstable the results. This suggests that the truncated order should be between 4 and 6.

Figure 4: RMSE (MW) on the test set for synthetic data (\alpha=0.005) as a function of the sliding window size and for different truncation orders.

5.4 Electricity Demand Forecasting↩︎

Finally, we apply our procedure to real electricity demand data. Since real demand, unlike the synthetic data, exhibits a pronounced weekly cycle, we account for it by changing the forecasting delay to \(D = 7\) days. We add to the previous comparison benchmarks using lagged demand from one week prior \(\left(Y_{t-7\text{ days}}\right)\). We emphasize that we tested various smoothing parameters to tune the benchmarks: \(\alpha = 0.005\) gives the best performance. As suggested by synthetic data, we set the window size to \(w=9\) and the truncated order to \(N=6\). Results of Table 2 demonstrates that our procedure yields the best results across both error metrics. Furthermore, in Figure 5, we compare demand predictions from our model and the best performing baseline from Table 2, focusing on a summer and winter week in 2015. Our model follows more closely observed values, particularly in winter when demand is more sensitive to temperature changes.

Table 2: Test RMSE and MAPE on real demand.
Model RMSE (MW) MAPE (%)
\(LR \left( \overline{T}, \overline{T}^2 \right)\) 6 518 10.7
\(LR\left( T, T^2, \overline{T}, \overline{T}^2 \right)\) 6 172 9.9
\(LR\left( Y_{t-\text{7 days}} \right)\) 4 149 5.3
\(LR\left( T, T^2, \overline{T}, \overline{T}^2, Y_{t-7\text{ days}} \right)\) 3 714 5.3
RidgeSig 3 150 4.4
Figure 5: Observed and predicted demand for a winter (top) and summer (bottom) week in the test set.

6 CONCLUSION↩︎

We established theoretical foundations in a new framework for time series forecasting that relies on sliding-window signatures, proving universality of approximation and stationarity of signatures in the discrete-time setting. We highlighted the importance of aligning paths at a common starting point and including appropriate timestamps as an additional coordinate. Incorporating these insights, we built a model that leverages signatures as non-parametric features for covariate time series. Our sequential algorithm exploits sliding window overlap for efficient signature computation. We demonstrated, both on synthetic and real electricity demand data, that signature features capture nonlinear and temporal dependencies without learned representations or expert knowledge. Our approach does require sensible choices of hyperparameters, particularly window size and target delay, to match the data’s temporal structure and granularity. A limitation lies in the rapid growth of signature size with truncation order and path dimension, though we demonstrated good performance at modest orders (e.g., \(N=6\)). Future directions include extending the framework to discrete-time signature kernels, and adapting our sequential algorithm to an online learning setting.

Acknowledgements↩︎

This research was supported by the European Union’s Horizon Europe research and innovation programme through the Marie Skłodowska-Curie Grant Agreement No.101081674.

The authors would like to thank Prof. Christa Cuchiero for valuable inputs during meetings in Vienna, and Adeline Fermanian for constructive discussions that helped improve this work.

7 THEORETICAL BACKGROUND↩︎

In this section, we present a more rigorous treatment of results concerning paths and path integration, tensor spaces, and signatures. To ensure completeness and readability, some definitions and results from the main body are restated.

7.1 Paths and path integration↩︎

We define a continuous path in \(\mathbb{R}^d\) as any continuous mapping from some interval \([s,t]\) to \(\mathbb{R}^d\), i.e. \(x : [s, t] \longrightarrow \mathbb{R}^d\), \(u \longmapsto x_u = \big(x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(d)}\big)\). We consider multi-dimensional paths, meaning \(d \geq 2\). Moreover, we restrict ourselves to working with paths of finite length, which we define through the notion of the \(1\)-variation.

Definition 1. Let \(x:[s,t] \longrightarrow \mathbb{R}^d\) be a continuous path. The 1-variation of \(x\) is defined by \[\|x\|_{\mathrm{1}-\mathrm{var}}=\sup _{\left(t_0, \ldots, t_k\right) \in P} \sum_{i=1}^k\left\|x_{t_i}-x_{t_{i-1}}\right\|,\] where \(P=\{(t_0, \ldots, t_k) \mid k \geq 0,\) \(s=t_0<\cdots<t_k=t\}\) denotes the set of all finite partitions of \([s, t]\).

Intuitively, \(\|x\|_{\mathrm{1}-\mathrm{var}}\) can be interpreted as the length of the path \(x\), therefore we work with paths such that \(\|x\|_{\mathrm{1}-\mathrm{var}} < \infty\). These paths are also referred to as being of bounded variation. This assumption will later prove important for theoretical guarantees, and it also allows us to define path integration in the \(\textit{Riemann-Stieltjes}\) sense. We refer to Chapters 2 and 3 in [6] for more details.

Definition 2. Let \(x, y\) be two continuous paths from \([s,t]\) to \(\mathbb{R}\), with \(x\) being of finite length. Let \(P^n=\left\{s=t_0^n<t_1^n<\cdots<t_{N_n}^n=t\right\}\) for \(n \geq 0\) be a sequence of partitions with vanishing mesh size \(|P^n|=\max_{1 \leq i \leq N_n}|t_i - t_{i-1}|\), meaning \(\left|P^n\right| \rightarrow 0\) as \(n \rightarrow \infty\). The Riemann-Stieltjes integral of \(y\) against \(x\), is defined as \[\int_s^t y_u \mathrm{~d} x_u:=\lim _{n \rightarrow \infty} \sum_{i=0}^{N_n-1} y_{t_i^n}\left(x_{t_{i+1}^n}-x_{t_i^n}\right),\] where the limit does not depend on the choice of the sequence of partitions \(\left(P^n\right)_{n \geq 0}\).

This definition can be generalized to the case when \(x\) and \(y\) are vector-valued; for \(x, y\) : \([s,t] \rightarrow \mathbb{R}^d\) we have \[\int_s^t y_u \mathrm{~d} x_u=\left(\begin{array}{c} \int_s^t y_u^{(1)} \mathrm{~d} x_u^{(1)} \\ \vdots \\ \int_s^t y_u^{(d)} \mathrm{~d} x_u^{(d)} \end{array}\right).\]

Remark 6. Take two continuous paths \(x:[s,t] \longrightarrow \mathbb{R}\) and \(y:[s,t] \longrightarrow \mathbb{R}\), where \(x\) is continuously differentiable. Then the \(\textit{Riemann-Stieltjes}\) integral of \(y\) against \(x\) comes down to \[\int_s^t y_u \mathrm{~d} x_u=\int_s^t y_u {x}^{'}_u \mathrm{~d} u,\] where the last integral is the classical Riemann integral and \({x}^{'}_u = \mathrm{~d}x_u/\mathrm{~d}u\) denotes differentiation with respect to a single variable.

Example 3. Consider the constant path \(y_u = 1\), \(u \in [s,t]\). It follows that the integral of \(y\) against any
\(x:[s,t] \longrightarrow \mathbb{R}\) that is continuously differentiable is just the increment of \(x\) on \([s,t]\): \[\int_s^t \mathrm{~d} x_u = \int_s^t {x}^{'}_u \mathrm{~d} u = x_t - x_s.\]

7.2 Tensor spaces↩︎

We denote by \(\otimes\) the tensor product and by \((\mathbb{R}^d)^{\otimes k}\) the \(k\)-th tensor power of \(\mathbb{R}^d\) , with \((\mathbb{R}^d)^{\otimes 0} := \mathbb{R}\). Let \(e_1, \ldots, e_d\) be the canonical basis of \(\mathbb{R}^d\), then \(\left(e_{i_1} \otimes \cdots \otimes e_{i_k}\right)_{1 \leq i_1, \ldots, i_k \leq d}\) is an orthonormal basis of \(\left(\mathbb{R}^d\right)^{\otimes k}\), meaning that any element \(a \in\left(\mathbb{R}^d\right)^{\otimes k}\) can be written as \(a=\sum_{1 \leq i_1, \ldots, i_k \leq d} a^{\left(i_1, \ldots, i_k\right)} e_{i_1} \otimes \cdots \otimes e_{i_k}\), where \(a^{\left(i_1, \ldots, i_k\right)} \in \mathbb{R}\).

Furthermore, \(\left(\mathbb{R}^d\right)^{\otimes k}\) is a Hilbert space of dimension \(d^k\), with the following scalar product and norm: \[\langle a, b\rangle_{\left(\mathbb{R}^d\right)^{\otimes k}}=\sum_{1 \leq i_1, \ldots, i_k \leq d} a^{\left(i_1, \ldots, i_k\right)} b^{\left(i_1, \ldots, i_k\right)}, \quad \|a\|_{\left(\mathbb{R}^d\right)^{\otimes k}} = \sqrt{\sum_{1 \leq i_1, \ldots, i_k \leq d} {\left( a^{\left(i_1, \ldots, i_k\right)} \right)} ^2} \quad .\]

Definition 4. We define the extended tensor algebra \(T\left(\left(\mathbb{R}^d\right)\right)\) as \[T\left(\left(\mathbb{R}^d\right)\right):=\left\{\left(a_0, a_1, \ldots, a_k,\ldots \right) \mid a_k \in\left(\mathbb{R}^d\right)^{\otimes k}, k \geq 0\right\}.\]

In other words, the extended tensor algebra is a set of infinite sequences of tensors of increasing order: \(a_0\) is a scalar, \(a_1\) a vector, \(a_2\) a matrix, \(a_3\) a “cube" (a third-order tensor), and so on. It can be shown that \(T\left(\left(\mathbb{R}^d\right)\right)\) is a non-commutative algebra under the tensor product \(\otimes\), with the neutral element \(\mathbf{1}=(1,0,0,\ldots)\). Furthermore, it holds that the subset \(T_1\left(\left(\mathbb{R}^n\right)\right):=\left\{\mathbf{a} \mid \mathbf{a} \in T\left(\left(\mathbb{R}^n\right)\right) \text{ with } a_0=1\right\}\) is a Lie group. Finally, the following subspace will be of special interest.

Definition 3. We denote with \(\mathscr{T}\left(\mathbb{R}^d\right)\) the space of square-summable elements of \(T\left(\left(\mathbb{R}^d\right)\right)\):

\[\mathscr{T}\left(\mathbb{R}^d\right)=\left\{\mathbf{a} \in T\left(\left(\mathbb{R}^d\right)\right) \mid \sum_{k=0}^{\infty}\left\|a_k\right\|_{\left(\mathbb{R}^d\right)^{\otimes k}}^2<\infty\right\}.\]

We can endow this space with the following scalar product and associated norm: for any \(\mathbf{a}, \mathbf{b} \in \mathscr{T}\left(\mathbb{R}^d\right)\), \[\langle \mathbf{a}, \mathbf{b}\rangle_{\mathscr{T}\left(\mathbb{R}^d\right)}=\sum_{k=0}^{\infty} \langle a_k, b_k\rangle_{\left(\mathbb{R}^d\right)^{\otimes k}}, \quad \|\mathbf{a}\|_{\mathscr{T}\left(\mathbb{R}^d\right)} = \sqrt{\sum_{k=0}^{\infty}\left\|a_k\right\|_{\left(\mathbb{R}^d\right)^{\otimes k}}^2} . \label{normhilbert}\tag{1}\]

Proposition 1 (Proposition 1). \(\left(\mathscr{T}\left(\mathbb{R}^d\right),\langle\cdot, \cdot\rangle_{\mathscr{T}\left(\mathbb{R}^d\right)}\right)\) is a separable Hilbert space.

Proof. (Completeness.) The detailed proof that \(\left(\mathscr{T}\left(\mathbb{R}^d\right), \langle\cdot, \cdot\rangle_{\mathscr{T}\left(\mathbb{R}^d\right)}\right)\) is a Hilbert space can be found in [9], Appendix A.

(Separability.) Let \(e_1, \ldots, e_d\) be the canonical basis of \(\mathbb{R}^d\), and \(B_k = \left\{ e_{i_1} \otimes \cdots \otimes e_{i_k} \mid 1 \leq i_1, \ldots, i_k \leq d \right\}\) denote the associated orthonormal basis of \(\left(\mathbb{R}^d\right)^{\otimes k}\) that has \(d^k\) elements. We now embed all of the elements of \(B_k, k \geq 0\) in the space \(\mathscr{T}\left(\mathbb{R}^d\right)\) by setting the other tensor sequence elements to zero, i.e. we define the sets \[\mathbf{B}_k = \left\{ \left(0, 0, \ldots, \overbrace{e_{i_1} \otimes \cdots \otimes e_{i_k} }^{k\text{-th position}}, 0, \ldots \right) \mid 1 \leq i_1, \ldots, i_k \leq d \right\},\] with \(\mathbf{B}_0 = \left\{ \left( 1, 0, 0, \ldots \right)\right\}\). Let \(\mathbf{B} = \bigcup_{k=0}^{\infty}\mathbf{B}_k\). The first thing to note is that \(\mathbf{B}\) is a countable union of finite sets and is therefore countable. Secondly, \(\mathbf{B}\) is a Schauder basis for \(\mathscr{T}\left(\mathbb{R}^d\right)\), meaning that every element \(\mathbf{a} \in \mathscr{T}\left(\mathbb{R}^d\right)\) can be uniquely represented as \(\mathbf{a}=\sum_{n=1}^{\infty} \alpha_n \mathbf{v}_n,\) \(\mathbf{v}_n \in \mathbf{B}\), where the convergence of the infinite sum is the one of the topology induced by the norm 1 . Indeed, we have that \(\mathbf{a}= \sum_{k=0}^{\infty} (0,0 \ldots, a_k, 0, \ldots)\) and \((0,0 \ldots, a_k, 0, \ldots) = \sum_{1 \leq i_1, \ldots, i_k \leq d} a^{\left(i_1, \ldots, i_k\right)} \left(0, 0, \ldots, e_{i_1} \otimes \cdots \otimes e_{i_k}, 0 \ldots \right)\), which gives us a unique representation of \(\mathbf{a}\) in terms of elements of \(\mathbf{B}\). Furthermore, from the fact that \(B_k\) is an orthonormal basis for \({\left(\mathbb{R}^d\right)}^{\otimes k}\) and the definition of the scalar product 1 , it is easy to see that \(\mathbf{B}\) is also orthonormal, making it a countable orthonormal basis. Finally, a Hilbert space is separable if and only if it has a countable orthonormal basis, which leads to the conclusion that \(\left(\mathscr{T}\left(\mathbb{R}^d\right),\langle\cdot, \cdot\rangle_{\mathscr{T}\left(\mathbb{R}^d\right)}\right)\) is a separable Hilbert space. ◻

7.3 Signatures↩︎

We can now fully define signatures, in the light of the theory presented earlier.

7.3.1 Definitions↩︎

Definition 4. The signature of a finite-length path \(x\) on \([s, t]\) is defined as an infinite tensor sequence: \[S\left( x_{[s,t]} \right)=\left(1, \;S^1\left( x_{[s,t]} \right), \ldots, \;S^k\left( x_{[s,t]} \right), \;\ldots \right) \in T\left(\left(\mathbb{R}^d\right)\right),\] where the \(k\)-th element (called level) is given by \[{S^k \left( x_{[s,t]} \right)}=\idotsint \displaylimits_{s < u_1<\cdots<u_k < t} \mathrm{d} x_{u_1} \otimes \cdots \otimes \mathrm{d} x_{u_k} \in {\left(\mathbb{R}^d\right)}^{\otimes k}.\]

Using the orthonormal basis for \({\left(\mathbb{R}^d\right)}^{\otimes k}\), we can write \[{S^k \left( x_{[s,t]} \right)} = \sum_{\left(i_1, \ldots, i_k\right) \subset\{1, \ldots, d\}^k} S^{\left(i_1, \ldots, i_k\right)}\left(x_{[s,t]}\right) e_{i_1} \otimes \cdots \otimes e_{i_k},\] where \(S^{\left(i_1, \ldots, i_k\right)}\left(x_{[s,t]}\right)\) is referred to as the signature coefficient of \(x\) along the multi-index \(\left(i_1, \ldots, i_k\right) \subset\{1, \ldots, d\}^k, k \geq 1\) on \([s,t]\) and is given by \[S^{\left(i_1, \ldots, i_k\right)} = \idotsint \displaylimits_{s < u_1<\cdots<u_k < t} \mathrm{d}x^{\left(i_1\right)}_{u_1} \cdots \mathrm{d}x^{\left(i_k\right)}_{u_k}.\]

In practice, we don’t work with infinite sequences of tensors. Firstly, we only consider a finite number of signature levels, defining the truncated signature as \[S^{\leq N}\left( x_{[s,t]} \right) = \left(1, \;S^1\left( x_{[s,t]} \right), \;\ldots, \;S^N\left( x_{[s,t]} \right) \right),\] where we refer to \(N\) as the truncation order. Secondly, in applications, we disregard the underlying tensor structure, and consider the truncated signature as a a collection of all signature coefficients with multi-index of length \(k \leq N\), arranged in a vector \[\left(1, S^{(1)}\left( x_{[s,t]} \right), \ldots, S^{(d)}\left( x_{[s,t]} \right), S^{(1,1)}\left( x_{[s,t]} \right), S^{(1,2)}\left( x_{[s,t]} \right), \ldots, S^{\overbrace{ \left( d,d, \ldots, d \right)}^{N \text{times}}} \left( x_{[s,t]} \right) \right),\] that is of size \(s_d(N)= \sum_{k=0}^{N} d^k = \left(d^{N+1}-1\right) /(d-1)\).

We now illustrate the definitions on the example of a linear path, deriving a closed formula for signature coefficients in this simple case. This will prove an important building block in our methodology.

Example 4 (Example 1 revisited). Let \(x:[s,t] \mapsto \mathbb{R}^d\), \(u \to \frac{x_{t} - x_s}{t-s}(u-s) + x_{s}\) be a d-dimensional linear path. For any \(\left(i_1, \ldots, i_k\right) \in\{1, \ldots, d\}^k\) we have the following: \[\begin{align} S^{\left(i_1, \ldots, i_k\right)}\left(x_{[s,t]}\right) &= \idotsint \displaylimits_{s < u_1<\cdots<u_k < t} \mathrm{~d} x_{u_1}^{(i_1)} \cdots \mathrm{~d} x_{u_k}^{(i_k)} = \idotsint \displaylimits_{s < u_1<\cdots<u_k < t} \frac{x_t^{(i_1)}-x_s^{(i_1)}}{t-s} \mathrm{~d} u_1 \ldots \frac{x_t^{(i_k)}-x_s^{(i_k)}}{t-s} \mathrm{~d} u_k \\ &=\prod_{j=1}^k \left(\frac{x_t^{(i_j)}-x_s^{(i_j)}}{t-s}\right) \cdot \idotsint \displaylimits_{s < u_1<\cdots<u_k < t} \mathrm{~d} u_1 \ldots \mathrm{~d} u_k \\ &= \prod_{j=1}^k \left(\frac{x_t^{(i_j)}-x_s^{(i_j)}}{t-s}\right) \cdot \int_s^t \int_s^{u_k} \ldots \left( \int_s^{u_2} \mathrm{~d} u_1 \right) \ldots \mathrm{~d} u_{k-1} \mathrm{~d} u_k \\ &= \prod_{j=1}^k \left(\frac{x_t^{(i_j)}-x_s^{(i_j)}}{t-s}\right) \cdot \int_s^t \int_s^{u_k} \ldots \left( \int_s^{u_3} \left(u_2 - s\right) \mathrm{~d} u_2 \right) \ldots \mathrm{~d} u_{k-1} \mathrm{~d} u_k \\ &= \prod_{j=1}^k \left(\frac{x_t^{(i_j)}-x_s^{(i_j)}}{t-s}\right) \cdot \int_s^t \int_s^{u_k} \ldots \left( \int_s^{u_4} \frac{{\left(u_3 - s\right)}^2}{2} \mathrm{~d} u_3 \right) \ldots \mathrm{~d} u_{k-1} \mathrm{~d} u_k \\ &= \ldots = \prod_{j=1}^k \left(\frac{x_t^{(i_j)}-x_s^{(i_j)}}{t-s}\right) \cdot \frac{{(t-s)}^k}{k!} =\frac{1}{k!} \prod_{j=1}^k\left(x_t^{(i_j)}-x_s^{(i_j)}\right) . \label{siglin} \end{align}\qquad{(1)}\]

More compactly written, \(S^k\left( x_{[s,t]} \right)=\frac{1}{k!}\left(x_t-x_s\right)^{\otimes k}\).

7.3.2 Properties↩︎

Proposition 4 (Invariances). Let \(x:[s,t] \mapsto \mathbb{R}^d\) be a path of finite length. The following holds:

  1. Let \(\widetilde{x}_u=x_{\psi(u)}\) be the smooth reparametrization of \(x\), where \(\psi:[s,t] \rightarrow[s,t]\) is a continuously differentiable non-decreasing surjection. Then, for any \([v,w] \subset [s,t]\) we have \(S\left(\widetilde{x}_{[v,w]} \right)=S\left(x_{\left[\psi(v), \psi(w)\right]}\right) .\)

  2. Let \(\overline{x}_u = x_u + a\) be a path obtained by translating \(x\) by some \(a \in \mathbb{R}^d\). Then \(S\left(\overline{x}_{[s,t]}\right) = S\left( x_{[s,t]} \right).\)

Both of these properties can be derived straight from the definition of signatures as iterated integrals. Invariance to reparametrization is a consequence of the change of variable formula in Riemann-Stieltjes integration, while invariance to translation follows from the fact that \(d\overline{x}_t = dx_t\).

Proposition 2 (Algebraic properties). Let \(x:[s, t] \mapsto \mathbb{R}^d\) and \(y:[t, u] \mapsto \mathbb{R}^d\) denote two paths of finite length.

  1. (Chen’s identity) Let \(x * y:[s, u] \mapsto \mathbb{R}^d\) be the concatenation of \(x\) and \(y\), meaning \((x * y)_v=x_v\) for \(v \in[s, t]\) and \((x * y)_v=x_t+y_v-y_t\) for \(v \in[t, u]\). Then \(S(\left(x * y\right)_{[s,u]})=S\left( x_{[s,t]} \right) \otimes S(y_{[t,u]})\)

  2. (Time reversal) We denote the time-reversal of \(x\) as the path \(\overleftarrow{x}:[s, t] \mapsto \mathbb{R}^d\) where \(\overleftarrow{x}(u)=x_{s+t-u}\). Then \(S\left( x_{[s,t]} \right) \otimes S\left( \overleftarrow{x}_{[s,t]} \right) = (1,0,0,\ldots).\)

We can rephrase Chen’s relation in terms of signature coefficients: for any multi-index \(\left(i_1, \ldots, i_k\right) \subset\{1, \ldots, d\}^k\), it holds that \[S^{(i_1, \ldots, i_k)}\left((x * y)_{[s,u]}\right) = \sum_{\ell=0}^k S^{(i_1, \ldots, i_\ell)}\left(x_{[s,t]}\right) \cdot S^{(i_{\ell+1}, \ldots, i_k)}\left(y_{[s,t]}\right) \label{chencomp}.\tag{2}\]

This result is crucial for the operationalization of signature computation as it provides a recursive formula for the signature of a concatenation of paths. As noted earlier, it also enables an easy calculation in the case of piecewise linear paths, which we explain in greater detail in the following remark.

Remark 7.

Suppose that \(x\) is a piecewise linear path such that for \(s = u_1 < u_2 \ldots < u_k = t\), \(x|_{[u_{\ell}, u_{\ell+1}]}\), \(1 \leq \ell \leq k\), is linear. In order to obtain its signature, we first need to calculate the signature coefficients for each linear segment, as in Example 4: \[{S^{(i_1, \ldots, i_k)}\left(x_{[u_{\ell},u_{\ell+1}]} \right)} = \frac{1}{k!} \prod_{j=1}^k\left(x_{u_{\ell+1}}^{(i_j)}-x_{u_{\ell}}^{(i_j)}\right). \label{siglin1}\qquad{(2)}\]

We proceed to calculate the signature on the whole time horizon \([s,t]\) by inductively concatenating the linear segments using the Chen’s relation 2 : \[\begin{align} {S^{(i_1, \ldots, i_k)}\left(x_{[s,u_{\ell+1}]} \right)} &= \sum_{m=0}^k {S^{(i_1, \ldots, i_m)}\left(x_{[s,u_\ell]} \right)} \cdot {S^{(i_{m+1}, \ldots, i_k)}\left(x_{[u_{\ell},u_{\ell+1}]} \right)} \nonumber \\ &= \sum_{m=0}^k \left( {S^{(i_1, \ldots, i_m)}\left(x_{[s,u_{\ell}]} \right)} \cdot \frac{1}{(k - m) !} \prod_{j=m + 1}^{k} \left(x_{u_{\ell+1}}^{(i_j)} - x_{u_\ell}^{(i_j)}\right) \right) \label{siglin2} . \end{align}\qquad{(3)}\]

Computing the signature of a piecewise linear path therefore reduces to applying the iterative formula stated above, requiring no numerical integration. This procedure is efficiently implemented in the Python package iisignature ([15]).

Going back to the time reversal property, we remark that, by definition, the first element of the signature is set to 1 and therefore \(S\left( x_{[s,t]} \right) \in T_1\left(\left(\mathbb{R}^d\right)\right)\). This result now states that the inverse of the signature of a path \(x\) in the tensor group \(T_1\left(\left(\mathbb{R}^d\right)\right)\) is exactly the signature of \(x\) traversed backwards in time.

8 PROOFS OF THEORETICAL CONTRIBUTIONS↩︎

8.1 Proof of Theorem 2↩︎

Theorem 1 (Universal approximation on sliding windows). Let \({(X_t)}_t\) be a \(d\)-dimensional time series such that Assumption 1 holds. Let \(w \in \mathbb{N}\) be a fixed window size and \(f : C\left([0,w],\mathbb{R}^{d+1}\right) \mapsto \mathbb{R}\) a continuous function given the uniform topology. It holds that for every \(\varepsilon>0\), there exists \(N \in \mathbb{N}, \theta \in \mathbb{R}^{s_{d+1}(N)}\), such that \[\sup_{t}\left\lvert f\left({X}_{[t-w, t]} - (0,X_{t-w})\right)- \theta^{\top} S^{\leq N} \left({X}_{[t-w, t]}\right) \right\rvert \leq \varepsilon.\]

Proof. We first introduce \(\overline{X}_{[t-w, t]} := {X}_{[t-w, t]} - (0,X_{t-w})\) as a random path obtained by linear interpolation of \(\left(0, 0\right), \left(1, X_{t-w+1} - X_{t-w}\right), \ldots, \left( w, X_{t} - X_{t-1} \right)\). In other words, we consider paths on windows translated by the starting point. We now define \(U = \left\{ \overline{X}_{[t-w, t]}(\omega) \mid \omega \in \Omega; t \in \mathbb{Z}\right\}\) as the set of all realisations of random paths on all sliding windows. We note that \(U\) is a set of piecewise linear functions whose first coordinate (corresponding to time) is strictly monotone and such that \(\left| \right(\overline{X}^{(i)}_{[t-w, t]}(\omega) \left)\right| \leq \max(2M,w) =:B\) for all \(1 \leq i \leq d\). It follows that they are therefore Lipschitz-continuous with the same Lipschitz constant \(L := \max(4M, 1)\).

Let us define a set of paths \(K \subset C\left([0,w],\mathbb{R}^{d+1}\right)\) such that:

  1. \(\forall x \in K\) has at least one coordinate that is linear, i.e. \(\exists j \in \{1,2,\ldots, d\}\), \(x^j_u = a_xu\) and such that there exists \(\varepsilon > 0\) such that \(a_x \geq \varepsilon\) \(\forall x \in K\) ,

  2. \(\forall x\in K\) has the same starting point at 0, i.e. \(x_0 = 0\in \mathbb{R}^{d+1}\) ,

  3. the paths are uniformly bounded by \(B\), i.e. \(\left|x^{(i)}_u \right| \leq B\) \(\forall i \in \{1,2,\ldots,d\}\), \(\forall x \in K\) ,

  4. the paths are Lipschitz with the same constant \(L\), i.e. \(\| x_v -x_u\|_{\infty} \leq L |v-u|\), \(\forall x \in K.\)

Functions in \(K\) have the same Lipschitz constant, therefore they are also uniformly equicontinuous. Furthermore, it can be shown that the set \(K\) is also closed in \(C\left([0,w], \mathbb{R}^{d+1}\right)\). We can now use the Arzelà–Ascoli theorem to conclude that \(K\) is compact in \(C\left([0,w], \mathbb{R}^{d+1}\right)\) in the uniform topology. This allows us to apply the Theorem 1 to \(K\).

Leveraging Theorem 1 and the fact that \(U \subset K\), we obtain: \[\sup_{U}\Bigg|f\left(\overline{X}_{[t-w, t]}(\omega)\right)- \theta^{\top} S^{\leq N} \left(\overline{X}_{[t-w, t]}\right) \Bigg| \quad \leq \sup_{K} \bigg| f\left({x}\right) - \theta^{\top} S^{\leq N} \left({x}_{[0,w]}\right) \bigg| \leq \varepsilon \quad .\] We conclude to the desired result using the identity \(S^{\leq N} \left(\overline{X}_{[t-w, t]}\right)=S^{\leq N} \left({X}_{[t-w, t]}\right)\) following from the invariance by translation of the signatures; see Proposition 4, 2. ◻

8.2 Proof of Theorem 3↩︎

We present a more formal statement of the result with the corresponding proof.

Theorem 2. Let \({(X_t)}_{t \ge 1}\) be a discrete \(d\)-dimensional time series such that the Assumption 2 is satisfied. The following statements hold:

  1. For any multi-index \(\left(i_1, \ldots, i_k\right)\) of any length \(k \in \mathbb{N}\), the time series \({\left({S^{(i_1, \ldots, i_k)}(X_{[t-w,t]})}\right)}_t\) is strictly stationary.

  2. For any truncation order \(N \in \mathbb{N}\), the time series \({\left({S^{\leq N}\left(X_{[t-w,t]}\right)}\right)}_t\) is strictly stationary.

  3. The time series \({\left({S\left(X_{[t-w,t]}\right)}\right)}_t\) is strictly stationary.

Proof.

  1. For any multi-index \(I=(i_1, \ldots, i_k)\) of length \(k \in \mathbb{N}\), we denote with \(f^{I}\) the function from \(\mathbb{R}^{w \times d}\) to \(\mathbb{R}\) given as: \[(X_{t-w+1} - X_{t-w},\ldots, X_{t} - X_{t-1}) \overset{f^I}{\longrightarrow} {S^{I}(X_{[t-w,t]})} \,.\]

    Examining the iterative formula ?? , we see that the \(f^I\) (the signature component \({{S^{I}(X_{[t-w,t]})}}\)) is the same deterministic, measurable function (as a composition of measurable functions: multiplication, sums and powers) of the increments \((X_{t-w+1} - X_{t-w}, X_{t-w+2} - X_{t-w+1}, \ldots, X_{t} - X_{t-1})\) of the original time series \({(X_t)}_{t}\). Furthermore, it does not depend on \(t\). We also note that if the \(d\)-dimensional time series \(\left( X_t - X_{t-1}\right)_t\) is strictly stationary, then it follows straight from the definition of strict stationarity that the \((d \times w)\)-dimensional time series \({\left(X_{t-w+1} - X_{t-w}, \ldots, X_{t} - X_{t-1}\right)}_t\) is also strictly stationary. If we now remember that for any deterministic, measurable function \(g\) and any random vectors \(\mathbf{X}, \mathbf{Y}\) such that \(\mathbf{X} \overset{D}{=} \mathbf{Y}\), we have \(g(\mathbf{X}) \overset{D}{=} g(\mathbf{Y})\), and combine it with the previous remark for \(f^I\), we have the strict stationarity for \({\left({S^{I}(X_{[t-w,t]})}\right)}_t\) .

  2. Let us now fix an \(N\) and consider the \(s_d(N)\)-dimensional time series of the signature truncated at level \(N\), \({\left({S^{\leq N}\left(X_{[t-w,t]}\right)}\right)}_t\). Let \(I_1, I_2, \ldots, I_{s_d(N)}\) be the multi-indexes of the components of the truncated signature, in alphabetical order. We now have: \[\begin{align} f^{\leq N} \left( X_{t-w+1} - X_{t-w}, \ldots, X_{t} - X_{t-1} \right) & := \left( f^{I_1}, \ldots, f^{I_{s_d(N)}} \right) ( X_{t-w+1} - X_{t-w}, \ldots, X_{t} - X_{t-1} ) \\ &= \left( {S^{I_1}\left(X_{[t-w,t]}\right)}, \ldots, S^{I_{s_d(N)}}\left(X_{[t-w,t]}\right) \right) \\ &= {S^{\leq N}\left(X_{[t-w,t]}\right)} , \end{align}\] where \(f = \left( f^{I_1}, \ldots, f^{I_{s_d(N)}} \right) : \mathbb{R}^{d \times w} \longrightarrow \mathbb{R}^{s_d(N)}\) is a deterministic, measurable function of the increments that does not depend on \(t\), as noted above. Following the same arguments as before, we can now conclude that \({\left({S^{\leq N}\left(X_{[t-w,t]}\right)}\right)}_t\) is strictly stationary.

  3. It is easy to see that the previous proof holds not only for the time series of signatures truncated at level \(N\), but also for the time series of any collection of signature components denoted by multi-indexes \(\left( I_1, \ldots, I_j \right)\). Additionally, as discussed in Remark 3, the finite-dimensional distributions of the signature components uniquely determine the law of the entire infinite-dimensional time series of signatures. From these two arguments, it follows that \({\left({S \left(X_{[t-w,t]}\right)}\right)}_t\) is strictly stationary.

 ◻

9 IMPLEMENTATION DETAILS↩︎

Our method consists in efficiently computing signatures on sliding windows and fitting a ridge regression model, using the covariate signatures as predictors and the increments of the target series as the response variable. Hyperparameters are tuned on a validation set. The full procedure is detailed in the pseudocode below.

Figure 6: Fitting a ridge regression on signatures

Finally, to obtain the forecast \(\widehat{Y}_t\) for \(t\) outside the train-validation set, it is enough to update the signature as in Algorithm 2 and use the estimated ridge regression coefficients \(\widehat{\theta}\) to compute the forecast \(\widehat{Y}_t = y_{t-D} + \widehat{\theta}^\top S(x_{[t-{w},t]})\).

We note that this sequential procedure can easily be transferred to an online learning setting.

10 EXPERIMENTAL DETAILS↩︎

Technical details. The data and code used in this paper are publicly available in our GitHub repository: https://github.com/ninadrobac/slidesig. All experiments were conducted on a standard laptop (MacBook Air, Apple M2 chip, 16 GB RAM), demonstrating that our implementation can be executed efficiently on standard hardware.

Real data. We use half-hourly electricity demand data from RTE (France’s electricity transmission system operator), aggregated on the national level, and temperature observations from Météo France recorded on a three-hour grid. The temperature series on the national level is obtained by averaging across weather stations and linearly interpolating to match the demand frequency. This yields a unified dataset with 48 observations per day from January 1, 2012, to December 30, 2015, resulting in 70128 data points of temperature and demand. This period is selected to avoid the demand fluctuations associated with the COVID-19 pandemic and the 2022 energy crisis. We denote by \(Y_t\) the electricity demand and by \(T_t\) the temperature at any time \(t \geq 1\).

Synthetic data. Electricity consumption depends on many factors, including calendar variables such as weekdays and public holidays. Since our goal is to isolate the role of temperature, we construct a synthetic dataset in which both the strength of past dependency and the functional form of the relationship can be controlled. To model memory effects, we introduce exponentially smoothed temperature as \[\left\{ \begin{array}{ll} \overline{T}_1^\alpha = T_1 \\ \overline{T}_t^\alpha = (1-\alpha)\overline{T}_{t-1}^\alpha + \alpha T_t^\alpha \,, \,\textrm{ for any } t \geq 2 \,, \end{array} \right.\] where the smoothing parameter \(\alpha \in (0,1]\) controls the influence of past values - the smaller \(\alpha\), the stronger the influence of past values. We then fit a linear regression model with observed consumption \(Y_t\) as the target variable and \(\overline{T}_t\) and \(\overline{T}_t^2\) as covariates: \[\widehat{Y}^\alpha_t = \widehat{\theta_1} \overline{T}_t^\alpha + \widehat{\theta_2} \big(\overline{T}_t^\alpha\big)^2, \label{eq:lin95model95synthetic}\tag{3}\]

thereby imposing a quadratic dependence of simulated demand on smoothed temperature. The final synthetic demand time series \(\big(\widetilde{Y}^\alpha_t\big)_t\) is obtained by adding normally distributed noise to the fitted values \(\widehat{Y}^\alpha_t\) : \[\widetilde{Y}^\alpha_t = \widehat{\theta_1} \overline{T}_t^\alpha + \widehat{\theta_2} \big(\overline{T}_t^\alpha\big)^2 + \varepsilon_t, \quad \varepsilon_t \overset{\mathrm{i.i.d}}{\sim} \mathcal{N}(0,\sigma^2)\,. \label{eq:model95synthetic}\tag{4}\]

In all experiments presented, we fix \(\sigma\) to introduce a moderate level of noise (\(\sigma=1000\) for the first and \(\sigma=500\) for the second and third synthetic data experiment) and we set a fixed random seed when simulating the noise to ensure reproducibility. The synthetic demand series closely follows the main patterns of the observed series, effectively capturing the yearly variations driven by temperature, providing a realistic proxy for evaluation. However, as illustrated in Figure 7, Model 4 does not reproduce finer weekly patterns that depend on calendar variables such as the day of the week or time of day, as it only contains temperature information. This is reflected in the fact that for experiments on synthetic data we fix \(D=2\), and for experiments on real data we choose \(D=7\) to account for weekly patterns. Although we could have adapted Model 4 by adding daily seasonality to better match the data locally, this is simply a proof of concept, so we opted for the simplest possible framework.

We note that by varying the smoothing parameter \(\alpha\), we can generate different versions of the dataset in which past values exert more or less influence on the simulated demand. Over several tested parameters, we mainly focus on \(\alpha = 0.005\) as it offers the best fit of Model 4 to the real data.

Figure 7: Comparison of observed and simulated electrical demand. Left: full four-year dataset showing that the synthetic series captures yearly patterns driven by temperature. Right: zoom-in on a weekly segment (Feb 11–17, 2013) illustrating that weekly fluctuations driven by calendar variables are not retained in synthetic demand.

Assumptions in practice. As discussed earlier, Assumption 1 is satisfied since temperature values in the dataset remain within natural bounds (between \(-10\,^{\circ}\mathrm{C}\) and \(35\,^{\circ}\mathrm{C}\)). In contrast, verifying Assumption 2 is less straightforward, as it concerns the distributional properties of temperature increments. To assess this, we apply two standard statistical tests — the Augmented Dickey–Fuller (ADF) and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests, both of which indicate that the time series of half-hourly temperature increments is weakly stationary at a significance level of \(0.05\). The same analysis was performed for weekly demand increments and two-day increments of simulated demand, yielding satisfactory results.

References↩︎

[1]
Antoniadis, A., Cugliari, J., Fasiolo, M., Goude, Y., and Poggi, J.-M. (2024). Statistical Learning Tools for Electricity Load Forecasting. Springer.
[2]
Goude, Y., Nedellec, R., and Kong, N. (2013). Local short and middle term electricity load forecasting with semi-parametric additive models. IEEE transactions on smart grid, 5(1):440–446.
[3]
Nedellec, R., Cugliari, J., and Goude, Y. (2014). Gefcom2012: Electric load forecasting and backcasting with semi-parametric models. International Journal of Forecasting, 30(2):375–381.
[4]
Chen, K.-T. (1958). Integration of paths—a faithful representation of paths by non-commutative formal power series. Transactions of the American Mathematical Society, 89:395–407.
[5]
Lyons, T., Caruana, M., and Lévy, T. (2007). Differential Equations Driven by Rough Paths, volume 1908 of Lecture Notes in Mathematics. Springer.
[6]
Friz, P. K. and Victoir, N. B. (2010). Multidimensional Stochastic Processes as Rough Paths: Theory and Applications, volume 120 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge.
[7]
Chevyrev, I. and Kormilitzin, A. (2025). A primer on the signature method in machine learning.
[8]
Fermanian, A. (2021b). Learning time-dependent data with the signature transform. Theses, Sorbonne Université.
[9]
Fermanian, A., Marion, P., Vert, J.-P., and Biau, G. (2021). Framing rnn as a kernel method: A neural ode approach.
[10]
Moreno-Pino, F., Álvaro Arroyo, Waldon, H., Dong, X., and Álvaro Cartea (2025). Rough transformers: Lightweight and continuous time series modelling through signature patching.
[11]
Jaber, E. A. and Sotnikov, D. (2025). Exponentially fading memory signature.
[12]
Cohen, S. N., Lui, S., Malpass, W., Mantoan, G., Nesheim, L., de Paula, A., Reeves, A., Scott, C., Small, E., and Yang, L. (2023). Nowcasting with signature methods. arXiv preprint arXiv:2305.10256.
[13]
Lyons, T. (2014). Rough paths, signatures and the modelling of functions on streams.
[14]
Fermanian, A. (2021a). Embedding and learning with signatures. Computational Statistics & Data Analysis, 157:107148.
[15]
Reizenstein, J. and Graham, B. (2020). Algorithm 1004: The iisignature library: Efficient calculation of iterated-integral signatures and log signatures. ACM Transactions on Mathematical Software (TOMS).
[16]
Fermanian, A. (2022). Functional linear regression with truncated signatures. Journal of Multivariate Analysis, 192:105031.
[17]
Morrill, J., Fermanian, A., Kidger, P., and Lyons, T. (2021). A generalised signature method for multivariate time series feature extraction.
[18]
Bosq, D. (2000). Linear Processes in Function Spaces: Theory and Applications, volume 149 of Lecture Notes in Statistics. Springer.
[19]
Guo, X., Wang, B., Zhang, R., and Zhao, C. (2024). On consistency of signature using lasso.
[20]
De Vilmarest, J. and Goude, Y. (2022). State-space models for online post-covid electricity load forecasting competition. IEEE Open Access Journal of Power and Energy, 9:192–201.

  1. https://www.rte-france.com/eco2mix↩︎

  2. https://donneespubliques.meteofrance.fr↩︎