September 09, 2025
This paper presents a kernel-based framework for physics-informed nonlinear system identification. The key contribution is a structured methodology that extends kernel-based techniques to seamlessly integrate partially known physics-based models, improving parameter estimation and overall model accuracy. The proposed method enhances traditional modeling approaches by integrating a parametric model, which provides physical interpretability, with a kernel-based function, which accounts for unmodelled dynamics. The two model’s components are identified from data simultaneously, minimizing a suitable cost that balances the relative importance of the physical and the black-box parts of the model. Additionally, nonlinear state smoothing is employed to address scenarios involving state-space models with not fully measurable states. Numerical simulations on an experimental benchmark system demonstrate the effectiveness of the proposed approach, with performance comparisons against state-of-the-art identification techniques.
Nonlinear systems identification, Grey-box modeling, Estimation, Kernel methods.
Nonlinear system identification plays a crucial role in engineering, aiming to construct models that accurately represent the complex dynamics of physical systems using measured data [1]. Traditional approaches often rely on parametric models derived from physical principles (also referred to as off-white models) [2] or more flexible representations (black-box models), such as neural networks [3] or nonlinear basis functions combinations [4]. Recently, the combination of these two approaches has been proposed to tackle the challenge of compensating for unmodeled dynamics in physical models that arise in real-world applications (see, e.g., [5], [6]).
Within this context, some identification approaches follow a two-step process: first, they estimate the physical parameters while assuming no unmodeled dynamics and then they introduce corrections to model the resulting discrepancy [7]–[9]. This strategy inevitably produces biased physical parameter estimates, which need to be handled a posteriori by compensating them via the black-box component of the model. Such term, therefore, must not only account for modeling errors, but also compensate for the bias error induced by the parametric model identification phase. An alternative perspective is presented in, e.g., [6], [10]. By modeling unmodeled dynamics explicitly from the beginning and estimating both the physical parameters and correction terms simultaneously, the interference between the two is minimized, leading to a more accurate and reliable identification process that prevents biased parameter estimates. In this context, sparsification is essential to ensure that corrections remain interpretable and do not overshadow the underlying physical model. However, despite their effectiveness in identifying nonlinear models, a key limitation of these methods is the need for a careful selection of appropriate basis functions that can adequately capture the underlying unmodeled system dynamics [11].
Kernel methods [11], [12] are a class of nonparametric machine learning techniques, able to provide a powerful framework for overcoming these challenges by enabling the construction of regularized models directly from data, without the need for explicitly defining basis functions. These methods are widely used in the context of input-output system identification (see, e.g., [13], [14]), where the relationship between inputs and outputs is learned directly from measured data. However, the conventional kernel approaches do not incorporate physical system knowledge, which on the other hand can be crucial for developing interpretable and reliable models, featuring also improved generalization capabilities. In this work, we aim to fill this gap by proposing a novel identification framework that integrates kernel methods with available physics-based models. The approach leverages kernel-based function approximation to systematically compensate for unmodeled dynamics, while preserving the interpretability of the physical part of the model. Unlike traditional methods that rely on predefined, often heuristically chosen, basis function dictionaries, the proposed formulation adapts directly to the data. Thus, by leveraging the representer theorem [15], the proposed framework provides a regularized, data-driven functional approximation mechanism that eliminates the need for manual selection of the basis functions while preserving both interpretability and accuracy through the embedded physical structure.
Moreover, to encompass a broader class of dynamical systems [16], we rely on a more general state-space setting, extending the proposed method beyond traditional input-output identification models (e.g., NARX). In fact, many physical systems are better described by state-space models, which explicitly capture system dynamics over time [11]. This formulation also unifies various prediction models, including nonlinear output error, ARMAX, and ARX models [16], and serves as a foundation for numerous controller and observer design methods, making it particularly valuable for system identification. Unlike static regression models, a state-space formulation accounts for the evolution of hidden states, requiring estimating both system parameters and unmeasured state trajectories. One common approach to this challenge is multi-step identification, where unmeasured states are recursively estimated through repeated model evaluation (see, e.g., [6], [17], [18]). While this strategy naturally estimates latent states by iterating the estimation model, it often makes the optimization challenging due to its strong nonlinear parameter dependencies. Moreover, extending the kernel-based framework to multi-step settings introduces further complexity. To circumvent these issues, we adopt an alternative strategy inspired by [19], [20], in which prior state estimates, derived from available data, are used within prediction-based state-space optimization problems. To this end, we combine an unscented Kalman filter (UKF) [21] with an unscented Rauch–Tung–Striebel smoother (URTSS) [22], [23] to reconstruct the hidden state trajectories, enabling kernel-based model integration in a state-space setting.
The effectiveness of the proposed approach is validated through an experimental benchmark using real data, showcasing its advantages over state-of-the-art techniques in both predictive accuracy and simulation performance.
The remainder of the paper is structured as follows. In Section 2, we provide an introduction to nonlinear function estimation using kernel theory. The main contribution of this paper, i.e., integrating the parameterized physical models with data-driven kernels to account for unmodeled dynamics, is presented in Section 3. Then, in Section 4 we extend the kernel-based framework to state-space systems via nonlinear state smoothing, enabling its application to a broader class of systems. Finally, numerical results on the experimental benchmark are discussed in Section 5 and main conclusions are drawn in Section 6.
In this section, we provide a brief introduction to nonlinear function approximation using kernels, which represents the core foundation for the main results presented in this paper. First, we introduce two definitions related to kernels.
Definition 1 (positive definite kernel [15]). Let \(\mathcal{X}\) be a nonempty set. A real-valued, continuous, symmetric function \(\kappa: \mathcal{X}\times\mathcal{X} \to \mathbb{R}\) is a positive-definite kernel (on \(\mathcal{X}\)) if \(\sum_{i=1}^n\sum_{j=1}^n c_ic_j\kappa(x_i,x_j)\ge0\) holds for all \(n\in\mathbb{N}\), \(x_1,\dots,x_n\in\mathcal{X}\), \(c_1,\dots,c_n\in\mathbb{R}\).
Definition 2 (reproducing kernel Hilbert space [24], [25]). Let \(\mathcal{H}\) be a Hilbert space of real-valued functions defined on a nonempty set \(\mathcal{X}\), with inner product \(\langle\cdot,\cdot\rangle_\mathcal{H}\). A function \(\kappa: \mathcal{X}\times\mathcal{X} \to \mathbb{R}\) is a reproducing kernel of \(\mathcal{H}\), and \(\mathcal{H}\) is a reproducing kernel Hilbert space (RKHS) on \(\mathcal{X}\) if the following conditions hold:
For any \(x\in\mathcal{X}\), \(\kappa(\cdot, x)\in\mathcal{H}\).
The reproducing property* holds, i.e., for any \(x\in\mathcal{X}\), \(h\in\mathcal{H}\), \(\langle h(\,\cdot\,),\kappa(\cdot,x)\rangle_\mathcal{H}=h(x)\).*
From the reproducing property in Definition 2.b, we observe that the value of \(h\) in \(x\) can be represented as an inner product in the feature space. Hence, applying this property to the kernel \(\kappa\), for any \(x, x^{\prime} \in \mathcal{X}\), we have that \(\kappa(x,x^\prime) = \langle \kappa(\cdot,x),\kappa(\cdot,x^\prime)\rangle_\mathcal{H}\). Moreover, according to the Moore–Aronszajn theorem [24], we know that every positive definite kernel \(\kappa\) uniquely defines a RKHS \(\mathcal{H}\) in which \(\kappa\) serves as the reproducing kernel. Conversely, each RKHS is associated with a unique positive definite kernel. Common choices for the kernel function \(\kappa(x, x^{\prime})\) include the Gaussian (radial basis function) kernel \(\kappa(x, x^{\prime}) = \exp(-\|x - x^{\prime}\|^2 / 2\sigma^2)\), the Laplacian kernel \(\kappa(x, x^{\prime}) = \exp(-\|x - x^{\prime}\| / \sigma)\), the polynomial kernel \(\kappa(x, x^{\prime}) = (x^\top x^{\prime} + c)^d\), and the linear kernel \(\kappa(x, x^{\prime}) = x^\top P x^{\prime}\), where \(\sigma\), \(c\), \(d\), and \(P\succeq0\) are hyperparameters.
Given a kernel \(\kappa\), we next consider a nonlinear input-output relation given by an unknown nonlinear function \(g:\mathcal{X}\to\mathbb{R}\) \[y = g(x) + e, \label{eqn:system}\tag{1}\] where \(x\in \mathcal{X}\), \(y\in \mathbb{R}\) are the input and output, respectively, \(g\) is assumed to belong to the native RKHS \(\mathcal{H}\) associated with the given kernel \(\kappa\), and \(e \in \mathbb{R}\) is an error term which represents measurement noise, as well as possible structural errors on \(g\). Let \(\mathcal{D}=\{(x_1,y_1),\dots,(x_T,y_T)\}\) be a sequence of given \(T\) input-output data, collected from the system 1 . The goal is to find an estimate \(\hat{g}\) of function \(g\), accurately representing the observed data while ensuring that, for any new pair of data \((x,y)\), the predicted value \(\hat{g}(x)\) remains close to \(y\).
A standard approach to estimate \(g\) using the dataset \(\mathcal{D}\) involves minimizing a loss function that combines a quadratic data-fit term (i.e., the prediction error) with a regularization term. Hence, the unknown function \(g\) can be estimated by solving the following optimization problem \[\hat{g} = \arg \min_{g \in \mathcal{H}} \sum_{t=1}^{T} (y_t - g(x_t))^2 + \gamma\|g\|^2_{\mathcal{H}}, \label{eqn:optprob46init}\tag{2}\] where \(\gamma\in\mathbb{R}\) is a trade-off weight that balances data-fit and regularization, and \(\|g\|_{\mathcal{H}}=\sqrt{\langle g,g\rangle}_\mathcal{H}\) is the norm in \(\mathcal{H}\), introduced by the inner product \(\langle \cdot,\cdot\rangle_\mathcal{H}\). Then, the representer theorem [15] guarantees the uniqueness of the solution to 2 , expressed as a sum of \(T\) basis functions determined by the kernel, with their contributions weighted by coefficients obtained through the solution of a system of linear equations (see also, e.g., [25]). In particular, given a positive-definite, real-valued kernel \(\kappa: \mathcal{X}\times\mathcal{X}\to \mathbb{R}\), and defining the kernel matrix \(\mathbf{K} \in \mathbb{R}^{T,T}\) with the \((i,j)\) entry defined as \(\mathbf{K}_{ij} \doteq \kappa(x_i,x_j)\), and \(Y \doteq [y_1,\dots,y_T]^\top\), the application of the representer theorem yields \(\hat{g}\) in closed form as follows \[\hat{g}(x) = \sum^T_{j=1} \omega_j \kappa(x, x_j), \, \forall x, \label{eqn:reprthm46standard}\tag{3}\] with the weights vector \(\omega=[\omega_1,\dots,\omega_T]\) given by \[\omega=(\mathbf{K}+\gamma I_T)^{-1}Y.\]
The result of this theorem is relevant as it demonstrates that a broad class of learning problems admits solutions that can be expressed as expansions of the training data. Building on this result, the next section explores how the representer theorem extends to the problem of integrating parameterized physical models with data-driven kernels to account for unmodeled dynamics and to estimate interpretable parameters.
Let us now consider a nonlinear map of the form \[y = f(x, \bar \theta) + \Delta(x) + e. \label{eqn:system46comp}\tag{4}\] where \(f:\mathcal{X}\times \Theta\to\mathbb{R}\), parametrized in \(\bar \theta\in\Theta\subseteq\mathbb{R}^{n_\theta}\), is a known function derived, e.g., from physical principles, whereas \(\Delta:\mathcal{X}\to\mathbb{R}\) is an unknown term representing, e.g., modeling errors, uncertainties, or dynamic perturbations. Let a set of \(T\) input-output data \(\mathcal{D}=\{(x_1,y_1),\dots,(x_T,y_T)\}\) be given, collected from a realization of 4 with true parameters \(\bar \theta\in\Theta\). The goal is to find an estimate \(\theta^\star\) of \(\bar\theta\), and a black-box approximation \(\delta(x):\mathcal{X}\to\mathbb{R}\) of \(\Delta(x)\). Such estimate and approximation can be found by solving the following optimization problem: \[(\theta^\star, \delta^\star) = \arg \min_{\theta\in\Theta,\delta \in \mathcal{H}} \sum_{t=1}^{T} \left[y_t - \hat{y}_t\left(\theta,\delta\right)\right]^2 + \gamma\|\delta\|^2_{\mathcal{H}}, \label{eqn:optprob46theta}\tag{5}\] where \(\hat{y}_t = f(x_t,\theta) + \delta(x_t)\) is the prediction of \(y_t\) at time \(t\), and the notation \(\hat{y}_t\equiv\hat{y}_t(\theta,\delta)\) is used to stress its dependencies to \(\theta\) and \(\delta\).
Remark 1 (on the role of \(\gamma\)). The role of \(\gamma\) in 5 differs slightly from the one in 2 . While in 2 \(\gamma\) balances data fit and regularization, in 5 it also controls the relative importance assigned to \(\delta\) and to the physical model. Specifically, \(\gamma\) determines the trade-off between enforcing the structure provided by the physical model and allowing deviations captured by \(\delta\). A smaller \(\gamma\) increases the influence of \(\delta\), allowing more flexibility in capturing deviations from the physical model, whereas a larger \(\gamma\) enforces stronger adherence to the model structure. For these reasons, a proper tuning of \(\gamma\) is necessary to achieve the correct trade-off between physical reliability and model flexibility. These aspects have been extensively analyzed in [6], where the reader can find additional examples and a detailed theoretical discussion.
The optimization problem in 5 aims to estimate the vector of physical parameters associated with the known component of the model \(f\) while simultaneously identifying a function \(\delta\) that captures the unmodeled term \(\Delta\). This approach integrates available prior knowledge, allows the identification of interpretable parameters, and systematically compensates for unmodeled effects, ensuring a more comprehensive and structured representation of the system. Assuming that the unknown term \(\Delta\) belongs to the RKHS \(\mathcal{H}\) associated with the chosen kernel indicates that the solution \(\delta^\star\) will admit a kernel representation. Moreover, it also implies that \(\Delta\) is a smooth function that can be effectively approximated using a finite number of kernel evaluations centered at the observed data points, as stated in Definition 2. This assumption is common in nonparametric regression [25] and provides a well-posed framework for learning unmodeled dynamics while ensuring regularization and generalization properties. Moreover, although restricting \(\Delta\) to a reproducing space, the RKHSs are flexible enough to approximate a broad class of nonlinear functions [26], making this assumption reasonable in many practical systems identification scenarios.
The primary goal of the identification process is to accurately estimate the physical parameters \(\bar\theta\) entering the physical model \(f\). The kernel-based representation of \(\Delta\) captures and compensates for unmodeled dynamics while preserving the underlying physical structure. This approach ensures that the learned correction term complements the physics-based model rather than overshadowing it.
The following key result extends the representer theorem to the system identification framework under consideration.
Proof: Given 5 and \(\hat{y}_t = f(x_t,\theta) + \delta(x_t)\), define \(J(\theta,\delta) = \sum_{t=1}^{T} \left[y_t - f(x_t,\theta) - \delta(x_t)\right]^2 + \gamma\|\delta\|_{\mathcal{H}},\) such that 5 is \((\theta^\star,\delta^\star) = \arg \min_{\theta\in\Theta,\delta \in \mathcal{H}} J(\theta,\delta)\). Considering that, \(\min_{\theta\in \Theta,\delta\in\mathcal{H}}J(\theta,\delta)=\min_{\theta\in \Theta} \min_{\delta\in\mathcal{H}}J(\theta,\delta)\), we have that a minimizer to 5 must satisfy \[\begin{align} \delta^\star(\cdot) &= (\arg \min_{\delta\in\mathcal{H}} J(\theta,\delta))|_{\theta=\theta^\star},\\ \theta^\star &= \arg \min_{\theta\in\Theta} p(\theta), \label{eqn:thm146tstar46p} \end{align}\tag{6}\] with \(p(\theta)\doteq\min_{\delta\in\mathcal{H}} J(\theta,\delta)\). Here, the inner minimization problem is solved with respect to \(\delta\) and it now represents a standard kernel regression problem 2 . Indeed, considering \(\tilde{y}_t \doteq y_t - f(x_t,\theta)\), we have \[\delta^\star (\cdot,\theta) = \arg\min_{\delta\in\mathcal{H}}\sum_{t=1}^{T} \left[\tilde{y}_t - \delta(x_t)\right]^2 + \gamma\|\delta\|^2_{\mathcal{H}}. \label{eqn:thm146dstar}\tag{7}\] By the representer theorem [15], the optimal solution to 7 is \[\delta^\star(x,\theta) = \sum^T_{j=1} \omega_j(\theta) \kappa(x, x_j), \label{eqn:thm1dstarrep}\tag{8}\] where the weight vector \(\omega(\theta)\) is given by \(\omega=(\mathbf{K}+\gamma I_T)^{-1}\tilde{Y},\) being \(\tilde{Y} \doteq [\tilde{y}_1,\dots,\tilde{y}_T]^\top\). Thus, [eqn:repthm46defs95b] is obtained given [eqn:repthm46defs95a] and substituting \(\tilde{y}_t \doteq y_t - f(x_t,\theta)\) in \(\tilde{Y}\). Moreover, considering the function \(p(\theta)\) in 6 , we have \(p(\theta)=\min_{\delta\in\mathcal{H}}J(\theta,\delta)=J(\theta,\delta^\star(\cdot,\theta))\), which, substituting 8 , simplifies to \[p(\theta) = \textstyle\sum_{t=1}^{T}(y_t\!-\!f(x_t,\theta)\!-\!\mathbf{K}^\top_t\omega(\theta))^2\!+\!\gamma\|\sqrt{\mathbf{K}}\omega(\theta)\|_2^2, \label{eqn:thm1pstar}\tag{9}\] noting that \[\begin{align} \|\delta\|_\mathcal{H}^2&=\langle\textstyle\sum^T_{i=1} \omega_i(\theta) \kappa(x, x_i),\textstyle\sum^T_{j=1} \omega_j(\theta) \kappa(x, x_j)\rangle_\mathcal{H}\\ &=\textstyle\sum^T_{i=1}\textstyle\sum^T_{j=1} \omega_i(\theta) \omega_j(\theta)\langle \kappa(x, x_i), \kappa(x, x_j)\rangle_\mathcal{H}\\ &=\textstyle\sum^T_{i=1}\textstyle\sum^T_{j=1} \omega_i(\theta) \omega_j(\theta) \kappa(x_i, x_j)\\ &= \omega(\theta)^\top \mathbf{K}\omega(\theta) = \|\sqrt{\mathbf{K}}\omega(\theta)\|_2^2, \end{align}\] from linearity of the inner product and the reproducing property in Definition 2.b. Here, \(\sqrt{\mathbf{K}}\) denotes the principal square root of \(\mathbf{K}=\sqrt{\mathbf{K}}\sqrt{\mathbf{K}}=\sqrt{\mathbf{K}}^\top\sqrt{\mathbf{K}}\), noting that \(\mathbf{K}\) is guaranteed to be symmetric and at least positive semi-definite by Definition 1. Thus, we obtain [eqn:reprthm46ext] by substituting the solution to 6 , with \(p(\theta)\) given by 9 , into 8 , which concludes the proof. \(\square\)
Theorem [thm:repthmII] establishes that the optimal solution to the estimation problem 5 can be formulated using kernel-based functions. In particular, the unmodeled component \(\Delta\) is approximated by \(\delta\), defined as a linear combination of kernel evaluations centered at the observed data points, thus leading to the following optimal predictive model \[\hat{y} = f(x, \theta^\star) + \delta^\star(x) = f(x, \theta^\star)+\textstyle\sum^T_{j=1} \omega_j^\star \kappa(x, x_j).\] Moreover, Theorem [thm:repthmII] is particularly relevant as it allows for seamless integration of prior physical knowledge with the adaptability of kernel methods, avoiding the use of heuristically chosen basis functions.
A relevant special case arises when the function \(f(x, \theta)\) is affine in \(\theta\). In this setting, the map 4 becomes \[y = f_0(x) + f(x)^\top \bar{\theta} + \Delta(x) + e, \label{eqn:system46comp46lin}\tag{10}\] where \(f_0:\mathcal{X}\to\mathbb{R}\), \(f:\mathcal{X}\to\mathbb{R}^{n_\theta}\), \(\bar \theta\in\Theta\in\mathbb{R}^{n_\theta}\). Thus, the optimization problem [eqn:reprthm46ext1] simplifies significantly, leading to a convex optimization problem with a closed-form solution, as formalized in the following theorem.
Theorem 1 (closed-form solution of [eqn:reprthm46ext]). Consider the same setup of Theorem [thm:repthmII]. If the system model in 4 is affine in \(\theta\), as in 10 , then the solution of [eqn:reprthm46ext1] is given by \[\theta^\star = \left(F(x)^\top \Psi^\top \Psi F(x)\right)^{-1}F(x)^\top \Psi^\top \Psi {Y_0}, \label{eqn:thmlincf}\qquad{(1)}\] with \(F(x) \doteq [f(x_1), \dots, f(x_T)]^\top\), \(Y_0 \doteq [y_1-f_0(x_1), \dots, y_T-f_0(x_T)]^\top\), and \[\Psi = \left[\begin{array}{c} I_T-\mathbf{K}(\mathbf{K}+\gamma I_T)^{-1}\\ \gamma\sqrt{\mathbf{K}}(\mathbf{K}+\gamma I_T)^{-1} \end{array}\right],\] with \(\mathbf{K}\) the kernel matrix associated to \(\kappa\) and \(\mathcal{D}\).
Proof: Considering [eqn:reprthm46ext1] applied to 10 , we obtain \[\begin{align} \theta^\star=\arg\min_{\theta\in\Theta} &\sum_{t=1}^{T}\Bigl(y_t{-f_0(x_t)}-f(x_t)^\top\theta-\mathbf{K}^\top_t\omega(\theta)\Bigr)^2\\&+\gamma\|\sqrt{\mathbf{K}}\omega(\theta)\|_2^2. \end{align}\] Consider the centered output \(y_t-f_0(x_t)\). Rewriting the summation and substituting [eqn:repthm46defs], we obtain \[\begin{align} \theta^\star=\arg\min_{\theta\in\Theta} &\|{Y_0}-F(x)\theta-\mathbf{K}(\mathbf{K}+\gamma I_T)^{-1}({Y_0}-F(x)\theta)\|_2^2\\ &+\gamma\|\sqrt{\mathbf{K}}(\mathbf{K}+\gamma I_T)^{-1}({Y_0}-F(x)\theta)\|_2^2, \end{align} \label{eqn:linproof1}\tag{11}\] where, we used the fact that [eqn:repthm46defs95a] corresponds to \(\Gamma(\theta)=F(x)\theta\), and, according to [eqn:repthm46defs95b], \(\omega(\theta) = (\mathbf{K}+\gamma I_T)^{-1}(Y_0-\Gamma(\theta))\). To simplify this expression further, we define \(\Psi_1 \doteq I_T-\mathbf{K}(\mathbf{K}+\gamma I_T)^{-1}\), \(\Psi_2 \doteq \gamma \sqrt{\mathbf{K}}(\mathbf{K}+\gamma I_T)^{-1}\). Substituting these definitions into 11 , we write \[\theta^\star=\arg\min_{\theta\in\Theta} \|\Psi_1 ({Y_0}-F(x)\theta)\|_2^2+\|\Psi_2({Y_0}-F(x)\theta)\|_2^2. \label{eqn:prooflin2}\tag{12}\] Problem 12 is now a standard least-squares problem, since the objective can be compactly written as \({\Bigl\|\Bigl[\begin{array}{c} \Psi_1\\\Psi_2 \end{array}\Bigr] \bigl({Y_0}-F(x)\theta\bigr)\Bigr\|_2^2} = {\Bigl\| \Psi \bigl({Y_0}-F(x)\theta\bigr)\Bigr\|_2^2}\). Therefore, the optimal solution is given in closed form by ?? , which concludes the proof.
\(\square\)
This result is particularly significant as it shows that when the model is affine in \(\theta\) the optimization problem [eqn:reprthm46ext1] becomes convex, ensuring a unique and efficiently computable solution. Importantly, the structure in 10 is quite general, as it does not impose linearity with respect to the input \(x\), but only in the parameters. Notably, many nonlinear (with respect to their inputs) systems can still be expressed in this form, making the framework and Theorem 1 broadly applicable. On the other hand, when affinity in the parameters does not hold, we can still tackle problem [eqn:reprthm46ext1] by means of nonlinear programming methods, such as gradient-based techniques, Gauss-Newton-type algorithms, or similar iterative approaches, acknowledging however that due to the potential non-convexity of the problem, the obtained solution may be local.
In the previous section, we considered a static input-output identification setting, where the goal was to estimate physical parameters \(\bar\theta\) exploiting physical priors \(f(\cdot)\) and approximating an unknown function \(\Delta(x)\) based on measured data pairs \((x_t, y_t)\). In this section, we extend the kernel-based framework to dynamic settings characterized by not-fully measured states by introducing a state-smoothing approach to effectively address the associated challenges.
We consider a discrete-time system of the form \[x_{t+1} = f(x_t, u_t, \bar\theta) + \Delta(x_t,u_t) + e_t, \label{eqn:system46ssform}\tag{13}\] where \(x_t\in\mathbb{R}^n\) denotes the state at time \(t\), \(u_t \in\mathbb{R}^{n_u}\) is the external, measured input, and \(e_t\in\mathbb{R}^n\) represents the process noise. The functions \(f\) and \(\Delta\) are now vector-valued, each comprising \(n\) components, i.e., \(f_i(x_t,u_t,\bar\theta_i)\) and \(\Delta_i(x_t,u_t)\), \(i = 1, \dots, n\). If all state variables are directly measurable, each parameter vector \(\bar\theta_i\) and function \(\Delta_i\) can be estimated using Theorem [thm:repthmII] directly. In this case, the state \(x_{t}\) serves both as the input – along with the measured input \(u_{t}\) – and as the measured output (\(y_{t} = x_{t}\)), allowing for a direct application of Theorem [thm:repthmII]. However, this approach becomes infeasible when certain state components are not directly measurable. In such cases, the system 13 is extended to incorporate also the output equation, i.e., \[\begin{align} x_{t+1} &= f(x_t, u_t, \bar\theta) + \Delta(x_t,u_t) + e_t,\\ y_t &= g(x_t,u_t, \bar\theta) + w_t, \end{align} \label{eqn:system46sswoform}\tag{14}\] where \(w_t\) represents the measurement noise, and the state \(x_t\) is not directly accessible. However, this framework adds a further layer of complexity to the estimation procedure, which can be addressed by adopting two principal approaches: (i) multi-step identification methods, or (ii) prior states estimation. As anticipated in the introduction, the optimization becomes challenging in multi-step identification due to significant nonlinear dependencies among the parameters. Moreover, the recursive dependence of \(\delta\) within \(f\), and consequently \(\theta\), prevents a straightforward application of Theorem [thm:repthmII]. To circumvent this issue, in the following we focus on the second class of approaches, adopting nonlinear state smoothing techniques.
We consider the system described by 14 , where the state variable \(x_t\in\mathcal{X}\) evolves according to known functions \(f:\mathcal{X}\times\mathbb{R}^{n_u}\times\Theta\to\mathcal{X}\), and \(g:\mathcal{X}\times\mathbb{R}^{n_u}\times\Theta\to\mathbb{R}\), and an unknown term \(\Delta:\mathcal{X}\times\mathbb{R}^{n_u}\to\mathcal{X}\), related to unmodeled dynamics. The goal of nonlinear state smoothing is to estimate a state trajectory \({x}_{0:T-1}\doteq\{ x_0,..., x_{T-1}\}\) from a given dataset of measurements \(\mathcal{D}=\{(u_0,y_0),\dots,(u_{T-1},y_{T-1})\}\cup\{y_T\}\) and a nominal nonlinear model [22].
First, let us consider the known components of 14 , which define the nominal model, i.e., \[\begin{align} x_{t+1} &= f(x_t, u_t, \theta_0) + e_t,\\ y_t &= g(x_t,u_t, \theta_0) + w_t, \end{align} \label{eqn:system46sswoform46init}\tag{15}\] where \(\theta_0\) represents the initial parameter estimate used for the state smoothing. This can correspond, for example, to an initial guess or to the central point in the parameter space \(\Theta\), which will be refined during the identification process.
Moreover, without loss of generality, we assume that an initial estimate of the initial condition, denoted as \(\hat{x}_0\), is available. This estimate can be seamlessly incorporated into the identification problem alongside \(\theta\) if needed (see, e.g., [6]).
To perform the nonlinear state reconstruction, in this work, we employ a nonlinear state smoothing based on a two-step strategy. First, we apply a forward filtering, based on an unscented Kalman filter, for state estimation. Then, we employ a backward smoothing for refining the state estimates. However, we note that any state reconstruction approach can be employed in this last step.
Therefore, we begin by imposing standard conditions ensuring that the state can be estimated and the physical parameters can be identified from the available measurements.
Assumption 1 (observability and identifiability). The system state \(x\) is observable along the trajectory induced by the applied input, and the physical parameters \(\theta\) are identifiable [27], provided that the input is persistently exciting over the observation window.
Next, we detail the proposed two-step approach as applied to the considered state-space framework.
1. Forward filtering: The UKF aims to approximate the posterior state distribution using a set of sigma points, which are propagated through the nominal nonlinear system dynamics.Here, we assume that the process and measurement noise, \(e_t\) and \(w_t\), can be characterized as zero-mean Gaussian with covariances \(P_e\) and \(P_w\), respectively, i.e., \(e_t \sim \mathcal{N}(0, P_e)\), \(w_t \sim \mathcal{N}(0, P_w)\). The state estimate uncertainty, represented by \(P_t\), is initialized as \(P_0\) and updated recursively.
All common variants of the UKF for discrete-time systems adhere to the same prediction–correction structure, though they may differ in specific formulations and weight definitions. The reader is referred to, e.g., [21] for details on the UKF and its implementation.
At the end of the forward filtering step, we obtain the filtered state sequence \(\hat{x}_{1:T}\doteq\{\hat{x}_1,\dots,\hat{x}_T\}\) with the associated covariance matrices \(P_{1:T}\doteq\{P_1,\dots,P_T\}\). These estimates serve as the input for the subsequent smoothing process, which is described next.
2. Backward smoothing:
The unscented Rauch–Tung–Striebel smoother [22], [23] aims to obtain the final state estimates. Given the filtered estimates, sigma points are generated and propagated through the model to compute predicted means, covariances, and cross-covariances [22]. The smoother gain matrix is then used to iteratively refine the state and covariance estimates, running backward from time \(t = T-1\) to \(t = 0\), yielding the final smoothed state sequence \(\hat{x}^s_{0:T-1}\doteq\{\hat{x}^s_0,\dots,\hat{x}^s_{T-1}\}\) and covariances \(P^s_{0:T-1}\doteq\{P^s_0,\dots,P^s_{T-1}\}\).
Once the sequence of unmeasured states has been reconstructed through state smoothing, i.e., \(\hat{x}^s_{0:T-1}\), the problem effectively reduces to the one in 5 , to which Theorem [thm:repthmII] is directly applicable. This can be done, for instance, by defining \(z_t \doteq [\hat{x}^s_{t-1}, u_{t-1}, u_t]\) as the new input variable, and computing the predictions at time \(t\), i.e., \(\hat{y}_t(\theta,\delta)\) in 5 , using the following prediction model \[\hat{y}_t = \xi(z_t,\theta) + \delta(z_t), \label{eqn:predmodel46new}\tag{16}\] with \(\xi(z_t)\doteq g\big(f(\hat{x}^s_{t-1},u_{t-1},\theta),u_t,\theta\big)\). In this formulation, \(\delta\) captures the discrepancies arising from unknown or unmodeled components in \(f\) 14 , which influence the system evolution and are subsequently mapped to the output space by \(g\).
Remark 2 (estimation without future data). When employing the identified model for simulation, where both the learned kernel component and the identified parameters are used, future data is not always available to apply the smoother. In this case, the natural solution is to rely solely on the filtering step to provide state estimates up to time \(t\). Beyond this point, the nominal model is used in open-loop to propagate the unmeasured states, which are then fed into the kernel-extended model to generate output predictions.
We note that, although alternative models exist, the selected formulation fully leverages the available physical priors by incorporating both \(f\) and \(g\). The proposed methodology enables the estimation of physical parameters \(\theta\) and the approximation of the unknown component \(\Delta\) via a kernel-based approach, effectively integrating prior knowledge with data. This principled combination enhances both interpretability and accuracy. Additionally, the nonlinear state smoother preserves the well-behaved properties of single-step identification while implicitly capturing multi-step dependencies. This leads to improved accuracy in both single- and multi-step settings, without requiring explicit multi-step optimization.
The overall procedure for implementing the proposed identification approach in a state-space setting is sketched in Algorithm [alg:kernel95state95space].
In this section, we illustrate the effectiveness of the proposed identification method on an academic example and a cascade tank system (CTS) benchmark [28].
We consider a regression problem where the goal is to estimate the parameters \(\bar \theta\) of a linear-in-parameter model, using the kernel-based approach and comparing it to the solution obtained when no kernel integration is employed, solved with ordinary least-squares methods.
First, we generate \(T = 1000\) samples with inputs \(x\in[-1,1]\) and outputs following a polynomial and sinusoidal relationship. Relying on 10 , we have \[\begin{align} f_0(x) &= 0,\, f(x)^\top=[1, x, u, x^2 u^2],\\ \Delta(x) &= 0.7\sin(5x) + 0.5\cos(3x) + 0.4x^2 + 0.3x^3 \\&\quad\,\,- 0.2\sin(7x)\cos(2x),\\ \end{align}\] where \(u = \sin(2\pi x) + 0.5 \cos(3\pi x)\) and \(e\) follows a Gaussian distribution with standard deviation \(0.1\). In the selected case study, we select \(\bar\theta = [2,3,4,1.5,-0.8]\). To improve the estimation accuracy in the presence of unknown nonlinearities \(\Delta(x)\), we employ a Gaussian kernel function, where the kernel matrix \(\mathbf{K}\) is computed as \(K_{ij} = \exp\left(-\frac{(x_i - x_j)^2}{2\sigma^2}\right)\), with a bandwidth parameter \(\sigma = 1\). The kernel-based estimator follows the formulation derived in Theorem [thm:repthmII] and Theorem 1, where the correction is introduced through a kernel, and the operator \(\Psi\) projects the observations into a feature space that captures unmodeled nonlinear effects.
The obtained results are reported in Table 1, where the solution obtained with the proposed approach is compared with the least-squares solution (LS) and the discrepancy modeling approach (DM) described in the introduction (see, e.g., [8]). The estimated parameter vector \(\theta^\star\) obtained using ?? with \(\gamma = 0.4\) is significantly closer to the true \(\bar\theta\) compared to the ordinary least-squares estimate \(\theta^{LS}\), which does not account for unknown nonlinearities. In addition, we computed the root mean square error (RMSE) to evaluate the accuracy of the predictions. The proposed kernel-based model achieves an RMSE of \(0.005\), whereas the model obtained using \(\theta^{LS}\), which neglects the unmodeled effects, results in a significantly higher RMSE of \(0.010\).
The estimate \(\theta^{LS}\) also corresponds to the solution given by the discrepancy modeling approach. In this two-step procedure, the physical parameters are first estimated without accounting for \(\Delta\), and the resulting discrepancy is then modeled separately – here, using a kernel-based correction. While this approach reduces the RMSE to \(0.0074\), it falls short of the performance achieved by the proposed method, which estimates both the physical parameters and the unmodeled dynamics simultaneously. Moreover, it yields more biased parameter estimates.
\(\theta_1\) | \(\theta_2\) | \(\theta_3\) | \(\theta_4\) | \(\theta_5\) | RMSE | |
---|---|---|---|---|---|---|
True | \(2\) | \(3\) | \(4\) | \(1.5\) | \(-0.8\) | / |
LS | \(2.55\) | \(3.03\) | \(4.32\) | \(0.80\) | \(-1.05\) | \(0.010\) |
DM | \(2.55\) | \(3.03\) | \(4.32\) | \(0.80\) | \(-1.05\) | \(0.007\) |
Proposed | \(\mathbf{2.18}\) | \(\mathbf{3.12}\) | \(\mathbf{4.17}\) | \(\mathbf{1.43}\) | \(\mathbf{-0.78}\) | \(\mathbf{0.005}\) |
Then, to illustrate the impact of \(\gamma\) on the proposed identification process, we analyze its effect on the model performance using a validation dataset. Figure 2 depicts the RMSE as a function of \(\gamma\), selected in the range \([10^{-2}, 10]\), highlighting the trade-off between model flexibility and physical consistency. As expected, small values of \(\gamma\) lead to high RMSE due to kernel overfitting to deviations and neglecting the physical priors. On the other hand, large values result in biased parameter estimates as they enforce strict adherence to the physical model at the expense of not capturing relevant unmodeled terms. Clearly, achieving the correct trade-off between physical priors and the kernel-based representation of unmodeled terms is crucial for accurate identification.
These results illustrate the benefits of using a kernel-based integration approach. By incorporating the unmodeled effects into the estimation process, the kernel-based approach better represents the system dynamics, leading to significantly lower RMSE values and improved parameter estimates.
In the second example, we test the proposed approach on a CTS benchmark [28] and compare it with other state-of-the-art estimation methods. The CTS regulates fluid levels among two connected tanks and a pump. First, water is pumped into the upper tank, then it flows to the lower tank and back to the reservoir. Overflow occurs when a tank is full: The excess from the upper tank partially drains into the lower one, with the rest exiting the system. In [28] an approximate non-linear, continuous state-space model of the CTS is derived using the Bernoulli’s and mass conservation principles. Here, we rely on its discretized version, i.e., \[\begin{align} x_{1,t+1} &= x_{1,t} + T_s\left(-k_1 \sqrt{x_{1,t}} + k_4 u_t + e_{1,t}\right),\\ x_{2,t+1} &= x_{2,t} + T_s\left( k_2 \sqrt{x_{1,t}} - k_3\sqrt{x_{2,t}} + e_{2,t}\right),\\ y_t &= x_{2,t} + w_t, \end{align} \label{eqn:tctmodel}\tag{17}\] where \(u_t \in \mathbb{R}\) is the input signal, \(x_{1,k}\in \mathbb{R}\) and \(x_{2,k}\in \mathbb{R}\) are the states of the system, \(y_t \in \mathbb{R}\) is the output, and \(e_{t}\in \mathbb{R}^2\), \(w_t\in \mathbb{R}\) are the additive noise sources. The sampling time is set to \(T_s = 4 s\). Moreover, the system is characterized by four unknown physical constants, \(k_1\), \(k_2\), \(k_3\), and \(k_4\), which depend on the properties of the system and need to be estimated. Since this model ignores the water overflow effect, unmodeled dynamics are included in the physical dynamics model 17 . The training and validation datasets consist of \(T = 1024\) input-output samples.
The goal is to estimate the dynamics of the system using only the available training data. For the identification, we employ the nonlinear state smoothing, described by Algorithm [alg:kernel95state95space], assuming that \(f\) and \(g\) are defined according to the discretized model 17 and selecting \(P_e=10^{-3}I_2\), \(P_w=10^{-2}\), \(P_0=0.5 I_2\), \(\theta_0= [0.05, 0.05, 0.05, 0.05]^\top\), and \(\hat{x}_0=[y_0,y_0]^\top\). Moreover, we set the UKF weights according to the formulation in [21], that is \(a=2.74\), \(w^m_0 = 0.33\), \(w^c_0 = 2.33\), and \(w^m_i = w^c_i = 0.67\), for \(i=1,\dots,2n\). Then, we solve an optimization problem of the form 5 for \(\gamma=10^{-1}\). Specifically, once the smoothed trajectory of the unmeasured state \(\hat{x}_{1,0:T-1}^s\) is computed, we define a predictive model of the form 16 and we solve Problem 5 applying Theorem [thm:repthmII]. Specifically, we employ the fmincon solver using a sqp method to solve Problem [eqn:reprthm46ext1]. Hence, considering 17 and selecting \(z_t \doteq [\hat{x}_{1,t-1}^s, y_t, u_{t-1}]\), we define \(\xi(z_t) \doteq y_{t} + T_sk_2 \sqrt{\hat{x}^s_{1,t-1} + T_s\left(-k_1 \sqrt{\hat{x}^s_{1,t-1}} + k_4 u_{t-1}\right)} - T_sk_3\sqrt{y_{t}}\). Thus, once we obtain \((\theta^\star,\delta^\star)\) as the solution of 5 according to Theorem [thm:repthmII], the optimal estimation model becomes \(\hat{y}_{t+1} = \xi(z_t,\theta^\star) + \delta^\star(z_t)\), selecting the Gaussian kernel \(\kappa(x, x^{\prime}) = \exp(-\|x - x^{\prime}\|^2 / 2\sigma^2)\) with \(\sigma=11\).
To evaluate the effectiveness of the estimation algorithm, as suggested in [28], the RMSE is employed as the performance metric. Table [tab:comparison] presents the performance of the proposed kernel-based identification method compared with the solution obtained by solving 5 when no kernel is used to compensate for unmodeled dynamics in 17 . The results are reported for the estimation and validation datasets, considering: (i) the prediction task, i.e., given \(z_t\), we estimate \(y_{t+1}\), and (ii) the simulation task, i.e., we recursively estimate \(y_{t+1}\) by defining \(z_t\) with \(\hat{y}_t\) (the previous estimate of \(y_t\)). Moreover, for the simulation results, we also report the fit values in Table [tab:comparison]. These first results highlight the significant reduction in the RMSE achieved through kernel integration, particularly in the simulation setting. Notably, despite the optimization being performed minimizing the prediction error, the joint use of the kernel-based approach and the smoother substantially improves the multi-step simulation accuracy.
Pred. (RMSE) | Sim. (RMSE, fit) | |||
Train. | Val. | Train. | Val. | |
17 only | \(0.06\) | \(0.06\) | \(0.37\), \(83.14\%\) | \(0.37\), \(82.46\%\) |
17 + Kernel | \(0.04\) | \(0.05\) | \(0.17\), \(92,11\%\) | \(0.18\), \(91.55\%\) |
This improvement is also reflected in the results reported in Table 3, where we compare the simulation performance for the validation data with other state-of-the-art approaches from the literature. Among these methods, we observe that the proposed approach also outperforms the solution of our multi-step identification approach presented in [6]. This improvement is likely due to the ability of the kernel-based method to automatically select an optimal representation for the approximating term \(\delta\), whereas in [6] this term was chosen from a limited, predefined dictionary of basis functions.
Finally, we report the identified parameters for completeness. The smoothed initial condition is \(\hat{x}^s_0 = [4.78, 5.20]\) whereas, for \(\hat{k}\), we have: (i) with no kernel, \(\hat{k} = [-0.01, 0.05, 0.06, 0.01]\), and (ii) with kernel, \(\hat{k} = [0.08, 0.05, 0.05, 0.06]\).
Method | RMSE | Method | RMSE |
---|---|---|---|
Svensson et al. [4] | \(0.45\) | PWARX [29] | \(0.35\) |
Volt.FB [13] | \(0.39\) | SED-MPK [14] | \(0.48\) |
INN [3] | \(0.41\) | PNLSS-I [30] | \(0.45\) |
NLSS2 [30] | \(0.34\) | NOMAD [31] | \(0.37\) |
Donati et al. [6] | \(0.26\) | Proposed (kernel) | \(0.178\) |
This work introduced a kernel-based framework for physics-informed nonlinear system identification, effectively integrating kernel methods while preserving physical model interpretability. Then, to tackle the case of unmeasured states, we incorporate a nonlinear state smoothing. The numerical results confirm the effectiveness of the proposed approach in finding meaningful parametric models with compensating unstructured components.
Future investigations will explore the extension to multi-step optimization and advanced strategies for integrating state smoothing with identification, analyzing the role of model uncertainty. For the latter, potential directions include iterative refinement of the filtering model and the tuning of noise covariances to account for modeling inaccuracies, as well as the derivation of error bounds in the smoothing step.
The authors are grateful to T.Alamo for his valuable suggestion to incorporate kernel-models in the framework of off-white system identification.
\(^\mathsection\) This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.↩︎
\(^{\star}\)Corresponding author: cesare.donati@polito.it.↩︎
\(^{1}\)DET, Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy.↩︎
\(^{2}\)CNR-IEIIT, c/o Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy.↩︎
\(^{3}\)EECS, The Pennsylvania State University, University Park, PA, USA.↩︎