April 02, 2024

We consider the problem of designing a machine learning-based model of an unknown dynamical system from a finite number of (state-input)-successor state data points, such that the model obtained is also suitable for optimal control design. We propose a specific NN architecture that yields a hybrid system with PWA dynamics that is differentiable with respect to the network’s parameters, thereby enabling the use of derivative-based training procedures. We show that a careful choice of our NN’s weights produces a hybrid system model with structural properties that are highly favourable when used as part of a finite horizon OCP. Specifically, we show that optimal solutions with strong local optimality guarantees can be computed via NLP, in contrast to classical OCPs for general hybrid systems which typically require mixed-integer optimization. In addition to being well-suited for optimal control design, numerical simulations illustrate that our NN-based technique enjoys very similar performance to state-of-the-art system identification methodologies for hybrid systems and it is competitive on nonlinear benchmarks.

Interest in data-driven learning techniques is reshaping conventional systems modelling and controller design, with the convergence of accessible datasets and advanced machine learning methods offering a transformative opportunity [1].

Within the systems-and-control community, *systems identification* methods have benefited significantly from powerful tools drawn from the machine learning literature [2]–[5]. In particular,
NNs [6] have been used since at least the 1980s [7] to generate data-based system model surrogates, which has sparked renewed interest in black-box modelling of functions within the
state–space. Their flexibility, given both their universal approximation properties [8] and the many NNs architectures available, has motivated research towards the development of NN-based techniques for (non)linear system identification
adopting various neural model structures [9]–[14].

However, with a few exceptions [15], [16] available learning-based system identification techniques do not consider whether the models constructed will have an internal structure favourable for use within an optimal control problem. In fact, unless one assumes a certain simplified structure, deep NNs are generally hard to analyze due to their nonlinear and large-scale structure [6], [17]. As an undesirable consequence, adopting the resulting system model surrogates into OCPs may make the latter extremely hard to solve efficiently (wherever a solution is guaranteed to exist). In this paper we propose a NN-based technique to identify a hybrid system representation in the form of continuous PWA dynamics with structure tailored specifically for optimal control design.

Although hybrid models represent a broad class of systems [18], [19] whose identification is known to be NP-hard in general [20], [21], to the best of our knowledge little attention has focussed on the derivation of hybrid system models through (deep) NN-based surrogates.

Starting from the pioneering work in [22], where a continuous PWA representation was obtained through a NN with a specific two-layer structure on the basis of the results developed in [23], [24] constructed a PWA local model by introducing an input space tessellation via hyperpolyhedral convex cells, associating to them NN granules with a local interpolation capability. The authors of [25] instead introduced a constructive algorithm that builds a NN with PWA sigmoidal nodes and one hidden layer. In particular, one node at a time was added by fitting the residual, a task that was accomplished by searching for the best fit through a sequence of QPs. A single hidden layer network was also proposed in [26], which constructed a continuous PWA error function and developed an efficient, two-step algorithm working in the weight space to minimize it. More recently, [27] proposed a series of experiments in which newly developed libraries are employed to identify dynamical models with NNs even for complicated hybrid systems, while more applied contributions have analyzed the positive effects of imposing structured knowledge to deep NN-based surrogates in describing the multi-modal behaviour exhibited by the system at hand [28], [29].

In a similar spirit to our work, [30] proposed to infer a control-oriented LC dynamics (equivalent to a PWA one [31]) from data directly. In particular, their approach built upon the minimization of a tailored violation-based loss function, which allowed one to learn LC dynamics via standard gradient-based methods. In view of the results established later in the present paper, we also mention the study conducted in [32]. There, a (strong) stationarity condition was identified as necessary and sufficient to characterize local optima of a program with a ReLU NN entering both in the cost and constraints, and the corresponding LC-based, lifted reformulation yielding an MPCC.

Following from our recent work [33], [34], in this paper we take a NN-based approach to the problem of system identification for control. In particular, we wish to answer the following question: given \(N\) (state-input)-successor state samples \(\{(x^{(i)},u^{(i)}, x^{+,{(i)}})\}_{i=1}^N\), how can we obtain a descriptive system model that is also suitable for optimal control design? While the meaning of the terms “descriptive” and “suitable” will be clarified later in the paper, we summarize the contribution as follows:

We employ a NN-based method to obtain a hybrid system with continuous PWA dynamics from available data. In particular, the proposed NN is characterized by a simple architecture, and its output is differentiable wrt the NN’s parameters;

Given its end-to-end trainability, a careful choice of some weights of the NN allows us to produce a hybrid dynamics model with a specific structure. We then show that the latter can be controlled (locally) optimally by solving the KKT conditions of the underlying finite horizon OCP;

Extensive numerical simulations show that, as a NN-based identification procedure, our technique has very similar performance compared to the state-of-the-art of hybrid system identification methods.

Our method thus requires a simple identification step, represented by a careful training of a NN with specific structure via standard tools, which yields a hybrid system model that is well-suited to optimal control design. In particular, solving an OCP involving a hybrid system with PWA dynamics as an NLP has been proven to require shorter computation times and feature better scaling in the problem dimensions than standard approaches based on mixed-integer optimization [35], [36]. In addition, we recall that PWA regression is NP-hard in general [21], since it requires simultaneous classification of the \(N\) samples \(\{(x^{(i)},u^{(i)}, x^{+,{(i)}})\}_{i=1}^N\) into modes, thereby calling for a regression of a submodel for each mode. By taking a NN-based perspective, we instead circumvent any challenge resulting from (possibly expensive) identification procedures, since the output differentiability of the proposed NN architecture allows for a standard training procedure requiring little computational efforts.

Unlike other approaches, we obtain an LC model suitable for control as a direct consequence of a careful choice of a NN with specific architecture, thereby circumventing the requirement of strict complementarity to recover differentiability wrt the main parameters as in [28], [30]. In contrast to [32], we provide an explicit structure of a NN for which local stationarity conditions, coinciding with the standard KKT system, are known to hold for the MPCC obtained by using our particular NN as a hybrid system model in an OCP.

The rest of the paper is organized as follows: we formalize the problem addressed in §2, and in §3 introduce our NN-based approach to system identification for control. In §4 we describe the OCP obtained when using our identified model structure, and develop our main theoretical results relating to that problem in §5. Finally, §6 reports a series of numerical experiments.

\(\mathbb{N}\), \(\mathbb{R}\) and \(\mathbb{R}_{\geq 0}\) denote the set of natural, real and nonnegative real numbers, respectively. \(\mathbb{S}^{n}\) is the space of \(n \times n\) symmetric matrices and \(\mathbb{S}_{\succ 0}^{n}\) (\(\mathbb{S}_{\succcurlyeq 0}^{n}\)) is the cone of positive (semi-)definite matrices. Bold \(\boldsymbol{1}\) (\(\boldsymbol{0}\)) is a vector of ones (zeros). Given a matrix \(A \in \mathbb{R}^{m \times n}\), \(A^\top\) denotes its transpose. For \(A \in \mathbb{S}_{\succ 0}^{n}\), \(\| v \|_A \mathrel{\vcenter{:}}=\sqrt{ v^\top A v }\). The operator \(\textrm{col}(\cdot)\) stacks its arguments in column vectors or matrices of compatible dimensions. \(A \otimes B\) is the Kronecker product of \(A\) and \(B\). We sometimes use \(x_{k+1}\), \(k \in \mathbb{N}_0\), as opposed to \(x^+\), to make time dependence explicit when describing the state evolution of discrete-time systems. The uniform distribution on the interval \([a,b]\) is denoted by \(U(a,b)\), whereas \(N(\rho, \sigma^2)\) stands for the normal distribution with mean \(\rho\) and standard deviation \(\sigma\).

We will assume that we have available a finite collection of (state-input)-successor state measured triplets, \((x,u, x^{+})\), \(x\), \(x^+\in\mathbb{R}^n\), \(u\in\mathbb{R}^m\) for an unknown but deterministic dynamic system. Our first aim is to produce a model of this unknown system *without* running further
experiments or successive identification procedures. In particular, the model we wish to compute from the available data should be “descriptive enough”, i.e., it should belong to a model class capable of capturing a wide variety of system behaviours, while
at the same time being suited for optimal control.

To clarify what we mean with the latter statement, we introduce next the finite horizon OCP considered throughout the paper: \[\label{eq:OCP} \left. \begin{align} &\underset{\boldsymbol{u}, \boldsymbol{x}}{\textrm{min}} && \ell_T\left(x_T\right)+\sum_{t\in\mathcal{T}} \ell_t\left(x_t, u_t\right)\\ & \textrm{ s.t. } && x_{t+1} = \mathcal{N}_\theta(x_t, u_t), \text{ for all } t \in \mathcal{T},\\ &&& x_0 = x(0), \end{align} \right.\tag{1}\] where \(x_t\), \(t \in \mathcal{T} \mathrel{\vcenter{:}}=\{0, \ldots, T-1\}\), denotes the predicted system state \(t\) timesteps into the future through the (possibly nonlinear) dynamical model \(\mathcal{N}_\theta : \mathbb{R}^p \times \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^n\), which is to be inferred from data and is characterized by parameters \(\theta \in \mathbb{R}^p\). We denote by \(\boldsymbol{u}\mathrel{\vcenter{:}}=\textrm{col}((u_t)_{t \in \mathcal{T}}) \in \mathbb{R}^{mT}\) the collection of control inputs to be chosen and by \(\boldsymbol{x}\mathrel{\vcenter{:}}=\textrm{col}((x_t)_{t \in \mathcal{T} \cup \{T\}}) \in \mathbb{R}^{n(T+1)}\) the resulting state trajectory starting from the measured initial state \(x_0=x(0)\). In addition, let \(\ell_t\) and \(\ell_T\) denote the stage and terminal costs respectively, which for technical reasons are assumed to meet the following standard condition:

**Standing Assumption 1**. *The stage cost \(\ell_t : \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}\) and the terminal cost \(\ell_T: \mathbb{R}^n \to \mathbb{R}\) are
convex functions.*

Therefore, given \(N\) samples \(\{(x^{(i)},u^{(i)}, x^{+,{(i)}})\}_{i=1}^N\), our goal is to develop a system model \(\mathcal{N}_\theta : \mathbb{R}^p \times \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^n\) such that, once \(\mathcal{N}_\theta\) is used as part of the OCP in 1 , an optimal solution to the latter can be characterized via standard optimality conditions. To this end, we choose for \(\mathcal{N}_\theta\) a NN-based representation of a hybrid system, ultimately leading to an OCP equivalent to 1 in the form of a MPCC [37], whose classical KKT conditions [38] will be necessary and sufficient to characterize an optimal solution.

For technical reasons, we will assume in the remainder that the state-input pair takes values in some bounded set as follows:

**Standing Assumption 2**. *\((x,u) \in \Omega\), where \(\Omega \subset \mathbb{R}^n \times \mathbb{R}^m\) is a convex polytope.*

To address the problem introduced in §2, we will design \(\mathcal{N}_\theta\) as a two-layer NN with the simplified architecture illustrated in Fig. 1, and then make use of available data \(\{(x^{(i)},u^{(i)}, x^{+,{(i)}})\}_{i=1}^N\) to train the associated parameters, generically described by \(\theta\in \mathbb{R}^p\). In particular, the underlying NN consists of an OptNet layer [39], [40], which takes the input pair \(\textrm{col}(x, u) \eqqcolon y\) as a parameter to solve the following QP: \[\label{eq:optnet} \left. \begin{align} &\underset{z}{\textrm{min}} && \tfrac{1}{2} z^\top Q\left(y\right) z+q\left(y\right)^\top z \\ &~\textrm{s.t. } && A\left(y\right) z=b\left(y\right), \\ &&& G\left(y\right) z \leq h\left(y\right), \end{align} \right.\tag{2}\] thus returning an optimal solution \(z^\star : \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^s\) as output. This output is then passed through a buffer layer performing an affine transformation so that \[\label{eq:neural95network} x^+ = W z^\star(x, u) + c \eqqcolon\mathcal{N}_\theta(x,u),\tag{3}\] with appropriate matrix \(W \in \mathbb{R}^{n \times s}\) and affine term \(c \in \mathbb{R}^n\). The parameters characterizing \(\mathcal{N}_\theta\) are then \(\theta \mathrel{\vcenter{:}}=\{(Q, q, A, b, G, h), W, c\} \in \mathbb{R}^p\), which must be determined during some offline training procedure using our available data set \(\{(x^{(i)},u^{(i)}, x^{+,{(i)}})\}_{i=1}^N\).

To ensure a unique optimizer for 2 so that there is no ambiguity in the definition of the dynamics in 3 , during the training phase we will impose strict convexity of 2 , i.e., \(Q\in\mathbb{S}^s_{\succ 0}\). This will be made possible in view of the main technical features possessed by the NN \(\mathcal{N}_\theta\) illustrated in Fig. 1, which are established next:

**Proposition 1**. *The NN \(\mathcal{N}_\theta : \mathbb{R}^p \times \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^n\) in 3 , consisting of an OptNet layer and a affine buffer layer in cascade, enjoys the following properties:*

*In 2 , assume that \(Q\in\mathbb{S}^s_{\succ 0}\) and that \(A\) has full row rank. Then the output of \(\mathcal{N}_\theta\) in 3 is differentiable wrt the whole set of parameters \(\theta\);**\(\mathcal{N}_\theta\) can represent any continuous PWA mapping defined over \(\Omega\). Specifically, in case the mapping to be modelled is defined over a regular partition of \(\Omega\) with \(r\) pieces, then the number of constraints that we require to reproduce it is no more than \(2nr\), i.e., \(l \le 2 n r\), and \(s\le2n\).*

*Proof.* (i) Since the composition of differentiable mappings remains differentiable, the proof follows by combining [39], which
proves differentiability of \(z^\star(x,u)\) wrt the OptNet parameters under the postulated conditions on matrices \(Q\) and \(A\), and the fact that any successor state \(x^+\) is defined by an affine combination of \(z^\star(x, u)\) for any
state-input pair \((x, u)\), which is hence differentiable wrt both \(W\) and \(c\).

(ii) This part follows from Standing Assumption 2 and [41] directly. In fact, if one chooses \(c=0\), and parameters \(Q, q, A, b, G, h\) so that 2 becomes:
\[\label{eq:optnet95PWA} \left. \begin{align} &\underset{z}{\mathrm{min}} && \tfrac{1}{2} z^\top Q z+ (p + R y)^\top z\\ &~\mathrm{s.t. } && F z + G y \leq h,
\end{align} \right.\tag{4}\] with \(Q\in\mathbb{S}^s_{\succ 0}\), then the structure in [41] is immediately recovered. Note that these choices are always possible since differentiability of the NN output wrt \(\theta\), which follows from the first property shown in this statement, implies that \(\mathcal{N}_\theta\) is an end-to-end trainable NN. ◻

Proposition 1(i) thus makes \(\mathcal{N}_\theta\) in 3 , with \(z^\star(\cdot)\) solving 2 , an end-to-end trainable NN, meaning that one is actually able to set values for the main parameters \(\theta = \{(Q, q, A, b, G, h), W, c\}\) exploiting (state-input)-successor state samples, as commonly happens with any other NN. A differentiable output enables the use of a typical backpropagation strategy to find the gradient of the training loss function. Consequently, one can train \(\mathcal{N}_\theta\) with standard gradient descent methods for finding a (local) minimum of that function.

As a main consequence of the second property in Proposition 1, instead, we note that 1 turns out to coincide with an OCP involving a continuous PWA dynamics. The latter is known to be equivalent to a number of hybrid system model classes [31], such as LC [30], [42] or MLD models [43]. For this reason, the family of continuous PWA dynamics excels in capturing a wide variety of real-world system behaviours. In particular, since \(\mathcal{N}_\theta\) will be trained on the basis of available data \(\{(x^{(i)},u^{(i)}, x^{+,{(i)}})\}_{i=1}^N\), 1 actually amounts to an OCP involving a data-based hybrid system representation.

We now derive an equivalent formulation for the OCP in 1 , which will shed further light on the type of control problem we are dealing with. We will then present in §4.2 available results that are applicable to MPCCs models with the particular structure we obtain, before applying those results to our problem in §5.

Suppose that we have trained \(\mathcal{N}_\theta\) in 3 to obtain the structure of the OptNet layer as in 4 with \(Q\in\mathbb{S}^s_{\succ 0}\) and imposing \(c=0\). The following KKT conditions are then necessary and sufficient to characterize the optimal solution \(z^\star\): \[\label{eq:KKT} \left\{ \begin{align} &Q z^\star +R y+p+F^{\top} \lambda =0,\\ &0 \leq (h-F z^\star-G y) \perp \lambda \geq 0, \end{align} \right.\tag{5}\] where \(\lambda \in \mathbb{R}^l_{\ge0}\) represents the vector of Lagrange multipliers associated for the linear constraints. By recalling that \(y = \textrm{col}(x,u)\), from the KKT system above we obtain a so-called LC model of the system dynamics [44] associated with the NN \(\mathcal{N}_\theta\), with architecture as in Fig. 1: \[\label{eq:LC95model} \left\{ \begin{align} &x^{+} =-W Q^{-1} R\begin{bmatrix} x \\ u \end{bmatrix}-W Q^{-1} F^{\top} \lambda-W Q^{-1} p, \\ &0 \leq \left(F Q^{-1} R-G\right)\begin{bmatrix} x \\ u \end{bmatrix}+F Q^{-1} F^{\top} \lambda+F Q^{-1} p+h \perp \lambda \geq 0. \end{align} \right.\tag{6}\]

Note that all of the terms in 6 except \((x,u)\) and \(\lambda\) are determined according to some offline training process characterizing the NN \(\mathcal{N}_\theta\) in 3 . Hence the terminology “data-based representation”.

Substituting our LC model from 6 into 1 yields: \[\label{eq:OCP95LC} \left. \begin{align} &\underset{\boldsymbol{u}, \boldsymbol{x}, \boldsymbol{w}}{\textrm{min}} && \ell_T\left(x_T\right)+\sum_{t\in\mathcal{T}} \ell_t\left(x_t, u_t\right)\\ & ~\textrm{ s.t. } && x_{t+1} = A x_t + B_u u_t + B_w w_t + d, \text{ for all } t \in \mathcal{T},\\ &&& 0 \le E_w w_t + E_x x_t +E_u u_t + e \perp w_t \geq 0, \text{ for all } t \in \mathcal{T},\\ &&& x_0 = x(0), \end{align} \right.\tag{7}\] where the matrices \(A\), \(B_u\), \(B_w\), \(E_w\), \(E_x\), \(E_u\), and vectors \(d\) and \(e\) can be obtained via straightforward rearrangement of terms and definition of the complementarity variable \(w \mathrel{\vcenter{:}}=\lambda\). Specifically, \(B_w \mathrel{\vcenter{:}}=-W Q^{-1} F^{\top}\), \(d \mathrel{\vcenter{:}}=-W Q^{-1} p\), \(E_w \mathrel{\vcenter{:}}= F Q^{-1} F^{\top}\), \(e \mathrel{\vcenter{:}}= F Q^{-1} p+h\). The matrices \(A\) and \(B_u\) (respectively, \(E_x\) and \(E_u\)) are obtained by partitioning the matrix \(-W Q^{-1} R\) (resp., \(F Q^{-1} R-G\)) with \(n+m\) columns into two matrices with \(n\) and \(m\) columns, respectively. The notation is then analogous to 1 , and we denote by \(\boldsymbol{w} \mathrel{\vcenter{:}}=\textrm{col}((w_k)_{t \in \mathcal{T}})\) the trajectory of complementarity variables corresponding to \(\boldsymbol{u}\) and \(\boldsymbol{x}\). A distinct feature of the LC dynamics are the complementarity constraints inherited from 6 , which therefore turns the OCP 7 , equivalent to 1 under our choice of \(\mathcal{N}_\theta\) in 3 , into a data-based MPCC.

**Remark 1**. *Unlike [30], where an LC model must
be inferred from data directly, we obtain it as a consequence of choosing to train a NN \(\mathcal{N}_\theta\) with architecture as in Fig. 1. For this reason, we do not require strict complementarity of the resulting LC model to recover differentiability wrt the main parameters, as postulated instead in [28], [30].*

In general, MPCCs amount to nonlinear, nonconvex optimization problems that can be very challenging to solve [45]. Indeed, for such problems the standard constraint qualifications for NLP (e.g., the classical Mangasarian-Fromovitz one [38]) typically fail to hold at any feasible point [46].

On the other hand, it has been recently proven in [35] that if the MPCC has a specific structure then one is able to establish strong stationarity conditions characterizing a (local) optimal solution. In particular, by referring to the data-based MPCC in 7 the following three requirements on the LC model are sufficient to recover the strong stationarity conditions derived in [35]:

**Condition 1**. *Given any state-input pair \((x, u)\), every complementarity variable \(w\) that solves \(0 \le E_w w + E_x x +E_u u + e \perp
w \geq 0\) results in the same successor state \(x^{+}\).*

**Condition 2**. *The complementarity problem \(0 \le E_w w + E_x x +E_u u + e \perp w \geq 0\) can be decomposed elementwise wrt
the complementarity variable \(w\), i.e., \[\label{eq:decomposition} \forall i \in \{1, \ldots, l\}: \quad 0 \leq m_i w_i + D_i x + G_i u + e_i \perp w_i \geq
0,\qquad{(1)}\] for \(m_i > 0\) so that \(E_w = \mathrm{diag}((m_i)_{i=1}^l)\), \(E_x = \textrm{\textrm{col}}((D_i)_{i=1}^l)\), \(E_u = \textrm{\textrm{col}}((G_i)_{i=1}^l)\) and \(e=\textrm{\textrm{col}}((e_i)_{i=1}^l)\).*

**Condition 3**. *Given any state-input pair \((x, u)\), there exists some \(i \in \{1, \ldots, l\}\) such that \(D_i x+G_i
u+e_i<0\).*

While Condition 1 guarantees the well-posedness of the LC model we consider, entailing a deterministic behaviour for the resulting dynamics, Conditions 2 and 3 are rather technical and partially limit the LC models we are allowed to consider. Specifically, Condition 2 means that the solution set of the complementarity problem \(0 \le E_w w + E_x x +E_u u + e \perp w \geq 0\) is given by the cartesian product of the solution sets of ?? , while Condition 3 requires the existence of a solution \(w_i \neq 0\) to ?? , for any fixed pair \((x, u)\).

**Remark 2**. *We consider an elementwise decomposition of the complementarity problem \(0 \le E_w w + E_x x +E_u u + e \perp w \geq 0\) for simplicity, although a generalization of Condition 2 allowing for a block-diagonal decomposition is also possible – see [35].*

Armed with these requirements, our next result provides necessary and sufficient conditions to characterize a local solution to the optimal control MPCC in 7 :

**Lemma 1**. *([35])Let \(\left(\boldsymbol{x}^\star,
\boldsymbol{u}^\star, \boldsymbol{w}^\star\right)\) be feasible for the MPCC in 7 with an LC model satisfying Conditions 1–3. Then, \(\left(\boldsymbol{x}^\star, \boldsymbol{u}^\star, \boldsymbol{w}^\star\right)\) is locally optimal if and only if the standard KKT conditions for 7 admit a primal-dual solution pair.*

The statement in Lemma 1 enables one to seek a local solution to the optimal control MPCC 7 through the solution of a classical KKT system, i.e., as an NLP. This is typically computationally advantageous with significantly better scaling features compared to more traditional mixed-integer approaches to solving OCPs based on generic PWA or hybrid models^{2} [35], [36].

Thus, if we are able to train our NN \(\mathcal{N}_\theta\) in 3 so that Conditions 1–3 are satisfied, then \(\mathcal{N}_\theta\) would also bring major computational advantages when used as part of an OCP.

We now show how the NN \(\mathcal{N}_\theta\) in 3 , where \(z^\star(x,u)\) minimizes 2 (or 4 ) for a given pair \((x,u) \in \Omega\), can be trained so that Conditions 1–3 are met, thereby enabling us to solve the data-based OCP 7 as an NLP.

Specifically, in [35] it was shown that the following inverse optimization model of PWA dynamics leads to an LC model so that, once it is embedded into an OCP as in 1 , we can satisfy all three conditions collectively:

\[\tag{8} \begin{align} & x^{+}=\alpha^\star(x,u)-\beta^\star(x,u)=\alpha^\star-\beta^\star, \text{ with } \tag{9}\\ &\notag\\ & \alpha^\star = \left\{ \begin{aligned} &\underset{\alpha \in \mathbb{R}^n}{\textrm{argmin}} && \tfrac{1}{2}\|\alpha-A_\psi x - B_\psi u- c_\psi\|_{Q_\alpha}^2 \\ &~~~~\textrm{ s.t. } && \alpha \geq A_{\alpha, i} x+ B_{\alpha, i} u +c_{\alpha, i}, \textrm{ for all } i \in \{1,\ldots, n_r^\alpha\} \end{aligned} \right.\tag{10}\\ & \beta^\star = \left\{ \begin{align} &\underset{\beta \in \mathbb{R}^n}{\textrm{argmin}} && \tfrac{1}{2}\|\beta-A_\phi x - B_\phi u - c_\phi\|_{Q_\beta}^2 \\ &~~~~\textrm{ s.t. } && \beta \geq A_{\beta, j} x + B_{\beta, j} u + c_{\beta, j}, \textrm{ for all } j \in \{1,\ldots,n_r^\beta\}, \end{align} \right.\tag{11} \end{align}\]

where \(Q_\alpha\), \(Q_\beta \in \mathbb{S}^n_{\succ0}\) are diagonal, but otherwise arbitrary, matrices. Moreover, the elements \(\{(A_{\alpha, i},B_{\alpha, i},c_{\alpha, i})\}_{i=1}^{n_r^\alpha}\) and \(\{(A_{\beta, j},B_{\beta, j},c_{\beta, j})\}_{j=1}^{n_r^\beta}\) were determined in [35] on the basis of a PWA partitions, computed through available methods as, e.g., Delaunay triangulations and Voronoi diagrams, of the state-input space \(\Omega \subset \mathbb{R}^n\times\mathbb{R}^m\) introduced in Standing Assumption 2 – see, for instance, the discussion in [41], [48]. In addition, the affine functions in the costs originating from \((A_\psi,B_\psi,c_\psi)\) (respectively, \((A_\phi,B_\phi,c_\phi)\)), instead, were required to lower bound \(\{(A_{\alpha, i},B_{\alpha, i},c_{\alpha, i})\}_{i=1}^{n_r^\alpha}\) (resp., \(\{(A_{\beta, j},B_{\beta, j},c_{\beta, j})\}_{j=1}^{n_r^\beta}\)) in each region of the underlying partition. Specifically, one has to satisfy the following: \[\label{eq:strict95inequalities} \begin{align} &A_\psi x + B_\psi u + c_\psi<\underset{i \in \{1,\ldots,n_r^\alpha\}}{\textrm{max}}~ \{A_{\alpha, i} x+B_{\alpha, i} u+c_{\alpha, i}\}, \text{ for all } (x, u) \in \Omega,\\ &A_\phi x + B_\phi u + c_\phi<\underset{j \in \{1,\ldots,n_r^\beta\}}{\textrm{max}}~ \{A_{\beta, j} x+B_{\beta, j} u+c_{\beta, j}\}, \text{ for all } (x, u) \in \Omega. \end{align}\tag{12}\] Note that the above can always be satisfied for a given PWA system, e.g., by choosing any index \(i \in \{1,\ldots,n_r^\alpha\}\) and \(j \in \{1,\ldots,n_r^\beta\}\) and by setting \[\label{eq:possible95choice} \begin{align} & A_\psi x + B_\psi u + c_\psi=A_{\alpha, i} x+B_{\alpha, i} u+c_{\alpha, i}-\eta, \\ & A_\phi x + B_\phi u + c_\phi=A_{\beta, j} x+B_{\beta, j} u+c_{\beta, j}-\zeta, \end{align}\tag{13}\] with arbitrary terms \(\eta\), \(\zeta > 0\).

By taking the NN-based perspective proposed in this work, however, one does not need to compute all of the aforementioned elements explicitly. As will be made clear in our main result, once the number of regions determining each partition is fixed, i.e., \(n_r^\alpha\), \(n_r^\beta\ge1\) (which will therefore coincide with the hyperparameters of our data-driven approach, along with matrices \(Q_\alpha\), \(Q_\beta\)), we will obtain matrices and vectors \(\{(A_{\alpha, i},B_{\alpha, i},c_{\alpha, i})\}_{i=1}^{n_r^\alpha}\), \(\{(A_{\beta, j},B_{\beta, j},c_{\beta, j})\}_{j=1}^{n_r^\beta}\), \((A_\psi,B_\psi,c_\psi)\) and \((A_\phi,B_\phi,c_\phi)\), as weights of the NN \(\mathcal{N}_\theta\) in 3 . In particular, we can claim the following:

**Theorem 1**. *Let \(n_r^\alpha\), \(n_r^\beta\ge1\) be fixed. The NN \(\mathcal{N}_\theta : \mathbb{R}^p \times \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^n\) in 3 , consisting of an OptNet layer and a buffer one in cascade, can be trained to produce a
hybrid system model satisfying Conditions 1–3.*

*Proof.* To start we note that, for given \(n_r^\alpha\), \(n_r^\beta\ge1\) and any state-input pair \((x,u) \in \Omega\), [eq:second] and [eq:third] can be consolidated to obtain the following separable QP:

\[\left. \begin{align} &\underset{(\alpha,\beta)\in \mathbb{R}^{2n}}{\textrm{min}} && \frac{1}{2} \left\|\begin{bmatrix} \alpha\\\beta\end{bmatrix} - \begin{bmatrix}A_\psi\\A_\phi\end{bmatrix} x - \begin{bmatrix}B_\psi\\B_\phi\end{bmatrix} u - \begin{bmatrix}c_\psi\\c_\phi\end{bmatrix} \right\|_H^2\\ &~~~~\textrm{ s.t. } && \begin{bmatrix} I_n\otimes\boldsymbol{1}_{n_r^\alpha} & \boldsymbol{0}_{n n_r^\alpha\times n}\\ \boldsymbol{0}_{n n_r^\beta\times n} & I_n\otimes\boldsymbol{1}_{n_r^\beta} \end{bmatrix} \begin{bmatrix} \alpha\\\beta \end{bmatrix} \ge \begin{bmatrix} A_{\alpha,1}\\ \vdots\\ A_{\alpha,n_r^\alpha}\\ A_{\beta,1}\\ \vdots\\ A_{\beta,n_r^\beta}\\ \end{bmatrix} x + \begin{bmatrix} B_{\alpha,1}\\ \vdots\\ B_{\alpha,n_r^\alpha}\\ B_{\beta,1}\\ \vdots\\ B_{\beta,n_r^\beta}\\ \end{bmatrix} u+ \begin{bmatrix} c_{\alpha,1}\\ \vdots\\ c_{\alpha,n_r^\alpha}\\ c_{\beta,1}\\ \vdots\\ c_{\beta,n_r^\beta}\\ \end{bmatrix}, \end{align} \right.\]

where, for simplicity, we have imposed \(\mathrm{diag}(Q_\alpha,Q_\beta) = H\in \mathbb{S}^{2n}_{\succ0}\). Then, by introducing the decision variable \(\xi \mathrel{\vcenter{:}}=\textrm{col}(\alpha,\beta)\) and after suitably redefining all matrices/vectors, the QP above admits the following compact reformulation:

\[\label{zjbsqhem} \left. \begin{align} &\underset{\xi\in \mathbb{R}^{2n}}{\textrm{min}} && \tfrac{1}{2} \| \xi - A_\gamma x - B_\gamma u - c_\gamma \|_H^2\\ &~\textrm{ s.t. } && S_\xi \xi \ge A_\xi x + B_\xi u + c_\xi. \end{align} \right. \implies \left. \begin{align} &\underset{\xi\in \mathbb{R}^{2n}}{\textrm{min}} && \tfrac{1}{2} \xi^\top H\xi - (A_\gamma x + B_\gamma u + c_\gamma)^\top H \xi\\ &~\textrm{ s.t. } && S_\xi \xi \ge A_\xi x + B_\xi u + c_\xi. \end{align} \right.\tag{14}\]

In particular, \(A_\gamma \in \mathbb{R}^{2n \times n}\), \(B_\gamma \in \mathbb{R}^{2n \times m}\), \(c_\gamma \in \mathbb{R}^{2n}\), while \(A_\xi \in \mathbb{R}^{n(n_r^\alpha+n_r^\beta) \times n}\), \(B_\xi \in \mathbb{R}^{n(n_r^\alpha+n_r^\beta) \times m}\) and \(c_\xi \in \mathbb{R}^{n(n_r^\alpha+n_r^\beta)}\). Note that, in addition, the matrix \(S_\xi \in \mathbb{R}^{n(n_r^\alpha+n_r^\beta) \times 2n}\) is a full column rank matrix for any \(n_r^\alpha\), \(n_r^\beta \ge 1\). The statement then follows by making a one-to-one correspondence between the model in 3 , where \(z^\star(x,u)\) minimizes 4 for some pair \((x,u) \in \Omega\) with affine layer in cascade, and [eq:single95QP95compact]. In particular, for given \(n_r^\alpha\), \(n_r^\beta\ge1\), and diagonal \(H\in \mathbb{S}^{2n}_{\succ0}\), to fully recover [eq:single95QP95compact] while training \(\mathcal{N}_\theta\) one can treat \(z\) in 4 as a decision variable living in \(\mathbb{R}^{2n}\), additionally imposing \(Q=H \in \mathbb{S}^{2n}_{\succ0}\), \(F=-S_\xi\), \(W=[I_n \; -I_n]\), and \(c=0\). Thus, once trained the NN \(\mathcal{N}_\theta\), we obtain equivalences: \(-H^{-1} R=[A_\gamma \; B_\gamma]\), \(-H^{-1} p=c_\gamma\), \(G = [A_\xi \; B_\xi]\), and \(-h=c_\xi\). To conclude, note that the required conditions on parameters \(\theta\), specifically, on the fixed matrices/vector \(Q\), \(F\), \(W\), and \(c\) can be easily imposed during the training as a direct consequence of Proposition 1.(i). ◻

Theorem 1 asserts that, under a careful choice of some of the weights characterizing the NN \(\mathcal{N}_\theta\) in 3 , the latter can be trained by immediately means of samples \(\{(x^{(i)},u^{(i)}, x^{+,{(i)}})\}_{i=1}^N\) so that a data-based LC model meeting Conditions 1–3 is actually produced. In particular, our data-driven approach reduces the number of parameters that must be set during the training phase of \(\mathcal{N}_\theta\) to \(\{R, p, G, h\}\) – see Fig. 2 for a schematic representation of the proposed NN-based hybrid system identification for control mechanism.

To conclude, Corollary 1 essentially particularizes the result in Lemma 1 to the optimal control of the data-based hybrid model designed through the NN \(\mathcal{N}_\theta\) in Fig. 1:

**Corollary 1**. *Let the NN \(\mathcal{N}_\theta : \mathbb{R}^p \times \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^n\) in 3 be trained to produce a hybrid system model satisfying Conditions 1–3, along with the inequalities in 12 . Let \(\left(\boldsymbol{x}^\star, \boldsymbol{u}^\star,
\boldsymbol{w}^\star\right)\) be feasible for the resulting OCP in the form of 7 . Then, \(\left(\boldsymbol{x}^\star,
\boldsymbol{u}^\star, \boldsymbol{w}^\star\right)\) is locally optimal if and only if the standard KKT conditions for 7 admit a primal-dual
solution pair.*

Corollary 1 requires that if our NN model \(\mathcal{N}_\theta\) is to be included in the OCP 7 , its parameters must also satisfy the strict inequalities in 12 in order for the local optimality results of Lemma 1 to apply.

However, these inequalities can not be enforced uniformly over \(\Omega\) during the training phase, i.e., strict inequalities are guaranteed to hold true for the considered (state-input)-successor state samples only, \(\{(x^{(i)},u^{(i)}, x^{+,{(i)}})\}_{i=1}^N\), and not for all \((x,u) \in \Omega\). The relations in 13 provide us with a possible means of resolution by setting \((A_\psi,B_\psi,c_\psi)\) and \((A_\phi,B_\phi,c_\phi)\) to meet 12 . Note that imposing equalities in 13 can be done either before or after the training process of \(\mathcal{N}_\theta\).

**Remark 3**. *In [32] a strong stationarity condition is identified as necessary and sufficient to characterize local
optima of a program with a ReLU entering both in the cost and constraints, and the corresponding LC-based,
lifted reformulation yielding an MPCC. In contrast, we give an explicit expression of a NN \(\mathcal{N}_\theta\) for which local stationarity conditions, coinciding with the standard KKT system, are known to hold true for the MPCC obtained once employed the NN \(\mathcal{N}_\theta\) as a data-based hybrid
system model in the OCP 1 .*

We now present several examples in which the proposed NN \(\mathcal{N}_\theta\) in 3 –4 is adopted to learn continuous PWA models from dataset produced by PWA and nonlinear dynamics. In particular, in §6.1, we compare our approach with PARC algorithm [5]. In §6.2 we analyze how our technique is sensible to hyperparameters tuning. Finally, in §6.3 we make use of our NN-based model to solve OCPs with sensitivity-based solvers.

To do so, we focus on the structure reported in [eq:alfa95beta] directly, since Theorem 1 shows that the two representations are indeed equivalent, with the possibility to recover 3 –4 from [eq:alfa95beta] and viceversa. Specifically, learning the latter amounts to determine the elements \(\{(A_{\alpha, i},B_{\alpha, i},c_{\alpha, i})\}_{i=1}^{n_r^\alpha}\), \(\{(A_{\beta, j},B_{\beta, j},c_{\beta, j})\}_{j=1}^{n_r^\beta}\), \((A_\psi,B_\psi,c_\psi)\), and \((A_\phi,B_\phi,c_\phi)\), with hyperparameters \(n_r^\alpha\) and \(n_r^\beta\). With this regard, we note that also the choice of the Hessian matrix in [eq:second] and [eq:third], i.e., \(Q_\alpha\) and \(Q_\beta\), is arbitrary. Although they could also be included in the learning process, in the following we start by setting \(Q_\alpha=Q_\beta=I\), and \(n_r^\alpha=n_r^\beta=7\) unless otherwise stated. It is also interesting to note that, in this case, mappings \(\alpha^\star(x,u)\) and \(\beta^\star(x,u)\) in [eq:second] and [eq:third], respectively, recover the structure proposed in [49].

We implement the proposed approach in Python, utilizing JAX [50], JaxOPT [51] and Scipy [52]. The models have been trained to minimize the sum of the mean squared error and a regularization term penalizing the Euclidean norm of the parameter vector \(\theta\). The overall training problem is therefore given by: \[\label{eq:training} \left. \begin{align} &\underset{\theta}{\textrm{min}} & & \frac{1}{N} \sum_{i=1}^{N} \|x^{+,(i)}-\hat{x}^{+,(i)}\|_2^2 + \lambda \|\theta\|_2^2\\ &\textrm{ s.t. } & & \hat{x}^{+,(i)}=\alpha_\theta(x^{(i)},u^{(i)})-\beta_\theta(x^{(i)},u^{(i)}),~ i=1,\dots,N. \end{align} \right.\tag{15}\] where \(\alpha_\theta(x,u)\) and \(\beta_\theta(x,u)\) reminds the explicit dependence from \(\theta = \{(A_{\alpha, i},B_{\alpha, i},c_{\alpha, i})\}_{i=1}^{7} \cup \{(A_{\beta, j},B_{\beta, j},c_{\beta, j})\}_{j=1}^{7} \cup \{(A_\psi,B_\psi,c_\psi), (A_\phi,B_\phi,c_\phi)\}\) of [eq:second] and [eq:third], respectively. The training problem was solved using the SLSQP algorithm [53] with \(\lambda=0.01\). As internal QP solver, we relied on the \(\mathtt{BoxCDQP}\) solver from JaxOPT. We then evaluate the approximation quality of the models obtained on the basis of open-loop predictions, obtained by simulating the trained \(\mathcal{N}_\theta\) when excited with an unseen input sequence for 1000 time steps, and employing standard metrics [3], [14], [54], [55] such as BFR: \[\mathrm{BFR}=\frac{\|\hat{X}^+ -X^+\|_F}{\| X^+ - \bar X^+\|_F},\] and the RMS error, defined as: \[\mathrm{RMS}=\sqrt{\frac{1}{N} \sum_{i=1}^{N}\|x^{+,(i)}-\hat{x}^{+,(i)}\|_2^2}~.\] Referring to BFR, all matrices have dimension \(n\times N\) and, specifically, \(X^+\) stacks the true data, \(\bar X^+\) is the componentwise average vector of \(x\), while \(\hat{X}^+\) is the predicted value of \(\hat{X}^+\) obtained in open-loop prediction.

Unless otherwise stated, all tests have been conducted by making using 5000 normalized samples. We include white noise \(N(0,0.01)\) to the sequence of state measurements to simulate the presence of measurement disturbances.

We first contrast the performance achieved by the proposed approach with that obtained by PARC, a state-of-the-art PWA regression method that has been configured to perform up to 25 iterations and consider 20 clusters. To take into account the nonconvex nature of both learning problems, each test has been repeated for 10 different initial conditions of the optimizer.

For reference, solving the training problem in 15 required less than 3 minutes on a laptop equipped with a intel core i7-1165G7 processor.

To compare the two approaches on learning hybrid systems, we consider the following benchmarks:

A system described by the following equation: \[\Sigma_{\mathrm{PWA}}: x_{k+1}= A x_k + B u_k + W_A~\mathtt{clip}_{[0,2]}(W_B x_k),\] where \(A\), \(W_A\), \(W_B \in \mathrm{R}^{4\times4}\), \(B \in \mathrm{R}^{4\times2}\), and \(\mathtt{clip}(\cdot)\) is a mapping that rounds the value of its argument in the range defined by the subscript. The entries of \(A\), \(B\) have been drawn from uniform distributions \(U(0,1)\) and \(U(0,1/3)\), respectively, while the entries of \(W_A\), \(W_B\) follow a normal distribution \(N(0,1/2)\);

The problem adopted in [54], which we simulate from a random initial condition using an excitation signal with the features described in [54]. We will refer to this system, which has also been used as benchmark in [3], [55], as \(\Sigma_{\mathrm{B-PWA}}\);

The queuing network described in [56], in which we assume the second and third components of the vector \(\mu\) are each restricted to the interval \([10,100]\), while its first component remains set to \(30\) and the vector \(s\) set to \([1000,11,11]\). The system is excited using a random white sequence drawn from a uniform distribution defined on the range above. The resulting ODEs are integrated using the LSODA suite [57] via Scipy. The sampling time is set to 5 seconds. We will refer to this system as \(\Sigma_{\mathrm{QN}}\).

The numerical results for the examples above are reported in Table 1, which clearly show how our methodology works well, achieving performance comparable with PARC and, in turn, with the current state-of-the-art.

median BFR | best BFR | median RMS | best RMS | |||
---|---|---|---|---|---|---|

\(\Sigma_{\mathrm{PWA}}\) | 0.952 (0.936) | 0.959 (0.943) | 0.0484 (0.0641) | 0.0422 (0.057) | ||

\(\Sigma_{\mathrm{B-PWA}}\) | 0.88 (0.852) | 0.932 (0.855) | 0.084 (0.109) | 0.05 (0.08) | ||

\(\Sigma_{\mathrm{QN}}\) | 0.838 (0.81) | 0.851 (0.829) | 0.192 (0.224) | 0.175 (0.204) |

We now compare the performance of our NN-based approach and PARC on the following nonlinear or parameter-varying system models:

The two-tank system available with the MATLAB system identification toolbox [58]. As the dataset contains input-output data only, we define the state as \(x_t=[y_{t-2} \; y_{t-1} \; y_{t}]^\top\). For this example, we use 2000 samples for training and 1000 for validation without adding any white noise to the measurements. We will refer to this system as \(\Sigma_{\mathrm{Tank}}\);

The linear parameter-varying dynamics in [54], excited according to the related discussion in that paper. We will refer to this system, which has also been used as benchmark in [3], [55], as \(\Sigma_{\mathrm{B-LPV}}\);

The tank system “\(\Sigma_2\)” described in [14], which exhibits a strong nonlinear input-output behaviour. For this example, the noise treatment and system excitation have been performed as in [14].

The numerical results for the examples above are reported in Table 2, where we can see that, on the \(\Sigma_{\mathrm{B-LPV}}\) benchmark, our method greatly outperforms PARC. In addition, we note that the performance obtained on \(\Sigma_{\mathrm{Tank}}\) are comparable with that in [14], with our methodology characterized by far fewer parameters and requiring much shorter training time than the NN in [14]. Finally, the results on \(\Sigma_2\) are also very competitive wrt those in [14], especially in view of the larger dataset employed in the latter paper (i.e., 20000 against 5000 samples).

median BFR | best BFR | median RMS | best RMS | |||
---|---|---|---|---|---|---|

\(\Sigma_{\mathrm{Tank}}\) | 0.901 (0.9) | 0.931 (0.91) | 0.107 (0.109) | 0.07 (0.09) | ||

\(\Sigma_{\mathrm{B-PLV}}\) | 0.696 (0.511) | 0.731 (0.528) | 0.286 (0.46) | 0.254 (0.44) | ||

\(\Sigma_{2}\) | 0.924 (0.921) | 0.947 (0.931) | 0.061 (0.063) | 0.042 (0.055) |

We now investigate the sensitivity of our NN-based approach wrt the main parameters characterizing our technique, i.e., \(n_r^\alpha\) and \(n_r^\beta\). Although the diagonal Hessian matrices \(Q_\alpha\), \(Q_\beta\) represent further possible hyperparameters to tune, we note that according to both the discussion in [35], [41] and our own observation, the performance of our method is not particularly affected by their values. We therefore omit the related sensitivity analysis for these terms.

We report in Table 3 the median, worst and best BFR achieved on the \(\Sigma_2\) and \(\Sigma_{\mathrm{B-PWA}}\) for different values of \(n_r^\alpha = n_r^\beta\). In general we can observe that, while a higher \(n_r^\alpha\) is beneficial to improve the best-case scenario performance, it also makes the training problem more difficult. This is clearly shown by the obtained median BFR, especially when dealing with \(\Sigma_2\). From our numerical experience, the difficulties of training models featuring higher values of \(n_r^\alpha\) (and therefore the capability of representing richer continuous PWA mappings) can be often overcome by making use of larger training dataset. This behaviour is illustrated in Fig. 3, where we report a scatter plot of the relationship between the logarithm of the ratio between available samples and \(n_r^\alpha\), and the median BFR achieved on \(\Sigma_{\mathrm{B-PWA}}\). These findings also suggest that in practical settings in which the training data are fixed in advance, the problem of choosing appropriate values for \(n_r^\alpha\) and \(n_r^\beta\) may become central. To this end, the application of techniques such as group LASSO [59] or the one in [60] is currently under investigation.

\(\Sigma_{\mathrm{2}}\) | \(\Sigma_{\mathrm{B-PWA}}\) | |||

median | best | median | best | |

\(n_r^\alpha=2\) | 0.915 | 0.943 | 0.863 | 0.863 |

\(n_r^\alpha=3\) | 0.897 | 0.925 | 0.889 | 0.913 |

\(n_r^\alpha=5\) | 0.904 | 0.932 | 0.915 | 0.924 |

\(n_r^\alpha=7\) | 0.883 | 0.932 | 0.924 | 0.947 |

\(n_r^\alpha=11\) | 0.845 | 0.936 | 0.939 | 0.956 |

\(n_r^\alpha=17\) | 0.875 | 0.910 | 0.946 | 0.952 |

The computational advantages of employing a model as in [eq:alfa95beta] for optimal control purposes have been already analyzed in [35]. Here, we have shown that the strong stationarity conditions offered by Lemma 1 can be solved through standard NLP solvers such as, e.g., IPOPT [61], outperforming classical mixed-integer programming approaches to hybrid system optimal control.

Nevertheless, the reader might wonder if the proposed NN model can be used in an OCP scheme in conjunction with a sensitivity-based NLP solver. This is especially interesting for cases in which one is interested in working with a system as those introduced in §6.1.2 that do not exhibit behaviours involving logic and dynamic at the same time. To this end, we design a simple OCP for \(\Sigma_2\), mimicking the setup of [14], with prediction horizon equal \(T = 7\). We solve, at each step, the following optimization problem: \[\left. \begin{align} &\underset{u_1,\dots,u_7}{\textrm{min}} & & \sum_{i=1}^{7}(x^{+}_{i,2}-r_{i+1})^2 + 0.01~ u_i^2 + 0.001~\delta_i^2\\ &~~\textrm{ s.t. } & & \hat{x}^{+}_i=\alpha_\theta(x_i,u_i)-\beta_\theta(x_i,u_i),~i=1,\dots,7,\\ &&& \delta_i =u_i-u_{i-1},~i=1,\dots,7, \\ &&& 0.95 \leq u_i \leq 1.2,~i=1,\dots,7, \end{align} \right.\] where with \(x_{i,2}\) we mean the second component of \(x_i\).

We then compare the behaviour of a closed-loop trajectory achieved when using SLSQP solver with the one obtained with the global optimization solver `DIRECT_L`

[62]. The results are shown in Fig. 4 where, although the selected input sequence is slightly different, the closed-loop trajectories achieved by the two
controllers are essentially indistinguishable. This suggests that derivative-based solver may also be capable of exploiting the proposed NN-based models. We note that on our reference
hardware, the whole closed-loop simulation with the SLSQP-based controller required approximately one millisecond per time-step. Finally, we note that this test highlights the robustness of the surrogate NN model wrt the covariate shift phenomena [63].

We have proposed a NN-based methodology to system identification for control that only requires the training of a NN via standard tools as simple identification step, and yields a hybrid system model suited for optimal control design. Specifically, we have employed a NN with specific, yet simple, architecture, which turns out to be end-to-end trainable and produces a hybrid system with PWA dynamics from available data. Extensive numerical simulations have illustrated that, as a NN-based identification procedure, our technique has very similar performance compared to the state-of-the-art of hybrid system identification methods. At the same time, we have shown that, under a careful choice of some weights of the NN, the resulting hybrid dynamics can be controlled (locally) optimally by just solving the KKT system associated to the underlying finite horizon OCP. This is computationally advantageous compared to traditional approaches to optimal control of hybrid systems, which usually require mixed-integer programming.

[1]

A. M. Annaswamy, K. H. Johansson, and G. J. Pappas. Control for societal-scale challenges: Road map 2030, 2023. . Available:
https://ieeecss.org/control-societal-scale-challenges-roadmap-2030.

[2]

G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung. Kernel methods in system identification, machine learning and function estimation: A survey.
*Automatica*, 50(3):657–682, 2014.

[3]

V. Breschi, D. Piga, and A. Bemporad. Piecewise affine regression via recursive multiple least squares and multicategory discrimination. *Automatica*, 73:155–162, 2016.

[4]

J. Schoukens and L. Ljung. Nonlinear system identification: A user-oriented road map. *IEEE Control Systems Magazine*, 39(6):28–99, 2019.

[5]

A. Bemporad. A piecewise linear regression and classification algorithm with application to learning and model predictive control of hybrid systems. *IEEE Transactions on Automatic
Control*, 68(6):3194–3209, 2023.

[6]

I. Goodfellow, Y. Bengio, and A. Courville. *Deep Learning*. MIT Press, 2016.

[7]

P. J. Werbos. Neural networks for control and system identification. In *Proceedings of the 28th IEEE Conference on Decision and Control,*, pages 260–265, 1989.

[8]

K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. *Neural Networks*, 2(5):359–366, 1989.

[9]

M. Forgione and D. Piga. Continuous-time system identification with neural networks: Model structures and fitting criteria. *European Journal of Control*, 59:69–81,
2021.

[10]

M. Forgione, A. Muni, D. Piga, and M. Gallieri. On the adaptation of recurrent neural networks for system identification. *Automatica*, 155:111092, 2023.

[11]

M. Forgione and D. Piga. Model structures and fitting criteria for system identification with neural networks. In *2020 IEEE 14th International Conference on Application of
Information and Communication Technologies (AICT)*, pages 1–6, 2020.

[12]

C. Andersson, A. H. Ribeiro, K. Tiels, N. Wahlström, and T. B. Schön. Deep convolutional networks in system identification. In *2019 IEEE 58th Conference on Decision and Control
(CDC)*, pages 3670–3676, 2019.

[13]

B. Mavkov, M. Forgione, and D. Piga. Integrated neural networks for nonlinear continuous-time system identification. *IEEE Control Systems Letters*, 4(4):851–856, 2020.

[14]

D. Masti and A. Bemporad. Learning nonlinear state–space models using autoencoders. *Automatica*, 129:109666, 2021.

[15]

J. H. Hoekstra, B. Cseppento, G. I. Beintema, M. Schoukens, Z. Kollár, and R. Tóth. Computationally efficient predictive control based on ANN
state-space models. In *2023 62nd IEEE Conference on Decision and Control (CDC)*, pages 6336–6341. IEEE, 2023.

[16]

G. P. Liu and V. Kadirkamanathan. Predictive control for non-linear systems using neural networks. *International Journal of Control*, 71(6):1119–1132, 1998.

[17]

M. T. Hagan, H. B. Demuth, and M. Beale. *Neural network design*. PWS Publishing Co., 1997.

[18]

S. Paoletti, A. L. Juloski, G. Ferrari-Trecate, and R. Vidal. Identification of hybrid systems: A tutorial. *European Journal of Control*, 13(2):242–260, 2007.

[19]

A. Garulli, S. Paoletti, and A. Vicino. A survey on switched and piecewise affine system identification. *IFAC Proceedings Volumes*, 45(16):344–355, 2012. 16th IFAC Symposium on
System Identification.

[20]

J. Roll, A. Bemporad, and L. Ljung. Identification of piecewise affine systems via mixed-integer programming. *Automatica*, 40(1):37–50, 2004.

[21]

F. Lauer. On the complexity of piecewise affine system identification. *Automatica*, 62:148–153, 2015.

[22]

R. Batruni. A multilayer neural network with piecewise-linear structure and back-propagation learning. *IEEE Transactions on Neural Networks*, 2(3):395–403, 1991.

[23]

L. O. Chua and A.-C. Deng. Canonical piecewise-linear representation. *IEEE Transactions on Circuits and Systems*, 35(1):101–111, 1988.

[24]

C.-H. Choi and J. Y. Choi. Constructive neural networks with piecewise interpolation capabilities for function approximations. *IEEE Transactions on Neural Networks*,
5(6):936–944, 1994.

[25]

D. R. Hush and B. Horne. Efficient algorithms for function approximation with piecewise linear sigmoidal networks. *IEEE Transactions on Neural Networks*, 9(6):1129–1141,
1998.

[26]

E. F. Gad, A. F. Atiya, S. Shaheen, and A. El-Dessouki. A new algorithm for learning in piecewise-linear neural networks. *Neural Networks*, 13(4):485–505, 2000.

[27]

M. Fält and P. Giselsson. System identification for hybrid systems using neural networks. *arXiv preprint arXiv:1911.12663*, 2019.

[28]

F. de Avila Belbute-Peres, K. Smith, K. Allen, J. Tenenbaum, and J. Z. Kolter. End-to-end differentiable physics for learning and control. *Advances in Neural Information Processing
Systems*, 31, 2018.

[29]

Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In *International Conference on
Learning Representations*, 2018.

[30]

W. Jin, A. Aydinoglu, M. Halm, and M Posa. Learning linear complementarity systems. In *Learning for Dynamics and Control Conference*, pages 1137–1149. PMLR, 2022.

[31]

M. Heemels, B. De Schutter, and A. Bemporad. Equivalence of hybrid dynamical models. *Automatica*, 37(7):1085–1091, 2001.

[32]

D. Yang, P. Balaprakash, and S. Leyffer. Modeling design and control problems involving neural network surrogates. *Computational Optimization and Applications*, pages 1–42,
2022.

[33]

F. Fabiani and P. J. Goulart. Reliably-stabilizing piecewise-affine neural network controllers. *IEEE Transactions on Automatic Control*, 68(9):5201–5215, 2023.

[34]

F. Fabiani and P. J. Goulart. Robust stabilization of polytopic systems via fast and reliable neural network-based approximations. *International Journal of Robust and Nonlinear
Control*, 2024. (In press).

[35]

A. B. Hempel, P. J. Goulart, and J. Lygeros. Strong stationarity conditions for optimal control of hybrid systems. *IEEE Transactions on Automatic Control*, 62(9):4512–4526,
2017.

[36]

J. Hall, A. Nurkanović, F. Messerer, and M. Diehl. A sequential convex programming approach to solving quadratic programs and optimal control problems with linear
complementarity constraints. *IEEE Control Systems Letters*, 6:536–541, 2021.

[37]

Z.-Q. Luo, J.-S. Pang, and D. Ralph. *Mathematical programs with equilibrium constraints*. Cambridge University Press, 1996.

[38]

D. Bertsekas, A. Nedic, and A. Ozdaglar. *Convex analysis and optimization*, volume 1. Athena Scientific, 2003.

[39]

B. Amos and J. Z. Kolter. : Differentiable optimization as a layer in neural networks. In *International Conference on Machine Learning*, pages 136–145. PMLR,
2017.

[40]

B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter. Differentiable MPC for end-to-end planning and control. *Advances in Neural Information Processing
Systems*, 31, 2018.

[41]

A. B. Hempel, P. J. Goulart, and J. Lygeros. Inverse parametric optimization with an application to hybrid system control. *IEEE Transactions on Automatic Control*,
60(4):1064–1069, 2014.

[42]

M. Heemels, M. K. Camlibel, A. J. Van Der Schaft, and J. M. Schumacher. Modelling, well-posedness, and stability of switched electrical networks. In *International Workshop on Hybrid
Systems: Computation and Control*, pages 249–266. Springer, 2003.

[43]

A. Bemporad and M. Morari. Control of systems integrating logic, dynamics, and constraints. *Automatica*, 35(3):407–427, 1999.

[44]

M. Heemels and B. Brogliato. The complementarity class of hybrid dynamical systems. *European Journal of Control*, 9(2-3):322–360, 2003.

[45]

D. Ralph. Mathematical programs with complementarity constraints in traffic and telecommunications networks. *Philosophical Transactions of the Royal Society A: Mathematical, Physical
and Engineering Sciences*, 366(1872):1973–1987, 2008.

[46]

Y. Chen and M. Florian. The nonlinear bilevel programming problem: Formulations, regularity and optimality conditions. *Optimization*, 32(3):193–209, 1995.

[47]

J. Daafouz, M. D. Di Benedetto, V. D. Blondel, and L. Hetel. *Switched and piecewise affine systems*, page 87–138. Cambridge University Press, 2009.

[48]

A. B. Hempel, P. J. Goulart, and J. Lygeros. Every continuous piecewise affine function can be obtained by solving a parametric linear program. In *2013 European Control Conference
(ECC)*, pages 2657–2662. IEEE, 2013.

[49]

A. Magnani and S. P. Boyd. Convex piecewise-linear fitting. *Optimization and Engineering*, 10:1–17, 2009.

[50]

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. : composable
transformations of Python+NumPy programs, 2018.

[51]

M. Blondel, Q. Berthet, M. Cuturi, R. Frostig, S. Hoyer, F. Llinares-Lopez, F. Pedregosa, and J.-P. Vert. Efficient and modular implicit differentiation. In *Advances in Neural
Information Processing Systems*, volume 35, pages 5230–5242, 2022.

[52]

P. Virtanen and SciPy 1.0 Contributors. . *Nature Methods*, 17:261–272, 2020.

[53]

D. Kraft. A software package for sequential quadratic programming. *Forschungsbericht-Deutsche Forschungs-und Versuchsanstalt fur Luft-und Raumfahrt*, 1988.

[54]

V. Breschi, A. Bemporad, and D. Piga. Identification of hybrid and linear parameter varying models via recursive piecewise affine regression and discrimination. In *2016 European
Control Conference (ECC)*, pages 2632–2637. IEEE, 2016.

[55]

M. D. Mejari. *Towards automated data-driven modeling of linear parameter-varying systems*. PhD thesis, IMT School for Advanced Studies Lucca, 2018.

[56]

G. Garbi, E. Incerto, and M. Tribastone. Learning queuing networks by recurrent neural networks. In *Proceedings of the ACM/SPEC International Conference on Performance
Engineering*, pages 56–66, 2020.

[57]

A. C. Hindmarsh. , a systemized collection of ODE solvers. *Scientific computing*, 1983.

[58]

MathWorks Inc. ontrol of a Two-Tank System. . Available: https://it.mathworks.com/help/robust/ug/control-of-a-two-tank-system.html.

[59]

J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparse group lasso. *arXiv preprint arXiv:1001.0736*, 2010.

[60]

V. Breschi and M. Mejari. Shrinkage strategies for structure selection and identification of piecewise affine models. In *2020 59th IEEE Conference on Decision and Control (CDC)*,
pages 1626–1631. IEEE, 2020.

[61]

A. Wächter and L. T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. *Mathematical
Programming*, 106:25–57, 2006.

[62]

J. M. Gablonsky and C. T. Kelley. A locally-biased form of the DIRECT algorithm. *Journal of Global Optimization*, 21:27–37, 2001.

[63]

M. Sugiyama. *Introduction to statistical machine learning*. Morgan Kaufmann, 2015.

*Email addresses:*`{filippo.fabiani, daniele.masti}@imtlucca.it`

,`bstellato@princeton.edu`

,`paul.goulart@eng.ox.ac.uk`

.↩︎The MPCC in 7 is an OCP with hybrid dynamics, and consequently can be shown to be NP-hard. We therefore do not expect globally optimal solutions to be efficiently computable in general [47], regardless of the method employed.↩︎