October 22, 2025
Efficient energy management is essential for reliable and sustainable microgrid operation amid increasing renewable integration. This paper proposes an imitation learning–based framework to approximate mixed-integer Economic Model Predictive Control (EMPC) for microgrid energy management. The proposed method trains a neural network to imitate expert EMPC control actions from offline trajectories, enabling fast, real-time decision making without solving optimization problems online. To enhance robustness and generalization, the learning process includes noise injection during training to mitigate distribution shift and explicitly incorporates forecast uncertainty in renewable generation and demand. Simulation results demonstrate that the learned policy achieves economic performance comparable to EMPC while only requiring \(10\%\) of the computation time of optimization-based MPC in practice.
Model predictive control ,Imitation learning ,Control and management of energy systems ,Learning methods for optimal control ,Optimal control of hybrid systems
The integration of distributed energy resources—such as photovoltaics, wind turbines, and Energy Storage Systems (ESSs)—has led to the growing deployment of microgrids, which enable more localized generation, storage, and consumption of electricity [1]. Designing an effective Energy Management System (EMS) for microgrids is crucial for ensuring reliability and reducing operational costs under uncertain generation and load profiles. The core of an EMS lies in solving an optimization problem, for which a wide range of methods have been explored, including quadratic programming [2], stochastic programming for explicitly handling uncertainties in generation and demand [3], and meta-heuristic approaches for addressing non-convex or non-smooth problem structures [4]. The interested reader is referred to a comprehensive review by [5] for other relevant methods. Recent trends in this direction have also shifted toward the specific use of Model Predictive Control (MPC) [6]–[8], which has emerged as an appealing framework since it seamlessly integrates system dynamics, forecast information, and operational constraints within an optimization-based control scheme. Typically, Economic MPC (EMPC) has attracted considerable attention [8]–[10], as it extends conventional MPC by directly optimizing an economic performance metric—such as operating cost and profit—rather than focusing solely on trajectory or reference tracking. This formulation makes EMPC well suited for modern microgrids, where achieving economic efficiency and sustainable energy integration are primary objectives.
Recognizing its potential benefits, numerous studies have implemented EMPC for microgrid energy management, demonstrating its practical effectiveness [8]. [9] incorporated controllable loads into the model, allowing load curtailment as part of the energy management strategy. [10] considered a hybrid ESS, resulting in a mixed-logical dynamical model in the EMPC formulation. In general, logic variables provide a compact and effective means to represent the discrete behaviors commonly encountered in microgrids, such as the ON/OFF status of fuel generators, charging/discharging modes of ESS, and purchasing/selling decisions when exchanging electricity with the main grid [7], [8], [10], [11]. Consequently, when MPC is applied to such systems, mixed-integer MPC (MI-MPC) formulations [12] are typically required to handle both continuous and discrete dynamics. However, a major limitation of MI-MPC lies in its substantial computational burden. At each sampling instant, a constrained mixed-integer program must be solved online, which can become prohibitively expensive for large-scale or fast-evolving microgrids [7], [11], [13]. This challenge is further exacerbated by the presence of longer prediction horizons, nonlinear system dynamics, and short sampling intervals, all of which significantly hinder real-time implementation even when using advanced optimization solvers.
Following several seminal works on approximating MPC policies using machine learning [12], [14]–[16], collectively known as learning-based approximate MPC, extensive research has explored the use of different neural networks for approximate MPC [17], [18]. In parallel, some papers have focused on improving sample efficiency through fast data augmentation [19] and providing safety guarantees by adding a projection layer to the network [20]. A key advantage of approximate MPC lies in its ability to enable fast online computation, making it particularly attractive for real-time applications. As a result, it has been successfully applied to domains such as smart buildings [12], [15] and solar trough plants [21].
In the context of microgrid energy management, however, only indirect approximate MPC has so far been used, i.e., machine learning is primarily applied to predict optimized integer or binary variables within MI-MPC formulations [11], [13]. Once these discrete decisions are learned, the remaining problem reduces to a linear or quadratic program, which can be solved very efficiently. Nevertheless, results on direct approximate MI-MPC (i.e., learning the MPC policy directly) remain scarce [12], [22], and such approaches have not yet been applied to microgrid energy management. On the other hand, many approximate MPC methods in the literature used open-loop data uniformly sampled over a grid [14], [16], and recent research has therefore focused on using Imitation Learning (IL) to directly sample from closed-loop trajectories in training [12], [15], [23], aiming to simultaneously improve sample efficiency and optimize closed-loop performance. However, despite the high potential of IL to drastically reduce the computational burden of solving MI-MPC problems in microgrid energy management, its direct application to approximate MI-MPC remains largely unexplored. Moreover, the issue of distribution shift [24], i.e., the mismatch between the state distributions encountered during training and those visited by the learned policy at deployment, has not been adequately addressed in existing applications using MI-MPC [12], [15].
This current work proposes an IL–based framework for approximate MI-MPC tailored to microgrid energy management. The considered microgrid scheduling and operation problem includes explicitly modeling fuel generators, renewable energy sources (RESs), curtailable loads, ESS dynamics, and operational constraints. The core contribution lies in applying IL to directly approximate the EMPC control policy, thereby replacing repeated online optimization with a lightweight, learned controller. To the best of our knowledge, this is the first work to leverage imitation learning for direct approximation of mixed-integer EMPC for microgrid energy management. The main contributions of the current paper is as follows:
A novel framework that directly approximates MI-MPC policies using IL is proposed. This IL paradigm uses special features tailored for microgrid energy management and incorporates a noisy expert to mitigate distribution shift.
Forecast uncertainties in renewable energy generation and load demand are explicitly considered in both data generation for offline training and online deployment of the learned controller.
The proposed approach achieves performance comparable to the optimization-based EMPC while significantly reducing computation time in simulation experiments, demonstrating its potential in real-time management of microgrids.
The effectiveness of the proposed method is demonstrated through simulations comparing the original EMPC, the proposed IL-based approximate EMPC, and a conventional IL-based approximate EMPC. The remainder of the paper is organized as follows. Section 2 introduces the microgrid model, including the logic relations. Section 3 formulates the mixed-integer EMPC problem. The proposed IL-based approximate EMPC is presented in Section 4. Section 5 details the simulation setup and provides comparative results. Finally, Section 6 concludes the paper and outlines directions for future research.
Notations: The set of real and natural numbers are denoted, respectively, by \(\mathbb{R}\) and \(\mathbb{N}\). The set of positive integers is denoted by \(\mathbb{N}_+\), and \(\mathbb{I}_{[a:b]} := \mathbb{N}\cap [a, b]\). The zero column vector of length \(n\) is denoted by \(\mathbf{0}_n\). The operator \(\|\cdot\|\) denotes the \(2\)-norm for vectors. Given \(\mathcal{X}\subset \mathbb{R}^n\) and \(x \in \mathbb{R}^n\), the projection operator \(\text{Proj}_{\mathcal{X}}(\cdot)\) is defined by \(\text{Proj}_{\mathcal{X}}(x):=\min_{x'\in\mathcal{X}}\|x - x'\|\).
This section provides the modeling of a typical grid-connected microgrid, consisting of fuel generators, an ESS, a unified RES, and curtailable loads [8]. The considered model unifies the ones proposed by [8] and [7] with slight modifications, and discrete-time formulations are considered universally with the time index denoted by \(t\). A visualization of the considered microgrid configuration is given in Fig. 1.
Consider \(N_{\mathrm{fg}}\) independent fuel generators. For the \(i\)-th generator (\(i \in \mathbb{I}_{[1:N_{\mathrm{fg}}]}\)), its operational cost is given by \[\begin{gather} \label{eq:cost95generator} C_{\mathrm{fg},i}(t) = \theta_{1,i}P_{\mathrm{fg},i}(t) + \theta_{2,i}P_{\mathrm{fg},i}^2(t) + O_{\mathrm{fg,i}}\delta_{\mathrm{fg},i}(t) \\ + S_{\mathrm{fg},i}^{\mathrm{[on]}}\delta_{\mathrm{fg},i}(t)(1 - \delta_{\mathrm{fg},i}(t-1)) + S_{\mathrm{fg},i}^{\mathrm{[off]}}\delta_{\mathrm{fg},i}(t-1)(1 - \delta_{\mathrm{fg},i}(t)), \end{gather}\tag{1}\] where the binary variable \(\delta_{\mathrm{fg},i}(t) \in \{0, 1\}\) represents the OFF (\(0\))/ON (\(1\)) mode of the generator and \(P_{\mathrm{fg},i}(t) \geq 0\) is the power generation. In 1 , the polynomial \(\theta_{1,i}P_{\mathrm{fg},i}(t) + \theta_{2,i}P_{\mathrm{fg},i}^2(t)\) is the fuel consumption cost, \(O_{\mathrm{fg,i}}\delta_{\mathrm{fg},i}(t)\) is the operational cost, and \(S_{\mathrm{fg},i}^{\mathrm{[on]}}\) and \(S_{\mathrm{fg},i}^{\mathrm{[off]}}\) are the switch cost when starting up and shutting down the \(i\)-th generator, respectively. Besides, the following logic relation should be satisfied: \[\label{eq:logic95generator} \begin{align} & \delta_{\mathrm{fg},i}(t) = 0 \iff P_{\mathrm{fg},i}(t) = 0, \\ & \delta_{\mathrm{fg},i}(t) = 1 \iff P_{\mathrm{fg},i}(t) > 0. \end{align}\tag{2}\] No dynamic behavior is considered for the fuel generators.
For the ESS modeling, its State of Charge (SoC), denoted by \(x_{\mathrm{ess}}(t)\), is governed by the following dynamics [7]: \[\label{eq:dynamics95ess} x_{\mathrm{ess}}(t+1) - x_{\mathrm{ess}}(t)= \begin{cases} T_{\mathrm{s}}\eta_{\mathrm{c}}P_{\mathrm{ess}}(t) - T_{\mathrm{s}}x_{\mathrm{dg}}& P_{\mathrm{ess}}(t) \geq 0 \\ T_{\mathrm{s}}\eta_{\mathrm{d}}^{-1}P_{\mathrm{ess}}(t) - T_{\mathrm{s}}x_{\mathrm{dg}}& P_{\mathrm{ess}}(t) < 0 \end{cases},\tag{3}\] where \(P_{\mathrm{ess}}(t) \in \mathbb{R}\) is the power flow, \(T_{\mathrm{s}}\) is the sampling period, \(x_{\mathrm{dg}}\) is the constant energy degradation [8], and \(\eta_{\mathrm{c}}\) and \(\eta_{\mathrm{d}}\) are the charging and discharging coefficients, respectively. A binary variable \(\delta_{\mathrm{ess}}(t) \in \{0,1\}\) is used to characterize the discharging (\(0\))/charging (\(1\)) mode, and additional logic relations involving \(\delta_{\mathrm{ess}}(t)\) and \(P_{\mathrm{ess}}(t)\) are given by \[\label{eq:logic95ess} \begin{align} & \delta_{\mathrm{ess}}(t) = 1 \iff P_{\mathrm{ess}}(t) \geq 0, \\ & \delta_{\mathrm{ess}}(t) = 0 \iff P_{\mathrm{ess}}(t) < 0. \end{align}\tag{4}\] The operational cost related to the ESS is given by \[\label{eq:cost95ess} C_{\mathrm{ess}}(t) = O_{\mathrm{ess}}|P_{\mathrm{ess}}(t)|,\tag{5}\] where \(O_{\mathrm{ess}}\) is the operational cost coefficient of the ESS.
Following [8], we consider a unified RES that aggregates multiple generation technologies, such as wind turbines and solar panels. The total generated power of the unified RES is denoted by \(P_{\mathrm{res}}(t)\), which is constrained as \[\label{eq:cons95res} P_{\mathrm{res}}^- \leq P_{\mathrm{res}}(t) \leq P_{\mathrm{res}}^+.\tag{6}\] Although physically connected to the microgrid, the output power of the RES is uncontrollable and is therefore treated as an exogenous input. Moreover, its internal dynamics and operational costs are not taken into account for the microgrid optimization.
In this paper, all loads are treated as a single aggregate that consumes power \(P_{\mathrm{load}}(t)\). In addition, partial curtailment of the loads is allowed, while respecting the limits of users’ tolerance for discomfort [8]. The curtailed load percentage is denoted by \(\beta(t) \in [0,1]\), which is one of the controllable inputs. The cost by supplying the loads is then given by \[\label{eq:cost95loads} C_{\mathrm{load}}(t) = \rho\beta(t)P_{\mathrm{load}}(t),\tag{7}\] where \(\rho\) is the penalty weight on curtailments. Similar to \(P_{\mathrm{res}}(t)\), the load power \(P_{\mathrm{load}}(t)\) is bounded as \[\label{eq:cons95load} P_{\mathrm{load}}^- \leq P_{\mathrm{load}}(t) \leq P_{\mathrm{load}}^+,\tag{8}\] and it is also an exogenous input to the system that cannot be managed by the microgrid itself.
The microgrid can trade electricity with the main grid to meet load demand efficiently and to enhance the overall economic benefit of operation. This exchanged power is denoted by \(P_{\mathrm{exg}}(t) \in \mathbb{R}\) and its associated binary variable \(\delta_{\mathrm{exg}}(t) \in \{0,1\}\) describes the selling (\(0\))/purchasing (\(1\)) decision. The cost of power exchange is given by \[\label{eq:cost95exchange} C_{\mathrm{exg}}(t) = \begin{cases} c_{\mathrm{p}}P_{\mathrm{exg}}(t) & P_{\mathrm{exg}}(t) \geq 0 \\ c_{\mathrm{s}}P_{\mathrm{exg}}(t) & P_{\mathrm{exg}}(t) < 0 \end{cases},\tag{9}\] where \(c_{\mathrm{p}}\) and \(c_{\mathrm{s}}\) are the constant2purchasing and selling prices, respectively. Besides, \(\delta_{\mathrm{exg}}(t)\) and \(P_{\mathrm{exg}}(t)\) satisfy \[\label{eq:logic95exchange} \begin{align} & \delta_{\mathrm{exg}}(t) = 1 \iff P_{\mathrm{exg}}(t) \geq 0, \\ & \delta_{\mathrm{exg}}(t) = 0 \iff P_{\mathrm{exg}}(t) < 0. \end{align}\tag{10}\] Finally, the powers of all the units in the microgrid must satisfy the following balance equation: \[\begin{gather} \label{eq:balance} \sum^{N_{\mathrm{fg}}}_{i=1}P_{\mathrm{fg},i}(t) + P_{\mathrm{res}}(t) + P_{\mathrm{exg}}(t) =\\ P_{\mathrm{ess}}(t) + (1 - \beta(t))P_{\mathrm{load}}(t), \end{gather}\tag{11}\] where \((1 - \beta(t))P_{\mathrm{load}}(t)\) is the consumed power by the loads after curtailment.
The total operational cost of the microgrid is the sum of all independent costs, i.e., \[\label{eq:cost95microgrid} C_{\mathrm{grid}}(t) = \sum^{N_{\mathrm{fg}}}_{i=1}C_{\mathrm{fg},i}(t) + C_{\mathrm{ess}}(t) + C_{\mathrm{load}}(t) + C_{\mathrm{exg}}(t).\tag{12}\] Several variables of the microgrid are subject to operational constraints. First, the discharging/charging power and SoC are both required to stay within a certain range to protect the ESS, i.e., \[\tag{13} \begin{align} \tag{14} & |P_{\mathrm{ess}}(t)| \leq \bar{P}_{\mathrm{ess}}\\ \tag{15} & x_{\mathrm{ess}}^{-} \leq x_{\mathrm{ess}}(t) \leq x_{\mathrm{ess}}^{+}. \end{align}\] Besides, if a fuel generator is ON, its produced power and power variation are both constrained, i.e., \[\tag{16} \begin{align} \tag{17} P_{\mathrm{fg},i}^{-}\delta_{\mathrm{fg},i}(t) \leq P_{\mathrm{fg},i}(t) \leq P_{\mathrm{fg},i}^{+}\delta_{\mathrm{fg},i}(t), \\ \tag{18} |P_{\mathrm{fg},i}(t) - P_{\mathrm{fg},i}(t-1)| \leq \Delta P_{\mathrm{fg},i}\delta_{\mathrm{fg},i}(t). \end{align}\] Moreover, we only consider \(P_{\mathrm{fg},i}^{-} \leq \Delta P_{\mathrm{fg},i}\) such that a fuel generator is always allowed to start up at time step \(t\) when it is OFF at \(t-1\), but not vice versa. The power exchange should not exceed a given threshold, i.e., \[\label{eq:input95constraint95grid} |P_{\mathrm{exg}}(t)| \leq \bar{P}_{\mathrm{exg}},\tag{19}\] and \(\beta^+\) is an upper bound for the curtail percentage, i.e., \[\label{eq:input95constraint95curtail} 0 \leq \beta(t) \leq \beta^{+}.\tag{20}\] The scalars describing the constraints in 6 , 8 , and 13 –20 (i.e., \(P_{\mathrm{res}}^-\), \(P_{\mathrm{res}}^+\), \(P_{\mathrm{load}}^-\), \(P_{\mathrm{load}}^+\), \(\bar{P}_{\mathrm{ess}}\), \(x_{\mathrm{ess}}^-\), \(x_{\mathrm{ess}}^+\), \(P_{\mathrm{fg},i}^-\), \(P_{\mathrm{fg},i}^+\), \(\Delta P_{\mathrm{fg},i}\), \(\bar{P}_{\mathrm{exg}}\), and \(\beta^+\)) are all constant and positive. Besides, the following inequalities should be satisfied: \[\label{eq:viable95balancing} \begin{align} & \sum_{i=1}^{N_{\mathrm{fg}}}P_{\mathrm{fg},i}^+ + \bar{P}_{\mathrm{exg}}\geq \bar{P}_{\mathrm{ess}}+ P_{\mathrm{load}}^+ - P_{\mathrm{res}}^-, \\ & \sum_{i=1}^{N_{\mathrm{fg}}}P_{\mathrm{fg},i}^- - \bar{P}_{\mathrm{exg}}\leq -\bar{P}_{\mathrm{ess}}+ (1 - \beta^+)P_{\mathrm{load}}^- - P_{\mathrm{res}}^+, \end{align}\tag{21}\] which means the manageable power inputs (i.e., \(P_{\mathrm{fg},i}(t), i=1,2,\dots,N_{\mathrm{fg}}\) and \(P_{\mathrm{exg}}(t)\)) can always be regulated to satisfy the power balance equation 11 regardless of the uncontrollable generation \(P_{\mathrm{res}}(t)\) and load \(P_{\mathrm{load}}(t)\).
For optimization, the SoC dynamics 3 , ESS cost 5 , and exchange cost 9 requires a reformulation, respectively, as \[\tag{22} \begin{align} \tag{23} & x_{\mathrm{ess}}(t+1) = x_{\mathrm{ess}}(t) + T_{\mathrm{s}}(\eta_{\mathrm{c}}- \eta_{\mathrm{d}}^{-1})\delta_{\mathrm{ess}}(t)P_{\mathrm{ess}}(t) \notag \\ & +T_{\mathrm{s}}\eta_{\mathrm{d}}^{-1}P_{\mathrm{ess}}(t) - T_{\mathrm{s}}x_{\mathrm{dg}}, \\ \tag{24} & C_{\mathrm{ess}}(t) = O_{\mathrm{ess}}\big(2\delta_{\mathrm{ess}}(t)P_{\mathrm{ess}}(t) - P_{\mathrm{ess}}(t)\big), \\ \tag{25} & C_{\mathrm{exg}}(t) = (c_{\mathrm{p}}- c_{\mathrm{s}})\delta_{\mathrm{exg}}(t)P_{\mathrm{exg}}(t) + c_{\mathrm{s}}P_{\mathrm{exg}}(t). \end{align}\]
Accordingly, \(\delta_{\mathrm{ess}}(t)P_{\mathrm{ess}}(t)\) and \(\delta_{\mathrm{exg}}(t)P_{\mathrm{exg}}(t)\) require auxiliary continuous variables \(z_{\mathrm{ess}}(t) := \delta_{\mathrm{ess}}(t)P_{\mathrm{ess}}(t)\) and \(z_{\mathrm{exg}}(t) := \delta_{\mathrm{exg}}(t)P_{\mathrm{exg}}(t)\) to eliminate bilinearity in the optimization formulation. Moreover, additional logic relations are also needed [26] to characterize \(z_{\mathrm{ess}}(t)\) and \(z_{\mathrm{exg}}(t)\) using linear constraints, i.e., \[\label{eq:cons95auxiliary} \begin{align} & -\delta_{\mathrm{ess}}(t)\bar{P}_{\mathrm{ess}}\leq z_{\mathrm{ess}}(t) \leq \delta_{\mathrm{ess}}(t)\bar{P}_{\mathrm{ess}}, \\ & -\delta_{\mathrm{exg}}(t)\bar{P}_{\mathrm{exg}}\leq z_{\mathrm{exg}}(t) \leq \delta_{\mathrm{exg}}(t)\bar{P}_{\mathrm{exg}}, \\ & P_{\mathrm{ess}}(t) - \bar{P}_{\mathrm{ess}}(1 - \delta_{\mathrm{ess}}(t)) \leq z_{\mathrm{ess}}(t), \\ & z_{\mathrm{ess}}(t) \leq P_{\mathrm{ess}}(t) + \bar{P}_{\mathrm{ess}}(1 - \delta_{\mathrm{ess}}(t)), \\ & P_{\mathrm{exg}}(t) - \bar{P}_{\mathrm{exg}}(1 - \delta_{\mathrm{exg}}(t)) \leq z_{\mathrm{exg}}(t), \\ & z_{\mathrm{exg}}(t) \leq P_{\mathrm{exg}}(t) + \bar{P}_{\mathrm{exg}}(1 - \delta_{\mathrm{exg}}(t)). \end{align}\tag{26}\] Besides, the bilinear term \(\delta_{\mathrm{fg},i}(t)\delta_{\mathrm{fg},i}(t-1)\) in 1 also needs to be reformulated by introducing an auxiliary binary variable \(z_{\mathrm{fg},i}(t) := \delta_{\mathrm{fg},i}(t)\delta_{\mathrm{fg},i}(t-1)\) with the following additional linear constraints: \[\label{eq:cons95abs95auxiliary} \begin{align} & z_{\mathrm{fg},i}(t) \leq \delta_{\mathrm{fg},i}(t), \\ & z_{\mathrm{fg},i}(t) \leq \delta_{\mathrm{fg},i}(t-1), \\ & z_{\mathrm{fg},i}(t) \geq \delta_{\mathrm{fg},i}(t-1) + \delta_{\mathrm{fg},i}(t) - 1. \end{align}\tag{27}\]
EMPC is used to dynamically minimize the operational cost. The state and input of the system are defined, respectively, as \[\tag{28} \begin{align} \tag{29} x(t) &:= x_{\mathrm{ess}}(t), \\ \tag{30} u(t) &:= [\mathbf{P}_{\mathrm{fg}}^\top(t), P_{\mathrm{exg}}(t), \beta(t)]^\top, \end{align}\] where \(\mathbf{P}_{\mathrm{fg}}(t) := [P_{\mathrm{fg},1}(t), P_{\mathrm{fg},2}(t), \dots, P_{\mathrm{fg},N_{\mathrm{fg}}}(t)]^\top\), and the disturbance is defined as \[\label{eq:disturance95definition} w(t) := [P_{\mathrm{res}}(t), P_{\mathrm{load}}(t)]^\top.\tag{31}\] Given a prediction horizon \(T \in \mathbb{N}_+\), the EMPC controller seeks to minimize the \(T\)-step-ahead cumulative cost subject to: the dynamics constraint 23 ; the power balance equation 11 , state and input constraints 13 –20 ; logic constraints 2 , 4 , and 10 ; and other additional constraints 26 and 27 involving the auxiliary variables \(z_{\mathrm{ess}}(t)\), \(z_{\mathrm{exg}}(t)\) and \(z_{\mathrm{fg},i}(t)\). At each time step \(t\), the real SoC is denoted by \(x^{[\mathrm{r}]}_{\mathrm{ess}}(t)\), and the real output and status of the \(i\)-th generator are denoted, respectively, by \(P^{[\mathrm{r}]}_{\mathrm{fg},i}(t)\) and \(\delta^{[\mathrm{r}]}_{\mathrm{fg},i}(t)\). Furthermore, we assume the availability of a load forecaster [27] and a RES forecaster [28], [29], which provide the predicted disturbance \(\hat{w}(t) := [\widehat{P}_{\mathrm{res}}(t), \widehat{P}_{\mathrm{load}}(t)]^\top\) at time step \(t\), where \(\widehat{P}_{\mathrm{res}}(t)\) and \(\widehat{P}_{\mathrm{load}}(t)\) are the predicted RES generation and load consumption, respectively. As such, the EMPC optimization problem can then be formulated as \[\begin{align} \mathrm{P}_{\mathrm{MPC}}(I_{t}(T)): & \min \sum^{t + T-1}_{\tau = t}C_{\mathrm{grid}}(\tau) \end{align}\] \[\begin{align} \text{s.t.} & \; \eqref{eq:state95constraints952},\; \forall \tau \in \mathbb{I}_{[t:t+T]}; \\ & \; \eqref{eq:balance}, \eqref{eq:state95constraints951},\eqref{eq:input95constraint95grid}, \eqref{eq:input95constraint95curtail},\eqref{eq:dynamics95ess95reformulated},\; \forall \tau \in \mathbb{I}_{[t:t+T-1]}; \\ & \; \eqref{eq:logic95generator}, \eqref{eq:logic95ess}, \eqref{eq:logic95exchange}, \eqref{eq:cons95auxiliary}, \; \forall \tau \in \mathbb{I}_{[t:t+T-1]}; \\ & \; \eqref{eq:input95constraints95fg}, \eqref{eq:cons95abs95auxiliary}, \forall i \in \mathbb{I}_{[1:N_{\mathrm{fg}}]}, \tau \in \mathbb{I}_{[t:t+T-1]} \\ & \; x(t) = x^{[\mathrm{r}]}_{\mathrm{ess}}(t); \\ & \; \delta_{\mathrm{fg},i}(t-1) = \delta^{[\mathrm{r}]}_{\mathrm{fg},i}(t-1), \; \forall i \in \mathbb{I}_{[1:N_{\mathrm{fg}}]}; \\ & \; P_{\mathrm{fg},i}(t-1) = P^{[\mathrm{r}]}_{\mathrm{fg},i}(t-1), \; \forall i \in \mathbb{I}_{[1:N_{\mathrm{fg}}]}; \\ & \; w(\tau) = \hat{w}(\tau), \forall \tau \in \mathbb{I}_{[t:t+T-1]}. \end{align}\] The optimization problem \(\mathrm{P}_{\mathrm{MPC}}(I_{t}(T))\) is a mixed-integer quadratic program (MIQP) parameterized by the information tuple \(I_{t}(T)\) defined as \[\label{eq:information} I_{t}(T) := \left(x^{[\mathrm{r}]}_{\mathrm{ess}}(t), \{P^{[\mathrm{r}]}_{\mathrm{fg},i}(t-1)\}^{N_{\mathrm{fg}}}_{i=1}, \{\hat{w}(\tau)\}^{t+T-1}_{\tau=t}\right).\tag{32}\] For simplicity, all generators are considered to be OFF initially, i.e., \(P^{[\mathrm{r}]}_{\mathrm{fg},i}(-1) = 0, \forall i \in \mathbb{I}_{[1:N_{\mathrm{fg}}]}\). Solving \(\mathrm{P}_{\mathrm{MPC}}(I_{t}(T))\) returns \(\{u^\star(\tau;I_{\tau}(T))\}^{t+T-1}_{\tau=t}\), and only \(u^\star(t;I_{t}(T))\) is applied due to the moving-horizon mechanism of EMPC. Moreover, \(u^\star(t;I_{t}(T))\) implicitly defines an EMPC control policy as \(\pi_{\mathrm{MPC}}(I_{t}(T)) := u^\star(\cdot;I_{t}(T))\) [30]. After obtaining \(P^{[\mathrm{r}]}_{\mathrm{res}}(t)\) and \(P^{[\mathrm{r}]}_{\mathrm{load}}(t)\), and applying \(\pi_{\mathrm{MPC}}(I_{t}(T))\), the system state evolves one step forward, and then the EMPC optimization is solved again at \(t+1\). The optimization-based EMPC policy \(\pi_{\mathrm{MPC}}\) is hereafter referred to as the expert EMPC policy.
This section discusses the details of IL-based approximate EMPC for microgrid energy management. In Section 4.1, the motivations for directly approximating MI-MPC policy and using IL are further discussed. Section 4.2 elaborates on the details of the proposed methodology, covering learning paradigm, feature design, data generation, and noise injection used to handle distribution shift.
The EMPC optimization problem \(\mathrm{P}_{\mathrm{MPC}}(I_{t}(T))\) exhibits several notable characteristics. First, the system has only \(N_{\mathrm{fg}}+2\) control inputs, while the number of integer variables required to formulate \(\mathrm{P}_{\mathrm{MPC}}(I_{t}(T))\) equals \(2T(N_{\mathrm{fg}}+1)\), including \(\delta_{\mathrm{fg},i}(\tau)\), \(\delta_{\mathrm{ess}}(\tau)\), \(\delta_{\mathrm{exg}}(\tau)\), and \(z_{\mathrm{fg},i}(\tau)\), for \(\tau \in \mathbb{I}_{t:t+T-1}\). It is obvious that \(2T(N_{\mathrm{fg}}+1)> N_{\mathrm{fg}}+2\), thus the number of outputs of the neural network used to approximate the EMPC policy is smaller when adopting a direct approximate EMPC approach than when using indirect approaches that aim to learn integer variables instead [11], [13], especially for long prediction horizons. Moreover, among the total \(2T(N_{\mathrm{fg}}+1)\) integer variables, \(2(T-1)(N_{\mathrm{fg}}+1)\) (corresponding to a fraction \(1-\frac{1}{T}\) of the total) are directly coupled with the predicted future inputs from \(\tau = t+1\) to \(\tau = t+T-1\), which are of limited interest since the primary objective is to learn the control policy only at the current step \(t\).
Most importantly, the power balance constraint 11 induces a strong coupling between the control inputs \(u(t)\) and the disturbances \(w(t)\). Consequently, the integer variables also become disturbance-dependent, as they are partially determined by the inputs through the logical constraints (see 2 and 10 ). The optimality of the learned integer variables thus relies heavily on the accuracy of the RES and load forecasts. In practice, forecast errors tend to grow with the prediction horizon, especially for renewable generation [28], [29] and load demand [27] in microgrids. Longer-term forecasts are therefore more uncertain, which can result in higher operational costs and increased variability in system performance. As a result, indirect methods, which heavily rely on the predicted future disturbances to determine integer variables, are more sensitive to these errors and less robust. In contrast, direct methods, which only determine the current control action, are inherently less affected by such inaccuracies since future disturbances have exponential-decaying impact on the current input [31], [32]. In short, for microgrid energy management, direct approximate EMPC constitutes a more parsimonious and robust approach, as it reduces output dimensionality and limits the influence of forecast uncertainty on policy optimality.
On the other hand, the SoC of the ESS is the only state variable. However, due to the presence of the generator switching costs 1 , the input constraint 18 , and the influence of exogenous disturbances, the number of parameters required to characterize the parametric EMPC policy \(\pi_{\mathrm{MPC}}\) amounts to \(1 +N_{\mathrm{fg}}+2T\). This high dimensionality makes the conventional approximate MPC with grid-based sampling [14], [16] computationally infeasible, particularly for long prediction horizons or systems with many fuel generators. Consequently, sampling state-input data from closed-loop trajectories provides a more scalable alternative [12], [15], which falls within the broader class of IL-based approximate MPC methods [23].
Given that the inputs 30 are continuous-valued, approximating \(\pi_{\mathrm{MPC}}\) constitutes a regression problem. Therefore, supervised learning with a standard Multi-Layer Perceptron (MLP) is sufficient for training [12], [23]. In this setting, our imitation learning (IL) approach, owing to its supervised-learning nature, reduces to behavior cloning [24], [33], where a neural network policy is trained to mimic the expert EMPC policy \(\pi_{\mathrm{MPC}}\).
For convenience, define the augmented system state as \(\bar{x}(t) := [x_{\mathrm{ess}}(t), \mathbf{P}_{\mathrm{fg}}^\top(t-1)]^\top\). By substituting 11 into 3 , an explicit function \(f\) can be characterized to describe the evolution of \(\bar{x}(t)\) as3 \[\label{eq:implicit95dynamics} \bar{x}(t+1) = f\big(\bar{x}(t), u(t), w(t)\big)\tag{33}\] with \(\bar{x}(0) = [x_{\mathrm{ess}}(0), \mathbf{0}^\top_{N_{\mathrm{fg}}}]^\top\). The power balance equation 11 indicates that the virtual disturbance \(\hat{w}_{\beta}(t, \beta(t)) := \widehat{P}_{\mathrm{res}}(t) - (1 - \beta(t))\widehat{P}_{\mathrm{load}}(t)\), which depends on \(\beta(t)\), is affecting the state dynamics 3 when solving \(\mathrm{P}_{\mathrm{MPC}}(I_t(T))\). Therefore, an extra feature \(\hat{\mathbf{w}}_{\beta}(t)\) is designed as \[\label{eq:heuristic95feature} \hat{\mathbf{w}}_{\beta}(t) = \{\hat{w}_{\beta}(t, \beta_i)\}^{N_\beta}_{i=1},\tag{34}\] where \(N_\beta\) is the resolution of \(\hat{\mathbf{w}}_{\beta}(t)\), and \(\beta_i := \frac{(i-1)\beta^+}{N_{\beta}-1}\) for \(i \in \mathbb{I}_{[1:N_\beta]}\) with \(\beta^+\) given in 20 . Incorporating \(\hat{\mathbf{w}}_{\beta}(t)\), an augmented information tuple \(\bar{I}_t(T)\) is defined by \[\label{eq:augmented95tuple} \bar{I}_t(T) := \left(I_t(T), \{\hat{\mathbf{w}}_{\beta}(\tau)\}^{t+T_{w}-1}_{\tau = t}\right),\tag{35}\] where \(I_t(T)\) is given in 32 and \(T_{w} \in \mathbb{I}_{[0:T]}\) is the horizon depth of this extra feature. While increasing the resolution and horizon depth can enrich the feature set, it also leads to longer offline training and may necessitate a larger network with additional neurons in the hidden layers.
Given a control horizon \(T_{\mathrm{sim}}\), define a scenario \(S\) as a tuple \[S := \left(\bar{x}(0), \left\{\left(P^{[\mathrm{r}]}_{\mathrm{load}}(t), P^{[\mathrm{r}]}_{\mathrm{res}}(t)\right)\right\}^{T_{\mathrm{sim}}-1}_{t=0}\right).\] For \(N_{\mathrm{sim}}\) simulation scenarios, the training data set \(\mathcal{D}\) is obtained by rolling out 33 \(T_{\mathrm{sim}}\) times for each of the scenarios. However, it is well known that behavior cloning suffers from distribution shift [24], [33], which degrades its performance. Therefore, inspired by [33], we apply a simple off-policy noise injection technique, which can be effective in continuous control and computationally cheaper than on-policy methods (e.g., Dagger [24]) to tackle distribution shift. Specifically, given a convariance matrix \(\Sigma \in \mathbb{R}^{(N_{\mathrm{fg}}+2)\times(N_{\mathrm{fg}}+2)}\), the noisy expert input for the \(j\)-th scenario is given by \[\label{eq:noisy} \tilde{u}_{\mathrm{MPC},t}^{[j]} = \pi_{\mathrm{MPC}}\left(I^{[j]}_t(T)\right) + \epsilon^{[j]}_t,\tag{36}\] where \(\epsilon^{[j]}_t \sim \mathcal{N}(0, \Sigma)\) is Gaussian noise, and \(I^{[j]}_t(T)\) is the information tuple acquired when simulating the \(j\)-th scenario. To satisfy the input constraints 16 , 19 , and 20 , post processing of the noisy input is needed. At time step \(t\), define the input constraint set \(\mathcal{U}(t)\) as \[\label{eq:input95set} \mathcal{U}(t) = \prod^{N_{\mathrm{fg}}}_{i=1}\mathcal{U}_{\text{fg},i}(t) \times [-\bar{P}_{\mathrm{exg}}, \bar{P}_{\mathrm{exg}}] \times [0,\beta^+],\tag{37}\] where \(\mathcal{U}_{\text{fg},i}(t) = [P_{\mathrm{fg},i}^{-},P_{\mathrm{fg},i}^{+}] \cap [P_{\mathrm{fg},i}(t-1)-\Delta P_{\mathrm{fg},i}, P_{\mathrm{fg},i}(t-1)+\Delta P_{\mathrm{fg},i}]\). The final applied input \(\tilde{\nu}_{\mathrm{MPC},t}^{[j]}\) is obtained via \[\label{eq:projection} \tilde{\nu}_{\mathrm{MPC},t}^{[j]} = \text{Proj}_{\mathcal{U}(t)}\left(\tilde{u}_{\mathrm{MPC},t}^{[j]}\right).\tag{38}\] As a result, the closed-loop system is generated by applying \(\tilde{\nu}_{\mathrm{MPC},t}^{[j]}\), and the training data set is given by \[\label{eq:data95set} \mathcal{D}= \left\{\left(\bar{I}^{[j]}_t(T), \tilde{\nu}_{\mathrm{MPC},t}^{[j]}\right)\right\}_{t=0,1,\dots,T_{\mathrm{sim}}, j=1,2,\dots,N_{\mathrm{sim}}},\tag{39}\] and the cardinality of \(\mathcal{D}\) is given by \(|\mathcal{D}| = T_{\mathrm{sim}}N_{\mathrm{sim}}\).
The learned policy using MLP is denoted by \(\pi_{\mathrm{MLP,\theta}}\), which is parameterized by \(\theta\), and it is a mapping from the space of augmented information tuples \(\bar{\mathcal{I}} \subset \mathbb{R}^{1+N_{\mathrm{fg}}+2T+N_{\beta}T_{w}}\) to the input space \(\mathcal{U}\subset \mathbb{R}^{N_{\mathrm{fg}}+2}\). The parameter \(\theta\) includes all the weights and biases of the MLP, whose structure is depicted in Fig. 2.
For the considered regression-based behavior cloning, the mean-squared-error loss is adopted as the learning metric [12], [23], i.e., \[\label{eq:final95loss}\mathcal{L}(\theta; \mathcal{D}) := \sum^{N_{\mathrm{sim}}}_{j=1}\sum^{T_{\mathrm{sim}}-1}_{t=0}\left\| \pi_{\mathrm{MLP,\theta}}\left(\bar{I}^{[j]}_t(T)\right) - \tilde{\nu}_{\mathrm{MPC},t}^{[j]}\right\|^2.\tag{40}\] The training objective is to find the best policy \(\pi_{\mathrm{MLP,\theta^\star}}\) parameterized by \(\theta^\star\) through solving the following optimization problem: \[\label{eq:solving95theta} \theta^\star = \arg\min_{\theta}\mathcal{L}(\theta;\mathcal{D}).\tag{41}\] In practice, this problem 41 is typically non-convex and highly nonlinear, making it infeasible to guarantee convergence to the global optimum. Consequently, stochastic gradient descent or its variants are commonly used to find a local minimum \(\hat{\theta}^\star\), which is then adopted as a suboptimal surrogate in most applications [34]. To guarantee that the learned policy satisfies the input constraints 16 , 19 , and 20 , post processing of the MLP network output is needed. The final applied learning-based approximate EMPC input is given by4\(^,\)5 \[\label{eq:final95policy} u_{\text{MLP},t,\hat{\theta}^\star} = \text{Proj}_{\mathcal{U}(t)}\left(\pi_{\mathrm{MLP,\hat{\theta}^\star}}\left(\bar{I}_t(T)\right)\right),\tag{42}\] where \(\mathcal{U}(t)\) is defined in 37 .
This section presents the performance evaluation of the proposed IL-based approximate EMPC approach applied to a microgrid consisting of photovoltaic panels and wind turbines, along with two fuel generators (i.e., \(N_{\mathrm{fg}}= 2\)). The case study considers a time horizon of \(24~\mathrm{h}\) (i.e., day-ahead scheduling) with a sampling interval of \(T_{\mathrm{s}}=1~\mathrm{h}\), resulting in a total simulation horizon of \(T_{\mathrm{sim}}= 24\). A set of \(50\) distinct disturbance realizations,
representing variations in RES generation and load demand, is employed, as shown in Fig. 3. Four different initial states \(x_{\mathrm{ess}}(0) \in \{44\mathrm{kWh}, 48\mathrm{kWh}, 52\mathrm{kWh},
56\mathrm{kWh}\}\) are considered, resulting in a total of \(N_{\mathrm{sim}}= 200\) scenarios. The implemented MLP network comprises \(3\) hidden layers, each containing \(12\) neurons, and employs the GELU activation function [35]. All simulations are implemented in
Python 3.12.9, using PyTorch for the construction and training of the MLP network, while the MIQP problem of the expert EMPC is solved using Gurobi 12.0.2 [36] with gurobipy.
For each simulation scenario, the predicted disturbance used by EMPC is obtained by adding bounded noise to the true one, thereby approximating the effect of prediction errors. The EMPC controller employs \(T=6\) as prediction horizon, and each scenario is generated over a two-day period accordingly. Although the EMPC assumes constant nominal market prices \(c_{\mathrm{p}}\) and \(c_{\mathrm{s}}\), time-varying prices are used in simulation, as shown in Fig. 4.
For the extra feature, the hyperparameters of our proposed method are chosen as \(N_\beta = 3\) and \(T_w = 6\). Three different controllers are considered for comparison: (a) the expert optimization-based EMPC controller, (b) the proposed IL-based approximate EMPC controller, and (c) the baseline IL-based approximate EMPC controller without extra features and noise injection mechanism. For a typical scenario, the control inputs are shown in Fig. 5. It can be observed that the input of our proposed IL-based approximate EMPC method matches the expert EMPC more closely than the basic IL-based method does.
To demonstrate the superiority of our approach, we also compare the closed-loop economic metric defined by \[\label{eq:final95metric95economic} J_{\text{eco}} = \sum^{T_{\mathrm{sim}}-1}_{t=0}C_{\mathrm{grid}}(t),\tag{43}\] as well as the computation-time metric defined by \[\label{eq:final95metric95time} J_{\text{time}} = \sum^{T_{\mathrm{sim}}-1}_{t=0}\Delta(t),\tag{44}\] where \(\Delta(t)\) is the time used to obtain the control input at time step \(t\). For the expert EMPC controller, \(\Delta(t)\) accounts only for the solving time and does not include the time required to construct and formulate the optimization problem. The comparison results for all \(200\) scenarios are given in Fig. 6. It can be observed that both the basic IL method and the proposed IL-based approach achieve comparable economic performance to the expert EMPC controller. Moreover, the proposed IL-based method outperforms the basic IL in this regard. In terms of computation time, the approximate EMPC implemented with an MLP network as the control input generator offers significant advantages, requiring only approximate \(10\%\) of the computation time of the standard EMPC.
This paper proposes a IL–based framework for approximate EMPC for microgrid energy management, which outperforms a basic IL approach in terms achieving near-optimal economic performance with novel designed features and noise injection mechanism to tackle distribution shift. In simulation experiments, the learned approximate EMPC controller reduces the required online computation time to only 10% of that of the optimization-based EMPC. Future work will focus on extending the approach to networked microgrids, incorporating parametric model mismatch as part of the learning features, and integrating more sophisticated uncertainty models for renewable generation and demand.
This paper is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 101018826 - CLariNet) and U.S. NSF ECCS #2234032. Corresponding author: Changrui Liu↩︎
In the MPC formulation, constant nominal prices are used for prediction and optimization within the control horizon, following a common simplification in microgrid energy management when prices remain stable over short time spans [25]. The true market prices, however, may fluctuate around these nominal values in real operation.↩︎
Deriving \(f\) is not necessary since \(f\) is only used to describe the closed-loop system evolution compactly. Hence, solving \(\mathrm{P}_{\mathrm{MPC}}(I_{t}(T))\) does not require knowing the closed form of \(f\) explicitly.↩︎
Since \(\mathcal{U}(t)\) is a hyperrectangle, the projection in 42 reduces to simple element-wise clipping of each of the inputs to their respective bounds, which is computationally trivial.↩︎
The projection step enforces input constraints only; satisfaction of the state constraints in 13 is not guaranteed and will be investigated in future work, possibly through control barrier functions.↩︎