Long-Term Mapping of the Douro River Plume
with Multi-Agent Reinforcement Learning
October 03, 2025
We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.
Monitoring dynamic coastal environments in real-time is a persistent challenge in both environmental science and robotics [1], [2]. Coastal waters evolve under the interplay of currents, winds, tides, and river discharge, producing transient patterns that are difficult to capture with conventional methods. A prominent example of this dynamism is a river plume: a buoyant outflow of freshwater that extends into the ocean. River plumes, which are defined by their steep salinity gradients, are integral to the mixing processes that impact fisheries, water quality, and the spread of pollutants in the coastal areas [3], [4]. However, given their large extension in the ocean (hundreds of square kilometers) and rapid variability, tracking and mapping river plumes via fixed sensors or manned surveys is usually impractical [5].
Autonomous underwater vehicles (AUVs) offer a promising alternative. Indeed, AUVs can adapt their trajectories to the evolving conditions of aquatic environments, collect measurements across large spatial domains, and operate continuously over extended periods. Yet using AUVs for long-term plume monitoring introduces several challenges: (i) the salinity field that characterizes the plume, a scalar function of space and time, evolves on the same timescale as vehicle motion, meaning that the process shifts significantly while measurements are being collected; (ii) ocean currents in the plume area can dramatically hinder the AUVs mobility; (iii) endurance is constrained by onboard energy reserves, resulting in a trade-off between coverage and longevity; (iv) inter-agent communication in the ocean plume waters is severely constrained due to horizontal and vertical density variations, so AUVs usually need to dwell at the sea surface to transmit information[6]. Hence, coordinating multiple AUVs to map a river plume presents unique challenges, which we aim to address in this work.
The Douro River plume, where freshwater from the river with the greatest discharge in Portugal’s northwest coast enters the Atlantic, exemplifies these challenges and motivates this work. Its shape, extent, and orientation vary widely with river discharge, wind forcing, and tidal cycles. During high-flow periods, strong freshwater outflow drives the plume offshore, where it can be advected for tens of kilometers along the coast [7]. More than 500,000 residents of Porto and Vila Nova de Gaia live along the Douro banks and draw economic value from the estuary. Hence, timely and accurate salinity maps, which capture the state of the plume, are highly valuable for informed management [8], [9].
Contributions. In this paper, we propose and evaluate a cooperative data-driven multi-agent control framework for long-duration (multiple days), energy-aware mapping of the Douro river plume. In our proposed system architecture, a server remotely coordinates a fleet of multiple light AUVs (LAUVs) [5], handling computationally heavy decisions and requiring only minimal, intermittent radio communication with the vehicles, which devote energy to low-level navigation and sensing. Within this setting, we propose an algorithm that combines a simple and reliable non-parametric Gaussian process regression (GPR) estimator with a reinforcement learning (RL) decision-making module based on a deep Q-network (DQN). Since most of the energy consumption of the considered LAUVs comes from the vehicle propulsion, we introduce a multi-head DQN architecture that decouples per-agent direction and speed decisions, paired with a reward function that allows us to control the tradeoff between estimation accuracy and energy efficiency.
We evaluate our approach and conduct a simulation study using a numerical model implementation of the Douro estuary with the open-source Delft3D ocean model [10], [11], which combines real coastal geometry and historical environmental data of the study area to generate high fidelity, realistic salinity and flow fields under distinct wind, tidal, and discharge regimes [11]. Our results show that the proposed approach (i) effectively maps the plume and generalizes to unseen conditions across different months and years; (ii) consistently outperforms baselines and benchmarks in both single- and multi-agent settings; (iii) effectively leverages multi-agent collaboration, both in terms of estimation accuracy and fleet endurance, for example scaling from 3 to 6 vehicles more than doubling mission lifetime.
Related work. Early coastal-robotics surveys rely on pre-planned coverage (lawn-mower transects, yo-yo depth cycling) [12], approaches still common operationally [13] but prone to oversampling and poor adaptability. Classic adaptive approaches frame the problem as informative path-planning (IPP) with Gaussian process surrogates [14], selecting waypoints that maximize posterior variance reduction or mutual information under travel constraints. Longer horizons have been pursued via rapidly exploring random trees embeddings [15] and non-myopic combinatorial methods such as GP-aware MILP and submodular branch-and-bound [16], yet these approaches generally rely on fixed graph discretizations, offer limited online re-planning, and scale poorly. Within plume monitoring, recent works focus on classification-oriented objectives such as expected integrated Bernoulli variance (EIBV) minimization, which prioritize sampling along uncertain river-ocean interface zones [17], [18]. While relevant, these methods are single-vehicle, single-step sighted, energy-unaware, and limited to short missions (\(\sim\)4h). Multi-AUV studies, on the other hand, remain largely heuristic and oriented toward short-term front following rather than long-term full mapping [5]. Model-free RL offers an alternative to traditional IPP and plume monitoring methods. Convolutional neural network (CNN)-based DQN and multi-head DQN surpassed lawn-mower baselines for multi-agent lake monitoring [19], though only on static processes and coarse grids. Similarly, the work in [20] also explored the idea of combining GPR and RL on static, synthetic datasets. Aerial domains show similar patterns in static pollution plumes, wildfire tracking, and radio maps [21]–[23].
We model the salinity field of the Douro river plume as a spatiotemporal scalar function \[f:\Omega \times \mathbb{R}_+ \to \mathbb{R}\] where \(\Omega \subset\mathbb{R}^2\) denotes a planar polygonal domain and \(\mathbb{R}_+\) is time. In this work, we consider 2-dimensional mapping of the plume, which is usually considered the region in \(\Omega\) where salinity is lower than the open-ocean background level \(f_{ocn}\) (\(\approx35\) psu). For a chosen threshold \(\zeta > 0\), we can define the plume at time \(t\) as \[\label{eq:plume95def} \mathcal{P}(t) = \{\, x \in \Omega : f_{\mathrm{ocn}} - f(x,t) \ge \zeta \,\}.\tag{1}\] Such \(\mathcal{P}(t)\) evolves dynamically and, depending on environmental conditions (such as wind, ocean flow, discharge level, and tidal cycle), at a speed comparable to AUVs mobility (\(1\)m/s), as we illustrate in Fig. 2.
Our objective in this work is to construct estimates \(\hat{f}\) of the salinity field \(f\) over time, based on measurements collected by a fleet of \(N\) AUVs deployed over \(\Omega\). A vehicle \(n\) follows the following differential equation mobility model [24]: \[\label{eq:mobility95model} \dot{x}^{(n)}(t) \;=\; u^{(n)}(t) + c\!\left(x^{(n)}(t),\,t\right)\tag{2}\] where \(x^{(n)}(t)\in\Omega\) is the AUV position, \(u^{(n)}(t)\in\mathbb{R}^2\) is its commanded velocity, and \(c(x, t)\) is the ocean flow, described by a time-varying vector field \[c:\Omega \times \mathbb{R}_+ \to \mathbb{R}^2, \quad c(x,t) = (u(x,t), w(x,t)) \label{ocean-cur}\tag{3}\]
where \(u(x,t)\) and \(w(x,t)\) are the longitude and latitude velocity components of the current, respectively.
To construct the estimate \(\hat{f}\) of the unknown function \(f\) over space and time, we want to collect measurements of the function \(f\) with the AUVs. Given that the AUVs move according to 2 , the measurements that we can use for the estimation are constrained in space and time by their \(N\) trajectories. To formally state our problem formulation, we need to introduce some notation. Let us first define a vector of discrete time slots \(\sigma = [0, \Delta, 2\Delta, ..., T\Delta]\) which we index with \(\mathcal{K} = \{0, 1, ..., T\}\). In the rest of the paper, we are interested in updating the estimate \(\hat{f}\) over the slots \(\sigma[1], ..., \sigma[T]\) with a slot granularity of \(\Delta = 30\) minutes.
At time slot \(k \in\mathcal{K}\), agent \(n\) has navigated along trajectory \(\gamma_k^{(n)}\in\Gamma(x^{(n)}((k-1)\Delta), x^{(n)}(k\Delta))\subset\Omega\)4. We denote the visited set of space-time coordinates along \(\gamma_k^{(n)}\) by \[Z_k^{(n)} = \left\{\left(x_{k, 1}^{(n)}, t_{k, 1}^{(n)}\right), ..., \left(x_{k, z_{k, n}}^{(n)}, t_{k, z_{k, n}}^{(n)} \right)\right\}, \label{measurement}\tag{4}\] with \(z_{k, n} = |Z_{k}^{(n)}|\). Note that \((k-1)\Delta \leq t_{k, 1}^{(n)} \leq \cdots \leq t_{k, z_{k,n}}^{(n)} \leq k\Delta\) and that \(Z_k^{(n)}\subset\gamma_k^{(n)}\), with \(x((k-1)\Delta) = x_{k, 1}^{(n)}\), \(x(k\Delta) = x_{k, z_{k,n}}^{(n)}\). Let the set of space-time locations visited up to time slot \(k\) by agent \(n\) be denoted by \(D_k^{(n)} = \{Z_{1}^{(n)}, ..., Z_{k}^{(n)}\}\) and let \(D_k^{N} = \{D_k^{(n)}\}_{n = 1}^N\) be the overall set of coordinates visited by the fleet up to time \(k\). We denote the noisy fleet-measured salinity values up to slot \(k\) by \[f_k^N = \{f(x, t)+\epsilon_{x, t} :(x, t)\in D_k^N\},\] where \(\epsilon_{x,t}\) is a generic noise term. Given space-time coordinates \(A\), let \(f(A) =\{f(x, t)+\epsilon_{x, t} :(x, t)\in A\}\) be the corresponding measurements, and let \(\mathcal{M}_k^N = \{D_k^N, f_k^N\}\).
Given a measurement-based predictor \(\hat{f}(\cdot|\mathcal{M}_k)\) for the salinity field \(f(\cdot)\), and an evaluation grid \(G\subset\Omega\), we formulate the long-term mapping problem as follows:
equation
_{_k^N}_k=1^T & _k=1^T_xGe_k^2
& Z_k^(n)_k^(n) , k, n,
where \[e_k = f\left(x,\sigma[k]\right)-\hat{f}\left(x,\sigma[k]\mid \mathcal{M}_k^N\right),\] recalling that \(\mathcal{M}_k^N = \{D_k^N, f_k^N\}\), and \(D_k^N = \{D_k^{(n)}\}_{n=1}^N = \{\{Z_l^{(n)}\}_{l=1}^k\}_{n=1}^N\).
Discussion and challenges. Given that the function \(f(\cdot)\) is unknown and measurements become available only in real time, problem [eq:setting-obj] must be solved online, sequentially building the set \(\mathcal{M}_k^N\), for \(k = 1, ..., T\). The first main challenge in doing this
is due to the physical constraints in building the trajectories \(D_k^N\), combined with the fact that the field \(f(\cdot, t)\) changes in time at a speed comparable to the speed with
which AUVs can move in the water. Mathematically, the new measurements \(\{Z_k^{(n)}\}\) that we can add to the set \(D_{k-1}^N\) can only cover a limited space which is comparable to
the space that the plume \(\mathcal{P}(t)\) in 1 covers in the same amount of time. This is also referred to as the lack of synopticity in the trajectory-constrained
measurements [25]. We illustrate this in the sequence of frames in Fig. 2, where we plot the evolution of
the field \(f(\cdot, t)\) over 10 hours and the corresponding mobility of one AUV across the area of interest.
A second relevant challenge is that the ocean flow (illustrated in Fig. 3) can significantly impact or even nullify the mobility of an AUV. Note that the ocean flow can push at the same speed as the vehicle’s nominal speed of \(1\)m/s (mobility model in 2 ), effectively stalling progress. At the same time, given that the current vector field \(c(\cdot)\) is correlated with the salinity field and wind forcing, it may also be exploited for navigation if learned appropriately. These two aforementioned challenges strongly motivate the study of a data-driven, long-horizon sequential decision-making solution (e.g., RL), one which we provide in this work. Beyond these constraints, we also address two fundamental engineering aspects. First, multi-agent coordination depends on communication, which is intermittent and bandwidth-limited in ocean plume environments. Can we effectively orchestrate multiple AUVs in solving problem [eq:setting-obj] with minimal and sporadic communication? Second, endurance hinges on energy efficiency. Can we take sequential decision-making actions that not only aim to keep the mean squared error (MSE) in [eq:setting-obj] low, but which also smartly regulate the propulsion used to generate the trajectories \(\gamma_{k}^{(n)}\)? Because hydrodynamic drag scales cubically with speed, halving velocity reduces energy consumption by a factor of eight [26], thus providing substantial gains in vehicle endurance. In the remainder of the paper, we provide an answer to these questions in the affirmative, illustrating our solution and the results of our simulation study.
LAUVs. The sensing agents considered in this work are light autonomous underwater vehicles (LAUVs) [5]. Each LAUV is equipped with a CTD probe (conductivity, temperature, depth) to measure salinity, navigation sensors, and communication modems (Wi-Fi, GSM, satellite) available when surfacing. The nominal survey speed of the LAUVs is \(1.0\) m/s, yielding an operational endurance of approximately 72 hours. As mentioned in the previous sections, smartly regulating the speed can provide dramatic energy savings, boosting the fleet’s endurance. To do this, we introduce a second cruise speed, which we set to \(0.4\)m/s. Traveling at this speed, the vehicle’s endurance can be potentially 8 times longer. However, note that the rapid variability of plume dynamics and the strength of coastal currents prevent exclusive reliance on the secondary speed, highlighting the need for adaptive velocity control. On the computational side, each vehicle runs lightweight processors sufficient for low-level navigation, sensing and radio communication. We show a picture of LAUVs in Fig. 4.
In this section, we describe the system architecture and the algorithm we design to solve problem [eq:setting-obj]. In Fig. 5 we provide a graphical illustration of the system architecture, and Algorithm 6 outlines the main control pipeline. The key rationales behind the system architecture are (i) offloading computation as much as possible from the AUVs to the server, prioritizing their endurance and (ii) relying only on intermittent and extremely low overhead communication. The pipeline at a given time slot \(k\) is as follows: first, all agents dwell on the sea surface and establish a connection with the cloud server via Wi-Fi or GSM. During this communication, each AUV \(n=1, ...,N\) uplinks its latest measurements \(\{Z_k^{n}, f(Z_k^{n})\}\) collected over the past 30 minutes. In our implementation, the number of per-agent per-slot measurements \(z_{k,n}\) depends on speed and ocean currents, averaging \(5\) and limited to a maximum of \(10\). Note that transmitting \(5\) space-time coordinates and the corresponding measurements only involves transmitting \(20\) float numbers, which - in the worst case - only requires \(160\) bytes (always \(<500\) bytes even with wireless transmission overhead). Upon receiving the new measurements \(\{Z_k^{n}, f(Z_k^{n})\}_{n=1}^N\), the server aggregates them with the previous set, resulting in \(\mathcal{M}_k^N = \{D_k^N, f_k^N\}= \mathcal{M}_{k-1}^N\cup \{Z_k^{(n)}, f(Z_k^{(n)})\}_{n=1}^N\) and updates the salinity map estimate \(\hat{f}(\cdot, \sigma[k]|\mathcal{M}_k^N)\). Based on \(\hat{f}\), the current agent trajectories \(D_k^N\), and a vector containing the wind speed and direction \(w_k\), the server determines the control policy \(\pi(\hat{f}, \mathcal{M}_k^N, w_k)\) for all AUVs \(n=1, ..., N\) for the next 30 minutes time slot, which results in direction \(\{b_k^{(n)}\}_{n=1}^N\) and speed levels \(\{a_k^{(n)}\}_{n=1}^N\). At that point, the server downlinks the commands to the AUVs, which resubmerge and resume sampling, starting to build \(\{Z_{k+1}^{n}, f(Z_{k+1}^{n})\}_{n=1}^N\). The estimator \(\hat{f}(\cdot, \sigma[k]|\mathcal{M}_k^N)\) and the policy \(\pi(\hat{f}, \mathcal{M}_k^N, w_k)\) form the core of our framework. We detail these two components in the following subsections, respectively.
To produce measurement-based estimates \(\hat{f}(\cdot, \sigma[k]|\mathcal{M}_k^N)\) we resort to non-parametric Gaussian process regression (GPR). GPR represents a very appealing choice in this case because (i) it encodes spatial and temporal correlations through a simple, interpretable kernel function; (ii) it is highly flexible and reliable when in need of interpolating the sparse and highly irregular samples \(\mathcal{M}_k^N\) produced by the trajectory-constrained AUVs. Our GPR estimator places a spatiotemporal Gaussian prior on the salinity field of interest \[f(x,t)\;\sim\;\mathcal{GP}\!\big(m(x,t),\,K\big((x,t),(x',t')\big)\big),\] with mean function \(m(\cdot)\) and covariance kernel \(K(\cdot, \cdot)\). In this work, we fix the mean \(m(x, t) = f_{\text{ocn}}\). Given the set \(\mathcal{M}_k^N\), the posterior mean at \((x,t)\) can be computed in closed form as: \[\begin{align} \hat{f}(x,t \mid \mathcal{M}_k^N) = f_{\text{ocn}} + k_\ast \bar{K}^{-1}\!\big(f(D_k^N) - f_{\text{ocn}}\big) \end{align} \label{mean}\tag{5}\] where \(k_\ast=K\!\big((x,t),D_k^{(N)}\big)\), \(\bar{K}=K\!\big(D_k^{(N)},D_k^{(N)}\big)+\sigma^2 I\) are the \((x,t)\)-kernel section and the kernel data matrix, respectively, with \(\sigma^2\) capturing the measurement noise. To keep computations tractable over long horizons, we retain only the most recent \(M\) slots in \(\mathcal{M}_k^N\), discarding older samples.
Spatiotemporal kernel. We adopt a space-time separable spatiotemporal kernel of the following form: \[K\big((x,t),(x',t')\big)=K_s(x,x')\,h\left(\tau\right),\] where \(\tau = |t-t'|\). We choose the functions \(K_s(x,x')\) and \(h\left(|t-t'|\right)\) by analyzing and fitting the empirical correlations from the historical data. In particular, we set \[K_s(x,x')=\lambda^2\exp\!\left(-\frac{\|x-x'\|}{\,\ell}\right),\] with hyperparameters \(\lambda, \ell >0\) and \[\begin{align} h(\tau)=\beta_0-\beta_1\,\tau+\beta_2\Big(\cos\!\big(\pi\tau/T_0\big)-1\Big), \end{align}\] which combines a linearly decaying term and a periodic oscillation, whose period \(T_0\), not surprisingly, is the same as the tidal period of 12.5 hours. All hyperparameters are fit to historical data by matching empirical correlations, as we illustrate in Figure 7.
We cast the construction of trajectories \(D_k^N\) over time slots \(k=1,..., T\) to solve [eq:setting-obj] as a (centralized) multi-agent sequential decision-making problem. In particular, we want to design a decision making policy \(\pi(\cdot)\) that, at each time slot \(k\), given measurements and trajectory history in \(\mathcal{M}_k^N\), and exogenous inputs (e.g., wind), selects a direction and a speed level for each AUV for the next time slot (30 minutes). Given the large amount of historical environmental data and the availability of the advanced Delft3D simulator, a natural choice to learn \(\pi(\cdot)\) is model-free multi-agent reinforcement learning (MARL) [27]. Due to the well-known problem of centralized MARL lacking scalability[28], we adopt the typical solution of treating the problem as if we were in a decentralized MARL setting with full communication [29]. To do so, we need to define a per-agent state space \(\mathcal{S}\) and action space \(\mathcal{A}\), and a reward function \(R:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}\). We seek a policy map \(\pi: \mathcal{S} \rightarrow \mathcal{A}\) that maximizes the expected cumulative discounted reward \(\mathbb{E}\left[\sum_{k = 0}^\infty \gamma^k R(s_k, a_k)\right]\). At each time step \(k\), the server builds the state \(s_k^{(n)}\) for each agent using the current estimate \(\hat{f}\) and the set \(D_k\), and then selects an action \(a_k^{(n)}\), for \(n=1, ..., N\). To learn policy \(\pi(\cdot)\), we use Q-learning [27], which, for a given discount factor \(\gamma>0\), targets the optimal action–value function \[Q^{\star}(s,a) = \mathbb{E}\Bigl[\, r + \gamma \max_{a'} Q^{\star}(s',a') \,\big|\, s,a \Bigr],\] and uses it to select actions. Since the state space is high-dimensional, we approximate \(Q(s,a)\) with a neural network with parameters \(\theta\) and denote it by \(Q_{\theta}(s,a)\).
We factor the action space \(\mathcal{A}\) into discrete direction and speed components \[\begin{align} a &= (b,v) \;\in\; \mathcal{H}\times\mathcal{V} = \mathcal{A},\;\text{where}\\ \mathcal{H}&=\{0,45, ..., 315\}\;\deg, \;\mathcal{V}=\{0.4,\,1.0\}\;\text{m/s}. \end{align}\] The two velocity levels enable us to control the trade-off between endurance and agility: the lower cruise speed (\(0.4\) m/s) extends mission duration by conserving propulsion energy, while the higher speed (\(1.0\) m/s) enables more accurate plume monitoring at the expense of battery life.
To exploit the structure of the action space, we decompose \(Q_\theta\) into two components, which share a state-value function \(V_{\bar{\theta}}(s)\) and two advantage functions \(A_{\theta_1,\text{dir}}(s,b)\), \(A_{\theta_2,\text{spd}}(s,v)\), \[\label{eq:Qangle} \begin{align} Q_{\theta,\mathrm{dir}}(s,b) &= V_{\bar{\theta}}(s) + A_{\theta_1,\mathrm{dir}}(s,b) - \frac{1}{|\mathcal{H}|}\sum_{b'\in\mathcal{H}} A_{\theta_1,\mathrm{dir}}(s,b') \\ Q_{\theta,\mathrm{spd}}(s,v) &= V_{\bar\theta}(s) + A_{\theta_2,\mathrm{spd}}(s,v) - \frac{1}{|\mathcal{V}|}\sum_{v'\in\mathcal{V}} A_{\theta_2,\mathrm{spd}}(s,v'), \end{align}\tag{6}\] where note that \(\theta = \{\bar{\theta}, \theta_1, \theta_2\}\). Action selection is then: \[\label{eq:Q95policy} \begin{align} b^{(n)}_k &= \arg\max_{b\in\mathcal{H}} Q_{\theta,\mathrm{dir}}(s_k^{(n)},b),\\ v^{(n)}_k &= \arg\max_{v\in\mathcal{V}} Q_{\theta,\mathrm{spd}}(s_k^{(n)},v), \end{align}\tag{7}\] which lets the agent decide where to go and how fast.
State construction. The state representation \(s_k^{(n)}\) comprises three components. Following prior work on image-based policy inputs [30], we render both spatial estimates and trajectory traces into a fixed-resolution \(3\)-channels image, compressed by a convolutional neural network (CNN) into a compact embedding. Specifically, the three channels for the \(n\)-th agent are (i) a compressed image of the GP estimate of the salinity field \(\hat{f}(\cdot,\sigma[k])\), (ii) one image marking in white over a black background recent measurement locations of the agent, \(Z_k^{(n)}, Z_{k-1}^{(n)}, ...\) (iii) one image marking in white over a black background recent measurement locations for agent \(n\) teammates, \(\{Z_k^{(l)}, Z_{k-1}^{(l)}, ...\}_{l\neq n}\). The trajectory marks intensities are log-scale weighted by recency. In addition, a wind vector \(w_k = [\text{angle}, \text{speed}]\), processed by a NN to produce an embedding, is concatenated with the rest of the network. See Fig. 5 for an illustration of the state construction.
Reward design. We design a reward function which combines cooperative and competitive components. This incentivizes agents to cooperate by minimizing the global MSE and to compete by receiving individual credit for reducing MSE along their own trajectory. Indeed, experimenting with a purely cooperative global reward, we incurred in the well-known credit assignment problem [31]. Let us introduce a salinity contrast score \[F_k \;=\; \Bigl|\,f_{\mathrm{ocn}}-\tfrac{1}{|G|}\sum_{x\in G} f(x,\sigma[k])\Bigr|, \quad f_{\mathrm{ocn}}\approx 35~\text{psu},\] which characterizes how much freshwater is present in the ocean at time slot \(k\).
Our reward function for agent \(n\) is \[\label{eq:reward} \begin{align} r_k^{(n)} &= r_{k,g} + r_{k,n}, \;\text{where} \\ r_{k,g} &= -\eta_0e_k + \eta_1\frac{F_k}{1+e_k} - \eta_2\sum_{n=1}^Nv^{(n)}_k, \\ r_{k,n} &= \eta_3\sum_{(x, t) \in Z_k^{(n)}} \left( \hat{f}_{k-1}(x) - f(x, \sigma[k]) \right)^2, \end{align}\tag{8}\] where \(e_k\) is the global MSE in the evaluation grid, \(\hat{f}_{k-1}(x) = \hat{f}(x, \sigma[k-1]|\mathcal{M}_{k-1}^N)\) and \(\eta_0, ..., \eta_3>0\) are hyperparameters. The term \(r_{k,n}\) is the key individual credit that the agent earns for visiting locations \(x\) along its trajectory \(Z_k^{(n)}\), where the previous estimate \(\hat{f}_{k-1}(x)\) was significantly inaccurate relative to the actual field \(f(x, \sigma[k])\). The term \(\sum_{n=1}^Nv^{(n)}_k\) penalizes higher speed at a fleet level. The salinity contrast score \(F_k\) incentivizes a reduction in the MSE when there is more freshwater in the plume.
In this section, we illustrate the details of our simulation study and of our numerical results.
Dataset. We train our algorithm over a selection of \(4\) months data from the year 2018 (February, April, October and December) and evaluate over different months from 2016-2018 (see Table 1). Overall, we use a total of \(\approx 6k\) time frames for the training, with a \(30\) minutes time resolution. Each frame spans \(\approx5\times10^4\) grid locations and measurement noise is i.i.d.\(\mathcal{N}(0,\,0.01)\).
Training setup. We minimize the Bellman loss for the Q-network, computed over mini-batches of size \(64\) sampled from a replay buffer. We use Adam optimizer, and we perform soft updates using a target network. Each policy used in this section has been trained for \(6500\) episodes of length \(150\) (\(\approx 3\) days) and a discount factor \(\gamma = 0.9\), with the first \(1500\) episodes of pure exploration. The Q-network uses a 3-layer CNN followed by two fully connected layers.
Two baselines. In Fig. 8, we illustrate two different sampling schemes on ground-truth salinity fields \(f\) (left) and their GP estimates \(\hat{f}\) (right) for October 2018. Both methods use the same budget of 15 measurements every 30 minutes with a memory window of 24 frames, and all estimates employ the GPR model of Sec. 3.1. The top panels show an unconstrained uniform strategy: measurements are placed uniformly at random in the spatial non-homogeneous grid of the Delft3D numerical model, which has a higher density of grid points in higher variance areas. In practice, this effectively biases sampling toward high-variance locations closer to the river mouth. Note that since uniform placement ignores the trajectory constraint \(Z_k^{(n)}\subset\gamma_{k}^{(n)},\;\forall k,n\), it is not physically realizable with the AUVs, but does provide a fundamental baseline. If one could freely deploy measurements without vehicle constraints, this uniform scheme would represent a natural choice for estimating \(f\). In the bottom panels of Figure 8, we show an example of a predetermined strategy which is compliant with the constraints in [eq:setting-obj]. This particular strategy is based on rotations around some strategic core points in the plume, and, for the sake of this illustration, we assume that the robots are able to perform the rotations without being impacted by the ocean currents. Although the trajectories are strategically placed in key areas, one can already see from Fig. 8 how the constrained sampling estimate struggle to capture the state of the plume, while the uniform sampling strategy provides a better estimation. To illustrate this numerically, in Figure 9 we show the evolution of the MSE over time for these two baselines (“uniform” and “ideal rotations”). One can see the notable difference in terms of MSE for the two different strategies, where the average MSE of the rotations is roughly 6 times higher than the unconstrained uniform baseline under the same sampling budget.
Single-agent performance. First, we analyze the performance of our algorithm when deploying only one AUV \(N=1\) to map the river plume. This allows us to compare our solution directly with the one proposed in [18], which only considered a single-agent setting, and which is based on evaluating the expected integrated Bernoulli variance (EIBV) for each potential direction at every step. For this benchmark, we implement an ideal version in which the decision maker has access to the actual IBV for each candidate direction, and we term this algorithm “ideal [18]". This allows us to compare against the best possible version of [18] and to obtain results which do not depend on the specific numerical way of computing the EIBV. A fundamental parameter of [18] is a salinity threshold that defines the plume front. While the original paper fixed this at 24 psu, we found this too low for the Douro plume and selected the best threshold from \(\{28,30,32,34\}\), choosing 32 psu. We show an example of simulation outcome in Figure 10, where on the left, we plot MSE evolution over three days for four algorithms, and on the right, we report monthly MSE box plots for February 2018. The two plots reveal how, despite [18] performing well at certain parts of the process, it tends to lose track of the plume over the long run, while our RL policy, in contrast, consistently maintains low MSE. This result is not surprising, as the quick temporal evolution of the plume can make the decisions taken by [18] based on the current prior Gaussian field myopic with respect to the spatiotemporal evolution of the plume, whereas our learned policy explicitly optimizes long-horizon mapping performance.
4pt
| \(N=3\) | \(N=6\) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2-4(lr)5-7 Month | Ours \({\mathrm{MSE}}\) | End. (d) | [32] \({\mathrm{MSE}}\) | Ours \({\mathrm{MSE}}\) | End. (d) | [32] \({\mathrm{MSE}}\) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Mar ’18 | 23.14 | 3.5 | 47.87 | 11.46 | 3.1 | 21.07 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sep ’18 | 10.36 | 4.6 | 18.48 | 5.44 | 5.0 | 6.27 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Nov ’18 | 13.96 | 4.2 | 20.99 | 7.88 | 4.3 | 8.24 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Jan ’16 | 27.03 | 3.2 | 49.19 | 12.84 | 3.3 | 19.53 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Feb ’16 | 24.26 | 3.3 | 49.74 | 14.07 | 3.4 | 21.32 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Oct ’17 | 3.69 | 13.0 | 4.04 | 2.79 | 15.6 | 2.04 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Multi-agent performance. We now evaluate performance in the multi-agent setting, focusing on: (i) MSE gains provided by the multi-agent coordination when increasing \(N\), and on the comparison with existing literature benchmarks; (ii) the energy efficiency improvements induced by varying the \(\eta_2\) reward hyperparameter in 8 and the corresponding MSE-energy tradeoff. As a multi-agent benchmark, we implement the recent work in [32] that proposed an algorithm based on adaptive Voronoi partitioning of the space of interest to track a time-varying process. We empirically tune the exploration-exploitation hyperparameters in [32] and show the best results in terms of MSE. In Fig. 11, we compare the different schemes for \(N=3\) and \(N=6\) for March 2018. We first note how the adaptive Voronoi partition solution [32] significantly outperforms the strategic rotations solution. We also note that our algorithm provides maps with half of the MSE compared to [32], for both \(N=3\) and \(N=6\). In Fig. 13, we show a sequence of frames to illustrate the multi-agent control performance of the agents with respect to the salinity field \(f\) evolution over a \(15\) hour time interval. In Fig. 12 we show the endurance–MSE trade-off obtained by varying the reward weight \(\eta_2\) in 8 , using October 2017 and October 2018 as exemplars. Increasing \(\eta_2\) biases the policy toward lower propulsion use, thus extending endurance at a modest cost in accuracy, while scaling from \(N=3\) to \(N=6\) both reduces MSE and more than doubles endurance. Table 1 summarizes results obtained with \(\eta_2 = 50\), and further confirms that our algorithm generalizes across unseen (during training) years and seasonal regimes.
We conducted a simulation study showing promising performance in mapping the Douro river plume with multiple underwater vehicles over long time horizons. Future works include considering 3D mapping and real-world deployment of the LAUVs robots in the river plume.
N. Dal Fabbro acknowledges the support from his AI x Science Postdoctoral Fellowship awarded through the University of Pennsylvania’s IDEAS, DDDI, and Penn AI initiatives. R. Mendes and J. Borges de Sousa were partially funded by national funds through FCT - Fundação para a Ciência e a Tecnologia, I.P., within the scope of the project with reference UIDB/50022/2020↩︎
\(^{1}\)Department of Electrical and Systems Engineering, University of Pennsylvania, USA. Correspondence to: ndf96@seas.upenn.edu.↩︎
\(^{2}\)Laboratório de Sistemas e Tecnologia Subaquática (LSTS), Faculdade de Engenharia da Universidade do Porto, Portugal↩︎
\(\Gamma(A, B)\) denotes the set of admissible curves connecting \(A\) and \(B\)↩︎