March 11, 2025
To navigate crowds without collisions, robots must interact with humans by forecasting their future motion and reacting accordingly. While learning-based prediction models have shown success in generating likely human trajectory predictions, integrating these stochastic models into a robot controller presents several challenges. The controller needs to account for interactive coupling between planned robot motion and human predictions while ensuring both predictions and robot actions are safe (i.e. collision-free). To address these challenges, we present a receding horizon crowd navigation method for single-robot multi-human environments. We first propose a diffusion model to generate joint trajectory predictions for all humans in the scene. We then incorporate these multi-modal predictions into a SICNav Bilevel MPC problem that simultaneously solves for a robot plan (upper-level) and acts as a safety filter to refine the predictions for non-collision (lower-level). Combining planning and prediction refinement into one bilevel problem ensures that the robot plan and human predictions are coupled. We validate the open-loop trajectory prediction performance of our diffusion model on the commonly used ETH/UCY benchmark and evaluate the closed-loop performance of our robot navigation method in simulation and extensive real-robot experiments demonstrating safe, efficient, and reactive robot motion.
1.0
For humans, walking within crowds is considered a trivial task. For robots, however, there exist considerable challenges in achieving safe, collision-free, and efficient crowd navigation. In an environment like the one illustrated in Fig. 1, a robot must anticipate human movement and adjust its trajectory in real-time. This problem has inspired the development of a variety of trajectory prediction methods (e.g. [1]–[3]) that learn distributions of future human trajectories from historical data and demonstrate promising performance in open-loop prediction, generating trajectory forecasts for a test set of historical data excluded during training. However, incorporating open-loop predictions into a robot controller remains a challenge. Firstly, the robot’s actions influence the future motion of humans, and vice versa, i.e. robot and human futures are coupled due to interaction. Secondly, the robot needs to account for potentially multimodal prediction distributions while choosing only one action to take. Finally, the robot needs to balance opposing objectives of efficiently reaching its goal, while ensuring safety.
One approach towards addressing the interaction challenge attempts to predict both human and robot futures and then penalize robot plans that deviate from the prediction [4], but as the plan is optimized towards the robot goal, it inevitably deviates from the robot prediction, reducing the relevance of the human predictions as well. Other methods dynamically represent predictions, for example by outputting predictions as affine dynamical systems [5] instead of trajectory tracks, or evaluating prediction model gradients during the planning step [6], [7]. While dynamically representing predictions addresses the multimodality challenge as well, these methods struggle to balance the safety and efficiency. All the aforementioned methods simplify the probabilistic nature of the problem to minimizing some cost, e.g. expected cost [4], risk metric [6], worst-case cost [7]. However, combining safety and efficiency in the cost function reveals a critical trade-off between the competing aims. In contrast, introducing non-collision constraints maximizes efficiency while ensuring actions avoid collisions.
Figure 1: We propose a method for interactive robotics crowd navigation. We use the diffusion prediction model to forecast the future motion of humans as joint trajectory samples. The SICNav Prediction Safety Filter uses a particle-filter
inspired approach to jointly refine these predictions (lower-level) and produce robot actions that satisfy safety constraints (upper-level). The robot uses solutions in receding horizon fashion. Video: tiny.cc/sicnav_diffusion
In this work, we introduce SICNav-Diffusion, a novel approach for safe robotic crowd navigation, illustrated in Fig. 1. We propose the JMID prediction model and combine with an extension of SICNav [8] to simultaneously plan robot actions while filtering the JMID predictions for safety.
Our contributions are threefold. First, we introduce JMID, whereur we extend a single-agent prediction model MID [3] to model a joint trajectory distribution for multiple humans. Second, we propose an extension to SICNav [8] to use multimodal prediction samples from JMID. SICNav embeds ORCA optimization problems, which model human collision-avoidance behavior, as constraints within the robot’s planner to form a bilevel MPC optimization problem. We extend the bilevel problem such that the lower-level ORCA refines JMID predictions to ensure non-collision, while the bilevel structure of the MPC couples the robot plan with these refined predictions. Our approach enables the optimizer to explore trajectory prediction modes generated by JMID and select the one that best optimizes the robot planning objective, while guaranteeing satisfaction of non-collision safety constraints. Third, to evaluate our proposal, we first validate the open-loop prediction performance of JMID on the ETH/UCY trajectory benchmark [9], [10] and demonstrate JMID outperforms the joint prediction performance of state-of-the-art models. We then evaluate the closed-loop planning performance of our method in simulation and extensive real-robot experiments (240 runs) to compare our approach, SICNav-JMID, with other state-of-the-art prediction models in a single-robot multi-human environment.
Human trajectory forecasting: Human trajectory forecasting models use the history of agent positions in a scene to predict their future 2D bird’s-eye-view trajectory for some horizon into the future. Recent methods focus on capturing the multi-modality of agent futures using a variety of learning-based architectures, e.g. CVAE [1], autoregressive Transformers [2], and Diffusion [3]. While all methods use the history of all agents to the scene as input, most predict the future of a single human in each inference [1], [3]. However, the future of all humans in the scene is coupled. Thus, we extend [3] to produce joint forecast samples for all agents in one inference.
Combined trajectory forecasting with planning and control: Recent works have tackled incorporating human trajectory prediction models into robot planners, focusing on distinct challenges: interactivity of planning with predictions, accounting for multimodal prediction distributions, and ensuring safety through non-collision. Some methods [5], [7] concentrate on interactivity by leveraging prediction model gradients [7] in the planning step or outputting predictions as time-varying dynamics used in the planner [5]. While these methods account for multimodal prediction distributions, they incorporate non-collision as an additional objective in the cost function rather than introducing constraints. Other research [11] ensures probabilistic non-collision constraint satisfaction via conformal prediction but considers unimodal predictions without modeling interactions in the planning step. A blended approach [12] addresses constraints primarily for the initial MPC step but remains reactive. In contrast, our method addresses the three challenges by selecting prediction modes based on wether the predictions are compatible (i.e. non-colliding) among the humans and with the robot plan, utilizing ORCA to refine the predictions. Our method incorporates this safety-filtering refinement into a bilevel MPC planner to explicitly model interactions between the humans and the robot while guaranteeing that generated action satisfies collision-avoidance constraints with the refined predictions.
Fig. 1 presents a block diagram of our method. In short, we first use the history of agent motion to generate a set of JMID prediction samples, \(\mathbold{\Upsilon}\). Then, we use the predictions and the latest measured state of the system as the initial state, \({\mathbf{x}_{\text{o}}}\), in a Bilevel MPC optimization problem that optimizes the robot’s actions, \({\mathbf{u}}_{0:T-1}\) and system state, \({\mathbf{x}}_{0:T}\), containing the robot states and a refined version of the predicted human trajectories for a finite horizon \(T\). We use this solution in receding horizon fashion: the robot executes the first robot action from the solution, \({\mathbf{u}}_{0}\), and repeats the MPC optimization with a newly measured state of the system, \({\mathbf{x}_{\text{o}}}\), and a newly sampled set of JMID predictions, \(\mathbold{\Upsilon}\).
The environment consists of a robot, human agents, indexed by \(j\in \bigl\{1,\dots,N\bigr\}\), and static obstacles in the form of line segments, indexed by \(\tilde{l}\in \bigl\{N+1,\dots,N+M\bigr\}\). The state of the system is in continuous space and contains the state of the ego robot, the states of all \(N\) humans, and a set of weights describing the relative importance of each of \(S\) JMID samples in determining the intents of the humans.
Formally, at a discrete time step, \(t\), the state is denoted as \({\mathbf{x}}_{t} = ({{\mathbf{x}}_{t}^{(r)}}, {{\tilde{\mathbf{x}}}_{t}^{(1)}}, \dots, {{\tilde{\mathbf{x}}}_{t}^{(N)}}, \mathbf{w}_t) \in \mathcal{X},\) where the state of the robot, \({{\mathbf{x}}_{t}^{(r)}} \in \mathbb{R}^{4}\), consists of its 2D position, heading, and speed. The human state, \({{\tilde{\mathbf{x}}}_{t}^{(j)}} \in \mathbb{R}^{4}\), consists of 2D position, and 2D velocity \(\forall j\in \bigl\{1,\dots,N\bigr\}\). We use a set of importance weights at each time step, \(\mathbf{w}_t \in \mathbb{R}^S, \text{ s.t. } \left\lVert\mathbf{w}_t\right\rVert_1 = 1\), to calculate a weighted average of \(S\) trajectory prediction samples obtained from JMID, which models future trajectories of all humans in the scene, \({\mathbf{Y}} =({\mathbf{Y}^{(1)}},\dots,{\mathbf{Y}^{(N)}}) \in \mathbb{R}^{2TN}\), where \({\mathbf{Y}^{(j)}} = ({\mathbf{y}_{1}^{(j)}}, \dots, {\mathbf{y}_{T}^{(j)}})\in\mathbb{R}^{2T}\) is the predicted trajectory for individual agent \(j\), composed of a sequence of predicted 2D positions \({\mathbf{y}_{t}^{(j)}}\in\mathbb{R}^{2}\). We denote the set of samples as \(\mathbold{\Upsilon}= \bigl\{{\mathbf{Y}^{(1:N,s)}}\bigr\}_{s=1}^{S}\), and refer to the break down of a particular sample \(s\) as \({\mathbf{Y}^{(j,s)}} = ({\mathbf{y}_{1}^{(j,s)}}, \dots, {\mathbf{y}_{T}^{(j,s)}})\). We will also refer to the set of all samples for individual agent \(j\) at a specific time step \(t\) as \({\mathbold{\Upsilon}_{t}^{(j)}}=\bigl\{{\mathbf{y}_t^{(j,s)}}\bigr\}_{s=1}^{S}\).
We separate the dynamics of the system into separate functions for the robot, the humans, and the evolution of the importance weights through time. The robot dynamics, \({{{\mathbf{x}}_{{t+1}}^{(r)}}} = \mathbf{f}({{{\mathbf{x}}_{t}^{(r)}}}, {\mathbf{u}}_{t}),\) are modeled as a kinematic unicycle model, where the control input for the system, \({\mathbf{u}}_{t} \in \mathbb{R}^2\), is a vector of the linear and angular velocity of the robot for time step \(t\). The dynamics of each human \(\forall j\in \bigl\{1,\dots,N\bigr\}\), \({{\tilde{\mathbf{x}}}_{{t+1}}^{(j)}} = \mathbf{h}({{\tilde{\mathbf{x}}}_{t}^{(j)}}, {{{\tilde{\mathbf{u}}}_{t}}^{(j)}}),\) are modeled as kinematic integrators, where \({{{\tilde{\mathbf{u}}}_{t}}^{(j)}}\) denote actions optimal with respect to ORCA. Note that actions \({{{\tilde{\mathbf{u}}}_{t}}^{(j)}}\) are used to obtain the safety-filtered predictions of the humans. Finally, the importance weights evolve according to \(\mathbf{w}_{{t+1}} = \mathbf{g}({\mathbf{x}}_{t};{\mathbold{\Upsilon}_{t}^{(1)}},\dots,{\mathbold{\Upsilon}_{t}^{(N)}})\). We describe the method for finding the refined predicted human actions and the evolution of importance weights in Section 3.3.2.
The stage cost, \(l({\mathbf{x}}_{t}, {{\mathbf{u}}_{t}}) = {\mathbf{x}}_{t}^\top \mathbf{Q}{\mathbf{x}}_{t} + {\mathbf{u}}_{t}^\top \mathbf{R}{\mathbf{u}}_{t},\) penalizes deviation from the robot’s goal position and excessive control effort with positive semidefinite matrices \(\mathbf{Q}\) and \(\mathbf{R}\). The terminal penalty, \(l_{T}({\mathbf{x}}_{T}) = \beta {\mathbf{x}}_{T}^\top \mathbf{Q}{\mathbf{x}}_{T},\) where \(\beta \geq 1,\) ensures stability of the controller (see Theorem 2.41 in [13]) with \(\beta\) selected to be sufficiently large.
We add constraints of the form \({\mathbf{x}}_{t}^\top \mathbf{P}_j{\mathbf{x}}_{t} \geq {d}_j^2, \forall j\in \bigl\{1,\dots,N\bigr\},\) where \(\mathbf{P}_j\in \mathbb{R}^{n \times n}\) extracts the positions of the robot and the \(j^{th}\) human from the state and \({d}_j\) is the minimum distance to avoid collisions. We similarly add quadratic constraints to avoid collisions between the robot and static obstacles, which are represented as piece-wise linear functions. We also bound the input of the system to enforce the kino-dynamic limits of the real-world robot. Finally, for each time-step, we constrain the prediction of each human (indexed \(j\)) to lie in the solution set of the ORCA optimization problem [14], \({\mathcal{O}^{(j)}}({\mathbf{x}}_{t};{\mathbold{\Upsilon}_{{t+1}}^{(j)}})\), described in Section 3.3.1. In short, the ORCA QCQP solves for a predicted human velocity minimizing distance to an intended velocity, obtained from the JMID samples for that human, subject to avoiding collisions with other agents and static obstacles.
With these definitions, the MPC problem for the robot can be formulated as an optimization problem, \[\tag{1} \begin{alignat}{2} \omit\hfil\displaystyle\mathop{\mathrm{minimize}}_{\substack{{\mathbf{x}}_{0:T},{\mathbf{u}}_{0:T-1},\\{\tilde{\mathbf{u}}_{0:T-1}^{(1:N)}}, \mathbf{w}_{0:T-1}}}\hfil\ignorespaces & \omit\hfil\displaystyle\sum_{t=0}^{T-1} l({\mathbf{x}}_{t},{\mathbf{u}}_{t}) + l_{T}({\mathbf{x}}_{T})\hfil\ignorespaces & \omit\hfil\displaystyle\quad\quad\quad\quad\quad\hfil\ignorespaces \tag{2} \\ \omit\hfil\displaystyle\mathop{\mathrm{subject\ to}}\hfil\ignorespaces & \omit\hfil\displaystyle{\mathbf{x}}_{0}={\mathbf{x}_{\text{o}}}\hfil\ignorespaces & \omit\hfil\displaystyle\quad\quad\quad\quad\quad\hfil\ignorespaces \tag{3} \\ \omit\hfil\displaystyle\hfil\ignorespaces & \omit\hfil\displaystyle{{\mathbf{x}}_{{t+1}}^{(r)}}= \mathbf{f}({{\mathbf{x}}_{t}^{(r)}},{\mathbf{u}}_{t})\hfil\ignorespaces & \omit\hfil\displaystyle\quad\quad\quad\quad\quad\hfil\ignorespaces \tag{4} \\ \omit\hfil\displaystyle\hfil\ignorespaces & \omit\hfil\displaystyle{\mathbf{u}_{min}^{(r)}} \leq {{\mathbf{u}}_{t}} \leq {\mathbf{u}_{\max}^{(r)}}\hfil\ignorespaces & \omit\hfil\displaystyle\quad\quad\quad\quad\quad\hfil\ignorespaces \tag{5} \\ \omit\hfil\displaystyle\hfil\ignorespaces & \omit\hfil\displaystyle\Delta{\mathbf{u}_{min}^{(r)}} \leq {{\mathbf{u}}_{t} - {\mathbf{u}}_{{t-1}}} \leq \Delta{\mathbf{u}_{\max}^{(r)}}\hfil\ignorespaces & \omit\hfil\displaystyle\quad\quad\quad\quad\quad\hfil\ignorespaces \tag{6} \\ \omit\hfil\displaystyle\hfil\ignorespaces & \omit\hfil\displaystyle{\mathbf{x}}_{t}^\top \mathbf{P}_l{\mathbf{x}}_{t} \geq {{d}_{l}}^2\hfil\ignorespaces & \omit\hfil\displaystyle\quad\quad\quad\quad\quad\hfil\ignorespaces \tag{7} \\ \omit\hfil\displaystyle\hfil\ignorespaces & \omit\hfil\displaystyle{{{\tilde{\mathbf{u}}}_{t}}^{(j)}} \in {\mathcal{O}^{(j)}}({\mathbf{x}}_{t};{\mathbold{\Upsilon}_{{t+1}}^{(j)}})\hfil\ignorespaces & \omit\hfil\displaystyle\quad\quad\quad\quad\quad\hfil\ignorespaces \tag{8} \\ \omit\hfil\displaystyle\hfil\ignorespaces & \omit\hfil\displaystyle{{\tilde{\mathbf{x}}}_{{t+1}}^{(j)}} = \mathbf{h}({{\tilde{\mathbf{x}}}_{t}^{(j)}}, {{{\tilde{\mathbf{u}}}_{t}}^{(j)}})\hfil\ignorespaces & \omit\hfil\displaystyle\quad\quad\quad\quad\quad\hfil\ignorespaces \tag{9} \\ \omit\hfil\displaystyle\hfil\ignorespaces & \omit\hfil\displaystyle\mathbf{w}_{{t+1}} = \mathbf{g}({\mathbf{x}}_{t};{\mathbold{\Upsilon}_{t}^{(1)}},\dots,{\mathbold{\Upsilon}_{t}^{(N)}})\hfil\ignorespaces & \omit\hfil\displaystyle\quad\quad\quad\quad\quad\hfil\ignorespaces \tag{10} \end{alignat}\] where all the constraints 4 -9 are defined for each time step, \(\forall t\in\bigl\{0,\dots,T-1\bigr\}\), and the constraints 8 -9 are defined for each human, \(\forall j\in \bigl\{1,\dots,N\bigr\}\). The robot collision constraint 7 is defined for each human agent and each static obstacle in the environment, \(\forall l\in \bigl\{1,\dots,N,\dots,N+M\bigr\}\). The constraints 8 are in the solution sets of constrained ORCA optimization problems, thus constituting the lower-level problems of the bilevel problem 1 .
Figure 2: Block diagram of JMID. Embeddings of agent histories (gray pedestrians and yellow robot) are generated for each agent in parallel by the encoder of T++. The embeddings, along with one sample per agent from a multivariate standard normal distribution, are passed in parallel through fully connected (FC) layers. Outputs are concatenated and passed through a Transformer. Denoising (dashed box) is repeated for K denoising steps to generate the final predictions.
We propose a human trajectory forecasting model that predicts future trajectories for multiple agents jointly. The model generates trajectory samples, \(\mathbold{\Upsilon},\) which are used to parameterize the the lower-level ORCA problems, \(\mathcal{O}\) in 1 . Our model extends MID [3], a diffusion-based trajectory forecaster that predicts future motions for each human individually. The original model is an encoder-decoder network with two components: a borrowed encoder from Trajectron++ (T++) [1] and a Transformer-based decoder that learns a reverse diffusion process [15]. To make a prediction, MID requires the history of the target human and its neighboring agents. For the rest of the paper, we refer to MID as iMID.
We extend iMID to Joint-MID (JMID), which produces joint samples for all humans in the scene, i.e. each sample contains a forecast of horizon \(T\) for all humans. While iMID models the future trajectory distribution of each agent independently, \({\mathbf{Y}^{(j)}} \sim p({\mathbf{Y}^{(j)}}|{\mathbf{x}}_{-H:0}), \forall j\in \{1,\dots,N\}\), JMID models the joint future trajectory distribution of all agents \({\mathbf{Y}} \sim p({\mathbf{Y}^{(1)}}, \dots, {\mathbf{Y}^{(N)}}|{\mathbf{x}}_{-H:0})\). A block diagram of JMID is illustrated in Fig. 2. To obtain joint samples, we first query the encoder in parallel to generate a latent embedding for each human in the scene. We pass these together with noisy future trajectories through fully-connected layers to generate new latent embeddings. For iMID, the new embeddings would have been passed through the transformer decoder in parallel. Instead, we concatenate them and then pass them to the transformer. We do not use positional encoding between embeddings so transformer outputs are equivariant to concatenation ordering. Finally, we split the denoised output of the transformer into one prediction per agent, generating the joint samples \(\mathbold{\Upsilon}\). To train the model, we follow [3] to maximize the variational lower bound. However, we alter the loss inputs’ dimensions to account for the model output’s varying dimension (i.e. number of humans being predicted) for each scene.
Figure 3: Example walkthrough of steps occurring implicitly within the SICNav-JMID optimization 1 demonstrating our method’s ability to select a mode of predictions to satisfy safety constraints. (a) Multicolored scatter illustrates JMID predictions with \(S=9\) future trajectory samples for human 1 (\({\mathbf{y}_1^{(1,s)}}\) purple to \({\mathbf{y}_3^{(1,s)}}\) cyan). (b) Purple star illustrates the weighted average for the intent \({{\hat{\mathbf{y}}}_{1}^{(1)}}\) found using 12 . (c) Solution from \(t=0\) to \(t=1\) with robot plan (blue line) and refined human prediction (green line) results in the robot going right and human left. (d)-(f) After the weights are updated using the human position \({{\tilde{\mathbf{x}}}_{1}^{(1)}}\) in 13 -14 , the process from (b) and (c) is repeated for subsequent steps resulting in the mode with the human going left being selected in the refined predictions.
Within the bilevel optimization problem, our method refines the JMID predictions to satisfy explicit constraints using the sequences of ORCA optimization problems [14] embedded in 8 . The ORCA optimization problem for agent \(j\) is formulated as, \[\label{eq:relaxed95orca95argmin} {\mathcal{O}^{(j)}}({\mathbf{x}}_{t};{{{\mathbold{\Upsilon}}_{{t+1}}^{(j)}}}) := \mathop{\mathrm{arg\,min}}_{{\mathbf{v}}} \\ \left\{ \begin{alignedat}{0} \omit\hfil\displaystyle\left\lVert{\mathbf{v}} - {{\mathbf{v}_{\text{intent}}}}({\mathbf{x}}_{t};{{{\mathbold{\Upsilon}}_{{t+1}}^{(j)}}})\right\rVert_{2}^2 \text{}:\text{}\hfil\ignorespaces\\ \omit\hfil\displaystyle\text{collision-avoidance}\hfil\ignorespaces \end{alignedat}\right\}\tag{11}\] where the optimization variable \(\mathbf{v}\in \mathbb{R}^2\) is the velocity of agent \(j\) for one time step. The non-collision constraints are derived using a velocity obstacle approach (see Sec. 4 of [14]) and define linear collision avoidance constraints between agents \(j\) and agent \(l\neq j\) and for each static obstacle. The constraints are linear with respect to \(\mathbf{v}\) and parameterized by \({\mathbf{x}}_{t}\). The cost is based on a velocity vector from the agent’s current position towards the agent’s intended position at the next time step, \({{\hat{\mathbf{y}}}_{t+1}}\), obtained from JMID samples, \[\mathbf{v}_{\text{intent}}({\mathbf{x}}_{t}; {{\mathbold{\Upsilon}}_{{t+1}}^{(j)}}) = \frac{1}{\delta t} \left({{\hat{\mathbf{y}}}_{t+1}^{(j)}} - \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 \end{bmatrix}{{\tilde{\mathbf{x}}}_{t}^{(j)}}\right),\] where \(\delta t\) is the discretization period in 1 . To obtain this intended position, we find the weighted average of the human’s predicted positions at time \(t+1\) from the JMID samples, \[\label{eq:weighted95avg95predpos} {{\hat{\mathbf{y}}}_{t+1}^{(j)}} = \sum_{s=1}^{S} {{w}_{t}^{(s)}} {{\mathbf{y}}_{t+1}^{(j,s)}}.\tag{12}\] Note that there is one importance weight per joint sample and not per human.
In 9 , we use the solution of 11 , \({{{\tilde{\mathbf{u}}}_{t}}^{(j)}}\), to find the refined predicted position of the agent at the next time step, \({{\tilde{\mathbf{x}}}_{{t+1}}^{(j)}}\). This way, if the intended velocity for the agent is feasible with respect to constraints in 11 , then the optimal velocity found by solving the problem is the one that moves the agent to the estimated intended position at the next time step. Otherwise, the ORCA optimization problem will find the closest feasible position to the estimated intent, acting as a safety filter for the JMID predictions.
ETH | Hotel | Univ | Zara1 | Zara2 | Average | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
2-13 Method | ADE/FDE | SAD/SFD | ADE/FDE | SAD/SFD | ADE/FDE | SAD/SFD | ADE/FDE | SAD/SFD | ADE/FDE | SAD/SFD | ADE/FDE | SAD/SFD |
1-13 T++ | 0.53/0.92 | 0.59/1.09 | 0.15/0.23 | 0.20/0.35 | 0.28/0.56 | 0.52/1.10 | 0.29/0.57 | 0.38/0.78 | 0.16/0.31 | 0.27/0.57 | 0.28/0.52 | 0.39/0.78 |
AgentFormer | 0.45/0.75 | 0.48/0.79 | 0.14/0.22 | 0.24/0.46 | 0.25/0.45 | 0.62/1.31 | 0.18/0.30 | 0.28/0.56 | 0.14/0.24 | 0.30/0.62 | 0.23/0.39 | 0.38/0.75 |
iMID (DDIM) | 0.63/1.02 | 0.66/1.13 | 0.24/0.41 | 0.28/0.52 | 0.29/0.56 | 0.47/1.00 | 0.26/0.52 | 0.35/0.74 | 0.20/0.37 | 0.32/0.66 | 0.32/0.58 | 0.42/0.81 |
iMID (DDPM) | 0.54/0.86 | 0.58/0.96 | 0.21/0.35 | 0.26/0.47 | 0.29/0.54 | 0.52/1.10 | 0.24/0.43 | 0.35/0.70 | 0.19/0.33 | 0.33/0.67 | 0.29/0.50 | 0.41/0.78 |
JMID (DDIM) | 0.59/1.00 | 0.61/1.05 | 0.21/0.34 | 0.24/0.40 | 0.34/0.70 | 0.45/0.96 | 0.23/0.45 | 0.29/0.59 | 0.19/0.39 | 0.24/0.50 | 0.31/0.58 | 0.37/0.70 |
JMID (DDPM) | 0.53/0.91 | 0.54/0.93 | 0.18/0.27 | 0.20/0.32 | 0.32/0.66 | 0.47/1.01 | 0.22/0.41 | 0.30/0.58 | 0.17/0.32 | 0.23/0.47 | 0.28/0.51 | 0.35/0.66 |
At \(t=0\), we initialize the importance weights based on the likelihood estimates of the JMID samples, which we obtain by fitting KDEs [16]. For subsequent steps, in 10 , we compare the ORCA-refined positions, \({{\tilde{\mathbf{x}}}_{t}^{(j)}}\), found by solving 11 at the previous time step with the JMID sample’s predicted positions, \({{\mathbf{y}}_{t}^{(j,s)}}\), \[\label{eq:weight95update95unnormed} {{\tilde{w}}_{t}^{(s)}} = \exp\left(-\frac{1}{N\sigma}\sum_{j=1}^{N}\left\lVert{{\mathbf{y}}_{t}^{(j,s)}} - \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 \end{bmatrix}{{\tilde{\mathbf{x}}}_{t}^{(j)}}\right\rVert_{2}^2\right),\tag{13}\] where \(\sigma\) is a hyperparameter. We normalize these weights, \({{\bar{w}}_{t}^{(s)}} = {{{\tilde{w}}_{t}^{(s)}}}/{\sum_{s=1}^{S} {{\tilde{w}}_{t}^{(s)}}},\) then incorporate the influence of the importance weights from the previous step, \({{{w}}_{{t-1}}^{(s)}}\), and normalize once again, \[\label{eq:weight95update95normed} {{w}_{t}^{(s)}} = \frac{{{w}_{t-1}^{(s)}}{{\bar{w}}_{t}^{(s)}}}{\sum_{s=1}^{S} {{w}_{t-1}^{(s)}}{{\bar{w}}_{t}^{(s)}}}.\tag{14}\] Through this procedure, the importance weights filter out prediction and robot plan combinations that would violate safety constraints.
Fig. 3 illustrates an example of refined human predictions and associated robot plan. For agent 1, we see its JMID prediction has two modes: right or left. As the robot jointly optimizes its plan and refines the human predictions through the horizon, at \(t=0\) the robot plan moves right. Following the mode where the human also going to the right would result in a collision with the robot’s plan (blue line). Since the ORCA problem avoids collisions, the refined human prediction moves closer to the left mode as \(t\) increases. As the variables in 2 are optimized, the robot plan and ORCA cause the human predictions to move into non-colliding prediction modes.
We validate open-loop performance on the popular ETH/UCY benchmark. Our evaluation follows the standard leave-one-scene-out cross-validation protocol with the standard history length of 3.2s and forecast length of 4.8s. Models: We compare the following methods,
JMID: our proposed method that generates joint predictions (i.e. each forecast sample contains future positions for all humans).
iMID [3]: the original MID model.
AgentFormer [2]: a state-of-the-art method for joint human trajectory forecasting by an autoregressive transformer in contrast to the diffusion-based architecture we use in JMID.
Trajectron++ (T++) [1]: a CVAE-based trajectory prediction model. Note that iMID and JMID adopt the encoder of T++.
For inference with the diffusion models, iMID and JMID, we use a DDPM [17] version of the models, where denoising is conducted for 100 steps. However, the inference process may not run in real time with DDPMs, thus we also evaluate the models with a widely-used accelerated sampling version of the models, DDIM [15], where denoising can be conducted with a variable number of steps, 2 in our case. We also use DDIM for real-time deployment.
Metrics: We evaluate predictive performance using,
Best-of-20 ADE: the lowest Euclidean distance between the ground truth trajectory and 20 samples obtained from the model, for each agent.
Best-of-20 FDE: the lowest Euclidean distance between positions at the final prediction step, \(T\), of the ground truth trajectory and 20 samples obtained from the model, for each agent.
Best-of-20 SADE [18]: the lowest average ADE across all agents in the scene for 20 joint-prediction samples.
Best-of-20 SFDE [18]: the lowest FDE across all agents in the scene for 20 joint-prediction samples.
While ADE and FDE evaluate the performance of the model at the individual agent level (i.e. marginal distributions of each agent’s future), SADE and SFDE evaluate the performance of the model at the scene level (i.e. joint distribution of all agents’ futures). SADE and SFDE only evaluate predictions for different agents that come from the same joint sample, whereas ADE and FDE can mix-and-match predictions across different joint samples. Our end-goal is to operate a robot among a crowd of humans, so we place a greater emphasis on the scene-level metrics because they evaluate the joint behavior and interactions between agents that a model has learned.
Implementation: Our JMID model uses the same number of layers and dimensions as iMID [3]. We train our model using the Adam optimizer with a learning rate of \(10^{-4}\) and a batch size of 64. Results and Discussion: Table 1 summarizes the results of our experiments. Our method, JMID, outperforms the baselines in the scene-level metrics on most test splits and on average. While AgentFormer performs the best in the individual-level metrics, as discussed previously, the scene-level metrics are more meaningful than the individual-level metrics because they evaluate how well a model has learned the interactions between pedestrians. Therefore, our proposed method performs the best in the most meaningful metrics.
Testing environment: We evaluate the performance of our planner in the CrowdNav environment, a commonly used crowd navigation simulation environment [8]. We generate 500 random initial and final positions for the humans in the environments with a total of \(N=3\) simulated humans. The simulated testing area is a \(1.75m\)-wide corridor with initial positions and goals on either end. In order to ensure a high density of agents in the corridor, we constrain the space using static obstacles, formulated as line-segments. The simulated humans are modeled as ORCA agents, however with randomly sampled attributes for radius, buffer (unobservable amount added to radius), collision avoidance time horizon, goal location, and maximum velocity. None of these attributes are observable to the robot navigation methods.
Navigation Methods: Each method uses a different approach to generate forecast samples of future human motion, then uses the Bilevel MPC method that we propose in 1 to simultaneously refine those predictions and optimize robot actions. The robot follows all methods in receding horizon fashion. The four navigation methods tested are as following,
SICNav-JMID: our proposed method that uses the joint prediction JMID samples with SICNav.
SICNav-iMID: a baseline that uses individual agent samples from MID=iMID [3] with SICNav.
SICNav-AF: joint prediction samples from the AgentFormer [2] model along with SICNav.
SICNav-CVG: a constant-velocity-goal baseline that projects the human’s current velocity forward in time to infer their goal. This baseline does not use any learning-based trajectory forecasting.
Implementation: The prediction models JMID, iMID, and AgentFormer are implemented in PyTorch and wrapped with ROS to deploy inference on the robot at 10 Hz. We pretrain the models on the JRDB dataset [19], which contains a robot agent in its scenes. The training details are the same as Sec. 4. We then fine-tune the models using data collected from training scenarios in the simulator. Performance metrics: To evaluate robot navigation performance, we adopt the following metrics from the literature [8]:
Success Rate: The rate of scenarios in which the robot is able to arrive at its goal within \(30s\). It measures the success of the robot’s navigation through the crowd.
Average Navigation Time: The average time to completion, which indicates the robot’s navigation efficiency.
Collision Frequency: The ratio of time steps the robot is in a collision state with another agent compared to the total number of time-steps in the 500 scenarios, which measures the safety of the robot.
Freezing Frequency: The frequency of time steps where the robot needed to reduce its velocity to zero, which measures how often the robot can potentially get stuck in a deadlock with the human agents (encounters the FRP).
Results and Discussion: Table 2 summarizes the results of our simulation experiments. SICNav-JMID outperforms the other methods in success rate, average navigation time, and freezing frequency. The results suggest the joint prediction of human trajectories with JMID is more effective than individual predictions with MID or the joint predictions from AgentFormer. The results also suggest the simple projection of human velocity with SICNav-CVG is not as effective as the learning-based methods.
Approach | Success Rate \(\uparrow\) | Avg Nav Time (s) \(\downarrow\) | Collision Freq. \(\downarrow\) | Frozen Freq. \(\downarrow\) |
---|---|---|---|---|
SICNav-JMID | \(\mathbf{1.00}\) | \(\mathbf{6.82}\) | \(0.006\) | \(\mathbf{0.20}\) |
SICNav-iMID | \({0.98}\) | \(7.54\) | \(0.007\) | \(0.38\) |
SICNav-AF | \(0.97\) | \(7.67\) | \({0.007}\) | \(0.43\) |
SICNav-CVG | \(0.95\) | \(8.86\) | \(\mathbf{0.005}\) | \(0.73\) |
Variant 1
Variant 2
Figure 4: Real-robot experiment scenarios with variants 1 (a) and 2 (b). The green star indicates the robot’s goal, numbered circles indicate human start positions, and arrows their goals. In a scenario with \(N\in \bigl\{1,2,3\bigr\}\) humans, positions greater than \(N\) are empty.
Robot-Human Proxemics
Figure 5: Box-and-whisker plots of robot navigation performance in terms of (a) efficiency and (b) proxemics in scenarios with varying numbers of humans. Median values are illustrated by the line in the box-and-whisker plots and mean values are indicated by the \(\times\). Our method (SICNav-JMID) (a) results in the most efficient robot navigation with respect to both metrics, and (b) spends the least time in the intimate space of humans compared to other learning-based methods. SICNav-CVG spends less time in intimate space because it causes humans to deviate from their shortest paths to goal at higher human densities.
Figure 6: We illustrate two similar scenarios with SICNav-CVG (left) and SICNav-JMID (right) in the first 4s of the scenarios. The top images (a, b) show the top-down snapshot graphs of the scenarios with dashed lines (gray for humans, green for robot) indicating each agent’s shortest-path-to-goal. The bottom images (c, d) show the video of each scenario, with dashed arrows indicating paths traversed by each agent. Transparent images illustrate older frames of the video. The SICNav-CVG robot produces solutions that turn towards the blue-hat human, causing both the robot and the human to deviate from their optimal paths, while the JMID predictions allow SICNav-JMID to produce solutions where both the robot and humans stay closer to their shortest-paths-to-goal..
Testing Environment: We also evaluate the performance of our approach on a real robot in a controlled environment similar to [8]. We use a Clearpath Jackal differential drive robot pictured in Fig. 1. The robot weighs \(20kg\) and measures \(46cm\) W \(\times 60cm\) L \(\times 50cm\) H. Our operating environment is an indoor space measuring \(5m \times 9m\). We conduct a total of 240 runs with a varying number of humans \(N\in \bigl\{1,2,3\bigr\}\) and two configuration variants, illustrated in Fig. 4 (i.e. six scenarios). For each run of a particular scenario, all human agents were instructed to move from their labelled starting positions to the opposite corner of the room (illustrated by the arrows). In variant 1 (Fig. 4 (a)) all humans start from the opposite side of the room from the robot while in variant 2 one human moves in the same direction as the robot (Fig. 4 (b)). These scenarios are designed to cause a conflict zone in the middle of the room where all agents need to interact. We repeat 10 runs per variant and navigation method being tested. We use a VICON motion capture system to measure the positions of the robot and human agents in the environment and use an EKF and KFs to track the positions and velocities of the robot and humans, respectively.
Navigation Methods: We evaluate the same methods as in the simulation experiments: SICNav-JMID, SICNav-iMID, SICNav-AF, and SICNav-CVG.
Implementation: We use the ROS 1 middleware. The robot accepts linear and angular velocity commands to track by its low level controller. We run the algorithms on an 24-core Lenovo Legion laptop using Intel Core i9-14900HX @ 2.2 GHz CPUs and an NVIDIA GeForce RTX 4080 GPU. The prediction models are fine-tuned as in the simulation experiments, however with data collected from the robot. We wrap the inference code for the models in a ROS node that publishes the most recent prediction samples at 10 Hz. We implement a Python ROS node to run the controller in real-time at 10 Hz with CasADi as the modeling language and the Acados [20] solver for optimization. We discretize time at \(\delta t = 0.25s\) and use an MPC horizon of \(T= 2s\).
Results and Discussion: We evaluate the robot’s navigation efficiency using two metrics: the deviation from the shortest path to the goal and the robot’s navigation time (i.e. time to reach the goal in each run). Fig. 5 (a) shows the box-and-whisker plots of the robot navigation efficiency metrics for the different navigation methods. We observe for both metrics, in scenarios with \(N= 1\), all methods behave similarly. However, as the number of humans increases to \(N\in \bigl\{2,3\bigr\}\), using SICNav-JMID results in both a lower deviation from the shortest path to the goal and a lower navigation time compared to the other methods.
We also analyze the interaction patterns between the robot and humans when using the different navigation methods. First, we analyze the effect of the robot’s navigation on the humans by looking at the deviation of the humans from their shortest paths to the goal. Fig. 5 (b) shows the box-and-whisker plots of the humans’ deviation from their shortest paths to the goal for the different navigation methods (right). We observe in all three scenarios, the learning-based methods SICNav-JMID, SICNav-iMID, and SICNav-AF perform similarly on this metric. However, in the scenarios with \(N\in \bigl\{2,3\bigr\}\), SICNav-CVG causes a higher deviation of the humans from their shortest paths to the goal compared to the learning-based methods.
Second, to analyze proxemics, we look at the duration of time the robot spends within the intimate space [21], i.e. when the robot is within \(0.45m\) of a human. Fig. 5 (b) illustrates the box-and-whisker plots of the duration of time the robot spends in the intimate space of a human per run for the different navigation methods (left). In the scnarios with \(N= 1\), we see all methods spend almost no time in the personal space of the human. In the scenarios with \(N\in \bigl\{2,3\bigr\}\), we observe SICNav-JMID spends less time in the intimate space of the humans compared to the other learning-based methods, SICNav-iMID and SICNav-AF. However, the non-learning-based baseline SICNav-CVG spends the least time in the personal space of the humans. Recalling the results from Fig. 5 (b), we see that this lower amount of time spent in the personal space of the humans comes at the cost of a higher deviation of the humans from their shortest paths to the goal.
We interpret this result as meaning SICNav-CVG causes the humans to run away or avoid getting close to the robot. We illustrate an example occurrence of this phenomenon from the runs of the experiments in Fig. 6. We analyze two similar scenarios with SICNav-CVG (6 (a), 6 (c)) and SICNav-JMID (6 (b), 6 (d)). In the first snapshots for both methods, we observe the blue-hat agent’s heading (blue arrow) deviates more from their optimal path (gray dashed line) than the green-hat agent’s heading. With CVG predictions (6 (a), 6 (c)), the blue hat agent is predicted to deviate far from their optimal path. As a result, at the next control step (middle snapshot), the SICNav-CVG robot turns toward the blue-hat agent forcing the blue-hat agent to deviate even further from their optimal path in the third control step (final graph in the snapshots). In contrast, with JMID (6 (b), 6 (d)) predictions, at the second control step (middle graph in the snapshots), the JMID model predicts the blue-hat agent intends to return toward their optimal path, despite the agent’s heading pointing away from the path. As such, the robot’s MPC plan does not turn towards the blue-hat agent, allowing the robot and blue-hat agent to stay close to their optimal paths.
Figure 7: Box-and-whisker plots of prediction performance in the closed loop experiments. We evaluate for all pedestrians ADE over the MPC horizon (left) and prediction error at the next step the robot is used (right). The learning-based methods outperform the CVG baseline in terms of both metrics. Surprisingly, despite outperforming all methods in terms of navigation performance, the JMID model does not consistently outperform the other learning-based methods in terms of the prediction metrics.
Finally, we evaluate prediction performance in the closed-loop experiments. Fig 7 illustrates the ADEs over the 2s horizon (left) as well as the predictive error at the next control step (right). To describe the latter metric, recall we use the MPC solutions in receding horizon fashion, thus the action from a solution obtained at time \(t\) is only used until the next solution becomes available at \(t+\delta t\). We evaluate the predictive error at this moment as it is the latest time the solution is used by the robot. We see for both metrics, the learning-based methods outperform the CVG baseline. Surprisingly, although the trajectory prediction results from JMID do not consistently outperform the other learning-based methods in every scenario, JMID nonetheless achieves the best closed-loop navigation performance. This observation highlights the importance of joint prediction in crowd navigation, yet also suggests the possible mismatch between open-loop prediction metrics and closed-loop robot navigation performance, as also shown in [4].
In this work we propose a novel method for joint human trajectory forecasting and robot navigation in crowded environments. We introduce the JMID model, which generates joint trajectory forecasts for all agents in the scene. We then propose a Bilevel MPC method that uses these forecasts to simultaneously refine the predictions for safety and optimize the robot’s actions. We validate the open-loop performance of our prediction model on the ETH/UCY benchmark and show it outperforms state-of-the-art methods in terms of joint forecasting. We also conduct real-robot experiments to evaluate the performance of our method in a controlled environment. Our results show our method outperforms state-of-the-art methods in terms of robot navigation efficiency and safety.
Sepehr Samavi, Anthony Lem, Sirui Chen, Qiao Gu, Angela P. Schoellig, and Florian Shkurti are with the University of Toronto Robotics Institute and the Vector Institute for Artificial Intelligence, Toronto, Canada. Angela P. Schoellig is also with the Technical University of Munich and the Munich Institute for Robotics and Machine Intelligence (MIRMI), Munich, Germany. Fumiaki Sato and Keijiro Yano are with Konica Minolta Inc., Japan. (Corresponding author: Sepehr Samavi sepehr@robotics.utias.utoronto.ca)↩︎