QuAD: Query-based Interpretable Neural Motion Planning for Autonomous Driving

Sourav Biswas1\(^{*, 1, 2}\), Sergio Casas\(^{*, 1, 2}\), Quinlan Sykora\(^{1, 2}\), Ben Agro\(^{1, 2}\), Abbas Sadat\(^{1}\), Raquel Urtasun\(^{1, 2}\)
\(^{1}\)Waabi, \(^{2}\)University of Toronto
{sbiswas, sergio, qsykora, bagro, asadat, urtasun}@waabi.ai


A self-driving vehicle must understand its environment to determine the appropriate action. Traditional autonomy systems rely on object detection to find the agents in the scene. However, object detection assumes a discrete set of objects and loses information about uncertainty, so any errors compound when predicting the future behavior of those agents. Alternatively, dense occupancy grid maps have been utilized to understand free-space. However, predicting a grid for the entire scene is wasteful since only certain spatio-temporal regions are reachable and relevant to the self-driving vehicle. We present a unified, interpretable, and efficient autonomy framework that moves away from cascading modules that first perceive, then predict, and finally plan. Instead, we shift the paradigm to have the planner query occupancy at relevant spatio-temporal points, restricting the computation to those regions of interest. Exploiting this representation, we evaluate a candidate trajectory around key factors such as collision avoidance, comfort, and progress for safety and interpretability. Our approach achieves better highway driving quality than the state-of-the-art on high-fidelity closed-loop simulations.

1 Introduction↩︎

Self-driving vehicles (SDVs) strive to reach their destinations safely and comfortably by analyzing their surroundings, envisioning potential future scenarios, and using this information to determine a plan of action to carry out. This is repeated with every new observation.

The majority of autonomy frameworks are object-based [1][4], which implies detecting a discrete set of objects, typically obtained by thresholding output confidence scores from an object detector, predicting a small set of hypothetical future trajectories, and finally planning a safe trajectory. However, this approach loses information about the scene from thresholding and has limited representation of future object uncertainty. Furthermore, the expressivity of the future trajectory forecasts is limited, as keeping the number of hypotheses low is crucial for real-time inference.

Alternatively, sensor-to-plan imitation learning frameworks avoid reasoning about individual objects by learning to map sensor data directly to plans [5][8]. These learned policies are typically brittle to distributional shift since supervision is only provided in states visited by the expert during training, so the policy never learns to recover from its own mistakes [9]. Moreover, the decisions are not easy to interpret or explain, which is important for system validation and verification.

To tackle these limitations, occupancy-based approaches [10][13] have proposed a more interpretable object-free paradigm. Occupancy describes the probability a point in space and time is occupied by any traffic participant. This enables planners to decide on consistent and effective actions with respect to this interpretable representation, serving as an explanation of their decision. This paradigm is also more robust to shifts in the ego state distribution due to the supervision of the intermediate representations, making it better suited for closed-loop deployment. However, representing the occupancy in dense spatio-temporal grids over a large region of interest is resource-intensive since a high resolution is needed to attain accurate plans.

Figure 1: The set of potential plans for the ego vehicle, with time into the future colored from blue to green. Our neural motion planner, QuAD, builds upon two observations: (1) the plans’ reachable space is much smaller than the full spatio-temporal volume and (2) many ego states throughout the trajectories are in close proximity to each other.

We propose QuAD, an interpretable, effective and efficient neural motion planner. QuAD diverges from prior works that first perceive, then predict, and finally plan. Instead, our unified autonomy first generates candidate trajectories respecting kinematic constraints and traffic rules, and then queries an implicit occupancy model only at spatio-temporal points needed for planning, which is used to rank the safety of the candidates. Fig. 1 shows an example of the candidate trajectories, which we can see only occupy a small portion of the spatio-temporal volume prior works predict. Moreover, we note that many candidate trajectories heavily overlap, which motivates us to quantize the spatio-temporal query points to reduce redundant computation.

Through extensive evaluation we show that QuAD is able to achieve better closed loop performance in a state-of-the-art highway driving simulator while attaining better runtime than competitive baselines.

Figure 2: QuAD inference. Given the ego state and the map, the trajectory sampler generates candidate plans. These plans are converted into query points that cover the relevant areas around the ego vehicle future positions. Leveraging multi-sweep LiDAR and HD map, a scene encoder builds a BEV latent representation which we then use to query an implicit occupancy model. Finally, we gather the occupancy relevant to each trajectory, cost them, and select the one with the lowest cost.

2 Related Work↩︎

This section reviews prior approaches to end-to-end autonomy from sensor data.

5 1ex 1ex .2ex 0ex Object-based autonomy Most previous approaches have employed object-based reasoning using a three-stage pipeline: (1) object detection [14], [15] and tracking [16], [17], (2) trajectory prediction based on past tracks [18][21], and (3) motion planning to decide which action the self-driving vehicle should take [22][25]. This paradigm faces three key challenges [11], [26]: (i) the uncertainty from detection and tracking is not propagated to downstream prediction, (ii) the predicted future distributions tend to be overly simple in practice for computational tractability in scenes with many actors, and (iii) the planner is blind to any objects that fall below the confidence threshold used to determine whether an object exists or not in detection. Several works [27][31] have addressed (i) by optimizing jointly through multiple stages. The recent work of PlanT [4] tackles (ii) by learning a Transformer that plans a trajectory for the ego vehicle from object and route tokens coming from the detector and the map, skipping the second stage of trajectory prediction. However, (iii) is fundamentally difficult to address in this autonomy paradigm as the core assumption is that there is a finite set of objects with a clear boundary between positive and negative instances.

5 1ex 1ex .2ex 0ex Sensor-to-plan autonomy Autonomy approaches in this family learn to map sensor data directly to plans, without any other intermediate interpretable representations. The pioneering work of ALVINN [32] proposed the use of a single neural network that directly outputs driving control commands. Following the triumph of deep learning, direct control-based methods have made strides through the implementation of more expressive networks, advanced sensors, and scalable learning methods [5], [8], [33][36]. To leverage prior knowledge about vehicle kinematics, recent methods such as NMP [6] and Lift-Splat-Shoot [7] leverage analytical trajectory samplers to propose candidate plans, reducing the learning problem to predicting the cost of the trajectories. Despite not requiring any manual annotation for training, using only expert demonstrations as supervision tends to result in policies with stability and robustness issues, as they become highly vulnerable to distributional shift [9], [12], [37]. Note that some methods in this family [6], [8] perform auxiliary tasks for additional supervision, but the auxiliary outputs are not used during inference by the planner and therefore no consistency between those outputs and the plans is guaranteed. Thus, the planner’s decisions cannot be explained by those auxiliary outputs.

5 1ex 1ex .2ex 0ex Occupancy-based autonomy These methods predict spatio-temporal occupancy from sensor data, and plan a trajectory that is safe with respect to the predicted occupancy without considering individual actors. The occupancy is then used to assess the risk of a trajectory by measuring the probability of collision [10][12]. This approach has proven very effective and robust to shifts in state distribution during deployment. Different architectures have been proposed to predict occupancy. P3 [11], MP3 [12], FIERY[38] and OccFlow [13] all propose to predict similar variations of 3D spatio-temporal occupancy grids using convolutional neural networks (CNN). However, this relies on the CNN receptive field being large enough to “transport” the occupancy from where the evidence is in the sensor data, to where the object would be at the end of the forecasting horizon. Moreover, predicting a dense 3D grid is computationally demanding and unnecessary. To tackle these two shortcomings, ImplicitO [39] proposes an implicit occupancy model that can be queried at continuous spatio-temporal points by leveraging deformable attention [40], yet it does not propose how to make use of this model for driving. In this work, we propose a unified autonomy framework which utilizes a state-of-the-art implicit occupancy architecture to obtain a strong understanding of the scene along with a query point quantization strategy to attain superior plans in practical runtimes without degrading driving quality.

3 Query-based Autonomous Driving (QuAD)↩︎

The goal of our motion planner (QuAD) is to find the best trajectory plan to execute given a short history of LiDAR observations, \(L\), and a high-definition (HD) map, \(M\). As shown in 2, with every new sensor observation, QuAD plans a trajectory by optimizing the following objective over a finite set of trajectory candidates \(\mathcal{T}\) generated based on the current ego kinematic state and the map: \[\begin{align} \tau^* &= \underset{\tau \in \mathcal{T}}{\operatorname{arg}\,\operatorname{min}}\; J\big(\tau, \mathbf{Z}, M \big), \end{align}\] where \(\mathbf{Z} \in \mathbb{R}^{H \times W \times C}\) is a bird’s eye view (BEV) latent representation of the environment extracted from a voxelized multi-sweep LiDAR and a raster map by a learned scene encoder.

We define our objective function \(J\) as a linear combination of \(N\) explainable costs \(f_i\), weighted by learnable coefficients \(w_i\) \[\begin{align} J\big(\tau, \mathbf{Z}, M \big) &= f\big(\tau, \psi(\mathcal{Q}_\tau, \mathbf{Z}), M \big) \\ &= \sum_i^N w_i f_i\big(\tau, \psi(\mathcal{Q}^{(i)}_\tau, \mathbf{Z}), M \big). \end{align}\] The learned occupancy model \(\psi\) provides flexibility, expressivity and interpretability in our planner. It models occupancy implicitly and can be queried for occupancy at any set of continuous spatio-temporal locations \(\mathcal{Q}\). More precisely, \(\psi(\mathbf{q}, \mathbf{Z})\) outputs the probability that the continuous spatio-temporal location \(\mathbf{q}=(x, y, t)\) lies within an object surface in BEV. Given this implicit occupancy representation, our planner can understand complex environments by having the different costs \(f_i\) reason about occupancy at different spatio-temporal locations \(\mathcal{Q}^{(i)}_\tau\).

3.1 Trajectory Sampling and Query Points↩︎

5 1ex 1ex .2ex 0ex Trajectory Sampler We consider a set of candidate trajectories \(\tau \in \mathcal{T}\) starting from the current ego state and going 5 seconds into the future. The ego state consists of the SDV current location in BEV, speed, and steering. A trajectory \(\tau\) contains a sequence of kinematic bicycle model [41] states for each time step in the planning horizon. It is crucial that the set of sampled trajectories, while small enough to enable real-time computation, encompasses a range of maneuvers, including lane following, lane changes, nudges to avoid encroaching objects, hard brakes. To accomplish this efficiently, we opt for a sampling approach that takes into account the lane structure. More precisely, we use the lane centerlines from the HD map as base paths and sample longitudinal and lateral profiles in Frenet frame [24], [42]. As a result, the sampled trajectories align with appropriate lane-based driving, while incorporating lateral variations.

5 1ex 1ex .2ex 0ex Points of Interest The goal of these points of interest is to cover the relevant areas around the ego vehicle throughout the candidate plans. For every trajectory \(\tau\) and time step \(t\) (sampled every 0.5s), we consider points within the ego bounding box as well as points forwards, backwards, and to the sides of the ego. For simplicity, we sample a uniform grid of points within the ego box at a certain resolution, and simply shift this grid forward/backward by the length of the ego vehicle and right/left by the width to obtain all the points of interest. This process is shown in 2 for the final time step of two trajectories.

5 1ex 1ex .2ex 0ex Point Quantization Since the points of interest are sampled along and around the trajectories \(\mathcal{T}\), which are generated to ensure coverage of the available actions, the distance between multiple pairs of query points \(||\mathbf{q}_j - \mathbf{q}_k||_2\) from different trajectories can be very small, as depicted in Fig. 1. We leverage this observation to improve the efficiency of our planner by quantizing the query points with a certain spatial resolution and only querying \(\psi\) with the unique set of points after quantization. We tune the quantization resolution to maximize efficiency without sacrificing driving performance. Empirically, we find this to reduce the number of queries by two orders of magnitude, from millions to tens of thousands.

3.2 Implicit Occupancy Model↩︎

Our implicit occupancy model consists of a scene encoder that provides a BEV latent representation of the environment \(\mathbf{Z}\), and an implicit occupancy decoder \(\psi\) that attends to the latent scene representation to predict occupancy probability at query points. By maintaining an intermediate occupancy representation, we can make planning decisions which are interpretable and consistent with the occupancy predictions.

5 1ex 1ex .2ex 0ex Scene Encoder We use as input a sequence of LiDAR point clouds, containing the 5 latest LiDAR sweeps. Each sweep contains a set of points with coordinates \((p_x, p_y, p_h)\), where the \((p_x, p_y)\) is the point location in the SDV coordinate frame while \(p_h\) is the height over the ground. We then voxelize [43] the LiDAR in BEV to obtain a 3D tensor where the different sweeps are concatenated along the height dimension. Since the behavior of other traffic participants is highly influenced by the road topology, we make use of the prior knowledge stored in the HD map to provide important cues about the regions they might occupy and how they could move. More precisely, we raster the polylines representing the lane centerlines in the HD map as a BEV binary map with the same spatial resolution as the LiDAR. Following [28], [39], our scene encoder uses two convolutional stems for processing the voxelized LiDAR and map raster respectively. The resulting feature maps are concatenated along the channel dimension and passed through a lightweight Feature Pyramid Network [44] to get a fused BEV feature map \(\mathbf{Z}\) containing information from both modalities at half resolution of the inputs. Intuitively, the latent scene embeddings \(\mathbf{Z}\) contain local geometry, motion and semantic descriptors from the area within the receptive field of our encoder.

5 1ex 1ex .2ex 0ex Implicit Occupancy Decoder Leveraging the latent scene embedding \(\mathbf{Z}\), our implicit occupancy decoder predicts the occupancy probabilities at a set of query points \(\mathcal{Q} = \{\mathbf{q}_j\}_{j \in [1, |Q|]}\). In more detail, each query point \(\mathbf{q}=(x,y,t)\in\mathbb{R}^3\) denotes a spatio-temporal point in BEV at a future time \(t\). We exploit ImplicitO [39], a recently proposed state-of-the-art architecture for occupancy prediction. Given a query point, it bilinearly interpolates a latent vector at the query point BEV location \((x, y)\), and uses it to predict locations to attend using deformable attention [40]. With the attended latent vector, an MLP decoder predicts occupancy for a particular query point. This simple architecture has a great advantage over prior methods that predict occupancy grid maps using CNNs: it can attend anywhere in the BEV latent instead of being limited by the CNN’s receptive field to a small region. This is crucial since vehicles can travel very fast, so to accurately predict the occupancy into the future (e.g., at \(t=5\)s), the model needs to find the original LiDAR evidence at \(t=0\)s, which may be 150-200 meters behind. The specific query points \(\mathcal{Q}\) used during inference depend on the trajectory sampler and costs explained next.

3.3 Trajectory Costing↩︎

In order to plan an effective trajectory, we must consider various factors of driving such as collision likelihood, traffic violations, goal location, and comfort. To meet this desiderata in a way that the decisions are explainable, we consider a set of interpretable costs. We split costs \(f_i\) into agent-agnostic costs and agent-aware costs. At a high level, agent-agnostic costs describe the comfort, rule compliance, and progress of a candidate trajectory. Agent-aware costs evaluate the safety of the trajectories with respect to other agents using the outputs of our implicit occupancy model \(\psi\) at the query point locations \(\mathcal{Q}\). In the next paragraph we describe our agent-aware costs at a high-level.

A collision cost considers the maximum probability of collision for each time step \(t\) of each trajectory candidate \(\tau\). Specifically, we gather the occupancy at the query points within the ego bounding box \(B_\tau^t\), and take the maximum probability. For each trajectory, we aggregate the maximum probabilities over time steps with a cumulative sum to further penalize trajectories that collide earlier on. A longitudinal buffer cost penalizes trajectories with agents too close in front or behind the ego vehicle by gathering the occupancy at those locations. We apply a linear decay to the cost based on the distance with respect to the ego. Similarly, lateral buffer cost penalizes trajectories that remain in close lateral proximity to other agents in the scene.

3.4 Learning↩︎

We optimize our motion planner in two stages. We first train the implicit occupancy model to learn to perceive and forecast. In a second stage, we freeze the occupancy model and train the cost aggregation weights \(\{w_i\}\) to imitate an expert driver. This two-stage training maintains the interpretability of the occupancy intermediate representation and allows the cost aggregation weights to train with stable occupancy predictions.

5 1ex 1ex .2ex 0ex Occupancy An advantage of having intermediate representations is that one can use much richer and denser supervision to perceive the world and understand its dynamics rather than just imitating the expert trajectory. For each continuous query point \(\mathbf{q} \in \mathcal{Q}\), occupancy is supervised with binary cross entropy loss. Following [39], we train with a batch of continuous query points \(\mathcal{Q}\), uniformly sampled across the spatio-temporal volume.

5 1ex 1ex .2ex 0ex Costing We train the cost aggregation such that the behavior of our planner imitates that of an expert. Because selecting the trajectory with the minimum cost from a discrete set is not a differentiable process, we use the max-margin loss to penalize trajectories that are either unsafe or have a low cost but differ significantly from the expert driving trajectory following prior works [11], [24]. Intuitively, this loss incentivizes the expert trajectory \(\tau_e\) to have a smaller cost \(J\) than all other trajectories. More precisely, our objective is

\[\begin{align} \mathcal{L}_w=\max _{\tau}\left[\Delta J_r\left(\tau, \tau_e\right)+l_{\mathrm{im}}+\sum_t\left[\Delta J_c^t\left(\tau, \tau_e\right)+l_{\mathrm{c}}^t\right]_{+}\right]_{+}, \end{align}\] where \(\Delta J\left(\tau, \tau_e\right) = J\left(\tau_e\right)-J(\tau)\) is the difference between the cost of the expert trajectory \(\tau_e\) and the candidate trajectory \(\tau\); \(J_c^t\) is the collision cost at a particular time step into the future, and \(J_r\) are the rest of the costs, aggregated; \(\left[\right]_+\) represents the ReLU function; and \(l_{\mathrm{im}}\) and \(l_{\mathrm{c}}^t\) are the imitation and safety margins, respectively. Note we omit \(\mathbf{Z}, M\) from \(J\) for brevity. The imitation margin is simply the distance between the trajectory waypoints in \(\tau_e\) and \(\tau\), and the safety margin is whether the candidate trajectory \(\tau\) collides with any ground-truth object.

As a way to mitigate distribution shift from open-loop learning to closed-loop deployment, we exploit dataset aggregation [9] by combining the initial set consisting of expert demonstrations on expert states with another dataset generated with states visited by our learned autonomy model.

4 Experiments↩︎

Table 1: [Safety-focused set] Closed-loop simulation results
Mission Safety and Compliance Progress, Consistency and Comfort
(l5pt)1-1 (l5pt)2-2 (l5ptr5pt)3-9 (l5ptr5pt)10-13 GSR\(\uparrow\) ECR\(\downarrow\) PCR\(\downarrow\) MinTTC\(\uparrow\) TVR\(\downarrow\) Progr.\(\uparrow\) L2E\(\downarrow\) P2P\(\downarrow\) Jerk\(\downarrow\)
(l5ptr5pt)5-8 \(p10\) \(<1s\) \(<2s\) \(<5s\)
Expert 91.6% 0.0% 2.7% 4.64 0.0% 1.1% 17.9% 6.8% 454.2 0.0 15.1 0.30
PlanT [4] 23.7% 10.0% 2.9% 3.42 8.4% 8.9% 19.5% 63.7% 331.7 95.0 42.8 1.03
CIL [5] 0.0% 16.3% 5.6% 1.55 8.4% 15.3% 36.8% 100.0% 64.6 298.2 19.7 1.38
NMP [6] 51.6% 14.2% 21.4% 0.50 10.5% 16.3% 62.1% 18.4% 337.6 133.3 44.2 1.21
P3 [11] 76.3% 6.8% 8.3% 3.44 4.7% 5.3% 29.5% 11.6% 436.1 37.5 14.7 0.17
OccFlow [13] 60.5% 30.5% 28.0% 0.40 24.2% 28.4% 44.2% 36.8% 385.3 55.2 12.8 0.14
QuAD (Ours) 84.7% 2.1% 1.0% 4.67 1.6% 1.6% 14.2% 8.4% 430.6 36.6 15.6 0.29

In this section, we first compare QuAD to state-of-the-art autonomy models, measuring their ability to drive in closed-loop in both safety-focused and canonical highway driving. Then, we analyze trade-off between runtime and driving quality of different methods, and ablate the importance of QuAD’s query point quantization. Finally, we show qualitative results.

4.1 Data↩︎

We utilize our high-fidelity end-to-end simulator to generate datasets for open-loop evaluation and training as well as a closed-loop benchmark. Our simulator can generate both LiDAR sensor data [45] as well as intelligent actor behaviors [46], [47]. We generate several datasets and benchmarks with distinct purposes.

5 1ex 1ex .2ex 0ex Safety-focused set We generate a dataset composed of \(\sim900\) safety-focused scenarios, split into \(\sim700\) for training and \(\sim200\) for evaluation. The scenarios in this set contain scripted interactions such as cut ins, blocked lanes, agents merging onto the highway, ego merging onto the highway through an off-ramp, and exiting through an off-ramp, aggressively slow/fast moving actors. The ego is provided with a variety of mission routes including keep lane, lane change, or lane merge. These scripted interactions are parameterized by highly controllable parameters such as time-to-collision, time-to-arrival to a merge point, initial speeds. Training and validation utilize non-overlapping parameter values to ensure generalization is tested.

5 1ex 1ex .2ex 0ex Canonical driving set Contains a total of \(\sim1700\) scenarios, \(\sim1400\) for training and \(\sim300\) for evaluation, each lasting around 20 seconds. This set is generated from a set of randomly distributed discrete and continuous parameters which control the map (using a mix of simulated and real maps to get exposure to diverse curvatures, speed limits, number of lanes, and distinct topologies), ego starting location, initial agent conditions such as speed, location, and maximum acceleration limits to ensure a diverse distribution of scenarios, and a variety of vehicle types. In this set, the ego vehicle does not have a particular goal or mission (in contrast to safety-focused set), making the task easier. Similar to the safety-focused set, training and evaluation are non-overlapping by ensuring different random parameter values.

5 1ex 1ex .2ex 0ex Open-loop vs. closed-loop Both safety-focused and canonical scenario sets can be used for open-loop as well as closed-loop execution. In open-loop, an expert planner is in charge of driving and the policy under train/test is only proposing plans that do not get executed and hence do not affect the next state of the world. The expert is a privileged planner that has access to the internal state of the simulation, receiving ground truth current states for other actors in the world as well as their most recent future plan. In contrast, during closed-loop simulation, the planner policy under test is driving the ego vehicle. In other words, the states the ego vehicle visits are dependent on the policy under test. We rely on closed-loop simulation as our ultimate benchmark, since it best captures the driving performance a planner would have in the real-world, where the planner actions affect the environment.

4.2 Experimental setup↩︎

5 1ex 1ex .2ex 0ex Baselines There are three main families of baselines we consider. (1) Object-based autonomy composed of separate object detection and planning modules. We compare against PlanT [4], a recent method that leverages an object detector and then a Transformer-based planner that reasons about the detected objects together with the route. (2) Sensor-to-plan autonomy approaches including direct regression of the plan through Conditional Imitation Learning (CIL) [5] as well as NMP [6], which learns a cost-prediction network to rank samples. (3) Occupancy-based autonomy that leverages grid-based occupancy and a sample-based planner [24]. We consider OccFlow [13] and P3 [11] as two possible architectures to predict temporal occupancy grids.

For P3 and OccFlow we train in two stages. In the first stage, we train the occupancy module on nominal scenarios with dense traffic. In the second stage, we train the planner on safety focused scenarios. For the remaining baselines PlanT, CIL, and NMP since they are trained in a single stage we train the entire module on the combined set of canonical and safety training scenarios.

5 1ex 1ex .2ex 0ex Metrics We consider a comprehensive set of metrics to provide a holistic view of autonomy performance.

To evaluate safety we propose the following metrics. Execution Collision Rate (ECR) measures the percentage of scenarios with a collision. Plan Collision Rate (PCR) calculates the collisions of our planned trajectory with respect to the simulated actors’ future trajectories. Minimum Time-To-Collision (MinTTC) computes the minimum time buffer (in seconds) –across a scenario– before a collision occurs, taking into account the ego plan and the actor plan. To better illustrate the worst-case we compute percentile 10 (\(p10\)), and to get an idea of different levels of risk exposure we report cumulative TTC buckets (\(<1s\), \(<2s\), \(<5s\)), in decreasing order of severity. To measure compliance, we use Traffic Violation Rate (TVR) to show the percentage of scenarios where the ego vehicle violates a lane boundary or speed limit, goes off-road, or collides.

On top of safety and compliance, it is also important driving makes sufficient progress towards the goal and is smooth. We compute progress within map (Progr.) as the meters traveled without going off-road. L2 distance to Expert (L2E) measures distance from the expert demonstrations. In closed-loop, this corresponds to the distance between the executed trajectories. Plan-to-plan consistency (P2P) computes the distance over the finite horizon planned trajectories for consecutive plans. Jerk measures the level of discomfort of the planned trajectories. Note secondary metrics are only meaningful when primary metrics are similar, since avoiding collisions naturally causes higher jerk and plan-to-plan distance thus safety and compliance must be prioritized at all times. Finally, for self-driving vehicles it is critical to reach a target location. To measure this, we propose Goal Success Rate (GSR) as the ratio of scenarios where the ego vehicle reaches the goal without colliding or violating traffic rules.

4.3 Results↩︎

5 1ex 1ex .2ex 0ex Comparison against state-of-the-art Table 1 shows our main benchmark results, where we evaluate the baselines and our method in closed-loop, on the safety-focused highly-interactive scenarios. We find our method is much safer than the baselines, as showcased by lower ECR, lower PCR as well as higher TTC. Moreover, it complies better with the rules of the road as showcased by our lower TVR. On top of being safer and more compliant, QuAD achieves the highest GSR indicating the maneuvers we pick at a behavioral level are effective. On the secondary metrics our model ranks high in progress, best in expert imitation, and has fairly consistent plans over time. While we do observe higher jerk than some baselines, our jerk is closer to that exhibited by the expert.

    \begin{tabularx}{\textwidth}{l r r r r r r r}
         &ECR\da &PCR\da &MinTTC\ua &TVR\da &Progr.\ua &L2E\da &Jerk\da \\
        &      &       &$p10$     &       &          &       & \\
        \mathrm{\small Expert} &0.0\% &0.0\% &10.00 &2.2\% &317.6 &0.0 &0.40 \\
        \mathrm{\small PlanT} &29.6\% &18.2\% &0.40 &86.6\% &214.6 &66.3 &1.08 \\
        \mathrm{\small CIL} &49.0\% &28.8\% &0.40 &92.0\% &50.7 &255.1 &1.35 \\
        \mathrm{\small NMP} &7.0\% &22.8\% &1.67 &8.3\% &214.9 &80.6 &1.20 \\
        \mathrm{\small P3} &2.9\% &6.0\% &3.88 &4.8\% &\textbf{317.3} &36.2 &0.28 \\
        \mathrm{\small OccFlow} &20.4\% &20.9\% &0.40 &22.9\% &306.0 &39.6 &\textbf{0.23} \\
        \ourmodel{} (Ours) &\textbf{0.0\%} &\textbf{0.5\%} &\textbf{4.64} &\textbf{2.2\%} &299.1 &\textbf{33.7} &0.43 \\
    \caption{\textbf{[Canonical set] Closed-loop simulation results}}

Figure 3: Driving quality vs. runtime comparison

Table [tab:closed-loop-canon] showcases our closed-loop results in canonical driving scenarios. Similar to previous results, we observe a substantial safety improvement, being able to attain very close safety and compliance performance to that of the expert. QuAD has the closest imitation to expert and ranks highly on progress too, striking a good balance between safety and progress towards the goal. When taking into consideration the priority safety has over progress, our method clearly outperforms the baselines. Note that CIL and PlanT do not use lane based trajectory samples and this results in much worse TVR metric due to lane boundary violations.

5 1ex 1ex .2ex 0ex Inference runtime Fig. 3 shows how methods compare when considering the balance between driving quality and runtime, an important factor for real-world deployment. We profile in a crowded scenario with 74 agents and 11 lanes to stress-test the different autonomy systems. We highlight that QuAD is faster than other occupancy-based autonomy models (P3, OccFlow), while attaining the best driving quality both in terms of mission achievement and safety.

5 1ex 1ex .2ex 0ex Query point quantization ablation Our proposed query point quantization is necessary to reap the benefits from an implicit occupancy decoder without inquiring a prohibitive runtime. The naive alternative of using the continuous query points directly after the generation step incurs very high inference times of \(\sim\)​700ms. This is due to the high number of points, caused by the high overlap between the candidate plans especially at the beginning of the temporal horizon. Quantization becomes necessary in practice. Utilizing a resolution of 0.5m (the one used in the rest of the experiments), autonomy runtime reduces to \(\sim\)​250ms while the execution collision rate only increases from \(\sim1\%\) to \(\sim2\%\), thus providing significant improvements over the baselines.

Figure 4: Qualitative results. We visualize the LiDAR point cloud, map, predicted occupancy (by querying \(\psi\) at a regular grid, solely for illustration purposes), and cost associated with the trajectory samples. From top to bottom: a lane change, a re-incorporation near an off-ramp, and a merge.

5 1ex 1ex .2ex 0ex Qualitative results Fig. 4 showcases example scenarios where the goals instructed to the ego consist of semantically diverse and interesting maneuvers. From the cost distribution, we can observe that our planner can understand the mission route and progress towards the goal, it can speed up to pass slow moving vehicles when instructed to lane-change, and can plan smooth merge trajectories at on-ramps.

5 Conclusion↩︎

In this paper, we have proposed an interpretable motion planner leveraging spatio-temporal occupancy queries to effectively understand the current and future free-space from sensor data. We showcased our proposed autonomy can drive more safely and progress further than contemporary object-based, sensor-to-plan and occupancy-based autonomy models. Our framework also achieves faster runtime than its closest competitors.

6 Appendix↩︎

We divide the supplementary material into three primary sections. We start by providing additional details about our method and experimental setup in Section 7. We then include supporting quantitative results in Section 8, including tables with the complete set of metrics as well as an ablation study. Finally, we show additional qualitative results of QuAD as well as the baselines in Section 9.

7 Additional Experiment Details↩︎

7.1 Costing Methodology↩︎

In order to achieve satisfactory performance in both safety critical and nominal scenarios, we require a sophisticated combination of different costs to be utilized. In this section we will go into more detail on how we perform this costing process.

As mentioned in the main paper, this costing is carried out for every trajectory candidate in the set \(\mathcal{T} = \{ \tau_s \}_{s \in [1, S] }\), where \(\tau_s = \{(x_t, y_t, \theta_t, v_t, a_t, \kappa_t)\}_{t \in [0,T]}\) and \(T\) is the number of timesteps in the planning horizon and we are considering \(S\) sample trajectories.

7.1.1 Agent Agnostic Costs↩︎

Comfort cost penalizes high jerk, acceleration, lateral acceleration, and curvature. Corridor cost favors trajectories that stay close to lane centerlines and do not violate lane boundaries. Solid lane boundary cost helps avoid trajectories that drive over solid lane boundaries. Speed limit cost prevents the ego from reaching speeds above the limit over the planning horizon. Progress rewards distance traveled along the trajectory. Route prioritizes trajectories that bring us closer to the mission route, where the reward is modulated by the speed limit in the target lane of the trajectory plan.

Figure 5: Illustration of example costs. These costs are evaluated at multiple time steps \(t\) along candidate plans, and aggregated to extract the trajectory cost. For agent-aware costs, the cost is modulated by the occupancy probability at the query points.

Recall that \(w_{i}\) is the learnable weight for the \(i\)-th cost, used for the linear combination of costs. We define \(\text{SD}_{l}(x,y)\) as the signed distance to a polyline \(l\), and \(v_{l}\) as the speed limit for lane \(l\). Finally, we define \(\text{lat}(x)\) as a function that isolates the component of vector \(x\) that is perpendicular to the nearest centerline, and \(\text{long}(x)\) isolates the component along the nearest centerline.

5 1ex 1ex .2ex 0ex Comfort cost \[\begin{align} \sum_t w_{\text{acc\_lat}} \text{lat}(a_t)^2 \\ + w_{\text{acc\_long}} \text{long}(a_t)^2 + w_{\text{jerk}} \dot{a_t}^2 + w_{\text{curv}} \kappa_t^2 \end{align}\]

5 1ex 1ex .2ex 0ex Corridor Costs

\[\sum_t w_{\text{corr}} |\text{SD}_{\text{centerline}}(x_t,y_t)|\]

5 1ex 1ex .2ex 0ex Solid Lane Boundary Cost

\[\begin{align} \sum_t w_{\text{bound}}([\text{SD}_{\text{left\_boundary}}(x_t,y_t)]_+ \\ + [\text{SD}_{\text{right\_boundary}}(x_t,y_t)]_+) \end{align}\]

5 1ex 1ex .2ex 0ex Speed Limit

\[\sum_t w_{\text{speed}} ([v_t - v_l]_+)^2\]

5 1ex 1ex .2ex 0ex Progress

\[-\sum_t w_{\text{prog}} \sqrt{(x_t - x_{t-1}) ^ 2 + (y_t - y_{t-1}) ^ 2}\]

5 1ex 1ex .2ex 0ex Route

\[\sum_t w_{\text{route}} |\text{SD}_{\text{desired\_lane}}(x_t,y_t)|\]

7.1.2 Agent Aware Costing↩︎

Here we elaborate on the mathematical definitions of the agent-aware costing. We denote \(\mathcal{Q}_{B}\) to be the query points that we sample around a region \(B\), and let \(\mathcal{B}_t\) represent the motion bounding box at time t.

At highway speeds (e.g., 30 m/s) a vehicle can travel a large distance in this time period. To avoid missing dangerous events due to limited temporal resolution, we are conservative and consider motion-blurred bounding boxes \(\mathcal{B}_t\) that are stretched to cover the space traveled during the time segment from \(t\) to \(t+1\).

We use \(q = (x, y, t)\) to denote a spatio-temporal query point and \(\psi(q, \mathbf{Z})\) to denote the occupancy probability for query point \(q\).

5 1ex 1ex .2ex 0ex Collision

\[w_{\text{col}} \sum_t (T - t) \max_{q \in \mathcal{Q}_{\mathcal{B}_t}}{[\psi(q, \mathbf{Z}) ]}\]

where \(t = 0, ..., T - 1\).

5 1ex 1ex .2ex 0ex Buffer Costs

For both the longitudinal and lateral buffer costs, we use:

\[\begin{align} w_{\text{buf}} \sum_t (T - t) \max_{q \in \mathcal{Q}_{\text{Buf}_t}}{[ \frac{dis(q)}{\max_{p \in \mathcal{Q}_{\text{Buf}_t}}{[dis(p)]}} \psi(q, \mathbf{Z}) ]} \end{align}\]

where \(dis(q)\) is the euclidean distance between \(q\) and the center of the motion bounding box \(\mathcal{B}_t\).

We define the longitudinal buffer cost by \(Buffer_t = forward(\mathcal{B}_t) \cup backward(\mathcal{B}_t)\), where \(forward\) translates the region \(\mathcal{B}_t\) forward a distance equal to the length of \(\mathcal{B}_t\). Similarly, \(backward\) translates the region \(\mathcal{B}_t\) backward a distance of the length of \(\mathcal{B}_t\).

We define the lateral buffer cost by \(Buffer_t = right(\mathcal{B}_t) \cup left(\mathcal{B}_t)\) where \(right\) translates the region \(\mathcal{B}_t\) right a distance of the width of \(\mathcal{B}_t\). Similarly, \(left\) translates the region \(\mathcal{B}_t\) left a distance of the width of \(\mathcal{B}_t\).

7.2 Implementation Details↩︎

We train our autonomy model in two stages: First, our scene encoder and implicit query function \(\psi\) are trained in canonical scenarios with background traffic to ensure a diverse set with plenty of agents in the scene are seen during training. For this first stage, we use the Adam optimizer over 20 epochs with a learning rate of \(10^{-3}\) which decays to 0.25 of its value every 6 epochs, and a batch size of 2. We also use a weight decay factor of \(10^{-4}\). In the second stage, we train the cost aggregation weights on safety-focused scenarios since safety is the main priority of a planner, and this set is richer in ego-to-agent close interactions. For this second stage, we use the Adam optimizer over 200 epochs with a learning rate of \(10^{-1}\) and no weight decay along with a batch size of 64.

7.3 Dataset Evaluation↩︎

We employ the use of two different kinds of scenario sets. The first is the canonical driving scenario set that consists of generic highway driving with realistic actors. The second consists of the safety set that is made up of very challenging safety critical scenarios. We will now break down the scenarios in each of these subsets.

7.3.1 Canonical Set↩︎

The canonical training set consists of 663 simulations each 20s long. The closed loop canonical evaluation set consists of 321 scenarios simulated using a different set of initial condition and random seeds, but that ultimately follow the same distribution as the training set. We will discuss this distribution in the following section.

The maps are taken from real world highways, and the background is similarly reconstructed using real scenes. We now provide a conceptual decomposition of the different high-level scenes that are in each scenario.

5 1ex 1ex .2ex 0ex Map Categorization

To choose an initial road map for simulation, we select a stretch of map that contains the desired topology. In a third of the cases the desired topology is a merge lane, in another third it is an off-ramp, and the final third it is a stretch of road with neither of the above. Examples of these can be seen in figure 6

Figure 6: Map Topology Examples. We illustrate a scenario in our dataset with (left) a standard highway road, (center) a fork in the road and (right) a merge into the road.

5 1ex 1ex .2ex 0ex Unpredictable Vehicles

In order to simulate a wide diversity of different road actors we randomly sample how conservative vs. aggressive they are. There are additional static actors placed in the environment that the ego and the background actors have to navigate around. Finally, we also vary the density of the traffic in the scenes, with about 10% of the scenes having extremely sparse traffic and about 10% of the scenes being extremely dense traffic.

7.3.2 Safety Set↩︎

Here we will describe the types of scenarios used to categorize our safety set. We note that we duplicate many of the high level scenarios with slightly altered initial conditions to ensure better coverage. Each one of these scenarios is simulated largely in isolation without background actors or a complex background mesh. The intent is to replicate a specific challenging scenario to evaluate the planner’s interaction. For training, we have 690 scenarios that are simulated for up to 20s (depending on when the goal is reached this may be shorter).

The safety scenarios consist of the following: first we have lane change scenarios, where the goal lane for the self-driving vehicle (SDV) is a different lane and there are actors around the vehicle in different configurations. This also includes scenarios where scripted actors will try to merge into the target lane, or will cut off the SDV. Next, we have the lane follow scenarios where the goal of the SDV is to remain in the current lane while contending with a series of challenging scenarios, such as cutting in actors, suddenly braking actors, forking lanes, blocked lanes, and multiple on-ramps converging to the current SDV position. Finally, we have the lane merge scenarios. Here the self-driving vehicle must merge into the target lane under a number of different scenarios. Some of which include actors in the target lane moving at various speeds, and some of which there are actors in front and behind the SDV that make merging a complex balancing act. This also includes scenarios where other actors taking an on-ramp must merge into the SDV’s lane. Examples of each of these can be seen in figure 7.

Figure 7: Safety Scenario Examples. We illustrate scenarios in our safety set where (left) the self-driving vehicle must merge into a lane while contending with other actors, (center) the vehicle must slow down to avoid crashing into a slow vehicle in its lane and (right) the vehicle must merge into the highway lane on an on-ramp.

7.4 Expert Motion Planner↩︎

In our work, we assume access to a privileged motion planner that the different methods will learn to imitate during training. This expert motion planner has access to the true state and future plan of every actor on the road, and therefore is able to plan more optimal trajectories without any perception uncertainty.

In practice this expert is a sample-based motion planner [1], [24] with hand tuned costs. As mentioned before, it has access to the current intention of each of the agents. Therefore, while these intentions may change as time goes on, avoiding collisions with these plans is an effective heuristic to plan good trajectories. The actors generally re-plan based on the surroundings (i.e., other actors and ego actions). The specific costs that are used by this ground truth planner are as follows:

  • Collision cost: cost for colliding into any other actors.

  • Headway cost: cost for safety buffer violations between the SDV and the lead actor, or any future lead actors.

  • Comfort cost: cost for acceleration/jerk.

  • Crosstrack cost: cost for deviating from the centerline.

  • Progress cost: penalizes travelling shorter distances.

  • Speed limit: cost for exceeding the prescribed speed limit.

  • Corridor cost: cost for moving outside the boundaries of the lane.

  • Contingency: cost for penalizing scenarios where a collision is guaranteed given maximum deceleration and no lane changing.

The hand tuned weighted combination of these costs, while not flawless, still gives the best overall results for our planning metrics, and therefore serves as a good guide for supervising the other methods.

8 Additional Quantitative Results↩︎

8.1 Full Results↩︎

For completeness, we include here the main closed loop and open loop results on both the canonical and safety sets. Some less relevant metrics were omitted in the main body of the paper due to space constraints. We begin with the complete closed loop metrics on the canonical set. The results for this can be seen in Table 2.

Table 2: [Canonical set] Closed-loop simulation results
Safety and Compliance Progress, Consistency and Comfort
(l5ptr5pt)2-8 (l5ptr5pt)9-12 ECR\(\downarrow\) PCR\(\downarrow\) MinTTC\(\uparrow\) TVR\(\downarrow\) Progr.\(\uparrow\) L2E\(\downarrow\) P2P\(\downarrow\) Jerk\(\downarrow\)
(l5ptr5pt)4-7 \(p10\) \(< 1s\) \(< 2s\) \(< 5s\)
Expert 0.0% 0.0% 10.00 1.0% 1.0% 8.6% 2.2% 317.6 0.0 19.7 0.40
PlanT [4] 29.6% 18.2% 0.40 22.9% 27.4% 51.9% 86.6% 214.6 66.3 33.7 1.08
CIL [5] 49.0% 28.8% 0.40 28.3% 44.6% 66.2% 92.0% 50.7 255.1 25.9 1.35
NMP [6] 7.0% 22.8% 1.67 7.3% 13.1% 77.7% 8.3% 214.9 80.6 43.9 1.20
P3 [11] 2.9% 6.0% 3.88 2.9% 2.9% 25.5% 4.8% 317.3 36.2 22.7 0.28
OccFlow [13] 20.4% 20.9% 0.40 19.7% 19.7% 36.6% 22.9% 306.0 39.6 13.8 0.23
QuAD 0.0% 0.5% 4.64 0.3% 0.3% 12.1% 2.2% 299.1 33.7 30.3 0.43

The results tell a very similar story to the safety scenarios. Namely, that our proposed method outperforms the other when it comes to key safety metrics, while still achieving competitive progress, consistency and comfort. When driving in the real world it is of paramount importance that safety takes precedence over secondary progress based metrics. However, it is still important to be able to operate reasonably well in the nominal case, which is exactly what we observe in the canonical closed loop metrics. In particular, while we are the best on all safety metrics, we also have very good progress, the best \(L2E\), and comparable jerk to the expert.

We will now analyze the open loop results on the safety set, as seen in Table 3, and the open loop results on the canonical set as seen in Table 4.

Table 3: [Safety set] Open-loop simulation results
Safety and Compliance Progress, Consistency and Comfort
(l5ptr5pt)2-6 (l5ptr5pt)7-10 PCR\(\downarrow\) MinTTC\(\uparrow\) Progr.\(\uparrow\) L2E\(\downarrow\) P2P\(\downarrow\) Jerk\(\downarrow\)
(l5ptr5pt)3-6 \(p10\) \(< 1s\) \(< 2s\) \(< 5s\)
Expert 0.0% 10.00 0.0% 0.0% 0.0% 117.9 30.9 15.7 0.44
PlanT 35.2% 1.45 4.7% 14.2% 40.5% 88.9 74.4 28.8 1.04
CIL 18.6% 1.55 1.6% 18.4% 43.7% 102.5 40.8 17.5 1.94
NMP 11.8% 2.54 0.0% 3.2% 54.2% 103.3 43.8 46.1 1.39
P3 1.4% 4.30 0.0% 0.5% 20.0% 118.9 29.0 15.3 0.26
OccFlow 2.9% 3.50 0.0% 2.1% 23.7% 119.1 29.2 15.0 0.26
QuAD 0.0% 10.00 0.0% 1.1% 7.9% 118.6 37.3 31.2 0.41

Table 3 summarizes the quality of the plans in open-loop simulation on the same safety-focused scenario set. In this evaluation, autonomy is free of any distributional shift, making this a much easier task. The goal of this analysis is more so understanding which approaches are most robust to distribution shifts rather than finding out which approach did best in open-loop, as open-loop metrics are not very indicative of driving performance in the real-world. The first observation when comparing these results to closed-loop safety-focused results in the main paper is that the plan collision rate (\(PCR\)) when planning from expert states is much lower than when the state visitation is determined by the policy under test. While this is expected to an extent, we observe that our method goes from \(0\%\) to \(1.0\%\), while for the strongest baselines the increase is very substantial (e.g., from \(1.4\%\) to \(8.3\%\) for P3 and from \(2.9\%\) to \(28.0\%\) for OccFlow). Similar trends are observed in MinTTC. We note that this increased difficulty from open-loop to closed-loop is not obvious from the table for PlanT, CIL. After further inspection, the reason is that these methods quickly veer off-road during closed-loop simulation due to being extremely brittle to distribution shift, where there are no agents to collide with. In contrast, in open-loop the ego vehicle is driven by the expert and therefore forces these methods to plan when being surrounded by other agents. Another metric that shows our method’s robustness to distribution shift is the L2 to expert (\(L2E\)). In open-loop, our model plans do not show the best imitation to the expert executions. However, when re-planning in states determined by our own previous actions in closed-loop, our method attains a driving trajectory that is the closest to the expert.

Table 4: [Canonical set] Open-loop simulation results
Safety and Compliance Progress, Consistency and Comfort
(l5ptr5pt)2-6 (l5ptr5pt)7-10 PCR\(\downarrow\) MinTTC\(\uparrow\) Progr.\(\uparrow\) L2E\(\downarrow\) P2P\(\downarrow\) Jerk\(\downarrow\)
(l5ptr5pt)3-6 \(p10\) \(< 1s\) \(< 2s\) \(< 5s\)
Expert 0.0% 10.00 0.7% 0.7% 1.0% 93.5 20.4 12.3 0.56
PlanT 38.3% 2.15 1.0% 8.5% 60.7% 87.6 68.3 30.4 1.07
CIL 39.1% 1.35 1.4% 37.3% 75.6% 69.6 54.4 24.8 1.54
NMP 39.1% 1.35 1.4% 37.3% 75.6% 69.6 54.4 24.8 1.54
P3 1.3% 4.50 0.3% 0.3% 16.6% 95.7 30.8 34.2 0.35
OccFlow 1.8% 4.27 0.3% 0.3% 18.6% 95.7 27.3 16.0 0.29
QuAD 1.0% 4.74 0.3% 0.3% 12.5% 94.2 37.2 45.2 0.58

In general, we still observe the same overall trend seen in closed loop. Namely, that our method is able to achieve better safety and compliance performance when compared to other methods. We also see further evidence that occupancy-based methods are able to produce safer driving patterns while simultaneously being more interpretable than other methods.

8.2 Dataset Aggregation Iterations↩︎

Our model attempts to mitigate distributional shift from open loop learning by performing a Dagger-like [9] aggregation of the dataset. We start with a dataset generated by the expert planner, meaning the states visited are caused by actions from the expert. Then, over multiple iterations we consider unrolling the simulation with our learned planner to learn to compensate our own mistakes by training to imitate the expert trajectory plan at every state. Intuitively, this process exposes the network to out of distribution states that otherwise are only discovered during closed loop execution. Here we conduct a brief investigation into the effect of this aggregation on the performance of the model during closed loop evaluation, the results of which can be seen in Table 5.

Table 5: [Safety set] Dagger ablation
Mission Safety and Compliance Progress, Consistency and Comfort
(l5pt)2-2 (l5ptr5pt)3-9 (l5ptr5pt)10-13 GSR\(\uparrow\) ECR\(\downarrow\) PCR\(\downarrow\) MinTTC\(\uparrow\) TVR\(\downarrow\) Progr.\(\uparrow\) L2E\(\downarrow\) P2P\(\downarrow\) Jerk\(\downarrow\)
(l5ptr5pt)5-8 \(p10\) \(< 1s\) \(< 2s\) \(< 5s\)
QuAD w/o Dagger [9] 81.1% 1.6% 2.5% 4.35 1.1% 1.1% 20.0% 7.9% 433.5 37.3 15.0 0.18
QuAD 1 Iteration 84.7% 2.1% 1.0% 4.67 1.6% 1.6% 14.2% 8.4% 430.6 36.6 15.6 0.29
QuAD 2 Iterations 84.2% 1.6% 2.3% 4.45 1.1% 1.1% 13.2% 8.4% 418.3 35.7 15.6 0.27
QuAD 3 Iterations 83.7% 1.6% 1.0% 4.60 1.1% 1.1% 12.1% 7.9% 417.3 34.8 19.5 0.27

We observe that dataset aggregation does not necessarily increase the overall performance of the model after a certain point. There are some noticeable gains in the collision metrics after 1 iteration, but after this the improvements to performance gradually begin to wane. Importantly, we highlight that the results of our model without any dataset aggregation still outperform the baselines.

8.3 Training Dataset Composition↩︎

One of the notable aspects of our method is our two stage training paradigm. Namely, we train the backbone of \(\mathrm{\small QuAD}{}\) on the canonical set to generate a reasonable set of occupancy predictions, and optimize the cost aggregation weights in the training safety set so that we can have better safety guarantees. In this portion of the ablation we explore the effect of optimizing the cost aggregation weights with different combinations of the safety and canonical set. The results of these experiments can be seen in Tables 6 and 7. Similar to TRAVL [36], we observe that training on safety critical scenarios yields better evaluation performance on both safety critical scenarios as well as nominal cases.

Table 6: [Safety set] Ablating the dataset for cost aggregation learning
Mission Safety and Compliance Progress, Consistency and Comfort
(l5pt)2-2 (l5ptr5pt)3-9 (l5ptr5pt)10-13 GSR\(\uparrow\) ECR\(\downarrow\) PCR\(\downarrow\) MinTTC\(\uparrow\) TVR\(\downarrow\) Progr.\(\uparrow\) L2E\(\downarrow\) P2P\(\downarrow\) Jerk\(\downarrow\)
(l5ptr5pt)5-8 \(p10\) \(< 1s\) \(< 2s\) \(< 5s\)
QuAD 100% safety 81.1% 1.6% 2.5% 4.35 1.1% 1.1% 20.0% 7.9% 433.5 37.3 15.0 0.18
QuAD 50% / 50% 75.3% 8.9% 13.0% 3.00 6.8% 6.8% 28.9% 14.7% 381.1 40.9 13.6 0.12
QuAD 100% canonical 55.3% 36.3% 36.8% 0.40 29.5% 33.2% 42.6% 40.0% 310.9 91.2 13.2 0.09
Table 7: [Canonical set] Ablating the dataset for cost aggregation learning
Safety and Compliance Progress, Consistency and Comfort
(l5ptr5pt)2-8 (l5ptr5pt)9-12 ECR\(\downarrow\) PCR\(\downarrow\) MinTTC\(\uparrow\) TVR\(\downarrow\) Progr.\(\uparrow\) L2E\(\downarrow\) P2P\(\downarrow\) Jerk\(\downarrow\)
(l5ptr5pt)4-7 \(p10\) \(< 1s\) \(< 2s\) \(< 5s\)
QuAD 100% safety 0.0% 0.5% 4.90 0.3% 0.6% 10.9% 1.9% 303.3 36.6 23.6 0.28
QuAD 50% / 50% 2.2% 1.3% 4.66 2.2% 2.6% 13.1% 2.9% 302.9 35.9 9.6 0.14
QuAD 100% canonical 21.4% 23.5% 0.40 18.5% 19.5% 27.5% 22.0% 284.6 44.1 7.9 0.09

8.4 Sub Cost Ablation↩︎

Table 8, analyzes the effect of a cost being removed. Each row is named after the dropped cost.

Table 8: [Safety set] Sub Cost Ablation
GSR\(\uparrow\) ECR\(\downarrow\) TTC\(< 5s\) \(\downarrow\) TVR\(\downarrow\) Progr.\(\uparrow\) Jerk\(\downarrow\)
collision % % % %
buffer % % % 0.0%
comfort % % % %
corridor % % % 0.0%
boundary % 0.0% % % 0.00
speed limit % 0.0% % 0.0%
progress % % 0.0% %
route % % % %

The results are intuitive: no collision cost yields the worst ECR, no comfort cost yields the worst jerk, no progress cost yields the worst progress, no route cost yields the worst GSR.

8.5 Query Sample Quantization↩︎

\(\mathrm{\small QuAD}{}\) queries uniformly over a motion blurred bounding box to estimate the occupancy of the targeted regions of interest. Specifically, these regions correspond to the areas the self-driving vehicle could occupy at some future point in time. This point sampling effectively discretizes the continuous occupancy problem. We note that the density of these points in the motion bounding box can be tuned to achieve a different runtime to precision ratio. In Table 9 we explore the effect point sampling density has on the final close loop performance.

Table 9: [Safety set] Query point quantization ablation
Mission Safety and Compliance Progress, Consistency and Comfort Latency
(l5pt)2-2 (l5ptr5pt)3-9 (l5ptr5pt)10-13 (l5ptr5pt)14-14 GSR\(\uparrow\) ECR\(\downarrow\) PCR\(\downarrow\) MinTTC\(\uparrow\) TVR\(\downarrow\) Progr.\(\uparrow\) L2E\(\downarrow\) P2P\(\downarrow\) Jerk\(\downarrow\) Runtime\(\downarrow\)
(l5ptr5pt)5-8 \(p10\) \(< 1s\) \(< 2s\) \(< 5s\) (ms)
QuAD continuous points 82.9% 1.0% 0.4% 4.89 1.0% 1.0% 11.6% 11.1% 444.9 35.8 15.4 0.28 705
QuAD 0.1m 83.9% 1.0% 0.4% 4.90 1.0% 1.0% 11.1% 10.1% 443.4 37.8 15.7 0.28 293
QuAD 0.25m 85.9% 1.0% 0.5% 4.89 1.0% 1.0% 12.1% 8.0% 441.8 37.8 15.3 0.28 259
QuAD 0.5m (default) 84.9% 2.0% 1.0% 4.83 1.5% 1.5% 13.6% 9.0% 436.7 37.8 15.6 0.29 256
QuAD 1m 83.9% 1.5% 1.0% 4.72 1.0% 1.5% 14.1% 10.6% 442.5 36.0 19.8 0.28 254
QuAD 2m 84.4% 3.0% 2.0% 4.45 2.5% 2.5% 14.6% 9.0% 436.5 38.1 22.9 0.29 257

We find that having a finer discretization does improve results, but after a certain point the improvements seem to be marginal while coming at the expense of additional runtime. Thus, we selected 0.5 m resolution for our final model. This is to be expected since the actors in the scene are fairly large, and further refinement wouldn’t be necessary to identify their location during the costing process.

9 Qualitative Results↩︎

In the following figures 8, 9, 10 we have the additional qualitative results comparing the performance of our model in safety critical scenarios to other baselines. Note that in each of the visuals, we omit the lidar point clouds for clarity.

Figure 8: Safety Scenario Comparison 1 Here we compare the various models in a scenario where a car cuts in front of the SDV. While our method is able to adapt to the cut-in, the other occupancy models fail to consider that modality and crash into the vehicle. NMP and PlanT are however able to adapt. Note that the red coloring on the trajectories corresponds to a high cost while the blue cost on the trajectories corresponds to a low cost.

Figure 9: Safety Scenario Comparison 2 Here we have a scenario where the vehicle must merge into the desired lane from an on-ramp. Note how while \(\mathrm{\small QuAD}{}\) and P3 are both able to navigate into the traffic, OccFlow fails to adequately rank the trajectories resulting in a collision. PlanT suffers from out of distribution maps and therefore begins drifting off the road, and finally NMP acts aggressively resulting in another collision.

Figure 10: Safety Scenario Comparison 3 Here we have a scenario where the SDV must merge into traffic filled lane before the current lane ends. While \(\mathrm{\small QuAD}{}\) and P3 are able to efficiently perform the lane change, the other models struggle to do it in before their current lane comes to an end.


H. Fan et al., “Baidu apollo EM motion planner,” arXiv preprint, 2018.
W. Zeng, S. Wang, R. Liao, Y. Chen, B. Yang, and booktitle=Computer. V. 2020:. 16th. E. C. G. U. A. 23–28,. 2020,. P. P. X. 16. Urtasun Raquel, “Dsdnet: Deep structured self-driving network,” 2020 , organization={Springer}, pp. 156–172.
A. Cui, S. Casas, A. Sadat, R. Liao, and booktitle=ICCV. Urtasun Raquel, “Lookout: Diverse multi-future prediction and planning for self-driving,” 2021.
K. Renz, K. Chitta, O.-B. Mercea, A. Koepke, Z. Akata, and A. Geiger, “PlanT: Explainable planning transformers via object-level representations,” arXiv preprint arXiv:2210.14222, 2022.
F. Codevilla, M. Miiller, A. López, V. Koltun, and booktitle=ICRA. Dosovitskiy Alexey, “End-to-end driving via conditional imitation learning,” 2018.
W. Zeng et al., “End-to-end interpretable neural motion planner,” 2019.
J. Philion and booktitle=ECCV. Fidler Sanja, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” 2020.
A. Hu et al., “Model-based imitation learning for urban driving,” arXiv preprint arXiv:2210.07729, 2022.
S. Ross, G. Gordon, and booktitle=Proceedings. of the fourteenth international conference on artificial intelligence and statistics Bagnell Drew, “A reduction of imitation learning and structured prediction to no-regret online learning,” 2011 , organization={JMLR Workshop and Conference Proceedings}, pp. 627–635.
S. Hoermann, M. Bach, and booktitle=ICRA. Dietmayer Klaus, “Dynamic occupancy grid prediction for urban autonomous driving: A deep learning approach with fully automatic labeling,” 2018.
A. Sadat, S. Casas, M. Ren, X. Wu, P. Dhawan, and booktitle=ECCV. Urtasun Raquel, “Perceive, predict, and plan: Safe motion planning through interpretable semantic representations,” 2020.
S. Casas, A. Sadat, and booktitle=CVPR. Urtasun Raquel, “Mp3: A unified model to map, perceive, predict and plan,” 2021.
R. Mahjourian, J. Kim, Y. Chai, M. Tan, B. Sapp, and D. Anguelov, “Occupancy flow fields for motion forecasting in autonomous driving,” IEEE Robotics and Automation Letters, 2022.
B. Yang, W. Luo, and booktitle=CVPR. Urtasun Raquel, “Pixor: Real-time 3d object detection from point clouds,” 2018.
A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and booktitle=CVPR. Beijbom Oscar, “Pointpillars: Fast encoders for object detection from point clouds,” 2019.
X. Weng, J. Wang, D. Held, and booktitle=2020. I. I. Kitani Kris, “3d multi-object tracking: A baseline and new evaluation metrics,” 2020.
S. Sharma, J. A. Ansari, J. K. Murthy, and booktitle=ICRA. Krishna K Madhava, “Beyond pixels: Leveraging geometry and shape cues for online multi-object tracking,” 2018.
C. Tang and booktitle=Advances. in N. I. P. S. Salakhutdinov Russ R, “Multiple futures prediction,” 2019.
T. Phan-Minh, E. C. Grigore, F. A. Boulton, O. Beijbom, and E. M. Wolff, “CoverNet: Multimodal behavior prediction using trajectory sets,” arXiv preprint, 2019.
Y. Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” CoRL, 2019.
T. Zhao et al., “Multi-agent tensor fusion for contextual trajectory prediction,” 2019.
H. Fan, Z. Xia, C. Liu, Y. Chen, and Q. Kong, “An auto-tuning framework for autonomous vehicles,” arXiv preprint, 2018.
N. Rhinehart, R. McAllister, and S. Levine, “Deep imitative models for flexible inference, planning, and control,” arXiv preprint, 2018.
A. Sadat, M. Ren, A. Pokrovsky, Y.-C. Lin, E. Yumer, and booktitle=IROS. Urtasun Raquel, “Jointly learnable behavior and trajectory planning for self-driving vehicles,” 2018.
N. Rhinehart et al., “Contingencies from observations: Tractable contingency planning with learned behavior models,” 2021 , organization={IEEE}, pp. 13663–13669.
A. Trabelsi, R. J. Beveridge, and N. Blanchard, “Drowned out by the noise: Evidence for tracking-free motion prediction,” arXiv preprint, 2021.
W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net , booktitle = CVPR,” 2018.
S. Casas, W. Luo, and booktitle=CoRL. Urtasun Raquel, “Intentnet: Learning to predict intention from raw sensor data,” 2018.
M. Liang et al., “PnPNet: End-to-end perception and prediction with tracking in the loop , booktitle = CVPR (CVPR),” 2020.
S. Casas, C. Gulino, S. Suo, K. Luo, R. Liao, and booktitle=ECCV. Urtasun Raquel, “Implicit latent variable model for scene-consistent motion forecasting,” 2020.
Y. Hu et al., “Planning-oriented autonomous driving,” 2023.
booktitle=NIPS. Pomerleau Dean A, “Alvinn: An autonomous land vehicle in a neural network,” 1989.
M. Bojarski et al., “End to end learning for self-driving cars,” arXiv preprint, 2016.
M. Müller, A. Dosovitskiy, B. Ghanem, and V. Koltun, “Driving policy transfer via modularity and abstraction,” arXiv preprint, 2018.
J. Hawke et al., “Urban driving with conditional imitation learning,” arXiv preprint, 2019.
C. Zhang et al., “Rethinking closed-loop training for autonomous driving.”
F. Codevilla, E. Santana, A. M. López, and booktitle=ICCV. Gaidon Adrien, “Exploring the limitations of behavior cloning for autonomous driving,” 2019.
A. Hu et al., “FIERY: Future instance prediction in bird’s-eye view from surround monocular cameras,” 2021.
B. Agro, Q. Sykora, S. Casas, and R. Urtasun, “Implicit occupancy flow fields for perception and prediction in self-driving,” CVPR, 2023.
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv, 2020.
J. Kong, M. Pfeiffer, G. Schildbach, and booktitle=2015. I. intelligent vehicles symposium (IV). Borrelli Francesco, “Kinematic and dynamic vehicle models for autonomous driving control design,” 2015 , organization={IEEE}, pp. 1094–1099.
M. Werling, J. Ziegler, S. Kammel, and booktitle=ICRA. Thrun Sebastian, “Optimal trajectory generation for dynamic street scenarios in a frenet frame,” 2010.
B. Yang, M. Liang, and booktitle=CoRL. Urtasun Raquel, “Hdnet: Exploiting hd maps for 3d object detection,” 2018.
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and booktitle=CVPR. Belongie Serge, “Feature pyramid networks for object detection,” 2017, pp. 2117–2125.
S. Manivasagam et al., “Lidarsim: Realistic lidar simulation by leveraging the real world,” 2020, pp. 11167–11176.
M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in empirical observations and microscopic simulations,” Physical Review E, vol. 62, no. 2, Aug. 2000, [Online]. Available: https://doi.org/10.11032Fphysreve.62.1805.
A. Kesting, M. Treiber, and D. Helbing, “General lane-changing model MOBIL for car-following models,” Transportation Research Record, vol. 1999, no. 1, 2007 , bdsk-url-1 = {https://doi.org/10.3141/1999-10}, doi: 10.3141/1999-10 , eprint = {https://doi.org/10.3141/1999-10}.

  1. \(^{*}\)Denotes equal contribution↩︎