JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments

Duy Tho Le\(^{1}\), Chenhui Gou\(^{1}\) , Stavya Datta\(^{1}\), Hengcan Shi\(^{1}\),
Ian Reid\(^{2,3}\), Jianfei Cai\(^{1}\), Hamid Rezatofighi\(^{1}\)
\(^{1}\)Monash University, \(^{2}\)MBZUAI, \(^{3}\)University of Adelaide,
{tho.le1, chenhui.gou, hengcan.shi}@monash.edu
Equal contribution, Corresponding author
https://jrdb.erc.monash.edu/dataset/panotrack


Abstract

Autonomous robot systems have attracted increasing research attention in recent years, where environment understanding is a crucial step for robot navigation, human-robot interaction, and decision. Real-world robot systems usually collect visual data from multiple sensors and are required to recognize numerous objects and their movements in complex human-crowded settings. Traditional benchmarks, with their reliance on single sensors and limited object classes and scenarios, fail to provide the comprehensive environmental understanding robots need for accurate navigation, interaction, and decision-making. As an extension of JRDB dataset, we unveil JRDB-PanoTrack, a novel open-world panoptic segmentation and tracking benchmark, towards more comprehensive environmental perception. JRDB-PanoTrack includes (1) various data involving indoor and outdoor crowded scenes, as well as comprehensive 2D and 3D synchronized data modalities; (2) high-quality 2D spatial panoptic segmentation and temporal tracking annotations, with additional 3D label projections for further spatial understanding; (3) diverse object classes for closed- and open-world recognition benchmarks, with OSPA-based metrics for evaluation. Extensive evaluation of leading methods shows significant challenges posed by our dataset.

Table 1: Typical datasets for 2D-3D panoptic segmentation and tracking. Abbreviations: I (Image), P (Point Cloud), Car (Autonomous Car), Rob (Mobile Robot), Int (Internet images/videos), Temp (Temporal data), Pano Cov. (Panoramic Coverage), No. Class (The number of classes), Trk Len (Track Length), No. Seq (The number of sequences), No. Smp (The number of samples) and No. M (the number of masks).
Data Domain No. Class
1-14 Dataset Data Temp Pano Cov. In door Out door Platform Thing Stuff Open world Trk Len No. Seq No. Smp No. M
1-14 PanopticCOCO [1] I Int 80 91 - 164k
Cityscapes [2] I Car 8 11 500 3k 10k
VIPSeg [3] I Int 58 66 10s 3536 85k 926k
MOT-STEP [4] I Int 1 6 19s 4 2k 17k
KITTI-STEP [4] I Car 2 17 65s 50 19k 126k
Waymo [5] I 220° Car 8 20 1.2s 2060 100k
SemanticKITTI [6] P Car 14 11 21 43k
Nuscenes [7] P 360° Car 23 6 1000 40k 1.2M
1-14 JRDB-PanoTrack I/P 360° Rob 60 11 117s 54  20k 428k

1 Introduction↩︎

With the increasing demands for autonomous robots in human-crowded environments, environment understanding becomes paramount, which serves as a vital step in many robotic systems, such as navigation and human-robot interaction. Specifically, human-centric environment understanding can be mainly divided into two aspects: spatial and temporal understanding. Spatial understanding aims to distinguish objects in human-crowded environments, while temporal understanding expects to recognize temporal relations of such objects.

The existing datasets for environment understanding, sourced primarily from self-driving vehicles [2], [5][7], or internet images/videos [1], [3], exhibit clear domain gaps when applied to robotic environments. These sources typically offer different perspectives from robots, and fail to encapsulate the challenges and interactions specific to robotic systems. As shown in 1, most of the existing datasets only contain a single data modality (RGB images or point clouds) [2], [3] and a small number of classes [2], [4]. They also lack temporal information [1] or 360-degree panoramic spatial perspectives [2], [4], [5]. In contrast, real-world applications where the robotic agents are deployed, usually involve multi-modal data, diverse classes, and both spatial and temporal understanding.

Built on top of JRDB dataset [8][10], inheriting its comprehensive annotation suite for human bodies, we introduce JRDB-PanoTrack, a novel comprehensive dataset for human-crowded environment understanding. Firstly, JRDB-PanoTrack offers a comprehensive dataset from various indoor and outdoor crowded scenes with 2D and 3D synchronized data modalities, supporting visual and robotic applications. Secondly, high-quality 2D panoptic segmentation and tracking annotations are provided for both spatial and temporal environment understanding, including 428K panoptic masks, 27K tracking labels and 7.3B annotated pixels. Additional 3D label projections are also presented for further spatial understanding. Thirdly, we introduce diverse objects and open-world benchmarks for generalization research. Finally, JRDB-PanoTrack annotates multiple classes for some areas, such as objects behind glass or hang on wall in [fig:thumbnail]. We propose metrics based on optimal sub-pattern matching (OSPA) to deal with such evaluation.

Based on the JRDB-PanoTrack dataset, we present several benchmarks, including Closed-world (CW) and Open-world (OW) panoptic segmentation and tracking. We extensively evaluate state-of-the-art (SOTA) methods on these benchmarks. Moreover, SOTA methods are also estimated on our 3D label projections. The results underline the imperative need for advanced methodologies that can adeptly handle the complexities presented by complex human-crowded environments. Our main contributions are:

  • We present JRDB-PanoTrack, an extensive new dataset for spatial and temporal robotic environment understanding. In JRDB-PanoTrack, high-quality panoptic segmentation and tracking annotations are provided. We employ comprehensive data collected by a mobile robot, including 2D&3D modalities as well as indoor&outdoor human-crowded scenes.

  • Closed- and open-world benchmarks are proposed for generalizable environment understanding. Our dataset also contains multi-class annotations and OSPA-based metrics for evaluation.

  • We conduct extensive evaluations of SOTA closed- and open-world segmentation/tracking methods on JRDB-PanoTrack, and discuss their strengths and weaknesses.

2 Related Work↩︎

Figure 1: Distribution of object masks of thing (brown) and stuff (green) classes in JRDB-PanoTrack train and test sets, where x and y-axis indicate the class names and mask counts, respectively. Best viewed in color and zoomed in.

Panoptic Segmentation and Tracking Datasets. Panoptic segmentation, as introduced in [11], is a task to generate instance-level masks for thing objects (countable, distinct entities) and class-level masks for stuff objects (amorphous and uncountable regions) to achieve a more complete visual understanding. Datasets like PanopticCOCO [1], ADE20K [12] and Cityscapes [2] are widely popular in this space, primarily focusing on 2D images. However, these datasets only support spatial understanding.

Panoptic tracking further integrates multi-object tracking into panoptic segmentation, as seen in datasets like MOT-STEP [4], VIPSeg [3], and Waymo [5] for 2D tracking, and SemanticKITTI [6], Panoptic Nuscenes [7] for 3D tracking. However, these datasets, often sourced from self-driving cars [4][7], [13], single-view surveillance cameras [4] or miscellaneous internet videos [3]. These datasets, although useful and large-scale, fall short in representing complex, human-centric environments for autonomous robotics due to the lack of synchronized multi-modal multi-view data, diverse object classes, complex human-crowded scenes, and domain consistency. Our JRDB-PanoTrack dataset addresses this gap by providing synchronized 2D and 3D data from a social mobile manipulator, capturing the complexity of crowded human spaces, offering diversity in objects and unique challenges in both closed-world and open-world settings.

OW Benchmarks. The development of OW benchmarks is crucial for assessing the generalization capabilities of models in diverse and unpredictable environments. Large-scale segmentation datasets such as COCO [1] and ADE20K [12] and OW segmentation datasets [14] are usually used for OW spatial understanding, while several datasets like TAO-OW [15] and OVTrack [16] are introduced for OW bounding box tracking. Moreover, these datasets are all from internet images/videos. Different from them, JRDB-PanoTrack introduces a unique and challenging OW benchmark for panoptic segmentation and tracking in robotic environments, with both 2D and 3D data modalities.

Previous JRDB Datasets. JRDB [8] is a large-scale and comprehensive dataset for autonomous robot research in human-centric environments. It collects 2D and 3D point cloud videos, audio as well as GPS positions by a social manipulator robot. In previous JRDB [8], JRDB-Act [9] and JRDB-Pose [10], 2D-3D human detection, tracking and forecasting, body skeleton pose estimation, human social grouping and activity recognition annotations have been introduced. In JRDB-PanoTrack, we complement this JRDB by providing new open-world panoptic segmentation and tracking annotations for more comprehensive human-centered scene understanding.

SOTA Frameworks. For panoptic segmentation, initial approaches [11], [17], [18] handle semantic and instance segmentation as separate tasks by using dual sub-networks. Max-deeplab [19] introduces transformer-based architectures, moving away from bounding box-dependent models. Recent developments, including K-net [20], MaskFormer [21], Mask2Former [22] and Mask DINO [23], unifies semantic, instance and panoptic segmentation into a singular mask proposal prediction framework. In the OW domain, methods like [24][27] generate mask proposals for all panoptic objects, and then align them with object names via large vision-language pre-training. For multi-object tracking, traditional motion-model-based algorithms often outperform modern integrated systems. SORT [28] exemplifies this with its linear-motion-based track association. ByteTrack [29] introduces low-confidence detection associations to improve tracking. OC-SORT [30] enhances filters and recovery strategies to solve the non-linear motion problem. More recently, BoT-SORT [31] advances the field by optimizing the Kalman filter state and incorporating camera-motion compensation. Notably, to the best of our knowledge, there is no available OW panoptic tracking method. We use those strong and popular trackers as baselines in our experiments.

3 The JRDB-PanoTrack Dataset↩︎

3.1 Dataset and Statistics↩︎

Data. JRDB-PanoTrack encompasses 20,000 images, sampled at 1Hz from 54 videos in the original JRDB dataset [8]. 4,000 360-degree panoramic images can be generated by merging 5 original images from 5 different camera views. 4,000 point clouds are also provided for 3D understanding. Annotation. JRDB-PanoTrack retains all annotations from JRDB[8] and further enhances the dataset by introducing 428K 2D panoptic segmentation and 27K tracking annotations to enable environment understanding.

Annotation process. The annotation process starts with an unlimited list of classes which can be extended by all annotators, any clearly visible and semantically meaningful objects would be annotated, objects that are behind the glass or being hang on wall will be annotated with multiple labels. Then annotators produce labels and senior annotators control the quality by multiple inspection rounds.

Object Class. There are 72 objects in JRDB-PanoTrack, which are classified into 61 thing (such as pedestrians, cars and laptops) and 11 stuff (like sky and walls) classes. 1 depicts the distributions of classes.

Figure 2: Word cloud of the most frequent classes seen through glass in JRDB-PanoTrack, with the size of the word proportional to the frequency of the class.

Special Class Labeling. Our dataset aims to analyze common environments for autonomous robots. There are some differences from traditional environment understanding datasets. (1) Floor differentiation: human-robot interaction and navigation require robots to distinguish different floors. To address this, we provide instance segmentation labels for floors and regard them as thing objects. (2) Multi-class segmentation: In modern environments, objects are often seen behind windows and glass, and sometimes being hang on walls (most frequently seen objects are shown in 2). Traditional datasets usually simply ignore these objects or interaction, while they might be crucial for environment understanding. In JRDB-PanoTrack, there are 9% of objects belonging to such cases. Therefore, we label multiple classes for pixels belonging to these objects, , including the front windows or glass, and the behind objects. We hope this will encourage the community to develop more robust models for better scene understanding.

Figure 3: Analysis of Track length distribution (top) and Number of masks per frame (bottom) in the JRDB-PanoTrack dataset. Best viewed in color.

Tasks. Our dataset supports panoptic segmentation and tracking tasks. Panoptic segmentation [11] expects to spatially understand environments, which generates masks for all the thing and stuff objects. Panoptic tracking [32] understands environments on both spatial and temporal aspects. It segments both thing and stuff objects, and tracks thing objects throughout a video. As shown in 3, there are up to 81 masks in an image, and the average number of masks is 22. In panoramic views, the maximum and average mask counts per image are 245 and 80, respectively. 3 also highlights the track length distribution in JRDB-PanoTrack. The maximum and average track lengths are 117s and 16s, respectively. The most populated scene in our dataset comprises a staggering 564 tracklets (1010 in panoramic views) in a single sequence, compared to the average 101 tracklets (198 in panoramic views) per sequence. According to 3, the testing set is more crowded than the training set with more masks per image and more tracklets per sequence.

width=,center [*] statistics for panoramic images

Thing and Stuff classes. Following [11], we divide the object classes into thing and stuff classes. Thing classes are objects that can be segmented and tracked, such as person, car, bicycle, chair, table, laptop, bottle, etc. Stuff classes are background classes that can be segmented but not tracked, such as sky, ground, wall, etc. 1 shows the distribution of object instances in JRDB-PanoTrack, where pedestrian is the most frequent class with more than 40k instances, followed by the commonly seen objects in human-centric environments such as chair, bag, table, door, board, machines, etc. with 10k to 50k instances.

CW and OW. Based on class distributions shown in 1, we divide our 72 panoptic classes into two sets: 43 common and 29 long-tail classes as known and unknown classes, respectively. The 43 known classes can be used for training and evaluation at the closed-world (CW) scenario. At the open-world (OW) scenario, the 43 known classes can be employed for training, while 28 unknown classes are for testing (there is one class that occur in training set only).

Tracklet statistics. [tab:track95inst95stats] offers detailed statistics on the number of masks per image, tracklets per sequence, and track lengths in the JRDB-PanoTrack dataset. It underscores the dataset’s depth and diversity, with some tracks extending up to 117 seconds across multiple camera views. According to [tab:track95inst95stats], the testing set is more crowded than the training set with more masks per image and more tracklets per sequence.

Mask Size. The distribution of mask sizes in JRDB-PanoTrack are as presented in 4. We have balanced mask sizes in both training and testing sets, which bring challenges to panoptic segmentation and tracking methods to carefully deal with objects of various sizes.

Figure 4: The count (top) and percentage (bottom) of Small, Medium and Large masks in JRDB-PanoTrack training and testing sets. Small and Large masks are the masks \(\leq 32^2\) and \(\leq 96^2\) pixels, and the sizes of Medium masks are in between. Image size is 752x480 (W x H).

3.2 Benchmark and Metrics↩︎

Benchmark. Based on our JRDB-PanoTrack dataset, we propose several benchmarks for environment understanding, the four categories are:

  • CW panoptic segmentation.

  • OW panoptic segmentation.

  • CW panoptic tracking.

  • OW panoptic tracking.

In all benchmarks, we use half of our dataset for training, i.e., 9365 images in 27 sequences. For testing, we employ 9280 images in the other 27 sequences. Panoramic images and point clouds corresponding to the 9365/9280 images can be used for panorama and 3D understanding. In CW benchmarks, we release annotations of the 43 known classes for both training and testing. In OW benchmarks, methods can use known classes for training, while being tested on all of the classes.

Metric. Current evaluation methods for panoptic segmentation, despite their utility, exhibit limitations that can skew method rankings, due to:

  • Threshold-based, where the choice of the threshold can change the ranking of the methods, making it unreliable [33]: VPQ [13] and PTQ [32].

  • Inadvertently penalizing the rectification of errors (ID recovery): VPQ [13] and PTQ [32].

  • Inability to handle multi-label scenarios: VPQ [13], PTQ [32], and STQ [4].

Given the introduction of multi-label panoptic segmentation and tracking by JRDB-PanoTrack, these existing metrics become insufficient. To address the gaps, we introduce OSPA\(_{PS}\) and OSPA\(^2_{PT}\), specifically designed for panoptic segmentation and tracking.

[!tb]\centering
\scriptsize
\begin{adjustbox}{width=\textwidth}
    \begin{tabular}{p{1.1cm}|C{0.35cm}C{0.4cm}|C{0.35cm}C{0.43cm}|C{0.35cm}C{0.43cm}}
        \toprule
        \textbf{Method} & \textbf{PQ$\uparrow$}  & \textbf{\ospapsbold$\downarrow$}   & \textbf{PQ$^{\textbf{Th}}$ $\uparrow$} & \textbf{\ospapsboldt$\downarrow$}  & \textbf{PQ$^{\textbf{St}}$ $\uparrow$} &\textbf{\ospapsbolds $\downarrow$}\\
        \midrule
       \textbf{kMaX}\cite{yu2022k}             & 32.52  & 0.67 & 27.96&0.72 & 45.81& 0.53\\
        \textbf{2Former}\cite{cheng2021mask2former} & 33.25   &0.66 & 28.74& 0.71&46.38 &0.52 \\
        \textbf{DINO}\cite{li2023mask}             & \textbf{36.57}&\textbf{0.64} &\textbf{33.07} &\textbf{0.68} &\textbf{46.74} & \textbf{0.51}\\
        \bottomrule
    \end{tabular}
\end{adjustbox}
\caption{\small{\textbf{Results of CW panoptic segmentation methods on JRDB-PanoTrack.} All methods use ResNet-50 backbone and COCO pre-training. There are 43 classes, 32 \textit{Thing} and 11 \textit{Stuff}. (Kmax for Kmax-Deeplab, 2Former for Mask2Former and DINO for mask DINO.)}}
\label{tab:panoptic:closeworld9516th9511}

OSPA for Panoptic Segmentation. The Optimal Sub-Pattern Matching (OSPA) metric, known for incorporating miss-distance in multi-object performance evaluation [34], has recently been adapted for bounding box/pose detection and tracking tasks [10], [33]. Building on this, we introduce OSPA\(_{PS}\) (\(\mathcal{O}_{PS}\)), a variant of OSPA, specifically designed for multi-label panoptic segmentation.

Let \(X = \{x_1, x_2, \dots, x_m\}\) and \(Y = \{y_1, y_2, \dots, y_n\}\) be two sets of arbitrary mask regions (\(x,y\subset \mathbb{R}^2\)) on an image for all ground-truths and predictions, with cardinalities \(|X|\) and \(|Y|\), where \(|Y| \geq |X|\) (otherwise flip \(X, Y\)). For a given class \(c \in \mathbb{C}\), we calculate the normalised base distance between masks \(d_{K}(x_i, y_i) = 1 - IOU(x_i, y_i) \in [0, 1]\). \(\mathcal{O}_{PS}(X_c, Y_c)\) is then acquired by using OSPA equation [34]. The overall OSPA error is calculated by averaging the OSPA error over all classes: \[\mathcal{O}_{PS}(X, Y) = \frac{1}{|\mathbb{C}|}\sum_{c \in \mathbb{C}}\mathcal{O}_{PS}(X_c, Y_c). \label{eq:ops}\tag{1}\]

OSPA for Panoptic Tracking. The OSPA metric is expanded to assess panoptic tracking with the introduction of \(OSPA_{PT}\) (\(\mathcal{O}^2_{PT}\)). For each class \(c \in \mathbb{C}\), consider \(\mathbf{X}_c = \{X^{\mathcal{D}1}_{1c}, X^{\mathcal{D}2}_{2c}, \ldots, X^{\mathcal{D}m}_{mc}\}\) and \(\mathbf{Y}_c = \{Y^{\mathcal{D}1}_{1c}, Y^{\mathcal{D}2}_{2c}, \ldots, Y^{\mathcal{D}n}_{nc}\}\) as sets of mask trajectories for ground-truth and predicted masks, respectively, where \(\mathcal{D}i\) contains the time indices where track \(i\) exists. Then, we calculate the time average distance of every pair of tracks \(X^{\mathcal{D}_i}_{ic}\) and \(Y^{\mathcal{D}_j}_{jc}\) similar to [34] using OSPA set distance \(d_{O}(\{X_{ic}^{t}\},\{Y_{jc}^{t}\} = 1 - IOU(x^t_{ic}, y^t_{jc})\). If only \(\{X_i^{t}\}\) or \(\{Y_j^{t}\}\) exists, then \(d_{O}(\{X_{ic}^{t}\},\{Y_{jc}^{t}\}) = 1\), otherwise \(d_{O}(\{X_{ic}^{t}\},\{Y_{jc}^{t}\})=0\). The remaining step remained the same as the original OSPA\(^2\), \(\mathcal{O}^2_{PT}\)is the average of all classes similar to 1 .

[t]\centering
\scriptsize
\begin{adjustbox}{width=\textwidth}
    \begin{tabular}{p{1.45cm}|C{0.35cm}C{0.43cm}|C{0.35cm}C{0.43cm}|C{0.35cm}C{0.43cm}}
        \toprule
        \textbf{Method} & \textbf{PQ$\uparrow$}  & \textbf{\ospapsbold$\downarrow$}   & \textbf{PQ$^{\textbf{Th}}$ $\uparrow$} & \textbf{\ospapsboldt$\downarrow$}  & \textbf{PQ$^{\textbf{St}}$ $\uparrow$} &\textbf{\ospapsbolds $\downarrow$}\\
        \midrule
       \textbf{ODISE-L}\cite{xu2023open} & 10.57 & \textbf{0.85}  & 7.03 & 0.90 & \textbf{29.87} & \textbf{0.72}\\
        \textbf{ODISE-C}\cite{xu2023open} & \textbf{11.07} & \textbf{0.85} & \textbf{8.41} & \textbf{0.88} & 25.55 & 0.78 \\
        \textbf{FC-CLIP}\cite{yu2023convolutions} & 10.06 & 0.87 & 7.07 & 0.90 & 26.36 & 0.78\\
        \bottomrule
    \end{tabular}
\end{adjustbox}
\caption{\small\textbf{Results of SOTA OW panoptic segmentation models on JRDB-Panotrack testing set.} All models were trained solely on the COCO panoptic dataset and underwent zero-shot evaluation on JRDB. ODISE-L and ODISE-C represent the model with class label and caption label supervisions, respectively.}
\label{tab:panoptic:openworld9516th9511}

In JRDB-Panotrack, OSPA is preferred as it is an actual metric in mathematical terms, fulfilling the triangle inequality, not threshold-based, and treats masks equally regardless of their size, without penalising error rectification.

4 Experiments↩︎

To explore the distinct challenges of JRDB-PanoTrack, we first evaluate advanced panoptic segmentation in both 2D closed-world (CW) and open-world (OW) settings. Then we investigate panoptic tracking methods. We also briefly evaluate 3D CW segmentation and tracking using pseudo labels generated from 2D annotations. Note that all of the experiments presented are done using individual views, not stitched views. The results show that the JRDB-PanoTrack dataset provides a uniquely challenging environment for panoptic segmentation and tracking.

Evaluation protocol Due to the absence of pretrained models for close/open-world panoptic segmentation/tracking limits our evaluation in multi-label settings. We preprocess these areas into single-label ones, selecting thing objects and omitting stuff behind them, to utilize standard evaluation metrics alongside OSPA\(_{PS}\) and OSPA\(^{2}_{PT}\). We hope future research can exploit the full potential of our dataset’s multi-class segmentation annotations.

4.1 Panoptic Segmentation↩︎

Implementation. We adopt the ResNet-50 backbone and COCO pertaining for all models, training with a batch size of 6 and a learning rate of \(1 \times 10^{-4}\)over 110K iterations on 2 RTX4090 GPUs. For other settings, we adhere to the default configuration in [23], [35], [36]. In OW experiments, we follow the official implementations of ODISE [25] and FC-CLIP [37]. For all cross-domain experiments, we use the weights pretrained on COCO and infer on JRDB-PanoTrack. For in-domain experiments, we train FC-CLIP on our OW training set and infer on our OW test set. The model is trained with two RTX4090 GPUs with batch size 8, learning rate \(5 \times 10^{-4}\), other training setting use same as [37]. We do not train ODISE due to its very high computational costs.

[t]\centering
\scriptsize
    \begin{adjustbox}{width=\textwidth,center}
        \begin{tabular}{{C{0.6cm}|C{0.5cm}|C{0.3cm}C{0.43cm}|C{0.35cm}C{0.43cm}|C{0.35cm}C{0.4cm}}}
            \toprule
           \multicolumn{2}{c|}{\textbf{Train strategy}} &
            \multicolumn{2}{c|}{\textbf{All}} &
            \multicolumn{2}{c|}{\textbf{Thing}}&
            \multicolumn{2}{c}{\textbf{Stuff}}\\
            \cmidrule(l{0pt}r{0pt}){1-8} 
            \textbf{COCO} & \textbf{JRDB}& \textbf{PQ$\uparrow$}     & \ospapsbold $\downarrow$  &\textbf{PQ$^{\textbf{Th}}\uparrow$}& \ospapsboldt$\downarrow$   & \textbf{PQ$^{\textbf{St}}$ $\uparrow$}&\ospapsbolds$\downarrow$  \\\midrule
           &  \textbf{✔}  & 31.41& 0.67& 27.12  & 0.72 &43.88 & 0.53\\
          \textbf{✔}&\textbf{✔}&\textbf{36.57}&\textbf{0.64} & \textbf{33.07}&\textbf{0.68}&\textbf{46.74}&\textbf{0.51}\\
            \bottomrule
         \end{tabular}
    \end{adjustbox}
    \caption{{\small\textbf{CW panoptic segmentation results of MaskDino with different training strategies on JRDB-PanoTrack.} Top: we solely train the model on JRDB-PanoTrack. Bottom: we use COCO pertaining followed by finetuning on JRDB-PanoTrack.}}
    \label{tab:panoptic:ablation}

width=

CW Panoptic Segmentation. We evaluate SOTA methods on JRDB-PanoTrack ([tab:panoptic:closeworld9516th9511]) and obtain the following findings: (1) Lower performance across all methods compared to COCO results, particularly for Thing classes. This highlights challenges like complex Thing instances and varied object scales in diverse environments. (2) Mask DINO stands out, which achieves PQ of 36.57% with COCO pertaining and 31.41% without it ([tab:panoptic:ablation]). One reason is that our dataset contains crowded objects, and Mask DINO contains more object queries to capture a mass of object candidates. Meanwhile, MaskDino pretrained on JRDB-PanoTrack achieves higher performance on the COCO dataset (see the supplemental material), suggesting JRDB-PanoTrack’s ability to generalize to other domains. These insights emphasize JRDB-PanoTrack’s unique challenges in CW panoptic segmentation, leading us to further explore its role in OW panoptic segmentation setting.

width=, center

OW Panoptic Segmentation. SOTA OW panoptic segmentation methods like FC-CLIP [37] and ODISE [25] show notably lower performance on JRDB-PanoTrack ([tab:panoptic:openworld9516th9511]) compared to other datasets. For instance, the PQ of FC-CLIP on ADE20K is \(26.8\) while \(10.06\) in our dataset, highlighting JRDB-PanoTrack’s distinct and challenging nature, especially in recognizing and segmenting Unknown classes. [tab:panopticSeg95known95unknow95cross95indomain] further shows cross- and in-domain evaluations for Known and Unknown classes on our dataset. Cross-domain results indicate that while prior knowledge from other datasets like COCO aids in understanding Known classes, it falls short with Unknown classes, underlining JRDB-PanoTrack’s OW segmentation challenge. In contrast, in-domain training improves the segmentation performance for Known classes, but slightly impacts Unknown classes, suggesting that new approaches are needed to address this core challenge. Additionally, we assess the transferability of JRDB-PanoTrack knowledge to other domains. Training exclusively on JRDB-PanoTrack yields a \(13.7\) PQ on COCO (2), demonstrating effective knowledge transfer to different domains. This finding indicates the potential usage of JRDB-PanoTrack to improve segmentation performance on other domains.

width=,center

width=,center

width=,center

****Generalizability of JRDB-Panotrack**** [tab:closed95world95onCOCO] presents comparative results of the Mask DINO model on the COCO validation set, highlighting the generalizability of JRDB-Panotrack. Specifically, [tab:closed95world95onCOCO] compares the performance when trained solely with the COCO dataset against a combined training scheme that includes both JRDB-PanoTrack and COCO datasets. Notably, the model pretrained on JRDB-PanoTrack followed by COCO tuning shows superior performance across all metrics compared to the model trained on COCO alone, which supporting JRDB-Panotrack is also beneficial for other domains.

Knowledge transfer [tab:closed95world95onCOCO] demonstrates knowledge transferability between

r0.45

Table 2: Cross-dataset validation results of FC-CLIP. The first row indicates training on COCO and testing on JRDB-PanoTrack, while the second row is the opposite.
Train data PQ
COCO 10.06
JRDB 13.70

datasets in open-world setting. In the closed-world setting, we also show that the knowledge from JRDB-PanoTrack can help improve performance when fine-tuning on the COCO dataset (2) and vice versa ([tab:panoptic:ablation]). The results in both open-world and closed-world settings show that using JRDB-PanoTrack improves segmentation performance in other domains. This suggests that the size of JRDB-PanoTrack does not significantly hinder the performance.

4.2 Panoptic Tracking↩︎

Implementation. We utilize default settings and implementations of recent popular tracking algorithms: ByteTrack[29], OC-SORT[30] and BoT-SORT[31]. Masks predicted from CW and OW segmentation models are converted into bounding boxes and then fed into these trackers.

CW Panoptic Tracking. In [tab:trackers95evaluation95CW], our evaluation highlights diverse capabilities of SOTA tracking methods. BoT-SORT excels in STQ and IDF1 metrics, showcasing its proficiency in object tracking and identity maintenance. OC-SORT, favored by the \(\mathcal{O}^2_{PT}\)metric, excels in consistently identifying objects across frames while minimizing noisy tracklets. BoT-SORT’s performance, though strong in tracking objects, shows signs of instability, often losing track and struggling with consistent ID maintenance. Breaking down into Thing and Stuff classes, \(\mathcal{O}^{2S}_{PT}\) for Stuff remains constant because cardinality error is not penalised. The lower performance on our JRDB-Panotrack, compared to other datasets, can be attributed to our dense annotations and numerous tracklets, posing a significant challenge for segmentation and tracking.

width=, center

OW Panoptic Tracking. OW panoptic tracking results, as shown in [tab:trackers95evaluation95OW], indicate a different set of challenges. While BoT-SORT is good at maintaining object identities and delivering high-quality segmentation, it exhibits higher fragmentation, indicating inconsistency in track identity over time. In contrast, OC-SORT, though it may not always top the STQ or IDF1 scores, shows greater consistency with fewer fragmentations and lower OSPA errors. The overall lower performance on the JRDB-Panotrack dataset reflects the complexities of OW tracking, especially when handling unknown objects. This underscores the need for advanced tracking algorithms to adapt to unfamiliar objects and maintain consistent track identities.

4.3 3D Panoptic Segmentation & Tracking↩︎

width=,center

In this work, we briefly touch on 3D CW panoptic segmentation and tracking, though it is not the main focus of this paper. Specifically, we projected 2D panoptic labels onto 3D point clouds, using these projections as pseudo-labels for model training. For evaluation, we use our proposed OSPA, OSPA\(^2\) and adopt popular metrics for 3D panoptic segmentation (PQ, IOU) and tracking (LSTQ[38]). It’s important to note that these 3D pseudo-labels may contain noise, potentially affecting result accuracy. In terms of 3D panoptic segmentation, as shown in [tab:3Dclosedworld], MaskPLS emerges as the superior method, excelling in all metrics. This indicates MaskPLS’s enhanced ability to identify and segment objects precisely in 3D space. In 3D panoptic tracking, Mask4D takes the lead in LSTQ[38] and achieves the best OSPA score of 0.860, denoting its strength in maintaining object identities and tracking consistency over time. Also, the higher \(S_{assoc}\) scores compared to \(S_{cls}\), suggesting that these methods are better at object association and tracking than at precise classification in a 3D environment.

5 Conclusion↩︎

In this paper, we have introduced the JRDB-PanoTrack dataset, a novel dataset designed for open-world panoptic segmentation and tracking, particularly for robotics and vision applications. The uniqueness and complexity of JRDB-PanoTrack set it apart from the existing ones. Our extensive evaluations underscore the dataset’s challenges, emphasizing the necessity for more robust methodologies in both closed-world and open-world scenarios. The dataset offers new ground for future research, especially in developing algorithms that can effectively handle densely populated environments and diverse object interactions that are typical in real-world settings.

References↩︎

[1]
pp. 740–755. Springer, 2014.
[2]
pp. 3213–3223, 2016.
[3]
pp. 21033–21043, 2022.
[4]
.
[5]
pp. 53–72. Springer, 2022.
[6]
pp. 9297–9307, 2019.
[7]
.
[8]
.
[9]
.
[10]
pp. 4811–4820, 2023.
[11]
pp. 9404–9413, 2019.
[12]
.
[13]
pp. 9859–9868, 2020.
[14]
pp. 815–824, 2023.
[15]
pp. 19045–19055, 2022.
[16]
pp. 5567–5577, 2023.
[17]
pp. 6399–6408, 2019.
[18]
pp. 8818–8826, 2019.
[19]
pp. 5463–5474, 2021.
[20]
.
[21]
.
[22]
pp. 1290–1299, 2022.
[23]
pp. 3041–3050, 2023.
[24]
.
[25]
.
[26]
.
[27]
pp. 887–898, 2023.
[28]
pp. 3464–3468, 2016.
[29]
pp. 1–21. Springer, 2022.
[30]
pp. 9686–9696, 2023.
[31]
.
[32]
.
[33]
.
[34]
.
[35]
.
[36]
pp. 288–307. Springer, 2022.
[37]
.
[38]
pp. 5527–5537, 2021.