September 05, 2025
This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames’ groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://github.com/irajisamurai/VLM-GroupDetection.git
To understand human social activities, the structure of human groups is a good cue because in-group members share a common purpose and/or behavior. For example, the group structure is used for group activity recognition [1]–[3], trajectory prediction [4]–[7], and anomaly detection [8]. For group detection (e.g., Fig. 1), interactions among in-group members should be recognized. While the dominant cue for recognizing such human interactions is the spatial relationship of people, various human attributes, such as face directions and postures, are also useful.
While the human attributes mentioned above can be recognized from the local appearance of each person, the spatial context observed in the image also provides helpful cues for group detection. The usefulness of the spatial context is validated in various tasks, such as object detection [9]–[12] and individual action recognition [13]–[15]. However, compared with these tasks that employ a relationship between each person and the scene, group detection is more difficult because it requires a complex relationship between multiple people and the scene (e.g., Fig. 1 (a)), which is called spatial scene context.
Figure 1: Effects of our method. People enclosed with the same color are detected as in-group members. (a) Our method using the spatial scene context can detect two in-group members because they are not facing each other but browsing the same background object. (b) Our method can detect dynamic group structures so that a group of three members in \(t_{1}\) splits into two groups in \(t_{2}\)..
In addition to the spatial scene context, the temporal scene context is useful. For example, in the previous group detection methods [16]–[22], group detection is stabilized under an assumption that the set of all groups is fixed (i.e., not changed) throughout a video. We call this group detection, in which only a single representative set of groups through a video is detected, static group detection. However, people often go back and forth between different groups. Static group detection can detect dynamically changing groups by dividing a video into short clips in each of which the group structure is not changed. However, since we do not know the frames in which the group structure changes, it is impossible to divide a video into appropriate short clips by pre-processes; it is a chicken-and-egg problem. Therefore, this paper aims at jointly achieving stabilization of group detection and detection of dynamically changing groups by employing the temporal scene context in a video. We call this group detection, in which a set of temporal structures of dynamically changing groups is detected, dynamic group detection (e.g., Fig. 1 (b)).
Figure 2: Bounding boxes fed into the feature extractor. (a) Ours: a bounding box includes a pair of circled people and the background. (b) Previous methods [21], [22]: each person’s bounding box is independently extracted..
To learn the spatial scene context, our method extracts image features from a bounding box, including people and the background (Fig. 2 (a)), to gain two advantages compared to previous methods [21]–[23]. (1) In [21]–[23], the features of each person are extracted independently, then the extracted features of people are compared and/or merged to estimate their groupness. Our method directly estimates the groupness from a bounding box including them. (2) Not only the local features of people but also the spatial scene context, including the background, are used. To understand a group structure based on the spatial scene context from the bounding box mentioned above, this paper proposes to augment a Vision-Language Model (VLM), such as CLIP [24], as visual features.
Such visual features are extracted independently in each frame of a video in our method, following previous methods [16]–[22]. However, unlike the previous methods in which the detected groups of all frames are merged (e.g., averaged), our method temporally connects the features of all frames for dynamic group detection to employ the temporal scene context.
The contributions of this paper are as follows:
In each frame, the spatial scene context expressing complex interactions extracted by a VLM (i.e., CLIP) is used to estimate the so-called groupness probability to express whether or not each pair is in the same group. Our augmented CLIP features outperform visual feature extractors trained only with images, such as DINOv2 [25], and improve the framewise group detection, resulting in better static and dynamic group detections.
Framewise groupness probabilities are temporally connected to construct a graph representing the temporal scene context. This graph, called a temporal groupness graph as contrasted with a static groupness graph generated by merging all frames for static group detection, is divided into dynamic groups: see Fig. 5.
The results of dynamic group detection using our method can be easily converted to those of static group detection.
Our experiments on public datasets demonstrate that our method outperforms state-of-the-art methods in various aspects of the static and dynamic detection tasks.
In the group detection task, all observed people are divided into groups, including those each of which consists of only one person. Unlike dense crowd tracking [26]–[28] in which particle-like crowd flows are represented as a set of feature points, group detection groups all individuals
Early work on group detection [17], [18] depends only on the spatial relationship (i.e., locations) among people. In addition to the locations, facial directions are used to evaluate the so-called F-formation system [29], where groups are detected in conversational scenarios [30]–[32]. Group detection can be improved by using the locations and facial directions simultaneously [16].
While the aforementioned methods formulate the groupness in a handcrafted fashion, such groupness formulations have recently been replaced by machine learning-based models. In [19], [33]–[36], the characteristics of human trajectories are trained for group detection. Other human attributes can also be merged, e.g., locations and facial directions [37]–[40] and locations and postures [21]. In these methods, all the human attributes are pre-processed.
Figure 3: Network architecture of our method. Extracted image and trajectory features are merged to construct the temporal groupness graph representing the probabilistic connectivity among people. A set of nodes in this graph is divided into clusters for group detection..
However, instead of the independent pre-processes, joint learning of the pre-processes and the final task benefits the final task in various tasks. While such joint learning is also used for group detection, e.g., in still images [41] and videos [2], [22], [23], these methods take human bounding boxes, in which important spatial context is unavailable. Our work aims to extract useful group features from the spatial and temporal scene contexts represented by VLMs and the temporal groupness graph, respectively.
In recent group detection methods for videos [20]–[22], graph clustering divides a set of nodes in a graph into several clusters as follows. (1) The groupness probability of each person pair is estimated, (2) a graph is constructed so that each node corresponds to each person, and the nodes are connected via edges. Each edge is given the estimated groupness probability between each pair, and (3) clusters detected by graph clustering are regarded as groups. These methods [20]–[22] use spectral clustering [42] and label propagation [43], [44] as graph clustering. While these simple clustering algorithms are useful for a small static groupness graph, dynamic group detection is required to divide a large temporal groupness graph. In addition, these simple algorithms need cluster-related naïve hyperparameters (e.g., the number of clusters and the max number of nodes in each cluster) Unlike these simple algorithms, the Louvain algorithm [45] used in our method efficiently allows for complex clustering in such a large graph by maximizing the modularity of the graph without naïve hyperparameters.
Figure 3 shows our group detection network. The groupness probability for each pair is estimated in each frame by our augmented CLIP image features with trajectory-based features (Sec. 3.1). All groupness probabilities are pairwise and framewise. The groupness probabilities in all frames in a video are used to construct the temporal groupness graph (Sec. 3.2.1) for dynamic group detection (Sec. 3.2.2).
The groupness probability is estimated from two features, i.e., trajectory and image features.
The trajectory features are computed with two steps:
1st step (framewise feature extraction): Following [22], for any pair of people, a vector consisting of their appearance attributes such as bounding box locations and face directions (denoted by \(\boldsymbol{X}_{traj} \in \mathbb{R}^{D_{x}}\)) is embedded to its latent variable, \(\boldsymbol{E}_{traj} \in \mathbb{R}^{D_{e}}\), by the framewise trajectory encoder, \(\phi\), as follows: \[\boldsymbol{E}_{traj} = \phi(\boldsymbol{X}_{traj}) \label{eq:32embedding}\tag{1}\]
2nd step (temporal feature extraction): \(\boldsymbol{E}_{traj}\) of \(T\) frames, which are selected before, in, and after frame \(f\), are concatenated. The concatenated features, \(\hat{\boldsymbol{E}}_{traj}\), are fed into the Transformer-based temporal trajectory encoder [46], \(\chi\), to obtain the trajectory features at frame \(f\): \[\boldsymbol{Z}_{traj} = \chi(\hat{\boldsymbol{E}}_{traj}) \in \mathbb{R}^{D_{t}} \label{eq:32trajectory95features}\tag{2}\]
Figure 4: Preliminary experiments with the pretrained CLIP for zero-shot group classification. While a bounding box with no visual prompt is fed into the image encoder, rectangles with the same color in each image enclose people in the same group of ground truth for visualization..
Preliminary experiments: We expect that CLIP is pretrained to understand what a human group is. The potential of CLIP is validated in our experiments (Fig. 4). In these experiments, a bounding box including a target pair and the background is fed into the CLIP image encoder, \(\psi\), to obtain its visual feature vector, \(f_{v}\). Two texts, “individual people” and “a group of people”1, are fed into the CLIP text encoder independently to obtain their text feature vectors, \(f_{i}\) and \(f_{g}\). The cosine similarities between \(f_{v}\) and the text feature vectors are computed and fed into the binary softmax function to obtain the probabilities that the target pair is in different groups and the same group. In Fig. 4 (a) and (b), where two in-group members and two individual people are observed, respectively, the estimated probabilities are relatively good. However, as shown in Fig. 4 (c), it is not easy for CLIP to pay attention to any pair because three or more people are observed. Figure 4 (d) shows another difficulty. Since the left person enclosed by the box is occluded, it is difficult for CLIP to recognize whether or not the two people enclosed by boxes are in the same group.
CLIP attention to a target pair: Even if three or more people are observed in a bounding box (e.g., Fig. 4 (c)), our method proposes to direct the CLIP’s attention to a target pair by circling each person of this pair (Fig. 2 (a)). The effectiveness of circling a region of interest for CLIP is validated for the simple image-text classification task [47]. While only one region is circled in each image in [47], our method applies this circling scheme to the group detection task so that two people are circled separately (Fig. 2 (a)).
Fine-tuning for groupness-augmented CLIP: While the potential of the pretrained CLIP is validated in our preliminary experiments, it is further finetuned in our method. The objectives of this finetuning are twofold. (1) CLIP is finetuned in a supervised manner using the aforementioned red-circled pair. (2) In addition to the two classes used in our preliminary experiments (i.e., “individual people” and “a group of people”), the “occlusion” class is added to cope with difficulty in occlusion handling, which is shown in Fig. 4 (d). We call the features obtained by this finetuning the groupness-augmented CLIP (GA-CLIP) features.
Our method finetunes the CLIP image encoder, \(\psi\), with the last softmax layer. This finetuning is done with the image-text training data for three-class classification:
A cropped bounding box with a pair of two circled people is used, as described before.
Each bounding box is labeled by either of “a group of people,” “individual people,” or “occlusion.” This occlusion labels are available as annotations in the JRDB dataset [23] used in this experiment.
Our method uses “occlusion” to explicitly find that image features may be unreliable if the features are extracted from a bounding box in which at least either of the two circled people is occluded. If image features are unreliable, our method relies more on the trajectory features in the image-trajectory fusion features described in Sec. 3.1.3.
Note that, in the following processes, only the GA-CLIP image encoder is used, but the last softmax layer is not used.
Image feature extraction: While our trajectory features in frame \(f\) are computed from \(T\) frames, image features in frame \(f\) (denoted by \(\boldsymbol{Z}_{app}\)) are extracted only from frame \(f\): \[\boldsymbol{Z}_{app} = \psi(\boldsymbol{I}_{app}) \in \mathbb{R}^{D_{a}}, \label{eq:32image95features}\tag{3}\] where \(\boldsymbol{I}_{app} \in \mathbb{R}^{H \times W}\) denotes a cropped bounding box consisting of \(H \times W\) pixels in frame \(f\).
The trajectory features, \(\boldsymbol{Z}_{traj}\), and the image features, \(\boldsymbol{Z}_{app}\), of a person pair are fused as \(\boldsymbol{Z} = \boldsymbol{Z}_{app} \oplus \boldsymbol{Z}_{traj} \in \mathbb{R}^{D_{a}+D_{t}}\), where \(\oplus\) denotes a concatenation operator. \(\boldsymbol{Z}\) is used for estimating the groupness probability between this pair (denoted by \(\boldsymbol{R}\)) in each frame: \[\begin{align} \boldsymbol{R} &=& (P_{i}, P_{g})^{T} = SM(\rho(\boldsymbol{Z})) \in \mathbb{R}^{2}, \label{eq:32R} , \end{align}\tag{4}\] where \(\rho\) and \(SM\) denote fully connected layers and a softmax function, respectively. \(P_{i}\) and \(P_{g}\) are the probabilities that the pair are individual people and in the same group, respectively. \(\boldsymbol{R}\) is estimated for all possible pairs of people in each frame.
With image-trajectory fusion features mentioned above, our method trains the three learnable networks, namely the trajectory feature extractors (i.e., \(\phi\) in Eq. (1 ) and \(\chi\) in Eq. (2 )) and the image-trajectory fusion feature extractor (i.e., \(\rho\) in Eq. (4 )), while the GA-CLIP image encoder is fixed. All of \(\phi\), \(\chi\), and \(\rho\) are trained in a supervised joint learning manner. In this joint learning, \(\boldsymbol{R}\) in Eq. (4 ) is used to compute the cross entropy loss with its ground truth, \(\boldsymbol{R}_{gt} \in \{ (1, 0)^T, (0, 1)^T \}\).
Figure 5: Graph construction and clustering. (a) Ours: dynamic group detection by clustering the temporal groupness graph. (b) Previous methods: static group detection with a static groupness graph, which is generated by aggregating the graphs of all frames..
Our method uses \(P_{g}\) in Eq. (4 ) to construct a graph for dynamic group detection in a video, as shown in the left-hand part of Fig. 5, via the following two steps.
Step 1: In each frame, a framewise groupness graph is constructed so that each detected person serves as a node. \(P_{g}\) of each pair is given to the edge that connects the nodes of this pair as its weight.
Step 2: The framewise groupness graphs of all frames in a video are connected between subsequent frames to construct a temporal groupness graph. Any pair of nodes between frames \(f\) and \(f+1\) can be connected via an edge if this temporal pair’s identification probability (denoted by \(P_{t}\)) is available and given to this edge. \(P_{t}\) may be estimated by multiple object tracking (i.e., \(0 \leq P_{t} \leq 1\)) or annotated as ground-truth tracking IDs (i.e., \(P_{t} = \{ 0, 1 \}\)).
All nodes in the temporal groupness graph are divided into clusters by graph clustering, as shown in the right-hand of Fig. 5. The clusters are regarded as dynamic groups. For example, in \(f\)-th framewise graph, nodes in the same cluster correspond to in-group members in frame \(f\). If in-group members differ between \(f\) and \(f+1\), that means the group changes between \(f\) and \(f+1\). This graph clustering is achieved by the Louvain [45], as descrobed in Sec. 2.2.
Datasets. The following two public datasets2 are used:
JRDB dataset [23] includes 27 indoor and outdoor videos of first-person-view captured from a mobile robot. Following [21]–[23], 20 and 7 videos are used for training and test, respectively. 2fps videos are sampled from the original videos. In total, 1,419 and 404 frames are included in the sampled training and test videos, respectively. Framewise group structures are annotated.
Café dataset [48] consists of 24 bird’s-eye videos of cafes. Each video is split into short clips at six-second intervals. Groups are annotated to each clip. In our experiments, five clips are concatenated to produce a 30-second video so that several 30-second clips capture dynamic group structure changes. All clips are split into training, validation, and test splits in two different ways, split-by-view and split-by-place, which are called CaféV and CaféP, respectively. The main paper shows the mean of the results of CaféV and CaféP, which is called Café. The separate results of CaféV and CaféP are shown in the Supp. As with JRDB, 2fps clips are sampled from the original clips. In total, CaféV and CaféP have 60,135 and 78,871 training, 30,116 and 16,956 validation, and 30,078 and 24,502 test frames, respectively. Unlike JRDB, Café has no annotations of occlusion for each person. In GA-CLIP training, therefore, each pair is labeled either “a group of people” or “individual people.”
Tasks. Static and dynamic group detections are evaluated.
Static group detection is tested to validate the generalizability (i.e., applicability to static detection) of our method. All previous methods [2], [20]–[23], [49], [50] in our experiments are static detection methods. While our method is designed for dynamic detection, its framewise results are merged for static detection as follows. The number of frames where each detected group is observed is counted to sort all groups in descending order. In this order, the groups are selected one by one. This group selection ends when all people are included in any of the selected groups.
Dynamic group detection is the main task. Three SoTA static group detection methods, S3R2 [21], P-HAR [51], and GroupTransformer [22], are modified for dynamic detection, following our method. That is, framewise graphs are not merged (e.g., averaged) but connected to construct a temporal groupness graph to divide it by graph clustering. All those methods [21], [22], are provided by the authors’ code. As graph clustering, label propagation (LP) [44], Clauset-Newman-Moore greedy modularity maximization (CNM) [52], and Louvain algorithm [45] are tested.
Evaluation. Following [20]–[22], the annotated human bounding boxes and temporal person IDs are used at training and test time, and detected groups are evaluated by the following criteria. Given the sets of person IDs in detected and ground truth groups denoted by \(G_{det}\) and \(G_{gt}\), respectively. If \(\frac{ \left|G_{det}\cap G_{gt}\right|}{{\rm max}(\left|G_{det}\right|,\left|G_{gt}\right|)}>0.5\), \(G_{det}\) is regarded as a true-positive detection. Otherwise, false-positive. With these criteria, precision, recall, and F1 scores are obtained.
Architecture: Our method consists of 346M parameters (3091MiB), including 342M parameters of CLIP. Louvain needs 476 and 917 MiB on JRDB and Café, respectively.
Hyper Parameters: The feature dimensions are \(D_{x} = 8\), \(D_{e} = 128\), \(D_{t} = 128 \times T\), and \(D_{a} = 768\). All frames are used in each video (i.e., between 28 and 115 frames on JRDB and around 60 frames on Café). As the CLIP image encoder, our method uses ViT-L/14@336px that takes a bounding box consisting of \(H \times W = 336 \times 336\) pixels.
Adam is used with \(\beta=(0.9, 0.99)\), \(\epsilon =\) 1e-6, and \({\rm weight}\_{\rm decay} =\) 0.05. The learning rate increases from 1e-8 to 2e-6 during the first 2 epochs, then decreases to 5e-8 with a cosine scheduler until 30 epochs.
While the weights of \(\rho\) in Eq. (4 ), \(\phi\) in Eq. (1 ), and \(\chi\) in Eq. (2 ) are trained with the cross-entropy loss, as described in Sec. 3.1.4, for 30 epochs. SGD is used. In addition, \({\rm weight}\_{\rm decay} =\) 5e-4 and momentum=0.9 on PANDA. The learning rate is fixed to 1e-4.
Figure 6: Success cases of our method in dynamic group detection on Café (upper) and JRDB (lower). In this figure, S3R2 and GroupTransformer are implemented w/ label propagation to contrast with our proposed method using the Louvain algorithm.. b — Success cases of our method in static group detection on JRDB. In Figs. 6 (b) and 6, people enclosed by boxes with the same color are detected as in-group members.
Static group detection: The results on JRDB and Café are shown in Tables [table:32sota95comparison95static95JRDB] and [table:32sota95comparison95static95cafe], respectively. The results of all previous methods in Table [table:32sota95comparison95static95JRDB] come from [21], [22], except for P-HAR [51] whose result is obtained by the authors’ code. In Table [table:32sota95comparison95static95cafe], the results are obtained by the authors’ code. Our method outperforms the previous methods in all metrics on both datasets. For example, compared with the second-best scores, our method improves F1 by \(0.157 = 0.790-0.633\) and \(0.172 = 0.819-0.647\) on JRDB and Café, respectively.
Four success cases of our method in which our GA-CLIP image features are beneficial for group detection are shown in Fig. 6 (b), which also shows the results of two SoTA methods, S3R2 [21] and GroupTransformer [22]. In Fig. 6 (b) (a), our method can detect a group in which in-group members are interacting. In Fig. 6 (b) (b), our method can discriminate between in-group people and others because the GA-CLIP image features can find the indirect interaction at the counter (i.e., background). In Fig. 6 (b) (c), the facial directions of people are useful in our method. Figure 6 (b) (d) is a difficult example in which people are just in a queue. While no clearly distinguishable interaction between these people is observed, their appearance attributes, such as fine-grained face directions, age, and gender, may be used in our method.
Dynamic group detection: The results of dynamic group detection on JRDB and Café are shown in Table [table:32sota95comparison95dynamic95jrdb] and Table [table:32sota95comparison95dynamic95cafe]. Our method outperforms all others in terms of all metrics by a large gap so that all best and second-best scores are obtained by the variants of our method. Even compared with the best F1 score among the other methods (i.e., “0.584 of S3R2 and LP on JRDB” and “0.631 of S3R2 and CNM on Café”), our method is better by \(0.185 = 0.769 - 0.584\) and \(0.145 = 0.776 - 0.631\) on JRDB and Café, respectively.
Two success cases of our method are shown in Fig. 6. In the upper, four sitting people are in the same group at \(t_{1}\), while S3R2 fails to include two people to this group. This group breaks up at \(t_{2}\). Our method detects this breakup soon at \(t_{2}\) even though the four people are still close to each other. This success case validates that (1) our GA-CLIP features can discriminate between in-group people in the table and breaking up people leaving the table, and (2) graph clustering can cut temporal edges in the temporal groupness graph for dynamic group detection. In the lower, two people enclosed by blue boxes at \(t_{1}\) are mutually occluded at \(t_{2}\). This occlusion drops their groupness probability, \(P_{g}\), at \(t_{2}\) compared with those at \(t_{1}\) and \(t_{3}\). However, our CLIP finetuning with the occlusion class reduces the drop. Combining this effect and graph clustering in our temporal groupness graph enables continuous dynamic group detection.
Figure 7: Visual prompts for specifying a pair of people of interest..
Dataset (Task) | Model | I-Enc | Group detection | ||
F1 | Precision | Recall | F1 | ||
JRDB (static) | No prompt | 0.442 | 0.719 | 0.751 | 0.735 |
Mask | 0.569 | 0.653 | 0.823 | 0.728 | |
A circle | 0.578 | 0.700 | 0.757 | 0.728 | |
Two circles | 0.666 | 0.742 | 0.844 | 0.790 | |
JRDB (dynamic) | No prompt | 0.442 | 0.700 | 0.689 | 0.695 |
Mask | 0.569 | 0.584 | 0.750 | 0.657 | |
A circle | 0.578 | 0.640 | 0.666 | 0.653 | |
Two circles | 0.666 | 0.724 | 0.820 | 0.769 | |
Café (static) | No prompt | 0.580 | 0.591 | 0.774 | 0.670 |
Mask | 0.824 | 0.734 | 0.871 | 0.796 | |
A circle | 0.761 | 0.656 | 0.802 | 0.722 | |
Two circles | 0.885 | 0.756 | 0.893 | 0.819 | |
Café (dynamic) | No prompt | 0.580 | 0.508 | 0.777 | 0.615 |
Mask | 0.824 | 0.677 | 0.870 | 0.761 | |
A circle | 0.761 | 0.613 | 0.797 | 0.692 | |
Two circles | 0.885 | 0.681 | 0.904 | 0.776 |
Dataset (Task) | Model | Precision | Recall | F1 |
JRDB (static) | Ours w/o Occ | 0.706 | 0.824 | 0.760 |
Ours | 0.742 | 0.844 | 0.790 | |
JRDB (dynamic) | Ours w/o Occ | 0.637 | 0.743 | 0.686 |
Ours | 0.724 | 0.820 | 0.769 |
Table 2 shows the contributions of visual prompts, which are shown in Fig. 7, for group detection. We can see that our proposed visual prompt using (d) two circles is the best in all datasets and all metrics for group detection.
In addition to group detection, the results of three-class classification (i.e., “individual people,” “a group of people,” and “occlusion”) using the image encoder are also shown in the “I-Enc” row of Table 2. The superior performance of (d) two circles validates that the well-trained image encoder contributes to group detection.
The effectiveness of the “occlusion” class is verified by ablating it in our CLIP finetuning scheme. Since “occlusion” is not annotated on Café, Table [table:32ablation95occlusion] shows the results on JRDB only. We can see that “occlusion” improves all the metrics in all datasets. In particular, the improvement of F1 is larger in dynamic group detection (i.e., \(0.083 = 0.769 - 0.686\)). This is because occlusion makes group detection difficult in each frame, so the performance is significantly degraded in dynamic detection if the CLIP image encoder does not learn “occlusion.” However, by merging the results of dynamic detections for static detection (as described in “Tasks” in Sec. 4.1), the results of static detection can be improved, resulting in the small gap between “Ours w/o Occ” and “Ours.”
Large pretrained vision encoders, ResNet [53], ViT [54], and DINOv2 [25], are finetuned with each dataset and then replace the CLIP image encoder. In addition to group detection, three-class classification with the encoder is also evaluated, as shown in the “I-Enc” row.
Table [table:32ablation95clip] shows that in terms of F1, our method using GA-CLIP outperforms the others in all cases by a large margin. This proves that higher-level spatial contexts useful for group detection are represented in GA-CLIP.
In Table [table:32ablation95features], our method with both appearance and trajectory features is better than that using one of them in all cases. While there is not much difference between ours w/o appearance features and ours w/o trajectory features on the whole, our method using both features can outperform both ours w/o appearance and ours w/o trajectory.
On a NVIDIA RTX 6000 Ada, the image-trajectory fusion features are computed in 0.018 secs/pair. Since the average number of all possible pairs in each frame is 958 in JRDB, the runtime is \(0.018 \times 958 \risingdotseq 17.2\) secs/frame. To reduce the runtime, (1) only \(k\)-nearest neighbor pairs for each person are chosen to reduce the number of pairs to 47.6, as mentioned in the Supp, and (2) 300 pairs are handled in parallel, leading to \(0.018 \times \frac{47.6}{300} \risingdotseq 0.003\) secs/frame. Note that, while only \(k\)-NN pairs are chosen, temporal graph clustering allows us to detect groups, each of which has \(k\) or more people, as validated in the Supp. The Louvain needs 0.020 secs/frame. On JRDB, therefore, the total inference time is 0.020 + 0.003 = 0.023 secs/frame, which is faster than 0.11 secs/frame of S3R2 [21]. On Café, the total time is 0.007 secs/frame, while it is 0.018 secs/frame in S3R2 [21].
Instead of annotated person bounding boxes and IDs between frames, inaccurate ones are obtained by a multi-object tracking algorithm [55], which needs to be used in real scenarios. Table [table:32mot95group95detection] shows the superiority of our method. This may be because our GA-CLIP features play a more crucial role in redeeming the negative effect of erroneous weights given to temporal edges (i.e., erroneous multi-object tracking probabilities). Even if these temporal edge weights are unreliable, our GA-CLIP features allow for correct framewise group detection.
Figure 8: Failure cases of our method on Café. Detected in-group members are marked by boxes with the same color..
Figure 8 shows failure cases of our method. In all false-positive group detections shown in the upper, individual people are detected as in-group people because they are nearby for a long time. The lower shows three temporal frames. While three in-group people are detected correctly at \(t_{1}\) and \(t_{2}\), group detection is failed at \(t_{3}\) because one of them enclosed by the blue box walks a different path. These failure cases reveal that our method stlil depends more on trajectory features than image features, while our proposed GA-CLIP features improve framewise group detection.
We proposed (1) the GA-CLIP image features for framewise groupness evaluation and (2) temporal groupness graph clustering for dynamic group evaluation. Our experiments demonstrated that (1) the GA-CLIP image features represent human-human and human-scene interactions for better group detection and (2) temporal groupness graph clustering enables dynamic group detection, which is not achieved even in SoTA methods [21], [22].