Sparse Semi-DETR: Sparse Learnable Queries for
Semi-Supervised Object Detection
April 02, 2024
In this paper, we address the limitations of the DETR-based semi-supervised object detection (SSOD) framework, particularly focusing on the challenges posed by the quality of object queries. In DETR-based SSOD, the one-to-one assignment strategy provides inaccurate pseudo-labels, while the one-to-many assignments strategy leads to overlapping predictions. These issues compromise training efficiency and degrade model performance, especially in detecting small or occluded objects. We introduce Sparse Semi-DETR, a novel transformer-based, end-to-end semi-supervised object detection solution to overcome these challenges. Sparse Semi-DETR incorporates a Query Refinement Module to enhance the quality of object queries, significantly improving detection capabilities for small and partially obscured objects. Additionally, we integrate a Reliable Pseudo-Label Filtering Module that selectively filters high-quality pseudo-labels, thereby enhancing detection accuracy and consistency. On the MS-COCO and Pascal VOC object detection benchmarks, Sparse Semi-DETR achieves a significant improvement over current state-of-the-art methods that highlight Sparse Semi-DETR’s effectiveness in semi-supervised object detection, particularly in challenging scenarios involving small or partially obscured objects.
Semi-Supervised Object Detection (SSOD) aims to improve the effectiveness of fully supervised object detection through the integration of abundant unlabeled data [1]–[11]. It has applications in diverse fields, ranging from autonomous vehicles [12], [13] to healthcare [14], [15], where obtaining extensive labeled datasets is often impractical or cost-prohibitive [16].
Several SSOD methods [1]–[11] have been proposed. Two prevalent approaches in this domain are pseudo-labeling [1]–[5], [17]–[19] and consistency-based regularization [6]–[11]. STAC [1] introduced a simple multi-stage SSOD training method with pseudo-labeling and consistency training, later simplified by a Teacher-Student framework for generating pseudo-labels [17]. Based on this framework, considerable research efforts have been directed towards enhancing the quality of pseudo-labels [3], [5]. These traditional SSOD methods are built upon conventional detectors like one-stage [20], [21] and two-stage [22], [23], which involve various manually designed components such as anchor boxes and non-maximum suppression (NMS). Employing object detection methods in SSOD poses several potential challenges that must be carefully dealt with to obtain reasonable performance. These factors include overfitting of the labeled data [24], pseudo-label noise [25], bias induced through label imbalance [26], [27], and poor detection performance on small objects [28]. Recently, DETR-based [29]–[35] SSOD methods [28], [36] remove the need for traditional components like NMS.
Even though DETR-based SSOD [28], [36] has progressed remarkably, state-of-the-art methods possess some limitations. (1) DETR-based SSOD methods perform poorly in the detection of small objects, as shown in Figure 3. This is because these methods don’t use multi-scale features [37] like Feature Pyramid Networks (FPN) [38], which play an important role in identifying smaller objects as in CNN-based SSOD methods [3]–[11]. Although recent advancements in DETR-based object detection [29]–[35] have improved the detection of small objects, their SSOD adaptation is still unable to cater this challenge effectively [28]. (2) SSOD approaches [3]–[5], [28] rely on handcrafted post-processing methods such as NMS [23]. This problem specifically appears in DETR-based SSOD when we use a large number of object queries and the one-to-many assignment strategy [28]. In DETR-based SSOD methods, this problem is partially solved using the one-to-one or hybrid (combination of one-to-one and one-to-many) assignment strategy. However, the hybrid assignment strategy is preferred because the one-to-one assignment strategy produces inaccurate pseudo-labels [36], thus resulting in inefficient learning. Although the number of duplicate bounding boxes is less in the hybrid strategy [28], the amount is high enough to impact object detection performance adversely, as depicted in Figure 4. (3) The pseudo-label generation produces both high and low-quality labels. The DETR-based SSOD methods lack an effective refinement strategy for one-to-many assignments, which is crucial for filtering out low-quality proposals.
To address the above mentioned issues, we propose enhancing the state-of-the-art DETR-based SSOD approach, namely ‘Sparse Semi-DETR’, presented in Figure [fig:intro95figure] (b). Our approach involves expanding its architecture by integrating a couple of novel modules designed to mitigate the identified shortcomings. The key module among these is the Query-Refinement module, as depicted in Figure 1 and explained in Figure 2. This module significantly improves the quality of the queries and reduces their numbers. The proposed module uses the low-level features from the backbone and high-level features extracted directly from weakly augmented images using ROI alignment [39]. Fusing these features results in overcoming the first shortcoming, i.e., detecting small and obscured objects, as shown in Figure 3. The attention mechanism drives the aggregation of the features, resulting in refined, high-quality features to carry forward. To ensure the quality of the query features, the attention mechanism is accompanied by a query-matching strategy for filtering irrelevant queries. Thus, the Query Refinement Module not only improves the quality of the queries but also reduces their numbers, giving rise to efficient processing. This module results in significantly fewer overlapping proposals, improving the performance overall, thereby solving the second limitation. Besides, we introduce a Reliable Pseudo-Label Filtering Module, as illustrated in Figure 1, inspired by Hybrid-DETR [34] to address the third limitation. Employing this module significantly reduces the low-quality pseudo-labels. Therefore, it further reduces the amount of duplicate predictions that may still occur after the second stage of the hybrid assignment strategy. Our approach provides better results than previous SSOD methods, as shown in Figure [fig:intro95figure] (c).
The key contributions of this work can be outlined as follows:
We present Sparse Semi-DETR, a novel approach in semi-supervised object detection, introducing two novel contributions. To our knowledge, we are the first to examine and propose query refinement and low-quality proposal filtering for the one-to-many query assignment strategy.
We introduce a novel query refinement module designed to improve object query features, particularly in complex detection scenarios such as identifying small or partially obscured objects. This enhancement not only boosts performance but also aids in learning semantic feature invariance among object queries.
We introduce a Reliable Pseudo-Label Filtering Module specifically designed to reduce the effect of noisy pseudo-labels. This module is designed to efficiently identify and extract reliable pseudo boxes from unlabeled data using augmented ground truths, enhancing the consistency of the learning process.
Sparse Semi-DETR outperforms current state-of-the-art methods on MS-COCO and Pascal VOC benchmarks. With only 10% labeled data from MS-COCO using ResNet-50 backbone, it achieves a 44.3 mAP, exceeding prior baselines by 0.8 mAP. Additionally, when trained on the complete COCO set with extra unlabeled data, it further improves, rising from 49.2 to 51.3 mAP.
Object detection identifies and locates objects in images or videos. Deep learning-based object detection approaches are typically categorized into two primary groups: two-stage detectors [22], [23] and one-stage detectors [21], [40]–[42]. These methods depend on numerous heuristics, such as generating anchors and NMS. Recently, DEtection TRansformer (DETR) [29] considers object detection as a set prediction problem, using transformer [43] to adeptly transform sparse object candidates [44] into precise target objects. Our Sparse Semi-DETR detects small or partially obscured objects in the DETR-based SSOD setting. Notably, our framework is compatible with various DETR-based detectors [30], [45]–[52], offering flexibility in integration.
Most research in SSOD employs detectors categorized into three types: one-stage, two-stage, and DETR-based systems.
One-stage STAC [1], an early SSOD, introduced a simple training strategy combining pseudo-labeling and consistency training, later streamlined by a student-teacher framework for easier pseudo-label generation [17]. DSL [7] introduced novel techniques including Adaptive Filtering, Aggregated Teacher, and uncertainty-consistency-regularization for improved generalization. Dense Teacher [53] introduced Dense Pseudo-Labels (DPL) for richer information and a region selection method to reduce noise.
Two-stage. Humble Teacher [54] uses soft labels and a teacher ensemble to boost pseudo-label reliability, matching other results. Instant-Teaching [5] creates pseudo annotations from weak augmentations, treating them as ground truth under strong augmentations with Mixup [55]. Unbiased Teacher [17] tackles class imbalance in pseudo-labeling with focal loss, focusing on underrepresented classes. Soft Teacher [3] minimizes incorrect foreground proposal classification by applying teacher-provided confidence scores to reduce classification loss. PseCo [10] enhances detector performance by combining pseudo-labeling with label and feature consistency methods, also using focal loss to address class imbalance.
DETR-based. Omni-DETR [36] is designed for omni-supervised detection and adapts to SSOD with a basic pseudo-label filtering method. It employs the one-to-one assignment strategy proposed in DETR [29], and encounters challenges when dealing with inaccurate pseudo-bounding boxes produced by the teacher network. These inaccuracies result in reduced performance, highlighting its limitations. Semi-DETR [28] adopts a stage-wise strategy, employing a one-to-many matching strategy in the first stage and switching to a one-to-one matching strategy in the second stage. This approach provides NMS-free end-to-end detection benefits but reduces performance compared to a one-to-many assignment strategy. Moreover, omni-DETR and Semi-DETR struggle to detect small or occluded objects. Our work introduces an advanced query refinement module that significantly refines object queries, enhancing training efficiency and performance and leading to the detection of small or densely packed objects in the DETR-based SSOD framework.
In DETR-based SSOD, one-to-one assignment strategy, denoted by \(\hat{\sigma}_{one2one}\), is achieved by applying Hungarian algorithm between the predictions made by the student model and the pseudo-labels provided by the teacher model as follows: \[\hat{\sigma}_{one2one} = \underset{\sigma \in \mathcal{\xi }_N}{\arg\min} \sum_{j}^{N} \mathcal{L}_{\text{match}} \left( \hat{y}_j^{t}, \hat{y}_{\sigma(j)}^{s} \right)\] where \(\mathcal{L}_{\text{match}}\left(\hat{y}_j^{t}, \hat{y}_{\sigma(j)}^{s}\right)\) is the matching cost between the pseudo-labels \(\hat{y}_j^{t}\) generated by the teacher network and the predictions of the student network with index \(\sigma(j)\) and \(\mathcal{\xi}_N\) is the permutation of \(N\) elements. Semi-DETR [28] addresses the issue of imprecise initial pseudo-labels by shifting from a one-to-one to a one-to-many assignment strategy, increasing the number of positive object queries to improve detection accuracy: \[\hat{\sigma}_{one2many} = \left\{ \underset{\sigma_j \in C_M^N}{\arg\min} \sum_{k}^{M} \mathcal{L}_{\text{match}} \left( \hat{y}_j^{t}, \hat{y}_{\sigma_j(k)}^{s} \right) \right\}^{|\hat{y}^t|}_j\] where \(C^M_N\) represents the combination of \(M\) and \(N\), denoting that a subset of \(M\) proposals is associated with each pseudo box \(\hat{y}^t_j\). Semi-DETR initially adopts a one-to-many assignment to improve label quality, then shifts to one-to-one assignment for an NMS-free model. This approach adopts a one-to-many assignment strategy aimed at boosting performance, but it’s less effective with small or occluded objects.
Figure 1: An overview of the Sparse Semi-DETR framework. It contains two networks: the student network and the teacher network. Labeled data is used for student network training, employing a supervised loss. Unlabeled data is fed to the teacher network with weak augmentation and the student network with strong augmentation. The teacher network takes unlabeled data to generate pseudo-labels. Here, the query refinement module provides refined queries to avoid incorrect bipartite matching with teacher-generated pseudo-labels. For a detailed overview of the query refinement module, see Figure 2. Furthermore, a Reliable Pseudo-label Filtering strategy is employed to filter low-quality pseudo-labels progressively during training.
In semi-supervised learning, a collection of labeled data denoted as \(D_l\), where \(D_l = {\{ x_{i}^l, y_{i}^l \}}_{i=1}^{N_l}\) is given, along with a set of unlabeled data represented as \(D_u\), where \(D_u = \{{x_{i}^u}\}_{i=1}^{N_u}\). Here, \(N_l\) and \(N_u\) correspond to the number of labeled and unlabeled data. The annotations \(y_{i}^l\) for the label data \(x^l\) contain object labels and bounding box information. The pipeline of the Sparse Semi-DETR framework is depicted in Figure 1. It introduces a Query Refinement Module for processing query features to enhance semantic representation for complex detection scenarios, such as identifying small or partially obscured objects. Additionally, we integrate a Reliable Pseudo-Label Filtering Module that selectively filters high-quality pseudo-labels, thereby enhancing detection accuracy. For comparison purposes, we employ DINO [33] with a ResNet-50 backbone. This section gives a detailed overview of the modules of Sparse Semi-DETR. We explain briefly our semi-supervised approach in Appendix A1.1.
Inspired by recent advancements in vision-based networks [56], [57], we introduce an innovative approach to enhance object query features. For each unlabeled image \(I \in \mathbb{R}^{H \times W \times C}\), we extract query features \(F_{s1} \in \mathbb{R}^{b \times W_1 \times 256}\) from strongly augmented image \(I\). Similarly, we extract query features \(F_{t1}\) from weakly augmented image \(I\), also in the same dimension. Subsequently, feature extraction from the image backbones occurs. This results in the generation of features \(F_{s2}\) for the student and \(F_{t2}\) for the teacher network as \(\mathbb{R}^{b \times W_2 \times 256}\). These features, encompassing both label and bounding box details, vary with the batch size, as indicated by \(b\). The feature sets \(W_1\) and \(W_2\) differ in size, with \(W_2\) being substantially larger than \(W_1\). We provide a brief overview of each component of Query Refinement Module as illustrated in Figure 2.
Query Refinement Module. In our approach, we handle multi-scale features \(F_{t1}\) and \(F_{t2}\) with a focus on effective aggregation. The finer details are encapsulated within the features \(F_{t1}\), while the features \(F_{t2}\) encapsulate more abstract elements such as shapes and patterns. Simple aggregation of these features has been shown to degrade performance, as indicated in Table [tab:attention95queries95h]). To solve this issue, we implement dual strategies to extract local and global information from high and low-resolution features. High-resolution features are crucial for detecting small objects. However, processing them with attentional operations is computationally demanding. To address this, we firstly convert the query label features \(F_{t2} \in (\mathbb{R}^{b \times W_2 \times 256 })\) into \(F'_{t2} \in (\mathbb{R}^{b \times W_2 \times 16 })\) by decreasing the channel dimension, and retaining the original resolution \(b\times W_2\). Then, we apply attentional mechanism on \(F'_{t2}\) to calculate the attentional weights \(W_{k+q}\) in attention block as follows: \[\begin{align} W_{k+q} = F'_{k} \cdot F'_{q}, \end{align} \label{eq:kq95product}\tag{1}\] \[\begin{align} \mathbf{\bar{\vphantom{W}W}}_{k+q} = \frac{\exp(W_{k+q})}{\sum_{l=1}^{L}\exp(W_{k+q})}, \end{align} \label{eq:kq95softmax}\tag{2}\] where \(W_{k+q}\) is the attentional weights of \(F'_k\) and \(F'_q\), and \(\mathbf{\bar{\vphantom{W}W}}_{k+q}\) is the normalized form of \(W_{k+q}\). Using normalized attention weights, we compute the enhanced queries representation \(Q\) as follows: \[Q = \mathbf{\bar{\vphantom{W}W}}_{k+q} \cdot F'_{v}\] now we find the similarity between the attentional \(F_{t2}\) features and \(F_{t1}\) features to obtain \(F'_{cs}\in \mathbb{R}^{{b}\times{W_1} \times 16}\) from \(Q \in \mathbb{R}^{{b}\times{W_2} \times 16}\) as follows: \[F'_{cs}= \frac{\sum_{i=l}^{n} P_l Q_l}{\sqrt{\sum_{l=1}^{n} P_l^2} \sqrt{\sum_{l=1}^{n} Q_l^2}}\] where \(P\) and \(Q\) are \(F_{t1}\) and attentional \(F_{t2}\) features, respectively. Then, we concatenate \(F'_{cs}\) with \(P\) to obtain refined query features. Interestingly, we observe a performance drop when our feature refinement strategy is applied to strongly augmented image features for the teacher network, as detailed in Table [tab:ST95attention95b]. However, we achieve optimal results by concatenating strongly augmented image features and applying our refinement strategy to weakly augmented image features. Consequently, we proceed by concatenating the features \(F_{s1}\) with \(F_{s2}\), thereby obtaining the query features \(r_{s}\). Note that \(r_{t}\), despite having a dimensional size equivalent to \(F'_{cs}+ P\), encapsulates substantially more intricate representations. This improved performance is due to the integration of high-resolution and low-resolution features.
Then, we form the decoder queries in the student-teacher network by merging the teacher’s original queries \(q_t\) with refined queries \(r_s\), and the student’s original queries \(q_s\) with refined queries \(r_t\), respectively. This integration forms the inputs for the decoder as follows: \[\begin{gather} \hat{o}_t, {o}_t = \text{Dec}_t ([r_s, q_t], E_t[A]) \\ \hat{o}_s, {o}_s = \text{Dec}_s ([r_t, q_s], E_s[A]) \end{gather}\] Where \(E_s\) and \(E_t\) refer to the encoded image features. \(\hat{o}_s\) and \(\hat{o}_t\) indicate the decoded features of refined queries, while \(o_s\) and \(o_t\) represent the decoded features of original object queries. Here, t is for teacher and s for student network. Following DN-DETR [32], we use the attention mask \(A\) to protect information leakage, ensuring the learning process’s integrity.
Figure 2: Overview of the Query Refinement Module. The query features from strong and weak augmented unlabeled images are refined through the Query Refinement Module. It amplifies the semantic representation of object queries and improves performance for small objects. For the best view, zoom in.
The one-to-many training strategy, while effective, causes duplication prediction in the first stage. We introduce a pseudo-label filtering module to address this and improve the filtering of pseudo boxes rich in semantic content for refined query learning. This module is designed to efficiently identify and extract reliable pseudo boxes from unlabeled data using augmented ground truths. We employ m groups of ground truths \(\hat{g} = \{\hat{g}_1, \hat{g}_2, \ldots, \hat{g}_{m}\}\) for one-to-many assignment strategy and select the top-k predictions as follows: \[\hat{\sigma}_{one2many} = \left\{ \underset{\sigma_j \in C_M^N}{\arg\min} \sum_{k}^{M} \mathcal{L}_{\text{match}} \left( \hat{y}_j^{t}, \hat{g}_{\sigma_j(k)}^{s} \right) \right\}^{|\hat{y}^t|}_j\] where \(C^M_N\) represents the combination of \(M\) and \(N\), denoting that a subset of \(M\) proposals is associated with each pseudo box \(\hat{y}^t_j\). Here, m is set to 6. Furthermore, we use the remaining predictions to filter out duplicates in the top-k predictions in the one-to-one matching branch. Through this improved selection scheme, we achieve a performance improvement of 0.4 mAP when m is set to 6, as shown in Table [tab:effect95each95module95a]. However, we observe no significant benefits when increasing m greater than 6, as detailed in Table [tab:LR95m].
Figure 3: Visual comparison of Sparse Semi-DETR with the two previous approaches on the COCO 10% label dataset. These results highlight Sparse Semi-DETR’s capabilities, particularly in identifying small objects and those obscured by obstacles (as indicated by white arrows) in the third-row images. For optimal clarity and detail, please zoom in.
We evaluate our approach on the MS-COCO [58] and Pascal VOC [59] datasets, benchmarking it against current SOTA SSOD methods. Following [3], [28], Sparse Semi-DETR is evaluated in three scenarios: COCO-Partial. We use 1%, 5%, 10% of train2017
as label data and rest as unlabeled
data. COCO-Full. We take train2017
as label data and unlabel2017
as unlabel data. VOC. We take VOC2007
as label data and VOC2012
as unlabel data. Evaluation metrics
include \(AP_{50:95}\), \(AP_{50}\), and \(AP_{75}\) [3], [28].
We set the quantity of DINO original object queries to 900. For the setting hyperparameters, following [28] : (1) In the COCO-Partial setting, we set the training iterations to 120k with a labeled to unlabeled data ratio of 1:4. The first 60k iterations adopt a one-to-many assignment strategy. (2) In the COCO-Full setting, Sparse Semi-DETR is trained for 240k iterations with labeled to unlabeled data ratio of 1:1. The first 120k iterations adopt a one-to-many assignment strategy. (3) In the VOC setting, we train the network for 60k iterations with a labeled to unlabeled data ratio of 1:4. The first 40k iterations adopt a one-to-many assignment strategy. For all experiments, the filtering threshold \(\sigma\) value is 0.4. We set the value of m to 6 and the value of k to 4. We provide complete implementation details for each experiment in Appendix A1.2.
We evaluate Sparse Semi-DETR and compare it against current SOTA SSOD methods. Our results demonstrate the superior performance of Sparse Semi-DETR in these aspects: (1) its effectiveness compared to both one-stage and two-stage detectors, (2) its comparison with traditional DETR-based detectors, and (3) its exceptional proficiency in accurately detecting small and partially occluded objects. We provide more results details in Appendix A1.3.
Experimental results on Pascal VOC protocol. Here, FCOS, Faster RCNN, and DINO are the supervised baselines.
Comparing Sparse Semi-DETR with other approaches on COCO-Full. Note that 1 denotes 30k training iterations, while an N\(\times\) signifies N times 30k iterations.
COCO-Partial benchmark. Sparse Semi-DETR outperforms the current SSOD methods in COCO-Partial across all experiment settings, as demonstrated in Table [tab:results95coco95partial]. (1) We compare our method to both one-stage and two-stage SSOD. Sparse Semi-DETR surpasses Dense Teacher by 8.52, 7.79, 7.17 mAP on 1%, 5%, and 10% label data. It also outperforms PseCo by 8.47, 8.30, 8.24 mAP on 1%, 5%, and 10% label data. Sparse Semi-DETR’s superior performance as a semi-supervised object detector is achieved without needing hand-crafted components commonly used in two-stage and one-stage detectors. (2) When compared to DETR-based detectors, Sparse Semi-DETR outperforms omni-DETR by 3.30, 3.10, and 3.00 mAP and beats Semi-DETR by 0.40, 0.70, 0.80 mAP on 1%, 5%, and 10% label data. (3) Sparse Semi-DETR’s exceptional proficiency in precisely detecting small and partially obscured objects is a standout feature. In Figure 3, we visually compare Sparse Semi-DETR with the two preceding approaches using the COCO 10% labeled dataset. These results demonstrate the impressive capabilities of Sparse Semi-DETR, particularly in its ability to identify small objects and objects concealed by obstacles, as highlighted in the third-row images by the white arrows. Table [tab:results95small] exhibits a remarkable performance boost of Sparse Semi-DETR on small objects. It surpasses the Semi-DETR by 1.20, 0.90 and 1.70 mAP on small objects using 1%, 5%, and 10% label data, respectively. It highlights the superior efficiency and accuracy of Sparse Semi-DETR in detecting smaller objects.
COCO-Full benchmark. In Table [tab:results95coco95full], when incorporating additional unlabel2017
data, Sparse
Semi-DETR demonstrates a substantial improvement, achieving an impressive 2.1 mAP gain and reaching a total of 51.3 mAP. It surpasses the performance of Dense Teacher, PseCo, and Semi-DETR by 5.2, 5.2, and 0.9 mAP, respectively, highlighting the
effectiveness of Sparse Semi-DETR.
Pascal VOC benchmark. Sparse Semi-DETR exhibits a remarkable performance boost on the Pascal VOC benchmark, as shown in Table [tab:results95voc]. It surpasses the supervised baseline by 5.1 improvement on \(AP_{50}\) and by 5.91 impressive increase on \(AP_{50:95}\). Furthermore, it outperforms all previously single-stage, two-stage, and DETR-based SSOD methods by a significant margin.
0.50 width=
0.33 width=.95
0.24 width=.98
0.35 width=.95
0.35 width=.92
Figure 4: The top-row figures display Semi-DETR’s detection results, and the bottom-row shows Sparse Semi-DETR’s outcomes. Both networks were trained for 120k iterations using one-to-many and one-to-one strategies. Sparse Semi-DETR eliminates redundant bounding boxes in the bottom-left image and detects small objects, like knives, in the top-right image, as indicated by white arrows.
This section ablates the key design choices of Sparse Semi-DETR. The experiments detailed in this section are executed on the MS COCO dataset with 10% label data, employing DINO as the primary detector.
Effect of Individual Component. We conduct three experiments to assess the efficacy of each module of Sparse Semi-DETR, as detailed in Table [tab:effect95each95module95a]. We employ Semi-DETR as our baseline model. Notably, integrating each component into Sparse Semi-DETR leads to consistent performance improvements. Specifically, adding the Query Refinement and Pseudo-Label Filtering modules yields a significant increase of 0.8 mAP. These results confirm that each component of Sparse Semi-DETR enhances our model’s performance.
0.30 width=
0.30 width=
0.23 width=.98
Effect of Query Refinement Module. We examine the impact of the Query Refinement (QR) Module. In Table [tab:ST95attention95b], we explore various QR combinations to determine the most effective design. Applying QR selectively to the student, teacher, or both networks, we find that employing QR on weak augmented image features \(F_{t1}\) and \(F_{t2}\), and integrating them into the student network with the original decoder queries yields the best results. Table [tab:ROI95f] shows that processing \(F_{t1}\) features without MLP (in [28]) improves results. Table [tab:quries95variants95g] presents our study on the impact of different query variants, where we observe that QR consistently outperforms other methods. Finally, Table [tab:attention95queries95h] shows that using an attentional block in the QR module is more effective than simple concatenation or cosine similarity. We provide more analysis of the Query Refinement Module supplementary document. Figure 4 compares Sparse Semi-DETR and Semi-DETR. Sparse Semi-DETR, processing fewer but refined queries, demonstrates lower duplication rates in this training approach.
width=.75
Effect of Reliable Pseudo-Label Filtering Module. In our analysis of the Pseudo-Label Filtering Module, we examine the impact of various parameters. Table [tab:LR95m] shows that a smaller m results in lower performance due to the inclusion of poor-quality labels. Performance improves with a moderately larger m due to enhanced auxiliary loss. However, excessively large m values trigger NMS, negatively impacting performance. Additionally, in Table [tab:LR95k], we analyze the selection of the k value and find that setting k to 4 yields the best performance. In Table [tab:LR95a], the optimal performance for \(\sigma\) is achieved at 0.4; values lower than this may lead to the generation of noisy pseudo-labels, whereas higher values can decrease the number of effective pseudo-labels.
Limitation. We still have duplications when we apply the one-to-many training strategy in both stages. Table [tab:NMS] illustrates that a one-to-many strategy for 120k iterations with Sparse Semi-DETR achieves a 44.6 mAP using NMS. In comparison, 60k iterations in the first stage attain a comparable 44.3 mAP without NMS. Future works can explore this aspect.
In conclusion, we successfully address the inherent limitations of DETR-based semi-supervised object detection frameworks by introducing Sparse Semi-DETR. This novel solution effectively tackles overlapping predictions and the detection of small objects. Sparse Semi-DETR incorporates a Query Refinement Module to enhance object query quality, mainly benefiting the detection of small and partially obscured objects. Besides, it also introduces a Reliable Pseudo-Label Filtering Module to filter out low-quality pseudo-labels selectively, thereby enhancing overall detection accuracy and consistency with the remaining high-quality labels. Our method outperforms existing SSOD approaches, with extensive experiments demonstrating its effectiveness.
Ethical considerations. We study semi-supervised models, and agree that standard ethical considerations for visual recognition are applicable to our work.
This work was in parts supported by the EU Horizon Europe Framework under grant agreements 101135724 (LUMINOUS) and 101092312 (AIRISE) and by the Federal Ministry of Education and Research of Germany (BMBF) under grant 01QE2227C (HERON).
The supplementary document offers an extensive overview of our approach, detailed insights into our implementation details, and a comprehensive analysis of results.
The Sparse Semi-DETR framework is an extension of Semi-DETR (the first semi-supervised DETR-based framework). Labeled data is used for student network training, employing a supervised loss. The Sparse Semi-DETR framework processes unlabeled data through two distinct pathways: the teacher network, which receives weakly augmented data, and the student network, which is fed with strongly augmented data. The teacher network utilizes the unlabeled data to produce pseudo-labels. Meanwhile, the student model undergoes parameter refinement via back-propagation. In contrast, the teacher model’s parameters are updated, following the exponential moving average (EMA) of the student model.
training setting | COCO-Partial | COCO-Full | VOC | Ablation |
---|---|---|---|---|
batch size | 5*8 | 8*8 | 5*8 | 5*8 |
labeled to unlabeled data ratio | 1:4 | 1:1 | 1:4 | 1:4 |
learning rate | 0.001 | 0.001 | 0.001 | 0.001 |
first stage iterations | 0-60K | 0-180K | 0-40K | 0-60K |
second stage iterations | 60k-120K | 180k-240K | 40k-60K | 60k-120K |
iterations | 120K | 240K | 60K | 120K |
unsupervised loss weight \(\alpha\) | 4.0 | 2.0 | 4.0 | 4.0 |
EMA rate | 0.996 | 0.999 | 0.999 | 0.999 |
confidence threshold | 0.4 | 0.4 | 0.4 | 0.4 |
Additional Details of Semi-DETR. Semi-DETR is a DETR-based semi-supervised framework that introduces cross-view query consistency and stage-wise hybrid matching strategies. (1) In CNN-based semi-supervised object detection (SSOD) frameworks [3]–[11], consistency regularization is easily implemented by minimizing differences between teacher and student model outputs, given the same input but with different augmentations. However, this approach is not directly applicable in DETR-based SSOD frameworks due to the lack of clear correspondence between input object queries and output predictions. To address this, a novel cross-view query consistency module is proposed. It processes RoI features through MLPs, and generates cross-view query embeddings. These embeddings are combined with original object queries and fed into a decoder. (2) Semi-DETR initially uses a one-to-many assignment in early training, allowing multiple predictions per pseudo-label. It speeds up convergence and improves label quality but can cause redundant predictions. It then switches to one-to-one assignment, reducing redundancy and aiming for an NMS-free final model. However, its effectiveness on small objects is limited. Our Sparse semi-DETR refines object queries, enhancing small object detection and accuracy.
The implementation of the Sparse Semi-DETR approach is based on MMdetection framework [60]. We integrate data pre-processing methodologies from Soft-Teacher [3]. We train the network on 8 GPUs (RTXA6000), which takes roughly two training days to complete 120k training iterations. Elaborating on training hyperparameters for different benchmarks: (1) COCO-Partial Setup: We train the network using 8 GPUs for 120k iterations, with each GPU handling five images. It employs one-to-many assignment strategy for first 60k iterations and then one-to-one assignment strategy for 60k-120k iterations. (2) COCO-Full Setup: For this benchmark, we train for 240k iterations, employing one-to-many assignment strategy for first 180k iterations and then one-to-one assignment strategy for 180k-240k iterations. We use 8 GPUs with eight images per GPU. (3) Pascal VOC Setup: Here, first 40k iterations adopt a one-to-many assignment strategy and then one-to-one assignment strategy for 40k-60k iterations. Across all our experimental setups, we’ve kept the confidence threshold constant at 0.4. We use the Adam optimizer and set the learning rate to 0.001. We avoid using learning rate decay for a fair comparison with Semi-DETR [28]. Complete implementation details are provided in Table 1.
Data Augmentation. We adopt the same data augmentation scheme as in Semi-DETR, detailed in Table [tab:augmentation]. We employ weak augmentation on unlabeled data for generating pseudo labels, while strong augmentation is utilized for both labeled and unlabeled data during the model’s training.
width=.95
Additional Details of Query Refinement Module. We perform additional experiments to assess the efficacy of our query refinement approach as follows:
Is the attention module crucial in query refinement? Could we apply attention to just low or high-resolution features exclusively, or should it be applied to both high and low-resolution features for optimal results?
Is the integration of a similarity module crucial in query refinement? How would training be impacted if we disregarded similarity features and considered all features comprehensively?
Figure 5: Overview of the Impact of Attention Module in Query Refinement. (a) Query refinement without an attention module, (b) Attention module applied to both low and high-resolution features, (c) Attention module applied to high-resolution features, and (d) Attention module applied to low-resolution features. The best results are achieved for refining queries by applying the attention module to low-resolution features and then combining these with high-resolution features.
Figure 6: Qualitative Comparison of positive proposals in One-to-Many assignment strategy: (a) Ground Truth (b) Semi-DETR (c) Sparse Semi-DETR. Our approach, compared to Semi-DETR, generates more refined positive proposals for each ground truth. Here, ground truths are outlined in red, while the positive proposals are highlighted in green. Sparse Semi-DETR performs better in identifying small or hidden objects, as indicated by positive proposals around such items. It employs an attention mechanism, focusing on finer image details, which enhances the detection of hidden objects. Additionally, its similarity module further refines the proposal quality, leading to a notably improved identification accuracy.
Impact of Attention module: In our Query Refinement, we refine the queries by applying the attention module on \(F_{t2}\) features and combining them with \(F_{t1}\) features. In this experiment, we study the impact of the attention module in Query Refinement, as highlighted in Table [tab:attention95network]. Figure 5 (a) illustrates the concatenating high-resolution features after extracting similar features from the low-resolution without applying an attention network to either set of features. Secondly, as indicated in Figure 5 (b), we apply the attention network on both sets of features. In Figure 5 (c), we extract similar features from the \(F_{t2}\) and apply the attention network on just \(F_{t1}\) features. In Figure 5 (d), we apply the attention network on \(F_{t2}\) features and find similarity with \(F_{t1}\) features that gives the best results. The model can focus on capturing essential information by applying the attention module to \(F_{t2}\) features. When these enhanced \(F_{t2}\) features are compared for similarity with \(F_{t1}\) features, the model can get refined detail. It enables the model to make more accurate predictions, leading to better overall performance.
width=.75
Impact of Similarity module: We reduce the number of queries in query refinement by filtering similar query features in low-resolution features. As indicated in Table [tab:similarity95network], removing the similarity module results in a performance decline of 0.3 mAP, increasing the number of queries in a one-to-many training strategy. It confirms the importance of the similarity module in our query refinement strategy. The effectiveness of refined queries using the similarity module is because when enhanced low-resolution features are compared for similarity with high-resolution features, the model can effectively correlate the relevant information from both levels of detail, improving performance.
width=.75
Figure 7: Qualitative comparison on the COCO test set. The prediction results are in red, and the green boxes refer to the prediction difference in Semi-DETR and Sparse Semi-DETR. (a) Small Objects: Semi-DETR, on the left, has missed detections of bird objects, indicated with green bounding boxes as false negatives. On the right, red bounding boxes signify correctly identified birds, showcasing Sparse Semi-DETR’s more precise and reliable detection capabilities for smaller objects. (b) Obscured Objects: The green boxes indicate the regions where the Semi-DETR has either failed to detect an object (false negatives) as the chair or incorrectly estimated the region of the objects, like the person. Sparse Semi-DETR detects obscured objects more precisely, improving performance in complex visual environments.
Figure 8: Qualitative comparison on the COCO test data. The prediction results are in red, and the blue boxes highlight the prediction difference in Semi-DETR and Sparse Semi-DETR. (a) Inaccurate localization: Semi-DETR incorrectly places multiple bounding boxes around the individual cow and bird objects, indicating it misidentified them as several entities instead of one. Sparse Semi-DETR, however, shows reduced duplications in bounding boxes. (b) Inaccurate classification: Semi-DETR partially misidentified a cat wearing a tie as the tie itself and confused a ‘remote’ object with a ‘cell phone’ as represented with a blue bounding box. Sparse Semi-DETR reduces the misidentification issues present in Semi-DETR, such as not confusing a cat wearing a tie as the tie itself and correctly identifying a ‘remote’ without mistaking it as a ‘cell phone.’
Qualitative comparison with the baseline. We employ Semi-DETR as the baseline and analyze the impact of Query Refinement on the one-to-many assignment strategy, as indicated in Figure 6. Sparse Semi-DETR generates more accurate and refined positive proposals for detecting small or hidden objects. Furthermore, our method significantly reduces the input queries to the decoder compared to Semi-DETR in the one-to-many assignment strategy.
Approach | Training time |
(min) | |
Semi-DETR | 38.56 |
Sparse Semi-DETR | 34.38 +4.18 |
As evidenced in Table 2, this refinement of queries has resulted in a training time reduction of 4.18 minutes on 1k iterations, amounting to a relative decrease of 10.84%. To further compare our Sparse Semi-DETR with the baseline Semi-DETR, we visualize the predicted bounding boxes on test2017, trained on the COCO 10% label data. In Figure 7 and Figure 8, we plot the predicted bounding boxes in red, while green and blue boxes highlight the differences in the prediction of Semi-DETR and Sparse Semi-DETR. There are four general properties that we could observe in our demonstration.
Firstly, Sparse Semi-DETR significantly improves the detection of small objects compared to Semi-DETR, primarily due to its advanced query refinement mechanism. As shown in Figure 7 (a), Sparse Semi-DETR is particularly beneficial for identifying small subjects such as birds, where Semi-DETR often struggles because of its inadequate query feature representation. By capturing refined details, Sparse Semi-DETR ensures more precise and reliable detection of these smaller objects, enhancing overall performance in object detection tasks.
Secondly, for obscured objects, Sparse Semi-DETR provides a distinct advantage over Semi-DETR through its refined query mechanism as indicated in Figure 7 (b). It allows Sparse Semi-DETR to understand better details of partially hidden objects, which is often challenging for Semi-DETR due to its less robust query features. As a result, Sparse Semi-DETR achieves more precise detection of obscured objects, leading to improved performance in complex visual environments.
Thirdly, Sparse Semi-DETR exhibits a significant advantage in removing duplicate predictions after the second stage. It is because of a reliable pseudo-label filtering module that filters out some duplications and selects more accurate pseudo-labels. A notable example is the detection of cow objects, as shown in Figure 8 (a). While Semi-DETR tends to provide two predictions for the same object, Sparse Semi-DETR demonstrates remarkable proficiency in duplicate removal.
Fourthly, in the semi-supervised setting, Semi-DETR often faces challenges in accurately categorizing objects, even when the location is correctly identified. For example, Semi-DETR labels a ‘remote’ object as a ‘cell phone’ despite accurately providing its location as indicated in Figure 8 (b). This misclassification often arises from a disparity between the features used for object detection (regression) and those used for classification. In contrast, Sparse Semi-DETR stands out by adeptly distinguishing between closely related categories. It leverages its innovative attention and similarity module, which dynamically selects the most relevant features for each task, ensuring a more unified and accurate performance in both classification and localization.