SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion
for Category-Level Pose Estimation

Yamei Chen\(^{1,*}\), Yan Di\(^{1,*}\), Guangyao Zhai\(^{1,2,\dagger}\), Fabian Manhardt\(^{3}\), Chenyangguang Zhang\(^{4}\),
Ruida Zhang\(^{4}\), Federico Tombari\(^{1,3}\), Nassir Navab\(^{1}\) and Benjamin Busam\(^{1,2}\)
1 Technical University of Munich 2 Munich Center for Machine Learning
3 Google 4 Tsinghua University
https://github.com/NOrangeeroli/SecondPose.git
1 2


Abstract

Category-level object pose estimation, aiming to predict the 6D pose and 3D size of objects from known categories, typically struggles with large intra-class shape variation. Existing works utilizing mean shapes often fall short of capturing this variation. To address this issue, we present SecondPose, a novel approach integrating object-specific geometric features with semantic category priors from DINOv2. Leveraging the advantage of DINOv2 in providing SE(3)-consistent semantic features, we hierarchically extract two types of SE(3)-invariant geometric features to further encapsulate local-to-global object-specific information. These geometric features are then point-aligned with DINOv2 features to establish a consistent object representation under SE(3) transformations, facilitating the mapping from camera space to the pre-defined canonical space, thus further enhancing pose estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose achieves a 12.4% leap forward over the state-of-the-art. Moreover, on a more complex dataset HouseCat6D which provides photometrically challenging objects, SecondPose still surpasses other competitors by a large margin.

1 Introduction↩︎

Figure 1: Categorical SE(3)-consistent features. We visualize our fused features by PCA. Colored points highlight the most corresponding parts, where our proposed feature achieves consistent alignment cross instances (left vs. middle) and maintains consistency on the same instance of different poses (middle vs. right).

Category-level pose estimation involves estimating the complete 9 degrees-of-freedom (DoF) object pose, encompassing 3D rotation, 3D translation, and 3D metric size, for arbitrary objects within a known set of categories. This task has garnered significant research interest due to its essential role in various applications, including the AR/VR industry [1][4], robotics [5][7], and scene understanding [8][10]. In contrast to traditional instance-level pose estimation methods [11], [12], which rely on specific 3D CAD models for each target object, the category-level approach necessitates greater adaptability to accommodate inherent shape diversity within each category. Effectively addressing intra-class shape variations has thus become a central focus, crucial for real-world applications where objects within a category may exhibit significant differences in shape while sharing the same general category label.

Mean Shape vs. Semantic Priors. One common approach to handle intra-class shape variation involves using explicit mean shapes as prior knowledge [13][15]. These methods typically consist of two functional modules: one for reconstructing the target object by slightly deforming the mean shape and another for regressing the 9D pose based on the reconstructed object [13], [14] or enhanced intermediate features [15]. These methods assume that the mean shape can perfectly encapsulate the structural information of objects within each category, thus achieving reconstruction of the target object with minimal deformation is feasible. However, this assumption does not hold in reality. Objects within the same category, such as chairs, may have fundamental structural differences, leading to the failure of such methods.

Recently, self-supervised learning with large vision models has experienced a significant leap forward, among which DINOv2 [16], due to its exceptional performance in providing semantically consistent patch-wise features, has gained great attention. In particular, various methods [17] utilize semantic features from DINOv2 as essential priors to understand the object. In the field of pose estimation, compared to category-specific mean shapes, DINOv2 demonstrates superior generalization capabilities in object representation across each category, thanks to its large-scale training data and advanced training strategy. ZSP [18] directly leverage DINOv2 features for zero-shot construction of semantic correspondences between objects under different camera viewpoints, and then estimates the pose with RANSAC. POPE [19] and CNOS [20] harness DINOv2 to refine the object detection, thus implicitly boosting the accuracy of pose estimation. However, to our knowledge, currently there exists no method that explores how to fuse DINOv2 features with object-specific features to directly enhance the performance of category-level pose estimation.

In this paper, we present SecondPose, a novel method that fuses SE(3)-Consistent Dual-stream features to enhance category-level Pose estimation. Leveraging DINOv2’s patch-wise SE(3)-consistent semantic features, we extract two types of SE(3)-invariant geometric features—pair-wise distance and pair-wise angles—to encapsulate object-specific cues. We hierarchically aggregate geometric features within support regions of increasing radius to encode local-to-global object structure information. These features are then point-aligned with DINOv2 features to establish a unified object representation that is consistent under SE(3) transformations. Specifically, given an RGB-D image capturing the target object, we first back-project the depth map to generate the respective point cloud, which is then fed into our Geometric and Semantic Streams (Fig. 2.A-B) to extract the corresponding features for our dual-stream fusion (Fig. 2.C). The fused features denoted as SECOND are finally fed into an off-the-shelf pose estimator [21] (Fig. 2.D) to regress the 9D pose.

SE(3)-Consistent Fusion vs. Direct Fusion. Alternatively, one could think of directly concatenating DINOv2 features with the back-projected point in a point-wise manner, without extracting SE(3)-invariant geometric features. However, our instead proposed SE(3)-consistent fusion holds two important advantages over such a straightforward approach. First, while DINOv2 is trained solely with RGB images, the incorporation of geometric features from the point cloud enriches it with valuable local-to-global 3D structural information. This enrichment proves particularly advantageous in handling diverse object shapes within a given category. Second, our SE(3)-consistent object representation modifies the underlying pose estimation process from {point cloud \(\longrightarrow\) canonical space} to {point cloud \(\longrightarrow\) SE(3)-consistent representation \(\longrightarrow\) canonical space}. In this optimized pipeline, the second stage – transitioning from our object representation to the human-defined canonical space – is consistent under SE(3) transformations. (approximately invariant, see Fig. 4) This consistency significantly simplifies the pose estimation process, as the pose estimator only needs to operate within the second stage. Further, this streamlined approach not only enhances the accuracy of pose estimation but also contributes to the efficiency of the overall method.

To summarize, our main contributions are threefold:

  1. We present SecondPose, the first method to directly fuse object-specific hierarchical geometric features with semantic DINOv2 features for category-level pose estimation.

  2. Our SE(3)-consistent dual-stream feature fusion strategy yields a unified object representation that is robust under SE(3) transformations, better suited for down-stream pose estimation.

  3. Extensive evaluation proves that our SE(3)-consistent fusion strategy significantly boosts pose estimation performance even under severe occlusion and clutter, enabling real-world applications.

2 Related Works↩︎

2.0.0.1 Instance-Level Pose Estimation

Instance-level pose estimation focuses on determining the 3D rotation and 3D translation of known objects given their 3D CAD models. Recent methods can be mainly categorized into three types: direct pose regression [22], [23], methods that establish 2D-3D correspondences through keypoint detection or pixel-wise 3D coordinate estimation [24][26], and approaches that learn pose-sensitive embeddings for subsequent pose retrieval [27]. While most keypoints based approaches rely on the PnP algorithm [25], [26], [28] to solve for pose, some methods instead employ neural networks to learn the optimization step [24]. As for RGB-D input, traditional methodologies often rely on hand-crafted features [29], [30]. Some more recent approaches [31][35] instead extract features independently from RGB images and point clouds, using dedicated CNNs and point cloud networks. These individual features are then fused for direct pose regression [31], [33] or keypoint detection [32], [34], [35]. Despite significant progress, practical applications of these methods remain limited due to their restriction to a few objects and the need for 3D CAD models.

Figure 2: Illustration of SecondPose. Semantic features are extracted using the DINOv2 model (A), and the HP-PPF feature is computed on the point cloud (B). These features, combined with RGB values, are fused into our SECOND feature \(F_f\) (C) using stream-specific modules \(L_s\), \(L_g\), \(L_c\), and a shared module \(L_f\) for concatenated features. The resulting fused features, in conjunction with the point cloud, are utilized for pose estimation (D).

2.0.0.2 Category-Level Pose Estimation

In the domain of category-level pose estimation, the objective encompasses predicting the 9DoF pose for any object, regardless if previously seen or novel, from a predefined set of categories. This task is inherently more complex due to significant intra-class variations in shape and texture. To address these challenges, Wang et al.  [36] developed the Normalized Object Coordinate Space (NOCS), offering a unified representation framework. This approach involves mapping the observed point cloud to the NOCS system, followed by pose recovery via the Umeyama algorithm [37]. Alternatively, CASS [38] introduces a learned canonical shape space, while FS-Net [39] advocates for a decoupled representation of rotation, focusing on direct pose regression. DualPoseNet [40] employs dual networks for both explicit and implicit pose prediction, ensuring consistency for refined pose estimation. GPV-Pose [41] and OPA-3D [42] leverage geometric insights in bounding box projection to augment the learning of pose-sensitive features specific to categories. HS-Pose [43] proposed the HS-layer, a simple network structure that extends 3D graph convolution to extract hybrid scope latent features from point cloud data. In contrast, 6-PACK [44] conducts pose tracking by means of semantic keypoints, and CAPTRA [45] combines coordinate prediction with direct regression for enhanced accuracy. SelfPose [46] utilizes optical flow to enhance the pose estimation accuracy.

To address the issue of intra-class shape variations, several works have focused on the incorporation of additional shape priors. SPD [14] utilizes a PointNet autoencoder to derive a prior point cloud for each category, representing the average shape. This model is then adapted to fit specific observed instances, assigning the observed point cloud to the reconstructed shape model. SGPA [47] dynamically adjusts the shape prior based on structural similarities of the observed instances. SAR-Net [48], while also employing shape priors, further leverages geometric attributes of objects to enhance performance. ACR-Pose [49], instead utilizes a shape prior-guided reconstruction network paired with a discriminator to achieve high-quality canonical representations.

Furthermore, recent research has introduced prior-free methods that demonstrate performance comparable to approaches relying on priors. VI-Net [21] attains high precision in object pose estimation by separating rotation into viewpoint and in-plane rotations. Additionally, IST-Net [50] achieves state-of-the-art performance on the REAL275 benchmark by implicitly transforming camera-space features to world-space counterparts without depending on priors.

3 Method↩︎

The objective of SecondPose is to estimate the 9DoF object pose from a single RGB-D image. In particular, given an RGB-D image capturing the target object from a set of known categories, our goal is to recover its full 9DoF object pose, including the \(\boldsymbol{R} \in S O(3)\) and the \(3 \mathrm{D}\) translation \(t \in \mathbb{R}^3\) and the \(3 \mathrm{D}\) metric size \(s \in \mathbb{R}^3\).

3.1 Overview.↩︎

As illustrated in Fig 2, SecondPose mainly consists of 3 modules to predict object pose from a single RGB-D input, i.e. i) the extraction of relevant geometric features \(F_g\) and semantic features \(F_s\), ii) the dual-stream feature fusion to build our SE(3)-consistent object representation \(F_f\), iii) the final pose regression from the extracted representation.

Figure 3: Hierarchical panel-based geometric features. The inner panel contains points that are close to the point of interest, and outer panels contain points far from the point of interest.

3.2 Semantic Category Prior From DINOv2↩︎

3.2.0.1 DINOv2 is an implicit rotation learner

We use DINOv2[16] as our image feature extractor. As shown in [17], DINOv2 can extract semantic-aware information from RGB images that can be well leveraged to establish zero-shot semantic correspondences, rendering it an excellent method for rich semantic information extraction.

As for estimating the 3D rotation, such extra semantic-aware information can provide a noticeable boost in performance. Exemplary, imagine that the z-axis commonly points to the top side of the object in model space, the y-axis always points to the front side of the object, and the x-axis always points to the left side of the object. Harnessing the semantic information given by DINOv2, the model can more easily identify the top, front, and left sides of the object, thus turning rotation estimation into a much simpler task. Moreover, DINOv2 features additionally contain global information about the object, including the object category and pose. Such information can thus serve as a good global prior to our method.

3.2.0.2 Deeper DINOv2 features

We use the "token" facet from the last (11th.) layer as our extracted semantic feature. Essentially, [17] has demonstrated that the features of deeper layers exhibit optimal semantic matching performance, thus providing improved consistency in terms of semantic correspondence across different objects. In addition, features from deeper layers also possess more holistic semantic information. A visualization piece is shown in Fig. 2.A.

3.2.0.3 Direct pose estimation from DINOv2

As aforementioned, the ad-hoc fusion of DINOv2 features with the back-projected points exhibits several downsides. First, DINOv2 extracts information only from RGB images; hence, the contained geometric information is limited. Second, as we make use of deeper-layer features from DINOv2 for a more holistic representation, the local detailed information is blurred to some extent. To complement DINOv2 features in these aspects, we thus need to combine them with geometric features containing local information for better descriptive power.

3.3 Hierarchical Geometric Features↩︎

The stream pipeline is shown in Fig. 2.B. Our geometric embedding in this stream is based on the calculation of pair-wise SE(3)-equivariant Point Pair Features (PPFs) [29]. We construct our SE(3)-invariant coordinate representation by aggregating the PPFs between the point of interest and neighborhood points in the multiple panels centered on it. We hierarchically concatenate the corresponding SE(3)-invariant coordinate representations in each panel to enrich the representation power of our geometric features HP-PPF. Fig. 3.c provides a visualization of HP-PPF.

3.3.0.1 Point Pair Features PPFs

A comprehensive example is shown in Fig. 3.a. Given an object point cloud denoted as \(\boldsymbol{P}\), we consider each pair of points \((p_i, p_j)\) where \(p_i, p_j \in \boldsymbol{P}\). Associated with each point, local normal vectors \(\boldsymbol{n_i}\) and \(\boldsymbol{n_j}\) are computed at each \(p_i\) and \(p_j\), respectively. The final pairwise feature between \(p_i\) and \(p_j\) is defined as

\[f_{i,j} = [d_{i,j}, \alpha_{i,j}, \beta_{i,j}, \theta_{i,j}], \label{eq:ppf}\tag{1}\] where \(d_{i,j} = \lVert p_j - p_i \rVert\) describes the Euclidean distance between points \(p_i\) and \(p_j\). \(\alpha_{i,j} = \angle (\boldsymbol{n_i}, p_j-p_i)\) represents the angular deviation between the normal vector \(\boldsymbol{n_i}\) at point \(p_i\) and the vector extending from \(p_i\) to \(p_j\). \(\beta_{i,j} = \angle (\boldsymbol{n_j}, p_j-p_i)\) denotes the angle subtended by the normal vector \(\boldsymbol{n_j}\) at point \(p_j\) with the aforementioned vector from \(p_i\) to \(p_j\). \(\theta_{i,j} = \angle (\boldsymbol{n_j}, \boldsymbol{n_i})\) denotes the angular disparity between the normal vectors \(\boldsymbol{n_j}\) and \(\boldsymbol{n_i}\) at points \(p_j\) and \(p_i\), respectively. Notice that thanks to its locality, this descriptor is invariant under \(SE(3)\).

3.3.0.2 Geometric Feature Panel

Based on PPFs, we propose panel-based PPFs to construct our geometric representation, which increases the perception field while maintaining the merit of the locality. For each point \(p_i\) in the point cloud \(\boldsymbol{P}\), there is a support panel \(\mathcal{S}^i \subseteq \boldsymbol{P}\) whose cardinality \(s_i = |\mathcal{S}^i|\). For all points \(p_j \in \mathcal{S}^i\), we calculate the PPF \(f_{i, j}\) between \(p_i, p_j\) and the local coordinate representation \(f^i_l\) of \(p_i\) is then obtain as average the average according to \[f^i_l = \frac{1}{s_i}(\sum_j{d_{i,j}}, \sum_j{\alpha_{i,j}}, \sum_j{\beta_{i,j}},\sum_j{\theta_{i,j}}).\]

3.3.0.3 From Single to Hierarchical Panels

Even though the mean aggregation in the panel can take the neighboring points into account, the inherent local representation limits its representational power, as the features brought by normals \(\boldsymbol{n_i},\boldsymbol{n_j}\) are noisy when constraining the perception field. Inspired by CNNs, which extract hierarchical features from local to global, we hierarchically sample multiple panels from local to global, as shown in Fig. 3.b. Specifically, for a point set \(\boldsymbol{P}\) with cardinality \(|\boldsymbol{P}|\), for integers \((k_0, k_1, k_2,..., k_l)\) satisfying \(0 = k_0<k_1<k_2<...<k_l=|\boldsymbol{P}|-1\), for each point \(p_i\in \boldsymbol{P}\) we first rank its distance to any other points in \(\boldsymbol{P}\) from smallest to largest: \[r_{i,j} = sort(d_{i,j})\] and construct support panels: \[\mathcal{S}^{i,m} = \{p_j \in \boldsymbol{P}| k_{m-1}< r_{i,j}\leq k_m \}, 1 \leq m \leq l,\] with \(l\) being the number of employed panels. We then calculate the corresponding pose-invariant coordinate representations \(f^{i,m}\) for each panel \(\mathcal{S}^{i,m}\) and concatenate them to get the point-wise geometric features with \[f_{g}^i = f^{i,1}_l\oplus f^{i,2}_l\oplus ...\oplus f^{i,l}_l .\] Thereby, for smaller \(k\), the support panel is composed of points that are closer to the point of interest, whereas for larger \(k\), the support panel consists of points that are farther from the point of interest. By concatenating features calculated by panels of different scales, we can harness geometric features in a way that balances details of local geometric landscapes and global instance-wise shape information. We experimentally show in Sec. 4 that our design performs better than the usual single-panel descriptor.

3.4 SE(3)-Consistent Feature Fusion↩︎

3.4.0.1 Fusion Strategy

We fuse the DINOv2 features, the geometric feature and RGB values, as shown in Fig. 2.C. In particular, we use VI-Net [21] as an example of the pose estimator, first projecting each feature to each feature stream \(\mathcal{F}\) and 3D point cloud \(P = \{p_i\}\) to a spherical feature map \(F\). To this end, we divide the sphere uniformly into \(W \times H\) along the azimuth and elevation axes, following VI-Net [21]. We assign the feature of the point with the largest distance to each bin. When there is no point in the region, we set 0 in the bin. For each feature map \(F_i \in \{F_g, F_s, F_c \}\) representing the geometric feature, the DINOV2 feature, and the respective RGB value, we employ a separate ResNet model \(\mathcal{L}_i\) as feature extractor. The outputs of these individual feature extractors are then concatenated to form the input to another ResNet for feature fusion, obtaining \(F_{f}\) also denoted as SECOND, \[F_{f} = \mathcal{L}_f\left(\mathcal{L}_g(F_g) \oplus \mathcal{L}_s(F_s) \oplus \mathcal{L}_c(F_c)\right).\]

3.4.0.2 Advantages of SE(3)-Consistent Fusion

The Design of a SE(3)-consistent fusion is an integral part of the improved quality of our method. As for the 3D rotation, we are learning a mapping from the space of point clouds and its features \((P,F) \in \mathbb{R}^{n \times 3} \times \mathbb{R}^{n \times C}\) to space of 3D rotations \(\boldsymbol{R} \in S O(3)\) \[\Phi: \mathbb{R}^{n \times 3} \times \mathbb{R}^{n \times C} \mapsto S O(3).\] This mapping \(\Phi\) should ensure rotation-equivariance, meaning that \[\Phi(R_xP, \psi_{R_x} (F)) = R_x\Phi(P,F), \forall R_x \in S O(3), \label{re}\tag{2}\] where \(\psi_{R_x}\) is the transformation applied to the feature when rotating the point cloud by \(R_x\). This rotation-equivariance relation is essential for the learned model to generalize well on unseen data. Without such equivariance embedded in the model structure, these relation needs to be learned through large amounts of data, which is limited by the scale of the data. Our design of SE(3)-consistent features are approximately rotation-invariant, hence \[\psi_{R_x}(F) \approx F , \forall R_x \in S O(3),\] eliminating the effect of \(\psi_{R_x}\) in Eq. 2 , and thus making learning of the rotation-equivariance relationship easier.

3.5 SecondPose Training and Inference↩︎

Following [21], we leverage a lightweight PointNet++ [51] as the translation and size estimation heads. Given an RGB-D image, we first segment the object of interest using Mask-RCNN [52], similar to [21], [41]. We then randomly select \(N\) points from the back-projected 3D point clouds \(\boldsymbol{P} \in \mathbb{R}^{n \times 3}\) with RGB features \(F_c\) and use them to estimate the translation and size, as shown in Fig. 2.D.

The core of our method is thus developed to focus on the more challenging task of 3D rotation estimation. We essentially train a separate translation-size network and rotation network. For the translation-size network, we adopt the L1 loss for both size and translation with \[L_{ts} = \lambda_t |t_{pred} - t_{gt}| + \lambda_s |s_{pred} - s_{gt}|.\] For the 3D rotation, we instead directly predict the 9D rotation matrix, which we optimize via the L1-loss according to \[L_{R} = |R_{pred} - R_{gt}|.\] During training, the ground truth translation and size are used to center and normalize the point cloud before rotation estimation, while during inference the predicted size and translation are instead utilized for normalization.

Table 1: Quantitative comparisons of different methods for category-level 6D object pose estimation on REAL275 [36]. ‘*’ denotes the CATRE[53] IoU metrics. The best results are in bold, and the second best results are underlined.
Method Mean REAL275
3 - 7 Name Shape Priors IoU \(_{75} *\) \(5^{\circ}\;2 \mathrm{~cm}\) \(5^{\circ}\;5 \mathrm{~cm}\) \(10^{\circ}\;2 \mathrm{~cm}\) \(10^{\circ}\;5 \mathrm{~cm}\)
SPD [14] \(✔\) 27.0 19.3 21.4 43.2 54.1
CR-Net [44] \(✔\) 33.2 27.8 34.3 47.2 60.8
CenterSnap-R [54] \(✔\) - - 29.1 - 64.3
ACR-Pose [49] \(✔\) - 31.6 36.9 54.8 65.9
SAR-Net [48] \(✔\) - 31.6 42.3 50.3 68.3
SSP-Pose [55] \(✔\) - 34.7 44.6 - 77.8
SGPA [47] \(✔\) 37.1 35.9 39.6 61.3 70.7
RBP-Pose [15] \(✔\) - 38.2 48.1 63.1 79.2
SPD + CATRE [53] \(✔\) 43.6 45.8 54.4 61.4 73.1
DPDN [56] \(✔\) - 46.0 50.7 \(70.4\) 78.4
FS-Net [39] \(\times\) - - 28.2 - 60.8
DualPoseNet [40] \(\times\) 30.8 29.3 35.9 50.0 66.8
GPV-Pose [41] \(\times\) - 32.0 42.9 - 73.3
SS-ConvNet [57] \(\times\) - 36.6 43.4 52.6 63.5
HS-Pose [43] \(\times\) - 46.5 55.2 68.6 \(\underline{82.7}\)
IST-Net [50] \(\times\) - 47.5 53.4 \(\underline{72.1}\) 80.5
VI-Net [21] \(\times\) \(\underline{48.3}\) \(\underline{50.0}\) \(\underline{57.6}\) 70.8 82.1
SecondPose (Ours) Semantic Priors \(\boldsymbol{49.7}\) \(\boldsymbol{56.2}\) \(\boldsymbol{63.6}\) \(\boldsymbol{74.7}\) \(\boldsymbol{86.0}\)

4 Experiment↩︎

4.1 Experimental Setup.↩︎

4.1.0.1 Datasets

We conduct our experiments on the common 9D pose estimation benchmarks NOCS-REAL275 [36], NOCS-CAMERA25 [36] as well as HouseCat6D[58] datasets. NOCS-REAL275 is a real-world dataset with 13 scenes containing objects from 6 different categories; 4,300 images of 7 scenes are used as a training set, while the other 2,750 images of 6 scenes form the test set. NOCS-CAMERA25 is a synthetic dataset containing 300k images with objects from the same categories as NOCS-REAL275. HouseCat6D is a comprehensive multi-modal real-world dataset, featuring 194 high-fidelity 3D models of household items of 10 categories. The collection encompasses transparent and reflective objects situated in 41 scenes, presenting a wide range of viewpoints, challenging occlusions, and devoid of markers.

4.1.0.2 Evaluation Metrics

As for the NOCS-REAL275 dataset, we report the mean Average Precision (mAP) of \(5^{\circ} 2 \mathrm{cm}, 5^{\circ} 5 \mathrm{cm}, 10^{\circ} 2 \mathrm{cm}, 10^{\circ} 5 \mathrm{cm}\) metrics. \(n^{\circ} m \mathrm{cm}\) denotes the percentage of prediction with rotation prediction error within n degrees and translation prediction error within m centimeters. We also report mAP of 3D Intersection over Union (IoU) at the threshold of \(75 \%\). For the HouseCat6D dataset, we again report the mAP of 3D IoU under thresholds of \(25 \%\) and \(50 \%\).

Figure 4: Qualitative comparison on REAL275 [36]. We compare our prediction with ground truth and the prediction of our baseline, VI-Net[21]. Our approach achieves significantly higher precision in rotation estimation.

4.1.0.3 Efficiency

Our method achieves an inference speed of 9 FPS. Excluding the running time of DINOv2, our inference speed increases to 10 FPS.

4.1.0.4 Implementation Details

We use MaskRCNN[52] to segment the objects of interest from the input image. We then combine point-wise radial distances, RGB values, and semantic-aware features from DINOv2 together with our proposed local-to-global SE(3)-invariant geometric features as input for further processing. Next, for the RGB values and the point-wise radial distances, we sample 2048 points from the point cloud. For DINOv2 features, we first crop the image by the bounding box around the object of interest and then resize the image to a resolution of \(210 \times 210\). Finally, for our geometric features, we sample 300 points from previously sampled 2048 points and estimate point-wise normal vectors using the 10 nearest neighbors. To train our model on the NOCS dataset, we use a mixture of 25% real-world images from the training set of REAL 275 and 75% synthetic images from the CAMERA25 training set, similar to [36]. For all experiments, we train our models with batch size 48 on a single NVIDIA 3090 GPU to the 40th. epoch.

Figure 5: Qualitative comparison on HouseCat6D [59]. We compare our prediction with ground truth and the prediction of our baseline, VI-Net[21].

4.2 Comparison with State-of-the-Art Methods↩︎

In Tab. 1, we compare SecondPose with the state-of-the-art on NOCS-REAL275 dataset. As can be easily observed, our method outperforms all state-of-the-art approaches, including the recent VI-Net [21], by a large margin on all metrics. More specifically, our method respectively exceeds VI-Net for \(5^{\circ} 2 \mathrm{~cm}\) and \(10^{\circ} 2 \mathrm{~cm}\) by 6.2% and 3.9%, demonstrating the effectiveness of our SE(3)-consistent feature fusion design. When comparing with DPDN [13], the best method using mean shape prior, our improvements in \(5^{\circ} 2 \mathrm{~cm}\) and \(5^{\circ} 5 \mathrm{~cm}\) metrics amount to 10.2% and 12.8%. We show qualitative results in Figure 4. It can be observed that SecondPose is more robust when handling objects with large intra-class variations, such as camera. In Tab. 2, we evaluate our method on the HouseCat6D dataset. Our method can again exceed current state-of-the-art methods by a large margin. As for the \(\text{IoU}_{50}\) metric, our method outperforms the second-best method VI-Net by 9.7% on average. Additional qualitative results can be found in Fig.

Table 2: Overall and class-wise evaluation of 3D IoU(at 25%, 50%) on the dataset HouseCat6D [59]. The best results are in bold.
\(\text{IoU}_{25}\) / \(\text{IoU}_{50}\)
50.0 / 21.2 41.9 / 5.0 43.3 / 6.5 81.9 / 62.4 68.8 / 2.0 81.8 / 59.8 24.3 / 0.1 14.7 / 6.0 95.4 / 49.6 21.0 / 4.6 26.4 / 16.5
74.9 / 48.0 65.3 / 45.0 31.7 / 1.2 98.3 / 73.8 96.4 / 68.1 65.6 / 46.8 69.9 / 59.8 71.0 / 51.6 99.4 / 32.4 79.7 / 46.0 71.4 / 55.4
74.9 / 50.7 66.8 / 45.6 31.4 / 1.1 98.6 / 75.2 96.7 / 69.0 65.7 / 46.9 75.4 / 61.6 70.9 / 52.0 99.6 / 62.7 76.9 / 42.4 67.4 / 50.2
80.7 / 56.4 90.6 / 79.6 44.8 / 12.7 \(\boldsymbol{99.0}\) / 67.0 96.7 / 72.1 54.9 / 17.1 52.6 / 47.3 89.2 / 76.4 99.1 / 93.7 94.9 / 36.0 85.2 / 62.4
SecondPose (Ours) \(\boldsymbol{83.7}\) / \(\boldsymbol{66.1}\) \(\boldsymbol{94.5}\) / \(\boldsymbol{79.8}\) \(\boldsymbol{54.5}\) / \(\boldsymbol{23.7}\) \(98.5\) / \(\boldsymbol{93.2}\) \(\boldsymbol{99.8}\) / \(\boldsymbol{82.9}\) 53.6 / \(\boldsymbol{35.4}\) \(\boldsymbol{81.0}\) / \(\boldsymbol{71.0}\) \(\boldsymbol{93.5}\) / 74.4 \(\boldsymbol{99.3}\) /92.5 75.6 / 35.6 \(\boldsymbol{86.9}\) / \(\boldsymbol{73.0}\)
Table 3: Ablation Study on REAL275 [36]. ‘*’ denotes the CATRE[53] IoU metrics.
\(\mathrm{IoU}_{75}*\) \(5^{\circ}\;2\mathrm{~cm}\) \(5^{\circ}\;5\mathrm{~cm}\) \(10^{\circ}\;2\mathrm{~cm}\) \(10^{\circ}\;5\mathrm{~cm}\)
49.7 56.2 63.6 74.7 86.0
48.0 51.1 58.9 71.6 82.4
49.5 55.1 62.3 73.7 84.8
48.5 49.9 57.4 70.4 80.8
49.1 55.1 63.1 73.7 85.0
49.3 54.7 62.8 73.1 84.7
49.6 54.8 62.7 74.6 86.7
49.5 55.1 63.1 74.2 85.6
49.4 55.4 63.1 73.7 85.5
\(5^{\circ}\) 49.7 56.1 63.4 74.6 85.9
\(10^{\circ}\) 49.4 55.8 63.5 74.4 85.8
\(15^{\circ}\) 48.5 55.4 63.0 73.9 85.4
\(20^{\circ}\) 47.9 54.5 62.4 73.2 85.1
49.7 56.0 63.6 74.8 86.2
49.5 55.7 63.2 74.3 85.6
46.7 52.5 60.9 71.5 84.6
49.7 56.1 63.6 74.6 85.8
49.6 55.8 63.4 74.4 86.0
45.9 53.7 62.6 73.4 86.1

4.3 Limitations↩︎

Our method’s efficacy is restricted by the constraints of DINOv2 due to our utilization of its features. When DINOv2 is unable to provide meaningful semantic information for specific images, our approach is unable to surpass this limitation.

4.4 Ablation Studies↩︎

To confirm the efficacy of our design choices, we conduct several ablation studies on the NOCS-REAL275[36] dataset.

[AS-1] Efficacy of employing semantic and geometric features. To show the effectiveness of our semantic-geometric-feature-fusion, we train the proposed model in 3 different variations: i) without semantic feature, ii) without geometric feature, and iii) without both semantic and geometric features. The results are presented in Tab. 3 (B0) - (B2). When considering the strict \(5^{\circ} 2 \mathrm{cm}\) metric, it turns out that removing semantic features, geometric features or both always leads to a large decrease in performance. In particular, the performance respectively drops by 5.1%, 1.1% and 6.3%.

[AS-2] Efficacy of individual geometric feature. We further run ablations on the four geometric features, \(d\), \(\alpha\), \(\beta\), \(\theta\). The corresponding results are presented again in Table 3 under (C0) - (C3). As can be observed, removing any component from the geometric feature leads to a strict drop in performance. Exemplary, for the \(5^{\circ} 2 \mathrm{cm}\) metric the performance drops by at least 1.1%. To summarize, each geometric feature contributes to the expressiveness of the geometric representation.

[AS-3] Efficacy of hierarchical panel construction. As shown in Tab. 3 (D0), when the hierarchical panel is substituted by KNN with 10 nearest neighbors, the \(5^{\circ} 2 \mathrm{cm}\) metric undergoes a decrease by 0.8%. This demonstrates the importance of our hierarchical panel construction, as it better captures finer-grained local and global information.

[AS-4] Robustness under random rotation. To show the robustness of our method under random rotation applied on point cloud, we perform experiments on test images when randomly rotating the entire point cloud by rotation angle \(A~[0^{\circ}, n^{\circ}]\), n = 5, 10, 15, 20, see Table 3 (E0) - (E3). The results show that our method performs well under these circumstances.

[AS-5] Robustness under manual occlusions We also perform an additional experiment to show the robustness of our method under various levels of occlusions. We manually mask out the object with different scale of rectangular masks, whose length and width is set to 1/n of the length and width of the original object bounding box. We further run tests with n = 16, 8, 4 in Tab. 3 (F0) - (F2). When undergoing only mild occlusion, i.e. \(n=16\), the performance is almost identical to the original result. Moreover, even when dealing with very large occlusions of 1/4 of the size of the object, the performance is still fairly strong with only a small decrease of 3% for \(\text{IoU}_{75}\).

[AS-6] Robustness Under Perturbation on Point Cloud. Next, we also evaluate the robustness of our method under random perturbations applied to the point clouds. To this end, we add random noise sampled from a uniform distribution ranging from \(-0.5sr\) to \(0,5sr\), where \(s\) is the scale factor and r is the average distance of the point cloud to the object center. We test our model with s = 0.002, 0.005,0.01 in Table 3 G0-G2. We observe again that with mild perturbation of s = 0.002, the performance is almost identical to the original result, while with relatively large perturbation of s = 0.01, the performance is still fairly strong with only a small decrease of 3.8% for \(\text{IoU}_{75}\).

5 Conclusion↩︎

In this paper, we propose SecondPose designing SE(3)-consistent fusion of semantic and geometric features for pose estimation. The two feature streams are proven to complement each other and jointly contribute to improving our method. To confirm the efficacy of our method, we apply our method on the challenging real-world category-level 6D object pose estimation datasets REAL275 and HouseCat6D and exceed the current SOTA by a large margin.

Supplementary Material↩︎

6 Implementation Details↩︎

Our network is implemented on Pytorch 1.13. The backbone is based on VI-Net [21]. To obtain DINOv2 features, we initially crop the object by its bounding boxes from the original image and subsequently resize it to a resolution of \(210 \times 210\). The DINOv2 model version employed is ‘dinov2_vits14’, with a set stride of 14. Consequently, the resolution of the output DINOv2 feature is \(15 \times 15\). We randomly select 100 points from the feature map as our sampled points with DINOv2 features.

For extracting geometric features, we initially randomly sample 300 points from the entire point cloud. These points serve as the basis for estimating point-wise normal vectors. To create the hierarchical panels, we then choose range parameters \((k_0, k_1, k_2,k_3, k_4, k_5, k_6) = (0, 10, 20, 40, 80, 160, 299)\).

See Tab 4 for an overview of the number of trainable parameters and frozen parameters of our method and VI-Net.

Table 4: Parameter Count
Number of Parameters Trainable Frozen
VI-Net 27,311,368 0
Ours 33,639,561 22,056,576

7 Further Explanations of the Pipeline↩︎

Invariance vs Equivariance. Following VI-Net[21], our backbone ensures that, when the input point-wise features are approximately SE(3)-invariant, the output feature map is approximately SE(3)-equivariant. We used the term “SE(3)-invariant" to emphasize that our input features are invariant. The process of feature fusion is illustrated in Figure 1 below. The RGB values \(F_c\), the DINOv2 features \(F_s\), and the HP-PPF features \(F_g\) are our invariant input features.
1) All input features are approximately invariant: the geometric features and RGB values are inherently invariant. The DINOv2 features are approximately invariant due to their training on a large-scale dataset, ensuring consistent semantic representation. This consistency in semantic meaning, regardless of rotation/translation, implies SE(3) consistency, thus leading to approximate SE(3) invariance.
2) The output is equivariant: similar to VI-Net, our backbone transforms point-wise features with the point cloud’s 3D coordinates into a 2D feature map. These maps are then processed by ResNets that approximately maintain SE(3) equivariance (see section 3.4 and [21] for details). Consequently, when the input features are approximately SE(3) invariant, the output 2D feature map is approximately SE(3) equivariant.

Figure 6: Feature Fusion We illustrate the fusion process with annotated approximately equivalent and approximately invariant features.

Visualization of Feature Maps. In Fig 7, we visualize the features of the same object in two frames, each with a different pose.

Figure 7: Feature Maps We fuse features of RGB, DINOv2 and HP-PPF into a 2D feature map that is approximately SE(3)-equivariant.

“RGB",”DINOv2", and “HP-PPF" depict our input features, which are roughly SE(3) invariant.”Second(2D)" represents our post-fusion feature map, utilized for pose estimation and approximately SE(3) equivariant; we observe slight shifts in the feature map pattern upon rotation and translation. For visualization, we also present “Second(PC)", a point-wise feature obtained by projecting”Second(2D)" back onto the point cloud, using the pixel-point correspondence, and it’s also approximately invariant.

Shape of All Feature Maps and Other Intermediate Representation. Fig 6 illustrates our input features as: \(F_c \in R^{n\times 3}\), \(F_g \in R^{n\times c_g}\), and \(F_s \in R^{n\times c_s}\), while after modules \(\mathscr{L}_c\), \(\mathscr{L}_g\), \(\mathscr{L}_s\), our features have shape \(R^{H\times W \times c}\), and after fusion module \(\mathscr{L}_f\) the feature is of shape \(R^{H\times W \times c}\).

8 More Experimental Results on HouseCat6D↩︎

We report more metrics on HouseCat6D [58] in Table 5. We note that our approach outperforms other methods by a significant margin across all metrics. Especially on the restricted metrics \(\text{IoU}_{75}\) and \(5^{\circ}\;2 \mathrm{~cm}\), SecondPose outperforms VI-Net by 22.1% and 31.0% respectively.

Table 5: Quantitative comparisons of different methods for category-level 6D object pose estimation on HouseCat6D [58].
Method HouseCat6D
IoU \(_{75}\) \(5^{\circ}\;2 \mathrm{~cm}\) \(5^{\circ}\;5 \mathrm{~cm}\) \(10^{\circ}\;2 \mathrm{~cm}\) \(10^{\circ}\;5 \mathrm{~cm}\)
FS-Net [39] 14.8 3.3 4.2 17.1 21.6
GPV-Pose [41] 15.2 3.5 4.6 17.8 22.7
VI-Net [21] 20.4 8.4 10.3 20.5 29.1
SecondPose (Ours) 24.9 11.0 13.4 25.3 35.7

We present the categorical results of our experiment on HouseCat6D in Fig. 10. Our method exhibits a substantial performance advantage over VI-Net in categories such as box, can, cup, remote, teapot, and shoe. However, in other categories, namely bottle, cutlery, glass, and tube, our method shows a slightly lower performance compared to VI-Net. We noted a shared characteristic among these categories—items within them typically display either high reflectivity or high transparency. Under optical conditions of this nature, DINOv2 tends to encounter difficulties in extracting meaningful semantic information.

Figure 8: Failure cases in HouseCat6D. We illustrate common failure scenarios on HouseCat6D. (A) depicts instances of transparent items; (B) showcases items with pronounced self-occlusion; (C) the tube represents items with high reflectivity; (D) illustrates failures attributed to atypical shapes.

Figure 9: Failure cases in REAL275. We illustrate common failure scenarios on HouseCat6D. (A) represents failure due to wrong instance segmentation; (B)-(D) illustrates failures due to wrong prediction of the y-axis.

Figure 10: Categorical results on HouseCat6D. We visualize the comparison of our IoU\(_{25}\) and IoU\(_{50}\) results on HouseCat6D with those of VI-Net.

9 Failure Cases and Limitations↩︎

We present typical failure cases in both REAL275 and HouseCat6D.

The failure cases of HouseCat6D are presented in Fig. 8. There are four common failure types. (A) highlights instances involving transparent items where DINOv2 struggles to extract meaningful semantic features, leading to poorer performance on transparent items. (B) illustrates a self-occlusion scenario, complicating pose prediction due to obscured essential features like the mug handle, which is crucial for object identification and orientation. In (C), the tube represents items with high reflectivity, a condition often associated with DINOv2 failures. (D) illustrates failures attributed to atypical shapes.

The failure cases of REAL275 are presented in Fig. 9. (A) signifies failures arising from false positive detection results. Meanwhile, (B)-(D) illustrate errors related to the wrong orientation prediction of the z-axis, where we observed that on REAL275, our model tends to predict the y-axis accurately.

In summary, there are four primary typical failure scenarios: Firstly, instances where DINOv2 fails to extract meaningful semantic information under specific optical conditions such as high reflectivity or high transparency. Secondly, when severe occlusions are present. Thirdly, when the item displays an atypical shape. Finally, errors are caused by the exclusive parts out of our pipeline, such as the detection frontend.

References↩︎

[1]
E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: A hands-on survey,” IEEE transactions on visualization and computer graphics, vol. 22, no. 12, pp. 2633–2651, 2015.
[2]
D. J. Tan, N. Navab, and F. Tombari, “6d object pose estimation with depth images: A seamless approach for robotic interaction and augmented reality,” arXiv preprint arXiv:1709.01459, 2017.
[3]
H. Tjaden, U. Schwanecke, and E. Schomer, “Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms , booktitle = Proceedings of the IEEE International Conference on Computer Vision (ICCV),” Oct. 2017.
[4]
C. Zhang et al., “DDF-HO: Hand-held object reconstruction via conditional directed distance field,” arXiv preprint arXiv:2308.08231, 2023.
[5]
G. Zhai et al., “MonoGraspNet: 6-DoF grasping with a single RGB image,” 2023 , organization={IEEE}.
[6]
G. Zhai et al., “Sg-bot: Object rearrangement via coarse-to-fine robotic imagination on scene graphs,” arXiv preprint arXiv:2309.12188, 2023.
[7]
G. Zhai et al., “DA\(^2\) dataset: Toward dexterity-aware dual-arm grasping,” RA-L, vol. 7, no. 4, pp. 8941–8948, 2022, doi: 10.1109/LRA.2022.3189959.
[8]
B. Busam, M. Esposito, S. Che’Rose, N. Navab, and booktitle=Proceedings. of the I. I. C. on C. V. W. Frisch Benjamin, “A stereo vision approach for cooperative robotic movement therapy,” 2015, pp. 127–135.
[9]
Y. Di et al., “U-RED: Unsupervised 3D shape retrieval and deformation for partial point clouds,” 2023, pp. 8884–8895.
[10]
G. Zhai et al., “CommonScenes: Generating commonsense 3D indoor scenes with scene graphs,” 2023.
[11]
Y. Di, F. Manhardt, G. Wang, X. Ji, N. Navab, and F. Tombari, “SO-pose: Exploiting self-occlusion for direct 6D pose estimation , booktitle = Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),” 2021, pp. 12396–12405.
[12]
Y. Labbé, J. Carpentier, M. Aubry, and booktitle=Computer. V. 2020:. 16th. E. C. G. U. A. 23–28,. 2020,. P. P. X. 16. Sivic Josef, “Cosypose: Consistent multi-view multi-object 6d pose estimation,” 2020 , organization={Springer}, pp. 574–591.
[13]
J. Lin, Z. Wei, C. Ding, and booktitle=European. C. on C. V. Jia Kui, “Category-level 6D object pose and size estimation using self-supervised deep prior deformation networks,” 2022 , organization={Springer}, pp. 19–34.
[14]
M. Tian, M. H. Ang, and booktitle=European. C. on C. V. Lee Gim Hee, “Shape prior deformation for categorical 6d object pose and size estimation,” 2020 , organization={Springer}, pp. 530–546.
[15]
R. Zhang, Y. Di, Z. Lou, F. Manhardt, F. Tombari, and X. Ji, “RBP-pose: Residual bounding box projection for category-level pose estimation.” 2022 , eprint={2208.00237}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[16]
M. Oquab et al., “DINOv2: Learning robust visual features without supervision,” arXiv:2304.07193. 2023.
[17]
J. Zhang et al., “A tale of two features: Stable diffusion complements DINO for zero-shot semantic correspondence,” arXiv preprint arXiv:2305.15347, 2023.
[18]
W. Goodwin, S. Vaze, I. Havoutis, and booktitle=European. C. on C. V. Posner Ingmar, “Zero-shot category-level object pose estimation,” 2022 , organization={Springer}, pp. 516–532.
[19]
Z. Fan et al., “POPE: 6-DoF promptable pose estimation of any object, in any scene, with one reference,” arXiv preprint arXiv:2305.15727, 2023.
[20]
V. N. Nguyen, T. Groueix, G. Ponimatkin, V. Lepetit, and booktitle=Proceedings. of the I. I. C. on C. V. Hodan Tomas, “CNOS: A strong baseline for CAD-based novel object segmentation,” 2023, pp. 2134–2140.
[21]
J. Lin, Z. Wei, Y. Zhang, and booktitle=Proceedings. of the I. I. C. on C. V. Jia Kui, “VI-net: Boosting category-level 6D object pose estimation via learning decoupled rotations on the spherical representations,” 2023, pp. 14001–14011.
[22]
Y. Xiang, T. Schmidt, V. Narayanan, and Dieter, “PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes,” Robotics: Science and Systems, 2018.
[23]
W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and booktitle=Proceedings. of the I. international conference on computer vision Navab Nassir, “Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again,” 2017, pp. 1521–1529.
[24]
G. Wang, F. Manhardt, F. Tombari, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Ji Xiangyang, “Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation,” 2021, pp. 16611–16621.
[25]
S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “PVNet: Pixel-wise voting network for 6DoF pose estimation , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2019.
[26]
S. Zakharov, I. Shugurov, and S. Ilic, “DPOD: 6D pose object detector and refiner , booktitle = Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),” 2019.
[27]
M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and booktitle =. P. of the E. C. on C. V. (ECCV). Triebel Rudolph, “Implicit 3d orientation learning for 6d object detection from rgb images,” 2018, pp. 699–715.
[28]
Y. Su et al., “Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation,” 2022, pp. 6738–6748.
[29]
B. Drost, M. Ulrich, N. Navab, and booktitle=CVPR. Ilic Slobodan, “Model globally, match locally: Efficient and robust 3D object recognition,” 2010 , organization={Ieee}, pp. 998–1005.
[30]
S. Hinterstoisser et al., “Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes,” 2011 , organization={IEEE}, pp. 858–865.
[31]
C. Wang et al., “DenseFusion: 6D object pose estimation by iterative dense fusion , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2019.
[32]
Y. He, W. Sun, H. Huang, J. Liu, H. Fan, and J. Sun, “PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2020.
[33]
W. Chen, X. Jia, H. J. Chang, J. Duan, and A. Leonardis, “G2L-net: Global to local network for real-time 6D pose estimation with embedding vector features , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2020.
[34]
Y. He, H. Huang, H. Fan, Q. Chen, and J. Sun, “FFB6D: A full flow bidirectional fusion network for 6D pose estimation , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2021, pp. 3003–3013.
[35]
Y. Wu, M. Zand, A. Etemad, and booktitle=European. C. on C. V. Greenspan Michael, “Vote from the center: 6 dof pose estimation in rgb-d images by radial keypoint voting,” 2022 , organization={Springer}, pp. 335–352.
[36]
H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Guibas Leonidas J, “Normalized object coordinate space for category-level 6d object pose and size estimation,” 2019, pp. 2642–2651.
[37]
S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 13, no. 4, pp. 376–380, 1991.
[38]
D. Chen, J. Li, Z. Wang, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Xu Kai, “Learning canonical shape space for category-level 6d object pose and size estimation,” 2020, pp. 11973–11982.
[39]
W. Chen, X. Jia, H. J. Chang, J. Duan, L. Shen, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Leonardis Ales, “Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism,” 2021, pp. 1581–1590.
[40]
J. Lin, Z. Wei, Z. Li, S. Xu, K. Jia, and booktitle=Proceedings. of the I. I. C. on C. V. Li Yuanqing, “Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency,” 2021, pp. 3560–3569.
[41]
Y. Di et al., “Gpv-pose: Category-level object pose estimation via geometry-guided point-wise voting,” 2022, pp. 6781–6791.
[42]
Y. Su et al., “Opa-3d: Occlusion-aware pixel-wise aggregation for monocular 3d object detection,” IEEE Robotics and Automation Letters, vol. 8, no. 3, pp. 1327–1334, 2023.
[43]
L. Zheng et al., “HS-pose: Hybrid scope feature extraction for category-level object pose estimation.” 2023 , eprint={2303.15743}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[44]
J. Wang, K. Chen, and Q. Dou, “Category-level 6D object pose estimation via cascaded relation and recurrent reconstruction networks.” 2021 , eprint={2108.08755}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[45]
Y. Weng et al., “Captra: Category-level pose tracking for rigid and articulated objects from point clouds,” 2021, pp. 13209–13218.
[46]
M. Zaccaria, F. Manhardt, Y. Di, F. Tombari, J. Aleotti, and M. Giorgini, “Self-supervised category-level 6D object pose estimation with optical flow consistency,” IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2510–2517, 2023.
[47]
K. Chen and booktitle=Proceedings. of the I. I. C. on C. V. Dou Qi, “Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation,” 2021, pp. 2773–2782.
[48]
H. Lin, Z. Liu, C. Cheang, Y. Fu, G. Guo, and X. Xue, “SAR-net: Shape alignment and recovery network for category-level 6D object pose and size estimation.” 2022 , eprint={2106.14193}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[49]
Z. Fan et al., “ACR-pose: Adversarial canonical representation reconstruction network for category level 6D object pose estimation.” 2021 , eprint={2111.10524}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[50]
J. Liu, Y. Chen, X. Ye, and X. Qi, “IST-net: Prior-free category-level pose estimation with implicit space transformation.” 2023 , eprint={2303.13479}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[51]
C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
[52]
K. He, G. Gkioxari, P. Dollár, and booktitle=Proceedings. of the I. international conference on computer vision Girshick Ross, “Mask r-cnn,” 2017, pp. 2961–2969.
[53]
X. Liu, G. Wang, Y. Li, and X. Ji, “CATRE: Iterative point clouds alignment for category-level object pose refinement.” 2022 , eprint={2207.08082}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[54]
M. Z. Irshad, T. Kollar, and M. Las, “CenterSnap: Single-shot multi-object 3D shape reconstruction and categorical 6D pose and size estimation.” 2022 , eprint={2203.01929}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[55]
R. Zhang, Y. Di, F. Manhardt, F. Tombari, and X. Ji, “SSP-pose: Symmetry-aware shape prior deformation for direct category-level object pose estimation.” 2022 , eprint={2208.06661}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[56]
J. Lin, Z. Wei, C. Ding, and K. Jia, “Category-level 6D object pose and size estimation using self-supervised deep prior deformation networks.” 2022 , eprint={2207.05444}, archivePrefix={arXiv}, primaryClass={cs.CV}.
[57]
J. Lin, H. Li, K. Chen, J. Lu, and booktitle=Neural. I. P. S. Kui Jia, “Sparse steerable convolutions: An efficient learning of SE(3)-equivariant features for estimation and tracking of object poses in 3D space,” 2021, [Online]. Available: https://api.semanticscholar.org/CorpusID:244117550.
[58]
H. Jung et al., “HouseCat6D–a large-scale multi-modal category level 6D object pose dataset with household objects in realistic scenarios,” arXiv preprint arXiv:2212.10428, 2022.
[59]
H. Jung et al., “HouseCat6D – a large-scale multi-modal category level 6D object pose dataset with household objects in realistic scenarios.” 2023 , eprint={2212.10428}, archivePrefix={arXiv}, primaryClass={cs.CV}.

  1. \(^*\) Equal contributions.↩︎

  2. \(^\dagger\) Corresponding author (e-mail: guangyao.zhai@tum.de).↩︎