TiCoSS: Tightening the Coupling between
Semantic Segmentation and Stereo Matching
within A Joint Learning Framework

Guanfeng Tang\(^{\href{https://orcid.org/0009-0002-5918-4775}{\scalerel*{ \begin{tikzpicture}[yscale=-1,transform shape] \pic{orcidlogo}; \end{tikzpicture} }{|}}\,}\), Zhiyuan Wu\(^{\href{https://orcid.org/0009-0001-7253-7603}{\scalerel*{ \begin{tikzpicture}[yscale=-1,transform shape] \pic{orcidlogo}; \end{tikzpicture} }{|}}\,}\), and Rui Fan\(^{\href{https://orcid.org/0000-0003-2593-6596}{\scalerel*{ \begin{tikzpicture}[yscale=-1,transform shape] \pic{orcidlogo}; \end{tikzpicture} }{|}}\,}\)1 2


Abstract

Addressing these two tasks with separate networks is no longer the mainstream direction in developing computer vision algorithms, particularly with the recent advances in large vision models and embodied artificial intelligence. The trend is shifting towards combining them within a joint learning framework, especially emphasizing feature sharing between the two tasks. The combined use of these technical contributions results in TiCoSS, a state-of-the-art joint learning framework that simultaneously tackles semantic segmentation and stereo matching. Through extensive experiments on the KITTI and vKITTI2 datasets, along with qualitative and quantitative analyses, we validate the effectiveness of our developed strategies and loss function, and demonstrate its superior performance compared to prior arts, with a notable increase in mIoU by over 9%.

semantic segmentation, stereo matching, joint learning, computer vision, artificial intelligence.

1 Introduction↩︎

1.1 Background↩︎

VISUAL environment perception (VEP) serves as a fundamental and front-end module in robotic systems [1]. Semantic segmentation and stereo matching are two key VEP tasks [2]. The former, akin to the ventral stream in our brain, provides a pixel-level understanding of the scene, while the latter, akin to the dorsal stream in our brain, mimics human binocular vision to acquire accurate depth information, essential for 3D geometry reconstruction [3]. These two tasks collaborate to deliver both contextual and geometric information, resulting in a comprehensive scene understanding that significantly enhances the capabilities of robotic systems [4].

Nevertheless, previous studies [5][9] address these two tasks with separate networks, which limits their potential to share informative contextual and geometric information [10]. For instance, stereo matching networks can occasionally produce ambiguous disparity estimations, particularly in texture-less and occluded regions [11]. Semantic segmentation can provide pixel-level scene understanding to disambiguate these cases, ultimately leading to more reliable disparity estimations [12]. In addition, semantic segmentation networks often struggle to distinguish clear object boundaries, particularly in complex driving scenarios, due to the lack of spatial geometric information [13].

1.2 Existing Challenges and Motivation↩︎

Taking our previous work S\(^3\)M-Net [14] as an example, its adopted feature fusion strategy essentially performs an element-wise summation between contextual and geometric feature maps at each stage. These feature maps are then directly fed into subsequent layers without excluding irrelevant information, leading to the loose coupling between the semantic segmentation and stereo matching tasks within the encoder.

1.3 Contributions↩︎

Therefore, in this article, we introduce Tightly-Coupled Semantic Segmentation and Stereo Matching Network (TiCoSS), an end-to-end joint learning approach that focuses primarily on improving the coupling between stereo matching and semantic segmentation, which has not been emphasized in previous related studies. Building upon our previous work S\(^3\)M-Net, our proposed TiCoSS introduces three new techniques: (1) a Tightly-Coupled, Gated Feature Fusion (TGF) strategy, which utilizes a series of Selective Inheritance Gates (SIGs) to propagate useful contextual and geometric information from the preceding layer to the current layer; (2) a Hierarchical Deep Supervision (HDS) strategy that uses the fused features with the highest resolution to guide deep supervision throughout each branch, as these features contain the most abundant local spatial details; (3) a novel Coupling Tightening (CT) loss, consisting of a widely used stereo matching loss presented in [9], the Semantic Consistency-Guided (SCG) loss in [14], a Disparity Inconsistency-Aware (DIA) loss that utilizes disparity estimations to refine segmentation results, and a Deep Supervision Consistency Constraint (DSCC) loss which employs the Kullback-Leibler (KL) divergence to improve prediction consistency across outputs from all deep supervision branches. These contributions collectively advance S\(^3\)M-Net, resulting in TiCoSS, a new, powerful, and tightly-coupled joint learning framework that enables robust and accurate simultaneous semantic segmentation and stereo matching. Extensive experiments conducted on the vKITTI2 [15] and KITTI 2015 [16] datasets unequivocally demonstrate the effectiveness of the aforementioned contributions and the superior performance of TiCoSS over other state-of-the-art (SoTA) approaches.

In summary, the main contributions of this article include:

  • The TGF strategy, which propagates useful contextual and geometric information from the preceding layer to the current layer, enabling more effective feature fusion for semantic segmentation;

  • The HDS strategy, which uses the fused features with the richest local spatial details to guide deep supervision across each branch;

  • The DIA loss and the DSCC loss that tighten the coupling between the two tasks, thereby further improving semantic segmentation performance.

1.4 Article Structure↩︎

The remainder of this article is organized as follows: Sect. 2 reviews related prior arts. Sect. 3 introduces our proposed TiCoSS. Sect. 4 compares our network with other SoTA approaches and presents the ablation studies.

2 Literature Review↩︎

2.1 Semantic Segmentation↩︎

Semantic segmentation has been a long-standing challenge in the fields of computer vision and robotics over the past decade [3], [17]. SoTA networks generally fall into two groups [14]: (1) single-modal networks (with a single encoder) and (2) feature-fusion networks (with multiple encoders) [18][20]. Early efforts primarily focused on encoder-decoder architectures for pixel-level classification. Representative examples include PSPNet [5], the DeepLab series [6], [21], as well as Transformer-based networks [7], [22], [23]. The encoder extracts hierarchical, contextual feature maps from input images, while the decoder generates segmentation maps by upsampling and combining feature maps from different encoder layers. Nonetheless, these networks are limited in effectively combining heterogeneous features extracted from different sources (or modalities) of visual information, which makes it challenging to produce accurate segmentation results in scenarios with poor lighting and illumination conditions [18]. This has led researchers to focus on feature-fusion networks that can effectively fuse heterogeneous features learned from multiple sources (or modalities) of visual information. This problem is often referred to as “RGB-X semantic segmentation”, where “X” represents the additional modality (or source) of visual information, in addition to the RGB images. The most representative feature-fusion networks include convolutional neural network (CNN)-based ones, such as MFNet [24], RTFNet [25], SNE-RoadSeg series [18], [26], [27], and S\(^3\)M-Net [14], as well as Transformer-based ones, such as OFF-Net [28] and RoadFormer [29]. In this article, we design TiCoSS based on our previously proposed S\(^3\)M-Net, with a specific emphasis on exploring more effective solutions for tighter coupling between semantic segmentation and stereo matching.

2.2 Stereo Matching↩︎

Owing to recent advancements in deep learning techniques, end-to-end deep stereo networks [8], [9], [30][33] have dramatically outperformed traditional explicit programming-based stereo matching algorithms. PSMNet [8] uses a spatial pyramid to capture multi-scale information and uses multiple 3D convolutional layers to exploit both local and global contexts for cost computation. GwcNet [30] builds on PSMNet by constructing the cost volume through group-wise correlation, enhancing the 3D stacked hourglass network. To address the computational demands of 3D convolutions, researchers have sought ways to balance efficiency and accuracy in stereo matching. LEA-Stereo [32], for instance, introduces the first neural architecture search (NAS) [34] framework to stereo matching. RAFT-Stereo [9], a rectified stereo matching approach, developed based on RAFT [35], uses a series of gateD recurrent units to refine correlation features and enhance disparity estimation accuracy. CRE-Stereo [33] leverages recurrent refinement to update disparities in a coarse-to-fine fashion and adaptive group correlation to minimize the impact of erroneous rectification, leading to more compelling disparity estimation results. In this article, we primarily focus on improving semantic segmentation performance, and thus, adopt the stereo matching approach used in our previous work, S\(^3\)M-Net.

2.3 Simultaneous Semantic Segmentation and Stereo Matching↩︎

Existing joint learning frameworks that simultaneously address these two tasks mainly focus on improving disparity accuracy by leveraging semantic information [10], [12], [36][39]. However, discussions on utilizing disparity information to enhance semantic segmentation at the feature level for joint learning remain limited, except for the aforementioned “RGB-X semantic segmentation”. These prior arts often require extensive annotated training data or involve complex training strategies for joint learning. For example, SegStereo [36] and DispSegNet [37] necessitate an initial unsupervised training phase on the large-scale Cityscapes [40] dataset, followed by a subsequent supervised fine-tuning on the smaller KITTI [16], [41] datasets. Similarly, the approaches introduced in [12], [38], [39] require pre-training their spatial branches for stereo matching on the large-scale SceneFlow [42] dataset before fine-tuning both semantic and spatial branches on the KITTI [16], [41] datasets. DSNet [10] adopts a different joint learning strategy by alternating the training of the semantic segmentation and stereo matching networks, with each network being frozen during the training of the other. Nevertheless, achieving simultaneous convergence of the two networks can be challenging, as shared features are not learned in an end-to-end fashion. Recently, SSNet [43] uses a single encoder to extract shareable features for both tasks. However, as demonstrated in [44], such features may not be suitable for both dense prediction and geometric vision tasks. S\(^3\)M-Net uses two separate encoding branches to achieve these two tasks simultaneously, but the weak coupling between these branches limits the integration of contextual and geometric information. Additionally, the source code (in PyTorch or TensorFlow) for these previous approaches is not publicly available, and re-implementing them carries the risk of introducing errors. In contrast to the aforementioned approaches, our proposed TiCoSS uses a tightly-coupled joint learning framework that effectively leverages both contextual and geometric information. Moreover, TiCoSS is trained in an end-to-end fashion and is capable of learning semantic segmentation and stereo matching simultaneously, even with limited training data.

3 Methodology↩︎

3.1 Framework Overview↩︎

The architecture of our proposed TiCoSS is illustrated in Fig. 1, containing three major technical contributions:

  1. A novel duplex, tightly-coupled encoder designed to selectively extract and fuse heterogeneous features, namely contextual features from RGB images and geometric features from disparity maps.

  2. A novel HDS strategy that leverages fused features with the richest local spatial details to guide deep supervision across each branch (auxiliary classifier).

  3. A CT loss that supervises the entire joint learning process and further tightens the coupling between semantic segmentation and stereo matching.

Figure 1: The architecture of our proposed TiCoSS for end-to-end joint learning of semantic segmentation and stereo matching.

3.2 Tightly-coupled Gated Fusion Strategy↩︎

S\(^3\)M-Net [14] proposes an effective joint learning framework to simultaneously perform semantic segmentation and stereo matching. Despite achieving impressive results, these two tasks are loosely coupled. It merely employs the feature fusion strategy proposed in SNE-RoadSeg [18], where the geometric features extracted from disparity maps are indiscriminately fused into the contextual features extracted from RGB images via simplistic element-wise summation. The fused heterogeneous features are then treated as preceding contextual features and fed into subsequent layers without selective processing, which misleads the semantic segmentation task. This is primarily because the deeper geometric features contain irrelevant semantic information, and as the network goes deeper, the proportion of RGB features in the decoder’s input tends to diminish [27].

Our TGF strategy is, therefore, designed to overcome this limitation by selectively complementing contextual features with useful geometric features, resulting in a tightly-coupled duplex encoder. The core of our TGF strategy is the SIGs, developed based on Gated Fully Fusion (GFF) [45], which fuses features from multiple scales using gates that control the propagation of useful information. This enables the features at each scale to be enhanced by both deeper, semantically stronger features and shallower, spatially richer features, significantly reducing noises during feature fusion. Nonetheless, GFF is primarily regarded as a post-processing technique (multi-scale features are required to be yielded before being processed by GFF) and is infeasible to fuse heterogeneous features, which are extracted and fused progressively. In contrast, our developed SIGs (see Fig. 1) selectively inherit useful features in \(\mathbf{X}_{i-1}^{G,F}\) from the previous layer into \(\mathbf{X}_{i}^{G,F}\), the features at the current layer, where \(i\in[1,n]\cap\mathbb{Z}\) denotes the layer number, and the superscripts ‘\(G\)’ as well as ‘\(F\)’ represent ‘geometric’ and ‘fused’ features3, respectively. Our SIG outputs \(\mathbf{\tilde{X}}_{i}^{G,F}\), selectively inherited feature maps at the \(i\)-th layer using the following expression: \[\begin{align} \mathbf{\tilde{X}}_{i}^{G,F} &=\Omega_{i} \left( \mathbf{X}_{i-1}^{G,F}, \mathbf{X}_{i}^{G,F} \right) = \left(\mathbf{1}_{H_i}(\mathbf{1}_{W_i})^\top+\mathbf{G}_{i}\right) \odot \mathbf{X}_{i}^{G,F} \nonumber \\ &+ \left(\mathbf{1}_{H_i}(\mathbf{1}_{W_i})^\top-\mathbf{G}_{i}\right) \odot \left [ \mathbf{G}_{i-1} \odot \mathcal{R}(\mathbf{X}_{i-1}^{G,F}) \right], \end{align}\] where \(\Omega_i\) represents the SIG operation at the \(i\)-th layer, \(\mathbf{1}_k\) denotes a column vector of ones, \(\mathbf{G}_i \in [0,1]^{H_i \times W_i}\) represents a gate map that controls feature propagation, \(\odot\) denotes the element-wise multiplication broadcasting in the channel dimension, and \(\mathcal{R}\) represents the remapping operation, as detailed in [14].

Based on our proposed TGF strategy, the heterogeneous feature extraction and fusion process in our duplex encoder can be formulated as follows: \[\begin{align} &\mathbf{F}_{i}^{G}=\left\{ \begin{array}{ll} \mathcal{E}_{i}^{G}(\mathbf{D}^L), & i=1\\ & \\ \Omega_{i}^{G}(\mathbf{F}_{i-1}^{G}\,,\,\mathcal{E}_i^G(\mathbf{F}_{i-1}^G)), & 1<i\le n \end{array} \right. \\ \text{and} \notag \\ &\mathbf{F}_{i}^{F}=\left\{ \begin{array}{ll} \mathcal{E}_{i}^{F}(\mathbf{F}_i^C)\:\oplus \:\mathbf{F}_i^G, & i=1\\ & \\ \Omega_{i}^{F}(\mathbf{F}_{i-1}^{F}\,,\,\mathcal{R}(\mathbf{F}_{2i-1}^C))\:\oplus\: \mathbf{F}_i^G, & 1<i\le \frac{n+1}{2} \\ & \\ \Omega_{i}^{F}(\mathbf{F}_{i-1}^{F}\,,\,\mathcal{E}_i^F(\mathbf{F}_{i-1}^F))\:\oplus\: \mathbf{F}_i^G, & \frac{n+1}{2}<i\le n \end{array}, \right. \end{align}\] where \(\mathbf{D}^L\) denotes the estimated disparity map, \(\mathbf{F}^{C}_i\), \(\mathbf{F}^{G}_i\), and \(\mathbf{F}^{F}_i\) respectively represent the extracted contextual feature maps in the left view, geometric feature maps, and fused feature maps at the \(i\)-th layer, \(\mathcal{E}^G_i\) and \(\mathcal{E}^F_i\) denote the geometric and fused features encoding operations at the \(i\)-th layer, respectively, \(\oplus\) represents the element-wise summation. Considering that the shallower features between the two tasks have similar numbers of channels, we make the contextual feature maps of the first three layers share weights with the feature maps extracted from the stereo matching network in our practical implementation. Our proposed TGF strategy selectively propagates features to subsequent layers, reducing indiscriminate heterogeneous feature fusion which can severely mislead semantic segmentation as the network goes deeper, thus achieving tighter coupling between these two relevant environmental perception tasks.

3.3 Hierarchical Deep Supervision↩︎

After tightening the coupling between semantic segmentation and stereo matching during the feature encoding stage, we turn our focus towards the feature decoding aspect. We first revisit the deep supervision strategies employed in SNE-RoadSeg+ [26] and UNet 3+ [46]. The former applies deep supervision to the decoded features with the highest resolution, whereas the latter achieves this on the deepest decoded features at each resolution. Despite the effectiveness of these two prior approaches in addressing challenges such as vanishing gradients and slow model convergence, the deep supervision strategies employed by them are not comprehensive (the former emphasizes enhancing local details, while the latter focuses on improving multi-scale feature consistency). Ideally, they should be used in conjunction to complement each other for improved results. Therefore, our proposed HDS strategy combines the strengths of both methods and demonstrates superior performance compared to each individually.

A straightforward way to combine the strengths of these two methods is to apply deep supervision simultaneously to both the main and side branches, enabling the network to leverage features from shallow layers (containing rich local details) and deep layers (being semantically strong). However, this strategy does not fully exploit the potential complementarity between the main and side auxiliary classifiers, since the decoded features are exclusively derived from adjacently-connected ones. To improve their interactions, we utilize the decoded, fused feature maps \(\mathbf{F}_{1}^{F}\) (containing rich, fine-grained local spatial details that are essential for semantic segmentation) at the highest resolution to guide the feature decoding process in the side branches. Specifically, for the \(l\)-th auxiliary classifier within the side branch, we utilize a Feature Dynamic Alignment (FDA) block, which is composed of \(l\) downsampling units to progressively align channel dimensions and spatial resolutions between a pair of features at different layers. Each downsampling unit comprises a \(3\times3\) convolutional layer with a stride of 2, followed by a batch normalization (BN) layer and a rectified linear unit (ReLU) activation layer. Compared to the simple max-pooling operation which may disrupt the original feature representation, our FDA achieves a smoother feature alignment. The feature maps obtained by downsampling \(\mathbf{F}_{1}^{F}\) are then concatenated with the deepest decoded features at the corresponding layers. This downsampling output not only guides the decoding process but also serves as the input for the subsequent downsampling unit, thereby preserving fine-grained local details to the greatest extent possible. Since the outputs of the side branches are obtained by directly upsampling the deepest features at each layer, this strategy does not significantly impact training efficiency and memory usage. Moreover, the auxiliary classifiers of the main and side branches collaboratively provide additional pathways for gradients to propagate more efficiently from the output layers to their corresponding layers, thereby accelerating the convergence of our model.

3.4 Coupling Tightening Loss for Multi-task Joint Learning↩︎

Compared to the tasks focused solely on either semantic segmentation or stereo matching, the loss function employed in our joint learning framework aims to supervise both tasks simultaneously [14]. Our proposed CT loss function is expressed as follows: \[\mathcal{L}_{CT} = \mathcal{L}_{DIA} + \mathcal{L}_{DSCC}+\mathcal{L}_{SCG}+\mathcal{L}_{SM}, \label{eq46l95ct}\tag{1}\] where the DIA loss \(\mathcal{L}_{DIA}\) measures disparity inconsistency, the DSCC loss \(\mathcal{L}_{DSCC}\) measures the segmentation consistency across outputs from all deep supervision branches, the SCG loss \(\mathcal{L}_{SCG}\) denotes the Semantic Consistency-Guided (SCG) loss proposed in [14], and \(\mathcal{L}_{SM}\), a prevalently used loss function to supervise the training of the stereo matching network, is detailed in [14].

3.4.1 Disparity Inconsistency-Aware Loss↩︎

In our previous work [14], the developed SCG loss focuses mainly on regions with semantic inconsistencies caused by occlusion and reflection. Given that relying solely on semantic consistency cannot sufficiently tighten the coupling between these two tasks at the output level, we define a weight matrix \(\mathbf{W} \in \mathbb{R} ^{H\times W}\) drawing from the concept of Left-Right Consistency in stereo matching [11], where \[\mathbf{W}(\mathbf{p})=\, \mathbf{D}^L(\mathbf{p})-\mathbf{D}^R(\mathbf{p}-(\mathbf{D}^L(\mathbf{p};0))) \label{eq46weight95matrix}\tag{2}\] denotes the weight at the given pixel \(\mathbf{p}\), and \(\mathbf{D}^L\) as well as \(\mathbf{D}^R\) represent the left and right disparity maps, respectively. \(\mathbf{W}^{N}\in \mathbb{R} ^{H\times W}\), a normalized weight matrix, is then yielded, where \[\mathbf{W}^{N}(\mathbf{p})= \frac{1}{1+e^{-\left | \mathbf{W}(\mathbf{p}) \right |}}\in[0,1]. \label{eq46w95n}\tag{3}\] A higher normalized weight corresponds to a greater inconsistency between a given pair of left and right disparities, indicating the need for more attention during the training of a semantic segmentation model. Our proposed DIA loss is, therefore, formulated as follows: \[\mathcal{L}_{DIA}=\alpha\sum_{i=1}^{n} \left \{ -\frac{1}{N}\sum_{j=1}^{N}\sum_{k=1}^{C}\left [ \mathbf{W}^N(\mathbf{p})\,y_k(\boldsymbol{p})\,\mathrm{log}(\hat{y}_k(\boldsymbol{p}) ) \right ] \right \}, \label{eq46dia}\tag{4}\] where \(n\) denotes the number of deep supervision branches, \(N\) represents the total number of pixels, \(C\) denotes the total number of classes, \(\hat{y}_{k}(\boldsymbol{p})\) represents the predicted probability of pixel \(\boldsymbol{p}\) belonging to class \(k\), \(y_{k}(\boldsymbol{p})\) denotes the ground-truth label for \(\boldsymbol{p}\) in class \(k\), and the weight factor \(\alpha\) is determined through extensive ablation studies, as detailed in Sect. [sec:sec46exp95ablation].

3.4.2 Deep Supervision Consistency Constraint Loss↩︎

In prior research [26], [46], the relationship among prediction maps generated by different auxiliary classifiers was not considered, which may lead to semantic inconsistencies across scales. To address this issue, we draw inspiration from [47] to design the following DSCC loss, which utilizes the KL divergence to measure the prediction differences across scales: \[\mathcal{L}_{DSCC} = \beta \sum_{r=1}^{L} \sum_{\substack{s=1 \\ s \neq r}}^{L}\left [ -\frac{1}{N}\sum_{i=1}^{N}\hat{y}_k^r(\boldsymbol{p}) \log\frac{\hat{y}_k^r(\boldsymbol{p})}{\hat{y}_k^s(\boldsymbol{p})} \right ] , \label{eq46dscc}\tag{5}\] where l denotes the classifier index and \(\beta\) is a scaling factor determined through extensive ablation studies presented in Sect. [sec:sec46exp95ablation]. By minimizing the DSCC loss, we enhance the informativeness of features at different resolutions, thereby significantly improving semantic segmentation performance across scales.

Figure 2: Qualitative experimental results achieved by the SoTA semantic segmentation networks on the KITTI 2015 [16] dataset.

Figure 3: Qualitative experimental results achieved by the SoTA semantic segmentation networks on the vKITTI2 [15] dataset.

Figure 4: Qualitative experimental results obtained under adverse weather conditions on the vKITTI2 [15] dataset, achieved by the SoTA semantic segmentation networks.

Figure 5: Qualitative experimental results on the KITTI 2015 [16] dataset achieved by the SoTA stereo matching networks.

Figure 6: Qualitative experimental results on the vKITTI2 [15] dataset achieved by the SoTA stereo matching networks.

4 Experiments↩︎

In this article, we conduct extensive experiments to evaluate the performance of our introduced TiCoSS both quantitatively and qualitatively. The following subsections detail the used datasets, experimental set-up, evaluation metrics, and the comprehensive evaluation of our proposed method.

4.1 Datasets↩︎

Since our network training requires both semantic and disparity annotations, we utilize the following two public datasets to conduct the experiments:

  • The vKITTI2 [15] dataset contains virtual replicas (providing 15 semantic classes) of five sequences from the KITTI dataset. Dense ground-truth disparity maps are obtained through depth rendering using a virtual engine. Following [14], we employ 700 stereo image pairs, along with their semantic and disparity annotations, to evaluate the effectiveness and robustness of our proposed TiCoSS, where 500 pairs are utilized for model training and the remaining 200 pairs are used for model validation.

  • The KITTI 2015 [16] dataset contains 400 stereo image pairs captured in real-world driving scenarios. Half of these pairs have ground truth annotations, while the remaining half do not. This dataset provides 19 semantic classes (consistent with those in the Cityscapes [40] dataset). Sparse ground-truth disparity maps are obtained using a Velodyne HDL-64E LiDAR. In our experiments, we split the data with a 7:3 ratio for training and testing, respectively.

4.2 Experimental Setup↩︎

Our experiments are conducted on an NVIDIA RTX 4090 GPU with a batch size of 1. We set the maximum disparity search range to 192 pixels. All images are cropped to 512 \(\times\) 256 pixels before being processed by the network. We utilize the AdamW [48] optimizer for model training, with epsilon and weight decay set to \(10^{-5}\) and \(10^{-8}\), respectively. The initial learning rate is set to \(2\times 10^{-4}\). Training lasts for 100,000 iterations on the vKITTI2 dataset and 20,000 iterations on the KITTI 2015 dataset. Standard data augmentation techniques are applied to improve model robustness.

4.3 Evaluation Metrics↩︎

Following [14], we quantify the performance of semantic segmentation using seven metrics: (1) accuracy (Acc), (2) mean accuracy (mAcc), (3) precision (Pre), (4) recall (Rec), (5) mean F1-score (mFSc), (6) mean intersection over union (mIoU), and (7) frequency-weighted intersection over union (fwIoU) [49]. Moreover, we quantify the performance of stereo matching using two metrics: (1) average end-point error (EPE) and (2) percentage of error pixels (PEP) at tolerance levels of 1.0 and 3.0 pixels, respectively.

1.5mm

Table 1: Quantitative comparisons of SoTA semantic segmentation networks on the KITTI 2015 [16] dataset. The best results are shown in bold font. The symbol \(\uparrow\) indicates that a higher value corresponds to better performance.
Networks Publication Type Acc (%) \(\uparrow\) mAcc (%) \(\uparrow\) mIoU (%) \(\uparrow\) fwIoU (%) \(\uparrow\) Pre (%) \(\uparrow\) Rec (%) \(\uparrow\) mFSc (%) \(\uparrow\)
PSPNet [5] CVPR’17 Single-Modal 80.03 44.97 38.15 68.62 79.29 82.66 79.59
DeepLabv3+ [21] ECCV’18 82.33 50.15 42.79 72.43 83.85 87.18 84.59
BiSeNet V2 [50] IJCV’21 73.68 41.66 32.71 60.55 68.35 81.79 72.37
Segmenter [22] ICCV’21 84.53 50.77 43.63 74.72 82.99 87.15 84.41
SegFormer [7] NeurIPS’21 88.28 59.23 51.39 80.53 87.15 90.85 88.46
Mask2Former [23] CVPR’22 84.35 54.33 45.87 75.56 84.74 89.12 85.92
MFNet [24] IROS’17 Feature-Fusion 81.02 48.13 40.70 70.42 82.85 85.73 82.36
RTFNet [25] RAL’19 71.61 39.26 30.35 57.98 69.52 85.16 74.28
SNE-RoadSeg [18] ECCV’20 79.46 51.91 41.56 69.22 81.45 87.05 82.91
OFF-Net [28] ICRA’22 75.84 40.13 33.13 64.02 77.48 72.19 70.62
RoadFormer [29] TIV’24 90.05 62.34 55.13 83.40 91.65 91.39 91.11
S\(^3\)M-Net [14] TIV’24 90.66 65.90 57.80 84.53 90.85 93.55 91.80
TiCoSS (Ours) / Feature-Fusion 91.95 72.28 63.91 86.82 92.35 94.24 92.90

1.5mm

Table 2: Quantitative comparisons of SoTA semantic segmentation networks on the vKITTI2 [15] dataset. The best results are shown in bold font. The symbol \(\uparrow\) indicates that a higher value corresponds to better performance.
Networks Publication Type Acc (%) \(\uparrow\) mAcc (%) \(\uparrow\) mIoU (%) \(\uparrow\) fwIoU (%) \(\uparrow\) Pre (%) \(\uparrow\) Rec (%) \(\uparrow\) mFSc (%) \(\uparrow\)
PSPNet [5] CVPR’17 Single-Modal 76.26 53.53 44.81 69.30 81.55 79.68 75.38
DeepLabv3+ [21] ECCV’18 92.19 63.15 56.90 87.15 89.00 92.71 90.16
BiSeNet V2 [50] IJCV’21 81.77 51.07 44.45 74.71 83.23 82.19 80.67
Segmenter [22] ICCV’21 90.39 60.33 52.99 83.47 88.05 87.89 87.70
SegFormer [7] NeurIPS’21 94.75 70.56 64.98 90.49 93.57 93.62 93.46
Mask2Former [23] CVPR’22 89.29 64.58 57.14 83.84 90.75 87.23 87.19
MFNet [24] IROS’17 Feature-Fusion 76.22 51.50 43.41 68.82 82.46 78.65 73.80
RTFNet [25] RAL’19 85.22 49.47 42.59 77.69 83.74 89.17 84.41
SNE-RoadSeg [18] ECCV’20 83.64 60.85 52.56 75.14 83.44 81.66 77.77
OFF-Net [28] ICRA’22 90.84 61.51 55.27 84.69 89.24 86.71 86.15
RoadFormer [29] TIV’24 97.54 86.58 80.83 95.34 96.99 96.86 96.91
S\(^3\)M-Net [14] TIV’24 98.32 88.24 84.18 96.98 98.37 98.28 98.31
TiCoSS (Ours) / Feature-Fusion 98.69 91.66 88.46 97.58 98.55 98.67 98.57

1.5mm

Table 3: Comparisons of SoTA stereo matching network on the KITTI 2015 [16] and vKITTI2 [15] datasets. The symbol \(\downarrow\) indicates that a lower value corresponds to better performance. The best results are shown in bold font.
Networks Publications KITTI 2015 [16] vKITTI2 [15]
EPE (pixels) \(\downarrow\) PEP (%) \(\downarrow\) EPE (pixels) \(\downarrow\) PEP (%) \(\downarrow\)
4-5 \(>1\) pixel \(>3\) pixels \(>1\) pixel \(>3\) pixels
PSMNet [8] CVPR’18 0.74 16.12 2.61 0.68 10.31 3.77
LEA-Stereo [32] NeurIPS’20 0.83 18.67 3.22 0.83 13.33 4.84
RAFT-Stereo [9] 3DV’21 0.60 10.78 1.96 0.40 5.88 2.67
CRE-Stereo [33] CVPR’22 0.92 19.68 3.35 0.63 10.35 3.90
ACVNet [51] CVPR’22 0.68 13.93 2.10 0.61 9.41 3.45
PCW-Net [52] ECCV’22 0.70 14.81 2.43 0.63 9.45 3.49
IGEV-Stereo [53] CVPR’23 0.62 12.15 1.99 0.47 7.15 3.09
S\(^3\)M-Net TIV’24 0.55 10.02 1.62 0.38 5.56 2.55
TiCoSS(Ours) / 0.53 10.09 1.58 0.36 5.49 2.58

4.4 Comparison with State-of-the-art Methods↩︎

4.4.1 Semantic Segmentation Performance↩︎

The qualitative experimental results on the KITTI and vKITTI2 datasets are presented in Figs. 2 and 3, respectively, while the quantitative experimental results on these two datasets are given in Tables 1 and 2, respectively.

These results suggest that TiCoSS outperforms all other SoTA single-modal and feature-fusion networks (including both CNN-based and Transformer-based methods) across all evaluation metrics on both datasets. Specifically, compared with S\(^3\)M-Net, the SoTA joint learning method, TiCoSS demonstrates substantial improvements on the KITTI dataset, achieving increases of \(9.68\%\) in mAcc, \(10.57\%\) in mIoU, \(2.71\%\) in fwIoU, and \(1.20\%\) in mFSc, respectively. Similarly, on the vKITTI2 dataset, it outperforms other networks across all evaluation metrics, with improvements of \(3.88\%\) in mAcc, \(5.08\%\) in mIoU, \(0.62\%\) in fwIoU, and \(0.26\%\) in mFSc, respectively. Particularly, as observed in Figs. 2 and 3, TiCoSS achieves more accurate predictions on distant regions as well as object boundaries and is capable of providing more fine-grained semantic segmentation details compared to S\(^3\)M-Net.

4.4.2 Stereo Macthing Performance↩︎

The qualitative experimental results on the KITTI and vKITTI2 datasets are illustrated in Figs. 5 and 6, respectively, while the quantitative experimental results on these two datasets are provided in Table 3. Since the primary focus of this study is to improve semantic segmentation performance, the stereo matching performance of TiCoSS is slightly better than that of S\(^3\)M-Net. Specifically, compared to S\(^3\)M-Net, TiCoSS demonstrates improvements of \(3.64\%\) in EPE and \(2.47\%\) in PEP 3.0 on the KITTI dataset. Additionally, on the vKITTI2 dataset, it achieves improvements of \(5.26\%\) in EPE and \(1.26\%\) in PEP 1.0. These improvements can be attributed to the tighter coupling between these two tasks, resulting in more comprehensive geometric features with informative contextual information compared to S\(^3\)M-Net. Additionally, by minimizing the CT loss, our model focuses more on areas with inconsistent disparities and achieves improved performance in occluded regions, as depicted in Fig. 6.

1.5mm

Table 4: Ablation study on our HDS strategy on the KITTI 2015 [16] dataset. “Baseline": S\(^3\)M-Net [14] enhanced by our TGF strategy. The symbol \(\uparrow\) indicates that a higher value corresponds to better performance. The best results are shown in bold font.
Methods mIoU (%) \(\uparrow\) fwIoU (%) \(\uparrow\)
Baseline 59.06 84.72
Baseline + SDS [26] 60.01 84.79
Baseline + FDS [46] 60.62 85.20
Baseline + SDS + FDS 60.86 85.59
Baseline + HDS (Ours) 62.36 86.33

Figure 7: The selection of hyper-parameters \(\alpha\) and \(\beta\) within the CT loss on the KITTI 2015 [16] dataset.

1.5mm

Table 5: Ablation study on the selection of the guidance features employed in our HDS strategy on the KITTI 2015 [16] dataset. “Feature Layer": the layer index of the guidance feature, with options of 1, 2, and 3.”GF": Geometric Features, “CF": Contextual Features,”FF": Fused Features, obtained by performing an element-wise summation of the geometric and contextual features. The symbol \(\uparrow\) indicates that a higher value corresponds to better performance. The best results are shown in bold font.
Feature Layer Method mIoU (%) \(\uparrow\) fwIoU (%) \(\uparrow\)
1-6 1 2 3 GF CF FF
61.74 86.88
62.09 86.48
62.36 86.33
61.18 85.52
61.12 86.85
62.11 86.31
56.67 83.68
59.05 84.11
60.26 85.60

1.5mm

Table 6: Ablation study to validate the effectiveness of the three semantic segmentation losses within our CT loss on the KITTI 2015 [16] dataset. The symbol \(\uparrow\) indicates that a higher value corresponds to better performance. The best results are shown in bold font.
DIA DCSS SCG mIoU (%) \(\uparrow\) fwIoU (%) \(\uparrow\)
62.36 86.33
63.04 86.48
62.88 85.98
62.75 86.66
63.64 86.5
63.15 86.77
62.99 86.59
63.91 86.82

1.5mm

Table 7: Ablation study on the three contributions on the KITTI 2015 [16] dataset. The symbol \(\uparrow\) indicates that a higher value corresponds to better performance. The best results are shown in bold font.
TGF PGDS CC loss mIoU (%) \(\uparrow\) mFSc (%) \(\uparrow\)
59.06 91.26
58.49 92.10
62.36 92.74
61.18 92.46
63.91 92.90

5 Conclusion and Future Work↩︎

This article introduced TiCoSS, a novel joint learning framework designed to tighten the coupling between semantic segmentation and stereo matching tasks. We have made three significant contributions in this work: (1) an effective feature fusion strategy that complements the contextual features with useful geometric features through a series of selective inheritance gates, enabling a tightly-coupled duplex encoder, (2) a novel hierarchical deep supervision strategy that improves the interactions among auxiliary classifiers, and (3) a coupling tightening joint learning loss that focuses on further tightening the coupling of these two tasks at the output level. The effectiveness of each contribution was validated through extensive experiments on both virtual and real-world datasets. Comprehensive comparisons with other state-of-the-art algorithms unequivocally demonstrate the superiority of TiCoSS.

References↩︎

[1]
Ö. Ciftcioglu and booktitle=International. C. on R. and A. (ICRA). others, “Studies on visual perception for perceptual robotics,” 2006, vol. 2, pp. 352–359.
[2]
S. Hua et al., “Pseudo segmentation for semantic information-aware stereo matching,” IEEE Signal Processing Letters, vol. 29, pp. 837–841, 2022.
[3]
R. Fan and isbn=978. others, Autonomous driving perception. Springer, 2023.
[4]
S. Luo et al., “Delving into multi-modal multi-task foundation models for road scene understanding: From learning paradigm perspectives,” IEEE Transactions on Intelligent Vehicles, pp. 1–25, keywords=Task analysis;Roads;Multitasking;Data models;Computer architecture;Surveys;Visualization;Foundation Model;Visual Understanding;Multi-modal Learning;Multi-task Learning;Road Scene, 2024, doi: 10.1109/TIV.2024.3406372.
[5]
H. Zhao and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). others, “Pyramid scene parsing network,” 2017, pp. 2881–2890.
[6]
L.-C. Chen et al., “Rethinking atrous convolution for semantic image segmentation,” CoRR, 2017.
[7]
E. Xie et al., “SegFormer: Simple and efficient design for semantic segmentation with transformers,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 12077–12090, 2021.
[8]
J.-R. Chang and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). Chen Yong-Sheng, “Pyramid stereo matching network,” 2018, pp. 5410–5418.
[9]
L. Lipson and booktitle=International. C. on 3D. V. (3DV). others, “RAFT-stereo: Multilevel recurrent field transforms for stereo matching,” 2021 , organization={IEEE}, pp. 218–227.
[10]
W. Zhan and booktitle=International. C. on R. and A. (ICRA). others, “DSNet: Joint learning for scene segmentation and disparity estimation,” 2019 , organization={IEEE}, pp. 2946–2952.
[11]
R. Fan et al., “Road surface 3D reconstruction based on dense subpixel disparity map estimation,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3025–3035, 2018.
[12]
Z. Wu and booktitle=Proceedings. of the I. I. C. on C. V. (ICCV). others, “Semantic stereo matching with pyramid cost volumes,” 2019, pp. 7484–7493.
[13]
V. Ramanishka and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). others, “Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning,” 2018, pp. 7699–7707.
[14]
Z. Wu et al., “S\(^3\)m-net: Joint learning of semantic segmentation and stereo matching for autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 2, pp. 3940–3951, keywords=Feature extraction;Semantic segmentation;Semantics;Task analysis;Three–dimensional displays;Training;Correlation;Autonomous driving;environmental perception;joint learning;semantic segmentation;stereo matching, 2024, doi: 10.1109/TIV.2024.3357056.
[15]
Y. Cabon et al., “Virtual KITTI 2,” CoRR, 2020.
[16]
M. Menze and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). Geiger Andreas, “Object scene flow for autonomous vehicles,” 2015, pp. 3061–3070.
[17]
U. Michieli et al., “Adversarial learning and self-teaching techniques for domain adaptation in semantic segmentation,” IEEE Transactions on Intelligent Vehicles, vol. 5, no. 3, pp. 508–518, 2020.
[18]
R. Fan and booktitle=Proceedings. of the E. C. on C. V. (ECCV). others, “SNE-RoadSeg: Incorporating surface normal information into semantic segmentation for accurate freespace detection,” 2020 , organization={Springer}, pp. 340–356.
[19]
Y. Yang et al., “On exploring shape and semantic enhancements for RGB-x semantic segmentation,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 2223–2235, keywords=Decoding;Shape;Semantics;Semantic segmentation;Feature extraction;Fuses;Convolution;Deep supervision;inter-pixel relationship;RGB-X semantic segmentation;signed distance map, 2024, doi: 10.1109/TIV.2023.3296219.
[20]
J. Fan et al., “MLFNet: Multi-level fusion network for real-time semantic segmentation of autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 756–767, 2022.
[21]
L.-C. Chen and booktitle=Proceedings. of the E. C. on C. V. (ECCV). others, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” 2018, pp. 801–818.
[22]
R. Strudel and booktitle=Proceedings. of the I. I. C. on C. V. (ICCV). others, “SegMenter: Transformer for semantic segmentation,” 2021, pp. 7262–7272.
[23]
B. Cheng and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). others, “Masked-attention mask transformer for universal image segmentation,” 2022, pp. 1290–1299.
[24]
.
[25]
.
[26]
H. Wang and booktitle=IEEE/RSJ. I. C. on I. R. and S. (IROS). others, “SNE-RoadSeg+: Rethinking depth-normal translation and deep supervision for freespace detection,” 2021 , organization={IEEE}, pp. 1140–1145.
[27]
Y. Feng et al., “SNE-RoadSegV2: Advancing heterogeneous feature fusion and fallibility awareness for freespace detection,” CoRR, 2024.
[28]
C. Min and booktitle=International. C. on R. and A. (ICRA). others, “Orfd: A dataset and benchmark for off-road freespace detection,” 2022 , organization={IEEE}, pp. 2532–2538.
[29]
J. Li et al., “RoadFormer: Duplex transformer for RGB-normal semantic road scene parsing,” IEEE Transactions on Intelligent Vehicles, pp. 1–10, keywords=Roads;Transformers;Feature extraction;Decoding;Semantics;Semantic segmentation;Benchmark testing;Convolutional neural network;road scene parsing;freespace detection;semantic segmentation;driving safety and comfort;Transformer, 2024, doi: 10.1109/TIV.2024.3388726.
[30]
X. Guo and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). others, “Group-wise correlation stereo network,” 2019, pp. 3273–3282.
[31]
H. Xu and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). Zhang Juyong, “AANet: Adaptive aggregation network for efficient stereo matching,” 2020, pp. 1959–1968.
[32]
X. Cheng et al., “Hierarchical neural architecture search for deep stereo matching,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 22158–22169, 2020.
[33]
J. Li and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). others, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” 2022, pp. 16263–16272.
[34]
T. Elsken et al., “Neural architecture search: A survey,” Transactions on Machine Learning Research, vol. 20, no. 1, pp. 1997–2017, 2019.
[35]
Z. Teed and booktitle=Proceedings. of the E. C. on C. V. (ECCV). Deng Jia, “RAFT: Recurrent all-pairs field transforms for optical flow,” 2020 , organization={Springer}, pp. 402–419.
[36]
G. Yang and booktitle=Proceedings. of the E. C. on C. V. (ECCV). others, “SegStereo: Exploiting semantic information for disparity estimation,” 2018, pp. 636–651.
[37]
J. Zhang et al., “DispSegNet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery,” IEEE Robotics and Automation Letters, vol. 4, pp. 1162–1169, 2018.
[38]
P. L. Dovesi and booktitle=International. C. on R. and A. (ICRA). others, “Real-time semantic stereo matching,” 2020 , organization={IEEE}, pp. 10780–10787.
[39]
S. Chen and booktitle=Proceedings. of the A. C. on C. V. (ACCV). others, “SGNet: Semantics guided deep stereo matching,” 2020, pp. 106–122.
[40]
M. Cordts and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). others, “The cityscapes dataset for semantic urban scene understanding,” 2016, pp. 3213–3223.
[41]
A. Geiger and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). others, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” 2012 , organization={IEEE}, pp. 3354–3361.
[42]
N. Mayer and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). others, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” 2016, pp. 4040–4048.
[43]
D. Jia et al., “SSNet: A joint learning network for semantic segmentation and disparity estimation,” The Visual Computer, pp. 1–13, 2024.
[44]
C.-W. Liu et al., “Playing to vision foundation model’s strengths in stereo matching,” CoRR, 2024.
[45]
X. Li and booktitle=Proceedings. of the A. C. on A. I. (AAAI). others, “Gated fully fusion for semantic segmentation,” 2020, vol. 34, pp. 11418–11425.
[46]
H. Huang et al., “UNet 3+: A full-scale connected unet for medical image segmentation,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1055–1059, 2020, [Online]. Available: https://api.semanticscholar.org/CorpusID:215828394.
[47]
D. Li and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). Chen Qifeng, “Dynamic hierarchical mimicking towards consistent optimization objectives,” 2020, pp. 7642–7651.
[48]
I. Loshchilov and booktitle=International. C. on M. L. (ICML). Frank Hutter, “Decoupled weight decay regularization,” 2017, [Online]. Available: https://api.semanticscholar.org/CorpusID:53592270.
[49]
H. Zhao and booktitle=Proceedings. of the I. I. C. on C. V. (ICCV). others, “Open vocabulary scene parsing,” 2017, pp. 2002–2010.
[50]
C. Yu et al., “BiSeNet V2: Bilateral network with guided aggregation for real-time semantic segmentation,” International Journal of Computer Vision, vol. 129, pp. 3051–3068, 2021.
[51]
G. Xu and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). others, “Attention concatenation,” 2022, vol. for Accurate and Efficient Stereo Matching, pp. 12981–12990.
[52]
Z. Shen and booktitle=Proceedings. of the E. C. on C. V. (ECCV). others, “PCW-net: Pyramid combination and warping cost,” 2022 , organization={Springer}, vol. for Stereo Matching, pp. 280–297.
[53]
G. Xu and booktitle=Proceedings. of the I. C. on C. V. and P. R. (CVPR). others, “Iterative geometry encoding,” 2023, vol. for Stereo Matching, pp. 21919–21928.

  1. This research was supported by the National Natural Science Foundation of China under Grant 62233013, the Science and Technology Commission of Shanghai Municipal under Grant 22511104500, the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0100), the Fundamental Research Funds for the Central Universities, and Xiaomi Young Talents Program. (Corresponding author: Rui Fan).↩︎

  2. Guanfeng Tang, Zhiyuan Wu and Rui Fan are with the College of Electronics & Information Engineering, Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai Institute of Intelligent Science and Technology, Tongji University, the State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, Tongji University, Shanghai 201804, China (e-mails: {gftang, gwu, rfan}tongji.edu.cn?).↩︎

  3. It is worth noting here that we use \(\mathbf{X}_{i}^{F}\) instead of \(\mathbf{X}_{i}^{C}\), where the superscript ‘\(C\)’ denotes ‘contextual’ features, primarily because the branch processing the RGB images progressively fuses the geometric features extracted from the disparity maps to obtain the fused features.↩︎