November 17, 2023
Supervised learning algorithms based on Convolutional Neural Networks have become the benchmark for medical image segmentation tasks, but their effectiveness heavily relies on a large amount of labeled data. However, annotating medical image datasets is a laborious and time-consuming process. Inspired by semi-supervised algorithms that use both labeled and unlabeled data for training, we propose the PLGDF framework, which builds upon the mean teacher network for segmenting medical images with less annotation. We propose a novel pseudo-label utilization scheme, which combines labeled and unlabeled data to augment the dataset effectively. Additionally, we enforce the consistency between different scales in the decoder module of the segmentation network and propose a loss function suitable for evaluating the consistency. Moreover, we incorporate a sharpening operation on the predicted results, further enhancing the accuracy of the segmentation.
Extensive experiments on three publicly available datasets demonstrate that the PLGDF framework can largely improve performance by incorporating the unlabeled data. Meanwhile, our framework yields superior performance compared to six state-of-the-art semi-supervised learning methods. The codes of this study are available at https://github.com/ortonwang/PLGDF.
Medical image segmentation, semi-supervised learning, pseudo label
Segmentation is a fundamental task in the field of medical image processing and analysis [1]. Accurate image segmentation in clinical medicine provides valuable auxiliary information for clinicians, facilitating rapid, accurate, and efficient diagnostic decision-making [2]. However, manual annotation of regions of interest is time-consuming and relies on the clinical expertise of physicians, resulting in a significant workload and potential errors [3].
With the rapid development of deep learning, Convolutional Neural Networks (CNN) and its variants have demonstrated powerful image processing capabilities in computer vision tasks. The introduction of Fully Convolutional Networks [4] and U-Net [5] has greatly propelled the development of automated image segmentation [6]. Building upon these foundations, numerous studies have emerged to further improve the performance of segmentation algorithms [[7]][8][9]. For instance, Ning et al. proposed SMU-Net [10], which utilizes salient background representation to assist foreground segmentation by considering the texture information present in the background. Pang et al. introduced a novel two-stage framework named SpineParseNet for automated spine parsing in volumetric magnetic resonance images [11]. Additionally, Chen et al. presented TransUNet [12], which combines CNN with Transformer [13] for medical image segmentation, demonstrating outstanding performance.
However, the success of these methods heavily relies on a large amount of pixel-level annotated data, which is only feasible through precise annotations by skilled medical professionals [14]. This process is time-consuming and costly, limiting the applicability of supervised learning methods.
To address this issue, researchers have proposed semi-supervised learning-based methods for medical image segmentation [15]. Compared to supervised learning, semi-supervised learning can fully utilize the information contained in unlabeled data, thus improving the generalization capability and accuracy of the segmentation model [16].
One common approach in semi-supervised learning is the use of pseudo-label strategies. This method typically employs labeled data to train an initial model, which is then applied to unlabeled data to generate pseudo-labels. These pseudo-labels serve as approximate labels for the unlabeled data, thereby expanding the labeled dataset. Subsequently, the model is retrained using the expanded dataset to improve its robustness [17] [18] Qiu et al. introduced a Federated Semi-Supervised Learning [19] approach to learn from distributed medical image domains, incorporating a federated pseudo-labeling strategy for unlabeled clients to mitigate the deficiency of annotations in unlabeled data. Bai et al. proposed an iterative learning method based on pseudo-labeling for cardiac MR image segmentation [20]. In this approach, pseudo-labels are refined using a Conditional Random Field, and the updated pseudo-labels are utilized for model updating.
Another common approach in semi-supervised learning is the consistency-based method [21]. This method aims to enhance the model’s robustness by combining the consistency among unlabeled data. In the context of image segmentation tasks, consistency can be categorized into data-level consistency and model-level consistency. Data-level consistency requires the model to produce consistent predictions for different perturbations of the same image. For example, when introducing slight perturbations or applying different data augmentation techniques to the input image, the model should generate the same segmentation results. On the other hand, model-level consistency requires consistent segmentation results across different models for the same input.
Deep adversarial training [[22]][23] is also a commonly used method that leverages unlabeled data by employing a discriminator to align the distributions of labeled and unlabeled data. Wu et al. introduced MC-Net [24], which comprises a shared encoder and multiple slightly different decoders. The model incorporates statistical differences among the decoders to represent the model’s uncertainty and enforce consistency constraints.
Furthermore, the mean-teacher model [15] and its extensions [[25]][26][27] have gained significant attention in semi-supervised medical image segmentation tasks. In the mean-teacher model, the parameter of the student network is guided by the teacher network during the training process. The model training involves minimizing the error between the teacher and student models. Additionally, other algorithms have also demonstrated outstanding performance in this field.
Luo et al. proposed Uncertainty Rectification Pyramid Consistency URPC [28], a novel framework with uncertainty rectified pyramid consistency regularization. This framework offers a straightforward and efficient method to enforce output consistency across various scales for unlabeled data.
While the framework is simple and efficient, there is still room for further optimization in its performance, indicating the potential for further improvements in the performance of these methods. Therefore, this study also endeavors to explore some integrated strategies to enhance the performance of semi-supervised learning algorithms. We propose a novel framework named Pseudo Label-Guided Data Fusion (PLGDF). The main contributions of this paper can be summarized as follows:
We introduce the PLGDF framework, a novel architecture built upon the mean teacher network, incorporating an innovative pseudo-label utilization scheme. This framework integrates consistency evaluation across various scales within the decoder module of the network.
The mixing module combines both labeled and unlabeled data, enhancing the dataset’s diversity. Additionally, the integrated sharpening operation further improves recognition accuracy.
Experimental results on three publicly available datasets demonstrate the superiority of the proposed approach in semi-supervised medical image segmentation compared to six state-of-the-art models, setting new performance benchmarks.
The advancement of deep learning has significantly enhanced the precision of semantic segmentation. Within medical image segmentation, U-Net and its extensions have become the benchmark methods for further research and practical applications. Building upon the foundation of U-Net, numerous high-performing algorithms have emerged, such as CE-Net[29], UNet++ [30], V-Net [31], and the introduction of 3D U-Net [32], expanding the application of medical image segmentation into the realm of 3D medical images. Cao et al. proposed Swin-Unet [33], substituting convolutional blocks with Swin-Transformer [34] blocks for enhanced feature extraction, while Wang et al. introduced O-Net [35], a deeper integration of CNN and Transformer, further improving algorithm performance. Additionally, other algorithms such as UNeXt [36], SpineParseNet [11], and SegFormer [37] have contributed to the improvement of the model. Although these methods have achieved success in medical image segmentation, they are predominantly constructed in a fully-supervised manner. Their performance is notably constrained by the scarcity of labeled samples available for training.
To address the challenge of limited labeled data in medical image segmentation, researchers have proposed various semi-supervised learning methods. Currently, one of the most widely employed approaches involves extending the mean-teacher framework to different aspects. For example, Li et al. introduced TCSM [25] which leverages various perturbations on unlabeled data to train the network by enforcing consistency in predictions through regularization. Yu et al. presented Uncertainty-Aware Mean Teacher (UA-MT) [38], which incorporates an uncertainty-aware scheme encouraging consistent predictions for the same input under different perturbations. Chen et al. proposed an enhancement of the mean-teacher framework combined with adversarial networks to distinguish labeled and unlabeled data, showcasing outstanding performance [23]. Furthermore, there is research based on task consistency. Luo et al. utilized a dual-task deep network for joint prediction of pixel segmentation maps and hierarchical representations of geometric objects, along with the introduction of dual-task consistency regularization[18].
Pseudo-label [39] is another common semi-supervised learning framework, often involving the conversion of probability mappings into pseudo-labels using sharpening functions or fixed thresholds. Li et al proposed self-loop uncertainty, a pseudo-label application strategy where recurrent optimization of the neural network with a self-supervised task generates ground-truth labels for unlabeled images, augmenting the training set and enhancing segmentation accuracy. Rizve et al [40] unified probability and uncertainty thresholds to select the most accurate pseudo-labels. Luo et al presented URPC, which leverages data pyramid consistency and uncertainty rectification within a single model, based on the consistency of outputs at different scales for the same input, achieving excellent performance in semi-supervised learning for segmentation [28]. Building upon previous attempts, we have combined pseudo label and pyramid Consistency with the mean teacher framework, to further improve the semi-supervised segmentation in medical images.
To provide a more comprehensive understanding of our proposed model, we begin by introducing the symbols utilized in our approach to address the problem of semi-supervised segmentation. Symbol definitions are as follows:
\(X_{l}\): the labeled data, where each sample in \(X_{l}\) is associated with its corresponding Ground Truth, denoted as \(GT\).
\(X_{u}\): the unlabeled data, which consists of samples in the dataset that do not have corresponding \(GT\).
\(X_{mix}\): the data mixed from \(X_{u}\) and \(X_{l}\) through Mix Module during the training process.
\(f_{\theta_{s}}(x)\): the generated probability map of input \(x\), where \(\theta_{s}\) denotes the parameters of the student network, and \(f_{\theta_{t}}(x)\) means the generated probability map by the teacher network.
Figure 1 illustrates the overall architecture of the PLGDF framework proposed in our study. In our methodology, we adopt the framework of the mean teacher model, where the teacher network implements parameter updates from the student network through exponential moving average (EMA). In Algorithm 2, we present the pseudocode to illustrate the training procedure of the proposed framework. To begin, we apply double random noise augmentation to the \(X_{u}\). The teacher network processes the augmented data and obtains the average results, which are subsequently binarized to generate the pseudo-label corresponding to the \(X_{u}\). Next, we utilize the Mix Module to augment the \(X_{u}\) with \(X_{l}\), resulting in the \(X_{mix}\) which is displayed as \(Mixed\) \(img\) in the Figure 1. Finally, we concatenate the \(X_{u}\), \(X_{l}\) and \(X_{mix}\), which are then processed by the student network to generate the corresponding prediction: \(f_{\theta_{s}}(X_{u})\), \(f_{\theta_{s}}(X_{l})\), and \(f_{\theta_{s}}(X_{mix})\). Next, we further refine \(f_{\theta_{s}}(X_{u})\) by applying a sharpening process to obtain soft pseudo-labels.
To facilitate the training of the model, we evaluate \(L_{sup}\) based on \(f_{\theta_{s}}(X_{l})\) and \(GT\), \(L_{semi}\) based on \(f_{\theta_{s}}(X_{mix})\) and pseudo labels, \(L_{sharp}\) based on \(f_{\theta_{s}}(X_{u})\) and soft pseudo-labels. Additionally, we introduced multi-scale outputs for the backbone model, and a multi-scale consistency evaluation module is incorporated to assess the consistency among outputs from different scales, which is shown as \(L_{consis}\). In the subsequent subsections, we will provide a detailed explanation of each module.
To enhance the model’s effectiveness and improve its robustness with limited data, we introduce a data augmentation technique inspired by the widely employed Mix-Up method in vision tasks [41]. We perform a data mixing process by combining \(X_{l}\) and \(X_{u}\), resulting in a more diverse training dataset. During the data mixing process, we randomly select two sets of samples and linearly interpolate their features by a certain proportion, generating new samples with blended characteristics for model training. For a pair of samples \(X_{u_{1}}\) and \(X_{l_{1}}\), this process can be mathematically represented as follows: \[\begin{align} \lambda = Random(Beta(\alpha ,\alpha )),\\ \lambda' = max(\lambda,1-\lambda),\\ X_{u_{1}}' = \lambda'X_{u_{1}} + (1 - \lambda')X_{l_{1}} \end{align}\]
where \(\alpha\) is a randomly generated hyperparameter. The effect of image mixing is shown in Figure 3. The mix is predominantly based on an unlabeled image, with a labeled image providing additional mixing. After the blending operation, certain changes occur in the information of the unlabeled image, especially within the confines of the red rectangular box, a more pronounced effect is showcased. However, these changes do not significantly alter the overall semantic information of the image. Therefore, we use the pseudo-label corresponding to the unlabeled image as the label for the mixed data.
Considering the efficacy of consistency in utilizing unlabeled data, we employ a sharpening function [42] to transform the \(f_{\theta_{s}}(X_{u})\) into soft pseudo-labels, which is the result obtained from the \(X_{u}\) through the student network. The formula of the sharpening function is as follows: \[f^{*}_{\theta_{s}}(X_{u}) = \frac{f_{\theta_{s}}(X_{u})^{\frac{1}{T} } }{f_{\theta_{s}}(X_{u}))^{\frac{1}{T} } + (1-f_{\theta_{s}}(X_{u}))^{\frac{1}{T} } }\] Where \(T\) is a hyperparameter used to control the sharpening temperature. Figure 4 illustrates that the sharpening operation enhances the clarity and accuracy of segmentation boundaries, reducing blurriness and ambiguous edges, especially in the region indicated by the red arrow. This improvement enables a more effective capture of target boundaries and subtle structures. Subsequently, we compute the consistency loss based on \(f_{\theta_{s}}(X_{u})\) and \(f^{*}_{\theta_{s}}(X_{u})\). Under the supervision of soft pseudo-labels, the model learns to generate low-entropy results as a means of minimizing entropy. We denote this loss as \(L_{sharp}\), with its evaluation formula expressed as follows: \[L_{sharp} = \frac{1}{N}\sum_{i=1}^{N} ||f_{\theta_{s}}(X_{u}) - f^{*}_{\theta_{s}}(X_{u})||^{2}\] Where \(N\) is the total number of pixels.
In the decoder part of the backbone network, convolution followed by upsampling operations is commonly utilized. Therefore, during the training phase, we can utilize the intermediate features at different scales within the decoder module and standardize their sizes by applying 2\(\times\), 4\(\times\), and 8\(\times\) upsampling along with convolutional operations. Our objective is to enforce the consistency among outputs of different scales. First, we calculate the average value \(\widehat{P}\) based on the multi-scale \(P_{1}, P_{2}, P_{3}\), and \(P_{4}\) as shown in Figure 1. Subsequently, we assess the consistency between each scale and the average value \(\widehat{P}\) using \(L_{consis}\). The evaluation formula is presented in equation (6), (7), and (8). \[\widehat{P} =\frac{1}{4} \sum_{s=1}^{4}P_{s}\] \[D_{s}^{i} = \sum_{j=0}^{C-1}P_{s}^{ i,j } \cdot \log\frac{P_{s}^{i,j}}{\widehat{P}^{i,j }}\] \[L_{consis} =\frac{1}{n} \sum_{s=1}^{n} \frac{ {\textstyle \sum_{i=1}^{N}} {||P_{s}^{i} -\widehat{P} ^{i}||_{2}\cdot e^{-D_{s}^{i}} } }{\sum_{i=1}^{N} e^{-D_{s}^{i}}} + \sum_{i=1}^{N}D_{s} ^{i}\] where \(C\) means the class number for the segmentation task, the \(n\) represents the number of multi-scale outputs, and the \(N\) is the total number of pixels.
The proposed PLGDF framework aims to learn from both labeled and unlabeled data by minimizing the following composite objective function: \[L_{total} = L_{sup}+L_{semi}+ \lambda \cdot ( L_{sharp} + L_{consis})\] the expressions for \(L_{sharp}\) and \(L_{consis}\) are already illustrated in equations (5) and (8), respectively. Regarding \(L_{sup}\) and \(L_{semi}\), we adopt the combination of Cross-Entropy Loss and Dice Loss [31], which is commonly used in medical image segmentation. The formula for Dice Loss is presented below: \[Dice \;Loss=1-\frac{2 * \sum_{i=1}^{N} p_{i} *g_{i} }{\sum_{i=1}^{N}p_{i} ^{2} + \sum_{i=1}^{n}g_{i} ^{2} }\] where \(p_{i}\) is the value of the pixel \(i\) predicted by the model, and \(g_{i}\) is the value of the pixel \(i\) of the ground truth. We introduce \(\lambda(t)\), a widely used time-dependent Gaussian warming-up function [43], to control the proportion between supervised and unsupervised loss at different training stages. Its specific formula is defined as follows: \[\lambda(t) = w\cdot e^{(-5(1-\frac{t}{t_{max} } )^{2} )}\] where \(w\) represents the final regulation weight, \(t\) represents the current training step and \(t_{max}\) denotes the maximum training step.
In this paper, we evaluated the proposed PLGDF method and compared it with six previous works on three publicly available datasets: the Pancreas-CT dataset, the LA dataset and the BraTS2019 dataset. All of the datasets are associated with 3D segmentation tasks.
The Pancreas-CT dataset [44] consists of 82 3D abdominal contrast-enhanced CT scans, acquired from a cohort of 53 male and 27 female subjects. The CT scans were obtained with resolutions of 512\(\times\)512 pixels and varying pixel sizes. The slice thickness ranged from 1.5 to 2.5 mm. For this study, we randomly selected 60 images for training and 20 images for testing, following a standard data splitting protocol commonly used in similar studies [24]. To ensure consistency and comparability of voxel values, we applied a clipping operation, limiting the values to the range of -125 to 275 Hounsfield Units (HU) [45]. Additionally, we performed data resampling to achieve an isotropic resolution of 1.0\(\times\)1.0\(\times\)1.0 mm.
The LA dataset [46], which serves as the benchmark dataset for the 2018 Atrial Segmentation Challenge, comprises 100 gadolinium-enhanced MR imaging scans for training, with a resolution of 0.625\(\times\)0.625\(\times\)0.625 mm. As the testing set of LA lacks publicly available annotations, we allocated 80 samples for training and reserved the remaining 20 samples for validation following [24]. Subsequently, we evaluated the performance of our model and other methods on the same validation set to ensure fair comparisons.
The publicly available BraTS2019 dataset[47] comprises scans obtained from 335 patients diagnosed with glioma. This dataset encompasses T1, T2, T1 contrast-enhanced, and FLAIR sequences, along with corresponding tumor segmentations annotated by expert radiologists. In this study, we focused on using the FLAIR modality for segmentation on the dataset. We conducted a random split, allocating 250 scans for training, 25 scans for validation, and 60 scans for testing following [28].
| Method | Scans used | Metrics | ||||||
| Labeled | Unlabeled | Dice(%)\(\uparrow\) | Jaccard(%)\(\uparrow\) | 95HD(voxel)\(\downarrow\) | ASD(voxel)\(\downarrow\) | |||
| V-Net | 3(5%) | 0 | 29.32 | 19.61 | 43.67 | 15.42 | ||
| V-Net | 6(10%) | 0 | 54.94 | 40.87 | 47.48 | 17.43 | ||
| V-Net | 12(20%) | 0 | 71.52 | 57.68 | 18.12 | 5.41 | ||
| V-Net | 62(100%) | 0 | 83.76 | 72.48 | 4.46 | 1.07 | ||
| UA-MT (MICCAI’19) | 3(5%) | 59(95%) | 43.15 | 29.07 | 51.96 | 20.00 | ||
| SASSNet (MICCAI’20) | 41.48 | 27.98 | 47.48 | 18.36 | ||||
| DTC (AAAI’21) | 47.57 | 33.41 | 44.17 | 15.31 | ||||
| URPC (MIA’2022 ) | 45.94 | 34.14 | 48.80 | 23.03 | ||||
| SS-Net (MICCAI’22) | 41.39 | 27.65 | 52.12 | 19.37 | ||||
| MC-Net+ (MIA’2022 ) | 32.45 | 21.22 | 58.57 | 24.84 | ||||
| Ours | 74.69 | 60.00 | 8.19 | 1.74 | ||||
| UA-MT (MICCAI’19) | 6(10%) | 56(90%) | 66.44 | 52.02 | 17.04 | 3.03 | ||
| SASSNet (MICCAI’20) | 68.97 | 54.29 | 18.83 | 1.96 | ||||
| DTC (AAAI’21) | 66.58 | 51.79 | 15.46 | 4.16 | ||||
| URPC (MIA’2022 ) | 73.53 | 59.44 | 22.57 | 7.85 | ||||
| SS-Net (MICCAI’22) | 73.44 | 58.82 | 12.56 | 2.91 | ||||
| MC-Net+ (MIA’2022 ) | 70.00 | 55.66 | 16.03 | 3.87 | ||||
| Ours | 80.90 | 68.40 | 6.02 | 1.59 | ||||
| UA-MT (MICCAI’19) | 12(20%) | 50(80%) | 76.10 | 62.62 | 10.84 | 2.43 | ||
| SASSNet (MICCAI’20) | 76.39 | 63.17 | 11.06 | 1.42 | ||||
| DTC (AAAI’21) | 78.27 | 64.75 | 8.36 | 2.25 | ||||
| URPC (MIA’2022 ) | 80.02 | 67.30 | 8.54 | 1.98 | ||||
| SS-Net (MICCAI’22) | 78.68 | 65.96 | 9.74 | 1.91 | ||||
| MC-Net+ (MIA’2022 ) | 79.37 | 66.83 | 8.52 | 1.72 | ||||
| Ours | 82.76 | 70.89 | 4.85 | 1.33 | ||||
During the training process, we randomly extracted 3D patches from the preprocessed data. For the LA dataset, the patch size was set to 112 \(\times\) 112 \(\times\) 80, while for the Pancreas-CT and BraTS2019 datasets, the patch size was 96 \(\times\) 96 \(\times\) 96. For all three datasets, we set the batch size to 4, where each batch consisted of two labeled patches and two unlabeled patches. The backbone network employed in our study is the V-Net [31]. Additionally, we made modifications to the network to generate multi-scale outputs, and the scales \(n\) we used for evaluating the multi-scale consistency is set to 4. We trained our PLGDF model for 15\(k\) iterations for Pancreas-CT and LA datasets and 30\(k\) iterations for the BraTS2019 dataset, following the methodology described in [24] and [28].
During the testing phase, we employed a sliding window approach with a fixed stride to extract patches. Specifically, on the LA dataset, we utilized a sliding window of size 112 \(\times\) 112 \(\times\) 80 with a stride of 18 \(\times\) 18 \(\times\) 4. On the Pancreas-CT and BraTS2019 datasets, we used a sliding window with a size of 96 \(\times\) 96 \(\times\) 96 and a stride of 16 \(\times\) 16 \(\times\) 16. Subsequently, we reconstructed the patch-based predictions to obtain the final results for the entire volume.
In our training process, we employed the SGD optimizer with a momentum 0.9 and weight decay set to 1e-4. The learning rate was set to 1e-2 and the hyperparameter \(T\) was set to 1e-1. In this study, we trained the network using 10\(\%\) and 20\(\%\) of the data on three representative semi-supervised datasets, following the data partitioning methods as described in [[21]][48][49]. Our framework was implemented in PyTorch 1.12.0, utilizing an Nvidia RTX 3090 GPU with 24GB of memory. For quantitative evaluation, we employed four metrics: Dice, Jaccard, the average surface distance (ASD), and the 95\(\%\) Hausdorff Distance (95HD). During the training phase, as our model incorporates multi-scale outputs, we utilized the four-scale output within the student model and exclusively employed the highest-scale output within the teacher model. Similarly, during the inference phase of the network, we solely utilized the output from the highest scale, denoted as \(P_{1}\) in Figure 1. Consequently, at this juncture, our backbone network is equivalent to the V-Net. The \(P_{2}\), \(P_{3}\), and \(P_{4}\) were exclusively utilized within the student models during the training process.
We compared our proposed framework with six state-of-the-art semi-supervised segmentation methods, including UA-MT [38], Shape-aware Adversarial Network (SASSNet) [48], Dual-task Consistency Framework (DTC) [49], Uncertainty Rectified Pyramid Consistency (URPC) [28], SS-Net [50] and Mutual Consistency Network (MC-Net+) [24]. Note that we utilized the official codes and results of UA-MT, SASSNet, URPC, DTC, SS-Net, and MC-Net, along with their publicly available data preprocessing schemes. We used the results in MC-Net as our benchmark.
| Method | Scans used | Metrics | ||||||
| Labeled | Unlabeled | Dice(%)\(\uparrow\) | Jaccard(%)\(\uparrow\) | 95HD(voxel)\(\downarrow\) | ASD(voxel)\(\downarrow\) | |||
| V-Net | 4(5%) | 0 | 52.55 | 39.60 | 47.05 | 9.87 | ||
| V-Net | 8(10%) | 0 | 78.57 | 66.96 | 21.20 | 6.07 | ||
| V-Net | 16(20%) | 0 | 86.96 | 77.31 | 11.85 | 3.22 | ||
| V-Net | 80(100%) | 0 | 91.62 | 84.60 | 5.40 | 1.64 | ||
| UA-MT (MICCAI’19) | 4(5%) | 76(95%) | 82.26 | 70.98 | 13.71 | 3.82 | ||
| SASSNet (MICCAI’20) | 81.60 | 69.63 | 16.60 | 3.58 | ||||
| DTC (AAAI’21) | 81.25 | 69.33 | 14.90 | 3.99 | ||||
| URPC (MIA’2022 ) | 82.48 | 71.35 | 14.65 | 3.65 | ||||
| SS-Net (MICCAI’22) | 86.33 | 76.15 | 9.97 | 2.31 | ||||
| MC-Net+ (MIA’2022 ) | 82.07 | 70.38 | 20.49 | 5.72 | ||||
| Ours | 89.22 | 80.62 | 7.90 | 1.71 | ||||
| UA-MT (MICCAI’19) | 8(10%) | 72(90%) | 86.28 | 76.11 | 18.71 | 4.63 | ||
| SASSNet (MICCAI’20) | 86.81 | 76.92 | 12.54 | 2.55 | ||||
| DTC (AAAI’21) | 87.51 | 78.17 | 8.23 | 2.36 | ||||
| URPC (MIA’2022 ) | 85.01 | 74.36 | 15.37 | 3.96 | ||||
| SS-Net (MICCAI’22) | 88.43 | 79.43 | 7.95 | 2.55 | ||||
| MC-Net+ (MIA’2022 ) | 88.96 | 80.25 | 7.93 | 1.86 | ||||
| Ours | 89.8 | 81.58 | 7.14 | 1.74 | ||||
| UA-MT (MICCAI’19) | 16(20%) | 64(80%) | 88.74 | 79.94 | 8.39 | 2.32 | ||
| SASSNet (MICCAI’20) | 89.27 | 80.82 | 8.83 | 3.13 | ||||
| DTC (AAAI’21) | 89.42 | 80.98 | 7.32 | 2.10 | ||||
| URPC (MIA’2022 ) | 88.74 | 79.93 | 12.73 | 3.66 | ||||
| SS-Net (MICCAI’22) | 89.86 | 81.70 | 7.01 | 1.87 | ||||
| MC-Net+ (MIA’2022 ) | 91.07 | 83.67 | 5.84 | 1.67 | ||||
| Ours | 91.34 | 83.94 | 5.66 | 1.76 | ||||
Table 1 shows the quantitative comparison of our model and six semi-supervised methods on the Pancreas-CT dataset, along with the results of the V-Net model trained with 5\(\%\),10\(\%\), 20\(\%\), and 100\(\%\) labeled data for supervised learning. The experimental results indicate significant improvements in our proposed method over six compared state-of-the-art (SOTA) models across four evaluation metrics: Dice, Jaccard, 95HD, and ASD. It is evident from the table that our result stands out prominently, particularly when only 5\(\%\) or 10\(\%\) of the data is labeled. With 5\(\%\) labeled data, we achieved a Dice score of 74.69\(\%\), along with an 95HD of 8.19 and an ASD of 1.74. These metrics not only surpass the accuracy obtained by the six compared SOTA models based on equivalent data annotation but also outperform the accuracy of these models at 10\(\%\) data annotation. Similarly, at 10\(\%\) data annotation, the accuracy of our method exceeds those compared models trained with 20\(\%\) labeled data annotation.
Figure 5 displays the visual results of image segmentation on the Pancreas-CT dataset when trained with 5\(\%\) and 10\(\%\) labeled data. The visualizations are presented in both 2D and 3D perspectives. It is apparent that when utilizing only 5\(\%\) labeling, other compared models fail to segment an approximate pancreas, while our model achieves segmentation results relatively closer to the Ground Truth (GT). Similarly, at 10\(\%\) data annotation, our algorithm’s predictions are more accurate. In the 2D visual representation, the comparative method exhibits a higher rate of false negatives, whereas our model achieves more precise identification. This further substantiates the effectiveness of the model we proposed.
| Method | Scans used | Metrics | ||||||
| Labeled | Unlabeled | Dice(%)\(\uparrow\) | Jaccard(%)\(\uparrow\) | 95HD(voxel)\(\downarrow\) | ASD(voxel)\(\downarrow\) | |||
| V-Net | 12(5%) | 0 | 74.28 | 64.42 | 13.44 | 2.60 | ||
| V-Net | 25(10%) | 0 | 78.67 | 68.75 | 10.44 | 2.23 | ||
| V-Net | 50(20%) | 0 | 80.59 | 71.13 | 8.95 | 2.03 | ||
| V-Net | 250(All) | 0 | 88.58 | 80.34 | 6.19 | 1.36 | ||
| UA-MT (MICCAI’19) | 12(5%) | 238(95%) | 80.31 | 70.43 | 10.65 | 2.12 | ||
| SASSNet (MICCAI’20) | 76.17 | 66.43 | 13.09 | 3.32 | ||||
| DTC (AAAI’21) | 74.21 | 64.89 | 13.54 | 3.16 | ||||
| URPC (MIA’2022 ) | 78.74 | 68.2 | 17.43 | 4.51 | ||||
| SS-Net (MICCAI’22) | 78.03 | 68.11 | 13.7 | 2.76 | ||||
| MC-Net+ (MIA’2022 ) | 78.69 | 68.38 | 16.44 | 4.49 | ||||
| Ours | 84.96 | 75.15 | 10.28 | 2.53 | ||||
| UA-MT (MICCAI’19) | 25(10%) | 225(90%) | 80.93 | 71.31 | 17.71 | 5.43 | ||
| SASSNet (MICCAI’20) | 79.19 | 68.80 | 16.36 | 6.67 | ||||
| DTC (AAAI’21) | 82.74 | 72.74 | 11.76 | 3.24 | ||||
| URPC (MIA’2022 ) | 84.16 | 74.29 | 11.01 | 2.63 | ||||
| SS-Net (MICCAI’22) | 82.00 | 71.82 | 10.68 | 1.82 | ||||
| MC-Net+ (MIA’2022 ) | 79.63 | 70.10 | 12.28 | 2.45 | ||||
| Ours | 85.47 | 75.97 | 9.57 | 2.07 | ||||
| UA-MT (MICCAI’19) | 50(20%) | 200(80%) | 85.05 | 74.51 | 12.31 | 3.03 | ||
| SASSNet (MICCAI’20) | 82.34 | 72.72 | 12.45 | 3.24 | ||||
| DTC (AAAI’21) | 83.47 | 72.93 | 14.48 | 3.59 | ||||
| URPC (MIA’2022 ) | 85.49 | 74.86 | 8.14 | 2.04 | ||||
| SS-Net (MICCAI’22) | 83.07 | 73.48 | 10.08 | 1.63 | ||||
| MC-Net+ (MIA’2022 ) | 82.87 | 73.61 | 8.94 | 1.92 | ||||
| Ours | 86.31 | 77.34 | 7.45 | 1.36 | ||||
| Designs | Scans used | Metrics | ||||||||||
| \(L_{sup}\) | \(L_{semi}\) | Mix M. | \(L_{consis}\) | \(L_{sharp}\) | Labeled | Unlabeled | Dice(%)\(\uparrow\) | Jaccard(%)\(\uparrow\) | 95HD(voxel)\(\downarrow\) | ASD(voxel)\(\downarrow\) | ||
| ✔ | 3(5%) | 59(95%) | 29.32 | 19.61 | 43.67 | 15.42 | ||||||
| ✔ | ✔ | 48.09 | 34.59 | 57.57 | 25.64 | |||||||
| ✔ | ✔ | ✔ | 74.01 | 59.52 | 26.58 | 8.83 | ||||||
| ✔ | ✔ | ✔ | ✔ | 74.18 | 59.51 | 8.39 | 2.05 | |||||
| ✔ | ✔ | ✔ | ✔ | 76.20 | 62.16 | 12.14 | 3.49 | |||||
| ✔ | ✔ | ✔ | ✔ | ✔ | 74.69 | 60.00 | 8.19 | 1.74 | ||||
| ✔ | 6(10%) | 56(90%) | 54.94 | 40.87 | 47.48 | 17.43 | ||||||
| ✔ | ✔ | 76.40 | 63.06 | 16.55 | 4.75 | |||||||
| ✔ | ✔ | ✔ | 80.49 | 67.84 | 6.22 | 1.77 | ||||||
| ✔ | ✔ | ✔ | ✔ | 80.47 | 68.81 | 6.16 | 1.87 | |||||
| ✔ | ✔ | ✔ | ✔ | 80.22 | 67.51 | 5.92 | 2.13 | |||||
| ✔ | ✔ | ✔ | ✔ | ✔ | 80.9 | 68.4 | 6.02 | 1.59 | ||||
| ✔ | 12(20%) | 50(80%) | 71.52 | 57.68 | 18.12 | 5.41 | ||||||
| ✔ | ✔ | 80.26 | 67.45 | 15.99 | 3.99 | |||||||
| ✔ | ✔ | ✔ | 81.96 | 69.77 | 5.18 | 1.43 | ||||||
| ✔ | ✔ | ✔ | ✔ | 82.51 | 70.69 | 6.87 | 2.11 | |||||
| ✔ | ✔ | ✔ | ✔ | 81.61 | 69.35 | 7.50 | 2.02 | |||||
| ✔ | ✔ | ✔ | ✔ | ✔ | 82.76 | 70.89 | 4.85 | 1.33 | ||||
Table 2 presents the quantitative experimental results on the LA dataset. Compared to six SOTA methods, our model achieved optimal Dice scores across 5\(\%\), 10\(\%\), and 20\(\%\) data annotations. With 20\(\%\) labeled data, our model obtained a Dice score of 91.34\(\%\), closely approaching the Dice score of 91.62\(\%\) achieved through supervised learning using 100\(\%\) labeled data. It is worth emphasizing that with 5\(\%\) labeled data, we achieved a Dice score of 88.72\(\%\), surpassing the accuracy of other comparative algorithms, except for MC-Net, which was trained with 10\(\%\) labeled data. Similarly, except for MC-Net, at 10\(\%\) data annotation, we achieved accuracies comparable to those obtained by the other algorithms based on 20\(\%\) data annotation. Simultaneously, we achieved the lowest 95HD and outstanding ASD metric.
Figure 6 illustrates the visualization results of image segmentation for the LA dataset when trained with 5\(\%\) and 10\(\%\) labeled data. It is evident that our model generates a more complete and accurate left atrium than those compared models. Especially at 5\(\%\) data annotation, our model naturally eliminates most of the isolated regions and preserves more fine details, whereas other models tend to produce false positive noise predictions. In the 2D view, our prediction also exhibits closer proximity to the GT. This visual representation intuitively demonstrates the effectiveness of our method.
Table 3 presents the quantitative experimental results on the BraTS2019 dataset. In comparison to six SOTA models, our model achieved superior Dice and 95HD scores by efficiently leveraging unlabeled data across 5\(\%\), 10\(\%\), and 20\(\%\) labeled data. Particularly, using 5\(\%\) labeled data, we achieved a Dice score of 84.96\(\%\), signifying a noteworthy enhancement compared to the best-scoring counterpart at 80.31\(\%\) in Dice, surpassing the metric achieved by the comparative model with 10\(\%\) labeled data.
Figure 7 illustrates the visual segmentation results on the BraTS2019 dataset. Compared to six SOTA models, our model showcases a more comprehensive 3D volumetric segmentation, approaching the delineation of GT. In the 2D view, our model demonstrates closer proximity to the GT. In contrast, the compared models exhibit more false negatives relative to the somewhat ambiguous lesion features. This demonstrates the superiority of our model, showcasing its capability to identify challenging lesions with a small amount of labeled data.
In this study, we performed studies on the Pancreas-CT dataset to analyse the contributions of the effect of the contributions of each component.
In our proposed framework, besides the \(L_{sup}\) loss used for computing the labeled data loss, we also incorporate the Mix Module and three additional losses: \(L_{semi}\), \(L_{sharp}\), and \(L_{consis}\), each loss represents a different module that we combined. To validate the effectiveness of these modules, we performed ablation experiments, and Table 4 presents the results obtained using different strategies. When using only \(L_{sup}\), which corresponds to utilizing a small amount of labeled data for fully supervised learning, the model’s accuracy was relatively low. However, after incorporating \(L_{semi}\), the model’s accuracy significantly improved. Moreover, when these strategies were integrated, the results have improved steadily. This indicates that the strategies we adopted in this paper are effective and contribute to enhancing the robustness of the model. It is worth noting that even with the combination of only the \(L_{sup}\) and \(L_{semi}\), the model has already achieved the best Dice and Jaccard compared to those SOTA model presented in Table 4. As our model incorporate \(L_{consis}\) which is the strategy proposed in URPC [28], we could spot from the table that our model also performed well without \(L_{consis}\).
To enhance the precision of segmentation boundaries, this study introduce a sharpening function designed to enforce entropy minimization constraints. We conducted ablation experiments on the Pancreas-CT datasets, employing 5\(\%\) and 10\(\%\) labeled data, aimed to assess the impact of temperature \(T\) on the model’s performance. The left side of Figure 8 delineates the application of the sharpening function across predictions for varying temperatures \(T\), whereas the right side exhibits the Dice scores of the PLGDF model trained at different \(T\) value on the Pancreas-CT dataset. The outcomes demonstrate a uniformity in Dice values across different \(T\) values, underscoring the model’s robustness to temperature fluctuations. Elevated \(T\) values are observed to insufficiently enforce entropy minimization constraints during during the training phase, while diminished \(T\) values could amplify pseudo-label noise, culminating in inaccuracies. As a result, a sharpening function with a calibrated temperature of 0.1 was adopted to generate soft pseudo-labels consistently across all datasets.
In this paper, we propose a novel semi-supervised medical image segmentation framework, named PLGDF. Building upon the mean-teacher network, we utilize the teacher model’s predictions as pseudo-labels for the unlabeled data to aid in the training of the student network. Additionally, we mix the unlabeled data with labeled data to enhance dataset diversity. Furthermore, we apply a sharpening operation to the predictions of the unlabeled data to improve the clarity and accuracy of segmentation boundaries. By ensuring consistency between different scales in the decoding part of the backbone network, we further enhance the model’s stability. Comprehensive experiments were performed on three publicly available medical image datasets, including the LA dataset, and BraTS2019 dataset from MR scans, and the Pancreas-CT dataset from CT scans. Remarkably, even with just 5\(\%\) labeled data, our method achieves state-of-the-art results, comparable to or even surpassing the performance of state-of-the-art methods that use 10\(\%\) labeled data. These results demonstrate the effectiveness, robustness, and general applicability of our proposed framework.
Furthermore, our proposed framework is clear and combinable with some modules suitable for semi-supervised learning, such as Generative Adversarial Networks (GAN) [51]. While we apply sharpening operations to the predicted results, it is important to acknowledge that certain regions’ predictions might already be erroneous, and sharpening these areas could potentially exacerbate misclassifications. Although the current metric suggests that sharpening brings more benefits than drawbacks, in future work, identifying correct regions and reducing interference in misjudged areas can be achieved to further improve the model’s performance.
This work was supported by National Natural Science Foundation of China under Grant 62171133, in part by the Artificial Intelligence and Economy Integration Platform of Fujian Province, and the Fujian Health Commission under Grant 2022ZD01003.↩︎
Tao Wang, Yuanbin Chen, Xinlin Zhang, Yuanbo Zhou, Junlin Lan, Min Du, Qinquan Gao and Tong Tong are with the college of physics and information engineering, University of Fuzhou University, Fuzhou 350108, China (e-mail: ortonwangtao@gmail.com; binycn904363330@gmail.com; xinlin1219@gmail.com; webbozhou@gmail.com; smurfslan@qq.com; dm\(\_\)dj90@163.com; gqinquan@imperial-vision.com; ttraveltong@imperial-vision.com).↩︎
Tao Tan is with the Faculty of Applied Science, University of Macao Polytechnic University, Macao 999078 (e-mail: taotanjs@gmail.com).↩︎
Bizhe Bai is with the University of Toronto University, Ontario M5S2E8, Canada (e-mail: bizhe.bai@outlook.com).↩︎