Pseudo Label-Guided Data Fusion and Output Consistency for Semi-Supervised Medical Image Segmentation


Abstract

Supervised learning algorithms based on Convolutional Neural Networks have become the benchmark for medical image segmentation tasks, but their effectiveness heavily relies on a large amount of labeled data. However, annotating medical image datasets is a laborious and time-consuming process. Inspired by semi-supervised algorithms that use both labeled and unlabeled data for training, we propose the PLGDF framework, which builds upon the mean teacher network for segmenting medical images with less annotation. We propose a novel pseudo-label utilization scheme, which combines labeled and unlabeled data to augment the dataset effectively. Additionally, we enforce the consistency between different scales in the decoder module of the segmentation network and propose a loss function suitable for evaluating the consistency. Moreover, we incorporate a sharpening operation on the predicted results, further enhancing the accuracy of the segmentation.

Extensive experiments on three publicly available datasets demonstrate that the PLGDF framework can largely improve performance by incorporating the unlabeled data. Meanwhile, our framework yields superior performance compared to six state-of-the-art semi-supervised learning methods. The codes of this study are available at https://github.com/ortonwang/PLGDF.

Medical image segmentation, semi-supervised learning, pseudo label

1 Introduction↩︎

Segmentation is a fundamental task in the field of medical image processing and analysis [1]. Accurate image segmentation in clinical medicine provides valuable auxiliary information for clinicians, facilitating rapid, accurate, and efficient diagnostic decision-making [2]. However, manual annotation of regions of interest is time-consuming and relies on the clinical expertise of physicians, resulting in a significant workload and potential errors [3].

With the rapid development of deep learning, Convolutional Neural Networks (CNN) and its variants have demonstrated powerful image processing capabilities in computer vision tasks. The introduction of Fully Convolutional Networks [4] and U-Net [5] has greatly propelled the development of automated image segmentation [6]. Building upon these foundations, numerous studies have emerged to further improve the performance of segmentation algorithms [[7]][8][9]. For instance, Ning et al. proposed SMU-Net [10], which utilizes salient background representation to assist foreground segmentation by considering the texture information present in the background. Pang et al. introduced a novel two-stage framework named SpineParseNet for automated spine parsing in volumetric magnetic resonance images [11]. Additionally, Chen et al. presented TransUNet [12], which combines CNN with Transformer [13] for medical image segmentation, demonstrating outstanding performance.

However, the success of these methods heavily relies on a large amount of pixel-level annotated data, which is only feasible through precise annotations by skilled medical professionals [14]. This process is time-consuming and costly, limiting the applicability of supervised learning methods.

To address this issue, researchers have proposed semi-supervised learning-based methods for medical image segmentation [15]. Compared to supervised learning, semi-supervised learning can fully utilize the information contained in unlabeled data, thus improving the generalization capability and accuracy of the segmentation model [16].

One common approach in semi-supervised learning is the use of pseudo-label strategies. This method typically employs labeled data to train an initial model, which is then applied to unlabeled data to generate pseudo-labels. These pseudo-labels serve as approximate labels for the unlabeled data, thereby expanding the labeled dataset. Subsequently, the model is retrained using the expanded dataset to improve its robustness [17] [18] Qiu et al. introduced a Federated Semi-Supervised Learning [19] approach to learn from distributed medical image domains, incorporating a federated pseudo-labeling strategy for unlabeled clients to mitigate the deficiency of annotations in unlabeled data. Bai et al. proposed an iterative learning method based on pseudo-labeling for cardiac MR image segmentation [20]. In this approach, pseudo-labels are refined using a Conditional Random Field, and the updated pseudo-labels are utilized for model updating.

Another common approach in semi-supervised learning is the consistency-based method [21]. This method aims to enhance the model’s robustness by combining the consistency among unlabeled data. In the context of image segmentation tasks, consistency can be categorized into data-level consistency and model-level consistency. Data-level consistency requires the model to produce consistent predictions for different perturbations of the same image. For example, when introducing slight perturbations or applying different data augmentation techniques to the input image, the model should generate the same segmentation results. On the other hand, model-level consistency requires consistent segmentation results across different models for the same input.

Deep adversarial training [[22]][23] is also a commonly used method that leverages unlabeled data by employing a discriminator to align the distributions of labeled and unlabeled data. Wu et al. introduced MC-Net [24], which comprises a shared encoder and multiple slightly different decoders. The model incorporates statistical differences among the decoders to represent the model’s uncertainty and enforce consistency constraints.

Furthermore, the mean-teacher model [15] and its extensions [[25]][26][27] have gained significant attention in semi-supervised medical image segmentation tasks. In the mean-teacher model, the parameter of the student network is guided by the teacher network during the training process. The model training involves minimizing the error between the teacher and student models. Additionally, other algorithms have also demonstrated outstanding performance in this field.

Luo et al. proposed Uncertainty Rectification Pyramid Consistency URPC [28], a novel framework with uncertainty rectified pyramid consistency regularization. This framework offers a straightforward and efficient method to enforce output consistency across various scales for unlabeled data.

While the framework is simple and efficient, there is still room for further optimization in its performance, indicating the potential for further improvements in the performance of these methods. Therefore, this study also endeavors to explore some integrated strategies to enhance the performance of semi-supervised learning algorithms. We propose a novel framework named Pseudo Label-Guided Data Fusion (PLGDF). The main contributions of this paper can be summarized as follows:

  • We introduce the PLGDF framework, a novel architecture built upon the mean teacher network, incorporating an innovative pseudo-label utilization scheme. This framework integrates consistency evaluation across various scales within the decoder module of the network.

  • The mixing module combines both labeled and unlabeled data, enhancing the dataset’s diversity. Additionally, the integrated sharpening operation further improves recognition accuracy.

  • Experimental results on three publicly available datasets demonstrate the superiority of the proposed approach in semi-supervised medical image segmentation compared to six state-of-the-art models, setting new performance benchmarks.

2 Relate Work↩︎

2.1 Medical image segmentation↩︎

The advancement of deep learning has significantly enhanced the precision of semantic segmentation. Within medical image segmentation, U-Net and its extensions have become the benchmark methods for further research and practical applications. Building upon the foundation of U-Net, numerous high-performing algorithms have emerged, such as CE-Net[29], UNet++ [30], V-Net [31], and the introduction of 3D U-Net [32], expanding the application of medical image segmentation into the realm of 3D medical images. Cao et al. proposed Swin-Unet [33], substituting convolutional blocks with Swin-Transformer [34] blocks for enhanced feature extraction, while Wang et al. introduced O-Net [35], a deeper integration of CNN and Transformer, further improving algorithm performance. Additionally, other algorithms such as UNeXt [36], SpineParseNet [11], and SegFormer [37] have contributed to the improvement of the model. Although these methods have achieved success in medical image segmentation, they are predominantly constructed in a fully-supervised manner. Their performance is notably constrained by the scarcity of labeled samples available for training.

Figure 1: The schematic diagram of the proposed PLGDF framework. The figure in the bottom right provides a detailed depiction of the student network. In the diagram, D_{i} represents the decoder module of the V-Net backbone, and P_{i} refers to the predictions obtained from different scales of the backbone. These predictions are unified to the same size through upsampling and convolution processes at various scales. Additionally, the Exponential Moving Average (EMA) represents that the teacher network implements parameter updates from the student network through the exponential moving average.

2.2 Semi-Supervised Medical image segmentation↩︎

To address the challenge of limited labeled data in medical image segmentation, researchers have proposed various semi-supervised learning methods. Currently, one of the most widely employed approaches involves extending the mean-teacher framework to different aspects. For example, Li et al. introduced TCSM [25] which leverages various perturbations on unlabeled data to train the network by enforcing consistency in predictions through regularization. Yu et al. presented Uncertainty-Aware Mean Teacher (UA-MT) [38], which incorporates an uncertainty-aware scheme encouraging consistent predictions for the same input under different perturbations. Chen et al. proposed an enhancement of the mean-teacher framework combined with adversarial networks to distinguish labeled and unlabeled data, showcasing outstanding performance [23]. Furthermore, there is research based on task consistency. Luo et al. utilized a dual-task deep network for joint prediction of pixel segmentation maps and hierarchical representations of geometric objects, along with the introduction of dual-task consistency regularization[18].

Pseudo-label [39] is another common semi-supervised learning framework, often involving the conversion of probability mappings into pseudo-labels using sharpening functions or fixed thresholds. Li et al proposed self-loop uncertainty, a pseudo-label application strategy where recurrent optimization of the neural network with a self-supervised task generates ground-truth labels for unlabeled images, augmenting the training set and enhancing segmentation accuracy. Rizve et al [40] unified probability and uncertainty thresholds to select the most accurate pseudo-labels. Luo et al presented URPC, which leverages data pyramid consistency and uncertainty rectification within a single model, based on the consistency of outputs at different scales for the same input, achieving excellent performance in semi-supervised learning for segmentation [28]. Building upon previous attempts, we have combined pseudo label and pyramid Consistency with the mean teacher framework, to further improve the semi-supervised segmentation in medical images.

3 Methods↩︎

To provide a more comprehensive understanding of our proposed model, we begin by introducing the symbols utilized in our approach to address the problem of semi-supervised segmentation. Symbol definitions are as follows:
\(X_{l}\): the labeled data, where each sample in \(X_{l}\) is associated with its corresponding Ground Truth, denoted as \(GT\).
\(X_{u}\): the unlabeled data, which consists of samples in the dataset that do not have corresponding \(GT\).
\(X_{mix}\): the data mixed from \(X_{u}\) and \(X_{l}\) through Mix Module during the training process.
\(f_{\theta_{s}}(x)\): the generated probability map of input \(x\), where \(\theta_{s}\) denotes the parameters of the student network, and \(f_{\theta_{t}}(x)\) means the generated probability map by the teacher network.

Figure 2: Pseudo Label-Guided Data Fusion and Output Consistency for Semi-Supervised Medical Image Segmentation

3.1 Overall architecture design↩︎

Figure 1 illustrates the overall architecture of the PLGDF framework proposed in our study. In our methodology, we adopt the framework of the mean teacher model, where the teacher network implements parameter updates from the student network through exponential moving average (EMA). In Algorithm 2, we present the pseudocode to illustrate the training procedure of the proposed framework. To begin, we apply double random noise augmentation to the \(X_{u}\). The teacher network processes the augmented data and obtains the average results, which are subsequently binarized to generate the pseudo-label corresponding to the \(X_{u}\). Next, we utilize the Mix Module to augment the \(X_{u}\) with \(X_{l}\), resulting in the \(X_{mix}\) which is displayed as \(Mixed\) \(img\) in the Figure 1. Finally, we concatenate the \(X_{u}\), \(X_{l}\) and \(X_{mix}\), which are then processed by the student network to generate the corresponding prediction: \(f_{\theta_{s}}(X_{u})\), \(f_{\theta_{s}}(X_{l})\), and \(f_{\theta_{s}}(X_{mix})\). Next, we further refine \(f_{\theta_{s}}(X_{u})\) by applying a sharpening process to obtain soft pseudo-labels.

To facilitate the training of the model, we evaluate \(L_{sup}\) based on \(f_{\theta_{s}}(X_{l})\) and \(GT\), \(L_{semi}\) based on \(f_{\theta_{s}}(X_{mix})\) and pseudo labels, \(L_{sharp}\) based on \(f_{\theta_{s}}(X_{u})\) and soft pseudo-labels. Additionally, we introduced multi-scale outputs for the backbone model, and a multi-scale consistency evaluation module is incorporated to assess the consistency among outputs from different scales, which is shown as \(L_{consis}\). In the subsequent subsections, we will provide a detailed explanation of each module.

3.2 Mix Module↩︎

To enhance the model’s effectiveness and improve its robustness with limited data, we introduce a data augmentation technique inspired by the widely employed Mix-Up method in vision tasks [41]. We perform a data mixing process by combining \(X_{l}\) and \(X_{u}\), resulting in a more diverse training dataset. During the data mixing process, we randomly select two sets of samples and linearly interpolate their features by a certain proportion, generating new samples with blended characteristics for model training. For a pair of samples \(X_{u_{1}}\) and \(X_{l_{1}}\), this process can be mathematically represented as follows: \[\begin{align} \lambda = Random(Beta(\alpha ,\alpha )),\\ \lambda' = max(\lambda,1-\lambda),\\ X_{u_{1}}' = \lambda'X_{u_{1}} + (1 - \lambda')X_{l_{1}} \end{align}\]

Figure 3: The Renderings of Mix-Up.

where \(\alpha\) is a randomly generated hyperparameter. The effect of image mixing is shown in Figure 3. The mix is predominantly based on an unlabeled image, with a labeled image providing additional mixing. After the blending operation, certain changes occur in the information of the unlabeled image, especially within the confines of the red rectangular box, a more pronounced effect is showcased. However, these changes do not significantly alter the overall semantic information of the image. Therefore, we use the pseudo-label corresponding to the unlabeled image as the label for the mixed data.

3.3 Pseudo label sharpening↩︎

Considering the efficacy of consistency in utilizing unlabeled data, we employ a sharpening function [42] to transform the \(f_{\theta_{s}}(X_{u})\) into soft pseudo-labels, which is the result obtained from the \(X_{u}\) through the student network. The formula of the sharpening function is as follows: \[f^{*}_{\theta_{s}}(X_{u}) = \frac{f_{\theta_{s}}(X_{u})^{\frac{1}{T} } }{f_{\theta_{s}}(X_{u}))^{\frac{1}{T} } + (1-f_{\theta_{s}}(X_{u}))^{\frac{1}{T} } }\] Where \(T\) is a hyperparameter used to control the sharpening temperature. Figure 4 illustrates that the sharpening operation enhances the clarity and accuracy of segmentation boundaries, reducing blurriness and ambiguous edges, especially in the region indicated by the red arrow. This improvement enables a more effective capture of target boundaries and subtle structures. Subsequently, we compute the consistency loss based on \(f_{\theta_{s}}(X_{u})\) and \(f^{*}_{\theta_{s}}(X_{u})\). Under the supervision of soft pseudo-labels, the model learns to generate low-entropy results as a means of minimizing entropy. We denote this loss as \(L_{sharp}\), with its evaluation formula expressed as follows: \[L_{sharp} = \frac{1}{N}\sum_{i=1}^{N} ||f_{\theta_{s}}(X_{u}) - f^{*}_{\theta_{s}}(X_{u})||^{2}\] Where \(N\) is the total number of pixels.

Figure 4: Demonstration of the sharpening operation. The Pre represents the prediction of the image, the S-Pre represents the results obtained after applying the sharpening operation to the Pre.

3.4 Consistency Across Multi-Scale↩︎

In the decoder part of the backbone network, convolution followed by upsampling operations is commonly utilized. Therefore, during the training phase, we can utilize the intermediate features at different scales within the decoder module and standardize their sizes by applying 2\(\times\), 4\(\times\), and 8\(\times\) upsampling along with convolutional operations. Our objective is to enforce the consistency among outputs of different scales. First, we calculate the average value \(\widehat{P}\) based on the multi-scale \(P_{1}, P_{2}, P_{3}\), and \(P_{4}\) as shown in Figure 1. Subsequently, we assess the consistency between each scale and the average value \(\widehat{P}\) using \(L_{consis}\). The evaluation formula is presented in equation (6), (7), and (8). \[\widehat{P} =\frac{1}{4} \sum_{s=1}^{4}P_{s}\] \[D_{s}^{i} = \sum_{j=0}^{C-1}P_{s}^{ i,j } \cdot \log\frac{P_{s}^{i,j}}{\widehat{P}^{i,j }}\] \[L_{consis} =\frac{1}{n} \sum_{s=1}^{n} \frac{ {\textstyle \sum_{i=1}^{N}} {||P_{s}^{i} -\widehat{P} ^{i}||_{2}\cdot e^{-D_{s}^{i}} } }{\sum_{i=1}^{N} e^{-D_{s}^{i}}} + \sum_{i=1}^{N}D_{s} ^{i}\] where \(C\) means the class number for the segmentation task, the \(n\) represents the number of multi-scale outputs, and the \(N\) is the total number of pixels.

3.5 Loss Function↩︎

The proposed PLGDF framework aims to learn from both labeled and unlabeled data by minimizing the following composite objective function: \[L_{total} = L_{sup}+L_{semi}+ \lambda \cdot ( L_{sharp} + L_{consis})\] the expressions for \(L_{sharp}\) and \(L_{consis}\) are already illustrated in equations (5) and (8), respectively. Regarding \(L_{sup}\) and \(L_{semi}\), we adopt the combination of Cross-Entropy Loss and Dice Loss [31], which is commonly used in medical image segmentation. The formula for Dice Loss is presented below: \[Dice \;Loss=1-\frac{2 * \sum_{i=1}^{N} p_{i} *g_{i} }{\sum_{i=1}^{N}p_{i} ^{2} + \sum_{i=1}^{n}g_{i} ^{2} }\] where \(p_{i}\) is the value of the pixel \(i\) predicted by the model, and \(g_{i}\) is the value of the pixel \(i\) of the ground truth. We introduce \(\lambda(t)\), a widely used time-dependent Gaussian warming-up function [43], to control the proportion between supervised and unsupervised loss at different training stages. Its specific formula is defined as follows: \[\lambda(t) = w\cdot e^{(-5(1-\frac{t}{t_{max} } )^{2} )}\] where \(w\) represents the final regulation weight, \(t\) represents the current training step and \(t_{max}\) denotes the maximum training step.

4 Experiments and Results↩︎

4.1 Dataset↩︎

In this paper, we evaluated the proposed PLGDF method and compared it with six previous works on three publicly available datasets: the Pancreas-CT dataset, the LA dataset and the BraTS2019 dataset. All of the datasets are associated with 3D segmentation tasks.

4.1.1 Pancreas-CT↩︎

The Pancreas-CT dataset [44] consists of 82 3D abdominal contrast-enhanced CT scans, acquired from a cohort of 53 male and 27 female subjects. The CT scans were obtained with resolutions of 512\(\times\)​512 pixels and varying pixel sizes. The slice thickness ranged from 1.5 to 2.5 mm. For this study, we randomly selected 60 images for training and 20 images for testing, following a standard data splitting protocol commonly used in similar studies [24]. To ensure consistency and comparability of voxel values, we applied a clipping operation, limiting the values to the range of -125 to 275 Hounsfield Units (HU) [45]. Additionally, we performed data resampling to achieve an isotropic resolution of 1.0\(\times\)​1.0\(\times\)​1.0 mm.

4.1.2 LA↩︎

The LA dataset [46], which serves as the benchmark dataset for the 2018 Atrial Segmentation Challenge, comprises 100 gadolinium-enhanced MR imaging scans for training, with a resolution of 0.625\(\times\)​0.625\(\times\)​0.625 mm. As the testing set of LA lacks publicly available annotations, we allocated 80 samples for training and reserved the remaining 20 samples for validation following [24]. Subsequently, we evaluated the performance of our model and other methods on the same validation set to ensure fair comparisons.

4.1.3 BraTS2019↩︎

The publicly available BraTS2019 dataset[47] comprises scans obtained from 335 patients diagnosed with glioma. This dataset encompasses T1, T2, T1 contrast-enhanced, and FLAIR sequences, along with corresponding tumor segmentations annotated by expert radiologists. In this study, we focused on using the FLAIR modality for segmentation on the dataset. We conducted a random split, allocating 250 scans for training, 25 scans for validation, and 60 scans for testing following [28].

Table 1: Quantitative comparison with six state-of-the-art methods on the Pancreas-CT dataset.
Method Scans used Metrics
Labeled Unlabeled Dice(%)\(\uparrow\) Jaccard(%)\(\uparrow\) 95HD(voxel)\(\downarrow\) ASD(voxel)\(\downarrow\)
V-Net 3(5%) 0 29.32 19.61 43.67 15.42
V-Net 6(10%) 0 54.94 40.87 47.48 17.43
V-Net 12(20%) 0 71.52 57.68 18.12 5.41
V-Net 62(100%) 0 83.76 72.48 4.46 1.07
UA-MT (MICCAI’19) 3(5%) 59(95%) 43.15 29.07 51.96 20.00
SASSNet (MICCAI’20) 41.48 27.98 47.48 18.36
DTC (AAAI’21) 47.57 33.41 44.17 15.31
URPC (MIA’2022 ) 45.94 34.14 48.80 23.03
SS-Net (MICCAI’22) 41.39 27.65 52.12 19.37
MC-Net+ (MIA’2022 ) 32.45 21.22 58.57 24.84
Ours 74.69 60.00 8.19 1.74
UA-MT (MICCAI’19) 6(10%) 56(90%) 66.44 52.02 17.04 3.03
SASSNet (MICCAI’20) 68.97 54.29 18.83 1.96
DTC (AAAI’21) 66.58 51.79 15.46 4.16
URPC (MIA’2022 ) 73.53 59.44 22.57 7.85
SS-Net (MICCAI’22) 73.44 58.82 12.56 2.91
MC-Net+ (MIA’2022 ) 70.00 55.66 16.03 3.87
Ours 80.90 68.40 6.02 1.59
UA-MT (MICCAI’19) 12(20%) 50(80%) 76.10 62.62 10.84 2.43
SASSNet (MICCAI’20) 76.39 63.17 11.06 1.42
DTC (AAAI’21) 78.27 64.75 8.36 2.25
URPC (MIA’2022 ) 80.02 67.30 8.54 1.98
SS-Net (MICCAI’22) 78.68 65.96 9.74 1.91
MC-Net+ (MIA’2022 ) 79.37 66.83 8.52 1.72
Ours 82.76 70.89 4.85 1.33

4.2 Implementation Details↩︎

During the training process, we randomly extracted 3D patches from the preprocessed data. For the LA dataset, the patch size was set to 112 \(\times\) 112 \(\times\) 80, while for the Pancreas-CT and BraTS2019 datasets, the patch size was 96 \(\times\) 96 \(\times\) 96. For all three datasets, we set the batch size to 4, where each batch consisted of two labeled patches and two unlabeled patches. The backbone network employed in our study is the V-Net [31]. Additionally, we made modifications to the network to generate multi-scale outputs, and the scales \(n\) we used for evaluating the multi-scale consistency is set to 4. We trained our PLGDF model for 15\(k\) iterations for Pancreas-CT and LA datasets and 30\(k\) iterations for the BraTS2019 dataset, following the methodology described in [24] and [28].

During the testing phase, we employed a sliding window approach with a fixed stride to extract patches. Specifically, on the LA dataset, we utilized a sliding window of size 112 \(\times\) 112 \(\times\) 80 with a stride of 18 \(\times\) 18 \(\times\) 4. On the Pancreas-CT and BraTS2019 datasets, we used a sliding window with a size of 96 \(\times\) 96 \(\times\) 96 and a stride of 16 \(\times\) 16 \(\times\) 16. Subsequently, we reconstructed the patch-based predictions to obtain the final results for the entire volume.

In our training process, we employed the SGD optimizer with a momentum 0.9 and weight decay set to 1e-4. The learning rate was set to 1e-2 and the hyperparameter \(T\) was set to 1e-1. In this study, we trained the network using 10\(\%\) and 20\(\%\) of the data on three representative semi-supervised datasets, following the data partitioning methods as described in [[21]][48][49]. Our framework was implemented in PyTorch 1.12.0, utilizing an Nvidia RTX 3090 GPU with 24GB of memory. For quantitative evaluation, we employed four metrics: Dice, Jaccard, the average surface distance (ASD), and the 95\(\%\) Hausdorff Distance (95HD). During the training phase, as our model incorporates multi-scale outputs, we utilized the four-scale output within the student model and exclusively employed the highest-scale output within the teacher model. Similarly, during the inference phase of the network, we solely utilized the output from the highest scale, denoted as \(P_{1}\) in Figure 1. Consequently, at this juncture, our backbone network is equivalent to the V-Net. The \(P_{2}\), \(P_{3}\), and \(P_{4}\) were exclusively utilized within the student models during the training process.

4.3 Comparison with Other Semi-supervised Methods:↩︎

We compared our proposed framework with six state-of-the-art semi-supervised segmentation methods, including UA-MT [38], Shape-aware Adversarial Network (SASSNet) [48], Dual-task Consistency Framework (DTC) [49], Uncertainty Rectified Pyramid Consistency (URPC) [28], SS-Net [50] and Mutual Consistency Network (MC-Net+) [24]. Note that we utilized the official codes and results of UA-MT, SASSNet, URPC, DTC, SS-Net, and MC-Net, along with their publicly available data preprocessing schemes. We used the results in MC-Net as our benchmark.

4.3.1 Results on the Pancreas-CT dataset↩︎

Figure 5: 2D and 3D Visualization with other methods on the Pancreas-CT dataset under 5\% labeled data and 10\% labeled data. The red lines denote the boundary of ground truth and the green lines denote the boundary of predictions.
Table 2: Quantitative comparison with six state-of-the-art methods on the LA dataset.
Method Scans used Metrics
Labeled Unlabeled Dice(%)\(\uparrow\) Jaccard(%)\(\uparrow\) 95HD(voxel)\(\downarrow\) ASD(voxel)\(\downarrow\)
V-Net 4(5%) 0 52.55 39.60 47.05 9.87
V-Net 8(10%) 0 78.57 66.96 21.20 6.07
V-Net 16(20%) 0 86.96 77.31 11.85 3.22
V-Net 80(100%) 0 91.62 84.60 5.40 1.64
UA-MT (MICCAI’19) 4(5%) 76(95%) 82.26 70.98 13.71 3.82
SASSNet (MICCAI’20) 81.60 69.63 16.60 3.58
DTC (AAAI’21) 81.25 69.33 14.90 3.99
URPC (MIA’2022 ) 82.48 71.35 14.65 3.65
SS-Net (MICCAI’22) 86.33 76.15 9.97 2.31
MC-Net+ (MIA’2022 ) 82.07 70.38 20.49 5.72
Ours 89.22 80.62 7.90 1.71
UA-MT (MICCAI’19) 8(10%) 72(90%) 86.28 76.11 18.71 4.63
SASSNet (MICCAI’20) 86.81 76.92 12.54 2.55
DTC (AAAI’21) 87.51 78.17 8.23 2.36
URPC (MIA’2022 ) 85.01 74.36 15.37 3.96
SS-Net (MICCAI’22) 88.43 79.43 7.95 2.55
MC-Net+ (MIA’2022 ) 88.96 80.25 7.93 1.86
Ours 89.8 81.58 7.14 1.74
UA-MT (MICCAI’19) 16(20%) 64(80%) 88.74 79.94 8.39 2.32
SASSNet (MICCAI’20) 89.27 80.82 8.83 3.13
DTC (AAAI’21) 89.42 80.98 7.32 2.10
URPC (MIA’2022 ) 88.74 79.93 12.73 3.66
SS-Net (MICCAI’22) 89.86 81.70 7.01 1.87
MC-Net+ (MIA’2022 ) 91.07 83.67 5.84 1.67
Ours 91.34 83.94 5.66 1.76
Figure 6: 2D and 3D Visualization with other methods on the LA dataset under 5\% labeled data and 10\% labeled data. The red lines denote the boundary of ground truth and the green lines denote the boundary of predictions.

Table 1 shows the quantitative comparison of our model and six semi-supervised methods on the Pancreas-CT dataset, along with the results of the V-Net model trained with 5\(\%\),10\(\%\), 20\(\%\), and 100\(\%\) labeled data for supervised learning. The experimental results indicate significant improvements in our proposed method over six compared state-of-the-art (SOTA) models across four evaluation metrics: Dice, Jaccard, 95HD, and ASD. It is evident from the table that our result stands out prominently, particularly when only 5\(\%\) or 10\(\%\) of the data is labeled. With 5\(\%\) labeled data, we achieved a Dice score of 74.69\(\%\), along with an 95HD of 8.19 and an ASD of 1.74. These metrics not only surpass the accuracy obtained by the six compared SOTA models based on equivalent data annotation but also outperform the accuracy of these models at 10\(\%\) data annotation. Similarly, at 10\(\%\) data annotation, the accuracy of our method exceeds those compared models trained with 20\(\%\) labeled data annotation.

Figure 5 displays the visual results of image segmentation on the Pancreas-CT dataset when trained with 5\(\%\) and 10\(\%\) labeled data. The visualizations are presented in both 2D and 3D perspectives. It is apparent that when utilizing only 5\(\%\) labeling, other compared models fail to segment an approximate pancreas, while our model achieves segmentation results relatively closer to the Ground Truth (GT). Similarly, at 10\(\%\) data annotation, our algorithm’s predictions are more accurate. In the 2D visual representation, the comparative method exhibits a higher rate of false negatives, whereas our model achieves more precise identification. This further substantiates the effectiveness of the model we proposed.

Table 3: Quantitative comparison with six state-of-the-art methods on the BraTS2019 dataset.
Method Scans used Metrics
Labeled Unlabeled Dice(%)\(\uparrow\) Jaccard(%)\(\uparrow\) 95HD(voxel)\(\downarrow\) ASD(voxel)\(\downarrow\)
V-Net 12(5%) 0 74.28 64.42 13.44 2.60
V-Net 25(10%) 0 78.67 68.75 10.44 2.23
V-Net 50(20%) 0 80.59 71.13 8.95 2.03
V-Net 250(All) 0 88.58 80.34 6.19 1.36
UA-MT (MICCAI’19) 12(5%) 238(95%) 80.31 70.43 10.65 2.12
SASSNet (MICCAI’20) 76.17 66.43 13.09 3.32
DTC (AAAI’21) 74.21 64.89 13.54 3.16
URPC (MIA’2022 ) 78.74 68.2 17.43 4.51
SS-Net (MICCAI’22) 78.03 68.11 13.7 2.76
MC-Net+ (MIA’2022 ) 78.69 68.38 16.44 4.49
Ours 84.96 75.15 10.28 2.53
UA-MT (MICCAI’19) 25(10%) 225(90%) 80.93 71.31 17.71 5.43
SASSNet (MICCAI’20) 79.19 68.80 16.36 6.67
DTC (AAAI’21) 82.74 72.74 11.76 3.24
URPC (MIA’2022 ) 84.16 74.29 11.01 2.63
SS-Net (MICCAI’22) 82.00 71.82 10.68 1.82
MC-Net+ (MIA’2022 ) 79.63 70.10 12.28 2.45
Ours 85.47 75.97 9.57 2.07
UA-MT (MICCAI’19) 50(20%) 200(80%) 85.05 74.51 12.31 3.03
SASSNet (MICCAI’20) 82.34 72.72 12.45 3.24
DTC (AAAI’21) 83.47 72.93 14.48 3.59
URPC (MIA’2022 ) 85.49 74.86 8.14 2.04
SS-Net (MICCAI’22) 83.07 73.48 10.08 1.63
MC-Net+ (MIA’2022 ) 82.87 73.61 8.94 1.92
Ours 86.31 77.34 7.45 1.36
Figure 7: 2D and 3D Visualization with other methods on the BraTS2019 dataset under 5\% labeled data and 10\% labeled data. The red lines denote the boundary of ground truth and the green lines denote the boundary of predictions.
Table 4: Ablation study about the combination of strategies on the Pancreas-CT dataset. Each Loss corresponds to a module and the Mix M. represents the Mix Module.
Designs Scans used Metrics
\(L_{sup}\) \(L_{semi}\) Mix M. \(L_{consis}\) \(L_{sharp}\) Labeled Unlabeled Dice(%)\(\uparrow\) Jaccard(%)\(\uparrow\) 95HD(voxel)\(\downarrow\) ASD(voxel)\(\downarrow\)
3(5%) 59(95%) 29.32 19.61 43.67 15.42
48.09 34.59 57.57 25.64
74.01 59.52 26.58 8.83
74.18 59.51 8.39 2.05
76.20 62.16 12.14 3.49
74.69 60.00 8.19 1.74
6(10%) 56(90%) 54.94 40.87 47.48 17.43
76.40 63.06 16.55 4.75
80.49 67.84 6.22 1.77
80.47 68.81 6.16 1.87
80.22 67.51 5.92 2.13
80.9 68.4 6.02 1.59
12(20%) 50(80%) 71.52 57.68 18.12 5.41
80.26 67.45 15.99 3.99
81.96 69.77 5.18 1.43
82.51 70.69 6.87 2.11
81.61 69.35 7.50 2.02
82.76 70.89 4.85 1.33

4.3.2 Results on the LA dataset↩︎

Table 2 presents the quantitative experimental results on the LA dataset. Compared to six SOTA methods, our model achieved optimal Dice scores across 5\(\%\), 10\(\%\), and 20\(\%\) data annotations. With 20\(\%\) labeled data, our model obtained a Dice score of 91.34\(\%\), closely approaching the Dice score of 91.62\(\%\) achieved through supervised learning using 100\(\%\) labeled data. It is worth emphasizing that with 5\(\%\) labeled data, we achieved a Dice score of 88.72\(\%\), surpassing the accuracy of other comparative algorithms, except for MC-Net, which was trained with 10\(\%\) labeled data. Similarly, except for MC-Net, at 10\(\%\) data annotation, we achieved accuracies comparable to those obtained by the other algorithms based on 20\(\%\) data annotation. Simultaneously, we achieved the lowest 95HD and outstanding ASD metric.

Figure 6 illustrates the visualization results of image segmentation for the LA dataset when trained with 5\(\%\) and 10\(\%\) labeled data. It is evident that our model generates a more complete and accurate left atrium than those compared models. Especially at 5\(\%\) data annotation, our model naturally eliminates most of the isolated regions and preserves more fine details, whereas other models tend to produce false positive noise predictions. In the 2D view, our prediction also exhibits closer proximity to the GT. This visual representation intuitively demonstrates the effectiveness of our method.

4.4 Results on the BraTS2019 dataset↩︎

Table 3 presents the quantitative experimental results on the BraTS2019 dataset. In comparison to six SOTA models, our model achieved superior Dice and 95HD scores by efficiently leveraging unlabeled data across 5\(\%\), 10\(\%\), and 20\(\%\) labeled data. Particularly, using 5\(\%\) labeled data, we achieved a Dice score of 84.96\(\%\), signifying a noteworthy enhancement compared to the best-scoring counterpart at 80.31\(\%\) in Dice, surpassing the metric achieved by the comparative model with 10\(\%\) labeled data.

Figure 7 illustrates the visual segmentation results on the BraTS2019 dataset. Compared to six SOTA models, our model showcases a more comprehensive 3D volumetric segmentation, approaching the delineation of GT. In the 2D view, our model demonstrates closer proximity to the GT. In contrast, the compared models exhibit more false negatives relative to the somewhat ambiguous lesion features. This demonstrates the superiority of our model, showcasing its capability to identify challenging lesions with a small amount of labeled data.

Figure 8: Illustrations of corresponding sharpening functions (left) and dice score (right) with different sharpening temperatures T on the Pancreas-CT dataset.

5 Ablation study↩︎

In this study, we performed studies on the Pancreas-CT dataset to analyse the contributions of the effect of the contributions of each component.

5.1 Effect of the Combination of Module.↩︎

In our proposed framework, besides the \(L_{sup}\) loss used for computing the labeled data loss, we also incorporate the Mix Module and three additional losses: \(L_{semi}\), \(L_{sharp}\), and \(L_{consis}\), each loss represents a different module that we combined. To validate the effectiveness of these modules, we performed ablation experiments, and Table 4 presents the results obtained using different strategies. When using only \(L_{sup}\), which corresponds to utilizing a small amount of labeled data for fully supervised learning, the model’s accuracy was relatively low. However, after incorporating \(L_{semi}\), the model’s accuracy significantly improved. Moreover, when these strategies were integrated, the results have improved steadily. This indicates that the strategies we adopted in this paper are effective and contribute to enhancing the robustness of the model. It is worth noting that even with the combination of only the \(L_{sup}\) and \(L_{semi}\), the model has already achieved the best Dice and Jaccard compared to those SOTA model presented in Table 4. As our model incorporate \(L_{consis}\) which is the strategy proposed in URPC [28], we could spot from the table that our model also performed well without \(L_{consis}\).

5.2 Effect of the temperature \(T\).↩︎

To enhance the precision of segmentation boundaries, this study introduce a sharpening function designed to enforce entropy minimization constraints. We conducted ablation experiments on the Pancreas-CT datasets, employing 5\(\%\) and 10\(\%\) labeled data, aimed to assess the impact of temperature \(T\) on the model’s performance. The left side of Figure 8 delineates the application of the sharpening function across predictions for varying temperatures \(T\), whereas the right side exhibits the Dice scores of the PLGDF model trained at different \(T\) value on the Pancreas-CT dataset. The outcomes demonstrate a uniformity in Dice values across different \(T\) values, underscoring the model’s robustness to temperature fluctuations. Elevated \(T\) values are observed to insufficiently enforce entropy minimization constraints during during the training phase, while diminished \(T\) values could amplify pseudo-label noise, culminating in inaccuracies. As a result, a sharpening function with a calibrated temperature of 0.1 was adopted to generate soft pseudo-labels consistently across all datasets.

5.3 Limitations and future work↩︎

6 Discussion and Conclusion↩︎

In this paper, we propose a novel semi-supervised medical image segmentation framework, named PLGDF. Building upon the mean-teacher network, we utilize the teacher model’s predictions as pseudo-labels for the unlabeled data to aid in the training of the student network. Additionally, we mix the unlabeled data with labeled data to enhance dataset diversity. Furthermore, we apply a sharpening operation to the predictions of the unlabeled data to improve the clarity and accuracy of segmentation boundaries. By ensuring consistency between different scales in the decoding part of the backbone network, we further enhance the model’s stability. Comprehensive experiments were performed on three publicly available medical image datasets, including the LA dataset, and BraTS2019 dataset from MR scans, and the Pancreas-CT dataset from CT scans. Remarkably, even with just 5\(\%\) labeled data, our method achieves state-of-the-art results, comparable to or even surpassing the performance of state-of-the-art methods that use 10\(\%\) labeled data. These results demonstrate the effectiveness, robustness, and general applicability of our proposed framework.

Furthermore, our proposed framework is clear and combinable with some modules suitable for semi-supervised learning, such as Generative Adversarial Networks (GAN) [51]. While we apply sharpening operations to the predicted results, it is important to acknowledge that certain regions’ predictions might already be erroneous, and sharpening these areas could potentially exacerbate misclassifications. Although the current metric suggests that sharpening brings more benefits than drawbacks, in future work, identifying correct regions and reducing interference in misjudged areas can be achieved to further improve the model’s performance.

References↩︎

References↩︎

[1]
G. Wang, M. A. Zuluaga, W. Li, R. Pratt, P. A. Patel, M. Aertsen, T. Doel, A. L. David, J. Deprest, S. Ourselin, and T. Vercauteren, DeepIGeoS: A Deep Interactive Geodesic Framework for Medical Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1559–1572, 2019.
[2]
S. M. Anwar, M. Majid, A. Qayyum, M. Awais, M. Alnowami, and M. K. Khan, Medical Image Analysis using Convolutional Neural Networks: A Review,” Journal of Medical Systems, vol. 42, pp. 1–13, 2018.
[3]
C. Sjöberg, M. Lundmark, C. Granberg, S. Johansson, A. Ahnesjö, and A. Montelius, Clinical evaluation of multi-atlas based segmentation of lymph node regions in head and neck and prostate cancer patients,” Radiation Oncology, vol. 8, no. 1, pp. 1–7, 2013.
[4]
J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[5]
O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18.Springer, 2015, pp. 234–241.
[6]
J. Dolz, K. Gopinath, J. Yuan, H. Lombaert, C. Desrosiers, and I. Ben Ayed, HyperDense-Net: A Hyper-Densely Connected CNN for Multi-Modal Image Segmentation,” IEEE Transactions on Medical Imaging, vol. 38, no. 5, pp. 1116–1126, 2019.
[7]
A. Sinha and J. Dolz, Multi-Scale Self-Guided Attention for Medical Image Segmentation,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 1, pp. 121–130, 2021.
[8]
L. Xie, W. Cai, and Y. Gao, DMCGNet: A Novel Network for Medical Image Segmentation With Dense Self-Mimic and Channel Grouping Mechanism,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 10, pp. 5013–5024, 2022.
[9]
X. Lin, L. Yu, K.-T. Cheng, and Z. Yan, “Batformer: Towards boundary-aware lightweight transformer for efficient medical image segmentation,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 7, pp. 3501–3512, 2023.
[10]
Z. Ning, S. Zhong, Q. Feng, W. Chen, and Y. Zhang, SMU-Net: Saliency-Guided Morphology-Aware U-Net for Breast Lesion Segmentation in Ultrasound Image,” IEEE Transactions on Medical Imaging, vol. 41, no. 2, pp. 476–490, 2022.
[11]
S. Pang, C. Pang, L. Zhao, Y. Chen, Z. Su, Y. Zhou, M. Huang, W. Yang, H. Lu, and Q. Feng, SpineParseNet: Spine Parsing for Volumetric MR Image by a Two-Stage Segmentation Framework With Semantic Image Representation,” IEEE Transactions on Medical Imaging, vol. 40, no. 1, pp. 262–273, 2021.
[12]
J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,” arXiv preprint arXiv:2102.04306, 2021.
[13]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, Attention is All you Need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.Curran Associates, Inc., 2017.
[14]
M. D. Kohli, R. M. Summers, and J. R. Geis, Medical Image Data and Datasets in the Era of Machine Learning—Whitepaper from the 2016 C-MIMI Meeting Dataset Session,” Journal of Digital Imaging, vol. 30, pp. 392–399, 2017.
[15]
A. Tarvainen and H. Valpola, Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[16]
Y. Zhao, K. Lu, J. Xue, S. Wang, and J. Lu, “Semi-supervised medical image segmentation with voxel stability and reliability constraints,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 8, pp. 3912–3923, 2023.
[17]
H. Zheng, S. M. Motch Perrine, M. K. Pitirri, K. Kawasaki, C. Wang, J. T. Richtsmeier, and D. Z. Chen, Cartilage Segmentation in High-Resolution 3D Micro-CT Images via Uncertainty-Guided Self-training with Very Sparse Annotation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23.Springer, 2020, pp. 802–812.
[18]
Z. Zheng, X. Wang, X. Zhang, Y. Zhong, X. Yao, Y. Zhang, and Y. Wang, Semi-supervised Segmentation with Self-training Based on Quality Estimation and Refinement,” in International Workshop on Machine Learning in Medical Imaging.Springer, 2020, pp. 30–39.
[19]
L. Qiu, J. Cheng, H. Gao, W. Xiong, and H. Ren, Federated Semi-Supervised Learning for Medical Image Segmentation via Pseudo-Label Denoising,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 10, pp. 4672–4683, 2023.
[20]
W. Bai, O. Oktay, M. Sinclair, H. Suzuki, M. Rajchl, G. Tarroni, B. Glocker, A. King, P. M. Matthews, and D. Rueckert, Semi-supervised Learning for Network-Based Cardiac MR Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part II 20.Springer, 2017, pp. 253–260.
[21]
L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, Uncertainty-Aware Self-ensembling Model for Semi-supervised 3D Left Atrium Segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part II 22.Springer, 2019, pp. 605–613.
[22]
T. Lei, D. Zhang, X. Du, X. Wang, Y. Wan, and A. K. Nandi, Semi-Supervised Medical Image Segmentation Using Adversarial Consistency Learning and Dynamic Convolution Network,” IEEE Transactions on Medical Imaging, vol. 42, no. 5, pp. 1265–1277, 2023.
[23]
G. Chen, J. Ru, Y. Zhou, I. Rekik, Z. Pan, X. Liu, Y. Lin, B. Lu, and J. Shi, MTANS: Multi-Scale Mean Teacher Combined Adversarial Network with Shape-Aware Embedding for Semi-Supervised Brain Lesion Segmentation,” NeuroImage, vol. 244, p. 118568, 2021.
[24]
Y. Wu, Z. Ge, D. Zhang, M. Xu, L. Zhang, Y. Xia, and J. Cai, Mutual consistency learning for semi-supervised medical image segmentation,” Medical Image Analysis, vol. 81, p. 102530, 2022.
[25]
X. Li, L. Yu, H. Chen, C.-W. Fu, L. Xing, and P.-A. Heng, Transformation-Consistent Self-Ensembling Model for Semisupervised Medical Image Segmentation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 523–534, 2021.
[26]
K. Wang, B. Zhan, C. Zu, X. Wu, J. Zhou, L. Zhou, and Y. Wang, Semi-supervised medical image segmentation via a tripled-uncertainty guided mean teacher model with contrastive learning,” Medical Image Analysis, vol. 79, p. 102447, 2022.
[27]
Z. Xu, Y. Wang, D. Lu, X. Luo, J. Yan, Y. Zheng, and R. K. yu Tong, Ambiguity-selective consistency regularization for mean-teacher semi-supervised medical image segmentation,” Medical Image Analysis, vol. 88, p. 102880, 2023.
[28]
X. Luo, G. Wang, W. Liao, J. Chen, T. Song, Y. Chen, S. Zhang, D. N. Metaxas, and S. Zhang, Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency,” Medical Image Analysis, vol. 80, p. 102517, 2022.
[29]
Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, and J. Liu, CE-Net: Context Encoder Network for 2D Medical Image Segmentation,” IEEE Transactions on Medical Imaging, vol. 38, no. 10, pp. 2281–2292, 2019.
[30]
Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 1856–1867, 2020.
[31]
F. Milletari, N. Navab, and S.-A. Ahmadi, V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV), 2016, pp. 565–571.
[32]
ÖÇiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and W. Wells, Eds.Cham: Springer International Publishing, 2016, pp. 424–432.
[33]
H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation,” in Computer Vision – ECCV 2022 Workshops, L. Karlinsky, T. Michaeli, and K. Nishino, Eds.Cham: Springer Nature Switzerland, 2023, pp. 205–218.
[34]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10 012–10 022.
[35]
T. Wang, J. Lan, Z. Han, Z. Hu, Y. Huang, Y. Deng, H. Zhang, J. Wang, M. Chen, H. Jiang et al., O-Net: A Novel Framework With Deep Fusion of CNN and Transformer for Simultaneous Segmentation and Classification,” Frontiers in neuroscience, vol. 16, p. 876065, 2022.
[36]
J. M. J. Valanarasu and V. M. Patel, UNeXt: MLP-Based Rapid Medical Image Segmentation Network,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2022, L. Wang, Q. Dou, P. T. Fletcher, S. Speidel, and S. Li, Eds.Cham: Springer Nature Switzerland, 2022, pp. 23–33.
[37]
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34.Curran Associates, Inc., 2021, pp. 12 077–12 090.
[38]
L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, Uncertainty-Aware Self-ensembling Model for Semi-supervised 3D Left Atrium Segmentation,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, and A. Khan, Eds.Cham: Springer International Publishing, 2019, pp. 605–613.
[39]
X. Chen, Y. Yuan, G. Zeng, and J. Wang, Semi-Supervised Semantic Segmentation With Cross Pseudo Supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 2613–2622.
[40]
M. N. Rizve, K. Duarte, Y. S. Rawat, and M. Shah, In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning,” arXiv preprint arXiv:2101.06329, 2021.
[41]
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, mixup: Beyond Empirical Risk Minimization,” arXiv preprint arXiv:1710.09412, 2017.
[42]
Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le, Unsupervised Data Augmentation for Consistency Training,” Advances in Neural Information Processing Systems, vol. 33, pp. 6256–6268, 2020.
[43]
S. Laine and T. Aila, Temporal Ensembling for Semi-Supervised Learning,” arXiv preprint arXiv:1610.02242, 2016.
[44]
K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips, D. Maffitt, M. Pringle et al., The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository,” Journal of Digital Imaging, vol. 26, pp. 1045–1057, 2013.
[45]
Y. Zhou, Z. Li, S. Bai, C. Wang, X. Chen, M. Han, E. Fishman, and A. L. Yuille, Prior-Aware Neural Network for Partially-Supervised Multi-Organ Segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10 672–10 681.
[46]
Z. Xiong, Q. Xia, Z. Hu, N. Huang, C. Bian, Y. Zheng, S. Vesal, N. Ravikumar, A. Maier, X. Yang et al., A global benchmark of algorithms for segmenting the left atrium from late gadolinium-enhanced cardiac magnetic resonance imaging,” Medical Image Analysis, vol. 67, p. 101832, 2021.
[47]
B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al., The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS),” IEEE Transactions on Medical Imaging, vol. 34, no. 10, pp. 1993–2024, 2014.
[48]
S. Li, C. Zhang, and X. He, Shape-Aware Semi-supervised 3D Semantic Segmentation for Medical Images,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23.Springer, 2020, pp. 552–561.
[49]
X. Luo, J. Chen, T. Song, and G. Wang, Semi-supervised Medical Image Segmentation through Dual-task Consistency,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 10, 2021, pp. 8801–8809.
[50]
Y. Wu, Z. Wu, Q. Wu, Z. Ge, and J. Cai, “Exploring smoothness and class-separation for semi-supervised medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2022, L. Wang, Q. Dou, P. T. Fletcher, S. Speidel, and S. Li, Eds.Cham: Springer Nature Switzerland, 2022, pp. 34–43.
[51]
H. Zheng, L. Lin, H. Hu, Q. Zhang, Q. Chen, Y. Iwamoto, X. Han, Y.-W. Chen, R. Tong, and J. Wu, Semi-supervised Segmentation of Liver Using Adversarial Learning with Deep Atlas Prior,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, and A. Khan, Eds.Cham: Springer International Publishing, 2019, pp. 148–156.

  1. This work was supported by National Natural Science Foundation of China under Grant 62171133, in part by the Artificial Intelligence and Economy Integration Platform of Fujian Province, and the Fujian Health Commission under Grant 2022ZD01003.↩︎

  2. Tao Wang, Yuanbin Chen, Xinlin Zhang, Yuanbo Zhou, Junlin Lan, Min Du, Qinquan Gao and Tong Tong are with the college of physics and information engineering, University of Fuzhou University, Fuzhou 350108, China (e-mail: ortonwangtao@gmail.com; binycn904363330@gmail.com; xinlin1219@gmail.com; webbozhou@gmail.com; smurfslan@qq.com; dm\(\_\)dj90@163.com; gqinquan@imperial-vision.com; ttraveltong@imperial-vision.com).↩︎

  3. Tao Tan is with the Faculty of Applied Science, University of Macao Polytechnic University, Macao 999078 (e-mail: taotanjs@gmail.com).↩︎

  4. Bizhe Bai is with the University of Toronto University, Ontario M5S2E8, Canada (e-mail: bizhe.bai@outlook.com).↩︎