Abstract

This paper explores the transfer of knowledge from general vision models pretrained on 2D natural images to improve 3D medical image segmentation. We focus on the semi-supervised setting, where only a few labeled 3D medical images are available, along with a large set of unlabeled images. To tackle this, we propose a model-agnostic framework that progressively distills knowledge from a 2D pretrained model to a 3D segmentation model trained from scratch. Our approach, M&N, involves iterative co-training of the two models using pseudo-masks generated by each other, along with our proposed learning rate guided sampling that adaptively adjusts the proportion of labeled and unlabeled data in each training batch to align with the models’ prediction accuracy and stability, minimizing the adverse effect caused by inaccurate pseudo-masks. Extensive experiments on multiple publicly available datasets demonstrate that M&N achieves state-of-the-art performance, outperforming thirteen existing semi-supervised segmentation approaches under all different settings. Importantly, ablation studies show that M&N remains model-agnostic, allowing seamless integration with different architectures. This ensures its adaptability as more advanced models emerge. The code is available at https://github.com/pakheiyeung/M-N.

Project Page

1 Introduction↩︎

The advent of deep learning has significantly boosted the performance of 3D medical image segmentation, unarguably one of the most important tasks in medical image analysis. However, training a deep learning model from scratch typically requires a large amount of labeled data, which can be a major bottleneck in the medical domain [1]. In contrast, labeled data is abundant in other domains, such as 2D natural images, where massive datasets [2], [3] have been curated and utilized to train powerful vision models [4]. These pretrained models have demonstrated remarkable capabilities in various computer vision tasks. Motivated by the success of these models, this paper explores the possibility of leveraging their knowledge to facilitate the 3D medical image segmentation, particularly in scenarios where only a few manual labels are available.

Specifically, we focus on the task of semi-supervised 3D medical image segmentation, where only a few labeled 3D medical images are available, accompanied by a large set of unlabeled images. Recent advances in this area have achieved remarkable performance through various strategies for utilizing unlabeled data, for example teacher-student frameworks [5], [6], uncertainty-driven approaches [7], [8], unsupervised domain adaptation [9] and Prototype- and contrastive-learning-based frameworks [10], [11]. Another line of research related to this paper explored bridging the gap between 2D and 3D networks for 3D medical image analysis, primarily through network architecture design [12]–[14]. While remarkable performance has been achieved, most of these approaches are tailored for specific types of networks. In contrast, this work proposes a model-agnostic framework that enables the transfer of knowledge from any 2D pretrained network to any 3D segmentation network, aiming for a more flexible and generalizable solution.

This work is motivated by our initial findings that show using a model pretrained on 2D natural images substantially outperforms the same network trained from scratch for a 3D medical segmentation task. This performance gap widens when the number of labeled training data is limited. These findings are shown in the first 3 rows in 3 [tab:la4] [tab:ct6]. This suggests that pretraining on natural images acquires knowledge that is transferable to 3D medical segmentation, particularly in low-data regimes. Building on the success of 3D-based networks [15], [16], which have achieved state-of-the-art medical segmentation performance when trained on large labeled datasets, we pose a fundamental question: can we leverage the knowledge from a pretrained 2D model to improve the performance of a 3D segmentation model, even when training on limited labeled samples?

Figure 1: Pipeline of our proposed M&N framework. The 2D and 3D models are iteratively co-trained using pseudo-masks generated by each other, with the unlabeled loss (\mathcal{L}_{unlabled}), and using labeled images and masks with the labeled loss (\mathcal{L}_{labled}). This iterative process alternates between odd and even epochs. LRG-sampling dynamically adjusts the proportion of labeled and unlabeled data in each batch based on the current learning rate, optimizing the utilization of available training data. — Figure 1: Pipeline of our proposed M&N framework. The 2D and 3D models are iteratively co-trained using pseudo-masks generated by each other, with the unlabeled loss ($\mathcal{L}_{unlabled}$), and using labeled images and masks with the labeled loss ($\mathcal{L}_{labled}$). This iterative process alternates between odd and even epochs. LRG-sampling dynamically adjusts the proportion of labeled and unlabeled data in each batch based on the current learning rate, optimizing the utilization of available training data.

To address this question, we propose M&N, a model-agnostic framework that distills knowledge from a vision model, pretrained on 2D natural images, to a 3D model trained from scratch for semi-supervised medical segmentation. Our work makes the following contributions: firstly, we propose an iterative co-training strategy, where the 2D and 3D models are trained using pseudo-masks generated by each other. To mitigate the impact of inaccurate pseudo-masks, we further propose learning rate guided sampling, which adaptively adjusts the proportion of labeled and unlabeled data in a batch to align with the models’ prediction accuracy and stability. As our second contribution, we benchmark M&N on various publicly available datasets with different limited-data settings. M&N outperforms 13 existing semi-supervised segmentation approaches, achieving state-of-the-art performance on all experiments. Thirdly, our ablation studies show that M&N is agnostic to different models and architectures. This suggests its generalizability and potential for seamless integration with advanced models to achieve even more outstanding results in the future.

2 Method↩︎

We propose M&N for semi-supervised 3D medical image segmentation. Given a dataset of 3D medical images, $\mathcal{I} = \{{\mathbf{I}_i}\}_{i=1}^m$, where each image $\mathbf{I}_i \in \mathbb{R}^{C_i \times H \times W \times D}$ has $C_i$ channels, height $H$, width $W$ and depth $D$. We assume a subset of images, $\mathcal{I}_L$, has corresponding labeled masks, $\mathcal{M}_L = \{{\mathbf{M}_i}\}_{i=1}^n, \mathbf{M}_i \in \mathbb{R}^{C_c \times H \times W \times D}$, with $C_c$ classes, where $n\ll m$. The remaining images, $\mathcal{I}_U$, are unlabeled.

Using $\mathcal{I}_L$, $\mathcal{M}_L$ and $\mathcal{I}_U$, our objective is to distill knowledge from a pretrained vision model, $f(\cdot;\theta_{nat})$, parametrized by $\theta_{nat}$ and pretrained on 2D natural images, to a 3D segmentation model, $g(\cdot;\theta_{med})$, parametrized by $\theta_{med}$.

2.1 Fine-tuning on Labeled Images↩︎

We begin by fine-tuning the pretrained 2D model, $f(\cdot;\theta_{nat})$, on the labeled dataset $\{{\mathbf{I}_i}, \mathbf{M}_i\}_{i=1}^n$ by extracting 2D slices along the depth dimension $D$. Simultaneously, we train the 3D segmentation model, $g(\cdot;\theta_{med})$, from scratch on $\mathcal{I}_L$ and $\mathcal{M}_L$. During this stage, the two models are optimized independently using the labeled loss, $\mathcal{L}_l$, defined as: \[\label{eq:loss95label} \mathcal{L}_l = w_{ce}\cdot\mathcal{L}_{ce}\left(\hat{\mathbf{M}},\mathbf{M}\right) + w_{dice}\cdot\mathcal{L}_{dice}\left(\hat{\mathbf{M}},\mathbf{M}\right),\tag{1}\] where $\hat{\mathbf{M}}$ is the predicted mask, $\mathcal{L}_{ce}$ is the cross-entropy loss, $\mathcal{L}_{dice}$ is the soft Dice loss and $w_{ce}$ and $w_{dice}$ are their respective weights.

Pretrained models. M&N is model-agnostic, allowing a wide range of 2D vision models, $f(\cdot;\theta_{nat})$, pretrained with different learning objectives, to be used. For $f(\cdot;\theta_{nat})$ with an encoder-decoder architecture, such as [4], fine-tuning can be done by simply replacing the last layer to match the number of classes $C_c$. Alternatively, a decoder needs to be appended and fine-tuned for models with only a pretrained encoder, such as [17]. Both cases are evaluated in 3.3.

Fine-tuning strategies. The pretrained $f(\cdot;\theta_{nat})$ can be fine-tuned with different strategies, ranging from updating all the weights, $\theta_{nat}$, to fine-tuning only a subset of layers while keeping the rest frozen. We adopt Low-Rank Adaptation (LoRA) [18] as the default fine-tuning strategy for M&N, but also investigate alternative options in 3.3 to provide a comprehensive comparison.

2.2 Iterative Co-Training↩︎

Both models, $f(\cdot;\theta_{nat})$ and $g(\cdot;\theta_{med})$, are then trained on both the labeled subset, $\mathcal{I}_L$ and $\mathcal{M}_L$, as well as the unlabeled subset, $\mathcal{I}_U$, as shown in 1.

Odd-number epochs. 2D slices, $\{{\mathbf{S}^d_{i}}\}_{i=1, d=1}^{n,D}$, are extracted along the depth dimension $D$ from $\mathcal{I}_U = \{{\mathbf{I}_i}\}_{i=1}^n$, where $\mathbf{S}^d_i \in \mathbb{R}^{C_i \times H \times W}$. These slices are input to $f(\cdot;\theta_{nat})$ to generate the pseudo-masks, $\{{\mathbf{P}_{i}}\}_{i=1}^{n}, \mathbf{P}_i \in \mathbb{R}^{C_c \times H \times W \times D}$: \[\label{eq:pseudo952d} \mathbf{P}_i = concat\left(f(\mathbf{S}^1_{i};\theta_{nat}),\;f(\mathbf{S}^2_{i};\theta_{nat}),\;...,\;f(\mathbf{S}^D_{i};\theta_{nat})\right),\tag{2}\] where $concat(\cdot)$ concatenates the 2D predicted masks across the depth dimension $D$. Together with the masks predicted by $g(\cdot;\theta_{med})$: \[\label{eq:prediction953d} [\hat{\mathbf{M}}_1,\;\hat{\mathbf{M}}_2,...,\;\hat{\mathbf{M}}_n] = [g(\mathbf{I}_1;\theta_{med}),\;g(\mathbf{I}_2;\theta_{med}),\;...,\;g(\mathbf{I}_n;\theta_{med})],\tag{3}\] the unlabeled loss, $\mathcal{L}_{u}$, can be computed as: \[\label{eq:loss95unlabel} \mathcal{L}_u = w_{kl}\cdot\mathcal{L}_{kl}\left(\hat{\mathbf{M}},\mathbf{P}\right) + w_{dice}\cdot\mathcal{L}_{dice}\left(\hat{\mathbf{M}},\mathbf{P}\right),\tag{4}\] where $\mathcal{L}_{kl}$ is the Kullback-Leibler divergence loss and $w_{kl}$ is its weight.

Supervising with only $\mathcal{L}_u$ may result in collapsed solutions, for example outputting the same prediction regardless of the inputs. Therefore, the labeled loss, $\mathcal{L}_l$ (1 ), is also computed, leading to the final co-training loss, $\mathcal{L}_c$: \[\label{eq:loss95co} \mathcal{L}_c = \frac{b_l}{b_l+b_u}\cdot\mathcal{L}_l + \frac{b_u}{b_l+b_u}\cdot\mathcal{L}_u,\tag{5}\] where $b_l$ and $b_u$ are the number of labeled and unlabeled data in a batch.

At each back-propagation step, a stochastic optimization is computed to minimize $\mathcal{L}_c$, with respect to $\theta_{med}$ to train $g(\cdot;\theta_{med})$: \[\label{eq:optimize953d} \theta_{med} \leftarrow optim(\theta_{med}, \nabla_{\theta_{med}}\mathcal{L}_c, \eta_{med}),\tag{6}\] where $optim(\cdot)$ indicates the optimizer and $\eta_{med}$ is the learning rate for $g(\cdot;\theta_{med})$.

Even-number epochs. Pseudo-masks, $\{{\mathbf{P}_{i}}\}_{i=1}^{n}$, are generated by $g(\cdot;\theta_{med})$: \[\label{eq:pseudo953d} [\mathbf{P}_1,\;\mathbf{P}_2,...,\;\mathbf{P}_n] = [g(\mathbf{I}_1;\theta_{med}),\;g(\mathbf{I}_2;\theta_{med}),\;...,\;g(\mathbf{I}_n;\theta_{med})]\tag{7}\] Meanwhile, $\{{\hat{\mathbf{M}}_i}\}_{i=1}^{n}$ are output by $f(\cdot;\theta_{nat})$ from the input 2D slices, $\{{\mathbf{S}^d_{i}}\}_{i=1, d=1}^{n,D}$: \[\label{eq:prediction952d} \hat{\mathbf{M}}_i = concat\left(f(\mathbf{S}^1_{i};\theta_{nat}),\;f(\mathbf{S}^2_{i};\theta_{nat}),\;...,\;f(\mathbf{S}^D_{i};\theta_{nat})\right)\tag{8}\] By minimizing the co-training loss, $\mathcal{L}_c$ (5 ), with respect to $\theta_{nat}$ (or a subset of $\theta_{nat}$ depending on the fine-tuning strategies described in 2.1), the training of $f(\cdot;\theta_{nat})$ can be summarized as: \[\label{eq:optimize952d} \theta_{nat} \leftarrow optim(\theta_{nat}, \nabla_{\theta_{nat}}\mathcal{L}_c, \eta_{nat})\tag{9}\] Through iterating between 6 9 , the two models, $f(\cdot;\theta_{nat})$ and $g(\cdot;\theta_{med})$, improve their performance by continuously learning from each other.

2.3 Learning Rate Guided Sampling (LRG-sampling)↩︎

Due to the significant imbalance between the size of the labeled set $\mathcal{I}_L$ and that of the unlabeled set $\mathcal{I}_U$, uniform data sampling from $\mathcal{I}$ during the co-training may undermine the training stability and final performance. Even with strategies like over-sampling, a fixed sampling approach may not be optimal.

Ideally, the proportion of labeled and unlabeled images within a batch should adapt to the current state of the models. At early stages when the models’ predictions are still unstable and inaccurate, relying heavily on their generated pseudo-masks may be suboptimal. As training progresses and the predictions become more stable and accurate, unlabeled images in a batch, used as pseudo-masks, should increase accordingly to maximize the utilization of unlabeled data.

This pattern aligns with learning rate decay, a widely adopted technique for training deep models [19]. Therefore, we propose learning rate guided (LRG)-sampling, which adaptively adjusts $b_l$ and $b_u$ (5 ), according to the current learning rate, $\eta_{current}$, at each epoch. This can be formulated as: \[\label{eq:bu} b_u = \left\lfloor\frac{\eta_{initial}-\eta_{current}}{\eta_{initial}-\eta_{final}}\cdot B\right\rfloor\tag{10}\] \[\label{piokcvun} b_l = B-b_u\tag{11}\] where $\lfloor \, \rfloor$ denotes the floor function, $B$ is the batch size, and $\eta_{initial}$ and $\eta_{final}$ are the initial and final learning rates of the schedule.

3 Experiments and Results↩︎

3.1 Experimental Settings↩︎

We conducted a comprehensive evaluation of M&N, comparing it with 13 state-of-the-art methods under various settings (i.e. training with different numbers of labeled data) and investigated the effect of different components in M&N through ablation studies.
Datasets. We benchmarked on the left atrial (LA) cavity dataset [20] and the Pancreas-CT dataset [21]. The LA dataset consists of 3D gadolinium-enhanced cardiac MR images, with ground-truth masks of the LA cavity. The resolution is $0.625 \times 0.625 \times 0.625 mm^3$. We normalized the voxel values of each image to the range $[0, 1]$ and then standardized them. There are 100 images in the training set and 56 in the testing set. The Pancreas-CT dataset contains 82 (62 training and 20 testing) abdominal contrast enhanced 3D CT images, with ground-truth masks of the pancreas. We re-sampled the images to a unified resolution of $0.85 \times 0.85 \times 0.75 mm^3$. The voxel values were clipped to the range $[-175,250]$ Hounsfield Units and then normalized to $[0, 1]$.
Implementation details. M&N used SegFormer-B2 [4], pretrained on Imagenet-1K [3] and ADE20K [2], as the 2D model $f(\cdot;\theta_{nat})$, and a randomly initialized 3D UNet [22] as the 3D segmentation network $g(\cdot;\theta_{med})$. Other 2D and 3D models, including UNet with pretrained ResNet-50 encoder [17], [22] and SwinUNETR [15], were also evaluated in ablation studies. The hyperparameters were set to: $w_{ce}=w_{kl}=w_{dice}=1$ and $B=5$. During training, we applied a range of data augmentations, including random resizing with a scale of $0.9-1.1$, gamma contrast adjustment with $\gamma\in[0.8,1.2]$ and random cropping to $160 \times 160 \times 64$ with a $50\%$ chance of foreground inclusion. Pseudo-masks predicted by the 3D model are used to determine the foreground regions for unlabeled images. The two models were first trained and fine-tuned for 500 epochs (with 50 warm-up epochs) using the labeled images and then co-trained for another 3500 epochs, using a weight decay cosine schedule with $\eta_{initial}=10^{-3}$ for $f(\cdot;\theta_{nat})$ and $\eta_{initial}=10^{-4}$ for $g(\cdot;\theta_{med})$ and $\eta_{final}=0$. AdamW [23] was used as the optimizer.
Inference. For each volume, we sampled patches of size $160 \times 160 \times 64$, matching the dimensions used during training, to ensure full coverage of the volume. Adjacent patches were sampled with a 50% overlap, and the predictions within the overlapping regions were averaged to obtain the final output. No data augmentation was applied during inference.
Evaluation metrics. Four metrics, namely Dice, Jaccard, 95$\%$ Hausdorff Distance ($\text{HD}_{95}$) and Average Surface Distance (ASD), were used for evaluation. For Dice and Jaccard, higher values indicate better performance, whereas lower values are desirable for $\text{HD}_{95}$ and ASD.

Method	Metrics
Method	Dice (%)$\uparrow$	Jaccard (%)$\uparrow$	$\text{HD}_{95}$ (vox)$\downarrow$	ASD (vox)$\downarrow$
3D UNet	88.26	79.89	11.19	3.29
*SegFormer	83.05	73.25	15.79	5.40
**SegFormer	88.61	79.87	10.49	3.35
SASSNet[24]	87.32	77.72	9.62	2.55
MC-Net[25]	87.71	78.31	9.36	2.18
SS-Net[11]	88.55	79.62	7.49	1.9
MC-Net+[26]	88.96	80.25	7.93	1.86
CAML[27]	89.62	81.28	8.76	2.02
DK-UXNet[28]	90.41	82.69	7.32	1.71
UA-MT[29]	84.25	73.48	13.84	3.36
LG-ER-MT[30]	85.54	75.12	13.29	3.77
DUWM[7]	85.91	75.75	12.67	3.31
AD-MT[5]	90.55	82.79	5.81	1.7
BCP[6]	89.62	81.31	6.81	1.76
Co-BioNet[8]	89.2	80.68	6.44	1.9
GraphCL[10]	90.24	82.31	6.42	1.71
M&N(Ours)	91.56	84.47	4.59	1.40
bold indicates top performance $\uparrow$ means higher values being more accurate * SegFormer trained from scratch ** SegFormer pretrained on ADE20K [2]

Method	Metrics
Method	Dice	Jaccard	$\text{HD}_{95}$	ASD
3D UNet	82.01	72.42	29.95	9.91
*SegFormer	71.01	58.4	14.34	7.07
**SegFormer	84.23	74.49	8.24	3.57
SS-Net[11]	86.33	76.15	9.97	2.31
CAML[27]	87.34	77.65	9.76	2.49
DK-UXNet[28]	85.96	75.91	11.72	2.64
AD-MT[5]	89.63	81.28	6.56	1.85
BCP[6]	88.02	78.72	7.9	2.15
GraphCL[10]	88.8	80	7.16	2.1
M&N	90.47	82.66	4.72	1.59

Table 1: LA results (8 labels)
Method	Metrics
Method	Dice	Jaccard	$\text{HD}_{95}$	ASD
3D UNet	59.53	44.17	18.52	1.79
*SegFormer	43.76	28.32	19.67	6.86
**SegFormer	68.57	53.34	16.39	4.77
MC-Net[25]	68.94	54.74	16.28	3.16
MC-Net+[26]	74.01	60.02	12.59	3.34
AD-MT[5]	80.21	67.51	7.18	1.66
Co-BioNet[8]	77.89	64.79	8.81	1.39
M&N	81.67	69.53	6.56	1.67

Figure 2: Qualitative results of Pancreas-CT dataset [21]. All approaches are trained from the same number (i.e. 6) of labeled images.*SegFormer is trained from scratch and **SegFormer is first pretrained on ADE20K [2]. — Figure 2: Qualitative results of Pancreas-CT dataset [21]. All approaches are trained from the same number (*i.e.* 6) of labeled images.*SegFormer is trained from scratch and **SegFormer is first pretrained on ADE20K [2].

3.2 Baseline Comparisons↩︎

We compared M&N with 13 state-of-the-art approaches on the LA dataset, trained with 4 and 8 labeled images, and the Pancreas-CT dataset, trained with 6 labeled images.

As shown in 3 [tab:la4] [tab:ct6], our proposed M&N consistently outperformed all existing approaches across different datasets and settings (i.e. results in bold). Testing on different modalities (i.e. MRI and CT) and target structures (i.e. LA and pancreas) using different numbers of labeled training data, our results demonstrated the robustness and generalizability of M&N.

The results in the first three rows of the tables, particularly [tab:ct6], and 2, which present the more challenging CT pancreas segmentation task, validated the motivation behind our work. Specifically, pretraining on 2D natural images significantly helps medical segmentation (i.e.*SegFormer vs. **SegFormer), whereas 3D UNet trained with only small amount of labeled data struggled to achieve satisfactory results. However, it still outperformed a 2D network trained from scratch (i.e. 3D UNet vs. *SegFormer), justifying M&N’s choice of distilling knowledge from a pretrained 2D model to a 3D model.

In comparison to the best existing method, AD-MT [5], we used 3D UNet while AD-MT used VNet [16] as the segmentation model. These two models share very similar architectures, further verifying that the better performance of M&N is primarily attributed to the differences in the learning framework design rather than the network architecture.

We focused on the low-labeled data regime, where fewer than 10 labeled images are available (i.e. 4, 6, and 8). This threshold was deliberately chosen based on our previous experience collaborating with clinicians, whose incentive to annotate data significantly drops when the required number increases by an order of magnitude (e.g. 9 to 10 or 90 to 100). This phenomenon is consistent with psychological and marketing concepts, such as categorical perception of numbers and the left-digit effect [31]. By limiting our evaluation to fewer than 10 labeled images, our experiment setting aimed to reflect realistic scenarios where annotation resources are scarce.

3.3 Ablation Studies↩︎

We investigated the effect of different components in M&N on the LA dataset, trained with 8 labeled images, with the results presented in 4.
Model-agnostic analysis. While the default pretrained SegFormer has an encoder-decoder architecture, we replaced it with a pretrained ResNet-50 [17] encoder, connected to a randomly initialized decoder through skip connections, termed as ResUNet. As shown in 4 (row 1), its performance slightly decreased but still outperformed the best existing method, AD-MT [5] (3). Additionally, we replaced 3D UNet with SwinUNETR [15] and observed a similar slight drop in performance (4 row 2), but still surpassing AD-MT [5] on most metrics. These experiments validated the model-agnostic nature of M&N, confirming its ability to adapt to various architectures while maintaining superior performance.
Fine-tuning strategies. We compared 3 fine-tuning strategies for the pretrained SegFormer, namely LoRA [18], decoder fine-tuning and whole network fine-tuning. While freezing the encoder and updating only the decoder resulted in a performance drop (row 5), the other two strategies performed comparably (row 6), with LoRA slightly outperformed on 3 out of 4 metrics. Since LoRA is also more parameter-efficient, we adopted it as the default strategy for M&N.
Training and data sampling. We ablated iterative co-training by using fixed pseudo-masks training, where the pre-trained 2D model was fine-tuned with labeled images (2.1) and then used to generate pseudo-masks for training the 3D model without further updates. This led to a performance decrease (row 3). Additionally, replacing LRG-sampling with uniform training data sampling resulted in an even more significant drop in performance (row 4). These experiments validated the effectiveness of our proposed components in M&N.

Table 2: Ablation studies of M&N on LA datasets (8 labels).
Pretrained 2D model	Fine-tuning strategy*	3D network	Iterative co-training	LRG- sampling	Metrics
Pretrained 2D model	Fine-tuning strategy*	3D network	Iterative co-training	LRG- sampling	Dice (%)$\uparrow$	Jaccard (%)$\uparrow$	$\text{HD}_{95}$ (vox)$\downarrow$	ASD (vox)$\downarrow$
ResUNet**	Whole	3D UNet			91.11	83.74	5.38	1.41
SegFormer	LoRA	SwinUNETR			90.67	83.03	5.24	2.16
SegFormer	LoRA	3D UNet			89.4	81.32	6.06	1.74
SegFormer	LoRA	3D UNet			87.14	77.35	8.43	2.36
SegFormer	Decoder	3D UNet			89.49	81.28	6.44	1.75
SegFormer	Whole	3D UNet			91.39	84.20	5.02	1.38
SegFormer	LoRA	3D UNet			91.56	84.47	4.59	1.40
* Finetuning strategy refers to which part of the pretrained 2D model we update. ** UNet with ResNet-50 encoder pretrained on Imagenet-1K [3]

4 Conclusion↩︎

In summary, we present M&N, a model-agnostic framework for transferring the knowledge from general vision models pretrained on 2D natural images to enhance semi-supervised 3D medical image segmentation. By iteratively co-training the 2D and 3D models and adaptively adjusting the proportion of labeled and unlabeled images within a batch throughout training, M&N achieves state-of-the-art performance on various publicly available datasets under different limited labeled data settings, outperforming 13 existing methods. As future work, we plan to extend M&N to other tasks in medical image analysis, for example image registration. Our ultimate goal is to utilize the abundant cross-domain knowledge to facilitate the development in medical image analysis.

4.0.1 Acknowledgments↩︎

PH. Yeung is funded by the Presidential Postdoctoral Fellowship from Nanyang Technological University. We thank Dr Madeleine Wyburd and Mr Valentin Bacher for their valuable suggestions and comments about the work.

References↩︎

[1]

Yeung, P.H., Namburete, A.I., Xie, W.: Sli2vol: Annotate a 3d volume from a single slice with self-supervised learning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 69–79. Springer (2021).

[2]

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 633–641 (2017).

[3]

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 248–255. IEEE (2009).

[4]

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems (NeurIPS) 34, 12077–12090 (2021).

[5]

Zhao, Z., Wang, Z., Wang, L., Yu, D., Yuan, Y., Zhou, L.: Alternate diverse teaching for semi-supervised medical image segmentation. In: European Conference on Computer Vision (ECCV). pp. 227–243. Springer (2024).

[6]

Bai, Y., Chen, D., Li, Q., Shen, W., Wang, Y.: Bidirectional copy-paste for semi-supervised medical image segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11514–11524 (2023).

[7]

Wang, Y., Zhang, Y., Tian, J., Zhong, C., Shi, Z., Zhang, Y., He, Z.: Double-uncertainty weighted method for semi-supervised learning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 542–551. Springer (2020).

[8]

Peiris, H., Hayat, M., Chen, Z., Egan, G., Harandi, M.: Uncertainty-guided dual-views for semi-supervised volumetric medical image segmentation. Nature Machine Intelligence 5(7), 724–738 (2023).

[9]

Lyu, P., Yeung, P.H., Yu, X., Xia, J., Chi, J., Wu, C., Rajapakse, J.C.: Bridging the inter-domain gap through low-level features for cross-modal medical image segmentation. arXiv preprint arXiv:2505.11909 (2025).

[10]

Wang, M., houcheng su, Li, J., Li, C., Yin, N., Shen, L., Guo, J.: GraphCL: Graph-based clustering for semi-supervised medical image segmentation. In: International Conference on Machine Learning (ICML) (2025), https://openreview.net/forum?id=Q2av1PZmfT.

[11]

Wu, Y., Wu, Z., Wu, Q., Ge, Z., Cai, J.: Exploring smoothness and class-separation for semi-supervised medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 34–43. Springer (2022).

[12]

Yang, J., Huang, X., He, Y., Xu, J., Yang, C., Xu, G., Ni, B.: Reinventing 2d convolutions for 3d images. IEEE Journal of Biomedical and Health Informatics 25(8), 3009–3018 (2021).

[13]

Jang, J., Hwang, D.: M3t: Three-dimensional medical image classifier using multi-plane and multi-slice transformer. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20718–20729 (2022).

[14]

Delchevalerie, V., Frénay, B., Mayer, A.: From three to two dimensions: 2d quaternion convolutions for 3d images. In: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) (2024).

[15]

Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H.R., Xu, D.: Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In: International MICCAI Brainlesion Workshop. pp. 272–284. Springer (2021).

[16]

Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: International Conference on 3D Vision (3DV). pp. 565–571. IEEE (2016).

[17]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016).

[18]

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022), https://openreview.net/forum?id=nZeVKeeFYf9.

[19]

Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Neural networks: Tricks of the trade: Second edition, pp. 437–478. Springer (2012).

[20]

Xiong, Z., Xia, Q., Hu, Z., Huang, N., Bian, C., Zheng, Y., Vesal, S., Ravikumar, N., Maier, A., Yang, X., et al.: A global benchmark of algorithms for segmenting the left atrium from late gadolinium-enhanced cardiac magnetic resonance imaging. Medical Image Analysis 67, 101832 (2021).

[21]

Roth, H., Farag, A., Turkbey, E., Lu, L., Liu, J., Summers, R.: Data from pancreas-ct (version 2)[data set]. the cancer imaging archive (2016).

[22]

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 234–241. Springer (2015).

[23]

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2017), https://api.semanticscholar.org/CorpusID:53592270.

[24]

Li, S., Zhang, C., He, X.: Shape-aware semi-supervised 3d semantic segmentation for medical images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 552–561. Springer (2020).

[25]

Wu, Y., Xu, M., Ge, Z., Cai, J., Zhang, L.: Semi-supervised left atrium segmentation with mutual consistency training. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 297–306. Springer (2021).

[26]

Wu, Y., Ge, Z., Zhang, D., Xu, M., Zhang, L., Xia, Y., Cai, J.: Mutual consistency learning for semi-supervised medical image segmentation. Medical Image Analysis 81, 102530 (2022).

[27]

Gao, S., Zhang, Z., Ma, J., Li, Z., Zhang, S.: Correlation-aware mutual learning for semi-supervised medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 98–108. Springer (2023).

[28]

Wu, R., Li, D., Zhang, C.: Semi-supervised medical image segmentation via query distribution consistency. In: IEEE International Symposium on Biomedical Imaging (ISBI). pp. 1–5. IEEE (2024).

[29]

Yu, L., Wang, S., Li, X., Fu, C.W., Heng, P.A.: Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 605–613. Springer (2019).

[30]

Hang, W., Feng, W., Liang, S., Yu, L., Wang, Q., Choi, K.S., Qin, J.: Local and global structure-aware entropy regularized mean teacher model for 3d left atrium segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 562–571. Springer (2020).

[31]

Anderson, E.T., Simester, D.I.: Effects of $\$$9 price endings on retail sales: Evidence from field experiments. Quantitative Marketing and Economics 1, 93–110 (2003).

Semi-Supervised 3D Medical Segmentation from 2D Natural Images Pretrained Model