September 18, 2025
This paper explores the transfer of knowledge from general vision models pretrained on 2D natural images to improve 3D medical image segmentation. We focus on the semi-supervised setting, where only a few labeled 3D medical images are available, along with a large set of unlabeled images. To tackle this, we propose a model-agnostic framework that progressively distills knowledge from a 2D pretrained model to a 3D segmentation model trained from scratch. Our approach, M&N, involves iterative co-training of the two models using pseudo-masks generated by each other, along with our proposed learning rate guided sampling that adaptively adjusts the proportion of labeled and unlabeled data in each training batch to align with the models’ prediction accuracy and stability, minimizing the adverse effect caused by inaccurate pseudo-masks. Extensive experiments on multiple publicly available datasets demonstrate that M&N achieves state-of-the-art performance, outperforming thirteen existing semi-supervised segmentation approaches under all different settings. Importantly, ablation studies show that M&N remains model-agnostic, allowing seamless integration with different architectures. This ensures its adaptability as more advanced models emerge. The code is available at https://github.com/pakheiyeung/M-N.
The advent of deep learning has significantly boosted the performance of 3D medical image segmentation, unarguably one of the most important tasks in medical image analysis. However, training a deep learning model from scratch typically requires a large amount of labeled data, which can be a major bottleneck in the medical domain [1]. In contrast, labeled data is abundant in other domains, such as 2D natural images, where massive datasets [2], [3] have been curated and utilized to train powerful vision models [4]. These pretrained models have demonstrated remarkable capabilities in various computer vision tasks. Motivated by the success of these models, this paper explores the possibility of leveraging their knowledge to facilitate the 3D medical image segmentation, particularly in scenarios where only a few manual labels are available.
Specifically, we focus on the task of semi-supervised 3D medical image segmentation, where only a few labeled 3D medical images are available, accompanied by a large set of unlabeled images. Recent advances in this area have achieved remarkable performance through various strategies for utilizing unlabeled data, for example teacher-student frameworks [5], [6], uncertainty-driven approaches [7], [8], unsupervised domain adaptation [9] and Prototype- and contrastive-learning-based frameworks [10], [11]. Another line of research related to this paper explored bridging the gap between 2D and 3D networks for 3D medical image analysis, primarily through network architecture design [12]–[14]. While remarkable performance has been achieved, most of these approaches are tailored for specific types of networks. In contrast, this work proposes a model-agnostic framework that enables the transfer of knowledge from any 2D pretrained network to any 3D segmentation network, aiming for a more flexible and generalizable solution.
This work is motivated by our initial findings that show using a model pretrained on 2D natural images substantially outperforms the same network trained from scratch for a 3D medical segmentation task. This performance gap widens when the number of labeled training data is limited. These findings are shown in the first 3 rows in 3 [tab:la4] [tab:ct6]. This suggests that pretraining on natural images acquires knowledge that is transferable to 3D medical segmentation, particularly in low-data regimes. Building on the success of 3D-based networks [15], [16], which have achieved state-of-the-art medical segmentation performance when trained on large labeled datasets, we pose a fundamental question: can we leverage the knowledge from a pretrained 2D model to improve the performance of a 3D segmentation model, even when training on limited labeled samples?
To address this question, we propose M&N, a model-agnostic framework that distills knowledge from a vision model, pretrained on 2D natural images, to a 3D model trained from scratch for semi-supervised medical segmentation. Our work makes the following contributions: firstly, we propose an iterative co-training strategy, where the 2D and 3D models are trained using pseudo-masks generated by each other. To mitigate the impact of inaccurate pseudo-masks, we further propose learning rate guided sampling, which adaptively adjusts the proportion of labeled and unlabeled data in a batch to align with the models’ prediction accuracy and stability. As our second contribution, we benchmark M&N on various publicly available datasets with different limited-data settings. M&N outperforms 13 existing semi-supervised segmentation approaches, achieving state-of-the-art performance on all experiments. Thirdly, our ablation studies show that M&N is agnostic to different models and architectures. This suggests its generalizability and potential for seamless integration with advanced models to achieve even more outstanding results in the future.
We propose M&N for semi-supervised 3D medical image segmentation. Given a dataset of 3D medical images, \(\mathcal{I} = \{{\mathbf{I}_i}\}_{i=1}^m\), where each image \(\mathbf{I}_i \in \mathbb{R}^{C_i \times H \times W \times D}\) has \(C_i\) channels, height \(H\), width \(W\) and depth \(D\). We assume a subset of images, \(\mathcal{I}_L\), has corresponding labeled masks, \(\mathcal{M}_L = \{{\mathbf{M}_i}\}_{i=1}^n, \mathbf{M}_i \in \mathbb{R}^{C_c \times H \times W \times D}\), with \(C_c\) classes, where \(n\ll m\). The remaining images, \(\mathcal{I}_U\), are unlabeled.
Using \(\mathcal{I}_L\), \(\mathcal{M}_L\) and \(\mathcal{I}_U\), our objective is to distill knowledge from a pretrained vision model, \(f(\cdot;\theta_{nat})\), parametrized by \(\theta_{nat}\) and pretrained on 2D natural images, to a 3D segmentation model, \(g(\cdot;\theta_{med})\), parametrized by \(\theta_{med}\).
We begin by fine-tuning the pretrained 2D model, \(f(\cdot;\theta_{nat})\), on the labeled dataset \(\{{\mathbf{I}_i}, \mathbf{M}_i\}_{i=1}^n\) by extracting 2D slices along the depth
dimension \(D\). Simultaneously, we train the 3D segmentation model, \(g(\cdot;\theta_{med})\), from scratch on \(\mathcal{I}_L\) and \(\mathcal{M}_L\). During this stage, the two models are optimized independently using the labeled loss, \(\mathcal{L}_l\), defined as: \[\label{eq:loss95label}
\mathcal{L}_l = w_{ce}\cdot\mathcal{L}_{ce}\left(\hat{\mathbf{M}},\mathbf{M}\right) + w_{dice}\cdot\mathcal{L}_{dice}\left(\hat{\mathbf{M}},\mathbf{M}\right),\tag{1}\] where \(\hat{\mathbf{M}}\) is the
predicted mask, \(\mathcal{L}_{ce}\) is the cross-entropy loss, \(\mathcal{L}_{dice}\) is the soft Dice loss and \(w_{ce}\) and \(w_{dice}\) are their respective weights.
Pretrained models. M&N is model-agnostic, allowing a wide range of 2D vision models, \(f(\cdot;\theta_{nat})\), pretrained with different learning objectives, to be used. For \(f(\cdot;\theta_{nat})\) with an encoder-decoder architecture, such as [4], fine-tuning can be done by simply replacing the
last layer to match the number of classes \(C_c\). Alternatively, a decoder needs to be appended and fine-tuned for models with only a pretrained encoder, such as [17]. Both cases are evaluated in 3.3.
Fine-tuning strategies. The pretrained \(f(\cdot;\theta_{nat})\) can be fine-tuned with different strategies, ranging from updating all the weights, \(\theta_{nat}\), to
fine-tuning only a subset of layers while keeping the rest frozen. We adopt Low-Rank Adaptation (LoRA) [18] as the default fine-tuning strategy for M&N, but
also investigate alternative options in 3.3 to provide a comprehensive comparison.
Both models, \(f(\cdot;\theta_{nat})\) and \(g(\cdot;\theta_{med})\), are then trained on both the labeled subset, \(\mathcal{I}_L\) and \(\mathcal{M}_L\), as well as the unlabeled subset, \(\mathcal{I}_U\), as shown in 1.
Odd-number epochs. 2D slices, \(\{{\mathbf{S}^d_{i}}\}_{i=1, d=1}^{n,D}\), are extracted along the depth dimension \(D\) from \(\mathcal{I}_U = \{{\mathbf{I}_i}\}_{i=1}^n\), where \(\mathbf{S}^d_i \in \mathbb{R}^{C_i \times H \times W}\). These slices are input to \(f(\cdot;\theta_{nat})\) to generate the pseudo-masks, \(\{{\mathbf{P}_{i}}\}_{i=1}^{n}, \mathbf{P}_i \in \mathbb{R}^{C_c \times H \times W \times D}\):
\[\label{eq:pseudo952d}
\mathbf{P}_i = concat\left(f(\mathbf{S}^1_{i};\theta_{nat}),\;f(\mathbf{S}^2_{i};\theta_{nat}),\;...,\;f(\mathbf{S}^D_{i};\theta_{nat})\right),\tag{2}\] where \(concat(\cdot)\) concatenates the 2D predicted
masks across the depth dimension \(D\). Together with the masks predicted by \(g(\cdot;\theta_{med})\): \[\label{eq:prediction953d}
[\hat{\mathbf{M}}_1,\;\hat{\mathbf{M}}_2,...,\;\hat{\mathbf{M}}_n] = [g(\mathbf{I}_1;\theta_{med}),\;g(\mathbf{I}_2;\theta_{med}),\;...,\;g(\mathbf{I}_n;\theta_{med})],\tag{3}\] the unlabeled loss, \(\mathcal{L}_{u}\), can be computed as: \[\label{eq:loss95unlabel}
\mathcal{L}_u = w_{kl}\cdot\mathcal{L}_{kl}\left(\hat{\mathbf{M}},\mathbf{P}\right) + w_{dice}\cdot\mathcal{L}_{dice}\left(\hat{\mathbf{M}},\mathbf{P}\right),\tag{4}\] where \(\mathcal{L}_{kl}\) is the
Kullback-Leibler divergence loss and \(w_{kl}\) is its weight.
Supervising with only \(\mathcal{L}_u\) may result in collapsed solutions, for example outputting the same prediction regardless of the inputs. Therefore, the labeled loss, \(\mathcal{L}_l\) (1 ), is also computed, leading to the final co-training loss, \(\mathcal{L}_c\): \[\label{eq:loss95co} \mathcal{L}_c = \frac{b_l}{b_l+b_u}\cdot\mathcal{L}_l + \frac{b_u}{b_l+b_u}\cdot\mathcal{L}_u,\tag{5}\] where \(b_l\) and \(b_u\) are the number of labeled and unlabeled data in a batch.
At each back-propagation step, a stochastic optimization is computed to minimize \(\mathcal{L}_c\), with respect to \(\theta_{med}\) to train \(g(\cdot;\theta_{med})\): \[\label{eq:optimize953d}
\theta_{med} \leftarrow optim(\theta_{med}, \nabla_{\theta_{med}}\mathcal{L}_c, \eta_{med}),\tag{6}\] where \(optim(\cdot)\) indicates the optimizer and \(\eta_{med}\) is the
learning rate for \(g(\cdot;\theta_{med})\).
Even-number epochs. Pseudo-masks, \(\{{\mathbf{P}_{i}}\}_{i=1}^{n}\), are generated by \(g(\cdot;\theta_{med})\): \[\label{eq:pseudo953d}
[\mathbf{P}_1,\;\mathbf{P}_2,...,\;\mathbf{P}_n] = [g(\mathbf{I}_1;\theta_{med}),\;g(\mathbf{I}_2;\theta_{med}),\;...,\;g(\mathbf{I}_n;\theta_{med})]\tag{7}\] Meanwhile, \(\{{\hat{\mathbf{M}}_i}\}_{i=1}^{n}\)
are output by \(f(\cdot;\theta_{nat})\) from the input 2D slices, \(\{{\mathbf{S}^d_{i}}\}_{i=1, d=1}^{n,D}\): \[\label{eq:prediction952d}
\hat{\mathbf{M}}_i = concat\left(f(\mathbf{S}^1_{i};\theta_{nat}),\;f(\mathbf{S}^2_{i};\theta_{nat}),\;...,\;f(\mathbf{S}^D_{i};\theta_{nat})\right)\tag{8}\] By minimizing the co-training loss, \(\mathcal{L}_c\) (5 ), with respect to \(\theta_{nat}\) (or a subset of \(\theta_{nat}\) depending on the fine-tuning strategies
described in 2.1), the training of \(f(\cdot;\theta_{nat})\) can be summarized as: \[\label{eq:optimize952d}
\theta_{nat} \leftarrow optim(\theta_{nat}, \nabla_{\theta_{nat}}\mathcal{L}_c, \eta_{nat})\tag{9}\] Through iterating between 6 9 , the two models, \(f(\cdot;\theta_{nat})\) and \(g(\cdot;\theta_{med})\), improve their performance by continuously learning from each other.
Due to the significant imbalance between the size of the labeled set \(\mathcal{I}_L\) and that of the unlabeled set \(\mathcal{I}_U\), uniform data sampling from \(\mathcal{I}\) during the co-training may undermine the training stability and final performance. Even with strategies like over-sampling, a fixed sampling approach may not be optimal.
Ideally, the proportion of labeled and unlabeled images within a batch should adapt to the current state of the models. At early stages when the models’ predictions are still unstable and inaccurate, relying heavily on their generated pseudo-masks may be suboptimal. As training progresses and the predictions become more stable and accurate, unlabeled images in a batch, used as pseudo-masks, should increase accordingly to maximize the utilization of unlabeled data.
This pattern aligns with learning rate decay, a widely adopted technique for training deep models [19]. Therefore, we propose learning rate guided (LRG)-sampling, which adaptively adjusts \(b_l\) and \(b_u\) (5 ), according to the current learning rate, \(\eta_{current}\), at each epoch. This can be formulated as: \[\label{eq:bu} b_u = \left\lfloor\frac{\eta_{initial}-\eta_{current}}{\eta_{initial}-\eta_{final}}\cdot B\right\rfloor\tag{10}\] \[\label{piokcvun} b_l = B-b_u\tag{11}\] where \(\lfloor \, \rfloor\) denotes the floor function, \(B\) is the batch size, and \(\eta_{initial}\) and \(\eta_{final}\) are the initial and final learning rates of the schedule.
We conducted a comprehensive evaluation of M&N, comparing it with 13 state-of-the-art methods under various settings (i.e. training with different numbers of labeled data) and investigated the effect of different components in M&N
through ablation studies.
Datasets. We benchmarked on the left atrial (LA) cavity dataset [20] and the Pancreas-CT dataset [21]. The LA dataset consists of 3D gadolinium-enhanced cardiac MR images, with ground-truth masks of the LA cavity. The resolution is \(0.625 \times 0.625
\times 0.625 mm^3\). We normalized the voxel values of each image to the range \([0, 1]\) and then standardized them. There are 100 images in the training set and 56 in the testing set. The Pancreas-CT
dataset contains 82 (62 training and 20 testing) abdominal contrast enhanced 3D CT images, with ground-truth masks of the pancreas. We re-sampled the images to a unified resolution of \(0.85 \times 0.85 \times 0.75
mm^3\). The voxel values were clipped to the range \([-175,250]\) Hounsfield Units and then normalized to \([0, 1]\).
Implementation details. M&N used SegFormer-B2 [4], pretrained on Imagenet-1K [3] and ADE20K [2], as the 2D model \(f(\cdot;\theta_{nat})\), and a
randomly initialized 3D UNet [22] as the 3D segmentation network \(g(\cdot;\theta_{med})\). Other 2D and 3D models, including UNet
with pretrained ResNet-50 encoder [17], [22] and SwinUNETR [15], were also evaluated in ablation studies. The hyperparameters were set to: \(w_{ce}=w_{kl}=w_{dice}=1\) and \(B=5\). During
training, we applied a range of data augmentations, including random resizing with a scale of \(0.9-1.1\), gamma contrast adjustment with \(\gamma\in[0.8,1.2]\) and random cropping to \(160 \times 160 \times 64\) with a \(50\%\) chance of foreground inclusion. Pseudo-masks predicted by the 3D model are used to determine the foreground regions for unlabeled images. The two
models were first trained and fine-tuned for 500 epochs (with 50 warm-up epochs) using the labeled images and then co-trained for another 3500 epochs, using a weight decay cosine schedule with \(\eta_{initial}=10^{-3}\) for
\(f(\cdot;\theta_{nat})\) and \(\eta_{initial}=10^{-4}\) for \(g(\cdot;\theta_{med})\) and \(\eta_{final}=0\). AdamW [23] was used as the optimizer.
Inference. For each volume, we sampled patches of size \(160 \times 160 \times 64\), matching the dimensions used during training, to ensure full coverage of the volume. Adjacent patches were sampled with a
50% overlap, and the predictions within the overlapping regions were averaged to obtain the final output. No data augmentation was applied during inference.
Evaluation metrics. Four metrics, namely Dice, Jaccard, 95\(\%\) Hausdorff Distance (\(\text{HD}_{95}\)) and Average Surface Distance (ASD), were used for evaluation. For
Dice and Jaccard, higher values indicate better performance, whereas lower values are desirable for \(\text{HD}_{95}\) and ASD.
| Method | Metrics | |||
| Dice (%)\(\uparrow\) | Jaccard (%)\(\uparrow\) | \(\text{HD}_{95}\) (vox)\(\downarrow\) | ASD (vox)\(\downarrow\) | |
| 3D UNet | 88.26 | 79.89 | 11.19 | 3.29 |
| *SegFormer | 83.05 | 73.25 | 15.79 | 5.40 |
| **SegFormer | 88.61 | 79.87 | 10.49 | 3.35 |
| SASSNet[24] | 87.32 | 77.72 | 9.62 | 2.55 |
| MC-Net[25] | 87.71 | 78.31 | 9.36 | 2.18 |
| SS-Net[11] | 88.55 | 79.62 | 7.49 | 1.9 |
| MC-Net+[26] | 88.96 | 80.25 | 7.93 | 1.86 |
| CAML[27] | 89.62 | 81.28 | 8.76 | 2.02 |
| DK-UXNet[28] | 90.41 | 82.69 | 7.32 | 1.71 |
| UA-MT[29] | 84.25 | 73.48 | 13.84 | 3.36 |
| LG-ER-MT[30] | 85.54 | 75.12 | 13.29 | 3.77 |
| DUWM[7] | 85.91 | 75.75 | 12.67 | 3.31 |
| AD-MT[5] | 90.55 | 82.79 | 5.81 | 1.7 |
| BCP[6] | 89.62 | 81.31 | 6.81 | 1.76 |
| Co-BioNet[8] | 89.2 | 80.68 | 6.44 | 1.9 |
| GraphCL[10] | 90.24 | 82.31 | 6.42 | 1.71 |
| M&N(Ours) | 91.56 | 84.47 | 4.59 | 1.40 |
| bold indicates top performance \(\uparrow\) means higher values being more accurate * SegFormer trained from scratch ** SegFormer pretrained on ADE20K [2] | ||||
| Method | Metrics | |||
| Dice | Jaccard | \(\text{HD}_{95}\) | ASD | |
| 3D UNet | 82.01 | 72.42 | 29.95 | 9.91 |
| *SegFormer | 71.01 | 58.4 | 14.34 | 7.07 |
| **SegFormer | 84.23 | 74.49 | 8.24 | 3.57 |
| SS-Net[11] | 86.33 | 76.15 | 9.97 | 2.31 |
| CAML[27] | 87.34 | 77.65 | 9.76 | 2.49 |
| DK-UXNet[28] | 85.96 | 75.91 | 11.72 | 2.64 |
| AD-MT[5] | 89.63 | 81.28 | 6.56 | 1.85 |
| BCP[6] | 88.02 | 78.72 | 7.9 | 2.15 |
| GraphCL[10] | 88.8 | 80 | 7.16 | 2.1 |
| M&N | 90.47 | 82.66 | 4.72 | 1.59 |
| Method | Metrics | |||
| Dice | Jaccard | \(\text{HD}_{95}\) | ASD | |
| 3D UNet | 59.53 | 44.17 | 18.52 | 1.79 |
| *SegFormer | 43.76 | 28.32 | 19.67 | 6.86 |
| **SegFormer | 68.57 | 53.34 | 16.39 | 4.77 |
| MC-Net[25] | 68.94 | 54.74 | 16.28 | 3.16 |
| MC-Net+[26] | 74.01 | 60.02 | 12.59 | 3.34 |
| AD-MT[5] | 80.21 | 67.51 | 7.18 | 1.66 |
| Co-BioNet[8] | 77.89 | 64.79 | 8.81 | 1.39 |
| M&N | 81.67 | 69.53 | 6.56 | 1.67 |
We compared M&N with 13 state-of-the-art approaches on the LA dataset, trained with 4 and 8 labeled images, and the Pancreas-CT dataset, trained with 6 labeled images.
As shown in 3 [tab:la4] [tab:ct6], our proposed M&N consistently outperformed all existing approaches across different datasets and settings (i.e. results in bold). Testing on different modalities (i.e. MRI and CT) and target structures (i.e. LA and pancreas) using different numbers of labeled training data, our results demonstrated the robustness and generalizability of M&N.
The results in the first three rows of the tables, particularly [tab:ct6], and 2, which present the more challenging CT pancreas segmentation task, validated the motivation behind our work. Specifically, pretraining on 2D natural images significantly helps medical segmentation (i.e.*SegFormer vs. **SegFormer), whereas 3D UNet trained with only small amount of labeled data struggled to achieve satisfactory results. However, it still outperformed a 2D network trained from scratch (i.e. 3D UNet vs. *SegFormer), justifying M&N’s choice of distilling knowledge from a pretrained 2D model to a 3D model.
In comparison to the best existing method, AD-MT [5], we used 3D UNet while AD-MT used VNet [16] as the segmentation model. These two models share very similar architectures, further verifying that the better performance of M&N is primarily attributed to the differences in the learning framework design rather than the network architecture.
We focused on the low-labeled data regime, where fewer than 10 labeled images are available (i.e. 4, 6, and 8). This threshold was deliberately chosen based on our previous experience collaborating with clinicians, whose incentive to annotate data significantly drops when the required number increases by an order of magnitude (e.g. 9 to 10 or 90 to 100). This phenomenon is consistent with psychological and marketing concepts, such as categorical perception of numbers and the left-digit effect [31]. By limiting our evaluation to fewer than 10 labeled images, our experiment setting aimed to reflect realistic scenarios where annotation resources are scarce.
We investigated the effect of different components in M&N on the LA dataset, trained with 8 labeled images, with the results presented in 4.
Model-agnostic analysis. While the default pretrained SegFormer has an encoder-decoder architecture, we replaced it with a pretrained ResNet-50 [17] encoder, connected to a randomly initialized decoder through skip connections, termed as ResUNet. As shown in 4 (row 1), its
performance slightly decreased but still outperformed the best existing method, AD-MT [5] (3). Additionally, we replaced 3D UNet with SwinUNETR [15] and observed a similar slight drop in performance (4 row 2), but still surpassing AD-MT [5] on most metrics. These experiments
validated the model-agnostic nature of M&N, confirming its ability to adapt to various architectures while maintaining superior performance.
Fine-tuning strategies. We compared 3 fine-tuning strategies for the pretrained SegFormer, namely LoRA [18], decoder fine-tuning and whole network
fine-tuning. While freezing the encoder and updating only the decoder resulted in a performance drop (row 5), the other two strategies performed comparably (row 6), with LoRA slightly outperformed on 3 out of 4 metrics. Since LoRA is also more
parameter-efficient, we adopted it as the default strategy for M&N.
Training and data sampling. We ablated iterative co-training by using fixed pseudo-masks training, where the pre-trained 2D model was fine-tuned with labeled images (2.1) and then used to generate
pseudo-masks for training the 3D model without further updates. This led to a performance decrease (row 3). Additionally, replacing LRG-sampling with uniform training data sampling resulted in an even more significant drop in performance (row 4). These
experiments validated the effectiveness of our proposed components in M&N.
| Pretrained 2D model | Fine-tuning strategy* | 3D network | Iterative co-training | LRG- sampling | Metrics | |||
| Dice (%)\(\uparrow\) | Jaccard (%)\(\uparrow\) | \(\text{HD}_{95}\) (vox)\(\downarrow\) | ASD (vox)\(\downarrow\) | |||||
| ResUNet** | Whole | 3D UNet | 91.11 | 83.74 | 5.38 | 1.41 | ||
| SegFormer | LoRA | SwinUNETR | 90.67 | 83.03 | 5.24 | 2.16 | ||
| SegFormer | LoRA | 3D UNet | 89.4 | 81.32 | 6.06 | 1.74 | ||
| SegFormer | LoRA | 3D UNet | 87.14 | 77.35 | 8.43 | 2.36 | ||
| SegFormer | Decoder | 3D UNet | 89.49 | 81.28 | 6.44 | 1.75 | ||
| SegFormer | Whole | 3D UNet | 91.39 | 84.20 | 5.02 | 1.38 | ||
| SegFormer | LoRA | 3D UNet | 91.56 | 84.47 | 4.59 | 1.40 | ||
| * Finetuning strategy refers to which part of the pretrained 2D model we update. ** UNet with ResNet-50 encoder pretrained on Imagenet-1K [3] | ||||||||
In summary, we present M&N, a model-agnostic framework for transferring the knowledge from general vision models pretrained on 2D natural images to enhance semi-supervised 3D medical image segmentation. By iteratively co-training the 2D and 3D models and adaptively adjusting the proportion of labeled and unlabeled images within a batch throughout training, M&N achieves state-of-the-art performance on various publicly available datasets under different limited labeled data settings, outperforming 13 existing methods. As future work, we plan to extend M&N to other tasks in medical image analysis, for example image registration. Our ultimate goal is to utilize the abundant cross-domain knowledge to facilitate the development in medical image analysis.
PH. Yeung is funded by the Presidential Postdoctoral Fellowship from Nanyang Technological University. We thank Dr Madeleine Wyburd and Mr Valentin Bacher for their valuable suggestions and comments about the work.