Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification in MIDOG 2025


Abstract

Atypical mitotic figures (AMFs) represent abnormal cell division associated with poor prognosis. Yet their detection remains difficult due to low prevalence, subtle morphology, and inter-observer variability. The MIDOG 2025 challenge introduces a benchmark for AMF classification across multiple domains. In this work, we fine-tuned the recently published DINOv3-H+ vision transformer, pretrained on natural images, using low-rank adaptation (LoRA), training only \(\sim\)​1.3M parameters in combination with extensive augmentation and a domain-weighted Focal Loss to handle domain heterogeneity. Despite the domain gap, our fine-tuned DINOv3 transfers effectively to histopathology, reaching second place on the preliminary test set. These results highlight the advantages of DINOv3 pretraining and underline the efficiency and robustness of our fine-tuning strategy, yielding state-of-the-art results for the atypical mitosis classification challenge in MIDOG 2025.

Histopathology | Foundation Models | Atypical Mitotic Figures

guillaume.balezo@minesparis.psl.eu

Introduction↩︎

Mitotic activity is a central indicator of tumor proliferation and prognosis. Beyond simple counts, the distinction between normal mitotic figures (NMFs) and atypical mitotic figures (AMFs) is of particular interest, as AMFs reflect abnormal cell division processes and correlate with poor clinical outcomes. However, their identification is challenging due to low prevalence, subtle morphological differences, and low inter-rater agreement even among trained pathologists. Automated image analysis methods therefore have the potential to improve reproducibility and reduce observer bias in this task.

The Mitosis Domain Generalization Challenge 2025 (MIDOG25) [1] extends the scope of previous editions (MIDOG 2021 [2] and MIDOG 2022 [3]) with the goal of advancing robust AI-assisted cancer diagnosis. The Task 2 introduces a dedicated benchmark for AMF classification, where participants are asked to classify cropped cell patches (128×128 pixels) into NMF or AMF across multiple tumor types, species, scanners, and laboratories. The dataset comprises more than 12,000 annotated mitotic figures, with AMFs accounting for only \(\sim\)​20% of cases. The evaluation metric of the challenge is the balanced accuracy to mitigate this strong class imbalance. Similar to the earlier MIDOG challenges, this benchmark addresses the crucial problem of robustness and generalization across domains, now extended to the clinically relevant task of atypical mitosis classification.

In this work, we tackle the Task 2 by applying low-rank adaptation (LoRA) [4] to fine-tune DINOv3-H+-LVD1689M, a vision transformer (ViT) pretrained on natural images using the state-of-the-art DINOv3 self-supervised (SSL) method [5]. Recent progress suggests that such generic foundation models, though developed outside the biomedical domain, can be efficiently adapted to specialized medical imaging tasks. To enhance robustness across diverse domains and compensate for the limited number of atypical figures, we combine this strategy with extensive data augmentation and a domain-weighted Focal Loss. Our approach aims to test and leverage the representational power of this generic DINOv3 SSL pretraining, while ensuring efficient adaptation to the histology-specific and heterogeneous challenge dataset. In parallel, our team also explored a ConvNeXt-based solution that we present in [6], which also achieved good performance, though slightly below DINOv3 on the preliminary test set, leading us to retain the latter approach as our final solution.

Material and Methods↩︎

Figure 1: Overview of our method during training: Input images are augmented (multi-Macenko, small translations, shear, coarse dropout, rotations, etc.) and normalized with ImageNet statistics. The classifier is a DINOv3-H+ pretrained on the LVD-1689M natural image dataset, fine-tuned with LoRA (rank 8, \alpha=16, \sim​1.3M trainable parameters) and followed by a linear head on the class token with sigmoid activation to output probabilities. Optimization is performed with a Domain-Weighted Focal Loss, which combines Focal Loss for class imbalance with domain reweighting to address dataset heterogeneity.

0.1 Dataset↩︎

The MIDOG 2025 [1] atypical mitosis training set is derived from 454 histopathology images spanning nine domains defined by different tumor types, species, scanners, and laboratories. Each mitotic figure was subtyped as normal or atypical by three expert pathologists in a blinded majority-vote setting.

In addition to the official MIDOG 2025 atypical training set, we incorporated three external resources. The AMi-Br [7] dataset provides mitotic figures from MIDOG 2021 [2] and TUPAC16 [8]; to avoid overlap, we only used the TUPAC16 cohort. The AtNorM-Br [9] dataset contains mitotic figures from the TCGA [10] breast cancer cohort, annotated by an expert pathologist. Finally, the OMG-Octo dataset [11] was created by screening large histopathology data with a model pretrained on AMi-Br and MIDOG25, followed by expert review of candidate mitoses.

After removing duplicate images, our training set comprised 11,939 mitotic figures from MIDOG 2025 (10,191 normal, 1,748 atypical), 1,999 mitotic figures from AMi-Br (1,571 normal, 428 atypical), 711 from AtNorM-Br (587 normal, 124 atypical), and 1,752 from OMG-Octo (378 normal, 1,374 atypical), resulting in a total of 16,398 figures (12,724 normal and 3,674 atypical). All datasets were provided as 128×128 pixel crops centered on the mitotic figure, except for OMG-Octo, which was originally 64×64 pixels and resized to 128×128 for training, corresponding to a resolution of \(0.25\,\mu\text{m/pixel}\).

The preliminary test set provided for the Task 2 consisted of mitotic figure crops from four tumor types not included in the final test data. It was made available on the challenge platform two weeks prior to submission for debugging purposes. The final test set consists of patches from 120 cases covering 12 distinct tumor types from both human and veterinary pathology, with 10 cases per tumor type. This set spans multiple laboratories and scanning systems and was used for the official evaluation. Performance was assessed using balanced accuracy, computed over all patches of the test set.

0.2 Methods↩︎

In this section, we detail the main components of our workflow (Figure 1): network training setup, the proposed Domain-Weighted Focal Loss, data augmentation strategy, and test-time augmentation.

0.2.1 Network Training↩︎

We trained our model on 128\(\times\)​128 pixel image crops, matching the challenge’s original patch size. Our model is a DINOv3-H+ ViT pretrained on the LVD-1689M natural image dataset, fine-tuned for the Task 2 with low-rank adaptation (LoRA; \(rank=8\), \(\alpha_{\text{LoRA}}=16.0\), dropout 0.05 and applied only to the query and value projections in the attention layers), resulting in only about 1.3M trainable parameters. A linear classification head with 0.2 dropout was added to produce logits from the class token. Training was run with a batch size of 16 and mixed precision (FP16), to speed up training and reduce memory consumption, without affecting predictive performance. We used the AdamW optimizer (learning rate \(1 \times 10^{-4}\), weight decay 0.1, \(\epsilon=1 \times 10^{-7}\)), using a cosine scheduler with linear warmup during the first 10% of training (from \(8.47 \times 10^{-7}\) to the base rate). Gradient norms were clipped at 1.0 for stability. Inputs were normalized with ImageNet statistics, consistent with DINOv3 pretraining. The final submitted model was trained for 60 epochs using the full combined dataset (AMi-Br TUPAC16, MIDOG25, AtNorM-Br, and OMG-Octo).

0.2.2 Domain-Weighted Focal Loss↩︎

To address class imbalance (\(\sim\)​20% atypical) and domain heterogeneity, we used the Focal Loss [12] (\(\alpha=0.25\), \(\gamma=2\)) extended with domain reweighting (DW-Focal Loss): domain weights were set to the inverse square root of domain size; the ratio between the largest and smallest weights was capped at 3 to avoid instability due to large values; and weights were normalized to sum up to 1.

\[\mathcal{L}_{\text{DW-Focal}} = - \frac{1}{N} \sum_{i=1}^N w_{d(i)}\, \alpha\, (1 - p_{i,y_i})^{\gamma} \log p_{i,y_i},\]

where \(N\) is the number of samples, \(p_{i,y_i}\) is the predicted probability for the ground-truth class \(y_i\), and \(w_{d(i)}\) is the weight of sample \(i\)’s domain \(d(i)\), computed as:

\[u_d = n_d^{-1/2},\; v_d = \min(u_d,\, 3 \cdot \min_j u_j),\; w_d = \frac{v_d}{\sum_j v_j},\]

where \(n_d\) is the number of samples in domain \(d\).

Indeed, the Focal Loss reduces the relative contribution of well-classified (easy) samples and forces the model to focus more on uncertain or difficult ones, which in practice helps mitigate the impact of class imbalance by giving more weight to atypical figures that are harder to classify. Besides, it also implicitly addresses the notion of hard negatives and hard positives, hence improving the model’s learning ability, since such examples were proved to be crucial in tackling the Task 1 in MIDOG 2025.

0.2.3 Data Augmentation↩︎

We also applied extensive online augmentations, including color jitter, JPEG compression, stain augmentation (multi-Macenko [13], [14] with random stain domain references), defocus blur, affine transforms, D4 symmetry, coarse dropout (up to two random boxes), and a custom black-border augmentation to mimic zero-padded regions in the training data.

We converted the multi-Macenko normalizer [13] into augmentation by first extracting stain matrices from 10 randomly selected images per domain, using both the training set and additional cases from MITOS CMC [15], MITOS CCMCT [16], and TCGA COAD/BLCA [10]. This was done to better address domain shift, especially from unseen domains in the final test set, by simulating a wider range of staining variations. During training, we sampled a domain and a random subset of these references, averaged their stain matrices as proposed in [13], and used this as the target for Macenko normalization. To mimic staining variability, we further perturbed the normalized images in stain space with additive and multiplicative uniform noise (\(\sigma=0.2\)).

0.2.4 Test Time Augmentation↩︎

At inference, we employed test-time augmentation by averaging logits over four rotated views (0°, 90°, 180°, and 270°) to improve robustness.

Evaluation and Results↩︎

Overall Performance↩︎

We evaluated our method, with DINOv3-H+, using 4-fold cross-validation on the training data, holding out the AMi-Br TUPAC16 subset as an external test set. In each fold, we measured balanced accuracy (BA) on the held-out validation split and on AMi-Br TUPAC16, and reported the mean \(\pm\) standard deviation across folds. We also report the performances on the preliminary test set of the submitted DINOv3-H+ model (trained on the full training set) (Table 1). For comparison, we included DINOv3-ConvNeXt-Tiny, a distilled variant pretrained on LVD-1689M and used here as a strong lightweight baseline, as suggested by our second approach [6]. Unlike our LoRA-based approach, DINOv3-ConvNeXt-Tiny was fully fine-tuned with a dropout of 0.2, leading to about 21× more trainable parameters despite having fewer overall parameters. Overall, DINOv3-H+ with LoRA consistently outperformed the ConvNeXt baseline across all evaluations. The performance gap between validation and the external AMi-Br TUPAC test set was also smaller, likely because full fine-tuning of ConvNeXt-Tiny led to stronger overfitting to the training domain.

Table 1: Performance on 4-fold cross-validation (mean \(\pm\) std), the external AMi-Br (TUPAC) test set, and the preliminary test set\(^*\).\(^*\)Evaluated with a model trained on the full training data (outside CV).
Split DINOv3-H+
ConvNeXt-Tiny
Cross-Validation 0.9485 \(\pm\) 0.0038 0.9352 \(\pm\) 0.0032
AMi-Br TUPAC 0.8475 \(\pm\) 0.0060 0.8085 \(\pm\) 0.0034
Preliminary Test\(^*\) 0.9045 -

Effect of Domain Reweighting Loss↩︎

Replacing Focal with Domain-Weighted Focal Loss improved mean domain-balanced accuracy in cross-validation (0.9284 → 0.9341) and yielded a notable gain on the preliminary test set (0.8870 → 0.9045), demonstrating increased robustness across domains. On the AMi-Br test set, however, performance remained almost unchanged (0.8467 → 0.8475). It is likely due to a stronger and unseen domain shift between the test and training domains, which could not be addressed by reweighting since it only balances training domains (Table 2).

Table 2: Balanced accuracy (BA) with Focal vs. Domain-Weighted Focal (DW-Focal).CV reports mean BA across domains (\(\pm\) std).AMi-Br and Preliminary Test report overall BA.\(^*\)Model trained on full data.
Split Focal DW-Focal
Mean BA across domains
Cross-Validation 0.9284 \(\pm\) 0.0031 0.9341 \(\pm\) 0.0057
Overall BA
AMi-Br TUPAC 0.8467 \(\pm\) 0.0060 0.8475 \(\pm\) 0.0060
Preliminary Test\(^*\) 0.8870 0.9045

SSL Finetuning on Mitosis Images↩︎

We also explored continuing the self-supervised training of the DINOv3-L-LVD1689M model on mitosis-like images. Using a mitosis detector trained on the Task 1, we collected candidate patches from TCGA BLCA/COAD, MITOS CMC/CCMCT, and the Task 2 training data, resulting in a dataset of \(\sim\)​260k crops (128×128). Using the official DINOv3 pipeline, we performed self-supervised fine-tuning of LoRA parameters, starting from ImageNet-pretrained weights. Because of limited computational resources — 4 GPUs with global batch size 64, compared to the original global batch size of 2048 — this experiment was strongly constrained. Nevertheless, we used linear probing on the full Task 2 dataset to evaluate the learned representations, following standard practice in SSL. This yielded about a 10% improvement in balanced accuracy, from 53.44% with DINOv3-L LVD1689M to 63.11%. As expected, this suggests that large-scale SSL on mitosis images could be beneficial. However, due to time and budget constraints, we did not fully investigate DINOv3 pretraining on histology images, leaving it as a promising direction for future work.

Table 3: Balanced accuracy from linear probing on Task 2, before and after SSL fine-tuning on mitosis-like images.
Model LVD1689M PEFT SSL Finetuned
DINOv3-L 0.5344 0.6311

Discussion↩︎

In this work, we demonstrated that DINOv3-H+ with LoRA fine-tuning constitutes a leading approach for atypical mitosis classification in MIDOG 2025, training only \(\sim\)​1.3M parameters. Despite the domain gap, the ImageNet-pretrained DINOv3 transferred well to histopathology, thanks to the efficacy of our fine-tuning strategy, achieving second place on the preliminary test set. This contributes in proving that generalist foundation models, even when trained on natural images, can capture meaningful patterns useful for biomedical imaging tasks and support specialized applications such as atypical mitosis classification.

DINOv3-H+ remains a large model, which can be costly at inference, especially since mitosis classification often follows object detection with hundreds of thousands of candidate patches. Our parallel ConvNeXt approach underlined that smaller networks can reach near state-of-the-art performance while requiring fewer overall parameters. Future directions include leveraging knowledge distillation to develop more efficient models. Stronger augmentations, such as optical flow or grid distortion, and increasing LoRA capacity (higher rank or more layers) could also improve adaptation to histology images.

Another promising direction is to build stronger integration with the Task 1 of the challenge. This could involve adding hard negative mitoses to the training set, exploring DINOv3 as a backbone for mitosis detection given its strong performance on dense downstream tasks, or applying our classification approach on top of a mitosis detector (Task 1) to further improve performance. Future work may also explore DINOv3 SSL pretraining on histopathology images, especially mitosis-like patches. Such extensions could further improve generalization across domains and strengthen the role of foundation models for mitosis subtyping.

CRediT authorship contribution statement↩︎

Guillaume Balezo: Conceptualization; Methodology; Software; Investigation; Data curation; Validation; Visualization; Project administration; Writing – original draft; Writing – review & editing.

Hana Feki: Conceptualization; Methodology; Software; Investigation; Data curation; Validation; Writing – review & editing.

Raphaël Bourgade: Conceptualization; Methodology; Data curation; Project administration; Validation; Writing – review & editing.

Lily Monnier: Conceptualization; Data curation; Writing – review & editing.

Alice Blondel: Conceptualization; Project administration; Writing – review & editing.

Albert Pla Planas: Resources; Supervision; Project administration; Writing – review & editing.

Thomas Walter: Conceptualization; Resources; Supervision; Project administration; Writing – review & editing.

Bibliography↩︎

References↩︎

[1]
Jonas Ammeling, Marc Aubreville, Sweta Banerjee, Christof A. Bertram, Katharina Breininger, Dominik Hirling, Peter Horvath, Nikolas Stathonikos, and Mitko Veta. Mitosis domain generalization challenge 2025, March 2025.
[2]
Marc Aubreville, Nikolas Stathonikos, Christof A Bertram, Robert Klopfleisch, Natalie Ter Hoeve, Francesco Ciompi, Frauke Wilm, Christian Marzahl, Taryn A Donovan, Andreas Maier, et al. Mitosis domain generalization in histopathology images—the midog challenge. Medical Image Analysis, 84: 102699, 2023.
[3]
Marc Aubreville, Christof Bertram, Katharina Breininger, Samir Jabari, Nikolas Stathonikos, and Mitko Veta. Mitosis domain generalization challenge 2022. In 25th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2022), 2022. .
[4]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 (2): 3, 2022.
[5]
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025.
[6]
Hana Feki, Alice Blondel, and Thomas Walter. Convnext with histopathology-specific augmentations for mitotic figure classification, 2025.
[7]
Christof A. Bertram, Viktoria Weiss, Taryn A. Donovan, Sweta Banerjee, Thomas Conrad, Jonas Ammeling, Robert Klopfleisch, Christopher Kaltenecker, and Marc Aubreville. Histologic dataset of normal and atypical mitotic figures on human breast cancer (ami-br). In Christoph Palm, Katharina Breininger, Thomas Deserno, Heinz Handels, Andreas Maier, Klaus H. Maier-Hein, and Thomas M. Tolxdorff, editors, Bildverarbeitung für die Medizin 2025, pages 113–118, Wiesbaden, 2025. Springer Fachmedien Wiesbaden. ISBN 978-3-658-47422-5.
[8]
Mitko Veta, Yujing J. Heng, Nikolas Stathonikos, Babak Ehteshami Bejnordi, Francisco Beca, Thomas Wollmann, Karl Rohr, Manan A. Shah, Dayong Wang, Mikael Rousson, Martin Hedlund, David Tellez, Francesco Ciompi, Erwan Zerhouni, David Lanyi, Matheus Viana, Vassili Kovalev, Vitali Liauchuk, Hady Ahmady Phoulady, Talha Qaiser, Simon Graham, Nasir Rajpoot, Erik Sjöblom, Jesper Molin, Kyunghyun Paeng, Sangheum Hwang, Sunggyun Park, Zhipeng Jia, Eric I-Chao Chang, Yan Xu, Andrew H. Beck, Paul J. van Diest, and Josien P.W. Pluim. Predicting breast tumor proliferation from whole-slide images: The tupac16 challenge. Medical Image Analysis, 54: 111–121, 2019. ISSN 1361-8415. .
[9]
Sweta Banerjee, Viktoria Weiss, Taryn A Donovan, Rutger HJ Fick, Thomas Conrad, Jonas Ammeling, Nils Porsche, Robert Klopfleisch, Christopher Kaltenecker, Katharina Breininger, et al. Benchmarking deep learning and vision foundation models for atypical vs. normal mitosis classification with cross-dataset evaluation. arXiv preprint arXiv:2506.21444, 2025.
[10]
JN Cancer Genome Atlas Research Network et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet, 45 (10): 1113–1120, 2013.
[11]
Z. Shen, M. A. Hawkins, E. Baer, K. Bräutigam, and C.-A. Collins Fekete. , 2025. .
[12]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[13]
Desislav Ivanov, Carlo Alberto Barbano, and Marco Grangetto. Multi-target stain normalization for histology slides. In International Workshop on Medical Optical Imaging and Virtual Microscopy Image Analysis, pages 36–44. Springer, 2024.
[14]
Marc Macenko, Marc Niethammer, James S Marron, David Borland, John T Woosley, Xiaojun Guan, Charles Schmitt, and Nancy E Thomas. A method for normalizing histology slides for quantitative analysis. In 2009 IEEE international symposium on biomedical imaging: from nano to macro, pages 1107–1110. IEEE, 2009.
[15]
Marc Aubreville, Christof A Bertram, Taryn A Donovan, Christian Marzahl, Andreas Maier, and Robert Klopfleisch. A completely annotated whole slide image dataset of canine breast cancer to aid human breast cancer research. Scientific data, 7 (1): 417, 2020.
[16]
Christof A Bertram, Marc Aubreville, Christian Marzahl, Andreas Maier, and Robert Klopfleisch. A large-scale dataset for mitotic figure assessment on whole slide images of canine cutaneous mast cell tumor. Scientific data, 6 (1): 274, 2019.