August 28, 2025
Atypical mitotic figures (AMFs) represent abnormal cell division associated with poor prognosis. Yet their detection remains difficult due to low prevalence, subtle morphology, and inter-observer variability. The MIDOG 2025 challenge introduces a benchmark for AMF classification across multiple domains. In this work, we fine-tuned the recently published DINOv3-H+ vision transformer, pretrained on natural images, using low-rank adaptation (LoRA), training only \(\sim\)1.3M parameters in combination with extensive augmentation and a domain-weighted Focal Loss to handle domain heterogeneity. Despite the domain gap, our fine-tuned DINOv3 transfers effectively to histopathology, reaching second place on the preliminary test set. These results highlight the advantages of DINOv3 pretraining and underline the efficiency and robustness of our fine-tuning strategy, yielding state-of-the-art results for the atypical mitosis classification challenge in MIDOG 2025.
Histopathology | Foundation Models | Atypical Mitotic Figures
Mitotic activity is a central indicator of tumor proliferation and prognosis. Beyond simple counts, the distinction between normal mitotic figures (NMFs) and atypical mitotic figures (AMFs) is of particular interest, as AMFs reflect abnormal cell division processes and correlate with poor clinical outcomes. However, their identification is challenging due to low prevalence, subtle morphological differences, and low inter-rater agreement even among trained pathologists. Automated image analysis methods therefore have the potential to improve reproducibility and reduce observer bias in this task.
The Mitosis Domain Generalization Challenge 2025 (MIDOG25) [1] extends the scope of previous editions (MIDOG 2021 [2] and MIDOG 2022 [3]) with the goal of advancing robust AI-assisted cancer diagnosis. The Task 2 introduces a dedicated benchmark for AMF classification, where participants are asked to classify cropped cell patches (128×128 pixels) into NMF or AMF across multiple tumor types, species, scanners, and laboratories. The dataset comprises more than 12,000 annotated mitotic figures, with AMFs accounting for only \(\sim\)20% of cases. The evaluation metric of the challenge is the balanced accuracy to mitigate this strong class imbalance. Similar to the earlier MIDOG challenges, this benchmark addresses the crucial problem of robustness and generalization across domains, now extended to the clinically relevant task of atypical mitosis classification.
In this work, we tackle the Task 2 by applying low-rank adaptation (LoRA) [4] to fine-tune DINOv3-H+-LVD1689M, a vision transformer (ViT) pretrained on natural images using the state-of-the-art DINOv3 self-supervised (SSL) method [5]. Recent progress suggests that such generic foundation models, though developed outside the biomedical domain, can be efficiently adapted to specialized medical imaging tasks. To enhance robustness across diverse domains and compensate for the limited number of atypical figures, we combine this strategy with extensive data augmentation and a domain-weighted Focal Loss. Our approach aims to test and leverage the representational power of this generic DINOv3 SSL pretraining, while ensuring efficient adaptation to the histology-specific and heterogeneous challenge dataset. In parallel, our team also explored a ConvNeXt-based solution that we present in [6], which also achieved good performance, though slightly below DINOv3 on the preliminary test set, leading us to retain the latter approach as our final solution.
The MIDOG 2025 [1] atypical mitosis training set is derived from 454 histopathology images spanning nine domains defined by different tumor types, species, scanners, and laboratories. Each mitotic figure was subtyped as normal or atypical by three expert pathologists in a blinded majority-vote setting.
In addition to the official MIDOG 2025 atypical training set, we incorporated three external resources. The AMi-Br [7] dataset provides mitotic figures from MIDOG 2021 [2] and TUPAC16 [8]; to avoid overlap, we only used the TUPAC16 cohort. The AtNorM-Br [9] dataset contains mitotic figures from the TCGA [10] breast cancer cohort, annotated by an expert pathologist. Finally, the OMG-Octo dataset [11] was created by screening large histopathology data with a model pretrained on AMi-Br and MIDOG25, followed by expert review of candidate mitoses.
After removing duplicate images, our training set comprised 11,939 mitotic figures from MIDOG 2025 (10,191 normal, 1,748 atypical), 1,999 mitotic figures from AMi-Br (1,571 normal, 428 atypical), 711 from AtNorM-Br (587 normal, 124 atypical), and 1,752 from OMG-Octo (378 normal, 1,374 atypical), resulting in a total of 16,398 figures (12,724 normal and 3,674 atypical). All datasets were provided as 128×128 pixel crops centered on the mitotic figure, except for OMG-Octo, which was originally 64×64 pixels and resized to 128×128 for training, corresponding to a resolution of \(0.25\,\mu\text{m/pixel}\).
The preliminary test set provided for the Task 2 consisted of mitotic figure crops from four tumor types not included in the final test data. It was made available on the challenge platform two weeks prior to submission for debugging purposes. The final test set consists of patches from 120 cases covering 12 distinct tumor types from both human and veterinary pathology, with 10 cases per tumor type. This set spans multiple laboratories and scanning systems and was used for the official evaluation. Performance was assessed using balanced accuracy, computed over all patches of the test set.
In this section, we detail the main components of our workflow (Figure 1): network training setup, the proposed Domain-Weighted Focal Loss, data augmentation strategy, and test-time augmentation.
We trained our model on 128\(\times\)128 pixel image crops, matching the challenge’s original patch size. Our model is a DINOv3-H+ ViT pretrained on the LVD-1689M natural image dataset, fine-tuned for the Task 2 with low-rank adaptation (LoRA; \(rank=8\), \(\alpha_{\text{LoRA}}=16.0\), dropout 0.05 and applied only to the query and value projections in the attention layers), resulting in only about 1.3M trainable parameters. A linear classification head with 0.2 dropout was added to produce logits from the class token. Training was run with a batch size of 16 and mixed precision (FP16), to speed up training and reduce memory consumption, without affecting predictive performance. We used the AdamW optimizer (learning rate \(1 \times 10^{-4}\), weight decay 0.1, \(\epsilon=1 \times 10^{-7}\)), using a cosine scheduler with linear warmup during the first 10% of training (from \(8.47 \times 10^{-7}\) to the base rate). Gradient norms were clipped at 1.0 for stability. Inputs were normalized with ImageNet statistics, consistent with DINOv3 pretraining. The final submitted model was trained for 60 epochs using the full combined dataset (AMi-Br TUPAC16, MIDOG25, AtNorM-Br, and OMG-Octo).
To address class imbalance (\(\sim\)20% atypical) and domain heterogeneity, we used the Focal Loss [12] (\(\alpha=0.25\), \(\gamma=2\)) extended with domain reweighting (DW-Focal Loss): domain weights were set to the inverse square root of domain size; the ratio between the largest and smallest weights was capped at 3 to avoid instability due to large values; and weights were normalized to sum up to 1.
\[\mathcal{L}_{\text{DW-Focal}} = - \frac{1}{N} \sum_{i=1}^N w_{d(i)}\, \alpha\, (1 - p_{i,y_i})^{\gamma} \log p_{i,y_i},\]
where \(N\) is the number of samples, \(p_{i,y_i}\) is the predicted probability for the ground-truth class \(y_i\), and \(w_{d(i)}\) is the weight of sample \(i\)’s domain \(d(i)\), computed as:
\[u_d = n_d^{-1/2},\; v_d = \min(u_d,\, 3 \cdot \min_j u_j),\; w_d = \frac{v_d}{\sum_j v_j},\]
where \(n_d\) is the number of samples in domain \(d\).
Indeed, the Focal Loss reduces the relative contribution of well-classified (easy) samples and forces the model to focus more on uncertain or difficult ones, which in practice helps mitigate the impact of class imbalance by giving more weight to atypical figures that are harder to classify. Besides, it also implicitly addresses the notion of hard negatives and hard positives, hence improving the model’s learning ability, since such examples were proved to be crucial in tackling the Task 1 in MIDOG 2025.
We also applied extensive online augmentations, including color jitter, JPEG compression, stain augmentation (multi-Macenko [13], [14] with random stain domain references), defocus blur, affine transforms, D4 symmetry, coarse dropout (up to two random boxes), and a custom black-border augmentation to mimic zero-padded regions in the training data.
We converted the multi-Macenko normalizer [13] into augmentation by first extracting stain matrices from 10 randomly selected images per domain, using both the training set and additional cases from MITOS CMC [15], MITOS CCMCT [16], and TCGA COAD/BLCA [10]. This was done to better address domain shift, especially from unseen domains in the final test set, by simulating a wider range of staining variations. During training, we sampled a domain and a random subset of these references, averaged their stain matrices as proposed in [13], and used this as the target for Macenko normalization. To mimic staining variability, we further perturbed the normalized images in stain space with additive and multiplicative uniform noise (\(\sigma=0.2\)).
At inference, we employed test-time augmentation by averaging logits over four rotated views (0°, 90°, 180°, and 270°) to improve robustness.
We evaluated our method, with DINOv3-H+, using 4-fold cross-validation on the training data, holding out the AMi-Br TUPAC16 subset as an external test set. In each fold, we measured balanced accuracy (BA) on the held-out validation split and on AMi-Br TUPAC16, and reported the mean \(\pm\) standard deviation across folds. We also report the performances on the preliminary test set of the submitted DINOv3-H+ model (trained on the full training set) (Table 1). For comparison, we included DINOv3-ConvNeXt-Tiny, a distilled variant pretrained on LVD-1689M and used here as a strong lightweight baseline, as suggested by our second approach [6]. Unlike our LoRA-based approach, DINOv3-ConvNeXt-Tiny was fully fine-tuned with a dropout of 0.2, leading to about 21× more trainable parameters despite having fewer overall parameters. Overall, DINOv3-H+ with LoRA consistently outperformed the ConvNeXt baseline across all evaluations. The performance gap between validation and the external AMi-Br TUPAC test set was also smaller, likely because full fine-tuning of ConvNeXt-Tiny led to stronger overfitting to the training domain.
Split | DINOv3-H+ | |
ConvNeXt-Tiny | ||
Cross-Validation | 0.9485 \(\pm\) 0.0038 | 0.9352 \(\pm\) 0.0032 |
AMi-Br TUPAC | 0.8475 \(\pm\) 0.0060 | 0.8085 \(\pm\) 0.0034 |
Preliminary Test\(^*\) | 0.9045 | - |
Replacing Focal with Domain-Weighted Focal Loss improved mean domain-balanced accuracy in cross-validation (0.9284 → 0.9341) and yielded a notable gain on the preliminary test set (0.8870 → 0.9045), demonstrating increased robustness across domains. On the AMi-Br test set, however, performance remained almost unchanged (0.8467 → 0.8475). It is likely due to a stronger and unseen domain shift between the test and training domains, which could not be addressed by reweighting since it only balances training domains (Table 2).
Split | Focal | DW-Focal |
---|---|---|
Mean BA across domains | ||
Cross-Validation | 0.9284 \(\pm\) 0.0031 | 0.9341 \(\pm\) 0.0057 |
Overall BA | ||
AMi-Br TUPAC | 0.8467 \(\pm\) 0.0060 | 0.8475 \(\pm\) 0.0060 |
Preliminary Test\(^*\) | 0.8870 | 0.9045 |
We also explored continuing the self-supervised training of the DINOv3-L-LVD1689M model on mitosis-like images. Using a mitosis detector trained on the Task 1, we collected candidate patches from TCGA BLCA/COAD, MITOS CMC/CCMCT, and the Task 2 training data, resulting in a dataset of \(\sim\)260k crops (128×128). Using the official DINOv3 pipeline, we performed self-supervised fine-tuning of LoRA parameters, starting from ImageNet-pretrained weights. Because of limited computational resources — 4 GPUs with global batch size 64, compared to the original global batch size of 2048 — this experiment was strongly constrained. Nevertheless, we used linear probing on the full Task 2 dataset to evaluate the learned representations, following standard practice in SSL. This yielded about a 10% improvement in balanced accuracy, from 53.44% with DINOv3-L LVD1689M to 63.11%. As expected, this suggests that large-scale SSL on mitosis images could be beneficial. However, due to time and budget constraints, we did not fully investigate DINOv3 pretraining on histology images, leaving it as a promising direction for future work.
Model | LVD1689M | PEFT SSL Finetuned |
---|---|---|
DINOv3-L | 0.5344 | 0.6311 |
In this work, we demonstrated that DINOv3-H+ with LoRA fine-tuning constitutes a leading approach for atypical mitosis classification in MIDOG 2025, training only \(\sim\)1.3M parameters. Despite the domain gap, the ImageNet-pretrained DINOv3 transferred well to histopathology, thanks to the efficacy of our fine-tuning strategy, achieving second place on the preliminary test set. This contributes in proving that generalist foundation models, even when trained on natural images, can capture meaningful patterns useful for biomedical imaging tasks and support specialized applications such as atypical mitosis classification.
DINOv3-H+ remains a large model, which can be costly at inference, especially since mitosis classification often follows object detection with hundreds of thousands of candidate patches. Our parallel ConvNeXt approach underlined that smaller networks can reach near state-of-the-art performance while requiring fewer overall parameters. Future directions include leveraging knowledge distillation to develop more efficient models. Stronger augmentations, such as optical flow or grid distortion, and increasing LoRA capacity (higher rank or more layers) could also improve adaptation to histology images.
Another promising direction is to build stronger integration with the Task 1 of the challenge. This could involve adding hard negative mitoses to the training set, exploring DINOv3 as a backbone for mitosis detection given its strong performance on dense downstream tasks, or applying our classification approach on top of a mitosis detector (Task 1) to further improve performance. Future work may also explore DINOv3 SSL pretraining on histopathology images, especially mitosis-like patches. Such extensions could further improve generalization across domains and strengthen the role of foundation models for mitosis subtyping.
Guillaume Balezo: Conceptualization; Methodology; Software; Investigation; Data curation; Validation; Visualization; Project administration; Writing – original draft; Writing – review & editing.
Hana Feki: Conceptualization; Methodology; Software; Investigation; Data curation; Validation; Writing – review & editing.
Raphaël Bourgade: Conceptualization; Methodology; Data curation; Project administration; Validation; Writing – review & editing.
Lily Monnier: Conceptualization; Data curation; Writing – review & editing.
Alice Blondel: Conceptualization; Project administration; Writing – review & editing.
Albert Pla Planas: Resources; Supervision; Project administration; Writing – review & editing.
Thomas Walter: Conceptualization; Resources; Supervision; Project administration; Writing – review & editing.