Synthetic Data for Robust Stroke Segmentation


Deep learning-based semantic segmentation in neuroimaging currently requires high-resolution scans and extensive annotated datasets, posing significant barriers to clinical applicability. We present a novel synthetic framework for the task of lesion segmentation, extending the capabilities of the established SynthSeg approach to accommodate large heterogeneous pathologies with lesion-specific augmentation strategies. Our method trains deep learning models, demonstrated here with the UNet architecture, using label maps derived from healthy and stroke datasets, facilitating the segmentation of both healthy tissue and pathological lesions without sequence-specific training data. Evaluated against in-domain and out-of-domain (OOD) datasets, our framework demonstrates robust performance, rivaling current methods within the training domain and significantly outperforming them on OOD data. This contribution holds promise for advancing medical imaging analysis in clinical settings, especially for stroke pathology, by enabling reliable segmentation across varied imaging sequences with reduced dependency on large annotated corpora. Code and weights available at

1 Introduction↩︎

Dense classification of images or volumes, known as semantic segmentation, is a crucial step in many analysis pipelines in neuroimaging and more broadly in medical imaging. Properties such as volume and surface distance of healthy tissue structures may be used in longitudinal models to measure changes in anatomical structure, and labels of tumours are commonly used for radiotherapy planning. Extracting fine-grained labels has typically been limited to high-contrast, high-resolution structural scans acquired using sequences like MPRAGE that are rarely found in clinical settings. This naturally serves as a barrier for the application of such methods to clinical data.

To extend such methods to the open-ended domain of clinical data, models often need to perform on sequences for which no training data may be available. To this end, domain randomisation via synthetic data has been shown to give effective results for healthy brain parcellation in SynthSeg [1]. In this method, a set of ground truth healthy tissues are used to generate synthetic images, under the assumption that each tissue class’ intensity distribution should roughly follow a Gaussian. By assigning random Gaussian distributions to each class, a deep learning model can learn to extract shape information for parcellation in a way that is invariant to the input image’s relative tissue contrast, hence allowing the model to be used on any sequence at test-time, without training data or prior knowledge of the sequence.

Prior work on SynthSeg highlighted the potential of encoding anatomical priors in a neural network by training with unpaired parcellation labels [2]. Methods built on this theory are expected to have a significantly larger challenge achieving effective results in the heterogeneous shape space of lesions and pathology. In the case of healthy parcellation, every tissue has a strong degree of homogeneity in its relative position that will not vary from subject to subject (e.g., the brain stem will always appear in roughly the same region of the brain). This was the motivation for traditional atlas-based approaches for parcellation. In the case of lesions, there is typically no inherent location where they are expected to occur. One possible exception to this is the case of multiple sclerosis, which is characterised by focal lesions in the white matter [3]. Although the size and presence of lesions may vary, their locations are relatively predictable. This case of lesions has successfully been modelled in both a synthetic deep learning framework [4] and in a traditional probabilistic model [5].

Our proposed work provides the first public implementation to directly segment heterogeneous lesions in a SynthSeg-like framework, combining two freely available datasets to do so. Several lesion-specific augmentations are also suggested for better representing the variety of lesions and their labels.

2 Methods↩︎

A model was trained using a combination of synthetic data, based on the OASIS-3 dataset [6], and real data, based on the ATLAS dataset [7]. In the OASIS-3 dataset (N=2679 from 1379 subjects) [6], tissue labels were extracted using SPM’s MultiBrain [8]. Maps were then normalised to MNI space to allow lesions to be pasted into correct regions. The ATLAS dataset was used for pasting the lesion maps (N=655) [7], also normalised to MNI. During generation, a lesioned brain map was generated by randomly pasting an ATLAS lesion onto a healthy brain from OASIS. Although the OASIS dataset contains patients with pathologies, such as Alzheimers, it is expected that the healthy labels acquired from these subjects will not create confounding factors for stroke identification. Pasting lesions onto healthy labels was augmented in a two-step approach: varying grades of infarction were simulated by applying a smooth multiplicative field over the binary lesion map as an intensity re-weighting, and then the resulting re-weighted lesion was stitched into the surrounding tissue structure using the Soft-CP algorithm [9].

The generated maps were used to randomly assign Gaussians to each tissue class by sampling a mean \(\mu \sim \text{Uniform}(0,255)\), a standard deviation \(\sigma \sim \text{Uniform}(0,16)\) and the FWHM of the applied smoothing from \(\text{Uniform}(0,2)\). To generate augmented versions of these simulated images the following transforms were applied: random affine and elastic deformations, random field inhomogeneity, random gamma contrast adjustment, random motion blurring, random anisotropy, random noise, random smoothing. Intensities were clipped to the 1st and 99th percentiles, random skull-stripping applied on 30% of samples, random flipping applied in all axes and images z-normalised to a zero mean and unit standard deviation. Training was performed at an isotropic slice thickness of 1 mm and volumes randomly cropped to 192\(\times\)​192\(\times\)​192.

In the synthetic model (labelled ‘Synth’ in all experiments), a combination of real and synthetic data was used for training. Real data was sampled from the ATLAS dataset (using the same train split from which lesion labels were derived). Real MPRAGE images were augmented with the following transforms: (random cropping to 192\(\times\)​192\(\times\)​192, random affine and elastic deformations, histogram normalisation, random histogram shift, random field inhomogeneity, random gamma contrast adjustment, random flipping in all axes), before being z-normalised to a zero mean and unit standard deviation. The real and synthetic dataloaders were rotated in a 1:1 ratio. In the baseline model (labelled ‘Baseline’ in all experiments), only this real dataloader was used with identical transforms, while all other aspects of training were kept constant.

A UNet architecture [10] was used for this task with symmetric encoder and decoder with channels of (32, 64, 128, 256, 320, 320), instance normalisation [11], PReLU activation [12], and one residual unit per block. In the decoder, the upsampling was performed with transposed convolutions. Six output channels were predicted from the network, covering background, the four healthy MultiBrain tissues (gray matter (GM), white matter (WM), gray/white matter partial volume (PV), cerebro-spinal fluid (CSF)) and the stroke lesion class. In the baseline model, only the background and stroke lesion classes were predicted.

Training minimised a combined Dice and Cross Entropy loss. An AdamW optimiser [13] was used with a learning rate of \(10^{-4}\), using a scheduler of \(\eta_{n} = \eta_{0}(1 - \frac{n}{N})^{0.9}\) for a learning rate \(\eta\) at step \(n \in N\), adapted from nnUNet [14]. Gradient clipping was applied for a norm value of 12. A dropout of 0.2 was used during training. Models were trained using a batch size of 1 for 1200 epochs, each of 500 iterations (total \(6\times10^5\) iterations). All code and weights are available freely at

3 Experiments↩︎

Models were validated on a number of datasets to assess performance. We first assessed the performance of the model on the hold-out test set for the ATLAS dataset, as a direct evaluation of in-domain performance. This dataset consisted of 131 subjects with a single 1mm isotropic MPRAGE. The other evaluations were concerned with out-of-domain robustness. The first of these was the ISLES 2015 dataset [15], containing 28 subjects with multi-modal MRI (T1, T2, FLAIR, DWI) skull-stripped and of varying slice thickness. The final dataset was from the ISLES 2022 challenge [16], containing 250 subjects with multi-modal MRI (DWI, ADC, FLAIR) of varying slice thickness. In addition to representing a shift in the domain of scanning apparatus used, this also represents a shift from chronic lesions (ATLAS) to acute lesions (ISLES).

At test-time, images are loaded, oriented to RAS, resliced to 1mm slice thickness, histogram normalised and images z-scored to a zero mean and unit standard deviation. Sliding-window inference was performed using a patch size of \(192\times192\times192\), using patches with a 50% overlap and a Gaussian weighting with \(\sigma=0.125\). Test-time augmentation was done by averaging logits across all possible combinations of flips in the x, y and z axes. Predicted logits were then resliced to the original input image and a softmax applied to derive posterior probabilities. An argmax was then used to derive a binary lesion map.

Comparison to ground truth was done by reslicing both the prediction and ground truth to 1mm slice thickness and padding images to a size of \(256\times256\times256\). Segmentations were evaluated using the Dice coefficient, 95th-percentile Hausdorff distance (HD95 in tables and figures), absolute volume difference (AVD in tables and figures), absolute lesion difference (ALD in tables and figures) and lesion-wise F1 score (LF1 in tables and figures). The Dice coefficient is a measure of overlap, where a score of 1 indicates perfect overlap and 0 indicates no overlap. The Hausdorff distance is a measure of surface similarity that determines the shortest distance between the surfaces at all given points. The 95th-percentile value of these distances is used to avoid errors due to anomalous values in imperfect masks.

The absolute volume difference gives a direct comparison of the overall proportion of brain labelled as unhealthy. The absolute lesion difference \(\mathcal{L}\) is the difference in the number of unique lesions, defined as \(\mathcal{L}(x, y) = \left| \mathcal{C}(x) - \mathcal{C}(y) \right|\), where \(\mathcal{C}\) is an algorithm used to calculate the number of connected components. Lastly, the lesion-wise F1 score uses the same equation as Dice but on a per-lesion basis rather than per-voxel: for each unique lesion in the ground-truth \(y\), if a single voxel of the predicted mask overlaps with the lesion then this is considered as a success.

4 Results↩︎

4.1 ATLAS↩︎

In the ATLAS dataset, the task is considered completely in-domain so the Baseline model was expected to outperform the Synth model. As shown in Table 1, the Baseline model performed as expected but the Synth model remained competitive, losing only in Dice and AVD. Figure 1 shows that no differences between the two models are statistically significant.

Table 1: Mean results on ATLAS hold-out set. Better score shown in bold.
Model Dice HD95 AVD ALD LF1
Baseline 0.503 31.1 7344 2.52 0.537
Synth 0.455 31.1 8513 1.76 0.559



Figure 1: Dice and lesion-wise F1 accuracy in ATLAS hold-out set. ns: \(p>0.05\), *: \(0.01<p\leq0.05\), **: \(10^{-3}<p\leq0.01\), ***: \(10^{-4}<p\leq10^{-3}\), ****: \(p\leq10^{-4}\)..

4.1.1 Private test set:↩︎

Further to the above evaluations on the hold-out data, blind evaluation was performed on the ATLAS hidden test set of an additional 300 subjects. Evaluation was done by the dataset owners so no score for HD95 was provided. For this data, reported in Table 2, the Baseline model also outperformed the Synth model in all but LF1, with similar margins to those reported for the hold-out set.

Table 2: Mean results on the ATLAS private dataset. Better score shown in bold.
Model Dice AVD ALD LF1
Baseline 0.550 14567 4.56 0.440
Synth 0.498 14997 4.68 0.442

4.2 ISLES 2015↩︎

The ISLES 2015 dataset was the first of two benchmarks used to evaluate behaviour in challenging out-of-distribution (OOD) settings. The results in Table 3 and Figure 2 show that for all modalities the Synth model outperforms the Baseline model, with strong statistical significance in three of four modalities. In addition, it was found that the Baseline failed to achieve a Dice above \(0.6\) in any of the modalities, whilst Synth achieved a score above this threshold in roughly 17% of all cases.

Table 3: Mean results on the ISLES 2015 dataset. Better score shown in bold.
Model Dice HD95 AVD ALD LF1
2-6 T1
Baseline 0.091 45.5 40113 1.96 0.311
Synth 0.230 39.9 31405 2.18 0.349
Baseline 0.003 65.5 48969 6.89 0.095
Synth 0.245 74.6 50995 4.68 0.279
Baseline 0.000 65.3 41236 3.43 0.000
Synth 0.329 44.3 18366 3.50 0.275
Baseline 0.000 77.3 43833 2.43 0.000
Synth 0.204 81.1 44316 8.39 0.187
Baseline 0.000 Inf 43795 2.11 0.000
Synth 0.378 38.6 20012 2.75 0.348



Figure 2: Dice and lesion-wise F1 accuracy on the ISLES 2015 dataset. ns: \(p>0.05\), *: \(0.01<p\leq0.05\), **: \(10^{-3}<p\leq0.01\), ***: \(10^{-4}<p\leq10^{-3}\), ****: \(p\leq10^{-4}\)..

4.2.1 Multi-modal ensembling:↩︎

In addition to per-modality evaluation, we provide results of ensembles of per-modality predictions to better represent how such a model may be used in practice. Logits were averaged first across all modalities before taking the softmax and post-processing as stated previously. In all instances the Baseline model failed and gave an empty prediction, whilst the Synth model’s ensembled prediction outperformed any of the single modalities across most metrics. A visualisation of the ensembled predictions can be seen in Figure 3. It is also noted that ensembling is expected to reduce the likelihood of models mis-labelling irrelevant hyperintensity in sequences such as FLAIR and DWI.





Figure 3: Sample predictions with ensembling method in the ISLES 2015 dataset. Color code in predicted Synth labels: dark-blue=GM, light-blue=WM, green=PV, amber=CSF, red=stroke..

4.3 ISLES 2022↩︎

Lastly, the ISLES 2022 dataset was used to further evaluate OOD behaviour in a larger and more challenging dataset. The results shown in Table 4 and Figure 4 show that Synth outperforms Baseline in Dice for all modalities with statistical significance, as well as in lesion-wise F1 score. HD95 values for this dataset may be of questionable conclusiveness, due to the averages excluding infinite/failure cases - of which there were significantly more for the Baseline model. It was also seen that Baseline again failed to achieve a \(0.6+\) Dice in any subject across all modalities, and only 6 (all from the ADC modality) in the range of \(0.3-0.6\). In contrast, Synth achieved the former category in 5% of all cases and the latter category in roughly 9% of cases.

Table 4: Mean results on the ISLES 2022 dataset. Better score shown in bold.
Model Dice HD95 AVD ALD LF1
2-6 DWI
Baseline 0.000 68.5 14276 5.22 0.008
Synth 0.169 70.2 19374 5.84 0.144
Baseline 0.001 64.2 15049 5.05 0.013
Synth 0.087 69.9 18311 5.35 0.119
Baseline 0.031 62.1 17554 4.53 0.090
Synth 0.068 60.4 14655 4.64 0.117



Figure 4: Dice and lesion-wise F1 accuracy on the ISLES 2022 dataset. ns: \(p>0.05\), *: \(0.01<p\leq0.05\), **: \(10^{-3}<p\leq 0.01\), ***: \(10^{-4}<p\leq10^{-3}\), ****: \(p\leq10^{-4}\)..

5 Conclusions↩︎

The results shown in the previous section evidenced that the proposed synthetic training framework can be competitive with the baseline on in-distribution data, and make significant improvements on data from unseen domains, such that a significant portion of predictions may be deemed of usable quality. We showed through these experiments that it is possible to augment a SynthSeg-style training procedure to include large, heterogeneous lesions. Whilst only evaluated in stroke, it is expected that such results should directly translate to other similar domains, such as haemorrhage and glioblastoma. In future work, the impact of training with a mix of real and synthetic data will be compared to using only synthetic data, in addition to examining the utility of multi-modal data in a multi-channel model versus post-hoc averaging of individual modality predictions - lesions can often take starkly different appearances between modalities (evidenced in Figure 3) and so a model trained with multi-channel inputs is expected to be able to better leverage these differences.

6 Acknowledgement↩︎

LC is supported by the EPSRC-funded UCL Centre for Doctoral Training in Intelligent, Integrated Imaging in Healthcare (i4health) (EP/S021930/1), and the Wellcome Trust (203147/Z/16/Z and 205103/Z/16/Z). This research was supported by NVIDIA and utilized NVIDIA RTX A6000 48GB.


Billot, B., Greve, D.N., Puonti, O., Thielscher, A., Van Leemput, K., Fischl, B., Dalca, A.V., Iglesias, J.E.: SynthSeg: Segmentation of brain MRI scans of any contrast and resolution without retraining. Medical Image Analysis 86, 102789 (May 2023).
Dalca, A.V., Guttag, J., Sabuncu, M.R.: Anatomical priors in convolutional networks for unsupervised biomedical segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE (Jun 2018).
Lassmann, H.: Multiple sclerosis pathology. Cold Spring Harbor Perspectives in Medicine 8(3), a028936 (Jan 2018).
Billot, B., Cerri, S., Leemput, K.V., Dalca, A.V., Iglesias, J.E.: Joint segmentation of multiple sclerosis lesions and brain anatomy in MRI scans of any contrast and resolution with CNNs. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE (Apr 2021).
Cerri, S., Puonti, O., Meier, D.S., Wuerfel, J., Mühlau, M., Siebner, H.R., Van Leemput, K.: A contrast-adaptive method for simultaneous whole-brain and lesion segmentation in multiple sclerosis. NeuroImage 225, 117471 (Jan 2021).
LaMontagne, P.J., Benzinger, T.L., Morris, J.C., Keefe, S., Hornbeck, R., Xiong, C., Grant, E., Hassenstab, J., Moulder, K., Vlassenko, A.G., Raichle, M.E., Cruchaga, C., Marcus, D.: OASIS-3: Longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and Alzheimer disease. medRxiv (Dec 2019).
Liew, S.L., Lo, B.P., Donnelly, M.R., Zavaliangos-Petropulu, A., Jeong, J.N., Barisano, G., Hutton, A., Simon, J.P., Juliano, J.M., Suri, A., Wang, Z., Abdullah, A., Kim, J., Ard, T., Banaj, N., Borich, M.R., Boyd, L.A., Brodtmann, A., Buetefisch, C.M., Cao, L., Cassidy, J.M., Ciullo, V., Conforto, A.B., Cramer, S.C., Dacosta-Aguayo, R., de la Rosa, E., Domin, M., Dula, A.N., Feng, W., Franco, A.R., Geranmayeh, F., Gramfort, A., Gregory, C.M., Hanlon, C.A., Hordacre, B.G., Kautz, S.A., Khlif, M.S., Kim, H., Kirschke, J.S., Liu, J., Lotze, M., MacIntosh, B.J., Mataró, M., Mohamed, F.B., Nordvik, J.E., Park, G., Pienta, A., Piras, F., Redman, S.M., Revill, K.P., Reyes, M., Robertson, A.D., Seo, N.J., Soekadar, S.R., Spalletta, G., Sweet, A., Telenczuk, M., Thielman, G., Westlye, L.T., Winstein, C.J., Wittenberg, G.F., Wong, K.A., Yu, C.: A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms. Scientific Data 9(1) (Jun 2022).
Brudfors, M., Balbastre, Y., Flandin, G., Nachev, P., Ashburner, J.: Flexible Bayesian Modelling for Nonlinear Image Registration, p. 253–263. Springer International Publishing (2020).
Dai, P., Dong, L., Zhang, R., Zhu, H., Wu, J., Yuan, K.: Soft-CP: A credible and effective data augmentation for semantic segmentation of medical lesions (2022).
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation (2015).
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing ingredient for fast stylization (2017).
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification (2015).
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019).
Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18(2), 203–211 (Dec 2020).
Maier, O., Menze, B.H., von der Gablentz, J., Häni, L., Heinrich, M.P., Liebrand, M., Winzeck, S., Basit, A., Bentley, P., Chen, L., Christiaens, D., Dutil, F., Egger, K., Feng, C., Glocker, B., Götz, M., Haeck, T., Halme, H.L., Havaei, M., Iftekharuddin, K.M., Jodoin, P.M., Kamnitsas, K., Kellner, E., Korvenoja, A., Larochelle, H., Ledig, C., Lee, J.H., Maes, F., Mahmood, Q., Maier-Hein, K.H., McKinley, R., Muschelli, J., Pal, C., Pei, L., Rangarajan, J.R., Reza, S.M., Robben, D., Rueckert, D., Salli, E., Suetens, P., Wang, C.W., Wilms, M., Kirschke, J.S., Krämer, U.M., Münte, T.F., Schramm, P., Wiest, R., Handels, H., Reyes, M.: ISLES 2015 - a public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI. Medical Image Analysis 35, 250–269 (Jan 2017).
Hernandez Petzsche, M.R., de la Rosa, E., Hanning, U., Wiest, R., Valenzuela, W., Reyes, M., Meyer, M., Liew, S.L., Kofler, F., Ezhov, I., Robben, D., Hutton, A., Friedrich, T., Zarth, T., Bürkle, J., Baran, T.A., Menze, B., Broocks, G., Meyer, L., Zimmer, C., Boeckh-Behrens, T., Berndt, M., Ikenberg, B., Wiestler, B., Kirschke, J.S.: ISLES 2022: A multi-center magnetic resonance imaging stroke lesion segmentation dataset. Scientific Data 9(1) (Dec 2022).