October 25, 2025
Background: Application of machine learning (ML) in neurosurgery is often constrained by the difficulty of assembling, sharing, and utilizing large, high-quality imaging datasets. Synthetic data offers a novel solution to this challenge
by enabling the creation of privacy-preserving images, which can be generated at scale.
Objective: To evaluate the feasibility of using a denoising diffusion probabilistic model (DDPM) to generate realistic synthetic lateral cervical spine radiographs.
Methods: A DDPM was trained on the Cervical Spine X-ray Atlas (CSXA; 4,963 lateral radiographs), and monitored using training/validation loss and Fréchet inception distance (FID) to quantify model convergence and synthetic image realism.
Blinded expert validation (“clinical Turing test”) was conducted with six neuroradiologists and two spine fellowship-trained neurosurgeons. Each expert reviewed 50 image quartets containing one ground truth radiograph and three synthetic images derived
from separate training checkpoints. For each quartet, experts were tasked with identifying the real image and rating each image’s realism on a 4-point Likert scale (1 = unrealistic, 4 = fully realistic). Images were also evaluated for potential
training-data memorization via a nearest-neighbor search implemented with imagededup, using vision transformer embeddings and ranking real-synthetic pairs by cosine similarity.
Results: Experts correctly identified the real image in 29.0% of quartet trials, with low inter-rater agreement (Fleiss’ \(\kappa\) = 0.061). Mean realism scores were 3.323 for real images and 3.228, 3.258,
and 3.320 for synthetic images from the three checkpoints. Paired Wilcoxon signed-rank tests showed no significant differences between real and synthetic mean ratings (unadjusted p = 0.128, 0.236, 1.000; Holm-adjusted p = 0.383, 0.471,
1.000; two-sided). No visually explicit memorization was identified among the nearest-neighbor pairs. We also include a large-scale dataset of 20,063 synthetic radiographs.
Conclusions: We present the generation of synthetic cervical spine X-ray images that are statistically indistinguishable in realism and quality from real radiographs in blinded expert review. This novel application of DDPM highlights the
potential to generate large-scale neuroimaging datasets to support ML model training for landmarking, segmentation, and classification tasks.
Machine learning (ML) is increasingly integrated into neurosurgical and neuroimaging research; however, progress can be constrained by the limited availability of large-scale training data [1], [2]. Patient volumes are often limited [3], [4], institutional data are heterogeneous and frequently incomplete [5], and the labor required to curate, standardize, and de-identify raw data is substantial [6]. Privacy and
regulatory constraints further complicate data sharing [7]–[9], which may
otherwise be used to compile multi-institutional datasets to address the limitations of single-center studies and improve model performance and generalizability.
Synthetic data has emerged as a pragmatic complement to real-world data (RWD) under these constraints. Synthetic data that are not tied to real patient records but preserve the salient properties or statistical structure of RWD can be produced at scale and
shared without regulatory restrictions [10], [11]. Increasingly, synthetic tabular
data has demonstrated significant fidelity to perioperative RWD [12]–[14]. For example, synthetic tabular neurosurgical data was used to amplify small RWD sample sizes, train ML models to predict patient outcomes, and conduct analyses with comparable results to ground truth clinical findings
[12]. Synthetic imaging is advancing along a similar trajectory [15], albeit at a slower pace, due to the added complexity of achieving visual fidelity. One method used to generate synthetic images is a generative adversarial network (GAN), which employs two competing networks to improve
image fidelity [16]. Early evaluations of GAN-based synthetic imaging data have demonstrated significant fidelity in expert reviews [17], [18]. Much of this work, however, has been concentrated on chest radiography, where datasets
such as MIMIC-CXR (377,110 images) [19] and CheXpert (224,316 images) [20] enable large-scale training and evaluation. In contrast, neuroimaging datasets are typically much more limited in both size and availability [1], [2]. This ‘abundance irony’ underscores why neurosurgical and neuroimaging research stands to benefit disproportionately from synthetic
data.
Another method of synthetic image generation is the denoising diffusion probabilistic model (DDPM) [21]. The DDPM generates images by iteratively denoising
random noise, producing high-quality and diverse synthetic images. Emerging evidence suggests that DDPM-generated images exhibit superior image quality and fidelity to those produced by GANs [22]. Yet, the application of a DDPM to generate new samples of synthetic neuroimaging data, and rigorous assessment of image quality and realism remains largely unexplored.
In this study, we investigate whether a DDPM can generate realistic synthetic lateral cervical spine X-rays, when trained on the open-access Cervical Spine X-ray Atlas (CSXA) of 4,963 radiographs [23], [24]. We conducted a multi-expert blinded validation (“clinical Turing test”) in which neuroradiologists and spine fellowship-trained
neurosurgeons were tasked with discriminating between real and synthetic images. We hypothesized that experts would be unable to distinguish real from synthetic images.
Training images were drawn from the CSXA, an open-access dataset of 4,963 lateral cervical spine radiographs [23], [24]. CSXA was selected for its scale, open availability, and accompanying metadata and annotations for downstream ML tasks (i.e., vertebral keypoints). All CSXA radiographs were included and resampled to 256x256 pixels. Negative scans (i.e., cortical bone rendered dark) were intensity-inverted to homogenize the coloration of training data.
We implemented a DDPM training and sampling setup with Python code adapted from the serag‑ai “I‑SynMed” repository [25], [26]. To optimize the hyperparameters of the model training session, we split the CSXA dataset into training and validation partitions comprising 4219 (85%) and 744 (15%) images, respectively, by random sampling.
Hyperparameters included batch size (8), learning rate (5e-5), and the number of training steps (up to 160,000). Validation loss was monitored at ‘model checkpoints’, saved every 2,000 training steps.
To generate the final model used for sampling, a separate instance of the DDPM was trained using 100% of the available images to maximize data utilization. During this final training session, distributional similarity between real and synthetic images was
evaluated using Fréchet inception distance (FID), which compares the statistical properties of feature representations (computed with a pretrained Inception network) between the two sets of images [26], [27]. FID scores were computed between the 4,963 real images and 7,250 synthetic images generated at
each interval of 20,000 training steps.
Training and sampling were performed on institutional NVIDIA A100 or V100 GPUs.
Synthetic image sets (1,000 images each) were generated using the DDPM at three specific model checkpoints, determined from the loss curve trajectories and FID scores. Following generation, synthetic images were standardized to left-facing orientation
to ensure consistent anatomical alignment. A manual review was then conducted to exclude any samples with implausible anatomy (e.g., pharynx and trachea posterior to the spinal column, spinal column anterior and posterior elements inverted, etc.). From
each of the three synthetic sets, 50 images were randomly selected for expert validation, along with 50 randomly selected real scans from CSXA. For the validation task, images were grouped into quartets: each containing one real image and three synthetic
images (one from each checkpoint).
In parallel, we generated larger synthetic image sets from each checkpoint and pooled them into a single, large-scale dataset intended for publication. Each large-scale sampling job was conducted up to a maximum institutional runtime limit of 24 hours. The
generated sets were manually reviewed using the same post-generation screen for gross abnormalities.
To ensure that the model was not simply memorizing training data, we conducted a near-duplicate search of all generated images using the open-source imagededeup library [28], a commonly used library for image similarity and duplicate detection [29], [30]. We computed vision transformer (ViT) embeddings for all real and synthetic images, ranked nearest neighbors by cosine similarity, and extracted the top 100 most similar real-synthetic pairs. Two authors (AAB and BSK) independently performed side-by-side visual review of these 100 pairs, adjudicating for explicit memorization. A priori, we defined explicit memorization as near-identical vertebral contours, soft-tissue silhouettes, or acquisition artifacts beyond plausible sampling variability.
Eight experts (six neuroradiologists and two spine fellowship-trained neurosurgeons) from the University of Calgary and the University of Toronto completed a blinded validation task using the Qualtrics survey platform (Qualtrics, Provo, UT). Each expert reviewed 50 quartets (presented sequentially on separate pages), containing one real CSXA radiograph and three synthetic images sampled independently from the three checkpoints. For every quartet, raters identified the image they believed was real and assigned subjective realism ratings to each image on a four-point Likert scale (1 = unrealistic; 4 = fully realistic).
Identification accuracy was computed as the proportion of quartets in which the real image was correctly identified, aggregated across raters. Inter-rater agreement was quantified with Fleiss’ \(\kappa\) [31]. For realism ratings, each rater’s mean score per image group was computed. The mean score for the real images was then compared with the mean score for each of the three synthetic image groups using paired, two-sided Wilcoxon signed-rank tests. Holm’s procedure was applied to adjust for multiple comparisons. Statistical analyses were performed in R (version 4.4.2, The R Project for Statistical Computing) and rating distributions were visualized with violin plots created using the Python library Matplotlib2.
The CSXA is an open-access dataset released under Creative Commons Attribution 4.0 International (CC BY 4.0) [23]. We have made our large-scale pooled dataset available for inclusion with this publication, consistent with principles of open science and the permissive terms of the CC BY 4.0 license. As members of the study team conducted a blinded review of images and we did not include human participants beyond co-authors, formal research ethics board approval was not required.
Training and validation losses converged between checkpoints 30 and 40, diverging thereafter (Figure 1). FID computed from 7,250-image synthetic batches declined over training (Figure 2),
indicating ongoing improvement in sample quality despite the loss divergence. On this basis, we selected checkpoints 34 and 40 to capture the period of loss convergence, and checkpoint 80 to assess potential benefits from extended training time not
reflected in the loss metrics.
Across 400 quartet judgements (eight raters with 50 quartets per rater), experts identified the real image in 29.0% of trials. A sample quartet is provided in Figure 3. Inter-rater agreement for the identification task was
low (Fleiss’ \(\kappa\) = 0.061). Mean realism ratings were 3.323 for real images and 3.228, 3.258, and 3.320 for synthetic images sampled from checkpoints 34, 40, and 80, respectively (Figure 4). Paired Wilcoxon signed-rank tests showed no significant differences between real and synthetic mean ratings (unadjusted p = 0.128, 0.236, 1.000; Holm-adjusted p = 0.383, 0.471, 1.000; two-sided).
The memorization audit using ViT embeddings produced 100 real-synthetic pairs with the highest cosine similarity. Two independent reviewers found no visually explicit memorization among these top-similar pairs under the a priori criteria. The
top 100 most similar real-synthetic pairs are available at 3.
From the 27,000 generated images, scans with gross anatomical abnormalities were removed. The resulting dataset, comprising 20,063 images from the three checkpoints, is available at 2. A greater proportion of images were removed from the earlier checkpoints, reflecting a higher incidence of gross abnormalities in images sampled earlier in DDPM training. The specific counts of excluded images from each
checkpoint are detailed in Table 1.
In this study, we presented a “clinical Turing test” for synthetic lateral spine radiographs generated by a DDPM trained on an open-access cervical spine X-ray dataset. The DDPM generated synthetic neuroimaging data that neuroradiologists and spine
fellowship-trained neurosurgeons could not reliably distinguish from real clinical images. Identification accuracy was near chance, inter-rater agreement on real-image selection was low, and subjective realism ratings did not differ significantly between
real and synthetic images. These findings complement quantitative indicators such as FID by grounding evaluation in domain-expert perception of anatomical plausibility and realism.
Anatomically faithful synthetic neuroimaging data offer a practical solution to long-standing RWD limitations. Neurosurgical patient volumes are often small and unevenly distributed across sites, constraining the assembly of large, representative training
sets [1]–[4]. By contrast, once a DDPM is trained,
synthetic images can be produced at scale to enable greater availability of open-access datasets for academic research and ML model development. Clinical data are also heterogeneous and frequently incomplete [5], [6], whereas synthetic images are generated in standard form and can augment dataset sizes. The
labor necessary to de-identify RWD [6] and coordinate data-sharing agreements is significant [7]–[9]. Synthetic images, which are not tied to specific individuals, can reduce de-identification
burden and lower the privacy risks of sharing data between sites. Synthetic datasets may be generated from institutional data and shared in place of RWD to support large-scale, multi-institutional dataset curation. Synthetic datasets may also be generated
to preserve the informational value of clinical data beyond the confines of data retention periods. Beyond research and ML model development, shareable datasets may provide educational opportunities to create large-scale teaching banks for trainees to more
readily practice radiograph interpretation. In addition to trainee education, synthetic images can provide examples of pathological and non-pathological images for patient education (e.g., when demonstrating the difference between lateral cervical spine
X-rays of a degenerative and a normal spine). Our release of a pooled synthetic lateral cervical spine dataset aims to catalyze these use cases [32].
This work adds to a growing body of evidence evaluating the fidelity, utility, and privacy of synthetic data. Similar to past evaluations of GAN- and DDPM-based chest radiograph images [17], [18], our evaluation of DDPM-generated spine X-rays demonstrated high expert-perceived realism under blinded review. Our present work expands on
this literature in three ways. First, while chest X-ray pipelines have been trained on datasets numbering in the hundreds of thousands (e.g., MIMIC-CXR >370k; CheXpert >220k) [19], [20], neuroimaging datasets are typically several orders of magnitude smaller [1], [2]. Accordingly, we achieved comparable expert-perceived realism despite training on a much smaller corpus (<5k lateral radiographs).
Importantly, our top-k nearest neighbor audit found no visually explicit reproduction, mitigating concerns that high realism merely reflects memorization or overfitting under a small training dataset approach. Second, we broadened our clinical Turing task
to include a multi-expert evaluation, beyond a radiologist-only evaluation. Third, we used a head-to-head design that allowed experts to directly compare one real image with three synthetic images, rather than presenting images in isolation. This
comparative format may have provided subspecialists with an additional opportunity to detect subtle inconsistencies across images and increase sensitivity to differences that may not be readily apparent in isolation. Despite the altered setup, we observed
low accuracy and inter-rater agreement on the identification task.
Beyond situating our results within the landscape of synthetic imaging data, it is instructive to note parallel progress in synthetic tabular neurosurgical data. Recent work has demonstrated that synthetic cohorts can preserve outcome-relevant signals for
real-world analyses and model training [12]. Building on both lines of evidence, the next step for our imaging-data pipeline is to adopt similar,
task-based utility benchmarks that demonstrate concrete downstream value to our dataset. To that end, follow-up work will train models on our synthetic radiographs (alone and in mixed training sets with RWD) for relevant tasks (e.g., vertebral landmarking,
anatomical segmentation, disease classification) and compare performance with baseline models trained exclusively on RWD.
All CSXA radiographs were resampled to 256x256 pixels to simplify modeling and standardize training, which can distort anatomical proportions. Future work will scale training and synthesis to higher resolutions. Initial evaluation of similarity between
the distributions of real and synthetic images with FID is not a radiology-native metric and may miss modality-specific properties. However, FID findings were corroborated by the clinical Turing test.
Following FID analysis, scans with gross anatomical abnormalities were manually excluded as a pre-processing step prior to expert validation and public release. This reflects an important limitation to synthetic clinical image generation: the diversity of
outputs will inherently include anatomically implausible scans. Furthermore, given the manual nature of review, some scans featuring abnormal anatomy may still be included in our released dataset. Future work should improve DDPM performance to minimize the
generation of implausible scans and/or institute automated post-generation filters.
Memorization remains a central concern for synthetic medical images. In our top-k nearest neighbor audit using ViT embeddings, two independent reviewers found no visually explicit memorization among the 100 most-similar real-synthetic pairs. While a
negative finding in a nearest-neighbor screen indicates that no direct regurgitation of images occurred, it cannot guarantee the absence of memorization. Development of a standardized approach to fidelity, utility, and privacy benchmarking for radiographic
imaging data, similar to tabular data [33], is warranted. This may include analyses featuring stress tests, which vary training set size and sampling
parameters.
Our dataset and clinical Turing test focused on lateral cervical spine radiographs with a pathology distribution shaped by the source dataset. Generalization to other projections, imaging modalities, and richer pathology remains to be demonstrated. Future
experiments should quantify the extent to which DDPMs can retain aberrant/pathologic anatomy found in training data and whether these are preserved at similar prevalence to RWD.
This study demonstrates a growing accuracy and value of synthetic neuroimaging data. Utilizing a DDPM trained on an open-access cervical spine radiographic atlas, we synthesized lateral X-rays that blinded subspecialists rated as indistinguishable from true clinical images. The novel application of DDPM in neuroimaging highlights the potential to generate large-scale imaging datasets to support data-sharing and ML model training for landmarking, segmentation, and classification tasks.
| Generation | Generated Images | Final Count |
|---|---|---|
| Checkpoint 34 (initial sampling) | 1,000 | 607 |
| Checkpoint 40 (initial sampling) | 1,000 | 654 |
| Checkpoint 80 (initial sampling) | 1,000 | 742 |
| Checkpoint 34 (large generation) | 8,000 | 5,071 |
| Checkpoint 40 (large generation) | 7,968 | 5,999 |
| Checkpoint 80 (large generation) | 8,032 | 6,990 |
| Total | 27,000 | 20,063 |