Normative Modelling in Neuroimaging:
A Practical Guide for Researchers
September 08, 2025
CNNP Lab (www.cnnp-lab.com), School of Computing, Newcastle University, Newcastle upon Tyne, United Kingdom
Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, United Kingdom
UCL Queen Square Institute of Neurology, Queen Square, London, United Kingdom
* Yujiang.Wang@newcastle.ac.uk
Normative modelling is an increasingly common statistical technique in neuroimaging that estimates population-level benchmarks in brain structure. It enables the quantification of individual deviations from expected distributions whilst accounting for biological and technical covariates without requiring large, matched control groups. This makes it a powerful alternative to traditional case-control studies for identifying brain structural alterations associated with pathology. Despite the availability of numerous modelling approaches and several toolboxes with pretrained models, the distinct strengths and limitations of normative modelling make it difficult to determine how and when to implement them appropriately. This review offers practical guidance and outlines statistical considerations for clinical researchers using normative modelling in neuroimaging. We compare several open-source normative modelling tools through a worked example using clinical epilepsy data; outlining decision points, common pitfalls, and considerations for responsible implementation, to support broader and more rigorous adoption of normative modelling in neuroimaging research.
The shape of the human brain can be quantified by structural neuroimaging. Deviations from typical morphology are often associated with neurological disorders, making accurate detection essential. Normative modelling is a statistical approach that establishes reference distributions of brain metrics such as cortical thickness, surface area, and volume based on the healthy population. Applying these models to new data yields deviation scores, for example z-scores or centiles, that quantify how an individual diverges from the healthy baseline 1. A familiar analogy is paediatric growth charts: just as height centiles assess a child’s growth relative to peers, normative curves evaluate brain metrics against age- and sex-matched standards [1].
Normative modelling removes the need for large, demographically matched control groups, making it increasingly popular in clinical research across different neurological disorders. In epilepsy, it can help localise structural abnormalities and predict treatment outcomes [2]. Similarly, the approach can characterise patient heterogeneity in mental disorders [3]–[5] and traumatic brain injury [6], [7]. It has proved valuable in studies of mild cognitive impairment and dementia [8]–[12], developmental psychiatry [13], [14], schizophrenia [15], [16], ADHD [17], and autism spectrum disorder [18]–[20]. The recent growth of pretrained models and user-friendly online platforms offers a powerful alternative to traditional case–control analyses.
Normative models use large healthy cohorts to learn the distribution of neuroimaging features as a function of covariates such as age, sex, and scanner site. The training dataset generally includes data acquired across multiple sites, and covers a wide age range. Typical statistical approaches to fit these models include linear regression, Gaussian process regression, generalised additive models of location, scale, and shape, and Bayesian frameworks [21]–[23]. Unlike case-control studies with restricted control sample sizes, normative models can therefore infer complex, non-Gaussian distributions of morphological features, which may more accurately reflect the distribution in the healthy population. Once trained, a model predicts expected values for new participants. Comparing observed metrics to these predictions yields individual deviation scores that account for biological (e.g. age, sex) and technical (e.g. scanner site [24]) variability. Calibrating pretrained models to new scanners requires a relatively small reference set of healthy controls from that site to estimate and correct site-specific effects. Without calibration, scanner artifacts may confound true pathological deviations.
In recent years, several open-access platforms of pretrained normative models for brain morphology have been developed, including Brain MoNoCle [25], BrainChart [26], PCN Toolkit [27], and CentileBrain [22]. These web-based platforms let users leverage the power of normative modelling by uploading processed imaging data and obtain deviation scores without needing to process training data or build normative models themselves. Although valuable, the complexities and distinct limitations of normative modelling may make it difficult to use for clinical researchers without extensive statistical training. Clinical researchers may struggle to choose between these models or platforms, determine how many calibration controls they need, and understand the consequences of using mismatched— or no—controls. Clear, practical guidelines on the proper and effective application of these methods to clinical data are lacking.
In this review, we provide practical guidance for applying normative modelling in clinical neuroimaging. We detail statistical considerations, highlight common pitfalls, and illustrate each point with a worked example. Specifically, we address:
How does the choice of normative model or platform influence results?
How many healthy controls are needed to calibrate a new scanner site?
What impact arises from demographic mismatches between cases and controls?
Can deviation scores be reliably computed without any site-matched controls for calibration?
It is our hope that this review will empower clinical researchers to adopt normative modelling techniques with confidence, enhancing the sensitivity and reproducibility of neuroimaging studies.
The growing availability of large-scale neuroimaging datasets, combined with advances in statistical techniques, has led to multiple pretrained normative models of brain morphology, including Brain MoNoCle, BrainChart, PCN Toolkit, and CentileBrain. These platforms differ in model type (e.g. Generalized Additive Models for Location Scale and Shape (GAMLSS), Bayesian Linear Regression (BLR), Multivariate Fractional Polynomial Regression (MFPR), and Hierarchical Bayesian Regression (HBR)), underlying reference datasets, morphometric measures, and output formats. Despite their increasing adoption, the impact of model choice on downstream results is not fully established. From the model training side, the choice of algorithm and parameters can affect training efficiency and performance [22]. However, here, we want to test if different pretrained models differed in terms of end-user performance. To illustrate potential differences of pre-trained modelling platforms, we analysed a clinical dataset of patients with mesial temporal lobe epilepsy (mTLE) from the IDEAS study [28], using four normative tools: Brain MoNoCle, CentileBrain, PCN Toolkit, and BrainChart. Analyses focused on average cortical thickness (CT), the most commonly modelled morphometric measure. Regional CT values were extracted using the Desikan–Killiany (DK) atlas via the FreeSurfer recon-all pipeline. Pretrained normative models for the DK atlas were available in Brain MoNoCle, CentileBrain, and PCN Toolkit. Each model computed individual deviation scores (z-scores), adjusting for covariates including age, sex, and scanner site, and we assessed agreement between z-scores across modelling platforms. We then calculated effect sizes (Cohen’s d) for each DK region comparing a cohort of patients with right-onset mTLE to controls. Group-level effect sizes were then compared across models. Additional hemisphere-level analyses comparing BrainChart and Brain MoNoCle are reported in the Supplementary.
Despite differences in algorithms of the pretrained models, reference datasets, and modelling frameworks, outputs were broadly consistent (Figure 2). Individual z-scores were in high agreement across models. Regional effect sizes also showed strong agreement, particularly between Brain MoNoCle and PCN Toolkit. Effect sizes from CentileBrain were systematically offset relative to Brain MoNoCle and PCN Toolkit, reflecting differences in scaling, but the relative pattern of abnormality across regions remained similar.
These findings suggest that normative modelling platforms provide reliable individual- and group-level outputs because they all capture the same underlying population distribution of morphometric features. Large, representative training datasets and well-regularized models contribute to this consistency. Nonetheless, systematic offsets, such as those observed with CentileBrain, highlight that absolute values may differ even when relative patterns are preserved. Therefore, when feasible, using multiple platforms to validate key findings is advisable.
In practice, choice of platform may depend on study goals and practical considerations. These include the morphometric metrics supported (e.g. cortical thickness vs. surface area), compatibility with specific atlases, type of outputs (e.g. z-scores vs. centiles), computational efficiency, and ease of integration into an existing analysis pipeline.
Key takeaway: Normative modelling platform outputs generally agree, but absolute values can differ. Validate findings across tools and choose the model that best fits your metrics, atlas, and workflow.
Normative models are trained on large healthy cohorts to establish expected brain morphology across age and other covariates (e.g. sex, scanner site). Empirical benchmarking on over 37,000 participants shows that model convergence and stability typically require training samples of roughly 3,000 subjects [22], [25]. When applied to a new dataset, these models are calibrated using a smaller, site-specific control cohort to infer site effects. Hierarchical Bayesian approaches further support this adaptability by using informative priors that preserve the normative baseline even with small adaptation samples [21]. This allows patient-level evaluation without large, matched control groups at every site - a major advantage in clinical studies where healthy control numbers are often limited. However, the practical question remains: how small can the site-matched control cohort be while still allowing accurate site adjustment?
To investigate this, we used Brain MoNoCle to compute right hemisphere cortical thickness decreases associated with right-hemisphere mTLE in the IDEAS dataset, adjusting the normative model with varying sizes of healthy control groups: small (n = 10), medium (n = 30), and full (n = 69). For each sample size, we performed 100 repetitions via sampling with replacement to simulate drawing controls from a larger population. Z-scores were calculated for all individuals, and Cohen’s d was used to compute effect sizes between controls and people with TLE.
While mean effects were similar across control sample sizes, very small groups introduced high variance, often over- or underestimating effects (Figure 3). For example, with n = 10 controls, only 30% of estimates fell within one standard deviation of the effect derived from the full control group. With n = 30, consistency improved substantially, with 50% of estimates within one standard deviation.
This demonstrates the importance of using adequately sized control groups when adjusting normative models to new sites. Although smaller samples are needed than if sex and age effects were inferred from matched controls, too few controls can distort biomarker discovery in clinical populations. Our findings using the Brain MoNoCle app suggest that normative models may perform reasonably well with as few as 30 site-matched controls, producing robust estimates of mean and standard deviation. Theoretical calculations support this: with n = 30, there is a 98% probability of estimating the standard deviation within 30% of its true value [29].
Key takeaway: For site-specific calibration of normative models, control cohorts as small as 30 subjects may provide robust estimates, though this number may vary between models. Very small cohorts (n \(\le\) 10) can produce unreliable deviation scores, and larger samples remain preferable when feasible.
In clinical research, the reliability of outputs depends on the quality and representativeness of the control cohort. Brain structure changes significantly with age and differs between males and females [30], [31], so traditional case-control studies often match participants by age and sex to avoid bias [32]. When control groups are demographically mismatched, results can be compromised. Normative modelling addresses this by statistically adjusting for covariates, providing accurate interpretations even under less ideal conditions. To evaluate robustness under demographic imbalances, we simulated two scenarios using the IDEAS dataset and average cortical thickness (CT) as the morphometric measure.
In the first scenario, we tested age mismatches. Two control samples were prepared: one spanning the full age range (n = 34), and another including only older controls (age \(>\) 40; n = 34). Pretrained regional normative models (Brain MoNoCle) were calibrated to the new site, and used to derive regional effect sizes of cortical thickness for patients with right-onset TLE compared to healthy controls. We found strong agreement between model outputs using balanced versus older-only control samples across regions (Figure 4), with regional effects differing by only a small margin (average difference in regional effect size = 0.15).
In the second scenario, we assessed sex imbalances. Case-control effects between patients with right-onset TLE and controls were derived using either a sex-balanced control group (n = 50; 25 males, 25 females) and a sex-imbalanced group (n = 50; 10 males, 40 females) for calibrating pretrained regional normative models. Outputs from the sex-imbalanced group were highly correlated with those from the balanced group (Figure 5), with minimal differences in regional effects (average difference in regional effect size = 0.09).
These results demonstrate that normative modelling is robust to common demographic imbalances. Strong agreement between outputs from balanced and biased control samples shows that covariate effects can be effectively adjusted, even when the control cohort is not perfectly representative. While using demographically representative samples remains best practice, these findings support the practical utility of normative models in clinical settings, where ideal control data are often difficult to obtain.
Key takeaway: Normative models are robust to demographic mismatches; even when control cohorts are age- or sex-biased, deviation scores remain reliable. Nonetheless, using representative controls is recommended when possible.
In quantitative neuroimaging, technical factors such as scanner hardware, site, and acquisition protocols introduce systematic variability in data. Multi-site studies have repeatedly shown that these factors can confound analyses. For example, [33] demonstrated substantial variation in cortical thickness measurements across 11 scanners, and [34] found that even after advanced preprocessing, a classifier could accurately identify the scanner used, highlighting persistent site-specific bias. Such effects can obscure pathology- or covariate-related morphological signals. Accurate matching of controls with patients at the same site and using the same protocols is therefore critical for reliable deviation scores and valid inferences.
A common challenge in clinical studies is the absence of local control data collected under the same scanning conditions as the patient cohort. Researchers may be tempted to use normative models adjusted with controls from a different site or scanner, assuming statistical covariate adjustment (e.g. for age and sex) is sufficient. To test this, we simulated this scenario using the IDEAS dataset.
We derived regional effect sizes of cortical thickness for patients with right TLE compared to healthy controls using Brain MoNoCle twice: once with controls acquired at the same site and protocol as the patients, and once using controls from a different scanner and acquisition protocol. Comparison of outputs revealed substantial discrepancies (Figure 6). When using mismatched controls, there is little agreement to the findings using the correctly matched controls, with an average difference in effect size of 0.43.
This example underscores the critical importance of site-matching controls and patients when applying normative models. Scanner-specific effects must be estimated for each new site. Using controls acquired under different scanning conditions introduces systematic errors in deviation scores, producing unreliable estimates.
Key takeaway: Deviation scores cannot be reliably computed without local controls matched by site and acquisition protocol. To ensure valid inference, every new dataset should include site-matched controls for model calibration.
We have examined key methodological considerations for applying normative models to clinical neuroimaging data, focusing on the influence of model choice, control sample size, demographic composition, and scanner/protocol matching on results.
Different normative modelling tools generally produce highly consistent individual- and group-level outputs. Absolute effect sizes may differ slightly between platforms, but relative patterns of pathology are preserved. Platform selection should therefore be guided by practical considerations, including model flexibility, the structural measures of interest, atlas compatibility, output format, computational efficiency, and ease of integration.
Normative models reduce the need for large, site-matched control groups, but very small samples can introduce high variance and unreliable deviation scores. A minimum of \(\approx\) 30 controls per site provided stable estimates in our case study, though this may vary when using other models, and larger samples are generally preferred whenever feasible.
Normative models can adjust for covariates such as age and sex, making them robust to moderate demographic imbalances. While demographically representative controls are ideal, deviation scores remain reliable even when the control cohort is somewhat biased.
Accurate matching of controls and patients by scanner site and acquisition protocol is critical. Mismatched controls introduce systematic errors that can compromise deviation scores and lead to spurious findings. Site-matched controls should always be used when calibrating normative models to new datasets.
Normative modelling is a powerful tool for detecting brain deviations in clinical populations, offering flexibility and robustness across tools and demographic variations. Ensuring adequate control sample sizes and strict site/protocol matching maximizes the reliability and clinical utility of deviation scores.
Normative modelling is a valuable approach for individual-level brain assessment in both research and clinical settings. However, its effectiveness is highly dependent on careful methodological choices and an understanding of potential pitfalls. Normative modelling is effective even for demographically imbalanced data. We emphasise the importance of a minimum number of control subjects and matching controls from the same scanner as patient cohorts to obtain reliable outputs. When observing these practical considerations and avoiding usage pitfalls, normative models are a highly reliable and powerful tool to support neuroimaging studies.
We thank members of the Computational Neurology, Neuroscience & Psychiatry Lab (www.cnnp-lab.com) for discussions on the analysis and manuscript; P.N.T. and Y.W. are both supported by UKRI Future Leaders Fellowships (MR/T04294X/1, MR/V026569/1).
Supplementary Text 1em