Estimation of reference curves for brain atrophy and analysis of robustness to machine effects

Piot, Elodie; Renard, Félix; Attyé, Arnaud; Krainik, Alexandre

doi:10.1038/s41598-025-18073-z

Download PDF

Article
Open access
Published: 03 October 2025

Estimation of reference curves for brain atrophy and analysis of robustness to machine effects

Scientific Reports volume 15, Article number: 34585 (2025) Cite this article

1748 Accesses
10 Altmetric
Metrics details

Subjects

Abstract

Neurodegenerative diseases like Alzheimer’s are difficult to diagnose due to brain complexity and imaging variability. However, volumetric analysis tools, using reference curves, help detect abnormal brain atrophy and support diagnosis and monitoring. This study evaluates the robustness of three segmentation algorithms, AssemblyNet, FastSurfer and FreeSurfer, in constructing brain volume reference curves and detecting hippocampal atrophy. Using data from 3,730 cognitively normal subjects, we built reference curves and assessed robustness to magnetic field strength (1.5T vs. 3T) using four error metrics (sMAPE, sMSPE, wMAPE, sMdAPE) with bootstrap validation. We evaluated classification performance using hippocampal atrophy rates and HAVAs scores (Hippocampal-Amygdalo-Ventricular Atrophy scores). AssemblyNet shows the lowest errors across all robustness metrics. In contrast, FastSurfer and FreeSurfer exhibit greater deviations, indicating higher sensitivity to field strength variability. AssemblyNet provides consistent hippocampal atrophy rates across all reference models, despite slightly lower sensitivity, while FastSurfer and FreeSurfer display greater variability. Specificity ranges from 0.87 to 0.91 for AssemblyNet, compared to 0.76-0.93 for FastSurfer and 0.86-0.93 for FreeSurfer. Using the HAVAs score, all methods detect high atrophy rates in Alzheimer’s patients. FastSurfer achieves the highest sensitivity (0.98), while AssemblyNet reaches the best specificity (0.95) and the highest balanced accuracy (0.91). This study underscores the importance of algorithm choice for reliable brain volumetric analysis in heterogeneous imaging environments. Among the methods tested, AssemblyNet stands out as both sensitive to Alzheimer’s-related atrophy and robust to acquisition variability, making it a strong candidate when analyzing hippocampal volumes in large, multi-site datasets.

Transcriptomic analysis to identify genes associated with selective hippocampal vulnerability in Alzheimer’s disease

Article Open access 19 April 2021

Automated hippocampal segmentation algorithms evaluated in stroke patients

Article Open access 20 July 2023

Classification method based on surf and sift features for alzheimer diagnosis using diffusion tensor magnetic resonance imaging

Article Open access 21 March 2025

Introduction

Neurodegenerative diseases cause accelerated neuronal death, leading to varying levels of cerebral atrophy¹. The spatial pattern of atrophy provides critical insights for differential diagnosis and disease staging²^,³. To better quantify and interpret such patterns, normative modeling tools have emerged as valuable resources. These tools rely on reference curves that represent healthy age-related variation in brain structure, allowing for the identification of individual deviations from typical trajectories⁴^,⁵^,⁶. By comparing an individual’s brain volumes to these normative models, deviations indicative of abnormal atrophy can be detected. In the context of neurodegenerative disease, atrophy is typically defined when regional volumes fall below the lower bounds of these normative distributions, providing a standardized and interpretable framework for assessing structural brain changes⁷.

However, the increasing use of large, diverse datasets has introduced additional challenges. Modern research relies on databases of unprecedented size, incorporating data from a wide variety of sources (Coupe et al., (2023)⁴: 40,944 subjects pooled across 24 databases). This diversity increases the prevalence of the “machine effects”, where differences in MRI acquisition parameters⁸, scanner hardware⁹, and processing pipelines¹⁰ may lead to biases in volumetric measurements. Such variability can mask subtle disease-related changes, as differences in MRI acquisition parameters alone can alter volume measures by up to 4.8%¹¹, comparable to early disease-related brain volume changes¹².

Tools for volume-based diagnostics enable clinicians to monitor disease progression and evaluate severity, with atrophy serving as a reliable biomarker¹⁴^,¹⁵. However, the impact of the machine effects underscores the need for robust algorithms. Commercial tools for volume-based diagnostics have emerged to meet increasing demand¹⁶^,¹⁷^,¹⁸, yet most lack thorough validation: of 17 identified products, only 4 underwent clinical validation in dementia populations¹⁹. Moreover, normative datasets vary widely (100-8,000 subjects), raising concerns about reliability and generalizability¹⁹. Further validation is essential for their full potential in neurodegenerative disease diagnosis and monitoring.

Several studies propose reference curves with divergent trajectories for cortical and subcortical structures, sometimes described as linear, U-shaped, or complex polynomial curves, highlighting a lack of consensus in the field²⁰. This variability has led to inconsistent findings²¹ and, in some cases, may even alter the observed differences between control and pathological groups²². As explained by Coupe et al., (2017)²⁰, several factors contribute to these inconsistencies: the use of data covering only restricted age ranges, which biases the construction of reference curves, limited scanner diversity within certain age groups¹⁹^,²³, non-harmonized acquisition protocols²⁴, differences in curve modeling approaches and segmentation tools²⁰^,²³ and the use of heterogeneous volumetric measures (e.g., absolute volumes, normalized volumes, or z-scores)²⁵.

Selecting a model for normative curves is challenging, requiring a balance between flexibility and overfitting (where the model becomes too tailored to the training data and fails to generalize to new data)²³^,²⁶. Options range from linear models²⁰ to advanced methods like GAMLSS²⁷. This study uses Generalized Additive Models (GAMs) to test segmentation reproducibility with a standard approach. GAMs extend generalized linear models, offering flexibility and controlling overfitting through constraints²⁸^,²⁹.

Normative curve robustness is as crucial as model selection. Liu et al., (2024)²⁴ show that identical models yield different curves across datasets. Sample size, age representation, and intracranial volume normalization further impact results²⁰. Temporal dependencies and machine effects also challenge reproducibility, underscoring the need for robust methods.

Several methods exist for comparing reference curves. Summing pairwise distances is simple but prone to error cancellation. Advanced time series metrics (please refer to Supplementary Section S1) provide a more robust assessment by capturing trajectory alignment. These metrics, available in classical, symmetrized, and sometimes weighted forms, were chosen for their interpretability, consistency, scale invariance, and balanced handling of over- and under-predictions.

The first objective of this study is to investigate the robustness of AssemblyNet, FastSurfer and FreeSurfer in generating reference curves and evaluating brain atrophy using GAMs. The second objective is to develop a methodology for comparing reference curves estimated from volumetric data of 3,730 healthy subjects. These curves, generated by three segmentation algorithms, that extract brain volumes, are evaluated for robustness across magnetic field strengths. To our knowledge, no study has used GAMs to build reference curves for AssemblyNet, FastSurfer, and FreeSurfer, quantified segmentation algorithm impact, or assessed robustness to machine effects. The third objective is to assess segmentation algorithms’ sensitivity and stability by measuring the proportion of Alzheimer’s patients (AD) with hippocampal atrophy³⁰^,³¹ and HAVAs scores³² (Hippocampal-Amygdalo-Ventricular Atrophy scores); and comparing results with the literature. The Scheltens scale³³ is commonly used but observer-dependent³⁴. While visual assessment detects cerebrospinal fluid enlargement, it only indirectly reflects gray or white matter loss. Automated segmentation directly targets gray matter, overcoming this limitation. To our knowledge, the impact of machine effects on atrophy assessment using GAMs remains unexplored. For an overview of this study, please refer to Fig. 1.

Materials and methods

Data

Data used to construct reference curves

To build our reference curves, we segmented 3730 T1-weighted MRI scans of healthy subjects from 11 open-access datasets: ABIDE³⁵ (n=469), ICBM³⁶ (n=294), IXI (https://brain-development.org/ixi-dataset/) (n=549), ADNI³⁷ (n=373), OASIS1³⁸ (n=298), PPMI³⁹ (n=166), UCLA⁴⁰ (n=125), DLBS (https://sites.utdallas.edu/dlbs/) (n=315), SALD⁴¹ (n=494), NIFD (http://memory.ucsf.edu/research/studies/nifd) (n=114), and SLIM (https://fcon_1000.projects.nitrc.org/indi/retro/southwestuni_qiu_index.html) (n=580).

Figure 2 illustrates the distribution by age and study. Table 1 show a summary of the key characteristics of the 11 datasets presented.

Part of the data for this work were sourced from the International Consortium for Brain Mapping (ICBM) dataset (https://www.loni.usc.edu/). Then, part of the data utilized in this study were sourced from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (adni.loni.usc.edu). Established in 2003 as a public-private partnership and led by Principal Investigator Michael W. Weiner, MD, ADNI’s primary objective is to evaluate whether the combination of serial MRI, positron emission tomography (PET), various biological markers, and clinical and neuropsychological assessments can track the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For the most recent updates, please visit www.adni-info.org. In addition, we used data from the Open Access Series of Imaging Studies (OASIS) OASIS1, which provides cross-sectional MRI data in young, middle-aged, nondemented, and demented older adults. Finally, NIFD is the abbreviation for the frontotemporal lobar degeneration neuroimaging initiative (FTLDNI).

Table 1 Main characteristics of the datasets used to construct the reference curves. N_sub: number of subjects; min/max/mean/std: age statistics; F/M: number of females/males; 1.5T/3T: number of scans acquired at each field strength.

Full size table

Table 2 Main characteristics of the datasets used for sensitivity and stability analysis. N_sub: number of subjects; min/max/mean/std: age statistics; F/M: number of females/males; 1.5T/3T: number of scans acquired at each field strength; CN: controls subjects; AD: Alzheimer’s patients.

Full size table

Data for sensitivity and stability analysis

To evaluate the sensitivity of segmentation algorithms, we segmented images of 219 AD subjects and 255 controls subjects from ADNI (https://adni.loni.usc.edu/). The subjects are all between 65 and 80 years of age.

For the stability analysis, we used the data of 46 patients with “mild-moderate Alzheimer’s disease” and 22 age-matched healthy subjects from Malone et al., (2013)⁴² (MIRIAD). They have been scanned 8 times (2, 6, 14, 26, 38 and 52 weeks, 18 and 24 months from baseline) on the same 1.5 T scanner. Additionally, stability was assessed using the data of 9 healthy traveling subjects from Tanaka et al., (2021)⁴³ (SRPBS). They have been scanned at 12 different imaging centers within a 30-day period, making up a total of 156 exams. Three MRI manufacturers, and seven MRI scanner types were used (3T only). Table 2 presents the dataset used for the sensitivity and stability analysis of segmentation algorithms.

Pipeline analysis

Segmentation algorithms

The data were segmented using AssemblyNet (version 1.0.0), a segmentation algorithm based on a large ensemble of Convolutional Neural Networks (CNNs)⁴⁴. Segmentation was then performed using 250 deep learning models through a multiscale framework.

We compared our results with FreeSurfer, one of the most widely used segmentation software packages in neuroimaging⁴⁵. All data was processed using FreeSurfer version 7.3.1. We segmented each subject using the automated “recon-all” pipeline with the default parameters. For FreeSurfer, because of the time required for segmentation (around 15 h per subject), the FreeSurfer segmentations were calculated on the VIP platform, which provides substantial computing resources⁴⁶.

FastSurfer⁴⁷(version 2.4.2) builds upon FreeSurfer and incorporates technologies similar to those used in AssemblyNet (Deep Learning), enabling segmentation in approximately ten minutes. FastSurfer utilizes three Fully Convolutional Neural Networks (F-CNNs), each responsible for segmenting 2D slices in the coronal, axial, and sagittal planes. The three segmentations are then aggregated.

GAM and constraints

Reference curves were estimated using a Generalized Additive Models (GAM) model⁴⁸(pyGAM (ExpectileGAM) version 0.9.1). This model generates a curve fitting a set of points using a flexible combination of smooth functions, and captures linear relationships between points. GAM’s model extends generalized linear models resulting in a highly flexible model, in which it is easy to control overfitting.

To prevent overfitting and avoid “non-physical” behaviors, such as abrupt variations in the variable over short periods throughout life, we applied convex or concave smoothing techniques. Overfitting can result from excessive learning from data, leading to a strong influence from specific studies rather than generalizable patterns. By applying appropriate smoothing methods, we ensure a more physiologically plausible representation of changes over time. Supplementary Section S2, Fig. S1 illustrates smoothing effects.

Curve stability assessment

Methods for comparing reference curves across magnetic field strengths

Results in the “Assessment of curves stability” section include visual and quantitative analyses: the former offers a graphical comparison, while the latter quantifies visual differences using different methods. We assessed the robustness of segmentation algorithms to magnetic field strength variations by comparing reference curves derived from 1.5T data with those from 3T data. For this analysis, we compared the lower, mean, and upper bound curves using metrics designed to quantify differences between curve pairs.

Measure of errors between curves

We aimed to compare reference curves that depict volume evolution over a lifespan. To achieve this, we explored metrics for time series analysis and forecasting that calculate errors between these curves. Given the nature of our data, the selected metrics had to meet specific requirements: 1) Interpretability: the metrics should be easy to interpret, for example, by being on the same scale as the data or presented as a percentage. 2) Independence from error sign: the metric should not cancel out positive and negative errors. This means the metric should not differentiate between the direction of the error. 3) Scale-independence/invariance: the metric should remain consistent regardless of data scaling, allowing comparisons between different algorithms. 4) Treat over-predictions and under-predictions equally.

Based on the literature, we selected 18 error metrics for evaluating the reference curves. An overview of these metrics is provided in Supplementary Section S1, where we define each metric, and discuss its strengths and limitations. We summarized the characteristics of the 18 metrics in relation to the selection criteria in Supplementary Section S1, Table S1. This table supports our selection of 5 key metrics for the comparison of reference curves: the symmetric Mean Squared Percentage Error (sMSPE⁴⁹, sktime version 0.34.0), the Mean Absolute Scaled Error (MASE⁴⁹, sklearn 1.4.1), the symmetric Mean Absolute Percentage Error (sMAPE⁵⁰), the weighted Mean Absolute Percentage Error (wMAPE⁵¹) and the symmetric Median Absolute Percentage Error (sMdAPE⁴⁹, sktime version 0.34.0). Since these are error metrics, the ideal value for all of them is 0, indicating no difference between reference curves.

Each metric has specific characteristics and relevance to our analysis, as described below: sMSPE amplifies larger errors by squaring the percentage differences, providing insight into the presence of significant deviations. sMAPE quantifies the percentage difference between predicted and reference values, taking into account both over- and under-predictions. A lower sMAPE indicates higher agreement between the predicted and reference values. wMAPE evaluates the absolute percentage errors while weighting them by the magnitude of the observed (reference) values. wMAPE accounts for errors weighted by volume size, making it crucial for datasets with a wide range of volumes. Since wMAPE accounts for volume size, this metric is particularly useful in applications where larger volumes might dominate the error. sMdAPE focuses on the median of absolute percentage differences, making it less sensitive to outliers.

Method for identifying biases related to magnetic field strength: bootstrapping

To evaluate the presence of bias related to the magnetic field strength, we compared the metric values (sMSPE, MASE, sMAPE, wMAPE and sMdAPE) obtained using the true labels with the distributions generated through bootstrapping⁵². We applied bootstrapping to evaluate the variability of error metrics under different configurations of shuffled labels. This approach helps to identify biases in the results.

Initially, we constructed two types of reference curves 1) using data from 1.5T MRI scans and 2) using data from 3T MRI scans. For each curve boundary, we calculated the metrics to compare the 1.5T curves against the 3T curves. Then, we performed a bootstrapping procedure⁵² involving 10,000 iterations. In each iteration, we shuffled the labels of the data, effectively randomizing the assignment of MRI scan data to the “1.5T” and “3T” groups. Using these shuffled labels, we reconstructed the reference curves for both groups and recalculated the metrics (sMSPE, MASE, sMAPE, wMAPE and sMdAPE) between the new curves. This allowed us to assess the variability in the metrics under random label assignments. The 5th percentile and 95th percentile were computed to characterize the range of variability.

To assess magnetic field bias, we compared the true-label metric values to the bootstrapped distributions. A significant bias was considered present if the true-label metric values fell outside the 5th-95th percentile range of the bootstrapped distributions. This would indicate that the observed differences between 1.5T and 3T data are not due to random variability but are likely influenced by differences related to magnetic field strength.

Atrophy assessment

While reproducibility is crucial for a segmentation algorithm, it is ineffective if it cannot detect pathological variations. So, we examined how segmentation algorithms affect atrophy assessment by studying the sensitivity and stability of results across different magnetic field strengths.

Sensitivity analysis

To assess sensitivity, two complementary approaches were used. First, hippocampal atrophy percentages were computed for both AD patients and cognitively normal (CN) subjects. Second, we implemented the HAVAs method, from Coupe et al., (2022)³². In this method, hippocampal, amygdalar, and inferior lateral ventricle volumes were first normalized by intracranial volume⁵³, then converted into z-scores using the mean and standard deviation from a reference set of 3730 CN subjects covering the full lifespan. This double normalization accounts for inter-individual and inter-structural variability³². The HAVAs score was then computed as the sum of the hippocampal and amygdalar z-scores, from which the z-scores of the inferior lateral ventricle is subtracted, based on the observation that AD is associated with atrophy in the hippocampus and amygdala and enlargement of the ventricle. Left and right HAVAs scores were also z-normalized using the same reference population.

Performance metrics

To evaluate the effectiveness of each segmentation method in distinguishing between AD and CN subjects, we computed standard classification metrics. These include sensitivity (or true positive rate), specificity (or true negative rate), balanced accuracy (the average of sensitivity and specificity), and the area under the ROC curve (AUC). A higher AUC indicates better discrimination between the two groups (AD vs. CN). Values closer to 1 reflect excellent performance, while values near 0.5 indicate performance close to chance. AUC values above 0.80 are typically considered clinically meaningful⁵⁴. These metrics provide insight into the algorithm’s ability to correctly classify both group.

Results

This study presents results with volumes normalized by the intracranial volume (ICV).

Curve stability assessment

The first row of Fig. 3 presents reference curves for the left hippocampus, with curves for other regions available on https://gitlab.com/geodaisics1/ Estimation_of_reference_curves_for_brain_atrophy_and _analysis_of_ro bustness_to_machine_effects. The differences in FastSurfer’s reference curves compared to AssemblyNet and FreeSurfer suggest that using FastSurfer may yield varying results, potentially providing different information and influencing final diagnoses.

Evaluation of the consistency of reference curves across magnetic field strengths

Preliminary results We calculated the median percentage volume difference of 3730 subjects segmented by AssemblyNet, FastSurfer, and FreeSurfer, both with and without normalization by ICV (Supplementary Section S3, Fig. S2). For non-normalized volumes, the median percentage differences were 2.53% for AssemblyNet, 5.60% for FastSurfer, and 6.15% for FreeSurfer. The median percentage differences were 2.54% for AssemblyNet, 10.38% for FastSurfer, and 9.05% for FreeSurfer when normalized by the ICV.

Visual analysis Observing the points in the first row of Fig. 3, we noticed a greater dispersion of data for FastSurfer compared to the other two algorithms. To analyze the organization of these dots, especially for FastSurfer, and to determine if their positions are influenced by an acquisition parameter, we therefore repeated the first row of Fig. 3 but color-coding the points according to the magnetic field strength of the MRI machine in which the images were acquired. This is shown in the first line of Fig. 3. We then built additional reference curves using only the data from images acquired at 1.5T, shown in the second line of Fig. 3, and only the images acquired at 3T, shown in the third line of the same figure, for all three segmentation algorithms in the study. Visually, a slight study effect can be observed for all three algorithms, indicating minor variations in segmentation results across different studies. However, this effect is considerably weaker than the influence of magnetic field strength, which remains the primary source of variability in our Figures. Moreover, assessing the impact of study-related differences is not the focus of this article, especially since the number of subjects from different studies within each age range is insufficient to robustly evaluate this effect. Additional sources of variability such as sex and scanner effects are also considered: please refer to Supplementary Section S4, Fig. S3 and Discussion for the sex effect, and Supplementary Section S5, Figs. S4 and Fig. S5 and Discussion for the scanner effect. Please refer to https://gitlab.com/geodaisics1/ Estimation_of_reference_curves_for_brain _atrophy_and _analysis_of_robustness_to_machine_effects to view those curves for other regions. The visual differences in reference curves as a function of training data indicate that FastSurfer is not robust to magnetic field intensity.

Measure of errors between curves

In this section, we assess the performance of segmentation algorithms by directly comparing reference curves calculated with data from 3T MRI scans to curves predicted using data from 1.5T MRI scans (Fig. 4). This approach enables us to evaluate how each algorithm performs when applied across datasets with differing MRI field strengths, reflecting the variability introduced by different MRI field strengths. The robustness of the segmentation algorithms is assessed based on their ability to maintain reference values despite changes in the magnetic field strength.

AssemblyNet achieves the lowest sMSPE for all bounds. Since sMSPE emphasizes larger errors, the smaller values for AssemblyNet indicate that its predictions are consistently closer to the reference curves with fewer large errors. AssemblyNet consistently on sMAPE values outperforms FastSurfer and FreeSurfer across all bounds with lower sMAPE values. A lower sMAPE indicates that AssemblyNet is more robust to magnetic field strength. The lower wMAPE for AssemblyNet indicates that it can maintain high accuracy across different volume sizes, making it more reliable when the data has a wide range of volumes. AssemblyNet achieves the lowest sMdAPE for all bounds. The results for the MASE metric are available in Supplementary Section S6, Fig. S6, and will be discussed in the Discussion section. In conclusion, AssemblyNet consistently outperforms FastSurfer and FreeSurfer in terms of robustness. AssemblyNet’s superior performance across these metrics, especially sMAPE, sMSPE, and wMAPE, indicates that it is more capable of providing reliable volume predictions, even when faced with the variability introduced by different MRI field strengths.

Method for identifying biases related to magnetic field strength: bootstrapping

The evaluation of bias related to the magnetic field strength, comparing 1.5T and 3T MRI reference curves using bootstrapping, reveals distinct patterns across the three algorithms (Fig. 5 illustrates sMAPE values; for other metrics and views, please refer to Supplementary Section S7, Fig. S7 (sMdAPE), Fig. S8 (sMSPE) and Fig. S9 (wMAPE)).

For AssemblyNet, the true-label values of the sMAPE, sMdAPE and wMAPE fall within the bootstrapped bounds for the lower bound but are slightly outside the 95th percentile for the mean and upper bound indicating a slight magnetic field-related bias. Conversely, sMSPE values for AssemblyNet consistently fall within the bootstrapped bounds across all bounds. In contrast, FastSurfer shows significant deviations for all metrics across configurations, indicating a pronounced bias likely driven by magnetic field effects. Similar trends are observed for sMdAPE and wMAPE. The substantial distances highlight that FastSurfer metrics are heavily influenced by magnetic field strength. FreeSurfer displays similar trends, with marked deviations across metrics, again reflecting strong magnetic field-related effects. In summary, while AssemblyNet exhibits minor deviations, FastSurfer and FreeSurfer demonstrate significant biases across all metrics. These findings suggest that FastSurfer and FreeSurfer are more sensitive to magnetic field differences, while AssemblyNet provides relatively stable results.

Atrophy assessment

Hippocampal atrophy sensitivity analysis

Table 3 shows the percentage of subjects with left/right hippocampal atrophy. For the control group, the percentages of subjects showing hippocampal atrophy according to AssemblyNet are notably consistent across all reference curves. In contrast, FastSurfer exhibits variability. FreeSurfer also shows some variability, though it is more stable compared to FastSurfer. For the pathological group, the percentage of Alzheimer’s patients with hippocampal atrophy is very stable with AssemblyNet, while FastSurfer shows more variability than FreeSurfer. There is a significant difference (McNemar test, p < 0.05)(R version 4.5.0) in the atrophy percentage between AssemblyNet and FreeSurfer, and between FastSurfer and FreeSurfer, across all models.

Table 3 Percentage of control and AD subjects showing left/right hippocampal atrophy (for each cell: value1: left/value2: right), according to algorithms (columns) and reference curves used (rows) (concave smoothing). Controls: controls subjects; AD: Alzheimer’s patients. For both groups, AssemblyNet shows a stable percentage of subjects with hippocampal atrophy, while FastSurfer exhibits more variability than FreeSurfer.

Full size table

Table 4 Classification scores related to the left/right hippocampus (for each cell: value1: left/value2: right), with rows corresponding to models and columns to segmentation algorithms.

Full size table

Table 4 provides the classification performance scores for the detection of left hippocampal atrophy across segmentation algorithms and reference curves. Supplementary Section S8, Fig. S10 shows ROC curves for the left hippocampus. Although AssemblyNet shows slightly lower classification metrics than FastSurfer and FreeSurfer, it consistently demonstrates greater stability across all reference models. In contrast, FastSurfer and FreeSurfer exhibit more variability, with performance scores fluctuating significantly depending on magnetic field strength. For example, regarding specificity, AssemblyNet ranges between 0.87 and 0.91 (considering both left and right hippocampus and all models), while FastSurfer ranges from 0.76 to 0.93, and FreeSurfer from 0.86 to 0.93. This supports the earlier observation that AssemblyNet is more robust to acquisition differences compared to the other segmentation tools.

HAVAs results

Supplementary Section S9, Fig. S11 presents the reference curves for the hippocampus, amygdala, and inferior lateral ventricle, across segmentation algorithms. A sensitivity to magnetic field strength is observed for FastSurfer and FreeSurfer, particularly for the hippocampus and amygdala, as indicated by the distribution of colored points according to field strength. However, as shown in Supplementary Section S9, Fig. S12, the HAVAs score appears to reduce this sensitivity to magnetic field strength, likely due to the two-step z-score normalization applied during its computation.

Table 5 presents the percentages of HAVAs scores considered pathological (i.e., scores falling below the 5th percentile of the reference curve) across CN subjects and AD patients, for each model and segmentation algorithms. In CN subjects, AssemblyNet yields overall lower atrophy percentages compared to FastSurfer and FreeSurfer (with a maximum value of 21.96% for AssemblyNet, versus 40.39% for FastSurfer and 33.73% for FreeSurfer). Moreover, AssemblyNet shows a narrower range of atrophy percentages (5.10% to 14.12%), while FastSurfer and FreeSurfer exhibit wider ranges (16.08% to 40.39% and 14.51% to 33.73%, respectively). In AD patients, the percentage of pathological scores is higher, as expected. AssemblyNet again shows slightly lower values (ranging from 82.19% to 94.98%) compared to FastSurfer (87.21% to 97.71%) and FreeSurfer (84.02% to 96.35%).

Table 5 Percentages of abnormally low (i.e., scores falling below the 5th percentile of the reference curve) left/right HAVAs scores (for each cell: value1: left/value2: right) in cognitively normal subjects (CN) and Alzheimer’s disease (AD) patients, for each model considered.

Full size table

Table 6 presents the classification scores between AD patients and CN subjects obtained using the HAVAs method, across the different models and algorithms. Supplementary Section S9, Fig. S13 shows the ROC curves for the left HAVAs. When comparing the models derived from 1.5T and 3T data (HAVAs left), FastSurfer and FreeSurfer stand out with higher sensitivity (0.95 and 0.93 respectively), compared to 0.89 for AssemblyNet. However, AssemblyNet achieves both the highest specificity (0.91 vs. 0.80 for FastSurfer and 0.83 for FreeSurfer) and the highest balanced accuracy (0.90 vs. 0.88 for FastSurfer and FreeSurfer). Expanding the analysis to all three models (1.5T-only, 3T-only, and combined)(HAVAs left), we observe that the specificity of FastSurfer and FreeSurfer drops significantly compared to AssemblyNet, with minimum values of 0.60 and 0.66, respectively, versus 0.86 for AssemblyNet. These results suggest that FastSurfer, due to its higher sensitivity, may be more effective at detecting Alzheimer’s cases. However, AssemblyNet, with its superior specificity and balanced accuracy, provides a better trade-off between correctly identifying both AD patients and healthy controls.

Table 6 Classification scores obtained with the left/right HAVAs scores (for each cell: value1: left/value2: right), organized by model (rows) and segmentation algorithm (columns).

Full size table

Atrophy stability analysis

Longitudinal analysis

Supplementary Section S10, Fig. S14, Fig. S15 and Fig. S16 shows left hippocampal volumes of control and AD subjects from the MIRIAD dataset, acquired during follow-ups, overlaid with reference curves from the three algorithms (using 1.5T and 3T data). This comparison enabled us to assess the consistency and reliability of each algorithm in detecting atrophy over time. For the healthy subjects, most subjects showed no signs of atrophy across all algorithms, indicating that the algorithms generally agree when there is no pathological change. However, discrepancies were observed in some cases. For instance, subject 230 was consistently atrophied, with important variations specifically in FreeSurfer’s results. Subject 231 showed no atrophy with AssemblyNet, consistent atrophy with FastSurfer, and fluctuating results with FreeSurfer. In the AD subjects, the majority were consistently identified as atrophied across all algorithms, which aligns with the expected progression of AD. However, there were notable exceptions, such as subject 191, where FreeSurfer and FastSurfer detected atrophy consistently, but AssemblyNet did not until the penultimate session. Overall, these findings suggest that while all three algorithms can reliably detect atrophy, there are differences in sensitivity and stability, particularly in borderline.

Inter-sites analysis

In this study, we used MRI data from the SRPBS traveling subjects. Brain volumes extracted from scans acquired at different sites were projected onto the reference curves for evaluation. Details of this analysis are provided in the Supplementary Section S11, Fig. S17. Supplementary Section S11, Table S2 summarizes the atrophy assessments obtained for SRPBS subjects using the three segmentation algorithms. Overall, FastSurfer exhibits greater variability across sites compared to AssemblyNet and FreeSurfer, suggesting reduced robustness in multi-site settings.

Discussion

The objectives of this study were to evaluate the robustness and sensitivity of three MRI segmentation algorithms in constructing reference curves for brain atrophy assessment in neurodegenerative diseases, with a focus on hippocampal volumes. The study aimed to determine which algorithm produces the most stable reference curves under varying magnetic field conditions and which algorithm provides the most consistent detection of hippocampal atrophy, a key marker in Alzheimer’s disease. Evaluating the impact of the machine effect and sensitivity is essential, as variability in MRI acquisition parameters and algorithm performance can compromise the reliability and the conclusion of study of reference curves.

Our results indicate that AssemblyNet is the most robust algorithm, consistently generating stable reference curves with minimal impact from variations in magnetic field strength. This robustness is particularly valuable in industrial settings, where data often come from heterogeneous sources, including varying MRI field strengths and different scanner models. In such contexts, covariates like magnetic field strength may not be correctable. Therefore, segmentation methods that are intrinsically robust to these variations are highly advantageous for large-scale deployment. In contrast, FastSurfer displays significant instability, with notable differences in reference curves depending on the MRI field strength. This lack of stability suggests that FastSurfer may not be the optimal choice when consistency across different imaging environments is required. Additionally, AssemblyNet also provides more consistent estimates of hippocampal atrophy and HAVAs scores compared to the other algorithms, though the atrophy percentages were generally lower.

Reference modeling: addressing confounding factors including imaging variability

Accounting for confounders in normative modeling

Confounding variables such as age, sex, scanners effects and MRI field strength can influence brain volumetric measurements and impact the construction of normative models. Addressing these confounders is essential for building accurate and generalizable lifespan trajectories. A review⁵ found that, among 13 normative modeling studies, 7 included only age as a covariate, 4 included both age and sex. Only one study⁵⁵ accounted for field strength, but without justifying this choice.

In our approach, age is directly modeled through GAMs, while sex-related differences are addressed via ICV normalization⁵⁶^,⁵⁷^,⁵⁸^,⁵⁹. This approach is widely used to compensate for head size differences, a major source of inter-individual variability in brain volume measurements. Without ICV correction, differences in regional volumes may reflect cranial size rather than meaningful biological effects. For instance, part of the apparent sex difference in brain volumes is explained by ICV disparities and applying normalization substantially reduces or eliminates these differences⁵⁹^,⁶⁰. ICV normalization is particularly important for assessing regional volume changes relative to maximum brain size, which is critical in studies of atrophy⁶¹. It also improves statistical sensitivity, enabling group comparisons with smaller sample sizes, especially in hippocampal studies⁶². Furthermore, ICV normalization contributes to reducing scanner-related variability, thereby enhancing comparability across MRI platforms⁹. Previous work using AssemblyNet has shown that, without normalization, males consistently exhibit larger brain volumes than females across most structures²⁰. However, after ICV normalization, these differences largely disappear, suggesting that explicit inclusion of sex as a covariate may not be necessary. Our findings support this interpretation: without normalization, small sex effects were observed across all algorithms, with males generally showing higher hippocampal volumes. After normalization, these effects became very small for AssemblyNet, while FastSurfer and FreeSurfer exhibited persistent differences (please refer to Supplementary Section S4, Fig. S3).

In summary, our modeling framework accounts for age via GAMs and sex through ICV normalization. Field strength was not included as a formal covariate, but its influence was assessed post hoc. Scanner-related effects are discussed in more detail in Section 4.1.

Magnetic field strength and scanner model effects: considerations and limitations

Machine effects are complex and multifactorial, as reported in numerous studies⁶³^,⁹^,⁶⁴^,⁶⁵. The variability observed in data can indeed arise from multiple sources, including not only magnetic field strength⁶⁶^,⁶⁷^,⁶³, but also vendor⁶⁴, model⁹, software version⁶⁸, sequence parameters⁸^,⁶⁵ or gradient non-linearities⁶⁹. Although some studies report minimal effects of magnetic field strength on volumetric estimates, especially for global measures⁶⁶^,⁶³, others have shown regional differences, particularly in areas with low contrast or complex anatomy⁷⁰. Moreover, Marchewka et al., (2014)⁷⁰ did not observe significant differences in hippocampal or amygdalar volumes between field strengths in epilepsy cohorts, especially when using high-quality standardized protocols such as those from ADNI. However, Marchewka et al., (2014)⁷⁰ also reported magnetic field strength-related differences in gray matter volume in the cerebellum, precentral cortex, and thalamus, regions known to be sensitive to scanner parameters and segmentation challenges. These regional effects are consistent with several previous studies⁷¹^,⁷², and segmentation accuracy has been shown to improve at 3T in difficult regions due to better contrast-to-noise ratio⁷⁰. In parallel, scanner model effects are also documented. For example, Yang et al., (2016)⁶⁴ found significant variability across models in over 12% of brain regions using FreeSurfer.

In this work, we chose to focus specifically on the magnetic field strength (1.5T vs. 3T) as a proxy of machine-related variability for several reasons. First, as shown in the Supplementary Section S5, Fig. S5, within a given scanner model (Intera), volumetric differences between 1.5T and 3T scans were greater than those observed between different scanner models operating at the same field strength. This suggests that, in our dataset, magnetic field strength exerts a more dominant effect than vendor or model. Second, scanner models were not evenly distributed across age groups, limiting our ability to disentangle model and vendor effects independently of age (please refer to Supplementary Section S5, Fig. S4). Such an analysis would require well-balanced subgroups across all age bins and scanner types, which is not achievable with the current data. Given these considerations, magnetic field strength was selected as the most interpretable and impactful axis of variability for this robustness study. We acknowledge that more granular investigations of vendor, model, and sequence-related effects would be valuable.

The “gold standard” issue

Manual segmentation is traditionally used as the reference, often termed the “gold standard”, for evaluating automated brain segmentation methods. However, this designation is increasingly questioned, as manual labeling suffers from limited reproducibility, with intra-rater agreement frequently falling below 80%. For example, using the BrainCOLOR protocol, intra-rater Dice scores (i.e., between repeated manual segmentations) were estimated at 76.8%⁴⁴. This level of agreement is comparable to the consistency observed between manual and automated segmentations: 75.8% for AssemblyNet using the same protocol, and approximately 80% for FastSurfer (80.19% subcortical, 80.65% cortical) on the Mindboggle-101 dataset⁴⁷. Moreover, Coupe et al., (2020)⁴⁴ reported intra-method Dice scores of 92.8% for AssemblyNet (i.e., between scan and rescan automatic segmentations), again surpassing the manual intra-rater reproducibility. These results suggest that automated methods can be as consistent as human experts. These findings challenge the conventional use of manual segmentation as a gold standard.

Although manual segmentations can serve as a reference for partial evaluation, this approach has important limitations. Both human raters and algorithms may exhibit similar biases, particularly in regions with ill-defined anatomical boundaries. This issue also applies to many automated methods: recent deep learning models such as FastSurfer, SynthSeg, and LOD-Brain are often trained on labels generated by FreeSurfer, thereby inheriting its structural biases, for example with hippocampal segmentation, as shown by Valabregue et al., (2024)⁷³. These factors make it inherently difficult to determine how close a segmentation truly approximates the underlying anatomical ground truth.

Median percentage volume differences between 1.5T and 3T and data

To compare our volumes between 1.5T and 3T and data, we refer to Lee et al., (2024)⁷⁴, who assessed the reliability of FreeSurfer at 1.5T and 3T. They reported the following median volume differences for the left hippocampus: 3.63 cm³ for 1.5T and 3.70 cm³ for 3T, resulting in a 1.89% volume variation between the two field strengths (non-normalized analysis). For FreeSurfer, the higher variation observed for our study (6.15%) could be attributed to the larger and more diverse dataset of 3730 subjects (from 11 different datasets), as opposed to Lee et al., (2024)⁷⁴’s dataset of 101 patients (from 3 datasets). The larger sample size and the inclusion of multiple datasets likely contributed to the increased variability in segmentation results, particularly for FreeSurfer and FastSurfer.

Comparing modeling strategies for reference curves construction

Mean Absolute Scaled Error (MASE) evaluation

The Mean Absolute Scaled Error (MASE) metric meets all inclusion criteria for our study. However, we observed that its value varied a lot depending on the number of points selected (please refer to Supplementary Section S6, Fig. S6). We hypothesize that this variation is due to the fact that our data follow very regular trends, where small errors can lead to elevated MASE values. This is in contrast to the findings of Liu et al., (2024)²⁴, who reported MASE values between 0 and 0.15 when comparing reference curves with different datasets. In our case, we observe much higher MASE values, even when selecting only one point out of eight from our reference curves.

Comparison of GAM-based curves with existing literature

We compared our GAM-based reference curves to those published in Coupe et al., (2017)²⁰, focusing on hippocampal atrophy percentages in both AD and CN subjects (please refer to Supplementary Section S12, Fig. S18 and Fig. S19). The results Supplementary Section S12, Table S3 show that AssemblyNet, when used with our GAM-based models, yields higher atrophy detection rates in AD patients (77.17% left/67.58% right) compared to the reference curves from Coupe et al., (2017)²⁰ (63.01%/59.36%). However, for CN subjects, our GAM curves detect a slightly higher percentage of atrophy (11.76%/8.24%) than those of Coupe et al., (2017)²⁰ (6.27%/4.31%). This trend is consistent across FastSurfer and FreeSurfer, where GAM-based curves yield both higher detection in AD and slightly elevated atrophy rates in controls. These findings suggest that while the GAM framework may offer improved sensitivity for detecting pathological deviations, it may also slightly increase atrophy detection in CN individuals. This trade-off is further discussed in Section 4.3.

Limitations of reference curves and potential of manifold learning for complex analysis

Reference curves are valuable for region-by-region analysis but face limitations in capturing the complexity of high-dimensional MRI data, which often exhibit correlations between brain regions due to the brain’s closed system. This challenge is further compounded by the lack of tools to effectively evaluate the robustness of these curves, particularly in the context of temporal data. In our case, the dataset includes brain volumes from approximately a hundred regions, adding to the analytical complexity. While reference curves are useful for localized assessments, they may fall short in capturing the global dynamics and intricate interdependencies among brain regions. Manifold learning algorithms offer a promising complementary approach by accounting for the interdependence between variables and revealing complex, non-linear relationships within high-dimensional datasets. These methods provide a more integrated and robust framework, enabling a deeper understanding of the global structure and intricate interactions within brain data⁷⁵.

Atrophy detection across acquisition conditions

Early atrophy detection considerations

Our normative lifespan curves suggest that hippocampal volume decline begins at different ages depending on the segmentation algorithm: around 35 years for FastSurfer, 50 years for FreeSurfer, and approximately 55 years for AssemblyNet. However, prior studies indicate that hippocampal atrophy may begin earlier than these estimates suggest⁷⁶^,⁷⁷. It is important to note that volume trajectories are influenced not only by segmentation accuracy but also by the modeling strategy used to estimate them. In our case, the use of GAMs likely contributes to delay visible inflection points as earlier work using AssemblyNet on larger datasets has shown divergence between AD and normative hippocampal trajectories before the age of 40⁷⁸, suggesting that AssemblyNet is indeed capable of detecting early pathological changes. Therefore, the later onset of decline observed in our curves likely reflects modeling limitations rather than segmentation constraints.

More fundamentally, it is important to recognize that hippocampal atrophy visible on structural MRI occurs relatively late in the pathological cascade of AD. Alzheimer’s disease is characterized by the abnormal accumulation of proteins, including amyloid plaques and neurofibrillary tangles composed of tau. These anomalies disrupt neuronal function and trigger a cascade of toxic mechanisms. Neurons are progressively damaged, lose their connections, and ultimately die⁷⁹^,¹⁴^,⁸⁰^,⁸¹. The ATN framework⁸², recently revisited in 2024⁸³, offers a biological staging system based on three biomarker categories: A for amyloid accumulation, T for tau pathology, and N for neurodegeneration. CSF biomarkers, measurable through lumbar puncture, can detect amyloid and tau abnormalities up to 15 years before clinical symptom onset⁸⁴^,⁸⁵. In contrast, neurodegeneration (N), as measured by structural MRI, reflects macroscopic neuronal loss and thus appears later in the disease process. In this context, although MRI remains a valuable tool for assessing neurodegeneration, it inherently captures late-stage processes. As such, efforts to improve early detection of AD may need to focus more on upstream biomarkers (A and T), while MRI-based approaches like ours are better suited for staging and differential diagnosis once neurodegeneration is established. Recent advances in blood-based diagnostics support this shift: improvements in mass spectrometry and antibody-based detection methods have enabled precise quantification of amyloid-\(\beta\) and various forms of tau in plasma, paving the way for faster, cheaper, and more accessible detection strategies⁸⁶.

Reasons for robustness to magnetic field strength

MRI field strength (1.5T vs. 3T) impacts tissue contrast and signal-to-noise ratio, which can significantly influence segmentation accuracy, especially for methods relying heavily on voxel intensity. Previous studies have shown that higher field strength generally improves segmentation accuracy, particularly in challenging anatomical regions due to better contrast-to-noise ratios⁷⁰. Our results consistently show that AssemblyNet is more robust to magnetic field strength variations compared to FastSurfer and FreeSurfer. This robustness can be attributed to several key aspects of its segmentation architecture and design.

First, AssemblyNet relies on a multiscale 3D deep learning framework composed of 250 U-Net models distributed across two levels of resolution. The first level performs coarse segmentation on overlapping 3D regions (2x2x2 mm3), each independently processed by a dedicated U-Net. A majority voting scheme aggregates the overlapping outputs, effectively introducing spatial redundancy that enhances robustness to contrast variability. The overlap of at least 50% between adjacent regions reinforces this effect by ensuring that the same anatomical region is processed by multiple models. Then, in a second stage, the coarse segmentation is refined using a finer resolution (1x1x1 mm3), improving anatomical boundary delineation. This multi-resolution approach substantially increases the number of learnable parameters (by nearly a factor of ten) allowing more precise modeling of tissue boundaries. According to Coupe et al., (2019)⁸⁷, this two-stage cascade was specifically designed to improve robustness. In addition, the model incorporates a novel nearest-neighbor transfer learning scheme: the weights from a trained U-Net are propagated to adjacent U-Nets processing overlapping regions. This design enables local anatomical knowledge to be shared across the network, which likely further contributes to stable performance across diverse imaging conditions.

In contrast, FastSurfer uses a 2D neural network that segments each slice independently. While this architecture is computationally efficient, it does not exploit the 3D context across slices. As a result, it is more susceptible to slice-wise intensity variations. This likely contributes to its greater sensitivity to field strength variability.

FreeSurfer, on the other hand, employs atlas-based and intensity-driven methods. Subcortical segmentation is based on voxel-wise probabilistic atlases and intensity priors⁸⁸^,⁸⁹, while cortical surfaces are reconstructed through sulcal and gyral pattern extraction using mesh smoothing and spherical registration⁹⁰^,⁹¹. This reliance on local contrast and intensity features makes FreeSurfer vulnerable to variability in MRI contrast and signal quality (contrast-to-noise ratio). This limitation also affects SPM, which, like FreeSurfer, relies on strong anatomical priors (SPM uses voxel intensities and tissue probability maps)⁹².

In summary, the superior robustness of AssemblyNet likely stems from its 3D patch-based architecture, redundant spatial encoding, and multi-resolution refinement, which provide enhanced tolerance to contrast variability. In contrast, FastSurfer’s 2D architecture and FreeSurfer’s reliance on contrast-dependent features make them more susceptible to changes in magnetic field strength.

Sensitivity of algorithms to hippocampal atrophy

We assessed the sensitivity of AssemblyNet, FastSurfer, and FreeSurfer in detecting pathological changes, such as hippocampal atrophy. All three segmentation algorithms of our study have demonstrated their capability to detect pathological variations in brain imaging. FreeSurfer has been validated by Fellhauer et al., (2015)⁹³, showing its effectiveness in identifying increased brain atrophy in conditions such as AD and mild cognitive impairment. Both FastSurfer and FreeSurfer can detect differences in cortical areas associated with disease progression⁹⁴^,⁴⁷. AssemblyNet further extends this ability, as demonstrated by Coupe et al., (2022)³², where AssemblyNet, combined with classification-based approaches using lifespan models, showed very accurate detection of AD (AUC \(\ge\) 94%) compared to control subjects. Additionally, it was able to accurately discriminate between progressive MCI and stable MCI (AUC = 78%) during a 3-year follow-up.

Our analysis shows that AssemblyNet not only provides stable reference curves but also offers consistent estimates of hippocampal atrophy. In contrast, FastSurfer and FreeSurfer showed greater variability in atrophy detection. This variability could lead to important differences in the interpretation of atrophy, highlighting the need for careful consideration when selecting segmentation algorithms for atrophy assessment.

Although AssemblyNet tends to report lower atrophy percentages, these results are in line with De et al., (1997)⁹⁵, but only for the left hippocampus. This study uses the Scheltens scale as a method of determining whether a hippocampus is atrophied: it’s a visual scale. This study includes mild AD patients and patients with moderate to severe AD. The study found frequencies of hippocampal atrophy ranged from 78% in the mild AD patients group to 96% in the advanced AD group. In comparison, the analysis using AssemblyNet (Table 3), using reference curves with both 1.5T and 3T data, revealed an atrophy rate of 77% (left) and 68% (right) (FastSurfer: 84%/81%; FreeSurfer: 82%/76%).

Besides, for all algorithms, there is a disparity in percentages between the left and right hippocampus. This aligns with multiple studies⁹⁶^,⁹⁷^,⁹⁸^,⁹⁹ demonstrating that AD is associated with greater atrophy in the left hippocampus compared to the right.

In addition, De et al., (1997)⁹⁵ found hippocampal atrophy in 15% of the normal elderly group between 60-75 years of age. In our case, the analysis revealed the following percentages of hippocampal atrophy in the control group: AssemblyNet detected 12% atrophy in the left hippocampus and 8% in the right. FastSurfer identified 13% atrophy in both the left and right hippocampus, while FreeSurfer showed 9% atrophy for both sides (MIRIAD: 22 controls; 69 years+/- 16).

We found two other interesting studies on this subject, but neither can be used for comparative analysis as they both use FreeSurfer as their segmentation algorithm. First, in the study by Byun et al¹⁰⁰, 77.9% of the 163 AD subjects from ADNI are considered to have hippocampal atrophy. Then, in the study Lowe et al¹⁰¹, 76% of the 92 AD subjects from ADNI are considered to have hippocampal atrophy.

Sensitivity with HAVAs method

As demonstrated by Coupe et al., (2022)³², the HAVAs methods offers superior classification performance compared to individual brain regions taken independently, reinforcing its relevance for distinguishing AD patients from CN subjects. When comparing our results to those reported in Coupe et al., (2022)³², we observe higher balanced accuracy scores. Specifically, Coupe et al., (2022)³² reported a balanced accuracy of 0.81, whereas our current study achieved 0.90 with AssemblyNet, and 0.88 with both FastSurfer and FreeSurfer using reference curves built from combined 1.5T and 3T data. Although the HAVAs method and the segmentation algorithm are the same, several factors explain this discrepancy: the test datasets differ (AIBL in their studies vs. ADNI in ours), the reference models are not the same (GAMs vs. hybrid models), and the training datasets also vary.

Trade-off between sensitivity and robustness: bias-variance balance

Our results highlight a classic bias-variance trade-off across segmentation algorithms when applied to AD detection. FastSurfer and FreeSurfer demonstrate higher sensitivity, making them more effective at detecting hippocampal atrophy. This is evidenced in both the direct hippocampal atrophy analysis and the HAVAs-based classification results. However, this increased sensitivity comes at the cost of greater variability and reduced robustness, particularly with respect to acquisition differences. In CN subjects, FastSurfer and FreeSurfer showed broader and less consistent ranges of atrophy percentages (up to 40.39% for FastSurfer), whereas AssemblyNet maintained lower and more stable values (maximum of 21.96%). Across all reference models and field strengths, classification metrics fluctuated more for FastSurfer and FreeSurfer, dropping as low as 0.60 and 0.66 (specificity, HAVAs method), while AssemblyNet consistently maintained higher specificity (0.86-0.95) and balanced accuracy (0.90-0.91)(left HAVAs. These results suggest that AssemblyNet favors a lower-variance strategy, yielding slightly lower sensitivity but offering a more robust and stable performance across different magnetic field strength. While it may miss a few pathological cases, it is more reliable for consistently distinguishing AD patients from healthy individuals without overfitting to magnetic field strength-specific noise.

From an industrial standpoint, robustness is often more desirable than maximal sensitivity. In fact, in a real-world setting, MRI data come from a wide range of machines, models, and sequences. Correction methods such as ComBat¹⁰² are not always applicable due to insufficient sample size. These harmonization techniques typically require large and balanced datasets, often 20-30 subjects per scanner, per sequence¹⁰²^,¹⁰³, which are rarely available in industrial settings⁷. Moreover, such harmonization algorithms can introduce additional variability¹⁰⁴ and may even degrade data quality¹⁰⁵. Notably, none of the harmonization algorithms evaluated in Gebre et al., (2023)¹⁰⁶improved intraclass correlation coefficients in longitudinal designs⁶⁹. Thus, a robust algorithm that handles acquisition heterogeneity gracefully is preferable, especially in large-scale screening or multi-site contexts where harmonization is not feasible.

Conclusion

This study highlights the critical role of algorithm selection in constructing reference curves and assessing brain atrophy in neurodegenerative diseases. We observed that AssemblyNet produces very stable reference curves with respect to magnetic field variations, unlike FastSurfer, which is not robust to this parameter. Significant differences in trends are noted between reference curves constructed with data acquired at 1.5T + 3T, only at 1.5T, and only at 3T. When considering the different reference curves calculated, the percentages of hippocampal atrophy and the HAVAs score in AD patients are more stable with AssemblyNet, though they are lower compared to FastSurfer and FreeSurfer.

In conclusion, AssemblyNet stands out as the most robust and reliable choice, offering stability across varying MRI conditions and consistent atrophy detection. In contrast, FastSurfer and FreeSurfer require further refinement to reduce their sensitivity to magnetic field strength and improve their consistency in atrophy assessment. These findings underscore the importance of segmentation algorithm selection, as the choice of segmentation method can significantly impact the consistency and accuracy of atrophy assessments in neurodegenerative diseases like Alzheimer’s.

Data availability

Reference curves associated with this study are available at https://gitlab.com/geodaisics1/Estimation_of_ reference_curves_for_brain_atrophy_and_analysis_of_robustness_to_machine_effects. Reference curves, stored as Excel files, provide the lower, middle, and upper bounds for brain regions, computed for three segmentation algorithms (AssemblyNet, FastSurfer, FreeSurfer) using different MRI field strengths (1.5T, 3T, or both) and smoothing methods (none, concave, convex). Visualizations show the effects of MRI field strength, displaying reference curves with training points color-coded by the field strength. Additional images provide general and consistent visualizations of reference curves, ensuring comparability across algorithms and regions. Concatenated figures summarize multi-algorithm and multi-region comparisons, showing the impact of smoothing and MRI datasets on reference curve generation. These data offer a comprehensive overview of the methods and analyses performed in the study. Links to open-access datasets are also provided. Access to the MRI datasets aggregated here is subject to application procedures individually managed at the discretion of each primary study.

References

Risacher, S. L. & Saykin, A. J. Neuroimaging biomarkers of neurodegenerative diseases and dementia. In Seminars in neurology, vol. 33, 386–416 (Thieme Medical Publishers, 2013).
Besson, F. L. et al. Cognitive and brain profiles associated with current neuroimaging biomarkers of preclinical alzheimer’s disease. Journal of Neuroscience 35, 10402–10411 (2015).
Article CAS PubMed Google Scholar
Bartos, A., Gregus, D., Ibrahim, I. & Tintěra, J. Brain volumes and their ratios in alzheimer s disease on magnetic resonance imaging segmented using freesurfer 6.0. Psychiatry Research: Neuroimaging 287, 70–74 (2019).
Article PubMed Google Scholar
Coupe, P. et al. Lifespan neurodegeneration of the human brain in multiple sclerosis. bioRxiv 2023–03 (2023).
Marquand, A. F. et al. Conceptualizing mental disorders as deviations from normative functioning. Molecular psychiatry 24, 1415–1424 (2019).
Article PubMed PubMed Central Google Scholar
Wolfers, T. et al. Individual differences v. the average patient: mapping the heterogeneity in adhd using normative models. Psychological Medicine 50, 314–323 (2020).
Article PubMed Google Scholar
Warrington, S. et al. A resource for development and comparison of multi-modal brain 3t mri harmonisation approaches. bioRxiv 2023–06 (2023).
Hedges, E. P. et al. Reliability of structural mri measurements: The effects of scan session, head tilt, inter-scan interval, acquisition sequence, freesurfer version and processing stream. Neuroimage 246, 118751 (2022).
Article CAS PubMed Google Scholar
Guo, C., Ferreira, D., Fink, K., Westman, E. & Granberg, T. Repeatability and reproducibility of freesurfer, fsl-sienax and spm brain volumetric measurements and the effect of lesion filling in multiple sclerosis. European radiology 29, 1355–1364 (2019).
Article PubMed Google Scholar
Srinivasan, D. et al. A comparison of freesurfer and multi-atlas muse for brain anatomy segmentation: Findings about size and age bias, and inter-scanner stability in multi-site aging studies. Neuroimage 223, 117248 (2020).
Article PubMed Google Scholar
Rebsamen, M. et al. Growing importance of brain morphometry analysis in the clinical routine: The hidden impact of mr sequence parameters. Journal of neuroradiology (2023).
Obrien, J. Role of imaging techniques in the diagnosis of dementia. The British journal of radiology 80, S71–S77 (2007).
Article PubMed Google Scholar
Klunk, W. E. et al. The centiloid project: standardizing quantitative amyloid plaque estimation by pet. Alzheimer’s & dementia 11, 1–15 (2015).
Article Google Scholar
Engels-Domínguez, N. et al. State-of-the-art imaging of neuromodulatory subcortical systems in aging and alzheimer’s disease: challenges and opportunities. Neuroscience & Biobehavioral Reviews 104998 (2022).
Huynh, K. et al. Clinical and biological correlates of white matter hyperintensities in patients with behavioral-variant frontotemporal dementia and alzheimer disease. Neurology 96, e1743–e1754 (2021).
Article CAS PubMed Google Scholar
Calloni, S. F. et al. Combining semi-quantitative rating and automated brain volumetry in mri evaluation of patients with probable behavioural variant of fronto-temporal dementia: an added value for clinical practise?. Neuroradiology 65, 1025–1035 (2023).
Article PubMed Google Scholar
Cavedo, E. et al. Validation of an automatic tool for the measurement of brain atrophy and white matter hyperintensity in clinical routine: Qyscore®: Neuroimaging/optimal neuroimaging measures for early detection. Alzheimer’s & Dementia 16, e040259 (2020).
Article Google Scholar
Hänninen, K. et al. Thalamic atrophy predicts 5-year disability progression in multiple sclerosis. Frontiers in Neurology 11, 606 (2020).
Article PubMed PubMed Central Google Scholar
Pemberton, H. G. et al. Technical and clinical validation of commercial automated volumetric mri tools for dementia diagnosis—a systematic review. Neuroradiology 63, 1773–1789 (2021).
Article PubMed PubMed Central Google Scholar
Coupé, P., Catheline, G., Lanuza, E., Manjón, J. V. & Initiative, A. D. N. Towards a unified analysis of brain maturation and aging across the entire lifespan: A mri analysis. Human brain mapping 38, 5501–5518 (2017).
Article PubMed PubMed Central Google Scholar
Rajagopalan, V., Yue, G. H. & Pioro, E. P. Do preprocessing algorithms and statistical models influence voxel-based morphometry (vbm) results in amyotrophic lateral sclerosis patients? a systematic comparison of popular vbm analytical methods. Journal of Magnetic Resonance Imaging 40, 662–667 (2014).
Article PubMed Google Scholar
Katuwal, G. J. et al. Inter-method discrepancies in brain volume estimation may drive inconsistent findings in autism. Frontiers in neuroscience 10, 439 (2016).
Article PubMed PubMed Central Google Scholar
Bozek, J., Griffanti, L., Lau, S. & Jenkinson, M. Normative models for neuroimaging markers: Impact of model selection, sample size and evaluation criteria. NeuroImage 268, 119864 (2023).
Article CAS PubMed Google Scholar
Liu, Y. et al. Quantifying individualized deviations of brain structure in patients with multiple neurological diseases from normative references. Under review on researchsquare.com (2024).
Mills, K. L. et al. Structural brain development between childhood and adulthood: Convergence across four longitudinal samples. Neuroimage 141, 273–281 (2016).
Article PubMed Google Scholar
Borghi, E. et al. Construction of the world health organization child growth standards: selection of methods for attained growth curves. Statistics in medicine 25, 247–265 (2006).
Article MathSciNet CAS PubMed Google Scholar
Bethlehem, R. A. et al. Brain charts for the human lifespan. Nature 604, 525–533 (2022).
Article CAS PubMed PubMed Central Google Scholar
Pomponio, R. et al. Harmonization of large mri datasets for the analysis of brain imaging patterns throughout the lifespan. NeuroImage 208, 116450 (2020).
Article PubMed Google Scholar
Satterthwaite, T. D. et al. Impact of puberty on the evolution of cerebral perfusion during adolescence. Proceedings of the National Academy of Sciences 111, 8643–8648 (2014).
Article ADS CAS Google Scholar
Tanpitukpongse, T. P., Mazurowski, M. A., Ikhena, J. & Petrella, J. R. Predictive utility of marketed volumetric software tools in subjects at risk for alzheimer disease: do regions outside the hippocampus matter?. American Journal of Neuroradiology 38, 546–552 (2017).
Article CAS PubMed PubMed Central Google Scholar
Eckerström, C. et al. Small baseline volume of left hippocampus is associated with subsequent conversion of mci into dementia: the göteborg mci study. Journal of the neurological sciences 272, 48–59 (2008).
Article PubMed Google Scholar
Coupé, P. et al. Hippocampal-amygdalo-ventricular atrophy score: Alzheimer disease detection using normative and pathological lifespan models. Human Brain Mapping 43, 3270–3282 (2022).
Article PubMed PubMed Central Google Scholar
Scheltens, P., Fox, N., Barkhof, F. & De Carli, C. Structural magnetic resonance imaging in the practical assessment of dementia: beyond exclusion. The Lancet Neurology 1, 13–21 (2002).
Article PubMed Google Scholar
Chapuis, P. et al. Morphologic and neuropsychological patterns in patients suffering from alzheimer’s disease. Neuroradiology 58, 459–466 (2016).
Article PubMed Google Scholar
Craddock, C. et al. The neuro bureau preprocessing initiative: open sharing of preprocessed neuroimaging data and derivatives. Frontiers in Neuroinformatics 7, 5 (2013).
Google Scholar
Mazziotta, J. C. et al. A probabilistic approach for mapping the human brain: the international consortium for brain mapping (icbm). In Brain mapping: the systems, 141–156 (Elsevier, 2000).
Petersen, R. C. et al. Alzheimer’s disease neuroimaging initiative (adni): clinical characterization. Neurology 74, 201–209 (2010).
Article PubMed PubMed Central Google Scholar
Marcus, D. S. et al. Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience 19, 1498–1507 (2007).
Article PubMed Google Scholar
Marek, K. et al. The parkinson’s progression markers initiative (ppmi)-establishing a pd biomarker cohort. Annals of clinical and translational neurology 5, 1460–1477 (2018).
Article CAS PubMed PubMed Central Google Scholar
Poldrack, R. A. et al. A phenome-wide examination of neural and cognitive function. Scientific data 3, 1–12 (2016).
Article Google Scholar
Wei, D. et al. Structural and functional mri from a cross-sectional southwest university adult lifespan dataset (sald). bioRxiv 177279 (2017).
Malone, I. B. et al. Miriad—public release of a multiple time point alzheimer’s mr imaging dataset. NeuroImage 70, 33–36 (2013).
Article PubMed Google Scholar
Tanaka, S. C. et al. A multi-site, multi-disorder resting-state magnetic resonance image database. Scientific data 8, 227 (2021).
Article PubMed PubMed Central Google Scholar
Coupé, P. et al. Assemblynet: A large ensemble of cnns for 3d whole brain mri segmentation. NeuroImage 219, 117026 (2020).
Article PubMed Google Scholar
Fischl, B. Freesurfer. Neuroimage 62, 774–781 (2012).
Article PubMed Google Scholar
Glatard, T. et al. A virtual imaging platform for multi-modality medical image simulation. IEEE transactions on medical imaging 32, 110–118 (2012).
Article ADS Google Scholar
Henschel, L. et al. Fastsurfer-a fast and accurate deep learning based neuroimaging pipeline. NeuroImage 219, 117012 (2020).
Article PubMed Google Scholar
Larsen, K. Gam: the predictive modeling silver bullet. Multithreaded. Stitch Fix 30, 1–27 (2015).
Google Scholar
Hyndman, R. J. & Koehler, A. B. Another look at measures of forecast accuracy. International journal of forecasting 22, 679–688 (2006).
Article Google Scholar
Kreinovich, V., Nguyen, H. T. & Ouncharoen, R. How to estimate forecasting quality: A system-motivated derivation of symmetric mean absolute percentage error (smape) and other similar characteristics. Departmental Technical Reports (CS). 865 (2014).
Wang, X., Peng, Y. & Ma, W. An end-to-end smart predict-then-optimize framework for vehicle relocation problems in large-scale vehicle crowd sensing. arXiv preprint arXiv:2411.18432 (2024).
Tibshirani, R. J. & Efron, B. An introduction to the bootstrap. Monographs on statistics and applied probability 57, 1–436 (1993).
MathSciNet Google Scholar
Manjón, J. V. et al. Nonlocal intracranial cavity extraction. International journal of biomedical imaging 2014, 820205 (2014).
Article PubMed PubMed Central Google Scholar
Çorbacıoğlu, ŞK. & Aksel, G. Receiver operating characteristic curve analysis in diagnostic accuracy studies: A guide to interpreting the area under the curve value. Turkish journal of emergency medicine 23, 195–198 (2023).
Article PubMed PubMed Central Google Scholar
Ziegler, G. et al. Individualized gaussian process-based prediction and detection of local and global gray matter abnormalities in elderly subjects. NeuroImage 97, 333–348 (2014).
Article CAS PubMed Google Scholar
Mathalon, D. H., Sullivan, E. V., Rawles, J. M. & Pfefferbaum, A. Correction for head size in brain-imaging measurements. Psychiatry Research: Neuroimaging 50, 121–139 (1993).
Article CAS Google Scholar
Yaakub, S. N. et al. On brain atlas choice and automatic segmentation methods: a comparison of maper & freesurfer using three atlas databases. Scientific Reports 10, 2837 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Kijonka, M. et al. Whole brain and cranial size adjustments in volumetric brain analyses of sex-and age-related trends. Frontiers in neuroscience 14, 278 (2020).
Article PubMed PubMed Central Google Scholar
Malone, I. B. et al. Accurate automatic estimation of total intracranial volume: a nuisance variable with less nuisance. Neuroimage 104, 366–372 (2015).
Article PubMed Google Scholar
Pintzka, C. W., Hansen, T. I., Evensmoen, H. R. & Håberg, A. K. Marked effects of intracranial volume correction methods on sex differences in neuroanatomical structures: a hunt mri study. Frontiers in neuroscience 9, 238 (2015).
Article PubMed PubMed Central Google Scholar
O’Brien, L. M. et al. Statistical adjustments for brain size in volumetric neuroimaging studies: some practical implications in methods. Psychiatry Research: Neuroimaging 193, 113–122 (2011).
Article Google Scholar
Klasson, N., Olsson, E., Eckerström, C., Malmgren, H. & Wallin, A. Estimated intracranial volume from freesurfer is biased by total brain volume. European Radiology Experimental 2, 1–6 (2018).
Article Google Scholar
Jovicich, J. et al. Mri-derived measurements of human subcortical, ventricular and intracranial brain volumes: reliability effects of scan sessions, acquisition sequences, data analyses, scanner upgrade, scanner vendors and field strengths. Neuroimage 46, 177–192 (2009).
Article PubMed Google Scholar
Yang, C.-Y. et al. Reproducibility of brain morphometry from short-term repeat clinical mri examinations: a retrospective study. PLoS One 11, e0146913 (2016).
Article PubMed PubMed Central Google Scholar
Rebsamen, M. et al. Growing importance of brain morphometry analysis in the clinical routine: The hidden impact of mr sequence parameters. Journal of Neuroradiology 51, 5–9 (2024).
Article PubMed Google Scholar
Heinen, R. et al. Robustness of automated methods for brain volume measurements across different mri field strengths. PloS one 11, e0165719 (2016).
Article PubMed PubMed Central Google Scholar
Keihaninejad, S. et al. A robust method to estimate the intracranial volume across mri field strengths (1.5 t and 3t). Neuroimage 50, 1427–1437 (2010).
Article PubMed Google Scholar
Potvin, O. et al. Measurement variability following mri system upgrade. Frontiers in neurology 10, 726 (2019).
Article PubMed PubMed Central Google Scholar
Jovicich, J. et al. Reliability in multi-site structural mri studies: effects of gradient non-linearity correction on phantom and human data. Neuroimage 30, 436–443 (2006).
Article PubMed Google Scholar
Marchewka, A. et al. Influence of magnetic field strength and image registration strategy on voxel-based morphometry in a study of alzheimer’s disease. Human brain mapping 35, 1865–1874 (2014).
Article PubMed Google Scholar
Goodro, M., Sameti, M., Patenaude, B. & Fein, G. Age effect on subcortical structures in healthy adults. Psychiatry Research: Neuroimaging 203, 38–45 (2012).
Article Google Scholar
Pfefferbaum, A., Rohlfing, T., Rosenbloom, M. J. & Sullivan, E. V. Combining atlas-based parcellation of regional brain data acquired across scanners at 1.5 t and 3.0 t field strengths. Neuroimage 60, 940–951 (2012).
Article PubMed Google Scholar
Valabregue, R., Khemir, I., Auzias, G., Rousseau, F. & Ounissi, M. Unraveling systematic biases in brain segmentation: Insights from synthetic training. In MIDL 2024-Medical Imaging with Deep Learning (2024).
Lee, H. et al. Evaluating brain volume segmentation accuracy and reliability of freesurfer and neurophet aqua at variations in mri magnetic field strengths. Scientific Reports 14, 24513 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Attyé, A. et al. Data-driven normative values based on generative manifold learning for quantitative mri. Scientific Reports 14, 7563 (2024).
Article ADS PubMed PubMed Central Google Scholar
Kiely, M. et al. Insights into human cerebral white matter maturation and degeneration across the adult lifespan. Neuroimage 247, 118727 (2022).
Article PubMed Google Scholar
Hedman, A. M., van Haren, N. E., Schnack, H. G., Kahn, R. S. & Hulshoff Pol, H. E. Human brain changes across the life span: a review of 56 longitudinal magnetic resonance imaging studies. Human brain mapping 33, 1987–2002 (2012).
Article PubMed Google Scholar
Coupé, P., Manjón, J. V., Lanuza, E. & Catheline, G. Lifespan changes of the human brain in alzheimer’s disease. Scientific reports 9, 3998 (2019).
Article ADS PubMed PubMed Central Google Scholar
Breijyeh, Z. & Karaman, R. Comprehensive review on alzheimer’s disease: causes and treatment. Molecules 25, 5789 (2020).
Article CAS PubMed PubMed Central Google Scholar
Bhogal, P. et al. The common dementias: a pictorial review. European radiology 23, 3405–3417 (2013).
Article PubMed Google Scholar
Bansode, P., Chivte, V. & Nikalje, A. P. A brief review on parkinson’s disease. EC Pharmacology & Toxicology–Review Article, Maharshtra, published: June 15 (2018).
Jack, C. R. Jr. et al. A/t/n: An unbiased descriptive classification scheme for alzheimer disease biomarkers. Neurology 87, 539–547 (2016).
Article CAS PubMed PubMed Central Google Scholar
Altmann, A. et al. Towards cascading genetic risk in alzheimer’s disease. Brain 147, 2680–2690 (2024).
Article PubMed PubMed Central Google Scholar
Bateman, R. J. et al. Clinical and biomarker changes in dominantly inherited alzheimer’s disease. New England Journal of Medicine 367, 795–804 (2012).
Article CAS PubMed Google Scholar
Fagan, A. M. et al. Longitudinal change in csf biomarkers in autosomal-dominant alzheimer’s disease. Science translational medicine 6, 226ra30-226ra30 (2014).
Article PubMed PubMed Central Google Scholar
Dolgin, E. Faster, cheaper, better: the rise of blood tests for alzheimer’s. Nature 640, S11–S13 (2025).
Article ADS CAS PubMed Google Scholar
Coupé, P. et al. Assemblynet: a novel deep decision-making process for whole brain mri segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22, 466–474 (Springer, 2019).
Fischl, B. et al. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33, 341–355 (2002).
Article CAS PubMed Google Scholar
Fischl, B. et al. Sequence-independent segmentation of magnetic resonance images. Neuroimage 23, S69–S84 (2004).
Article PubMed Google Scholar
Fischl, B., Sereno, M. I. & Dale, A. M. Cortical surface-based analysis: Ii: inflation, flattening, and a surface-based coordinate system. Neuroimage 9, 195–207 (1999).
Article CAS PubMed Google Scholar
Desikan, R. S. et al. An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest. Neuroimage 31, 968–980 (2006).
Article PubMed Google Scholar
Kazemi, K. & Noorizadeh, N. Quantitative comparison of spm, fsl, and brainsuite for brain mr image segmentation. Journal of biomedical physics & engineering 4, 13 (2014).
CAS Google Scholar
Fellhauer, I. et al. Comparison of automated brain segmentation using a brain phantom and patients with early alzheimer’s dementia or mild cognitive impairment. Psychiatry Research: Neuroimaging 233, 299–305 (2015).
Article Google Scholar
Poulin, S. P. et al. Amygdala atrophy is prominent in early alzheimer’s disease and relates to symptom severity. Psychiatry Research: Neuroimaging 194, 7–13 (2011).
Article Google Scholar
De Leon, M. et al. Frequency of hippocampal formation atrophy in normal aging and alzheimer’s disease. Neurobiology of aging 18, 1–11 (1997).
Article PubMed Google Scholar
Shi, F., Liu, B., Zhou, Y., Yu, C. & Jiang, T. Hippocampal volume and asymmetry in mild cognitive impairment and alzheimer’s disease: Meta-analyses of mri studies (2009).
Dhikav, V., Duraisamy, S., Anand, K. S. & Garga, U. C. Hippocampal volumes among older indian adults: Comparison with alzheimer’s disease and mild cognitive impairment. Annals of Indian Academy of Neurology 19, 195–200 (2016).
Article PubMed PubMed Central Google Scholar
Ezzati, A. et al. Differential association of left and right hippocampal volumes with verbal episodic and spatial memory in older adults. Neuropsychologia 93, 380–385 (2016).
Article PubMed PubMed Central Google Scholar
Li, X. et al. Hippocampal subfield volumetry in patients with subcortical vascular mild cognitive impairment. Scientific reports 6, 20873 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Byun, M. S. et al. Heterogeneity of regional brain atrophy patterns associated with distinct progression rates in alzheimer’s disease. PLoS One 10, e0142756 (2015).
Article PubMed PubMed Central Google Scholar
Lowe, V. J. et al. Application of the national institute on aging-alzheimer’s association ad criteria to adni. Neurology 80, 2130–2137 (2013).
Article PubMed PubMed Central Google Scholar
Orlhac, F. et al. A guide to combat harmonization of imaging biomarkers in multicenter studies. Journal of Nuclear Medicine 63, 172–179 (2022).
Article PubMed PubMed Central Google Scholar
Pinto, M. S. et al. Harmonization of brain diffusion mri: Concepts and methods. Frontiers in Neuroscience 14, 396 (2020).
Article PubMed PubMed Central Google Scholar
Orlhac, F. et al. A postreconstruction harmonization method for multicenter radiomic studies in pet. Journal of Nuclear Medicine 59, 1321–1328 (2018).
Article CAS PubMed Google Scholar
Bayer, J. M. et al. Accommodating site variation in neuroimaging data using normative and hierarchical bayesian models. Neuroimage 264, 119699 (2022).
Article PubMed Google Scholar
Gebre, R. K. et al. Cross-scanner harmonization methods for structural mri may need further work: A comparison study. NeuroImage 269, 119912 (2023).
Article PubMed Google Scholar

Download references

Acknowledgements

Virtual Imaging Platform: VIP Part of the results presented in this work were achieved using the Neuroimaging (FreeSurfer-Recon-all) application through the Virtual Imaging Platform⁴⁶, which uses the resources provided by the biomed virtual organisation of the EGI infrastructure. Autism Brain Imaging Data Exchange: ABIDE The ABIDE³⁵ data used in the preparation of this article were supported by ABIDE funding resources listed at https://fcon_1000.projects.nitrc.org/indi/abide/. International Consortium for Brain Mapping: ICBM The International Consortium for Brain Mapping (ICBM), under the leadership of Principal Investigator John Mazziotta, MD, PhD, provided the data collection and sharing for this project. The National Institute of Biomedical Imaging and BioEngineering funded the ICBM. The Laboratory of Neuro Imaging at the University of Southern California distributes the ICBM data³⁶. Information eXtraction from Images: IXI Data collected as part of the project EPSRC GR/S21533/02 were obtained from https://brain-development.org/ixi-dataset/. Alzheimer’s Disease Neuroimaging Initiative: ADNI The Alzheimer’s Disease Neuroimaging Initiative (ADNI) data collection and sharing are funded by the National Institute on Aging (National Institutes of Health Grant U19 AG024904), with the Northern California Institute for Research and Education serving as the grantee organization. ADNI has also received past funding from the National Institute of Biomedical Imaging and Bioengineering, the Canadian Institutes of Health Research, and private sector contributions through the Foundation for the National Institutes of Health (FNIH). Generous contributions have been made by numerous organizations, including AbbVie, the Alzheimer’s Association, the Alzheimer’s Drug Discovery Foundation, Araclon Biotech, BioClinica, Inc., Biogen, Bristol-Myers Squibb Company, CereSpir, Inc., Cogstate, Eisai Inc., Elan Pharmaceuticals, Inc., Eli Lilly and Company, EuroImmun, F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc., Fujirebio, GE Healthcare, IXICO Ltd., Janssen Alzheimer Immunotherapy Research and Development, LLC, Johnson and Johnson Pharmaceutical Research and Development LLC, Lumosity, Lundbeck, Merck and Co., Inc., Meso Scale Diagnostics, LLC, NeuroRx Research, Neurotrack Technologies, Novartis Pharmaceuticals Corporation, Pfizer Inc., Piramal Imaging, Servier, Takeda Pharmaceutical Company, and Transition Therapeutics³⁷. Open Access Series of Imaging Studies: OASIS The data were provided by OASIS-1: Cross-Sectional, with Principal Investigators D. Marcus, R. Buckner, J. Csernansky, and J. Morris. Funding support includes grants P50 AG05681, P01 AG03991, P01 AG026276, R01 AG021910, P20 MH071616, and U24 RR021382³⁸. Parkinson’s Progression Markers Initiative: PPMI Data used in the preparation of this article was obtained on 2023-10-23 from the Parkinson’s Progression Markers Initiative (PPMI) database (https://www.ppmi-info.org/access-data-specimens/download-data), RRID:SCR_006431. For up-to-date information on the study, visit www.ppmi-info.org. PPMI - a public-private partnership - is funded by the Michael J. Fox Foundation for Parkinson’s Research and funding partners, including 4D Pharma, Abbvie, AcureX, Allergan, Amathus Therapeutics, Aligning Science Across Parkinson’s, AskBio, Avid Radiopharmaceuticals, BIAL, BioArctic, Biogen, Biohaven, BioLegend, BlueRock Therapeutics, Bristol-Myers Squibb, Calico Labs, Capsida Biotherapeutics, Celgene, Cerevel Therapeutics, Coave Therapeutics, DaCapo Brainscience, Denali, Edmond J. Safra Foundation, Eli Lilly, Gain Therapeutics, GE HealthCare, Genentech, GSK, Golub Capital, Handl Therapeutics, Insitro, Jazz Pharmaceuticals, Johnson and Johnson Innovative Medicine, Lundbeck, Merck, Meso Scale Discovery, Mission Therapeutics, Neurocrine Biosciences, Neuron23, Neuropore, Pfizer, Piramal, Prevail Therapeutics, Roche, Sanofi, Servier, Sun Pharma Advanced Research Company, Takeda, Teva, UCB, Vanqua Bio, Verily, Voyager Therapeutics, the Weston Family Foundation and Yumanity Therapeutics. UCLA Consortium for Neuropsychiatric Phenomics LA5c Study: UCLA Data used in the preparation of this article were obtained from https://openneuro.org/datasets/ds000030/versions/00016. Its accession number is ds000030⁴⁰. Dallas Lifespan Brain Study: DLBS Data used in the preparation of this article were obtained from https://www.nitrc.org/ir/app/template/XDATScreen_report_xnat_projectData.vm/search_element/xnat:project Data/search_field/xnat:projectData.ID/search _value/dlbs (https://sites.utdallas.edu/dlbs/). Southwest University Adult Lifespan Dataset: SALD⁴¹ Data used in the preparation of this article were obtained from https://fcon_1000.projects.nitrc.org/indi/retro/sald.html. Frontotemporal lobar degeneration neuroimaging initiative: NIFD NIFD is the abbreviation for the frontotemporal lobar degeneration neuroimaging initiative (FTLDNI, AG032306) (http://memory.ucsf.edu/ research/studies/ nifd). This project’s data collection and sharing were funded by the Frontotemporal Lobar Degeneration Neuroimaging Initiative (National Institutes of Health Grant R01 AG032306). The study is coordinated by the Memory and Aging Center at the University of California, San Francisco, and the data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. Southwest University Longitudinal Imaging Multimodal: SLIM Data used in the preparation of this article were obtained from https://fcon_1000.projects.nitrc.org/indi/retro/southwestuni_qiu_index.html. Minimal Interval Resonance Imaging in Alzheimer’s Disease (MIRIAD): Data used in the preparation of this article were obtained from the MIRIAD dataset⁴². The MIRIAD investigators did not participate in analysis or writing of this report. The MIRIAD dataset is made available through the support of the UK Alzheimer’s Society (Grant RF116). The original data collection was funded through an unrestricted educational grant from GlaxoSmithKline (Grant 6GKC). SRPBS Traveling Subject MRI Dataset:⁴³ Data used in the preparation of this article were obtained from https://www.synapse.org/#!Synapse:syn22317082. Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf. A comprehensive list of consortium members appears at the end of the paper. Data used in preparation of this article were obtained from the Frontotemporal Lobar Degeneration Neuroimaging Initiative (FTLDNI) dataset. The investigators at NIFD/FTLDNI contributed to the design and implementation of FTLDNI and/or provided data, but did not participate in analysis or writing of this report (unless otherwise listed). A comprehensive list of consortium members appears at the end of the paper.

Funding

This work was supported by the Association Nationale de la Recherche et de la Technologie (CIFRE No. 2022/0918).

Author information

List of authors and their affiliations appear at the end of the paper.

Authors and Affiliations

Functional neuroimaging and cerebral perfusion laboratory, Grenoble Institute of Neurosciences (GIN, Grenoble Alpes University), Grenoble, France
Elodie Piot
GeodAIsics, Grenoble, France
Elodie Piot, Félix Renard & Arnaud Attyé
University Grenoble Alps, 38000, Grenoble, France
Alexandre Krainik
University Grenoble Alpes, Inserm, CHU Grenoble Alpes, CNRS, IRMaGe, 38000, Grenoble, France
Alexandre Krainik
Department of neuroradiology, University Hospital of Grenoble, 38000, Grenoble, France
Alexandre Krainik
UC San Francisco, San Francisco, CA, 94107, USA
Michael W. Weiner, Norbert Schuff, Howard J. Rosen, Bruce L. Miller & David Perry
UC San Diego, La Jolla, CA, 92093, USA
Paul Aisen, Ronald G. Thomas, Michael Donohue, Sarah Walter, Tamie Sather, Gus Jiminez, Archana B. Balasubramanian, Jennifer Mason, Iris Sim, James Brewer, Helen Vanderswag & Adam Fleisher
Mayo Clinic, Rochester, Mn, USA
Ronald Petersen, Clifford R. Jack Jr., Matthew Bernstein, Bret Borowski, Jeff Gunter, Matt Senjem, Prashanthi Vemuri, David Jones, Kejal Kantarci, Chad Ward, Sara S. Mason, Colleen S. Albers, David Knopman & Kris Johnson
UC Berkeley, Berkeley, San Francisco, USA
William Jagust, Susan Landau, Judith L. Heidebrink & Joanne L. Lord
University of Pennsylvania, Philadelphia, PA, 19104, USA
John Q. Trojanowki, Leslie M. Shaw, Steven E. Arnold, Jason H. Karlawish, David Wolk & Christopher M. Clark
USC, Los Angeles, CA, 90032, USA
Arthur W. Toga, Karen Crawford, Scott Neu, Lon S. Schneider, Sonia Pawluczyk, Mauricio Beccera, Liberty Teodoro & Bryan M. Spann
UC Davis, Sacramento, CA, USA
Laurel Beckett, Danielle Harvey, Charles DeCArli, Evan Fletcher, Pauline Maillard, John Olichney & Owen Carmichael
Brigham and Women’s Hospital/Harvard Medical School, Boston, MA, 02215, USA
Robert C. Green, Reisa A. Sperling, Keith A. Johnson & Gad Marshall
Indiana University, Bloomington, IN, 47405, USA
Andrew J. Saykin, Tatiana M. Foroud, Li Shen, Kelley Faber, Sungeun Kim, Martin R. Farlow, Ann Marie Hake, Brandy R. Matthews, Jared R. Brosch, Scott Herring & Cynthia Hunt
Washington University St. Louis, St. Louis, MO, 63110, USA
John Morris, Marc Raichle, Davie Holtzman, Nigel J. Cairns, Erin Householder, Lisa Taylor-Reinwald, Beau Ances, Maria Carroll, Mary L. Creech, Erin Franklin, Mark A. Mintun, Stacy Schneider & Angela Oliver
Prevent Alzheimer’s Disease 2020, Rockville, MD, 20850, USA
Zaven Khachaturian
Siemens, Erlangen, Germany
Greg Sorensen & Lean Tha
Alzheimer’s Association, Chicago, IL, 60631, USA
Maria Carrillo
University of Pittsburgh, Pittsburgh, PA, 15213, USA
Lew Kuller, Chet Mathis, Oscar L. Lopez, MaryAnn Oakley & Donna M. Simpson
Cornell University, Ithaca, NY, 14853, USA
Steven Paul, Norman Relkin, Gloria Chaing, Michael Lin & Lisa Ravdin
Albert Einstein College of Medicine of Yeshiva University, Bronx, NY, 10461, USA
Peter Davies
AD Drug Discovery Foundation, New York, NY, 10019, USA
Howard Fillit
Acumen Pharmaceuticals, Livermore, CA, 94551, USA
Franz Hefti
Northwestern University, Chicago, IL, 60611, USA
M. Marcel Mesulam, Kristine Lipowski, Masandra Weintraub, Borna Bonakdarpour, Diana Kerwin, Chuang-Kuo Wu & Nancy Johnson
National Institute of Mental Health, Bethesda, MD, 20892, USA
William Potter
Brown University, Providence, RI, 02912, USA
Peter Snyder, Brian R. Ott, Henry Querfurth, Geoffrey Tremont, Stephen Salloway, Paul Malloy & Stephen Correia
University of Washington, Seattle, WA, 98195, USA
Tom Montine
University of London, London, UK
Nick Fox
UC LA, Torrance, CA, 90509, USA
Paul Thompson, Virginia Lee, Magdalena Korecka, Michal Figurski, Liana Apostolova, Kathleen Tingus, Ellen Woo, Daniel H. S. Silverman, Po H. Lu & George Bartzokis
University of Michigan, Ann Arbor, MI, 48109-2800, USA
Robert A. Koeppe
University of Utah, Salt Lake City, UT, 84112, USA
Norm Foster
Banner Alzheimer’s Institute, Phoenix, AZ, 85006, USA
Eric M. Reiman, Kewei Chen, Marwan N. Sabbagh, Christine M. Belden, Sandra A. Jacobson, Sherye A. Sirrel, Pierre Tariot, Anna Burke, Nadira Trncic, Adam Fleisher & Stephanie Reeder
UUC Irvine, Orange, CA, 92868, USA
Steven Potkin, Adrian Preda & Dana Nguyen
Johns Hopkins University, Baltimore, MD, 21205, USA
Marissa Natelson Love, Marilyn Albert, Chiadi Onyike, Daniel D’Agostino II & Stephanie Kielb
Richard Frank Consulting, Consulting, New York, USA
Richard Frank & Hillel Grossman
National Institute on Aging, Baltimore, Maryland, USA
John Hsiao
Oregon Health and Science University, Portland, OR, 97239, USA
Jeffrey Kaye, Joseph Quinn, Lisa Silbert, Betty Lind, Raina Carter & Sara Dolen
University of Alabama, Birmingham, AL, USA
Daniel Marson, Randall Griffith, David Clark, David Geldmacher, John Brockington & Erik Roberson
Mount Sinai School of Medicine, New York, NY, USA
Effie Mitsis
Rush University Medical Center, Chicago, IL, 60612, USA
Raj C. Shah & Leyla deToledo-Morrell
Baylor College of Medicine, Houston, TX, USA
Rachelle S. Doody, Javier Villanueva-Meyer, Munir Chowdhury, Susan Rountree & Mimi Dang
Wien Center, Miami Beach, FL, 33140, USA
Ranjan Duara, Daniel Varon, Maria T. Greig & Peggy Roberts
Columbia University Medical Center, New York, NY, USA
Yaakov Stern, Lawrence S. Honig & Karen L. Bell
New York University, New York, NY, USA
James E. Galvin, Brittany Cerbone, Christina A. Michel, Dana M. Pogorelec, Henry Rusinek, Mony J. de Leon, Lidia Glodzik & Susan De Santi
University of Texas Southwestern Medical School, Galveston, TX, 77555, USA
Kyle Womack, Dana Mathews & Mary Quiceno
Duke University Medical Center, Durham, NC, USA
P. Murali Doraiswamy, Jeffrey R. Petrella, Salvador Borges-Neto, Terence Z. Wong & Edward Coleman
Emory University, Atlanta, GA, 30307, USA
Allan I. Levey, James J. Lah & Janet S. Cella
University of Kansas Medical Center, Kansas City, Kansas, USA
Jeffrey M. Burns, Russell H. Swerdlow & William M. Brooks
University of Kentucky, Lexington, KY, USA
Charles D. Smith, Greg Jicha, Peter Hardy, partha sinha, Elizabeth Oates & Gary Conrad
Mayo Clinic, Jacksonville, Florida, USA
Neill R. Graff-Radford, Francine Parfitt, Tracy Kendall & Heather Johnson
University of Rochester Medical Center, Rochester, NY, 14642, USA
Anton P. Porsteinsson, Bonnie S. Goldstein, Kim Martin, Kelly M. Makino, M. Saleem Ismail, Connie Brand, Ruth A. Mulnard, Gaby Thai & Catherine Mc-Adams-Ortiz
Yale University School of Medicine, New Haven, ct, USA
Christopher H. van Dyck, Richard E. Carson, Martha G. MacAvoy & Pradeep Varma
McGill Univ. Montreal-Jewish General Hospital, Montreal, PQ, H3A 2A7, Canada
Howard Chertkow, Howard Bergman & Chris Hosein
Sunnybrook Health Sciences, Toronto, On, Canada
Sandra Black, Bojana Stefanovic & Curtis Caldwell
U.B.C. Clinic for AD and Related Disorders, Vancouver, BC, Canada
Ging-Yuek Robin Hsiung, Howard Feldman, Benita Mudge & Michele Assaly
Cognitive Neurology - St. Joseph’s, London, ON, Canada
Elizabeth Finger, Stephen Pasternack, Irina Rachisky, Dick Trost & Andrew Kertesz
Cleveland Clinic Lou Ruvo Center for Brain Health, Las Vegas, NV, 89106, USA
Charles Bernick & Donna Munic
Premiere Research Inst (Palm Beach Neurology), W Palm Beach, FL, USA
Carl Sadowsky & Teresa Villena
Georgetown University Medical Center, Washington, DC, 20007, USA
Raymond Scott Turner, Kathleen Johnson & Brigid Reynolds
Stanford University, Stanford, CA, 94305, USA
Jerome Yesavage, Joy L. Taylor, Barton Lane, Allyson Rosen & Jared Tinklenberg
Boston University, Boston, Massachusetts, USA
Neil Kowall, Ronald Killiany, Andrew E. Budson, Alexander Norbash & Patricia Lynn Johnson
Howard University, Washington, DC, 20059, USA
Thomas O. Obisesan, Saba Wolday & Joanne Allard
Case Western Reserve University, Cleveland, OH, 44106, USA
Alan Lerner, Paula Ogrocki, Curtis Tatsuoka & Parianne Fatica
Neurological Care of CNY, Liverpool, NY, 13088, USA
Smita Kittur
St. Joseph’s Health Care, London, ON, N6A 4H1, Canada
Andrew Kertesz, Michael Borrie, T.-Y. Lee, Rob Bartha, Sterling Johnson, Sanjay Asthana, Cynthia M. Carlsson, Elizabether Finger, Stephen Pasternak, Irina Rachinsky, John Rogers & Dick Drost
Dent Neurologic Institute, Amherst, NY, 14226, USA
Vernice Bates, Horacio Capote & Michelle Rainka
Ohio State University, Columbus, OH, 43210, USA
Douglas W. Scharre, Maria Kataki & Anahita Adeli
Albany Medical College, Albany, NY, 12208, USA
Earl A. Zimmerman, Dzintra Celmins & Alice D. Brown
Hartford Hospital Olin Neuropsychiatry Research Center, Hartford, CT, 06114, USA
Godfrey D. Pearlson, Karen Blank & Karen Anderson
Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA
Laura A. Flashman, Marc Seltzer, Mary L. Hynes & Robert B. Santulli
Wake Forest University Health Sciences, Winston-Salem, NC, USA
Kaycee M. Sink, Leslie Gordineer, Jeff D. Williamson, Pradeep Garg & Franklin Watkins
Medical University South Carolina, Charleston, SC, 29425, USA
Jacobo Mintzer, Kenneth Spicer & David Bachman
Nathan Kline Institute, Orangeburg, NY, USA
Nunzio Pomara, Raymundo Hernando & Antero Sarrael
University of Iowa College of Medicine, Iowa City, IA, 52242, USA
Susan K. Schultz, Laura L. Boles Ponto, Hyungsub Shim & Karen Ekstam Smith
University of South Florida, USF Health Byrd Alzheimer’s Institute, Tampa, FL, 33613, USA
Amanda Smith, Balebail Ashok Raj & Kristin Fargher
University of California, San Francisco, USA
Howard Rosen, Adam L. Boxer, John Kornak, Bruce L. Miller, William W. Seeley, Maria-Luisa Gorno-Tempini, Scott McGinnis & Maria Luisa Mandelli
Harvard Medical School and Massachusetts General Hospital, Boston, USA
Bradford C. Dickerson
University of Washington School of Medicine, Seattle, USA
Kimoko Domoto-Reilly
Mayo Clinic, Rochester, USA
David Knopman & Bradley F. Boeve

Authors

Elodie Piot
View author publications
Search author on:PubMed Google Scholar
Félix Renard
View author publications
Search author on:PubMed Google Scholar
Arnaud Attyé
View author publications
Search author on:PubMed Google Scholar
Alexandre Krainik
View author publications
Search author on:PubMed Google Scholar

Consortia

for the Alzheimer’s Disease Neuroimaging Initiative

Michael W. Weiner
, Paul Aisen
, Ronald Petersen
, Clifford R. Jack Jr.
, William Jagust
, John Q. Trojanowki
, Arthur W. Toga
, Laurel Beckett
, Robert C. Green
, Andrew J. Saykin
, John Morris
, Leslie M. Shaw
, Zaven Khachaturian
, Greg Sorensen
, Maria Carrillo
, Lew Kuller
, Marc Raichle
, Steven Paul
, Peter Davies
, Howard Fillit
, Franz Hefti
, Davie Holtzman
, M. Marcel Mesulam
, William Potter
, Peter Snyder
, Tom Montine
, Ronald G. Thomas
, Michael Donohue
, Sarah Walter
, Tamie Sather
, Gus Jiminez
, Archana B. Balasubramanian
, Jennifer Mason
, Iris Sim
, Danielle Harvey
, Matthew Bernstein
, Nick Fox
, Paul Thompson
, Norbert Schuff
, Charles DeCArli
, Bret Borowski
, Jeff Gunter
, Matt Senjem
, Prashanthi Vemuri
, David Jones
, Kejal Kantarci
, Chad Ward
, Robert A. Koeppe
, Norm Foster
, Eric M. Reiman
, Kewei Chen
, Chet Mathis
, Susan Landau
, Nigel J. Cairns
, Erin Householder
, Lisa Taylor-Reinwald
, Virginia Lee
, Magdalena Korecka
, Michal Figurski
, Karen Crawford
, Scott Neu
, Tatiana M. Foroud
, Steven Potkin
, Li Shen
, Kelley Faber
, Sungeun Kim
, Lean Tha
, Richard Frank
, John Hsiao
, Jeffrey Kaye
, Joseph Quinn
, Lisa Silbert
, Betty Lind
, Raina Carter
, Sara Dolen
, Beau Ances
, Maria Carroll
, Mary L. Creech
, Erin Franklin
, Mark A. Mintun
, Stacy Schneider
, Angela Oliver
, Lon S. Schneider
, Sonia Pawluczyk
, Mauricio Beccera
, Liberty Teodoro
, Bryan M. Spann
, James Brewer
, Helen Vanderswag
, Adam Fleisher
, Daniel Marson
, Randall Griffith
, David Clark
, David Geldmacher
, John Brockington
, Erik Roberson
, Marissa Natelson Love
, Judith L. Heidebrink
, Joanne L. Lord
, Sara S. Mason
, Colleen S. Albers
, David Knopman
, Kris Johnson
, Hillel Grossman
, Effie Mitsis
, Raj C. Shah
, Leyla deToledo-Morrell
, Rachelle S. Doody
, Javier Villanueva-Meyer
, Munir Chowdhury
, Susan Rountree
, Mimi Dang
, Ranjan Duara
, Daniel Varon
, Maria T. Greig
, Peggy Roberts
, Yaakov Stern
, Lawrence S. Honig
, Karen L. Bell
, Marilyn Albert
, Chiadi Onyike
, Daniel D’Agostino II
, Stephanie Kielb
, James E. Galvin
, Brittany Cerbone
, Christina A. Michel
, Dana M. Pogorelec
, Henry Rusinek
, Mony J. de Leon
, Lidia Glodzik
, Susan De Santi
, Kyle Womack
, Dana Mathews
, Mary Quiceno
, P. Murali Doraiswamy
, Jeffrey R. Petrella
, Salvador Borges-Neto
, Terence Z. Wong
, Edward Coleman
, Allan I. Levey
, James J. Lah
, Janet S. Cella
, Jeffrey M. Burns
, Russell H. Swerdlow
, William M. Brooks
, Steven E. Arnold
, Jason H. Karlawish
, David Wolk
, Christopher M. Clark
, Liana Apostolova
, Kathleen Tingus
, Ellen Woo
, Daniel H. S. Silverman
, Po H. Lu
, George Bartzokis
, Charles D. Smith
, Greg Jicha
, Peter Hardy
, partha sinha
, Elizabeth Oates
, Gary Conrad
, Neill R. Graff-Radford
, Francine Parfitt
, Tracy Kendall
, Heather Johnson
, Oscar L. Lopez
, MaryAnn Oakley
, Donna M. Simpson
, Martin R. Farlow
, Ann Marie Hake
, Brandy R. Matthews
, Jared R. Brosch
, Scott Herring
, Cynthia Hunt
, Anton P. Porsteinsson
, Bonnie S. Goldstein
, Kim Martin
, Kelly M. Makino
, M. Saleem Ismail
, Connie Brand
, Ruth A. Mulnard
, Gaby Thai
, Catherine Mc-Adams-Ortiz
, Christopher H. van Dyck
, Richard E. Carson
, Martha G. MacAvoy
, Pradeep Varma
, Howard Chertkow
, Howard Bergman
, Chris Hosein
, Sandra Black
, Bojana Stefanovic
, Curtis Caldwell
, Ging-Yuek Robin Hsiung
, Howard Feldman
, Benita Mudge
, Michele Assaly
, Elizabeth Finger
, Stephen Pasternack
, Irina Rachisky
, Dick Trost
, Andrew Kertesz
, Charles Bernick
, Donna Munic
, Kristine Lipowski
, Masandra Weintraub
, Borna Bonakdarpour
, Diana Kerwin
, Chuang-Kuo Wu
, Nancy Johnson
, Carl Sadowsky
, Teresa Villena
, Raymond Scott Turner
, Kathleen Johnson
, Brigid Reynolds
, Reisa A. Sperling
, Keith A. Johnson
, Gad Marshall
, Jerome Yesavage
, Joy L. Taylor
, Barton Lane
, Allyson Rosen
, Jared Tinklenberg
, Marwan N. Sabbagh
, Christine M. Belden
, Sandra A. Jacobson
, Sherye A. Sirrel
, Neil Kowall
, Ronald Killiany
, Andrew E. Budson
, Alexander Norbash
, Patricia Lynn Johnson
, Thomas O. Obisesan
, Saba Wolday
, Joanne Allard
, Alan Lerner
, Paula Ogrocki
, Curtis Tatsuoka
, Parianne Fatica
, Evan Fletcher
, Pauline Maillard
, John Olichney
, Owen Carmichael
, Smita Kittur
, Michael Borrie
, T.-Y. Lee
, Rob Bartha
, Sterling Johnson
, Sanjay Asthana
, Cynthia M. Carlsson
, Adrian Preda
, Dana Nguyen
, Pierre Tariot
, Anna Burke
, Nadira Trncic
, Adam Fleisher
, Stephanie Reeder
, Vernice Bates
, Horacio Capote
, Michelle Rainka
, Douglas W. Scharre
, Maria Kataki
, Anahita Adeli
, Earl A. Zimmerman
, Dzintra Celmins
, Alice D. Brown
, Godfrey D. Pearlson
, Karen Blank
, Karen Anderson
, Laura A. Flashman
, Marc Seltzer
, Mary L. Hynes
, Robert B. Santulli
, Kaycee M. Sink
, Leslie Gordineer
, Jeff D. Williamson
, Pradeep Garg
, Franklin Watkins
, Brian R. Ott
, Henry Querfurth
, Geoffrey Tremont
, Stephen Salloway
, Paul Malloy
, Stephen Correia
, Howard J. Rosen
, Bruce L. Miller
, David Perry
, Jacobo Mintzer
, Kenneth Spicer
, David Bachman
, Elizabether Finger
, Stephen Pasternak
, Irina Rachinsky
, John Rogers
, Dick Drost
, Nunzio Pomara
, Raymundo Hernando
, Antero Sarrael
, Susan K. Schultz
, Laura L. Boles Ponto
, Hyungsub Shim
, Karen Ekstam Smith
, Norman Relkin
, Gloria Chaing
, Michael Lin
, Lisa Ravdin
, Amanda Smith
, Balebail Ashok Raj
& Kristin Fargher

For the Frontotemporal Lobar Degeneration Neuroimaging Initiative

Howard Rosen
, Bradford C. Dickerson
, Kimoko Domoto-Reilly
, David Knopman
, Bradley F. Boeve
, Adam L. Boxer
, John Kornak
, Bruce L. Miller
, William W. Seeley
, Maria-Luisa Gorno-Tempini
, Scott McGinnis
& Maria Luisa Mandelli

Contributions

F. R., A. K. and A.A. designed the study; E. P. performed the analysis; all authors discussed the results; E. P. wrote the main manuscript text; All authors reviewed the manuscript; “For Alzheimer’s Disease Neuroimaging Initiative” and “For the Frontotemporal Lobar Degeneration Neuroimaging Initiative” generate part of the data used.

Corresponding author

Correspondence to Elodie Piot.

Ethics declarations

Competing interests

Four authors have a potential conflict of interest: Arnaud Attyé and Félix Renard are co-founders of GeodAIsics, A. Krainik is a consultant at GeodAIsics and E. Piot is a PhD student at GeodAIsics. “For Alzheimer’s Disease Neuroimaging Initiative” and “For the Frontotemporal Lobar Degeneration Neuroimaging Initiative” have no conflict of interest as they only generate part of the data used.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Piot, E., Renard, F., Attyé, A. et al. Estimation of reference curves for brain atrophy and analysis of robustness to machine effects. Sci Rep 15, 34585 (2025). https://doi.org/10.1038/s41598-025-18073-z

Download citation

Received: 22 April 2025
Accepted: 29 August 2025
Published: 03 October 2025
Version of record: 03 October 2025
DOI: https://doi.org/10.1038/s41598-025-18073-z

Subjects

Abstract

Similar content being viewed by others

Transcriptomic analysis to identify genes associated with selective hippocampal vulnerability in Alzheimer’s disease

Automated hippocampal segmentation algorithms evaluated in stroke patients

Classification method based on surf and sift features for alzheimer diagnosis using diffusion tensor magnetic resonance imaging

Introduction

Materials and methods

Data

Data used to construct reference curves

Data for sensitivity and stability analysis

Pipeline analysis

Segmentation algorithms

GAM and constraints

Curve stability assessment

Methods for comparing reference curves across magnetic field strengths

Atrophy assessment

Sensitivity analysis

Performance metrics

Results

Curve stability assessment

Evaluation of the consistency of reference curves across magnetic field strengths

Atrophy assessment

Hippocampal atrophy sensitivity analysis

HAVAs results

Atrophy stability analysis

Discussion

Reference modeling: addressing confounding factors including imaging variability

Accounting for confounders in normative modeling

Magnetic field strength and scanner model effects: considerations and limitations

The “gold standard” issue

Median percentage volume differences between 1.5T and 3T and data

Comparing modeling strategies for reference curves construction

Mean Absolute Scaled Error (MASE) evaluation

Comparison of GAM-based curves with existing literature

Limitations of reference curves and potential of manifold learning for complex analysis

Atrophy detection across acquisition conditions

Early atrophy detection considerations

Reasons for robustness to magnetic field strength

Sensitivity of algorithms to hippocampal atrophy

Sensitivity with HAVAs method

Trade-off between sensitivity and robustness: bias-variance balance

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Consortia

for the Alzheimer’s Disease Neuroimaging Initiative

For the Frontotemporal Lobar Degeneration Neuroimaging Initiative

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links