Abstract
Neurodegenerative diseases like Alzheimer’s are difficult to diagnose due to brain complexity and imaging variability. However, volumetric analysis tools, using reference curves, help detect abnormal brain atrophy and support diagnosis and monitoring. This study evaluates the robustness of three segmentation algorithms, AssemblyNet, FastSurfer and FreeSurfer, in constructing brain volume reference curves and detecting hippocampal atrophy. Using data from 3,730 cognitively normal subjects, we built reference curves and assessed robustness to magnetic field strength (1.5T vs. 3T) using four error metrics (sMAPE, sMSPE, wMAPE, sMdAPE) with bootstrap validation. We evaluated classification performance using hippocampal atrophy rates and HAVAs scores (Hippocampal-Amygdalo-Ventricular Atrophy scores). AssemblyNet shows the lowest errors across all robustness metrics. In contrast, FastSurfer and FreeSurfer exhibit greater deviations, indicating higher sensitivity to field strength variability. AssemblyNet provides consistent hippocampal atrophy rates across all reference models, despite slightly lower sensitivity, while FastSurfer and FreeSurfer display greater variability. Specificity ranges from 0.87 to 0.91 for AssemblyNet, compared to 0.76-0.93 for FastSurfer and 0.86-0.93 for FreeSurfer. Using the HAVAs score, all methods detect high atrophy rates in Alzheimer’s patients. FastSurfer achieves the highest sensitivity (0.98), while AssemblyNet reaches the best specificity (0.95) and the highest balanced accuracy (0.91). This study underscores the importance of algorithm choice for reliable brain volumetric analysis in heterogeneous imaging environments. Among the methods tested, AssemblyNet stands out as both sensitive to Alzheimer’s-related atrophy and robust to acquisition variability, making it a strong candidate when analyzing hippocampal volumes in large, multi-site datasets.
Similar content being viewed by others
Introduction
Neurodegenerative diseases cause accelerated neuronal death, leading to varying levels of cerebral atrophy1. The spatial pattern of atrophy provides critical insights for differential diagnosis and disease staging2,3. To better quantify and interpret such patterns, normative modeling tools have emerged as valuable resources. These tools rely on reference curves that represent healthy age-related variation in brain structure, allowing for the identification of individual deviations from typical trajectories4,5,6. By comparing an individual’s brain volumes to these normative models, deviations indicative of abnormal atrophy can be detected. In the context of neurodegenerative disease, atrophy is typically defined when regional volumes fall below the lower bounds of these normative distributions, providing a standardized and interpretable framework for assessing structural brain changes7.
However, the increasing use of large, diverse datasets has introduced additional challenges. Modern research relies on databases of unprecedented size, incorporating data from a wide variety of sources (Coupe et al., (2023)4: 40,944 subjects pooled across 24 databases). This diversity increases the prevalence of the “machine effects”, where differences in MRI acquisition parameters8, scanner hardware9, and processing pipelines10 may lead to biases in volumetric measurements. Such variability can mask subtle disease-related changes, as differences in MRI acquisition parameters alone can alter volume measures by up to 4.8%11, comparable to early disease-related brain volume changes12.
Tools for volume-based diagnostics enable clinicians to monitor disease progression and evaluate severity, with atrophy serving as a reliable biomarker14,15. However, the impact of the machine effects underscores the need for robust algorithms. Commercial tools for volume-based diagnostics have emerged to meet increasing demand16,17,18, yet most lack thorough validation: of 17 identified products, only 4 underwent clinical validation in dementia populations19. Moreover, normative datasets vary widely (100-8,000 subjects), raising concerns about reliability and generalizability19. Further validation is essential for their full potential in neurodegenerative disease diagnosis and monitoring.
Several studies propose reference curves with divergent trajectories for cortical and subcortical structures, sometimes described as linear, U-shaped, or complex polynomial curves, highlighting a lack of consensus in the field20. This variability has led to inconsistent findings21 and, in some cases, may even alter the observed differences between control and pathological groups22. As explained by Coupe et al., (2017)20, several factors contribute to these inconsistencies: the use of data covering only restricted age ranges, which biases the construction of reference curves, limited scanner diversity within certain age groups19,23, non-harmonized acquisition protocols24, differences in curve modeling approaches and segmentation tools20,23 and the use of heterogeneous volumetric measures (e.g., absolute volumes, normalized volumes, or z-scores)25.
Selecting a model for normative curves is challenging, requiring a balance between flexibility and overfitting (where the model becomes too tailored to the training data and fails to generalize to new data)23,26. Options range from linear models20 to advanced methods like GAMLSS27. This study uses Generalized Additive Models (GAMs) to test segmentation reproducibility with a standard approach. GAMs extend generalized linear models, offering flexibility and controlling overfitting through constraints28,29.
Normative curve robustness is as crucial as model selection. Liu et al., (2024)24 show that identical models yield different curves across datasets. Sample size, age representation, and intracranial volume normalization further impact results20. Temporal dependencies and machine effects also challenge reproducibility, underscoring the need for robust methods.
Several methods exist for comparing reference curves. Summing pairwise distances is simple but prone to error cancellation. Advanced time series metrics (please refer to Supplementary Section S1) provide a more robust assessment by capturing trajectory alignment. These metrics, available in classical, symmetrized, and sometimes weighted forms, were chosen for their interpretability, consistency, scale invariance, and balanced handling of over- and under-predictions.
The first objective of this study is to investigate the robustness of AssemblyNet, FastSurfer and FreeSurfer in generating reference curves and evaluating brain atrophy using GAMs. The second objective is to develop a methodology for comparing reference curves estimated from volumetric data of 3,730 healthy subjects. These curves, generated by three segmentation algorithms, that extract brain volumes, are evaluated for robustness across magnetic field strengths. To our knowledge, no study has used GAMs to build reference curves for AssemblyNet, FastSurfer, and FreeSurfer, quantified segmentation algorithm impact, or assessed robustness to machine effects. The third objective is to assess segmentation algorithms’ sensitivity and stability by measuring the proportion of Alzheimer’s patients (AD) with hippocampal atrophy30,31 and HAVAs scores32 (Hippocampal-Amygdalo-Ventricular Atrophy scores); and comparing results with the literature. The Scheltens scale33 is commonly used but observer-dependent34. While visual assessment detects cerebrospinal fluid enlargement, it only indirectly reflects gray or white matter loss. Automated segmentation directly targets gray matter, overcoming this limitation. To our knowledge, the impact of machine effects on atrophy assessment using GAMs remains unexplored. For an overview of this study, please refer to Fig. 1.
Materials and methods
Data
Data used to construct reference curves
To build our reference curves, we segmented 3730 T1-weighted MRI scans of healthy subjects from 11 open-access datasets: ABIDE35 (n=469), ICBM36 (n=294), IXI (https://brain-development.org/ixi-dataset/) (n=549), ADNI37 (n=373), OASIS138 (n=298), PPMI39 (n=166), UCLA40 (n=125), DLBS (https://sites.utdallas.edu/dlbs/) (n=315), SALD41 (n=494), NIFD (http://memory.ucsf.edu/research/studies/nifd) (n=114), and SLIM (https://fcon_1000.projects.nitrc.org/indi/retro/southwestuni_qiu_index.html) (n=580).
Figure 2 illustrates the distribution by age and study. Table 1 show a summary of the key characteristics of the 11 datasets presented.
Part of the data for this work were sourced from the International Consortium for Brain Mapping (ICBM) dataset (https://www.loni.usc.edu/). Then, part of the data utilized in this study were sourced from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (adni.loni.usc.edu). Established in 2003 as a public-private partnership and led by Principal Investigator Michael W. Weiner, MD, ADNI’s primary objective is to evaluate whether the combination of serial MRI, positron emission tomography (PET), various biological markers, and clinical and neuropsychological assessments can track the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For the most recent updates, please visit www.adni-info.org. In addition, we used data from the Open Access Series of Imaging Studies (OASIS) OASIS1, which provides cross-sectional MRI data in young, middle-aged, nondemented, and demented older adults. Finally, NIFD is the abbreviation for the frontotemporal lobar degeneration neuroimaging initiative (FTLDNI).
Data for sensitivity and stability analysis
To evaluate the sensitivity of segmentation algorithms, we segmented images of 219 AD subjects and 255 controls subjects from ADNI (https://adni.loni.usc.edu/). The subjects are all between 65 and 80 years of age.
For the stability analysis, we used the data of 46 patients with “mild-moderate Alzheimer’s disease” and 22 age-matched healthy subjects from Malone et al., (2013)42 (MIRIAD). They have been scanned 8 times (2, 6, 14, 26, 38 and 52 weeks, 18 and 24 months from baseline) on the same 1.5 T scanner. Additionally, stability was assessed using the data of 9 healthy traveling subjects from Tanaka et al., (2021)43 (SRPBS). They have been scanned at 12 different imaging centers within a 30-day period, making up a total of 156 exams. Three MRI manufacturers, and seven MRI scanner types were used (3T only). Table 2 presents the dataset used for the sensitivity and stability analysis of segmentation algorithms.
Pipeline analysis
Segmentation algorithms
The data were segmented using AssemblyNet (version 1.0.0), a segmentation algorithm based on a large ensemble of Convolutional Neural Networks (CNNs)44. Segmentation was then performed using 250 deep learning models through a multiscale framework.
We compared our results with FreeSurfer, one of the most widely used segmentation software packages in neuroimaging45. All data was processed using FreeSurfer version 7.3.1. We segmented each subject using the automated “recon-all” pipeline with the default parameters. For FreeSurfer, because of the time required for segmentation (around 15 h per subject), the FreeSurfer segmentations were calculated on the VIP platform, which provides substantial computing resources46.
FastSurfer47(version 2.4.2) builds upon FreeSurfer and incorporates technologies similar to those used in AssemblyNet (Deep Learning), enabling segmentation in approximately ten minutes. FastSurfer utilizes three Fully Convolutional Neural Networks (F-CNNs), each responsible for segmenting 2D slices in the coronal, axial, and sagittal planes. The three segmentations are then aggregated.
GAM and constraints
Reference curves were estimated using a Generalized Additive Models (GAM) model48(pyGAM (ExpectileGAM) version 0.9.1). This model generates a curve fitting a set of points using a flexible combination of smooth functions, and captures linear relationships between points. GAM’s model extends generalized linear models resulting in a highly flexible model, in which it is easy to control overfitting.
To prevent overfitting and avoid “non-physical” behaviors, such as abrupt variations in the variable over short periods throughout life, we applied convex or concave smoothing techniques. Overfitting can result from excessive learning from data, leading to a strong influence from specific studies rather than generalizable patterns. By applying appropriate smoothing methods, we ensure a more physiologically plausible representation of changes over time. Supplementary Section S2, Fig. S1 illustrates smoothing effects.
Curve stability assessment
Methods for comparing reference curves across magnetic field strengths
Results in the “Assessment of curves stability” section include visual and quantitative analyses: the former offers a graphical comparison, while the latter quantifies visual differences using different methods. We assessed the robustness of segmentation algorithms to magnetic field strength variations by comparing reference curves derived from 1.5T data with those from 3T data. For this analysis, we compared the lower, mean, and upper bound curves using metrics designed to quantify differences between curve pairs.
Measure of errors between curves
We aimed to compare reference curves that depict volume evolution over a lifespan. To achieve this, we explored metrics for time series analysis and forecasting that calculate errors between these curves. Given the nature of our data, the selected metrics had to meet specific requirements: 1) Interpretability: the metrics should be easy to interpret, for example, by being on the same scale as the data or presented as a percentage. 2) Independence from error sign: the metric should not cancel out positive and negative errors. This means the metric should not differentiate between the direction of the error. 3) Scale-independence/invariance: the metric should remain consistent regardless of data scaling, allowing comparisons between different algorithms. 4) Treat over-predictions and under-predictions equally.
Based on the literature, we selected 18 error metrics for evaluating the reference curves. An overview of these metrics is provided in Supplementary Section S1, where we define each metric, and discuss its strengths and limitations. We summarized the characteristics of the 18 metrics in relation to the selection criteria in Supplementary Section S1, Table S1. This table supports our selection of 5 key metrics for the comparison of reference curves: the symmetric Mean Squared Percentage Error (sMSPE49, sktime version 0.34.0), the Mean Absolute Scaled Error (MASE49, sklearn 1.4.1), the symmetric Mean Absolute Percentage Error (sMAPE50), the weighted Mean Absolute Percentage Error (wMAPE51) and the symmetric Median Absolute Percentage Error (sMdAPE49, sktime version 0.34.0). Since these are error metrics, the ideal value for all of them is 0, indicating no difference between reference curves.
Each metric has specific characteristics and relevance to our analysis, as described below: sMSPE amplifies larger errors by squaring the percentage differences, providing insight into the presence of significant deviations. sMAPE quantifies the percentage difference between predicted and reference values, taking into account both over- and under-predictions. A lower sMAPE indicates higher agreement between the predicted and reference values. wMAPE evaluates the absolute percentage errors while weighting them by the magnitude of the observed (reference) values. wMAPE accounts for errors weighted by volume size, making it crucial for datasets with a wide range of volumes. Since wMAPE accounts for volume size, this metric is particularly useful in applications where larger volumes might dominate the error. sMdAPE focuses on the median of absolute percentage differences, making it less sensitive to outliers.
Method for identifying biases related to magnetic field strength: bootstrapping
To evaluate the presence of bias related to the magnetic field strength, we compared the metric values (sMSPE, MASE, sMAPE, wMAPE and sMdAPE) obtained using the true labels with the distributions generated through bootstrapping52. We applied bootstrapping to evaluate the variability of error metrics under different configurations of shuffled labels. This approach helps to identify biases in the results.
Initially, we constructed two types of reference curves 1) using data from 1.5T MRI scans and 2) using data from 3T MRI scans. For each curve boundary, we calculated the metrics to compare the 1.5T curves against the 3T curves. Then, we performed a bootstrapping procedure52 involving 10,000 iterations. In each iteration, we shuffled the labels of the data, effectively randomizing the assignment of MRI scan data to the “1.5T” and “3T” groups. Using these shuffled labels, we reconstructed the reference curves for both groups and recalculated the metrics (sMSPE, MASE, sMAPE, wMAPE and sMdAPE) between the new curves. This allowed us to assess the variability in the metrics under random label assignments. The 5th percentile and 95th percentile were computed to characterize the range of variability.
To assess magnetic field bias, we compared the true-label metric values to the bootstrapped distributions. A significant bias was considered present if the true-label metric values fell outside the 5th-95th percentile range of the bootstrapped distributions. This would indicate that the observed differences between 1.5T and 3T data are not due to random variability but are likely influenced by differences related to magnetic field strength.
Atrophy assessment
While reproducibility is crucial for a segmentation algorithm, it is ineffective if it cannot detect pathological variations. So, we examined how segmentation algorithms affect atrophy assessment by studying the sensitivity and stability of results across different magnetic field strengths.
Sensitivity analysis
To assess sensitivity, two complementary approaches were used. First, hippocampal atrophy percentages were computed for both AD patients and cognitively normal (CN) subjects. Second, we implemented the HAVAs method, from Coupe et al., (2022)32. In this method, hippocampal, amygdalar, and inferior lateral ventricle volumes were first normalized by intracranial volume53, then converted into z-scores using the mean and standard deviation from a reference set of 3730 CN subjects covering the full lifespan. This double normalization accounts for inter-individual and inter-structural variability32. The HAVAs score was then computed as the sum of the hippocampal and amygdalar z-scores, from which the z-scores of the inferior lateral ventricle is subtracted, based on the observation that AD is associated with atrophy in the hippocampus and amygdala and enlargement of the ventricle. Left and right HAVAs scores were also z-normalized using the same reference population.
Performance metrics
To evaluate the effectiveness of each segmentation method in distinguishing between AD and CN subjects, we computed standard classification metrics. These include sensitivity (or true positive rate), specificity (or true negative rate), balanced accuracy (the average of sensitivity and specificity), and the area under the ROC curve (AUC). A higher AUC indicates better discrimination between the two groups (AD vs. CN). Values closer to 1 reflect excellent performance, while values near 0.5 indicate performance close to chance. AUC values above 0.80 are typically considered clinically meaningful54. These metrics provide insight into the algorithm’s ability to correctly classify both group.
Results
This study presents results with volumes normalized by the intracranial volume (ICV).
Curve stability assessment
The first row of Fig. 3 presents reference curves for the left hippocampus, with curves for other regions available on https://gitlab.com/geodaisics1/ Estimation_of_reference_curves_for_brain_atrophy_and _analysis_of_ro bustness_to_machine_effects. The differences in FastSurfer’s reference curves compared to AssemblyNet and FreeSurfer suggest that using FastSurfer may yield varying results, potentially providing different information and influencing final diagnoses.
Reference curves for the left hippocampus (using concave smoothing), computed using data from different magnetic field strengths: 1.5T + 3T combined (first row: points colored by study; second row: points colored by field strength), only 1.5T (third row), and only 3T (fourth row). Each column corresponds to one of the three segmentation algorithms evaluated in the study. Each dot represents one subject. The 5th (red), 50th (green), and 95th (red) expectiles were computed to characterize the range of variability. AssemblyNet appears less sensitive to magnetic field strength variations compared to FastSurfer and FreeSurfer.
Evaluation of the consistency of reference curves across magnetic field strengths
Preliminary results We calculated the median percentage volume difference of 3730 subjects segmented by AssemblyNet, FastSurfer, and FreeSurfer, both with and without normalization by ICV (Supplementary Section S3, Fig. S2). For non-normalized volumes, the median percentage differences were 2.53% for AssemblyNet, 5.60% for FastSurfer, and 6.15% for FreeSurfer. The median percentage differences were 2.54% for AssemblyNet, 10.38% for FastSurfer, and 9.05% for FreeSurfer when normalized by the ICV.
Visual analysis Observing the points in the first row of Fig. 3, we noticed a greater dispersion of data for FastSurfer compared to the other two algorithms. To analyze the organization of these dots, especially for FastSurfer, and to determine if their positions are influenced by an acquisition parameter, we therefore repeated the first row of Fig. 3 but color-coding the points according to the magnetic field strength of the MRI machine in which the images were acquired. This is shown in the first line of Fig. 3. We then built additional reference curves using only the data from images acquired at 1.5T, shown in the second line of Fig. 3, and only the images acquired at 3T, shown in the third line of the same figure, for all three segmentation algorithms in the study. Visually, a slight study effect can be observed for all three algorithms, indicating minor variations in segmentation results across different studies. However, this effect is considerably weaker than the influence of magnetic field strength, which remains the primary source of variability in our Figures. Moreover, assessing the impact of study-related differences is not the focus of this article, especially since the number of subjects from different studies within each age range is insufficient to robustly evaluate this effect. Additional sources of variability such as sex and scanner effects are also considered: please refer to Supplementary Section S4, Fig. S3 and Discussion for the sex effect, and Supplementary Section S5, Figs. S4 and Fig. S5 and Discussion for the scanner effect. Please refer to https://gitlab.com/geodaisics1/ Estimation_of_reference_curves_for_brain _atrophy_and _analysis_of_robustness_to_machine_effects to view those curves for other regions. The visual differences in reference curves as a function of training data indicate that FastSurfer is not robust to magnetic field intensity.
Measure of errors between curves
In this section, we assess the performance of segmentation algorithms by directly comparing reference curves calculated with data from 3T MRI scans to curves predicted using data from 1.5T MRI scans (Fig. 4). This approach enables us to evaluate how each algorithm performs when applied across datasets with differing MRI field strengths, reflecting the variability introduced by different MRI field strengths. The robustness of the segmentation algorithms is assessed based on their ability to maintain reference values despite changes in the magnetic field strength.
Values of different error metrics: (a) sMSPE (symmetric Mean Squared Percentage Error), (b) sMAPE (Symmetric Mean Absolute Percentage Error), (c) wMAPE (Weighted Mean Absolute Percentage Error), and (d) sMdAPE (Symmetric Median Absolute Percentage Error); for the three algorithms (AssemblyNet, FastSurfer, and FreeSurfer). The errors were calculated between the reference curves constructed with data from 1.5T MRI scans and the reference curves predicted using data from 3T MRI scans. AssemblyNet consistently shows lower error metric values compared to FastSurfer and FreeSurfer for all bounds and metrics.
AssemblyNet achieves the lowest sMSPE for all bounds. Since sMSPE emphasizes larger errors, the smaller values for AssemblyNet indicate that its predictions are consistently closer to the reference curves with fewer large errors. AssemblyNet consistently on sMAPE values outperforms FastSurfer and FreeSurfer across all bounds with lower sMAPE values. A lower sMAPE indicates that AssemblyNet is more robust to magnetic field strength. The lower wMAPE for AssemblyNet indicates that it can maintain high accuracy across different volume sizes, making it more reliable when the data has a wide range of volumes. AssemblyNet achieves the lowest sMdAPE for all bounds. The results for the MASE metric are available in Supplementary Section S6, Fig. S6, and will be discussed in the Discussion section. In conclusion, AssemblyNet consistently outperforms FastSurfer and FreeSurfer in terms of robustness. AssemblyNet’s superior performance across these metrics, especially sMAPE, sMSPE, and wMAPE, indicates that it is more capable of providing reliable volume predictions, even when faced with the variability introduced by different MRI field strengths.
Method for identifying biases related to magnetic field strength: bootstrapping
The evaluation of bias related to the magnetic field strength, comparing 1.5T and 3T MRI reference curves using bootstrapping, reveals distinct patterns across the three algorithms (Fig. 5 illustrates sMAPE values; for other metrics and views, please refer to Supplementary Section S7, Fig. S7 (sMdAPE), Fig. S8 (sMSPE) and Fig. S9 (wMAPE)).
Histograms showing the distribution of sMAPE (symmetric Mean Absolute Percentage Error) values across 10,000 bootstrap iterations with randomized field strength labels. Columns correspond to segmentation algorithms (AssemblyNet, FastSurfer and FreeSurfer), while rows represent boundary conditions (lower, mean and upper bounds). The red line indicates the observed/true label value (“Obs”), i.e., the error metric calculated between the 1.5T and 3T reference curves using the true field strength labels, and the black lines mark the 5th and 95th percentiles. The histogram represents the distribution (y) of metric values (x) obtained from 10,000 bootstrap iterations with randomized field strength labels. This distribution estimates the range of metric variability expected by chance. If the observed value (red line, the true metric value) falls outside the 5th-95th percentile of the bootstrap distribution, it indicates a statistically significant bias likely driven by magnetic field strength rather than random variability. AssemblyNet exhibits fewer true-label values outside the 5th-95th percentiles compared to FastSurfer and FreeSurfer, indicating it is less biased by differences related to magnetic field strength.
For AssemblyNet, the true-label values of the sMAPE, sMdAPE and wMAPE fall within the bootstrapped bounds for the lower bound but are slightly outside the 95th percentile for the mean and upper bound indicating a slight magnetic field-related bias. Conversely, sMSPE values for AssemblyNet consistently fall within the bootstrapped bounds across all bounds. In contrast, FastSurfer shows significant deviations for all metrics across configurations, indicating a pronounced bias likely driven by magnetic field effects. Similar trends are observed for sMdAPE and wMAPE. The substantial distances highlight that FastSurfer metrics are heavily influenced by magnetic field strength. FreeSurfer displays similar trends, with marked deviations across metrics, again reflecting strong magnetic field-related effects. In summary, while AssemblyNet exhibits minor deviations, FastSurfer and FreeSurfer demonstrate significant biases across all metrics. These findings suggest that FastSurfer and FreeSurfer are more sensitive to magnetic field differences, while AssemblyNet provides relatively stable results.
Atrophy assessment
Hippocampal atrophy sensitivity analysis
Table 3 shows the percentage of subjects with left/right hippocampal atrophy. For the control group, the percentages of subjects showing hippocampal atrophy according to AssemblyNet are notably consistent across all reference curves. In contrast, FastSurfer exhibits variability. FreeSurfer also shows some variability, though it is more stable compared to FastSurfer. For the pathological group, the percentage of Alzheimer’s patients with hippocampal atrophy is very stable with AssemblyNet, while FastSurfer shows more variability than FreeSurfer. There is a significant difference (McNemar test, p < 0.05)(R version 4.5.0) in the atrophy percentage between AssemblyNet and FreeSurfer, and between FastSurfer and FreeSurfer, across all models.
Table 4 provides the classification performance scores for the detection of left hippocampal atrophy across segmentation algorithms and reference curves. Supplementary Section S8, Fig. S10 shows ROC curves for the left hippocampus. Although AssemblyNet shows slightly lower classification metrics than FastSurfer and FreeSurfer, it consistently demonstrates greater stability across all reference models. In contrast, FastSurfer and FreeSurfer exhibit more variability, with performance scores fluctuating significantly depending on magnetic field strength. For example, regarding specificity, AssemblyNet ranges between 0.87 and 0.91 (considering both left and right hippocampus and all models), while FastSurfer ranges from 0.76 to 0.93, and FreeSurfer from 0.86 to 0.93. This supports the earlier observation that AssemblyNet is more robust to acquisition differences compared to the other segmentation tools.
HAVAs results
Supplementary Section S9, Fig. S11 presents the reference curves for the hippocampus, amygdala, and inferior lateral ventricle, across segmentation algorithms. A sensitivity to magnetic field strength is observed for FastSurfer and FreeSurfer, particularly for the hippocampus and amygdala, as indicated by the distribution of colored points according to field strength. However, as shown in Supplementary Section S9, Fig. S12, the HAVAs score appears to reduce this sensitivity to magnetic field strength, likely due to the two-step z-score normalization applied during its computation.
Table 5 presents the percentages of HAVAs scores considered pathological (i.e., scores falling below the 5th percentile of the reference curve) across CN subjects and AD patients, for each model and segmentation algorithms. In CN subjects, AssemblyNet yields overall lower atrophy percentages compared to FastSurfer and FreeSurfer (with a maximum value of 21.96% for AssemblyNet, versus 40.39% for FastSurfer and 33.73% for FreeSurfer). Moreover, AssemblyNet shows a narrower range of atrophy percentages (5.10% to 14.12%), while FastSurfer and FreeSurfer exhibit wider ranges (16.08% to 40.39% and 14.51% to 33.73%, respectively). In AD patients, the percentage of pathological scores is higher, as expected. AssemblyNet again shows slightly lower values (ranging from 82.19% to 94.98%) compared to FastSurfer (87.21% to 97.71%) and FreeSurfer (84.02% to 96.35%).
Table 6 presents the classification scores between AD patients and CN subjects obtained using the HAVAs method, across the different models and algorithms. Supplementary Section S9, Fig. S13 shows the ROC curves for the left HAVAs. When comparing the models derived from 1.5T and 3T data (HAVAs left), FastSurfer and FreeSurfer stand out with higher sensitivity (0.95 and 0.93 respectively), compared to 0.89 for AssemblyNet. However, AssemblyNet achieves both the highest specificity (0.91 vs. 0.80 for FastSurfer and 0.83 for FreeSurfer) and the highest balanced accuracy (0.90 vs. 0.88 for FastSurfer and FreeSurfer). Expanding the analysis to all three models (1.5T-only, 3T-only, and combined)(HAVAs left), we observe that the specificity of FastSurfer and FreeSurfer drops significantly compared to AssemblyNet, with minimum values of 0.60 and 0.66, respectively, versus 0.86 for AssemblyNet. These results suggest that FastSurfer, due to its higher sensitivity, may be more effective at detecting Alzheimer’s cases. However, AssemblyNet, with its superior specificity and balanced accuracy, provides a better trade-off between correctly identifying both AD patients and healthy controls.
Atrophy stability analysis
Longitudinal analysis
Supplementary Section S10, Fig. S14, Fig. S15 and Fig. S16 shows left hippocampal volumes of control and AD subjects from the MIRIAD dataset, acquired during follow-ups, overlaid with reference curves from the three algorithms (using 1.5T and 3T data). This comparison enabled us to assess the consistency and reliability of each algorithm in detecting atrophy over time. For the healthy subjects, most subjects showed no signs of atrophy across all algorithms, indicating that the algorithms generally agree when there is no pathological change. However, discrepancies were observed in some cases. For instance, subject 230 was consistently atrophied, with important variations specifically in FreeSurfer’s results. Subject 231 showed no atrophy with AssemblyNet, consistent atrophy with FastSurfer, and fluctuating results with FreeSurfer. In the AD subjects, the majority were consistently identified as atrophied across all algorithms, which aligns with the expected progression of AD. However, there were notable exceptions, such as subject 191, where FreeSurfer and FastSurfer detected atrophy consistently, but AssemblyNet did not until the penultimate session. Overall, these findings suggest that while all three algorithms can reliably detect atrophy, there are differences in sensitivity and stability, particularly in borderline.
Inter-sites analysis
In this study, we used MRI data from the SRPBS traveling subjects. Brain volumes extracted from scans acquired at different sites were projected onto the reference curves for evaluation. Details of this analysis are provided in the Supplementary Section S11, Fig. S17. Supplementary Section S11, Table S2 summarizes the atrophy assessments obtained for SRPBS subjects using the three segmentation algorithms. Overall, FastSurfer exhibits greater variability across sites compared to AssemblyNet and FreeSurfer, suggesting reduced robustness in multi-site settings.
Discussion
The objectives of this study were to evaluate the robustness and sensitivity of three MRI segmentation algorithms in constructing reference curves for brain atrophy assessment in neurodegenerative diseases, with a focus on hippocampal volumes. The study aimed to determine which algorithm produces the most stable reference curves under varying magnetic field conditions and which algorithm provides the most consistent detection of hippocampal atrophy, a key marker in Alzheimer’s disease. Evaluating the impact of the machine effect and sensitivity is essential, as variability in MRI acquisition parameters and algorithm performance can compromise the reliability and the conclusion of study of reference curves.
Our results indicate that AssemblyNet is the most robust algorithm, consistently generating stable reference curves with minimal impact from variations in magnetic field strength. This robustness is particularly valuable in industrial settings, where data often come from heterogeneous sources, including varying MRI field strengths and different scanner models. In such contexts, covariates like magnetic field strength may not be correctable. Therefore, segmentation methods that are intrinsically robust to these variations are highly advantageous for large-scale deployment. In contrast, FastSurfer displays significant instability, with notable differences in reference curves depending on the MRI field strength. This lack of stability suggests that FastSurfer may not be the optimal choice when consistency across different imaging environments is required. Additionally, AssemblyNet also provides more consistent estimates of hippocampal atrophy and HAVAs scores compared to the other algorithms, though the atrophy percentages were generally lower.
Reference modeling: addressing confounding factors including imaging variability
Accounting for confounders in normative modeling
Confounding variables such as age, sex, scanners effects and MRI field strength can influence brain volumetric measurements and impact the construction of normative models. Addressing these confounders is essential for building accurate and generalizable lifespan trajectories. A review5 found that, among 13 normative modeling studies, 7 included only age as a covariate, 4 included both age and sex. Only one study55 accounted for field strength, but without justifying this choice.
In our approach, age is directly modeled through GAMs, while sex-related differences are addressed via ICV normalization56,57,58,59. This approach is widely used to compensate for head size differences, a major source of inter-individual variability in brain volume measurements. Without ICV correction, differences in regional volumes may reflect cranial size rather than meaningful biological effects. For instance, part of the apparent sex difference in brain volumes is explained by ICV disparities and applying normalization substantially reduces or eliminates these differences59,60. ICV normalization is particularly important for assessing regional volume changes relative to maximum brain size, which is critical in studies of atrophy61. It also improves statistical sensitivity, enabling group comparisons with smaller sample sizes, especially in hippocampal studies62. Furthermore, ICV normalization contributes to reducing scanner-related variability, thereby enhancing comparability across MRI platforms9. Previous work using AssemblyNet has shown that, without normalization, males consistently exhibit larger brain volumes than females across most structures20. However, after ICV normalization, these differences largely disappear, suggesting that explicit inclusion of sex as a covariate may not be necessary. Our findings support this interpretation: without normalization, small sex effects were observed across all algorithms, with males generally showing higher hippocampal volumes. After normalization, these effects became very small for AssemblyNet, while FastSurfer and FreeSurfer exhibited persistent differences (please refer to Supplementary Section S4, Fig. S3).
In summary, our modeling framework accounts for age via GAMs and sex through ICV normalization. Field strength was not included as a formal covariate, but its influence was assessed post hoc. Scanner-related effects are discussed in more detail in Section 4.1.
Magnetic field strength and scanner model effects: considerations and limitations
Machine effects are complex and multifactorial, as reported in numerous studies63,9,64,65. The variability observed in data can indeed arise from multiple sources, including not only magnetic field strength66,67,63, but also vendor64, model9, software version68, sequence parameters8,65 or gradient non-linearities69. Although some studies report minimal effects of magnetic field strength on volumetric estimates, especially for global measures66,63, others have shown regional differences, particularly in areas with low contrast or complex anatomy70. Moreover, Marchewka et al., (2014)70 did not observe significant differences in hippocampal or amygdalar volumes between field strengths in epilepsy cohorts, especially when using high-quality standardized protocols such as those from ADNI. However, Marchewka et al., (2014)70 also reported magnetic field strength-related differences in gray matter volume in the cerebellum, precentral cortex, and thalamus, regions known to be sensitive to scanner parameters and segmentation challenges. These regional effects are consistent with several previous studies71,72, and segmentation accuracy has been shown to improve at 3T in difficult regions due to better contrast-to-noise ratio70. In parallel, scanner model effects are also documented. For example, Yang et al., (2016)64 found significant variability across models in over 12% of brain regions using FreeSurfer.
In this work, we chose to focus specifically on the magnetic field strength (1.5T vs. 3T) as a proxy of machine-related variability for several reasons. First, as shown in the Supplementary Section S5, Fig. S5, within a given scanner model (Intera), volumetric differences between 1.5T and 3T scans were greater than those observed between different scanner models operating at the same field strength. This suggests that, in our dataset, magnetic field strength exerts a more dominant effect than vendor or model. Second, scanner models were not evenly distributed across age groups, limiting our ability to disentangle model and vendor effects independently of age (please refer to Supplementary Section S5, Fig. S4). Such an analysis would require well-balanced subgroups across all age bins and scanner types, which is not achievable with the current data. Given these considerations, magnetic field strength was selected as the most interpretable and impactful axis of variability for this robustness study. We acknowledge that more granular investigations of vendor, model, and sequence-related effects would be valuable.
The “gold standard” issue
Manual segmentation is traditionally used as the reference, often termed the “gold standard”, for evaluating automated brain segmentation methods. However, this designation is increasingly questioned, as manual labeling suffers from limited reproducibility, with intra-rater agreement frequently falling below 80%. For example, using the BrainCOLOR protocol, intra-rater Dice scores (i.e., between repeated manual segmentations) were estimated at 76.8%44. This level of agreement is comparable to the consistency observed between manual and automated segmentations: 75.8% for AssemblyNet using the same protocol, and approximately 80% for FastSurfer (80.19% subcortical, 80.65% cortical) on the Mindboggle-101 dataset47. Moreover, Coupe et al., (2020)44 reported intra-method Dice scores of 92.8% for AssemblyNet (i.e., between scan and rescan automatic segmentations), again surpassing the manual intra-rater reproducibility. These results suggest that automated methods can be as consistent as human experts. These findings challenge the conventional use of manual segmentation as a gold standard.
Although manual segmentations can serve as a reference for partial evaluation, this approach has important limitations. Both human raters and algorithms may exhibit similar biases, particularly in regions with ill-defined anatomical boundaries. This issue also applies to many automated methods: recent deep learning models such as FastSurfer, SynthSeg, and LOD-Brain are often trained on labels generated by FreeSurfer, thereby inheriting its structural biases, for example with hippocampal segmentation, as shown by Valabregue et al., (2024)73. These factors make it inherently difficult to determine how close a segmentation truly approximates the underlying anatomical ground truth.
Median percentage volume differences between 1.5T and 3T and data
To compare our volumes between 1.5T and 3T and data, we refer to Lee et al., (2024)74, who assessed the reliability of FreeSurfer at 1.5T and 3T. They reported the following median volume differences for the left hippocampus: 3.63 cm³ for 1.5T and 3.70 cm³ for 3T, resulting in a 1.89% volume variation between the two field strengths (non-normalized analysis). For FreeSurfer, the higher variation observed for our study (6.15%) could be attributed to the larger and more diverse dataset of 3730 subjects (from 11 different datasets), as opposed to Lee et al., (2024)74’s dataset of 101 patients (from 3 datasets). The larger sample size and the inclusion of multiple datasets likely contributed to the increased variability in segmentation results, particularly for FreeSurfer and FastSurfer.
Comparing modeling strategies for reference curves construction
Mean Absolute Scaled Error (MASE) evaluation
The Mean Absolute Scaled Error (MASE) metric meets all inclusion criteria for our study. However, we observed that its value varied a lot depending on the number of points selected (please refer to Supplementary Section S6, Fig. S6). We hypothesize that this variation is due to the fact that our data follow very regular trends, where small errors can lead to elevated MASE values. This is in contrast to the findings of Liu et al., (2024)24, who reported MASE values between 0 and 0.15 when comparing reference curves with different datasets. In our case, we observe much higher MASE values, even when selecting only one point out of eight from our reference curves.
Comparison of GAM-based curves with existing literature
We compared our GAM-based reference curves to those published in Coupe et al., (2017)20, focusing on hippocampal atrophy percentages in both AD and CN subjects (please refer to Supplementary Section S12, Fig. S18 and Fig. S19). The results Supplementary Section S12, Table S3 show that AssemblyNet, when used with our GAM-based models, yields higher atrophy detection rates in AD patients (77.17% left/67.58% right) compared to the reference curves from Coupe et al., (2017)20 (63.01%/59.36%). However, for CN subjects, our GAM curves detect a slightly higher percentage of atrophy (11.76%/8.24%) than those of Coupe et al., (2017)20 (6.27%/4.31%). This trend is consistent across FastSurfer and FreeSurfer, where GAM-based curves yield both higher detection in AD and slightly elevated atrophy rates in controls. These findings suggest that while the GAM framework may offer improved sensitivity for detecting pathological deviations, it may also slightly increase atrophy detection in CN individuals. This trade-off is further discussed in Section 4.3.
Limitations of reference curves and potential of manifold learning for complex analysis
Reference curves are valuable for region-by-region analysis but face limitations in capturing the complexity of high-dimensional MRI data, which often exhibit correlations between brain regions due to the brain’s closed system. This challenge is further compounded by the lack of tools to effectively evaluate the robustness of these curves, particularly in the context of temporal data. In our case, the dataset includes brain volumes from approximately a hundred regions, adding to the analytical complexity. While reference curves are useful for localized assessments, they may fall short in capturing the global dynamics and intricate interdependencies among brain regions. Manifold learning algorithms offer a promising complementary approach by accounting for the interdependence between variables and revealing complex, non-linear relationships within high-dimensional datasets. These methods provide a more integrated and robust framework, enabling a deeper understanding of the global structure and intricate interactions within brain data75.
Atrophy detection across acquisition conditions
Early atrophy detection considerations
Our normative lifespan curves suggest that hippocampal volume decline begins at different ages depending on the segmentation algorithm: around 35 years for FastSurfer, 50 years for FreeSurfer, and approximately 55 years for AssemblyNet. However, prior studies indicate that hippocampal atrophy may begin earlier than these estimates suggest76,77. It is important to note that volume trajectories are influenced not only by segmentation accuracy but also by the modeling strategy used to estimate them. In our case, the use of GAMs likely contributes to delay visible inflection points as earlier work using AssemblyNet on larger datasets has shown divergence between AD and normative hippocampal trajectories before the age of 4078, suggesting that AssemblyNet is indeed capable of detecting early pathological changes. Therefore, the later onset of decline observed in our curves likely reflects modeling limitations rather than segmentation constraints.
More fundamentally, it is important to recognize that hippocampal atrophy visible on structural MRI occurs relatively late in the pathological cascade of AD. Alzheimer’s disease is characterized by the abnormal accumulation of proteins, including amyloid plaques and neurofibrillary tangles composed of tau. These anomalies disrupt neuronal function and trigger a cascade of toxic mechanisms. Neurons are progressively damaged, lose their connections, and ultimately die79,14,80,81. The ATN framework82, recently revisited in 202483, offers a biological staging system based on three biomarker categories: A for amyloid accumulation, T for tau pathology, and N for neurodegeneration. CSF biomarkers, measurable through lumbar puncture, can detect amyloid and tau abnormalities up to 15 years before clinical symptom onset84,85. In contrast, neurodegeneration (N), as measured by structural MRI, reflects macroscopic neuronal loss and thus appears later in the disease process. In this context, although MRI remains a valuable tool for assessing neurodegeneration, it inherently captures late-stage processes. As such, efforts to improve early detection of AD may need to focus more on upstream biomarkers (A and T), while MRI-based approaches like ours are better suited for staging and differential diagnosis once neurodegeneration is established. Recent advances in blood-based diagnostics support this shift: improvements in mass spectrometry and antibody-based detection methods have enabled precise quantification of amyloid-\(\beta\) and various forms of tau in plasma, paving the way for faster, cheaper, and more accessible detection strategies86.
Reasons for robustness to magnetic field strength
MRI field strength (1.5T vs. 3T) impacts tissue contrast and signal-to-noise ratio, which can significantly influence segmentation accuracy, especially for methods relying heavily on voxel intensity. Previous studies have shown that higher field strength generally improves segmentation accuracy, particularly in challenging anatomical regions due to better contrast-to-noise ratios70. Our results consistently show that AssemblyNet is more robust to magnetic field strength variations compared to FastSurfer and FreeSurfer. This robustness can be attributed to several key aspects of its segmentation architecture and design.
First, AssemblyNet relies on a multiscale 3D deep learning framework composed of 250 U-Net models distributed across two levels of resolution. The first level performs coarse segmentation on overlapping 3D regions (2x2x2 mm3), each independently processed by a dedicated U-Net. A majority voting scheme aggregates the overlapping outputs, effectively introducing spatial redundancy that enhances robustness to contrast variability. The overlap of at least 50% between adjacent regions reinforces this effect by ensuring that the same anatomical region is processed by multiple models. Then, in a second stage, the coarse segmentation is refined using a finer resolution (1x1x1 mm3), improving anatomical boundary delineation. This multi-resolution approach substantially increases the number of learnable parameters (by nearly a factor of ten) allowing more precise modeling of tissue boundaries. According to Coupe et al., (2019)87, this two-stage cascade was specifically designed to improve robustness. In addition, the model incorporates a novel nearest-neighbor transfer learning scheme: the weights from a trained U-Net are propagated to adjacent U-Nets processing overlapping regions. This design enables local anatomical knowledge to be shared across the network, which likely further contributes to stable performance across diverse imaging conditions.
In contrast, FastSurfer uses a 2D neural network that segments each slice independently. While this architecture is computationally efficient, it does not exploit the 3D context across slices. As a result, it is more susceptible to slice-wise intensity variations. This likely contributes to its greater sensitivity to field strength variability.
FreeSurfer, on the other hand, employs atlas-based and intensity-driven methods. Subcortical segmentation is based on voxel-wise probabilistic atlases and intensity priors88,89, while cortical surfaces are reconstructed through sulcal and gyral pattern extraction using mesh smoothing and spherical registration90,91. This reliance on local contrast and intensity features makes FreeSurfer vulnerable to variability in MRI contrast and signal quality (contrast-to-noise ratio). This limitation also affects SPM, which, like FreeSurfer, relies on strong anatomical priors (SPM uses voxel intensities and tissue probability maps)92.
In summary, the superior robustness of AssemblyNet likely stems from its 3D patch-based architecture, redundant spatial encoding, and multi-resolution refinement, which provide enhanced tolerance to contrast variability. In contrast, FastSurfer’s 2D architecture and FreeSurfer’s reliance on contrast-dependent features make them more susceptible to changes in magnetic field strength.
Sensitivity of algorithms to hippocampal atrophy
We assessed the sensitivity of AssemblyNet, FastSurfer, and FreeSurfer in detecting pathological changes, such as hippocampal atrophy. All three segmentation algorithms of our study have demonstrated their capability to detect pathological variations in brain imaging. FreeSurfer has been validated by Fellhauer et al., (2015)93, showing its effectiveness in identifying increased brain atrophy in conditions such as AD and mild cognitive impairment. Both FastSurfer and FreeSurfer can detect differences in cortical areas associated with disease progression94,47. AssemblyNet further extends this ability, as demonstrated by Coupe et al., (2022)32, where AssemblyNet, combined with classification-based approaches using lifespan models, showed very accurate detection of AD (AUC \(\ge\) 94%) compared to control subjects. Additionally, it was able to accurately discriminate between progressive MCI and stable MCI (AUC = 78%) during a 3-year follow-up.
Our analysis shows that AssemblyNet not only provides stable reference curves but also offers consistent estimates of hippocampal atrophy. In contrast, FastSurfer and FreeSurfer showed greater variability in atrophy detection. This variability could lead to important differences in the interpretation of atrophy, highlighting the need for careful consideration when selecting segmentation algorithms for atrophy assessment.
Although AssemblyNet tends to report lower atrophy percentages, these results are in line with De et al., (1997)95, but only for the left hippocampus. This study uses the Scheltens scale as a method of determining whether a hippocampus is atrophied: it’s a visual scale. This study includes mild AD patients and patients with moderate to severe AD. The study found frequencies of hippocampal atrophy ranged from 78% in the mild AD patients group to 96% in the advanced AD group. In comparison, the analysis using AssemblyNet (Table 3), using reference curves with both 1.5T and 3T data, revealed an atrophy rate of 77% (left) and 68% (right) (FastSurfer: 84%/81%; FreeSurfer: 82%/76%).
Besides, for all algorithms, there is a disparity in percentages between the left and right hippocampus. This aligns with multiple studies96,97,98,99 demonstrating that AD is associated with greater atrophy in the left hippocampus compared to the right.
In addition, De et al., (1997)95 found hippocampal atrophy in 15% of the normal elderly group between 60-75 years of age. In our case, the analysis revealed the following percentages of hippocampal atrophy in the control group: AssemblyNet detected 12% atrophy in the left hippocampus and 8% in the right. FastSurfer identified 13% atrophy in both the left and right hippocampus, while FreeSurfer showed 9% atrophy for both sides (MIRIAD: 22 controls; 69 years+/- 16).
We found two other interesting studies on this subject, but neither can be used for comparative analysis as they both use FreeSurfer as their segmentation algorithm. First, in the study by Byun et al100, 77.9% of the 163 AD subjects from ADNI are considered to have hippocampal atrophy. Then, in the study Lowe et al101, 76% of the 92 AD subjects from ADNI are considered to have hippocampal atrophy.
Sensitivity with HAVAs method
As demonstrated by Coupe et al., (2022)32, the HAVAs methods offers superior classification performance compared to individual brain regions taken independently, reinforcing its relevance for distinguishing AD patients from CN subjects. When comparing our results to those reported in Coupe et al., (2022)32, we observe higher balanced accuracy scores. Specifically, Coupe et al., (2022)32 reported a balanced accuracy of 0.81, whereas our current study achieved 0.90 with AssemblyNet, and 0.88 with both FastSurfer and FreeSurfer using reference curves built from combined 1.5T and 3T data. Although the HAVAs method and the segmentation algorithm are the same, several factors explain this discrepancy: the test datasets differ (AIBL in their studies vs. ADNI in ours), the reference models are not the same (GAMs vs. hybrid models), and the training datasets also vary.
Trade-off between sensitivity and robustness: bias-variance balance
Our results highlight a classic bias-variance trade-off across segmentation algorithms when applied to AD detection. FastSurfer and FreeSurfer demonstrate higher sensitivity, making them more effective at detecting hippocampal atrophy. This is evidenced in both the direct hippocampal atrophy analysis and the HAVAs-based classification results. However, this increased sensitivity comes at the cost of greater variability and reduced robustness, particularly with respect to acquisition differences. In CN subjects, FastSurfer and FreeSurfer showed broader and less consistent ranges of atrophy percentages (up to 40.39% for FastSurfer), whereas AssemblyNet maintained lower and more stable values (maximum of 21.96%). Across all reference models and field strengths, classification metrics fluctuated more for FastSurfer and FreeSurfer, dropping as low as 0.60 and 0.66 (specificity, HAVAs method), while AssemblyNet consistently maintained higher specificity (0.86-0.95) and balanced accuracy (0.90-0.91)(left HAVAs. These results suggest that AssemblyNet favors a lower-variance strategy, yielding slightly lower sensitivity but offering a more robust and stable performance across different magnetic field strength. While it may miss a few pathological cases, it is more reliable for consistently distinguishing AD patients from healthy individuals without overfitting to magnetic field strength-specific noise.
From an industrial standpoint, robustness is often more desirable than maximal sensitivity. In fact, in a real-world setting, MRI data come from a wide range of machines, models, and sequences. Correction methods such as ComBat102 are not always applicable due to insufficient sample size. These harmonization techniques typically require large and balanced datasets, often 20-30 subjects per scanner, per sequence102,103, which are rarely available in industrial settings7. Moreover, such harmonization algorithms can introduce additional variability104 and may even degrade data quality105. Notably, none of the harmonization algorithms evaluated in Gebre et al., (2023)106improved intraclass correlation coefficients in longitudinal designs69. Thus, a robust algorithm that handles acquisition heterogeneity gracefully is preferable, especially in large-scale screening or multi-site contexts where harmonization is not feasible.
Conclusion
This study highlights the critical role of algorithm selection in constructing reference curves and assessing brain atrophy in neurodegenerative diseases. We observed that AssemblyNet produces very stable reference curves with respect to magnetic field variations, unlike FastSurfer, which is not robust to this parameter. Significant differences in trends are noted between reference curves constructed with data acquired at 1.5T + 3T, only at 1.5T, and only at 3T. When considering the different reference curves calculated, the percentages of hippocampal atrophy and the HAVAs score in AD patients are more stable with AssemblyNet, though they are lower compared to FastSurfer and FreeSurfer.
In conclusion, AssemblyNet stands out as the most robust and reliable choice, offering stability across varying MRI conditions and consistent atrophy detection. In contrast, FastSurfer and FreeSurfer require further refinement to reduce their sensitivity to magnetic field strength and improve their consistency in atrophy assessment. These findings underscore the importance of segmentation algorithm selection, as the choice of segmentation method can significantly impact the consistency and accuracy of atrophy assessments in neurodegenerative diseases like Alzheimer’s.
Data availability
Reference curves associated with this study are available at https://gitlab.com/geodaisics1/Estimation_of_ reference_curves_for_brain_atrophy_and_analysis_of_robustness_to_machine_effects. Reference curves, stored as Excel files, provide the lower, middle, and upper bounds for brain regions, computed for three segmentation algorithms (AssemblyNet, FastSurfer, FreeSurfer) using different MRI field strengths (1.5T, 3T, or both) and smoothing methods (none, concave, convex). Visualizations show the effects of MRI field strength, displaying reference curves with training points color-coded by the field strength. Additional images provide general and consistent visualizations of reference curves, ensuring comparability across algorithms and regions. Concatenated figures summarize multi-algorithm and multi-region comparisons, showing the impact of smoothing and MRI datasets on reference curve generation. These data offer a comprehensive overview of the methods and analyses performed in the study. Links to open-access datasets are also provided. Access to the MRI datasets aggregated here is subject to application procedures individually managed at the discretion of each primary study.
References
Risacher, S. L. & Saykin, A. J. Neuroimaging biomarkers of neurodegenerative diseases and dementia. In Seminars in neurology, vol. 33, 386–416 (Thieme Medical Publishers, 2013).
Besson, F. L. et al. Cognitive and brain profiles associated with current neuroimaging biomarkers of preclinical alzheimer’s disease. Journal of Neuroscience 35, 10402–10411 (2015).
Bartos, A., Gregus, D., Ibrahim, I. & Tintěra, J. Brain volumes and their ratios in alzheimer s disease on magnetic resonance imaging segmented using freesurfer 6.0. Psychiatry Research: Neuroimaging 287, 70–74 (2019).
Coupe, P. et al. Lifespan neurodegeneration of the human brain in multiple sclerosis. bioRxiv 2023–03 (2023).
Marquand, A. F. et al. Conceptualizing mental disorders as deviations from normative functioning. Molecular psychiatry 24, 1415–1424 (2019).
Wolfers, T. et al. Individual differences v. the average patient: mapping the heterogeneity in adhd using normative models. Psychological Medicine 50, 314–323 (2020).
Warrington, S. et al. A resource for development and comparison of multi-modal brain 3t mri harmonisation approaches. bioRxiv 2023–06 (2023).
Hedges, E. P. et al. Reliability of structural mri measurements: The effects of scan session, head tilt, inter-scan interval, acquisition sequence, freesurfer version and processing stream. Neuroimage 246, 118751 (2022).
Guo, C., Ferreira, D., Fink, K., Westman, E. & Granberg, T. Repeatability and reproducibility of freesurfer, fsl-sienax and spm brain volumetric measurements and the effect of lesion filling in multiple sclerosis. European radiology 29, 1355–1364 (2019).
Srinivasan, D. et al. A comparison of freesurfer and multi-atlas muse for brain anatomy segmentation: Findings about size and age bias, and inter-scanner stability in multi-site aging studies. Neuroimage 223, 117248 (2020).
Rebsamen, M. et al. Growing importance of brain morphometry analysis in the clinical routine: The hidden impact of mr sequence parameters. Journal of neuroradiology (2023).
Obrien, J. Role of imaging techniques in the diagnosis of dementia. The British journal of radiology 80, S71–S77 (2007).
Klunk, W. E. et al. The centiloid project: standardizing quantitative amyloid plaque estimation by pet. Alzheimer’s & dementia 11, 1–15 (2015).
Engels-Domínguez, N. et al. State-of-the-art imaging of neuromodulatory subcortical systems in aging and alzheimer’s disease: challenges and opportunities. Neuroscience & Biobehavioral Reviews 104998 (2022).
Huynh, K. et al. Clinical and biological correlates of white matter hyperintensities in patients with behavioral-variant frontotemporal dementia and alzheimer disease. Neurology 96, e1743–e1754 (2021).
Calloni, S. F. et al. Combining semi-quantitative rating and automated brain volumetry in mri evaluation of patients with probable behavioural variant of fronto-temporal dementia: an added value for clinical practise?. Neuroradiology 65, 1025–1035 (2023).
Cavedo, E. et al. Validation of an automatic tool for the measurement of brain atrophy and white matter hyperintensity in clinical routine: Qyscore®: Neuroimaging/optimal neuroimaging measures for early detection. Alzheimer’s & Dementia 16, e040259 (2020).
Hänninen, K. et al. Thalamic atrophy predicts 5-year disability progression in multiple sclerosis. Frontiers in Neurology 11, 606 (2020).
Pemberton, H. G. et al. Technical and clinical validation of commercial automated volumetric mri tools for dementia diagnosis—a systematic review. Neuroradiology 63, 1773–1789 (2021).
Coupé, P., Catheline, G., Lanuza, E., Manjón, J. V. & Initiative, A. D. N. Towards a unified analysis of brain maturation and aging across the entire lifespan: A mri analysis. Human brain mapping 38, 5501–5518 (2017).
Rajagopalan, V., Yue, G. H. & Pioro, E. P. Do preprocessing algorithms and statistical models influence voxel-based morphometry (vbm) results in amyotrophic lateral sclerosis patients? a systematic comparison of popular vbm analytical methods. Journal of Magnetic Resonance Imaging 40, 662–667 (2014).
Katuwal, G. J. et al. Inter-method discrepancies in brain volume estimation may drive inconsistent findings in autism. Frontiers in neuroscience 10, 439 (2016).
Bozek, J., Griffanti, L., Lau, S. & Jenkinson, M. Normative models for neuroimaging markers: Impact of model selection, sample size and evaluation criteria. NeuroImage 268, 119864 (2023).
Liu, Y. et al. Quantifying individualized deviations of brain structure in patients with multiple neurological diseases from normative references. Under review on researchsquare.com (2024).
Mills, K. L. et al. Structural brain development between childhood and adulthood: Convergence across four longitudinal samples. Neuroimage 141, 273–281 (2016).
Borghi, E. et al. Construction of the world health organization child growth standards: selection of methods for attained growth curves. Statistics in medicine 25, 247–265 (2006).
Bethlehem, R. A. et al. Brain charts for the human lifespan. Nature 604, 525–533 (2022).
Pomponio, R. et al. Harmonization of large mri datasets for the analysis of brain imaging patterns throughout the lifespan. NeuroImage 208, 116450 (2020).
Satterthwaite, T. D. et al. Impact of puberty on the evolution of cerebral perfusion during adolescence. Proceedings of the National Academy of Sciences 111, 8643–8648 (2014).
Tanpitukpongse, T. P., Mazurowski, M. A., Ikhena, J. & Petrella, J. R. Predictive utility of marketed volumetric software tools in subjects at risk for alzheimer disease: do regions outside the hippocampus matter?. American Journal of Neuroradiology 38, 546–552 (2017).
Eckerström, C. et al. Small baseline volume of left hippocampus is associated with subsequent conversion of mci into dementia: the göteborg mci study. Journal of the neurological sciences 272, 48–59 (2008).
Coupé, P. et al. Hippocampal-amygdalo-ventricular atrophy score: Alzheimer disease detection using normative and pathological lifespan models. Human Brain Mapping 43, 3270–3282 (2022).
Scheltens, P., Fox, N., Barkhof, F. & De Carli, C. Structural magnetic resonance imaging in the practical assessment of dementia: beyond exclusion. The Lancet Neurology 1, 13–21 (2002).
Chapuis, P. et al. Morphologic and neuropsychological patterns in patients suffering from alzheimer’s disease. Neuroradiology 58, 459–466 (2016).
Craddock, C. et al. The neuro bureau preprocessing initiative: open sharing of preprocessed neuroimaging data and derivatives. Frontiers in Neuroinformatics 7, 5 (2013).
Mazziotta, J. C. et al. A probabilistic approach for mapping the human brain: the international consortium for brain mapping (icbm). In Brain mapping: the systems, 141–156 (Elsevier, 2000).
Petersen, R. C. et al. Alzheimer’s disease neuroimaging initiative (adni): clinical characterization. Neurology 74, 201–209 (2010).
Marcus, D. S. et al. Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience 19, 1498–1507 (2007).
Marek, K. et al. The parkinson’s progression markers initiative (ppmi)-establishing a pd biomarker cohort. Annals of clinical and translational neurology 5, 1460–1477 (2018).
Poldrack, R. A. et al. A phenome-wide examination of neural and cognitive function. Scientific data 3, 1–12 (2016).
Wei, D. et al. Structural and functional mri from a cross-sectional southwest university adult lifespan dataset (sald). bioRxiv 177279 (2017).
Malone, I. B. et al. Miriad—public release of a multiple time point alzheimer’s mr imaging dataset. NeuroImage 70, 33–36 (2013).
Tanaka, S. C. et al. A multi-site, multi-disorder resting-state magnetic resonance image database. Scientific data 8, 227 (2021).
Coupé, P. et al. Assemblynet: A large ensemble of cnns for 3d whole brain mri segmentation. NeuroImage 219, 117026 (2020).
Fischl, B. Freesurfer. Neuroimage 62, 774–781 (2012).
Glatard, T. et al. A virtual imaging platform for multi-modality medical image simulation. IEEE transactions on medical imaging 32, 110–118 (2012).
Henschel, L. et al. Fastsurfer-a fast and accurate deep learning based neuroimaging pipeline. NeuroImage 219, 117012 (2020).
Larsen, K. Gam: the predictive modeling silver bullet. Multithreaded. Stitch Fix 30, 1–27 (2015).
Hyndman, R. J. & Koehler, A. B. Another look at measures of forecast accuracy. International journal of forecasting 22, 679–688 (2006).
Kreinovich, V., Nguyen, H. T. & Ouncharoen, R. How to estimate forecasting quality: A system-motivated derivation of symmetric mean absolute percentage error (smape) and other similar characteristics. Departmental Technical Reports (CS). 865 (2014).
Wang, X., Peng, Y. & Ma, W. An end-to-end smart predict-then-optimize framework for vehicle relocation problems in large-scale vehicle crowd sensing. arXiv preprint arXiv:2411.18432 (2024).
Tibshirani, R. J. & Efron, B. An introduction to the bootstrap. Monographs on statistics and applied probability 57, 1–436 (1993).
Manjón, J. V. et al. Nonlocal intracranial cavity extraction. International journal of biomedical imaging 2014, 820205 (2014).
Çorbacıoğlu, ŞK. & Aksel, G. Receiver operating characteristic curve analysis in diagnostic accuracy studies: A guide to interpreting the area under the curve value. Turkish journal of emergency medicine 23, 195–198 (2023).
Ziegler, G. et al. Individualized gaussian process-based prediction and detection of local and global gray matter abnormalities in elderly subjects. NeuroImage 97, 333–348 (2014).
Mathalon, D. H., Sullivan, E. V., Rawles, J. M. & Pfefferbaum, A. Correction for head size in brain-imaging measurements. Psychiatry Research: Neuroimaging 50, 121–139 (1993).
Yaakub, S. N. et al. On brain atlas choice and automatic segmentation methods: a comparison of maper & freesurfer using three atlas databases. Scientific Reports 10, 2837 (2020).
Kijonka, M. et al. Whole brain and cranial size adjustments in volumetric brain analyses of sex-and age-related trends. Frontiers in neuroscience 14, 278 (2020).
Malone, I. B. et al. Accurate automatic estimation of total intracranial volume: a nuisance variable with less nuisance. Neuroimage 104, 366–372 (2015).
Pintzka, C. W., Hansen, T. I., Evensmoen, H. R. & Håberg, A. K. Marked effects of intracranial volume correction methods on sex differences in neuroanatomical structures: a hunt mri study. Frontiers in neuroscience 9, 238 (2015).
O’Brien, L. M. et al. Statistical adjustments for brain size in volumetric neuroimaging studies: some practical implications in methods. Psychiatry Research: Neuroimaging 193, 113–122 (2011).
Klasson, N., Olsson, E., Eckerström, C., Malmgren, H. & Wallin, A. Estimated intracranial volume from freesurfer is biased by total brain volume. European Radiology Experimental 2, 1–6 (2018).
Jovicich, J. et al. Mri-derived measurements of human subcortical, ventricular and intracranial brain volumes: reliability effects of scan sessions, acquisition sequences, data analyses, scanner upgrade, scanner vendors and field strengths. Neuroimage 46, 177–192 (2009).
Yang, C.-Y. et al. Reproducibility of brain morphometry from short-term repeat clinical mri examinations: a retrospective study. PLoS One 11, e0146913 (2016).
Rebsamen, M. et al. Growing importance of brain morphometry analysis in the clinical routine: The hidden impact of mr sequence parameters. Journal of Neuroradiology 51, 5–9 (2024).
Heinen, R. et al. Robustness of automated methods for brain volume measurements across different mri field strengths. PloS one 11, e0165719 (2016).
Keihaninejad, S. et al. A robust method to estimate the intracranial volume across mri field strengths (1.5 t and 3t). Neuroimage 50, 1427–1437 (2010).
Potvin, O. et al. Measurement variability following mri system upgrade. Frontiers in neurology 10, 726 (2019).
Jovicich, J. et al. Reliability in multi-site structural mri studies: effects of gradient non-linearity correction on phantom and human data. Neuroimage 30, 436–443 (2006).
Marchewka, A. et al. Influence of magnetic field strength and image registration strategy on voxel-based morphometry in a study of alzheimer’s disease. Human brain mapping 35, 1865–1874 (2014).
Goodro, M., Sameti, M., Patenaude, B. & Fein, G. Age effect on subcortical structures in healthy adults. Psychiatry Research: Neuroimaging 203, 38–45 (2012).
Pfefferbaum, A., Rohlfing, T., Rosenbloom, M. J. & Sullivan, E. V. Combining atlas-based parcellation of regional brain data acquired across scanners at 1.5 t and 3.0 t field strengths. Neuroimage 60, 940–951 (2012).
Valabregue, R., Khemir, I., Auzias, G., Rousseau, F. & Ounissi, M. Unraveling systematic biases in brain segmentation: Insights from synthetic training. In MIDL 2024-Medical Imaging with Deep Learning (2024).
Lee, H. et al. Evaluating brain volume segmentation accuracy and reliability of freesurfer and neurophet aqua at variations in mri magnetic field strengths. Scientific Reports 14, 24513 (2024).
Attyé, A. et al. Data-driven normative values based on generative manifold learning for quantitative mri. Scientific Reports 14, 7563 (2024).
Kiely, M. et al. Insights into human cerebral white matter maturation and degeneration across the adult lifespan. Neuroimage 247, 118727 (2022).
Hedman, A. M., van Haren, N. E., Schnack, H. G., Kahn, R. S. & Hulshoff Pol, H. E. Human brain changes across the life span: a review of 56 longitudinal magnetic resonance imaging studies. Human brain mapping 33, 1987–2002 (2012).
Coupé, P., Manjón, J. V., Lanuza, E. & Catheline, G. Lifespan changes of the human brain in alzheimer’s disease. Scientific reports 9, 3998 (2019).
Breijyeh, Z. & Karaman, R. Comprehensive review on alzheimer’s disease: causes and treatment. Molecules 25, 5789 (2020).
Bhogal, P. et al. The common dementias: a pictorial review. European radiology 23, 3405–3417 (2013).
Bansode, P., Chivte, V. & Nikalje, A. P. A brief review on parkinson’s disease. EC Pharmacology & Toxicology–Review Article, Maharshtra, published: June 15 (2018).
Jack, C. R. Jr. et al. A/t/n: An unbiased descriptive classification scheme for alzheimer disease biomarkers. Neurology 87, 539–547 (2016).
Altmann, A. et al. Towards cascading genetic risk in alzheimer’s disease. Brain 147, 2680–2690 (2024).
Bateman, R. J. et al. Clinical and biomarker changes in dominantly inherited alzheimer’s disease. New England Journal of Medicine 367, 795–804 (2012).
Fagan, A. M. et al. Longitudinal change in csf biomarkers in autosomal-dominant alzheimer’s disease. Science translational medicine 6, 226ra30-226ra30 (2014).
Dolgin, E. Faster, cheaper, better: the rise of blood tests for alzheimer’s. Nature 640, S11–S13 (2025).
Coupé, P. et al. Assemblynet: a novel deep decision-making process for whole brain mri segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22, 466–474 (Springer, 2019).
Fischl, B. et al. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33, 341–355 (2002).
Fischl, B. et al. Sequence-independent segmentation of magnetic resonance images. Neuroimage 23, S69–S84 (2004).
Fischl, B., Sereno, M. I. & Dale, A. M. Cortical surface-based analysis: Ii: inflation, flattening, and a surface-based coordinate system. Neuroimage 9, 195–207 (1999).
Desikan, R. S. et al. An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest. Neuroimage 31, 968–980 (2006).
Kazemi, K. & Noorizadeh, N. Quantitative comparison of spm, fsl, and brainsuite for brain mr image segmentation. Journal of biomedical physics & engineering 4, 13 (2014).
Fellhauer, I. et al. Comparison of automated brain segmentation using a brain phantom and patients with early alzheimer’s dementia or mild cognitive impairment. Psychiatry Research: Neuroimaging 233, 299–305 (2015).
Poulin, S. P. et al. Amygdala atrophy is prominent in early alzheimer’s disease and relates to symptom severity. Psychiatry Research: Neuroimaging 194, 7–13 (2011).
De Leon, M. et al. Frequency of hippocampal formation atrophy in normal aging and alzheimer’s disease. Neurobiology of aging 18, 1–11 (1997).
Shi, F., Liu, B., Zhou, Y., Yu, C. & Jiang, T. Hippocampal volume and asymmetry in mild cognitive impairment and alzheimer’s disease: Meta-analyses of mri studies (2009).
Dhikav, V., Duraisamy, S., Anand, K. S. & Garga, U. C. Hippocampal volumes among older indian adults: Comparison with alzheimer’s disease and mild cognitive impairment. Annals of Indian Academy of Neurology 19, 195–200 (2016).
Ezzati, A. et al. Differential association of left and right hippocampal volumes with verbal episodic and spatial memory in older adults. Neuropsychologia 93, 380–385 (2016).
Li, X. et al. Hippocampal subfield volumetry in patients with subcortical vascular mild cognitive impairment. Scientific reports 6, 20873 (2016).
Byun, M. S. et al. Heterogeneity of regional brain atrophy patterns associated with distinct progression rates in alzheimer’s disease. PLoS One 10, e0142756 (2015).
Lowe, V. J. et al. Application of the national institute on aging-alzheimer’s association ad criteria to adni. Neurology 80, 2130–2137 (2013).
Orlhac, F. et al. A guide to combat harmonization of imaging biomarkers in multicenter studies. Journal of Nuclear Medicine 63, 172–179 (2022).
Pinto, M. S. et al. Harmonization of brain diffusion mri: Concepts and methods. Frontiers in Neuroscience 14, 396 (2020).
Orlhac, F. et al. A postreconstruction harmonization method for multicenter radiomic studies in pet. Journal of Nuclear Medicine 59, 1321–1328 (2018).
Bayer, J. M. et al. Accommodating site variation in neuroimaging data using normative and hierarchical bayesian models. Neuroimage 264, 119699 (2022).
Gebre, R. K. et al. Cross-scanner harmonization methods for structural mri may need further work: A comparison study. NeuroImage 269, 119912 (2023).
Acknowledgements
Virtual Imaging Platform: VIP Part of the results presented in this work were achieved using the Neuroimaging (FreeSurfer-Recon-all) application through the Virtual Imaging Platform46, which uses the resources provided by the biomed virtual organisation of the EGI infrastructure. Autism Brain Imaging Data Exchange: ABIDE The ABIDE35 data used in the preparation of this article were supported by ABIDE funding resources listed at https://fcon_1000.projects.nitrc.org/indi/abide/. International Consortium for Brain Mapping: ICBM The International Consortium for Brain Mapping (ICBM), under the leadership of Principal Investigator John Mazziotta, MD, PhD, provided the data collection and sharing for this project. The National Institute of Biomedical Imaging and BioEngineering funded the ICBM. The Laboratory of Neuro Imaging at the University of Southern California distributes the ICBM data36. Information eXtraction from Images: IXI Data collected as part of the project EPSRC GR/S21533/02 were obtained from https://brain-development.org/ixi-dataset/. Alzheimer’s Disease Neuroimaging Initiative: ADNI The Alzheimer’s Disease Neuroimaging Initiative (ADNI) data collection and sharing are funded by the National Institute on Aging (National Institutes of Health Grant U19 AG024904), with the Northern California Institute for Research and Education serving as the grantee organization. ADNI has also received past funding from the National Institute of Biomedical Imaging and Bioengineering, the Canadian Institutes of Health Research, and private sector contributions through the Foundation for the National Institutes of Health (FNIH). Generous contributions have been made by numerous organizations, including AbbVie, the Alzheimer’s Association, the Alzheimer’s Drug Discovery Foundation, Araclon Biotech, BioClinica, Inc., Biogen, Bristol-Myers Squibb Company, CereSpir, Inc., Cogstate, Eisai Inc., Elan Pharmaceuticals, Inc., Eli Lilly and Company, EuroImmun, F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc., Fujirebio, GE Healthcare, IXICO Ltd., Janssen Alzheimer Immunotherapy Research and Development, LLC, Johnson and Johnson Pharmaceutical Research and Development LLC, Lumosity, Lundbeck, Merck and Co., Inc., Meso Scale Diagnostics, LLC, NeuroRx Research, Neurotrack Technologies, Novartis Pharmaceuticals Corporation, Pfizer Inc., Piramal Imaging, Servier, Takeda Pharmaceutical Company, and Transition Therapeutics37. Open Access Series of Imaging Studies: OASIS The data were provided by OASIS-1: Cross-Sectional, with Principal Investigators D. Marcus, R. Buckner, J. Csernansky, and J. Morris. Funding support includes grants P50 AG05681, P01 AG03991, P01 AG026276, R01 AG021910, P20 MH071616, and U24 RR02138238. Parkinson’s Progression Markers Initiative: PPMI Data used in the preparation of this article was obtained on 2023-10-23 from the Parkinson’s Progression Markers Initiative (PPMI) database (https://www.ppmi-info.org/access-data-specimens/download-data), RRID:SCR_006431. For up-to-date information on the study, visit www.ppmi-info.org. PPMI - a public-private partnership - is funded by the Michael J. Fox Foundation for Parkinson’s Research and funding partners, including 4D Pharma, Abbvie, AcureX, Allergan, Amathus Therapeutics, Aligning Science Across Parkinson’s, AskBio, Avid Radiopharmaceuticals, BIAL, BioArctic, Biogen, Biohaven, BioLegend, BlueRock Therapeutics, Bristol-Myers Squibb, Calico Labs, Capsida Biotherapeutics, Celgene, Cerevel Therapeutics, Coave Therapeutics, DaCapo Brainscience, Denali, Edmond J. Safra Foundation, Eli Lilly, Gain Therapeutics, GE HealthCare, Genentech, GSK, Golub Capital, Handl Therapeutics, Insitro, Jazz Pharmaceuticals, Johnson and Johnson Innovative Medicine, Lundbeck, Merck, Meso Scale Discovery, Mission Therapeutics, Neurocrine Biosciences, Neuron23, Neuropore, Pfizer, Piramal, Prevail Therapeutics, Roche, Sanofi, Servier, Sun Pharma Advanced Research Company, Takeda, Teva, UCB, Vanqua Bio, Verily, Voyager Therapeutics, the Weston Family Foundation and Yumanity Therapeutics. UCLA Consortium for Neuropsychiatric Phenomics LA5c Study: UCLA Data used in the preparation of this article were obtained from https://openneuro.org/datasets/ds000030/versions/00016. Its accession number is ds00003040. Dallas Lifespan Brain Study: DLBS Data used in the preparation of this article were obtained from https://www.nitrc.org/ir/app/template/XDATScreen_report_xnat_projectData.vm/search_element/xnat:project Data/search_field/xnat:projectData.ID/search _value/dlbs (https://sites.utdallas.edu/dlbs/). Southwest University Adult Lifespan Dataset: SALD41 Data used in the preparation of this article were obtained from https://fcon_1000.projects.nitrc.org/indi/retro/sald.html. Frontotemporal lobar degeneration neuroimaging initiative: NIFD NIFD is the abbreviation for the frontotemporal lobar degeneration neuroimaging initiative (FTLDNI, AG032306) (http://memory.ucsf.edu/ research/studies/ nifd). This project’s data collection and sharing were funded by the Frontotemporal Lobar Degeneration Neuroimaging Initiative (National Institutes of Health Grant R01 AG032306). The study is coordinated by the Memory and Aging Center at the University of California, San Francisco, and the data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. Southwest University Longitudinal Imaging Multimodal: SLIM Data used in the preparation of this article were obtained from https://fcon_1000.projects.nitrc.org/indi/retro/southwestuni_qiu_index.html. Minimal Interval Resonance Imaging in Alzheimer’s Disease (MIRIAD): Data used in the preparation of this article were obtained from the MIRIAD dataset42. The MIRIAD investigators did not participate in analysis or writing of this report. The MIRIAD dataset is made available through the support of the UK Alzheimer’s Society (Grant RF116). The original data collection was funded through an unrestricted educational grant from GlaxoSmithKline (Grant 6GKC). SRPBS Traveling Subject MRI Dataset:43 Data used in the preparation of this article were obtained from https://www.synapse.org/#!Synapse:syn22317082. Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf. A comprehensive list of consortium members appears at the end of the paper. Data used in preparation of this article were obtained from the Frontotemporal Lobar Degeneration Neuroimaging Initiative (FTLDNI) dataset. The investigators at NIFD/FTLDNI contributed to the design and implementation of FTLDNI and/or provided data, but did not participate in analysis or writing of this report (unless otherwise listed). A comprehensive list of consortium members appears at the end of the paper.
Funding
This work was supported by the Association Nationale de la Recherche et de la Technologie (CIFRE No. 2022/0918).
Author information
Authors and Affiliations
Consortia
Contributions
F. R., A. K. and A.A. designed the study; E. P. performed the analysis; all authors discussed the results; E. P. wrote the main manuscript text; All authors reviewed the manuscript; “For Alzheimer’s Disease Neuroimaging Initiative” and “For the Frontotemporal Lobar Degeneration Neuroimaging Initiative” generate part of the data used.
Corresponding author
Ethics declarations
Competing interests
Four authors have a potential conflict of interest: Arnaud Attyé and Félix Renard are co-founders of GeodAIsics, A. Krainik is a consultant at GeodAIsics and E. Piot is a PhD student at GeodAIsics. “For Alzheimer’s Disease Neuroimaging Initiative” and “For the Frontotemporal Lobar Degeneration Neuroimaging Initiative” have no conflict of interest as they only generate part of the data used.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Piot, E., Renard, F., Attyé, A. et al. Estimation of reference curves for brain atrophy and analysis of robustness to machine effects. Sci Rep 15, 34585 (2025). https://doi.org/10.1038/s41598-025-18073-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-18073-z