Abstract
Purpose
Shortcut learning has been identified as a source of algorithmic unfairness in medical imaging artificial intelligence (AI), but its impact on photoacoustic tomography (PAT), particularly concerning sex bias, remains underexplored. This study investigates this issue using peripheral artery disease (PAD) diagnosis as a specific clinical application.
Methods
To examine the potential for sex bias due to shortcut learning in convolutional neural network (CNNs) and assess how such biases might affect diagnostic predictions, we created training and test datasets with varying PAD prevalence between sexes. Using these datasets, we explored (1) whether CNNs can classify the sex from imaging data, (2) how sex-specific prevalence shifts impact PAD diagnosis performance and underdiagnosis disparity between sexes, and (3) how similarly CNNs encode sex and PAD features.
Results
Our study with 147 individuals demonstrates that CNNs can classify the sex from calf muscle PAT images, achieving an AUROC of 0.75. For PAD diagnosis, models trained on data with imbalanced sex-specific disease prevalence experienced significant performance drops (up to 0.21 AUROC) when applied to balanced test sets. Additionally, greater imbalances in sex-specific prevalence within the training data exacerbated underdiagnosis disparities between sexes. Finally, we identify evidence of shortcut learning by demonstrating the effective reuse of learned feature representations between PAD diagnosis and sex classification tasks.
Conclusion
CNN-based models trained on PAT data may engage in shortcut learning by leveraging sex-related features, leading to biased and unreliable diagnostic predictions. Addressing demographic-specific prevalence imbalances and preventing shortcut learning is critical for developing models in the medical field that are both accurate and equitable across diverse patient populations.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Avoid common mistakes on your manuscript.
Introduction
Convolutional neural networks (CNNs) are widely used for medical image analysis but can exhibit demographic bias in their predictions leading to performance disparities across demographic subgroups [1]. One potential cause of these disparities is shortcut learning, where models learn spurious correlations or shortcuts, resulting in unreliable predictions.
While shortcut learning has been studied in common medical imaging domains, such as X-ray imaging, computed tomography (CT), and magnetic resonance (MR) imaging [1,2,3,4,5], it remains largely unexplored in emerging modalities, such as PAT. PAT is a non-ionizing interventional imaging modality that combines the high contrast of optical imaging with the high resolution of ultrasound (US) imaging [6]. Compared to US imaging, which uses a sound-in sound-out principle, PAT is based on a light-in sound-out principle. Using multiple wavelengths, PAT can resolve functional tissue properties such as oxygen saturation in real time [7]. Existing PAT systems—such as the CE-certified MSOT Acuity Echo used in this study—are often hybrid imaging systems that allow for a joint acquisition of US and PAT images in real time essentially enabling a combined structural and functional interventional imaging several cms deep in the tissue. While PAT as an interventional imaging modality is still emerging, it has already been proven to be an asset in various interventional settings for photoacoustic-guided hysterectomy [8], needle tracking [9], interventional guidance in cardiovascular medicine [10], and surgery [11], as well as first applications in the context of da Vinci robotic interventions [12]. The use of deep learning in PAT has especially been increasingly studied [13]. However, the topic of shortcut learning leading to sex bias in PAT has received no attention to date, despite the awareness of sex differences in the field of PAT [14]. So far, the only source of bias that has received substantial attention in the literature in the context of PAT is skin tone [15], as different skin tones interact differently with light [16]. Previous studies with other medical image modalities, such as X-ray imaging have shown that sex-specific prevalence imbalances can lead to subgroup performance disparities [17]. The severity of the impact that subgroup separability can have, however, heavily varies between medical imaging modalities [3], thus highlighting the importance of investigating this issue for each modality separately. To our knowledge, there is no literature on the impact of sex bias in deep learning models for PAT to date. The purpose of this work was therefore to shed light on this important issue, using PAD diagnosis as a representative sample clinical application. PAD is particularly suitable for this investigation due to the strong causal influence of sex on its expression [18].
PAD is a prevalent circulatory condition where narrowed arteries reduce blood flow to the limbs. Early and accurate diagnosis of PAD is essential to prevent serious complications like limb amputation. CNN-based support in the current clinical workflow for PAD diagnosis with PAT can help automate and accelerate initial examinations, facilitating earlier diagnosis.
Given the gap in the scientific literature concerning biases in PAT in general and in PAD diagnosis in particular, our main contributions are threefold: (1) We are the first to show that neural networks can predict sex from PAT images, indicating that sex-specific features are present in PAT data. (2) Using PAD diagnosis as an example, we demonstrate that models trained on datasets with imbalanced sex-specific prevalence ratio (PRs) exhibit significant performance degradation when tested on balanced datasets and display severe underdiagnosis disparity between sexes, particularly affecting the underrepresented sex. (3) We provide evidence that neural networks trained for PAD diagnosis encode sex-related features, as demonstrated by effective reuse of learned feature representations between PAD diagnosis and sex classification tasks.
Materials and methods
This work is based on the hypothesis that CNN-based models trained on PAT data can exhibit sex bias due to shortcut learning, impacting the reliability and fairness of neural networks. To investigate this hypothesis, this work addresses the research questions (RQs) depicted in Fig. 1.
Summary of contribution a Illustration of shortcut learning in the context of peripheral artery disease (PAD). b Specific research questions (RQs). PAD induces vascular changes that causally influence photoacoustic tomography (PAT) images, allowing for automatic diagnosis. However, sex also affects PAT images, influenced by factors such as differences in skin and fat layer thickness. While sex might have a causal link to the risk and presentation of PAD, some datasets might exhibit an overemphasized spurious correlation between PAD and sex. In this case, neural networks may over-rely on the sex, i.e. inadvertently using it as a shortcut for PAD prediction, leading to biased models
Terminology | |
---|---|
Sex The biological attributes distinguishing males and females. | |
Sex bias The improper influence of sex-related features in a model’s predictions, where these features serve as non-causal proxies for the factors directly related to the target prediction. | |
Shortcut learning The phenomenon where a machine learning model relies on spurious correlations or easily learned features not directly causal to the target variable [2]. In our context, models may exploit sex-related features as shortcuts for disease prediction (depicted in Fig.1). | |
Sex-specific PR The imbalance of disease prevalence between sexes in a dataset, defined as the proportion of diseased males over the proportion of diseased females. Furthermore, sex-specific prevalence shifts refer to changes in the PR between datasets (e.g., between training and test sets). |
Dataset and data sampling strategy
The CE-certified MSOT Acuity Echo (iThera Medical GmbH, Munich, Germany) was used to capture 2D photoacoustic images of the calf muscle from 147 individuals at wavelengths 760, 800, and 850 nm. Data was acquired within two clinical studies at the Department of Vascular Surgery, University Hospital Erlangen, Germany under study IDs NCT05373927 and NCT05773534. The health status was diagnosed by PAD experts using established methods, including angiographic imaging, and was characterized by intermittent claudication of mild to moderate severity (Fontaine IIa/IIb) [19].
The data was divided into three pools: a training pool (\(\text {N}_\text {train}\) = 86), a validation pool (\(\text {N}_\text {val}\) = 21), and a test pool (\(\text {N}_\text {test}\) = 40). These splits remained fixed throughout all experiments and were created using confounder matching, specifically controlling for sex and disease status, to ensure that the sex-disease strata were evenly distributed across all data splits. The data pools were subsequently used to randomly sample the desired sex-specific PR. Importantly, across all experiments, the PR in the validation sets was kept consistent with that of the training sets. Table 1 presents the sampling counts for each sex-disease stratum at each PR. In all of our experiments, the total number of male and female patients, as well as the total number of healthy and diseased patients, were consistently balanced. The key variable manipulated across experiments was the distribution of diseased patients between sexes to achieve the desired sex-specific PR.
Classification model
A version of EfficientNetV2_B0 [20], pre-trained on ImageNet [21], was used for the classification tasks in all experiments. Cross-entropy loss was employed as the loss function. Hyperparameter optimization was conducted manually only once for the PAD classifier trained on the balanced dataset (PR = 1) using the corresponding validation set. The optimized hyperparameters were then applied unchanged to all other models, including the one for sex classification and those trained on datasets with different sex-specific PRs. Details on data processing and hyperparameters are provided in the supplementary material.
Experimental design
In order to answer the RQs posed in Fig. 1, four corresponding experiments were designed. For each classifier in these experiments, an ensemble of 10 models was trained. Each model in the ensemble was trained using a unique randomized sampling from the training and validation pools, maintaining the predefined sex-specific PR. Sample counts per sex-disease group for each PR are provided in Table 1. For performance evaluation, AUROC was used following recommendations of [22]. Each ensemble for RQs 1, 2, and 4 (not used for RQ3) was evaluated using stratified bootstrapping (\(\text {n}_\text {iter}\) = 1000) to generate 95 % confidence intervals (CIs) for the reported AUROC scores. Reporting confidence intervals is essential, as recent studies have shown that the performance variability in medical image analysis models can be substantial. For instance, Christodoulou et al. [23] found that the median width of confidence intervals for MICCAI 2023 segmentation models was three times larger than the median performance gain over previous methods, highlighting the importance of variability analysis in model evaluation. In each iteration, the test pool was sampled with replacement according to the predefined PR. The sample counts per stratum for each PR are provided in Table 1.
RQ1, Sex classification in PAT The experiments are divided in:
-
1.
Sex separability in PAT: A sex classifier was trained and tested using the balanced sex-disease distribution (PR = 1). The result was compared to a PAD classifier which was trained and tested under the same conditions.
-
2.
Generalizability to other datasets: The sex classifier was applied as-is to a dataset of 525 PAT images that were acquired at three body sites of 30 healthy volunteers: calf, forearm, and neck (see S2.2 for further details).
RQ2, Impact of sex-specific PR shifts
PAD classifiers were trained on four different distributions with sex-specific (PRs) of 1, 2, 5, and \(\infty \) (where all diseased individuals were male). Each ensemble was evaluated on test sets with each of the four different PRs.
RQ3, Underdiagnosis disparity
PAD classifiers were trained on four different distributions with sex-specific PR of 1, 2, 5, and \(\infty \). We report the mean and median underdiagnosis disparity [24] between females and males across 10 separate runs. Underdiagnosis disparity is defined as the difference in the ratio of false negatives to false positives between females and males. A positive value indicates an underdiagnosis bias against females. For each run, a new ensemble was trained and the whole test pool (\(\text {N}_\text {test}\)=40) was used to calculate the underdiagnosis disparity, which measures the difference in diagnostic accuracy between sexes.
RQ4, Feature representation similarity The experiments are divided in:
-
1.
Transfer learning for sex and PAD classification: This experiment investigated whether CNNs reuse learned representations between PAD diagnosis and sex classification. To achieve this, the feature extractors of the sex and PAD classifiers from RQ1 were frozen, and the classification head was retrained on the alternate task (PR = 1 for all datasets).
-
2.
Principal component analysis (PCA) projections: This experiment aimed to analyze and visualize the feature representations of PAD classifiers trained on balanced (PR = 1) and extremely imbalanced (PR = \(\infty \)) datasets using PCA. By assessing the distribution differences of the feature representations with respect to sex and disease status using 2-dimensional Wasserstein distances calculated on the first two principal components, additional insights can be drawn whether models trained on imbalanced data encode sex-related features more prominently. For PR = 1 and PR = \(\infty \), a representative model was chosen from the corresponding ensemble that showed median AUROC performance on the in-distribution test set within the ensemble. We randomly drew 7 samples for each sex-disease subgroup from the test pool, ensuring a test set with PR = 1, resulting in 28 samples in total. To avoid sampling bias, we performed 1000 sampling runs, and then calculated descriptive statistics of the Wasserstein distance over all runs.
Results
This section presents the results of the experiments described in the previous section corresponding to the four driving RQs of this study.
RQ1, Sex classification in PAT
-
1.
Sex separability in PAT: As shown in Fig. 2, the sex classifier was able to classify the subject’s sex from calf muscle PAT images, achieving a performance of 0.75 (95 % CI: 0.52–0.94). This is comparable to its performance in diagnosing PAD (0.79, 95 % CI: 0.60–0.94).
-
2.
Generalizability to other datasets: As can be seen in Fig. 3, the sex classifier trained on the PAD dataset and tested on a healthy volunteer dataset scored mean AUROC results of 0.81 (95 % CI: 0.74–0.88) for the calf, 0.70 (95 % CI: 0.62–0.79) for the forearm, and 0.68 (95 % CI: 0.59–0.77) for the neck.
Sex separability a and peripheral artery disease (PAD) separability b are possible based on photoacoustic tomography images of the calf muscle. The mean area under the receiver operating characteristic curve (AUROC) is shown for sex a and PAD b classification models that were trained and tested on balanced data (sex-specific prevalence ratio PR = 1). Whiskers indicate the 95 % confidence intervals, and the dashed black lines at AUROC = 0.5 mark the performance expected from random guessing
A sex classifier trained on a peripheral artery disease (PAD) dataset (PR = 1) generalizes well to out-of-distribution (OOD) datasets of healthy volunteers across three body sites (calf, forearm, and neck). The sex classifier was tested on both the in-distribution (ID) test set (PR = 1) and OOD test sets. The mean area under the receiver operating characteristic curve (AUROC) is shown as a dot, and whiskers represent 95 % confidence intervals
RQ2, Impact of sex-specific PR shifts
The model trained with a balanced distribution (PR = 1) maintained consistent performance across all test domains (cf. Fig. 4a), demonstrating robustness to shifts in sex-health distributions. The model trained with PR = \(\infty \) (all diseased individuals male) showed a 0.21 drop in AUROC when tested on a balanced test domain (PR = 1) (cf. Fig. 4b, revealing a strong performance degradation when facing domain shifts. Increased sex-specific prevalence bias during training led not only to less stable outcomes but also to an overestimation of performance when models were tested on datasets with similarly high PRs.
Impact of sex-specific prevalence shifts from training set to test set. Peripheral artery disease classifiers trained on distributions with increasing sex-specific prevalence ratios (top to bottom PR = \(\infty \), 5, 2, 1) show significant performance drops and instability when tested on balanced data (PR = 1). Performance is measured with the mean area under the receiver operating characteristic curve (AUROC) shown in bold a and as a dot b. The model trained with PR = \(\infty \) experienced a 0.21 AUROC drop. Brackets a and whiskers b represent 95 % confidence intervals
RQ3, Underdiagnosis disparity
Both the mean and median underdiagnosis disparity between sexes generally increase as the sex-specific PR in the training data increases (cf. Fig. 5). Although the median underdiagnosis disparity increases from PR = 5 to PR = \(\infty \), the mean slightly decreases. However, both values fall within each other’s interquartile range (IQR).
RQ4, Feature representation similarity
-
1.
Transfer learning for sex and PAD classification: The sex classifier retrained for PAD classification experienced a drop in AUROC of 0.05 from 0.75 (95 % CI: 0.52– 0.94) to 0.70 (95 % CI: 0.48–0.90) (cf. Fig. 6). The PAD classifier retrained for sex classification experienced a similar drop in AUROC of 0.05 from 0.79 (95 % CI: 0.60–0.94) to 0.74 (95 % CI: 0.54–0.90) (cf. Fig. 6). Both retrained classifiers, however, still perform substantially better than random guessing.
-
2.
PCA projections: The PCA results achieved from a representative subset, selected as the subset with Wasserstein distances closest to the geometrical median calculated across 1000 sampling runs, are shown in Fig. 7. The first two principal components of the PR = \(\infty \) model show higher differences in the distributions between the sex subgroups (W = 6.2) than the PR = 1 model (W = 2.1). Distribution differences across the disease subgroups remain fairly unchanged (W = 3.8 and W = 4.1). Across the 1000 sampling runs, the medians and the interquartile ranges of the Wasserstein distances were 2.1 [1.9, 2.3] for the sex subgroups and 4.1 [3.6, 4.5] for the disease subgroups in the balanced trained model, and 6.2 [5.7, 6.7] and 3.8 [3.4, 4.2], respectively, in the imbalanced trained model.
The underdiagnosis disparity between sexes increases with higher sex-specific prevalence ratio (PR) in the training data. The solid lines represent the mean and the dashed lines represent the median underdiagnosis disparity across ten runs of trained ensembles. The boxes show the interquartile range (IQR), while the whiskers extend to 1.5 times the IQR from the box
Discussion
In this study, we addressed a critical gap in the existing literature by investigating the presence of sex bias in deep learning models for PAT imaging, specifically within the context of PAD diagnosis. Our results showed that sex can be predicted solely from PAT images, suggesting that neural networks may engage in shortcut learning, which could lead to performance disparities in diagnoses between sexes. Additionally, we are the first to explore shared feature representations of sex and PAD as a potential reason for shortcut learning in PAT.
We demonstrated that CNNs can effectively classify sex from PAT images of calf muscles, which reinforces that PAT images contain sufficient information for neural networks to distinguish between sexes. It raises awareness that models trained on PAT data may inadvertently learn and utilize sex-related characteristics potentially leading to biased predictions. We could show that the sex classifier generalizes despite the limited sample size to an out-of-distribution (OOD) dataset partly consisting of body sites that were not in the training dataset. Notably, classification on calf images surpassed in-distribution performance, possibly due to a younger cohort or acquisition protocol variations leading to higher image quality.
Identification of sex bias and its impact on PAD diagnosis models
We demonstrated that models trained on datasets with imbalanced sex-specific PR show significant performance degradation (up to 0.21 AUROC drop) when tested on balanced datasets. This indicates that these models are sensitive to shifts in sex-disease prevalence between training and deployment environments. As a result, the models do not generalize well to populations with different sex distributions, leading to decreased performance in real-world settings. In contrast, models trained on datasets with balanced PR provide robust performances independent of sex-disease prevalence shifts, highlighting the importance of dataset composition in mitigating shortcut behavior.
Transfer Learning is possible between sex and peripheral artery disease (PAD) classification encodings. The last layers (heads) of the sex and PAD classifiers from RQ1 (cf. Fig. 2) indicated by a green gear symbol, were retrained for the opposite task. Retraining and testing were conducted on balanced data (sex-specific prevalence ratio PR = 1). Snowflakes indicate frozen layers. The mean area under the receiver operating characteristic curve (AUROC) is shown for sex a and PAD b classification models. Whiskers indicate the 95 % confidence intervals
Learned representations of peripheral artery disease (PAD) classifiers exhibit stronger sex separation in the principal component analysis (PCA) projections when trained with extreme sex-specific prevalence ratios (PR = \(\infty \)) compared to those trained on balanced PR data (PR = 1). The first two PCA components of the learned representations of a balanced subset of the test pool and the distributions of the marginals are displayed for models trained with PR = 1 a and PR = \(\infty \) b, with the variance explained indicated in brackets in the axis labels. W indicates the 2D-Wasserstein distance between the subgroups
We could also show that the underdiagnosis disparity between sexes increases as the sex-specific PR in the training data increases. Models trained on data with increasing sex-specific PR were more likely to underdiagnose the sex that was underrepresented among the diseased individuals in the training data. This might occur because the model predominantly learns disease features from the overrepresented sex (males in this case), leading to a lack of generalization to the underrepresented sex (females). Note that the higher underdiagnosis disparity for PR = 5 compared to PR = \(\infty \) is most likely attributable to our rather small test set with a sample size of 40.
Evidence of shortcut learning through shared feature representations
We showed that there is a considerable similarity between the neural network representations for sex and PAD features, enabling effective transfer learning between these tasks. The first two principal components of the model trained on a dataset with PR = \(\infty \) exhibited a more pronounced disparity across the sex subgroups compared to the model trained on a dataset with PR = 1 (cf. Fig. 7), indicating a stronger sex-biased encoding for models trained on data with high PR. In Fig. 6, performance drops from top right to bottom left (and vice versa), indicating that the encoders do not extract purely task-agnostic features. If they were entirely task-agnostic, we would expect no performance loss when retraining the heads for the alternate task. Nonetheless, the positive transfer learning result confirms that some features remain useful. The precise ratio of task-agnostic versus task-specific but transferable features remains challenging to determine.
The sex of a patient is routinely considered in certain medical diagnoses [25]. While sex is an important and legitimate factor in most diagnostic contexts, the concern arises when models develop an overreliance on sex-derived features instead of learning direct markers of disease. In this study, we simulated scenarios where training data exhibited unrepresentative sex-specific prevalence ratios (PR > 1) and provided evidence that such data distributions promote shortcut learning. It is worth noting that this does not challenge the diagnostic relevance of sex in PAD, but rather highlights how classifiers may prioritize sex-related features over physiological disease markers, leading to biased and non-generalizable predictions. Such biases can compromise model robustness and fairness, particularly when sex-specific prevalence shifts between training and deployment populations. Future research should explore the effectiveness of explicitly incorporating sex as an auxiliary feature in a controlled and interpretable manner, rather than allowing models to infer sex implicitly in a way that may exacerbate shortcut learning.
Compared to other studies in the field of PAT, as outlined in a recently published review paper [26], our sample size lies within the 90th percentile regarding the number of subjects. Investigations on even larger, preferably multi-center studies should nevertheless be subject to future work. Furthermore, while we demonstrate sex bias, other confounding factors (e.g., age, comorbidities) were not explicitly controlled for, which may also influence model performance. Further, this study focuses on PAD diagnosis, and while the results are likely relevant to other applications, further research is needed to assess the impact of sex bias in different clinical settings. To improve the performance of the PAD classifier (AUROC: 0.79, 95% CI: 0.60\(-\) 0.94), we tested several optimization strategies, including gradual unfreezing of the EfficientNetV2 backbone, however none yielded meaningful gains. These efforts suggest that dataset limitations, rather than model architecture, constrain performance. While a larger dataset could enhance overall performance, it remains unclear whether a stronger model would still rely on sex-based shortcuts. In addition, although we identified sex bias in the models, we did not explore techniques for mitigating these biases such as using a weighted loss which could further improve performance, especially in settings with PR\(\ne \)1 and thus attempting to mitigate the reliance on sex-based shortcuts. However, we opted not to implement these techniques for reducing the risk of shortcut learning in this work to provide an unfiltered view of the risks caused by sex-based shortcuts. Future work should investigate methods for bias reduction, such as data augmentation or fairness-aware learning algorithms.
In conclusion, this study highlights the critical issue of sex bias in deep learning models for PAT-based PAD diagnosis, arising from models’ unintended and spurious reliance on sex-related features due to shortcut learning. Our findings underscore the importance of carefully designing training datasets-particularly taking into account imbalances in sex-specific PR-to prevent shortcut learning and ensure the development of fair and reliable artificial intelligence models for medical imaging.
Supplementary information
Additional details on the data and model training can be found in the material.
References
Zong Y, Yang Y, Hospedales TM (2023) MEDFAIR: benchmarking fairness for medical imaging. In: The eleventh international conference on learning representations: ICLR. https://doi.org/10.48550/arXiv.2210.01725 . https://openreview.net/forum?id=6ve2CkeQe5S
Brown A, Tomasev N, Freyberg J, Liu Y, Karthikesalingam A, Schrouff J (2023) Detecting shortcut learning for fair medical AI using shortcut testing. Nat Commun 14(1):4314. https://doi.org/10.1038/s41467-023-39902-7
Jones C, Roschewitz M, Glocker B (2023) The role of subgroup separability in group-fair medical image classification. In: Medical image computing and computer assisted intervention - MICCAI, pp 179–188. https://doi.org/10.1007/978-3-031-43898-1_18
Jiménez-Sànchez A, Juodelyte D, Chamberlain B, Cheplygina V (2023) Detecting shortcuts in medical images—a case study in chest X-Rays. In: 2023 IEEE 20th international symposium on biomedical imaging (ISBI).https://doi.org/10.1109/ISBI53787.2023.10230572
Glocker B, Jones C, Roschewitz M, Winzeck S (2023) Risk of bias in chest radiography deep learning foundation models. Radiol Artif Intell 5(6):230060. https://doi.org/10.1148/ryai.230060
Beard P (2011) Biomedical photoacoustic imaging. Interface Focus 1(4):602–631. https://doi.org/10.1098/rsfs.2011.0028
Li M, Tang Y, Yao J (2018) Photoacoustic tomography of blood oxygenation: a mini review. Photoacoustics 10:65–73. https://doi.org/10.1016/j.pacs.2018.05.001
Wiacek A, Wang KC, Wu H, Bell MAL (2021) Photoacoustic-guided laparoscopic and open hysterectomy procedures demonstrated with human cadavers. IEEE Trans Med Imaging 40(12):3279–3292. https://doi.org/10.1109/TMI.2021.3082555
Lediju Bell MA, Shubert J (2018) Photoacoustic-based visual servoing of a needle tip. Sci Rep 8(1):15519. https://doi.org/10.1038/s41598-018-33931-9
Iskander-Rizk S, Steen AFW, Soest G (2019) Photoacoustic imaging for guidance of interventions in cardiovascular medicine. Phys Med Biol 64(16):16. https://doi.org/10.1088/1361-6560/ab1ede
Lediju Bell MA (2020) Photoacoustic imaging for surgical guidance: principles, applications, and outlook. J Appl Phys 128(6):060904. https://doi.org/10.1063/5.0018190
Gandhi N, Allard M, Kim S, Kazanzides P, Bell MAL (2017) Photoacoustic-based approach to surgical guidance performed with and without a da Vinci robot. J Biomed Opt 22(12):121606. https://doi.org/10.1117/1.JBO.22.12.121606
Gröhl J, Schellenberg M, Dreher K, Maier-Hein L (2021) Deep learning for biomedical photoacoustic imaging: a review. Photoacoustics 22:100241. https://doi.org/10.1016/j.pacs.2021.100241
Wagner AL, Danko V, Federle A, Klett D, Simon D, Heiss R, Jüngert J, Uder M, Schett G, Neurath MF, Woelfle J, Waldner MJ, Trollmann R, Regensburger AP, Knieling F (2021) Precision of handheld multispectral optoacoustic tomography for muscle imaging. Photoacoustics 21:100220. https://doi.org/10.1016/j.pacs.2020.100220
Else TR, Hacker L, Gröhl J, Bunce EV, Tao R, Bohndiek SE (2023) Effects of skin tone on photoacoustic imaging and oximetry. J Biomed Opt 29(S1):11506. https://doi.org/10.1117/1.JBO.29.S1.S11506
Karsten AE, Singh A, Karsten PA, Braun MWH (2013) Diffuse reflectance spectroscopy as a tool to measure the absorption coefficient in skin: South African skin phototypes. Photochem Photobiol 89(1):227–233. https://doi.org/10.1111/j.1751-1097.2012.01220.x
Larrazabal AJ, Nieto N, Peterson V, Milone DH, Ferrante E (2020) Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc Natl Acad Sci 117(23):12592–12594. https://doi.org/10.1073/pnas.1919012117
Schramm K, Rochon PJ (2018) Gender differences in peripheral vascular disease. Semin Interv Radiol 35(1):9. https://doi.org/10.1055/s-0038-1636515
Caranovic M, Kempf J, Li Y, Regensburger AP, Günther JS, Träger AP, Lang W, Meyer A, Wagner AL, Woelfle J, Raming R, Paulus L-P, Buehler A, Uter W, Uder M, Behrendt C-A, Neurath MF, Waldner MJ, Knieling F, Rother U (2023) Derivation and validation of a non-invasive optoacoustic imaging biomarker for patients with intermittent claudication. medRxiv. https://doi.org/10.1101/2023.10.19.23297246
Tan M, Le Q (2021) EfficientNetV2: smaller models and faster training. In: Proceedings of the 38th international conference on machine learning, pp 10096–10106. https://doi.org/10.48550/arXiv.2104.00298 . https://proceedings.mlr.press/v139/tan21a.html
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848 . ISSN: 1063-6919
Maier-Hein L, Reinke A, Godau P, Tizabi MD, Buettner F, Christodoulou E, Glocker B, Isensee F, Kleesiek J, Kozubek M, Reyes M, Riegler MA, Wiesenfarth M, Kavur AE, Sudre CH, Baumgartner M, Eisenmann M, Heckmann-Nötzel D, Rädsch T, Acion L, Antonelli M, Arbel T, Bakas S, Benis A, Blaschko MB, Cardoso MJ, Cheplygina V, Cimini BA, Collins GS, Farahani K, Ferrer L, Galdran A, Ginneken B, Haase R, Hashimoto DA, Hoffman MM, Huisman M, Jannin P, Kahn CE, Kainmueller D, Kainz B, Karargyris A, Karthikesalingam A, Kofler F, Kopp-Schneider A, Kreshuk A, Kurc T, Landman BA, Litjens G, Madani A, Maier-Hein K, Martel AL, Mattson P, Meijering E, Menze B, Moons KGM, Müller H, Nichyporuk B, Nickel F, Petersen J, Rajpoot N, Rieke N, Saez-Rodriguez J, Sànchez CI, Shetty S, Smeden M, Summers RM, Taha AA, Tiulpin A, Tsaftaris SA, Van Calster B, Varoquaux G, Jäger PF (2024) Metrics reloaded: recommendations for image analysis validation. Nature Methods 21(2):195–212. https://doi.org/10.1038/s41592-023-02151-z
Christodoulou E, Reinke A, Houhou R, Kalinowski P, Erkan S, Sudre CH, Burgos N, Boutaj S, Loizillon S, Solal M, Rieke N, Cheplygina V, Antonelli M, Mayer LD, Tizabi MD, Cardoso MJ, Simpson A, Jäger PF, Kopp-Schneider A, Varoquaux G, Colliot O, Maier-Hein L (2024) Confidence Intervals Uncovered: Are We Ready for Real-World Medical Imaging AI? In: Linguraru MG, Dou Q, Feragen A, Giannarou S, Glocker B, Lekadir K, Schnabel JA (eds) Medical Image Computing and Computer Assisted Intervention MICCAI 2024. Springer, Cham, pp 124–132. https://doi.org/10.1007/978-3-031-72117-5_12
Yang Y, Liu Y, Liu X, Gulhane A, Mastrodicasa D, Wu W, Wang EJ, Sahani DW, Patel S (2024) Demographic bias of expert-level vision-language foundation models in medical imaging. arXiv .https://doi.org/10.48550/arXiv.2402.14815
Clayton JA (2015) Studying both sexes: a guiding principle for biomedicine. FASEB J 30(2):519. https://doi.org/10.1096/fj.15-279554. (Accessed 2024-10-23)
Park J, Choi S, Knieling F, Clingman B, Bohndiek S, Wang LV, Kim C (2024) Clinical translation of photoacoustic imaging. Nat Rev Bioeng 2024:1–20. https://doi.org/10.1038/s44222-024-00240-y
Funding
Open Access funding enabled and organized by Projekt DEAL. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. [101002198])
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
F.K. is a co-inventor together with iThera Medical GmbH, Munich, Germany, on an EU patent application (EP 19 163 304.9). F.K. and U.R. are members of the advisory board of iThera Medical GmbH, Munich, Germany. F.K. received travel support from iThera Medical GmbH, Munich, Germany.
Ethical approval
The studies performed for data collection were approved by the ethics committee of the FAU Erlangen-Nuremberg and performed by the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments. Informed consent was obtained from all individual participants.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Ulrich Rother, Alexander Seitel, Lena Maier-Hein, and Kris K. Dreher shared equal leadership in this work.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Knopp, M., Bender, C.J., Holzwarth, N. et al. Shortcut learning leads to sex bias in deep learning models for photoacoustic tomography. Int J CARS 20, 1325–1333 (2025). https://doi.org/10.1007/s11548-025-03370-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11548-025-03370-9