这是indexloc提供的服务,不要输入任何密码
Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A data-efficient strategy for building high-performing medical foundation models

Abstract

Foundation models are pretrained on massive datasets. However, collecting medical datasets is expensive and time-consuming, and raises privacy concerns. Here we show that synthetic data generated via conditioning with disease labels can be leveraged for building high-performing medical foundation models. We pretrained a retinal foundation model, first with approximately one million synthetic retinal images with physiological structures and feature distribution consistent with real counterparts, and then with only 16.7% of the 904,170 real-world colour fundus photography images required in a recently reported retinal foundation model (RETFound). The data-efficient model performed as well or better than RETFound across nine public datasets and four diagnostic tasks; and for diabetic-retinopathy grading, it used only 40% of the expert-annotated training data used by RETFound. We also support the generalizability of the data-efficient strategy by building a classifier for the detection of tuberculosis on chest X-ray images. The text-conditioned generation of synthetic data may enhance the performance and generalization of medical foundation models.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of data-efficient strategy for building medical foundation models with generative AI.
Fig. 2: Schematic of RETFound-DE and consistency between real and generated retinal images in image and feature spaces.
Fig. 3: Performance in downstream tasks.
Fig. 4: Labelling and fine-tuning time efficiency.
Fig. 5: Cross-centre external evaluation for models pretrained with and without synthetic data.
Fig. 6: Data and pretraining time efficiency.

Similar content being viewed by others

Data availability

The main data supporting the results in this study are available within the paper and its Supplementary Information. Data for pretraining can be accessed through the following weblinks: AIROGS (https://airogs.grand-challenge.org/data-and-challenge), Kaggle EyePACS (https://www.kaggle.com/c/diabetic-retinopathy-detection), DDR (https://github.com/nkicsl/DDR-dataset), ODIR-2019 (https://odir2019.grand-challenge.org). Data for fine-tuning can be accessed through the following weblinks: IDRID (https://ieee-dataport.org/open-access/indian-diabetic-retinopathy-image-dataset-idrid), MESSIDOR-2 (https://www.adcis.net/en/third-party/messidor2), APTOS-2019 (https://www.kaggle.com/competitions/aptos2019-blindness-detection/data), PAPILA (https://figshare.com/articles/dataset/PAPILA/14798004/1), Glaucoma Fundus (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1YRRAC), ORIGA (https://www.kaggle.com/datasets/arnavjain1/glaucoma-datasets), AREDS (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000001.v3.p1), JSIEC (https://zenodo.org/record/3477553), Retina (https://www.kaggle.com/datasets/jr2ngb/cataractdataset), REFUGE (https://ieee-dataport.org/documents/refuge-retinal-fundus-glaucoma-challenge), RIM-ONE-DL (https://github.com/miag-ull/rim-one-dl?tab=readme-ov-file), CheXpert (https://stanfordmlgroup.github.io/competitions/chexpert/), Shenzhen Hospital CXR Set (https://data.lhncbc.nlm.nih.gov/public/Tuberculosis-Chest-X-ray-Datasets/Shenzhen-Hospital-CXR-Set/index.html), TB Chest X-ray database (https://www.kaggle.com/datasets/tawsifurrahman/tuberculosis-tb-chest-xray-dataset).

Code availability

The code of RETFound-DE is available at https://github.com/Jonlysun/DERETFound (ref. 64), and an online interactive platform is available at http://fdudml.cn:12001. We used stable diffusion implemented by diffusers (https://github.com/huggingface/diffusers) for the backbone and image generation. The heat maps were generated with GradCam (https://github.com/jacobgil/pytorch-grad-cam) and the t-SNE visualization was generated with tsne-pytorch (https://github.com/mxl1990/tsne-pytorch).

References

  1. Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Huang, Z. et al. A visual-language foundation model for pathology image analysis using medical Twitter. Nat. Med. 29, 2307–2316 (2023).

    Article  CAS  PubMed  Google Scholar 

  4. Zhang, X. et al. Knowledge-enhanced visual-language pre-training on chest radiology images. Nat. Commun. 14, 4542 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

    Article  CAS  PubMed  Google Scholar 

  6. Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6, 1346–1352 (2022).

    Article  PubMed  Google Scholar 

  7. Mitchell, M., Jain, R. & Langer, R. Engineering and physical sciences in oncology: challenges and opportunities. Nat. Rev. Cancer 17, 659–675 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Villoslada, P., Baeza-Yates, R. & Masdeu, J. C. Reclassifying neurodegenerative diseases. Nat. Biomed. Eng. 4, 759–760 (2020).

    Article  PubMed  Google Scholar 

  9. Rajpurkar, P. et al. AI in health and medicine. Nat. Med. 28, 31–38 (2022).

    Article  CAS  PubMed  Google Scholar 

  10. Ribaric, S., Ariyaeeinia, A. & Pavesic, N. De-identification for privacy protection in multimedia content: a survey. Signal Process. Image Commun. 47, 131–151 (2016).

    Article  Google Scholar 

  11. Chang, Q. et al. Mining multi-center heterogeneous medical data with distributed synthetic learning. Nat. Commun. 14, 5510 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Bond-Taylor, S., Leach, A., Long, Y. & Willcocks, C. G. Deep generative modelling: a comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7327–7347 (2021).

    Article  Google Scholar 

  13. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).

  14. Goodfellow, I. et al. Generative adversarial networks. Commun. ACM 63, 139–144 (2020).

    Article  Google Scholar 

  15. Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).

    Google Scholar 

  16. Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10684–10695 (IEEE, 2022).

  17. Kazerouni, A. et al. Diffusion models in medical imaging: a comprehensive survey. Med. Image Anal. 88, 102846 (2023).

    Article  PubMed  Google Scholar 

  18. Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).

    Article  Google Scholar 

  19. Shin, J. E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Schmitt, L. T. et al. Prediction of designer-recombinases for DNA editing with generative deep learning. Nat. Commun. 13, 7966 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Godinez, W. J. et al. Design of potent antimalarials with generative chemistry. Nat. Mach. Intell. 4, 180–186 (2022).

    Article  Google Scholar 

  23. Huang, X. et al. The landscape of mRNA nanomedicine. Nat. Med. 28, 2273–2287 (2022).

    Article  CAS  PubMed  Google Scholar 

  24. Chen, Z. et al. A deep generative model for molecule optimization via one fragment modification. Nat. Mach. Intell. 3, 1040–1049 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Zhong, W., Yang, Z. & Chen, C. Y. C. Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing. Nat. Commun. 14, 3009 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng. 5, 613–623 (2021).

    Article  CAS  PubMed  Google Scholar 

  27. Kanakasabapathy, M. K. et al. Adaptive adversarial neural networks for the analysis of lossy and domain-shifted datasets of medical images. Nat. Biomed. Eng. 5, 571–585 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Ozyoruk, K. B. et al. A deep-learning model for transforming the style of tissue images from cryosectioned to formalin-fixed and paraffin-embedded. Nat. Biomed. Eng. 6, 1407–1419 (2022).

    Article  CAS  PubMed  Google Scholar 

  29. DeGrave, A. J. et al. Auditing the inference processes of medical-image classifiers by leveraging generative AI and the expertise of physicians. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-023-01160-9 (2023).

  30. Cao, R. et al. Label-free intraoperative histology of bone tissue via deep-learning-assisted ultraviolet photoacoustic microscopy. Nat. Biomed. Eng. 7, 124–134 (2023).

    Article  CAS  PubMed  Google Scholar 

  31. Nichol, A. et al. Glide: towards photorealistic image generation and editing with text-guided diffusion models. Preprint at https://arxiv.org/abs/2112.10741 (2021).

  32. Ramesh, A. et al. Zero-shot text-to-image generation. In Proc. International Conference on Machine Learning 8821–8831 (PMLR, 2021).

  33. Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. International Conference on Machine Learning 8748–8763 (PMLR, 2021).

  34. Kather, J. N. et al. Medical domain knowledge in domain-agnostic generative AI. npj Digit. Med. 5, 90 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Burlina, P. M. et al. Assessment of deep generative models for high-resolution synthetic retinal image generation of age-related macular degeneration. JAMA Ophthalmol. 137, 258–264 (2019).

  36. Yoon, J. et al. EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records. npj Digit. Med. 6, 141 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Trabucco, B., Doherty, K., Gurinas, M. & Salakhutdinov, R. Effective data augmentation with diffusion models. In Proc. International Conference on Learning Representations (ICLR, 2024).

  38. Zhang, A. et al. Shifting machine learning for healthcare from development to deployment and from models to data. Nat. Biomed. Eng. 6, 1330–1345 (2022).

    Article  PubMed  Google Scholar 

  39. Chen, R. J. et al. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  40. DuMont Schütte, A. et al. Overcoming barriers to data sharing with medical image generation: a comprehensive evaluation. npj Digit. Med. 4, 141 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  41. World Report on Vision (World Health Organization, 2019).

  42. Cen, L. P. et al. Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks. Nat. Commun. 12, 4828 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Alimanov, A. & Islam, M. B. Denoising diffusion probabilistic model for retinal image generation and segmentation. In Proc. IEEE International Conference on Computational Photography 1–12 (IEEE, 2023).

  44. Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations (ICLR, 2021).

  45. He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16000–16009 (IEEE, 2022).

  46. Karthik, M. & Sohier, D. APTOS 2019 Blindness Detection (Kaggle, 2019).

  47. Porwal, P. et al. Idrid: diabetic retinopathy–segmentation and grading challenge. Med. Image Anal. 59, 101561 (2020).

    Article  PubMed  Google Scholar 

  48. Decencière, E. et al. Feedback on a publicly distributed image database: the Messidor database. Image Anal. Stereol. 33, 231–234 (2014).

  49. Kovalyk, O. et al. PAPILA: dataset with fundus images and clinical data of both eyes of the same patient for glaucoma assessment. Sci. Data 9, 291 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Zhang, Z. et al. Origa-light: an online retinal fundus image database for glaucoma analysis and research. In Proc. Annual International Conference of the IEEE Engineering in Medicine and Biology 3065–3068 (IEEE, 2010).

  51. Irvin, J. et al. Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. AAAI Conference on Artificial Intelligence Vol. 33 590–597 (AAAI, 2019).

  52. Jaeger, S. et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surg. 4, 475–477 (2014).

    PubMed  PubMed Central  Google Scholar 

  53. Rahman, T. et al. Reliable tuberculosis detection using chest X-ray with deep learning, segmentation and visualization. IEEE Access 8, 191586–191601 (2020).

    Article  Google Scholar 

  54. Peng, W., Adeli, E., Zhao, Q. & Pohl, K. M. in Medical Image Computing and Computer Assisted Intervention 14–24 (MICCAI, 2023).

  55. Eschweiler, D. et al. Denoising diffusion probabilistic models for generation of realistic fully-annotated microscopy image datasets. PLoS Comput. Biol. 20, e1011890 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Ktena, I. et al. Generative models improve fairness of medical classifiers under distribution shifts. Nat. Med. 30, 1166–1173 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Bachmann, R., Mizrahi, D., Atanov, A. & Zamir, A. Multimae: multi-modal multi-task masked autoencoders. In Proc. European Conference on Computer Vision 348–367 (Springer, 2022).

  58. Shumailov, I. et al. AI models collapse when trained on recursively generated data. Nature 631, 755–759 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Yang, Y. et al. The limits of fair medical imaging AI in real-world generalization. Nat. Med. 30, 2838–2848 (2024).

  60. de Vente, C. et al. AIROGS: artificial intelligence for robust glaucoma screening challenge. IEEE Trans. Med. Imaging 43, 542–557 (2024).

  61. van den Oord, A., Vinyals, O. & Kavukcuoglu, K. Neural discrete representation learning. In Proc. 31st Conference on Neural Information Processing Systems 6309–6318 (NIPS, 2017).

  62. Song, Y. et al. Score-based generative modeling through stochastic differential equations. In Proc. International Conference on Learning Representations (ICLR, 2021).

  63. Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision 618–626 (IEEE, 2017).

  64. Yuqi. Controllable generative model enables high data efficiency for building medical foundation model. GitHub https://github.com/Jonlysun/DERETFound (2024).

Download references

Acknowledgements

We acknowledge support for this work provided by the National Natural Science Foundation of China (grant numbers U2001209, 62372117 and 62472102) and the Natural Science Foundation of Shanghai (grant number 21ZR1406600). The computations in this research were performed using the CFFF platform of Fudan University. The AREDS dataset used for the analyses described in this paper was obtained from the Age-Related Eye Disease Study (AREDS) Database found at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000001.v3.p1 through dbGaP accession number phs000001.v3.p1. Funding support for AREDS was provided by the National Eye Institute (N01-EY-0-2127). We thank the AREDS participants and the AREDS Research Group for their valuable contribution to this research.

Author information

Authors and Affiliations

Contributions

B.Y. and W.T. supervised the research. B.Y. conceived the technique. Y.S. implemented the algorithm. Y.S. and W.T. designed the validation experiments. Y.S. trained the network and performed the validation experiments. Y.S. and W.T. analysed the validation results. Z.G. verified the code. R.H. provided technical support on the implementation of the web page. Z.G., M.P. and S.C. collected the public datasets. Y.S., W.T. and B.Y. wrote the paper.

Corresponding author

Correspondence to Bo Yan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Pearse Keane and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance (AUPR) in downstream tasks.

a, Internal evaluation. We fine-tuned the pretrained models on nine public datasets across four downstream tasks: diabetic retinopathy grading, glaucoma diagnosis, age-related macular degeneration (AMD) grading and multi diseases classification. Compared to RETFound, RETFound-DE achieves superior performance on six datasets (P < 0.05) and comparable performance on the other three datasets (P > 0.05). b, External evaluation. Models are fine-tuned on one diabetic retinopathy grading dataset and evaluated on the others. RETFound-DE outperforms RETFound when fine-tuned on APTOS-2019 and evaluated on IDRID, or when fine-tuned on IDRID and evaluated on MESSIDOR-2. We present the mean value of AUPR on each bar and the error bars show 95% confidence intervals. P-value was calculated with the two-sided t-test and listed in the figure.

Extended Data Fig. 2 Confusion matrices comparison in downstream tasks.

The confusion matrices shows the comparison between the predicted classes and the actual labels by the models, with each element of the matrix representing the prediction distribution for a specific class. RETFound-DE’s diagonal prediction scores are generally higher than SSL-ImageNet, and even higher than RETFound, indicating that it classifies each class more accurately in every task and has a lower misclassification rate.

Extended Data Fig. 3 Label and training time efficiency (AUPR).

a, Label efficiency. Label efficiency refers to the amount of training data and labels to achieve the target performance for a deep learning networks. RETFound-DE and RETFound show superior label efficiency than SSL-ImageNet. RETFound-DE even outperforms RETFound on Retina dataset. b, Training time efficiency. Training time efficiency refers to the fine-tuning time to achieve the target performance for foundation models when adapted to downstream datasets. RETFound-DE enables same performance as RETFound in less training time on several datasets, such as IDRID and Retina. The dashed grey lines highlight the significant difference between RETFound and RETFound-DE.

Extended Data Fig. 4 Performance and pretraining time comparisons of SSL-ImageNet-Retinal and RETFound-DE (AUPR).

a, The effect of synthetic data on using different number of real data for pretraining. The efficacy of SSL-ImageNet-Retinal progressively enhances on four downstream tasks as the quantity of real retinal images used for pretraining increases. By pretraining on generated images, RETFound-DE shows a significant performance improvement than SSL-ImageNet-Retinal. On IDRID, MESSIDOR-2 and Retina datasets, RETFound-DE outperforms RETFound when pretrained on only 40k real retinal images. b, The performance of RETFound-DE and SSL-ImageNet-Retinal (150k) over a matched computational time from 5 to 6 8-A100 days. We use 8-A100 day as a unit to denote the pretraining time of using 8 NVIDIA A100 GPUs for one day. For both models, the pretraining dataset at this time period is 150k real retinal image dataset. RETFound-DE consistently outperforms SSL-ImageNet-Retinal (150k) on all four downstream datasets within the same pretraining time.

Extended Data Fig. 5 Internal and external evaluation on Chest X-ray images.

a, Internal evaluation. CXRFM and CXRFM-DE perform comparably on the ShenzhenCXR dataset and significantly outperformed SSL-ImageNet. On the TBChest dataset, all three models achieved similar performance, with an AUROC exceeding 0.99. b, External evaluation. When fine-tuned on TBChest and evaluated on ShenzhenCXR, CXRFM and CXRFM-DE perform similarly, both substantially surpassing SSL-ImageNet. When fine-tuned on ShenzhenCXR and evaluated on TBChest, CXRFM outperforms CXRFM-DE. c, CXRFM-DE (denoted as ‘With synthetic data’) significantly outperforms SSL-ImageNet-Chest (20k) (denoted as ‘Without synthetic data’) under both conditions, demonstrating the enhancement in generalization brought about by synthetic data.

Extended Data Fig. 6 Feature distribution of real and synthetic datasets in terms of age and gender.

We use histograms and cumulative distribution functions to illustrate the feature distributions of real data and synthetic data. (a) and (b) show that the features distribution of real and synthetic data for the 0<Age<60 and Age>60 groups. (c) and (d) show that the features distribution of real and synthetic data for female and male. The results show the consistency of feature distribution between real and synthetic datasets in terms of age and gender.

Supplementary information

Supplementary Information

Supplementary notes, figures and captions for the supplementary tables.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–9.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, Y., Tan, W., Gu, Z. et al. A data-efficient strategy for building high-performing medical foundation models. Nat. Biomed. Eng 9, 539–551 (2025). https://doi.org/10.1038/s41551-025-01365-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41551-025-01365-0

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing