A vision–language pretrained transformer for versatile clinical respiratory disease applications

Ma, Liangdi; Liang, Hengrui; He, Yuwei; Wang, Wei; Yan, Zeping; Li, Wuchao; Wang, Rongpin; Li, Yongyi; Lizhu, Yuerong; Liu, Yaou; Guo, Yuchen; He, Jianxing; Xu, Feng

doi:10.1038/s41551-025-01544-z

Article
Published: 06 November 2025

A vision–language pretrained transformer for versatile clinical respiratory disease applications

Nature Biomedical Engineering (2025)Cite this article

65 Accesses
15 Altmetric
Metrics details

Subjects

Abstract

General artificial intelligence models have unique challenges in clinical practice when applied to diverse modalities and complex clinical tasks. Here we present MedMPT, a versatile, clinically oriented pretrained model tailored for respiratory healthcare, trained on 154,274 pairs of chest computed-tomography scans and radiograph reports. MedMPT adopts self-supervised learning to acquire medical insights and is capable of handling multimodal clinical data and supporting various clinical tasks aligned with clinical workflows. We evaluate the performance of MedMPT on a broad spectrum of chest-related pathological conditions, involving common medical modalities such as computed tomography images, radiology reports, laboratory tests and drug relationship graphs. MedMPT consistently outperforms the state-of-the-art multimodal pretrained models in the medical domain, achieving significant improvements in diverse clinical tasks. Extensive analysis indicates that MedMPT effectively harnesses the potential of medical data, showing both data and parameter efficiency and offering explainable insights for decision-making. MedMPT highlights the potential of multimodal pretrained models in the realm of general-purpose artificial intelligence for clinical practice.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: MedMPT learns from multimodal medical data within a unified framework.**

**Fig. 2: Disease diagnosis and report generation performance.**

**Fig. 3: Performance on multimodal prescription recommendation.**

**Fig. 4: Human–AI collaboration in radiology report generation.**

**Fig. 5: Qualitative evaluation of MedMPT.**

**Fig. 6: Overview of the medication modeling framework and its performance analysis.**

**Fig. 7: The overall data collection process of pretraining and downstream tasks.**

Data availability

Data from NLST are available at https://cdas.cancer.gov/datasets/nlst/. Data from MosMedData are available at https://github.com/michaelwfry/MosMedData-Chest-CT-Scans-with-COVID-19-Related-Findings/. The remaining data collected from The First Affiliated Hospital of Guangzhou Medical University are not publicly available due to privacy requirements. A de-identified validation set can be made available for research purposes upon reasonable request to the corresponding authors.

Code availability

The code for pretraining and fine-tuning of MedMPT, along with the pretrained model weights can be found at GitHub at https://github.com/maliangdi/MedMPT (ref. ⁶⁸).

References

Agusti, A., Vogelmeier, C. F. & Halpin, D. M. G. Tackling the global burden of lung disease through prevention and early diagnosis. Lancet Respir. Med. 10, 1013–1015 (2022).
Article PubMed Google Scholar
Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 (2019).
Article CAS PubMed Google Scholar
Chen, Z., Song, Y., Chang, T.-H. & Wan, X. Generating radiology reports via memory-driven transformer. In Proc. Conference on Empirical Methods in Natural Language Processing (eds Webber, B. et al.) 1439–1449 (ACL, 2020); https://doi.org/10.18653/v1/2020.emnlp-main.112
OpenAI. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Kirillov, A. et al. Segment anything. In Proc. IEEE/CVF International Conference on Computer Vision 4015–4026 (IEEE, 2023).
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).
Article CAS PubMed Google Scholar
Azizi, S. et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat. Biomed. Eng. 7, 756–779 (2023).
Article PubMed Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024).
Lei, W., Wei, X., Zhang, X., Li, K. & Zhang, S. MedLSAM: localize and segment anything model for 3D medical images. Med. Image Anal. 99, 103370 (2025).
Article PubMed Google Scholar
Zhang, X., Wu, C., Zhang, Y., Xie, W. & Wang, Y. Knowledge-enhanced visual-language pre-training on chest radiology images. Nat. Commun. 14, 4542 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zhang, S. et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs. NEJM AI 2, AIoa2400640 (2025).
Article Google Scholar
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Article CAS PubMed PubMed Central Google Scholar
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).
Article CAS PubMed PubMed Central Google Scholar
Topol, E. J. As artificial intelligence goes multimodal, medical applications multiply. Science 381, adk6139 (2023).
Article PubMed Google Scholar
National Academies of Sciences, Engineering, and Medicine. Improving Diagnosis in Health Care (The National Academies Press, 2015).
Azizi, S. et al. Big self-supervised models advance medical image classification. In Proc. IEEE/CVF International Conference on Computer Vision 3458–3468 (IEEE, 2021); https://doi.org/10.1109/ICCV48922.2021.00346
Hosseinzadeh Taher, M. R., Haghighi, F., Gotway, M. B. & Liang, J. CAiD: context-aware instance discrimination for self-supervised learning in medical imaging. Proc. Mach. Learn. Res. 172, 535–551 (2022).
Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 6000–6010 (Curran Associates, 2017).
The National Lung Screening Trial Research Team Reduced lung-cancer mortality with low-dose computed tomographic screening. N. Engl. J. Med. 365, 395–409 (2011).
Article PubMed Central Google Scholar
Morozov, S. P. et al. Mosmeddata: chest CT scans with COVID-19 related findings dataset. Preprint at https://arxiv.org/abs/2005.06465 (2020).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).
Eslami, S., Meinel, C. & De Melo, G. PubMedClip: how much does CLIP benefit visual question answering in the medical domain? In Proc. Findings of the Association for Computational Linguistics (eds Vlachos, A. & Augenstein, I.) 1151–1163 (ACL, 2023).
Moor, M. et al. Med-Flamingo: a multimodal medical few-shot learner. In Proc. 3rd Machine Learning for Health Symposium (eds Hegselmann, S. et al.) 353–367 (PMLR, 2023).
Li, C. et al. LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. In Proc. 37th Conference on Neural Information Processing Systems 28541–28564 (Curran Associates, 2023).
Dosovitskiy, A. et al. An image is worth 16×16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations (OpenReview.net, 2021).
Deng, J. et al. Imagenet: a large-scale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics (eds Isabelle, P. et al.) 311–318 (ACL, 2002); https://doi.org/10.3115/1073083.1073135
Lin, C.-Y. ROUGE: a package for automatic evaluation of summaries. In Proc. Text Summarization Branches Out 74–81 (ACL, 2004).
Banerjee, S. & Lavie, A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (eds Goldstein, J. et al.) 65–72 (ACL, 2005).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. Conference of the North American, Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (ACL, 2019); https://doi.org/10.18653/v1/N19-1423
Fedus, W., Zoph, B. & Shazeer, N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 23, 1–39 (2022).
Google Scholar
Houlsby, N. et al. Parameter-efficient transfer learning for NLP. In Proc. 36th International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 2790–2799 (PMLR, 2019).
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Article CAS PubMed Google Scholar
Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 23, bbac409 (2022).
Article PubMed Google Scholar
Yang, X. et al. A large language model for electronic health records. npj Digit. Med. 5, 194 (2022).
Article PubMed PubMed Central Google Scholar
Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit. Med. 4, 86 (2021).
Article PubMed PubMed Central Google Scholar
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
Article CAS PubMed PubMed Central Google Scholar
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article CAS PubMed Google Scholar
Liu, S. et al. Multimodal data matters: language model pre-training over structured and unstructured electronic health records. IEEE J. Biomed. Heal. Inform. 27, 504–514 (2023).
Article Google Scholar
Peiris, H., Hayat, M., Chen, Z., Egan, G. & Harandi, M. Uncertainty-guided dual-views for semi-supervised volumetric medical image segmentation. Nat. Mach. Intell. 5, 724–738 (2023).
Article Google Scholar
Zhou, L. et al. Self pre-training with masked autoencoders for medical image classification and segmentation. In Proc. IEEE 20th International Symposium on Biomedical Imaging 1–6 (IEEE, 2023).
Hu, X., Xu, X. & Shi, Y. How to efficiently adapt large segmentation model (SAM) to medical images. Preprint at https://arxiv.org/abs/2306.13731 (2023).
Qiu, Z., Hu, Y., Li, H. & Liu, J. Learnable ophthalmology SAM. Preprint at https://arxiv.org/abs/2304.13425 (2023).
Cao, H. et al. Swin-Unet: Unet-like pure transformer for medical image segmentation. In Proc. Computer Vision–ECCV 2022 Workshops (eds Karlinsky, L. et al.) 205–218 (Springer, 2023).
Schäfer, R. et al. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. Nat. Comput. Sci. 4, 495–509 (2024).
Article PubMed PubMed Central Google Scholar
Pai, S. et al. Foundation model for cancer imaging biomarkers. Nat. Mach. Intell. 6, 354–367 (2024).
Article PubMed PubMed Central Google Scholar
Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, AIoa2300138 (2024).
Article Google Scholar
Zhou, H.-Y. et al. Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nat. Mach. Intell. 4, 32–40 (2022).
Article Google Scholar
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical twitter. Nat. Med. 29, 2307–2316 (2023).
Article CAS PubMed Google Scholar
Zhang, K. et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat. Med. https://doi.org/10.1038/s41591-024-03185-2 (2024).
Zhou, H.-Y., Adithan, S., Acosta, J. N., Topol, E. J. & Rajpurkar, P. MedVersa: a generalist foundation model for medical image interpretation. Preprint at https://arxiv.org/abs/2405.07988 (2025).
Yang, J. et al. Poisoning medical knowledge using large language models. Nat. Mach. Intell. 6, 1156–1168 (2024).
Article Google Scholar
Jin, C. et al. Development and evaluation of an artificial intelligence system for COVID-19 diagnosis. Nat. Commun. 11, 5088 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chen, X., Fan, H., Girshick, R. & He, K. Improved baselines with momentum contrastive learning. Preprint at https://arxiv.org/abs/2003.04297 (2020).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In Proc. 37th International Conference on Machine Learning (eds Daumé III, H. & Singh, A.) 1597–1607 (PMLR, 2020).
He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 9726–9735 (IEEE, 2020).
van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2019).
He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 15979–15988 https://doi.org/10.1109/CVPR52688.2022.01553 (IEEE, 2022).
Brody, S., Alon, U. & Yahav, E. How attentive are graph attention networks? In Proc. International Conference on Learning Representations (OpenReview.net, 2022).
Pelka, O., Koitka, S., Rückert, J., Nensa, F. & Friedrich, C. M. Radiology objects in context (ROCO): a multimodal image dataset. In Proc. Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis (eds Stoyanov, D. et al.) 180–189 (Springer, 2018).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. In Proc. 37th Conference on Neural Information Processing Systems 34892–34916 (Curran Associates, 2023).
Abnar, S. & Zuidema, W. Quantifying attention flow in transformers. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 4190–4197 (ACL, 2020); https://doi.org/10.18653/v1/2020.acl-main.385
Chefer, H., Gur, S. & Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proc. IEEE/CVF International Conference on Computer Vision 387–396 (IEEE, 2021); https://doi.org/10.1109/ICCV48922.2021.00045
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In Proc. International Conference on Learning Representations (OpenReview.net, 2019).
Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 8024–8035 (Curran Associates, 2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Ma, L. D. et al. MedMPT. GitHub https://github.com/maliangdi/MedMPT (2025).

Download references

Acknowledgements

This study is supported by the National Key R&D Program of China (number 2023YFC3305600 to F.X.); National Natural Science Foundation of China (numbers 61822111 and 62021002 to F.X., 82441013 to Y.G., and 82330057 to Y. Liu); the Zhejiang Provincial Natural Science Foundation (number LDT23F02024F02 to F.X.); National Science and Technology Major Project (number 2023ZD0506304 to Y.G.); R&D Program of Guangzhou National Laboratory (number SRPG22-017 to J.H.); a grant (number CX23YZ01 to Y. Liu) from the Chinese Institutes for Medical Research, Beijing; and Beijing Hospital Management Center-Climb Plan (number DFL20220503 to Y. Liu). This study is also supported by THUIBCS, Tsinghua University, and BLBCI, Beijing Municipal Education Commission.

Author information

Authors and Affiliations

School of Software, Tsinghua University, Beijing, China
Liangdi Ma, Yuwei He & Feng Xu
Institute for Brain and Cognitive Sciences, BNRist, Tsinghua University, Beijing, China
Liangdi Ma, Yuwei He, Yuchen Guo & Feng Xu
Department of Thoracic Surgery, China State Key Laboratory of Respiratory Disease and National Clinical Research Center for Respiratory Disease, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China
Hengrui Liang, Wei Wang, Zeping Yan & Jianxing He
Department of Radiology, Guizhou Provincial People’s Hospital, Guiyang, China
Wuchao Li & Rongpin Wang
Department of Radiology, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China
Yongyi Li
Department of Radiology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
Yuerong Lizhu & Yaou Liu
Tiantan Image Research Center, China National Clinical Research Center for Neurological Diseases, Beijing, China
Yuerong Lizhu & Yaou Liu

Authors

Liangdi Ma
View author publications
Search author on:PubMed Google Scholar
Hengrui Liang
View author publications
Search author on:PubMed Google Scholar
Yuwei He
View author publications
Search author on:PubMed Google Scholar
Wei Wang
View author publications
Search author on:PubMed Google Scholar
Zeping Yan
View author publications
Search author on:PubMed Google Scholar
Wuchao Li
View author publications
Search author on:PubMed Google Scholar
Rongpin Wang
View author publications
Search author on:PubMed Google Scholar
Yongyi Li
View author publications
Search author on:PubMed Google Scholar
Yuerong Lizhu
View author publications
Search author on:PubMed Google Scholar
Yaou Liu
View author publications
Search author on:PubMed Google Scholar
Yuchen Guo
View author publications
Search author on:PubMed Google Scholar
Jianxing He
View author publications
Search author on:PubMed Google Scholar
Feng Xu
View author publications
Search author on:PubMed Google Scholar

Contributions

F.X., J.H., Y.G. and H.L. are the co-corresponding authors. L.M., H.L., Y.H., J.H., Y.G. and F.X. contributed to the conception and design of the research. F.X., J.H. and Y.G. coordinated and organized the research team to complete this work. H.L. and W.W. contributed to the raw data acquisition. L.M., Y.H. and Z.Y. contributed to data organization and verification. L.M., Y.G. and F.X. contributed to the methodology, technical implementation and results analysis. H.L., Y. Liu and J.H. organized the clinical team for the validation experiment. H.L., W.W., Y. Li, W.L., R.W., Y. Lizhu, Y. Liu and J.H. contributed to the clinical validation experiment and results evaluation. L.M., Y.G. and F.X. contributed to the original draft preparation and revising of the paper. L.M., H.L., Y.G., J.H. and F.X. discussed the results and commented on the paper.

Corresponding authors

Correspondence to Hengrui Liang, Yuchen Guo, Jianxing He or Feng Xu.

Ethics declarations

Competing interests

Tsinghua University has filed for patent protection for F.X., Y.G. and L.M. for the work related to the multimodal pretraining method in chest CT. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Pranav Rajpurkar and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Qualitative comparison of AI-generated radiology reports.

Two cases illustrating the ground-truth (human-written) radiology reports and the corresponding reports generated by MedMPT, Med-Flamingo, and LLaVa-Med. MedMPT generated accurate and clinically appropriate descriptions, whereas the comparison models produced reports with irrelevant or inconsistent content.

Extended Data Fig. 2 Case study of human-AI collaboration in report generation.

The figure presents the reference report, representative CT slice(s) highlighting key findings, and reports generated by MedMPT (AI-written), a radiologist alone (human-written), and a radiologist assisted by AI (AI-assisted human-written). Radiologist revisions are marked with colour-coded annotations. The AI-assisted report combined the detailed findings from MedMPT with the radiologist’s refinements, achieving high quality and fluency while substantially reducing reporting time.

Extended Data Fig. 3 Case study of human-AI collaboration in report generation.

The figure presents the reference report, representative CT slice(s) highlighting key findings, and reports generated by MedMPT (AI-written), a radiologist alone (human-written), and a radiologist assisted by AI (AI-assisted human-written). Radiologist revisions are marked with colour-coded annotations. In this case, the radiologist revised the AI-generated report to align with their personal writing style, even though it was already clinically accurate, resulting in only a modest improvement in reporting time.

Extended Data Fig. 4 Overview of MedMPT pretraining framework and finetuning strategies.

a, The pretraining framework of MedMPT, where paired CT scans and reports were used to train the vision encoder, vision decoder, text encoder, and text decoder in a multi-task pattern to extract the multi-scale representations of multi-modal medical data. b, The transferring framework of MedMPT to downstream tasks. The pretrained modules, along with additional modules, were employed to facilitate multi-modal downstream tasks. The pretrained parameters can be fine-tuned or frozen for evaluation.

Extended Data Fig. 5 Overview of the MedMPT Transformer Architecture.

a, Vision encoder, which consists of a slice encoder encoding a series of image patches within a slice as a slice embedding and a sequence of patch embeddings, and a slice fusion encoder encoding the output slice embeddings from the slice encoder within a scan into a global scan embedding and a sequence of slice embeddings. b, Vision decoder, decoding the output patch embeddings of slice encoder (added with mask token embeddings if masks were applied on the input) to a sequence of image patches. c, Text encoder, encoding a radiology report as a sequence of token embeddings. d, Text decoder, predicting the following texts based on the prefix contents and the slice embeddings from the vision encoder.

Extended Data Table 1 Data characteristics of pretraining and downstream datasets

Full size table

Extended Data Table 2 Comparison of task-specific and multi-task joint fine-tuning methods across three clinical tasks

Full size table

Supplementary information

Supplementary Information

Supplementary Methods, Figs. 1–4 and Tables 1–19.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ma, L., Liang, H., He, Y. et al. A vision–language pretrained transformer for versatile clinical respiratory disease applications. Nat. Biomed. Eng (2025). https://doi.org/10.1038/s41551-025-01544-z

Download citation

Received: 23 February 2024
Accepted: 18 September 2025
Published: 06 November 2025
Version of record: 06 November 2025
DOI: https://doi.org/10.1038/s41551-025-01544-z