这是indexloc提供的服务,不要输入任何密码
Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Innovating robot-assisted surgery through large vision models

Abstract

The rapid development of generative artificial intelligence and large models, including large vision models (LVMs), has accelerated their wide applications in medicine. Robot-assisted surgery (RAS) or surgical robotics, in which vision has a vital role, typically combines medical images for diagnostic or navigation abilities with robots with precise operative capabilities. In this context, LVMs could serve as a revolutionary paradigm towards surgical autonomy, accomplishing surgical representations with high fidelity and physical intelligence and enabling high-quality data use and long-term learning. In this Perspective, vision-related tasks in RAS are divided into fundamental upstream tasks and advanced downstream counterparts, elucidating their shared technical foundations with state-of-the-art research that could catalyse a paradigm shift in surgical robotics research for the next decade. LVMs have already been extensively explored to tackle upstream tasks in RAS, exhibiting promising performances. Developing vision foundation models for downstream RAS tasks, which is based on upstream counterparts but necessitates further investigations, will directly enhance surgical autonomy. Here, we outline research trends that could accelerate this paradigm shift and highlight major challenges that could impede progress in the way to the ultimate transformation from ‘surgical robots’ to ‘robotic surgeons’.

Key points

  • Large vision model-driven surgical robots could become a new paradigm for achieving higher level of surgical autonomy.

  • In robot-assisted surgery (RAS), vision-related tasks are broadly classified into upstream tasks including classification, detection, segmentation and registration and downstream counterparts encompassing cognition, simulation, diagnosis and robot control.

  • Large vision models have been extensively explored to tackle upstream tasks in RAS, demonstrating great effectiveness and promising performances.

  • Incorporating vision foundation models into downstream RAS tasks that typically involve multidimensional and multimodal data directly enhances surgical autonomy and intelligence, which to some extent can be achieved through leveraging upstream image processing advancements but necessitates further investigations.

  • Future research trends towards developing large models for downstream tasks (and beyond) and increasing autonomy level in RAS encompass enhancing data collection, achieving physics-aware artificial intelligence models, developing surgical large multimodal models, boosting models’ explainability and strengthening cross-disciplinary collaborations.

  • Looking forward, with technical, application, ethical and regulatory challenges to be tackled along the road, a pathway to develop multimodal, downstream-task-oriented, high-dimensional and physics-aware large models for achieving higher RAS autonomy level is on the horizon.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Applications of large vision models to upstream and downstream robot-assisted surgery tasks.
Fig. 2: Number of publications in upstream and downstream robot-assisted surgery tasks in the years 2018–2024.
Fig. 3: The role of large vision models in enhancing surgical robot autonomy.

Similar content being viewed by others

References

  1. Khan, S. et al. Transformers in vision: a survey. ACM Comput. Surv. 54, 1–41 (2022).

    Article  Google Scholar 

  2. Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6, 1346–1352 (2022).

    Article  Google Scholar 

  3. Qiu, J. et al. Large AI models in health informatics: applications, challenges, and the future. IEEE J. Biomed. Health Inform. 27, 6074–6087 (2023).

    Article  Google Scholar 

  4. Firoozi, R. et al. Foundation models in robotics: applications, challenges, and the future. Int. J. Robot. Res. https://doi.org/10.1177/02783649241281508 (2024).

  5. Dupont, P. E. et al. A decade retrospective of medical robotics research from 2010 to 2020. Sci. Robot. 6, eabi8017 (2021).

    Article  Google Scholar 

  6. Yip, M. et al. Artificial intelligence meets medical robotics. Science 381, 141–146 (2023). This work reviews artificial intelligence techniques for medical robotics.

    Article  Google Scholar 

  7. Fiorini, P., Goldberg, K. Y., Liu, Y. & Taylor, R. H. Concepts and trends in autonomy for robot-assisted surgery. Proc. IEEE 110, 993–1011 (2022).

    Article  Google Scholar 

  8. Marcus, H. J. et al. The IDEAL framework for surgical robotics: development, comparative evaluation and long-term monitoring. Nat. Med. 30, 61–75 (2024).

    Article  Google Scholar 

  9. Varghese, C., Harrison, E. M., O’Grady, G. & Topol, E. J. Artificial intelligence in surgery. Nat. Med. 30, 1257–1268 (2024).

    Article  Google Scholar 

  10. Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024). This study introduces a foundation model for universal segmentation of a wide spectrum of anatomical structures and lesions across different medical imaging modalities.

    Article  Google Scholar 

  11. Wang, D. et al. A real-world dataset and benchmark for foundation model adaptation in medical image classification. Sci. Data 10, 574 (2023).

    Article  Google Scholar 

  12. Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).

    Article  Google Scholar 

  13. Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022). This study presents a self-supervised model that performs pathology classification by leveraging image–text pairs of unannotated X-rays and accompanying radiology reports.

    Article  Google Scholar 

  14. Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. AAAI’19: AAAI Conference on Artificial Intelligence 590–597 (AAAI, 2019).

  15. Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021). This study presents contrastive language-image pretraining (CLIP), a multimodal approach that jointly trains an image encoder and a text encoder to predict correct image–text pairs, to learn transferable image representations and to support zero-shot prediction.

  16. Wang, Z., Wu, Z., Agarwal, D. & Sun, J. MedCLIP: contrastive learning from unpaired medical images and text. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 3876–3887 (Association for Computational Linguistics, 2022).

  17. Yue, W. et al. SurgicalSAM: efficient class promptable surgical instrument segmentation. In Proc. AAAI’2024: AAAI Conference on Artificial Intelligence (eds Wooldridge, M. et al.) 6890–6898 (AAAI, 2024).

  18. Boers, T. G. W. et al. Foundation models in gastrointestinal endoscopic AI: impact of architecture, pre-training approach and data efficiency. Med. Image Anal. 98, 103298 (2024).

    Article  Google Scholar 

  19. Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 9630–9640 (IEEE, 2021).

  20. van der Zander, Q. E. et al. Real-time classification of colorectal polyps using artificial intelligence — a prospective pilot study comparing two computer-aided diagnosis systems and one expert endoscopist. Gastrointest. Endosc. 95, AB250–AB251 (2022).

    Article  Google Scholar 

  21. Wang, X. et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature 634, 970–978 (2024).

    Article  Google Scholar 

  22. Bulten, W. et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat. Med. 28, 154–163 (2022).

    Article  Google Scholar 

  23. Sheng, Y., Bano, S., Clarkson, M. J. & Islam, M. Surgical-DeSAM: decoupling SAM for instrument segmentation in robotic surgery. Int. J. Comput. Assist. Radiol. Surg. 19, 1267–1271 (2024).

    Article  Google Scholar 

  24. Carion, N. et al. End-to-end object detection with transformers. In Proc. Computer Vision – ECCV 2020: 16th European Conference (eds Vedaldi, A. et al.) 213–229 (Springer, 2022).

  25. Kirillov, A. et al. Segment Anything. In Proc. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 4015–4026 (IEEE, 2023).

  26. Fan, D.-P. et al. Pranet: parallel reverse attention network for polyp segmentation. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020: 23rd International Conference (eds Martel, A. L. et al.) 263–273 (Springer, 2020).

  27. Li, L. H. et al. Grounded language-image pre-training. In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10955–10965 (IEEE, 2022).

  28. Qin, Z., Yi, H. H., Lao, Q. & Li, K. Medical image understanding with pretrained vision language models: a comprehensive study. In Proc. 11th International Conference on Learning Representations (ICLR, 2023).

  29. Zhao, T. et al. A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities. Nat. Methods 22, 166–176 (2025). This work proposes a foundation model for joint segmentation, detection and recognition tasks of biomedical objects along with a large biomedical data set across nine modalities.

    Article  Google Scholar 

  30. Lei, W., Xu, W., Li, K., Zhang, X. & Zhang, S. MedLSAM: localize and segment anything model for 3D CT images. Med. Image Anal. 99, 103370 (2025).

    Article  Google Scholar 

  31. Baumgartner, M., Jäger, P. F., Isensee, F. & Maier-Hein, K. H. nnDetection: a self-configuring method for medical object detection. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference (eds de Bruijne, M. et al.) 530–539 (Springer, 2021).

  32. He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In Proc. 2017 IEEE International Conference on Computer Vision (ICCV) 2980–2988 (IEEE, 2017).

  33. Mazurowski, M. A. et al. Segment anything model for medical image analysis: an experimental study. Med. Image Anal. 89, 102918 (2023).

    Article  Google Scholar 

  34. Huang, Y. et al. Segment anything model for medical images? Med. Image Anal. 92, 103061 (2024). This work extensively validated the performance and limitations of segment anything models in medical scenarios.

    Article  Google Scholar 

  35. Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).

    Article  Google Scholar 

  36. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. Computer Vision – ECCV 2018: 15th European Conference (eds Ferrari, V. et al.) 833–851 (Springer, 2018).

  37. Wang, A., Islam, M., Xu, M., Zhang, Y. & Ren, H. SAM meets robotic surgery: an empirical study on generalization, robustness and adaptation. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops: ISIC 2023, Care-AI 2023, MedAGI 2023, DeCaF 2023, Held in Conjunction with MICCAI 2023 (eds Celebi, M. E. et al.) 234–244 (Springer, 2023). This is the very first work to examine and analyse the robustness and zero-shot generalizability of the segment anything model in the field of robotic surgery.

  38. Hu, E. J. et al. LoRA: low-rank adaptation of large language models. In Proc. 10th International Conference on Learning Representations (ICLR, 2022).

  39. Paranjape, J. N., Nair, N. G., Sikder, S., Vedula, S. S. & Patel, V. M. AdaptiveSAM: towards efficient tuning of SAM for surgical scene segmentation. In Proc. Medical Image Understanding and Analysis: 28th Annual Conference, MIUA 2024 (eds Yap, M. H. et al.) 187–201 (Springer, 2024).

  40. Zhu, J., Hamdi, A., Qi, Y., Jin, Y., & Wu, J. Medical SAM 2: segment medical images as video via Segment Anything Model 2. Preprint at arXiv https://doi.org/10.48550/arXiv.2408.00874 (2024).

  41. Ravi, N. et al. SAM 2: segment anything in images and videos. Preprint at https://arxiv.org/abs/2408.00714 (2024).

  42. Chen, C. et al. MA-SAM: modality-agnostic SAM adaptation for 3D medical image segmentation. Med. Image Anal. 98, 103310 (2024).

    Article  Google Scholar 

  43. Wu, J. et al. MedSegDiff: medical image segmentation with diffusion probabilistic model. In Proc. Medical Imaging with Deep Learning (MIDL 2023) (eds Oguz, I. et al.) 1623–1639 (PMLR, 2024).

  44. Landman, B. et al. MICCAI 2015: multi-atlas labeling beyond the cranial vault - workshop and challenge. Synapse https://doi.org/10.7303/syn3193805 (2015).

  45. Cui, B., Islam, M., Bai, L. & Ren, H. Surgical-DINO: adapter learning of foundation model for depth estimation in endoscopic surgery. Int. J. Comput. Assist. Radiol. Surg. 16, 1013–1020 (2024).

    Article  Google Scholar 

  46. Oquab, M. et al. DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. https://openreview.net/forum?id=a68SUt6zFt (2024).

  47. Yang, L. et al. Depth Anything: unleashing the power of large-scale unlabeled data. In Proc. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10371–10381 (IEEE, 2024).

  48. Han, J. J., Acar, A., Henry, C. & Wu, J. Y. Depth anything in medical images: a comparative study. Preprint at https://arxiv.org/abs/2401.16600 (2024).

  49. Lou, A., Li, Y., Zhang, Y. & Noble, J. Surgical depth anything: depth estimation for surgical scenes using foundation models. Preprint at https://arxiv.org/abs/2410.07434 (2024).

  50. Fu, Y. et al. DeepReg: a deep learning toolkit for medical image registration. J. Open Source Softw. 5, 2705 (2020).

    Article  Google Scholar 

  51. Chen, J. et al. A survey on deep learning in medical image registration: new technologies, uncertainty, evaluation metrics, and beyond. Med. Image Anal. 100, 103385 (2025).

    Article  Google Scholar 

  52. Song, X., Xu, X. & Yan, P. DINO-Reg: general purpose image encoder for training-free multi-modal deformable medical image registration. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 608–617 (Springer, 2024).

  53. Tian, L. et al. uniGradICON: a foundation model for medical image registration. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 749–760 (Springer, 2024). This paper presents the first foundation model uniGradICON for medical image registration, demonstrating great performances across multiple data sets and zero-shot capabilities for new registration tasks.

  54. Wang, S. et al. The use of three-dimensional visualization techniques for prostate procedures: a systematic review. Eur. Urol. Focus. 7, 1274–1286 (2021).

    Article  Google Scholar 

  55. Min, Z. et al. Non-rigid medical image registration using physics-informed neural networks. In Proc. Information Processing in Medical Imaging: 28th International Conference, IPMI 2023 (eds Frangi, A. et al.) 601–613 (Springer, 2023). This study presents a biomechanically constrained medical image registration approach using physics-informed neural networks.

  56. Min, Z. et al. Biomechanics-informed non-rigid medical image registration and its inverse material property estimation with linear and nonlinear elasticity. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 564–574 (Springer, 2024).

  57. Ahdoot, M. et al. MRI-targeted, systematic, and combined biopsy for prostate cancer diagnosis. N. Engl. J. Med. 382, 917–928 (2020).

    Article  Google Scholar 

  58. Demir, B. et al. MultiGradICON: a foundation model for multimodal medical image registration. In Proc. Biomedical Image Registration: 11th International Workshop, WBIR 2024, Held in Conjunction with MICCAI 2024 (eds Modat, M. et al.) 3–18 (Springer, 2024).

  59. Huang, S. et al. SAMReg: SAM-enabled image registration with ROI-based correspondence. Preprint at https://arxiv.org/abs/2410.14083 (2024).

  60. Modat, M. et al. Fast free-form deformation using graphics processing units. Comput. Methods Prog. Biomed. 98, 278–284 (2010).

    Article  Google Scholar 

  61. Balakrishnan, G., Zhao, A., Sabuncu, M. R., Guttag, J. & Dalca, A. V. Voxelmorph: a learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 38, 1788–1800 (2019).

    Article  Google Scholar 

  62. Hu, Y. et al. Weakly-supervised convolutional neural networks for multimodal image registration. Med. Image Anal. 49, 1–13 (2018).

    Article  Google Scholar 

  63. Schmidt, A., Mohareri, O., DiMaio, S., Yip, M. C. & Salcudean, S. E. Tracking and mapping in medical computer vision: a review. Med. Image Anal. 94, 103131 (2024).

    Article  Google Scholar 

  64. Hong, L. et al. OneTracker: unifying visual object tracking with foundation models and efficient tuning. In Proc. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 19079–19091 (IEEE, 2024).

  65. Ding, X. et al. Less is more: surgical phase recognition from timestamp supervision. IEEE Trans. Med. Imaging 42, 1897–1910 (2023).

    Article  Google Scholar 

  66. Liu, Y. et al. SkiT: a fast key information video transformer for online surgical phase recognition. In Proc. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 21017–21027 (IEEE, 2023).

  67. Bai, L., Islam, M. & Ren, H. CAT-ViL: co-attention gated vision-language embedding for visual question localized-answering in robotic surgery. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023: 26th International Conference (eds Greenspan, H. et al.) 397–407 (Springer, 2023).

  68. Cichy, R. M., Pantazis, D. & Oliva, A. Resolving human object recognition in space and time. Nat. Neurosci. 17, 455–462 (2014).

    Article  Google Scholar 

  69. Seenivasan, L., Islam, M., Krishna, A. K. & Ren, H. Surgical-VQA: visual question answering in surgical scenes using transformer. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022: 25th International Conference (eds Wang, L. et al.) 33–43 (Springer, 2022).

  70. Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J. & Chang, K.-W. Visualbert: a simple and performant baseline for vision and language. Preprint at https://arxiv.org/abs/1908.03557 (2019).

  71. Bai, L., Islam, M., Seenivasan, L. & Ren, H. Surgical-VQLA: transformer with gated vision-language embedding for visual question localized-answering in robotic surgery. In Proc. 2023 IEEE International Conference on Robotics and Automation (ICRA) 6859–6865 (IEEE, 2023).

  72. Schmidgall, S., Kim, J. W., Jopling, J. & Krieger, A. General surgery vision transformer: a video pre-trained foundation model for general surgery. Preprint at https://arxiv.org/abs/2403.05949 (2024).

  73. Schmidgall, S., Cho, J., Zakka, C. & Hiesinger, W. GP-VLS: a general-purpose vision language model for surgery. Preprint at https://arxiv.org/abs/2407.19305 (2024). This paper presents GP-VLS, a general-purpose vision language model for surgery, that understands both medical and surgical knowledge and tackles surgical visual question answering problems such as phase and triplet action recognition.

  74. Kiyasseh, D. et al. A vision transformer for decoding surgeon activity from surgical videos. Nat. Biomed. Eng. 7, 780–796 (2023). This study introduces the unified surgical AI system (SAIS) leveraging a vision transformer and supervised contrastive learning, to decode subphase recognition, gesture classification and skill assessment from videos collected during robotic surgeries.

    Article  Google Scholar 

  75. Yi, X., Walia, E. & Babyn, P. Generative adversarial network in medical imaging: a review. Med. Image Anal. 58, 101552 (2019).

    Article  Google Scholar 

  76. Kazerouni, A. et al. Diffusion models in medical imaging: a comprehensive survey. Med. Image Anal. 88, 102846 (2023).

    Article  Google Scholar 

  77. Xu, M., Islam, M., Bai, L. & Ren, H. Privacy-preserving synthetic continual semantic segmentation for robotic surgery. IEEE Trans. Med. Imaging 43, 2291–2302 (2024).

    Article  Google Scholar 

  78. Shumailov, I. et al. AI models collapse when trained on recursively generated data. Nature 631, 755–759 (2024).

    Article  Google Scholar 

  79. Kapelyukh, I., Vosylius, V. & Johns, E. Dall-E-Bot: introducing web-scale diffusion models to robotics. IEEE Robot. Autom. Lett. 8, 3956–3963 (2023).

    Article  Google Scholar 

  80. Li, C. et al. Endora: video generation models as endoscopy simulators. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 230–240 (Springer, 2024). This study introduces the medical video generation framework Endora that achieves high-fidelity endoscopy simulations, by utilizing video transformer for spatial–temporal modelling and 2D vision foundation models for feature extraction.

  81. Unterthiner, T. et al. Towards accurate generative models of video: a new metric & challenges. Preprint at https://arxiv.org/abs/1812.01717 (2018).

  82. Sun, W. et al. Bora: biomedical generalist video generation model. Preprint at https://arxiv.org/abs/2407.08944 (2024).

  83. Kaleta, J., Dall’Alba, D., Płotka, S. & Korzeniowski, P. Minimal data requirement for realistic endoscopic image generation with stable diffusion. Int. J. Comput. Assist. Radiol. Surg. 19, 531–539 (2024).

    Article  Google Scholar 

  84. Venkatesh, D. K., Rivoir, D., Pfeiffer, M., Kolbinger, F. & Speidel, S. Synthesizing multi-class surgical datasets with anatomy-aware diffusion models. Preprint at https://arxiv.org/abs/2410.07753 (2024).

  85. Venkatesh, D. K. et al. Exploring semantic consistency in unpaired image translation to generate data for surgical applications. Int. J. Comput. Assist. Radiol. Surg. 19, 985–993 (2024).

    Article  Google Scholar 

  86. Venkatesh, D. K., Rivoir, D., Pfeiffer, M. & Speidel, S. SurgicaL-CD: generating surgical images via unpaired image translation with latent consistency diffusion models. Preprint at https://arxiv.org/abs/2408.09822 (2024).

  87. Liu, Y. et al. Sora: a review on background, technology, limitations, and opportunities of large vision models. Preprint at https://arxiv.org/abs/2402.17177 (2024).

  88. Ng, C., Gao, H., Ren, T.-A., Lai, J. & Ren, H. Navigation of tendon-driven flexible robotic endoscope through deep reinforcement learning. In Proc. 2024 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO) 134–139 (IEEE, 2024).

  89. Moghani, M. et al. SuFIA: language-guided augmented dexterity for robotic surgical assistants. In Proc. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 6969–6976 (IEEE, 2024).

  90. Liang, X. et al. Real-to-sim deformable object manipulation: optimizing physics models with residual mappings for robotic surgery. In Proc. 2024 IEEE International Conference on Robotics and Automation (ICRA) 15471–15477 (IEEE, 2024).

  91. Richens, J. G., Lee, C. M. & Johri, S. Improving the accuracy of medical diagnosis with causal machine learning. Nat. Commun. 11, 3923 (2020).

    Article  Google Scholar 

  92. Varoquaux, G. & Cheplygina, V. Machine learning for medical imaging: methodological failures and recommendations for the future. npj Digit. Med. 5, 48 (2022).

    Article  Google Scholar 

  93. Markowetz, F. All models are wrong and yours are useless: making clinical prediction models impactful for patients. npj Precis. Oncol. 8, 54 (2024). This article points out problems with existing medical models that are not applicable to practical medical applications.

    Article  Google Scholar 

  94. Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health 1, e271–e297 (2019).

    Article  Google Scholar 

  95. Babenko, B. et al. Detection of signs of disease in external photographs of the eyes via deep learning. Nat. Biomed. Eng. 6, 1370–1383 (2022).

    Article  Google Scholar 

  96. Wen, J. et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med. Image Anal. 63, 101694 (2020).

    Article  Google Scholar 

  97. Zhang, K. et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat. Med. 30, 3129–3141 (2024). This work presents a generalist lightweight vision-language foundation model that can perform versatile biomedical tasks, such as disease diagnosis, report generation and summarization.

    Article  Google Scholar 

  98. Min, Z. et al. Segmentation versus detection: development and evaluation of deep learning models for prostate imaging reporting and data system lesions localisation on bi-parametric prostate magnetic resonance imaging. CAAI Trans. Intell. Technol. https://doi.org/10.1049/cit2.12318 (2024).

  99. Wu, S. et al. High-speed and accurate diagnosis of gastrointestinal disease: learning on endoscopy images using lightweight transformer with local feature attention. Bioengineering 10, 1416 (2023).

    Article  Google Scholar 

  100. Semigran, H. L., Levine, D. M., Nundy, S. & Mehrotra, A. Comparison of physician and computer diagnostic accuracy. JAMA Intern. Med. 176, 1860–1861 (2016).

    Article  Google Scholar 

  101. Wu, C. et al. Can GPT-4v(ision) serve medical applications? Case studies on GPT-4v for multimodal medical diagnosis. Preprint at https://arxiv.org/abs/2310.09909 (2023).

  102. Lu, B., Chu, H. K., Huang, K. & Cheng, L. Vision-based surgical suture looping through trajectory planning for wound suturing. IEEE Trans. Autom. Sci. Eng. 16, 542–556 (2018). This work introduces a dynamic motion planning approach for coordinated motions of laparoscopic robotic arms to enable a higher level of dexterity and optimal workspace towards the automation of surgical knot tying.

    Article  Google Scholar 

  103. Yang, S. et al. Accuracy of autonomous robotic surgery for single-tooth implant placement: a case series. J. Dent. 132, 104451 (2023).

    Article  Google Scholar 

  104. Feng, X., Zhang, X., Shi, X. & Li, L. AIRS: autonomous intraoperative robotic suturing based on surgeon-like operation and path quantification in keratoplasty. IEEE Trans. Ind. Electron. 71, 11115–11124 (2024).

    Article  Google Scholar 

  105. Kam, M. et al. Autonomous system for vaginal cuff closure via model-based planning and markerless tracking techniques. IEEE Robot. Autom. Lett. 8, 3916–3923 (2023).

    Article  Google Scholar 

  106. Lai, J. et al. Sim-to-real transfer of soft robotic navigation strategies that learns from the virtual eye-in-hand vision. IEEE Trans. Ind. Inform. 20, 2365–2377 (2024).

    Article  Google Scholar 

  107. Kim, J. W. et al. Surgical robot transformer (SRT): imitation learning for surgical tasks. In Proc. 8th Conference on Robot Learning (eds Agrawal, P. et al.) 130–144 (PMLR, 2025).

  108. Zhu, X. et al. Diff-LfD: contact-aware model-based learning from visual demonstration for robotic manipulation via differentiable physics-based simulation and rendering. In Proc. 7th Conference on Robot Learning (eds Tan, J. et al.) 499–512 (PMLR, 2023).

  109. Zitkovich, B. et al. RT-2: vision-language-action models transfer web knowledge to robotic control. In Proc. 7th Conference on Robot Learning (eds Tan, J. et al.) 2165–2183 (PMLR, 2023).

  110. Ma, Y., Song, Z., Zhuang, Y., Hao, J. & King, I. A survey on vision-language-action models for embodied AI. Preprint at https://arxiv.org/abs/2405.14093 (2024).

  111. Schmidgall, S., Kim, J. W., Kuntz, A., Ghazi, A. E. & Krieger, A. General-purpose foundation models for increased autonomy in robot-assisted surgery. Nat. Mach. Intell. 6, 1275–1283 (2024). This work introduces a conceptual path for enhancing surgical robot autonomy by developing a multimodal, multitask, vision-language-action model.

    Article  Google Scholar 

  112. Wijsman, P. J. M. et al. First experience with THE AUTOLAP™ SYSTEM: an image-based robotic camera steering device. Surg. Endosc. 32, 2560–2566 (2018).

    Article  Google Scholar 

  113. Murali, A. et al. Learning by observation for surgical subtasks: multilateral cutting of 3D viscoelastic and 2D orthotropic tissue phantoms. In Proc. 2015 IEEE International Conference on Robotics and Automation (ICRA) 1202–1209 (IEEE, 2015).

  114. Chiu, Z.-Y., Liao, A. Z., Richter, F., Johnson, B. & Yip, M. C. Markerless suture needle 6D pose tracking with robust uncertainty estimation for autonomous minimally invasive robotic surgery. In Proc. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 5286–5292 (IEEE, 2022).

  115. Sen, S. et al. Automating multi-throw multilateral surgical suturing with a mechanical needle guide and sequential convex optimization. In Proc. 2016 IEEE International Conference on Robotics and Automation (ICRA) 4178–4185 (IEEE, 2016).

  116. Saeidi, H. et al. Autonomous robotic laparoscopic surgery for intestinal anastomosis. Sci. Robot. 7, eabj2908 (2022).

    Article  Google Scholar 

  117. Yang, G.-Z. et al. Medical robotics — regulatory, ethical, and legal considerations for increasing levels of autonomy. Sci. Robot. 2, eaam8638 (2017). This work presents an analysis on the regulatory, ethical and legal barriers imposed on medical robots.

    Article  Google Scholar 

  118. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

    Article  Google Scholar 

  119. Frangi, A. F., Tsaftaris, S. A. & Prince, J. L. Simulation and synthesis in medical imaging. IEEE Trans. Med. Imaging 37, 673–679 (2018).

    Article  Google Scholar 

  120. Li, J. et al. MedShapeNet – a large-scale dataset of 3D medical shapes for computer vision. Biomed. Eng. Biomed. Tech. 70, 71–90 (2025).

    Article  Google Scholar 

  121. Liu, Y. et al. Segment any point cloud sequences by distilling vision foundation models. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) 37193–37229 (NeurIPS, 2023).

  122. Wang, H. et al. SAM-Med3D: towards general-purpose segmentation models for volumetric medical images. Preprint at https://arxiv.org/abs/2310.15161 (2023).

  123. Pang, Y. et al. Masked autoencoders for point cloud self-supervised learning. In Proc. Computer Vision – ECCV 2022: 17th European Conference (eds Avidan, S. et al.) 604–621 (Springer, 2022).

  124. Yu, X. et al. Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 19291–19300 (IEEE, 2022).

  125. Zhang, R. et al. Point-M2AE: multi-scale masked autoencoders for hierarchical point cloud pre-training. In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 27061–27074 (NeurIPS, 2022).

  126. Guo, Z. et al. Point-Bind & Point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. Preprint at https://arxiv.org/abs/2309.00615 (2023).

  127. Waisberg, E., Ong, J., Masalkhi, M. & Lee, A. G. Concerns with OpenAI’s Sora in medicine. Ann. Biomed. Eng. 52, 1932–1934 (2024).

    Article  Google Scholar 

  128. López, P. A., Mella, H., Uribe, S., Hurtado, D. E. & Costabal, F. S. WarpPINN: cine-MR image registration with physics-informed neural networks. Med. Image Anal. 89, 102925 (2023).

    Article  Google Scholar 

  129. Kadambi, A., de Melo, C., Hsieh, C.-J., Srivastava, M. & Soatto, S. Incorporating physics into data-driven computer vision. Nat. Mach. Intell. 5, 572–580 (2023).

    Article  Google Scholar 

  130. Chen, A. et al. Modeling and understanding uncertainty in medical image classification. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 557–567 (Springer, 2024).

  131. Deng, G. et al. SAM-U: multi-box prompts triggered uncertainty estimation for reliable SAM in medical image. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops: MTSAIL 2023, LEAF 2023, AI4Treat 2023, MMMI 2023, REMIA 2023, Held in Conjunction with MICCAI 2023 (eds Woo, J. et al.) 368–377 (Springer, 2023).

  132. Zhang, X. et al. Heteroscedastic uncertainty estimation framework for unsupervised registration. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 651–661 (Springer, 2024).

  133. Gawlikowski, J. et al. A survey of uncertainty in deep neural networks. Artif. Intell. Rev. 56, 1513–1589 (2023).

    Article  Google Scholar 

  134. Gilbert, S., Kather, J. N. & Hogan, A. Augmented non-hallucinating large language models as medical information curators. npj Digit. Med. 7, 100 (2024).

    Article  Google Scholar 

  135. Lin, S. et al. SuPerPM: a surgical perception framework based on deep point matching learned from physical constrained simulation data. In Proc. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 12780–12786 (IEEE, 2024).

  136. Seenivasan, L., Islam, M., Kannan, G. & Ren, H. SurgicalGPT: end-to-end language-vision GPT for visual question answering in surgery. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023: 26th International Conference (eds Greenspan, H. et al.) 281–290 (Springer, 2023). This work proposes a language-vision GPT model for visual question answering tasks in surgical scenarios.

  137. Cui, B., Islam, M., Bai, L., Wang, A. & Ren, H. EndoDAC: efficient adapting foundation model for self-supervised depth estimation from any endoscopic camera. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 208–218 (Springer, 2024). This work designs dynamic vector-based low-rank adaptation and intrinsic parameter estimator head to adapt depth foundation models to surgical scenes with only surgical videos.

  138. Rau, A. et al. SimCol3D — 3D reconstruction during colonoscopy challenge. Med. Image Anal. 96, 103195 (2024).

    Article  Google Scholar 

  139. Nwoye, C. I. et al. Cholectriplet2021: a benchmark challenge for surgical action triplet recognition. Med. Image Anal. 86, 102803 (2023).

    Article  Google Scholar 

  140. Xu, M., Islam, M. & Ren, H. Rethinking surgical captioning: end-to-end window-based MLP transformer using patches. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022: 25th International Conference (eds Wang, L. et al.) 376–386 (Springer, 2022).

  141. Wang, S. et al. Interactive computer-aided diagnosis on medical image using large language models. Commun. Eng. 3, 133 (2024).

    Article  Google Scholar 

  142. Auloge, P. et al. Augmented reality and artificial intelligence-based navigation during percutaneous vertebroplasty: a pilot randomised clinical trial. Eur. Spine J. 29, 1580–1589 (2020).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under grants 62303275, 62403402 and 22IAA01849; in part by Jinan Municipal Bureau of Science and Technology under Grant 202333011; in part by the Hong Kong Research Grants Council (RGC) under grants CRF C4026-21G, RIF R4020-22, GRF 14203323 and 14216022; in part by the NSFC/RGC Joint Research Scheme under grant N_CUHK420/22; and in part by the CUHK Direct Grant for Research under grant 4055213.

Author information

Authors and Affiliations

Authors

Contributions

H.R. conceived and initiated the project. Z.M. and J.L. researched data for the article and wrote the manuscript. All the authors contributed to the discussion of the content and revised/edited the manuscript.

Corresponding author

Correspondence to Hongliang Ren  (任洪亮).

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Peer review

Peer review information

Nature Reviews Electrical Engineering thanks Adam Schmidt and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

EU AI Act: https://artificialintelligenceact.eu/

Grand Challenge: https://grand-challenge.org/

Imagen: https://deepmind.google/technologies/imagen-2/

Medical Open Network for Artificial Intelligence (MONAI): https://monai.io/

MIDRC Data: https://www.midrc.org/midrc-data

Sora: https://openai.com/index/sora/

Stable Diffusion: https://stability.ai/

Glossary

Area under the receiver-operating curve

A single scalar that quantitatively summarizes the performance of one classification model across all classification thresholds.

Contrastive learning

A self-supervised learning technique through maximizing and minimizing agreements between similar and dissimilar pairs in the latent space, respectively.

Dice similarity coefficient

The ratio of intersection of predicted and ground-truth segmented regions in the context of segmentation.

Foundation models

Large-scale artificial intelligence models pretrained on massive amounts of diverse data sets and can be adapted to various downstream tasks.

Knowledge graphs

A graph-based representation of knowledge that describes entities and their relationships.

Large language models

(LLMs). Large artificial intelligence models pretrained on massive amounts of language data (for example, text and radiology reports) and can be applied to numerous language downstream tasks such as question answering and dialogue.

Large vision models

(LVMs). Large artificial intelligence models pretrained on massive amounts of vision data (for example, medical images and surgical videos) and can be applied to numerous vision downstream tasks such as medical image classification.

Segment anything model

(SAM). A segmentation foundation model that exhibits impressive zero-shot performance.

Self-supervised learning

Models that derive supervisory signals directly from unlabelled data.

Vision-language-action (VLA) models

Large multimodal models that process vision and language information and generate robot actions.

Vision-language models

Large multimodal models pretrained with image–text pairs and can be applied to various downstream vision-language tasks (for example, image captioning and visual question answering).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Min, Z., Lai, J. & Ren, H. Innovating robot-assisted surgery through large vision models. Nat Rev Electr Eng 2, 350–363 (2025). https://doi.org/10.1038/s44287-025-00166-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s44287-025-00166-6

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing