Innovating robot-assisted surgery through large vision models

Min, Zhe; Lai, Jiewen; Ren, Hongliang

doi:10.1038/s44287-025-00166-6

Perspective
Published: 12 May 2025

Innovating robot-assisted surgery through large vision models

Nature Reviews Electrical Engineering volume 2, pages 350–363 (2025)Cite this article

746 Accesses
2 Citations
2 Altmetric
Metrics details

Subjects

Abstract

The rapid development of generative artificial intelligence and large models, including large vision models (LVMs), has accelerated their wide applications in medicine. Robot-assisted surgery (RAS) or surgical robotics, in which vision has a vital role, typically combines medical images for diagnostic or navigation abilities with robots with precise operative capabilities. In this context, LVMs could serve as a revolutionary paradigm towards surgical autonomy, accomplishing surgical representations with high fidelity and physical intelligence and enabling high-quality data use and long-term learning. In this Perspective, vision-related tasks in RAS are divided into fundamental upstream tasks and advanced downstream counterparts, elucidating their shared technical foundations with state-of-the-art research that could catalyse a paradigm shift in surgical robotics research for the next decade. LVMs have already been extensively explored to tackle upstream tasks in RAS, exhibiting promising performances. Developing vision foundation models for downstream RAS tasks, which is based on upstream counterparts but necessitates further investigations, will directly enhance surgical autonomy. Here, we outline research trends that could accelerate this paradigm shift and highlight major challenges that could impede progress in the way to the ultimate transformation from ‘surgical robots’ to ‘robotic surgeons’.

Key points

Large vision model-driven surgical robots could become a new paradigm for achieving higher level of surgical autonomy.
In robot-assisted surgery (RAS), vision-related tasks are broadly classified into upstream tasks including classification, detection, segmentation and registration and downstream counterparts encompassing cognition, simulation, diagnosis and robot control.
Large vision models have been extensively explored to tackle upstream tasks in RAS, demonstrating great effectiveness and promising performances.
Incorporating vision foundation models into downstream RAS tasks that typically involve multidimensional and multimodal data directly enhances surgical autonomy and intelligence, which to some extent can be achieved through leveraging upstream image processing advancements but necessitates further investigations.
Future research trends towards developing large models for downstream tasks (and beyond) and increasing autonomy level in RAS encompass enhancing data collection, achieving physics-aware artificial intelligence models, developing surgical large multimodal models, boosting models’ explainability and strengthening cross-disciplinary collaborations.
Looking forward, with technical, application, ethical and regulatory challenges to be tackled along the road, a pathway to develop multimodal, downstream-task-oriented, high-dimensional and physics-aware large models for achieving higher RAS autonomy level is on the horizon.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Applications of large vision models to upstream and downstream robot-assisted surgery tasks.**

**Fig. 2: Number of publications in upstream and downstream robot-assisted surgery tasks in the years 2018–2024.**

**Fig. 3: The role of large vision models in enhancing surgical robot autonomy.**

General-purpose foundation models for increased autonomy in robot-assisted surgery

Article 01 November 2024

Levels of autonomy in FDA-cleared surgical robots: a systematic review

Article Open access 26 April 2024

A vision transformer for decoding surgeon activity from surgical videos

Article Open access 30 March 2023

References

Khan, S. et al. Transformers in vision: a survey. ACM Comput. Surv. 54, 1–41 (2022).
Article Google Scholar
Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6, 1346–1352 (2022).
Article Google Scholar
Qiu, J. et al. Large AI models in health informatics: applications, challenges, and the future. IEEE J. Biomed. Health Inform. 27, 6074–6087 (2023).
Article Google Scholar
Firoozi, R. et al. Foundation models in robotics: applications, challenges, and the future. Int. J. Robot. Res. https://doi.org/10.1177/02783649241281508 (2024).
Dupont, P. E. et al. A decade retrospective of medical robotics research from 2010 to 2020. Sci. Robot. 6, eabi8017 (2021).
Article Google Scholar
Yip, M. et al. Artificial intelligence meets medical robotics. Science 381, 141–146 (2023). This work reviews artificial intelligence techniques for medical robotics.
Article Google Scholar
Fiorini, P., Goldberg, K. Y., Liu, Y. & Taylor, R. H. Concepts and trends in autonomy for robot-assisted surgery. Proc. IEEE 110, 993–1011 (2022).
Article Google Scholar
Marcus, H. J. et al. The IDEAL framework for surgical robotics: development, comparative evaluation and long-term monitoring. Nat. Med. 30, 61–75 (2024).
Article Google Scholar
Varghese, C., Harrison, E. M., O’Grady, G. & Topol, E. J. Artificial intelligence in surgery. Nat. Med. 30, 1257–1268 (2024).
Article Google Scholar
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024). This study introduces a foundation model for universal segmentation of a wide spectrum of anatomical structures and lesions across different medical imaging modalities.
Article Google Scholar
Wang, D. et al. A real-world dataset and benchmark for foundation model adaptation in medical image classification. Sci. Data 10, 574 (2023).
Article Google Scholar
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Article Google Scholar
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022). This study presents a self-supervised model that performs pathology classification by leveraging image–text pairs of unannotated X-rays and accompanying radiology reports.
Article Google Scholar
Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. AAAI’19: AAAI Conference on Artificial Intelligence 590–597 (AAAI, 2019).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021). This study presents contrastive language-image pretraining (CLIP), a multimodal approach that jointly trains an image encoder and a text encoder to predict correct image–text pairs, to learn transferable image representations and to support zero-shot prediction.
Wang, Z., Wu, Z., Agarwal, D. & Sun, J. MedCLIP: contrastive learning from unpaired medical images and text. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 3876–3887 (Association for Computational Linguistics, 2022).
Yue, W. et al. SurgicalSAM: efficient class promptable surgical instrument segmentation. In Proc. AAAI’2024: AAAI Conference on Artificial Intelligence (eds Wooldridge, M. et al.) 6890–6898 (AAAI, 2024).
Boers, T. G. W. et al. Foundation models in gastrointestinal endoscopic AI: impact of architecture, pre-training approach and data efficiency. Med. Image Anal. 98, 103298 (2024).
Article Google Scholar
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 9630–9640 (IEEE, 2021).
van der Zander, Q. E. et al. Real-time classification of colorectal polyps using artificial intelligence — a prospective pilot study comparing two computer-aided diagnosis systems and one expert endoscopist. Gastrointest. Endosc. 95, AB250–AB251 (2022).
Article Google Scholar
Wang, X. et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature 634, 970–978 (2024).
Article Google Scholar
Bulten, W. et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat. Med. 28, 154–163 (2022).
Article Google Scholar
Sheng, Y., Bano, S., Clarkson, M. J. & Islam, M. Surgical-DeSAM: decoupling SAM for instrument segmentation in robotic surgery. Int. J. Comput. Assist. Radiol. Surg. 19, 1267–1271 (2024).
Article Google Scholar
Carion, N. et al. End-to-end object detection with transformers. In Proc. Computer Vision – ECCV 2020: 16th European Conference (eds Vedaldi, A. et al.) 213–229 (Springer, 2022).
Kirillov, A. et al. Segment Anything. In Proc. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 4015–4026 (IEEE, 2023).
Fan, D.-P. et al. Pranet: parallel reverse attention network for polyp segmentation. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020: 23rd International Conference (eds Martel, A. L. et al.) 263–273 (Springer, 2020).
Li, L. H. et al. Grounded language-image pre-training. In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10955–10965 (IEEE, 2022).
Qin, Z., Yi, H. H., Lao, Q. & Li, K. Medical image understanding with pretrained vision language models: a comprehensive study. In Proc. 11th International Conference on Learning Representations (ICLR, 2023).
Zhao, T. et al. A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities. Nat. Methods 22, 166–176 (2025). This work proposes a foundation model for joint segmentation, detection and recognition tasks of biomedical objects along with a large biomedical data set across nine modalities.
Article Google Scholar
Lei, W., Xu, W., Li, K., Zhang, X. & Zhang, S. MedLSAM: localize and segment anything model for 3D CT images. Med. Image Anal. 99, 103370 (2025).
Article Google Scholar
Baumgartner, M., Jäger, P. F., Isensee, F. & Maier-Hein, K. H. nnDetection: a self-configuring method for medical object detection. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference (eds de Bruijne, M. et al.) 530–539 (Springer, 2021).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In Proc. 2017 IEEE International Conference on Computer Vision (ICCV) 2980–2988 (IEEE, 2017).
Mazurowski, M. A. et al. Segment anything model for medical image analysis: an experimental study. Med. Image Anal. 89, 102918 (2023).
Article Google Scholar
Huang, Y. et al. Segment anything model for medical images? Med. Image Anal. 92, 103061 (2024). This work extensively validated the performance and limitations of segment anything models in medical scenarios.
Article Google Scholar
Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).
Article Google Scholar
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. Computer Vision – ECCV 2018: 15th European Conference (eds Ferrari, V. et al.) 833–851 (Springer, 2018).
Wang, A., Islam, M., Xu, M., Zhang, Y. & Ren, H. SAM meets robotic surgery: an empirical study on generalization, robustness and adaptation. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops: ISIC 2023, Care-AI 2023, MedAGI 2023, DeCaF 2023, Held in Conjunction with MICCAI 2023 (eds Celebi, M. E. et al.) 234–244 (Springer, 2023). This is the very first work to examine and analyse the robustness and zero-shot generalizability of the segment anything model in the field of robotic surgery.
Hu, E. J. et al. LoRA: low-rank adaptation of large language models. In Proc. 10th International Conference on Learning Representations (ICLR, 2022).
Paranjape, J. N., Nair, N. G., Sikder, S., Vedula, S. S. & Patel, V. M. AdaptiveSAM: towards efficient tuning of SAM for surgical scene segmentation. In Proc. Medical Image Understanding and Analysis: 28th Annual Conference, MIUA 2024 (eds Yap, M. H. et al.) 187–201 (Springer, 2024).
Zhu, J., Hamdi, A., Qi, Y., Jin, Y., & Wu, J. Medical SAM 2: segment medical images as video via Segment Anything Model 2. Preprint at arXiv https://doi.org/10.48550/arXiv.2408.00874 (2024).
Ravi, N. et al. SAM 2: segment anything in images and videos. Preprint at https://arxiv.org/abs/2408.00714 (2024).
Chen, C. et al. MA-SAM: modality-agnostic SAM adaptation for 3D medical image segmentation. Med. Image Anal. 98, 103310 (2024).
Article Google Scholar
Wu, J. et al. MedSegDiff: medical image segmentation with diffusion probabilistic model. In Proc. Medical Imaging with Deep Learning (MIDL 2023) (eds Oguz, I. et al.) 1623–1639 (PMLR, 2024).
Landman, B. et al. MICCAI 2015: multi-atlas labeling beyond the cranial vault - workshop and challenge. Synapse https://doi.org/10.7303/syn3193805 (2015).
Cui, B., Islam, M., Bai, L. & Ren, H. Surgical-DINO: adapter learning of foundation model for depth estimation in endoscopic surgery. Int. J. Comput. Assist. Radiol. Surg. 16, 1013–1020 (2024).
Article Google Scholar
Oquab, M. et al. DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. https://openreview.net/forum?id=a68SUt6zFt (2024).
Yang, L. et al. Depth Anything: unleashing the power of large-scale unlabeled data. In Proc. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10371–10381 (IEEE, 2024).
Han, J. J., Acar, A., Henry, C. & Wu, J. Y. Depth anything in medical images: a comparative study. Preprint at https://arxiv.org/abs/2401.16600 (2024).
Lou, A., Li, Y., Zhang, Y. & Noble, J. Surgical depth anything: depth estimation for surgical scenes using foundation models. Preprint at https://arxiv.org/abs/2410.07434 (2024).
Fu, Y. et al. DeepReg: a deep learning toolkit for medical image registration. J. Open Source Softw. 5, 2705 (2020).
Article Google Scholar
Chen, J. et al. A survey on deep learning in medical image registration: new technologies, uncertainty, evaluation metrics, and beyond. Med. Image Anal. 100, 103385 (2025).
Article Google Scholar
Song, X., Xu, X. & Yan, P. DINO-Reg: general purpose image encoder for training-free multi-modal deformable medical image registration. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 608–617 (Springer, 2024).
Tian, L. et al. uniGradICON: a foundation model for medical image registration. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 749–760 (Springer, 2024). This paper presents the first foundation model uniGradICON for medical image registration, demonstrating great performances across multiple data sets and zero-shot capabilities for new registration tasks.
Wang, S. et al. The use of three-dimensional visualization techniques for prostate procedures: a systematic review. Eur. Urol. Focus. 7, 1274–1286 (2021).
Article Google Scholar
Min, Z. et al. Non-rigid medical image registration using physics-informed neural networks. In Proc. Information Processing in Medical Imaging: 28th International Conference, IPMI 2023 (eds Frangi, A. et al.) 601–613 (Springer, 2023). This study presents a biomechanically constrained medical image registration approach using physics-informed neural networks.
Min, Z. et al. Biomechanics-informed non-rigid medical image registration and its inverse material property estimation with linear and nonlinear elasticity. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 564–574 (Springer, 2024).
Ahdoot, M. et al. MRI-targeted, systematic, and combined biopsy for prostate cancer diagnosis. N. Engl. J. Med. 382, 917–928 (2020).
Article Google Scholar
Demir, B. et al. MultiGradICON: a foundation model for multimodal medical image registration. In Proc. Biomedical Image Registration: 11th International Workshop, WBIR 2024, Held in Conjunction with MICCAI 2024 (eds Modat, M. et al.) 3–18 (Springer, 2024).
Huang, S. et al. SAMReg: SAM-enabled image registration with ROI-based correspondence. Preprint at https://arxiv.org/abs/2410.14083 (2024).
Modat, M. et al. Fast free-form deformation using graphics processing units. Comput. Methods Prog. Biomed. 98, 278–284 (2010).
Article Google Scholar
Balakrishnan, G., Zhao, A., Sabuncu, M. R., Guttag, J. & Dalca, A. V. Voxelmorph: a learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 38, 1788–1800 (2019).
Article Google Scholar
Hu, Y. et al. Weakly-supervised convolutional neural networks for multimodal image registration. Med. Image Anal. 49, 1–13 (2018).
Article Google Scholar
Schmidt, A., Mohareri, O., DiMaio, S., Yip, M. C. & Salcudean, S. E. Tracking and mapping in medical computer vision: a review. Med. Image Anal. 94, 103131 (2024).
Article Google Scholar
Hong, L. et al. OneTracker: unifying visual object tracking with foundation models and efficient tuning. In Proc. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 19079–19091 (IEEE, 2024).
Ding, X. et al. Less is more: surgical phase recognition from timestamp supervision. IEEE Trans. Med. Imaging 42, 1897–1910 (2023).
Article Google Scholar
Liu, Y. et al. SkiT: a fast key information video transformer for online surgical phase recognition. In Proc. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 21017–21027 (IEEE, 2023).
Bai, L., Islam, M. & Ren, H. CAT-ViL: co-attention gated vision-language embedding for visual question localized-answering in robotic surgery. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023: 26th International Conference (eds Greenspan, H. et al.) 397–407 (Springer, 2023).
Cichy, R. M., Pantazis, D. & Oliva, A. Resolving human object recognition in space and time. Nat. Neurosci. 17, 455–462 (2014).
Article Google Scholar
Seenivasan, L., Islam, M., Krishna, A. K. & Ren, H. Surgical-VQA: visual question answering in surgical scenes using transformer. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022: 25th International Conference (eds Wang, L. et al.) 33–43 (Springer, 2022).
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J. & Chang, K.-W. Visualbert: a simple and performant baseline for vision and language. Preprint at https://arxiv.org/abs/1908.03557 (2019).
Bai, L., Islam, M., Seenivasan, L. & Ren, H. Surgical-VQLA: transformer with gated vision-language embedding for visual question localized-answering in robotic surgery. In Proc. 2023 IEEE International Conference on Robotics and Automation (ICRA) 6859–6865 (IEEE, 2023).
Schmidgall, S., Kim, J. W., Jopling, J. & Krieger, A. General surgery vision transformer: a video pre-trained foundation model for general surgery. Preprint at https://arxiv.org/abs/2403.05949 (2024).
Schmidgall, S., Cho, J., Zakka, C. & Hiesinger, W. GP-VLS: a general-purpose vision language model for surgery. Preprint at https://arxiv.org/abs/2407.19305 (2024). This paper presents GP-VLS, a general-purpose vision language model for surgery, that understands both medical and surgical knowledge and tackles surgical visual question answering problems such as phase and triplet action recognition.
Kiyasseh, D. et al. A vision transformer for decoding surgeon activity from surgical videos. Nat. Biomed. Eng. 7, 780–796 (2023). This study introduces the unified surgical AI system (SAIS) leveraging a vision transformer and supervised contrastive learning, to decode subphase recognition, gesture classification and skill assessment from videos collected during robotic surgeries.
Article Google Scholar
Yi, X., Walia, E. & Babyn, P. Generative adversarial network in medical imaging: a review. Med. Image Anal. 58, 101552 (2019).
Article Google Scholar
Kazerouni, A. et al. Diffusion models in medical imaging: a comprehensive survey. Med. Image Anal. 88, 102846 (2023).
Article Google Scholar
Xu, M., Islam, M., Bai, L. & Ren, H. Privacy-preserving synthetic continual semantic segmentation for robotic surgery. IEEE Trans. Med. Imaging 43, 2291–2302 (2024).
Article Google Scholar
Shumailov, I. et al. AI models collapse when trained on recursively generated data. Nature 631, 755–759 (2024).
Article Google Scholar
Kapelyukh, I., Vosylius, V. & Johns, E. Dall-E-Bot: introducing web-scale diffusion models to robotics. IEEE Robot. Autom. Lett. 8, 3956–3963 (2023).
Article Google Scholar
Li, C. et al. Endora: video generation models as endoscopy simulators. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 230–240 (Springer, 2024). This study introduces the medical video generation framework Endora that achieves high-fidelity endoscopy simulations, by utilizing video transformer for spatial–temporal modelling and 2D vision foundation models for feature extraction.
Unterthiner, T. et al. Towards accurate generative models of video: a new metric & challenges. Preprint at https://arxiv.org/abs/1812.01717 (2018).
Sun, W. et al. Bora: biomedical generalist video generation model. Preprint at https://arxiv.org/abs/2407.08944 (2024).
Kaleta, J., Dall’Alba, D., Płotka, S. & Korzeniowski, P. Minimal data requirement for realistic endoscopic image generation with stable diffusion. Int. J. Comput. Assist. Radiol. Surg. 19, 531–539 (2024).
Article Google Scholar
Venkatesh, D. K., Rivoir, D., Pfeiffer, M., Kolbinger, F. & Speidel, S. Synthesizing multi-class surgical datasets with anatomy-aware diffusion models. Preprint at https://arxiv.org/abs/2410.07753 (2024).
Venkatesh, D. K. et al. Exploring semantic consistency in unpaired image translation to generate data for surgical applications. Int. J. Comput. Assist. Radiol. Surg. 19, 985–993 (2024).
Article Google Scholar
Venkatesh, D. K., Rivoir, D., Pfeiffer, M. & Speidel, S. SurgicaL-CD: generating surgical images via unpaired image translation with latent consistency diffusion models. Preprint at https://arxiv.org/abs/2408.09822 (2024).
Liu, Y. et al. Sora: a review on background, technology, limitations, and opportunities of large vision models. Preprint at https://arxiv.org/abs/2402.17177 (2024).
Ng, C., Gao, H., Ren, T.-A., Lai, J. & Ren, H. Navigation of tendon-driven flexible robotic endoscope through deep reinforcement learning. In Proc. 2024 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO) 134–139 (IEEE, 2024).
Moghani, M. et al. SuFIA: language-guided augmented dexterity for robotic surgical assistants. In Proc. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 6969–6976 (IEEE, 2024).
Liang, X. et al. Real-to-sim deformable object manipulation: optimizing physics models with residual mappings for robotic surgery. In Proc. 2024 IEEE International Conference on Robotics and Automation (ICRA) 15471–15477 (IEEE, 2024).
Richens, J. G., Lee, C. M. & Johri, S. Improving the accuracy of medical diagnosis with causal machine learning. Nat. Commun. 11, 3923 (2020).
Article Google Scholar
Varoquaux, G. & Cheplygina, V. Machine learning for medical imaging: methodological failures and recommendations for the future. npj Digit. Med. 5, 48 (2022).
Article Google Scholar
Markowetz, F. All models are wrong and yours are useless: making clinical prediction models impactful for patients. npj Precis. Oncol. 8, 54 (2024). This article points out problems with existing medical models that are not applicable to practical medical applications.
Article Google Scholar
Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health 1, e271–e297 (2019).
Article Google Scholar
Babenko, B. et al. Detection of signs of disease in external photographs of the eyes via deep learning. Nat. Biomed. Eng. 6, 1370–1383 (2022).
Article Google Scholar
Wen, J. et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med. Image Anal. 63, 101694 (2020).
Article Google Scholar
Zhang, K. et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat. Med. 30, 3129–3141 (2024). This work presents a generalist lightweight vision-language foundation model that can perform versatile biomedical tasks, such as disease diagnosis, report generation and summarization.
Article Google Scholar
Min, Z. et al. Segmentation versus detection: development and evaluation of deep learning models for prostate imaging reporting and data system lesions localisation on bi-parametric prostate magnetic resonance imaging. CAAI Trans. Intell. Technol. https://doi.org/10.1049/cit2.12318 (2024).
Wu, S. et al. High-speed and accurate diagnosis of gastrointestinal disease: learning on endoscopy images using lightweight transformer with local feature attention. Bioengineering 10, 1416 (2023).
Article Google Scholar
Semigran, H. L., Levine, D. M., Nundy, S. & Mehrotra, A. Comparison of physician and computer diagnostic accuracy. JAMA Intern. Med. 176, 1860–1861 (2016).
Article Google Scholar
Wu, C. et al. Can GPT-4v(ision) serve medical applications? Case studies on GPT-4v for multimodal medical diagnosis. Preprint at https://arxiv.org/abs/2310.09909 (2023).
Lu, B., Chu, H. K., Huang, K. & Cheng, L. Vision-based surgical suture looping through trajectory planning for wound suturing. IEEE Trans. Autom. Sci. Eng. 16, 542–556 (2018). This work introduces a dynamic motion planning approach for coordinated motions of laparoscopic robotic arms to enable a higher level of dexterity and optimal workspace towards the automation of surgical knot tying.
Article Google Scholar
Yang, S. et al. Accuracy of autonomous robotic surgery for single-tooth implant placement: a case series. J. Dent. 132, 104451 (2023).
Article Google Scholar
Feng, X., Zhang, X., Shi, X. & Li, L. AIRS: autonomous intraoperative robotic suturing based on surgeon-like operation and path quantification in keratoplasty. IEEE Trans. Ind. Electron. 71, 11115–11124 (2024).
Article Google Scholar
Kam, M. et al. Autonomous system for vaginal cuff closure via model-based planning and markerless tracking techniques. IEEE Robot. Autom. Lett. 8, 3916–3923 (2023).
Article Google Scholar
Lai, J. et al. Sim-to-real transfer of soft robotic navigation strategies that learns from the virtual eye-in-hand vision. IEEE Trans. Ind. Inform. 20, 2365–2377 (2024).
Article Google Scholar
Kim, J. W. et al. Surgical robot transformer (SRT): imitation learning for surgical tasks. In Proc. 8th Conference on Robot Learning (eds Agrawal, P. et al.) 130–144 (PMLR, 2025).
Zhu, X. et al. Diff-LfD: contact-aware model-based learning from visual demonstration for robotic manipulation via differentiable physics-based simulation and rendering. In Proc. 7th Conference on Robot Learning (eds Tan, J. et al.) 499–512 (PMLR, 2023).
Zitkovich, B. et al. RT-2: vision-language-action models transfer web knowledge to robotic control. In Proc. 7th Conference on Robot Learning (eds Tan, J. et al.) 2165–2183 (PMLR, 2023).
Ma, Y., Song, Z., Zhuang, Y., Hao, J. & King, I. A survey on vision-language-action models for embodied AI. Preprint at https://arxiv.org/abs/2405.14093 (2024).
Schmidgall, S., Kim, J. W., Kuntz, A., Ghazi, A. E. & Krieger, A. General-purpose foundation models for increased autonomy in robot-assisted surgery. Nat. Mach. Intell. 6, 1275–1283 (2024). This work introduces a conceptual path for enhancing surgical robot autonomy by developing a multimodal, multitask, vision-language-action model.
Article Google Scholar
Wijsman, P. J. M. et al. First experience with THE AUTOLAP™ SYSTEM: an image-based robotic camera steering device. Surg. Endosc. 32, 2560–2566 (2018).
Article Google Scholar
Murali, A. et al. Learning by observation for surgical subtasks: multilateral cutting of 3D viscoelastic and 2D orthotropic tissue phantoms. In Proc. 2015 IEEE International Conference on Robotics and Automation (ICRA) 1202–1209 (IEEE, 2015).
Chiu, Z.-Y., Liao, A. Z., Richter, F., Johnson, B. & Yip, M. C. Markerless suture needle 6D pose tracking with robust uncertainty estimation for autonomous minimally invasive robotic surgery. In Proc. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 5286–5292 (IEEE, 2022).
Sen, S. et al. Automating multi-throw multilateral surgical suturing with a mechanical needle guide and sequential convex optimization. In Proc. 2016 IEEE International Conference on Robotics and Automation (ICRA) 4178–4185 (IEEE, 2016).
Saeidi, H. et al. Autonomous robotic laparoscopic surgery for intestinal anastomosis. Sci. Robot. 7, eabj2908 (2022).
Article Google Scholar
Yang, G.-Z. et al. Medical robotics — regulatory, ethical, and legal considerations for increasing levels of autonomy. Sci. Robot. 2, eaam8638 (2017). This work presents an analysis on the regulatory, ethical and legal barriers imposed on medical robots.
Article Google Scholar
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Article Google Scholar
Frangi, A. F., Tsaftaris, S. A. & Prince, J. L. Simulation and synthesis in medical imaging. IEEE Trans. Med. Imaging 37, 673–679 (2018).
Article Google Scholar
Li, J. et al. MedShapeNet – a large-scale dataset of 3D medical shapes for computer vision. Biomed. Eng. Biomed. Tech. 70, 71–90 (2025).
Article Google Scholar
Liu, Y. et al. Segment any point cloud sequences by distilling vision foundation models. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) 37193–37229 (NeurIPS, 2023).
Wang, H. et al. SAM-Med3D: towards general-purpose segmentation models for volumetric medical images. Preprint at https://arxiv.org/abs/2310.15161 (2023).
Pang, Y. et al. Masked autoencoders for point cloud self-supervised learning. In Proc. Computer Vision – ECCV 2022: 17th European Conference (eds Avidan, S. et al.) 604–621 (Springer, 2022).
Yu, X. et al. Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 19291–19300 (IEEE, 2022).
Zhang, R. et al. Point-M2AE: multi-scale masked autoencoders for hierarchical point cloud pre-training. In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 27061–27074 (NeurIPS, 2022).
Guo, Z. et al. Point-Bind & Point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. Preprint at https://arxiv.org/abs/2309.00615 (2023).
Waisberg, E., Ong, J., Masalkhi, M. & Lee, A. G. Concerns with OpenAI’s Sora in medicine. Ann. Biomed. Eng. 52, 1932–1934 (2024).
Article Google Scholar
López, P. A., Mella, H., Uribe, S., Hurtado, D. E. & Costabal, F. S. WarpPINN: cine-MR image registration with physics-informed neural networks. Med. Image Anal. 89, 102925 (2023).
Article Google Scholar
Kadambi, A., de Melo, C., Hsieh, C.-J., Srivastava, M. & Soatto, S. Incorporating physics into data-driven computer vision. Nat. Mach. Intell. 5, 572–580 (2023).
Article Google Scholar
Chen, A. et al. Modeling and understanding uncertainty in medical image classification. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 557–567 (Springer, 2024).
Deng, G. et al. SAM-U: multi-box prompts triggered uncertainty estimation for reliable SAM in medical image. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops: MTSAIL 2023, LEAF 2023, AI4Treat 2023, MMMI 2023, REMIA 2023, Held in Conjunction with MICCAI 2023 (eds Woo, J. et al.) 368–377 (Springer, 2023).
Zhang, X. et al. Heteroscedastic uncertainty estimation framework for unsupervised registration. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 651–661 (Springer, 2024).
Gawlikowski, J. et al. A survey of uncertainty in deep neural networks. Artif. Intell. Rev. 56, 1513–1589 (2023).
Article Google Scholar
Gilbert, S., Kather, J. N. & Hogan, A. Augmented non-hallucinating large language models as medical information curators. npj Digit. Med. 7, 100 (2024).
Article Google Scholar
Lin, S. et al. SuPerPM: a surgical perception framework based on deep point matching learned from physical constrained simulation data. In Proc. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 12780–12786 (IEEE, 2024).
Seenivasan, L., Islam, M., Kannan, G. & Ren, H. SurgicalGPT: end-to-end language-vision GPT for visual question answering in surgery. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023: 26th International Conference (eds Greenspan, H. et al.) 281–290 (Springer, 2023). This work proposes a language-vision GPT model for visual question answering tasks in surgical scenarios.
Cui, B., Islam, M., Bai, L., Wang, A. & Ren, H. EndoDAC: efficient adapting foundation model for self-supervised depth estimation from any endoscopic camera. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 208–218 (Springer, 2024). This work designs dynamic vector-based low-rank adaptation and intrinsic parameter estimator head to adapt depth foundation models to surgical scenes with only surgical videos.
Rau, A. et al. SimCol3D — 3D reconstruction during colonoscopy challenge. Med. Image Anal. 96, 103195 (2024).
Article Google Scholar
Nwoye, C. I. et al. Cholectriplet2021: a benchmark challenge for surgical action triplet recognition. Med. Image Anal. 86, 102803 (2023).
Article Google Scholar
Xu, M., Islam, M. & Ren, H. Rethinking surgical captioning: end-to-end window-based MLP transformer using patches. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022: 25th International Conference (eds Wang, L. et al.) 376–386 (Springer, 2022).
Wang, S. et al. Interactive computer-aided diagnosis on medical image using large language models. Commun. Eng. 3, 133 (2024).
Article Google Scholar
Auloge, P. et al. Augmented reality and artificial intelligence-based navigation during percutaneous vertebroplasty: a pilot randomised clinical trial. Eur. Spine J. 29, 1580–1589 (2020).
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under grants 62303275, 62403402 and 22IAA01849; in part by Jinan Municipal Bureau of Science and Technology under Grant 202333011; in part by the Hong Kong Research Grants Council (RGC) under grants CRF C4026-21G, RIF R4020-22, GRF 14203323 and 14216022; in part by the NSFC/RGC Joint Research Scheme under grant N_CUHK420/22; and in part by the CUHK Direct Grant for Research under grant 4055213.

Author information

These authors contributed equally: Zhe Min, Jiewen Lai.

Authors and Affiliations

School of Control Science and Engineering, Shandong University, Jinan, China
Zhe Min (闵哲)
Department of Medical Physics and Biomedical Engineering, University College London, London, UK
Zhe Min (闵哲)
Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, China
Jiewen Lai (赖捷文) & Hongliang Ren (任洪亮)

Authors

Zhe Min (闵哲)
View author publications
Search author on:PubMed Google Scholar
Jiewen Lai (赖捷文)
View author publications
Search author on:PubMed Google Scholar
Hongliang Ren (任洪亮)
View author publications
Search author on:PubMed Google Scholar

Contributions

H.R. conceived and initiated the project. Z.M. and J.L. researched data for the article and wrote the manuscript. All the authors contributed to the discussion of the content and revised/edited the manuscript.

Corresponding author

Correspondence to Hongliang Ren (任洪亮).

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Peer review

Peer review information

Nature Reviews Electrical Engineering thanks Adam Schmidt and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Glossary

Area under the receiver-operating curve: A single scalar that quantitatively summarizes the performance of one classification model across all classification thresholds.
Contrastive learning: A self-supervised learning technique through maximizing and minimizing agreements between similar and dissimilar pairs in the latent space, respectively.
Dice similarity coefficient: The ratio of intersection of predicted and ground-truth segmented regions in the context of segmentation.
Foundation models: Large-scale artificial intelligence models pretrained on massive amounts of diverse data sets and can be adapted to various downstream tasks.
Knowledge graphs: A graph-based representation of knowledge that describes entities and their relationships.
Large language models: (LLMs). Large artificial intelligence models pretrained on massive amounts of language data (for example, text and radiology reports) and can be applied to numerous language downstream tasks such as question answering and dialogue.
Large vision models: (LVMs). Large artificial intelligence models pretrained on massive amounts of vision data (for example, medical images and surgical videos) and can be applied to numerous vision downstream tasks such as medical image classification.
Segment anything model: (SAM). A segmentation foundation model that exhibits impressive zero-shot performance.
Self-supervised learning: Models that derive supervisory signals directly from unlabelled data.
Vision-language-action (VLA) models: Large multimodal models that process vision and language information and generate robot actions.
Vision-language models: Large multimodal models pretrained with image–text pairs and can be applied to various downstream vision-language tasks (for example, image captioning and visual question answering).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Min, Z., Lai, J. & Ren, H. Innovating robot-assisted surgery through large vision models. Nat Rev Electr Eng 2, 350–363 (2025). https://doi.org/10.1038/s44287-025-00166-6

Download citation

Accepted: 18 March 2025
Published: 12 May 2025
Issue Date: May 2025
DOI: https://doi.org/10.1038/s44287-025-00166-6

This article is cited by

The rise of robotics and AI-assisted surgery in modern healthcare
- Jack Ng Kok Wah
Journal of Robotic Surgery (2025)