Abstract
The rapid development of generative artificial intelligence and large models, including large vision models (LVMs), has accelerated their wide applications in medicine. Robot-assisted surgery (RAS) or surgical robotics, in which vision has a vital role, typically combines medical images for diagnostic or navigation abilities with robots with precise operative capabilities. In this context, LVMs could serve as a revolutionary paradigm towards surgical autonomy, accomplishing surgical representations with high fidelity and physical intelligence and enabling high-quality data use and long-term learning. In this Perspective, vision-related tasks in RAS are divided into fundamental upstream tasks and advanced downstream counterparts, elucidating their shared technical foundations with state-of-the-art research that could catalyse a paradigm shift in surgical robotics research for the next decade. LVMs have already been extensively explored to tackle upstream tasks in RAS, exhibiting promising performances. Developing vision foundation models for downstream RAS tasks, which is based on upstream counterparts but necessitates further investigations, will directly enhance surgical autonomy. Here, we outline research trends that could accelerate this paradigm shift and highlight major challenges that could impede progress in the way to the ultimate transformation from ‘surgical robots’ to ‘robotic surgeons’.
Key points
-
Large vision model-driven surgical robots could become a new paradigm for achieving higher level of surgical autonomy.
-
In robot-assisted surgery (RAS), vision-related tasks are broadly classified into upstream tasks including classification, detection, segmentation and registration and downstream counterparts encompassing cognition, simulation, diagnosis and robot control.
-
Large vision models have been extensively explored to tackle upstream tasks in RAS, demonstrating great effectiveness and promising performances.
-
Incorporating vision foundation models into downstream RAS tasks that typically involve multidimensional and multimodal data directly enhances surgical autonomy and intelligence, which to some extent can be achieved through leveraging upstream image processing advancements but necessitates further investigations.
-
Future research trends towards developing large models for downstream tasks (and beyond) and increasing autonomy level in RAS encompass enhancing data collection, achieving physics-aware artificial intelligence models, developing surgical large multimodal models, boosting models’ explainability and strengthening cross-disciplinary collaborations.
-
Looking forward, with technical, application, ethical and regulatory challenges to be tackled along the road, a pathway to develop multimodal, downstream-task-oriented, high-dimensional and physics-aware large models for achieving higher RAS autonomy level is on the horizon.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Khan, S. et al. Transformers in vision: a survey. ACM Comput. Surv. 54, 1–41 (2022).
Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6, 1346–1352 (2022).
Qiu, J. et al. Large AI models in health informatics: applications, challenges, and the future. IEEE J. Biomed. Health Inform. 27, 6074–6087 (2023).
Firoozi, R. et al. Foundation models in robotics: applications, challenges, and the future. Int. J. Robot. Res. https://doi.org/10.1177/02783649241281508 (2024).
Dupont, P. E. et al. A decade retrospective of medical robotics research from 2010 to 2020. Sci. Robot. 6, eabi8017 (2021).
Yip, M. et al. Artificial intelligence meets medical robotics. Science 381, 141–146 (2023). This work reviews artificial intelligence techniques for medical robotics.
Fiorini, P., Goldberg, K. Y., Liu, Y. & Taylor, R. H. Concepts and trends in autonomy for robot-assisted surgery. Proc. IEEE 110, 993–1011 (2022).
Marcus, H. J. et al. The IDEAL framework for surgical robotics: development, comparative evaluation and long-term monitoring. Nat. Med. 30, 61–75 (2024).
Varghese, C., Harrison, E. M., O’Grady, G. & Topol, E. J. Artificial intelligence in surgery. Nat. Med. 30, 1257–1268 (2024).
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024). This study introduces a foundation model for universal segmentation of a wide spectrum of anatomical structures and lesions across different medical imaging modalities.
Wang, D. et al. A real-world dataset and benchmark for foundation model adaptation in medical image classification. Sci. Data 10, 574 (2023).
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022). This study presents a self-supervised model that performs pathology classification by leveraging image–text pairs of unannotated X-rays and accompanying radiology reports.
Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. AAAI’19: AAAI Conference on Artificial Intelligence 590–597 (AAAI, 2019).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021). This study presents contrastive language-image pretraining (CLIP), a multimodal approach that jointly trains an image encoder and a text encoder to predict correct image–text pairs, to learn transferable image representations and to support zero-shot prediction.
Wang, Z., Wu, Z., Agarwal, D. & Sun, J. MedCLIP: contrastive learning from unpaired medical images and text. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 3876–3887 (Association for Computational Linguistics, 2022).
Yue, W. et al. SurgicalSAM: efficient class promptable surgical instrument segmentation. In Proc. AAAI’2024: AAAI Conference on Artificial Intelligence (eds Wooldridge, M. et al.) 6890–6898 (AAAI, 2024).
Boers, T. G. W. et al. Foundation models in gastrointestinal endoscopic AI: impact of architecture, pre-training approach and data efficiency. Med. Image Anal. 98, 103298 (2024).
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 9630–9640 (IEEE, 2021).
van der Zander, Q. E. et al. Real-time classification of colorectal polyps using artificial intelligence — a prospective pilot study comparing two computer-aided diagnosis systems and one expert endoscopist. Gastrointest. Endosc. 95, AB250–AB251 (2022).
Wang, X. et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature 634, 970–978 (2024).
Bulten, W. et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat. Med. 28, 154–163 (2022).
Sheng, Y., Bano, S., Clarkson, M. J. & Islam, M. Surgical-DeSAM: decoupling SAM for instrument segmentation in robotic surgery. Int. J. Comput. Assist. Radiol. Surg. 19, 1267–1271 (2024).
Carion, N. et al. End-to-end object detection with transformers. In Proc. Computer Vision – ECCV 2020: 16th European Conference (eds Vedaldi, A. et al.) 213–229 (Springer, 2022).
Kirillov, A. et al. Segment Anything. In Proc. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 4015–4026 (IEEE, 2023).
Fan, D.-P. et al. Pranet: parallel reverse attention network for polyp segmentation. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020: 23rd International Conference (eds Martel, A. L. et al.) 263–273 (Springer, 2020).
Li, L. H. et al. Grounded language-image pre-training. In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10955–10965 (IEEE, 2022).
Qin, Z., Yi, H. H., Lao, Q. & Li, K. Medical image understanding with pretrained vision language models: a comprehensive study. In Proc. 11th International Conference on Learning Representations (ICLR, 2023).
Zhao, T. et al. A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities. Nat. Methods 22, 166–176 (2025). This work proposes a foundation model for joint segmentation, detection and recognition tasks of biomedical objects along with a large biomedical data set across nine modalities.
Lei, W., Xu, W., Li, K., Zhang, X. & Zhang, S. MedLSAM: localize and segment anything model for 3D CT images. Med. Image Anal. 99, 103370 (2025).
Baumgartner, M., Jäger, P. F., Isensee, F. & Maier-Hein, K. H. nnDetection: a self-configuring method for medical object detection. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference (eds de Bruijne, M. et al.) 530–539 (Springer, 2021).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In Proc. 2017 IEEE International Conference on Computer Vision (ICCV) 2980–2988 (IEEE, 2017).
Mazurowski, M. A. et al. Segment anything model for medical image analysis: an experimental study. Med. Image Anal. 89, 102918 (2023).
Huang, Y. et al. Segment anything model for medical images? Med. Image Anal. 92, 103061 (2024). This work extensively validated the performance and limitations of segment anything models in medical scenarios.
Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. Computer Vision – ECCV 2018: 15th European Conference (eds Ferrari, V. et al.) 833–851 (Springer, 2018).
Wang, A., Islam, M., Xu, M., Zhang, Y. & Ren, H. SAM meets robotic surgery: an empirical study on generalization, robustness and adaptation. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops: ISIC 2023, Care-AI 2023, MedAGI 2023, DeCaF 2023, Held in Conjunction with MICCAI 2023 (eds Celebi, M. E. et al.) 234–244 (Springer, 2023). This is the very first work to examine and analyse the robustness and zero-shot generalizability of the segment anything model in the field of robotic surgery.
Hu, E. J. et al. LoRA: low-rank adaptation of large language models. In Proc. 10th International Conference on Learning Representations (ICLR, 2022).
Paranjape, J. N., Nair, N. G., Sikder, S., Vedula, S. S. & Patel, V. M. AdaptiveSAM: towards efficient tuning of SAM for surgical scene segmentation. In Proc. Medical Image Understanding and Analysis: 28th Annual Conference, MIUA 2024 (eds Yap, M. H. et al.) 187–201 (Springer, 2024).
Zhu, J., Hamdi, A., Qi, Y., Jin, Y., & Wu, J. Medical SAM 2: segment medical images as video via Segment Anything Model 2. Preprint at arXiv https://doi.org/10.48550/arXiv.2408.00874 (2024).
Ravi, N. et al. SAM 2: segment anything in images and videos. Preprint at https://arxiv.org/abs/2408.00714 (2024).
Chen, C. et al. MA-SAM: modality-agnostic SAM adaptation for 3D medical image segmentation. Med. Image Anal. 98, 103310 (2024).
Wu, J. et al. MedSegDiff: medical image segmentation with diffusion probabilistic model. In Proc. Medical Imaging with Deep Learning (MIDL 2023) (eds Oguz, I. et al.) 1623–1639 (PMLR, 2024).
Landman, B. et al. MICCAI 2015: multi-atlas labeling beyond the cranial vault - workshop and challenge. Synapse https://doi.org/10.7303/syn3193805 (2015).
Cui, B., Islam, M., Bai, L. & Ren, H. Surgical-DINO: adapter learning of foundation model for depth estimation in endoscopic surgery. Int. J. Comput. Assist. Radiol. Surg. 16, 1013–1020 (2024).
Oquab, M. et al. DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. https://openreview.net/forum?id=a68SUt6zFt (2024).
Yang, L. et al. Depth Anything: unleashing the power of large-scale unlabeled data. In Proc. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10371–10381 (IEEE, 2024).
Han, J. J., Acar, A., Henry, C. & Wu, J. Y. Depth anything in medical images: a comparative study. Preprint at https://arxiv.org/abs/2401.16600 (2024).
Lou, A., Li, Y., Zhang, Y. & Noble, J. Surgical depth anything: depth estimation for surgical scenes using foundation models. Preprint at https://arxiv.org/abs/2410.07434 (2024).
Fu, Y. et al. DeepReg: a deep learning toolkit for medical image registration. J. Open Source Softw. 5, 2705 (2020).
Chen, J. et al. A survey on deep learning in medical image registration: new technologies, uncertainty, evaluation metrics, and beyond. Med. Image Anal. 100, 103385 (2025).
Song, X., Xu, X. & Yan, P. DINO-Reg: general purpose image encoder for training-free multi-modal deformable medical image registration. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 608–617 (Springer, 2024).
Tian, L. et al. uniGradICON: a foundation model for medical image registration. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 749–760 (Springer, 2024). This paper presents the first foundation model uniGradICON for medical image registration, demonstrating great performances across multiple data sets and zero-shot capabilities for new registration tasks.
Wang, S. et al. The use of three-dimensional visualization techniques for prostate procedures: a systematic review. Eur. Urol. Focus. 7, 1274–1286 (2021).
Min, Z. et al. Non-rigid medical image registration using physics-informed neural networks. In Proc. Information Processing in Medical Imaging: 28th International Conference, IPMI 2023 (eds Frangi, A. et al.) 601–613 (Springer, 2023). This study presents a biomechanically constrained medical image registration approach using physics-informed neural networks.
Min, Z. et al. Biomechanics-informed non-rigid medical image registration and its inverse material property estimation with linear and nonlinear elasticity. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 564–574 (Springer, 2024).
Ahdoot, M. et al. MRI-targeted, systematic, and combined biopsy for prostate cancer diagnosis. N. Engl. J. Med. 382, 917–928 (2020).
Demir, B. et al. MultiGradICON: a foundation model for multimodal medical image registration. In Proc. Biomedical Image Registration: 11th International Workshop, WBIR 2024, Held in Conjunction with MICCAI 2024 (eds Modat, M. et al.) 3–18 (Springer, 2024).
Huang, S. et al. SAMReg: SAM-enabled image registration with ROI-based correspondence. Preprint at https://arxiv.org/abs/2410.14083 (2024).
Modat, M. et al. Fast free-form deformation using graphics processing units. Comput. Methods Prog. Biomed. 98, 278–284 (2010).
Balakrishnan, G., Zhao, A., Sabuncu, M. R., Guttag, J. & Dalca, A. V. Voxelmorph: a learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 38, 1788–1800 (2019).
Hu, Y. et al. Weakly-supervised convolutional neural networks for multimodal image registration. Med. Image Anal. 49, 1–13 (2018).
Schmidt, A., Mohareri, O., DiMaio, S., Yip, M. C. & Salcudean, S. E. Tracking and mapping in medical computer vision: a review. Med. Image Anal. 94, 103131 (2024).
Hong, L. et al. OneTracker: unifying visual object tracking with foundation models and efficient tuning. In Proc. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 19079–19091 (IEEE, 2024).
Ding, X. et al. Less is more: surgical phase recognition from timestamp supervision. IEEE Trans. Med. Imaging 42, 1897–1910 (2023).
Liu, Y. et al. SkiT: a fast key information video transformer for online surgical phase recognition. In Proc. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 21017–21027 (IEEE, 2023).
Bai, L., Islam, M. & Ren, H. CAT-ViL: co-attention gated vision-language embedding for visual question localized-answering in robotic surgery. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023: 26th International Conference (eds Greenspan, H. et al.) 397–407 (Springer, 2023).
Cichy, R. M., Pantazis, D. & Oliva, A. Resolving human object recognition in space and time. Nat. Neurosci. 17, 455–462 (2014).
Seenivasan, L., Islam, M., Krishna, A. K. & Ren, H. Surgical-VQA: visual question answering in surgical scenes using transformer. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022: 25th International Conference (eds Wang, L. et al.) 33–43 (Springer, 2022).
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J. & Chang, K.-W. Visualbert: a simple and performant baseline for vision and language. Preprint at https://arxiv.org/abs/1908.03557 (2019).
Bai, L., Islam, M., Seenivasan, L. & Ren, H. Surgical-VQLA: transformer with gated vision-language embedding for visual question localized-answering in robotic surgery. In Proc. 2023 IEEE International Conference on Robotics and Automation (ICRA) 6859–6865 (IEEE, 2023).
Schmidgall, S., Kim, J. W., Jopling, J. & Krieger, A. General surgery vision transformer: a video pre-trained foundation model for general surgery. Preprint at https://arxiv.org/abs/2403.05949 (2024).
Schmidgall, S., Cho, J., Zakka, C. & Hiesinger, W. GP-VLS: a general-purpose vision language model for surgery. Preprint at https://arxiv.org/abs/2407.19305 (2024). This paper presents GP-VLS, a general-purpose vision language model for surgery, that understands both medical and surgical knowledge and tackles surgical visual question answering problems such as phase and triplet action recognition.
Kiyasseh, D. et al. A vision transformer for decoding surgeon activity from surgical videos. Nat. Biomed. Eng. 7, 780–796 (2023). This study introduces the unified surgical AI system (SAIS) leveraging a vision transformer and supervised contrastive learning, to decode subphase recognition, gesture classification and skill assessment from videos collected during robotic surgeries.
Yi, X., Walia, E. & Babyn, P. Generative adversarial network in medical imaging: a review. Med. Image Anal. 58, 101552 (2019).
Kazerouni, A. et al. Diffusion models in medical imaging: a comprehensive survey. Med. Image Anal. 88, 102846 (2023).
Xu, M., Islam, M., Bai, L. & Ren, H. Privacy-preserving synthetic continual semantic segmentation for robotic surgery. IEEE Trans. Med. Imaging 43, 2291–2302 (2024).
Shumailov, I. et al. AI models collapse when trained on recursively generated data. Nature 631, 755–759 (2024).
Kapelyukh, I., Vosylius, V. & Johns, E. Dall-E-Bot: introducing web-scale diffusion models to robotics. IEEE Robot. Autom. Lett. 8, 3956–3963 (2023).
Li, C. et al. Endora: video generation models as endoscopy simulators. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 230–240 (Springer, 2024). This study introduces the medical video generation framework Endora that achieves high-fidelity endoscopy simulations, by utilizing video transformer for spatial–temporal modelling and 2D vision foundation models for feature extraction.
Unterthiner, T. et al. Towards accurate generative models of video: a new metric & challenges. Preprint at https://arxiv.org/abs/1812.01717 (2018).
Sun, W. et al. Bora: biomedical generalist video generation model. Preprint at https://arxiv.org/abs/2407.08944 (2024).
Kaleta, J., Dall’Alba, D., Płotka, S. & Korzeniowski, P. Minimal data requirement for realistic endoscopic image generation with stable diffusion. Int. J. Comput. Assist. Radiol. Surg. 19, 531–539 (2024).
Venkatesh, D. K., Rivoir, D., Pfeiffer, M., Kolbinger, F. & Speidel, S. Synthesizing multi-class surgical datasets with anatomy-aware diffusion models. Preprint at https://arxiv.org/abs/2410.07753 (2024).
Venkatesh, D. K. et al. Exploring semantic consistency in unpaired image translation to generate data for surgical applications. Int. J. Comput. Assist. Radiol. Surg. 19, 985–993 (2024).
Venkatesh, D. K., Rivoir, D., Pfeiffer, M. & Speidel, S. SurgicaL-CD: generating surgical images via unpaired image translation with latent consistency diffusion models. Preprint at https://arxiv.org/abs/2408.09822 (2024).
Liu, Y. et al. Sora: a review on background, technology, limitations, and opportunities of large vision models. Preprint at https://arxiv.org/abs/2402.17177 (2024).
Ng, C., Gao, H., Ren, T.-A., Lai, J. & Ren, H. Navigation of tendon-driven flexible robotic endoscope through deep reinforcement learning. In Proc. 2024 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO) 134–139 (IEEE, 2024).
Moghani, M. et al. SuFIA: language-guided augmented dexterity for robotic surgical assistants. In Proc. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 6969–6976 (IEEE, 2024).
Liang, X. et al. Real-to-sim deformable object manipulation: optimizing physics models with residual mappings for robotic surgery. In Proc. 2024 IEEE International Conference on Robotics and Automation (ICRA) 15471–15477 (IEEE, 2024).
Richens, J. G., Lee, C. M. & Johri, S. Improving the accuracy of medical diagnosis with causal machine learning. Nat. Commun. 11, 3923 (2020).
Varoquaux, G. & Cheplygina, V. Machine learning for medical imaging: methodological failures and recommendations for the future. npj Digit. Med. 5, 48 (2022).
Markowetz, F. All models are wrong and yours are useless: making clinical prediction models impactful for patients. npj Precis. Oncol. 8, 54 (2024). This article points out problems with existing medical models that are not applicable to practical medical applications.
Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health 1, e271–e297 (2019).
Babenko, B. et al. Detection of signs of disease in external photographs of the eyes via deep learning. Nat. Biomed. Eng. 6, 1370–1383 (2022).
Wen, J. et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med. Image Anal. 63, 101694 (2020).
Zhang, K. et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat. Med. 30, 3129–3141 (2024). This work presents a generalist lightweight vision-language foundation model that can perform versatile biomedical tasks, such as disease diagnosis, report generation and summarization.
Min, Z. et al. Segmentation versus detection: development and evaluation of deep learning models for prostate imaging reporting and data system lesions localisation on bi-parametric prostate magnetic resonance imaging. CAAI Trans. Intell. Technol. https://doi.org/10.1049/cit2.12318 (2024).
Wu, S. et al. High-speed and accurate diagnosis of gastrointestinal disease: learning on endoscopy images using lightweight transformer with local feature attention. Bioengineering 10, 1416 (2023).
Semigran, H. L., Levine, D. M., Nundy, S. & Mehrotra, A. Comparison of physician and computer diagnostic accuracy. JAMA Intern. Med. 176, 1860–1861 (2016).
Wu, C. et al. Can GPT-4v(ision) serve medical applications? Case studies on GPT-4v for multimodal medical diagnosis. Preprint at https://arxiv.org/abs/2310.09909 (2023).
Lu, B., Chu, H. K., Huang, K. & Cheng, L. Vision-based surgical suture looping through trajectory planning for wound suturing. IEEE Trans. Autom. Sci. Eng. 16, 542–556 (2018). This work introduces a dynamic motion planning approach for coordinated motions of laparoscopic robotic arms to enable a higher level of dexterity and optimal workspace towards the automation of surgical knot tying.
Yang, S. et al. Accuracy of autonomous robotic surgery for single-tooth implant placement: a case series. J. Dent. 132, 104451 (2023).
Feng, X., Zhang, X., Shi, X. & Li, L. AIRS: autonomous intraoperative robotic suturing based on surgeon-like operation and path quantification in keratoplasty. IEEE Trans. Ind. Electron. 71, 11115–11124 (2024).
Kam, M. et al. Autonomous system for vaginal cuff closure via model-based planning and markerless tracking techniques. IEEE Robot. Autom. Lett. 8, 3916–3923 (2023).
Lai, J. et al. Sim-to-real transfer of soft robotic navigation strategies that learns from the virtual eye-in-hand vision. IEEE Trans. Ind. Inform. 20, 2365–2377 (2024).
Kim, J. W. et al. Surgical robot transformer (SRT): imitation learning for surgical tasks. In Proc. 8th Conference on Robot Learning (eds Agrawal, P. et al.) 130–144 (PMLR, 2025).
Zhu, X. et al. Diff-LfD: contact-aware model-based learning from visual demonstration for robotic manipulation via differentiable physics-based simulation and rendering. In Proc. 7th Conference on Robot Learning (eds Tan, J. et al.) 499–512 (PMLR, 2023).
Zitkovich, B. et al. RT-2: vision-language-action models transfer web knowledge to robotic control. In Proc. 7th Conference on Robot Learning (eds Tan, J. et al.) 2165–2183 (PMLR, 2023).
Ma, Y., Song, Z., Zhuang, Y., Hao, J. & King, I. A survey on vision-language-action models for embodied AI. Preprint at https://arxiv.org/abs/2405.14093 (2024).
Schmidgall, S., Kim, J. W., Kuntz, A., Ghazi, A. E. & Krieger, A. General-purpose foundation models for increased autonomy in robot-assisted surgery. Nat. Mach. Intell. 6, 1275–1283 (2024). This work introduces a conceptual path for enhancing surgical robot autonomy by developing a multimodal, multitask, vision-language-action model.
Wijsman, P. J. M. et al. First experience with THE AUTOLAP™ SYSTEM: an image-based robotic camera steering device. Surg. Endosc. 32, 2560–2566 (2018).
Murali, A. et al. Learning by observation for surgical subtasks: multilateral cutting of 3D viscoelastic and 2D orthotropic tissue phantoms. In Proc. 2015 IEEE International Conference on Robotics and Automation (ICRA) 1202–1209 (IEEE, 2015).
Chiu, Z.-Y., Liao, A. Z., Richter, F., Johnson, B. & Yip, M. C. Markerless suture needle 6D pose tracking with robust uncertainty estimation for autonomous minimally invasive robotic surgery. In Proc. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 5286–5292 (IEEE, 2022).
Sen, S. et al. Automating multi-throw multilateral surgical suturing with a mechanical needle guide and sequential convex optimization. In Proc. 2016 IEEE International Conference on Robotics and Automation (ICRA) 4178–4185 (IEEE, 2016).
Saeidi, H. et al. Autonomous robotic laparoscopic surgery for intestinal anastomosis. Sci. Robot. 7, eabj2908 (2022).
Yang, G.-Z. et al. Medical robotics — regulatory, ethical, and legal considerations for increasing levels of autonomy. Sci. Robot. 2, eaam8638 (2017). This work presents an analysis on the regulatory, ethical and legal barriers imposed on medical robots.
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Frangi, A. F., Tsaftaris, S. A. & Prince, J. L. Simulation and synthesis in medical imaging. IEEE Trans. Med. Imaging 37, 673–679 (2018).
Li, J. et al. MedShapeNet – a large-scale dataset of 3D medical shapes for computer vision. Biomed. Eng. Biomed. Tech. 70, 71–90 (2025).
Liu, Y. et al. Segment any point cloud sequences by distilling vision foundation models. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) 37193–37229 (NeurIPS, 2023).
Wang, H. et al. SAM-Med3D: towards general-purpose segmentation models for volumetric medical images. Preprint at https://arxiv.org/abs/2310.15161 (2023).
Pang, Y. et al. Masked autoencoders for point cloud self-supervised learning. In Proc. Computer Vision – ECCV 2022: 17th European Conference (eds Avidan, S. et al.) 604–621 (Springer, 2022).
Yu, X. et al. Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 19291–19300 (IEEE, 2022).
Zhang, R. et al. Point-M2AE: multi-scale masked autoencoders for hierarchical point cloud pre-training. In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 27061–27074 (NeurIPS, 2022).
Guo, Z. et al. Point-Bind & Point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. Preprint at https://arxiv.org/abs/2309.00615 (2023).
Waisberg, E., Ong, J., Masalkhi, M. & Lee, A. G. Concerns with OpenAI’s Sora in medicine. Ann. Biomed. Eng. 52, 1932–1934 (2024).
López, P. A., Mella, H., Uribe, S., Hurtado, D. E. & Costabal, F. S. WarpPINN: cine-MR image registration with physics-informed neural networks. Med. Image Anal. 89, 102925 (2023).
Kadambi, A., de Melo, C., Hsieh, C.-J., Srivastava, M. & Soatto, S. Incorporating physics into data-driven computer vision. Nat. Mach. Intell. 5, 572–580 (2023).
Chen, A. et al. Modeling and understanding uncertainty in medical image classification. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 557–567 (Springer, 2024).
Deng, G. et al. SAM-U: multi-box prompts triggered uncertainty estimation for reliable SAM in medical image. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops: MTSAIL 2023, LEAF 2023, AI4Treat 2023, MMMI 2023, REMIA 2023, Held in Conjunction with MICCAI 2023 (eds Woo, J. et al.) 368–377 (Springer, 2023).
Zhang, X. et al. Heteroscedastic uncertainty estimation framework for unsupervised registration. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 651–661 (Springer, 2024).
Gawlikowski, J. et al. A survey of uncertainty in deep neural networks. Artif. Intell. Rev. 56, 1513–1589 (2023).
Gilbert, S., Kather, J. N. & Hogan, A. Augmented non-hallucinating large language models as medical information curators. npj Digit. Med. 7, 100 (2024).
Lin, S. et al. SuPerPM: a surgical perception framework based on deep point matching learned from physical constrained simulation data. In Proc. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 12780–12786 (IEEE, 2024).
Seenivasan, L., Islam, M., Kannan, G. & Ren, H. SurgicalGPT: end-to-end language-vision GPT for visual question answering in surgery. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023: 26th International Conference (eds Greenspan, H. et al.) 281–290 (Springer, 2023). This work proposes a language-vision GPT model for visual question answering tasks in surgical scenarios.
Cui, B., Islam, M., Bai, L., Wang, A. & Ren, H. EndoDAC: efficient adapting foundation model for self-supervised depth estimation from any endoscopic camera. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024: 27th International Conference (eds Linguraru, M. G. et al.) 208–218 (Springer, 2024). This work designs dynamic vector-based low-rank adaptation and intrinsic parameter estimator head to adapt depth foundation models to surgical scenes with only surgical videos.
Rau, A. et al. SimCol3D — 3D reconstruction during colonoscopy challenge. Med. Image Anal. 96, 103195 (2024).
Nwoye, C. I. et al. Cholectriplet2021: a benchmark challenge for surgical action triplet recognition. Med. Image Anal. 86, 102803 (2023).
Xu, M., Islam, M. & Ren, H. Rethinking surgical captioning: end-to-end window-based MLP transformer using patches. In Proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022: 25th International Conference (eds Wang, L. et al.) 376–386 (Springer, 2022).
Wang, S. et al. Interactive computer-aided diagnosis on medical image using large language models. Commun. Eng. 3, 133 (2024).
Auloge, P. et al. Augmented reality and artificial intelligence-based navigation during percutaneous vertebroplasty: a pilot randomised clinical trial. Eur. Spine J. 29, 1580–1589 (2020).
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (NSFC) under grants 62303275, 62403402 and 22IAA01849; in part by Jinan Municipal Bureau of Science and Technology under Grant 202333011; in part by the Hong Kong Research Grants Council (RGC) under grants CRF C4026-21G, RIF R4020-22, GRF 14203323 and 14216022; in part by the NSFC/RGC Joint Research Scheme under grant N_CUHK420/22; and in part by the CUHK Direct Grant for Research under grant 4055213.
Author information
Authors and Affiliations
Contributions
H.R. conceived and initiated the project. Z.M. and J.L. researched data for the article and wrote the manuscript. All the authors contributed to the discussion of the content and revised/edited the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Peer review
Peer review information
Nature Reviews Electrical Engineering thanks Adam Schmidt and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Related links
EU AI Act: https://artificialintelligenceact.eu/
Grand Challenge: https://grand-challenge.org/
Imagen: https://deepmind.google/technologies/imagen-2/
Medical Open Network for Artificial Intelligence (MONAI): https://monai.io/
MIDRC Data: https://www.midrc.org/midrc-data
Sora: https://openai.com/index/sora/
Stable Diffusion: https://stability.ai/
Glossary
- Area under the receiver-operating curve
-
A single scalar that quantitatively summarizes the performance of one classification model across all classification thresholds.
- Contrastive learning
-
A self-supervised learning technique through maximizing and minimizing agreements between similar and dissimilar pairs in the latent space, respectively.
- Dice similarity coefficient
-
The ratio of intersection of predicted and ground-truth segmented regions in the context of segmentation.
- Foundation models
-
Large-scale artificial intelligence models pretrained on massive amounts of diverse data sets and can be adapted to various downstream tasks.
- Knowledge graphs
-
A graph-based representation of knowledge that describes entities and their relationships.
- Large language models
-
(LLMs). Large artificial intelligence models pretrained on massive amounts of language data (for example, text and radiology reports) and can be applied to numerous language downstream tasks such as question answering and dialogue.
- Large vision models
-
(LVMs). Large artificial intelligence models pretrained on massive amounts of vision data (for example, medical images and surgical videos) and can be applied to numerous vision downstream tasks such as medical image classification.
- Segment anything model
-
(SAM). A segmentation foundation model that exhibits impressive zero-shot performance.
- Self-supervised learning
-
Models that derive supervisory signals directly from unlabelled data.
- Vision-language-action (VLA) models
-
Large multimodal models that process vision and language information and generate robot actions.
- Vision-language models
-
Large multimodal models pretrained with image–text pairs and can be applied to various downstream vision-language tasks (for example, image captioning and visual question answering).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Min, Z., Lai, J. & Ren, H. Innovating robot-assisted surgery through large vision models. Nat Rev Electr Eng 2, 350–363 (2025). https://doi.org/10.1038/s44287-025-00166-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s44287-025-00166-6
This article is cited by
-
The rise of robotics and AI-assisted surgery in modern healthcare
Journal of Robotic Surgery (2025)