这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

On Efficient Variants of Segment Anything Model: A Survey

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The Segment Anything Model (SAM) is a foundational model for image segmentation tasks, known for its strong generalization across diverse applications. However, its impressive performance comes with significant computational and resource demands, making it challenging to deploy in resource-limited environments such as edge devices. To address this, a variety of SAM variants have been proposed to enhance efficiency while keeping accuracy. This survey provides the first comprehensive review of these efficient SAM variants. We begin by exploring the motivations driving this research. We then present core techniques used in SAM and model acceleration. This is followed by a detailed exploration of SAM acceleration strategies, categorized by approach, and a discussion of several future research directions. Finally, we offer a unified and extensive evaluation of these methods across various hardware, assessing their efficiency and accuracy on representative benchmarks, and providing a clear comparison of their overall performance. To complement this survey, we summarize the papers and codes related to efficient SAM variants at https://github.com/Image-and-Video-Computing-Group/On-Efficient-Variants-of-Segment-Anything-Model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Data Availability

The data supporting the experiments of this study are openly available, as stated in Section 4.1.

Notes

  1. https://eval.ai/web/challenges/challenge-page/1931/overview

  2. https://github.com/facebookresearch/segment-anything/blob/main/notebooks/images/groceries.jpg

  3. https://github.com/MrYxJ/calculate-flops.pytorch

References

  • Abebe, W., Jafari, S., Yu, S., Dutta, A., Strube, J., Tallent, N. R., Guo, L., Munoz, P., & Jannesari, A. (2025). Supersam: Crafting a sam supernetwork via structured pruning and unstructured parameter prioritization. arXiv preprint arXiv:2501.08504.

  • Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., & others. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

  • Ahmadi, M., Lonbar, A. G., Sharifi, A., Beris, A.T., Nouri, M., & Javidi, A. S. (2023). Application of segment anything model for civil infrastructure defect assessment. arXiv preprint arXiv:2304.12600.

  • Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., & others. (2023). Palm 2 technical report. arXiv preprint arXiv:2305.10403.

  • Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelley, J., Bosboom, J., O’Reilly, U.-M., & Amarasinghe, S. (2014). Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation.

  • Anwar, S., Hwang, K., & Sung, W. (2017). Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3), 1–18.

    Article  Google Scholar 

  • Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.

  • Banner, R., Nahshan, Y., & Soudry, D. (2019). Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems

  • Bolya, D., Zhou, C., Xiao, F., & Lee, Y. J. (2019). Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., & others. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 .

  • Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & others. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems.

  • Cai, H., Li, J., Hu, M., Gan, C., & Han, S. (2023). Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

  • Cao, Y., Xu, X., Sun, C., Cheng, Y., Du, Z., Gao, L., & Shen, W. (2023). Segment any anomaly without training via hybrid prompt regularization. arXiv preprint arXiv:2305.10724.

  • Chang, Q., Ahmad, D., Toth, J., Bascom, R., & Higgins, W. E. (2023). Esfpnet: efficient deep learning architecture for real-time lesion segmentation in autofluorescence bronchoscopic video. In Medical Imaging 2023: Biomedical Applications in Molecular, Structural, and Functional Imaging.

  • Chen, S., Chen, H., Haque, M., Liu, C., & Yang, W. (2023). The dark side of dynamic routing neural networks: Towards efficiency backdoor injection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., & Liu, Z. (2020). Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Chen, Z., Fang, G., Ma, X., & Wang, X. (2024). Slimsam: 0.1% data makes segment anything slim. In Advances in Neural Information Processing Systems.

  • Chen, P., Liu, S., Zhao, H., & Jia, J. (2021). Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Chen, Y., Son, M., Hua, C., & Kim, J.-Y. (2024). Aop-sam: Automation of prompts for efficient segmentation. In Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning.

  • Chen, S., Song, Z., Haque, M., Liu, C., & Yang, W. (2022). Nicgslowdown: Evaluating the efficiency robustness of neural image caption generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Chen, T., Zhu, L., Deng, C., Cao, R., Wang, Y., Zhang, S., Li, Z., Sun, L., Zang, Y., & Mao, P. (2023). Sam-adapter: Adapting segment anything in underperformed scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • Chen, T., Zhu, L., Ding, C., Cao, R., Wang, Y., Li, Z., Sun, L., Mao, P., & Zang, Y. (2023). Sam fails to segment anything?–sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more. arXiv preprint arXiv:2304.09148.

  • Chen, X., Zhu, J., Jiang, J., & Tsui, C.-Y. (2020). Tight compression: Compressing cnn model tightly through unstructured pruning and simulated annealing based permutation. In 2020 57th ACM/IEEE Design Automation Conference (DAC).

  • Cheng, Y., Li, L., Xu, Y., Li, X., Yang, Z., Wang, W., & Yang, Y. (2023). Segment and track anything. arXiv preprint arXiv:2305.06558.

  • Cheng, H.K., Oh, S.W., Price, B., Schwing, A., & Lee, J.-Y. (2023). Tracking anything with decoupled video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., & others. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240), 1-113.

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision.

  • Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems.

  • Delatolas, T., Kalogeiton, V., & Papadopoulos, D. P. (2024). Learning the what and how of annotation in video object segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.

  • Deng, R., Cui, C., Liu, Q., Yao, T., Remedios, L. W., Bao, S., Landman, B. A., Wheless, L. E., Coburn, L. A., Wilson, K. T., & others. (2025). Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging. In IS &T International Symposium on Electronic Imaging.

  • Diba, A., Sharma, V., Arzani, M., Van Gool, L., & others. (2023). Spatio-temporal convolution-attention video network. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • Ding, J., Ma, S., Dong, L., Zhang, X., Huang, S., Wang, W., Zheng, N., & Wei, F. (2023). Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.

  • Dong, H., Gu, H., Chen, Y., Yang, J., & Mazurowski, M. A. (2024). Segment anything model 2: an application to 2d and 3d medical images. arXiv preprint arXiv:2408.00756.

  • Dong, S., Liu, F., & Lin, G. (2023). Leveraging large-scale pretrained vision foundation models for label-efficient 3d point cloud segmentation. arXiv preprint arXiv:2311.01989.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.

  • Duan, Y., Wang, W., Chen, Z., Zhu, X., Lu, L., Lu, T., Qiao, Y., Li, H., Dai, J., & Wang, W. (2025). Vision-RWKV: Efficient and scalable visual perception with RWKV-like architectures. In The Thirteenth International Conference on Learning Representations.

  • Edalati, A., Tahaei, M., Rashid, A., Nia, V., Clark, J., & Rezagholizadeh, M. Kronecker decomposition for GPT compression. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2022).

  • Fang, J., Lin, H., Chen, X., & Zeng, K. (2022). A hybrid network of cnn and transformer for lightweight image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations.

  • Fu, D.Y., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., & Re, C. (2023). Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations.

  • Fu, J., Yu, Y., Li, N., Zhang, Y., Chen, Q., Xiong, J., Yin, J., & Xiang, Z. (2024). Lite-sam is actually what you need for segment everything. In European Conference on Computer Vision.

  • Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., Gao, J., & others. (2022). Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision 14(3–4), 163–352.

  • Gao, K., Bai, Y., Gu, J., Xia, S.-T., Torr, P., Li, Z., & Liu, W. (2024). Inducing high energy-latency of large vision-language models with verbose images. In The Twelfth International Conference on Learning Representations.

  • Gao, R., Lyu, D., & Staring, M. (2024). Swin-litemedsam: A lightweight box-based segment anything model for large-scale medical image datasets. In Medical Image Segmentation Challenge.

  • Giannakis, I., Bhardwaj, A., Sam, L., & Leontidis, G. (2023). Deep learning universal crater detection using segment anything model (sam). arXiv preprint arXiv:2304.07764.

  • Grachev, A. M., Ignatov, D. I., & Savchenko, A. V. (2017). Neural networks compression for language modeling. In International Conference on Pattern Recognition and Machine Intelligence.

  • Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., & Douze, M. (2021). Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • Gu, A., & Dao, T. (2024). Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling.

  • Gu, A., Goel, K., & Re, C. (2022). Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations.

  • Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Gupta, A., Gu, A., & Berant, J. (2022). Diagonal state spaces are as effective as structured state spaces. In Advances in Neural Information Processing Systems.

  • Hajimolahoseini, H., Ahmed, W., Rezagholizadeh, M., Partovinia, V., & Liu, Y. (2022). Strategies for applying low rank decomposition to transformer-based models. In 36th Conference on Neural Information Processing Systems (NeurIPS2022).

  • Han, D., Pan, X., Han, Y., Song, S., & Huang, G. (2023). Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • He, H., Zhang, J., Xu, M., Liu, J., Du, B., & Tao, D. (2023). Scalable mask annotation for video text spotting. arXiv preprint arXiv:2305.01443.

  • Hemker, K., Simidjievski, N., & Jamnik, M. (2023). Hybrid early fusion for multi-modal biomedical representations. In UniReps: the First Workshop on Unifying Representations in Neural Models.

  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

  • Hong, S., Kaya, Y., Modoranu, I.-V., & Dumitras, T. (2021). A panda? no, it’s a sloth: Slowdown attacks on adaptive multi-exit neural network inference. In International Conference on Learning Representations.

  • Hu, M., Li, Y., & Yang, X. (2023). Skinsam: Empowering skin cancer segmentation with segment anything model. arXiv preprint arXiv:2304.13973.

  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Hu, C., Xia, T., Ju, S., & Li, X. (2023). When sam meets medical images: An investigation of segment anything model (sam) on multi-phase liver tumor segmentation. arXiv preprint arXiv:2304.08506.

  • Huang, Y., Lai, W., Ji, J., Cao, L., Zhang, S., & Ji, R. (2024). Hrsam: Efficiently segment anything in high-resolution images. arXiv preprint arXiv:2407.02109.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning.

  • Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., & Carreira, J. (2021). Perceiver: General perception with iterative attention. In International Conference on Machine Learning.

  • Jia, D., Yuan, Y., He, H., Wu, X., Yu, H., Lin, W., Sun, L., Zhang, C., & Hu, H. (2023). Detrs with hybrid matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Ji, W., Li, J., Bi, Q., Liu, T., Li, W., & Cheng, L. (2024). Segment anything is not always perfect: An investigation of sam on different real-world applications. Machine Intelligence Research, 21, 617–630.

    Article  Google Scholar 

  • Jocher, G., Chaurasia, A., & Qiu, J. (2023). Yolo by ultralytics. https://github.com/ultralytics/ultralytics

  • Ke, L., Ye, M., Danelljan, M., liu, Y., Tai, Y.-W., Tang, C.-K., & Yu, F. (2023). Segment anything in high quality. In Advances in Neural Information Processing Systems.

  • Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., & others. (2023). Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • Laurent, C., Ballas, C., George, T., Ballas, N., & Vincent, P. (2020). Revisiting loss modelling for unstructured pruning. arXiv preprint arXiv:2006.12279.

  • Le, B.-H., Nguyen-Vu, D.-K., Nguyen-Mau, T.-H., Nguyen, H.-D., & Tran, M.-T. Medficientsam: a robust medical segmentation model with optimized inference pipeline for limited clinical settings. In Medical Image Segmentation Challenge (2024).

  • Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J., & others. (2024). Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends® in Computer Graphics and Vision 16(1-2), 1–214.

  • Li, Y., Hu, M., & Yang, X. (2024). Polyp-sam: Transfer sam for polyp segmentation. In Medical Imaging 2024: Computer-Aided Diagnosis.

  • Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S., & Ren, J. (2023). Rethinking vision transformers for mobilenet size and speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • Li, Y., Mao, H., Girshick, R., & He, K. (2022). Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision.

  • Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Li, Y., Yuan, G., Wen, Y., Hu, J., Evangelidis, G., Tulyakov, S., Wang, Y., & Ren, J. (2022). Efficientformer: Vision transformers at mobilenet speed. In Advances in Neural Information Processing Systems.

  • Liang, W., & Yuan, Y. Expediting SAM without Fine-tuning (2023). https://github.com/Expedit-LargeScale-Vision-Transformer/Expedit-SAM

  • Liang, W., Yuan, Y., Ding, H., Luo, X., Lin, W., Jia, D., Zhang, Z., Zhang, C., & Hu, H. (2022). Expediting large-scale vision transformer for dense prediction without fine-tuning. In Advances in Neural Information Processing Systems.

  • Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision.

  • Liu, X., Ding, X., Yu, L., Xi, Y., Li, W., Tu, Z., Hu, J., Chen, H., Yin, B., & Xiong, Z. (2024). Pq-sam: Post-training quantization for segment anything model. In European Conference on Computer Vision

  • Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. In Advances in Neural Information Processing Systems.

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., & Gao, W. (2021). Post-training quantization for vision transformer. In Advances in Neural Information Processing Systems.

  • Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljacic, M., Hou, T. Y., & Tegmark, M. (2025). KAN: Kolmogorov–arnold networks. In The Thirteenth International Conference on Learning Representations.

  • Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., & others. (2024). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision.

  • Liu, H., Zhang, E., Wu, J., Hong, M., & Jin, Y. (2024). Surgical SAM 2: Real-time segment anything in surgical video by efficient frame pruning. In Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond.

  • Liu, X., Zhang, J., Zhang, K., Liu, X., & Li, L. (2024). Lsvos challenge 3rd place report: Sam2 and cutie based vos. arXiv preprint arXiv:2408.10469.

  • Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35.

    Article  Google Scholar 

  • Li, Y., Wang, D., Yuan, C., Li, H., & Hu, J. (2023). Enhancing agricultural image segmentation with an agricultural segment anything model adapter. Sensors, 23(18), 7884.

    Article  Google Scholar 

  • Lou, C., Jia, Z., Zheng, Z., & Tu, K. (2024). Sparser is faster and less is more: Efficient sparse attention for long-range transformers. arXiv preprint arXiv:2406.16747.

  • Lou, A., Li, Y., Zhang, Y., Labadie, R. F., & Noble, J. (2025). Zero-shot surgical tool segmentation in monocular video using segment anything model 2. In Medical Imaging 2025: Image Processing.

  • Lu, Y., Wu, Y., Liu, B., Zhang, T., Li, B., Chu, Q., & Yu, N. (2020). Cross-modality person re-identification with shared-specific feature transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Lu, Z., Xiao, Z., Bai, J., Xiong, Z., & Wang, X. (2023). Can sam boost video super-resolution? arXiv preprint arXiv:2305.06524.

  • Luan, Z., Lai, Y., Huang, R., Yan, Y., Wang, J., Lu, J., & Chen, B. (2023). Hierarchical large language models in cloud-edge-end architecture for heterogeneous robot cluster control. In Proceedings of the 2023 2nd International Symposium on Computing and Artificial Intelligence.

  • Lv, C., Chen, H., Guo, J., Ding, Y., & Liu, X. (2024). Ptq4sam: Post-training quantization for segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Ma, J., Li, F., Kim, S., Asakereh, R., Le, B.-H., Nguyen-Vu, D.-K., Pfefferle, A., Wei, M., Gao, R., Lyu, D., & others. (2024). Efficient medsams: Segment anything in medical images on laptop. arXiv preprint arXiv:2412.16085.

  • Maaz, M., Rasheed, H., Khan, S., & Khan, F. (2024). Video-ChatGPT: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

  • Ma, J., He, Y., Li, F., Han, L., You, C., & Wang, B. (2024). Segment anything in medical images. Nature Communications, 15(1), 654.

    Article  Google Scholar 

  • Mehta, S., & Rastegari, M. (2022). Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In International Conference on Learning Representations.

  • Miyashita, D., Lee, E. H., & Murmann, B. (2016). Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025.

  • Mohapatra, S., Gosai, A., & Schlaug, G. (2023). Sam vs bet: A comparative study for brain extraction and segmentation of magnetic resonance images using deep learning. arXiv preprint arXiv:2304.04738.

  • Nahshan, Y., Chmiel, B., Baskin, C., Zheltonozhskii, E., Banner, R., Bronstein, A. M., & Mendelson, A. (2021). Loss aware post-training quantization. Machine Learning, 110(11), 3245–3262.

    Article  MathSciNet  Google Scholar 

  • NVIDIA-AI-IOT: NanoSAM (2023). https://github.com/NVIDIA-AI-IOT/nanosam

  • Osco, L. P., Wu, Q., Lemos, E. L., Gonçalves, W. N., Ramos, A. P. M., Li, J., & Junior, J. M. (2023). The segment anything model (sam) for remote sensing applications: From zero to one shot. International Journal of Applied Earth Observation and Geoinformation, 124, Article 103540.

    Article  Google Scholar 

  • Pan, J., Zheng, Q., Fan, Z., Rahmani, H., Ke, Q., & Liu, J. (2022). Gradauto: Energy-oriented attack on dynamic neural networks. In European Conference on Computer Vision.

  • Papa, L., Russo, P., Amerini, I., & Zhou, L. (2024). A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Peng, B., Alcaide, E., Anthony, Q. G., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M. N., Derczynski, L., Du, X., Grella, M., GV, K. K., He, X., Hou, H., Kazienko, P., Kocon, J., Kong, J., Koptyra, B., ... & Zhu, R.-J. (2023). RWKV: Reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023.

  • Pfefferle, A., Purucker, L., & Hutter, F. (2024). Daft: data-aware fine-tuning of foundation models for efficient and effective medical image segmentation. In Medical Image Segmentation Challenge.

  • PyTorch: Accelerating Generative AI with PyTorch: Segment Anything, Fast (2023). https://pytorch.org/blog/accelerating-generative-ai/

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & others. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.

  • Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., & Dollár, P. (2020). Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Rafaeli, O., Svoray, T., &Nahlieli, A. (2024). Prompt-based segmentation at multiple resolutions and lighting conditions using segment anything model 2. arXiv preprint arXiv:2408.06970.

  • Rajič, F., Ke, L., Tai, Y.-W., Tang, C.-K., Danelljan, M., & Yu, F. (2025). Segment anything meets point tracking. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV).

  • Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.-J. (2021). Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems.

  • Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., & others. (2024). Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714.

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention.

  • Ryali, C., Hu, Y.-T., Bolya, D., Wei, C., Fan, H., Huang, P.-Y., Aggarwal, V., Chowdhury, A., Poursaeed, O., Hoffman, J., & others. (2023). Hiera: A hierarchical vision transformer without the bells-and-whistles. In International Conference on Machine Learning.

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Sener, F., Singhania, D., & Yao, A. (2020). Temporal aggregate representations for long-range video understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16.

  • Shaharabany, T., Dahan, A., Giryes, R., & Wolf, L. (2023). Autosam: Adapting sam to medical images by overloading the prompt encoder. arXiv preprint arXiv:2306.06370.

  • Sharma, P., Ash, J.T., & Misra, D. (2024). The truth is in there: Improving reasoning in language models with layer-selective rank reduction. In The Twelfth International Conference on Learning Representations.

  • Shen, Y., Li, J., Shao, X., Inigo Romillo, B., Jindal, A., Dreizin, D., & Unberath, M. (2024). Fastsam3d: An efficient segment anything model for 3d volumetric medical images. In International Conference on Medical Image Computing and Computer-Assisted Intervention.

  • Shen, C., Li, W., Shi, Y., Wang, X. (2024). Interactive 3d medical image segmentation with sam 2. arXiv preprint arXiv:2408.02635.

  • Shen, Q., Yang, X., & Wang, X. (2023). Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261.

  • Shu, H., Li, W., Tang, Y., Zhang, Y., Chen, Y., Li, H., Wang, Y., & Chen, X. (2025). Tinysam: Pushing the envelope for efficient segment anything model. In Proceedings of the AAAI Conference on Artificial Intelligence.

  • Shu, C., Liu, Y., Gao, J., Yan, Z., & Shen, C. (2021). Channel-wise knowledge distillation for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • Songa, Y., Pua, B., Wanga, P., Jiang, H., Donga, D., & Shen, Y. (2024). Sam-lightening: A lightweight segment anything model with dilated flash attention to achieve 30 times acceleration. arXiv preprint arXiv:2403.09195.

  • Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., & Wei, F. (2023). Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621.

  • Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., & others. (2024). Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620

  • Swaminathan, S., Garg, D., Kannan, R., & Andres, F. (2020). Sparse low rank factorization for deep neural network compression. Neurocomputing.

  • Tahaei, M., Charlaix, E., Nia, V., Ghodsi, A., & Rezagholizadeh, M. (2022). KroneckerBERT: Significant compression of pre-trained language models through kronecker decomposition and knowledge distillation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

  • Tai, C., Xiao, T., Zhang, Y., Wang, X., & others. (2015). Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067.

  • Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., & others. (2023). Video understanding with large language models: A survey. arXiv preprint arXiv:2312.17432.

  • Tang, G., Zhao, W., Ford, L., Benhaim, D., & Zhang, P. (2024). Segment any mesh: Zero-shot mesh part segmentation via lifting segment anything 2 to 3d. arXiv preprint arXiv:2408.13679.

  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., & others. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

  • Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., & others. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

  • Tran, T. (2024). The 2nd solution for lsvos challenge rvos track: Spatial-temporal refinement for consistent semantic segmentation. arXiv preprint arXiv:2408.12447.

  • Varadarajan, B., Soran, B., Iandola, F., Xiang, X., Xiong, Y., Zhu, C., Krishnamoorthi, R., & Chandra, V. (2023). Squeezesam: User friendly mobile interactive segmentation. arXiv preprint arXiv:2312.06736.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems.

  • Wald, T., Roy, S., Koehler, G., Disch, N., Rokuss, M. R., Holzschuh, J., Zimmerer, D., & Maier-Hein, K. (2023). SAM.MD: Zero-shot medical image segmentation capabilities of the segment anything model. In Medical Imaging with Deep Learning, Short Paper Track.

  • Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y., Liu, J., Qu, Z., Yan, S., Zhu, Y., Zhang, Q., Chowdhury, M., & Zhang, M. (2024). Efficient large language models: A survey. Transactions on Machine Learning Research.

  • Wang, A., Chen, H., Lin, Z., Han, J., & Ding, G. (2023). Repvit-sam: Towards real-time segmenting anything. arXiv preprint arXiv:2312.05760.

  • Wang, A., Chen, H., Lin, Z., Han, J., Ding, G. (2024). Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Wang, W., Feiszli, M., Wang, H., & Tran, D. (2021). Unidentified video objects: A benchmark for dense, open-world segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

  • Wang, J., Huang, Q., Tang, F., Meng, J., Su, J., & Song, S. (2022). Stepwise feature fusion: Local guides global. In International Conference on Medical Image Computing and Computer-Assisted Intervention.

  • Wang, Z., Wohlwend, J., & Lei, T. (2020). Structured pruning of large language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).

  • Wang, D., Zhang, J., Du, B., Xu, M., Liu, L., Tao, D., & Zhang, L. (2023). Samrs: Scaling-up remote sensing segmentation dataset with segment anything model. In Advances in Neural Information Processing Systems.

  • Wang, T., Zhang, J., Fei, J., Zheng, H., Tang, Y., Li, Z., Gao, M., & Zhao, S. (2023). Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677.

  • Wang, Y., & Liu, Q. (2024). Aqa: An adaptive post-training quantization method for activations of cnns. IEEE Transactions on Computers, 73(08), 2025–2035.

    Article  Google Scholar 

  • Wei, M., Chen, S., Wu, S., & Xu, D. (2024). Rep-medsam: Towards real-time and universal medical image segmentation. In Medical Image Segmentation Challenge.

  • White, C., Safari, M., Sukthanker, R., Ru, B., Elsken, T., Zela, A., Dey, D., & Hutter, F. (2023). Neural architecture search: Insights from 1000 papers. arXiv preprint arXiv:2301.08727.

  • Winata, G. I., Madotto, A., Shin, J., Barezi, E. J., & Fung, P. (2019). On the effectiveness of low-rank matrix factorization for lstm model compression. arXiv preprint arXiv:1908.09982.

  • Wortsman, M., Dettmers, T., Zettlemoyer, L., Morcos, A., Farhadi, A., & Schmidt, L. (2023). Stable and low-precision training for large-scale vision-language models. In Advances in Neural Information Processing Systems.

  • Wu, J. Z., Li, X., Gao, D., Dong, Z., Bai, J., Singh, A., Xiang, X., Li, Y., Huang, Z., Sun, Y., & others. (2023). Cvpr 2023 text guided video editing competition. arXiv preprint arXiv:2310.16003.

  • Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., & Yuan, L. (2022). Tinyvit: Fast pretraining distillation for small vision transformers. In European Conference on Computer Vision.

  • Wu, J., Wang, Z., Hong, M., Ji, W., Fu, H., Xu, Y., Xu, M., & Jin, Y. (2025). Medical sam adapter: Adapting segment anything model for medical image segmentation. Medical image analysis, 102, Article 103547.

    Article  Google Scholar 

  • Xiao, J., Li, Z., Yang, L., & Gu, Q. (2023). Patch-wise mixed-precision quantization of vision transformer. In 2023 International Joint Conference on Neural Networks.

  • Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., & Girshick, R. (2021). Early convolutions help transformers see better. In Advances in Neural Information Processing Systems.

  • Xiao, A., Xuan, W., Qi, H., Xing, Y., Yokoya, N., & Lu, S. (2024). Segment anything with multiple modalities. arXiv preprint arXiv:2408.09085.

  • Xie, D., Wang, R., Ma, J., Chen, C., Lu, H., Yang, D., Shi, F., & Lin, X. (2023). Edit everything: A text-guided generative system for images editing. arXiv preprint arXiv:2304.14006.

  • Xiong, Y., Varadarajan, B., Wu, L., Xiang, X., Xiao, F., Zhu, C., Dai, X., Wang, D., Sun, F., Iandola, F., & others. (2024). Efficientsam: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Xiong, Y., Zhou, C., Xiang, X., Wu, L., Zhu, C., Liu, Z., Suri, S., Varadarajan, B., Akula, R., Iandola, F., & others. (2024). Efficient track anything. arXiv preprint arXiv:2411.18933.

  • Xu, C., & McAuley, J. (2023). A survey on model compression and acceleration for pretrained language models. In Proceedings of the AAAI Conference on Artificial Intelligence.

  • Xu, X., Chen, H., Zhao, L., Wang, Z., Zhou, J., & Lu, J. (2025). EmbodiedSAM: Online segment any 3d thing in real time. In The Thirteenth International Conference on Learning Representations.

  • Xu, S., Yuan, H., Shi, Q., Qi, L., Wang, J., Yang, Y., Li, Y., Chen, K., Tong, Y., Ghanem, B., Li, X., & Yang, M.-H. (2025). RMP-SAM: Towards real-time multi-purpose segment anything. In The Thirteenth International Conference on Learning Representations

  • Xue, J., Li, J., & Gong, Y. (2013). Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech.

  • Yamagishi, Y., Hanaoka, S., Kikuchi, T., Nakao, T., Nakamura, Y., Nomura, Y., Miki, S., Yoshikawa, T., & Abe, O. (2024). Zero-shot 3d segmentation of abdominal organs in ct scans using segment anything model 2: Adapting video tracking capabilities for 3d medical imaging. arXiv preprint arXiv:2408.06170.

  • Yan, Z., Sun, W., Zhou, R., Yuan, Z., Zhang, K., Li, Y., Liu, T., Li, Q., Li, X., He, L., & others. (2024). Biomedical sam 2: Segment anything in biomedical images and videos. arXiv preprint arXiv:2408.03286.

  • Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., & Zheng, F. (2023). Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968.

  • Yang, Z., Li, Z., Jiang, X., Gong, Y., Yuan, Z., Zhao, D., & Yuan, C. (2022). Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., & He, Y. (2022). Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Advances in Neural Information Processing Systems.

  • Yu, T., Feng, R., Feng, R., Liu, J., Jin, X., Zeng, W., & Chen, Z. (2023). Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790.

  • Yu, X., Liu, T., Wang, X., & Tao, D. (2017). On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Yu, J., Wang, A., Dong, W., Xu, M., Islam, M., Wang, J., Bai, L., & Ren, H. (2024). Sam 2 in robotic surgery: An empirical evaluation for robustness and generalization in surgical video segmentation. arXiv preprint arXiv:2408.04593.

  • Yuan, H., Li, X., Qi, L., Zhang, T., Yang, M.-H., Yan, S., & Loy, C.C. (2024). Mamba or rwkv: Exploring high-quality and high-efficiency segment anything model. arXiv preprint arXiv:2406.19369.

  • Yuan, Z., Shang, Y., Song, Y., Wu, Q., Yan, Y., & Sun, G. (2023). Asvd: Activation-aware singular value decomposition for compressing large language models. arXiv preprint arXiv:2312.05821.

  • Yuan, Z., Xue, C., Chen, Y., Wu, Q., & Sun, G. (2022). Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. In European Conference on Computer Vision.

  • Yu, W., Si, C., Zhou, P., Luo, M., Zhou, Y., Feng, J., Yan, S., & Wang, X. (2023). Metaformer baselines for vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(02), 896–912.

    Article  Google Scholar 

  • Zeng, S., Liu, J., Dai, G., Yang, X., Fu, T., Wang, H., Ma, W., Sun, H., Li, S., Huang, Z., & others. (2024). Flightllm: Efficient large language model inference with a complete mapping flow on fpgas. In Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays.

  • Zhang, Y., & Shen, Z. (2024). Unleashing the potential of sam2 for biomedical images and videos: A survey. arXiv preprint arXiv:2408.12889.

  • Zhang, Z., Cai, H., & Han, S. (2024). Efficientvit-sam: Accelerated segment anything model without performance loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Zhang, Y., Cheng, Y., & Qi, Y. (2023). Semisam: Exploring sam for enhancing semi-supervised medical image segmentation with extremely limited annotations. arXiv preprint arXiv:2312.06316 .

  • Zhang, C., Cui, Y., Lin, W., Huang, G., Rong, Y., Liu, L., & Shan, S. (2024). Segment anything for videos: A systematic survey. arXiv preprint arXiv:2408.08315.

  • Zhang, L., Deng, X., & Lu, Y. (2023). Segment anything model (sam) for medical image segmentation: A preliminary review. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

  • Zhang, C., Han, D., Qiao, Y., Kim, J.U., Bae, S.-H., Lee, S., & Hong, C. S. (2023). Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289.

  • Zhang, C., Han, D., Zheng, S., Choi, J., Kim, T.-H., & Hong, C. S. (2023). Mobilesamv2: Faster segment anything to everything. arXiv preprint arXiv:2312.09579.

  • Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Dong, H., Qiao, Y., Gao, P., & Li, H. (2024). Personalize segment anything model with one shot. In The Twelfth International Conference on Learning Representations.

  • Zhang, C., Liu, L., Cui, Y., Huang, G., Lin, W., Yang, Y., & Hu, Y. (2023). A comprehensive survey on segment anything model for vision and beyond. arXiv preprint arXiv:2305.08196.

  • Zhang, C., Puspitasari, F.D., Zheng, S., Li, C., Qiao, Y., Kang, T., Shan, X., Zhang, C., Qin, C., Rameau, F., & others. (2023). A survey on segment anything model (sam): Vision foundation model meets prompt engineering. arXiv preprint arXiv:2306.06211.

  • Zhang, M., Wang, L., Gu, L., Li, Z., Wang, Y., Ling, T., & Tao, X. (2024). Sam2-path: A better segment anything model for semantic segmentation in digital pathology. arXiv preprint arXiv:2408.03651.

  • Zhang, Z., Wei, Z., Zhang, S., Dai, Z., & Zhu, S. (2023). Uvosam: A mask-free paradigm for unsupervised video object segmentation via segment anything model. arXiv preprint arXiv:2305.12659.

  • Zhang, Y., Shen, Z., & Jiao, R. (2024). Segment anything model for medical image segmentation: Current applications and future directions. Computers in Biology and Medicine, 171, Article 108238.

    Article  Google Scholar 

  • Zhao, Y. (2023). Efficient SAM for Medical Image Analysis. https://www.researchgate.net/publication/375895620_Efficient_SAM_for_Medical_Image_Analysis

  • Zhao, X., Ding, W., An, Y., Du, Y., Yu, T., Li, M., Tang, M., & Wang, J. (2023). Fast segment anything. arXiv preprint arXiv:2306.12156.

  • Zhao, Y., Xie, E., Hong, L., Li, Z., & Lee, G. H. (2023). Make-a-protagonist: Generic video editing with an ensemble of experts. arXiv preprint arXiv:2305.08850.

  • Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., & others. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223.

  • Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., & Misra, I. (2022). Detecting twenty-thousand classes using image-level supervision. In European Conference on Computer Vision.

  • Zhou, C., Li, X., Loy, C. C., & Dai, B. (2023). Edgesam: Prompt-in-the-loop distillation for on-device deployment of sam. arXiv preprint arXiv:2312.06660.

  • Zhou, C., Zhu, C., Xiong, Y., Suri, S., Xiao, F., Wu, L., Krishnamoorthi, R., Dai, B., Loy, C. C., Chandra, V., & others. (2025). Edgetam: On-device track anything model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Zhu, X., Li, J., Liu, Y., Ma, C., & Wang, W. (2024). A survey on model compression for large language models. Transactions of the Association for Computational Linguistics .

  • Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., & Wang, X. (2024). Vision mamba: Efficient visual representation learning with bidirectional state space model. In Forty-first International Conference on Machine Learning.

  • Zhu, J., Qi, Y., & Wu, J. (2024). Medical sam 2: Segment medical images as video via segment anything model 2. arXiv preprint arXiv:2408.00874.

  • Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., & others. (2023). Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

  • Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., & Lee, Y. J. (2024). Segment everything everywhere all at once. In Advances in Neural Information Processing Systems.

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China under Grant 2022YFA1004100 and National Natural Science Foundation of China under Grant 62476048.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ping Hu.

Additional information

Communicated by Yen-Yu Lin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, X., Liu, J., Shen, H. et al. On Efficient Variants of Segment Anything Model: A Survey. Int J Comput Vis 133, 7406–7436 (2025). https://doi.org/10.1007/s11263-025-02539-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02539-8

Keywords