SeaFormer++: Squeeze-Enhanced Axial Transformer for Mobile Visual Recognition

Wan, Qiang; Huang, Zilong; Lu, Jiachen; Yu, Gang; Zhang, Li

doi:10.1007/s11263-025-02345-2

SeaFormer++: Squeeze-Enhanced Axial Transformer for Mobile Visual Recognition

Published: 25 January 2025

Volume 133, pages 3645–3666, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Qiang Wan¹,
Zilong Huang²,
Jiachen Lu¹,
Gang Yu³ &
…
Li Zhang ORCID: orcid.org/0000-0003-1031-5420¹

859 Accesses
39 Citations
Explore all metrics

Abstract

Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement renders these methods unsuitable on the mobile device. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K, Cityscapes Pascal Context and COCO-Stuff datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Furthermore, we incorporate a feature upsampling-based multi-resolution distillation technique, further reducing the inference latency of the proposed framework. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification and object detection problems, demonstrating the potential of serving as a versatile mobile-friendly backbone. Our code and models are made publicly available at https://github.com/fudan-zvg/SeaFormer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evolving Deep Architectures: A New Blend of CNNs and Transformers Without Pre-training Dependencies

To-Former: semantic segmentation of transparent object with edge-enhanced transformer

Article 31 May 2024

Transformers in computational visual media: A survey

Article Open access 27 October 2021

Data Availability

The datasets generated during and/or analysed during the current study are available in the Imagenet (Deng et al., 2009) (https://www.image-net.org/), COCO (Caesar et al., 2018) (https://cocodataset.org), ADE20K (Zhou et al., 2017) (https://groups.csail.mit.edu/vision/datasets/ADE20K/), Cityscapes (Cordts et al., 2016) (https://www.cityscapes-dataset.com), Pascal Context (Mottaghi et al., 2014) (https://cs.stanford.edu/~roozbeh/pascal-context/) and COCO-Stuff (Caesar et al., 2018) (https://github.com/nightrome/cocostuff?tab=readme-ov-file) repositories.

References

Caesar, H., Uijlings, J. & Ferrari, V. (2018). Coco-stuff: Thing and stuff classes in context. In IEEE Conference on Computer Vision and Pattern Recognition.
Cao, Y., Xu, J., Lin, S., Wei, F. & Hu, H. (2019). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In IEEE International Conference on Computer Vision Workshops.
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L. & Liu, Z. (2022). Mobile-former: Bridging mobilenet and transformer. In IEEE Conference on Computer Vision and Pattern Recognition.
Chen, Y., Kalantidis, Y., Li, J., Yan, S. & Feng, J. (2018). ${\rm A}^2$-nets: Double attention networks. In Advances in Neural Information Processing Systems.
Chen, L.-C., Papandreou, G., Schroff, F. & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint.
Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., Cheng, J. & Wang, J. (2022). Mixformer: Mixing features across windows and dimensions. In IEEE Conference on Computer Vision and Pattern Recognition.
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision.
Cho, J. H. & Hariharan, B. (2019). On the efficacy of knowledge distillation. In IEEE International Conference on Computer Vision.
Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. (2021). Rethinking attention with performers. In International Conference on Learning Representations.
Contributors, T. (2019). TNN: A high-performance, lightweight neural network inference framework.
Contributors, M. (2020). MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S. & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
Gao, H., Wang, Z. & Ji, S. (2020). Kronecker attention networks. In ACM SIGKDD.
He, T., Shen, C., Tian, Z., Gong, D., Sun, C. & Yan, Y. (2019). Knowledge adaptation for efficient semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
Heo, B., Kim, J., Yun, S., Park, H., Kwak, N. & Choi, J. Y. (2019). A comprehensive overhaul of feature distillation. In IEEE International Conference on Computer Vision.
Hinton, G. (2015). Distilling the knowledge in a neural network. arXiv preprint.
Ho, J., Kalchbrenner, N., Weissenborn, D. & Salimans, T. (2019). Axial attention in multidimensional transformers. arXiv preprint.
Hong, Y., Pan, H., Sun, W. & Jia, Y. (2021). Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv preprint.
Hou, Q., Zhang, L., Cheng, M.-M. & Feng, J. (2020). Strip pooling: Rethinking spatial pooling for scene parsing. In IEEE Conference on Computer Vision and Pattern Recognition.
Hou, Q., Zhou, D. & Feng, J. (2021). Coordinate attention for efficient mobile network design. In IEEE Conference on Computer Vision and Pattern Recognition.
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al. (2019). Searching for mobilenetv3. In IEEE International Conference on Computer Vision.
Hu, B., Zhou, S., Xiong, Z. & Wu, F. (2022). Cross-resolution distillation for efficient 3d medical image registration. IEEE Transactions on Circuits and Systems for Video Technology.
Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G. & Fu, B. (2021). Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint.
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y. & Liu, W. (2019). Ccnet: Criss-cross attention for semantic segmentation. In IEEE International Conference on Computer Vision.
Huang, Z., Wei, Y., Wang, X., Liu, W., Huang, T. S., Shi, H. (2021). Alignseg: Feature-aligned segmentation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Huang, L., Yuan, Y., Guo, J., Zhang, C., Chen, X. & Wang, J. (2019). Interlaced sparse self-attention for semantic segmentation. arXiv preprint.
Ioffe, S. & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning.
Kim, J., Park, S. & Kwak, N. (2018). Paraphrasing complex network: Network compression via factor transfer. Advances in Neural Information Processing Systems.
Kingma, D. P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint.
Kirillov, A., Girshick, R., He, K. & Dollár, P. (2019). Panoptic feature pyramid networks. In IEEE Conference on Computer Vision and Pattern Recognition.
Li, J., Hassani, A., Walton, S. & Shi, H. (2021). Convmlp: Hierarchical convolutional mlps for vision. arXiv preprint.
Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S. & Ren, J. (2023). Rethinking vision transformers for mobilenet size and speed. In IEEE International Conference on Computer Vision.
Li, Z., Li, X., Yang, L., Zhao, B., Song, R., Luo, L., Li, J. & Yang, J. (2023). Curriculum temperature for knowledge distillation. In AAAI Conference on Artificial Intelligence.
Li, X., Li, X., You, A., Zhang, L., Cheng, G., Yang, K., Tong, Y. & Lin, Z. (2021). Towards efficient scene understanding via squeeze reasoning. IEEE Transactions on Image Processing.
Li, X., Li, X., Zhang, L., Cheng, G., Shi, J., Lin, Z., Tan, S. & Tong, Y. (2020). Improving semantic segmentation via decoupled body and edge supervision. In European Conference on Computer Vision.
Li, H., Xiong, P., Fan, H. & Sun, J. (2019). Dfanet: Deep feature aggregation for real-time semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
Li, Z., Ye, J., Song, M., Huang, Y. & Pan, Z. (2021). Online knowledge distillation for efficient pose estimation. In IEEE International Conference on Computer Vision.
Li, Y., Yuan, G., Wen, Y., Hu, E., Evangelidis, G., Tulyakov, S., Wang, Y. & Ren, J. (2022). Efficientformer: Vision transformers at mobilenet speed. arXiv preprint.
Li, X., Zhang, L., Cheng, G., Yang, K., Tong, Y., Zhu, X. & Xiang, T. (2021). Global aggregation then local distribution for scene parsing. IEEE Transactions on Image Processing.
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). Focal loss for dense object detection. In IEEE International Conference on Computer Vision.
Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z. & Wang, J. (2019). Structured knowledge distillation for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE International Conference on Computer Vision.
Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L. & Shazeer, N. (2018). Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations.
Liu, R., Yang, K., Roitberg, A., Zhang, J., Peng, K., Liu, H. & Stiefelhagen, R. (2022). Transkd: Transformer knowledge distillation for efficient semantic segmentation. arXiv preprint.
Long, J., Shelhamer, E. & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
Luong, M.-T., Pham, H. & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint.
Ma, N., Zhang, X., Zheng, H.-T. & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient cnn architecture design. In European Conference on Computer Vision.
Mehta, S. & Rastegari, M. (2022). Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. In International Conference on Learning Representations.
Mehta, S. & Rastegari, M. (2022). Separable self-attention for mobile vision transformers. arXiv preprint.
Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A. & Ghasemzadeh, H. (2020). Improved knowledge distillation via teacher assistant. In AAAI Conference on Artificial Intelligence.
Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., Urtasun, R. & Yuille, A. (2014). The role of context for object detection and semantic segmentation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition.
Pan, J., Bulat, A., Tan, F., Zhu, X., Dudziak, L., Li, H., Tzimiropoulos, G. & Martinez, B. (2022). Edgevits: Competing light-weight cnns on mobile devices with vision transformers. In European Conference on Computer Vision.
Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z. & Huang, G. (2022). On the integration of self-attention and convolution. In IEEE Conference on Computer Vision and Pattern Recognition.
Park, W., Kim, D., Lu, Y. & Cho, M. (2019). Relational knowledge distillation. In IEEE Conference on Computer Vision and Pattern Recognition.
Poudel, R. P., Liwicki, S. & Cipolla, R. (2019). Fast-scnn: Fast semantic segmentation network. In British Machine Vision Conference.
Qi, L., Kuen, J., Gu, J., Lin, Z., Wang, Y., Chen, Y., Li, Y. & Jia, J. (2021). Multi-scale aligned distillation for low-resolution detection. In IEEE Conference on Computer Vision and Pattern Recognition.
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C. & Bengio, Y. (2014). Fitnets: Hints for thin deep nets. arXiv preprint.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition.
Shen, Z., Zhang, M., Zhao, H., Yi, S. & Li, H. (2021). Efficient attention: Attention with linear complexities. In IEEE Winter Conference on Applications of Computer Vision.
Tan, M. & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning.
Tang, S., Sun, T., Peng, J., Chen, G., Hao, Y., Lin, M., Xiao, Z., You, J. & Liu, Y. (2023). Pp-mobileseg: Explore the fast and accurate semantic segmentation model on mobile devices. arXiv preprint.
Tian, Y., Krishnan, D. & Isola, P. (2019). Contrastive representation distillation. arXiv preprint.
Vasu, P. K. A., Gabriel, J., Zhu, J., Tuzel, O. & Ranjan, A. (2022). An improved one millisecond mobile backbone. arXiv preprint.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems.
Wan, Q., Huang, Z., Lu, J., Yu, G. & Zhang, L. (2023). Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In International Conference on Learning Representations.
Wang, J., Chen, Y., Zheng, Z., Li, X., Cheng, M.-M. & Hou, Q. (2024). Crosskd: Cross-head knowledge distillation for object detection. In IEEE Conference on Computer Vision and Pattern Recognition.
Wang, S., Li, B. Z., Khabsa, M., Fang, H. & Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv preprint.
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P. & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In IEEE International Conference on Computer Vision.
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A. & Chen, L.-C. (2020). Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In European Conference on Computer Vision.
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. (2018). Cbam: Convolutional block attention module. In European Conference on Computer Vision.
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M. & Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems.
Xu, W., Xu, Y., Chang, T. & Tu, Z. (2021). Co-scale conv-attentional image transformers. In IEEE International Conference on Computer Vision.
Yan, H., Li, Z., Li, W., Wang, C., Wu, M. & Zhang, C. (2021). Contnet: Why not use convolution and transformer at the same time? arXiv preprint.
Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z. & Yuille, A. (2022). Lite vision transformer with enhanced self-attention. In IEEE Conference on Computer Vision and Pattern Recognition.
Yang, C., Xie, L., Su, C. & Yuille, A. L. (2019). Snapshot distillation: Teacher-student optimization in one generation. In IEEE Conference on Computer Vision and Pattern Recognition.
Yang, C., Zhou, H., An, Z., Jiang, X., Xu, Y. & Zhang, Q. (2022). Cross-image relational knowledge distillation for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
Yim, J., Joo, D., Bae, J. & Kim, J. (2017). A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition.
Yu, C., Gao, C., Wang, J., Yu, G., Shen, C. & Sang, N. (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision.
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G. Sang, N. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision.
Yuan, Y., Chen, X. & Wang, J. (2020). Object-contextual representations for semantic segmentation. In European Conference on Computer Vision.
Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X. & Wang, J. (2021). Hrformer: High-resolution transformer for dense prediction. arXiv preprint.
Zhang, L., Chen, M., Arnab, A., Xue, X. & Torr, P. H. (2022). Dynamic graph message passing networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Zhang, H., Hu, W. & Wang, X. (2022). Edgeformer: Improving light-weight convnets by learning from vision transformers. arXiv preprint.
Zhang, W., Huang, Z., Luo, G., Chen, T., Wang, X., Liu, W., Yu, G. & Shen, C. (2022). Topformer: Token pyramid transformer for mobile semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
Zhang, Y., Xiang, T., Hospedales, T. M. & Lu, H. (2018). Deep mutual learning. In IEEE Conference on Computer Vision and Pattern Recognition.
Zhang, L., Xu, D., Arnab, A. & Torr, P. H. (2020). Dynamic graph message passing networks. In IEEE Conference on Computer Vision and Pattern Recognition.
Zhao, B., Cui, Q., Song, R., Qiu, Y. & Liang, J. (2022). Decoupled knowledge distillation. In IEEE Conference on Computer Vision and Pattern Recognition.
Zhao, H., Qi, X., Shen, X., Shi, J. & Jia, J. (2018). Icnet for real-time semantic segmentation on high-resolution images. In European Conference on Computer Vision.
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., et al. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In IEEE Conference on Computer Vision and Pattern Recognition.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A. & Torralba, A. (2017). Scene parsing through ade20k dataset. In IEEE Conference on Computer Vision and Pattern Recognition.

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant No. 62376060).

Author information

Authors and Affiliations

School of Data Science, Fudan University, Yangpu, 200433, Shanghai, China
Qiang Wan, Jiachen Lu & Li Zhang
ByteDance, 1 Raffles Quay, #26-10, 048583, Singapore, Singapore
Zilong Huang
Tencent, Xuhui, Shanghai, 200233, Shanghai, China
Gang Yu

Authors

Qiang Wan
View author publications
Search author on:PubMed Google Scholar
Zilong Huang
View author publications
Search author on:PubMed Google Scholar
Jiachen Lu
View author publications
Search author on:PubMed Google Scholar
Gang Yu
View author publications
Search author on:PubMed Google Scholar
Li Zhang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Li Zhang.

Additional information

Communicated by Kaiyang Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wan, Q., Huang, Z., Lu, J. et al. SeaFormer++: Squeeze-Enhanced Axial Transformer for Mobile Visual Recognition. Int J Comput Vis 133, 3645–3666 (2025). https://doi.org/10.1007/s11263-025-02345-2

Download citation

Received: 07 May 2024
Accepted: 02 January 2025
Published: 25 January 2025
Version of record: 25 January 2025
Issue date: June 2025
DOI: https://doi.org/10.1007/s11263-025-02345-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SeaFormer++: Squeeze-Enhanced Axial Transformer for Mobile Visual Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Evolving Deep Architectures: A New Blend of CNNs and Transformers Without Pre-training Dependencies

To-Former: semantic segmentation of transparent object with edge-enhanced transformer

Transformers in computational visual media: A survey

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now