Vision Transformers: From Semantic Segmentation to Dense Prediction

Zhang, Li; Lu, Jiachen; Zheng, Sixiao; Zhao, Xinxuan; Zhu, Xiatian; Fu, Yanwei; Xiang, Tao; Feng, Jianfeng; Torr, Philip H. S.

doi:10.1007/s11263-024-02173-w

Vision Transformers: From Semantic Segmentation to Dense Prediction

Published: 16 July 2024

Volume 132, pages 6142–6162, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Li Zhang ORCID: orcid.org/0000-0003-1031-5420¹^na1,
Jiachen Lu¹^na1,
Sixiao Zheng¹,
Xinxuan Zhao¹,
Xiatian Zhu²,
Yanwei Fu¹,
Tao Xiang²,
Jianfeng Feng¹ &
…
Philip H. S. Torr³

1374 Accesses
10 Citations
10 Altmetric
Explore all metrics

Abstract

The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image patches, in comparison to the increasing receptive fields of CNNs across layers and other alternatives (e.g., large kernels and atrous convolution). In this work, for the first time we explore the global context learning potentials of ViTs for dense visual prediction (e.g., semantic segmentation). Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information, critical for dense prediction tasks. We first demonstrate that encoding an image as a sequence of patches, a vanilla ViT without local convolution and resolution reduction can yield stronger visual representation for semantic segmentation. For example, our model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission) and performs competitively on Cityscapes. However, the basic ViT architecture falls short in broader dense prediction applications, such as object detection and instance segmentation, due to its lack of a pyramidal structure, high computational demand, and insufficient local context. For tackling general dense visual prediction tasks in a cost-effective manner, we further formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture. Extensive experiments show that our methods achieve appealing performance on a variety of dense prediction tasks (e.g., object detection and instance segmentation and semantic segmentation) as well as image classification. Our code and models are available at https://github.com/fudan-zvg/SETR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

A Review on Convolutional Neural Networks and Vision Transformers

On the Effectiveness of ViT Features as Local Semantic Descriptors

Data Availability Statement

The datasets generated during and/or analysed during the current study are available in the Imagenet (Russakovsky et al., 2015) (https://www.image-net.org/), ImageNet-v2 (Recht et al., 2019) (https://github.com/modestyachts/ImageNetV2), COCO (Lin et al., 2014) (https://cocodataset.org), ADE20K (Zhou et al., 2019) (https://groups.csail.mit.edu/vision/datasets/ADE20K/), Cityscapes (Cordts et al., 2016) (https://www.cityscapes-dataset.com), Pascal Context (Mottaghi et al., 2014) (https://cs.stanford.edu/~roozbeh/pascal-context/) repositories.

References

Abnar, S., & Zuidema, W. (2020). Quantifying attention flow in transformers. In Annual meeting of the association for computational linguistics, (pp. 4190–4197).
Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495.
Article Google Scholar
Bai, S., & Torr, P., et al. (2021). Visual parser: Representing part-whole hierarchies with transformers. arXiv preprint. arXiv:2107.05790.
Bao, H., Dong, L., Piao, S., & Wei, F. (2022) Beit: Bert pre-training of image transformers. In International conference on learning representations.
Bello, I., Zoph, B., Vaswani, A., Shlens, J., & Le, Q.V. (2019). Attention augmented convolutional networks. In IEEE international conference on computer vision, (pp. 10076–10085).
Cao, Y., Xu, J., Lin, S., Wei, F., & Hu, H. (2019). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In IEEE international conference on computer vision workshops
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European conference on computer vision, (pp. 213–229).
Chen, C.-F. R., Fan, Q., & Panda, R. (2021). Crossvit: Cross-attention multi-scale vision transformer for image classification. In IEEE international conference on computer vision, (pp. 357–366).
Chen, C. -F., Panda, R., & Fan, Q. (2022). Regionvit: Regional-to-local attention for vision transformers. In International conference on learning representations.
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2015). Semantic image segmentation with deep convolutional nets and fully connected CRFs. In International conference on learning representations.
Chen, L. -C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2015). Semantic image segmentation with deep convolutional nets and fully connected CRFs. In International conference on learning representations.
Chen, L. -C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In European conference on computer vision, (pp. 801–818).
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Article Google Scholar
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In IEEE conference on computer vision and pattern recognition, (pp. 1251–1258).
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. In Advances in Neural Information Processing Systems, (pp. 9355–9366).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016) The cityscapes dataset for semantic urban scene understanding. In IEEE conference on computer vision and pattern recognition, (pp. 3213–3223).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In IEEE conference on computer vision and pattern recognition, (pp. 3213–3223).
Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021) Coatnet: Marrying convolution and attention for all data sizes. In Advances in Neural Information Processing Systems, (pp. 3965–3977).
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. In Annual Meeting of the Association for Computational Linguistics, (pp. 2978–2988). Association for Computational Linguistics.
Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35, 16344–16359.
Google Scholar
d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., & Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In International conference on machine learning, (pp. 2286–2296). PMLR.
Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 4171–4186). Association for Computational Linguistics.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations.
Fu, J., Liu, J., Tian, H., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In IEEE conference on computer vision and pattern recognition, (pp. 3146–3154).
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. Advances in Neural Information Processing Systems, 34, 15908–15919.
Google Scholar
Hassani, A., Walton, S., Li, J., Li, S., & Shi, H. (2023). Neighborhood attention transformer. In IEEE conference on computer vision and pattern recognition, (pp. 6185–6194).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017) Mask r-cnn. In IEEE international conference on computer vision, (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, (pp. 770–778).
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., & Oh, S. J. (2021) Rethinking spatial dimensions of vision transformers. In IEEE international conference on computer vision, (pp. 11936–11945).
Ho, J., Kalchbrenner, N., Weissenborn, D., & Salimans, T. (2019). Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180.
Holschneider, M., Kronland-Martinet, R., Morlet, J., & Tchamitchian, P. (1990). A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space Proceedings of the International Conference, Marseille, France, December 14–18, 1987, (pp. 286–297). Springer.
Hu, J., Shen, L., & Sun, G. (2018) Squeeze-and-excitation networks. In IEEE conference on computer vision and pattern recognition, (pp. 7132–7141).
Hu, H., Zhang, Z., Xie, Z., & Lin, S. (2019). Local relation networks for image recognition. In IEEE international conference on computer vision, (pp. 3464–3473).
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In IEEE conference on computer vision and pattern recognition, (pp. 4700–4708).
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., & Liu, W. (2019). Ccnet: Criss-cross attention for semantic segmentation. In IEEE international conference on computer vision, (pp. 603–612).
Kirillov, A., Girshick, R., He, K., & Dollár, P. (2019). Panoptic feature pyramid networks. In IEEE conference on computer vision and pattern recognition, (pp. 6399–6408).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems25.
Larsson, G., Maire, M., & Shakhnarovich, G. (2016) Fractalnet: Ultra-deep neural networks without residuals. In International conference on learning representations.
Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, F. X., Taylor, R. H., & Unberath, M. (2021). Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In IEEE conference on computer vision and pattern recognition, (pp. 6197–6206).
Li, J., Yan, Y., Liao, S., Yang, X., & Shao, L. (2021). Local-to-global self-attention in vision transformers. arxiv 2021. arXiv preprint arXiv:2107.04735.
Li, Y., Zhang, K., Cao, J., Timofte, R., & Van Gool, L. (2021). Localvit: Bringing locality to vision transformers. arXiv preprint. arXiv:2104.05707.
Li, X., Zhang, L., You, A., Yang, M., Yang, K., & Tong, Y. (2019). Global aggregation then local distribution in fully convolutional networks. In British machine vision conference, (p. 244).
Lin, T. -Y., Dollár, P., Girshick, R. B., He, K., Hariharan, B., & Belongie, S. J. (2017) Feature pyramid networks for object detection. In IEEE conference on computer vision and pattern recognition, (pp. 2117–2125).
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. In IEEE international conference on computer vision, (pp. 2980–2988).
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision, (pp. 740–755). Springer.
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., & Dong, L., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In IEEE conference on computer vision and pattern recognition, (pp. 12009–12019).
Liu, Z., Li, X., Luo, P., Loy, C. C., & Tang, X.(2015). Semantic image segmentation via deep parsing network. In IEEE international conference on computer vision, (pp. 1377–1385).
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE international conference on computer vision, (pp. 10012–10022).
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022) A convnet for the 2020s. In IEEE conference on computer vision and pattern recognition, (pp. 11976–11986).
Liu, R., Yuan, Z., Liu, T., & Xiong, Z. (2021). End-to-end lane shape prediction with transformers. In IEEE winter conference on applications of computer vision, (pp. 3694–3702).
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In IEEE conference on computer vision and pattern recognition, (pp. 3431–3440).
Mottaghi, R., Chen, X., Liu, X., Cho, N. -G., Lee, S. -W., Fidler, S., Urtasun, R., & Yuille, A. (2014) The role of context for object detection and semantic segmentation in the wild. In IEEE conference on computer vision and pattern recognition, (pp. 891–898).
Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In IEEE international conference on computer vision, (pp. 1520–1528).
OpenMMLab: mmsegmentation. https://github.com/open-mmlab/mmsegmentation (2020).
Peng, C., Zhang, X., Yu, G., Luo, G., & Sun, J. (2017). Large kernel matters–improve semantic segmentation by global convolutional network. In IEEE conference on computer vision and pattern recognition, (pp. 4353–4361).
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., & Shlens, J. (2019). Stand-alone self-attention in vision models. In Advances in Neural Information Processing Systems, (pp. 68–80).
Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet? In International conference on machine learning, (pp. 5389–5400). PMLR.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer assisted intervention, (pp. 234–241). Springer.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211–252.
Article MathSciNet Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE conference on computer vision and pattern recognition, (pp. 4510–4520).
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations.
Song, C.H., Han, H.J., & Avrithis, Y. (2022). All the attention you need: Global-local, spatial-channel attention for image retrieval. In IEEE winter conference on applications of computer vision, (pp. 2754–2763).
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition, (pp. 1–9).
Touvron, H., Cord, M., & Jégou, H. (2022) Deit iii: Revenge of the vit. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, (pp. 516–533). Springer.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In International conference on machine learning, (pp. 10347–10357). PMLR.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, (pp. 5998–6008).
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph attention networks. In International conference on learning representations.
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In IEEE conference on computer vision and pattern recognition, (pp. 7794–7803).
Wang, W., Xie, E., Li, X., Fan, D. -P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In IEEE international conference on computer vision, (pp. 568–578).
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., & Chen, L. -C. (2020). Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In European conference on computer vision, (pp. 108–126). Springer.
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2022). Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3), 415–424.
Article Google Scholar
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In IEEE International Conference on Computer Vision, (pp. 22–31).
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In European conference on computer vision, (pp. 418–434).
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In European conference on computer vision, (pp. 418–434).
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017) Aggregated residual transformations for deep neural networks. In IEEE conference on computer vision and pattern recognition, (pp. 1492–1500).
Yan, H., Li, Z., Li, W., Wang, C., Wu, M., & Zhang, C. (2021) Contnet: Why not use convolution and transformer at the same time? arXiv preprint arXiv:2104.13497.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, (pp. 5754–5764).
Yang, C., Qiao, S., Yu, Q., Yuan, X., Zhu, Y., Yuille, A., Adam, H., & Chen, L.-C. (2022) Moat: Alternating mobile convolution and attention brings strong vision models. In International conference on learning representations.
Yang, M., Yu, K., Zhang, C., Li, Z., & Yang, K. (2018). Denseaspp for semantic segmentation in street scenes. In IEEE conference on computer vision and pattern recognition, (pp. 3684–3692).
Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In International conference on learning representations.
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., & Sang, N. (2018) Bisenet: Bilateral segmentation network for real-time semantic segmentation. In European conference on computer vision, (pp. 325–341).
Yuan, Y., & Wang, J. (2018) Ocnet: Object context network for scene parsing. arXiv preprint. arXiv preprint arXiv:1809.00916.
Yuan, Y., Chen, X., & Wang, J. (2020). Object-contextual representations for semantic segmentation. In European conference on computer vision, (pp. 173–190). Springer.
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017) mixup: Beyond empirical risk minimization. In International conference on learning representations.
Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., & Gao, J. (2021) Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In IEEE International conference on computer vision, (pp. 2998–3008).
Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., & Torr, P.H. (2019). Dual graph convolutional network for semantic segmentation. In British Machine Vision Conference, (p. 254).
Zhang, L., Xu, D., Arnab, A., & Torr, P. H. (2020). Dynamic graph message passing networks. In IEEE conference on computer vision and pattern recognition.
Zhang, Z., Zhang, H., Zhao, L., Chen, T., & Pfister, T. (2021). Aggregating nested transformers. arXiv preprint. arXiv:2105.12723
Zhao, H., Jia, J., & Koltun, V. (2020). Exploring self-attention for image recognition. In IEEE conference on computer vision and pattern recognition.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In IEEE conference on computer vision and pattern recognition, (pp. 2881–2890).
Zhao, H., Zhang, Y., Liu, S., Shi, J., Change Loy, C., Lin, D., & Jia, J. (2018). Psanet: Point-wise spatial attention network for scene parsing. In European conference on computer vision, (pp. 267–283).
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. H. S. (2015). Conditional random fields as recurrent neural networks. In IEEE international conference on computer vision, (pp. 1529–1537).
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., & Zhang, L. (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In IEEE conference on computer vision and pattern recognition, (pp. 6881–6890).
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127, 302–321.
Article Google Scholar

Download references

Acknowledgements

We thank Hengshuang Zhao, Zekun Luo and Yabiao Wang for valuable discussions. This work was supported in part by STI2030-Major Projects (Grant No. 2021ZD0200204), National Natural Science Foundation of China (Grant No. 62106050 and 62376060), Natural Science Foundation of Shanghai (Grant No. 22ZR1407500), USyd-Fudan BISA Flagship Research Program and Lingang Laboratory (Grant No. LG-QS-202202-07).

Author information

Li Zhang, Jiachen Lu, and Sixiao Zheng have contributed equally to this work.

Authors and Affiliations

School of Data Science, Fudan University, Shanghai, China
Li Zhang, Jiachen Lu, Sixiao Zheng, Xinxuan Zhao, Yanwei Fu & Jianfeng Feng
University of Surrey, Guildford, UK
Xiatian Zhu & Tao Xiang
University of Oxford, Oxfordshire, UK
Philip H. S. Torr

Authors

Li Zhang
View author publications
Search author on:PubMed Google Scholar
Jiachen Lu
View author publications
Search author on:PubMed Google Scholar
Sixiao Zheng
View author publications
Search author on:PubMed Google Scholar
Xinxuan Zhao
View author publications
Search author on:PubMed Google Scholar
Xiatian Zhu
View author publications
Search author on:PubMed Google Scholar
Yanwei Fu
View author publications
Search author on:PubMed Google Scholar
Tao Xiang
View author publications
Search author on:PubMed Google Scholar
Jianfeng Feng
View author publications
Search author on:PubMed Google Scholar
Philip H. S. Torr
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Li Zhang.

Additional information

Communicated by Zaid Harchaoui.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Visualizations

1.1 Position Embedding

Visualization of the learned position embedding in Fig. 10 shows that the model learns to encode distance within the image in the similarity of position embeddings (Fig. 11).

1.2 Features

Figure 12 shows the feature visualization of our SETR-PUP. For the encoder, 24 output features from the 24 Transformer layers namely $Z^1-Z^{24}$ are collected. Meanwhile, 5 features ($U^1-U^5$) right after each bilinear interpolation in the decoder head are visited.

1.3 Attention Maps

Attention maps (Figs. 13, 14) in each Transformer layer catch our interest. There are 16 heads and 24 layers in T-large. Similar to (Abnar & Zuidema, 2020), a recursion perspective into this problem is applied. Figure 11 shows the attention maps of different selected spatial points (red).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, L., Lu, J., Zheng, S. et al. Vision Transformers: From Semantic Segmentation to Dense Prediction. Int J Comput Vis 132, 6142–6162 (2024). https://doi.org/10.1007/s11263-024-02173-w

Download citation

Received: 11 October 2023
Accepted: 01 July 2024
Published: 16 July 2024
Version of record: 16 July 2024
Issue date: December 2024
DOI: https://doi.org/10.1007/s11263-024-02173-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vision Transformers: From Semantic Segmentation to Dense Prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

A Review on Convolutional Neural Networks and Vision Transformers

On the Effectiveness of ViT Features as Local Semantic Descriptors

Explore related subjects

Data Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Visualizations

Visualizations

1.1 Position Embedding

1.2 Features

1.3 Attention Maps

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now