SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting

Huang, Mingxin; Peng, Dezhi; Li, Hongliang; Peng, Zhenghao; Liu, Chongyu; Lin, Dahua; Liu, Yuliang; Bai, Xiang; Jin, Lianwen

doi:10.1007/s11263-025-02428-0

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting

Published: 15 April 2025

Volume 133, pages 5281–5301, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Mingxin Huang¹,
Dezhi Peng¹,
Hongliang Li¹,
Zhenghao Peng²,
Chongyu Liu¹,
Dahua Lin³,
Yuliang Liu ORCID: orcid.org/0000-0002-3037-173X⁴,
Xiang Bai⁴ &
…
Lianwen Jin¹

337 Accesses
4 Citations
Explore all metrics

Abstract

End-to-end scene text spotting, which aims to read the text in natural images, has garnered significant attention in recent years. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2, which seeks to find a better synergy between text detection and recognition. Specifically, we enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules. Recognition Conversion explicitly guides text localization through recognition loss, while Recognition Alignment dynamically extracts text features for recognition through the detection predictions. This simple yet effective design results in a concise framework that requires neither an additional rectification module nor character-level annotations for the arbitrarily-shaped text. Furthermore, the parameters of the detector are greatly reduced without performance degradation by introducing a Box Selection Schedule. Qualitative and quantitative experiments demonstrate that SwinTextSpotter v2 achieves state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks. The code will be available at https://github.com/mxin262/SwinTextSpotterv2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

SVTR-V3: An Improved Model Based on a Single Visual Recognition Network

Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction

References

Baek, Y., Shin, S., Baek, J., Park, S., Lee, J., Nam, D. & Lee, H. (2020). Character region attention for text spotting. In: European Conference on Computer Vision, Springer, pp 504–521
Bissacco, A., Cummins, M., Netzer, Y. & Neven, H. (2013). Photoocr: Reading text in uncontrolled conditions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 785–792
Bookstein, F. L. (1989). Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(6), 567–585.
Article Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229
Chen, T., Saxena, S., Li, L., Fleet, D. J., & Hinton, G. (2022). Pix2Seq: A language modeling framework for object detection. ICLR, 2022, 1–9.
Google Scholar
Chng, CK., Liu, Y., Sun, Y., Ng, CC., Luo, C., Ni, Z., Fang, C., Zhang, S., Han, J., Ding, E., et al. (2019). Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In: International Conference on Document Analysis and Recognition, IEEE, pp 1571–1576
Ch’ng, C. K., Chan, C. S., & Liu, C. L. (2020). Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition, 23(1), 31–52.
Article Google Scholar
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H. & Wei, Y. (2017). Deformable convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 764–773
Das, A., Biswas, S., Pal, U., Lladós, J. & Bhattacharya, S. (2024). Fasttextspotter: A high-efficiency transformer for multilingual scene text spotting. In: International Conference on Pattern Recognition, Springer, pp 135–150
Fang, S., Xie, H., Wang, Y., Mao, Z. & Zhang, Y. (2021). Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7098–7107
Fang, S., Mao, Z., Xie, H., Wang, Y., Yan, C., & Zhang, Y. (2022). Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6), 7123–7141.
Article Google Scholar
Feng, W., He, W., Yin, F., Zhang, XY. & Liu, CL. (2019). Textdragon: An end-to-end framework for arbitrary shaped text spotting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9076–9085
Feng, W., Yin, F., Zhang, X. Y., He, W., & Liu, C. L. (2021). Residual dual scale scene text spotting by fusing bottom-up and top-down processing. International Journal of Computer Vision, 129, 619–637.
Article Google Scholar
Gómez, L., & Karatzas, D. (2017). Textproposals: a text-specific selective search algorithm for word spotting in the wild. Pattern Recognition, 70, 60–74.
Article Google Scholar
Gupta, A., Vedaldi, A. & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2315–2324
He, K., Gkioxari, G., Dollár, P. & Girshick, R. (2017). Mask r-cnn. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2961–2969
He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y. & Sun, C. (2018). An end-to-end textspotter with explicit alignment and attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5020–5029
Hu, J., Cao, L., Lu, Y., Zhang, S., Wang, Y., Li, K., Huang, F., Shao, L. & Ji, R. (2021). Istr: End-to-end instance segmentation with transformers. arXiv preprint arXiv:2105.00637
Huang, M., Liu, Y., Peng, Z., Liu, C., Lin, D., Zhu, S., Yuan, N., Ding, K. & Jin, L. (2022). Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4593–4603
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20.
Article MathSciNet Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. Cambridge: MIT Press.
Google Scholar
Jia, D., Yuan, Y., He, H., Wu, X., Yu, H., Lin, W., Sun, L., Zhang, C. & Hu, H. (2022). Detrs with hybrid matching. arXiv preprint arXiv:2207.13080
Jia, X., De Brabandere, B., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. Advances in Neural Information Processing Systems, 29, 667–675.
Google Scholar
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, VR., Lu, S., et al. (2015). Icdar 2015 competition on robust reading. In: International Conference on Document Analysis and Recognition, IEEE, pp 1156–1160
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, LG., Mestre, SR., Mas, J., Mota, DF., Almazan, JA. & De Las Heras, LP. (2013). Icdar 2013 robust reading competition. In: International Conference on Document Analysis and Recognition, IEEE, pp 1484–1493
Kil, T., Kim, S., Seo, S., Kim, Y. & Kim, D. (2023). Towards unified scene text spotting based on sequence generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15223–15232
Kittenplon, Y., Lavi, I., Fogel, S., Bar, Y., Manmatha, R. & Perona, P. (2022). Towards weakly-supervised text spotting using a multi-task transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4604–4613
Li, H., Wang, P. & Shen, C. (2017). Towards end-to-end text spotting with convolutional recurrent neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5238–5246
Li, F., Zhang, H., Liu, S., Guo, J., Ni, LM. & Zhang, L. (2022). Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13619–13627
Liao, M., Pang, G., Huang, J., Hassner, T. & Bai, X. (2020). Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In: European Conference on Computer Vision, Springer, pp 706–722
Liao, M., Shi, B., Bai, X., Wang, X. & Liu, W. (2017). Textboxes: A fast text detector with a single deep neural network. In: Thirty-first AAAI Conference on Artificial Intelligence
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2021). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 532–548. https://doi.org/10.1109/TPAMI.2019.2937086
Article Google Scholar
Liao, M., Shi, B., & Bai, X. (2018). Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8), 3676–3690.
Article MathSciNet Google Scholar
Lin TY, Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. (2017). Feature pyramid networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Lin TY, Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). Focal loss for dense object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2980–2988
Liu, Y., Chen, H., Shen, C., He, T., Jin, L. & Wang, L. (2020). Abcnet: Real-time scene text spotting with adaptive bezier-curve network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9809–9818
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J. & Zhang, L. (2022). DAB-DETR: Dynamic anchor boxes are better queries for DETR. In: ICLR
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y. & Yan, J. (2018). Fots: Fast oriented text spotting with a unified network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5676–5685
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022
Liu, Y., Zhang, J., Peng, D., Huang, M., Wang, X., Tang, J., Huang, C., Lin, D., Shen, C., Bai, X., et al. (2023). Spts v2: single-point scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence
Liu, Y., Jin, L., Zhang, S., Luo, C., & Zhang, S. (2019). Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90, 337–345.
Article Google Scholar
Liu, Y., Shen, C., Jin, L., He, T., Chen, P., Liu, C., & Chen, H. (2021). Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 8048–8064.
Google Scholar
Lyu, P., Liao, M., Yao, C., Wu, W. & Bai, X. (2018). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: European Conference on Computer Vision, pp 67–83
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L. & Wang, J. (2021). Conditional detr for fast training convergence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3651–3660
Milletari, F., Navab, N. & Ahmadi, SA. (2016). V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: Fourth International Conference on 3D Vision, IEEE, pp 565–571
Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., Luo, Z., Pal, U., Rigaud, C., Chazalon, J., et al. (2017). Icdar 2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. International Conference on Document Analysis and Recognition, IEEE, 1, 1454–1459.
Google Scholar
Neumann, L., & Matas, J. (2015). Real-time lexicon-free scene text localization and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1872–1885.
Article Google Scholar
Nguyen, N., Nguyen, T., Tran, V., Tran, MT., Ngo, TD., Nguyen, TH. & Hoai, M. (2021). Dictionary-guided scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7383–7392
Ouyang-Zhang, J., Cho, JH., Zhou, X. & Krähenbühl, P. (2022). Nms strikes back. arXiv preprint arXiv:2212.06137
Pan, Z., Cai, J. & Zhuang, B. (2022). Fast vision transformers with hilo attention. In: NeurIPS
Peng, D., Wang, X., Liu, Y., Zhang, J., Huang, M., Lai, S., Li, J., Zhu, S., Lin, D., Shen, C., et al. (2022). Spts: single-point text spotting. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 4272–4281
Peng, D., Jin, L., Liu, Y., Luo, C., & Lai, S. (2022). Pagenet: Towards end-to-end weakly supervised page-level handwritten chinese text recognition. International Journal of Computer Vision, 130(11), 2623–2645.
Article Google Scholar
Qiao, L., Chen, Y., Cheng, Z., Xu, Y., Niu, Y., Pu, S. & Wu, F. (2021). Mango: A mask attention guided one-stage scene text spotter. In: Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, pp 2467–2476
Qiao, L., Tang, S., Cheng, Z., Xu, Y., Niu, Y., Pu, S., & Wu, F. (2020). Text perceptron: Towards end-to-end arbitrary-shaped text spotting. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11899–11907.
Article Google Scholar
Qin, S., Bissacco, A., Raptis, M., Fujii, Y. & Xiao, Y. (2019). Towards unconstrained end-to-end text spotting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4704–4714
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 91–99.
Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I. & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 658–666
Ronen, R., Tsiper, S., Anschel, O., Lavi, I., Markovitz, A. & Manmatha, R. (2022). Glass: Global to local attention for scene-text spotting. In: European Conference on Computer Vision, Springer, pp 249–266
Rong, X., Li, B., Munoz JP, Xiao, J., Arditi, A. & Tian, Y. (2016). Guided text spotting for assistive blind navigation in unfamiliar indoor environments. In: International Symposium on Visual Computing, Springer, pp 11–22
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304.
Article Google Scholar
Singh, A., Pang, G., Toh, M., Huang, J., Galuba, W. & Hassner, T. (2021). Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8802–8812
Stewart, R., Andriluka, M. & Ng, AY. (2016). End-to-end people detection in crowded scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2325–2333
Sun, Y., Ni, Z., Chng, CK., Liu, Y., Luo, C., Ng CC, Han, J., Ding, E., Liu, J., Karatzas, D., et al. (2019). Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In: International Conference on Document Analysis and Recognition, IEEE, pp 1557–1562
Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., Wang, C., et al. (2021). Sparse r-cnn: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14454–14463
Tian, Z., Shen, C. & Chen, H. (2020). Conditional convolutions for instance segmentation. In: European Conference on Computer Vision, Springer, pp 282–298
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008
Wang, HC., Finn, C., Paull, L., Kaess, M., Rosenholtz, R., Teller, S. & Leonard, J. (2015). Bridging text spotting and slam with junction features. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 3701–3708
Wang, K., Babenko, B. & Belongie, S. (2011). End-to-end scene text recognition. In: 2011 International Conference on Computer Vision, IEEE, pp 1457–1464
Wang, W., Liu, X., Ji, X., Xie, E., Liang, D., Yang, Z., Lu, T., Shen, C. & Luo, P. (2020). Ae textspotter: Learning visual and linguistic representation for ambiguous text spotting. In: European Conference on Computer Vision, Springer, pp 457–473
Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S. & Zhang, Y. (2021). From two to one: A new scene text recognizer with visual language modeling network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 14194–14203
Wang, W., Xie, E., Li, X., Fan, DP., Song, K., Liang, D., Lu, T., Luo, P. & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 568–578
Wang, W., Xie, E., Li, X., Liu, X., Liang, D., Zhibo, Y., Lu, T. & Shen, C. (2021). Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence
Wang, W., Xie, E., Song, X., Zang, Y., Wang, W., Lu, T., Yu, G. & Shen, C. (2019). Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8440–8449
Wang, P., Zhang, C., Qi, F., Liu, S., Zhang, X., Lyu, P., Han, J., Liu, J., Ding, E. & Shi, G. (2021). Pgnet: Real-time arbitrarily-shaped text spotting with point gathering network. arXiv preprint arXiv:2104.05458
Wang, J., Liu, C., Jin, L., Tang, G., Zhang, J., Zhang, S., Wang, Q., Wu, Y., & Cai, M. (2021). Towards robust visual information extraction in real world: New dataset and novel solution. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 2738–2745.
Article Google Scholar
Wang, H., Lu, P., Zhang, H., Yang, M., Bai, X., Xu, Y., He, M., Wang, Y., & Liu, W. (2020). All you need is boundary: Toward arbitrary-shaped text spotting. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 12160–12167.
Article Google Scholar
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52.
Article Google Scholar
Wolf, C., & Jolion, J. M. (2006). Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition, 8(4), 280–296.
Article Google Scholar
Xia, X., Ding, G. & Li, S. (2024). Lmtextspotter: Towards better scene text spotting with language modeling in transformer. In: International Conference on Document Analysis and Recognition, Springer, pp 76–92
Xing, L., Tian, Z., Huang, W. & Scott, MR. (2019). Convolutional character networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9126–9136
Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L. & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641
Ye, M., Zhang, J., Zhao, S., Liu, J., Du, B. & Tao, D. (2023). Dptext-detr: Towards better scene text detection with dynamic points in transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence
Ye, M., Zhang, J., Zhao, S., Liu, J., Liu, T., Du, B. & Tao, D. (2023). Deepsolo: Let transformer decoder with explicit points solo for text spotting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19348–19357
Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J. & Ding, E. (2020). Towards accurate scene text recognition with semantic reasoning networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12113–12122
Yu, W., Liu, Y., Zhu, X., Cao, H., Sun, X. & Bai, X. (2024). Turning a clip model into a scene text spotter. IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhang, SX., Yang, C., Zhu, X., Zhou, H., Wang, H. & Yin, XC. (2024). Inverse-like antagonistic scene text spotting via reading-order estimation and dynamic sampling. IEEE Transactions on Image Processing
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, LM. & Shum, HY. (2023).. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. ICLR pp 1–9
Zhang, X., Su, Y., Tripathi, S. & Tu, Z. (2022). Text spotting transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9519–9528
Zhang, C., Tao, Y., Du, K., Ding, W., Wang, B., Liu, J. & Wang, W. (2021). Character-level street view text spotting based on deep multi-segmentation network for smarter autonomous driving. IEEE Transactions on Artificial Intelligence pp 1–1, https://doi.org/10.1109/TAI.2021.3116216
Zhang, S., Wang, X., Wang, J., Pang, J. & Chen, K. (2022). What are expected queries in end-to-end object detection? arXiv preprint arXiv:2206.01232
Zhang, P., Xu, Y., Cheng, Z., Pu, S., Lu, J., Qiao, L., Niu, Y. & Wu, F. (2020). Trie: End-to-end text reading and information extraction for document understanding. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1413–1422
Zhang, R., Zhou, Y., Jiang, Q., Song, Q., Li, N., Zhou, K., Wang, L., Wang, D., Liao, M., Yang, M., et al. (2019). Icdar 2019 robust reading challenge on reading chinese text on signboard. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 1577–1581
Zhong, H., Tang, J., Wang, W., Yang, Z., Yao, C. & Lu, T. (2021). Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter. arXiv preprint arXiv:2110.10405
Zhu, X., Su, W., Lu, L., Li, B., Wang, X. & Dai, J. (2021). Deformable detr: Deformable transformers for end-to-end object detection. ICLR pp 1–9

Download references

Acknowledgements

This research is supported in part by National Natural Science Foundation of China (Grant No.: 62206104, 62476093, 62225603), National Key R&D Program of China (Grant No.: 2022YFC2305102).

Author information

Authors and Affiliations

South China University of Technology, Guangzhou, China
Mingxin Huang, Dezhi Peng, Hongliang Li, Chongyu Liu & Lianwen Jin
University of California, Los Angeles, USA
Zhenghao Peng
Chinese University of Hong Kong, Hong Kong, China
Dahua Lin
Huazhong University of Science and Technology, Wuhan, China
Yuliang Liu & Xiang Bai

Authors

Mingxin Huang
View author publications
Search author on:PubMed Google Scholar
Dezhi Peng
View author publications
Search author on:PubMed Google Scholar
Hongliang Li
View author publications
Search author on:PubMed Google Scholar
Zhenghao Peng
View author publications
Search author on:PubMed Google Scholar
Chongyu Liu
View author publications
Search author on:PubMed Google Scholar
Dahua Lin
View author publications
Search author on:PubMed Google Scholar
Yuliang Liu
View author publications
Search author on:PubMed Google Scholar
Xiang Bai
View author publications
Search author on:PubMed Google Scholar
Lianwen Jin
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yuliang Liu.

Additional information

Communicated by Svetlana Lazebnik.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huang, M., Peng, D., Li, H. et al. SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting. Int J Comput Vis 133, 5281–5301 (2025). https://doi.org/10.1007/s11263-025-02428-0

Download citation

Received: 09 April 2023
Accepted: 15 March 2025
Published: 15 April 2025
Version of record: 15 April 2025
Issue date: August 2025
DOI: https://doi.org/10.1007/s11263-025-02428-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

SVTR-V3: An Improved Model Based on a Single Visual Recognition Network

Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now