这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

End-to-end scene text spotting, which aims to read the text in natural images, has garnered significant attention in recent years. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2, which seeks to find a better synergy between text detection and recognition. Specifically, we enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules. Recognition Conversion explicitly guides text localization through recognition loss, while Recognition Alignment dynamically extracts text features for recognition through the detection predictions. This simple yet effective design results in a concise framework that requires neither an additional rectification module nor character-level annotations for the arbitrarily-shaped text. Furthermore, the parameters of the detector are greatly reduced without performance degradation by introducing a Box Selection Schedule. Qualitative and quantitative experiments demonstrate that SwinTextSpotter v2 achieves state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks. The code will be available at https://github.com/mxin262/SwinTextSpotterv2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Baek, Y., Shin, S., Baek, J., Park, S., Lee, J., Nam, D. & Lee, H. (2020). Character region attention for text spotting. In: European Conference on Computer Vision, Springer, pp 504–521

  • Bissacco, A., Cummins, M., Netzer, Y. & Neven, H. (2013). Photoocr: Reading text in uncontrolled conditions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 785–792

  • Bookstein, F. L. (1989). Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(6), 567–585.

    Article  Google Scholar 

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229

  • Chen, T., Saxena, S., Li, L., Fleet, D. J., & Hinton, G. (2022). Pix2Seq: A language modeling framework for object detection. ICLR, 2022, 1–9.

    Google Scholar 

  • Chng, CK., Liu, Y., Sun, Y., Ng, CC., Luo, C., Ni, Z., Fang, C., Zhang, S., Han, J., Ding, E., et al. (2019). Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In: International Conference on Document Analysis and Recognition, IEEE, pp 1571–1576

  • Ch’ng, C. K., Chan, C. S., & Liu, C. L. (2020). Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition, 23(1), 31–52.

    Article  Google Scholar 

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H. & Wei, Y. (2017). Deformable convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 764–773

  • Das, A., Biswas, S., Pal, U., Lladós, J. & Bhattacharya, S. (2024). Fasttextspotter: A high-efficiency transformer for multilingual scene text spotting. In: International Conference on Pattern Recognition, Springer, pp 135–150

  • Fang, S., Xie, H., Wang, Y., Mao, Z. & Zhang, Y. (2021). Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7098–7107

  • Fang, S., Mao, Z., Xie, H., Wang, Y., Yan, C., & Zhang, Y. (2022). Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6), 7123–7141.

    Article  Google Scholar 

  • Feng, W., He, W., Yin, F., Zhang, XY. & Liu, CL. (2019). Textdragon: An end-to-end framework for arbitrary shaped text spotting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9076–9085

  • Feng, W., Yin, F., Zhang, X. Y., He, W., & Liu, C. L. (2021). Residual dual scale scene text spotting by fusing bottom-up and top-down processing. International Journal of Computer Vision, 129, 619–637.

    Article  Google Scholar 

  • Gómez, L., & Karatzas, D. (2017). Textproposals: a text-specific selective search algorithm for word spotting in the wild. Pattern Recognition, 70, 60–74.

    Article  Google Scholar 

  • Gupta, A., Vedaldi, A. & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2315–2324

  • He, K., Gkioxari, G., Dollár, P. & Girshick, R. (2017). Mask r-cnn. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2961–2969

  • He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y. & Sun, C. (2018). An end-to-end textspotter with explicit alignment and attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5020–5029

  • Hu, J., Cao, L., Lu, Y., Zhang, S., Wang, Y., Li, K., Huang, F., Shao, L. & Ji, R. (2021). Istr: End-to-end instance segmentation with transformers. arXiv preprint arXiv:2105.00637

  • Huang, M., Liu, Y., Peng, Z., Liu, C., Lin, D., Zhu, S., Yuan, N., Ding, K. & Jin, L. (2022). Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4593–4603

  • Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20.

    Article  MathSciNet  Google Scholar 

  • Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. Cambridge: MIT Press.

    Google Scholar 

  • Jia, D., Yuan, Y., He, H., Wu, X., Yu, H., Lin, W., Sun, L., Zhang, C. & Hu, H. (2022). Detrs with hybrid matching. arXiv preprint arXiv:2207.13080

  • Jia, X., De Brabandere, B., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. Advances in Neural Information Processing Systems, 29, 667–675.

    Google Scholar 

  • Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, VR., Lu, S., et al. (2015). Icdar 2015 competition on robust reading. In: International Conference on Document Analysis and Recognition, IEEE, pp 1156–1160

  • Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, LG., Mestre, SR., Mas, J., Mota, DF., Almazan, JA. & De Las Heras, LP. (2013). Icdar 2013 robust reading competition. In: International Conference on Document Analysis and Recognition, IEEE, pp 1484–1493

  • Kil, T., Kim, S., Seo, S., Kim, Y. & Kim, D. (2023). Towards unified scene text spotting based on sequence generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15223–15232

  • Kittenplon, Y., Lavi, I., Fogel, S., Bar, Y., Manmatha, R. & Perona, P. (2022). Towards weakly-supervised text spotting using a multi-task transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4604–4613

  • Li, H., Wang, P. & Shen, C. (2017). Towards end-to-end text spotting with convolutional recurrent neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5238–5246

  • Li, F., Zhang, H., Liu, S., Guo, J., Ni, LM. & Zhang, L. (2022). Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13619–13627

  • Liao, M., Pang, G., Huang, J., Hassner, T. & Bai, X. (2020). Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In: European Conference on Computer Vision, Springer, pp 706–722

  • Liao, M., Shi, B., Bai, X., Wang, X. & Liu, W. (2017). Textboxes: A fast text detector with a single deep neural network. In: Thirty-first AAAI Conference on Artificial Intelligence

  • Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2021). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 532–548. https://doi.org/10.1109/TPAMI.2019.2937086

    Article  Google Scholar 

  • Liao, M., Shi, B., & Bai, X. (2018). Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8), 3676–3690.

    Article  MathSciNet  Google Scholar 

  • Lin TY, Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. (2017). Feature pyramid networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2117–2125

  • Lin TY, Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). Focal loss for dense object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2980–2988

  • Liu, Y., Chen, H., Shen, C., He, T., Jin, L. & Wang, L. (2020). Abcnet: Real-time scene text spotting with adaptive bezier-curve network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9809–9818

  • Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J. & Zhang, L. (2022). DAB-DETR: Dynamic anchor boxes are better queries for DETR. In: ICLR

  • Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y. & Yan, J. (2018). Fots: Fast oriented text spotting with a unified network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5676–5685

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022

  • Liu, Y., Zhang, J., Peng, D., Huang, M., Wang, X., Tang, J., Huang, C., Lin, D., Shen, C., Bai, X., et al. (2023). Spts v2: single-point scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence

  • Liu, Y., Jin, L., Zhang, S., Luo, C., & Zhang, S. (2019). Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90, 337–345.

    Article  Google Scholar 

  • Liu, Y., Shen, C., Jin, L., He, T., Chen, P., Liu, C., & Chen, H. (2021). Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 8048–8064.

    Google Scholar 

  • Lyu, P., Liao, M., Yao, C., Wu, W. & Bai, X. (2018). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: European Conference on Computer Vision, pp 67–83

  • Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L. & Wang, J. (2021). Conditional detr for fast training convergence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3651–3660

  • Milletari, F., Navab, N. & Ahmadi, SA. (2016). V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: Fourth International Conference on 3D Vision, IEEE, pp 565–571

  • Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., Luo, Z., Pal, U., Rigaud, C., Chazalon, J., et al. (2017). Icdar 2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. International Conference on Document Analysis and Recognition, IEEE, 1, 1454–1459.

    Google Scholar 

  • Neumann, L., & Matas, J. (2015). Real-time lexicon-free scene text localization and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1872–1885.

    Article  Google Scholar 

  • Nguyen, N., Nguyen, T., Tran, V., Tran, MT., Ngo, TD., Nguyen, TH. & Hoai, M. (2021). Dictionary-guided scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7383–7392

  • Ouyang-Zhang, J., Cho, JH., Zhou, X. & Krähenbühl, P. (2022). Nms strikes back. arXiv preprint arXiv:2212.06137

  • Pan, Z., Cai, J. & Zhuang, B. (2022). Fast vision transformers with hilo attention. In: NeurIPS

  • Peng, D., Wang, X., Liu, Y., Zhang, J., Huang, M., Lai, S., Li, J., Zhu, S., Lin, D., Shen, C., et al. (2022). Spts: single-point text spotting. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 4272–4281

  • Peng, D., Jin, L., Liu, Y., Luo, C., & Lai, S. (2022). Pagenet: Towards end-to-end weakly supervised page-level handwritten chinese text recognition. International Journal of Computer Vision, 130(11), 2623–2645.

    Article  Google Scholar 

  • Qiao, L., Chen, Y., Cheng, Z., Xu, Y., Niu, Y., Pu, S. & Wu, F. (2021). Mango: A mask attention guided one-stage scene text spotter. In: Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, pp 2467–2476

  • Qiao, L., Tang, S., Cheng, Z., Xu, Y., Niu, Y., Pu, S., & Wu, F. (2020). Text perceptron: Towards end-to-end arbitrary-shaped text spotting. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11899–11907.

    Article  Google Scholar 

  • Qin, S., Bissacco, A., Raptis, M., Fujii, Y. & Xiao, Y. (2019). Towards unconstrained end-to-end text spotting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4704–4714

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 91–99.

    Google Scholar 

  • Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I. & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 658–666

  • Ronen, R., Tsiper, S., Anschel, O., Lavi, I., Markovitz, A. & Manmatha, R. (2022). Glass: Global to local attention for scene-text spotting. In: European Conference on Computer Vision, Springer, pp 249–266

  • Rong, X., Li, B., Munoz JP, Xiao, J., Arditi, A. & Tian, Y. (2016). Guided text spotting for assistive blind navigation in unfamiliar indoor environments. In: International Symposium on Visual Computing, Springer, pp 11–22

  • Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304.

    Article  Google Scholar 

  • Singh, A., Pang, G., Toh, M., Huang, J., Galuba, W. & Hassner, T. (2021). Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8802–8812

  • Stewart, R., Andriluka, M. & Ng, AY. (2016). End-to-end people detection in crowded scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2325–2333

  • Sun, Y., Ni, Z., Chng, CK., Liu, Y., Luo, C., Ng CC, Han, J., Ding, E., Liu, J., Karatzas, D., et al. (2019). Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In: International Conference on Document Analysis and Recognition, IEEE, pp 1557–1562

  • Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., Wang, C., et al. (2021). Sparse r-cnn: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14454–14463

  • Tian, Z., Shen, C. & Chen, H. (2020). Conditional convolutions for instance segmentation. In: European Conference on Computer Vision, Springer, pp 282–298

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008

  • Wang, HC., Finn, C., Paull, L., Kaess, M., Rosenholtz, R., Teller, S. & Leonard, J. (2015). Bridging text spotting and slam with junction features. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 3701–3708

  • Wang, K., Babenko, B. & Belongie, S. (2011). End-to-end scene text recognition. In: 2011 International Conference on Computer Vision, IEEE, pp 1457–1464

  • Wang, W., Liu, X., Ji, X., Xie, E., Liang, D., Yang, Z., Lu, T., Shen, C. & Luo, P. (2020). Ae textspotter: Learning visual and linguistic representation for ambiguous text spotting. In: European Conference on Computer Vision, Springer, pp 457–473

  • Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S. & Zhang, Y. (2021). From two to one: A new scene text recognizer with visual language modeling network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 14194–14203

  • Wang, W., Xie, E., Li, X., Fan, DP., Song, K., Liang, D., Lu, T., Luo, P. & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 568–578

  • Wang, W., Xie, E., Li, X., Liu, X., Liang, D., Zhibo, Y., Lu, T. & Shen, C. (2021). Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence

  • Wang, W., Xie, E., Song, X., Zang, Y., Wang, W., Lu, T., Yu, G. & Shen, C. (2019). Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8440–8449

  • Wang, P., Zhang, C., Qi, F., Liu, S., Zhang, X., Lyu, P., Han, J., Liu, J., Ding, E. & Shi, G. (2021). Pgnet: Real-time arbitrarily-shaped text spotting with point gathering network. arXiv preprint arXiv:2104.05458

  • Wang, J., Liu, C., Jin, L., Tang, G., Zhang, J., Zhang, S., Wang, Q., Wu, Y., & Cai, M. (2021). Towards robust visual information extraction in real world: New dataset and novel solution. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 2738–2745.

    Article  Google Scholar 

  • Wang, H., Lu, P., Zhang, H., Yang, M., Bai, X., Xu, Y., He, M., Wang, Y., & Liu, W. (2020). All you need is boundary: Toward arbitrary-shaped text spotting. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 12160–12167.

    Article  Google Scholar 

  • Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52.

    Article  Google Scholar 

  • Wolf, C., & Jolion, J. M. (2006). Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition, 8(4), 280–296.

    Article  Google Scholar 

  • Xia, X., Ding, G. & Li, S. (2024). Lmtextspotter: Towards better scene text spotting with language modeling in transformer. In: International Conference on Document Analysis and Recognition, Springer, pp 76–92

  • Xing, L., Tian, Z., Huang, W. & Scott, MR. (2019). Convolutional character networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9126–9136

  • Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L. & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641

  • Ye, M., Zhang, J., Zhao, S., Liu, J., Du, B. & Tao, D. (2023). Dptext-detr: Towards better scene text detection with dynamic points in transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence

  • Ye, M., Zhang, J., Zhao, S., Liu, J., Liu, T., Du, B. & Tao, D. (2023). Deepsolo: Let transformer decoder with explicit points solo for text spotting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19348–19357

  • Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J. & Ding, E. (2020). Towards accurate scene text recognition with semantic reasoning networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12113–12122

  • Yu, W., Liu, Y., Zhu, X., Cao, H., Sun, X. & Bai, X. (2024). Turning a clip model into a scene text spotter. IEEE Transactions on Pattern Analysis and Machine Intelligence

  • Zhang, SX., Yang, C., Zhu, X., Zhou, H., Wang, H. & Yin, XC. (2024). Inverse-like antagonistic scene text spotting via reading-order estimation and dynamic sampling. IEEE Transactions on Image Processing

  • Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, LM. & Shum, HY. (2023).. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. ICLR pp 1–9

  • Zhang, X., Su, Y., Tripathi, S. & Tu, Z. (2022). Text spotting transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9519–9528

  • Zhang, C., Tao, Y., Du, K., Ding, W., Wang, B., Liu, J. & Wang, W. (2021). Character-level street view text spotting based on deep multi-segmentation network for smarter autonomous driving. IEEE Transactions on Artificial Intelligence pp 1–1, https://doi.org/10.1109/TAI.2021.3116216

  • Zhang, S., Wang, X., Wang, J., Pang, J. & Chen, K. (2022). What are expected queries in end-to-end object detection? arXiv preprint arXiv:2206.01232

  • Zhang, P., Xu, Y., Cheng, Z., Pu, S., Lu, J., Qiao, L., Niu, Y. & Wu, F. (2020). Trie: End-to-end text reading and information extraction for document understanding. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1413–1422

  • Zhang, R., Zhou, Y., Jiang, Q., Song, Q., Li, N., Zhou, K., Wang, L., Wang, D., Liao, M., Yang, M., et al. (2019). Icdar 2019 robust reading challenge on reading chinese text on signboard. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 1577–1581

  • Zhong, H., Tang, J., Wang, W., Yang, Z., Yao, C. & Lu, T. (2021). Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter. arXiv preprint arXiv:2110.10405

  • Zhu, X., Su, W., Lu, L., Li, B., Wang, X. & Dai, J. (2021). Deformable detr: Deformable transformers for end-to-end object detection. ICLR pp 1–9

Download references

Acknowledgements

This research is supported in part by National Natural Science Foundation of China (Grant No.: 62206104, 62476093, 62225603), National Key R&D Program of China (Grant No.: 2022YFC2305102).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuliang Liu.

Additional information

Communicated by Svetlana Lazebnik.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, M., Peng, D., Li, H. et al. SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting. Int J Comput Vis 133, 5281–5301 (2025). https://doi.org/10.1007/s11263-025-02428-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02428-0

Keywords