Abstract
End-to-end scene text spotting, which aims to read the text in natural images, has garnered significant attention in recent years. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2, which seeks to find a better synergy between text detection and recognition. Specifically, we enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules. Recognition Conversion explicitly guides text localization through recognition loss, while Recognition Alignment dynamically extracts text features for recognition through the detection predictions. This simple yet effective design results in a concise framework that requires neither an additional rectification module nor character-level annotations for the arbitrarily-shaped text. Furthermore, the parameters of the detector are greatly reduced without performance degradation by introducing a Box Selection Schedule. Qualitative and quantitative experiments demonstrate that SwinTextSpotter v2 achieves state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks. The code will be available at https://github.com/mxin262/SwinTextSpotterv2.
Similar content being viewed by others
References
Baek, Y., Shin, S., Baek, J., Park, S., Lee, J., Nam, D. & Lee, H. (2020). Character region attention for text spotting. In: European Conference on Computer Vision, Springer, pp 504–521
Bissacco, A., Cummins, M., Netzer, Y. & Neven, H. (2013). Photoocr: Reading text in uncontrolled conditions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 785–792
Bookstein, F. L. (1989). Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(6), 567–585.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229
Chen, T., Saxena, S., Li, L., Fleet, D. J., & Hinton, G. (2022). Pix2Seq: A language modeling framework for object detection. ICLR, 2022, 1–9.
Chng, CK., Liu, Y., Sun, Y., Ng, CC., Luo, C., Ni, Z., Fang, C., Zhang, S., Han, J., Ding, E., et al. (2019). Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In: International Conference on Document Analysis and Recognition, IEEE, pp 1571–1576
Ch’ng, C. K., Chan, C. S., & Liu, C. L. (2020). Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition, 23(1), 31–52.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H. & Wei, Y. (2017). Deformable convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 764–773
Das, A., Biswas, S., Pal, U., Lladós, J. & Bhattacharya, S. (2024). Fasttextspotter: A high-efficiency transformer for multilingual scene text spotting. In: International Conference on Pattern Recognition, Springer, pp 135–150
Fang, S., Xie, H., Wang, Y., Mao, Z. & Zhang, Y. (2021). Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7098–7107
Fang, S., Mao, Z., Xie, H., Wang, Y., Yan, C., & Zhang, Y. (2022). Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6), 7123–7141.
Feng, W., He, W., Yin, F., Zhang, XY. & Liu, CL. (2019). Textdragon: An end-to-end framework for arbitrary shaped text spotting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9076–9085
Feng, W., Yin, F., Zhang, X. Y., He, W., & Liu, C. L. (2021). Residual dual scale scene text spotting by fusing bottom-up and top-down processing. International Journal of Computer Vision, 129, 619–637.
Gómez, L., & Karatzas, D. (2017). Textproposals: a text-specific selective search algorithm for word spotting in the wild. Pattern Recognition, 70, 60–74.
Gupta, A., Vedaldi, A. & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2315–2324
He, K., Gkioxari, G., Dollár, P. & Girshick, R. (2017). Mask r-cnn. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2961–2969
He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y. & Sun, C. (2018). An end-to-end textspotter with explicit alignment and attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5020–5029
Hu, J., Cao, L., Lu, Y., Zhang, S., Wang, Y., Li, K., Huang, F., Shao, L. & Ji, R. (2021). Istr: End-to-end instance segmentation with transformers. arXiv preprint arXiv:2105.00637
Huang, M., Liu, Y., Peng, Z., Liu, C., Lin, D., Zhu, S., Yuan, N., Ding, K. & Jin, L. (2022). Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4593–4603
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20.
Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. Cambridge: MIT Press.
Jia, D., Yuan, Y., He, H., Wu, X., Yu, H., Lin, W., Sun, L., Zhang, C. & Hu, H. (2022). Detrs with hybrid matching. arXiv preprint arXiv:2207.13080
Jia, X., De Brabandere, B., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. Advances in Neural Information Processing Systems, 29, 667–675.
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, VR., Lu, S., et al. (2015). Icdar 2015 competition on robust reading. In: International Conference on Document Analysis and Recognition, IEEE, pp 1156–1160
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, LG., Mestre, SR., Mas, J., Mota, DF., Almazan, JA. & De Las Heras, LP. (2013). Icdar 2013 robust reading competition. In: International Conference on Document Analysis and Recognition, IEEE, pp 1484–1493
Kil, T., Kim, S., Seo, S., Kim, Y. & Kim, D. (2023). Towards unified scene text spotting based on sequence generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15223–15232
Kittenplon, Y., Lavi, I., Fogel, S., Bar, Y., Manmatha, R. & Perona, P. (2022). Towards weakly-supervised text spotting using a multi-task transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4604–4613
Li, H., Wang, P. & Shen, C. (2017). Towards end-to-end text spotting with convolutional recurrent neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5238–5246
Li, F., Zhang, H., Liu, S., Guo, J., Ni, LM. & Zhang, L. (2022). Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13619–13627
Liao, M., Pang, G., Huang, J., Hassner, T. & Bai, X. (2020). Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In: European Conference on Computer Vision, Springer, pp 706–722
Liao, M., Shi, B., Bai, X., Wang, X. & Liu, W. (2017). Textboxes: A fast text detector with a single deep neural network. In: Thirty-first AAAI Conference on Artificial Intelligence
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2021). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 532–548. https://doi.org/10.1109/TPAMI.2019.2937086
Liao, M., Shi, B., & Bai, X. (2018). Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8), 3676–3690.
Lin TY, Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. (2017). Feature pyramid networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Lin TY, Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). Focal loss for dense object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2980–2988
Liu, Y., Chen, H., Shen, C., He, T., Jin, L. & Wang, L. (2020). Abcnet: Real-time scene text spotting with adaptive bezier-curve network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9809–9818
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J. & Zhang, L. (2022). DAB-DETR: Dynamic anchor boxes are better queries for DETR. In: ICLR
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y. & Yan, J. (2018). Fots: Fast oriented text spotting with a unified network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5676–5685
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022
Liu, Y., Zhang, J., Peng, D., Huang, M., Wang, X., Tang, J., Huang, C., Lin, D., Shen, C., Bai, X., et al. (2023). Spts v2: single-point scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence
Liu, Y., Jin, L., Zhang, S., Luo, C., & Zhang, S. (2019). Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90, 337–345.
Liu, Y., Shen, C., Jin, L., He, T., Chen, P., Liu, C., & Chen, H. (2021). Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 8048–8064.
Lyu, P., Liao, M., Yao, C., Wu, W. & Bai, X. (2018). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: European Conference on Computer Vision, pp 67–83
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L. & Wang, J. (2021). Conditional detr for fast training convergence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3651–3660
Milletari, F., Navab, N. & Ahmadi, SA. (2016). V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: Fourth International Conference on 3D Vision, IEEE, pp 565–571
Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., Luo, Z., Pal, U., Rigaud, C., Chazalon, J., et al. (2017). Icdar 2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. International Conference on Document Analysis and Recognition, IEEE, 1, 1454–1459.
Neumann, L., & Matas, J. (2015). Real-time lexicon-free scene text localization and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1872–1885.
Nguyen, N., Nguyen, T., Tran, V., Tran, MT., Ngo, TD., Nguyen, TH. & Hoai, M. (2021). Dictionary-guided scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7383–7392
Ouyang-Zhang, J., Cho, JH., Zhou, X. & Krähenbühl, P. (2022). Nms strikes back. arXiv preprint arXiv:2212.06137
Pan, Z., Cai, J. & Zhuang, B. (2022). Fast vision transformers with hilo attention. In: NeurIPS
Peng, D., Wang, X., Liu, Y., Zhang, J., Huang, M., Lai, S., Li, J., Zhu, S., Lin, D., Shen, C., et al. (2022). Spts: single-point text spotting. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 4272–4281
Peng, D., Jin, L., Liu, Y., Luo, C., & Lai, S. (2022). Pagenet: Towards end-to-end weakly supervised page-level handwritten chinese text recognition. International Journal of Computer Vision, 130(11), 2623–2645.
Qiao, L., Chen, Y., Cheng, Z., Xu, Y., Niu, Y., Pu, S. & Wu, F. (2021). Mango: A mask attention guided one-stage scene text spotter. In: Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, pp 2467–2476
Qiao, L., Tang, S., Cheng, Z., Xu, Y., Niu, Y., Pu, S., & Wu, F. (2020). Text perceptron: Towards end-to-end arbitrary-shaped text spotting. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11899–11907.
Qin, S., Bissacco, A., Raptis, M., Fujii, Y. & Xiao, Y. (2019). Towards unconstrained end-to-end text spotting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4704–4714
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 91–99.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I. & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 658–666
Ronen, R., Tsiper, S., Anschel, O., Lavi, I., Markovitz, A. & Manmatha, R. (2022). Glass: Global to local attention for scene-text spotting. In: European Conference on Computer Vision, Springer, pp 249–266
Rong, X., Li, B., Munoz JP, Xiao, J., Arditi, A. & Tian, Y. (2016). Guided text spotting for assistive blind navigation in unfamiliar indoor environments. In: International Symposium on Visual Computing, Springer, pp 11–22
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304.
Singh, A., Pang, G., Toh, M., Huang, J., Galuba, W. & Hassner, T. (2021). Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8802–8812
Stewart, R., Andriluka, M. & Ng, AY. (2016). End-to-end people detection in crowded scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2325–2333
Sun, Y., Ni, Z., Chng, CK., Liu, Y., Luo, C., Ng CC, Han, J., Ding, E., Liu, J., Karatzas, D., et al. (2019). Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In: International Conference on Document Analysis and Recognition, IEEE, pp 1557–1562
Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., Wang, C., et al. (2021). Sparse r-cnn: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14454–14463
Tian, Z., Shen, C. & Chen, H. (2020). Conditional convolutions for instance segmentation. In: European Conference on Computer Vision, Springer, pp 282–298
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008
Wang, HC., Finn, C., Paull, L., Kaess, M., Rosenholtz, R., Teller, S. & Leonard, J. (2015). Bridging text spotting and slam with junction features. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 3701–3708
Wang, K., Babenko, B. & Belongie, S. (2011). End-to-end scene text recognition. In: 2011 International Conference on Computer Vision, IEEE, pp 1457–1464
Wang, W., Liu, X., Ji, X., Xie, E., Liang, D., Yang, Z., Lu, T., Shen, C. & Luo, P. (2020). Ae textspotter: Learning visual and linguistic representation for ambiguous text spotting. In: European Conference on Computer Vision, Springer, pp 457–473
Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S. & Zhang, Y. (2021). From two to one: A new scene text recognizer with visual language modeling network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 14194–14203
Wang, W., Xie, E., Li, X., Fan, DP., Song, K., Liang, D., Lu, T., Luo, P. & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 568–578
Wang, W., Xie, E., Li, X., Liu, X., Liang, D., Zhibo, Y., Lu, T. & Shen, C. (2021). Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence
Wang, W., Xie, E., Song, X., Zang, Y., Wang, W., Lu, T., Yu, G. & Shen, C. (2019). Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8440–8449
Wang, P., Zhang, C., Qi, F., Liu, S., Zhang, X., Lyu, P., Han, J., Liu, J., Ding, E. & Shi, G. (2021). Pgnet: Real-time arbitrarily-shaped text spotting with point gathering network. arXiv preprint arXiv:2104.05458
Wang, J., Liu, C., Jin, L., Tang, G., Zhang, J., Zhang, S., Wang, Q., Wu, Y., & Cai, M. (2021). Towards robust visual information extraction in real world: New dataset and novel solution. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 2738–2745.
Wang, H., Lu, P., Zhang, H., Yang, M., Bai, X., Xu, Y., He, M., Wang, Y., & Liu, W. (2020). All you need is boundary: Toward arbitrary-shaped text spotting. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 12160–12167.
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52.
Wolf, C., & Jolion, J. M. (2006). Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition, 8(4), 280–296.
Xia, X., Ding, G. & Li, S. (2024). Lmtextspotter: Towards better scene text spotting with language modeling in transformer. In: International Conference on Document Analysis and Recognition, Springer, pp 76–92
Xing, L., Tian, Z., Huang, W. & Scott, MR. (2019). Convolutional character networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9126–9136
Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L. & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641
Ye, M., Zhang, J., Zhao, S., Liu, J., Du, B. & Tao, D. (2023). Dptext-detr: Towards better scene text detection with dynamic points in transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence
Ye, M., Zhang, J., Zhao, S., Liu, J., Liu, T., Du, B. & Tao, D. (2023). Deepsolo: Let transformer decoder with explicit points solo for text spotting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19348–19357
Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J. & Ding, E. (2020). Towards accurate scene text recognition with semantic reasoning networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12113–12122
Yu, W., Liu, Y., Zhu, X., Cao, H., Sun, X. & Bai, X. (2024). Turning a clip model into a scene text spotter. IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhang, SX., Yang, C., Zhu, X., Zhou, H., Wang, H. & Yin, XC. (2024). Inverse-like antagonistic scene text spotting via reading-order estimation and dynamic sampling. IEEE Transactions on Image Processing
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, LM. & Shum, HY. (2023).. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. ICLR pp 1–9
Zhang, X., Su, Y., Tripathi, S. & Tu, Z. (2022). Text spotting transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9519–9528
Zhang, C., Tao, Y., Du, K., Ding, W., Wang, B., Liu, J. & Wang, W. (2021). Character-level street view text spotting based on deep multi-segmentation network for smarter autonomous driving. IEEE Transactions on Artificial Intelligence pp 1–1, https://doi.org/10.1109/TAI.2021.3116216
Zhang, S., Wang, X., Wang, J., Pang, J. & Chen, K. (2022). What are expected queries in end-to-end object detection? arXiv preprint arXiv:2206.01232
Zhang, P., Xu, Y., Cheng, Z., Pu, S., Lu, J., Qiao, L., Niu, Y. & Wu, F. (2020). Trie: End-to-end text reading and information extraction for document understanding. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1413–1422
Zhang, R., Zhou, Y., Jiang, Q., Song, Q., Li, N., Zhou, K., Wang, L., Wang, D., Liao, M., Yang, M., et al. (2019). Icdar 2019 robust reading challenge on reading chinese text on signboard. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 1577–1581
Zhong, H., Tang, J., Wang, W., Yang, Z., Yao, C. & Lu, T. (2021). Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter. arXiv preprint arXiv:2110.10405
Zhu, X., Su, W., Lu, L., Li, B., Wang, X. & Dai, J. (2021). Deformable detr: Deformable transformers for end-to-end object detection. ICLR pp 1–9
Acknowledgements
This research is supported in part by National Natural Science Foundation of China (Grant No.: 62206104, 62476093, 62225603), National Key R&D Program of China (Grant No.: 2022YFC2305102).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Svetlana Lazebnik.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Huang, M., Peng, D., Li, H. et al. SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting. Int J Comput Vis 133, 5281–5301 (2025). https://doi.org/10.1007/s11263-025-02428-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02428-0