RealDTT: Towards A Comprehensive Real-World Dataset for Tampered Text Detection

Duan, Junxian; Sun, Hao; Ji, Fan; Zhou, Kai; Wang, Zhiyong; Huang, Huaibo; Jin, Lianwen

doi:10.1007/s11263-025-02515-2

RealDTT: Towards A Comprehensive Real-World Dataset for Tampered Text Detection

Published: 08 July 2025

Volume 133, pages 6993–7011, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Junxian Duan ORCID: orcid.org/0000-0002-0218-6924¹,
Hao Sun¹,
Fan Ji¹,
Kai Zhou²,
Zhiyong Wang³,
Huaibo Huang¹ &
…
Lianwen Jin⁴

654 Accesses
Explore all metrics

Abstract

The swift advancement of text manipulation in AI-generated images and the rise of false document fabrication emphasize the need for effective detection methods applicable in real-world settings. While current forensics research primarily addresses tampered text in natural images, text manipulation in documents presents a more realistic struggle to handle. To address the robustness of current detection methods and datasets, we aim to develop a real-world, large-scale dataset containing manually tampered documents and diverse automatic tampering techniques. Our work distinguishes itself from existing benchmarks through three key features: Manual Tampering: encompassing the simulation of realism and cognition, where human edits are often subtle and contextually coherent. Diverse Generators: rich manipulating types for tampered images ensure the coverage of traditional and advanced tampering techniques. Multilingual and Multiscene Coverage: spanning English and Chinese text across natural scenes and documents, with varied resolutions. We have developed a comprehensive dataset, RealDTT, to evaluate the open-set generalization capabilities of text-tampered detection models. The RealDTT encompasses approximately 300,000 diverse synthetic samples originating from nine distinct generative models. To our knowledge, this represents the most extensive collection of Deepfake model types currently available. Complementing these synthetic samples are 4,012 meticulously manually tampered images. Moreover, leveraging the RealDTT dataset, we propose a robust tampered text detection model, TTDMamba, which fully harnesses the unique strengths of the Mamba architecture and integrates selective scanning, high-frequency feature aggregation, and disentangled semantic axial attention to process global information while maintaining linear complexity. Extensive experiments demonstrate that the proposed TTDMamba exhibits remarkable efficacy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Fine-Grained and Semantically Aware Mamba Representations for Tampered Text Detection in Images

Unsupervised Document Image Tampering Localization via Anomaly Detection

A Review Paper on Image Forgery Detection Techniques

Data Availability

The construction of the new dataset RealDTT is detailed in Section 3. The dataset and the full implementation of TTDMamba, including pretrained models and training/inference text, will be publicly available at https://github.com/edmundhaohao/RealDTT/tree/main.

References

Bao, H., Dong, L., Piao, S., & Wei, F. (2021). Beit: BERT pre-training of image transformers. In International Conference on Learning Representations.
Bayar, B., & Stamm, M. C. (2018). Constrained convolutional neural networks: A new approach towards general purpose image manipulation detection. IEEE Transactions on Information Forensics and Security, 13(11), (pp 2691-2706).
Berman, M., Triki, A., & Blaschko, M. B. (2018). The lovasz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp 4413-4421).
Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., & Wei, F. (2023). Textdiffuser: Diffusion models as text painters. In Advances in Neural Information Processing Systems, 36, 9353-9387.
Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision, (pp 801-818).
Chen, X., Dong, C., Ji, J., Cao, J., & Li, X. (2021). Image manipulation detection by multi-view multi-scale supervision. In International Conference on Computer Vision, (pp 14185-14193).
Cruz, F., Sidere, N., Coustaty, M., d’Andecy, V. P., & Ogier, J. M. (2017). Local binary patterns for document forgery detection. In International Conference on Document Analysis and Recognition, (pp 1223-1228).
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794.
Dong, L., Liang, W., & Wang, R. (2024). Robust text image tampering localization via forgery traces enhancement and multiscale attention. IEEE Transactions on Consumer Electronics, 70(1), 3495-3507.
Du, B., Ye, J., Zhang, J., Liu, J., & Tao, D. (2022). I3CL: intra- and inter-instance collaborative learning for arbitrary-shaped scene text detection. International Journal of Computer Vision, 130(8), 1961-1977.
Durall, R., Keuper, M., & Keuper, J. (2020). Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp 7890-7899).
Fridrich, J., & Kodovsky, J. (2012). Rich models for steganalysis of digital images. IEEE Transactions on Information Forensics and Security, 7(3), 868-882.
Jaume, G., Ekenel, H. K., & Thiran, J. P. (2019). Funsd: A dataset for form understanding in noisy scanned documents. In International Conference on Document Analysis and Recognition.
Hao, Q., Ren, R., Wang, K., Niu, S., Zhang, J., & Wang, M. (2024). Ec-net: General image tampering localization network based on edge distribution guidance and contrastive learning. Knowledge-Based Systems, 293, Article 111656.
He, R., Zhang, M., Wang, L., Ji, Y., & Yin, Q. (2015). Cross-modal subspace learning via pairwise constraints. IEEE Transactions on Image Processing, 24(12), 5543-5556.
Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., & Jawahar, C. V. (2019). Icdar2019 competition on scanned receipt ocr and information extraction. In International Conference on Document Analysis and Recognition, IEEE, (pp 1516-1520).
Ji, J., Zhang, G., Wang, Z., Hou, B., Zhang, Z., Price, B., & Chang, S. (2024). 2024. Transactions on Machine Learning Research: Improving diffusion models for scene text editing with dual encoders.
Google Scholar
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L. G., Mestre, S. R., ... & De Las Heras, L. P. (2013) Icdar 2013 robust reading competition. In International Conference on Document Analysis and Recognition, (pp 1484-1493).
Knott, A., Pedreschi, D., Jitsuzumi, T., Leavy, S., Eyers, D., Chakraborti, T., & Bengio, Y. (2024). Ai content detection in the emerging information ecosystem: new obligations for media and tech companies. Ethics and Information Technology, 26(4), 63.
Article Google Scholar
Kwon, M. J., Nam, S. H., Yu, I. J., Lee, H. K., & Kim, C. (2022). Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision, 130(8), 1875-1895.
Li, N., Wang, Z., Huang, Y., Tian, J., Li, X., & Xiao, Z. (2024). A multi-scale natural scene text detection method based on attention feature extraction and cascade feature fusion. Sensors, 24(12), 3758.
Article Google Scholar
Liao, M., Zou, Z., Wan, Z., Yao, C., & Bai, X. (2023). Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 919-931.
Liao, X., Chen, S., Chen, J., Wang, T., & Li, X. (2023) Ctp-net: Character texture perception network for document image forgery localization. arXiv preprint arXiv:2308.02158
Liu, X., Liu, Y., Chen, J., & Liu, X. (2022). Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology, 32(11), 7505-7517.
Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., & Liu, Y. (2024). Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., ... & Guo, B. (2022) Swin transformer v2: Scaling up capacity and resolution. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp 12009-12019).
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In International Conference on Learning Representations.
Luo, D., Liu, Y., Yang, R., Liu, X., Zeng, J., Zhou, Y., & Bai, X. (2025). Toward real text manipulation detection: New dataset and new solution. Pattern Recognition, 157, Article 110828.
Luo, Z., Shafait, F., & Mian, A. (2015). Localized forgery detection in hyperspectral document images. In International Conference on Document Analysis and Recognition, (pp 496-500).
Peng, D., Jin, L., Liu, Y., Luo, C., & Lai, S. (2022). Pagenet: Towards end-to-end weakly supervised page-level handwritten chinese text recognition. International Journal of Computer Vision, 130(11), 2623-2645.
Peng, D., Liu, C., Liu, Y., & Jin, L. (2024). Viteraser: Harnessing the power of vision transformers for scene text removal with segmim pretraining. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp 4468-4477).
Pippi, V., Cascianelli, S., & Cucchiara, R. (2023). Handwritten text generation from visual archetypes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp 22458-22467).
Qu, C., Liu, C., Liu, Y., Chen, X., Peng, D., Guo, F., & Jin, L. (2023). Towards robust tampered text detection in document image: new dataset and new solution. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp 5937-5946).
Qu, C., Zhong, Y., Guo, F., & Jin, L. (2025). Revisiting tampered scene text detection in the era of generative AI. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp 694-702).
Qu, Y., Tan, Q., Xie, H., Xu, J., Wang, Y., & Zhang, Y. (2023). Exploring stroke-level modifications for scene text editing. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp 2119-2127).
Roy, A. G., Navab, N., & Wachinger, C. (2018). Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In Medical Image Computing and Computer Assisted Intervention, (pp 421-429).
Roy, P., Bhattacharya, S., Ghosh, S., & Pal, U. (2020). Stefann: scene text editor using font adaptive neural network. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp 13228-13237).
Shi, B., Yao, C., Liao, M., Yang, M., Xu, P., Cui, L., ... & Bai, X. (2017). Icdar2017 competition on reading chinese text in the wild (rctw-17). In International Conference on Document Analysis and Recognition, (pp 1429-1434).
Shimoda, W., Haraguchi, D., Uchida, S., & Yamaguchi, K. (2021). De-rendering stylized texts. In International Conference on Computer Vision, (pp 1056-1065).
Sidere, N., Cruz, F., Coustaty, M., & Ogier, J. M. (2017). A dataset for forgery detection and spotting in document images. In International Conference on Emerging Security Technologies, (pp 26-31).
Sun, Y., Ni, Z., Chng, C. K., Liu, Y., Luo, C., Ng, C. C., ... & Jin, L. (2019). Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In International Conference on Document Analysis and Recognition, (pp 1557-1562).
Tornés, B. M., Taburet, T., Boros, E., Rouis, K., Doucet, A., Gomez-Krämer, P., ... & d’Andecy, V. P. (2023). Receipt dataset for document forgery detection. In International Conference on Document Analysis and Recognition, (pp 454-469).
Tuo, Y., Xiang, W., He, J. Y., Geng, Y., & Xie, X. (2024). Anytext: Multilingual visual text generation and editing. In International Conference on Learning Representations.
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., & Xiao, B. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349-3364.
Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., & Shao, S. (2019). Shape robust text detection with progressive scale expansion network. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp 9336-9345).
Wang, X., Jiang, Y., Luo, Z., Liu, C. L., Choi, H., & Kim, S. (2019). Arbitrary shape scene text detection with adaptive text region representation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp 6449-6458).
Wang, Y., Xie, H., Xing, M., Wang, J., Zhu, S., & Zhang, Y. (2022). Detecting tampered scene text in the wild. In European Conference on Computer Vision, Springer, (pp 215-232).
Wang, Y., Zhang, B., Xie, H., & Zhang, Y. (2022). Tampered text detection via rgb and frequency relationship modeling. Chinese Journal of Network and Information Security, 8(3), 29-40.
Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). Cbam: Convolutional block attention module. In European Conference on Computer Vision, (pp 3-19.)
Wu, L., Zhang, C., Liu, J., Han, J., Liu, J., Ding, E., & Bai, X. (2019). Editing text in the wild. In ACM International Conference on Multimedia, (p 1500-1508).
Wu, Y., AbdAlmageed, W., & Natarajan, P. (2019). Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp 9543-9552).
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems, 34, 12077-12090.
Xu, W., Luo, J., Zhu, C., Lu, W., Zeng, J., Shi, S., & Lin, C. (2022). Document images forgery localization using a two-stream network. International Journal of Intelligent Systems, 37(8), 5272-5289.
Yang, J., Zhou, K., Li, Y., & Liu, Z. (2024). Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, (pp 1-28).
Yu, Z., Li, B., Lin, Y., Zeng, J., & Zeng, J. (2023). Learning to locate the text forgery in smartphone screenshots. In IEEE International Conference on Acoustics, Speech and Signal Processing, (pp 1-5).
Zhang, L., Chen, X., Wang, Y., Lu, Y., & Qiao, Y. (2024). Brush your text: Synthesize any scene text on images via diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence, (pp 7215-7223).
Zhang, R., Zhou, Y., Jiang, Q., Song, Q., Li, N., Zhou, K., ... & Jawahar, C. V. (2019). Icdar 2019 robust reading challenge on reading chinese text on signboard. In International Conference on Document Analysis and Recognition, (pp 1577-1581).
Zhao, L., Chen, C., & Huang, J. (2021). Deep learning-based forgery attack on document images. IEEE Transactions on Image Processing, 30, 7964-7979.
Zhao, X., Feng, W., Zhang, Z., Lv, J., Zhu, X., Lin, Z., & Shao, J. (2024). Cbnet: A plug-and-play network for segmentation-based scene text detection. International Journal of Computer Vision, 132, 3119-3138.
Zhao, Y., & Lian, Z. (2023). Udifftext: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models. CoRR abs/2312.04884.
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., & Liang, J. (2017). East: an efficient and accurate scene text detector. In IEEE Conference on Computer Vision and Pattern Recognition, (pp 5551-5560).

Download references

Acknowledgements

This work is partially funded by Beijing Natural Science Foundation (4252054), Youth Innovation Promotion Association CAS (Grant No. 2022132), Beijing Nova Program (20230484276).

Author information

Authors and Affiliations

Institute of Automation, Chinese Academy of Sciences, Beijing, China
Junxian Duan, Hao Sun, Fan Ji & Huaibo Huang
Meituan, Beijing, China
Kai Zhou
School of Computer Science, The University of Sydney, Sydney, Australia
Zhiyong Wang
South China University of Technology, Guangzhou, China
Lianwen Jin

Authors

Junxian Duan
View author publications
Search author on:PubMed Google Scholar
Hao Sun
View author publications
Search author on:PubMed Google Scholar
Fan Ji
View author publications
Search author on:PubMed Google Scholar
Kai Zhou
View author publications
Search author on:PubMed Google Scholar
Zhiyong Wang
View author publications
Search author on:PubMed Google Scholar
Huaibo Huang
View author publications
Search author on:PubMed Google Scholar
Lianwen Jin
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Junxian Duan.

Additional information

Communicated by Xin Zhao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Duan, J., Sun, H., Ji, F. et al. RealDTT: Towards A Comprehensive Real-World Dataset for Tampered Text Detection. Int J Comput Vis 133, 6993–7011 (2025). https://doi.org/10.1007/s11263-025-02515-2

Download citation

Received: 30 September 2024
Accepted: 25 June 2025
Published: 08 July 2025
Version of record: 08 July 2025
Issue date: October 2025
DOI: https://doi.org/10.1007/s11263-025-02515-2

Keywords

Profiles

Junxian Duan View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RealDTT: Towards A Comprehensive Real-World Dataset for Tampered Text Detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Fine-Grained and Semantically Aware Mamba Representations for Tampered Text Detection in Images

Unsupervised Document Image Tampering Localization via Anomaly Detection

A Review Paper on Image Forgery Detection Techniques

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now