DocScanner: Robust Document Image Rectification with Progressive Learning

Feng, Hao; Zhou, Wengang; Deng, Jiajun; Tian, Qi; Li, Houqiang

doi:10.1007/s11263-025-02431-5

DocScanner: Robust Document Image Rectification with Progressive Learning

Published: 26 May 2025

Volume 133, pages 5343–5362, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Hao Feng¹,
Wengang Zhou¹,
Jiajun Deng²,
Qi Tian³ &
…
Houqiang Li ORCID: orcid.org/0000-0003-2188-3028¹

324 Accesses
4 Citations
Explore all metrics

Abstract

Compared with flatbed scanners, portable smartphones provide more convenience for physical document digitization. However, such digitized documents are often distorted due to uncontrolled physical deformations, camera positions, and illumination variations. To this end, we present DocScanner, a novel framework for document image rectification. Different from existing solutions, DocScanner addresses this issue by introducing a progressive learning mechanism. Specifically, DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture. The iterative refinements make DocScanner converge to a robust and superior rectification performance, while the lightweight recurrent architecture ensures the running efficiency. To further improve the rectification quality, based on the geometric priori between the distorted and the rectified images, a geometric constraint is introduced during training to further improve the performance. Extensive experiments are conducted on the Doc3D dataset and the DocUNet Benchmark dataset, and the quantitative and qualitative evaluation results verify the effectiveness of DocScanner, which outperforms previous methods on OCR accuracy, image similarity, and our proposed distortion metric by a considerable margin. Furthermore, our DocScanner shows superior efficiency in runtime latency and model size. The codes and pre-trained models are available at https://github.com/fh2019ustc/DocScanner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Am I readable? Transfer learning based document image rectification

Article 22 May 2024

A New Benchmark and OCR-Free Method for Document Image Topic Classification

Geometric Representation Learning for Document Image Rectification

Notes

References

Amidror, I. (2002). Scattered data interpolation methods for electronic imaging systems: A survey. Journal of Electronic Imaging, 11(2), 157–176.
Article Google Scholar
Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11), 1222–1239.
Article Google Scholar
Brown, M. S., & Seales, W. B. (2001). Document restoration using 3D shape: A general deskewing algorithm for arbitrarily warped documents. Proceedings of the IEEE International Conference on Computer Vision, 2, 367–374.
Google Scholar
Brown, M. S., & Seales, W. B. (2004). Image restoration of arbitrarily warped documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10), 1295–1306.
Article Google Scholar
Brown, M. S., Sun, M., Yang, R., Yun, L., & Seales, W. B. (2007). Restoring 2D content from distorted documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11), 1904–1916.
Article Google Scholar
Brown, M. S., & Tsoi, Y. C. (2006). Geometric and shading correction for images of printed materials using boundary. IEEE Transactions on Image Processing, 15(6), 1544–1554.
Article Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. arXiv preprint arXiv:1409.1259.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3606–3613).
Courteille, F., Crouzil, A., Durou, J. D., & Gurdjos, P. (2007). Shape from shading for the digitization of curved documents. Machine Vision and Applications, 18(5), 301–316.
Article Google Scholar
Das, S., Ma, K., Shu, Z., Samaras, D., & Shilkrot, R. (2019). DewarpNet: Single-image document unwarping with stacked 3D and 2D regression networks. In Proceedings of the international conference on computer vision (pp. 131–140).
Das, S., Sial, H. M., Baldrich, R., Vanrell, M., & Samaras, D. (2020). Intrinsic decomposition of document images in-the-wild. In Proceedings of the British machine vision conference.
Das, S., Singh, K. Y., Wu, J., Bas, E., Mahadevan, V., Bhotika, R., & Samaras, D. (2021). End-to-end piece-wise unwarping of document images. In Proceedings of the IEEE international conference on computer vision (pp. 4268–4277).
De Boer, P. T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A tutorial on the cross-entropy method. Annals of Operations Research, 134(1), 19–67.
Article MathSciNet Google Scholar
Feng, H., Wang, Y., Zhou, W., Deng, J., & Li, H. (2021). DocTr: Document image transformer for geometric unwarping and illumination correction. In Proceedings of the ACM international conference on multimedia (pp. 273–281).
Feng, H., Zhou, W., Deng, J., Wang, Y., & Li, H. (2022). Geometric representation learning for document image rectification. In Proceedings of the European conference on computer vision.
Garai, A., Biswas, S., & Mandal, S. (2021). A theoretical justification of warping generation for dewarping using CNN. Pattern Recognition, 109, 107621.
Garai, A., Biswas, S., Mandal, S., & Chaudhuri, B. B. (2020). Automatic rectification of warped Bangla document images. IET Image Processing, 14(1), 74–83.
Article Google Scholar
Garai, A., Dutta, A., & Biswas, S. (2023). Automatic dewarping of camera-captured comic document images. Multimedia Tools and Applications, 82(1), 1537–1552.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
He, Y., Pan, P., Xie, S., Sun, J., & Naoi, S. (2013). A book dewarping system by boundary-based 3D surface reconstruction. In Proceedings of the international conference on document analysis and recognition (pp. 403–407).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Proceedings of the neural information processing systems (pp. 2017–2025).
Jiang, X., Long, R., Xue, N., Yang, Z., Yao, C., & Xia, G. S. (2022). Revisiting document image dewarping by grid regularization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4543–4552).
Kil, T., Seo, W., Koo, H. I., & Cho, N. I. (2017). Robust document image dewarping method using text-lines and line segments. In Proceedings of the international conference on document analysis and recognition (Vol. 1, pp. 865–870).
Kim, B. S., Koo, H. I., & Cho, N. I. (2015). Document dewarping via text-line based optimization. Pattern Recognition, 48(11), 3600–3614.
Article Google Scholar
Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., & Park, S. (2022). OCR-free document understanding transformer. In Proceedings of the European conference on computer vision (pp. 498–517).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Koo, H. I., & Cho, N. I. (2010). State estimation in a document image and its application in text block identification and text line extraction. In Proceedings of the European conference on computer vision (pp. 421–434).
Koo, H. I., Kim, J., & Cho, N. I. (2009). Composition of a dewarped and enhanced document image from two view images. IEEE Transactions on Image Processing, 18(7), 1551–1562.
Article MathSciNet Google Scholar
Lavialle, O., Molines, X., Angella, F., & Baylou, P. (2001). Active contours network to straighten distorted text lines. Proceedings of the International Conference on Image Processing, 3, 748–751.
Google Scholar
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.
MathSciNet Google Scholar
Lévy, B., Petitjean, S., Ray, N., & Maillot, J. (2002). Least squares conformal maps for automatic texture atlas generation. ACM Transactions on Graphics, 21(3), 362–371.
Article Google Scholar
Li, X., Zhang, B., Liao, J., & Sander, P. V. (2019). Document rectification and illumination correction using a patch-based CNN. ACM Transactions on Graphics, 38(6), 1–11.
Google Scholar
Liang, J., DeMenthon, D., & Doermann, D. (2008). Geometric rectification of camera-captured document images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4), 591–605.
Article Google Scholar
Liu, C., Yuen, J., & Torralba, A. (2011). SIFT flow: Dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 978–994.
Article Google Scholar
Liu, X., Meng, G., Fan, B., Xiang, S., & Pan, C. (2020). Geometric rectification of document images using adversarial gated unwarping network. Pattern Recognition, 108, Article 107576.
Article Google Scholar
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Ma, K., Das, S., Shu, Z., & Samaras, D. (2022). Learning from documents in the wild to improve document unwarping. In Proceedings of the ACM SIGGRAPH conference (pp. 1–9).
Ma, K., Shu, Z., Bai, X., Wang, J., & Samaras, D. (2018). DocUNet: Document image unwarping via a stacked U-Net. In Proceedings of the IEEE international conference on computer vision (pp. 4700–4709).
Markovitz, A., Lavi, I., Perel, O., Mazor, S., & Litman, R. (2020). Can you read me now? Content aware rectification using angle supervision. In Proceedings of the European conference on computer vision (pp. 208–223).
Mathew, M., Karatzas, D., & Jawahar, C. (2021). DocVQA: A dataset for VQA on document images. In Proceedings of the IEEE winter conference on applications of computer vision (pp. 2200–2209).
Meng, G., Pan, C., Xiang, S., Duan, J., & Zheng, N. (2011). Metric rectification of curved document images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 707–722.
Article Google Scholar
Meng, G., Su, Y., Wu, Y., Xiang, S., & Pan, C. (2018). Exploiting vector fields for geometric rectification of distorted document images. In Proceedings of the European conference on computer vision (pp. 172–187).
Meng, G., Wang, Y., Qu, S., Xiang, S., & Pan, C. (2014). Active flattening of curved document images via two structured beams. In Proceedings of the IEEE international conference on computer vision (pp. 3890–3897).
Meng, G., Xiang, S., Pan, C., & Zheng, N. (2017). Active rectification of curved document images using structured beams. International Journal of Computer Vision, 122(1), 34–60.
Article MathSciNet Google Scholar
Mischke, L., & Luther, W. (2005). Document image de-warping based on detection of distorted text lines. In Proceedings of the international conference on image analysis and processing (pp. 1068–1075).
Morris, A. C., Maier, V., & Green, P. (2004). From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition. In Proceedings of the international conference on spoken language processing.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in PyTorch.
Peng, D., Jin, L., Liu, Y., Luo, C., & Lai, S. (2022). PageNet: Towards end-to-end weakly supervised page-level handwritten Chinese text recognition. International Journal of Computer Vision, 130(11), 2623–2645.
Article Google Scholar
Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O. R., & Jagersand, M. (2020). U2-Net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106, 107404.
Article Google Scholar
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the international conference on medical image computing and computer-assisted intervention (pp. 234–241).
Smith, R. (2007). An overview of the tesseract OCR engine. In Proceedings of the international conference on document analysis and recognition (Vol. 2, pp. 629–633).
Tan, C. L., Zhang, L., Zhang, Z., & Xia, T. (2006). Restoring warped document images through 3D shape modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(2), 195–208.
Article Google Scholar
Teed, Z., & Deng, J. (2020). RAFT: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European conference on computer vision (pp. 402–419).
Tian, Y., & Narasimhan, S. G. (2011). Rectification and 3D reconstruction of curved document images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 377–384).
Tokmakov, P., Alahari, K., & Schmid, C. (2017). Learning video object segmentation with visual memory. In Proceedings of the IEEE international conference on computer vision (pp. 4481–4490).
Tsoi, Y. C., & Brown, M. S. (2007). Multi-view document rectification using boundary. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–8.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the neural information processing systems (pp. 6000–6010).
Wada, T., Ukida, H., & Matsuyama, T. (1997). Shape from shading with interreflections under a proximal light source: Distortion-free copying of an unfolded book. International Journal of Computer Vision, 24(2), 125–135.
Article Google Scholar
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Article Google Scholar
Wang, Z., Simoncelli, E. P., & Bovik, A. C. (2003). Multiscale structural similarity for image quality assessment. In Proceedings of the Asilomar conference on signals, systems computers (Vol. 2, pp. 1398–1402).
Wu, C., & Agam, G. (2002). Document image de-warping for text/graphics recognition. In Proceedings of the joint IAPR international workshops on statistical techniques in pattern recognition and structural and syntactic pattern recognition (pp. 348–357).
Xie, G., Yin, F., Zhang, X., & Liu, C. (2020). Dewarping document image by displacement flow estimation with fully convolutional network. In Proceedings of the international workshop on document analysis systems (pp. 131–144).
Xie, G. W., Yin, F., Zhang, X. Y., & Liu, C. L. (2021). Document dewarping with control points. In Proceedings of the international conference on document analysis and recognition (pp. 466–480).
Xue, C., Tian, Z., Zhan, F., Lu, S., & Bai, S. (2022). Fourier document restoration for robust document dewarping and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4573–4582).
Yamashita, A., Kawarago, A., Kaneko, T., & Miura, K. T. (2004). Shape reconstruction and image restoration for non-flat surfaces of documents with a stereo vision system. In Proceedings of the international conference on pattern recognition (Vol. 1, pp. 482–485).
Yang, S., Lin, C., Liao, K., Zhang, C., & Zhao, Y. (2021). Progressively complementary network for fisheye image rectification using appearance flow. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6348–6357).
You, S., Matsushita, Y., Sinha, S., Bou, Y., & Ikeuchi, K. (2018). Multiview rectification of folded documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2), 505–511.
Article Google Scholar
Yuan, Y., Liu, X., Dikubab, W., Liu, H., Ji, Z., Wu, Z., & Bai, X. (2022). Syntax-aware network for handwritten mathematical expression recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4553–4562).
Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., Yang, M. H., & Shao, L. (2021). Multi-stage progressive image restoration. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 14821–14831).
Zandifar, A. (2007). Unwarping scanned image of Japanese/English documents. In Proceedings of the international conference on image analysis and processing (pp. 129–136).
Zhang, J., Luo, C., Jin, L., Guo, F., & Ding, K. (2022). Marior: Margin removal and iterative content rectification for document dewarping in the wild. In Proceedings of the ACM international conference on multimedia (pp. 2805–2815).
Zhang, L., Yip, A. M., Brown, M. S., & Tan, C. L. (2009). A unified framework for document restoration using inpainting and shape-from-shading. Pattern Recognition, 42(11), 2961–2978.
Article Google Scholar
Zhang, L., Zhang, Y., & Tan, C. (2008). An improved physically-based method for geometric restoration of distorted document images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4), 728–734.
Zhong, X., Tang, J., & Yepes, A. J. (2019). PubLayNet: Largest dataset ever for document layout analysis. In Proceedings of the international conference on document analysis and recognition (pp. 1015–1022).
Zhou, Z., Fan, X., Shi, P., & Xin, Y. (2021). R-MSFM: Recurrent multi-scale feature modulation for monocular depth estimating. In Proceedings of the IEEE international conference on computer vision (pp. 12777–12786).

Download references

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Hao Feng, Wengang Zhou & Houqiang Li
The University of Adelaide, Adelaide, Australia
Jiajun Deng
Huawei Cloud & AI, Shenzhen, China
Qi Tian

Authors

Hao Feng
View author publications
Search author on:PubMed Google Scholar
Wengang Zhou
View author publications
Search author on:PubMed Google Scholar
Jiajun Deng
View author publications
Search author on:PubMed Google Scholar
Qi Tian
View author publications
Search author on:PubMed Google Scholar
Houqiang Li
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Houqiang Li.

Additional information

Communicated by Dimosthenis Karatzas.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 290 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Feng, H., Zhou, W., Deng, J. et al. DocScanner: Robust Document Image Rectification with Progressive Learning. Int J Comput Vis 133, 5343–5362 (2025). https://doi.org/10.1007/s11263-025-02431-5

Download citation

Received: 22 December 2022
Accepted: 24 March 2025
Published: 26 May 2025
Version of record: 26 May 2025
Issue date: August 2025
DOI: https://doi.org/10.1007/s11263-025-02431-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DocScanner: Robust Document Image Rectification with Progressive Learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Am I readable? Transfer learning based document image rectification

A New Benchmark and OCR-Free Method for Document Image Topic Classification

Geometric Representation Learning for Document Image Rectification

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 290 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now