这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

DocScanner: Robust Document Image Rectification with Progressive Learning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Compared with flatbed scanners, portable smartphones provide more convenience for physical document digitization. However, such digitized documents are often distorted due to uncontrolled physical deformations, camera positions, and illumination variations. To this end, we present DocScanner, a novel framework for document image rectification. Different from existing solutions, DocScanner addresses this issue by introducing a progressive learning mechanism. Specifically, DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture. The iterative refinements make DocScanner converge to a robust and superior rectification performance, while the lightweight recurrent architecture ensures the running efficiency. To further improve the rectification quality, based on the geometric priori between the distorted and the rectified images, a geometric constraint is introduced during training to further improve the performance. Extensive experiments are conducted on the Doc3D dataset and the DocUNet Benchmark dataset, and the quantitative and qualitative evaluation results verify the effectiveness of DocScanner, which outperforms previous methods on OCR accuracy, image similarity, and our proposed distortion metric by a considerable margin. Furthermore, our DocScanner shows superior efficiency in runtime latency and model size. The codes and pre-trained models are available at https://github.com/fh2019ustc/DocScanner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. https://www.blender.org/.

  2. https://www.camscanner.com/.

References

  • Amidror, I. (2002). Scattered data interpolation methods for electronic imaging systems: A survey. Journal of Electronic Imaging, 11(2), 157–176.

    Article  Google Scholar 

  • Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11), 1222–1239.

    Article  Google Scholar 

  • Brown, M. S., & Seales, W. B. (2001). Document restoration using 3D shape: A general deskewing algorithm for arbitrarily warped documents. Proceedings of the IEEE International Conference on Computer Vision, 2, 367–374.

    Google Scholar 

  • Brown, M. S., & Seales, W. B. (2004). Image restoration of arbitrarily warped documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10), 1295–1306.

    Article  Google Scholar 

  • Brown, M. S., Sun, M., Yang, R., Yun, L., & Seales, W. B. (2007). Restoring 2D content from distorted documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11), 1904–1916.

    Article  Google Scholar 

  • Brown, M. S., & Tsoi, Y. C. (2006). Geometric and shading correction for images of printed materials using boundary. IEEE Transactions on Image Processing, 15(6), 1544–1554.

    Article  Google Scholar 

  • Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. arXiv preprint arXiv:1409.1259.

  • Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3606–3613).

  • Courteille, F., Crouzil, A., Durou, J. D., & Gurdjos, P. (2007). Shape from shading for the digitization of curved documents. Machine Vision and Applications, 18(5), 301–316.

    Article  Google Scholar 

  • Das, S., Ma, K., Shu, Z., Samaras, D., & Shilkrot, R. (2019). DewarpNet: Single-image document unwarping with stacked 3D and 2D regression networks. In Proceedings of the international conference on computer vision (pp. 131–140).

  • Das, S., Sial, H. M., Baldrich, R., Vanrell, M., & Samaras, D. (2020). Intrinsic decomposition of document images in-the-wild. In Proceedings of the British machine vision conference.

  • Das, S., Singh, K. Y., Wu, J., Bas, E., Mahadevan, V., Bhotika, R., & Samaras, D. (2021). End-to-end piece-wise unwarping of document images. In Proceedings of the IEEE international conference on computer vision (pp. 4268–4277).

  • De Boer, P. T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A tutorial on the cross-entropy method. Annals of Operations Research, 134(1), 19–67.

    Article  MathSciNet  Google Scholar 

  • Feng, H., Wang, Y., Zhou, W., Deng, J., & Li, H. (2021). DocTr: Document image transformer for geometric unwarping and illumination correction. In Proceedings of the ACM international conference on multimedia (pp. 273–281).

  • Feng, H., Zhou, W., Deng, J., Wang, Y., & Li, H. (2022). Geometric representation learning for document image rectification. In Proceedings of the European conference on computer vision.

  • Garai, A., Biswas, S., & Mandal, S. (2021). A theoretical justification of warping generation for dewarping using CNN. Pattern Recognition, 109, 107621.

  • Garai, A., Biswas, S., Mandal, S., & Chaudhuri, B. B. (2020). Automatic rectification of warped Bangla document images. IET Image Processing, 14(1), 74–83.

    Article  Google Scholar 

  • Garai, A., Dutta, A., & Biswas, S. (2023). Automatic dewarping of camera-captured comic document images. Multimedia Tools and Applications, 82(1), 1537–1552.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • He, Y., Pan, P., Xie, S., Sun, J., & Naoi, S. (2013). A book dewarping system by boundary-based 3D surface reconstruction. In Proceedings of the international conference on document analysis and recognition (pp. 403–407).

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Proceedings of the neural information processing systems (pp. 2017–2025).

  • Jiang, X., Long, R., Xue, N., Yang, Z., Yao, C., & Xia, G. S. (2022). Revisiting document image dewarping by grid regularization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4543–4552).

  • Kil, T., Seo, W., Koo, H. I., & Cho, N. I. (2017). Robust document image dewarping method using text-lines and line segments. In Proceedings of the international conference on document analysis and recognition (Vol. 1, pp. 865–870).

  • Kim, B. S., Koo, H. I., & Cho, N. I. (2015). Document dewarping via text-line based optimization. Pattern Recognition, 48(11), 3600–3614.

    Article  Google Scholar 

  • Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., & Park, S. (2022). OCR-free document understanding transformer. In Proceedings of the European conference on computer vision (pp. 498–517).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Koo, H. I., & Cho, N. I. (2010). State estimation in a document image and its application in text block identification and text line extraction. In Proceedings of the European conference on computer vision (pp. 421–434).

  • Koo, H. I., Kim, J., & Cho, N. I. (2009). Composition of a dewarped and enhanced document image from two view images. IEEE Transactions on Image Processing, 18(7), 1551–1562.

    Article  MathSciNet  Google Scholar 

  • Lavialle, O., Molines, X., Angella, F., & Baylou, P. (2001). Active contours network to straighten distorted text lines. Proceedings of the International Conference on Image Processing, 3, 748–751.

    Google Scholar 

  • Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.

    MathSciNet  Google Scholar 

  • Lévy, B., Petitjean, S., Ray, N., & Maillot, J. (2002). Least squares conformal maps for automatic texture atlas generation. ACM Transactions on Graphics, 21(3), 362–371.

    Article  Google Scholar 

  • Li, X., Zhang, B., Liao, J., & Sander, P. V. (2019). Document rectification and illumination correction using a patch-based CNN. ACM Transactions on Graphics, 38(6), 1–11.

    Google Scholar 

  • Liang, J., DeMenthon, D., & Doermann, D. (2008). Geometric rectification of camera-captured document images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4), 591–605.

    Article  Google Scholar 

  • Liu, C., Yuen, J., & Torralba, A. (2011). SIFT flow: Dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 978–994.

    Article  Google Scholar 

  • Liu, X., Meng, G., Fan, B., Xiang, S., & Pan, C. (2020). Geometric rectification of document images using adversarial gated unwarping network. Pattern Recognition, 108, Article 107576.

    Article  Google Scholar 

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).

  • Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Ma, K., Das, S., Shu, Z., & Samaras, D. (2022). Learning from documents in the wild to improve document unwarping. In Proceedings of the ACM SIGGRAPH conference (pp. 1–9).

  • Ma, K., Shu, Z., Bai, X., Wang, J., & Samaras, D. (2018). DocUNet: Document image unwarping via a stacked U-Net. In Proceedings of the IEEE international conference on computer vision (pp. 4700–4709).

  • Markovitz, A., Lavi, I., Perel, O., Mazor, S., & Litman, R. (2020). Can you read me now? Content aware rectification using angle supervision. In Proceedings of the European conference on computer vision (pp. 208–223).

  • Mathew, M., Karatzas, D., & Jawahar, C. (2021). DocVQA: A dataset for VQA on document images. In Proceedings of the IEEE winter conference on applications of computer vision (pp. 2200–2209).

  • Meng, G., Pan, C., Xiang, S., Duan, J., & Zheng, N. (2011). Metric rectification of curved document images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 707–722.

    Article  Google Scholar 

  • Meng, G., Su, Y., Wu, Y., Xiang, S., & Pan, C. (2018). Exploiting vector fields for geometric rectification of distorted document images. In Proceedings of the European conference on computer vision (pp. 172–187).

  • Meng, G., Wang, Y., Qu, S., Xiang, S., & Pan, C. (2014). Active flattening of curved document images via two structured beams. In Proceedings of the IEEE international conference on computer vision (pp. 3890–3897).

  • Meng, G., Xiang, S., Pan, C., & Zheng, N. (2017). Active rectification of curved document images using structured beams. International Journal of Computer Vision, 122(1), 34–60.

    Article  MathSciNet  Google Scholar 

  • Mischke, L., & Luther, W. (2005). Document image de-warping based on detection of distorted text lines. In Proceedings of the international conference on image analysis and processing (pp. 1068–1075).

  • Morris, A. C., Maier, V., & Green, P. (2004). From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition. In Proceedings of the international conference on spoken language processing.

  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in PyTorch.

  • Peng, D., Jin, L., Liu, Y., Luo, C., & Lai, S. (2022). PageNet: Towards end-to-end weakly supervised page-level handwritten Chinese text recognition. International Journal of Computer Vision, 130(11), 2623–2645.

    Article  Google Scholar 

  • Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O. R., & Jagersand, M. (2020). U2-Net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106, 107404.

    Article  Google Scholar 

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the international conference on medical image computing and computer-assisted intervention (pp. 234–241).

  • Smith, R. (2007). An overview of the tesseract OCR engine. In Proceedings of the international conference on document analysis and recognition (Vol. 2, pp. 629–633).

  • Tan, C. L., Zhang, L., Zhang, Z., & Xia, T. (2006). Restoring warped document images through 3D shape modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(2), 195–208.

    Article  Google Scholar 

  • Teed, Z., & Deng, J. (2020). RAFT: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European conference on computer vision (pp. 402–419).

  • Tian, Y., & Narasimhan, S. G. (2011). Rectification and 3D reconstruction of curved document images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 377–384).

  • Tokmakov, P., Alahari, K., & Schmid, C. (2017). Learning video object segmentation with visual memory. In Proceedings of the IEEE international conference on computer vision (pp. 4481–4490).

  • Tsoi, Y. C., & Brown, M. S. (2007). Multi-view document rectification using boundary. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–8.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the neural information processing systems (pp. 6000–6010).

  • Wada, T., Ukida, H., & Matsuyama, T. (1997). Shape from shading with interreflections under a proximal light source: Distortion-free copying of an unfolded book. International Journal of Computer Vision, 24(2), 125–135.

    Article  Google Scholar 

  • Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.

    Article  Google Scholar 

  • Wang, Z., Simoncelli, E. P., & Bovik, A. C. (2003). Multiscale structural similarity for image quality assessment. In Proceedings of the Asilomar conference on signals, systems computers (Vol. 2, pp. 1398–1402).

  • Wu, C., & Agam, G. (2002). Document image de-warping for text/graphics recognition. In Proceedings of the joint IAPR international workshops on statistical techniques in pattern recognition and structural and syntactic pattern recognition (pp. 348–357).

  • Xie, G., Yin, F., Zhang, X., & Liu, C. (2020). Dewarping document image by displacement flow estimation with fully convolutional network. In Proceedings of the international workshop on document analysis systems (pp. 131–144).

  • Xie, G. W., Yin, F., Zhang, X. Y., & Liu, C. L. (2021). Document dewarping with control points. In Proceedings of the international conference on document analysis and recognition (pp. 466–480).

  • Xue, C., Tian, Z., Zhan, F., Lu, S., & Bai, S. (2022). Fourier document restoration for robust document dewarping and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4573–4582).

  • Yamashita, A., Kawarago, A., Kaneko, T., & Miura, K. T. (2004). Shape reconstruction and image restoration for non-flat surfaces of documents with a stereo vision system. In Proceedings of the international conference on pattern recognition (Vol. 1, pp. 482–485).

  • Yang, S., Lin, C., Liao, K., Zhang, C., & Zhao, Y. (2021). Progressively complementary network for fisheye image rectification using appearance flow. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6348–6357).

  • You, S., Matsushita, Y., Sinha, S., Bou, Y., & Ikeuchi, K. (2018). Multiview rectification of folded documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2), 505–511.

    Article  Google Scholar 

  • Yuan, Y., Liu, X., Dikubab, W., Liu, H., Ji, Z., Wu, Z., & Bai, X. (2022). Syntax-aware network for handwritten mathematical expression recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4553–4562).

  • Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., Yang, M. H., & Shao, L. (2021). Multi-stage progressive image restoration. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 14821–14831).

  • Zandifar, A. (2007). Unwarping scanned image of Japanese/English documents. In Proceedings of the international conference on image analysis and processing (pp. 129–136).

  • Zhang, J., Luo, C., Jin, L., Guo, F., & Ding, K. (2022). Marior: Margin removal and iterative content rectification for document dewarping in the wild. In Proceedings of the ACM international conference on multimedia (pp. 2805–2815).

  • Zhang, L., Yip, A. M., Brown, M. S., & Tan, C. L. (2009). A unified framework for document restoration using inpainting and shape-from-shading. Pattern Recognition, 42(11), 2961–2978.

    Article  Google Scholar 

  • Zhang, L., Zhang, Y., & Tan, C. (2008). An improved physically-based method for geometric restoration of distorted document images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4), 728–734.

  • Zhong, X., Tang, J., & Yepes, A. J. (2019). PubLayNet: Largest dataset ever for document layout analysis. In Proceedings of the international conference on document analysis and recognition (pp. 1015–1022).

  • Zhou, Z., Fan, X., Shi, P., & Xin, Y. (2021). R-MSFM: Recurrent multi-scale feature modulation for monocular depth estimating. In Proceedings of the IEEE international conference on computer vision (pp. 12777–12786).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Houqiang Li.

Additional information

Communicated by Dimosthenis Karatzas.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 290 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feng, H., Zhou, W., Deng, J. et al. DocScanner: Robust Document Image Rectification with Progressive Learning. Int J Comput Vis 133, 5343–5362 (2025). https://doi.org/10.1007/s11263-025-02431-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02431-5

Keywords