这是indexloc提供的服务,不要输入任何密码
Skip to content

Vgarcan/pdf-reader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Document Processor

This project is a PDF document processing application designed to extract structured information from scanned documents. It leverages Optical Character Recognition (OCR) to extract text and provides a graphical user interface (GUI) to fine-tune image preprocessing parameters. However, the program is still a work in progress and has known limitations.


Features

  • Converts PDF pages into images for OCR processing.
  • Interactive GUI allows users to adjust preprocessing parameters:
    • Gaussian Blur
    • Threshold
    • Colour Inversion
  • Attempts to extract key information such as:
    • Company Name
    • Company Identifier
    • Document Purpose
  • Saves the structured data into a JSON file.
  • Supports multilingual text recognition (English, French, German, Dutch).

Current Status and Limitations

Achievements:

  • The program can process PDFs and allow interactive adjustments to improve OCR results.
  • Partial data extraction is possible (e.g., Company Identifier or Document Purpose).

Limitations:

  1. Inconsistent Data Extraction:
    • Extraction of fields like Company Name and Document Purpose is not always reliable.
    • Changing preprocessing parameters might improve one field's accuracy but can degrade another.
  2. Noise in OCR Results:
    • Extracted text may contain artefacts (e.g., /78e8) caused by noisy scans or low-quality documents. That can be fixed by adjusting the Threshold .
  3. Incomplete Regex Matching:
    • Regular expressions used for text extraction need improvement to handle diverse document formats.
  4. Manual Interaction Required:
    • Each document requires manual adjustments of parameters for optimal results, as no universal settings work for all.

Project Structure

project/
├── src/
│   ├── main.py              # Main application entry point
│   ├── ocr_utils.py         # Functions for OCR processing
│   ├── text_processing.py   # Functions for extracting and saving structured data
│   ├── preprocess.py        # Preprocessing for image enhancement
├── requirements.txt         # Python dependencies
├── README.md                # Documentation

Requirements

  • Python 3.7 or higher
  • Tesseract OCR
  • Poppler for PDF conversion
  • Python libraries:
    • pytesseract
    • pdf2image
    • Pillow
    • opencv-python

Installation

1. Install Python Libraries

Run the following command to install the required Python dependencies:

pip install -r requirements.txt

2. Install Tesseract OCR

Windows

  • Download Tesseract OCR from the official website.
  • During installation, ensure the "Add to PATH" option is selected.
  • If not added automatically, add the installation directory (e.g., C:\Program Files\Tesseract-OCR) to your system PATH manually:
    • Open "System Properties" > "Advanced" > "Environment Variables".
    • Under "System variables", find Path, click "Edit", and add the Tesseract directory.

Linux

  • Use your package manager to install Tesseract:
    sudo apt-get install tesseract-ocr
  • Ensure the package is in your PATH (usually automatic).

MacOS

  • Install Tesseract using Homebrew:
    brew install tesseract
  • Verify installation with:
    tesseract --version

3. Install Poppler

Windows

  • Download Poppler for Windows from Poppler for Windows.
  • Extract the ZIP file and note the directory path (e.g., C:\poppler\Library\bin).
  • Add the bin folder to your system PATH:
    • Open "System Properties" > "Advanced" > "Environment Variables".
    • Under "System variables", find Path, click "Edit", and add the Poppler bin directory.

Linux

  • Install Poppler via your package manager:
    sudo apt-get install poppler-utils

MacOS

  • Install Poppler using Homebrew:
    brew install poppler

Verifying Poppler Installation

To check if Poppler is installed correctly, run the following command:

pdfinfo

This will display general information about pdfinfo, including the version. If you see an error, ensure the bin directory of Poppler is correctly added to your PATH.

Note: Avoid using pdfinfo --version as it may not work in some systems. Simply running pdfinfo will suffice.

Common Issues and Solutions

Issue: Tesseract not recognised in command line

  • Cause: Tesseract is not in your PATH.
  • Solution:
    • Verify the installation directory.
    • Add the directory to your PATH manually as described above.
    • Restart your terminal or system.

Issue: Poppler not working with pdf2image

  • Cause: Poppler bin directory is not in your PATH.
  • Solution:
    • Ensure the correct Poppler bin path is added to your PATH.
    • Test with:
      pdfinfo

Issue: Missing Libraries

  • Solution:
    • Verify all Python dependencies are installed using:
      pip install -r requirements.txt

Usage

1. Run the Application

Launch the GUI:

python src/main.py

2. Select a PDF File

  • Use the Select PDF button to load a PDF.
  • The program converts the PDF pages into images.

3. Preprocess Images

  • Adjust sliders in the GUI for parameters like blur, threshold, and inversion to optimise the image for OCR.

4. Apply OCR

  • Click the Apply OCR button to extract text from the processed image.

5. Save Results

  • Extracted data is saved as output.json in the project directory.

Process Overview

  1. PDF Conversion:

    • Each page of the PDF is converted into an image using the pdf2image library.
  2. Image Preprocessing:

    • Images are processed with OpenCV:
      • Greyscale Conversion: Simplifies processing by removing colour.
      • Gaussian Blur: Reduces noise.
      • Thresholding: Enhances text visibility by binarising the image.
      • Colour Inversion: Helps OCR in cases of reversed colour schemes.
  3. OCR:

    • Tesseract OCR extracts text from the preprocessed image.
    • The extracted text is cleaned and structured into JSON format.
  4. Manual Parameter Adjustment:

    • Users manually adjust preprocessing parameters for each document to optimise results.

Example Workflow

Input

A PDF document containing:

  • Company Name: QUICK ENTREPRISE
  • Identifier: 0795 785 723
  • Purpose: NOMINATION ADMINISTRATEUR / DÉMISSION ADMINISTRATEUR

Expected JSON Output

{
    "Company Name": "QUICK ENTREPRISE",
    "Company Identifier": "0795 785 723",
    "Document Purpose": "NOMINATION ADMINISTRATEUR / DÉMISSION ADMINISTRATEUR",
    "Details": {}
}

Actual JSON Output

{
    "Company Name": null,
    "Company Identifier": "0795 785 723",
    "Document Purpose": "NOMINATION ADMINISTRATEUR / DÉMISSION ADMINISTRATEUR",
    "Details": {}
}

Notes:

  • Results depend heavily on document quality and parameter adjustments.
  • Fine-tuning sliders can improve text extraction, but consistency remains an issue.

Troubleshooting

  • OCR Artefacts:

    • Adjust the blur and threshold sliders.
    • Use colour inversion if text is hard to read.
  • Data Missing:

    • Experiment with preprocessing settings to optimise text clarity.
  • Dependencies Missing:

    • Ensure all Python dependencies are installed:
      pip install -r requirements.txt

Future Improvements

  1. Regex Enhancement:
    • Improve patterns to handle diverse document layouts.
  2. Noise Reduction:
    • Add advanced OpenCV techniques to clean images before OCR.
  3. Batch Processing:
    • Allow multiple documents to be processed in a single session.
  4. Field Matching:
    • Incorporate machine learning models for better field extraction.

Author

Víctor G.C.


License

This project is licensed under the MIT License.

Changes and Additions

  1. Explicit Acknowledgment of Limitations:

    • Clearly states the program's inconsistent performance and dependency on manual intervention.
  2. Detailed Process Description:

    • Outlines each step from PDF conversion to OCR, emphasising the use of OpenCV.
  3. Realistic Examples:

    • Demonstrates expected vs. actual output to set accurate expectations.
  4. Future Directions:

    • Proposes solutions to improve accuracy and automate manual adjustments.

Acknowledgement

We would like to express our gratitude to the various sources of information that have greatly assisted us throughout the development of this project. Although we have not yet achieved a fully functional and reliable solution, the resources we consulted provided invaluable insights into the complexities of working with OCR, image preprocessing, and PDF document analysis.

Online Resources

The following tutorials, articles, and videos were instrumental in shaping our understanding of the project requirements and the technologies involved:

Special Thanks to ChatGPT

We would also like to extend our sincere thanks to ChatGPT and its creators for providing invaluable assistance throughout this project. ChatGPT played a significant role in:

  • Debugging complex issues in our implementation.
  • Offering guidance on constructing effective Regular Expressions (RegEx) for data extraction.
  • Helping streamline our approach to image preprocessing and text recognition.

Although the project remains incomplete due to time and resource constraints, ChatGPT enabled us to work more efficiently, saving countless hours. We also acknowledge the wider ChatGPT community, whose insights and contributions have been a source of inspiration and support.

Acknowledgement of Challenges

While these resources provided substantial help, the project remains incomplete due to:

  • The variability in scanned document quality.
  • The inherent limitations of the tools and techniques we used.
  • The challenges of applying universal preprocessing settings to documents with diverse layouts and noise levels.

Nevertheless, we are deeply grateful for the wealth of knowledge shared by the online community and the creators of tools like ChatGPT, which have guided us in building a solid foundation for this project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages