Document Processor

This project is a PDF document processing application designed to extract structured information from scanned documents. It leverages Optical Character Recognition (OCR) to extract text and provides a graphical user interface (GUI) to fine-tune image preprocessing parameters. However, the program is still a work in progress and has known limitations.

Features

Converts PDF pages into images for OCR processing.
Interactive GUI allows users to adjust preprocessing parameters:
- Gaussian Blur
- Threshold
- Colour Inversion
Attempts to extract key information such as:
- Company Name
- Company Identifier
- Document Purpose
Saves the structured data into a JSON file.
Supports multilingual text recognition (English, French, German, Dutch).

Current Status and Limitations

Achievements:

The program can process PDFs and allow interactive adjustments to improve OCR results.
Partial data extraction is possible (e.g., Company Identifier or Document Purpose).

Limitations:

Inconsistent Data Extraction:
- Extraction of fields like Company Name and Document Purpose is not always reliable.
- Changing preprocessing parameters might improve one field's accuracy but can degrade another.
Noise in OCR Results:
- Extracted text may contain artefacts (e.g., /78e8) caused by noisy scans or low-quality documents. That can be fixed by adjusting the Threshold .
Incomplete Regex Matching:
- Regular expressions used for text extraction need improvement to handle diverse document formats.
Manual Interaction Required:
- Each document requires manual adjustments of parameters for optimal results, as no universal settings work for all.

Project Structure

project/
├── src/
│   ├── main.py              # Main application entry point
│   ├── ocr_utils.py         # Functions for OCR processing
│   ├── text_processing.py   # Functions for extracting and saving structured data
│   ├── preprocess.py        # Preprocessing for image enhancement
├── requirements.txt         # Python dependencies
├── README.md                # Documentation

Requirements

Python 3.7 or higher
Tesseract OCR
Poppler for PDF conversion
Python libraries:
- pytesseract
- pdf2image
- Pillow
- opencv-python

Installation

1. Install Python Libraries

Run the following command to install the required Python dependencies:

pip install -r requirements.txt

2. Install Tesseract OCR

Windows

Download Tesseract OCR from the official website.
During installation, ensure the "Add to PATH" option is selected.
If not added automatically, add the installation directory (e.g., C:\Program Files\Tesseract-OCR) to your system PATH manually:
- Open "System Properties" > "Advanced" > "Environment Variables".
- Under "System variables", find Path, click "Edit", and add the Tesseract directory.

Linux

Use your package manager to install Tesseract:
```
sudo apt-get install tesseract-ocr
```
Ensure the package is in your PATH (usually automatic).

MacOS

Install Tesseract using Homebrew:
```
brew install tesseract
```
Verify installation with:
```
tesseract --version
```

3. Install Poppler

Windows

Download Poppler for Windows from Poppler for Windows.
Extract the ZIP file and note the directory path (e.g., C:\poppler\Library\bin).
Add the bin folder to your system PATH:
- Open "System Properties" > "Advanced" > "Environment Variables".
- Under "System variables", find Path, click "Edit", and add the Poppler bin directory.

Linux

Install Poppler via your package manager:
```
sudo apt-get install poppler-utils
```

MacOS

Install Poppler using Homebrew:
```
brew install poppler
```

Verifying Poppler Installation

To check if Poppler is installed correctly, run the following command:

pdfinfo

This will display general information about pdfinfo, including the version. If you see an error, ensure the bin directory of Poppler is correctly added to your PATH.

Note: Avoid using pdfinfo --version as it may not work in some systems. Simply running pdfinfo will suffice.

Common Issues and Solutions

Issue: Tesseract not recognised in command line

Cause: Tesseract is not in your PATH.
Solution:
- Verify the installation directory.
- Add the directory to your PATH manually as described above.
- Restart your terminal or system.

Issue: Poppler not working with `pdf2image`

Cause: Poppler bin directory is not in your PATH.
Solution:
- Ensure the correct Poppler bin path is added to your PATH.
- Test with:
```
pdfinfo
```

Issue: Missing Libraries

Solution:
- Verify all Python dependencies are installed using:
```
pip install -r requirements.txt
```

Usage

1. Run the Application

Launch the GUI:

python src/main.py

2. Select a PDF File

Use the Select PDF button to load a PDF.
The program converts the PDF pages into images.

3. Preprocess Images

Adjust sliders in the GUI for parameters like blur, threshold, and inversion to optimise the image for OCR.

4. Apply OCR

Click the Apply OCR button to extract text from the processed image.

5. Save Results

Extracted data is saved as output.json in the project directory.

Process Overview

PDF Conversion:
- Each page of the PDF is converted into an image using the pdf2image library.
Image Preprocessing:
- Images are processed with OpenCV:
  - Greyscale Conversion: Simplifies processing by removing colour.
  - Gaussian Blur: Reduces noise.
  - Thresholding: Enhances text visibility by binarising the image.
  - Colour Inversion: Helps OCR in cases of reversed colour schemes.
OCR:
- Tesseract OCR extracts text from the preprocessed image.
- The extracted text is cleaned and structured into JSON format.
Manual Parameter Adjustment:
- Users manually adjust preprocessing parameters for each document to optimise results.

Example Workflow

Input

A PDF document containing:

Company Name: QUICK ENTREPRISE
Identifier: 0795 785 723
Purpose: NOMINATION ADMINISTRATEUR / DÉMISSION ADMINISTRATEUR

Expected JSON Output

{
    "Company Name": "QUICK ENTREPRISE",
    "Company Identifier": "0795 785 723",
    "Document Purpose": "NOMINATION ADMINISTRATEUR / DÉMISSION ADMINISTRATEUR",
    "Details": {}
}

Actual JSON Output

{
    "Company Name": null,
    "Company Identifier": "0795 785 723",
    "Document Purpose": "NOMINATION ADMINISTRATEUR / DÉMISSION ADMINISTRATEUR",
    "Details": {}
}

Notes:

Results depend heavily on document quality and parameter adjustments.
Fine-tuning sliders can improve text extraction, but consistency remains an issue.

Troubleshooting

OCR Artefacts:
- Adjust the blur and threshold sliders.
- Use colour inversion if text is hard to read.
Data Missing:
- Experiment with preprocessing settings to optimise text clarity.
Dependencies Missing:
- Ensure all Python dependencies are installed:
```
pip install -r requirements.txt
```

Future Improvements

Regex Enhancement:
- Improve patterns to handle diverse document layouts.
Noise Reduction:
- Add advanced OpenCV techniques to clean images before OCR.
Batch Processing:
- Allow multiple documents to be processed in a single session.
Field Matching:
- Incorporate machine learning models for better field extraction.

Author

Víctor G.C.

License

This project is licensed under the MIT License.

Changes and Additions

Explicit Acknowledgment of Limitations:
- Clearly states the program's inconsistent performance and dependency on manual intervention.
Detailed Process Description:
- Outlines each step from PDF conversion to OCR, emphasising the use of OpenCV.
Realistic Examples:
- Demonstrates expected vs. actual output to set accurate expectations.
Future Directions:
- Proposes solutions to improve accuracy and automate manual adjustments.

Acknowledgement

We would like to express our gratitude to the various sources of information that have greatly assisted us throughout the development of this project. Although we have not yet achieved a fully functional and reliable solution, the resources we consulted provided invaluable insights into the complexities of working with OCR, image preprocessing, and PDF document analysis.

Online Resources

The following tutorials, articles, and videos were instrumental in shaping our understanding of the project requirements and the technologies involved:

Special Thanks to ChatGPT

We would also like to extend our sincere thanks to ChatGPT and its creators for providing invaluable assistance throughout this project. ChatGPT played a significant role in:

Debugging complex issues in our implementation.
Offering guidance on constructing effective Regular Expressions (RegEx) for data extraction.
Helping streamline our approach to image preprocessing and text recognition.

Although the project remains incomplete due to time and resource constraints, ChatGPT enabled us to work more efficiently, saving countless hours. We also acknowledge the wider ChatGPT community, whose insights and contributions have been a source of inspiration and support.

Acknowledgement of Challenges

While these resources provided substantial help, the project remains incomplete due to:

The variability in scanned document quality.
The inherent limitations of the tools and techniques we used.
The challenges of applying universal preprocessing settings to documents with diverse layouts and noise levels.

Nevertheless, we are deeply grateful for the wealth of knowledge shared by the online community and the creators of tools like ChatGPT, which have guided us in building a solid foundation for this project.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Vgarcan/pdf-reader

Folders and files

Latest commit

History

Repository files navigation

Document Processor

Features

Current Status and Limitations

Achievements:

Limitations:

Project Structure

Requirements

Installation

1. Install Python Libraries

2. Install Tesseract OCR

Windows

Linux

MacOS

3. Install Poppler

Windows

Linux

MacOS

Verifying Poppler Installation

Common Issues and Solutions

Issue: Tesseract not recognised in command line

Issue: Poppler not working with pdf2image

Issue: Missing Libraries

Usage

1. Run the Application

2. Select a PDF File

3. Preprocess Images

4. Apply OCR

5. Save Results

Process Overview

Example Workflow

Input

Expected JSON Output

Actual JSON Output

Notes:

Troubleshooting

Future Improvements

Author

License

Changes and Additions

Acknowledgement

Online Resources

Special Thanks to ChatGPT

Acknowledgement of Challenges

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Issue: Poppler not working with `pdf2image`

Packages