This project is a PDF document processing application designed to extract structured information from scanned documents. It leverages Optical Character Recognition (OCR) to extract text and provides a graphical user interface (GUI) to fine-tune image preprocessing parameters. However, the program is still a work in progress and has known limitations.
- Converts PDF pages into images for OCR processing.
- Interactive GUI allows users to adjust preprocessing parameters:
- Gaussian Blur
- Threshold
- Colour Inversion
- Attempts to extract key information such as:
- Company Name
- Company Identifier
- Document Purpose
- Saves the structured data into a JSON file.
- Supports multilingual text recognition (English, French, German, Dutch).
- The program can process PDFs and allow interactive adjustments to improve OCR results.
- Partial data extraction is possible (e.g., Company Identifier or Document Purpose).
- Inconsistent Data Extraction:
- Extraction of fields like Company Name and Document Purpose is not always reliable.
- Changing preprocessing parameters might improve one field's accuracy but can degrade another.
- Noise in OCR Results:
- Extracted text may contain artefacts (e.g.,
/78e8) caused by noisy scans or low-quality documents. That can be fixed by adjusting theThreshold.
- Extracted text may contain artefacts (e.g.,
- Incomplete Regex Matching:
- Regular expressions used for text extraction need improvement to handle diverse document formats.
- Manual Interaction Required:
- Each document requires manual adjustments of parameters for optimal results, as no universal settings work for all.
project/
├── src/
│ ├── main.py # Main application entry point
│ ├── ocr_utils.py # Functions for OCR processing
│ ├── text_processing.py # Functions for extracting and saving structured data
│ ├── preprocess.py # Preprocessing for image enhancement
├── requirements.txt # Python dependencies
├── README.md # Documentation
- Python 3.7 or higher
- Tesseract OCR
- Poppler for PDF conversion
- Python libraries:
pytesseractpdf2imagePillowopencv-python
Run the following command to install the required Python dependencies:
pip install -r requirements.txt- Download Tesseract OCR from the official website.
- During installation, ensure the "Add to PATH" option is selected.
- If not added automatically, add the installation directory (e.g.,
C:\Program Files\Tesseract-OCR) to your system PATH manually:- Open "System Properties" > "Advanced" > "Environment Variables".
- Under "System variables", find
Path, click "Edit", and add the Tesseract directory.
- Use your package manager to install Tesseract:
sudo apt-get install tesseract-ocr
- Ensure the package is in your PATH (usually automatic).
- Install Tesseract using Homebrew:
brew install tesseract
- Verify installation with:
tesseract --version
- Download Poppler for Windows from Poppler for Windows.
- Extract the ZIP file and note the directory path (e.g.,
C:\poppler\Library\bin). - Add the
binfolder to your system PATH:- Open "System Properties" > "Advanced" > "Environment Variables".
- Under "System variables", find
Path, click "Edit", and add the Popplerbindirectory.
- Install Poppler via your package manager:
sudo apt-get install poppler-utils
- Install Poppler using Homebrew:
brew install poppler
To check if Poppler is installed correctly, run the following command:
pdfinfoThis will display general information about pdfinfo, including the version. If you see an error, ensure the bin directory of Poppler is correctly added to your PATH.
Note: Avoid using pdfinfo --version as it may not work in some systems. Simply running pdfinfo will suffice.
- Cause: Tesseract is not in your PATH.
- Solution:
- Verify the installation directory.
- Add the directory to your PATH manually as described above.
- Restart your terminal or system.
- Cause: Poppler
bindirectory is not in your PATH. - Solution:
- Ensure the correct Poppler
binpath is added to your PATH. - Test with:
pdfinfo
- Ensure the correct Poppler
- Solution:
- Verify all Python dependencies are installed using:
pip install -r requirements.txt
- Verify all Python dependencies are installed using:
Launch the GUI:
python src/main.py- Use the Select PDF button to load a PDF.
- The program converts the PDF pages into images.
- Adjust sliders in the GUI for parameters like blur, threshold, and inversion to optimise the image for OCR.
- Click the Apply OCR button to extract text from the processed image.
- Extracted data is saved as
output.jsonin the project directory.
-
PDF Conversion:
- Each page of the PDF is converted into an image using the
pdf2imagelibrary.
- Each page of the PDF is converted into an image using the
-
Image Preprocessing:
- Images are processed with OpenCV:
- Greyscale Conversion: Simplifies processing by removing colour.
- Gaussian Blur: Reduces noise.
- Thresholding: Enhances text visibility by binarising the image.
- Colour Inversion: Helps OCR in cases of reversed colour schemes.
- Images are processed with OpenCV:
-
OCR:
- Tesseract OCR extracts text from the preprocessed image.
- The extracted text is cleaned and structured into JSON format.
-
Manual Parameter Adjustment:
- Users manually adjust preprocessing parameters for each document to optimise results.
A PDF document containing:
- Company Name: QUICK ENTREPRISE
- Identifier: 0795 785 723
- Purpose: NOMINATION ADMINISTRATEUR / DÉMISSION ADMINISTRATEUR
{
"Company Name": "QUICK ENTREPRISE",
"Company Identifier": "0795 785 723",
"Document Purpose": "NOMINATION ADMINISTRATEUR / DÉMISSION ADMINISTRATEUR",
"Details": {}
}{
"Company Name": null,
"Company Identifier": "0795 785 723",
"Document Purpose": "NOMINATION ADMINISTRATEUR / DÉMISSION ADMINISTRATEUR",
"Details": {}
}- Results depend heavily on document quality and parameter adjustments.
- Fine-tuning sliders can improve text extraction, but consistency remains an issue.
-
OCR Artefacts:
- Adjust the blur and threshold sliders.
- Use colour inversion if text is hard to read.
-
Data Missing:
- Experiment with preprocessing settings to optimise text clarity.
-
Dependencies Missing:
- Ensure all Python dependencies are installed:
pip install -r requirements.txt
- Ensure all Python dependencies are installed:
- Regex Enhancement:
- Improve patterns to handle diverse document layouts.
- Noise Reduction:
- Add advanced OpenCV techniques to clean images before OCR.
- Batch Processing:
- Allow multiple documents to be processed in a single session.
- Field Matching:
- Incorporate machine learning models for better field extraction.
Víctor G.C.
This project is licensed under the MIT License.
-
Explicit Acknowledgment of Limitations:
- Clearly states the program's inconsistent performance and dependency on manual intervention.
-
Detailed Process Description:
- Outlines each step from PDF conversion to OCR, emphasising the use of OpenCV.
-
Realistic Examples:
- Demonstrates expected vs. actual output to set accurate expectations.
-
Future Directions:
- Proposes solutions to improve accuracy and automate manual adjustments.
We would like to express our gratitude to the various sources of information that have greatly assisted us throughout the development of this project. Although we have not yet achieved a fully functional and reliable solution, the resources we consulted provided invaluable insights into the complexities of working with OCR, image preprocessing, and PDF document analysis.
The following tutorials, articles, and videos were instrumental in shaping our understanding of the project requirements and the technologies involved:
- OCR in Python Tutorials
- Optical Character Recognition (OCR)
- Convert PDF to JSON - PDF to JSON Python & Javascript
- Simple OCR in Python with easyocr
We would also like to extend our sincere thanks to ChatGPT and its creators for providing invaluable assistance throughout this project. ChatGPT played a significant role in:
- Debugging complex issues in our implementation.
- Offering guidance on constructing effective Regular Expressions (RegEx) for data extraction.
- Helping streamline our approach to image preprocessing and text recognition.
Although the project remains incomplete due to time and resource constraints, ChatGPT enabled us to work more efficiently, saving countless hours. We also acknowledge the wider ChatGPT community, whose insights and contributions have been a source of inspiration and support.
While these resources provided substantial help, the project remains incomplete due to:
- The variability in scanned document quality.
- The inherent limitations of the tools and techniques we used.
- The challenges of applying universal preprocessing settings to documents with diverse layouts and noise levels.
Nevertheless, we are deeply grateful for the wealth of knowledge shared by the online community and the creators of tools like ChatGPT, which have guided us in building a solid foundation for this project.