A Python tool for cleaning and processing PDF bank reports with AI analysis using Google Gemini.
- PDF Text Extraction: Extract text from PDF files using advanced parsing
- Pattern-based Cleaning: Remove unwanted content using configurable regex patterns
- AI Analysis: Process cleaned text with Google Gemini for financial insights
- Batch Processing: Process entire directories of PDF files
- Concurrent Processing: Multi-threaded processing for better performance
- Flexible Output: Save cleaned PDFs, text files, and JSON analysis
- Clone or download the project
- Install dependencies:
pip install -r requirements.txt
- Set up your Gemini API key:
cp .env.example .env # Edit .env and add your GEMINI_API_KEY
# Basic usage
python main.py bank_report.pdf
# With pattern cleaning
python main.py bank_report.pdf -f config/default_patterns.txt
# Save all outputs
python main.py bank_report.pdf --save-text --save-cleaned-pdf
# Custom analysis prompt
python main.py bank_report.pdf --custom-prompt "Extract all transaction amounts greater than $1000"
# Process all PDFs in a directory
python main.py -d /path/to/pdf/directory -o /path/to/output
# Process recursively with custom patterns
python main.py -d /path/to/pdf/directory -o /path/to/output --recursive -f my_patterns.txt
# Batch processing with all outputs
python main.py -d /path/to/pdf/directory -o /path/to/output --save-text --save-cleaned-pdf --max-workers 8
Use the simplified batch script for common operations:
# Basic batch processing
python batch_process.py input_directory output_directory
# With all options
python batch_process.py input_directory output_directory --recursive --save-all --workers 6 --verbose
Input/Output:
input_pdf
: Path to single PDF file (or use -d for directory)-d, --directory
: Process all PDFs in this directory-o, --output
: Output directory (default: 'output')
Processing:
-p, --patterns
: Patterns to remove (space-separated)-f, --pattern-file
: File containing patterns (one per line)--api-key
: Gemini API key (or set GEMINI_API_KEY env var)--custom-prompt
: Custom prompt for AI analysis
Output Options:
--save-cleaned-pdf
: Save cleaned PDF file--save-text
: Save extracted text file
Batch Options:
--max-workers
: Maximum concurrent workers (default: 4)--recursive
: Process subdirectories recursively
General:
-v, --verbose
: Enable verbose logging
input_directory
: Directory with PDF filesoutput_directory
: Where to save results--pattern-file
: Pattern file to use-r, --recursive
: Process subdirectories--save-all
: Save both PDF and text files--workers
: Number of parallel workers-v, --verbose
: Verbose output
Create custom pattern files to remove unwanted content from PDFs. Each line is a regex pattern:
# Remove page numbers
^Page \\d+ of \\d+$
# Remove bank logos
^BANCO.*LOGO$
# Remove confidential markers
CONFIDENTIAL
INTERNAL USE ONLY
# Remove blank lines
^\\s*$
The default patterns are in config/default_patterns.txt
.
For each processed PDF, the tool generates:
{filename}_cleaned.txt
: Extracted and cleaned text (if --save-text){filename}_cleaned.pdf
: Cleaned PDF file (if --save-cleaned-pdf){filename}_analysis.json
: AI analysis results
python main.py statement.pdf -f config/default_patterns.txt --save-text
python main.py -d monthly_reports/ -o processed_reports/ --save-cleaned-pdf --max-workers 6
python main.py report.pdf --custom-prompt "List all transactions over $5000 with dates and descriptions"
python batch_process.py input_pdfs/ output_results/ --recursive --save-all
- Files that fail to process are logged and reported in batch summary
- Processing continues even if some files fail
- Gemini API errors are handled gracefully - processing continues without AI analysis
- Detailed error messages help identify issues
- Multi-threaded processing for batch operations
- Configurable worker count for optimal performance
- Memory-efficient text processing
- Progress tracking for batch operations
- Python 3.7+
- Google Gemini API key for AI analysis
- See requirements.txt for package dependencies
This project is for educational and defensive security purposes only.