+
Skip to content

lafco/economIA

Repository files navigation

EconomIA - PDF Bank Report Processor

A Python tool for cleaning and processing PDF bank reports with AI analysis using Google Gemini.

Features

  • PDF Text Extraction: Extract text from PDF files using advanced parsing
  • Pattern-based Cleaning: Remove unwanted content using configurable regex patterns
  • AI Analysis: Process cleaned text with Google Gemini for financial insights
  • Batch Processing: Process entire directories of PDF files
  • Concurrent Processing: Multi-threaded processing for better performance
  • Flexible Output: Save cleaned PDFs, text files, and JSON analysis

Installation

  1. Clone or download the project
  2. Install dependencies:
    pip install -r requirements.txt
  3. Set up your Gemini API key:
    cp .env.example .env
    # Edit .env and add your GEMINI_API_KEY

Usage

Single File Processing

# Basic usage
python main.py bank_report.pdf

# With pattern cleaning
python main.py bank_report.pdf -f config/default_patterns.txt

# Save all outputs
python main.py bank_report.pdf --save-text --save-cleaned-pdf

# Custom analysis prompt
python main.py bank_report.pdf --custom-prompt "Extract all transaction amounts greater than $1000"

Batch Directory Processing

# Process all PDFs in a directory
python main.py -d /path/to/pdf/directory -o /path/to/output

# Process recursively with custom patterns
python main.py -d /path/to/pdf/directory -o /path/to/output --recursive -f my_patterns.txt

# Batch processing with all outputs
python main.py -d /path/to/pdf/directory -o /path/to/output --save-text --save-cleaned-pdf --max-workers 8

Simplified Batch Processing

Use the simplified batch script for common operations:

# Basic batch processing
python batch_process.py input_directory output_directory

# With all options
python batch_process.py input_directory output_directory --recursive --save-all --workers 6 --verbose

Command Line Options

Main Script (main.py)

Input/Output:

  • input_pdf: Path to single PDF file (or use -d for directory)
  • -d, --directory: Process all PDFs in this directory
  • -o, --output: Output directory (default: 'output')

Processing:

  • -p, --patterns: Patterns to remove (space-separated)
  • -f, --pattern-file: File containing patterns (one per line)
  • --api-key: Gemini API key (or set GEMINI_API_KEY env var)
  • --custom-prompt: Custom prompt for AI analysis

Output Options:

  • --save-cleaned-pdf: Save cleaned PDF file
  • --save-text: Save extracted text file

Batch Options:

  • --max-workers: Maximum concurrent workers (default: 4)
  • --recursive: Process subdirectories recursively

General:

  • -v, --verbose: Enable verbose logging

Batch Script (batch_process.py)

  • input_directory: Directory with PDF files
  • output_directory: Where to save results
  • --pattern-file: Pattern file to use
  • -r, --recursive: Process subdirectories
  • --save-all: Save both PDF and text files
  • --workers: Number of parallel workers
  • -v, --verbose: Verbose output

Pattern Files

Create custom pattern files to remove unwanted content from PDFs. Each line is a regex pattern:

# Remove page numbers
^Page \\d+ of \\d+$

# Remove bank logos
^BANCO.*LOGO$

# Remove confidential markers
CONFIDENTIAL
INTERNAL USE ONLY

# Remove blank lines
^\\s*$

The default patterns are in config/default_patterns.txt.

Output Files

For each processed PDF, the tool generates:

  • {filename}_cleaned.txt: Extracted and cleaned text (if --save-text)
  • {filename}_cleaned.pdf: Cleaned PDF file (if --save-cleaned-pdf)
  • {filename}_analysis.json: AI analysis results

Examples

Process Single Bank Statement

python main.py statement.pdf -f config/default_patterns.txt --save-text

Batch Process Monthly Reports

python main.py -d monthly_reports/ -o processed_reports/ --save-cleaned-pdf --max-workers 6

Custom Analysis for Specific Transactions

python main.py report.pdf --custom-prompt "List all transactions over $5000 with dates and descriptions"

Use Simplified Batch Script

python batch_process.py input_pdfs/ output_results/ --recursive --save-all

Error Handling

  • Files that fail to process are logged and reported in batch summary
  • Processing continues even if some files fail
  • Gemini API errors are handled gracefully - processing continues without AI analysis
  • Detailed error messages help identify issues

Performance

  • Multi-threaded processing for batch operations
  • Configurable worker count for optimal performance
  • Memory-efficient text processing
  • Progress tracking for batch operations

Requirements

  • Python 3.7+
  • Google Gemini API key for AI analysis
  • See requirements.txt for package dependencies

License

This project is for educational and defensive security purposes only.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载