Docler

A unified Python library for document conversion and OCR that provides a consistent interface to multiple document processing providers. Extract text, images, and metadata from PDFs, images, and office documents using state-of-the-art OCR and document AI services.

Features

Unified Interface: Single API for multiple document processing providers
Multiple Providers: Support for 10+ OCR and document AI services
Rich Output: Extract text, images, tables, and metadata
Async Support: Built-in async/await support
Flexible Configuration: Provider-specific settings and preferences
Page Range Support: Process specific pages from documents
Multi-language OCR: Support for 100+ languages across providers
Structured Output: Standardized markdown with embedded metadata

Quick Start

import asyncio
from docler import MistralConverter

async def main():
    # Use the aggregated converter for automatic provider selection
    converter = MistralConverter()

    # Convert a document
    result = await converter.convert_file("document.pdf")

    print(f"Title: {result.title}")
    print(f"Content: {result.content[:500]}...")
    print(f"Images: {len(result.images)} extracted")
    print(f"Pages: {result.page_count}")

asyncio.run(main())

Available OCR Converters

Cloud API Providers

Azure Document Intelligence

from docler import AzureConverter

converter = AzureConverter(
    endpoint="your-endpoint",
    api_key="your-key",
    model="prebuilt-layout"
)

Mistral OCR

from docler import MistralConverter

converter = MistralConverter(
    api_key="your-key",
    languages=["en", "fr", "de"]
)

LlamaParse

from docler import LlamaParseConverter

converter = LlamaParseConverter(
    api_key="your-key",
    adaptive_long_table=True
)

Upstage Document AI

from docler import UpstageConverter

converter = UpstageConverter(
    api_key="your-key",
    chart_recognition=True
)

DataLab

from docler import DataLabConverter

converter = DataLabConverter(
    api_key="your-key",
    use_llm=False  # Enable for higher accuracy
)

Local/Self-Hosted Providers

Marker

from docler import MarkerConverter

converter = MarkerConverter(
    dpi=192,
    use_llm=True,  # Requires local LLM setup
    llm_provider="ollama"
)

Docling

from docler import DoclingConverter

converter = DoclingConverter(
    ocr_engine="easy_ocr",
    image_scale=2.0
)

Docling Remote

from docler import DoclingRemoteConverter

converter = DoclingRemoteConverter(
    endpoint="http://localhost:5001",
    pdf_backend="dlparse_v4"
)

MarkItDown (Microsoft)

from docler import MarkItDownConverter

converter = MarkItDownConverter()

LLM-Based Providers

LLM Converter

from docler import LLMConverter

converter = LLMConverter(
    model="gpt-4o",  # or claude-3-5-sonnet, etc.
    system_prompt="Extract text preserving formatting..."
)

Provider Comparison

Provider	Cost/Page	Local	API Required	Best For
Azure	$0.0096	❌	✅	Enterprise forms, invoices
Mistral	Variable	❌	✅	High-quality text extraction
LlamaParse	$0.0045	❌	✅	Complex layouts, academic papers
Upstage	$0.01	❌	✅	Charts, presentations
DataLab	$0.0015	❌	✅	Cost-effective processing
Marker	Free	✅	❌	Privacy-sensitive documents
Docling	Free	✅	❌	Open-source processing
MarkItDown	Free	✅	❌	Office documents
LLM	Variable	❌	✅	Latest AI capabilities

Advanced Usage

Directory Processing

Process entire directories with progress tracking:

from docler import DirectoryConverter, MarkerConverter

base_converter = MarkerConverter()
dir_converter = DirectoryConverter(base_converter, chunk_size=10)

# Convert all supported files
results = await dir_converter.convert("./documents/")

# Or with progress tracking
async for state in dir_converter.convert_with_progress("./documents/"):
    print(f"Progress: {state.processed_files}/{state.total_files}")
    print(f"Current: {state.current_file}")
    if state.errors:
        print(f"Errors: {len(state.errors)}")

Page Range Processing

Extract specific pages from documents:

# Extract pages 1-5 and 10-15
converter = MistralConverter(page_range="1-5,10-15")
result = await converter.convert_file("large_document.pdf")

Batch Processing

Process multiple files efficiently:

files = ["doc1.pdf", "doc2.png", "doc3.docx"]
results = await converter.convert_files(files)

for file, result in zip(files, results):
    print(f"{file}: {len(result.content)} characters extracted")

Output Format

All converters return a standardized Document object with:

class Document:
    content: str           # Extracted text in markdown format
    images: list[Image]    # Extracted images with metadata
    title: str            # Document title
    source_path: str      # Original file path
    mime_type: str        # File MIME type
    metadata: dict        # Provider-specific metadata
    page_count: int       # Number of pages processed

The markdown content includes standardized metadata for page breaks and structure:

<!-- docler:page_break {"next_page":1} -->
# Document Title

Content from page 1...

<!-- docler:page_break {"next_page":2} -->
More content from page 2...

Installation

# Basic installation
pip install docler

# With specific provider dependencies
pip install docler[azure]      # Azure Document Intelligence
pip install docler[mistral]    # Mistral OCR
pip install docler[marker]     # Marker PDF processing
pip install docler[all]        # All providers

Environment Variables

Configure API keys via environment variables:

export AZURE_DOC_INTELLIGENCE_ENDPOINT="your-endpoint"
export AZURE_DOC_INTELLIGENCE_KEY="your-key"
export MISTRAL_API_KEY="your-key"
export LLAMAPARSE_API_KEY="your-key"
export UPSTAGE_API_KEY="your-key"
export DATALAB_API_KEY="your-key"

Contributing

We welcome contributions! See our contributing guidelines for details.

License

MIT License - see LICENSE for details.

Links

Documentation: https://phil65.github.io/docler/
PyPI: https://pypi.org/project/docler/
GitHub: https://github.com/phil65/docler/
Issues: https://github.com/phil65/docler/issues
Discussions: https://github.com/phil65/docler/discussions

Coming Soon: FastAPI demo with bring-your-own-keys on https://contexter.net

Name		Name	Last commit message	Last commit date
Latest commit History 555 Commits
.github		.github
docs		docs
overrides		overrides
src		src
tests		tests
.copier-answers.yml		.copier-answers.yml
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.yml		compose.yml
duties.py		duties.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Uh oh!

License

phil65/docler

Folders and files

Latest commit

History

Repository files navigation

Docler

Features

Quick Start

Available OCR Converters

Cloud API Providers

Azure Document Intelligence

Mistral OCR

LlamaParse

Upstage Document AI

DataLab

Local/Self-Hosted Providers

Marker

Docling

Docling Remote

MarkItDown (Microsoft)

LLM-Based Providers

LLM Converter

Provider Comparison

Advanced Usage

Directory Processing

Page Range Processing

Batch Processing

Output Format

Installation

Environment Variables

Contributing

License

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 29

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Packages