Get your documents ready for gen AI
-
Updated
Jul 26, 2025 - Python
Get your documents ready for gen AI
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Awesome multilingual OCR and Document Parsing toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines
Knowledge Agents and Management in the Cloud
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)
Transform your documents into intelligent conversations. This open-source RAG chatbot combines semantic search with fine-tuned language models (LLaMA, Qwen2.5VL-3B) to deliver accurate, context-aware responses from your own knowledge base. Join our community!
The metadata and text content extractor for almost every file type.
A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.
extract and analyze content from various file formats including PDFs, text files, and images.
Docparser OCR Package for PHP Laravel
An intelligent resume parsing engine built with Python and NLP, aimed at automating the tedious task of sifting through resumes. It accurately extracts vital candidate information such as contact details, employment history, educational qualifications, and technical skills, making it an invaluable asset for recruitment and HR professionals.
LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.
ZubairHub is a Streamlit-based application that integrates various functionalities, including social graph visualization, object detection, document parsing, text extraction, generative AI interaction, and personal data transformation.
Convert PDFs, DOCX, TXT & more into structured JSON trees using Python. Built for legal, institutional and NLP applications.
Combining OCR for text extraction with LLMs for accurate, efficient document structuring.
Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.
This repository showcases a practical and easy-to-follow implementation of text extraction using Large Language Models (LLMs). Designed for developers, data scientists, and AI enthusiasts, it walks you through everything from setup to evaluation, making it a great resource for real-world NLP applications.
Opinionated and Sophisticated Document Region Analyzer.
Add a description, image, and links to the document-parsing topic page so that developers can more easily learn about it.
To associate your repository with the document-parsing topic, visit your repo's landing page and select "manage topics."