Learn to build modern AI systems from the ground up through hands-on implementation
Master the most in-demand AI engineering skills: RAG (Retrieval-Augmented Generation)
This is a learner-focused project where you'll build a complete research assistant system that automatically fetches academic papers, understands their content, and answers your research questions using advanced RAG techniques.
The arXiv Paper Curator will teach you to build a production-grade RAG system using industry best practices. You'll master the architecture, implementation, and deployment of AI systems that professionals use in the real world.
By the end of this course, you'll have your own AI research assistant and the skills to build similar systems for any domain.
- Docker Desktop (with Docker Compose)
- Python 3.12+
- UV Package Manager (Install Guide)
- 8GB+ RAM and 20GB+ free disk space
# 1. Clone and setup
git clone <repository-url>
cd arxiv-paper-curator
uv sync
# 2. Start all services
docker compose up --build -d
# 3. Verify everything works
curl http://localhost:8000/health| Week | Topic | Blog Post | Code Release |
|---|---|---|---|
| Week 0 | The Mother of AI project - 6 phases | The Mother of AI project | - |
| Week 1 | Infrastructure Foundation | The Infrastructure That Powers RAG Systems | week1.0 |
| Week 2 | Data Ingestion Pipeline | Building Data Ingestion Pipelines for RAG | week2.0 |
| Week 3 | Hybrid Search Implementation | Coming Soon | Coming Soon |
| Week 4 | Advanced Chunking & Retrieval | Coming Soon | Coming Soon |
| Week 5 | Full RAG Pipeline | Coming Soon | Coming Soon |
| Week 6 | Production Deployment | Coming Soon | Coming Soon |
📥 Clone a specific week's release:
# Clone a specific week's code
git clone --branch <WEEK_TAG> https://github.com/jamwithai/arxiv-paper-curator
cd arxiv-paper-curator
uv sync
docker compose down -v
docker compose up --build -d
# Replace <WEEK_TAG> with: week1.0, week2.0, etc.| Service | URL | Purpose |
|---|---|---|
| API Documentation | http://localhost:8000/docs | Interactive API testing |
| Airflow Dashboard | http://localhost:8080 | Workflow management |
| OpenSearch Dashboards | http://localhost:5601 | Hybrid search engine UI |
Start here! Master the infrastructure that powers modern RAG systems.
- Complete infrastructure setup with Docker Compose
- FastAPI development with automatic documentation and health checks
- PostgreSQL database configuration and management
- OpenSearch hybrid search engine setup
- Ollama local LLM service configuration
- Service orchestration and health monitoring
- Professional development environment with code quality tools
Infrastructure Components:
- FastAPI: REST endpoints with async support (Port 8000)
- PostgreSQL 16: Paper metadata storage (Port 5432)
- OpenSearch 2.19: Search engine with dashboards (Ports 9200, 5601)
- Apache Airflow 3.0: Workflow orchestration (Port 8080)
- Ollama: Local LLM server (Port 11434)
# Launch the Week 1 notebook
uv run jupyter notebook notebooks/week1/week1_setup.ipynbComplete when you can:
- Start all services with
docker compose up -d - Access API docs at http://localhost:8000/docs
- Login to Airflow at http://localhost:8080
- Browse OpenSearch at http://localhost:5601
- All tests pass:
uv run pytest
Blog Post: The Infrastructure That Powers RAG Systems - Detailed walkthrough and production insights
Building on Week 1 infrastructure: Learn to fetch, process, and store academic papers automatically.
- arXiv API integration with rate limiting and retry logic
- Scientific PDF parsing using Docling
- Automated data ingestion pipelines with Apache Airflow
- Metadata extraction and storage workflows
- Complete paper processing from API to database
Data Pipeline Components:
- MetadataFetcher: 🎯 Main orchestrator coordinating the entire pipeline
- ArxivClient: Rate-limited paper fetching with retry logic
- PDFParserService: Docling-powered scientific document processing
- Airflow DAGs: Automated daily paper ingestion workflows
- PostgreSQL Storage: Structured paper metadata and content
# Launch the Week 2 notebook
uv run jupyter notebook notebooks/week2/week2_arxiv_integration.ipynbarXiv API Integration:
# Example: Fetch papers with rate limiting
from src.services.arxiv.factory import make_arxiv_client
async def fetch_recent_papers():
client = make_arxiv_client()
papers = await client.search_papers(
query="cat:cs.AI",
max_results=10,
from_date="20240801",
to_date="20240807"
)
return papersPDF Processing Pipeline:
# Example: Parse PDF with Docling
from src.services.pdf_parser.factory import make_pdf_parser_service
async def process_paper_pdf(pdf_url: str):
parser = make_pdf_parser_service()
parsed_content = await parser.parse_pdf_from_url(pdf_url)
return parsed_content # Structured content with text, tables, figuresComplete Ingestion Workflow:
# Example: Full paper ingestion pipeline
from src.services.metadata_fetcher import make_metadata_fetcher
async def ingest_papers():
fetcher = make_metadata_fetcher()
results = await fetcher.fetch_and_store_papers(
query="cat:cs.AI",
max_results=5,
from_date="20240807"
)
return results # Papers stored in database with full contentComplete when you can:
- Fetch papers from arXiv API: Test in Week 2 notebook
- Parse PDF content with Docling: View extracted structured content
- Run Airflow DAG:
arxiv_paper_ingestionexecutes successfully - Verify database storage: Papers appear in PostgreSQL with full content
- API endpoints work:
/papersreturns stored papers with metadata
Blog Post: Building Data Ingestion Pipelines for RAG - arXiv API integration and PDF processing
Building on Weeks 1-2 foundation: Advanced RAG techniques and production deployment.
- Week 3: OpenSearch hybrid search implementation with BM25 + semantic vectors
- Week 4: Context-aware chunking and retrieval evaluation with nDCG metrics
- Week 5: Full RAG pipeline with LLM integration and prompt optimization
- Week 6: Observability with Langfuse, A/B testing, and production deployment
| Service | Purpose | Status |
|---|---|---|
| FastAPI | REST API with automatic docs | ✅ Ready |
| PostgreSQL 16 | Paper metadata and content storage | ✅ Ready |
| OpenSearch 2.19 | Hybrid search engine | ✅ Ready |
| Apache Airflow 3.0 | Workflow automation | ✅ Ready |
| Ollama | Local LLM serving | ✅ Ready |
Development Tools: UV, Ruff, MyPy, Pytest, Docker Compose
arxiv-paper-curator/
├── src/ # Main application code
│ ├── main.py # FastAPI application
│ ├── routers/ # API endpoints
│ ├── models/ # Database models (SQLAlchemy)
│ ├── repositories/ # Data access layer
│ ├── schemas/ # Pydantic validation schemas
│ ├── services/ # Business logic
│ │ ├── arxiv/ # ✨ NEW: arXiv API client
│ │ ├── pdf_parser/ # ✨ NEW: Docling PDF processing
│ │ ├── metadata_fetcher.py # ✨ NEW: Complete ingestion pipeline
│ │ └── ollama/ # Ollama LLM service
│ ├── db/ # Database configuration
│ ├── config.py # Environment configuration
│ └── dependencies.py # Dependency injection
│
├── notebooks/ # Learning materials
│ ├── week1/ # Week 1: Infrastructure setup
│ │ └── week1_setup.ipynb # Complete setup guide
│ └── week2/ # ✨ NEW: Week 2 materials
│ └── week2_data_ingestion.ipynb # Data pipeline guide
│
├── airflow/ # Workflow orchestration
│ ├── dags/ # Workflow definitions
│ │ ├── arxiv_ingestion/ # ✨ NEW: arXiv ingestion modules
│ │ └── arxiv_paper_ingestion.py # ✨ NEW: Main ingestion DAG
│ └── requirements-airflow.txt # ✨ NEW: Airflow dependencies
│
├── tests/ # Comprehensive test suite
├── static/ # Assets (images, GIFs)
└── compose.yml # Service orchestration
# View all available commands
make help
# Quick workflow
make start # Start all services
make health # Check all services health
make test # Run tests
make stop # Stop services| Command | Description |
|---|---|
make start |
Start all services |
make stop |
Stop all services |
make restart |
Restart all services |
make status |
Show service status |
make logs |
Show service logs |
make health |
Check all services health |
make setup |
Install Python dependencies |
make format |
Format code |
make lint |
Lint and type check |
make test |
Run tests |
make test-cov |
Run tests with coverage |
make clean |
Clean up everything |
# If you prefer using commands directly
docker compose up --build -d # Start services
docker compose ps # Check status
docker compose logs # View logs
uv run pytest # Run tests| Who | Why |
|---|---|
| AI/ML Engineers | Learn production RAG architecture beyond tutorials |
| Software Engineers | Build end-to-end AI applications with best practices |
| Data Scientists | Implement production AI systems using modern tools |
Common Issues:
- Services not starting? Wait 2-3 minutes, check
docker compose logs - Port conflicts? Stop other services using ports 8000, 8080, 5432, 9200
- Memory issues? Increase Docker Desktop memory allocation
Get Help:
- Check the comprehensive Week 1 notebook troubleshooting section
- Review service logs:
docker compose logs [service-name] - Complete reset:
docker compose down --volumes && docker compose up --build -d
This course is completely free! You'll only need minimal costs for optional services:
- Local Development: $0 (everything runs locally)
- Optional Cloud APIs: ~$2-5 for external LLM services (if chosen)
Begin with the Week 1 setup notebook and build your first production RAG system!
For learners who want to master modern AI engineering
Built with love by Jam With AI
MIT License - see LICENSE file for details.