A comprehensive Medical Retrieval-Augmented Generation (RAG) application that allows users to upload medical PDF documents and query them using natural language through an AI-powered interface.
- PDF Processing: Extract and process medical documents with intelligent text cleaning
- Vector Database: Store document embeddings using ChromaDB for efficient similarity search
- Medical Context Awareness: Specialized prompt engineering for medical document queries
- Interactive Web Interface: User-friendly Streamlit interface for document upload and querying
- Source Attribution: Always cite the source documents when providing answers
- Medical Term Normalization: Automatically expand common medical abbreviations
- Advanced Search: Configurable similarity thresholds and retrieval parameters
- Batch Processing: Support for multiple document uploads
- System Monitoring: Comprehensive logging and system statistics
- Python 3.12 or higher
- OpenAI API key
- Medical PDF documents to query
-
Clone and navigate to the project:
cd /path/to/your/project
-
Install dependencies:
pip install -e .
-
Set up environment variables: Create a
.env
file in the project root:OPENAI_API_KEY=your_openai_api_key_here OPENAI_MODEL=gpt-3.5-turbo OPENAI_MAX_TOKENS=1000 OPENAI_TEMPERATURE=0.7
-
Run the application:
streamlit run main.py
-
Open your browser and navigate to
http://localhost:8501
- Click "Browse files" to select one or more PDF documents
- Supported format: PDF files
- Documents are automatically processed and added to the vector database
- Type your medical question in the text area
- Examples:
- "What are the symptoms of diabetes?"
- "What medications are mentioned in the documents?"
- "What are the treatment options for hypertension?"
- AI Response: The main answer based on your documents
- Sources: Expandable sections showing the source documents and relevance scores
- Advanced Options: Adjust retrieval parameters for better results
- Sidebar: Shows document count and processed files
- Settings: Clear all documents or view system statistics
Variable | Default | Description |
---|---|---|
OPENAI_API_KEY |
Required | Your OpenAI API key |
OPENAI_MODEL |
gpt-3.5-turbo |
OpenAI model to use |
OPENAI_MAX_TOKENS |
1000 |
Maximum tokens for responses |
OPENAI_TEMPERATURE |
0.7 |
Response creativity (0-1) |
VECTOR_STORE_PATH |
./vector_store |
Path for ChromaDB storage |
EMBEDDING_MODEL |
sentence-transformers/all-MiniLM-L6-v2 |
Embedding model |
CHUNK_SIZE |
1000 |
Text chunk size for processing |
CHUNK_OVERLAP |
200 |
Overlap between chunks |
TOP_K_RETRIEVAL |
5 |
Number of documents to retrieve |
SIMILARITY_THRESHOLD |
0.7 |
Minimum similarity score |
You can modify the configuration in medrag/utils/config.py
or set environment variables for runtime configuration.
MEDRAG Application
├── main.py # Main Streamlit application
├── medrag/
│ ├── core/
│ │ ├── pdf_processor.py # PDF text extraction and chunking
│ │ ├── vector_store.py # ChromaDB vector database management
│ │ ├── llm_client.py # OpenAI API integration
│ │ └── rag_system.py # RAG pipeline orchestration
│ └── utils/
│ └── config.py # Configuration management
└── vector_store/ # ChromaDB data storage
- PDFProcessor: Handles PDF text extraction, cleaning, and chunking
- VectorStore: Manages document embeddings and similarity search
- LLMClient: Communicates with OpenAI for response generation
- RAGSystem: Orchestrates the complete RAG pipeline
The application includes built-in system testing:
from medrag.core.rag_system import RAGSystem
# Test the complete system
test_results = rag_system.test_system()
print(test_results)
- Upload a medical PDF document
- Ask a question about the document content
- Verify the response cites the source document
- Check that the answer is based on the uploaded content
- Local Processing: All document processing happens locally
- No Data Sharing: Documents are not sent to external services (except OpenAI for query processing)
- Secure Storage: Vector database is stored locally
- API Key Protection: Use environment variables for sensitive configuration
This application is for educational and research purposes only. It should not be used for medical diagnosis or treatment decisions. Always consult qualified healthcare professionals for medical advice.
- The quality of responses depends on the uploaded documents
- Complex medical queries may require multiple documents for comprehensive answers
- The system works best with well-structured medical documents
- Large PDF files may take longer to process
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
If you encounter any issues:
- Check the console logs for error messages
- Verify your OpenAI API key is valid
- Ensure your PDF documents are not password-protected
- Check that you have sufficient disk space for the vector database
- v0.1.0: Initial release with basic RAG functionality
- Future versions will include:
- Support for more document formats
- Advanced medical term recognition
- Multi-language support
- Enhanced UI features