Understand messy datasets fast, with privacy-first analysis, clear visualisations, and optional AI-generated summaries.
- Privacy-first: your data stays on your machine by default (local model).
- Actionable insights: concise, explainable summaries of quality issues.
- Two ways to use: a friendly web UI and a simple CLI.
- Batteries included: distributions, correlations, missingness patterns, and outlier detection with professional-looking charts.
- Optional AI: plug in an API key to upgrade summaries; otherwise DataSage still produces useful, deterministic reports.
Given a CSV (or a pandas
DataFrame), DataSage:
- Profiles the dataset (shape, memory, types, uniques, missingness, outliers, correlations).
- Builds a short, plain-English report (local LLM by default; optionally OpenAI).
- Lets you explore interactively (histograms, heatmaps, missingness, outliers) or script it from the CLI.
- Exports a clean Markdown report you can paste into a notebook, issue, or slide.
git clone https://github.com/Marini97/datasage.git
cd datasage
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
python -m datasage.cli web
# Then open http://localhost:8501
# Basic profiling (local, no API keys needed)
python -m datasage.cli profile path/to/data.csv
# Save a Markdown report
python -m datasage.cli profile path/to/data.csv -o report.md
If you want higher-quality narrative summaries (on top of the local model / statistical fallback), export your key:
export OPENAI_API_KEY="sk-..."
# (Optional) custom base: export OPENAI_API_BASE="https://api.openai.com/v1"
Then re-run the same commands; DataSage will automatically use the remote model when available.
Prefer local only? Do nothing — DataSage runs with a local model or a purely statistical fallback.
Environment variables (all optional):
# Local model name (Hugging Face)
export DATASAGE_MODEL="google/flan-t5-base"
# Generation controls (keep low for determinism)
export DATASAGE_MAX_TOKENS=300
export DATASAGE_TEMPERATURE=0.0
CLI flags (selected):
python -m datasage.cli profile data.csv -o report.md --summary-only --debug
- Dataset overview: rows/columns, memory footprint, per-type breakdown.
- Per-column stats: type, non-null %, uniques, sample values.
- Numerical: min/max/mean/std, quartiles & IQR, zero/negative counts, outliers.
- Categorical: top-k values with frequencies, rare categories.
- Datetime: min/max, coverage period.
- Text: average length, empty-string rate.
- Quality flags: high missingness, likely IDs, skew, duplicates, many rares.
- Visuals: histograms/KDE, correlation heatmaps, missingness patterns, boxplots.
- Reports: Markdown with Overview, Quality Summary, Column Profiles.
- Chat (web app): ask questions about your data; responses are grounded in the profile.
# Data Quality Report
## Dataset Overview
- 244 rows × 7 columns (≈0.13 MB)
- Mixed numeric and categorical types
- 6 outliers across 2 columns
- High correlation (0.89) between total_bill and tip
## Key Insights
- Tipping patterns scale with bill amount
- No missing values detected
- 3 potential data entry errors in “tip”
(Numbers above are illustrative; your dataset will differ.)
.
├─ datasage/ # app + CLI + profiling + prompts + reporting
├─ examples/ # sample CSVs / demo assets
├─ tests/ # unit tests
├─ .github/workflows/ # CI configuration
├─ README.md
├─ requirements.txt
├─ pyproject.toml
└─ LICENSE # MIT
Core modules (in datasage/
):
cli.py
— entry point for CLI & web appprofiler.py
— dataframe → stats & quality signalsprompt_builder.py
— compact prompts from statsmodel.py
— local LLM loader + generation helpersenhanced_generator.py
— optional OpenAI + fallbacksreport_formatter.py
— stitch chunks into Markdownweb_app.py
— Streamlit interface
CSV → pandas → profiler ─┐
├─→ prompt_builder → { local LLM | OpenAI | statistical fallback } → report_formatter → Markdown
Profiles & visuals ──────┘
│
└─────────────────────────────→ Streamlit UI (explore + chat)
- Deterministic by default: temperature
0.0
, bounded tokens. - Graceful degradation: if remote LLMs are absent/unavailable, DataSage still produces a clear, useful report.
- No data leaves the machine unless you opt into a remote API.
# Dev install
pip install -r requirements.txt
pip install -e .
# Tests
pytest tests/
# Lint/format (if configured in pyproject)
ruff check datasage/
black datasage/
MIT — see LICENSE
.
Built with 🧙♂️ magic and ☕ coffee. Your data stays private, your insights go far!