+
Skip to content

Marini97/datasage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataSage — LLM Data Quality Profiler

Understand messy datasets fast, with privacy-first analysis, clear visualisations, and optional AI-generated summaries.


Why DataSage?

  • Privacy-first: your data stays on your machine by default (local model).
  • Actionable insights: concise, explainable summaries of quality issues.
  • Two ways to use: a friendly web UI and a simple CLI.
  • Batteries included: distributions, correlations, missingness patterns, and outlier detection with professional-looking charts.
  • Optional AI: plug in an API key to upgrade summaries; otherwise DataSage still produces useful, deterministic reports.

What it does

Given a CSV (or a pandas DataFrame), DataSage:

  1. Profiles the dataset (shape, memory, types, uniques, missingness, outliers, correlations).
  2. Builds a short, plain-English report (local LLM by default; optionally OpenAI).
  3. Lets you explore interactively (histograms, heatmaps, missingness, outliers) or script it from the CLI.
  4. Exports a clean Markdown report you can paste into a notebook, issue, or slide.

Quick start

1) Install

git clone https://github.com/Marini97/datasage.git
cd datasage
python -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -r requirements.txt

2) Run the web app (recommended)

python -m datasage.cli web
# Then open http://localhost:8501

3) Or run from the CLI

# Basic profiling (local, no API keys needed)
python -m datasage.cli profile path/to/data.csv

# Save a Markdown report
python -m datasage.cli profile path/to/data.csv -o report.md

Optional: enhanced AI summaries

If you want higher-quality narrative summaries (on top of the local model / statistical fallback), export your key:

export OPENAI_API_KEY="sk-..."
# (Optional) custom base: export OPENAI_API_BASE="https://api.openai.com/v1"

Then re-run the same commands; DataSage will automatically use the remote model when available.

Prefer local only? Do nothing — DataSage runs with a local model or a purely statistical fallback.


Configuration

Environment variables (all optional):

# Local model name (Hugging Face)
export DATASAGE_MODEL="google/flan-t5-base"

# Generation controls (keep low for determinism)
export DATASAGE_MAX_TOKENS=300
export DATASAGE_TEMPERATURE=0.0

CLI flags (selected):

python -m datasage.cli profile data.csv   -o report.md   --summary-only   --debug

Features at a glance

  • Dataset overview: rows/columns, memory footprint, per-type breakdown.
  • Per-column stats: type, non-null %, uniques, sample values.
  • Numerical: min/max/mean/std, quartiles & IQR, zero/negative counts, outliers.
  • Categorical: top-k values with frequencies, rare categories.
  • Datetime: min/max, coverage period.
  • Text: average length, empty-string rate.
  • Quality flags: high missingness, likely IDs, skew, duplicates, many rares.
  • Visuals: histograms/KDE, correlation heatmaps, missingness patterns, boxplots.
  • Reports: Markdown with Overview, Quality Summary, Column Profiles.
  • Chat (web app): ask questions about your data; responses are grounded in the profile.

Example

# Data Quality Report

## Dataset Overview
- 244 rows × 7 columns (≈0.13 MB)
- Mixed numeric and categorical types
- 6 outliers across 2 columns
- High correlation (0.89) between total_bill and tip

## Key Insights
- Tipping patterns scale with bill amount
- No missing values detected
- 3 potential data entry errors in “tip”

(Numbers above are illustrative; your dataset will differ.)


Project structure

.
├─ datasage/                # app + CLI + profiling + prompts + reporting
├─ examples/                # sample CSVs / demo assets
├─ tests/                   # unit tests
├─ .github/workflows/       # CI configuration
├─ README.md
├─ requirements.txt
├─ pyproject.toml
└─ LICENSE                  # MIT

Core modules (in datasage/):

  • cli.py — entry point for CLI & web app
  • profiler.py — dataframe → stats & quality signals
  • prompt_builder.py — compact prompts from stats
  • model.py — local LLM loader + generation helpers
  • enhanced_generator.py — optional OpenAI + fallbacks
  • report_formatter.py — stitch chunks into Markdown
  • web_app.py — Streamlit interface

Architecture

CSV → pandas → profiler ─┐
                         ├─→ prompt_builder → { local LLM | OpenAI | statistical fallback } → report_formatter → Markdown
Profiles & visuals ──────┘
                                 │
                                 └─────────────────────────────→ Streamlit UI (explore + chat)
  • Deterministic by default: temperature 0.0, bounded tokens.
  • Graceful degradation: if remote LLMs are absent/unavailable, DataSage still produces a clear, useful report.
  • No data leaves the machine unless you opt into a remote API.

Development

# Dev install
pip install -r requirements.txt
pip install -e .

# Tests
pytest tests/

# Lint/format (if configured in pyproject)
ruff check datasage/
black datasage/

Licence

MIT — see LICENSE.


Built with 🧙‍♂️ magic and ☕ coffee. Your data stays private, your insights go far!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载