+
Skip to content

hdf5-agent is a Python-based Single-File/Folder Agent (SFA) agent that provides a natural language interface for interacting with HDF5 data files. By bundling the agent with your HDF5 files, you create a self-describing "smart folder". It uses a local LLM via Ollama making data exploration accessible without writing complex code.

License

Notifications You must be signed in to change notification settings

akougkas/hdf5-agent

Repository files navigation

SFA-HDF5: Single-File HDF5 Agent

License: MIT Version: 0.5.0

A self-contained, single-file Python agent for exploring and analyzing HDF5 files using natural language queries. This agent leverages either local Language Models via Ollama or Google's Gemini API to provide an intuitive, conversational interface for interacting with HDF5 data.

Concept: The "Smart Folder" - Bringing AI to Your Data

The core idea behind SFA-HDF5 is the "smart folder": place the single-file agent (sfa_hdf5_ollama.py or sfa_hdf5_gemini.py) in any directory with your HDF5 data to create a self-contained, portable data exploration unit. This fundamentally inverts the traditional model:

Traditional approach: Bring your data to AI (uploading files to ChatGPT or Gemini)
SFA approach: Bring AI to your data - securely, privately, and portably

The agent's design enables users to explore complex data locally, keeping sensitive information secure while maintaining full control over which AI service processes queries.

LLM Workflows vs. Agentic Approach

SFA-HDF5 implements an agentic approach rather than a simple LLM workflow:

LLM Workflow: Sequential calls to an LLM with predefined steps and limited adaptability
Agentic Approach: LLM has agency to choose tools and drive progress through reasoning

By using a tool-based architecture, SFA-HDF5 enables:

  • More complex exploration paths determined by the agent
  • Better error handling with specific recovery strategies
  • Extensibility through the addition of new tools without rewriting core logic
  • The ability to chain multiple SFA agents together for cross-domain analysis pipelines

Features

  • Self-contained: Entire agent resides in a single Python file.
  • Multiple LLM options:
    • sfa_hdf5_ollama.py for local LLM execution via Ollama
    • sfa_hdf5_gemini.py for cloud-based execution via Google's Gemini API
  • HDF5 File Exploration: Navigate files, groups, datasets, and their metadata.
  • Metadata Retrieval: Access group and file attributes, plus dataset details (shape, dtype, attributes).
  • Dataset Summarization: Generate statistical summaries for numerical data or value counts for strings.
  • Dataset Analysis: Comprehensive analysis with adaptive sampling for large datasets.
  • Performance Optimization: Caches metadata and tool results with a configurable LRU cache.
  • Dependency Management: Employs uv for automatic dependency installation via the # /// script block.
  • Interactive and CLI Modes: Supports command-line queries and interactive chat.
  • Multi-Step Queries: Handles complex, multi-step explorations.
  • Robust Error Handling: Gracefully manages invalid files, paths, and query errors with descriptive feedback.

Advanced Feature: Context-Aware Exploration

The agent maintains awareness of your exploration context, prioritizing recently mentioned files and tracking your intent. This allows for more natural conversations like:

HDF5> What groups are in test_data.h5?
# Agent shows groups

HDF5> What datasets are in the /timeseries group?
# Agent understands you're still working with test_data.h5

Available Agent Tools

SFA-HDF5 uses a tool-based architecture with these core functions:

  • list_files: Lists all HDF5 files in the directory
  • list_groups: Lists all groups within an HDF5 file
  • list_datasets: Lists datasets within a specified group
  • get_group_attribute: Retrieves a specific attribute from a group
  • get_file_attribute: Retrieves a specific attribute from the file
  • get_dataset_info: Retrieves dataset metadata (shape, dtype, attributes)
  • summarize_dataset: Provides statistical summaries of dataset contents
  • analyze_dataset: Performs in-depth analysis of dataset contents with sampling for large datasets
  • list_all_datasets: Lists all datasets in a file, grouped by parent groups
  • get_file_metadata: Retrieves file metadata (size, creation/modification times)

Read-Only Design Philosophy

SFA-HDF5 v0.5.0 is intentionally designed as a read-only agent, specifically optimized for the data exploration phase where:

  • The working directory remains static during exploration
  • Files are not expected to be modified during agent operation
  • The focus is purely on exploring and understanding existing data structures

This design choice offers several advantages:

  • Maximized Caching: Results can be aggressively cached without worrying about data staleness
  • Simplified Architecture: No need for file monitoring or modification detection
  • Reliable Exploration: Consistent results throughout an exploration session
  • Performance Optimization: Reduced I/O operations through safe result reuse

Note: If you need to explore new or modified files, simply restart the agent to refresh its cache and begin a new exploration session.

Requirements

  • Python 3.8+
  • uv (Universal Python Package Installer)
  • For Ollama implementation: Ollama
  • For Gemini implementation: A Google API key with access to Gemini models

Installation

  1. Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. For Ollama implementation: - Install and start Ollama following instructions at Ollama - Pull the default model:
ollama pull phi4:latest
  1. For Gemini implementation: - Create a .env file in the same directory with your Google API key:
GOOGLE_API_KEY=your_api_key_here
  1. Place the appropriate SFA file in your data directory: - Copy either sfa_hdf5_ollama.py or sfa_hdf5_gemini.py into the folder containing your HDF5 files.

Usage

Run the script using uv run, which automatically installs dependencies from the # /// script block.

Command-line Mode

# Using Ollama (local LLM)
uv run sfa_hdf5_ollama.py <directory_path> "<your_query>"

# Using Gemini (cloud API)
uv run sfa_hdf5_gemini.py <directory_path> "<your_query>"

Parameters:

  • <directory_path>: Path to the directory with HDF5 files and the script
  • <your_query>: Natural language query (optional)

Examples:

uv run sfa_hdf5_ollama.py data "List all HDF5 files in this directory"
uv run sfa_hdf5_gemini.py data "What groups are in test_data.h5?"

Model Configuration

Ollama Version

# Select model family and size
uv run sfa_hdf5_ollama.py -m mistral data "What datasets exist in test_data.h5?"
uv run sfa_hdf5_ollama.py -f granite -s small data "List all HDF5 files"
uv run sfa_hdf5_ollama.py -f phi4 -s large data "Show me groups in sample.h5"

Available model configurations:

Family Size Model Description
granite small granite3.2:2b Faster, less memory
granite large granite3.2:8b More capable (default)
phi4 small phi4-mini Faster, less memory
phi4 large phi4 More capable

Note: Choose smaller models for faster responses or when running on systems with limited resources.

Gemini Version

# Options: flash (default), think, pro
uv run sfa_hdf5_gemini.py -m pro data "Summarize dataset 'timeseries/temperature' in test_data.h5"

Interactive Mode

Run without a query to enter interactive mode:

uv run sfa_hdf5_ollama.py data
# OR
uv run sfa_hdf5_gemini.py data

Type your query at the HDF5> prompt. Use exit to quit, history to view past queries, or a number to rerun a previous query.

Example Interactive Session:

uv run sfa_hdf5_ollama.py data
Initializing Ollama...
✓ Ollama ready :)
Interactive mode: 'exit' to quit, 'history' for past queries, or number to rerun
HDF5> how many groups are there in test_data.h5 file?         
Processing query...

─ You asked: how many groups are there in test_data.h5 file? ─

In the test_data.h5 file located at /home/akougkas/NFS/dev/single-file-agents/hdf5-agent/data, there are a total of 9 groups.

If you need any more details about these groups or anything else regarding this HDF5 file, feel free to ask!

HDF5> exit

Testing HDF5 Agent

The test suite for SFA-HDF5 is currently in development. In v0.5.0, we've established the testing framework using pytest with comprehensive coverage of utility functions and tool functions.

To run the tests (coming in the next release):

cd hdf5-agent
pytest -xvs tests/

Contributing

Contributions are welcome! Please submit pull requests or issues to the GitHub repository: https://github.com/akougkas/hdf5-agent.

Join the Community

This project aims to bridge the gap between complex scientific data and intuitive exploration. There are multiple ways to get involved:

For Researchers & Data Scientists

  • Share Use Cases: How have you used SFA-HDF5 in your research? What datasets have you explored?
  • Extension Ideas: What additional analysis capabilities would help your workflow?

For Students

  • Learning Projects: SFA-HDF5 is an excellent way to learn about both agent architectures and scientific data formats
  • Course Integration: Professors can use this tool to teach data exploration concepts without requiring complex coding
  • Contribute Examples: Create example datasets that demonstrate interesting HDF5 structures for others to learn from

For Developers

  • Tool Contributions: Add new specialized tools for domain-specific analyses
  • Performance Improvements: Help optimize large dataset handling
  • Integration Components: Build connectors to other data visualization or analysis systems

Share your experiences, questions, and contributions through:

License

This project is licensed under the MIT License.

Author

Developed by Anthony Kougkas | akougkas.io

About

hdf5-agent is a Python-based Single-File/Folder Agent (SFA) agent that provides a natural language interface for interacting with HDF5 data files. By bundling the agent with your HDF5 files, you create a self-describing "smart folder". It uses a local LLM via Ollama making data exploration accessible without writing complex code.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载