SFA-HDF5: Single-File HDF5 Agent

A self-contained, single-file Python agent for exploring and analyzing HDF5 files using natural language queries. This agent leverages either local Language Models via Ollama or Google's Gemini API to provide an intuitive, conversational interface for interacting with HDF5 data.

Concept: The "Smart Folder" - Bringing AI to Your Data

The core idea behind SFA-HDF5 is the "smart folder": place the single-file agent (sfa_hdf5_ollama.py or sfa_hdf5_gemini.py) in any directory with your HDF5 data to create a self-contained, portable data exploration unit. This fundamentally inverts the traditional model:

Traditional approach: Bring your data to AI (uploading files to ChatGPT or Gemini)
SFA approach: Bring AI to your data - securely, privately, and portably

The agent's design enables users to explore complex data locally, keeping sensitive information secure while maintaining full control over which AI service processes queries.

LLM Workflows vs. Agentic Approach

SFA-HDF5 implements an agentic approach rather than a simple LLM workflow:

LLM Workflow: Sequential calls to an LLM with predefined steps and limited adaptability
Agentic Approach: LLM has agency to choose tools and drive progress through reasoning

By using a tool-based architecture, SFA-HDF5 enables:

More complex exploration paths determined by the agent
Better error handling with specific recovery strategies
Extensibility through the addition of new tools without rewriting core logic
The ability to chain multiple SFA agents together for cross-domain analysis pipelines

Features

Self-contained: Entire agent resides in a single Python file.
Multiple LLM options:
- sfa_hdf5_ollama.py for local LLM execution via Ollama
- sfa_hdf5_gemini.py for cloud-based execution via Google's Gemini API
HDF5 File Exploration: Navigate files, groups, datasets, and their metadata.
Metadata Retrieval: Access group and file attributes, plus dataset details (shape, dtype, attributes).
Dataset Summarization: Generate statistical summaries for numerical data or value counts for strings.
Dataset Analysis: Comprehensive analysis with adaptive sampling for large datasets.
Performance Optimization: Caches metadata and tool results with a configurable LRU cache.
Dependency Management: Employs uv for automatic dependency installation via the # /// script block.
Interactive and CLI Modes: Supports command-line queries and interactive chat.
Multi-Step Queries: Handles complex, multi-step explorations.
Robust Error Handling: Gracefully manages invalid files, paths, and query errors with descriptive feedback.

Advanced Feature: Context-Aware Exploration

The agent maintains awareness of your exploration context, prioritizing recently mentioned files and tracking your intent. This allows for more natural conversations like:

HDF5> What groups are in test_data.h5?
# Agent shows groups

HDF5> What datasets are in the /timeseries group?
# Agent understands you're still working with test_data.h5

Available Agent Tools

SFA-HDF5 uses a tool-based architecture with these core functions:

list_files: Lists all HDF5 files in the directory
list_groups: Lists all groups within an HDF5 file
list_datasets: Lists datasets within a specified group
get_group_attribute: Retrieves a specific attribute from a group
get_file_attribute: Retrieves a specific attribute from the file
get_dataset_info: Retrieves dataset metadata (shape, dtype, attributes)
summarize_dataset: Provides statistical summaries of dataset contents
analyze_dataset: Performs in-depth analysis of dataset contents with sampling for large datasets
list_all_datasets: Lists all datasets in a file, grouped by parent groups
get_file_metadata: Retrieves file metadata (size, creation/modification times)

Read-Only Design Philosophy

SFA-HDF5 v0.5.0 is intentionally designed as a read-only agent, specifically optimized for the data exploration phase where:

The working directory remains static during exploration
Files are not expected to be modified during agent operation
The focus is purely on exploring and understanding existing data structures

This design choice offers several advantages:

Maximized Caching: Results can be aggressively cached without worrying about data staleness
Simplified Architecture: No need for file monitoring or modification detection
Reliable Exploration: Consistent results throughout an exploration session
Performance Optimization: Reduced I/O operations through safe result reuse

Note: If you need to explore new or modified files, simply restart the agent to refresh its cache and begin a new exploration session.

Requirements

Python 3.8+
uv (Universal Python Package Installer)
For Ollama implementation: Ollama
For Gemini implementation: A Google API key with access to Gemini models

Installation

Install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

For Ollama implementation: - Install and start Ollama following instructions at Ollama - Pull the default model:

ollama pull phi4:latest

For Gemini implementation: - Create a .env file in the same directory with your Google API key:

GOOGLE_API_KEY=your_api_key_here

Place the appropriate SFA file in your data directory: - Copy either sfa_hdf5_ollama.py or sfa_hdf5_gemini.py into the folder containing your HDF5 files.

Usage

Run the script using uv run, which automatically installs dependencies from the # /// script block.

Command-line Mode

# Using Ollama (local LLM)
uv run sfa_hdf5_ollama.py <directory_path> "<your_query>"

# Using Gemini (cloud API)
uv run sfa_hdf5_gemini.py <directory_path> "<your_query>"

Parameters:

<directory_path>: Path to the directory with HDF5 files and the script
<your_query>: Natural language query (optional)

Examples:

uv run sfa_hdf5_ollama.py data "List all HDF5 files in this directory"
uv run sfa_hdf5_gemini.py data "What groups are in test_data.h5?"

Model Configuration

Ollama Version

# Select model family and size
uv run sfa_hdf5_ollama.py -m mistral data "What datasets exist in test_data.h5?"
uv run sfa_hdf5_ollama.py -f granite -s small data "List all HDF5 files"
uv run sfa_hdf5_ollama.py -f phi4 -s large data "Show me groups in sample.h5"

Available model configurations:

Family	Size	Model	Description
granite	small	granite3.2:2b	Faster, less memory
granite	large	granite3.2:8b	More capable (default)
phi4	small	phi4-mini	Faster, less memory
phi4	large	phi4	More capable

Note: Choose smaller models for faster responses or when running on systems with limited resources.

Gemini Version

# Options: flash (default), think, pro
uv run sfa_hdf5_gemini.py -m pro data "Summarize dataset 'timeseries/temperature' in test_data.h5"

Interactive Mode

Run without a query to enter interactive mode:

uv run sfa_hdf5_ollama.py data
# OR
uv run sfa_hdf5_gemini.py data

Type your query at the HDF5> prompt. Use exit to quit, history to view past queries, or a number to rerun a previous query.

Example Interactive Session:

uv run sfa_hdf5_ollama.py data
Initializing Ollama...
✓ Ollama ready :)
Interactive mode: 'exit' to quit, 'history' for past queries, or number to rerun
HDF5> how many groups are there in test_data.h5 file?         
Processing query...

─ You asked: how many groups are there in test_data.h5 file? ─

In the test_data.h5 file located at /home/akougkas/NFS/dev/single-file-agents/hdf5-agent/data, there are a total of 9 groups.

If you need any more details about these groups or anything else regarding this HDF5 file, feel free to ask!

HDF5> exit

Testing HDF5 Agent

The test suite for SFA-HDF5 is currently in development. In v0.5.0, we've established the testing framework using pytest with comprehensive coverage of utility functions and tool functions.

To run the tests (coming in the next release):

cd hdf5-agent
pytest -xvs tests/

Contributing

Contributions are welcome! Please submit pull requests or issues to the GitHub repository: https://github.com/akougkas/hdf5-agent.

Join the Community

This project aims to bridge the gap between complex scientific data and intuitive exploration. There are multiple ways to get involved:

For Researchers & Data Scientists

Share Use Cases: How have you used SFA-HDF5 in your research? What datasets have you explored?
Extension Ideas: What additional analysis capabilities would help your workflow?

For Students

Learning Projects: SFA-HDF5 is an excellent way to learn about both agent architectures and scientific data formats
Course Integration: Professors can use this tool to teach data exploration concepts without requiring complex coding
Contribute Examples: Create example datasets that demonstrate interesting HDF5 structures for others to learn from

For Developers

Tool Contributions: Add new specialized tools for domain-specific analyses
Performance Improvements: Help optimize large dataset handling
Integration Components: Build connectors to other data visualization or analysis systems

Share your experiences, questions, and contributions through:

GitHub Issues: https://github.com/akougkas/hdf5-agent/issues
Direct email: a.kougkas@gmail.com
Academic collaboration inquiries welcome!

License

This project is licensed under the MIT License.

Author

Developed by Anthony Kougkas | akougkas.io

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data_generators		data_generators
tests		tests
.editorrules		.editorrules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
diagrams.md		diagrams.md
next_steps.md		next_steps.md
pytest.ini		pytest.ini
sfa_hdf5_gemini.py		sfa_hdf5_gemini.py
sfa_hdf5_ollama.py		sfa_hdf5_ollama.py
start_here.sh		start_here.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SFA-HDF5: Single-File HDF5 Agent

Concept: The "Smart Folder" - Bringing AI to Your Data

LLM Workflows vs. Agentic Approach

Features

Advanced Feature: Context-Aware Exploration

Available Agent Tools

Read-Only Design Philosophy

Requirements

Installation

Usage

Command-line Mode

Model Configuration

Ollama Version

Gemini Version

Interactive Mode

Testing HDF5 Agent

Contributing

Join the Community

For Researchers & Data Scientists

For Students

For Developers

License

Author

About

Uh oh!

Releases

Packages

Languages

License

akougkas/hdf5-agent

Folders and files

Latest commit

History

Repository files navigation

SFA-HDF5: Single-File HDF5 Agent

Concept: The "Smart Folder" - Bringing AI to Your Data

LLM Workflows vs. Agentic Approach

Features

Advanced Feature: Context-Aware Exploration

Available Agent Tools

Read-Only Design Philosophy

Requirements

Installation

Usage

Command-line Mode

Model Configuration

Ollama Version

Gemini Version

Interactive Mode

Testing HDF5 Agent

Contributing

Join the Community

For Researchers & Data Scientists

For Students

For Developers

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages