A self-contained, single-file Python agent for exploring and analyzing HDF5 files using natural language queries. This agent leverages either local Language Models via Ollama or Google's Gemini API to provide an intuitive, conversational interface for interacting with HDF5 data.
The core idea behind SFA-HDF5 is the "smart folder": place the single-file agent (sfa_hdf5_ollama.py
or sfa_hdf5_gemini.py
) in any directory with your HDF5 data to create a self-contained, portable data exploration unit. This fundamentally inverts the traditional model:
Traditional approach: Bring your data to AI (uploading files to ChatGPT or Gemini)
SFA approach: Bring AI to your data - securely, privately, and portably
The agent's design enables users to explore complex data locally, keeping sensitive information secure while maintaining full control over which AI service processes queries.
SFA-HDF5 implements an agentic approach rather than a simple LLM workflow:
LLM Workflow: Sequential calls to an LLM with predefined steps and limited adaptability
Agentic Approach: LLM has agency to choose tools and drive progress through reasoning
By using a tool-based architecture, SFA-HDF5 enables:
- More complex exploration paths determined by the agent
- Better error handling with specific recovery strategies
- Extensibility through the addition of new tools without rewriting core logic
- The ability to chain multiple SFA agents together for cross-domain analysis pipelines
- Self-contained: Entire agent resides in a single Python file.
- Multiple LLM options:
sfa_hdf5_ollama.py
for local LLM execution via Ollamasfa_hdf5_gemini.py
for cloud-based execution via Google's Gemini API
- HDF5 File Exploration: Navigate files, groups, datasets, and their metadata.
- Metadata Retrieval: Access group and file attributes, plus dataset details (shape, dtype, attributes).
- Dataset Summarization: Generate statistical summaries for numerical data or value counts for strings.
- Dataset Analysis: Comprehensive analysis with adaptive sampling for large datasets.
- Performance Optimization: Caches metadata and tool results with a configurable LRU cache.
- Dependency Management: Employs
uv
for automatic dependency installation via the# /// script
block. - Interactive and CLI Modes: Supports command-line queries and interactive chat.
- Multi-Step Queries: Handles complex, multi-step explorations.
- Robust Error Handling: Gracefully manages invalid files, paths, and query errors with descriptive feedback.
The agent maintains awareness of your exploration context, prioritizing recently mentioned files and tracking your intent. This allows for more natural conversations like:
HDF5> What groups are in test_data.h5?
# Agent shows groups
HDF5> What datasets are in the /timeseries group?
# Agent understands you're still working with test_data.h5
SFA-HDF5 uses a tool-based architecture with these core functions:
list_files
: Lists all HDF5 files in the directorylist_groups
: Lists all groups within an HDF5 filelist_datasets
: Lists datasets within a specified groupget_group_attribute
: Retrieves a specific attribute from a groupget_file_attribute
: Retrieves a specific attribute from the fileget_dataset_info
: Retrieves dataset metadata (shape, dtype, attributes)summarize_dataset
: Provides statistical summaries of dataset contentsanalyze_dataset
: Performs in-depth analysis of dataset contents with sampling for large datasetslist_all_datasets
: Lists all datasets in a file, grouped by parent groupsget_file_metadata
: Retrieves file metadata (size, creation/modification times)
SFA-HDF5 v0.5.0 is intentionally designed as a read-only agent, specifically optimized for the data exploration phase where:
- The working directory remains static during exploration
- Files are not expected to be modified during agent operation
- The focus is purely on exploring and understanding existing data structures
This design choice offers several advantages:
- Maximized Caching: Results can be aggressively cached without worrying about data staleness
- Simplified Architecture: No need for file monitoring or modification detection
- Reliable Exploration: Consistent results throughout an exploration session
- Performance Optimization: Reduced I/O operations through safe result reuse
Note: If you need to explore new or modified files, simply restart the agent to refresh its cache and begin a new exploration session.
- Python 3.8+
- uv (Universal Python Package Installer)
- For Ollama implementation: Ollama
- For Gemini implementation: A Google API key with access to Gemini models
- Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
- For Ollama implementation: - Install and start Ollama following instructions at Ollama - Pull the default model:
ollama pull phi4:latest
- For Gemini implementation:
- Create a
.env
file in the same directory with your Google API key:
GOOGLE_API_KEY=your_api_key_here
- Place the appropriate SFA file in your data directory:
- Copy either
sfa_hdf5_ollama.py
orsfa_hdf5_gemini.py
into the folder containing your HDF5 files.
Run the script using uv run
, which automatically installs dependencies from the # /// script
block.
# Using Ollama (local LLM)
uv run sfa_hdf5_ollama.py <directory_path> "<your_query>"
# Using Gemini (cloud API)
uv run sfa_hdf5_gemini.py <directory_path> "<your_query>"
Parameters:
<directory_path>
: Path to the directory with HDF5 files and the script<your_query>
: Natural language query (optional)
Examples:
uv run sfa_hdf5_ollama.py data "List all HDF5 files in this directory"
uv run sfa_hdf5_gemini.py data "What groups are in test_data.h5?"
# Select model family and size
uv run sfa_hdf5_ollama.py -m mistral data "What datasets exist in test_data.h5?"
uv run sfa_hdf5_ollama.py -f granite -s small data "List all HDF5 files"
uv run sfa_hdf5_ollama.py -f phi4 -s large data "Show me groups in sample.h5"
Available model configurations:
Family | Size | Model | Description |
---|---|---|---|
granite | small | granite3.2:2b | Faster, less memory |
granite | large | granite3.2:8b | More capable (default) |
phi4 | small | phi4-mini | Faster, less memory |
phi4 | large | phi4 | More capable |
Note: Choose smaller models for faster responses or when running on systems with limited resources.
# Options: flash (default), think, pro
uv run sfa_hdf5_gemini.py -m pro data "Summarize dataset 'timeseries/temperature' in test_data.h5"
Run without a query to enter interactive mode:
uv run sfa_hdf5_ollama.py data
# OR
uv run sfa_hdf5_gemini.py data
Type your query at the HDF5>
prompt. Use exit
to quit, history
to view past queries, or a number to rerun a previous query.
Example Interactive Session:
uv run sfa_hdf5_ollama.py data
Initializing Ollama...
✓ Ollama ready :)
Interactive mode: 'exit' to quit, 'history' for past queries, or number to rerun
HDF5> how many groups are there in test_data.h5 file?
Processing query...
─ You asked: how many groups are there in test_data.h5 file? ─
In the test_data.h5 file located at /home/akougkas/NFS/dev/single-file-agents/hdf5-agent/data, there are a total of 9 groups.
If you need any more details about these groups or anything else regarding this HDF5 file, feel free to ask!
HDF5> exit
The test suite for SFA-HDF5 is currently in development. In v0.5.0, we've established the testing framework using pytest with comprehensive coverage of utility functions and tool functions.
To run the tests (coming in the next release):
cd hdf5-agent
pytest -xvs tests/
Contributions are welcome! Please submit pull requests or issues to the GitHub repository: https://github.com/akougkas/hdf5-agent.
This project aims to bridge the gap between complex scientific data and intuitive exploration. There are multiple ways to get involved:
- Share Use Cases: How have you used SFA-HDF5 in your research? What datasets have you explored?
- Extension Ideas: What additional analysis capabilities would help your workflow?
- Learning Projects: SFA-HDF5 is an excellent way to learn about both agent architectures and scientific data formats
- Course Integration: Professors can use this tool to teach data exploration concepts without requiring complex coding
- Contribute Examples: Create example datasets that demonstrate interesting HDF5 structures for others to learn from
- Tool Contributions: Add new specialized tools for domain-specific analyses
- Performance Improvements: Help optimize large dataset handling
- Integration Components: Build connectors to other data visualization or analysis systems
Share your experiences, questions, and contributions through:
- GitHub Issues: https://github.com/akougkas/hdf5-agent/issues
- Direct email: a.kougkas@gmail.com
- Academic collaboration inquiries welcome!
This project is licensed under the MIT License.
Developed by Anthony Kougkas | akougkas.io