Agent Reliability Testing Tool

A tool for testing and understanding the reliability of LLM Agents. This tool evaluates agents on two key dimensions:

Visibility: How well the agent explains what it's doing
Repeatability: How consistent the agent's responses are

Quick Start

# Clone the repository
git clone https://github.com/hasura/agent-reliability-tool.git
cd agent-reliability-tool

# Set up configuration
cp config.yaml.example config.yaml
cp .env.example .env

# Edit .env to add your API keys
nano .env  # Add your OpenAI or Anthropic API key

# Edit config.yaml to configure settings
nano config.yaml  # Choose LLM provider and model

# Install dependencies
poetry install

# Run the tool
poetry run agent-reliability examples/test_prompts.yaml

Installation

Prerequisites

Python 3.8 or higher
Poetry (for dependency management)

Setup

Clone the repository:

git clone https://github.com/hasura/agent-reliability-tool.git
cd agent-reliability-tool

Install dependencies using Poetry:
```
poetry install
```

Copy the example configuration files:

cp config.yaml.example config.yaml
cp .env.example .env

Edit the .env file to add your API keys:

# For OpenAI
OPENAI_API_KEY=your_openai_api_key_here

# For Anthropic
ANTHROPIC_API_KEY=your_anthropic_api_key_here

Edit the config.yaml file to configure your LLM provider choice and other settings:

# Choose your LLM provider
llm_provider: "anthropic"  # or "openai"

# Choose the specific model
llm_model: "claude-3-sonnet-20240229"  # or appropriate OpenAI model

Implementing Your Agent

Open agent_reliability/agent_wrapper.py
Replace the implementation of the execute_query method with your agent's API call
Configure any necessary parameters in the config.yaml file under the agent_config section

Example agent implementation:

def execute_query(self, query: str) -> str:
    """Execute a query on your agent and return the response as a string."""
    response = requests.post(
        self.config.get("api_url", "https://your-agent-api.com/query"),
        headers={
            "Authorization": f"Bearer {self.config.get('api_key')}",
            "Content-Type": "application/json"
        },
        json={"query": query, "parameters": self.config.get("parameters", {})}
    )
    
    if response.status_code != 200:
        return f"Error: Agent API returned status code {response.status_code}"
    
    result = response.json()
    
    # Combine multiple response components if present
    final_response = ""
    if "answer" in result:
        final_response += f"ANSWER:\n{result['answer']}\n\n"
    if "reasoning" in result:
        final_response += f"REASONING:\n{result['reasoning']}\n\n"
    if "code" in result:
        final_response += f"CODE:\n{result['code']}\n\n"
    if "sources" in result:
        final_response += f"SOURCES:\n{json.dumps(result['sources'], indent=2)}\n\n"
    
    return final_response.strip()

Creating Test Prompts

Create a YAML file containing the prompts you want to test. Only the first 5 prompts will be used.

Example:

prompts:
  - id: "factual_query"
    text: "What is the capital of France?"
    description: "Basic factual query"
  
  - id: "data_analysis"
    text: "I have sales data for Q1: Jan=$10,000, Feb=$12,500, Mar=$9,800. What's the trend and Q1 average?"
    description: "Simple data analysis and calculation"
  
  # More prompts...

Running the Tests

poetry run agent-reliability your_prompts.yaml

Or if you prefer to run it directly:

python -m agent_reliability.cli your_prompts.yaml

Understanding the Report

The reliability report includes:

Executive Summary: Overall reliability score and high-level assessment
Visibility Analysis: How well the agent explains its process
Repeatability Analysis: How consistent the agent's responses are
Key Reliability Issues: Most important problems affecting reliability
Recommendations: Suggestions for improving reliability
Conclusion: Overall trustworthiness assessment

The report is saved as a Markdown file (default: agent_reliability_report.md) and also printed to the console.

How Reliability is Evaluated

Visibility Criteria (10-point scale)

Clear indication of data sources
Explanation of reasoning steps
Explicit assumptions
Understandable conclusions
Sufficient context
Clear indication of tools/functions used
Explanations of calculations
Acknowledgment of limitations
Verifiability by domain experts
Evidence of information retrieval process

Repeatability Criteria (10-point scale)

Consistency of core answers
Absence of contradictions
Explainable variations
Consistency of reasoning approaches
Consistency of data sources
Consistency of calculation results
Consistency of response structure
Consistency of steps taken
Impact of variations on user decisions
Consistency of detail level

The overall reliability score is the average of the visibility and repeatability scores.

Configuration Options

In config.yaml:

llm_provider: LLM provider for evaluation ("openai" or "anthropic")
llm_model: Specific model to use
agent_config: Configuration for your agent
report_path: Path to save the reliability report
advanced.repeat_count: Number of repetitions for repeatability testing
advanced.max_tokens_per_call: Maximum tokens per LLM call

Technical Notes

This tool uses direct HTTP API calls to OpenAI and Anthropic instead of their client libraries to avoid potential proxy-related issues.
The report generation is designed to handle large responses through summarization when necessary.
The tool is designed to be extensible - you can modify the evaluation prompts in the reliability_tester.py file to adjust the criteria.

Example Files

The repository includes example files to help you get started:

examples/test_prompts.yaml: Example test prompts
examples/agent_example.py: Example agent implementation

Guides for Specific Agents

The following guides provide specific instructions for testing different LLM agents:

Using the Reliability Tool with PromptQL - Guide for testing Hasura PromptQL implementations

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
agent_reliability		agent_reliability
docs		docs
examples		examples
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config.yaml.example		config.yaml.example
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agent Reliability Testing Tool

Quick Start

Installation

Prerequisites

Setup

Implementing Your Agent

Creating Test Prompts

Running the Tests

Understanding the Report

How Reliability is Evaluated

Visibility Criteria (10-point scale)

Repeatability Criteria (10-point scale)

Configuration Options

Technical Notes

Example Files

Guides for Specific Agents

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

hasura/agent-reliability-tool

Folders and files

Latest commit

History

Repository files navigation

Agent Reliability Testing Tool

Quick Start

Installation

Prerequisites

Setup

Implementing Your Agent

Creating Test Prompts

Running the Tests

Understanding the Report

How Reliability is Evaluated

Visibility Criteria (10-point scale)

Repeatability Criteria (10-point scale)

Configuration Options

Technical Notes

Example Files

Guides for Specific Agents

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages