A tool for testing and understanding the reliability of LLM Agents. This tool evaluates agents on two key dimensions:
- Visibility: How well the agent explains what it's doing
- Repeatability: How consistent the agent's responses are
# Clone the repository
git clone https://github.com/hasura/agent-reliability-tool.git
cd agent-reliability-tool
# Set up configuration
cp config.yaml.example config.yaml
cp .env.example .env
# Edit .env to add your API keys
nano .env # Add your OpenAI or Anthropic API key
# Edit config.yaml to configure settings
nano config.yaml # Choose LLM provider and model
# Install dependencies
poetry install
# Run the tool
poetry run agent-reliability examples/test_prompts.yaml- Python 3.8 or higher
- Poetry (for dependency management)
-
Clone the repository:
git clone https://github.com/hasura/agent-reliability-tool.git cd agent-reliability-tool -
Install dependencies using Poetry:
poetry install
-
Copy the example configuration files:
cp config.yaml.example config.yaml cp .env.example .env
-
Edit the
.envfile to add your API keys:# For OpenAI OPENAI_API_KEY=your_openai_api_key_here # For Anthropic ANTHROPIC_API_KEY=your_anthropic_api_key_here -
Edit the
config.yamlfile to configure your LLM provider choice and other settings:# Choose your LLM provider llm_provider: "anthropic" # or "openai" # Choose the specific model llm_model: "claude-3-sonnet-20240229" # or appropriate OpenAI model
- Open
agent_reliability/agent_wrapper.py - Replace the implementation of the
execute_querymethod with your agent's API call - Configure any necessary parameters in the
config.yamlfile under theagent_configsection
Example agent implementation:
def execute_query(self, query: str) -> str:
"""Execute a query on your agent and return the response as a string."""
response = requests.post(
self.config.get("api_url", "https://your-agent-api.com/query"),
headers={
"Authorization": f"Bearer {self.config.get('api_key')}",
"Content-Type": "application/json"
},
json={"query": query, "parameters": self.config.get("parameters", {})}
)
if response.status_code != 200:
return f"Error: Agent API returned status code {response.status_code}"
result = response.json()
# Combine multiple response components if present
final_response = ""
if "answer" in result:
final_response += f"ANSWER:\n{result['answer']}\n\n"
if "reasoning" in result:
final_response += f"REASONING:\n{result['reasoning']}\n\n"
if "code" in result:
final_response += f"CODE:\n{result['code']}\n\n"
if "sources" in result:
final_response += f"SOURCES:\n{json.dumps(result['sources'], indent=2)}\n\n"
return final_response.strip()Create a YAML file containing the prompts you want to test. Only the first 5 prompts will be used.
Example:
prompts:
- id: "factual_query"
text: "What is the capital of France?"
description: "Basic factual query"
- id: "data_analysis"
text: "I have sales data for Q1: Jan=$10,000, Feb=$12,500, Mar=$9,800. What's the trend and Q1 average?"
description: "Simple data analysis and calculation"
# More prompts...poetry run agent-reliability your_prompts.yamlOr if you prefer to run it directly:
python -m agent_reliability.cli your_prompts.yamlThe reliability report includes:
- Executive Summary: Overall reliability score and high-level assessment
- Visibility Analysis: How well the agent explains its process
- Repeatability Analysis: How consistent the agent's responses are
- Key Reliability Issues: Most important problems affecting reliability
- Recommendations: Suggestions for improving reliability
- Conclusion: Overall trustworthiness assessment
The report is saved as a Markdown file (default: agent_reliability_report.md) and also printed to the console.
- Clear indication of data sources
- Explanation of reasoning steps
- Explicit assumptions
- Understandable conclusions
- Sufficient context
- Clear indication of tools/functions used
- Explanations of calculations
- Acknowledgment of limitations
- Verifiability by domain experts
- Evidence of information retrieval process
- Consistency of core answers
- Absence of contradictions
- Explainable variations
- Consistency of reasoning approaches
- Consistency of data sources
- Consistency of calculation results
- Consistency of response structure
- Consistency of steps taken
- Impact of variations on user decisions
- Consistency of detail level
The overall reliability score is the average of the visibility and repeatability scores.
In config.yaml:
llm_provider: LLM provider for evaluation ("openai" or "anthropic")llm_model: Specific model to useagent_config: Configuration for your agentreport_path: Path to save the reliability reportadvanced.repeat_count: Number of repetitions for repeatability testingadvanced.max_tokens_per_call: Maximum tokens per LLM call
- This tool uses direct HTTP API calls to OpenAI and Anthropic instead of their client libraries to avoid potential proxy-related issues.
- The report generation is designed to handle large responses through summarization when necessary.
- The tool is designed to be extensible - you can modify the evaluation prompts in the
reliability_tester.pyfile to adjust the criteria.
The repository includes example files to help you get started:
examples/test_prompts.yaml: Example test promptsexamples/agent_example.py: Example agent implementation
The following guides provide specific instructions for testing different LLM agents:
- Using the Reliability Tool with PromptQL - Guide for testing Hasura PromptQL implementations