llm-eval is a simple, yet flexible, framework designed for assessing the quality of responses generated by Language Model Models (LLMs). This framework allows you to easily create tests cases, ranging from simple tests based on matching/regex, to tests that extract and execute Python code blocks (generated from the responses) to determine the percent of code blocks that successfully execute. It also allows you to create your own custom tests.
Get started with examples found in the examples folder.
In this framework, there are two fundamental concepts:
- Eval: An Eval represents a single test scenario. Each Eval defines an
input
to an LLM or agent, and "checks" which evaluate the response of the agent against the criteria specified in the check. Users can also create and custom checks. - Candidate: A Candidate is a lightweight wrapper around an LLM or agent that used to standardize the inputs and outputs of the agent with the inputs and outputs associated with the Eval. In other words, different models might expect inputs to be formatted in different ways (and might return responses formatted in different ways) and a Candidate is an adaptor for those models so that the Evals can be defined in one format, regardless of the various formats expected by various models.
You can define Candidates and Evals using YAML files. Here's an example YAML file for a ChatGPT Candidate:
model: gpt-4o-mini
candidate_type: OPENAI
metadata:
name: OpenAI GPT-4o-mini
parameters:
temperature: 0.01
max_tokens: 4096
seed: 42
Here's an example of a YAML file that defines an Eval, focusing on generating a Fibonacci sequence function and corresponding assertion statements:
metadata:
name: Fibonacci Sequence
input:
- role: user
content: Create a Python function called `fib` that takes an integer `n` and returns the `n`th number in the Fibonacci sequence. Use type hints and docstrings.
checks:
- check_type: REGEX
pattern: "def fib\\([a-zA-Z_]+\\: int\\) -> int\\:"
- check_type: PYTHON_CODE_BLOCK_TESTS
code_setup: import re
code_tests:
- |
def verify_mask_emails_with_no_email_returns_original_string(code_blocks: list[str]) -> bool:
value = 'This is a string with no email addresses'
return mask_emails(value) == value
The Eval above defines various types of checks, including a PYTHON_CODE_BLOCK_TESTS
check, which executes the code blocks generated by the LLM in an isolated environment and tracks the number of code blocks that successfully execute. It also allows you to define custom tests to directly test the variables or functions created by the code blocks (in the same isolated environment).
The following code loads the Eval and Candidate from above (with the addition of a ChatGPT 4.0 Candidate and an Eval testing the creation of a "mask_emails" function). You can then run these Evals against Candidates with an EvalHarness
:
from llm_eval.eval import EvalHarness
eval_harness = EvalHarness()
eval_harness.add_eval_from_yaml('examples/evals/simple_example.yaml')
eval_harness.add_eval_from_yaml('examples/evals/mask_emails.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4o-mini.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4.0.yaml')
results = eval_harness()
print(f"Num Checks: {result.num_checks}")
print(f"Num Passed: {result.num_successful_checks}")
print(f"Percent Passed: {result.perc_successful_checks:.1%}")
print(result.response)
results
contains a list of lists of EvalResults. Each item in the outer list corresponds to a single Candidate and contains a list of EvalResults for all Evals run against the Candidate. In our example, results
is [[EvalResult, EvalResult], [EvalResult, EvalResult]] where the first list corresponds to results of the Evals associated with the first Candidate (ChatGPT 4o-mini) and the second list corresponds to results of the Evals associated with the second Candidate (ChatGPT 4.0).
Note that you can load multiple YAML files in a directory using add_evals_from_yamls
and add_candidates_from_yamls
:
...
eval_harness.add_evals_from_yamls('examples/evals/*.yaml')
eval_harness.add_candidate_from_yamls('examples/candidates/*.yaml')
...
The easiest way to install the llm-eval package during this beta period is by cloning the repo and installing locally.
cd <to the directory you want to clone the repo>
git clone https://github.com/anaconda/llm-eval.git
<activate conda or virtual environment if necessary>
pip install -e .
Important
You must have WARP enabled to access the internal PSM.
To install the llm-eval package using Anaconda's internal PSM you must do the following:
-
Ensure you have setup your CLI to access the internal PSM repo. Use the following link for up to date instruction: Setup Instructions
conda install conda-repo-cli conda repo config --set-site pkgs.as.anacondaconnect.com conda repo config --set oauth2 true conda repo login # this opens a browser window to complete the OAuth login process
-
Run:
conda install -c https://pkgs.as.anacondaconnect.com/api/repo/ai-aug llm-eval
If you wish to include the package as a dependency in an environment.yml
file, you must add the complete channel URL.
channels:
- https://pkgs.as.anacondaconnect.com/api/repo/ai-aug
- defaults
- conda-forge
dependencies:
- llm-eval
OPENAI_API_KEY
: This environment variable and API key are required for using OpenAIChat and OpenAICandidate.HUGGING_FACE_API_KEY
: This environment variable and API key are required for using HuggingFaceEndpointChat and HuggingFaceEndpointCandidate.
The easiest way to set up the development environment is by creating a docker container and then connecting VS Code to the container.
You can create and start the docker container by running make docker_run
.
Once docker is running select the Attach to Running Container
option within VS Code and select the appropriate container.
To set the required environment variables for corresponding services, follow these steps:
HUGGING_FACE_API_KEY
: Set this environment variable to your Hugging Face API key. This is necessary for using the HuggingFaceEndpointChat. You can set it in an .env file or your environment.OPENAI_API_KEY
: Set this environment variable to your OpenAI API key. It is required for using OpenAIChat. Similar to the Hugging Face API key, you can set it in an .env file or your environment.
For testing the HuggingFaceEndpointChat in llm_eval/llms/hugging_face.py (via tests/test_hugging_face.py), you can set the HUGGING_FACE_ENDPOINT_UNIT_TESTS
environment variable to a deployed model on Hugging Face Endpoints.
With these environment variables correctly configured, you can use the corresponding services seamlessly within the llm-eval framework.
Install the project dependencies using either pip
or conda
. For a reproducible environment, conda
is preferred, as it sets up a local environment using the make
command.
To install dependencies with pip
, run:
pip install -r requirements.txt
To set up a local environment with conda, run:
make setup_env
To activate the environment, run conda activate ./env
There are two source files used to automate the management of multiple build files. They are:
dev_requirements.txt
: Dependencies required for development.run_requirements.txt
: Dependencies required for running the project.
When adding new dependencies:
- Include them in either
dev_requirements.txt
orrun_requirements.txt
accordingly with versions explicitly specified to ensure stability. - Verify that each dependency is available in a conda channel to prevent compatibility issues.
Once added, run:
make gen_build_files
For dependencies with different names in pip
and conda
:
- When adding a new package to either the
dev_requirements.txt
or therun_requirements.txt
files, be sure to use the package name used when installing withpip
. - If the package name is different when installing with
conda
, be sure to update thegenerate_build_files.py
script to ensure the proper package name is used forconda
related files.
The project supports packaging as either a conda
package or a Python wheel
file. The preferred method is the conda
package to ensure compatibility with other conda
environments.
To create a conda
package, run:
make gen_build_files # if new dependencies have been added.
make build_package # to create the conda package.
The resulting conda
package will be located in the dist/conda/
directory.