llm-eval

llm-eval is a simple, yet flexible, framework designed for assessing the quality of responses generated by Language Model Models (LLMs). This framework allows you to easily create tests cases, ranging from simple tests based on matching/regex, to tests that extract and execute Python code blocks (generated from the responses) to determine the percent of code blocks that successfully execute. It also allows you to create your own custom tests.

Get started with examples found in the examples folder.

Using llm-eval

In this framework, there are two fundamental concepts:

Eval: An Eval represents a single test scenario. Each Eval defines an input to an LLM or agent, and "checks" which evaluate the response of the agent against the criteria specified in the check. Users can also create and custom checks.
Candidate: A Candidate is a lightweight wrapper around an LLM or agent that used to standardize the inputs and outputs of the agent with the inputs and outputs associated with the Eval. In other words, different models might expect inputs to be formatted in different ways (and might return responses formatted in different ways) and a Candidate is an adaptor for those models so that the Evals can be defined in one format, regardless of the various formats expected by various models.

Examples

Running Evals/Candidates from YAML files

You can define Candidates and Evals using YAML files. Here's an example YAML file for a ChatGPT Candidate:

model: gpt-4o-mini
candidate_type: OPENAI
metadata:
  name: OpenAI GPT-4o-mini
parameters:
  temperature: 0.01
  max_tokens: 4096
  seed: 42

Here's an example of a YAML file that defines an Eval, focusing on generating a Fibonacci sequence function and corresponding assertion statements:

metadata:
  name: Fibonacci Sequence
input:
  - role: user
    content: Create a Python function called `fib` that takes an integer `n` and returns the `n`th number in the Fibonacci sequence. Use type hints and docstrings.
checks:
  - check_type: REGEX
    pattern: "def fib\\([a-zA-Z_]+\\: int\\) -> int\\:"
  - check_type: PYTHON_CODE_BLOCK_TESTS
    code_setup: import re
    code_tests:
    - |
      def verify_mask_emails_with_no_email_returns_original_string(code_blocks: list[str]) -> bool:
          value = 'This is a string with no email addresses'
          return mask_emails(value) == value

The Eval above defines various types of checks, including a PYTHON_CODE_BLOCK_TESTS check, which executes the code blocks generated by the LLM in an isolated environment and tracks the number of code blocks that successfully execute. It also allows you to define custom tests to directly test the variables or functions created by the code blocks (in the same isolated environment).

The following code loads the Eval and Candidate from above (with the addition of a ChatGPT 4.0 Candidate and an Eval testing the creation of a "mask_emails" function). You can then run these Evals against Candidates with an EvalHarness:

from llm_eval.eval import EvalHarness

eval_harness = EvalHarness()
eval_harness.add_eval_from_yaml('examples/evals/simple_example.yaml')
eval_harness.add_eval_from_yaml('examples/evals/mask_emails.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4o-mini.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4.0.yaml')
results = eval_harness()

print(f"Num Checks: {result.num_checks}")
print(f"Num Passed: {result.num_successful_checks}")
print(f"Percent Passed: {result.perc_successful_checks:.1%}")
print(result.response)

results contains a list of lists of EvalResults. Each item in the outer list corresponds to a single Candidate and contains a list of EvalResults for all Evals run against the Candidate. In our example, results is [[EvalResult, EvalResult], [EvalResult, EvalResult]] where the first list corresponds to results of the Evals associated with the first Candidate (ChatGPT 4o-mini) and the second list corresponds to results of the Evals associated with the second Candidate (ChatGPT 4.0).

Note that you can load multiple YAML files in a directory using add_evals_from_yamls and add_candidates_from_yamls:

...
eval_harness.add_evals_from_yamls('examples/evals/*.yaml')
eval_harness.add_candidate_from_yamls('examples/candidates/*.yaml')
...

Installing

Local

The easiest way to install the llm-eval package during this beta period is by cloning the repo and installing locally.

cd <to the directory you want to clone the repo>
git clone https://github.com/anaconda/llm-eval.git
<activate conda or virtual environment if necessary>
pip install -e .

Internal Public Security Manager (PSM)

Important

You must have WARP enabled to access the internal PSM.

To install the llm-eval package using Anaconda's internal PSM you must do the following:

Ensure you have setup your CLI to access the internal PSM repo. Use the following link for up to date instruction: Setup Instructions

conda install conda-repo-cli
conda repo config --set-site pkgs.as.anacondaconnect.com
conda repo config --set oauth2 true
conda repo login  # this opens a browser window to complete the OAuth login process

Run:

conda install -c https://pkgs.as.anacondaconnect.com/api/repo/ai-aug llm-eval

Dependency

If you wish to include the package as a dependency in an environment.yml file, you must add the complete channel URL.

channels:
  - https://pkgs.as.anacondaconnect.com/api/repo/ai-aug
  - defaults
  - conda-forge
dependencies:
  - llm-eval

Environment Variables

OPENAI_API_KEY: This environment variable and API key are required for using OpenAIChat and OpenAICandidate.
HUGGING_FACE_API_KEY: This environment variable and API key are required for using HuggingFaceEndpointChat and HuggingFaceEndpointCandidate.

Development Environment Setup

The easiest way to set up the development environment is by creating a docker container and then connecting VS Code to the container.

You can create and start the docker container by running make docker_run.

Once docker is running select the Attach to Running Container option within VS Code and select the appropriate container.

Environment Variables / API Keys

To set the required environment variables for corresponding services, follow these steps:

HUGGING_FACE_API_KEY: Set this environment variable to your Hugging Face API key. This is necessary for using the HuggingFaceEndpointChat. You can set it in an .env file or your environment.
OPENAI_API_KEY: Set this environment variable to your OpenAI API key. It is required for using OpenAIChat. Similar to the Hugging Face API key, you can set it in an .env file or your environment.

For testing the HuggingFaceEndpointChat in llm_eval/llms/hugging_face.py (via tests/test_hugging_face.py), you can set the HUGGING_FACE_ENDPOINT_UNIT_TESTS environment variable to a deployed model on Hugging Face Endpoints.

With these environment variables correctly configured, you can use the corresponding services seamlessly within the llm-eval framework.

Installing Dependencies

Install the project dependencies using either pip or conda. For a reproducible environment, conda is preferred, as it sets up a local environment using the make command.

Using PIP

To install dependencies with pip, run:

pip install -r requirements.txt

Conda

To set up a local environment with conda, run:

make setup_env

To activate the environment, run conda activate ./env

Managing Dependencies

There are two source files used to automate the management of multiple build files. They are:

dev_requirements.txt: Dependencies required for development.
run_requirements.txt: Dependencies required for running the project.

When adding new dependencies:

Include them in either dev_requirements.txt or run_requirements.txt accordingly with versions explicitly specified to ensure stability.
Verify that each dependency is available in a conda channel to prevent compatibility issues.

Once added, run:

make gen_build_files

For dependencies with different names in pip and conda:

When adding a new package to either the dev_requirements.txt or the run_requirements.txt files, be sure to use the package name used when installing with pip.
If the package name is different when installing with conda, be sure to update the generate_build_files.py script to ensure the proper package name is used for conda related files.

Packaging

The project supports packaging as either a conda package or a Python wheel file. The preferred method is the conda package to ensure compatibility with other conda environments.

To create a conda package, run:

make gen_build_files # if new dependencies have been added.
make build_package # to create the conda package.

The resulting conda package will be located in the dist/conda/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 424 Commits
.github/workflows		.github/workflows
conda.recipe		conda.recipe
examples		examples
llm_eval		llm_eval
scripts		scripts
tests		tests
.gitignore		.gitignore
.ruff.toml		.ruff.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llm-eval

Using llm-eval

Examples

Running Evals/Candidates from YAML files

Installing

Local

Internal Public Security Manager (PSM)

Dependency

Environment Variables

Development Environment Setup

Environment Variables / API Keys

Installing Dependencies

Using PIP

Conda

Managing Dependencies

Packaging

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

anaconda/llm-eval

Folders and files

Latest commit

History

Repository files navigation

llm-eval

Using llm-eval

Examples

Running Evals/Candidates from YAML files

Installing

Local

Internal Public Security Manager (PSM)

Dependency

Environment Variables

Development Environment Setup

Environment Variables / API Keys

Installing Dependencies

Using PIP

Conda

Managing Dependencies

Packaging

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages