clerk-evals

A project to enable easy testing of prompts against various provider LLMs.

Quickstart

You will need a few API Keys. See .env.example

cp .env.example .env

Run the eval suite (might take about 50s)

bun i
bun start

Example scores

[
  {
    "model": "gpt-5-chat-latest",
    "framework": "Next.js",
    "category": "Fundamentals",
    "value": 0.6666666666666666,
    "updatedAt": "2025-10-15T17:51:27.901Z"
  },
  {
    "model": "gpt-4o",
    "framework": "Next.js",
    "category": "Fundamentals",
    "value": 0.3333333333333333,
    "updatedAt": "2025-10-15T17:51:30.871Z"
  },
  {
    "model": "claude-sonnet-4-0",
    "framework": "Next.js",
    "category": "Fundamentals",
    "value": 0.5,
    "updatedAt": "2025-10-15T17:51:56.370Z"
  },
  {
    "model": "claude-sonnet-4-5",
    "framework": "Next.js",
    "category": "Fundamentals",
    "value": 0.8333333333333334,
    "updatedAt": "2025-10-15T17:52:03.349Z"
  },
  {
    "model": "v0-1.5-md",
    "framework": "Next.js",
    "category": "Fundamentals",
    "value": 1,
    "updatedAt": "2025-10-15T17:52:06.700Z"
  },
  {
    "model": "claude-opus-4-0",
    "framework": "Next.js",
    "category": "Fundamentals",
    "value": 0.5,
    "updatedAt": "2025-10-15T17:52:06.898Z"
  },
  {
    "model": "gpt-5",
    "framework": "Next.js",
    "category": "Fundamentals",
    "value": 0.5,
    "updatedAt": "2025-10-15T17:52:07.038Z"
  }
]

Overview

This project is broken up into a few core pieces:

src/index.ts: This is the main entrypoint of the project. Evaluations, models, reporters, and the runner are registered here, and all executed.
/evals: Folders that contain a prompt and grading expectations. Runners currently assume that eval folders contain two files: graders.ts and PROMPT.md.
/runners: The primary logic responsible for loading evaluations, calling provider llms, and outputting scores.
/reporters: The primary logic responsible for sending scores somewhere — stdout, a file, etc.

Running

A runner takes a simple object as an argument:

{
  "provider": "openai",
  "model": "gpt-5",
  "evalPath": "/absolute/path/to/clerk-evals/src/evals/000-basic-nextjs
}

It will resolve the provider and model to the respective SDK.

It will load the designated evaluation, generate LLM text from the prompt, and pass the result to graders.

Evaluations

At the moment, evaluations are simply folders that contain:

PROMPT.md: the instruction for which we're evaluating the model's output on
graders.ts: a module containing grader functions which return true/false signalling if the model's output passed or failed. This is essentially our acceptance criteria.

Score

For a given model, and evaluation, we'll retrieve a score from 0..1, which is the percentage of grader functions that passed.

Reporting

At the moment, we employ two minimal reporters

console: writes scores via console.log()
file: saves scores to a gitignored scores.json file.

Interfaces

For the notable interfaces, see /interfaces.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

clerk-evals

Quickstart

Overview

Running

Evaluations

Score

Reporting

Interfaces

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

clerk/clerk-evals

Folders and files

Latest commit

History

Repository files navigation

clerk-evals

Quickstart

Overview

Running

Evaluations

Score

Reporting

Interfaces

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages