GitHub - compl-ai/compl-ai: An open-source compliance-centered evaluation framework for Generative AI models

COMPL-AI is a compliance-centered evaluation framework for LLMs created and maintained by

Overview

The COMPL-AI framework includes a technical interpretation of the EU AI Act and an open-source benchmarking suite (this repo). The key features are:

Built on the Inspect evaluation framework
Tailored set of benchmarks to provide coverage over technical parts of EU AI Act (23 and growing)
A public Hugging Face leaderboard of our latest evaluation results
Extensive set of supported providers (API, Cloud, Local).
A custom eval CLI (run complai --help for usage)

Community contributions for benchmarks and new mappings are welcome! We are actively looking to exapand our EU AI Act and Code of Practice technical interpretation and benchmark coverage. See the contributing section below.

⏩ Quickstart

To run an evaluation yourself, please follow the instructions below (or contact us through compl-ai.org).

# Clone and create a virtual environment
git clone https://github.com/compl-ai/compl-ai.git
cd compl-ai
uv sync
source .venv/bin/activate

# Set your API key
export OPENAI_API_KEY=your_key

# Run 5 samples on a single benchmark
complai eval openai/gpt-5-nano --tasks mmlu_pro --limit 5

# Or run the full framework
complai eval openai/gpt-5-nano

You can then view a detailed sample-level log of your results with the Inspect AI VS Code extension, or in your browser with:

inspect view

💻 CLI

Get help with any command

complai COMMAND --help

List Tasks

complai list

Run Evals with the following syntax

complai eval <provider>/<model> -t <task_name> -l <n_samples>

Command Examples

# Remote API
complai eval openai/gpt-4o-mini
complai eval anthropic/claude-sonnet-4-0

# Locally with HF backend, set cuda device (use mps for macOS)
complai eval hf/Qwen/Qwen3-8B -t mmlu_pro -M device=cuda:0

# Using vLLM backend, evaluate specific sample and cap number of sandboxes for agentic benchmarks
complai eval vllm/Qwen/Qwen3-8B -t swe_bench_verified --sample-id django__django-11848 --max-sandboxes 1 

# Use task configuration file or CLI task args (CLI args take precedence)
complai eval openai/gpt-5-nano --task-config default_config.yaml -T mmlu_pro:num_fewshot=5

# Retry (if eval failed) with existing log directory or specify custom log directory (supports S3 URLs)
complai eval openai/gpt-5-nano --log-dir path/to/logdir

Task Configuration

COMPL-AI supports task-specific configuration via YAML or JSON files. See default_config.yaml for a reference of all configurable parameters. You can:

Use the default config as a template: cp default_config.yaml my_config.yaml
Modify the tasks and parameters you want
Pass it to any eval: complai eval <model> --task-config my_config.yaml

Providers

See the Providers section for more information on different providers.

Environment Variables

COMPL-AI can auto-load models (COMPLAI_MODEL), API keys (OPENAI_API_KEY), and many other configurations (COMPLAI_LOG_DIR) from your local .env file. Values provided in the CLI take precedence over .env vars.

🧪 Framework

🚧 Update in Progress

The current version of the framework is published here.

We are currently in the process of renewing our coverage of the EU AI Act by updating the set of benchmarks, thus the supported set of benchmark may differ from this original mapping. The goals of this update are:

To increase coverage over the EU AI Act principles
To increase coverage over technical requirements
Adding support for the Code of Practice, namely the Safety and Security chapter.
Adding the notion of risk along side technical requirements
Refreshing the supported benchmarks to ensure they remain challenging for frontier models (addressing saturation, contamination, and other benchmark quality issues).

As part of our update, the renewed benchmarking suite is now built on the UK Security Institute’s Inspect Framework, which offers improved ease of use and greater overall consistency. This means that several benchmarks are now evaluated differently (e.g. full-text answers instead of logits for multiple-choice questions) reflecting more modern and opinionated approaches to LLM evaluation. Thus, benchmark scores from the v1 and v2 suites, even for the same benchmark, should not be considered directly comparable.

Principles

COMPL-AI is primarily structured to provide coverage over 6 core EU AI Act principles:

Human Agency and Oversight: AI systems should be supervised by people, not by automation alone, to prevent harmful outcomes and allow for human intervention.
Technical Robustness and Safety: AI systems must be safe and secure, implementing risk management, data quality, and - cybersecurity measures to prevent undue risks.
Privacy and Data Governance: The Act sets rules for the quality and governance of data used in AI, emphasizing the protection of personal and sensitive information.
Transparency: Users should understand when they are interacting with an AI system and how it functions, fostering trust and enabling accountability.
Diversity, Non-Discrimination, and Fairness: AI systems should be designed and used to uphold human rights, including fairness and equality, and avoid biases that could lead to discrimination.
Societal and Environmental Well-being: AI systems should be developed in a way that benefits society and the environment, avoiding negative impacts on fundamental rights and democratic values.

Technical Requirements and Benchmarks

You can see a list of all technical requirements and their respective benchmarks using complai list:

Capabilities, Performance, and Limitations
- aime_2025, arc_challenge, hellaswag, humaneval, livebench_coding, mmlu_pro, swe_bench_verified, truthfulqa
Representation — Absence of Bias
- bbq, bold
Interpretability
- bigbench_calibration, triviaqa_calibration
Robustness and Predictability
- boolq_contrast, forecast_consistency, imdb_contrast, self_check_consistency
Fairness — Absence of Discrimination
- decoding_trust, fairllm
Disclosure of AI
- human_deception
Cyberattack Resilience
- instruction_goal_hijacking, llm_rules
Harmful Content and Toxicity
- realtoxicityprompts

🤝 Contributing

We welcome contributions! When contributing, please make sure to activate pre-commit hooks to ensure code quality and consistency. You can install pre-commit hooks with:

pip install pre-commit
pre-commit install

To run tests, run the following command:

make test

📄 License

This project is licensed under the Apache 2.0 License - see LICENSE for details.

📝 Citation

Please cite our work as follows:

@article{complai24,
      title={COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act}, 
      author={Philipp Guldimann and Alexander Spiridonov and Robin Staab and Nikola Jovanovi\'{c} and Mark Vero and Velko Vechev and Anna Gueorguieva and Mislav Balunovi\'{c} and Nikola Konstantinov and Pavol Bielik and Petar Tsankov and Martin Vechev},
      year={2024},
      eprint={2410.07959},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.07959},
}

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
.github/workflows		.github/workflows
config		config
providers		providers
src		src
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
compl-ai-logo.svg		compl-ai-logo.svg
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock
yamlfmt.yaml		yamlfmt.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

⏩ Quickstart

💻 CLI

Get help with any command

List Tasks

Run Evals with the following syntax

Command Examples

Task Configuration

Providers

Environment Variables

🧪 Framework

🚧 Update in Progress

Principles

Technical Requirements and Benchmarks

🤝 Contributing

📄 License

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 9

Uh oh!

Languages

License

compl-ai/compl-ai

Folders and files

Latest commit

History

Repository files navigation

Overview

⏩ Quickstart

💻 CLI

Get help with any command

List Tasks

Run Evals with the following syntax

Command Examples

Task Configuration

Providers

Environment Variables

🧪 Framework

🚧 Update in Progress

Principles

Technical Requirements and Benchmarks

🤝 Contributing

📄 License

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 9

Uh oh!

Languages

Packages