+
Skip to content

compl-ai/compl-ai

Repository files navigation

COMPL-AI

COMPL-AI is a compliance-centered evaluation framework for LLMs created and maintained by

ETH Zurich, INSAIT and LatticeFlow AI.

arXiv Web License: Apache 2.0

Overview

The COMPL-AI framework includes a technical interpretation of the EU AI Act and an open-source benchmarking suite (this repo). The key features are:

  • Built on the Inspect evaluation framework
  • Tailored set of benchmarks to provide coverage over technical parts of EU AI Act (23 and growing)
  • A public Hugging Face leaderboard of our latest evaluation results
  • Extensive set of supported providers (API, Cloud, Local).
  • A custom eval CLI (run complai --help for usage)

Community contributions for benchmarks and new mappings are welcome! We are actively looking to exapand our EU AI Act and Code of Practice technical interpretation and benchmark coverage. See the contributing section below.

⏩ Quickstart

To run an evaluation yourself, please follow the instructions below (or contact us through compl-ai.org).

# Clone and create a virtual environment
git clone https://github.com/compl-ai/compl-ai.git
cd compl-ai
uv sync
source .venv/bin/activate

# Set your API key
export OPENAI_API_KEY=your_key

# Run 5 samples on a single benchmark
complai eval openai/gpt-5-nano --tasks mmlu_pro --limit 5

# Or run the full framework
complai eval openai/gpt-5-nano

You can then view a detailed sample-level log of your results with the Inspect AI VS Code extension, or in your browser with:

inspect view

💻 CLI

Get help with any command

complai COMMAND --help

List Tasks

complai list

Run Evals with the following syntax

complai eval <provider>/<model> -t <task_name> -l <n_samples>

Command Examples

# Remote API
complai eval openai/gpt-4o-mini
complai eval anthropic/claude-sonnet-4-0

# Locally with HF backend, set cuda device (use mps for macOS)
complai eval hf/Qwen/Qwen3-8B -t mmlu_pro -M device=cuda:0

# Using vLLM backend, evaluate specific sample and cap number of sandboxes for agentic benchmarks
complai eval vllm/Qwen/Qwen3-8B -t swe_bench_verified --sample-id django__django-11848 --max-sandboxes 1 

# Use task configuration file or CLI task args (CLI args take precedence)
complai eval openai/gpt-5-nano --task-config default_config.yaml -T mmlu_pro:num_fewshot=5

# Retry (if eval failed) with existing log directory or specify custom log directory (supports S3 URLs)
complai eval openai/gpt-5-nano --log-dir path/to/logdir

Task Configuration

COMPL-AI supports task-specific configuration via YAML or JSON files. See default_config.yaml for a reference of all configurable parameters. You can:

  • Use the default config as a template: cp default_config.yaml my_config.yaml
  • Modify the tasks and parameters you want
  • Pass it to any eval: complai eval <model> --task-config my_config.yaml

Providers

See the Providers section for more information on different providers.

Environment Variables

COMPL-AI can auto-load models (COMPLAI_MODEL), API keys (OPENAI_API_KEY), and many other configurations (COMPLAI_LOG_DIR) from your local .env file. Values provided in the CLI take precedence over .env vars.

🧪 Framework

🚧 Update in Progress

The current version of the framework is published here.

We are currently in the process of renewing our coverage of the EU AI Act by updating the set of benchmarks, thus the supported set of benchmark may differ from this original mapping. The goals of this update are:

  • To increase coverage over the EU AI Act principles
  • To increase coverage over technical requirements
  • Adding support for the Code of Practice, namely the Safety and Security chapter.
  • Adding the notion of risk along side technical requirements
  • Refreshing the supported benchmarks to ensure they remain challenging for frontier models (addressing saturation, contamination, and other benchmark quality issues).

As part of our update, the renewed benchmarking suite is now built on the UK Security Institute’s Inspect Framework, which offers improved ease of use and greater overall consistency. This means that several benchmarks are now evaluated differently (e.g. full-text answers instead of logits for multiple-choice questions) reflecting more modern and opinionated approaches to LLM evaluation. Thus, benchmark scores from the v1 and v2 suites, even for the same benchmark, should not be considered directly comparable.

Principles

COMPL-AI is primarily structured to provide coverage over 6 core EU AI Act principles:

  • Human Agency and Oversight: AI systems should be supervised by people, not by automation alone, to prevent harmful outcomes and allow for human intervention.
  • Technical Robustness and Safety: AI systems must be safe and secure, implementing risk management, data quality, and - cybersecurity measures to prevent undue risks.
  • Privacy and Data Governance: The Act sets rules for the quality and governance of data used in AI, emphasizing the protection of personal and sensitive information.
  • Transparency: Users should understand when they are interacting with an AI system and how it functions, fostering trust and enabling accountability.
  • Diversity, Non-Discrimination, and Fairness: AI systems should be designed and used to uphold human rights, including fairness and equality, and avoid biases that could lead to discrimination.
  • Societal and Environmental Well-being: AI systems should be developed in a way that benefits society and the environment, avoiding negative impacts on fundamental rights and democratic values.

Technical Requirements and Benchmarks

You can see a list of all technical requirements and their respective benchmarks using complai list:

  • Capabilities, Performance, and Limitations
    • aime_2025, arc_challenge, hellaswag, humaneval, livebench_coding, mmlu_pro, swe_bench_verified, truthfulqa
  • Representation — Absence of Bias
    • bbq, bold
  • Interpretability
    • bigbench_calibration, triviaqa_calibration
  • Robustness and Predictability
    • boolq_contrast, forecast_consistency, imdb_contrast, self_check_consistency
  • Fairness — Absence of Discrimination
    • decoding_trust, fairllm
  • Disclosure of AI
    • human_deception
  • Cyberattack Resilience
    • instruction_goal_hijacking, llm_rules
  • Harmful Content and Toxicity
    • realtoxicityprompts

🤝 Contributing

We welcome contributions! When contributing, please make sure to activate pre-commit hooks to ensure code quality and consistency. You can install pre-commit hooks with:

pip install pre-commit
pre-commit install

To run tests, run the following command:

make test

📄 License

This project is licensed under the Apache 2.0 License - see LICENSE for details.

📝 Citation

Please cite our work as follows:

@article{complai24,
      title={COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act}, 
      author={Philipp Guldimann and Alexander Spiridonov and Robin Staab and Nikola Jovanovi\'{c} and Mark Vero and Velko Vechev and Anna Gueorguieva and Mislav Balunovi\'{c} and Nikola Konstantinov and Pavol Bielik and Petar Tsankov and Martin Vechev},
      year={2024},
      eprint={2410.07959},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.07959},
}

About

An open-source compliance-centered evaluation framework for Generative AI models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 9

Languages

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载