+
Skip to content

RLinf/LLMEvalKit

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 

Repository files navigation

LLM Evaluation Toolkit for RLinf

Acknowledgement

This evaluation toolkit is adapted from Qwen2.5-Math.

Introduction

We provide an integrated evaluation toolkit for long chain-of-thought (CoT) mathematical reasoning.

This toolkit includes both code and datasets, allowing researchers to easily benchmark trained LLMs on math-related reasoning tasks.

Environment Setup

To use the package, install the required dependencies:

pip install -r requirements.txt 

If you are using our Docker image, you only need to additionally install:

pip install Pebble
pip install timeout-decorator

Quick Start

To run evaluation on a single dataset:

MODEL_NAME_OR_PATH=/model/path # TODO: change to your model path
OUTPUT_DIR=${MODEL_NAME_OR_PATH}/math_eval
SPLIT="test"
NUM_TEST_SAMPLE=-1
export CUDA_VISIBLE_DEVICES="0"

DATA_NAME="aime24" # aime24, aime25, gpqa_diamond
PROMPT_TYPE="r1-distilled-qwen"  
# NOTE:
# for aime24 and aime25, set PROMPT_TYPE="r1-distilled-qwen";
# for gpqa_diamond, set PROMPT_TYPE="r1-distilled-qwen-gpqa".

TOKENIZERS_PARALLELISM=false \
python3 -u math_eval.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --data_name ${DATA_NAME} \
    --output_dir ${OUTPUT_DIR} \
    --split ${SPLIT} \
    --prompt_type ${PROMPT_TYPE} \
    --num_test_sample ${NUM_TEST_SAMPLE} \
    --use_vllm \
    --save_outputs

For batch evaluation, in this toolkit, run:

bash main_eval.sh

You must specify the MODEL_NAME_OR_PATH and CUDA_VISIBLE_DEVICES in the script. This will sequentially evaluate the model on AIME24, AIME25, and GPQA-diamond.

Outputs

The results are printed to the console and stored in OUTPUT_DIR. Stored outputs include:

  • Metadata (xx_metrics.json): summary statistics.
  • Full model outputs (xx.jsonl): complete reasoning traces and predictions.

Example Metadata

{
    "num_samples": 30,
    "num_scores": 960,
    "timeout_samples": 0,
    "empty_samples": 0,
    "acc": 42.39375,
    "time_use_in_second": 3726.008672475815,
    "time_use_in_minite": "62:06"
}

acc reports the average accuracy across all sampled responses, which serves as the main evaluation metric.

Example Model Output

{
    "idx": 0, 
    "question": "Find the number of...", 
    "gt_cot": "None", 
    "gt": "204", # ground truth answer
    "solution": "... . Thus, we have the equation $(240-t)(s) = 540$ ..., ", # standard solution
    "answer": "204", # ground truth answer
    "code": ["Alright, so I need to figure out ... . Thus, the number of ... is \\(\\boxed{204}\\)."], # generated reasoning chains
    "pred": ["204"], # extracted answers from reasoning chains
    "report": [null], 
    "score": [true] # whether the extracted answers are correct
}

Configuration

Name Description
data_names Dataset to evaluate. Supported: aime24, aime25, gpqa_diamond.
prompt_type Prompt template. Use r1-distilled-qwen for AIME datasets, r1-distilled-qwen-gpqa for GPQA.
temperature Sampling temperature. Recommended: 0.6 for 1.5B models, 1.0 for 7B models.
top_p Sampling parameter for nucleus sampling. Default: 0.95.
n_sampling Number of responses sampled per question, used to compute average accuracy. Default: 32.
max_tokens_per_call Maximum tokens generated per call. Default: 32768.
output_dir Output directory for results. Default: ./outputs.

About

A lightweight LLM evaluation toolkit for RLinf. Support mathematical reasoning and long CoT tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.1%
  • Shell 1.9%
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载