-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Labels
Description
I have used lm-evaluation-harness to generate IFEval score with this script:
accelerate launch -m lm_eval --model hf \
--model_args "pretrained=Qwen/Qwen3-32B" \
--tasks ifeval \
--batch_size auto \
--write_out \
--show_config \
--seed 42 \
--log_samples \
--output_path out/hf \
--cache_requests refresh \
--gen_kwargs '{"temperature":0.7,"top_p":0.8,"top_k":20,"min_p":0.0}' \
--apply_chat_template \
--fewshot_as_multiturn \
--system_instruction "You are a helpful assistant. /no_think."
I got prompt_strict
score: 63.40, which is very less compared to the paper's 83.2 (table 14), even without thinking mode.
Already opened an issue on the repo but no luck there.
Anyone faced the same issue? or is there any eval library or script I should follow?