这是indexloc提供的服务,不要输入任何密码
Skip to content

Low IFEval in Qwen3 #1563

@dipta007

Description

@dipta007

I have used lm-evaluation-harness to generate IFEval score with this script:

accelerate launch -m lm_eval --model hf \
    --model_args "pretrained=Qwen/Qwen3-32B" \
    --tasks ifeval \
    --batch_size auto \
    --write_out \
    --show_config \
    --seed 42 \
    --log_samples \
    --output_path out/hf \
    --cache_requests refresh \
    --gen_kwargs '{"temperature":0.7,"top_p":0.8,"top_k":20,"min_p":0.0}' \
    --apply_chat_template \
    --fewshot_as_multiturn \
    --system_instruction "You are a helpful assistant. /no_think."

I got prompt_strict score: 63.40, which is very less compared to the paper's 83.2 (table 14), even without thinking mode.

Already opened an issue on the repo but no luck there.

Anyone faced the same issue? or is there any eval library or script I should follow?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions