Low IFEval in Qwen3

I have used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to generate IFEval score with this script:

```
accelerate launch -m lm_eval --model hf \
    --model_args "pretrained=Qwen/Qwen3-32B" \
    --tasks ifeval \
    --batch_size auto \
    --write_out \
    --show_config \
    --seed 42 \
    --log_samples \
    --output_path out/hf \
    --cache_requests refresh \
    --gen_kwargs '{"temperature":0.7,"top_p":0.8,"top_k":20,"min_p":0.0}' \
    --apply_chat_template \
    --fewshot_as_multiturn \
    --system_instruction "You are a helpful assistant. /no_think."
```

I got `prompt_strict` score: 63.40, which is very less compared to the paper's 83.2 (table 14), even without thinking mode.

Already opened an [issue](https://github.com/EleutherAI/lm-evaluation-harness/issues/3161) on the repo but no luck there.

Anyone faced the same issue? or is there any eval library or script I should follow?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Low IFEval in Qwen3 #1563

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Low IFEval in Qwen3 #1563

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions