[Badcase]: Fail to reproduce the results of Qwen3-32B on GPQA diamond.

### Model Series

Qwen3

### What are the models used?

Qwen3-32B

### What is the scenario where the problem happened?

Qwen3-32B can not reproduce the GPQA result reported in the technical report with vllm.

### Is this badcase known and can it be solved using avaiable techniques?

- [x] I have followed [the GitHub README](https://github.com/QwenLM/Qwen3).
- [x] I have checked [the Qwen documentation](https://qwen.readthedocs.io) and cannot find a solution there.
- [x] I have checked the documentation of the related framework and cannot find useful information.
- [x] I have searched [the issues](https://github.com/QwenLM/Qwen3/issues?q=is%3Aissue) and there is not a similar one.

### Information about environment

torch 2.6.0
transformers 4.52.4
vllm 0.8.5

### Description

#### Steps to reproduce

The badcase can be reproduced with the following steps:
1. Copy the GPQA diamond testset to the path of Qwen2.5-Math evaluation scripts
2. Running evaluation with Qwen2.5-Math with the following script:
```
export CUDA_VISIBLE_DEVICES=0,1,2,3

DATA_NAME="gpqa"
 
TOKENIZERS_PARALLELISM=false \
python3 -u ${WORK_PATH}/math_eval.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --data_name ${DATA_NAME} \
    --data_dir ${WORK_PATH}/data \
    --output_dir ${OUTPUT_DIR} \
    --split ${SPLIT} \
    --prompt_type ${PROMPT_TYPE} \
    --num_test_sample ${NUM_TEST_SAMPLE} \
    --max_tokens_per_call 32768 \
    --seed 0 \ 
    --temperature 0.6 \
    --n_sampling ${N} \
    --top_p 0.95 \
    --top_k 20 \
    --start 0 \ 
    --end -1 \
    --save_outputs \
    --overwrite \
    --use_vllm \
    --apply_chat_template
```
3. Add some evaluation code for GPQA like:
```
def extract_answer(pred_str, data_name, use_last_number=True):
    pred_str = pred_str.replace("\u043a\u0438", "") 
    if data_name in ["mmlu_stem", "sat_math", "aqua", "gaokao2023"]:
        # TODO check multiple choice
        return choice_answer_clean(pred_str)
    if data_name in ["gpqa"]:
        ANSWER_PATTERN_MULTICHOICE = r"(?i)Answer[ \t]*:[ \t]*\$?([A-D])\$?"
        match = re.search(ANSWER_PATTERN_MULTICHOICE, pred_str)
        pred = match.group(1) if match else ""
        if pred in ["A", "B", "C", "D"]:                                                                                                                                                                                                                                                         
            return pred
...
```

The following example input & output can be used:
```
Reproduced result on GPQA diamond:
66.1 (65.x or 66.x with multiple runs)

The result reported:
68.4
...
```

#### Expected results

The result reported:
68.4

#### Attempts to fix

I have tried several ways to fix this, including:
1. Change different prompt.
2. Add some rule to extract more answer from the model generated text.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Badcase]: Fail to reproduce the results of Qwen3-32B on GPQA diamond. #1503

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this badcase known and can it be solved using avaiable techniques?

Information about environment

Description

Steps to reproduce

Expected results

Attempts to fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Badcase]: Fail to reproduce the results of Qwen3-32B on GPQA diamond. #1503

Description

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this badcase known and can it be solved using avaiable techniques?

Information about environment

Description

Steps to reproduce

Expected results

Attempts to fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions