这是indexloc提供的服务,不要输入任何密码
Skip to content

[Badcase]: Fail to reproduce the results of Qwen3-32B on GPQA diamond. #1503

@SefaZeng

Description

@SefaZeng

Model Series

Qwen3

What are the models used?

Qwen3-32B

What is the scenario where the problem happened?

Qwen3-32B can not reproduce the GPQA result reported in the technical report with vllm.

Is this badcase known and can it be solved using avaiable techniques?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find a solution there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

torch 2.6.0
transformers 4.52.4
vllm 0.8.5

Description

Steps to reproduce

The badcase can be reproduced with the following steps:

  1. Copy the GPQA diamond testset to the path of Qwen2.5-Math evaluation scripts
  2. Running evaluation with Qwen2.5-Math with the following script:
export CUDA_VISIBLE_DEVICES=0,1,2,3

DATA_NAME="gpqa"
 
TOKENIZERS_PARALLELISM=false \
python3 -u ${WORK_PATH}/math_eval.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --data_name ${DATA_NAME} \
    --data_dir ${WORK_PATH}/data \
    --output_dir ${OUTPUT_DIR} \
    --split ${SPLIT} \
    --prompt_type ${PROMPT_TYPE} \
    --num_test_sample ${NUM_TEST_SAMPLE} \
    --max_tokens_per_call 32768 \
    --seed 0 \ 
    --temperature 0.6 \
    --n_sampling ${N} \
    --top_p 0.95 \
    --top_k 20 \
    --start 0 \ 
    --end -1 \
    --save_outputs \
    --overwrite \
    --use_vllm \
    --apply_chat_template
  1. Add some evaluation code for GPQA like:
def extract_answer(pred_str, data_name, use_last_number=True):
    pred_str = pred_str.replace("\u043a\u0438", "") 
    if data_name in ["mmlu_stem", "sat_math", "aqua", "gaokao2023"]:
        # TODO check multiple choice
        return choice_answer_clean(pred_str)
    if data_name in ["gpqa"]:
        ANSWER_PATTERN_MULTICHOICE = r"(?i)Answer[ \t]*:[ \t]*\$?([A-D])\$?"
        match = re.search(ANSWER_PATTERN_MULTICHOICE, pred_str)
        pred = match.group(1) if match else ""
        if pred in ["A", "B", "C", "D"]:                                                                                                                                                                                                                                                         
            return pred
...

The following example input & output can be used:

Reproduced result on GPQA diamond:
66.1 (65.x or 66.x with multiple runs)

The result reported:
68.4
...

Expected results

The result reported:
68.4

Attempts to fix

I have tried several ways to fix this, including:

  1. Change different prompt.
  2. Add some rule to extract more answer from the model generated text.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions