-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Labels
Description
Model Series
Qwen3
What are the models used?
Qwen3-32B
What is the scenario where the problem happened?
Qwen3-32B can not reproduce the GPQA result reported in the technical report with vllm.
Is this badcase known and can it be solved using avaiable techniques?
- I have followed the GitHub README.
- I have checked the Qwen documentation and cannot find a solution there.
- I have checked the documentation of the related framework and cannot find useful information.
- I have searched the issues and there is not a similar one.
Information about environment
torch 2.6.0
transformers 4.52.4
vllm 0.8.5
Description
Steps to reproduce
The badcase can be reproduced with the following steps:
- Copy the GPQA diamond testset to the path of Qwen2.5-Math evaluation scripts
- Running evaluation with Qwen2.5-Math with the following script:
export CUDA_VISIBLE_DEVICES=0,1,2,3
DATA_NAME="gpqa"
TOKENIZERS_PARALLELISM=false \
python3 -u ${WORK_PATH}/math_eval.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--data_name ${DATA_NAME} \
--data_dir ${WORK_PATH}/data \
--output_dir ${OUTPUT_DIR} \
--split ${SPLIT} \
--prompt_type ${PROMPT_TYPE} \
--num_test_sample ${NUM_TEST_SAMPLE} \
--max_tokens_per_call 32768 \
--seed 0 \
--temperature 0.6 \
--n_sampling ${N} \
--top_p 0.95 \
--top_k 20 \
--start 0 \
--end -1 \
--save_outputs \
--overwrite \
--use_vllm \
--apply_chat_template
- Add some evaluation code for GPQA like:
def extract_answer(pred_str, data_name, use_last_number=True):
pred_str = pred_str.replace("\u043a\u0438", "")
if data_name in ["mmlu_stem", "sat_math", "aqua", "gaokao2023"]:
# TODO check multiple choice
return choice_answer_clean(pred_str)
if data_name in ["gpqa"]:
ANSWER_PATTERN_MULTICHOICE = r"(?i)Answer[ \t]*:[ \t]*\$?([A-D])\$?"
match = re.search(ANSWER_PATTERN_MULTICHOICE, pred_str)
pred = match.group(1) if match else ""
if pred in ["A", "B", "C", "D"]:
return pred
...
The following example input & output can be used:
Reproduced result on GPQA diamond:
66.1 (65.x or 66.x with multiple runs)
The result reported:
68.4
...
Expected results
The result reported:
68.4
Attempts to fix
I have tried several ways to fix this, including:
- Change different prompt.
- Add some rule to extract more answer from the model generated text.