-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
Description
Model Series
Qwen3
What are the models used?
Qwen3-30-A3B
What is the scenario where the problem happened?
Qwen3-30B-A3B 在 bfcl v3 中multi_turn base 子集的复现分数与 leaderboard 不一致,是否是我们的参数配置有问题?
Is this badcase known and can it be solved using avaiable techniques?
- I have followed the GitHub README.
- I have checked the Qwen documentation and cannot find a solution there.
- I have checked the documentation of the related framework and cannot find useful information.
- I have searched the issues and there is not a similar one.
Information about environment
Python: Python 3.10
GPUs: 4块NVIDIA A800
PyTorch: 2.7.0+cu126 (from python -c "import torch; print(torch.version)")
sglang: 0.4.7.post1
采用BFCL官方评测方式
Description
您好,Qwen 团队:
我们在复现 Qwen3-30B-A3B 模型在 bfcl v3 评测集上的表现时,注意到其在 multi_turn base 子集的得分与官方 leaderboard 存在较大差距,想看下是否是我们评测的参数有问题?如何能复现出leaderborad上的结果?我们使用的是huggingface上开源的Qwen3-30B-A3B 模型+sglang,测试了0518那天的chat template和0624更新的chat template,发现依然无法得到leaderboard上的结果,leaderboard显示的分数是0.34,我们测试的结果始终在0.21~0.25之间,具体的结果如下:
模型及模板配置 | temperature | 精度类型 | multi_turn_base 得分 |
---|---|---|---|
qwen3-30-A3B-FC + 0518_chat_template | 0.001 | bf16 | 0.215 |
qwen3-30-A3B-FC + 0518_chat_template | 0.7 | bf16 | 0.250 |
qwen3-30-A3B-FC + 0624_chat_template | 0.7 | bf16 | 0.240 |
qwen3-30-A3B-FC + 0624_chat_template | 0.7 | fp16 | 0.245 |