这是indexloc提供的服务,不要输入任何密码
Skip to content

[Badcase]: Qwen3-30B-A3B 在 bfcl v3 中multi_turn base 子集的复现分数与 leaderboard 不一致,能否告知测试时使用的推理参数 #1518

@momo-707

Description

@momo-707

Model Series

Qwen3

What are the models used?

Qwen3-30-A3B

What is the scenario where the problem happened?

Qwen3-30B-A3B 在 bfcl v3 中multi_turn base 子集的复现分数与 leaderboard 不一致,是否是我们的参数配置有问题?

Is this badcase known and can it be solved using avaiable techniques?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find a solution there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

Python: Python 3.10
GPUs: 4块NVIDIA A800
PyTorch: 2.7.0+cu126 (from python -c "import torch; print(torch.version)")
sglang: 0.4.7.post1
采用BFCL官方评测方式

Description

您好,Qwen 团队:
我们在复现 Qwen3-30B-A3B 模型在 bfcl v3 评测集上的表现时,注意到其在 multi_turn base 子集的得分与官方 leaderboard 存在较大差距,想看下是否是我们评测的参数有问题?如何能复现出leaderborad上的结果?我们使用的是huggingface上开源的Qwen3-30B-A3B 模型+sglang,测试了0518那天的chat template和0624更新的chat template,发现依然无法得到leaderboard上的结果,leaderboard显示的分数是0.34,我们测试的结果始终在0.21~0.25之间,具体的结果如下:

模型及模板配置 temperature 精度类型 multi_turn_base 得分
qwen3-30-A3B-FC + 0518_chat_template 0.001 bf16 0.215
qwen3-30-A3B-FC + 0518_chat_template 0.7 bf16 0.250
qwen3-30-A3B-FC + 0624_chat_template 0.7 bf16 0.240
qwen3-30-A3B-FC + 0624_chat_template 0.7 fp16 0.245

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions