[Badcase]: Qwen3-30B-A3B 在 bfcl v3 中multi_turn base 子集的复现分数与 leaderboard 不一致，能否告知测试时使用的推理参数

### Model Series

Qwen3

### What are the models used?

Qwen3-30-A3B

### What is the scenario where the problem happened?

Qwen3-30B-A3B 在 bfcl v3 中multi_turn base 子集的复现分数与 leaderboard 不一致，是否是我们的参数配置有问题？

### Is this badcase known and can it be solved using avaiable techniques?

- [x] I have followed [the GitHub README](https://github.com/QwenLM/Qwen3).
- [x] I have checked [the Qwen documentation](https://qwen.readthedocs.io) and cannot find a solution there.
- [x] I have checked the documentation of the related framework and cannot find useful information.
- [x] I have searched [the issues](https://github.com/QwenLM/Qwen3/issues?q=is%3Aissue) and there is not a similar one.

### Information about environment

Python: Python 3.10
GPUs: 4块NVIDIA A800
PyTorch: 2.7.0+cu126 (from python -c "import torch; print(torch.__version__)")
sglang: 0.4.7.post1
采用BFCL官方评测方式

### Description

您好，Qwen 团队：
我们在复现 Qwen3-30B-A3B 模型在 **bfcl v3** 评测集上的表现时，注意到其在 **multi_turn base** 子集的得分与官方 leaderboard 存在较大差距，想看下是否是我们评测的参数有问题？**如何能复现出leaderborad上的结果**？我们使用的是huggingface上开源的Qwen3-30B-A3B 模型+sglang，测试了0518那天的chat template和0624更新的chat template，发现依然无法得到leaderboard上的结果，leaderboard显示的分数是**0.34**，我们测试的结果始终在0.21~0.25之间，具体的结果如下：

| 模型及模板配置                       | temperature | 精度类型 | multi_turn_base 得分 |
| ------------------------------------ | ----------- | -------- | -------------------- |
| qwen3-30-A3B-FC + 0518_chat_template | 0.001       | bf16     | 0.215                |
| qwen3-30-A3B-FC + 0518_chat_template | 0.7         | bf16     | 0.250                |
| qwen3-30-A3B-FC + 0624_chat_template | 0.7         | bf16     | 0.240                |
| qwen3-30-A3B-FC + 0624_chat_template | 0.7         | fp16     | 0.245                |




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Badcase]: Qwen3-30B-A3B 在 bfcl v3 中multi_turn base 子集的复现分数与 leaderboard 不一致，能否告知测试时使用的推理参数 #1518

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this badcase known and can it be solved using avaiable techniques?

Information about environment

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

模型及模板配置	temperature	精度类型	multi_turn_base 得分
qwen3-30-A3B-FC + 0518_chat_template	0.001	bf16	0.215
qwen3-30-A3B-FC + 0518_chat_template	0.7	bf16	0.250
qwen3-30-A3B-FC + 0624_chat_template	0.7	bf16	0.240
qwen3-30-A3B-FC + 0624_chat_template	0.7	fp16	0.245

[Badcase]: Qwen3-30B-A3B 在 bfcl v3 中multi_turn base 子集的复现分数与 leaderboard 不一致，能否告知测试时使用的推理参数 #1518

Description

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this badcase known and can it be solved using avaiable techniques?

Information about environment

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions