[Badcase]: Performance drop using rope scaling with Qwen3-8b in vllm

### Model Series

Qwen3

### What are the models used?

Qwen3-8b

### What is the scenario where the problem happened?

applying rope scaling at config.json file occurs performance degradation in vllm

### Is this badcase known and can it be solved using avaiable techniques?

- [x] I have followed [the GitHub README](https://github.com/QwenLM/Qwen3).
- [x] I have checked [the Qwen documentation](https://qwen.readthedocs.io) and cannot find a solution there.
- [x] I have checked the documentation of the related framework and cannot find useful information.
- [x] I have searched [the issues](https://github.com/QwenLM/Qwen3/issues?q=is%3Aissue) and there is not a similar one.

### Information about environment

vllm 0.9.2rc2.dev39+gc18b3b8e
transformers 4.52.4

### Description

Hi Qwen team,

I observed a performance difference on the Arena-Hard-v2.0 benchmark when using Qwen3-8B on vLLM with and without RoPE scaling. Specifically, the model performs worse when RoPE scaling is applied. (39->26.5)
I added the following for RoPE scaling as per the instructions:

```
"rope_scaling": {
    "rope_type": "yarn",
    "factor": 4.0,
    "original_max_position_embeddings": 32768
}
```
According to the logs, vLLM seems to correctly apply the scaling different from similar issue https://github.com/QwenLM/Qwen3/issues/1424#event-17766142215:

`INFO 07-24 03:32:48 [config.py:1472] Using max model len 131072
`

performance drop is unexpected because the benchmark uses only short-context prompts, so RoPE scaling shouldn't have any impact.

Is there something I might be overlooking? I'd really appreciate any guidance or help on this!

Best,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Badcase]: Performance drop using rope scaling with Qwen3-8b in vllm #1567

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this badcase known and can it be solved using avaiable techniques?

Information about environment

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Badcase]: Performance drop using rope scaling with Qwen3-8b in vllm #1567

Description

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this badcase known and can it be solved using avaiable techniques?

Information about environment

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions