-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Model Series
Qwen3
What are the models used?
Qwen3-8b
What is the scenario where the problem happened?
applying rope scaling at config.json file occurs performance degradation in vllm
Is this badcase known and can it be solved using avaiable techniques?
- I have followed the GitHub README.
- I have checked the Qwen documentation and cannot find a solution there.
- I have checked the documentation of the related framework and cannot find useful information.
- I have searched the issues and there is not a similar one.
Information about environment
vllm 0.9.2rc2.dev39+gc18b3b8e
transformers 4.52.4
Description
Hi Qwen team,
I observed a performance difference on the Arena-Hard-v2.0 benchmark when using Qwen3-8B on vLLM with and without RoPE scaling. Specifically, the model performs worse when RoPE scaling is applied. (39->26.5)
I added the following for RoPE scaling as per the instructions:
"rope_scaling": {
"rope_type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768
}
According to the logs, vLLM seems to correctly apply the scaling different from similar issue #1424 (comment):
INFO 07-24 03:32:48 [config.py:1472] Using max model len 131072
performance drop is unexpected because the benchmark uses only short-context prompts, so RoPE scaling shouldn't have any impact.
Is there something I might be overlooking? I'd really appreciate any guidance or help on this!
Best,