Multi-turn sft issue Qwen3 #1398
friendshipity
started this conversation in
General
Replies: 1 comment 3 replies
-
For multi-turn conversations, the thinking content of the previous turns should be removed except for multi-step tool calls. The official chat template could do that automaticaly for you.
It would be better to reorganize that one multi-turn example into multiple examples and only keep the thinking block of the final turn. For example, [Q1, T1, A1, Q2, T2, A2, Q3, T3, A3] should be split into [Q1, T1, A1], [Q1, A1, Q2, T2, A2], and [Q1, A1, Q2, A2, Q3, T3, A3]. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In multi-turn SFT with Qwen3, there are multiple "think blocks" across turns, and only the loss of the assistant's replies is computed. This means that all "think blocks" from previous turns are included in the context when computing the loss for each assistant reply. Given that these think blocks are quite long in my dataset, does it make sense to split the multi-turn text and compute the loss for each turn separately (no history think blocks in context), in order to avoid an excessively long token context during training?
Beta Was this translation helpful? Give feedback.
All reactions