-
Notifications
You must be signed in to change notification settings - Fork 313
Added post processing (for reasoning tokens) to pipeline #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
HELP_PANNEL_NAME_2 = "Logging Parameters" | ||
HELP_PANNEL_NAME_3 = "Debug Parameters" | ||
HELP_PANNEL_NAME_4 = "Modeling Parameters" | ||
HELP_PANEL_NAME_1 = "Common Parameters" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, unrelated to the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this feature so quickly! Overall the logic looks sound, but to be sure could we:
- Evaluate a few popular reasoning models before/after the change on e.g. AIME24 & IFEval?
- Add some unit tests to check that if
--remove-reasoning-tags
is set toTrue/False
then the desired post-processing is applied? - Add some small docs / example somewhere which explains this flag? If you want an example of a reasoning model with different think tags, checkout Magistral
If possible, it would also be great if we can store both the raw and post-processed predictions in the details. This would be helpful for debugging / understanding if a poor score is due to unterminated reasoning block
None # Log probabilities of the unconditioned model (if applicable) | ||
) | ||
@property | ||
def final_text(self) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this give predictions
in the details that have their reasoning tags removed?
For debugging purposes, it could be handy to have both the raw predictions and the post-processed ones in the details so that users can see if e.g. a bad score is due to unterminated reasoning traces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep good idea to store both in the details
@@ -236,7 +236,7 @@ def add_to_specifics_with_timeout( | |||
|
|||
def sample_level_fn(doc: Doc, model_response: ModelResponse) -> float: | |||
golds = doc.get_golds() | |||
predictions = model_response.text | |||
predictions = model_response.final_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here as below on whether this leads to the predictions
in the details having no reasoning tags or not
Ok thanks for the feature list (and yep def for tests at the main level!) |
Fix #869
Tested on Qwen3-0.6B and AIME25: when you select
--remove-reasoning-tags
, it indeeds removes them before sending the answer to the metric (but the model is quite verbose, so a lot of sentences are not changed because the token never appears.Edit: still need to add correct parsing for reasoning tags provided by the user