-
Notifications
You must be signed in to change notification settings - Fork 496
Fix number of quantisation buckets #182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alvaropp thanks for spotting this! I think that changing the values associated with tokens may not be the right way to go, since this will make the code diverge from how the model was originally trained. It's true that the additional buckets result in an invalid token ID (the tokenizer might have been simpler in these regards) but I think maybe the best fix is to clip the token ids in Let me know what you think! |
|
Hi, I think you're right, didn't think of that problem 😄 I guess the clipping approach you propose would flatten big spikes by effectively putting normalised values that are around +15 and >1e20 in the same bucket (using default parameters for the tokenizer), but I reckon this should only be the case for time series samples with extreme values, and quite minor overall. Happy to give this approach a go. |
|
@lostella ready for review now |
lostella
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alvaropp just suggesting to remove the comments, especially since they cross-reference each other. I think the explanation in the tests is sufficient.
Co-authored-by: Lorenzo Stella <lorenzostella@gmail.com>
Co-authored-by: Lorenzo Stella <lorenzostella@gmail.com>
|
@lostella sure, removed. |
lostella
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alvaropp!
Fixes #181.
Chronos' tokenizer has a vocabulary size of
n_tokens. Among these, there aren_special_tokensreserved for EOS, PAD, etc. andn_tokens - n_special_tokensallocated to numerical values. However, the providedMeanScaleUniformBinstokenizer createsn_tokens - n_special_tokens + 1different buckets, resulting in a total ofn_tokens + 1possible tokens. This causes training and inference errors when one of the data points gets allocated to the largest bucket, as the model requires 0 <= token_id < n_tokens.This PR modifies the
MeanScaleUniformBinstokenizer, so that it creates one less bucket for numerical values.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.