这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@alvaropp
Copy link
Contributor

@alvaropp alvaropp commented Sep 30, 2024

Fixes #181.

Chronos' tokenizer has a vocabulary size of n_tokens. Among these, there are n_special_tokens reserved for EOS, PAD, etc. and n_tokens - n_special_tokens allocated to numerical values. However, the provided MeanScaleUniformBins tokenizer creates n_tokens - n_special_tokens + 1 different buckets, resulting in a total of n_tokens + 1 possible tokens. This causes training and inference errors when one of the data points gets allocated to the largest bucket, as the model requires 0 <= token_id < n_tokens.

This PR modifies the MeanScaleUniformBins tokenizer, so that it creates one less bucket for numerical values.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@lostella
Copy link
Contributor

lostella commented Oct 1, 2024

@alvaropp thanks for spotting this! I think that changing the values associated with tokens may not be the right way to go, since this will make the code diverge from how the model was originally trained. It's true that the additional buckets result in an invalid token ID (the tokenizer might have been simpler in these regards) but I think maybe the best fix is to clip the token ids in _input_transform before returning.

token_ids.clamp_(0, self.config.n_tokens - 1)

Let me know what you think!

@alvaropp
Copy link
Contributor Author

alvaropp commented Oct 1, 2024

Hi, I think you're right, didn't think of that problem 😄

I guess the clipping approach you propose would flatten big spikes by effectively putting normalised values that are around +15 and >1e20 in the same bucket (using default parameters for the tokenizer), but I reckon this should only be the case for time series samples with extreme values, and quite minor overall.

Happy to give this approach a go.

@alvaropp
Copy link
Contributor Author

alvaropp commented Oct 1, 2024

@lostella ready for review now

Copy link
Contributor

@lostella lostella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvaropp just suggesting to remove the comments, especially since they cross-reference each other. I think the explanation in the tests is sufficient.

alvaropp and others added 2 commits October 2, 2024 15:44
Co-authored-by: Lorenzo Stella <lorenzostella@gmail.com>
Co-authored-by: Lorenzo Stella <lorenzostella@gmail.com>
@alvaropp
Copy link
Contributor Author

alvaropp commented Oct 4, 2024

@lostella sure, removed.

@lostella lostella added the bugfix Contains a bug fix label Oct 4, 2024
Copy link
Contributor

@lostella lostella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alvaropp!

@lostella lostella merged commit ac6ee36 into amazon-science:main Oct 4, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Contains a bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Incorrect number of quantisation buckets

2 participants