Fix number of quantisation buckets #182

alvaropp · 2024-09-30T09:09:42Z

Fixes #181.

Chronos' tokenizer has a vocabulary size of n_tokens. Among these, there are n_special_tokens reserved for EOS, PAD, etc. and n_tokens - n_special_tokens allocated to numerical values. However, the provided MeanScaleUniformBins tokenizer creates n_tokens - n_special_tokens + 1 different buckets, resulting in a total of n_tokens + 1 possible tokens. This causes training and inference errors when one of the data points gets allocated to the largest bucket, as the model requires 0 <= token_id < n_tokens.

This PR modifies the MeanScaleUniformBins tokenizer, so that it creates one less bucket for numerical values.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

lostella · 2024-10-01T09:15:42Z

@alvaropp thanks for spotting this! I think that changing the values associated with tokens may not be the right way to go, since this will make the code diverge from how the model was originally trained. It's true that the additional buckets result in an invalid token ID (the tokenizer might have been simpler in these regards) but I think maybe the best fix is to clip the token ids in _input_transform before returning.

token_ids.clamp_(0, self.config.n_tokens - 1)

Let me know what you think!

alvaropp · 2024-10-01T09:29:21Z

Hi, I think you're right, didn't think of that problem 😄

I guess the clipping approach you propose would flatten big spikes by effectively putting normalised values that are around +15 and >1e20 in the same bucket (using default parameters for the tokenizer), but I reckon this should only be the case for time series samples with extreme values, and quite minor overall.

Happy to give this approach a go.

alvaropp · 2024-10-01T16:52:04Z

@lostella ready for review now

lostella

@alvaropp just suggesting to remove the comments, especially since they cross-reference each other. I think the explanation in the tests is sufficient.

src/chronos/chronos.py

Co-authored-by: Lorenzo Stella <lorenzostella@gmail.com>

alvaropp · 2024-10-04T10:11:01Z

@lostella sure, removed.

CITATION.cff

src/chronos/chronos.py

lostella

Thanks @alvaropp!

alvaropp added 3 commits September 27, 2024 16:51

Bugfix + test.

a44ee4a

Version bump

0b0bb88

Added comment.

65446a9

Clip token ids rather than modify buckets.

f4338b3

lostella requested changes Oct 2, 2024

View reviewed changes

src/chronos/chronos.py Outdated Show resolved Hide resolved

src/chronos/chronos.py Outdated Show resolved Hide resolved

alvaropp and others added 2 commits October 2, 2024 15:44

Update src/chronos/chronos.py

11580af

Co-authored-by: Lorenzo Stella <lorenzostella@gmail.com>

Update src/chronos/chronos.py

8789c1a

Co-authored-by: Lorenzo Stella <lorenzostella@gmail.com>

lostella added the bugfix Contains a bug fix label Oct 4, 2024

lostella reviewed Oct 4, 2024

View reviewed changes

CITATION.cff Outdated Show resolved Hide resolved

Update CITATION.cff

587360b

lostella reviewed Oct 4, 2024

View reviewed changes

src/chronos/chronos.py Outdated Show resolved Hide resolved

Update src/chronos/chronos.py

31ba45a

lostella approved these changes Oct 4, 2024

View reviewed changes

lostella merged commit ac6ee36 into amazon-science:main Oct 4, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix number of quantisation buckets #182

Fix number of quantisation buckets #182

Uh oh!

alvaropp commented Sep 30, 2024 •

edited

Loading

Uh oh!

lostella commented Oct 1, 2024 •

edited

Loading

Uh oh!

alvaropp commented Oct 1, 2024 •

edited

Loading

Uh oh!

alvaropp commented Oct 1, 2024

Uh oh!

lostella left a comment

Uh oh!

Uh oh!

Uh oh!

alvaropp commented Oct 4, 2024

Uh oh!

Uh oh!

Uh oh!

lostella left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix number of quantisation buckets #182

Fix number of quantisation buckets #182

Uh oh!

Conversation

alvaropp commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lostella commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alvaropp commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alvaropp commented Oct 1, 2024

Uh oh!

lostella left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alvaropp commented Oct 4, 2024

Uh oh!

Uh oh!

Uh oh!

lostella left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alvaropp commented Sep 30, 2024 •

edited

Loading

lostella commented Oct 1, 2024 •

edited

Loading

alvaropp commented Oct 1, 2024 •

edited

Loading