+
Skip to content

Fine-tune tokenizers #80

Open
Open
@rth

Description

@rth

It can happen that the tokenization results are unsatisfactory in some way, and the question is what should be the mechanism to customize/improve them. Whether it should be by,
a) adding options make these optional improvements in the tokenizer. The issue with these is that some of these might be relevant to multiple tokenizers
b) add a new step later in the pipeline. That's probably the best way to allow arbitrary customization. The issue is that some steps might be specific to the previous step, and adding them in the library might be confusing.

There is probably a balance that needs to be found between the two.

For instance,

  1. PunctuationTokenizer,
    • currently doesn't take into account repeated punctuation
      >>> PunctuationTokenizer().tokenize("test!!!")                                                                                                     
      ['test!', '!', '!']
    • will tokenize abbreviations separated by . as separate sentence
      >>> PunctuationTokenizer().tokenize("W.T.O.")
      ['W.', 'T.', 'O.']
    both could probably be addressed by adding an option to force sentences to be longer than some minimal length (and otherwise append them to the previous token).
  2. UnicodeSentenceTokenizer,
    will not tokenizer sentences separated by a punctuation without space e.g.,
    >>> UnicodeSentenceTokenizer().tokenize('One sentence.Another sentence.')
    ['One sentence.Another sentence.']
    That's a very common occurrence in actual text, and I think a workaround should be found (e.g. using an additional tokenization pass with a regex/punctuation tokenizer).

Generally it would be good to add some evaluation benchmarks to evaluation/ for sentence tokenization to evaluation/ folder.

  1. UnicodeTokenizer is currently extended in VTextTokenizer (for lack of a better name), with a few additional rules. Maybe this could have been a separate token-processing step, particularly if one imagine that more rules could be added (or potentially even using an ML model).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载