Fine-tune tokenizers

It can happen that the tokenization results are unsatisfactory in some way, and the question is what should be the mechanism to customize/improve them. Whether it should be by,
 a) adding options make these optional improvements in the tokenizer. The issue with these is that some of these might be relevant to multiple tokenizers
 b) add a new step later in the pipeline. That's probably the best way to allow arbitrary customization. The issue is that some steps might be specific to the previous step, and adding them in the library might be confusing.

There is probably a balance that needs to be found between the two.

For instance,
1. `PunctuationTokenizer`, 
   - currently doesn't take into account repeated punctuation
     ```py
     >>> PunctuationTokenizer().tokenize("test!!!")                                                                                                     
     ['test!', '!', '!']
     ```
   - will tokenize abbreviations separated by `.` as separate sentence
     ```py
     >>> PunctuationTokenizer().tokenize("W.T.O.")
     ['W.', 'T.', 'O.']
     ```
   both could probably be addressed by adding an option to force sentences to be longer than some minimal length (and otherwise append them to the previous token). 
2. `UnicodeSentenceTokenizer`,
    will not tokenizer sentences separated by a punctuation without space e.g.,
    ```py
    >>> UnicodeSentenceTokenizer().tokenize('One sentence.Another sentence.')
    ['One sentence.Another sentence.']
    ```
    That's a very common occurrence in actual text, and I think a workaround should be found (e.g. using an additional tokenization pass with a regex/punctuation tokenizer).

Generally it would be good to add some evaluation benchmarks to `evaluation/` for sentence tokenization to `evaluation/` folder.

3. `UnicodeTokenizer` is currently extended in `VTextTokenizer` (for lack of a better name), with a few additional rules. Maybe this could have been a separate token-processing step, particularly if one imagine that more rules could be added (or potentially even using an ML model). 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine-tune tokenizers #80

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fine-tune tokenizers #80

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions