Better unicode support in tokenization rules

Currently, the `VTextTokenizer` first computes Unicode segmentation (which should handle Unicode well by definition) than applies a few simple rules on top to produce tokenization that is more standard in NLP (and possibly language dependent).

These rules might need to be generalized a bit to handle Unicode better. For instance, currently we merge tokens linked by `-` but only the ascii one, not on other Unicode variants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better unicode support in tokenization rules #31

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Better unicode support in tokenization rules #31

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions