+
Skip to content

Better unicode support in tokenization rules #31

@rth

Description

@rth

Currently, the VTextTokenizer first computes Unicode segmentation (which should handle Unicode well by definition) than applies a few simple rules on top to produce tokenization that is more standard in NLP (and possibly language dependent).

These rules might need to be generalized a bit to handle Unicode better. For instance, currently we merge tokens linked by - but only the ascii one, not on other Unicode variants.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载