这是indexloc提供的服务,不要输入任何密码
Skip to content

Proposal: blingfire and/or pysbd for better multilingual sentence tokenization #1811

@ChenghaoMou

Description

@ChenghaoMou

Inspired from #1697

Current, the basic sentence tokenizer handles normal English text well. But it does not work well with other languages (e.g. Chinese) or rare cases like repeated punctuation (e.g. ?? is actually recommended by Cartesia best-practices.

With two quick benchmarks provided in PySBD, I tested the basic sentence tokenizer with other tokenizers for reference:

lib benchmark performance time
blingfire Golden Rule Set (↑) 75.00% 0.19 ms
nltk Golden Rule Set (↑) 56.25% 0.72 ms
pysbd Golden Rule Set (↑) 97.92% 7.18 ms
spacy Golden Rule Set (↑) 52.08% 1.24 ms
spacy (dep) Golden Rule Set (↑) 60.42% 105.21 ms
stanza Golden Rule Set (↑) 75.00% 170.21 ms
syntok Golden Rule Set (↑) 70.83% 1.85 ms
basic* Golden Rule Set (↑) 33.33% 0.53 ms
sat-1l-sm Golden Rule Set (↑) 25.00% 1479.20 ms
blingfire Big Text (↓) - 21.69 ms
nltk Big Text (↓) - 48.77 ms
pysbd Big Text (↓) - 5140.17 ms
spacy Big Text (↓) - 896.53 ms
spacy (dep) Big Text (↓) - 14152.61 ms
stanza Big Text (↓) - 7637.68 ms
syntok Big Text (↓) - 337.19 ms
basic* Big Text (↓) - 51.95 ms
sat-1l-sm Big Text (↓) - 25171.41 ms

*: with retain_format=False, min_sentence_len=1 settings for the higher score.

Quick illustration:

Method: basic
Doc: Hello, world! This is a test. This is another test. This is a third test?? How about this? I don't know!!!! This is a sentence with something...
Sentences: [Chunk(text='Hello, world!', start=0, end=13), Chunk(text=' This is a test.', start=13, end=29), Chunk(text=' This is another test.', start=29, end=51), Chunk(text=' This is a third test?', start=51, end=73), Chunk(text='? How about this?', start=73, end=90), Chunk(text=" I don't know!", start=90, end=104), Chunk(text='!!', start=104, end=106), Chunk(text='! This is a sentence with something...', start=106, end=144)]
Doc: 这是一句比较短的中文。这是另一句, 但是比较长的, 复杂的中文。
Sentences: [Chunk(text='这是一句比较短的中文。这是另一句, 但是比较长的, 复杂的中文。', start=0, end=32)]
Time: 0.69ms

Method: blingfire
Doc: Hello, world! This is a test. This is another test. This is a third test?? How about this? I don't know!!!! This is a sentence with something...
Sentences: [Chunk(text='Hello, world!', start=0, end=13), Chunk(text='This is a test.', start=14, end=29), Chunk(text='This is another test.', start=30, end=51), Chunk(text='This is a third test??', start=52, end=74), Chunk(text='How about this?', start=75, end=90), Chunk(text="I don't know!!!!", start=91, end=107), Chunk(text='This is a sentence with something...', start=108, end=144)]
Doc: 这是一句比较短的中文。这是另一句, 但是比较长的, 复杂的中文。
Sentences: [Chunk(text='这是一句比较短的中文。', start=0, end=11), Chunk(text='这是另一句, 但是比较长的, 复杂的中文。', start=11, end=32)]
Time: 1.10ms

Method: pysbd
Doc: Hello, world! This is a test. This is another test. This is a third test?? How about this? I don't know!!!! This is a sentence with something...
Sentences: [Chunk(text='Hello, world! ', start=0, end=14), Chunk(text='This is a test. ', start=14, end=30), Chunk(text='This is another test. ', start=30, end=52), Chunk(text='This is a third test?? ', start=52, end=75), Chunk(text='How about this? ', start=75, end=91), Chunk(text="I don't know!!!! This is a sentence with something...", start=91, end=144)]
Doc: 这是一句比较短的中文。这是另一句, 但是比较长的, 复杂的中文。
Sentences: [Chunk(text='这是一句比较短的中文。', start=0, end=11), Chunk(text='这是另一句, 但是比较长的, 复杂的中文。', start=11, end=32)]
Time: 4.68ms

Blingfire is fast enough to be a replacement with performance improvement while PySBD brings mostly quality boost at a price of latency.

Do you think having an option to use blingfire and/or pysbd to have better multilingual/long-tail support merits the addition of those dependencies? If so, happy to work on a PR for this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions