Proposal: blingfire and/or pysbd for better multilingual sentence tokenization

Inspired from #1697 

Current, the basic sentence tokenizer handles normal English text well. But it does not work well with other languages (e.g. Chinese) or rare cases like repeated punctuation (e.g. `??` is actually recommended by Cartesia [best-practices](https://docs.cartesia.ai/2024-11-13/build-with-cartesia/formatting-text-for-sonic-2/best-practices).

With two quick benchmarks provided in [PySBD](https://github.com/nipunsadvilkar/pySBD/tree/master/benchmarks), I tested the basic sentence tokenizer with other tokenizers for reference:

| lib | benchmark | performance | time |
|-----| ---------- | ---------- | ---------- |
|blingfire| Golden Rule Set (↑) | 75.00% | **0.19 ms** |
|nltk| Golden Rule Set (↑) | 56.25% | 0.72 ms |
|pysbd| Golden Rule Set (↑) | **97.92%** | 7.18 ms |
|spacy| Golden Rule Set (↑) | 52.08% | 1.24 ms |
|spacy (dep)| Golden Rule Set (↑) | 60.42% | 105.21 ms |
|stanza| Golden Rule Set (↑) | 75.00% | 170.21 ms |
|syntok| Golden Rule Set (↑) | 70.83% | 1.85 ms |
|basic*| Golden Rule Set (↑) | 33.33% | 0.53 ms |
|sat-1l-sm| Golden Rule Set (↑) | 25.00% | 1479.20 ms |
|blingfire| Big Text (↓) | - | **21.69 ms** |
|nltk| Big Text (↓) | - | 48.77 ms |
|pysbd| Big Text (↓) | - | 5140.17 ms |
|spacy| Big Text (↓) | - | 896.53 ms |
|spacy (dep)| Big Text (↓) | - | 14152.61 ms |
|stanza| Big Text (↓) | - | 7637.68 ms |
|syntok| Big Text (↓) | - | 337.19 ms |
|basic*| Big Text (↓) | - | 51.95 ms |
|sat-1l-sm| Big Text (↓) | - | 25171.41 ms |

\*: with `retain_format=False, min_sentence_len=1` settings for the higher score.

Quick illustration:

```text
Method: basic
Doc: Hello, world! This is a test. This is another test. This is a third test?? How about this? I don't know!!!! This is a sentence with something...
Sentences: [Chunk(text='Hello, world!', start=0, end=13), Chunk(text=' This is a test.', start=13, end=29), Chunk(text=' This is another test.', start=29, end=51), Chunk(text=' This is a third test?', start=51, end=73), Chunk(text='? How about this?', start=73, end=90), Chunk(text=" I don't know!", start=90, end=104), Chunk(text='!!', start=104, end=106), Chunk(text='! This is a sentence with something...', start=106, end=144)]
Doc: 这是一句比较短的中文。这是另一句， 但是比较长的， 复杂的中文。
Sentences: [Chunk(text='这是一句比较短的中文。这是另一句， 但是比较长的， 复杂的中文。', start=0, end=32)]
Time: 0.69ms

Method: blingfire
Doc: Hello, world! This is a test. This is another test. This is a third test?? How about this? I don't know!!!! This is a sentence with something...
Sentences: [Chunk(text='Hello, world!', start=0, end=13), Chunk(text='This is a test.', start=14, end=29), Chunk(text='This is another test.', start=30, end=51), Chunk(text='This is a third test??', start=52, end=74), Chunk(text='How about this?', start=75, end=90), Chunk(text="I don't know!!!!", start=91, end=107), Chunk(text='This is a sentence with something...', start=108, end=144)]
Doc: 这是一句比较短的中文。这是另一句， 但是比较长的， 复杂的中文。
Sentences: [Chunk(text='这是一句比较短的中文。', start=0, end=11), Chunk(text='这是另一句， 但是比较长的， 复杂的中文。', start=11, end=32)]
Time: 1.10ms

Method: pysbd
Doc: Hello, world! This is a test. This is another test. This is a third test?? How about this? I don't know!!!! This is a sentence with something...
Sentences: [Chunk(text='Hello, world! ', start=0, end=14), Chunk(text='This is a test. ', start=14, end=30), Chunk(text='This is another test. ', start=30, end=52), Chunk(text='This is a third test?? ', start=52, end=75), Chunk(text='How about this? ', start=75, end=91), Chunk(text="I don't know!!!! This is a sentence with something...", start=91, end=144)]
Doc: 这是一句比较短的中文。这是另一句， 但是比较长的， 复杂的中文。
Sentences: [Chunk(text='这是一句比较短的中文。', start=0, end=11), Chunk(text='这是另一句， 但是比较长的， 复杂的中文。', start=11, end=32)]
Time: 4.68ms
```

Blingfire is fast enough to be a replacement with performance improvement while PySBD brings mostly quality boost at a price of latency.

Do you think having an option to use blingfire and/or pysbd to have better multilingual/long-tail support merits the addition of those dependencies? If so, happy to work on a PR for this.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: blingfire and/or pysbd for better multilingual sentence tokenization #1811

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

lib	benchmark	performance	time
blingfire	Golden Rule Set (↑)	75.00%	0.19 ms
nltk	Golden Rule Set (↑)	56.25%	0.72 ms
pysbd	Golden Rule Set (↑)	97.92%	7.18 ms
spacy	Golden Rule Set (↑)	52.08%	1.24 ms
spacy (dep)	Golden Rule Set (↑)	60.42%	105.21 ms
stanza	Golden Rule Set (↑)	75.00%	170.21 ms
syntok	Golden Rule Set (↑)	70.83%	1.85 ms
basic*	Golden Rule Set (↑)	33.33%	0.53 ms
sat-1l-sm	Golden Rule Set (↑)	25.00%	1479.20 ms
blingfire	Big Text (↓)	-	21.69 ms
nltk	Big Text (↓)	-	48.77 ms
pysbd	Big Text (↓)	-	5140.17 ms
spacy	Big Text (↓)	-	896.53 ms
spacy (dep)	Big Text (↓)	-	14152.61 ms
stanza	Big Text (↓)	-	7637.68 ms
syntok	Big Text (↓)	-	337.19 ms
basic*	Big Text (↓)	-	51.95 ms
sat-1l-sm	Big Text (↓)	-	25171.41 ms

Proposal: blingfire and/or pysbd for better multilingual sentence tokenization #1811

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions