-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Inspired from #1697
Current, the basic sentence tokenizer handles normal English text well. But it does not work well with other languages (e.g. Chinese) or rare cases like repeated punctuation (e.g. ??
is actually recommended by Cartesia best-practices.
With two quick benchmarks provided in PySBD, I tested the basic sentence tokenizer with other tokenizers for reference:
lib | benchmark | performance | time |
---|---|---|---|
blingfire | Golden Rule Set (↑) | 75.00% | 0.19 ms |
nltk | Golden Rule Set (↑) | 56.25% | 0.72 ms |
pysbd | Golden Rule Set (↑) | 97.92% | 7.18 ms |
spacy | Golden Rule Set (↑) | 52.08% | 1.24 ms |
spacy (dep) | Golden Rule Set (↑) | 60.42% | 105.21 ms |
stanza | Golden Rule Set (↑) | 75.00% | 170.21 ms |
syntok | Golden Rule Set (↑) | 70.83% | 1.85 ms |
basic* | Golden Rule Set (↑) | 33.33% | 0.53 ms |
sat-1l-sm | Golden Rule Set (↑) | 25.00% | 1479.20 ms |
blingfire | Big Text (↓) | - | 21.69 ms |
nltk | Big Text (↓) | - | 48.77 ms |
pysbd | Big Text (↓) | - | 5140.17 ms |
spacy | Big Text (↓) | - | 896.53 ms |
spacy (dep) | Big Text (↓) | - | 14152.61 ms |
stanza | Big Text (↓) | - | 7637.68 ms |
syntok | Big Text (↓) | - | 337.19 ms |
basic* | Big Text (↓) | - | 51.95 ms |
sat-1l-sm | Big Text (↓) | - | 25171.41 ms |
*: with retain_format=False, min_sentence_len=1
settings for the higher score.
Quick illustration:
Method: basic
Doc: Hello, world! This is a test. This is another test. This is a third test?? How about this? I don't know!!!! This is a sentence with something...
Sentences: [Chunk(text='Hello, world!', start=0, end=13), Chunk(text=' This is a test.', start=13, end=29), Chunk(text=' This is another test.', start=29, end=51), Chunk(text=' This is a third test?', start=51, end=73), Chunk(text='? How about this?', start=73, end=90), Chunk(text=" I don't know!", start=90, end=104), Chunk(text='!!', start=104, end=106), Chunk(text='! This is a sentence with something...', start=106, end=144)]
Doc: 这是一句比较短的中文。这是另一句, 但是比较长的, 复杂的中文。
Sentences: [Chunk(text='这是一句比较短的中文。这是另一句, 但是比较长的, 复杂的中文。', start=0, end=32)]
Time: 0.69ms
Method: blingfire
Doc: Hello, world! This is a test. This is another test. This is a third test?? How about this? I don't know!!!! This is a sentence with something...
Sentences: [Chunk(text='Hello, world!', start=0, end=13), Chunk(text='This is a test.', start=14, end=29), Chunk(text='This is another test.', start=30, end=51), Chunk(text='This is a third test??', start=52, end=74), Chunk(text='How about this?', start=75, end=90), Chunk(text="I don't know!!!!", start=91, end=107), Chunk(text='This is a sentence with something...', start=108, end=144)]
Doc: 这是一句比较短的中文。这是另一句, 但是比较长的, 复杂的中文。
Sentences: [Chunk(text='这是一句比较短的中文。', start=0, end=11), Chunk(text='这是另一句, 但是比较长的, 复杂的中文。', start=11, end=32)]
Time: 1.10ms
Method: pysbd
Doc: Hello, world! This is a test. This is another test. This is a third test?? How about this? I don't know!!!! This is a sentence with something...
Sentences: [Chunk(text='Hello, world! ', start=0, end=14), Chunk(text='This is a test. ', start=14, end=30), Chunk(text='This is another test. ', start=30, end=52), Chunk(text='This is a third test?? ', start=52, end=75), Chunk(text='How about this? ', start=75, end=91), Chunk(text="I don't know!!!! This is a sentence with something...", start=91, end=144)]
Doc: 这是一句比较短的中文。这是另一句, 但是比较长的, 复杂的中文。
Sentences: [Chunk(text='这是一句比较短的中文。', start=0, end=11), Chunk(text='这是另一句, 但是比较长的, 复杂的中文。', start=11, end=32)]
Time: 4.68ms
Blingfire is fast enough to be a replacement with performance improvement while PySBD brings mostly quality boost at a price of latency.
Do you think having an option to use blingfire and/or pysbd to have better multilingual/long-tail support merits the addition of those dependencies? If so, happy to work on a PR for this.