tiny_segmenter_dart 1.0.1
tiny_segmenter_dart: ^1.0.1 copied to clipboard
A compact Japanese text tokenizer for Dart. TinySegmenter is a Japanese word segmentation library based on the original JavaScript implementation by Taku Kudo.
TinySegmenter for Dart #
A compact Japanese text tokenizer for Dart. TinySegmenter is a Japanese word segmentation library based on the original JavaScript implementation by Taku Kudo.
Features #
- 🚀 Lightweight: No external dependencies, compact implementation
- 🎯 Dictionary-free: Uses statistical modeling instead of requiring large dictionary files
- 🔧 Simple API: Easy to use with just one method call
- 📦 Pure Dart: Works on all Dart platforms (Flutter, Web, Server, etc.)
- 🇯🇵 Japanese text support: Handles Hiragana, Katakana, Kanji, and mixed text
Installation #
Add this to your package's pubspec.yaml
file:
dependencies:
tiny_segmenter_dart: ^1.0.0
Then run:
dart pub get
Usage #
Basic Usage #
import 'package:tiny_segmenter_dart/tiny_segmenter_dart.dart';
void main() {
final segmenter = TinySegmenter();
// Segment Japanese text
final words = segmenter.segment('私は日本人です');
print(words); // ['私', 'は', '日本人', 'です']
// Works with mixed text types
final mixed = segmenter.segment('今日はいい天気です');
print(mixed); // ['今日', 'は', 'いい', '天気', 'です']
}
More Examples #
import 'package:tiny_segmenter_dart/tiny_segmenter_dart.dart';
void main() {
final segmenter = TinySegmenter();
// Katakana text
final katakana = segmenter.segment('コンピューターを使います');
print(katakana); // ['コンピューター', 'を', '使い', 'ます']
// Text with numbers
final withNumbers = segmenter.segment('今日は2023年12月です');
print(withNumbers); // ['今日', 'は', '2', '0', '2', '3', '年', '1', '2', '月', 'です']
// Complex sentence
final complex = segmenter.segment('私は東京大学で日本語を勉強しています。');
print(complex); // ['私', 'は', '東京', '大学', 'で', '日本語', 'を', '勉強', 'し', 'て', 'い', 'ます', '。']
// Empty string handling
final empty = segmenter.segment('');
print(empty); // []
}
How it Works #
TinySegmenter uses a statistical approach to segment Japanese text:
- Character Classification: Characters are classified into types (Hiragana, Katakana, Kanji, Alphabet, Numbers, etc.)
- Statistical Modeling: Uses pre-trained statistical models to determine word boundaries
- Context Analysis: Considers surrounding characters and their types to make segmentation decisions
- No Dictionary Required: Unlike traditional approaches, it doesn't need large dictionary files
Character Types #
The segmenter recognizes these character types:
- M: Japanese numbers (一二三四五六七八九十百千万億兆)
- H: Kanji characters (一-龠々〆ヵヶ)
- I: Hiragana (ぁ-ん)
- K: Katakana (ァ-ヴーア-ン゙ー)
- A: Alphabet (a-zA-Za-zA-Z)
- N: Arabic numbers (0-90-9)
- O: Other characters
Performance #
TinySegmenter is designed to be fast and memory-efficient:
- No external dependencies
- Minimal memory footprint
- Fast segmentation speed
- Suitable for real-time applications
Limitations #
- Optimized for Japanese text; may not work well with other languages
- Segmentation accuracy depends on the statistical model and may not be perfect for all text types
- Numbers are often segmented character by character
Credits #
- Original JavaScript implementation by Taku Kudo
- Dart port implementation
License #
This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.
Contributing #
Contributions are welcome! Please feel free to submit a Pull Request.
Issues #
If you encounter any issues or have suggestions, please open an issue on GitHub.