TinySegmenter for Dart #

A compact Japanese text tokenizer for Dart. TinySegmenter is a Japanese word segmentation library based on the original JavaScript implementation by Taku Kudo.

Features #

🚀 Lightweight: No external dependencies, compact implementation
🎯 Dictionary-free: Uses statistical modeling instead of requiring large dictionary files
🔧 Simple API: Easy to use with just one method call
📦 Pure Dart: Works on all Dart platforms (Flutter, Web, Server, etc.)
🇯🇵 Japanese text support: Handles Hiragana, Katakana, Kanji, and mixed text

Installation #

Add this to your package's pubspec.yaml file:

dependencies:
  tiny_segmenter_dart: ^1.0.0

Then run:

dart pub get

Usage #

Basic Usage #

import 'package:tiny_segmenter_dart/tiny_segmenter_dart.dart';

void main() {
  final segmenter = TinySegmenter();
  
  // Segment Japanese text
  final words = segmenter.segment('私は日本人です');
  print(words); // ['私', 'は', '日本人', 'です']
  
  // Works with mixed text types
  final mixed = segmenter.segment('今日はいい天気です');
  print(mixed); // ['今日', 'は', 'いい', '天気', 'です']
}

More Examples #

import 'package:tiny_segmenter_dart/tiny_segmenter_dart.dart';

void main() {
  final segmenter = TinySegmenter();
  
  // Katakana text
  final katakana = segmenter.segment('コンピューターを使います');
  print(katakana); // ['コンピューター', 'を', '使い', 'ます']
  
  // Text with numbers
  final withNumbers = segmenter.segment('今日は2023年12月です');
  print(withNumbers); // ['今日', 'は', '2', '0', '2', '3', '年', '1', '2', '月', 'です']
  
  // Complex sentence
  final complex = segmenter.segment('私は東京大学で日本語を勉強しています。');
  print(complex); // ['私', 'は', '東京', '大学', 'で', '日本語', 'を', '勉強', 'し', 'て', 'い', 'ます', '。']
  
  // Empty string handling
  final empty = segmenter.segment('');
  print(empty); // []
}

How it Works #

TinySegmenter uses a statistical approach to segment Japanese text:

Character Classification: Characters are classified into types (Hiragana, Katakana, Kanji, Alphabet, Numbers, etc.)
Statistical Modeling: Uses pre-trained statistical models to determine word boundaries
Context Analysis: Considers surrounding characters and their types to make segmentation decisions
No Dictionary Required: Unlike traditional approaches, it doesn't need large dictionary files

Character Types #

The segmenter recognizes these character types:

M: Japanese numbers (一二三四五六七八九十百千万億兆)
H: Kanji characters (一-龠々〆ヵヶ)
I: Hiragana (ぁ-ん)
K: Katakana (ァ-ヴーｱ-ﾝﾞｰ)
A: Alphabet (a-zA-Zａ-ｚＡ-Ｚ)
N: Arabic numbers (0-9０-９)
O: Other characters

Performance #

TinySegmenter is designed to be fast and memory-efficient:

No external dependencies
Minimal memory footprint
Fast segmentation speed
Suitable for real-time applications

Limitations #

Optimized for Japanese text; may not work well with other languages
Segmentation accuracy depends on the statistical model and may not be perfect for all text types
Numbers are often segmented character by character

Credits #

Original JavaScript implementation by Taku Kudo
Dart port implementation

License #

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.

Contributing #

Contributions are welcome! Please feel free to submit a Pull Request.

Issues #

If you encounter any issues or have suggestions, please open an issue on GitHub.

tiny_segmenter_dart 1.0.1
tiny_segmenter_dart: ^1.0.1 copied to clipboard

Metadata

TinySegmenter for Dart #

Features #

Installation #

Usage #

Basic Usage #

More Examples #

How it Works #

Character Types #

Performance #

Limitations #

Credits #

License #

Contributing #

Issues #

← Metadata

Publisher

Weekly Downloads

Metadata

Topics

Documentation

License

More

tiny_segmenter_dart 1.0.1 tiny_segmenter_dart: ^1.0.1 copied to clipboard

Metadata

TinySegmenter for Dart #

Features #

Installation #

Usage #

Basic Usage #

More Examples #

How it Works #

Character Types #

Performance #

Limitations #

Credits #

License #

Contributing #

Issues #

← Metadata

Publisher

Weekly Downloads

Metadata

Topics

Documentation

License

More

tiny_segmenter_dart 1.0.1
tiny_segmenter_dart: ^1.0.1 copied to clipboard