这是indexloc提供的服务,不要输入任何密码

tiny_segmenter_dart 1.0.1 copy "tiny_segmenter_dart: ^1.0.1" to clipboard
tiny_segmenter_dart: ^1.0.1 copied to clipboard

A compact Japanese text tokenizer for Dart. TinySegmenter is a Japanese word segmentation library based on the original JavaScript implementation by Taku Kudo.

TinySegmenter for Dart #

A compact Japanese text tokenizer for Dart. TinySegmenter is a Japanese word segmentation library based on the original JavaScript implementation by Taku Kudo.

Features #

  • 🚀 Lightweight: No external dependencies, compact implementation
  • 🎯 Dictionary-free: Uses statistical modeling instead of requiring large dictionary files
  • 🔧 Simple API: Easy to use with just one method call
  • 📦 Pure Dart: Works on all Dart platforms (Flutter, Web, Server, etc.)
  • 🇯🇵 Japanese text support: Handles Hiragana, Katakana, Kanji, and mixed text

Installation #

Add this to your package's pubspec.yaml file:

dependencies:
  tiny_segmenter_dart: ^1.0.0

Then run:

dart pub get

Usage #

Basic Usage #

import 'package:tiny_segmenter_dart/tiny_segmenter_dart.dart';

void main() {
  final segmenter = TinySegmenter();
  
  // Segment Japanese text
  final words = segmenter.segment('私は日本人です');
  print(words); // ['私', 'は', '日本人', 'です']
  
  // Works with mixed text types
  final mixed = segmenter.segment('今日はいい天気です');
  print(mixed); // ['今日', 'は', 'いい', '天気', 'です']
}

More Examples #

import 'package:tiny_segmenter_dart/tiny_segmenter_dart.dart';

void main() {
  final segmenter = TinySegmenter();
  
  // Katakana text
  final katakana = segmenter.segment('コンピューターを使います');
  print(katakana); // ['コンピューター', 'を', '使い', 'ます']
  
  // Text with numbers
  final withNumbers = segmenter.segment('今日は2023年12月です');
  print(withNumbers); // ['今日', 'は', '2', '0', '2', '3', '年', '1', '2', '月', 'です']
  
  // Complex sentence
  final complex = segmenter.segment('私は東京大学で日本語を勉強しています。');
  print(complex); // ['私', 'は', '東京', '大学', 'で', '日本語', 'を', '勉強', 'し', 'て', 'い', 'ます', '。']
  
  // Empty string handling
  final empty = segmenter.segment('');
  print(empty); // []
}

How it Works #

TinySegmenter uses a statistical approach to segment Japanese text:

  1. Character Classification: Characters are classified into types (Hiragana, Katakana, Kanji, Alphabet, Numbers, etc.)
  2. Statistical Modeling: Uses pre-trained statistical models to determine word boundaries
  3. Context Analysis: Considers surrounding characters and their types to make segmentation decisions
  4. No Dictionary Required: Unlike traditional approaches, it doesn't need large dictionary files

Character Types #

The segmenter recognizes these character types:

  • M: Japanese numbers (一二三四五六七八九十百千万億兆)
  • H: Kanji characters (一-龠々〆ヵヶ)
  • I: Hiragana (ぁ-ん)
  • K: Katakana (ァ-ヴーア-ン゙ー)
  • A: Alphabet (a-zA-Za-zA-Z)
  • N: Arabic numbers (0-90-9)
  • O: Other characters

Performance #

TinySegmenter is designed to be fast and memory-efficient:

  • No external dependencies
  • Minimal memory footprint
  • Fast segmentation speed
  • Suitable for real-time applications

Limitations #

  • Optimized for Japanese text; may not work well with other languages
  • Segmentation accuracy depends on the statistical model and may not be perfect for all text types
  • Numbers are often segmented character by character

Credits #

  • Original JavaScript implementation by Taku Kudo
  • Dart port implementation

License #

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.

Contributing #

Contributions are welcome! Please feel free to submit a Pull Request.

Issues #

If you encounter any issues or have suggestions, please open an issue on GitHub.

1
likes
125
points
73
downloads

Publisher

verified publisheriori.dev

Weekly Downloads

A compact Japanese text tokenizer for Dart. TinySegmenter is a Japanese word segmentation library based on the original JavaScript implementation by Taku Kudo.

Repository (GitHub)
View/report issues

Topics

#japanese #text-processing #tokenizer #nlp #segmentation

Documentation

API reference

License

unknown (license)

More

Packages that depend on tiny_segmenter_dart