#nlp #japanese #tokenizer #ngrams

tiniestsegmenter

Compact Japanese segmenter

3 releases (breaking)

0.3.0 Sep 24, 2024
0.2.0 Jun 24, 2024
0.1.1 May 11, 2024
0.1.0 May 11, 2024

#1105 in Text processing

Download history 6/week @ 2024-07-29 1/week @ 2024-08-12 7/week @ 2024-09-16 158/week @ 2024-09-23 34/week @ 2024-09-30 51/week @ 2024-10-07 5/week @ 2024-10-14 1/week @ 2024-10-21 32/week @ 2024-10-28 64/week @ 2024-11-04 25/week @ 2024-11-11

122 downloads per month

Custom license

58KB
2K SLoC

TiniestSegmenter

A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).

Usage

Add the crate to your project: cargo add tiniestsegmenter.

use tiniestsegmenter as ts;

fn main() {
    let tokens: Vec<&str> = ts::tokenize("ジャガイモが好きです。");
}

No runtime deps