#chinese #hanzi #tokenize #segment #localization

chinese_segmenter

Tokenize Chinese sentences using a dictionary-driven largest first matching approach

5 releases (2 stable)

1.0.1 Aug 2, 2022
1.0.0 May 8, 2022
0.1.2 May 13, 2020
0.1.1 May 5, 2020
0.1.0 May 5, 2020

#1535 in Text processing

Download history 6/week @ 2024-09-23 2/week @ 2024-10-14 2/week @ 2024-11-04 55/week @ 2024-12-02 78/week @ 2024-12-09

133 downloads per month

MIT license

4KB

segmenter

v1.0.0

About

Segment Chinese sentences into component words using a dictionary-driven largest first matching approach.

Usage

extern crate chinese_segmenter;

use chinese_segmenter::{initialize, tokenize};

let sentence = "今天晚上想吃羊肉吗?";
initialize(); // Optional intialization to load data
let result: Vec<&str> = tokenize(sentence);
println!("{:?}", result); // --> ['今天', '晚上', '想', '吃', '羊肉', '吗']

Contributors

License

MIT


lib.rs:

About

Segment Chinese sentences into component words using a dictionary-driven largest first matching approach.

Usage

extern crate chinese_segmenter;

use chinese_segmenter::{initialize, tokenize};

let sentence = "今天晚上想吃羊肉吗?";
initialize(); // Optional initialization to load data
let result: Vec<&str> = tokenize(sentence);
println!("{:?}", result); // --> ['今天', '晚上', '想', '吃', '羊肉', '吗']

Dependencies

~4.5MB
~19K SLoC