11 releases (4 breaking)

0.5.1 May 12, 2023
0.5.0 Feb 22, 2023
0.4.0 Feb 3, 2023
0.3.3 Dec 14, 2022
0.1.1 Aug 23, 2022

#1337 in Text processing

Download history 193/week @ 2024-07-24 197/week @ 2024-07-31 98/week @ 2024-08-07 114/week @ 2024-08-14 111/week @ 2024-08-21 180/week @ 2024-08-28 140/week @ 2024-09-04 83/week @ 2024-09-11 103/week @ 2024-09-18 261/week @ 2024-09-25 249/week @ 2024-10-02 163/week @ 2024-10-09 358/week @ 2024-10-16 281/week @ 2024-10-23 302/week @ 2024-10-30 263/week @ 2024-11-06

1,235 downloads per month
Used in tantivy-vibrato

MIT/Apache

285KB
7K SLoC

vibrato

Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm.

API documentation

https://docs.rs/vibrato

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.


lib.rs:

Vibrato

Vibrato is a fast implementation of tokenization (or morphological analysis) based on the viterbi algorithm.

Examples

use std::fs::File;
use std::io::{BufRead, BufReader};

use vibrato::{SystemDictionaryBuilder, Tokenizer};

// Loads a set of raw dictionary files
let dict = SystemDictionaryBuilder::from_readers(
    File::open("src/tests/resources/lex.csv")?,
    File::open("src/tests/resources/matrix.def")?,
    File::open("src/tests/resources/char.def")?,
    File::open("src/tests/resources/unk.def")?,
)?;
// or loads a compiled dictionary
// let reader = File::open("path/to/system.dic")?;
// let dict = Dictionary::read(reader)?;

let tokenizer = vibrato::Tokenizer::new(dict);
let mut worker = tokenizer.new_worker();

worker.reset_sentence("京都東京都");
worker.tokenize();
assert_eq!(worker.num_tokens(), 2);

let t0 = worker.token(0);
assert_eq!(t0.surface(), "京都");
assert_eq!(t0.range_char(), 0..2);
assert_eq!(t0.range_byte(), 0..6);
assert_eq!(t0.feature(), "京都,名詞,固有名詞,地名,一般,*,*,キョウト,京都,*,A,*,*,*,1/5");

let t1 = worker.token(1);
assert_eq!(t1.surface(), "東京都");
assert_eq!(t1.range_char(), 2..5);
assert_eq!(t1.range_byte(), 6..15);
assert_eq!(t1.feature(), "東京都,名詞,固有名詞,地名,一般,*,*,トウキョウト,東京都,*,B,5/9,*,5/9,*");

Dependencies

~3.5–5.5MB
~93K SLoC