11 releases (4 breaking)
0.5.1 | May 12, 2023 |
---|---|
0.5.0 | Feb 22, 2023 |
0.4.0 | Feb 3, 2023 |
0.3.3 | Dec 14, 2022 |
0.1.1 | Aug 23, 2022 |
#1337 in Text processing
1,235 downloads per month
Used in tantivy-vibrato
285KB
7K
SLoC
vibrato
Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm.
API documentation
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
lib.rs
:
Vibrato
Vibrato is a fast implementation of tokenization (or morphological analysis) based on the viterbi algorithm.
Examples
use std::fs::File;
use std::io::{BufRead, BufReader};
use vibrato::{SystemDictionaryBuilder, Tokenizer};
// Loads a set of raw dictionary files
let dict = SystemDictionaryBuilder::from_readers(
File::open("src/tests/resources/lex.csv")?,
File::open("src/tests/resources/matrix.def")?,
File::open("src/tests/resources/char.def")?,
File::open("src/tests/resources/unk.def")?,
)?;
// or loads a compiled dictionary
// let reader = File::open("path/to/system.dic")?;
// let dict = Dictionary::read(reader)?;
let tokenizer = vibrato::Tokenizer::new(dict);
let mut worker = tokenizer.new_worker();
worker.reset_sentence("京都東京都");
worker.tokenize();
assert_eq!(worker.num_tokens(), 2);
let t0 = worker.token(0);
assert_eq!(t0.surface(), "京都");
assert_eq!(t0.range_char(), 0..2);
assert_eq!(t0.range_byte(), 0..6);
assert_eq!(t0.feature(), "京都,名詞,固有名詞,地名,一般,*,*,キョウト,京都,*,A,*,*,*,1/5");
let t1 = worker.token(1);
assert_eq!(t1.surface(), "東京都");
assert_eq!(t1.range_char(), 2..5);
assert_eq!(t1.range_byte(), 6..15);
assert_eq!(t1.feature(), "東京都,名詞,固有名詞,地名,一般,*,*,トウキョウト,東京都,*,B,5/9,*,5/9,*");
Dependencies
~3.5–5.5MB
~93K SLoC