9 unstable releases (4 breaking)

0.6.2 Mar 7, 2023
0.6.1 Oct 27, 2022
0.6.0 Sep 30, 2022
0.5.1 Jun 20, 2022
0.1.3 Aug 30, 2021

#288 in Text processing

Download history 251/week @ 2022-12-03 408/week @ 2022-12-10 280/week @ 2022-12-17 191/week @ 2022-12-24 59/week @ 2022-12-31 170/week @ 2023-01-07 278/week @ 2023-01-14 320/week @ 2023-01-21 427/week @ 2023-01-28 175/week @ 2023-02-04 160/week @ 2023-02-11 120/week @ 2023-02-18 155/week @ 2023-02-25 150/week @ 2023-03-04 149/week @ 2023-03-11 44/week @ 2023-03-18

515 downloads per month
Used in vaporetto_tantivy

MIT/Apache

290KB
7K SLoC

vaporetto_rules

Vaporetto is a fast and lightweight pointwise prediction based tokenizer. vaporetto_rules is rule-base filters for Vaporetto.

Examples

use std::fs::File;
use std::io::BufReader;
use std::rc::Rc;

use vaporetto::{CharacterType, Model, Predictor, Sentence};
use vaporetto_rules::{
    SentenceFilter, StringFilter,
    sentence_filters::{ConcatGraphemeClustersFilter, KyteaWsConstFilter},
    string_filters::KyteaFullwidthFilter,
};

let mut f = BufReader::new(File::open("model.bin").unwrap());
let model = Model::read(&mut f).unwrap();
let mut predictor = Predictor::new(model, false).unwrap();

let pre_filters: Vec<Box<dyn StringFilter<String>>> = vec![
    Box::new(KyteaFullwidthFilter),
];
let post_filters: Vec<Box<dyn SentenceFilter>> = vec![
    Box::new(ConcatGraphemeClustersFilter),
    Box::new(KyteaWsConstFilter::new(CharacterType::Digit)),
];

let input = "Vaporettoは仲良し家族👨‍👨‍👧‍👦を離れ離れにさせません。"
    .to_string();

let preproc_input = pre_filters.iter().fold(input, |s, filter| filter.filter(s));

let mut sentence = Sentence::from_raw(preproc_input).unwrap();
predictor.predict(&mut sentence);

post_filters.iter().for_each(|filter| filter.filter(&mut sentence));

let mut buf = String::new();
sentence.write_tokenized_text(&mut buf);
assert_eq!(
    "Vaporetto は 仲良 し 家族 👨‍👨‍👧‍👦 を 離れ離れ に さ せ ま せ ん 。",
    buf,
);

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~2MB
~38K SLoC