8 releases (4 breaking)

0.19.1 Mar 7, 2023
0.19.0 Jan 27, 2023
0.6.1 Oct 27, 2022
0.6.0 Sep 30, 2022
0.3.0 Feb 14, 2022

#805 in Text processing

Download history 8/week @ 2022-11-22 1/week @ 2022-11-29 3/week @ 2022-12-06 6/week @ 2022-12-13 1/week @ 2022-12-27 1/week @ 2023-01-03 4/week @ 2023-01-10 10/week @ 2023-01-17 36/week @ 2023-01-24 12/week @ 2023-01-31 15/week @ 2023-02-07 21/week @ 2023-02-14 9/week @ 2023-02-21 14/week @ 2023-02-28 22/week @ 2023-03-07

72 downloads per month

MIT/Apache

310KB
7.5K SLoC

vaporetto_tantivy

Vaporetto is a fast and lightweight pointwise prediction based tokenizer. vaporetto_tantivy is a crate to use Vaporetto in Tantivy.

Example

use std::fs::File;
use std::io::{Read, BufReader};

use tantivy::schema::{IndexRecordOption, Schema, TextFieldIndexing, TextOptions};
use tantivy::Index;
use vaporetto::Model;
use vaporetto_tantivy::VaporettoTokenizer;

let mut schema_builder = Schema::builder();
let text_field_indexing = TextFieldIndexing::default()
    .set_tokenizer("ja_vaporetto")
    .set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
    .set_indexing_options(text_field_indexing)
    .set_stored();
schema_builder.add_text_field("title", text_options);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);

// Loads a model with decompression.
let mut f = BufReader::new(File::open("bccwj-suw+unidic.model.zst").unwrap());
let mut decoder = ruzstd::StreamingDecoder::new(&mut f).unwrap();
let mut buff = vec![];
decoder.read_to_end(&mut buff).unwrap();
let model = Model::read(&mut buff.as_slice()).unwrap();

// Creates VaporettoTokenizer with wsconst=DGR.
let tokenizer = VaporettoTokenizer::new(model, "DGR").unwrap();
index
    .tokenizers()
    .register("ja_vaporetto", tokenizer);

Dependencies

~16–45MB
~743K SLoC