#japanese #analyzer #tokenizer #morphological

no-std vaporetto

Vaporetto: a pointwise prediction based tokenizer

16 releases

0.6.3 Apr 1, 2023
0.6.2 Mar 7, 2023
0.6.1 Oct 27, 2022
0.5.1 Jun 20, 2022
0.2.0 Nov 1, 2021

#240 in Text processing

Download history 230/week @ 2023-08-17 423/week @ 2023-08-24 413/week @ 2023-08-31 679/week @ 2023-09-07 670/week @ 2023-09-14 649/week @ 2023-09-21 715/week @ 2023-09-28 289/week @ 2023-10-05 741/week @ 2023-10-12 964/week @ 2023-10-19 914/week @ 2023-10-26 532/week @ 2023-11-02 862/week @ 2023-11-09 632/week @ 2023-11-16 570/week @ 2023-11-23 773/week @ 2023-11-30

2,897 downloads per month
Used in 2 crates


6.5K SLoC


Vaporetto is a fast and lightweight pointwise prediction based tokenizer.


use std::fs::File;

use vaporetto::{Model, Predictor, Sentence};

let f = File::open("../resources/model.bin")?;
let model = Model::read(f)?;
let predictor = Predictor::new(model, true)?;

let mut buf = String::new();

let mut s = Sentence::default();

predictor.predict(&mut s);
s.write_tokenized_text(&mut buf);
    "まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ",

predictor.predict(&mut s);
s.write_tokenized_text(&mut buf);
    "まぁ/副詞/マー 良い/形容詞/ヨイ だろう/助動詞/ダロー",

Feature flags

The following features are disabled by default:

  • kytea - Enables the reader for models generated by KyTea.
  • train - Enables the trainer.
  • portable-simd - Uses the portable SIMD API instead of our SIMD-conscious data layout. (Nightly Rust is required.)

The following features are enabled by default:

  • std - Uses the standard library. If disabled, it uses the core library instead.
  • cache-type-score - Enables caching type scores for faster processing. If disabled, type scores are calculated in a straightforward manner.
  • fix-weight-length - Uses fixed-size arrays for storing scores to facilitate optimization. If disabled, vectors are used instead.
  • tag-prediction - Enables tag prediction.
  • charwise-pma - Uses the Charwise Daachorse instead of the standard version for faster prediction, although it can make to load a model file slower.

Notes for distributed models

The distributed models are compressed in the zstd format. If you want to load these compressed models, you must decompress them outside of the API.

// Requires zstd crate or ruzstd crate
let reader = zstd::Decoder::new(File::open("path/to/model.bin.zst")?)?;
let model = Model::read(reader)?;

You can also decompress the file using the unzstd command, which is bundled with modern Linux distributions.


Licensed under either of

at your option.


Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.


~40K SLoC