14 releases
Uses new Rust 2024
| new 0.7.1 | Nov 9, 2025 |
|---|---|
| 0.7.0 | Nov 6, 2025 |
| 0.6.13 | Nov 3, 2025 |
| 0.6.7 | Oct 26, 2025 |
#258 in Text processing
918 downloads per month
455KB
10K
SLoC
🎤 vibrato-rkyv: VIterbi-Based acceleRAted TOkenizer with rkyv
Note: This is a fork of the original daac-tools/vibrato modified to use the rkyv serialization framework for significantly faster dictionary loading.
Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm.
Significantly Faster Dictionary Loading with rkyv
vibrato-rkyv utilizes the rkyv zero-copy deserialization framework to achieve a significant speedup in dictionary loading. By memory-mapping the dictionary file, it can be made available for use almost instantaneously.
The benchmark results below compare loading from both uncompressed and zstd-compressed files, demonstrating the performance difference.
CPU: Intel Core i7-14700
OS: WSL2 (Ubuntu 24.04)
Dictionary: UniDic-cwj v3.1.1 (approx. 700MB uncompressed dictionary binary)
Source: The benchmark code is available in the benches directory.
From Uncompressed File (.dic)
The table below compares the performance of loading a dictionary from a pre-decompressed .dic file. The fastest possible speed is achieved with from_path_unchecked, while from_path with LoadMode::TrustCache provides a safe, near-instant alternative.
| Condition | Original vibrato (Read from stream) |
vibrato-rkyv (Memory-mapped) |
Speedup |
|---|---|---|---|
| Cold Start (Safe Cache)¹ | ~42 s | ~1.1 ms | ~38,000x |
| Warm Start (Unchecked)² | ~34 s | ~2.9 µs | ~11,700,000x |
| Warm Start (Safe Cache)³ | ~34 s | ~4.1 µs | ~8,300,000x |
This shows that the safety mechanism of the cache (metadata hashing and file check) adds a minimal overhead of just ~1.2 µs compared to the unsafe version.
¹ Cold Start (Safe Cache): The file is not in the OS page cache, but the application cache (proof file) is valid. This measures the cost of disk I/O.
² Warm Start (Unchecked): The fastest possible scenario using from_path_unchecked. The file is in the OS page cache, and bytechecks are bypassed.
³ Warm Start (Safe Cache): A typical fast reload scenario using LoadMode::TrustCache. The file is in the OS page cache, and minimal validation is performed.
From Zstd-Compressed File (.dic.zst)
| Condition | Original vibrato (Read from stream) |
vibrato-rkyv (with caching) |
Speedup |
|---|---|---|---|
| 1st Run (Cold) | ~4.6 s | ~1.3 s | ~3.5x |
| Subsequent Runs (Cache-hit) | ~4.5 s | ~6.5 μs | ~700,000x |
vibrato-rkyv automatically decompresses and caches the dictionary on the first run, using the memory-mapped cache for subsequent loads.
To take advantage of this performance, use the Dictionary::from_path or Dictionary::from_zstd methods:
use vibrato_rkyv::{Dictionary, LoadMode};
// Recommended for uncompressed dictionaries:
// Almost instantaneous loading via memory-mapping.
let dict_mmap = Dictionary::from_path("path/to/system.dic")?;
// Recommended for zstd-compressed dictionaries:
// Decompresses and caches on the first run, then uses memory-mapping.
let dict_zstd = Dictionary::from_zstd("path/to/system.dic.zst", LoadMode::TrustCache)?;
Differences
The following summarizes key differences from the original implementation.
Differences from Original vibrato
If you are migrating from the original daac-tools/vibrato, please note the following key changes:
- Legacy Dictionary Support (with legacy feature):
vibrato-rkyvis designed for performance with its nativerkyv-based dictionary format. However, to provide flexibility and allow users to leverage a wide range of dictionary assets, it also offers support for thebincode-based format used by the originalvibratowhen thelegacyfeature is enabled.
This enables the use of valuable, pre-existing dictionaries that may only be available in thebincodeformat, such as those trained on proprietary corpora (e.g., BCCWJ).
The library handles different formats:Dictionary::from_path(): Transparently loads both uncompressedrkyvandbincodeformat dictionaries. It automatically detects the format based on the file's content.Dictionary::from_zstd(): When given a Zstandard-compressed dictionary, it provides sophisticated, format-aware caching:- If the dictionary is in the
rkyvformat, it is decompressed and cached for near-instant, memory-mapped access on subsequent loads. - If the dictionary is in the
bincodeformat, it is loaded directly into memory for immediate use. In the background, a process is started to convert it to therkyvformat and create a separate cache. This ensures that while the first load is operational, all future loads benefit from the high-speedrkyvcache.
- If the dictionary is in the
This eliminates the need for manual conversion for most use cases. For users who prefer to convert dictionaries, the compiler transmute command is also available (see Toolchain below).
-
User Dictionaries Must Be Pre-compiled: The
--user-dicruntime option has been removed. User dictionaries must now be compiled into the system dictionary beforehand. This design choice supports the zero-copy, immutable model ofrkyv.
However, this does not mean dictionaries are purely static. While you cannot modify a dictionary after it has been loaded, you can dynamically construct a dictionary in memory (e.g., usingSystemDictionaryBuilder) and create aTokenizerfrom it usingDictionary::from_inner(). This is useful for scenarios where dictionary contents are generated at runtime before tokenization begins. -
New Recommended Loading APIs: For maximum performance, use
Dictionary::from_path()for uncompressed files andDictionary::from_zstd()forzstd-compressed files. These methods leverage memory-mapping and caching for near-instantaneous loading. WhileDictionary::read()is still available for generic readers, it is less efficient.
use vibrato_rkyv::{dictionary::LoadMode, Dictionary};
// Recommended: Zero-copy loading via memory-mapping.
let dict = Dictionary::from_path("path/to/system.dic", LoadMode::TrustCache)?;
Additional Improvements
Beyond the core change to rkyv for faster loading, vibrato-rkyv includes several other significant enhancements over the original implementation:
-
Unified and Enhanced Toolchain (
compiler)
Thetrain,dictgen, andcompileexecutables have been consolidated into a single, more powerfulcompilertool. This simplifies the dictionary creation workflow with a clear subcommand structure (train,dictgen,build). It also adds:full-build: A convenient command to run the entire train-generate-build process in one go.transmute: A utility to convert legacybincode-formatted dictionaries from the originalvibratoto the newrkyvformat.
-
Flexible
Tokenizer
TheTokenizerAPI has been redesigned for better flexibility, resolving a long-standing design limitation (upstream issue #99).- It is now cheaply
Clone-able (internally usingArc<Dictionary>). - New constructors like
Tokenizer::from_inner(DictionaryInner)allow for creating a tokenizer directly from a dynamically built dictionary instance, enhancing flexibility for testing and applications that generate dictionaries on-the-fly.
- It is now cheaply
-
Owned Token Type (
TokenBuf)
A new owned token type,TokenBuf, has been introduced alongside the existing borrowedToken<'a>. Following the familiarPath/PathBufpattern in Rust's standard library. This makes it easy to store tokenization results, modify them, or send them across threads without lifetime complications. -
Built-in Dictionary Downloader and Manager
Getting started is now easier than ever. You can download and set up pre-compiled preset dictionaries (e.g., IPADIC, UNIDIC) with a single function call.Dictionary::from_preset_with_download(): Handles downloading, checksum verification, and caching automatically.Dictionary::from_zstd(): Intelligently manageszstd-compressed dictionaries by decompressing them to a local cache on the first run. It also automatically detects and converts legacybincode-formatted dictionaries (when the legacy feature is enabled), caching them in the modern format in the background for future fast loads.
-
N-best Tokenization (Experimental) An experimental feature for retrieving multiple tokenization candidates, sorted by cost, has been added in response to an upstream feature request (upstream issue #151). The implementation employs an A* search algorithm, which helps handle ambiguity in downstream NLP tasks.
Features
Fast tokenization
Vibrato is a Rust reimplementation of the fast tokenizer MeCab,
although its implementation has been simplified and optimized for even faster tokenization.
Especially for language resources with a large matrix
(e.g., unidic-cwj-3.1.1 with a matrix of 459 MiB),
Vibrato will run faster thanks to cache-efficient id mappings.
For example, the following figure shows an experimental result of tokenization time with MeCab and its reimplementations. The detailed experimental settings and other results are available on Wiki.
MeCab compatibility
Vibrato supports options for outputting tokenized results identical to MeCab, such as ignoring whitespace.
Training parameters
Vibrato also supports training parameters (or costs) in dictionaries from your corpora. The detailed description can be found here.
Basic usage
This software is implemented in Rust.
First of all, install rustc and cargo following the official instructions.
As a Rust Library (Recommended)
The easiest way to get started is by using vibrato-rkyv as a library and downloading a pre-compiled preset dictionary.
1. Add vibrato-rkyv to your dependencies
Add the following to your Cargo.toml. The dictionary download feature is enabled by default.
[dependencies]
vibrato-rkyv = "x.y.z"
2. Download a dictionary and tokenize text
The Dictionary::from_preset_with_download() function handles everything: downloading, verifying the checksum, and caching the dictionary in a specified directory for future runs.
use std::path::PathBuf;
use vibrato_rkyv::{dictionary::PresetDictionaryKind, Dictionary, Tokenizer};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Specify a directory to cache the dictionary.
let cache_dir = dirs::cache_dir()
.unwrap_or_else(|| PathBuf::from(".cache"))
.join("vibrato-rkyv");
// Downloads and loads a preset dictionary (e.g., IPADIC).
// The dictionary is cached in the specified directory, so subsequent runs are instantaneous.
let dict = Dictionary::from_preset_with_download(
PresetDictionaryKind::Ipadic,
&cache_dir,
)?;
// Create a tokenizer with the loaded dictionary.
let tokenizer = Tokenizer::new(dict);
// A worker holds internal states for tokenization and can be reused.
let mut worker = tokenizer.new_worker();
worker.set_text("本とカレーの街神保町へようこそ。");
worker.tokenize();
// Iterate over tokens.
for token in worker.token_iter() {
println!("{}\t{}", token.surface(), token.feature());
}
Ok(())
}
As a Command-Line Tool
1. Prepare a Dictionary
You need a dictionary file (.dic) compatible with vibrato-rkyv. Use the compiler tool to build a dictionary from your source CSV files.
# Example of compiling a dictionary
$ cargo run --release -p compiler -- build \
--lexicon-in path/to/lex.csv \
--matrix-in path/to/matrix.def \
--char-in path/to/char.def \
--unk-in path/to/unk.def \
--sysdic-out system.dic
2. Tokenize Sentences
Pipe your text to the tokenize command and specify the dictionary path with -i.
$ echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i path/to/system.dic
The result will be printed in MeCab format. To output tokens separated by spaces, use the -O wakati option.
本 名詞,一般,*,*,*,*,本,ホン,ホン
と 助詞,並立助詞,*,*,*,*,と,ト,ト
カレー 名詞,固有名詞,地域,一般,*,*,カレー,カレー,カレー
の 助詞,連体化,*,*,*,*,の,ノ,ノ
街 名詞,一般,*,*,*,*,街,マチ,マチ
神保 名詞,固有名詞,地域,一般,*,*,神保,ジンボウ,ジンボー
町 名詞,接尾,地域,*,*,*,町,マチ,マチ
へ 助詞,格助詞,一般,*,*,*,へ,ヘ,エ
ようこそ 感動詞,*,*,*,*,*,ようこそ,ヨウコソ,ヨーコソ
。 記号,句点,*,*,*,*,。,。,。
EOS
Advanced Usage
MeCab-compatible Options
Vibrato is a reimplementation of the MeCab algorithm, but its default tokenization results may differ. For example, Vibrato treats spaces as tokens by default, whereas MeCab ignores them.
To get results identical to MeCab, use the -S (ignore spaces) and -M (maximum unknown word length) flags.
# Get MeCab-compatible output
$ echo 'mens second bag' | cargo run --release -p tokenize -- -i path/to/system.dic -S -M 24
mens 名詞,固有名詞,組織,*,*,*,*
second 名詞,一般,*,*,*,*,*
bag 名詞,固有名詞,組織,*,*,*,*
EOS
Note: In rare cases, results may still differ due to tie-breaking in cost calculation.
Using a User Dictionary
IMPORTANT: In vibrato-rkyv, user dictionaries can no longer be specified as a runtime option. They must be compiled into the system dictionary beforehand.
Option: With the compiler full-build command
If you are training a new dictionary, the full-build command is the recommended way to include a user dictionary. It handles the entire pipeline: training, generating source files (including the user lexicon), and building the final binary. Use the --user-lexicon-in option.
$ cargo run --release -p compiler -- full-build \
-t path/to/corpus.txt \
-l path/to/seed_lex.csv \
--user-lexicon-in path/to/my_user_dic.csv \
... # other required arguments
-o ./my_dictionary
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
References
Technical details of Vibrato are available in the following resources:
- 神田峻介, 赤部晃一, 後藤啓介, 小田悠介. 最小コスト法に基づく形態素解析におけるCPUキャッシュの効率化, 言語処理学会第29回年次大会 (NLP2023).
- 赤部晃一, 神田峻介, 小田悠介. CRFに基づく形態素解析器のスコア計算の分割によるモデルサイズと解析速度の調整, 言語処理学会第29回年次大会 (NLP2023).
- MeCab互換な形態素解析器Vibratoの高速化技法, LegalOn Technologies Engineering Blog (2022-09-20).
Dependencies
~10–29MB
~409K SLoC