#nlp #tokenizer #bytepiece #deeplearning

bytepiece_rs

The Bytepiece Tokenizer Implemented in Rust

7 releases

0.2.2 Nov 12, 2023
0.2.1 Oct 17, 2023
0.1.0 Sep 20, 2023
0.0.3 Sep 20, 2023

#744 in Text processing

30 downloads per month
Used in bytepiece

MIT license

1MB
335 lines

bytepiece-rs

Usage

use bytepice_rs::Tokenizer;

let tokenizer = Tokenizer::new();
// or load a custom model
let tokenizer = Tokenizer::load_from("/path/to/model");
let text = "今天天气不错";
let ids = tokenizer.encode(text, false, false, alpha=0.0);
assert_eq!(ids, vec![40496, 45268, 39432]);
let text2 = tokenizer.decode(ids);
assert_eq!(text2, text);

Benchmark & Test

cargo test
cargo bench -- --plotting-backend gnuplot

Dependencies

~6–13MB
~148K SLoC