22 releases

0.11.2 Jul 22, 2023
0.11.1 Mar 19, 2023
0.10.0 Oct 11, 2022
0.8.2 Jul 30, 2022
0.1.3 Feb 7, 2020

#117 in Machine learning

Download history 90/week @ 2024-03-14 59/week @ 2024-03-21 113/week @ 2024-03-28 88/week @ 2024-04-04 107/week @ 2024-04-11 152/week @ 2024-04-18 162/week @ 2024-04-25 347/week @ 2024-05-02 268/week @ 2024-05-09 362/week @ 2024-05-16 276/week @ 2024-05-23 364/week @ 2024-05-30 1544/week @ 2024-06-06 1937/week @ 2024-06-13 2524/week @ 2024-06-20 3322/week @ 2024-06-27

9,412 downloads per month
Used in 6 crates (3 directly)

MIT/Apache

2MB
26K SLoC

C++ 24K SLoC // 0.1% comments Rust 1K SLoC // 0.0% comments Bitbake 370 SLoC // 0.5% comments Shell 4 SLoC

This crate binds the sentencepiece library. sentencepiece is an unsupervised text tokenizer.

The main data structure of this crate is SentencePieceProcessor, which is used to tokenize sentences:

use sentencepiece::SentencePieceProcessor;

let spp = SentencePieceProcessor::open("testdata/toy.model").unwrap();
let pieces = spp.encode("I saw a girl with a telescope.").unwrap()
  .into_iter().map(|p| p.piece).collect::<Vec<_>>();
assert_eq!(pieces, vec!["▁I", "▁saw", "▁a", "▁girl", "▁with",
  "▁a", "▁t", "el", "es", "c", "o", "pe", "."]);

Dependencies