sentencepiece

Binding for the sentencepiece tokenizer

22 releases

0.11.2 Jul 22, 2023
0.11.1 Mar 19, 2023
0.10.0 Oct 11, 2022
0.8.2 Jul 30, 2022
0.1.3 Feb 7, 2020
Download history 246/week @ 2024-07-24 268/week @ 2024-07-31 393/week @ 2024-08-07 793/week @ 2024-08-14 375/week @ 2024-08-21 437/week @ 2024-08-28 517/week @ 2024-09-04 218/week @ 2024-09-11 1034/week @ 2024-09-18 338/week @ 2024-09-25 389/week @ 2024-10-02 362/week @ 2024-10-09 314/week @ 2024-10-16 348/week @ 2024-10-23 915/week @ 2024-10-30 373/week @ 2024-11-06

1,998 downloads per month
Used in 9 crates (5 directly)

MIT/Apache

2MB
26K SLoC

C++ 24K SLoC // 0.1% comments Rust 1K SLoC // 0.0% comments Bitbake 370 SLoC // 0.5% comments Shell 4 SLoC

This crate binds the sentencepiece library. sentencepiece is an unsupervised text tokenizer.

The main data structure of this crate is SentencePieceProcessor, which is used to tokenize sentences:

use sentencepiece::SentencePieceProcessor;

let spp = SentencePieceProcessor::open("testdata/toy.model").unwrap();
let pieces = spp.encode("I saw a girl with a telescope.").unwrap()
  .into_iter().map(|p| p.piece).collect::<Vec<_>>();
assert_eq!(pieces, vec!["▁I", "▁saw", "▁a", "▁girl", "▁with",
  "▁a", "▁t", "el", "es", "c", "o", "pe", "."]);

Dependencies