#tokenizer #binding

sentencepiece

Binding for the sentencepiece tokenizer

18 releases

new 0.9.0 Aug 11, 2022
0.8.2 Jul 30, 2022
0.8.1 Jan 4, 2022
0.8.0 Jul 11, 2021
0.1.3 Feb 7, 2020
Download history 60/week @ 2022-04-24 92/week @ 2022-05-01 113/week @ 2022-05-08 323/week @ 2022-05-15 83/week @ 2022-05-22 159/week @ 2022-05-29 109/week @ 2022-06-05 22/week @ 2022-06-12 44/week @ 2022-06-19 45/week @ 2022-06-26 22/week @ 2022-07-03 42/week @ 2022-07-10 108/week @ 2022-07-17 83/week @ 2022-07-24 170/week @ 2022-07-31 118/week @ 2022-08-07

481 downloads per month
Used in 5 crates (2 directly)

MIT/Apache

2MB
26K SLoC

C++ 24K SLoC // 0.1% comments Rust 1.5K SLoC // 0.0% comments Shell 4 SLoC

sentencepiece

This Rust crate is a binding for the sentencepiece unsupervised text tokenizer. The crate documentation is available online.

libsentencepiece dependency

This crate depends on the sentencepiece C++ library. By default, this dependency is treated as follows:

  • If sentencepiece could be found with pkg-config, the crate will link against the library found through pkg-config. Warning: dynamic linking only works correctly with sentencepiece 0.1.95 or later, due to a bug in earlier versions.
  • Otherwise, the crate's build script will do a static build of the sentencepiece library. This requires that cmake is available.

If you wish to override this behavior, the sentencepiece-sys crate offers two features:

  • system: always attempt to link to the sentencepiece library found with pkg-config.
  • static: always do a static build of the sentencepiece library and link against that.

Dependencies