22 releases
0.11.2 | Jul 22, 2023 |
---|---|
0.11.1 | Mar 19, 2023 |
0.10.0 | Oct 11, 2022 |
0.8.2 | Jul 30, 2022 |
0.1.3 | Feb 7, 2020 |
#1621 in Text processing
9,319 downloads per month
Used in 12 crates
(8 directly)
2MB
26K
SLoC
sentencepiece
This Rust crate is a binding for the sentencepiece unsupervised text tokenizer. The crate documentation is available online.
libsentencepiece
dependency
This crate depends on the sentencepiece
C++ library. By default,
this dependency is treated as follows:
- If
sentencepiece
could be found withpkg-config
, the crate will link against the library found throughpkg-config
. Warning: dynamic linking only works correctly with sentencepiece 0.1.95 or later, due to a bug in earlier versions. - Otherwise, the crate's build script will do a static build of the
sentencepiece
library. This requires thatcmake
is available.
If you wish to override this behavior, the sentencepiece-sys
crate
offers two features:
system
: always attempt to link to thesentencepiece
library found withpkg-config
.static
: always do a static build of thesentencepiece
library and link against that.
lib.rs
:
This crate binds the sentencepiece library. sentencepiece is an unsupervised text tokenizer.
The main data structure of this crate is SentencePieceProcessor
,
which is used to tokenize sentences:
use sentencepiece::SentencePieceProcessor;
let spp = SentencePieceProcessor::open("testdata/toy.model").unwrap();
let pieces = spp.encode("I saw a girl with a telescope.").unwrap()
.into_iter().map(|p| p.piece).collect::<Vec<_>>();
assert_eq!(pieces, vec!["▁I", "▁saw", "▁a", "▁girl", "▁with",
"▁a", "▁t", "el", "es", "c", "o", "pe", "."]);