6 releases (breaking)
0.11.0 | Jul 5, 2024 |
---|---|
0.10.0 | May 25, 2024 |
0.9.0 | May 16, 2024 |
0.7.0 | Apr 12, 2024 |
0.1.0 | Dec 31, 2023 |
#688 in Machine learning
115 downloads per month
Used in rten-generate
89KB
2K
SLoC
rten-text
Library containing text tokenization and related functionality, for preparing inputs and decoding outputs for text models (eg. BERT).
The functionality is a subset of that found in Hugging Face Tokenizers. It has less functionality, but also fewer dependencies, and none that require C/C++.
lib.rs
:
This crate provides text tokenizers for preparing inputs for inference of machine-learning models. It provides implementations of popular tokenization methods such as WordPiece (used by BERT), and Byte Pair Encoding (used by GPT-2).
It does not support training new vocabularies and isn't optimized for processing very large volumes of text. If you need a tokenization crate with more complete functionality, see HuggingFace tokenizers.
Dependencies
~5–7MB
~150K SLoC