9 breaking releases
new 0.15.0 | Dec 28, 2024 |
---|---|
0.14.0 | Oct 27, 2024 |
0.13.0 | Aug 24, 2024 |
0.12.0 | Jul 30, 2024 |
0.1.0 | Dec 31, 2023 |
#737 in Machine learning
115 downloads per month
Used in rten-generate
140KB
3K
SLoC
rten-text
Library containing text tokenization and related functionality, for preparing inputs and decoding outputs for text models (eg. BERT).
The functionality is a subset of that found in Hugging Face Tokenizers. It has less functionality, but also fewer dependencies, and none that require C/C++.
lib.rs
:
This crate provides tokenizers for encoding text into token IDs for model inputs and decoding output token IDs back into text.
The tokenization process follows the
pipeline used by the
Hugging Face Tokenizers
library. Tokenizers can either be constructed manually or loaded from
Hugging Face tokenizer.json
files.
Comparison to tokenizers crate
The canonical implementation of this tokenization pipeline is the
tokenizers
crate. The main
differences compared to that crate are:
- rten-text focuses on inference only and does not support training tokenizers.
- rten-text is a pure Rust library with no dependencies written in C/C++. This means it is easy to build for WebAssembly and other targets where non-Rust dependencies may cause difficulties.
- rten-text is integrated with the rten-generate library which handles running the complete inference loop for auto-regressive transformer models. Note that you can use rten-generate's outputs with other tokenizer libraries if rten-text is not suitable.
- Not all tokenizer features are currently implemented in rten-text. Please file an issue if you find that rten-text is missing a feature needed for a particular model's tokenizer.
Loading a pre-trained tokenizer
The main entry point is the Tokenizer
type. Use Tokenizer::from_file
or Tokenizer::from_json
to construct a tokenizer from a tokenizer.json
file.
Encoding text
The Tokenizer::encode
method is used to encode text into token IDs.
This can be used for example to encode a model's prompt:
use rten_text::Tokenizer;
let tokenizer = Tokenizer::from_file("gpt2/tokenizer.json")?;
let encoded = tokenizer.encode("some text to tokenize", None)?;
let token_ids = encoded.token_ids(); // Sequence of token IDs
Decoding text
Given token IDs generated by a model, you can decode them back into text
using the Tokenizer::decode
method:
use rten_text::Tokenizer;
let tokenizer = Tokenizer::from_file("gpt2/tokenizer.json")?;
// Run model and get token IDs from outputs...
let token_ids = [101, 4256, 300];
let text = tokenizer.decode(&token_ids)?;
More examples
See the rten-examples crate for various examples showing how to use this crate as part of an end-to-end pipeline.
Dependencies
~4.5–6.5MB
~140K SLoC