9 breaking releases

new 0.15.0 Dec 28, 2024
0.14.0 Oct 27, 2024
0.13.0 Aug 24, 2024
0.12.0 Jul 30, 2024
0.1.0 Dec 31, 2023

#737 in Machine learning

Download history 26/week @ 2024-09-11 9/week @ 2024-09-18 12/week @ 2024-09-25 21/week @ 2024-10-02 2/week @ 2024-10-09 99/week @ 2024-10-23 24/week @ 2024-10-30 1/week @ 2024-11-06 1/week @ 2024-11-13 4/week @ 2024-11-20 1/week @ 2024-12-04 3/week @ 2024-12-11 111/week @ 2024-12-25

115 downloads per month
Used in rten-generate

MIT/Apache

140KB
3K SLoC

rten-text

Library containing text tokenization and related functionality, for preparing inputs and decoding outputs for text models (eg. BERT).

The functionality is a subset of that found in Hugging Face Tokenizers. It has less functionality, but also fewer dependencies, and none that require C/C++.


lib.rs:

This crate provides tokenizers for encoding text into token IDs for model inputs and decoding output token IDs back into text.

The tokenization process follows the pipeline used by the Hugging Face Tokenizers library. Tokenizers can either be constructed manually or loaded from Hugging Face tokenizer.json files.

Comparison to tokenizers crate

The canonical implementation of this tokenization pipeline is the tokenizers crate. The main differences compared to that crate are:

  • rten-text focuses on inference only and does not support training tokenizers.
  • rten-text is a pure Rust library with no dependencies written in C/C++. This means it is easy to build for WebAssembly and other targets where non-Rust dependencies may cause difficulties.
  • rten-text is integrated with the rten-generate library which handles running the complete inference loop for auto-regressive transformer models. Note that you can use rten-generate's outputs with other tokenizer libraries if rten-text is not suitable.
  • Not all tokenizer features are currently implemented in rten-text. Please file an issue if you find that rten-text is missing a feature needed for a particular model's tokenizer.

Loading a pre-trained tokenizer

The main entry point is the Tokenizer type. Use Tokenizer::from_file or Tokenizer::from_json to construct a tokenizer from a tokenizer.json file.

Encoding text

The Tokenizer::encode method is used to encode text into token IDs. This can be used for example to encode a model's prompt:

use rten_text::Tokenizer;

let tokenizer = Tokenizer::from_file("gpt2/tokenizer.json")?;
let encoded = tokenizer.encode("some text to tokenize", None)?;
let token_ids = encoded.token_ids(); // Sequence of token IDs

Decoding text

Given token IDs generated by a model, you can decode them back into text using the Tokenizer::decode method:

use rten_text::Tokenizer;

let tokenizer = Tokenizer::from_file("gpt2/tokenizer.json")?;
// Run model and get token IDs from outputs...
let token_ids = [101, 4256, 300];
let text = tokenizer.decode(&token_ids)?;

More examples

See the rten-examples crate for various examples showing how to use this crate as part of an end-to-end pipeline.

Dependencies

~4.5–6.5MB
~140K SLoC