1 unstable release
0.1.0 | Mar 17, 2023 |
---|
#967 in Machine learning
Used in fgpt
565KB
143 lines
GPT-Tokenizer
An implementation of the GPT-3 tokenizer created by converting the GPT-3-Encoder
JavaScript package to Rust (with the help of ChatGPT-4). You can use it to estimate the number of
tokens that your prompt would approximately consume. You can also create your own custom encoding
and
decoding
functions by providing your own encoder.json
and vocab.bpe
files.
As a rule of thumb, OpenAI suggest that 100 tokens equal 75 words.
See how it works against the tokenizer published by OpenAI:
https://platform.openai.com/tokenizer
use tokenizer::DefaultTokenizer;
fn main() {
let tokenizer = DefaultTokenizer::new();
let text = r#"I'Many words map to one token, but some don't: indivisible.
Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🤚🏾
Sequences of characters commonly found next to each other may be grouped together: 1234567890"#;
let encoded = &tokenizer.encode(text);
let decoded = &tokenizer.decode(encoded);
println!("Original text: {}", text);
println!("Encoded text: {:#?}", encoded);
println!("Decoded text: {}", decoded
println!("Text size: {}", text.len());
println!("Words: {}", text.split(" ").count());
println!("Rule of Thumb: {}", text.split(" ").count() * 4 / 3);
println!("Tokens: {}", encoded.len());
}
See the ./examples directory to see more examples of how to use it.
Dependencies
~2.5–4MB
~74K SLoC