#tokenizer #openai #chatgpt #gpt-3 #bpe #codec

gpt_tokenizer

Rust BPE Encoder Decoder (Tokenizer) for GPT-2 / GPT-3

1 unstable release

0.1.0 Mar 17, 2023

#708 in Machine learning

26 downloads per month

MIT license

565KB
143 lines

GPT-Tokenizer

An implementation of the GPT-3 tokenizer created by converting the GPT-3-Encoder JavaScript package to Rust (with the help of ChatGPT-4). You can use it to estimate the number of tokens that your prompt would approximately consume. You can also create your own custom encoding and decoding functions by providing your own encoder.json and vocab.bpe files.

As a rule of thumb, OpenAI suggest that 100 tokens equal 75 words.

See how it works against the tokenizer published by OpenAI:

https://platform.openai.com/tokenizer

use tokenizer::DefaultTokenizer;

fn main() {
    let tokenizer = DefaultTokenizer::new();

    let text = r#"I'Many words map to one token, but some don't: indivisible.

Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🤚🏾

Sequences of characters commonly found next to each other may be grouped together: 1234567890"#;

    let encoded = &tokenizer.encode(text);
    let decoded = &tokenizer.decode(encoded);

    println!("Original text: {}", text);
    println!("Encoded text: {:#?}", encoded);
    println!("Decoded text: {}", decoded

    println!("Text size: {}", text.len());
    println!("Words: {}", text.split(" ").count());
    println!("Rule of Thumb: {}", text.split(" ").count() * 4 / 3);
    println!("Tokens: {}", encoded.len());
}

See the ./examples directory to see more examples of how to use it.

Dependencies

~2.5–4MB
~71K SLoC