1 unstable release

0.1.0 Nov 30, 2023

#197 in Machine learning

Download history 7/week @ 2024-01-04 8/week @ 2024-01-11 16/week @ 2024-01-18 26/week @ 2024-01-25 56/week @ 2024-02-01 48/week @ 2024-02-08 115/week @ 2024-02-15 96/week @ 2024-02-22 36/week @ 2024-02-29 30/week @ 2024-03-07 43/week @ 2024-03-14 246/week @ 2024-03-21 255/week @ 2024-03-28 264/week @ 2024-04-04 123/week @ 2024-04-11 119/week @ 2024-04-18

813 downloads per month

MIT license

1.5MB
327 lines

Cover logo

Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network

Documentation Crates.io PyPI Build status License: MIT

Instant CLIP Tokenizer is a fast pure-Rust text tokenizer for OpenAI's CLIP model. It is intended to be a replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with OpenCLIP and other implementations using the same tokenizer.

In addition to being usable as a Rust crate it also includes Python bindings built with PyO3 so that it can be used as a native Python module.

For the microbenchmarks included in this repository, Instant CLIP Tokenizer is ~70x faster than the Python implementation (with preprocessing and caching disabled to ensure a fair comparison).

Using the library

Rust

[dependencies]
instant-clip-tokenizer = "0.1.0"
# To enable additional functionality that depends on the `ndarray` crate:
# instant-clip-tokenizer = { version = "0.1.0", features = ["ndarray"] }

Python (>= 3.9)

pip install instant-clip-tokenizer

Using the library requires numpy >= 1.16.0 installed in your Python environment (e.g., via pip install numpy).

Examples

use instant_clip_tokenizer::{Token, Tokenizer};

let tokenizer = Tokenizer::new();

let mut tokens = Vec::new();
tokenizer.encode("A person riding a motorcycle", &mut tokens);
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
println!("{:?}", tokens);

// -> [320, 2533, 6765, 320, 10297]
import instant_clip_tokenizer

tokenizer = instant_clip_tokenizer.Tokenizer()

tokens = tokenizer.encode("A person riding a motorcycle")
print(tokens)

# -> [320, 2533, 6765, 320, 10297]

batch = tokenizer.tokenize_batch(["A person riding a motorcycle", "Hi there"], context_length=5)
print(batch)

# -> [[49406   320  2533  6765 49407]
#     [49406  1883   997 49407     0]]

Testing

To run the tests run the following:

cargo test --all-features

You can also test the Python bindings with:

make test-python

Acknowledgements

The vocabulary file and original Python tokenizer code included in this repository are copyright (c) 2021 OpenAI (MIT-License).

Dependencies

~2.7–4.5MB
~71K SLoC