32 stable releases (7 major)

8.0.0 Jan 29, 2023
7.0.2 Apr 1, 2022
7.0.1 Dec 23, 2021
7.0.0 Nov 7, 2021
1.0.0 Feb 15, 2020

#28 in Machine learning

Download history 823/week @ 2022-11-28 756/week @ 2022-12-05 970/week @ 2022-12-12 1026/week @ 2022-12-19 329/week @ 2022-12-26 705/week @ 2023-01-02 1342/week @ 2023-01-09 1004/week @ 2023-01-16 1107/week @ 2023-01-23 1545/week @ 2023-01-30 1046/week @ 2023-02-06 867/week @ 2023-02-13 1284/week @ 2023-02-20 1044/week @ 2023-02-27 1605/week @ 2023-03-06 751/week @ 2023-03-13

4,770 downloads per month
Used in 5 crates (3 directly)

Apache-2.0

1MB
17K SLoC

rust-tokenizers

Rust-tokenizer is a drop-in replacement for the tokenization methods from the Transformers library It includes a broad range of tokenizers for state-of-the-art transformers architectures, including:

  • Sentence Piece (unigram model)
  • Sentence Piece (BPE model)
  • BERT
  • ALBERT
  • DistilBERT
  • RoBERTa
  • GPT
  • GPT2
  • ProphetNet
  • CTRL
  • Pegasus
  • MBart50
  • M2M100
  • NLLB
  • DeBERTa
  • DeBERTa (v2)

The wordpiece based tokenizers include both single-threaded and multi-threaded processing. The Byte-Pair-Encoding tokenizers favor the use of a shared cache and are only available as single-threaded tokenizers Using the tokenizers requires downloading manually the tokenizers required files (vocabulary or merge files). These can be found in the Transformers library.

Usage example

let vocab = Arc::new(rust_tokenizers::BertVocab::from_file(&vocab_path));

let test_sentence = Example::new_from_string("This is a sample sentence to be tokenized");
let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab.clone());

println!("{:?}", bert_tokenizer.encode(&test_sentence.sentence_1,
                                       None,
                                       128,
                                       &TruncationStrategy::LongestFirst,
                                       0));

Dependencies

~9.5MB
~228K SLoC