26 stable releases (5 major)

6.2.3 Jun 7, 2021
6.2.1 Feb 19, 2021
6.1.0 Nov 16, 2020
5.0.1 Oct 1, 2020
1.0.0 Feb 15, 2020

#38 in Machine learning

Download history 194/week @ 2021-02-23 224/week @ 2021-03-02 147/week @ 2021-03-09 68/week @ 2021-03-16 159/week @ 2021-03-23 218/week @ 2021-03-30 172/week @ 2021-04-06 131/week @ 2021-04-13 201/week @ 2021-04-20 91/week @ 2021-04-27 230/week @ 2021-05-04 219/week @ 2021-05-11 219/week @ 2021-05-18 198/week @ 2021-05-25 192/week @ 2021-06-01 183/week @ 2021-06-08

736 downloads per month
Used in 3 crates (2 directly)

Apache-2.0

735KB
14K SLoC

rust-tokenizers

Rust-tokenizer is a drop-in replacement for the tokenization methods from the Transformers library It includes a broad range of tokenizers for state-of-the-art transformers architectures, including:

  • Sentence Piece (unigram model)
  • BERT
  • ALBERT
  • DistilBERT
  • RoBERTa
  • GPT
  • GPT2
  • ProphetNet
  • CTRL
  • Pegasus
  • MBart50

The wordpiece based tokenizers include both single-threaded and multi-threaded processing. The Byte-Pair-Encoding tokenizers favor the use of a shared cache and are only available as single-threaded tokenizers Using the tokenizers requires downloading manually the tokenizers required files (vocabulary or merge files). These can be found in the Transformers library.

Usage example

let vocab = Arc::new(rust_tokenizers::BertVocab::from_file(&vocab_path));

let test_sentence = Example::new_from_string("This is a sample sentence to be tokenized");
let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab.clone());

println!("{:?}", bert_tokenizer.encode(&test_sentence.sentence_1,
                                       None,
                                       128,
                                       &TruncationStrategy::LongestFirst,
                                       0));

Dependencies

~9MB
~219K SLoC