34 stable releases (7 major)

8.1.1 Oct 1, 2023
8.1.0 Apr 15, 2023
8.0.0 Jan 29, 2023
7.0.2 Apr 1, 2022
1.0.0 Feb 15, 2020

#40 in Machine learning

Download history 1538/week @ 2023-12-14 1197/week @ 2023-12-21 1165/week @ 2023-12-28 1588/week @ 2024-01-04 1423/week @ 2024-01-11 1763/week @ 2024-01-18 1392/week @ 2024-01-25 1381/week @ 2024-02-01 1465/week @ 2024-02-08 1694/week @ 2024-02-15 1506/week @ 2024-02-22 1318/week @ 2024-02-29 1631/week @ 2024-03-07 1453/week @ 2024-03-14 1471/week @ 2024-03-21 1169/week @ 2024-03-28

5,908 downloads per month
Used in 19 crates (8 directly)

Apache-2.0

1MB
18K SLoC

rust-tokenizers

Rust-tokenizer is a drop-in replacement for the tokenization methods from the Transformers library It includes a broad range of tokenizers for state-of-the-art transformers architectures, including:

  • Sentence Piece (unigram model)
  • Sentence Piece (BPE model)
  • BERT
  • ALBERT
  • DistilBERT
  • RoBERTa
  • GPT
  • GPT2
  • ProphetNet
  • CTRL
  • Pegasus
  • MBart50
  • M2M100
  • NLLB
  • DeBERTa
  • DeBERTa (v2)

The wordpiece based tokenizers include both single-threaded and multi-threaded processing. The Byte-Pair-Encoding tokenizers favor the use of a shared cache and are only available as single-threaded tokenizers Using the tokenizers requires downloading manually the tokenizers required files (vocabulary or merge files). These can be found in the Transformers library.

Usage example

use std::path::PathBuf;

use rust_tokenizers::tokenizer::{BertTokenizer, Tokenizer, TruncationStrategy};
use rust_tokenizers::vocab::{BertVocab, Vocab};

let lowercase: bool = true;
let strip_accents: bool = true;
let vocab_path: PathBuf  = PathBuf::from("path/to/vocab");
let vocab: BertVocab = BertVocab::from_file(&vocab_path)?;
let test_sentence: Example = Example::new_from_string("This is a sample sentence to be tokenized");
let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab, lowercase, strip_accents);

println!("{:?}", bert_tokenizer.encode(&test_sentence.sentence_1,
                                       None,
                                       128,
                                       &TruncationStrategy::LongestFirst,
                                       0));

Dependencies

~12MB
~247K SLoC