5 releases

0.1.4 Sep 26, 2024
0.1.3 Sep 24, 2024
0.1.2 Sep 24, 2024
0.1.1 Sep 24, 2024
0.1.0 Sep 24, 2024

#418 in Algorithms

Download history 140/week @ 2024-09-18 286/week @ 2024-09-25 13/week @ 2024-10-02 3/week @ 2024-10-09 4/week @ 2024-10-16 2/week @ 2024-10-30 4/week @ 2024-11-06 1/week @ 2024-11-13 1/week @ 2024-11-20 3/week @ 2024-11-27 58/week @ 2024-12-04 44/week @ 2024-12-11 8/week @ 2024-12-18 9/week @ 2024-12-25 6/week @ 2025-01-01

84 downloads per month
Used in bpetok

Custom license

13MB
694 lines

bpe-tokenizer

A Rust implementation of Byte Pair Encoding (BPE) tokenization. This crate provides functionality to tokenize text into subword units using pre-trained vocabularies. BPE is widely used in natural language processing (NLP) tasks, where it breaks down words into subword tokens using a vocabulary of the most frequent token pairs.

It supports Unicode-aware text segmentation for sentence and word splitting, making it suitable for processing a variety of languages and scripts.

Features

  • Bring your own BPE token vocabularies, or use ...
  • Pre-trained multilingual vocabularies sourced from the BPEmb project, with support for tokenizing text in 275 languages.
  • Unicode-aware sentence and word segmentation: Leveraging the unicode-segmentation crate for proper text splitting.

Installation

To add this crate to your project, run:

cargo add bpe-tokenizer

Or manually include it in your Cargo.toml:

[dependencies]
bpe-tokenizer = "<version>"

Full Example

Here is an example of how to create a BytePairEncoder from a string and use it to tokenize text:

use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let tokenized = vocab.tokenize("Hello, world!");
println!("{:?}", tokenized);

The output will be a vector of tokens:

["<s>", "▁hello", "▁world", "</s>"]

Or load a vocabulary from a file:

use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};
let vocab = BytePairEncoder::new_from_file("path/to/file.vocab").unwrap();

Cargo Features

The crate also includes several sizes of default pre-trained vocabularies, which are optional and can be enabled via Cargo features. They are sourced from Wikipedia data, pre-trained as part of the BPEmb project. These MIT-licensed vocabularies support 275 languages and provide different sizes depending on usage needs:

Available Optional Features

  • default-small (100,000 tokens): Suitable for memory-constrained environments.
  • default-medium (320,000 tokens): Balances between token coverage and memory efficiency.
  • default-large (1,000,000 tokens): Provides the most detailed token representations for high granularity tasks.

Enabling Optional Features

To use these default vocabularies, specify the feature in your Cargo.toml:

[dependencies]
bpe-tokenizer = { version = "<version>", features = ["default-medium"] }

Example with default-medium Vocabulary

An example of using the medium vocabulary (320,000 tokens):

# #[cfg(feature = "default-medium")] {
use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};

let encoder = BytePairEncoder::new_default_medium().unwrap();
let tokenized = encoder.tokenize("This is a test sentence.");
println!("{:?}", tokenized);
// Output: ["<s>", "▁this", "▁is", "▁a", "▁test", "▁sentence", "</s>"]
# }

Tokenization Functions

The crate provides various ways to interact with the tokenizer:

  • Tokenize into a flat Vec<String>:

    • BytePairEncoder::tokenize

    Splits and flattens the text into tokens.

    let tokenized = vocab.tokenize("Example sentence.");
    // Output: ["<s>", "▁example", "▁sentence", "</s>"]
    
  • Tokenize into nested sentence vectors Vec<Vec<String>>:

    • BytePairEncoder::tokenize_sentences

    Useful for processing multiple sentences separately.

    let tokenized = vocab.tokenize_sentences("This is sentence one. And this is sentence two.");
    // Output: [["<s>", "▁this", "▁is", "▁sentence", "▁one", "</s>"], ["<s>", "▁and", "▁this", "▁is", "▁sentence", "▁two", "</s>"]]
    
  • Iterative tokenization:

    • BytePairEncoder::tokenize_iter and BytePairEncoder::tokenize_sentences_iter

    Provides an iterator over generated tokens for better memory efficiency in large-scale text.

    let tokens_iter: Vec<String> = vocab.tokenize_iter("Example sentence").collect();
    // Output: ["<s>", "▁example", "▁sentence", "</s>"]
    

Licensing

This crate is licensed under the MIT License.

Contributing

Contributions are welcome! Please open an issue, submit a pull request, or reach out if you'd like to contribute awesome new features or fixes to this crate.

Dependencies