#bpe #tokenization #text-input #tokenizer #cli-applications #command-line-interface

app bpetok

A simple CLI for tokenizing text input using Byte Pair Encoding (BPE)

3 releases

0.1.2 Sep 26, 2024
0.1.1 Sep 26, 2024
0.1.0 Sep 26, 2024

#104 in Text processing

Download history 313/week @ 2024-09-22 69/week @ 2024-09-29 12/week @ 2024-10-06

394 downloads per month

Custom license

11KB

bpetok

bpetok is a simple command-line interface (CLI) application written in Rust for tokenizing text input using Byte Pair Encoding (BPE). The primary goal of the tool is to provide efficient and flexible tokenization for various applications that rely on text processing, natural language processing (NLP), or any pipeline where tokenized input is necessary.

Given an input text stream from stdin, bpetok produces tokenized sentences to stdout. It supports multiple built-in vocabulary sizes (small, medium, large), and also allows for the loading of custom vocabularies.

Features

  • Tokenization using Byte Pair Encoding (BPE): Tokenizes input text using the BPE algorithm.
  • Multiple built-in vocabularies:
    • Small (100k sized vocabulary)
    • Medium (320k sized vocabulary) [default]
    • Large (1M sized vocabulary)
  • Custom vocabulary support: Load your own BPE vocabulary from a file.
  • CLI-friendly: Simple and intuitive command-line arguments.
  • Stream-based: Tokenizes text from standard input line-by-line, emitting tokens to standard output.

Installation

You can install bpetok directly from crates.io using Cargo:

cargo install bpetok

Once installed, the bpetok binary is available to use globally on your system.

Usage

bpetok [OPTIONS]

Flags and Options

  • Default Vocabulary Size (medium): By default, it uses the medium vocabulary (320k tokens).

  • -s, --small: Use the smaller vocabulary (100k tokens).

  • -l, --large: Use the larger vocabulary (1M tokens).

  • -v, --vocab FILE: Path to custom BPE vocabulary file. When this flag is set, the built-in vocabularies are ignored.

A BPE vocabulary file is expected to follow this format:

<token>\t<score>\n

Each line should consist of:

  • A token (a string) followed by a tab character (\t)
  • A score (an integer) as either a positive or negative value.

Example lines from the file:

<unk> 0
<s>   0
</s>  0
00    -0
an    -1
▁d    -2
en    -3
er    -4
▁s    -5
in    -6
▁p    -7
ar    -8
▁a    -9
▁00   -10
▁m    -11
▁t    -12
es    -13
on    -14
▁k    -15
or    -16
▁n    -17
la    -18
▁b    -19
is    -20
▁c    -21

Examples

Tokenizing with the default (medium) vocabulary
echo "Hello world" | bpetok
Tokenizing using the small vocabulary
echo "Hello world" | bpetok --small
Tokenizing using the large vocabulary
echo "Hello world" | bpetok --large
Tokenizing using a custom vocabulary file
echo "Hello world" | bpetok --vocab path/to/vocabulary.bpe

Error Handling

  • If an invalid vocabulary file is provided using the --vocab option, the program will gracefully fail and print an error message to stderr.

  • Similarly, any issues with initializing the default vocabularies (small, medium, or large) will result in an error with appropriate feedback printed to the terminal.

Development

To contribute or modify bpetok, you'll need a working installation of the Rust toolchain. Once set up, feel free to modify or extend the functionality of the tool and submit a PR.

Acknowledgements

The default pre-built small, medium, and large vocabularies are 275-language multilingual vocabularies that were originally trained on Wikipedia by the BPEmb project. We thank the contributors from BPEmb for making these vocabularies open and accessible to the community.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Dependencies

~15MB
~54K SLoC