3 releases
0.1.2 | Sep 26, 2024 |
---|---|
0.1.1 | Sep 26, 2024 |
0.1.0 | Sep 26, 2024 |
#274 in Text processing
11KB
bpetok
bpetok
is a simple command-line interface (CLI) application written in Rust for tokenizing text input using Byte Pair Encoding (BPE). The primary goal of the tool is to provide efficient and flexible tokenization for various applications that rely on text processing, natural language processing (NLP), or any pipeline where tokenized input is necessary.
Given an input text stream from stdin, bpetok
produces tokenized sentences to stdout. It supports multiple built-in vocabulary sizes (small, medium, large), and also allows for the loading of custom vocabularies.
Features
- Tokenization using Byte Pair Encoding (BPE): Tokenizes input text using the BPE algorithm.
- Multiple built-in vocabularies:
- Small (100k sized vocabulary)
- Medium (320k sized vocabulary) [default]
- Large (1M sized vocabulary)
- Custom vocabulary support: Load your own BPE vocabulary from a file.
- CLI-friendly: Simple and intuitive command-line arguments.
- Stream-based: Tokenizes text from standard input line-by-line, emitting tokens to standard output.
Installation
You can install bpetok
directly from crates.io using Cargo:
cargo install bpetok
Once installed, the bpetok
binary is available to use globally on your system.
Usage
bpetok [OPTIONS]
Flags and Options
-
Default Vocabulary Size (medium): By default, it uses the medium vocabulary (320k tokens).
-
-s
,--small
: Use the smaller vocabulary (100k tokens). -
-l
,--large
: Use the larger vocabulary (1M tokens). -
-v
,--vocab FILE
: Path to custom BPE vocabulary file. When this flag is set, the built-in vocabularies are ignored.
A BPE vocabulary file is expected to follow this format:
<token>\t<score>\n
Each line should consist of:
- A token (a string) followed by a tab character (
\t
) - A score (an integer) as either a positive or negative value.
Example lines from the file:
<unk> 0
<s> 0
</s> 0
00 -0
an -1
▁d -2
en -3
er -4
▁s -5
in -6
▁p -7
ar -8
▁a -9
▁00 -10
▁m -11
▁t -12
es -13
on -14
▁k -15
or -16
▁n -17
la -18
▁b -19
is -20
▁c -21
Examples
Tokenizing with the default (medium) vocabulary
echo "Hello world" | bpetok
Tokenizing using the small vocabulary
echo "Hello world" | bpetok --small
Tokenizing using the large vocabulary
echo "Hello world" | bpetok --large
Tokenizing using a custom vocabulary file
echo "Hello world" | bpetok --vocab path/to/vocabulary.bpe
Error Handling
-
If an invalid vocabulary file is provided using the
--vocab
option, the program will gracefully fail and print an error message tostderr
. -
Similarly, any issues with initializing the default vocabularies (small, medium, or large) will result in an error with appropriate feedback printed to the terminal.
Development
To contribute or modify bpetok
, you'll need a working installation of the Rust toolchain. Once set up, feel free to modify or extend the functionality of the tool and submit a PR.
Acknowledgements
The default pre-built small, medium, and large vocabularies are 275-language multilingual vocabularies that were originally trained on Wikipedia by the BPEmb project. We thank the contributors from BPEmb for making these vocabularies open and accessible to the community.
License
This project is licensed under the MIT License. See the LICENSE file for more details.
Dependencies
~15MB
~53K SLoC