9 releases (1 stable)
1.0.0 | Apr 28, 2024 |
---|---|
0.7.1 | Apr 4, 2024 |
0.7.0 | Mar 27, 2024 |
0.6.2 | Mar 22, 2024 |
0.1.0 | Feb 18, 2024 |
#239 in Machine learning
262 downloads per month
165KB
4K
SLoC
TokenGeeX - Efficient Tokenizer for CodeGeeX
This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018).
CLI
Exact
The most restrictive pattern. Does not allow punctuation to be mixed in with words and strictly adheres to code structure. Does not allow words that mix casing. Digits are encoded as a single token.
RUST_LOG=debug tokengeex regex --output data/exact.regex \
$(for idiom in any-char lowercase-word uppercase-word capitalized-word english-contraction chinese-word indent few-repeated-punct-space; do echo "-i ${idiom} "; done)
Exact+
The pattern used for the merge step of exact vocabularies.
RUST_LOG=debug tokengeex regex --output data/exact-plus.regex \
$(for idiom in any-char word english-word french-word chinese-word english-contraction punct-word newline-indent repeated-punct-space; do echo "-i ${idiom} "; done)
General
General-purpose pattern which is loosely analogous to GPT-4's pattern. Numbers of up to three digits are allowed.
RUST_LOG=debug tokengeex regex --output data/general.regex \
$(for idiom in any-char word english-word french-word chinese-word english-contraction short-number punct-word newline-indent repeated-punct-space; do echo "-i ${idiom} "; done)
General+
The pattern used for the merge step of general vocabularies.
TODO!
Idiomatic
Permissive pattern which allows some common idioms to form. Allows multi-word tokens to form.
TODO!
Idiomatic+
The pattern used for the merge step of idiomatic vocabularies.
TODO!
Loose
Permits a wide range of patterns and idioms. Highest compression.
TODO!
Dependencies
~9–19MB
~271K SLoC