9 releases (1 stable)

1.0.0	Apr 28, 2024
0.7.1	Apr 4, 2024
0.7.0	Mar 27, 2024
0.6.2	Mar 22, 2024
0.1.0	Feb 18, 2024

#239 in Machine learning

262 downloads per month

Apache-2.0

165KB
4K SLoC

TokenGeeX - Efficient Tokenizer for CodeGeeX

This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018).

CLI

Exact

The most restrictive pattern. Does not allow punctuation to be mixed in with words and strictly adheres to code structure. Does not allow words that mix casing. Digits are encoded as a single token.

RUST_LOG=debug tokengeex regex --output data/exact.regex \
    $(for idiom in any-char lowercase-word uppercase-word capitalized-word english-contraction chinese-word indent few-repeated-punct-space; do echo "-i ${idiom} "; done)

Exact+

The pattern used for the merge step of exact vocabularies.

RUST_LOG=debug tokengeex regex --output data/exact-plus.regex \
    $(for idiom in any-char word english-word french-word chinese-word english-contraction punct-word newline-indent repeated-punct-space; do echo "-i ${idiom} "; done)

General

General-purpose pattern which is loosely analogous to GPT-4's pattern. Numbers of up to three digits are allowed.

RUST_LOG=debug tokengeex regex --output data/general.regex \
    $(for idiom in any-char word english-word french-word chinese-word english-contraction short-number punct-word newline-indent repeated-punct-space; do echo "-i ${idiom} "; done)

General+

The pattern used for the merge step of general vocabularies.

TODO!

Idiomatic

Permissive pattern which allows some common idioms to form. Allows multi-word tokens to form.

TODO!

Idiomatic+

The pattern used for the merge step of idiomatic vocabularies.

TODO!

Loose

Permits a wide range of patterns and idioms. Highest compression.

TODO!

Dependencies

~9–19MB
~271K SLoC