#nlp #tokenizer #llm #python-packages #codegeex

bin+lib tokengeex

TokenGeeX is an efficient tokenizer for code based on UnigramLM and TokenMonster

11 releases (3 stable)

1.1.0 Jun 3, 2024
1.0.0 Apr 28, 2024
0.7.1 Apr 4, 2024
0.7.0 Mar 27, 2024

#478 in Machine learning

Apache-2.0

160KB
4K SLoC

Rust 3K SLoC // 0.1% comments Python 802 SLoC // 0.1% comments Shell 10 SLoC

TokenGeeX - Efficient Tokenizer for CodeGeeX

This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018).

Dependencies

~9–19MB
~256K SLoC