11 releases (3 stable)
1.1.0 | Jun 3, 2024 |
---|---|
1.0.0 | Apr 28, 2024 |
0.7.1 | Apr 4, 2024 |
0.7.0 | Mar 27, 2024 |
#478 in Machine learning
160KB
4K
SLoC
TokenGeeX - Efficient Tokenizer for CodeGeeX
This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018).
Dependencies
~9–19MB
~256K SLoC