#nlp #tokenizer #llm #python-packages #codegeex

bin+lib tokengeex

TokenGeeX is an efficient tokenizer for code based on UnigramLM and TokenMonster

10 releases (2 stable)

new 1.0.1 May 17, 2024
1.0.0 Apr 28, 2024
0.7.1 Apr 4, 2024
0.7.0 Mar 27, 2024
0.1.0 Feb 18, 2024

#399 in Machine learning

Download history 245/week @ 2024-02-17 124/week @ 2024-02-24 9/week @ 2024-03-02 5/week @ 2024-03-09 177/week @ 2024-03-16 147/week @ 2024-03-23 140/week @ 2024-03-30 31/week @ 2024-04-06 84/week @ 2024-04-13 152/week @ 2024-04-27 4/week @ 2024-05-04

243 downloads per month

Apache-2.0

155KB
4K SLoC

Rust 3K SLoC // 0.1% comments Python 1K SLoC // 0.1% comments Shell 10 SLoC

TokenGeeX - Efficient Tokenizer for CodeGeeX

This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018).

Dependencies

~9–20MB
~289K SLoC