4 releases

0.1.3 Nov 26, 2024
0.1.2 Nov 26, 2024
0.1.1 Nov 26, 2024
0.1.0 Nov 26, 2024

#227 in Compression

Download history 379/week @ 2024-11-25 20/week @ 2024-12-09

399 downloads per month

MIT license

8KB
132 lines

Memory-efficient English language tokenizer

Applying Dearborn orthography to make English easier for machines to understand.

Dearborn orthography allows for lossless compression of English. This reduces the number of tokens required to encode meaning, and removes tokens that are informationally "distracting". It also removes confusing inconsistencies of standard English, while retaining it's structure and being convertible at any stage back to it's standard English equivalent. This compression and standardization of language down to meaning carrying tokens is ideal for the training of large language models.

Dependencies

~3.5–5.5MB
~91K SLoC