4 releases
0.1.3 | Nov 26, 2024 |
---|---|
0.1.2 | Nov 26, 2024 |
0.1.1 | Nov 26, 2024 |
0.1.0 | Nov 26, 2024 |
#227 in Compression
399 downloads per month
8KB
132 lines
Memory-efficient English language tokenizer
Applying Dearborn orthography to make English easier for machines to understand.
Dearborn orthography allows for lossless compression of English. This reduces the number of tokens required to encode meaning, and removes tokens that are informationally "distracting". It also removes confusing inconsistencies of standard English, while retaining it's structure and being convertible at any stage back to it's standard English equivalent. This compression and standardization of language down to meaning carrying tokens is ideal for the training of large language models.
Dependencies
~3.5–5.5MB
~91K SLoC