#machine-learning #tokenizers #input #tokenization #bert #cc

rten-text

Text tokenization and other ML pre/post-processing functions

3 releases (breaking)

new 0.7.0 Apr 12, 2024
0.4.0 Feb 8, 2024
0.1.0 Dec 31, 2023

#551 in Machine learning

MIT/Apache

59KB
1.5K SLoC

rten-text

Library containing text tokenization and related functionality, for preparing inputs and decoding outputs for text models (eg. BERT).

The functionality is a subset of that found in Hugging Face Tokenizers. It has less functionality, but also fewer dependencies, and none that require C/C++.


lib.rs:

This crate provides tools for pre and post-processing text inputs and outputs of models. This primarily means tokenizing and de-tokenizing text.

If you need a more featureful set of tokenizers, see the tokenizers project.

Dependencies

~1.7–2.6MB
~76K SLoC