6 releases (breaking)

0.11.0 Jul 5, 2024
0.10.0 May 25, 2024
0.9.0 May 16, 2024
0.7.0 Apr 12, 2024
0.1.0 Dec 31, 2023

#688 in Machine learning

Download history 10/week @ 2024-03-28 3/week @ 2024-04-04 136/week @ 2024-04-11 167/week @ 2024-05-16 164/week @ 2024-05-23 18/week @ 2024-05-30 2/week @ 2024-06-06 2/week @ 2024-06-13 109/week @ 2024-07-04 6/week @ 2024-07-11

115 downloads per month
Used in rten-generate

MIT/Apache

89KB
2K SLoC

rten-text

Library containing text tokenization and related functionality, for preparing inputs and decoding outputs for text models (eg. BERT).

The functionality is a subset of that found in Hugging Face Tokenizers. It has less functionality, but also fewer dependencies, and none that require C/C++.


lib.rs:

This crate provides text tokenizers for preparing inputs for inference of machine-learning models. It provides implementations of popular tokenization methods such as WordPiece (used by BERT), and Byte Pair Encoding (used by GPT-2).

It does not support training new vocabularies and isn't optimized for processing very large volumes of text. If you need a tokenization crate with more complete functionality, see HuggingFace tokenizers.

Dependencies

~5–7MB
~150K SLoC