16 releases (5 breaking)
Uses new Rust 2024
| new 0.6.2 | Apr 17, 2026 |
|---|---|
| 0.6.0 | Apr 9, 2026 |
| 0.5.0 | Mar 31, 2026 |
| 0.4.3 | Mar 25, 2026 |
| 0.1.0 | Jan 19, 2026 |
#5 in #synthetic
Used in axonml
2MB
43K
SLoC
axonml-text
Overview
axonml-text provides natural language processing utilities for the AxonML machine learning framework: a serializable Vocab, six tokenizers, and three dataset families (classification, language modeling, synthetic seq2seq). Labels are emitted as class-index tensors of shape [1] — directly compatible with AxonML's CrossEntropyLoss.
Features
- Vocabulary management —
Vocabwith token/index maps, special-token indices (PAD, UNK, BOS, EOS, MASK), frequency-threshold construction viaVocab::from_text, auto-inserted UNK/PAD infrom_tokens, and JSON save/load (serde). - Six tokenizers implementing a common
Tokenizertrait:WhitespaceTokenizer(with optional lowercasing)CharTokenizer(optional whitespace filtering)WordPunctTokenizer(separates words and punctuation)NGramTokenizer(word- or character-level n-grams)BasicBPETokenizer(trainable byte-pair encoding with priority-ordered merges and</w>end markers)UnigramTokenizer(Viterbi-optimal segmentation from scored vocabulary)
- Text classification dataset —
TextDatasetstores a tokenizer and pads/truncates tomax_length;from_samplesbuilds the vocab from tokenized text with amin_freqthreshold. - Language modeling dataset —
LanguageModelDatasetproduces next-token (input, target) pairs of shape[seq_length]. - Synthetic datasets —
SyntheticSentimentDataset(small/train/test presets, deterministic per-index generation) andSyntheticSeq2SeqDataset(reverse / copy task). - Prelude module for concise imports.
Modules
| Module | Description |
|---|---|
vocab |
Vocab struct, special-token constants (PAD_TOKEN, UNK_TOKEN, BOS_TOKEN, EOS_TOKEN, MASK_TOKEN), JSON save/load |
tokenizer |
Tokenizer trait plus Whitespace, Char, WordPunct, NGram, BasicBPE, Unigram implementations |
datasets |
TextDataset, LanguageModelDataset, SyntheticSentimentDataset, SyntheticSeq2SeqDataset |
Usage
Add the dependency to your Cargo.toml:
[dependencies]
axonml-text = "0.6.1"
Building a Vocabulary
use axonml_text::prelude::*;
// Frequency-threshold construction (adds special tokens automatically)
let text = "the quick brown fox jumps over the lazy dog";
let vocab = Vocab::from_text(text, /*min_freq=*/ 1);
// Or build manually
let mut vocab = Vocab::with_special_tokens();
vocab.add_token("hello");
vocab.add_token("world");
// Encode and decode
let indices = vocab.encode(&["hello", "world"]);
let tokens = vocab.decode(&indices);
// Unknown tokens resolve to the UNK index
assert_eq!(vocab.token_to_index("foo"), vocab.unk_index().unwrap());
// Persistence
vocab.save(std::path::Path::new("vocab.json")).unwrap();
let loaded = Vocab::load(std::path::Path::new("vocab.json")).unwrap();
Tokenization
use axonml_text::prelude::*;
let ws = WhitespaceTokenizer::new();
let tokens = ws.tokenize("Hello World"); // ["Hello", "World"]
let chars = CharTokenizer::new();
let t = chars.tokenize("Hi!"); // ["H", "i", "!"]
let wp = WordPunctTokenizer::lowercase();
let t = wp.tokenize("Hello, World!"); // ["hello", ",", "world", "!"]
let bigrams = NGramTokenizer::word_ngrams(2);
let t = bigrams.tokenize("one two three"); // ["one two", "two three"]
let trigrams = NGramTokenizer::char_ngrams(3);
let t = trigrams.tokenize("hello"); // ["hel", "ell", "llo"]
// Trainable BPE
let mut bpe = BasicBPETokenizer::new();
bpe.train("low lower lowest newer newest", /*num_merges=*/ 10);
let tokens = bpe.tokenize("lowest"); // uses priority-ordered merges
// Unigram (Viterbi-optimal segmentation)
let unigram = UnigramTokenizer::from_tokens(&["hel", "lo", "wor", "ld"]);
let tokens = unigram.tokenize("hello world");
Text Classification Dataset
use axonml_text::prelude::*;
let samples = vec![
("good movie".to_string(), 1),
("bad movie".to_string(), 0),
("great film".to_string(), 1),
("terrible movie".to_string(), 0),
];
let tokenizer = WhitespaceTokenizer::new();
// Builds vocab from tokenized samples with min_freq=1, pads/truncates to 10.
let dataset = TextDataset::from_samples(&samples, &tokenizer, 1, 10);
assert_eq!(dataset.num_classes(), 2);
let loader = DataLoader::new(dataset, 16);
for batch in loader.iter() {
// batch.data : [batch_size, max_length] (float token indices)
// batch.targets : [batch_size, 1] (float class index)
}
Language Modeling Dataset
use axonml_text::prelude::*;
let text = "one two three four five six seven eight nine ten";
let dataset = LanguageModelDataset::from_text(text, /*seq_len=*/ 3, /*min_freq=*/ 1);
let (input, target) = dataset.get(0).unwrap();
// input : [seq_length] — tokens at positions i..i+seq_length
// target : [seq_length] — tokens at positions i+1..i+seq_length+1
Synthetic Datasets
use axonml_text::prelude::*;
// Deterministic sentiment dataset (binary, reproducible per-index)
let sentiment = SyntheticSentimentDataset::small(); // 100 samples, max_len=32, vocab=1000
let sentiment = SyntheticSentimentDataset::train(); // 10000 samples, max_len=64, vocab=10000
let sentiment = SyntheticSentimentDataset::test(); // 2000 samples
// Seq2seq reverse (copy_task makes src_len == tgt_len)
let seq2seq = SyntheticSeq2SeqDataset::copy_task(/*size=*/ 100, /*length=*/ 5, /*vocab_size=*/ 50);
// For each sample: tgt is src reversed.
Tests
cargo test -p axonml-text
License
Licensed under either of:
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
at your option.
Last updated: 2026-04-16 (v0.6.1)
Dependencies
~10MB
~185K SLoC