1 unstable release
| 0.1.0 | Sep 24, 2025 |
|---|
#1388 in Algorithms
Used in 2 crates
(via nvs-core)
705KB
309 lines
TokenMonster: greedy tiktoken-like tokenizer (cl100k_base approximator)
- Greedy longest-match over an embedded vocabulary (base64-encoded tokens → ids).
- Falls back to raw bytes (0..255) when no match.
- Fast counting suitable for chunking and cost estimates (not exact tiktoken fidelity).
Design
- Lazy vocabulary load with once_cell.
- Hash maps (ahash) for encoder/decoder.
- Small inline vocab under
tiny_vocabfeature for tests/examples.
tokenmonster
Greedy tiktoken-like tokenizer with an embedded vocabulary, intended for fast, allocation-light tokenization.
Features
- Greedy tokenization compatible with common LLM vocabularies
- Zero-copy where possible; minimal allocations
- Optional tiny test vocabulary via the
tiny_vocabfeature
License: MIT