#tokenize #tiktoken #nlp

tokenmonster

Greedy tiktoken-like tokenizer with embedded vocabulary (cl100k-base approximator)

1 unstable release

0.1.0 Sep 24, 2025

#1388 in Algorithms


Used in 2 crates (via nvs-core)

MIT license

705KB
309 lines

TokenMonster: greedy tiktoken-like tokenizer (cl100k_base approximator)

  • Greedy longest-match over an embedded vocabulary (base64-encoded tokens → ids).
  • Falls back to raw bytes (0..255) when no match.
  • Fast counting suitable for chunking and cost estimates (not exact tiktoken fidelity).

Design

  • Lazy vocabulary load with once_cell.
  • Hash maps (ahash) for encoder/decoder.
  • Small inline vocab under tiny_vocab feature for tests/examples.

tokenmonster

Greedy tiktoken-like tokenizer with an embedded vocabulary, intended for fast, allocation-light tokenization.

Features

  • Greedy tokenization compatible with common LLM vocabularies
  • Zero-copy where possible; minimal allocations
  • Optional tiny test vocabulary via the tiny_vocab feature

License: MIT

Dependencies