#gguf #llama #sentence-piece #llm

shimmytok

Pure Rust tokenizer for GGUF models with llama.cpp compatibility (SentencePiece + BPE + WPM + UGM + RWKV)

6 releases (breaking)

0.7.0 Jan 13, 2026
0.5.0 Oct 23, 2025
0.4.0 Oct 22, 2025
0.3.0 Oct 22, 2025
0.1.0 Oct 22, 2025

#975 in Text processing

Download history 23/week @ 2026-01-22 12/week @ 2026-02-19 21/week @ 2026-02-26 8/week @ 2026-03-05 33/week @ 2026-03-19 78/week @ 2026-03-26 45/week @ 2026-04-02 52/week @ 2026-04-09

208 downloads per month
Used in oxide-rs

MIT license

180KB
2.5K SLoC

shimmytok

Pure Rust tokenizer for GGUF models

100% llama.cpp compatible • zero C++ • just works

License: MIT Crates.io Rust 💝 Sponsor


shimmytok is free forever. MIT licensed, no strings attached.

💝 If shimmytok helps you, consider sponsoring.


Features

  • 🦀 Pure Rust - No C++ dependencies
  • 📦 Load from GGUF - Read tokenizers directly from model files
  • Validated - 10/10 llama.cpp vocab models passing
  • 🎯 Complete - All llama.cpp tokenizer types: SPM, BPE, WPM, UGM, RWKV

Installation

[dependencies]
shimmytok = "0.7"

Usage

use shimmytok::Tokenizer;

// Load tokenizer from GGUF file
let tokenizer = Tokenizer::from_gguf_file("model.gguf")?;

// Encode text to token IDs
let tokens = tokenizer.encode("Hello world", true)?;

// Decode token IDs back to text
let text = tokenizer.decode(&tokens, true)?;

Validated Models

All models validated against llama-tokenize with exact token match:

Model Type Status
bert-bge WPM
command-r BPE
deepseek-coder BPE
deepseek-llm BPE
falcon BPE
gpt-2 BPE
llama-spm SPM
qwen2 BPE
refact BPE
starcoder BPE

Tokenizer Coverage

Type Algorithm Status
SPM SentencePiece resegment
BPE Priority queue merge + 41 pre-tokenizer patterns
WPM Word-Piece greedy longest match
UGM Unigram Viterbi DP
RWKV Trie-based greedy
PLaMo-2 Table-driven reverse DP

API

// Core
Tokenizer::from_gguf_file(path) -> Result<Tokenizer>
tokenizer.encode(text, add_special_tokens) -> Result<Vec<TokenId>>
tokenizer.decode(&tokens) -> Result<String>
tokenizer.decode_single(token_id) -> Result<String>

// Metadata
tokenizer.vocab_size() -> usize
tokenizer.bos_token() -> Option<TokenId>
tokenizer.eos_token() -> Option<TokenId>
tokenizer.model_type() -> &str
tokenizer.pre_type() -> &str

// Batch
tokenizer.encode_batch(texts, add_special) -> Result<Vec<Vec<TokenId>>>

Why shimmytok?

  • No C++: Works anywhere Rust works (WASM, embedded, etc.)
  • No separate files: Loads tokenizer directly from GGUF
  • Correctness first: Every tokenizer validated against llama.cpp

License

MIT License - forever.


Maintainer: Michael A. Kuykendall

See Also

Dependencies

~3.5–5MB
~96K SLoC