6 releases (breaking)
| 0.7.0 | Jan 13, 2026 |
|---|---|
| 0.5.0 | Oct 23, 2025 |
| 0.4.0 | Oct 22, 2025 |
| 0.3.0 | Oct 22, 2025 |
| 0.1.0 | Oct 22, 2025 |
#975 in Text processing
208 downloads per month
Used in oxide-rs
180KB
2.5K
SLoC
shimmytok is free forever. MIT licensed, no strings attached.
💝 If shimmytok helps you, consider sponsoring.
Features
- 🦀 Pure Rust - No C++ dependencies
- 📦 Load from GGUF - Read tokenizers directly from model files
- ✅ Validated - 10/10 llama.cpp vocab models passing
- 🎯 Complete - All llama.cpp tokenizer types: SPM, BPE, WPM, UGM, RWKV
Installation
[dependencies]
shimmytok = "0.7"
Usage
use shimmytok::Tokenizer;
// Load tokenizer from GGUF file
let tokenizer = Tokenizer::from_gguf_file("model.gguf")?;
// Encode text to token IDs
let tokens = tokenizer.encode("Hello world", true)?;
// Decode token IDs back to text
let text = tokenizer.decode(&tokens, true)?;
Validated Models
All models validated against llama-tokenize with exact token match:
| Model | Type | Status |
|---|---|---|
| bert-bge | WPM | ✅ |
| command-r | BPE | ✅ |
| deepseek-coder | BPE | ✅ |
| deepseek-llm | BPE | ✅ |
| falcon | BPE | ✅ |
| gpt-2 | BPE | ✅ |
| llama-spm | SPM | ✅ |
| qwen2 | BPE | ✅ |
| refact | BPE | ✅ |
| starcoder | BPE | ✅ |
Tokenizer Coverage
| Type | Algorithm | Status |
|---|---|---|
| SPM | SentencePiece resegment | ✅ |
| BPE | Priority queue merge + 41 pre-tokenizer patterns | ✅ |
| WPM | Word-Piece greedy longest match | ✅ |
| UGM | Unigram Viterbi DP | ✅ |
| RWKV | Trie-based greedy | ✅ |
| PLaMo-2 | Table-driven reverse DP | ✅ |
API
// Core
Tokenizer::from_gguf_file(path) -> Result<Tokenizer>
tokenizer.encode(text, add_special_tokens) -> Result<Vec<TokenId>>
tokenizer.decode(&tokens) -> Result<String>
tokenizer.decode_single(token_id) -> Result<String>
// Metadata
tokenizer.vocab_size() -> usize
tokenizer.bos_token() -> Option<TokenId>
tokenizer.eos_token() -> Option<TokenId>
tokenizer.model_type() -> &str
tokenizer.pre_type() -> &str
// Batch
tokenizer.encode_batch(texts, add_special) -> Result<Vec<Vec<TokenId>>>
Why shimmytok?
- No C++: Works anywhere Rust works (WASM, embedded, etc.)
- No separate files: Loads tokenizer directly from GGUF
- Correctness first: Every tokenizer validated against llama.cpp
Links
- 📖 CHANGELOG - Version history
- 🗺️ ROADMAP - Future plans
- 🤝 CONTRIBUTING - How to contribute
- 🔒 SECURITY - Vulnerability reporting
License
MIT License - forever.
Maintainer: Michael A. Kuykendall
See Also
- libshimmy - Pure Rust LLM inference engine that uses shimmytok
- llama.cpp - Reference C++ implementation
- GGUF format spec
Dependencies
~3.5–5MB
~96K SLoC