#tokenize #artificial-intelligence #mistral #nlp

tekken-rs

Rust implementation of Mistral Tekken tokenizer with audio support

2 releases

Uses new Rust 2024

0.1.1 Jul 28, 2025
0.1.0 Jul 25, 2025

#343 in Audio

Download history 43/week @ 2025-08-26 125/week @ 2025-09-02 74/week @ 2025-09-09 59/week @ 2025-09-16 42/week @ 2025-09-23 77/week @ 2025-09-30 67/week @ 2025-10-07 70/week @ 2025-10-14 47/week @ 2025-10-21 13/week @ 2025-10-28 8/week @ 2025-11-04 7/week @ 2025-11-11 31/week @ 2025-11-18 8/week @ 2025-11-25 39/week @ 2025-12-02 4/week @ 2025-12-09

85 downloads per month
Used in kitsune-stt

Apache-2.0

85KB
1K SLoC

tekken-rs

License Rust codecov

A Rust implementation of the Mistral Tekken tokenizer with audio support. This library provides fast and efficient tokenization capabilities for text and audio data, fully compatible with Mistral AI's tokenizer.

Features

  • Text Tokenization: Full compatibility with Mistral's Tekken tokenizer
  • Audio Support: Encode and decode audio data with mel-scale spectrogram processing
  • Multiple Versions: Support for various tokenizer versions (V7, etc.)
  • Special Tokens: Complete handling of special tokens (BOS, EOS, audio tokens, etc.)

Installation

Add this to your Cargo.toml:

[dependencies]
tekken = "0.1.0"

Or use the Git repository directly:

[dependencies]
tekken = { git = "https://github.com/jorge-menjivar/tekken-rs" }

Quick Start

Basic Text Tokenization

use tekken::tekkenizer::Tekkenizer;
use tekken::special_tokens::SpecialTokenPolicy;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load tokenizer
    let tokenizer = Tekkenizer::from_file("tekken.json")?;

    // Encode text
    let text = "Hello, world!";
    let tokens = tokenizer.encode(text, true, true)?; // add_bos=true, add_eos=true

    // Decode tokens
    let decoded = tokenizer.decode(&tokens, SpecialTokenPolicy::Keep)?;
    println!("Original: {}", text);
    println!("Tokens: {:?}", tokens);
    println!("Decoded: {}", decoded);

    Ok(())
}

Audio Processing

use tekken::audio::{Audio, AudioConfig, AudioSpectrogramConfig, AudioEncoder};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load audio
    let audio = Audio::from_file("audio.wav")?;

    // Create audio configuration
    let spectrogram_config = AudioSpectrogramConfig::new(80, 160, 400)?;
    let audio_config = AudioConfig::new(16000, 12.5, spectrogram_config, None)?;

    // Encode audio to tokens
    let encoder = AudioEncoder::new(audio_config, 1000, 1001); // audio_token_id, begin_audio_token_id
    let encoding = encoder.encode(audio)?;

    println!("Audio encoded to {} tokens", encoding.tokens.len());

    Ok(())
}

Examples

Run the examples to see the tokenizer in action:

# Basic tokenizer test
cargo run --example basic_tokenizer_test

# Audio processing test
cargo run --bin test_audio

Testing

Run the test suite:

cargo test

Architecture

The tokenizer consists of several key components:

  • tokenizer.rs: Main tokenizer implementation
  • audio.rs: Audio processing and encoding functionality
  • special_tokens.rs: Special token definitions and handling
  • config.rs: Configuration structures
  • errors.rs: Error handling

Audio Support

The audio implementation includes:

  • WAV file loading and processing
  • Mel-scale spectrogram computation
  • Audio chunk encoding to tokens
  • Compatible with Python implementation

Audio Token Flow

  1. Load Audio: Load WAV files or audio data
  2. Resample: Convert to target sampling rate (16kHz)
  3. Pad: Ensure minimum length for processing
  4. Tokenize: Convert to token sequence with special audio markers

Compatibility

This Rust implementation is designed to be fully compatible with the Python version:

  • Same tokenization results
  • Identical audio processing
  • Compatible special token handling
  • Same mel filter bank computations

Requirements

  • Rust 1.70 or higher
  • For audio support: audio files in WAV format

Project Structure

tekken-rs/
├── src/
│   ├── lib.rs          # Library entry point
│   ├── tokenizer.rs    # Main tokenizer implementation
│   ├── audio.rs        # Audio processing functionality
│   ├── special_tokens.rs # Special token definitions
│   ├── config.rs       # Configuration structures
│   └── errors.rs       # Error types
├── examples/           # Example usage
├── tests/             # Integration tests
└── benches/           # Performance benchmarks

Performance

The Rust implementation provides significant performance improvements over the Python version:

  • Fast tokenization using efficient data structures
  • Zero-copy string handling where possible
  • Optimized audio processing with SIMD operations

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to:

  • Update tests as appropriate
  • Follow Rust coding conventions
  • Run cargo fmt and cargo clippy before submitting

See CONTRIBUTING.md for detailed guidelines.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

This is an original Rust implementation designed to be compatible with Mistral AI's Tekken tokenizer format.

See NOTICE file for detailed attribution.

Dependencies

~20MB
~197K SLoC