#llm #transformer #ml #inference

lmonade-models

LLM model architectures and serving components for the Lmonade inference engine

2 releases

Uses new Rust 2024

0.1.0-alpha.2 Aug 21, 2025
0.1.0-alpha.1 Aug 20, 2025

#1013 in Math

28 downloads per month
Used in 3 crates

GPL-3.0-or-later

38KB
447 lines

lmonade-models

Core model architectures and serving components for the Lmonade inference engine.

Overview

This crate provides:

  • Model architectures (currently TinyLlama)
  • Tensor operations and components (attention, feedforward, normalization)
  • Serving infrastructure (paged KV cache, block management)
  • Weight loading from SafeTensors and GGUF formats
  • Batching strategies for inference

Key Components

  • Models: Architecture implementations (src/models/)
  • Components: Building blocks like attention and feedforward layers (src/components/)
  • Formats: Weight loading and model configuration (src/formats/)
  • Serving: Production serving infrastructure (src/serving/)
    • Paged attention and KV cache management
    • Continuous batching for throughput optimization
    • Memory block management

Usage

use lmonade_models::models::tinyllama::TinyLlamaModel;
use lmonade_models::formats::config::ModelConfig;

// Load model configuration
let config = ModelConfig::from_file("path/to/config.json")?;

// Initialize model
let model = TinyLlamaModel::new(&config)?;

Documentation

For detailed API documentation and architectural details, see:

Status

This crate is under active development. TinyLlama inference is partially working with ongoing optimizations for performance and accuracy.

Dependencies

~43MB
~834K SLoC