2 releases
| 0.1.1 | Sep 25, 2025 |
|---|---|
| 0.1.0 | Sep 20, 2025 |
#591 in Algorithms
23 downloads per month
69KB
1K
SLoC
MiniLLM π€
A lightweight, efficient transformer inference engine written in Rust. MiniLLM provides a clean, well-documented implementation of GPT-2 style transformer models with support for text generation.
β¨ Features
- π Fast Inference: Efficient tensor operations using ndarray
- π Memory Safe: Written in Rust with zero-copy operations where possible
- π¦ Easy to Use: High-level API for quick integration
- π― Well Tested: Comprehensive examples and documentation
- π§ Extensible: Modular architecture for easy customization
- π€ GPT-2 Compatible: Load and run GPT-2 models from HuggingFace
- π SafeTensors Support: Fast and secure model weight loading
ποΈ Architecture
src/
βββ lib.rs # Library entry point and public API
βββ main.rs # Simple CLI example (clean 27 lines)
βββ inference.rs # High-level inference engine
βββ gpt.rs # GPT model implementation
βββ transformer.rs # Transformer block components
βββ attention.rs # Multi-head attention mechanism
βββ mlp.rs # Feed-forward network layers
βββ tensor.rs # Tensor operations and math
βββ weights.rs # Model weight loading (SafeTensors)
βββ config.rs # Model configuration handling
examples/
βββ basic_generation.rs # Simple text generation
βββ interactive_chat.rs # Interactive chat interface
βββ tokenization.rs # Tokenization examples
π Quick Start
Library Usage
use minillm::inference::InferenceEngine;
fn main() -> minillm::Result<()> {
// Load a GPT-2 model
let engine = InferenceEngine::new("openai-community/gpt2")?;
// Generate text
let prompt = "The future of AI is";
let generated = engine.generate(prompt, 20)?;
println!("Generated: {}", generated);
Ok(())
}
Command Line
# Run the main example
cargo run
# Run specific examples
cargo run --example basic_generation
cargo run --example interactive_chat
cargo run --example tokenization
π Requirements
- Rust 1.70+
- HuggingFace token (optional, for private models)
Set your HuggingFace token:
echo "HF_TOKEN=your_token_here" > .env
π§ Dependencies
ndarray- Tensor operationssafetensors- Model weight loadingtokenizers- Text tokenizationhf-hub- HuggingFace model downloadingserde- Configuration parsing
π API Documentation
InferenceEngine
The main high-level interface:
// Create engine
let engine = InferenceEngine::new("openai-community/gpt2")?;
// Generate text
let result = engine.generate("prompt", max_tokens)?;
// Tokenization
let tokens = engine.tokenize("text")?;
let text = engine.decode(&tokens)?;
// Get model info
let config = engine.config();
Low-Level Components
For custom implementations, you can use the individual components:
GPTModel- Complete transformer modelTransformerBlock- Individual transformer layersMultiHeadAttention- Attention mechanismMLP- Feed-forward networksTensor- Mathematical operations
π― Examples
Basic Generation
cargo run --example basic_generation
Demonstrates simple text generation with model configuration display.
Interactive Chat
cargo run --example interactive_chat
Interactive command-line chat interface with the model.
Tokenization
cargo run --example tokenization
Shows tokenization, encoding/decoding, and round-trip verification.
π Performance
MiniLLM is designed for inference efficiency:
- Memory: ~1GB RAM for GPT-2 (117M parameters)
- Speed: ~10-50 tokens/second (CPU, varies by hardware)
- Accuracy: Identical outputs to reference implementations
- Models: Currently supports GPT-2 architecture
π οΈ Development
# Clone and build
git clone https://github.com/bmqube/minillm
cd minillm
cargo build --release
# Run tests
cargo test
# Check examples
cargo check --examples
# Generate documentation
cargo doc --open
π Architecture Details
Transformer Implementation
- Multi-head attention with causal masking
- Feed-forward networks with GELU activation
- Layer normalization and residual connections
- Position and token embeddings
Tensor Operations
- Dynamic 1D-4D tensor support
- Optimized matrix multiplication
- Element-wise operations (add, softmax, layer_norm)
- Memory-efficient implementations
Model Loading
- SafeTensors format support
- Automatic model downloading from HuggingFace
- Configuration parsing and validation
- Error handling with detailed messages
β Current Status
- β Core Architecture: Complete GPT-2 implementation
- β Inference Engine: High-level API ready
- β Examples: Comprehensive usage examples
- β Documentation: Well-documented codebase
- β Testing: All components tested and working
πΊοΈ Roadmap
- Performance: GPU acceleration support
- Models: Support for larger GPT variants
- Features: Beam search and sampling options
- Optimization: Quantization and pruning
- Integration: Python bindings
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
π Acknowledgments
- Inspired by Andrej Karpathy's educational implementations
- Built on the excellent Rust ecosystem (ndarray, tokenizers, etc.)
- Model weights from HuggingFace transformers library
π¨βπ» Author
BM Monjur Morshed
Dependencies
~24β39MB
~584K SLoC