11 stable releases
| 1.3.0 | Dec 18, 2025 |
|---|---|
| 1.2.7 | Dec 12, 2025 |
| 1.1.0 | Dec 11, 2025 |
| 1.0.0 | Dec 10, 2025 |
#187 in Machine learning
595KB
10K
SLoC
metal-candle
Production-quality Rust ML library for Apple Silicon - LoRA training, text generation, and semantic embeddings
Overview
Pure Rust machine learning library optimized for Apple Silicon:
- LoRA Training: Fine-tune transformer models efficiently
- Text Generation: Streaming, multiple sampling strategies, repetition penalty
- Semantic Embeddings: E5, MiniLM, MPNet models for RAG and search
- Metal Acceleration: Native GPU acceleration on M-series chips
Why metal-candle? 25.9x faster than MLX for embeddings, single binary deployment, type-safe ML, production-ready (407 tests, 81.6% coverage)
Performance
metal-candle demonstrates exceptional performance on Apple Silicon:
| Task | Batch Size | metal-candle | MLX | Speedup |
|---|---|---|---|---|
| Embeddings | 100 docs | 4.4ms | 113.5ms | 25.9x π |
| Embeddings | Single query | 3.9ms | 7.7ms | 2.0x |
| Throughput | - | 22,831 docs/sec | 881 docs/sec | 25.9x |
Near constant-time performance: Batch 1β100 only increases by 13% (3.9ms β 4.4ms)
See BENCHMARKS.md for detailed performance analysis and methodology.
Installation
[dependencies]
metal-candle = "1.2" # or latest from crates.io
Requirements: Rust 1.75+, Apple Silicon (M1/M2/M3/M4), macOS 12.0+
Quick Start
Text Generation
use metal_candle::inference::{Generator, GeneratorConfig, SamplingStrategy};
use metal_candle::models::Qwen;
// Load model
let model = Qwen::new(&config, vb)?;
// Configure generation
let gen_config = GeneratorConfig {
max_tokens: 128,
sampling: SamplingStrategy::TopP { p: 0.95 },
temperature: 0.7,
repetition_penalty: 1.1, // Reduce repetition
stop_on_eos: true,
eos_token_id: Some(151643), // Qwen EOS token
..Default::default()
};
// Generate tokens
let mut generator = Generator::new(Box::new(model), gen_config)?;
let output_ids = generator.generate(&input_ids)?;
// Or use streaming for real-time generation (v1.3.0+)
generator.generate_stream(&input_ids, |token| {
println!("Token {}: prob={:.2}%", token.token_id, token.probability * 100.0);
true // Continue generation
})?;
// Async streaming (requires 'streaming' feature)
#[cfg(feature = "streaming")]
{
use futures::stream::StreamExt;
use futures::pin_mut;
let stream = generator.generate_stream_async(&input_ids);
pin_mut!(stream);
while let Some(result) = stream.next().await {
let token = result?;
println!("Token: {}", token.token_id);
}
}
Semantic Embeddings (RAG & Search)
use metal_candle::embeddings::{EmbeddingModel, EmbeddingModelType};
use metal_candle::Device;
// Load embedding model with Metal acceleration (25.9x faster than MLX!)
let device = Device::new_metal(0)?;
let model = EmbeddingModel::from_pretrained(
EmbeddingModelType::E5SmallV2,
device,
)?;
// Generate embeddings for semantic search
let texts = vec![
"Rust is a systems programming language",
"Python is a high-level language",
];
let embeddings = model.encode(&texts)?; // [batch, 384] in 3.9ms
// Batch processing: 100 docs in 4.4ms (22,831 docs/sec throughput)
let large_corpus = load_documents()?;
let batch_embeddings = model.encode(&large_corpus)?;
LoRA Training
use metal_candle::training::{
LoRAAdapter, LoRAAdapterConfig, TargetModule,
Trainer, TrainingConfig, LRScheduler
};
// Create LoRA adapter
let lora_config = LoRAAdapterConfig {
rank: 8,
alpha: 16.0,
dropout: 0.0,
target_modules: vec![TargetModule::QProj, TargetModule::VProj],
};
let adapter = LoRAAdapter::new(&model, lora_config, &device)?;
// Configure and train
let training_config = TrainingConfig {
num_epochs: 3,
lr_scheduler: LRScheduler::warmup_cosine(100, 1000, 1e-4, 1e-6),
..Default::default()
};
let trainer = Trainer::new(adapter, training_config)?;
let metrics = trainer.train(&dataset)?;
LoRA Adapter Management (v1.3.0+)
use metal_candle::training::{AdapterRegistry, LoRAAdapter, LoRAAdapterConfig};
// Create registry for managing multiple adapters
let mut registry = AdapterRegistry::new();
// Load task-specific adapters
let code_adapter = LoRAAdapter::new(768, 3072, 12, &config, &device)?;
let chat_adapter = LoRAAdapter::new(768, 3072, 12, &config, &device)?;
registry.add_adapter("code-assistant".to_string(), code_adapter)?;
registry.add_adapter("chat".to_string(), chat_adapter)?;
// Switch between adapters without reloading base model
registry.activate("code-assistant")?;
// ... use model for code generation ...
registry.activate("chat")?;
// ... use model for chat ...
// Memory efficient: adapters are ~0.03% of base model size
println!("Active adapter: {:?}", registry.active_adapter());
Features
- Training: LoRA with dropout, AdamW optimizer, learning rate schedulers, checkpoint management, adapter registry (v1.3.0+)
- Inference: KV-cache, multiple sampling strategies, streaming generation (sync & async), repetition penalty, rich token metadata (v1.3.0+)
- Models: Qwen2.5-Coder, safetensors format, transformer components (RoPE, GQA, MLP)
- Embeddings: E5, MiniLM, MPNet with HuggingFace Hub integration
- Quality: 407 tests, 81.6% coverage, strict clippy linting, 100% API documentation
Architecture
Built on Candle with Metal backend:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β metal-candle (Public API) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Training β Inference β Models β
β β’ LoRAAdapter β β’ KVCache β β’ ModelLoader β
β β’ Trainer β β’ Sampling β β’ Qwen β
β β’ AdamW β β’ Generator β β’ Config β
β β’ Schedulers β β β
β β’ Checkpoint β Embeddings β β
β β β’ EmbeddingModel β β
β β β’ E5/MiniLM/MPNetβ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Candle Framework β
β β’ Tensor operations β’ Metal backend β’ Autograd β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Apple Metal API β
β (GPU acceleration on Apple Silicon) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
See ARCHITECTURE.md for detailed architecture documentation.
Documentation
- API Reference - Complete API documentation
- Architecture Guide - System design and implementation details
- Contributing Guide - Development standards and guidelines
- Benchmarks - Performance analysis and methodology
- Project Plan - Development roadmap and future plans
Examples
| Example | Description |
|---|---|
generate_text.rs |
Text generation with streaming and sampling |
train_lora.rs |
End-to-end LoRA training |
embeddings_demo.rs |
Semantic search with embeddings |
inference_demo.rs |
KV-cache and sampling demo |
load_model.rs |
Model loading and inspection |
Run examples:
cargo run --example generate_text
cargo run --example train_lora
cargo run --example embeddings_demo --features embeddings
Development
git clone https://github.com/GarthDB/metal-candle.git
cd metal-candle
cargo build && cargo test
See CONTRIBUTING.md for full guidelines. Quality standards: zero clippy warnings (pedantic), β₯80% coverage, 100% API docs.
Roadmap
See ROADMAP.md for detailed release plans and NEXT_STEPS.md for immediate priorities.
Upcoming Releases
- v1.3.1 (Jan 2025): ApplyAdapter implementation, streaming benchmarks
- v1.4.0 (Feb 2025): GGUF format support
- v1.5.0 (Mar 2025): LLaMA/Mistral architectures
- v1.6.0 (Apr 2025): 4-bit/8-bit quantization
- v1.7.0 (May 2025): Flash Attention
- v2.0.0 (Q3 2025): Multi-GPU support
Track progress on the v1.3+ Feature Roadmap project board. Vote with π on issues you'd like to see prioritized!
Contributing
Contributions welcome! See CONTRIBUTING.md for development standards and testing requirements.
License
Licensed under Apache-2.0 (LICENSE). Provides explicit patent protection for production ML.
Acknowledgments
- Built on the excellent Candle framework by Hugging Face
- Inspired by MLX and llama.cpp
- LoRA implementation based on LoRA paper
Known Advisories
Two unmaintained transitive dependencies (non-security): number_prefix, paste from trusted upstream (Candle, HF). See deny.toml for details.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: ARCHITECTURE.md | CONTRIBUTING.md
Maintained by: @GarthDB
Dependencies
~34β53MB
~794K SLoC