6 releases (stable)
| 2.0.4 | Feb 21, 2026 |
|---|---|
| 2.0.2 | Feb 8, 2026 |
| 2.0.1 | Jan 21, 2026 |
| 0.2.3 | Dec 12, 2025 |
#636 in Algorithms
1,584 downloads per month
Used in prime-radiant
5.5MB
111K
SLoC
RuvLLM v2.3 - High-Performance LLM Inference for Rust
RuvLLM is a production-ready Rust LLM inference engine optimized for Apple Silicon (M1-M4), featuring real-time fine-tuning, NEON SIMD acceleration, Apple Neural Engine integration, and the SONA self-optimizing neural architecture.
What's New in v2.3
Major Features
| Feature | Description | Benefit |
|---|---|---|
| RuvLTRA-Medium 3B | Purpose-built 3B model for Claude Flow | 42 layers, 256K context, speculative decode |
| HuggingFace Hub | Full Hub integration (download/upload) | Easy model sharing & distribution |
| Task-Specific LoRA | 5 pre-trained adapters for agent types | Optimized for coder/researcher/security/architect/reviewer |
| Adapter Merging | TIES, DARE, SLERP, Task Arithmetic | Combine adapters for multi-task models |
| Hot-Swap Adapters | Zero-downtime adapter switching | Runtime task specialization |
| Claude Dataset | 2,700+ Claude-style training examples | Optimized for Claude Flow integration |
| HNSW Routing | 150x faster semantic pattern matching | <25µs pattern retrieval |
| Evaluation Harness | Real model evaluation with SWE-Bench | 5 ablation modes, quality metrics |
| HNSW Auto-Dimension | Automatic embedding dimension detection | No manual config needed |
| mistral-rs Backend | Production-scale serving with PagedAttention | 5-10x concurrent users, X-LoRA, ISQ |
Previous v2.0-2.2 Features
| Feature | Description | Benefit |
|---|---|---|
| Apple Neural Engine | Core ML backend with ANE routing | 38 TOPS, 3-4x power efficiency |
| Hybrid GPU+ANE Pipeline | Intelligent operation routing | Best of both accelerators |
| Multi-threaded GEMM | Rayon parallelization | 4-12x speedup on M4 Pro |
| Flash Attention 2 | Auto block sizing, online softmax | O(N) memory, +10% throughput |
| Quantized Inference | INT8/INT4/Q4_K/Q8_K kernels | 4-8x memory reduction |
| Metal GPU Shaders | simdgroup_matrix operations | 3x speedup on Apple Silicon |
| GGUF Support | Memory-mapped model loading | Fast loading, reduced RAM |
| Continuous Batching | Dynamic batch scheduling | 2-3x throughput improvement |
| Speculative Decoding | Draft model acceleration | 2-3x faster generation |
| Gemma-2 & Phi-3 | New model architectures | Extended model support |
Features
Multiple Backends
- Candle Backend: HuggingFace's Candle framework with Metal/CUDA GPU acceleration
- Core ML Backend: Apple Neural Engine for maximum efficiency on Apple Silicon
- Hybrid Pipeline: Automatic routing between GPU and ANE based on operation type
Optimized Kernels
- NEON SIMD: ARM64-optimized kernels with 4x loop unrolling and FMA instructions
- Flash Attention 2: Memory-efficient attention with O(N) complexity and online softmax
- Paged Attention: Efficient KV cache management for long-context inference
- ANE Operations: GELU, SiLU, softmax, layer norm optimized for Neural Engine
Real-Time Learning (SONA)
- MicroLoRA: Per-request fine-tuning with rank 1-2 adapters (<1ms latency)
- EWC++: Elastic Weight Consolidation to prevent catastrophic forgetting
- Three-Tier Learning: Instant (<1ms), Background (~100ms), Deep (minutes)
Memory Efficiency
- Two-Tier KV Cache: FP16 tail + Q4/Q8 quantized store
- Grouped-Query Attention (GQA): 4-8x KV memory reduction
- Memory Pool: Arena allocator for zero-allocation inference
- GGUF Memory Mapping: Efficient large model loading
Quick Start
use ruvllm::prelude::*;
// Initialize backend with Metal GPU + ANE hybrid
let mut backend = CandleBackend::with_device(DeviceType::Metal)?;
// Load a GGUF model
backend.load_gguf("models/qwen2.5-7b-q4_k.gguf", ModelConfig::default())?;
// Or load from HuggingFace
backend.load_model("Qwen/Qwen2.5-7B-Instruct", ModelConfig {
quantization: Quantization::Q4K,
use_flash_attention: true,
..Default::default()
})?;
// Generate text
let response = backend.generate("Explain quantum computing in simple terms.",
GenerateParams {
max_tokens: 256,
temperature: 0.7,
top_p: 0.9,
..Default::default()
}
)?;
println!("{}", response);
// Check SONA learning stats
if let Some(stats) = backend.sona_stats() {
println!("Patterns learned: {}", stats.patterns_learned);
println!("Quality improvement: {:.1}%", stats.quality_improvement * 100.0);
}
Installation
Add to your Cargo.toml:
[dependencies]
# Recommended for Apple Silicon Mac
ruvllm = { version = "2.0", features = ["inference-metal", "coreml", "parallel"] }
# For NVIDIA GPUs
ruvllm = { version = "2.0", features = ["inference-cuda", "parallel"] }
# Minimal (CPU only)
ruvllm = { version = "2.0" }
Feature Flags
| Feature | Description |
|---|---|
candle |
Enable Candle backend (HuggingFace) |
metal |
Apple Silicon GPU acceleration via Candle |
metal-compute |
Native Metal compute shaders (M4 Pro optimized) |
cuda |
NVIDIA GPU acceleration |
coreml |
Apple Neural Engine via Core ML |
hybrid-ane |
GPU+ANE hybrid pipeline (recommended for Mac) |
inference-metal |
Full Metal inference stack |
inference-metal-native |
Metal + native shaders (best M4 Pro perf) |
inference-cuda |
Full CUDA inference stack |
parallel |
Multi-threaded GEMM/GEMV with Rayon |
accelerate |
Apple Accelerate BLAS (~2x GEMV speedup) |
gguf-mmap |
Memory-mapped GGUF loading |
async-runtime |
Tokio async support |
wasm |
WebAssembly support |
mistral-rs |
mistral-rs backend (PagedAttention, X-LoRA, ISQ) |
mistral-rs-metal |
mistral-rs with Apple Silicon acceleration |
mistral-rs-cuda |
mistral-rs with NVIDIA CUDA acceleration |
Architecture
+----------------------------------+
| Application |
+----------------------------------+
|
+----------------------------------+
| RuvLLM Backend |
| +----------------------------+ |
| | Hybrid Pipeline Router | |
| | ┌─────────┐ ┌──────────┐ | |
| | │ Metal │ │ ANE │ | |
| | │ GPU │ │ Core ML │ | |
| | └────┬────┘ └────┬─────┘ | |
| | │ ↕ │ | |
| | Attention MLP/FFN | |
| | RoPE Activations | |
| | Softmax LayerNorm | |
| +----------------------------+ |
| | |
| +----------------------------+ |
| | SONA Learning | |
| | - Instant (<1ms) | |
| | - Background (~100ms) | |
| | - Deep (minutes) | |
| +----------------------------+ |
| | |
| +----------------------------+ |
| | NEON/SIMD Kernels | |
| | - Flash Attention 2 | |
| | - Paged KV Cache | |
| | - Quantized MatMul | |
| +----------------------------+ |
+----------------------------------+
Supported Models
| Model Family | Sizes | Quantization | Backend |
|---|---|---|---|
| RuvLTRA-Small | 0.5B | Q4K, Q5K, Q8, FP16 | Candle/Metal/ANE |
| RuvLTRA-Medium | 3B | Q4K, Q5K, Q8, FP16 | Candle/Metal |
| Qwen 2.5 | 0.5B-72B | Q4K, Q8, FP16 | Candle/Metal |
| Llama 3.x | 8B-70B | Q4K, Q8, FP16 | Candle/Metal |
| Mistral | 7B-22B | Q4K, Q8, FP16 | Candle/Metal |
| Phi-3 | 3.8B-14B | Q4K, Q8, FP16 | Candle/Metal |
| Gemma-2 | 2B-27B | Q4K, Q8, FP16 | Candle/Metal |
RuvLTRA Models (Claude Flow Optimized)
| Model | Parameters | Hidden | Layers | Context | Features |
|---|---|---|---|---|---|
| RuvLTRA-Small | 494M | 896 | 24 | 32K | GQA 7:1, SONA hooks |
| RuvLTRA-Medium | 3.0B | 2560 | 42 | 256K | Flash Attention 2, Speculative Decode |
Performance (M4 Pro 14-core)
Inference Benchmarks
| Model | Quant | Prefill (tok/s) | Decode (tok/s) | Memory |
|---|---|---|---|---|
| Qwen2.5-7B | Q4K | 2,800 | 95 | 4.2 GB |
| Qwen2.5-7B | Q8 | 2,100 | 72 | 7.8 GB |
| Llama3-8B | Q4K | 2,600 | 88 | 4.8 GB |
| Mistral-7B | Q4K | 2,500 | 85 | 4.1 GB |
| Phi-3-3.8B | Q4K | 3,500 | 135 | 2.3 GB |
| Gemma2-9B | Q4K | 2,200 | 75 | 5.2 GB |
ANE vs GPU Performance (M4 Pro)
| Dimension | ANE | GPU | Winner |
|---|---|---|---|
| < 512 | +30-50% | - | ANE |
| 512-1024 | +10-30% | - | ANE |
| 1024-1536 | ~Similar | ~Similar | Either |
| 1536-2048 | - | +10-20% | GPU |
| > 2048 | - | +30-50% | GPU |
Kernel Benchmarks
| Kernel | Single-thread | Multi-thread (10-core) |
|---|---|---|
| GEMM 4096x4096 | 1.2 GFLOPS | 12.7 GFLOPS |
| GEMV 4096x4096 | 0.8 GFLOPS | 6.4 GFLOPS |
| Flash Attention (seq=2048) | 850μs | 320μs |
| RMS Norm (4096) | 2.1μs | 0.8μs |
| RoPE (4096, 128) | 4.3μs | 1.6μs |
Apple Neural Engine (ANE) Integration
RuvLLM v2.0 includes full ANE support via Core ML:
use ruvllm::backends::coreml::{CoreMLBackend, AneStrategy};
// Create ANE-optimized backend
let backend = CoreMLBackend::new(AneStrategy::PreferAneForMlp)?;
// Or use hybrid pipeline for best performance
use ruvllm::backends::HybridPipeline;
let pipeline = HybridPipeline::new(HybridConfig {
ane_strategy: AneStrategy::Adaptive,
gpu_for_attention: true, // Attention on GPU
ane_for_mlp: true, // MLP/FFN on ANE
..Default::default()
})?;
ANE Routing Recommendations
| Operation | Recommended | Reason |
|---|---|---|
| Attention | GPU | Better for variable sequence lengths |
| Flash Attention | GPU | GPU memory bandwidth advantage |
| MLP/FFN | ANE | Optimal for fixed-size matmuls |
| GELU/SiLU | ANE | Dedicated activation units |
| LayerNorm/RMSNorm | ANE | Good for small dimensions |
| Embedding | GPU | Sparse operations |
MicroLoRA Real-Time Adaptation
RuvLLM supports per-request fine-tuning using MicroLoRA:
use ruvllm::lora::{MicroLoRA, MicroLoraConfig, AdaptFeedback};
// Create MicroLoRA adapter
let config = MicroLoraConfig::for_hidden_dim(4096);
let lora = MicroLoRA::new(config);
// Adapt on user feedback
let feedback = AdaptFeedback::from_quality(0.9);
lora.adapt(&input_embedding, feedback)?;
// Apply learned updates
lora.apply_updates(0.01); // learning rate
// Get adaptation stats
let stats = lora.stats();
println!("Samples: {}, Avg quality: {:.2}", stats.samples, stats.avg_quality);
SONA Three-Tier Learning
Continuous improvement with three learning loops:
use ruvllm::optimization::{SonaLlm, SonaLlmConfig, ConsolidationStrategy};
let config = SonaLlmConfig {
instant_lr: 0.01,
background_interval_ms: 100,
deep_trigger_threshold: 100.0,
consolidation_strategy: ConsolidationStrategy::EwcMerge,
..Default::default()
};
let sona = SonaLlm::new(config);
// 1. Instant Loop (<1ms): Per-request MicroLoRA
let result = sona.instant_adapt("user query", "model response", 0.85);
println!("Instant adapt: {}μs", result.latency_us);
// 2. Background Loop (~100ms): Pattern consolidation
if let result = sona.maybe_background() {
if result.applied {
println!("Consolidated {} samples", result.samples_used);
}
}
// 3. Deep Loop (minutes): Full optimization
if sona.should_trigger_deep() {
let result = sona.deep_optimize(OptimizationTrigger::QualityThreshold(100.0));
println!("Deep optimization: {:.1}s", result.latency_us as f64 / 1_000_000.0);
}
// Check learning stats
let stats = sona.stats();
println!("Total samples: {}", stats.total_samples);
println!("Accumulated quality: {:.2}", stats.accumulated_quality);
Two-Tier KV Cache
Memory-efficient caching with automatic tiering:
use ruvllm::kv_cache::{TwoTierKvCache, KvCacheConfig};
let config = KvCacheConfig {
tail_length: 256, // Recent tokens in FP16
tail_precision: Precision::FP16,
store_precision: Precision::Q4, // Older tokens in Q4
max_tokens: 8192,
num_layers: 32,
num_kv_heads: 8,
head_dim: 128,
};
let cache = TwoTierKvCache::new(config);
cache.append(&keys, &values)?;
// Automatic migration from tail to quantized store
let stats = cache.stats();
println!("Tail: {} tokens, Store: {} tokens", stats.tail_tokens, stats.store_tokens);
println!("Compression ratio: {:.2}x", stats.compression_ratio);
println!("Memory saved: {:.1} MB", stats.memory_saved_mb);
Continuous Batching
High-throughput serving with dynamic batching:
use ruvllm::serving::{ContinuousBatchScheduler, SchedulerConfig, InferenceRequest};
let scheduler = ContinuousBatchScheduler::new(SchedulerConfig {
max_batch_size: 32,
max_batch_tokens: 4096,
max_waiting_time_ms: 50,
preemption_mode: PreemptionMode::Recompute,
..Default::default()
});
// Add requests
scheduler.add_request(InferenceRequest::new(tokens, params))?;
// Process batch
while let Some(batch) = scheduler.get_next_batch() {
let outputs = backend.forward_batch(&batch)?;
scheduler.process_outputs(outputs)?;
}
// Get throughput stats
let stats = scheduler.stats();
println!("Throughput: {:.1} tok/s", stats.tokens_per_second);
println!("Batch utilization: {:.1}%", stats.avg_batch_utilization * 100.0);
Speculative Decoding
Accelerate generation with draft models:
use ruvllm::speculative::{SpeculativeDecoder, SpeculativeConfig};
let config = SpeculativeConfig {
draft_tokens: 4, // Tokens to draft per step
acceptance_threshold: 0.8, // Min probability for acceptance
..Default::default()
};
let decoder = SpeculativeDecoder::new(
target_model,
draft_model,
config,
)?;
// Generate with speculation
let output = decoder.generate(prompt, GenerateParams {
max_tokens: 256,
..Default::default()
})?;
println!("Acceptance rate: {:.1}%", output.stats.acceptance_rate * 100.0);
println!("Speedup: {:.2}x", output.stats.speedup);
GGUF Model Loading
Efficient loading with memory mapping:
use ruvllm::gguf::{GgufLoader, GgufConfig};
let loader = GgufLoader::new(GgufConfig {
mmap_enabled: true, // Memory-map for fast loading
validate_checksum: true, // Verify file integrity
..Default::default()
});
// Load model metadata
let metadata = loader.read_metadata("model.gguf")?;
println!("Model: {}", metadata.name);
println!("Parameters: {}B", metadata.parameters / 1_000_000_000);
println!("Quantization: {:?}", metadata.quantization);
// Load into backend
let tensors = loader.load_tensors("model.gguf")?;
backend.load_tensors(tensors)?;
mistral-rs Backend (Production Serving)
RuvLLM v2.3 includes integration with mistral-rs for production-scale LLM serving with advanced memory management.
Note: The mistral-rs crate is not yet published to crates.io. The integration is designed and ready—enable it when mistral-rs becomes available.
Key Features
| Feature | Description | Benefit |
|---|---|---|
| PagedAttention | vLLM-style KV cache management | 5-10x concurrent users, 85-95% memory utilization |
| X-LoRA | Per-token adapter routing | <1ms routing overhead, multi-task inference |
| ISQ | In-Situ Quantization (AWQ, GPTQ, RTN) | Runtime quantization without re-export |
Usage Example
use ruvllm::backends::mistral::{
MistralBackend, MistralBackendConfig,
PagedAttentionConfig, XLoraConfig, IsqConfig
};
// Configure mistral-rs backend for production serving
let config = MistralBackendConfig::builder()
// PagedAttention: Enable 50+ concurrent users
.paged_attention(PagedAttentionConfig {
block_size: 16,
max_blocks: 4096,
gpu_memory_fraction: 0.9,
enable_prefix_caching: true,
})
// X-LoRA: Per-token adapter routing
.xlora(XLoraConfig {
adapters: vec![
"adapters/coder".into(),
"adapters/researcher".into(),
],
top_k: 2,
temperature: 0.3,
})
// ISQ: Runtime quantization
.isq(IsqConfig {
bits: 4,
method: IsqMethod::AWQ,
calibration_samples: 128,
})
.build();
let mut backend = MistralBackend::new(config)?;
backend.load_model("mistralai/Mistral-7B-Instruct-v0.2", ModelConfig::default())?;
// Generate with PagedAttention + X-LoRA
let response = backend.generate("Write secure authentication code", GenerateParams {
max_tokens: 512,
temperature: 0.7,
..Default::default()
})?;
When to Use mistral-rs vs Candle
| Scenario | Recommended Backend | Reason |
|---|---|---|
| Single user / Edge | Candle | Simpler, smaller binary |
| 10-100 concurrent users | mistral-rs | PagedAttention memory efficiency |
| Multi-task models | mistral-rs | X-LoRA per-token routing |
| Runtime quantization | mistral-rs | ISQ without model re-export |
| WASM / Browser | Candle | mistral-rs doesn't support WASM |
Feature Flags
# Enable mistral-rs (when available on crates.io)
ruvllm = { version = "2.3", features = ["mistral-rs"] }
# With Metal acceleration (Apple Silicon)
ruvllm = { version = "2.3", features = ["mistral-rs-metal"] }
# With CUDA acceleration (NVIDIA)
ruvllm = { version = "2.3", features = ["mistral-rs-cuda"] }
See ADR-008: mistral-rs Integration for detailed architecture decisions.
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
RUVLLM_CACHE_DIR |
Model cache directory | ~/.cache/ruvllm |
RUVLLM_LOG_LEVEL |
Logging level | info |
RUVLLM_METAL_DEVICE |
Metal device index | 0 |
RUVLLM_ANE_ENABLED |
Enable ANE routing | true |
RUVLLM_SONA_ENABLED |
Enable SONA learning | true |
Model Configuration
let config = ModelConfig {
max_context: 8192,
use_flash_attention: true,
quantization: Quantization::Q4K,
kv_cache_config: KvCacheConfig::default(),
rope_scaling: Some(RopeScaling::Linear { factor: 2.0 }),
sliding_window: Some(4096),
..Default::default()
};
Benchmarks
Run benchmarks with:
# Attention benchmarks
cargo bench --bench attention_bench --features inference-metal
# ANE benchmarks (Mac only)
cargo bench --bench ane_bench --features coreml
# LoRA benchmarks
cargo bench --bench lora_bench
# End-to-end inference
cargo bench --bench e2e_bench --features inference-metal
# Metal shader benchmarks
cargo bench --bench metal_bench --features metal-compute
# Serving benchmarks
cargo bench --bench serving_bench --features inference-metal
HuggingFace Hub Integration (v2.3)
Download and upload models to HuggingFace Hub:
use ruvllm::hub::{ModelDownloader, ModelUploader, RuvLtraRegistry, DownloadConfig};
// Download from Hub
let downloader = ModelDownloader::new(DownloadConfig::default());
let model_path = downloader.download(
"ruvector/ruvltra-small-q4km",
Some("./models"),
)?;
// Or use the registry for RuvLTRA models
let registry = RuvLtraRegistry::new();
let model = registry.get("ruvltra-medium", "Q4_K_M")?;
// Upload to Hub (requires HF_TOKEN)
let uploader = ModelUploader::new("hf_your_token");
let url = uploader.upload(
"./my-model.gguf",
"username/my-ruvltra-model",
Some(metadata),
)?;
println!("Uploaded to: {}", url);
Task-Specific LoRA Adapters (v2.3)
Pre-trained adapters optimized for Claude Flow agent types:
use ruvllm::lora::{RuvLtraAdapters, AdapterTrainer, AdapterMerger, HotSwapManager};
// Create adapter for specific task
let adapters = RuvLtraAdapters::new();
let coder = adapters.create_lora("coder", 768)?; // Rank 16, code generation
let security = adapters.create_lora("security", 768)?; // Rank 16, vulnerability detection
// Available adapters:
// - coder: Rank 16, Alpha 32.0, targets attention (Q,K,V,O)
// - researcher: Rank 8, Alpha 16.0, targets Q,K,V
// - security: Rank 16, Alpha 32.0, targets attention + MLP
// - architect: Rank 12, Alpha 24.0, targets Q,V + Gate,Up
// - reviewer: Rank 8, Alpha 16.0, targets Q,V
// Merge adapters for multi-task models
let merger = AdapterMerger::new(MergeConfig::weighted(weights));
let multi_task = merger.merge(&[coder, security], &output_config, 768)?;
// Hot-swap adapters at runtime
let mut manager = HotSwapManager::new();
manager.set_active(coder);
manager.prepare_standby(security);
manager.swap()?; // Zero-downtime switch
Adapter Merging Strategies
| Strategy | Description | Use Case |
|---|---|---|
| Average | Equal-weight averaging | Simple multi-task |
| WeightedSum | User-defined weights | Task importance weighting |
| SLERP | Spherical interpolation | Smooth transitions |
| TIES | Trim, Elect, Merge | Robust multi-adapter |
| DARE | Drop And REscale | Sparse merging |
| TaskArithmetic | Add/subtract vectors | Task composition |
Evaluation Harness (v2.3)
RuvLLM includes a comprehensive evaluation harness for benchmarking model quality:
use ruvllm::evaluation::{RealEvaluationHarness, EvalConfig, AblationMode};
// Create harness with GGUF model
let harness = RealEvaluationHarness::with_gguf(
"./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
EvalConfig::default(),
)?;
// Run single evaluation
let result = harness.evaluate(
"Fix the null pointer exception in this code",
"def process(data):\n return data.split()",
AblationMode::Full,
)?;
println!("Success: {}, Quality: {:.2}", result.success, result.quality_score);
// Run full ablation study (5 modes)
let report = harness.run_ablation_study(&tasks)?;
for (mode, metrics) in &report.mode_metrics {
println!("{:?}: {:.1}% success, {:.2} quality",
mode, metrics.success_rate * 100.0, metrics.avg_quality);
}
Ablation Modes
| Mode | Description | Use Case |
|---|---|---|
| Baseline | No enhancements | Control baseline |
| RetrievalOnly | HNSW pattern retrieval | Measure retrieval impact |
| AdaptersOnly | LoRA adapters | Measure adaptation impact |
| RetrievalPlusAdapters | HNSW + LoRA | Combined without SONA |
| Full | All systems (SONA + HNSW + LoRA) | Production mode |
SWE-Bench Task Loader
use ruvllm::evaluation::swe_bench::SweBenchLoader;
// Load SWE-Bench tasks
let loader = SweBenchLoader::new();
let tasks = loader.load_subset("lite", 50)?; // 50 tasks from lite subset
for task in &tasks {
println!("Instance: {}", task.instance_id);
println!("Problem: {}", task.problem_statement);
}
CLI Evaluation
# Run evaluation with default settings
cargo run --example run_eval --features async-runtime -- \
--model ./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# Run SWE-Bench subset
cargo run --example run_eval --features async-runtime -- \
--model ./models/model.gguf \
--swe-bench-path ./data/swe-bench \
--subset lite \
--max-tasks 100
# Output report
cargo run --example run_eval --features async-runtime -- \
--model ./models/model.gguf \
--output ./reports/eval-report.json
HNSW Auto-Dimension Detection
The evaluation harness automatically detects model embedding dimensions:
// HNSW router automatically uses model's hidden_size
// TinyLlama 1.1B → 2048 dimensions
// Qwen2 0.5B → 896 dimensions
// RuvLTRA-Small → 896 dimensions
// RuvLTRA-Medium → 2560 dimensions
let harness = RealEvaluationHarness::with_config(
EvalConfig::default(),
RealInferenceConfig {
enable_hnsw: true,
hnsw_config: None, // Auto-detect from model
..Default::default()
},
)?;
Examples
See the /examples directory for:
download_test_model.rs- Download and validate modelsbenchmark_model.rs- Full inference benchmarkingrun_eval.rs- Run evaluation harness with SWE-Bench- Basic inference
- Streaming generation
- MicroLoRA adaptation
- Multi-turn chat
- Speculative decoding
- Continuous batching
- ANE hybrid inference
Error Handling
use ruvllm::error::{Result, RuvLLMError};
match backend.generate(prompt, params) {
Ok(response) => println!("{}", response),
Err(RuvLLMError::Model(e)) => eprintln!("Model error: {}", e),
Err(RuvLLMError::OutOfMemory(e)) => eprintln!("OOM: {}", e),
Err(RuvLLMError::Generation(e)) => eprintln!("Generation failed: {}", e),
Err(RuvLLMError::Ane(e)) => eprintln!("ANE error: {}", e),
Err(RuvLLMError::Gguf(e)) => eprintln!("GGUF loading error: {}", e),
Err(e) => eprintln!("Error: {}", e),
}
npm Package
RuvLLM is also available as an npm package with native bindings:
npm install @ruvector/ruvllm
import { RuvLLM } from '@ruvector/ruvllm';
const llm = new RuvLLM();
const response = llm.query('Explain quantum computing');
console.log(response.text);
See @ruvector/ruvllm on npm for full documentation.
License
Apache-2.0 / MIT dual license.
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Links
Dependencies
~20–64MB
~1M SLoC