4 releases (2 breaking)
Uses new Rust 2024
| 0.2.1 | Dec 4, 2025 |
|---|---|
| 0.2.0 | Oct 9, 2025 |
| 0.1.0 | Aug 28, 2025 |
| 0.0.1-rc.1 | Aug 12, 2025 |
#1 in #llm-inference
Used in dynamo-llm
165KB
3.5K
SLoC
Dynamo Model Express
Model Express is a Rust-based model cache management service designed to be deployed as a sidecar alongside existing inference solutions such as NVIDIA Dynamo. Model Express accelerates overall inference performance by reducing the latency of artifact downloads and writes.
Project Overview
It should be established that although Model Express is a component of the Dynamo inference stack, Model Express can be deployed standalone to accelerate other inference solutions such as vLLM, Sglang, etc. independent of Dynamo.
The current version of Model Express acts as a cache for HuggingFace, providing fast access to pre-trained models and reducing the need for repeated downloads across multiple servers. Additionally, this service aids fault tolerance for inference solutions by providing managed model persistence, ensuring that models remain available even in the event of node failures or restarts.
Model Express also shines in multi-node / multi-worker environments, where inference solutions may spawn multiple replicas that require model artifacts to be shared efficiently.
Future versions will expand support to additional model providers (AWS, Azure, NFS, etc.) and include features like model versioning, advanced caching strategies, advanced networking using NIXL, checkpoint storage, as well as a peer-to-peer model sharing system.
Architecture
The project is organized as a Rust workspace with the following components:
modelexpress_server: The main gRPC server that provides model servicesmodelexpress_client: Client library for interacting with the servermodelexpress_common: Shared code and constants between client and server
The current diagram represents a high-level overview of the Model Express architecture. It will evolve with time as we add new features and components.
architecture-beta
group MXS(cloud)[Model Express]
service db(database)[Database] in MXS
service disk(disk)[Persistent Volume Storage] in MXS
service server(server)[Server] in MXS
db:L -- R:server
disk:T -- B:server
group MXC(cloud)[Inference Server]
service client(server)[Client] in MXC
disk:T -- B:client
The client is either a library embedded in the inference server of your choice, or a CLI tool which can be used beforehand to hydrate the model cache.
CLI Tool
The client library includes a command-line interface, meant to facilitate interaction with the Model Express server, and act as a HuggingFace CLI replacement. In the future, it will also abstract other model providers, making it a one-stop shop for interacting with various model APIs.
See docs/CLI.md for detailed CLI documentation.
Prerequisites
- Rust: Latest stable version (recommended: 1.90)
- Cargo: Rust's package manager (included with Rust)
- protoc: The Protocol Buffers compiler is expected to be installed and usable
- Docker (optional): For containerized deployment
Quick Start
1. Clone the Repository
git clone <repository-url>
cd modelexpress
2. Build the Project
cargo build
3. Run the Server
cargo run --bin modelexpress-server
The server will start on 0.0.0.0:8001 by default.
Running Options
Option 1: Local Development
# Start the gRPC server
cargo run --bin modelexpress-server
# In another terminal, run tests
cargo test
# Run integration tests
./run_integration_tests.sh
Option 2: Docker Deployment
# Build and run with docker-compose
docker-compose up --build
# Or build and run manually
docker build -t model-express .
docker run -p 8000:8000 model-express
Option 3: Kubernetes Deployment
Prerequisites:
- Kubernetes Cluster: With GPU support and
kubectlconfigured to access your cluster - HuggingFace Token: Required for accessing HuggingFace models within your cluster via k8s secret as shown here:
export HF_TOKEN=your_hf_token kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${NAMESPACE} - Docker Registry: Container registry accessible from your cluster (Docker Hub, private registry, or local registry)
- Model Express Image: Built and pushed to your registry by building from root directory of repository
# Build the Model Express image docker build -t model-express:latest . # Tag for your registry docker tag model-express:latest your-registry/model-express:latest # Push to your registry docker push your-registry/model-express:latest - Update Image Reference: Update the image reference in your deployment files to match your registry
# In k8s-deployment.yaml or agg.yaml, update: image: your-registry/model-express:latest
Now to deploy Modelexpress in your cluster you can run:
kubectl apply -f k8s-deployment.yaml
Please follow the guide here to learn more on how to launch modelexpress with dynamo on kubernetes.
Configuration
ModelExpress uses a layered configuration system that supports multiple sources in order of precedence:
- Command line arguments (highest priority)
- Environment variables
- Configuration files (YAML)
- Default values (lowest priority)
Configuration File
Create a configuration file (supports YAML):
# Generate a sample configuration file
cargo run --bin config_gen -- --output model-express.yaml
Start the server with a configuration file:
cargo run --bin modelexpress-server -- --config model-express.yaml
Example Configuration Files
Basic Configuration (model-express.yaml):
server:
host: 0.0.0.0
port: 8001
database:
path: ./models.db
cache:
eviction:
enabled: true
policy:
type: lru
unused_threshold: 60
max_models: null
min_free_space_bytes: null
check_interval: 360
directory: ./cache
max_size_bytes: null
logging:
level: Info
format: Pretty
file: null
structured: false
Running Commands:
cargo run --bin modelexpress-server -- --config model-express.yaml
Environment Variables
You can use structured environment variables with the MODEL_EXPRESS_ prefix:
# Server settings
export MODEL_EXPRESS_SERVER_HOST="127.0.0.1"
export MODEL_EXPRESS_SERVER_PORT=8080
# Database settings
export MODEL_EXPRESS_DATABASE_PATH="/path/to/models.db"
# Cache settings
export MODEL_EXPRESS_CACHE_DIRECTORY="/path/to/cache"
export MODEL_EXPRESS_CACHE_EVICTION_ENABLED=true
# Logging settings
export MODEL_EXPRESS_LOGGING_LEVEL=debug
export MODEL_EXPRESS_LOGGING_FORMAT=json
Command Line Arguments
# Basic usage
cargo run --bin modelexpress-server -- --port 8080 --log-level debug
# With configuration file
cargo run --bin modelexpress-server -- --config model-express.yaml --port 8080
# Validate configuration
cargo run --bin modelexpress-server -- --config model-express.yaml --validate-config
Configuration Options
Server Settings
host: Server host address (default: "0.0.0.0")port: Server port (default: 8001)
Database Settings
path: SQLite database file path (default: "./models.db"). Note that in the case of a multi node kubernetes deployment, the database should be shared among all nodes using a persistent volume.
Cache Settings
directory: Cache directory path (default: "./cache")max_size_bytes: Maximum cache size in bytes (default: null/unlimited)eviction.enabled: Enable cache eviction (default: true)eviction.check_interval_seconds: Eviction check interval (default: 3600)eviction.policy.unused_threshold_seconds: Unused threshold (default: 604800/7 days)eviction.policy.max_models: Maximum models to keep (default: null/unlimited)eviction.policy.min_free_space_bytes: Minimum free space (default: null/unlimited)
Logging Settings
level: Log level - trace, debug, info, warn, error (default: "info")format: Log format - json, pretty, compact (default: "pretty")file: Log file path (default: null/stdout)structured: Enable structured logging (default: false)
Default Settings
- gRPC Port: 8001
- Server Address:
0.0.0.0:8001(listens on all interfaces) - Client Endpoint:
http://localhost:8001
API Services
The server provides the following gRPC services:
- HealthService: Health check endpoints
- ApiService: General API endpoints
- ModelService: Model management and serving
Testing
Run All Tests
cargo test
Run Specific Tests
# Integration tests
cargo test --test integration_tests
# Client tests with specific model
cargo run --bin test_client -- --test-model "google-t5/t5-small"
# Fallback tests
cargo run --bin fallback_test
Test Coverage
# Run tests with coverage (requires cargo-tarpaulin)
cargo tarpaulin --out Html
Development
Project Structure
ModelExpress/
├── modelexpress_server/ # Main gRPC server
├── modelexpress_client/ # Client library
├── modelexpress_common/ # Shared code
├── examples/ # Example deployment with dynamo
├── helm/ # Helm chart for Kubernetes deployment
├── docs/ # Documentation and guides
├── workspace-tests/ # Integration tests
├── docker-compose.yml # Docker configuration
├── Dockerfile # Docker build file
├── k8s-deployment.yaml # Kubernetes deployment
└── run_integration_tests.sh # Test runner script
Adding New Features
- Server Features: Add to
modelexpress_server/src/ - Client Features: Add to
modelexpress_client/src/ - Shared Code: Add to
modelexpress_common/src/ - Tests: Add to appropriate directory under
workspace-tests/
Dependencies
Key dependencies include:
tokio: Async runtimetonic: gRPC frameworkaxum: Web framework (if needed)serde: Serializationhf-hub: Hugging Face Hub integrationrusqlite: SQLite database
Pre-commit Hooks
This repository uses pre-commit hooks to maintain code quality. In order to contribute effectively, please set up the pre-commit hooks:
pip install pre-commit
pre-commit install
Performance
The project includes benchmarking capabilities:
# Run benchmarks
cargo bench
Monitoring and Logging
The server uses structured logging with tracing:
# Set log level
RUST_LOG=debug cargo run --bin modelexpress-server
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
Support
For issues and questions:
- Create an issue in the repository
- Check the integration tests for usage examples
- Review the client library documentation
ModelExpress 0.1.0 Release
Includes:
- Model Express being released as a CLI tool.
- Model weight caching within Kubernetes clusters using PVC.
- Database tracking of which models are stored on which nodes.
- Basic model download and storage management.
- Documentation for Kubernetes deployment and CLI usage.
Known Issues
- Ocassionally the GRPC stream will not close automatically for larger models requested from Huggingface. It is suggested to call modelexpress asynchronously, and implement a check on the calling client side (either with modelexpress client or a file check) to verify when a model has completed downloading. Alternatively, a timeout could be used and inference backends like vLLM or SGlang will typically identify the model if it was downloaded into the cache.
Dependencies
~25–45MB
~678K SLoC