2 releases
Uses new Rust 2024
| 0.1.0-alpha.2 | Aug 21, 2025 |
|---|---|
| 0.1.0-alpha.1 | Aug 20, 2025 |
#1607 in HTTP server
Used in lmonade
380KB
7K
SLoC
lmonade-server
OpenAI-compatible HTTP API server for the Lmonade LLM inference engine.
Overview
lmonade-server provides a production-ready HTTP server with endpoints compatible with OpenAI's API, enabling drop-in replacement for existing OpenAI integrations.
Features
- OpenAI-compatible REST API
- Real-time token streaming via Server-Sent Events (SSE)
- Concurrent request handling
- Health monitoring endpoints
- CORS support for web applications
- Comprehensive error handling
API Endpoints
Core Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check endpoint |
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | Chat completion (OpenAI-compatible) |
/v1/completions |
POST | Text completion (OpenAI-compatible) |
/v1/embeddings |
POST | Generate embeddings (placeholder) |
Request Examples
Chat Completion
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"stream": false
}'
Streaming Chat Completion
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": true
}'
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
HOST |
Server bind address | 127.0.0.1 |
PORT |
Server port | 8080 |
MODEL_NAME |
Default model to load | TinyLlama-1.1B-Chat-v1.0 |
MAX_CONCURRENT_REQUESTS |
Maximum concurrent requests | 100 |
REQUEST_TIMEOUT_SECS |
Request timeout in seconds | 300 |
RUST_LOG |
Logging level | info |
Configuration File
Create a config.toml file:
[server]
host = "0.0.0.0"
port = 8080
[model]
name = "TinyLlama-1.1B-Chat-v1.0"
max_concurrent_requests = 100
[logging]
level = "info"
Running the Server
Using Cargo
# Development
cargo run -- stand
# Release (optimized)
cargo run --release -- serve
Using Pre-built Binary
./lmonade stand
Docker
docker run -p 8080:8080 lmonade/server:latest
Architecture
The server is built on top of:
- Axum: High-performance async web framework
- Tokio: Async runtime
- Tower: Middleware and service composition
Request Flow
- HTTP request received by Axum router
- Request validated and parsed
- Forwarded to
LLMService LLMServicecommunicates withModelHub(actor system)- Response streamed back to client
Streaming Implementation
The server implements true token-by-token streaming:
// SSE streaming for real-time generation
let stream = hub.generate_stream(model, prompt, config).await?;
let sse_stream = stream.map(|token| {
Event::default().data(serde_json::to_string(&token)?)
});
Error Handling
The server provides detailed error responses:
{
"error": {
"message": "Model not found: gpt-4",
"type": "invalid_request_error",
"code": "model_not_found"
}
}
Monitoring
Health Check
curl http://localhost:8080/health
Response:
{
"status": "healthy",
"model": "TinyLlama-1.1B-Chat-v1.0",
"uptime_seconds": 3600,
"requests_processed": 1234
}
Metrics
The server exposes Prometheus-compatible metrics at /metrics (when enabled).
Development
Project Structure
lmonade-server/
├── src/
│ ├── api_handlers.rs # HTTP request handlers
│ ├── llm_service.rs # Core service logic
│ ├── routes.rs # API route definitions
│ ├── config.rs # Configuration management
│ ├── error.rs # Error types
│ └── lib.rs # Library exports
├── bin/
│ └── serve.rs # Server binary
└── tests/
└── integration.rs # Integration tests
Adding New Endpoints
- Define handler in
api_handlers.rs - Add route in
routes.rs - Update OpenAPI spec if applicable
Testing
# Run tests
cargo test
# Integration tests
cargo test --test integration
# With logging
RUST_LOG=debug cargo test
Performance
Benchmarks
On a typical setup:
- Throughput: ~1000 tokens/second
- Latency: <50ms first token
- Concurrent requests: 100+
Optimization Tips
- Use release builds for production
- Enable GPU acceleration if available
- Adjust batch sizes based on hardware
- Use connection pooling for clients
Troubleshooting
Common Issues
Port already in use:
# Change port
PORT=8081 ./lmonade stand
Model not loading:
# Check model path
ls ~/.lmonade/models/
Out of memory:
# Reduce batch size
MAX_BATCH_SIZE=8 ./lmonade stand
License
See LICENSE in the root directory.
Contributing
See CONTRIBUTING.md for guidelines.
Dependencies
~45–65MB
~1M SLoC