3 unstable releases
| 0.2.0-rc.2 | Feb 10, 2026 |
|---|---|
| 0.1.0 | Dec 3, 2025 |
#646 in HTTP server
Used in infernum
5.5MB
121K
SLoC
infernum-server
HTTP API server for the Infernum LLM inference framework.
Overview
infernum-server provides a production-ready HTTP server that exposes Infernum's LLM capabilities through industry-standard /v1/* API endpoints. Works with any client that supports standard chat completion APIs.
Features
- Standard API: Industry-standard
/v1/*routes compatible with existing clients - Streaming Responses: Real-time token-by-token output via SSE
- Model Cache Management: Download, convert, and manage local models
- HoloTensor Compression: Convert models to compressed HoloTensor format
- Agent Framework: ReAct-style agents with tool execution
- RAG System: Knowledge retrieval with vector embeddings
- Health & Metrics: Built-in health checks and Prometheus metrics
- CORS Support: Configurable cross-origin resource sharing
Usage
use infernum_server::Server;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let server = Server::new("0.0.0.0:8080").await?;
server.run().await
}
Or use the CLI:
infernum server --port 8080
API Endpoints
Chat & Inference Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completions (streaming/non-streaming) |
/v1/completions |
POST | Text completions |
/v1/models |
GET | List available models |
/v1/embeddings |
POST | Generate embeddings |
Model Management
| Endpoint | Method | Description |
|---|---|---|
/api/models/load |
POST | Load a model into memory |
/api/models/unload |
POST | Unload current model |
/api/status |
GET | Server and model status |
Model Cache Management
| Endpoint | Method | Description |
|---|---|---|
/api/cache/models |
GET | List cached models |
/api/cache/models/delete |
POST | Delete a cached model |
/api/cache/models/convert |
POST | Convert model to HoloTensor (SSE streaming) |
/api/models/download |
POST | Download from HuggingFace (SSE streaming) |
Agent Framework
| Endpoint | Method | Description |
|---|---|---|
/api/agent/tools |
GET | List available tools |
/api/agent/run |
POST | Execute agent with objective (SSE streaming) |
/api/sessions |
GET | List active agent sessions |
/api/sessions/{id} |
GET | Get session details |
/api/sessions/{id}/events |
GET | Stream session events (SSE) |
RAG (Retrieval-Augmented Generation)
| Endpoint | Method | Description |
|---|---|---|
/api/rag/health |
GET | RAG system health |
/api/rag/index |
POST | Index documents |
/api/rag/search |
POST | Search indexed documents |
Health & Metrics
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Basic health check |
/health/deep |
GET | Deep health check with component status |
/ready |
GET | Readiness probe |
/metrics |
GET | Prometheus metrics |
Streaming Endpoints
Several endpoints use Server-Sent Events (SSE) for real-time progress updates:
Model Download (POST /api/models/download)
curl -X POST http://localhost:8080/api/models/download \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{"model": "meta-llama/Llama-3.2-3B-Instruct"}'
SSE events include:
progress: Download progress withpercent,file,files_done,files_totalcomplete: Download finished withbytes_totalerror: Error occurred withmessage
Supports:
- Sharded models: Automatically detects and downloads 70B+ models with multiple weight files
- Single models: Downloads single safetensors/pytorch files
- HoloTensor conversion: Optional post-download conversion via
convert_to_holo: true
Model Convert (POST /api/cache/models/convert)
curl -X POST http://localhost:8080/api/cache/models/convert \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"num_fragments": 64,
"max_rank": 256,
"min_quality": 0.85,
"verify": true
}'
SSE events include:
progress: Conversion progress withoperation,tensor,compression_ratiocomplete: Conversion finished withmetadata(compression ratio, quality score)error: Error occurred
Configuration
Environment variables:
| Variable | Default | Description |
|---|---|---|
INFERNUM_PORT |
8080 | Server port |
INFERNUM_HOST |
0.0.0.0 | Bind address |
INFERNUM_MODELS_DIR |
~/.cache/infernum | Model cache directory |
HF_HOME |
~/.cache/huggingface | HuggingFace cache directory |
Examples
Chat Completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Stream Chat
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}'
Run Agent
curl http://localhost:8080/api/agent/run \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"objective": "List the files in the current directory",
"tools": ["bash", "file_read"],
"max_iterations": 10
}'
List Cached Models
curl http://localhost:8080/api/cache/models
Response:
{
"models": [
{
"id": "meta-llama/Llama-3.2-3B-Instruct",
"name": "Llama-3.2-3B-Instruct",
"source": "huggingface",
"size_bytes": 6438985728,
"size_str": "6.00 GB",
"is_holotensor": false,
"architecture": "llama"
}
],
"total_size_bytes": 6438985728,
"total_size_str": "6.00 GB",
"cache_dir": "/home/user/.cache/huggingface/hub"
}
Part of Infernum Framework
This crate is part of the Infernum ecosystem:
- infernum-core: Shared types and traits
- abaddon: Inference engine with Flash Attention
- malphas: Model orchestration and scheduling
- stolas: Knowledge retrieval (RAG) with BM25 and vector search
- beleth: Agent framework (ReAct, Tree of Thought)
- dantalion: Observability (Prometheus, Jaeger)
- haagenti: HoloTensor compression (LRDF holographic encoding)
License
Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Dependencies
~95–140MB
~2.5M SLoC