12 releases
| 0.1.23 | Jan 30, 2026 |
|---|---|
| 0.1.22 | Jan 15, 2026 |
| 0.1.20 | Dec 23, 2025 |
#1851 in Web programming
Used in 3 crates
620KB
14K
SLoC
Vllora LLM crate (vllora_llm)
This crate powers the Vllora AI Gateway’s LLM layer. It provides:
- Unified chat-completions client over multiple providers (OpenAI-compatible, Anthropic, Gemini, Bedrock, …)
- Gateway-native types (
ChatCompletionRequest,ChatCompletionMessage, routing & tools support) - Streaming responses and telemetry hooks via a common
ModelInstancetrait - Tracing integration: out-of-the-box
tracingsupport, with a console example inllm/examples/tracing(spans/events to stdout) and an OTLP example inllm/examples/tracing_otlp(send spans to external collectors such as New Relic) - Supported parameters: See Supported parameters for a detailed table of which parameters are honored by each provider
Use it when you want to talk to the gateway’s LLM engine from Rust code, without worrying about provider-specific SDKs.
Installation
Run cargo add vllora_llm or add to your Cargo.toml:
[dependencies]
vllora_llm = "0.1"
Quick start
Here's a minimal example to get started:
use vllora_llm::client::VlloraLLMClient;
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;
#[tokio::main]
async fn main() -> LLMResult<()> {
// 1) Build a chat completion request using gateway-native types
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(),
messages: vec![
ChatCompletionMessage::new_text(
"system".to_string(),
"You are a helpful assistant.".to_string(),
),
ChatCompletionMessage::new_text(
"user".to_string(),
"Stream numbers 1 to 20 in separate lines.".to_string(),
),
],
..Default::default()
};
// 2) Construct a VlloraLLMClient
let client = VlloraLLMClient::new();
// 3) Non-streaming: send the request and print the final reply
let response = client
.completions()
.create(request.clone())
.await?;
// ... handle response
Ok(())
}
Note: By default, VlloraLLMClient::new() fetches API keys from environment variables following the pattern VLLORA_{PROVIDER_NAME}_API_KEY. For example, for OpenAI, it will look for VLLORA_OPENAI_API_KEY.
Quick start with async-openai-compatible types
If you already build OpenAI-compatible requests (e.g. via async-openai-compat), you can send both non‑streaming and streaming completions through VlloraLLMClient.
use async_openai::types::{
ChatCompletionRequestMessage,
ChatCompletionRequestSystemMessageArgs,
ChatCompletionRequestUserMessageArgs,
CreateChatCompletionRequestArgs,
};
use tokio_stream::StreamExt;
use vllora_llm::client::VlloraLLMClient;
use vllora_llm::error::LLMResult;
use vllora_llm::types::credentials::{ApiKeyCredentials, Credentials};
#[tokio::main]
async fn main() -> LLMResult<()> {
// 1) Build an OpenAI-style request using async-openai-compatible types
let openai_req = CreateChatCompletionRequestArgs::default()
.model("gpt-4.1-mini")
.messages([
ChatCompletionRequestMessage::System(
ChatCompletionRequestSystemMessageArgs::default()
.content("You are a helpful assistant.")
.build()?,
),
ChatCompletionRequestMessage::User(
ChatCompletionRequestUserMessageArgs::default()
.content("Stream numbers 1 to 20 in separate lines.")
.build()?,
),
])
.build()?;
// 2) Construct a VlloraLLMClient (here: direct OpenAI key)
let client = VlloraLLMClient::new().with_credentials(Credentials::ApiKey(
ApiKeyCredentials {
api_key: std::env::var("VLLORA_OPENAI_API_KEY")
.expect("VLLORA_OPENAI_API_KEY must be set"),
},
));
// 3) Non-streaming: send the request and print the final reply
let response = client
.completions()
.create(openai_req.clone())
.await?;
if let Some(content) = &response.message().content {
if let Some(text) = content.as_string() {
println!("Non-streaming reply:\\n{text}");
}
}
// 4) Streaming: send the same request and print chunks as they arrive
let mut stream = client
.completions()
.create_stream(openai_req)
.await?;
while let Some(chunk) = stream.next().await {
let chunk = chunk?;
for choice in chunk.choices {
if let Some(delta) = choice.delta.content {
print!("{delta}");
}
}
}
Ok(())
}
Basic usage: completions client (gateway-native)
The main entrypoint is VlloraLLMClient, which gives you a CompletionsClient for chat completions using the gateway-native request/response types.
use std::sync::Arc;
use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;
#[tokio::main]
async fn main() -> LLMResult<()> {
// In production you would pass a real ModelInstance implementation
// that knows how to call your configured providers / router.
let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
// Build the high-level client
let client = VlloraLLMClient::new_with_instance(instance);
// Build a simple chat completion request
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(), // or any gateway-configured model id
messages: vec![
ChatCompletionMessage::new_text(
"system".to_string(),
"You are a helpful assistant.".to_string(),
),
ChatCompletionMessage::new_text(
"user".to_string(),
"Say hello in one short sentence.".to_string(),
),
],
..Default::default()
};
// Send the request and get a single response message
let response = client.completions().create(request).await?;
let message = response.message();
if let Some(content) = &message.content {
if let Some(text) = content.as_string() {
println!("Model reply: {text}");
}
}
Ok(())
}
Key pieces:
VlloraLLMClient: wraps aModelInstanceand exposes.completions().CompletionsClient::create: sends a one-shot completion request and returns aChatCompletionMessageWithFinishReason.- Gateway types (
ChatCompletionRequest,ChatCompletionMessage) abstract over provider-specific formats.
Streaming completions
CompletionsClient::create_stream returns a ResultStream that yields streaming chunks:
use std::sync::Arc;
use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;
#[tokio::main]
async fn main() -> LLMResult<()> {
let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
let client = VlloraLLMClient::new_with_instance(instance);
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(),
messages: vec![ChatCompletionMessage::new_text(
"user".to_string(),
"Stream the alphabet, one chunk at a time.".to_string(),
)],
..Default::default()
};
let mut stream = client.completions().create_stream(request).await?;
while let Some(chunk) = stream.next().await {
let chunk = chunk?;
for choice in chunk.choices {
if let Some(delta) = choice.delta.content {
print!("{delta}");
}
}
}
Ok(())
}
The stream API mirrors OpenAI-style streaming but uses gateway-native ChatCompletionChunk types.
Supported parameters
The table below lists which ChatCompletionRequest (and provider-specific) parameters are honored by each provider when using VlloraLLMClient:
| Parameter | OpenAI / Proxy | Anthropic | Gemini | Bedrock | Notes |
|---|---|---|---|---|---|
model |
yes | yes | yes | yes | Taken from ChatCompletionRequest.model or engine config. |
max_tokens |
yes | yes | yes | yes | Mapped to provider-specific max_tokens / max_output_tokens. |
temperature |
yes | yes | yes | yes | Sampling temperature. |
top_p |
yes | yes | yes | yes | Nucleus sampling. |
n |
no | no | yes | no | For Gemini, mapped to candidate_count; other providers always use n = 1. |
stop / stop_sequences |
yes | yes | yes | yes | Converted to each provider’s stop / stop-sequences field. |
presence_penalty |
yes | no | yes | no | OpenAI / Gemini only. |
frequency_penalty |
yes | no | yes | no | OpenAI / Gemini only. |
logit_bias |
yes | no | no | no | OpenAI-only token bias map. |
user |
yes | no | no | no | OpenAI “end-user id” field. |
seed |
yes | no | yes | no | Deterministic sampling where supported. |
response_format (JSON schema, etc.) |
yes | no | yes | no | Gemini additionally normalizes JSON schema for its API. |
prompt_cache_key |
yes | no | no | no | OpenAI-only prompt caching hint. |
provider_specific.top_k |
no | yes | no | no | Anthropic-only: maps to Claude top_k. |
provider_specific.thinking |
no | yes | no | no | Anthropic “thinking” options (e.g. budget tokens). |
Bedrock additional_parameters map |
no | no | no | yes | Free-form JSON, passed through to Bedrock model params. |
Additionally, for Anthropic, the first system message in the conversation is mapped into a SystemPrompt (either as a single text string or as multiple TextContentBlocks), and any cache_control options on those blocks are translated into Anthropic’s ephemeral cache-control settings.
All other fields on ChatCompletionRequest (such as stream, tools, tool_choice, functions, function_call) are handled at the gateway layer and/or per-provider tool integration, but are not mapped 1:1 into provider primitive parameters.
Provider-specific examples
There are runnable examples under llm/examples/ that mirror the patterns above:
openai: Direct OpenAI chat completions usingVlloraLLMClient(non-streaming + streaming).anthropic: Anthropic (Claude) chat completions via the unified client.gemini: Gemini chat completions via the unified client.bedrock: AWS Bedrock chat completions (Nova etc.) via the unified client.proxy: UsingInferenceModelProvider::Proxy("proxy_name")to call a OpenAI completions-compatible endpoint.tracing: Same OpenAI-style flow asopenai, but withtracing_subscriber::fmt()configured to emit spans and events to the console (stdout).tracing_otlp: Shows how to wirevllora_telemetry::events::layerto an OTLP HTTP exporter (e.g. New Relic / any OTLP collector) and emit spans fromVlloraLLMClientcalls to a remote telemetry backend.
Each example is a standalone Cargo binary; you can cd into a directory and run:
cargo run
after setting the provider-specific environment variables noted in the example’s main.rs.
Notes
- Real usage: In the full LangDB / Vllora gateway, concrete
ModelInstanceimplementations are created by the core executor based on yourmodels.yamland routing rules; the examples above useDummyModelInstanceonly to illustrate the public API of theCompletionsClient. - Error handling: All client methods return
LLMResult<T>, which wraps richLLMErrorvariants (network, mapping, provider errors, etc.). - More features: The same types in
vllora_llm::types::gatewayare used for tools, MCP, routing, embeddings, and image generation; see the main repository docs athttps://vllora.dev/docsfor higher-level gateway features.
Roadmap and issues
- GitHub issues / roadmap: See open LLM crate issues for planned and outstanding work.
- Planned enhancements:
- Integrate responses API
- Support builtin MCP tool calls
- Gemini prompt caching supported
- Full thinking messages support
License
Licensed under Apache License, Version 2.0.Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Dependencies
~55–76MB
~1M SLoC