#artificial-intelligence #load-balancing #openai #anthropic #llm #task

flyllm

A rust library for unifying LLM backends as an abstraction layer with load balancing

4 releases

new 0.2.0 Apr 30, 2025
0.1.2 Apr 27, 2025
0.1.1 Apr 27, 2025
0.1.0 Apr 27, 2025

#146 in Machine learning

Download history 282/week @ 2025-04-22

282 downloads per month

MIT license

84KB
1.5K SLoC

FlyLLM

FlyLLM is a Rust library that provides a load-balanced, multi-provider client for Large Language Models. It enables developers to seamlessly work with multiple LLM providers (OpenAI, Anthropic, Google, Mistral...) through a unified API with request routing, load balancing, and failure handling.

FlyLLM Logo

Features

  • Multiple Provider Support 🌐: Unified interface for OpenAI, Anthropic, Google, and Mistral APIs
  • Task-Based Routing 🧭: Route requests to the most appropriate provider based on predefined tasks
  • Load Balancing ⚖️: Automatically distribute load across multiple provider instances
  • Failure Handling 🛡️: Retry mechanisms and automatic failover between providers
  • Parallel Processing ⚡: Process multiple requests concurrently for improved throughput
  • Custom Parameters 🔧: Set provider-specific parameters per task or request
  • Usage Tracking 📊: Monitor token consumption for cost management

Installation

Add FlyLLM to your Cargo.toml:

[dependencies]
flyllm = "0.1.0"
tokio = { version = "1", features = ["full"] } # For async runtime

Architecture

Open Escordia_2025-04-25_13-41-55

The LLM Manager (LLMManager) serves as the core component for orchestrating language model interactions in your application. It manages multiple LLM instances (LLMInstance), each defined by a model, API key, and supported tasks (TaskDefinition).

When your application sends a generation request (GenerationRequest), the manager automatically selects an appropriate instance based on configurable strategies (Last Recently Used, Quickest Response Time, etc.) and returns the generated response by the LLM (LLMResponse). This design prevents rate limiting by distributing requests across multiple instances (even of the same model) with different API keys.

The manager handles failures gracefully by re-routing requests to alternative instances. It also supports parallel execution for significant performance improvements when processing multiple requests simultaneously!

You can define default parameters (temperature, max_tokens) for each task while retaining the ability to override these settings in specific requests. The system also tracks token usage across all instances:

--- Token Usage Statistics ---
ID    Provider        Model                          Prompt Tokens   Completion Tokens Total Tokens
-----------------------------------------------------------------------------------------------
0     mistral         mistral-small-latest           109             897             1006
1     anthropic       claude-3-sonnet-20240229       133             1914            2047
2     anthropic       claude-3-opus-20240229         51              529             580
3     google          gemini-2.0-flash               0               0               0
4     openai          gpt-3.5-turbo                  312             1003            1315

Usage Examples

You can activate FlyLLM's debug messages by setting the environment variable RUST_LOG to the value "debug".

Quick Start

use flyllm::{
    ProviderType, LlmManager, GenerationRequest, TaskDefinition,
    use_logging
};
use std::collections::HashMap;
use serde_json::json;

#[tokio::main]
async fn main() {
    // Initialize logging
    use_logging();

    // Create an LLM manager
    let mut manager = LlmManager::new();

    // Define a task with specific parameters
    let summary_task = TaskDefinition {
        name: "summary".to_string(),
        parameters: HashMap::from([
            ("max_tokens".to_string(), json!(500)),
            ("temperature".to_string(), json!(0.3)),
        ]),
    };

    // Add a provider with its supported tasks
    manager.add_provider(
        ProviderType::OpenAI,
        std::env::var("OPENAI_API_KEY").expect("OPENAI_API_KEY not set"),
        "gpt-3.5-turbo".to_string(),
        vec![summary_task.clone()],
        true,
    );

    // Create a request
    let request = GenerationRequest {
        prompt: "Summarize the following text: Climate change refers to long-term shifts in temperatures...".to_string(),
        task: Some("summary".to_string()),
        params: None,
    };

    // Generate response 
    // The Manager will automatically choose a provider fit for the task according to the selected strategy
    let responses = manager.generate_sequentially(vec![request]).await;
    
    // Handle the response
    if let Some(response) = responses.first() {
        if response.success {
            println!("Response: {}", response.content);
        } else {
            println!("Error: {}", response.error.as_ref().unwrap_or(&"Unknown error".to_string()));
        }
    }
}

Adding Multiple Providers

// Add OpenAI
manager.add_provider(
    ProviderType::OpenAI,
    openai_api_key,
    "gpt-4-turbo".to_string(),
    vec![summary_task.clone(), qa_task.clone()],
    true,
);

// Add Anthropic
manager.add_provider(
    ProviderType::Anthropic,
    anthropic_api_key,
    "claude-3-sonnet-20240229".to_string(),
    vec![creative_writing_task.clone(), code_generation_task.clone()],
    true,
);

// Add Mistral
manager.add_provider(
    ProviderType::Mistral,
    mistral_api_key,
    "mistral-large-latest".to_string(),
    vec![summary_task.clone(), translation_task.clone()],
    true,
);

// Add Google (Gemini)
manager.add_provider(
    ProviderType::Google,
    google_api_key,
    "gemini-1.5-pro".to_string(),
    vec![qa_task.clone(), creative_writing_task.clone()],
    true,
);

Task-Based Routing

// Define tasks with specific parameters
let summary_task = TaskDefinition {
    name: "summary".to_string(),
    parameters: HashMap::from([
        ("max_tokens".to_string(), json!(500)),
        ("temperature".to_string(), json!(0.3)),
    ]),
};

let creative_writing_task = TaskDefinition {
    name: "creative_writing".to_string(),
    parameters: HashMap::from([
        ("max_tokens".to_string(), json!(1500)),
        ("temperature".to_string(), json!(0.9)),
    ]),
};

// Create requests with different tasks
let summary_request = GenerationRequest {
    prompt: "Summarize the following article: ...".to_string(),
    task: Some("summary".to_string()),
    params: None,
};

let story_request = GenerationRequest {
    prompt: "Write a short story about a robot discovering emotions.".to_string(),
    task: Some("creative_writing".to_string()),
    params: None,
};

Parallel Processing

// Create multiple requests
let requests = vec![
    GenerationRequest { /* ... */ },
    GenerationRequest { /* ... */ },
    GenerationRequest { /* ... */ },
];

// Process in parallel
let parallel_results = manager.batch_generate(requests).await;

// Process each result
for result in parallel_results {
    if result.success {
        println!("Success: {}", result.content);
    } else {
        println!("Error: {}", result.error.as_ref().unwrap_or(&"Unknown error".to_string()));
    }
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are always welcome! If you're interested in contributing to FlyLLM, please fork the repository and create a new branch for your changes. When you're done with your changes, submit a pull request to merge your changes into the main branch.

Supporting FlyLLM

If you want to support FlyLLM, you can:

  • Star ⭐ the project in Github!
  • Donate 🪙 to my Ko-fi page!
  • Share ❤️ the project with your friends!

To-Do List

Next features planned:

  • Usage tracking
  • Log redirection to file
  • Streaming responses
  • Internal documentation
  • Unit & Integration tests
  • Builder Pattern
  • Aggregation of more strategies

Dependencies

~10–22MB
~308K SLoC