#llm #openai #anthropic #gguf #llama-cpp #artificial-intelligence

nightly bin+lib llm_client

llm_client: An Interface for Deterministic Signals from Probabilistic LLM Vibes

2 releases

0.0.4 Aug 28, 2024
0.0.2 May 28, 2024

#197 in Asynchronous

Download history 142/week @ 2024-05-27 9/week @ 2024-06-03 153/week @ 2024-08-26

153 downloads per month

MIT license

235KB
5.5K SLoC

Contributors Forks Stargazers Issues MIT License

llm_client: An Interface for Deterministic Signals from Probabilistic LLM Vibes

// Load Local LLMs
let llm_client = LlmClient::llama_cpp().available_vram(48).mistral7b_instruct_v0_3().init().await?;

// Build requests
let response: u32 = llm_client.reason().integer()
    .instructions()
    .set_content("Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?")
    .return_primitive().await?;

// Recieve 'primitive' outputs
assert_eq!(response, 1)

This runs the reason one round cascading prompt workflow with an integer output.

An example run of this workflow with these instructions.

From AI Vibes to the Determinstic Real World

Large Language Models (LLMs) are somewhere between conventional programming and databases: they process information like rule-based systems while storing and retrieving vast amounts of data like databases. But unlike the deterministic outputs of if statements and database queries, raw LLMs produce outputs that can be described as 'vibes' — probabilistic, context-dependent results that may vary between runs. Building on vibes may work for creative applications, but in real-world applications, a vibey output is a step back from the reliability and consistency of traditional programming paradigms.

llm_client is an interface for building cascading prompt workflows from dynamic and novel inputs, running the steps of the workflows in a linear or reactive manner, and constraining and interpretting LLM inference as actionable signals. This allows the integration of LLMs into traditional software systems with the level of consistency and reliability that users expect.

llm_client Implementation Goals

  • As Friendly as Possible: the most intuitive and ergonomic Rust interface possible
  • Local and Embedded:
    • Designed for native, in-process integration—just like a standard Rust crate
    • No stand-alone servers, containers, or services
    • Minimal cloud dependencies, and fully local, fully supported
  • Works: Accurate, reliable, and observable results from available workflows

Traditional LLM Constraints Impact LLM Performance

Using an LLM as part of your business logic requires extracting usable signals from LLM responses. This means interpretting the text generated by an LLM using something like Regex or LLM constraints. Controlling the output of an LLM is commonly achieved with constraints like logit bias, stop words, or grammars.

However, from practical experience, as well as studies, constraints such as logit bias and grammars negatively impact the quality of LLMs. Furthermore, these merely constrain the inference at the token level, when we may wish to shape the structure of the entire generation.

Controlled Generation with Step Based Cascade Workflows

llm_client's cascade prompting system runs pre-defined workflows that control and constrain both the overall structure of generation and individual tokens during inference. This allows the implementation of specialized workflows for specific tasks, shaping LLM outputs towards intended, reproducible outcomes.

This method significantly improves the reliability of LLM use cases. For example, there are test cases this repo that can be used to benchmark an LLM. There is a large increase in accuracy when comparing basic inference with a constrained outcome and a CoT style cascading prompt workflow. The decision workflow that runs N count of CoT workflows across a tempature gradient approaches 100% accuracy for the test cases.

Cascade Prompt Elements

  • Workflow: A workflow, or 'flow', is a high level object that runs the individual elements.
  • Rounds: Each round is a pair of a user turn and an assistant turn. Turns are sometimes referred to as 'messages'.
    • Both the user turn and the assistant turn can be pre-generated, or dynamically generated.
  • Tasks: The 'user message' in the user turn of a round. Generically refered to 'task' for the sake of brevity.
  • Steps: Each assistant turn consists of multiple steps.
    • Inference steps generate text via LLM inference.
    • Guidance steps generate text from pre-defined static inputs or dynamic inputs from the program.
  • Generation prefixes: Assistant steps can be prefixed with content.
  • Dynamic sufixes: Assistant steps can also be suffixed with additional content after generation.

An Example Cascade: CoT Reasoning

An example of a cascade workflow is the one round reasoning workflow.

In this example the work flow is run linearly as built, but it's also possible to run dynamic workflows where each step is ran one at a time and the behavior of the workflow can be dynamic based on the outcome of that step. See extract_urls for an example of this.

Reasoning with Primitive Outcomes

A constraint enforced CoT process for reasoning. First, we get the LLM to 'justify' an answer in plain english. This allows the LLM to 'think' by outputting the stream of tokens required to come to an answer. Then we take that 'justification', and prompt the LLM to parse it for the answer. See the workflow for implementation details.

  • Currently supporting returning booleans, u32s, and strings from a list of options

  • Can be 'None' when ran with return_optional_primitive()

    // boolean outcome
    let reason_request = llm_client.reason().boolean();
    reason_request
        .instructions()
        .set_content("Does this email subject indicate that the email is spam?");
    reason_request
        .supporting_material()
        .set_content("You'll never believe these low, low prices 💲💲💲!!!");
    let res: bool = reason_request.return_primitive().await.unwrap();
    assert_eq!(res, true);

    // u32 outcome
    let reason_request = llm_client.reason().integer();
    reason_request.primitive.lower_bound(0).upper_bound(10000);
    reason_request
        .instructions()
        .set_content("How many times is the word 'llm' mentioned in these comments?");
    reason_request
        .supporting_material()
        .set_content(hacker_news_comment_section);
    // Can be None
    let response: Option<u32> = reason_request.return_optional_primitive().await.unwrap();
    assert!(res > Some(9000));

    // string from a list of options outcome
    let mut reason_request = llm_client.reason().exact_string();
    reason_request
        .instructions()
        .set_content("Based on this readme, what is the name of the creator of this project?");
    reason_request
        .supporting_material()
        .set_content(llm_client_readme);
    reason_request
        .primitive
        .add_strings_to_allowed(&["shelby", "jack", "camacho", "john"]);
    let response: String = reason_request.return_primitive().await.unwrap();
    assert_eq!(res, "shelby");

See the reason example for more

Decisions with N number of Votes Across a Temperature Gradient

Uses the same process as above N number of times where N is the number of times the process must be repeated to reach a consensus. We dynamically alter the temperature to ensure an accurate consensus. See the workflow for implementation details.

  • Supports primitives that implement the reasoning trait

  • The consensus vote count can be set with best_of_n_votes()

  • By default dynamic_temperture is enabled, and each 'vote' increases across a gradient

    // An integer decision request
    let decision_request = llm_client.reason().integer().decision();
    decision_request.best_of_n_votes(5); 
    decision_request
        .instructions()
        .set_content("How many fingers do you have?");
    let response = decision_request.return_primitive().await.unwrap();
    assert_eq!(response, 5);

See the decision example for more

Structured Outputs and NLP

  • Data extraction, summarization, and semantic splitting on text.

  • 'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.' Using Regex to parse and structure the output of LLMs puts an exponent over this old joke. llm_client uses constraints to conform the outputs of LLMs rather than trying to extract information from a non-constrainted LLM generation.

  • Currently implemented NLP workflows are url extraction.

See the extract_urls example

Basic Primitives

A generation where the output is constrained to one of the defined primitive types. See the currently implemented primitive types. These are used in other workflows, but only some are used as the output for specific workflows like reason and decision.

  • These are fairly easy to add, so feel free to open an issue if you'd like one added.

See the basic_primitive example

LLM -> LLMs

  • Basic support for API based LLMs. Currently, anthropic, openai, perplexity

  • Perplexity does not currently return documents, but it does create it's responses from live data

    let llm_client = LlmClient::perplexity().sonar_large().init();
    let mut basic_completion = llm_client.basic_completion();
    basic_completion
        .prompt()
        .add_user_message()
        .set_content("Can you help me use the llm_client rust crate? I'm having trouble getting cuda to work.");
    let response = basic_completion.run().await?;

See the basic_completion example

Loading Custom Models from Local Storage

    // From a local storage
    let llm_client = LlmClient::llama_cpp().local_quant_file_path(local_llm_path).init().await?;

    // From hugging face
    let llm_client = LlmClient::llama_cpp().hf_quant_file_url("https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q6_K.gguf").init().await?;

Configuring Requests

  • All requests and workflows implement the BaseRequestConfigTrait which gives access to the parameters sent to the LLM

  • These settings are normalized across both local and API requests

    let llm_client = LlmClient::llama_cpp()
        .available_vram(48)
        .mistral7b_instruct_v0_3()
        .init()
        .await?;

    let basic_completion = llm_client.basic_completion();

    basic_completion
        .temperature(1.5)
        .frequency_penalty(0.9)
        .max_tokens(200);

See See all the settings here

Guides

Installation

llm_client currently relies on llama.cpp. As it's a c++ project, it's not bundled in the crate. In the near future, llm_client will support mistral-rs, an inference backend built in Candle and supporting great features like ISQ. Once integration is complete, llm_client will be pure Rust and can be installed as just a crate.

  • Clone repo:
git clone --recursive https://github.com/ShelbyJenkins/llm_client.git
cd llm_client
  • Add to cargo.toml:
[dependencies]
llm_client = {path="../llm_client"}

Roadmap

  • Migrate from llama.cpp to mistral-rs. This would greatly simplify consuming as an embedded crate. It's currently a WIP. It may also end up that llama.cpp is behind a feature flag as a fallback.
    • Current blockers are grammar migration
    • and multi-gpu support
  • Reasoning where the output can be multiple answers
  • Expanding the NLP functionality to include semantic splitting and data further extraction.
  • Refactor the benchmarks module

Dependencies

async-openai is used to interact with the OpenAI API. A modifed version of the async-openai crate is used for the Llama.cpp server. If you just need an OpenAI API interface, I suggest using the async-openai crate.

clust is used to interact with the Anthropic API. If you just need an Anthropic API interface, I suggest using the clust crate.

llm_utils is a sibling crate that was split from the llm_client. If you just need prompting, tokenization, model loading, etc, I suggest using the llm_utils crate on it's own.

Contributing

Yes.

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

Shelby Jenkins - Here or Linkedin

Dependencies

~61MB
~732K SLoC