2 unstable releases

0.2.0 Jul 4, 2024
0.1.0 Jun 20, 2024

#601 in Machine learning

MIT license

68KB
1.5K SLoC

callm

Latest version on crates.io Documentation on docs.rs License

About

callm enables you to run Generative AI models (such as Large Language Models) directly on your hardware, offline.
Under the hood, callm heavily relies on the candle crate and is written in pure Rust 🦀.

Supported models

Model Safetensors GGUF (quantized)
Llama
Mistral
Phi3
Qwen2

Thread safety

While pipelines are safe to send between threads, callm has not undergone extensive testing for thread-safety.
Caution is advised.

Portability

callm is known to run on Linux and macOS, and has been tested on these platforms. While Windows has not been extensively tested, it is expected to work out-of-the-box without issues.

callm is still in an early development stage and is not production-ready yet.

Installation

Add callm to your dependencies:

$ cargo add callm

Enabling GPU Support

callm uses features to selectively enable support for GPU acceleration.

NVIDIA (CUDA)

Enable the cuda feature to include support for CUDA devices.

$ cargo add callm --features cuda

Apple (Metal)

Enable the metal feature to include support for Metal devices.

$ cargo add callm --features metal

Usage

callm uses builder pattern to create inference pipelines.

use callm::pipelines::PipelineText;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Build pipeline
    let mut pipeline = PipelineText::builder()
        .with_location("/path/to/model")
        .build()?;

    // Run inference
    let text_completion = pipeline.run("Tell me a joke about Rust borrow checker")?;
    println!("{text_completion}");

    Ok(())
}

Customizing sampling parameters

Override default sampling parameters during pipeline build or afterwards.

use callm::pipelines::PipelineText;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Build pipeline with custom sampling parameters
    let mut pipeline = PipelineText::builder()
        .with_location("/path/to/model")
        .with_temperature(0.65)
        .with_top_k(25)
        .build()?;

    // Adjust sampling parameters later on
    pipeline.set_seed(42);
    pipeline.set_top_p(0.3);

    // Run inference
    let text_completion = pipeline.run("Write an article about Pentium F00F bug")?;
    println!("{text_completion}");

    Ok(())
}

Instruction-following and Chat models

If the model you are loading includes a chat template, you can use conversation-style inference via run_chat(). It accepts a slice of tuples in the form: (MessageRole, String).

use callm::pipelines::PipelineText;
use callm::templates::MessageRole;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Build pipeline
    let mut pipeline = PipelineText::builder()
        .with_location("/path/to/model")
        .with_temperature(0.1)
        .build()?;

    // Prepare conversation messages
    let messages = vec![
        (
            MessageRole::System,
            "You are impersonating Linus Torvalds.".to_string(),
        ),
        (
            MessageRole::User,
            "What is your opinion on Rust for Linux kernel development?".to_string(),
        ),
    ];

    // Run chat-style inference
    let assistant_response = pipeline.run_chat(&messages)?;
    println!("{assistant_response}");

    Ok(())
}

Documentation

Consult the documentation for a full API reference.
Several examples and tools can be found in a separate callm-demos repo.

Contributing

Thank you for your interest in contributing to callm!

As this project is still in its early stages, your help is invaluable. Here are some ways you can get involved:

  • Report issues: If you encounter any bugs or unexpected behavior, please file an issue on GitHub. This will help us track and fix problems.
  • Submit a pull request: If you'd like to contribute code, please fork the repository, make your changes, and submit a pull request. We'll review and merge your changes as soon as possible.
  • Help with documentation: If you have expertise in a particular area, please help us improve our documentation.

Thank you for your contributions! 💪

Dependencies

~23–35MB
~668K SLoC