#gguf #inference #ggml #llama #llm

sys llama-cpp-sys-4

Low Level Bindings to llama.cpp

64 releases

new 0.2.60 May 19, 2026
0.2.50 Apr 23, 2026
0.2.18 Mar 30, 2026
0.1.94 Feb 11, 2025
0.1.87 Nov 26, 2024

#2041 in Artificial intelligence

Download history 26/week @ 2026-03-05 94/week @ 2026-03-12 103/week @ 2026-03-19 128/week @ 2026-03-26 209/week @ 2026-04-02 328/week @ 2026-04-09 351/week @ 2026-04-16 131/week @ 2026-04-23 139/week @ 2026-04-30 105/week @ 2026-05-07

754 downloads per month
Used in 6 crates (via llama-cpp-4)

MIT/Apache

18MB
349K SLoC

C++ 229K SLoC // 0.1% comments C 43K SLoC // 0.1% comments OpenCL 19K SLoC CUDA 18K SLoC // 0.1% comments GLSL 15K SLoC // 0.0% comments Metal Shading Language 8K SLoC // 0.1% comments WebGPU Shader Language 6.5K SLoC // 0.0% comments Templ 3K SLoC // 0.0% comments Python 2.5K SLoC // 0.2% comments Rust 2K SLoC // 0.1% comments Objective-C 2K SLoC // 0.1% comments Shell 236 SLoC // 0.2% comments

llama-cpp-sys-4

Crates.io License

Raw bindgen-generated bindings to llama.cpp, plus the C/C++ build logic that compiles the library.

llama.cpp version: b8533 · Crate version: 0.2.13

Unless you need access to a symbol not yet exposed by llama-cpp-4, use that crate instead — it provides a safe API over these raw bindings.


What's included

  • llama_* functions and types from llama.h
  • ggml_* functions and types from ggml/include/ggml.h
  • LLAMA_* constants
  • common_tokenize and common_token_to_piece from common/common.h
  • The entire llama.cpp static library (or shared, with dynamic-link)

Feature flags

Feature Description
openmp OpenMP multi-threading (default on; auto-detected on ARM platforms)
cuda NVIDIA GPU (requires CUDA toolkit)
metal Apple GPU (macOS/iOS only)
vulkan Vulkan GPU backend
native -march=native — tune for the build machine's CPU
rpc Remote compute backend
dynamic-link Link against a pre-installed shared libllama instead of building from source

Building

The crate compiles llama.cpp from the vendored submodule at build time using cc + cmake-style flags. No external llama.cpp installation is required.

# CPU only (default)
cargo build -p llama-cpp-sys-4

# Metal (macOS)
cargo build -p llama-cpp-sys-4 --features metal

# CUDA
cargo build -p llama-cpp-sys-4 --features cuda

# OpenMPI (distributed inference)
brew install open-mpi   # or apt install libopenmpi-dev
cargo build -p llama-cpp-sys-4 --features mpi

Build dependencies

  • clang — required by bindgen to parse the C++ headers
  • A C++17 compiler (GCC 9+, Clang 10+, MSVC 2019+)
  • cmake is not required — the build is driven entirely by build.rs

Regenerating bindings

Bindings are regenerated automatically whenever build.rs or wrapper.h changes. The allowlist covers llama_*, ggml_*, LLAMA_*, and the two common_* functions.

# Force a full rebuild including binding regeneration
touch llama-cpp-sys-4/wrapper.h
cargo build -p llama-cpp-sys-4

Notable API changes (b4689 → b8249)

These are the upstream llama.cpp breaks handled in this crate:

Removed / renamed Replacement
llama_kv_cache_* functions llama_memory_* via llama_get_memory(ctx)
llama_set_adapter_lora + llama_rm_adapter_lora llama_set_adapters_lora (batch API)
context_params.flash_attn: bool context_params.flash_attn_type: llama_flash_attn_type
llama-sampling.h llama-sampler.h
C++11 build flag C++17 required by new common.h (std::string_view)

Bindgen configuration

Key decisions in build.rs:

  • derive_partialeq(true) with no_partialeq(...) overrides for structs containing function-pointer fields (avoids the unpredictable_function_pointer_comparisons lint).
  • opaque_type("std::.*") — C++ STL types are opaque pointers.
  • OpenMP auto-detection — reads GGML_OPENMP_ENABLED from the CMake cache rather than relying solely on the openmp feature flag, because some ARM toolchains enable OpenMP unconditionally.

Dependencies