2 unstable releases
Uses new Rust 2024
| 0.2.0 | Oct 9, 2025 |
|---|---|
| 0.1.1 | Aug 9, 2025 |
| 0.1.0 |
|
#808 in Network programming
220KB
5.5K
SLoC
Pie is a high-performance, programmable LLM serving system that empowers you to design and deploy custom inference logic and optimization strategies.
Note ๐งช
This software is in a pre-release stage and under active development. It's recommended for testing and research purposes only.
Getting Started
Docker Installation
The easiest way to run Pie with CUDA support is using our pre-built Docker image.
Prerequisites:
- NVIDIA GPU (SM 8.0+), NVIDIA, and Docker
- Tested on Ubuntu 24.04, CUDA 12.7
- Install NVIDIA Container Toolkit
Step 1: Pull Image and Download Model
docker pull pieproject/pie:latest
mkdir -p ~/.cache
docker run --rm --gpus all -it -v ~/.cache:/root/.cache pieproject/pie:latest \
pie model add "llama-3.2-1b-instruct"
-
Models are downloaded into
~/.cache/pie/models/and persist across container runs. -
FlashInfer's JIT-compiled kernels are cached in
~/.cache/flashinfer/to avoid recompilation.
Step 2: Run an Inferlet
docker run --gpus all --rm -it -v ~/.cache:/root/.cache pieproject/pie:latest \
pie run --config /workspace/pie/docker_config.toml \
/workspace/example-apps/text_completion.wasm -- --prompt "What is the capital of France?"
Note that the very first inferlet response may take a few minutes due to the JIT compilation of FlashInfer.
Manual Installation
Prerequisites
-
Configure a Backend: Navigate to a backend directory and follow its
README.mdfor setup: -
Add Wasm Target: Install the WebAssembly target for Rust:
rustup target add wasm32-wasip2This is required to compile Rust-based inferlets in the
example-appsdirectory.
Step 1: Build
Build the CLIs and the example inferlets.
-
Build the engine
pieand the client CLIpie-cli:From the repository root, run
cd pie && cargo install --path .Also, from the repository root, run
cd client/cli && cargo install --path . -
Build the Examples:
From the repository root, run
cd example-apps && cargo build --target wasm32-wasip2 --release
Step 2: Configure Engine and Backend
-
Create default configuration file:
Substitute
$REPOto the actual repository root and runpie config init python $REPO/backend/backend-python/server.py -
Download the model:
The default config file specifies the expected model. Run the following command to download it.
pie model add qwen-3-0.6b -
Test the engine:
Run an inferlet directly with the engine. Due to JIT compilation of FlashInfer kernels, the first run will have very long latency.
pie run \ $REPO/example-apps/target/wasm32-wasip2/release/text_completion.wasm \ -- \ --prompt "Where is the capital of France?"
Step 3: Run an Inferlet from a User Client
-
Create User Public Key:
If you don't already have a key pair in
~/.ssh, generate one with the following command. By default, the private key will be generated in~/.ssh/id_ed25519and the public key in~/.ssh/id_ed25519.pub. Please make sure the passphrase is empty.ssh-keygenIn addition to ED25519, you can also use RSA or ECDSA keys.
-
Create default user client configuration file:
The following command creates a default user client configuration file using the current Unix username and the private key in
~/.ssh.pie-cli config init -
Register the user on the engine:
Run the following command to register the current user on the engine.
my-first-keyis the name of the key and can be any string.catreads the public key from~/.ssh/id_ed25519.puband pipes it topie auth add.cat ~/.ssh/id_ed25519.pub | pie auth add $(whoami) my-first-key -
Start the Engine:
Launch the Pie engine with the default configuration.
pie serve -
Run an Inferlet:
From another terminal window, run
pie-cli submit \ $REPO/example-apps/target/wasm32-wasip2/release/text_completion.wasm \ -- \ --prompt "Where is the capital of France?"
Dependencies
~45โ66MB
~1M SLoC