#encryption #ml #pipeline #tee #confidential

confidential-ml-pipeline

Multi-enclave pipeline orchestration for confidential ML inference

6 releases (3 breaking)

new 0.5.0 Apr 3, 2026
0.4.0 Mar 15, 2026
0.3.1 Feb 13, 2026
0.2.2 Feb 12, 2026

#893 in Cryptography

Download history 140/week @ 2026-02-09 258/week @ 2026-02-16 111/week @ 2026-02-23 33/week @ 2026-03-02 237/week @ 2026-03-09 106/week @ 2026-03-16 72/week @ 2026-03-23

451 downloads per month

MIT/Apache

165KB
3K SLoC

Rust 3K SLoC // 0.0% comments Shell 357 SLoC // 0.1% comments

confidential-ml-pipeline

Crates.io docs.rs CI License: MIT/Apache-2.0

Multi-enclave pipeline orchestration for confidential ML inference.

Disclaimer: This project is under development. All source code and features are not production ready.

Overview

confidential-ml-pipeline builds on top of confidential-ml-transport to orchestrate model-parallel inference across multiple TEE enclaves. A model is sharded across N stages, each running inside a separate enclave, and the orchestrator drives input tensors through the pipeline while maintaining encrypted, attestation-bound channels between all participants.

Key properties:

  • Pipeline parallelism -- 1F1B (one forward, one backward) fill-drain scheduling with configurable micro-batching to minimize pipeline bubbles
  • Shard manifest -- JSON-based model sharding specification with layer ranges, weight hashes, and expected attestation measurements per stage
  • Two-phase APIs -- StageRuntime and Orchestrator expose split control/data phases for TCP deployment where connections arrive at different times
  • Configurable timeouts -- per-operation timeouts for health checks (default 10s) and inference requests (default 60s), surfaced as PipelineError::Timeout
  • Retry policy -- TCP connection retries use the transport crate's RetryPolicy with exponential backoff and jitter, configurable on both OrchestratorConfig and StageConfig
  • TCP deployment helpers -- tcp module with retry-connect, listener binding, and full stage/orchestrator lifecycle over real TCP
  • Pluggable transports -- TCP and VSock backends via feature flags, with tokio::io::duplex for in-process testing
  • Pluggable attestation -- trait-based attestation, mock for development, Nitro for production
  • Relay mesh -- transparent bidirectional byte relay for inter-stage data channels through the host
  • Error propagation -- stage failures send error sentinels on data channels to unblock the pipeline, with detailed error reporting on control channels

Architecture

                    control channels (SecureChannel)
                    ┌──────────┬──────────┐
                    │          │          │
              ┌─────▼───┐ ┌───▼─────┐ ┌──▼──────┐
  input ────► │ Stage 0  │ │ Stage 1 │ │ Stage 2 │ ────► output
  tensors     │ layers   │ │ layers  │ │ layers  │       tensors
              │  0..3    │ │  4..7   │ │  8..11  │
              └────┬─────┘ └──┬──────┘ └─────────┘
                   │          │
                   └──────────┘
              data channels (SecureChannel)
              activation tensors flow L→R

The orchestrator runs on the host and:

  1. Connects control channels to each stage, sends Init with shard specs, waits for Ready
  2. Sends EstablishDataChannels, then connects/accepts data channels
  3. Dispatches StartRequest with micro-batch scheduling, sends input tensors to stage 0, receives output tensors from the last stage

Each stage runs inside an enclave and:

  1. Accepts a control channel, receives its StageSpec and ActivationSpec
  2. Accepts a data-in channel, connects a data-out channel
  3. Executes forward passes per the 1F1B schedule, streaming activation tensors to the next stage

Features

Feature Default Description
mock Yes Mock attestation provider/verifier for local development
tcp Yes TCP transport backend + TCP deployment helpers
vsock No VSock transport backend for Nitro Enclaves
nitro No AWS Nitro attestation provider/verifier
sev-snp No AMD SEV-SNP attestation provider/verifier
tdx No Intel TDX attestation provider/verifier

Quick Start

In-process (duplex transport)

use confidential_ml_pipeline::*;
use confidential_ml_transport::{MockProvider, MockVerifier};

// Create duplex pairs for control and data channels
let (orch_ctrl, stage_ctrl) = tokio::io::duplex(65536);
let (orch_data_in, stage_data_in) = tokio::io::duplex(65536);
let (stage_data_out, orch_data_out) = tokio::io::duplex(65536);

// Spawn stage
tokio::spawn(async move {
    let mut runtime = StageRuntime::new(MyExecutor::new(), StageConfig::default());
    runtime.run(stage_ctrl, stage_data_in, stage_data_out,
                &MockProvider::new(), &MockVerifier::new()).await.unwrap();
});

// Run orchestrator
let mut orch = Orchestrator::new(OrchestratorConfig::default(), manifest)?;
orch.init(vec![orch_ctrl], &MockProvider::new(), &MockVerifier::new()).await?;
orch.establish_data_channels(orch_data_in, orch_data_out, vec![],
                              &MockProvider::new(), &MockVerifier::new()).await?;
let result = orch.infer(input_tensors, seq_len).await?;

TCP multi-process

use confidential_ml_pipeline::tcp;

// Stage worker (each process)
let (ctrl_lis, _, din_lis, _) = tcp::bind_stage_listeners(ctrl_addr, din_addr).await?;
tcp::run_stage_with_listeners(executor, config, ctrl_lis, din_lis,
                               data_out_target, &provider, &verifier).await?;

// Orchestrator
let dout_listener = TcpListener::bind(dout_addr).await?;
let mut orch = tcp::init_orchestrator_tcp(config, manifest, dout_listener,
                                           &verifier, &provider).await?;
let result = orch.infer(input_tensors, seq_len).await?;

See examples/tcp-pipeline/ for a complete multi-binary example.

Examples

Mock pipeline (in-process)

cargo run --example mock-pipeline --manifest-path examples/mock-pipeline/Cargo.toml

TCP pipeline (multi-process)

# Terminal 1: Stage 0
cargo run --bin stage-worker --manifest-path examples/tcp-pipeline/Cargo.toml -- \
  --manifest manifest.json --stage-idx 0 --data-out-target 127.0.0.1:9011

# Terminal 2: Stage 1
cargo run --bin stage-worker --manifest-path examples/tcp-pipeline/Cargo.toml -- \
  --manifest manifest.json --stage-idx 1 --data-out-target 127.0.0.1:9020

# Terminal 3: Orchestrator
cargo run --bin pipeline-orch --manifest-path examples/tcp-pipeline/Cargo.toml -- \
  --manifest manifest.json --data-out-listen 127.0.0.1:9020

Testing

Test counts depend on features:

Command Tests Notes
cargo test ~49 Default features (tcp only)
cargo test --features mock ~88 Full suite including mock integration, stress, and timeout tests

Note: mock cannot be combined with production attestation features (tdx, nitro, etc.) due to a compile-time guard. Some integration tests bind TCP ports and may need socket permissions.

# Full test suite (recommended)
cargo test --features mock

# Default features only
cargo test

# Stress tests only (100 sequential requests, 1.5 MiB tensors, 16 micro-batches, 3-stage multi-request)
cargo test --test stress_test

# TCP integration tests only
cargo test --test tcp_pipeline

# With logging
RUST_LOG=debug cargo test --test tcp_pipeline -- --nocapture

Benchmarks

Synthetic (in-process, mock transport)

cargo bench --bench pipeline_bench
Benchmark Result
Pipeline throughput (2-stage, 128KB tensor) 923 µs, 135 MiB/s
Latency per stage (1KB tensor) ~43 µs/stage
Schedule generation (4 stages, 16 micro-batches) 4.0 µs
Relay forwarding (128KB) 57 µs, 2.1 GiB/s
Protocol serde (StartRequest roundtrip) 205 ns serialize, 419 ns deserialize
Health check (2 stages) 39 µs
Multi micro-batch (2-stage, 16 micro-batches) 598 µs

Nitro Enclave (GPT-2 124M, m6i.2xlarge, N=5)

End-to-end GPT-2 inference across real Nitro Enclaves with encrypted VSock transport (ChaCha20-Poly1305). 5 independent cold-boot runs per configuration. Mean +/- 95% CI (t-distribution, df=4).

Metric 1-Stage (12 layers) 2-Stage (6+6) 3-Stage (4+4+4)
TTFT 92.5 +/- 1.8ms 97.5 +/- 5.4ms 107.1 +/- 13.7ms
Gen avg 41.9 +/- 1.8ms/tok 44.1 +/- 3.3ms/tok 50.0 +/- 8.3ms/tok
Gen p50 41.9 +/- 1.8ms/tok 44.1 +/- 3.4ms/tok 49.9 +/- 8.2ms/tok
Gen p95 42.9 +/- 1.9ms/tok 45.4 +/- 3.6ms/tok 51.9 +/- 11.0ms/tok
Throughput 23.9 +/- 1.0 tok/s 22.7 +/- 1.6 tok/s 20.3 +/- 2.9 tok/s
Overhead vs 1-stage -- +5.2% +19.2%

GCP Cross-VM (GPT-2 124M, c3-standard-4, N=5)

End-to-end GPT-2 inference across separate GCP VMs with encrypted TCP transport (ChaCha20-Poly1305, mock attestation). Compute-normalized: 4 vCPU total budget (1-stage = 4 cores, 2-stage = 2+2 cores via taskset). 5 independent runs per configuration. Mean +/- 95% CI (t-distribution, df=4).

Metric Standard 1-Stage Standard 2-Stage TDX 1-Stage TDX 2-Stage
TTFT 70.1 +/- 0.8ms 75.5 +/- 1.2ms 97.7 +/- 2.5ms 89.4 +/- 0.7ms
Gen avg 42.3 +/- 0.5ms/tok 41.3 +/- 0.4ms/tok 48.6 +/- 0.5ms/tok 47.8 +/- 1.5ms/tok
Gen p50 42.3 +/- 0.5ms/tok 41.3 +/- 0.4ms/tok 48.6 +/- 0.5ms/tok 47.7 +/- 1.4ms/tok
Gen p95 43.4 +/- 1.1ms/tok 42.0 +/- 0.3ms/tok 49.6 +/- 0.4ms/tok 48.9 +/- 1.7ms/tok
Throughput 23.6 +/- 0.3 tok/s 24.2 +/- 0.2 tok/s 20.6 +/- 0.2 tok/s 20.9 +/- 0.6 tok/s
Cost/1M tokens $3.25 $5.58 $3.73 $6.46
Transport overhead -- within CI -- within CI
TDX overhead -- -- +14.9% +15.7%

Real TDX Attestation (GPT-2 124M, c3-standard-4 TDX, N=5)

Same hardware as TDX columns above, but using real TDX quotes (configfs-tsm) instead of mock attestation. Isolates the one-time attestation handshake cost.

Metric TDX Mock (Phase 2) TDX Real Attestation (Phase 3) Delta
TTFT 97.7 +/- 2.5ms 101.6 +/- 3.8ms +3.9ms (+4.0%)
Gen avg 48.6 +/- 0.5ms/tok 49.7 +/- 0.2ms/tok +1.1ms (within CI)
Gen p50 48.6 +/- 0.5ms/tok 49.7 +/- 0.3ms/tok within CI
Gen p95 49.6 +/- 0.4ms/tok 50.7 +/- 0.6ms/tok within CI
Throughput 20.6 +/- 0.2 tok/s 20.1 +/- 0.1 tok/s within CI

Key findings:

  • TDX compute overhead: ~15% vs standard VMs (memory encryption, not transport).
  • Real TDX attestation adds ~4ms one-time handshake cost, amortized over the session.
  • Per-token generation latency is unchanged by attestation (within CI).
  • No measurable transport bottleneck from cross-VM encrypted relay.

See examples/gpt2-pipeline/ for the full example and benchmark_results/ for raw data and detailed analysis.

License

Licensed under either of

at your option.

Contributing

See CONTRIBUTING.md for guidelines.

Dependencies

~11–20MB
~287K SLoC