#machine-learning #triton #api-bindings

inference

A crate for managing the machine learning inference process

1 unstable release

0.3.0 Feb 18, 2023

#613 in Machine learning

Apache-2.0

225KB
2.5K SLoC

Inference

A Rust crate for managing the inference process for machine learning (ML) models. Currently, we support interacting with a Triton Inference Server, loading models from a MinIO Model Store.

Requirements

  • rust: Minimum Supported Rust Version 1.58
  • lld linker: for faster Rust builds.
  • docker: Container engine
  • NVIDIA container toolkit for Docker and a GPU supported by the container toolkit.
    • It may be possible to avoid the NVIDIA GPU requirement by changing the resource reservations for the triton service in docker compose to not require a GPU...YMMV based on whether the model you're trying to serve inference requests from was already compiled/optimized for GPU-only inference.
  • docker compose: Multi-container orchestration. NOTE: docker-compose is now deprecated and the compose functionality is integrated into the docker compose command. To install alongside an existing docker installation, run sudo apt-get install docker-compose-plugin. ref.
  • protoc: Google Protocol Buffer compiler. Needed to build protobufs to bind to the Triton gRPC server.

For Debian-based Linux distros, you can install inference's dependencies (except Docker & NVIDIA container toolkit, that require special repository configuration documented above) with the following command:

apt-get install clang build-essential lld clang protobuf-compiler libprotobuf-dev zstd libzstd-dev make cmake pkg-config libssl-dev

inference is tested on Ubuntu 22.04 LTS, but welcomes pull requests to fix Windows or MacOS issues.

Quick Start

  1. Clone repo: git clone https://github.com/opensensordotdev/inference.git
  • Ensure all requirements have been installed, especially the lld linker and protoc! Otherwise inference won't build!
  1. make: Download the latest versions of the Triton Inference Server Protocol Buffer files & Triton sample ML models
  2. docker compose up: Start the MinIO and Triton containers + monitoring infrastructure
  • If you have a GPU available, uncomment the below section of the inference.triton service in docker-compose.yaml. In order for your GPU to work with Triton, the CUDA versions on your host OS and the CUDA version expected by Triton have to be compatible.
      deploy:
        resources:
          reservations:
            devices:
              - driver: nvidia
                capabilities: [gpu]
    
  • If you don't have a GPU, comment out that section
  1. Upload the contents of the sample_models directory to the models bucket vis the MinIO web UI at localhost:9001
  2. cargo test: Verify all cargo tests pass

Model Inspection

http://localhost:8000/v2/models/simple

Will print model name and parameters required to set up the inputs and outputs.

Errata

gRPC Setup

proto folder will contain protocol buffers. Only grpc_service.proto is referenced in the build.rs because model_config.proto is included by grpc_service. Generated code from tonic is in inference.rs

  • Json is served on port 8000
  • gRPC calls are submitted on on 8001
  • Prometheus metrics are on 8002

Multiplexing Tonic Channels

Submitting requests to a gRPC service requires a mutable reference to a Client. This prohibits you from passing a single Client around to multiple Tasks and creates a bottleneck for async code.

Trying to hide this from users by wrapping what amounts to a synchronous resource in a struct and using async message passing to access it might help some but still doesn't fix the core problem.

While it would be possible to make a connection pool of multiple Client<Channel>s and hide this pool in a struct accessed with async message passing, this is complicated.

It also doesn't work to store a tonic.transport.Channel in the TritonClient struct...it requires the struct to implement some obscure internal tonic traits. tonic.transport.Channel.

The idiomatic way appears to be storing a single master Client in a struct and then providing a function that returns a clone of the Client since Cloning clients is cheap.

A limitation of this could be that gRPC servers usually have a finite number of connections they can multiplex (100 seems to be the number a lot of places throw out). See gRPC performance best practices.

tonic seems to have a default buffer size of 1024. Source: DEFAULT_BUFFER_SIZE channel/mod.rs

This might be useful eventually if you have multiple Triton pods and want to discover which ones are live + update the endpoint list grpc load balancing, github.

Not clear if there's a connection pool under the hood there or how they're able to connect to multiple servers?

Alternate Implementations/Inspiration

Triton-client-rs

Integration with DALI

DALI (Data Loading Library) Triton DALI backend

Dependencies

~18–28MB
~374K SLoC