1 unstable release

0.1.0	Oct 12, 2024

#2045 in Hardware support

Used in 3 crates (via cudarse-video)

MIT license

15KB
294 lines

TurboMetrics

A collection of video related libraries and tools oriented at performance and hardware acceleration. Including :

cudarse : General purpose (no ML) CUDA bindings for the Driver API, NPP and the Video Codec SDK.
A workflow and its tools to develop CUDA kernels in Rust just as another crate in the workspace.
A working ssimulacra2 implementation with CUDA.
Utilities and foundational libraries for codec bitstream demuxing and statistics.
Kernels for colorspace conversion and linearization.

Goal

This project started as me noticing my GPU usage at 0% while my CPU was overloaded while doing video processing.

The strategy is to offload as much work as possible onto the GPU :

Demux a video file on the CPU
Decode the bitstream on hardware and keep the frame in CUDA memory
Do any costly processing on the frames (IQA, postprocessing ...) using the GPU
Get the results back to the CPU

In some instances, it would be impossible to decode the frame on the GPU, which means one has to stream decoded frames from the CPU (e.g. image formats), this would reduce performance but still be faster than full CPU processing if the frames can stay in gpu memory long enough.

Subprojects

turbo-metrics

CLI to process a pair of videos or images and compute various metrics and statistics. Available here.

cudarse

Here

codec-bitstream

Transform codec bitstream for feeding into GPU decoders. Also provides parsing for metadata like color information.

nvptx-core

Nightly only helper library to write CUDA kernels in Rust. Acts as some kind of libstd for the nvptx64-nvidia-cuda target. Provides a math extension trait to replace std based on bindings to libdevice.

nvptx-builder

Allows a crate to define a dependency on a nvptx crate and have it built with a single cargo build.

cuda-colorspace

Colorspace conversion CUDA kernels used in other crates.

ssimulacra2-cuda

An attempt at computing the ssimulacra2 metric with GPU acceleration leveraging NPP and custom written kernels written in Rust. Preliminary profiling shows that I'm terrible at writing GPU code that runs fast.

Reference implementation : https://github.com/cloudinary/ssimulacra2

vmaf

Bindings to libvmaf.

Prerequisites

This repository is particularly difficult to set up for a Rust project due to the dependencies on various vendor SDKs. You need to be patient and be able to read error message from builds.

Also, it uses a novel approach enabled by recent rustc developments to colocate CUDA kernels written in Rust within the same cargo workspace. This is very much bleeding edge and the way the crates are linked together prevent publishing to crates.io. The only supported way to build any crate in this repo is by cloning the git repo.

Common

64-bit system.
Nvidia GPU, exact requirement is unknown, but you should be safe with a GTX 10xx or later (codec support will vary with your GPU generation).
CUDA 12.x (tested with 12.5 and 12.6, it might work with previous versions, I don't know)
CUDA NPP (normally packaged with CUDA by default, but it's an optional component on Windows)
Rust stable
Rust nightly for the CUDA kernels (it should work with only a nightly toolchain and no stable)

Various rustup components for the nightly channel :

rustup +nightly target add nvptx64-nvidia-cuda
rustup +nightly component add llvm-bitcode-linker
rustup +nightly component add llvm-tools

NVIDIA Video Codec SDK (need headers only on Linux, full sdk on Windows) with the NV_VIDEO_CODEC_SDK env var
For the AMF backend : AMD AMF SDK headers
(in progress) For the libvmaf bindings : libvmaf
clang toolchain (that's for bindgen)

Windows

Tested on Windows 10, but should work elsewhere.
CUDA_PATH env var pointing to your CUDA install
AMF_SDK_PATH env var pointing to your AMF SDK install
NPP dlls can't be built statically in the resulting binary and must be redistributed with whatever binary that depends on it.

Linux

Tested on Fedora 41 and CachyOS with proprietary Nvidia drivers (I do not think CUDA works with Nouveau ?)
CUDA_PATH env var is optional, by default it will look in /usr/local/cuda.
AMF_SDK_PATH env var is optional, by default it will look in /usr/include/AMF as AMF headers were present in my system packages.
I need to link to libstdc++ for NPP libraries, but it should be possible to use libc++ instead.

Support this project

There are various ways you can support development.

File a detailed issue when you encounter a problem
Support me through ko-fi or GitHub Sponsors

TODO ideas

The core libraries are getting solid. I plan to implement various tools to help the process of making encodes (except encoding itself) from pre-filtering to validation. In no particular order or priority :

Tools & workflows

GUI with plots and interactive usage
GUI for interactive inspection of error maps
Hull generation, by running a command automatically (e.g. turbo-metrics --ssimulacra2 --hull --ref ref.png -- avifenc ref.png --crf @)
Dynamic loading at runtime (e.g. optionally load ffmpeg libraries if present), this might be necessary to support many platform API with a single portable build.

Algorithms implementations

XPSNR
Butteraugli
VMAF (using both libvmaf and a custom CUDA impl)
CAMBI (banding detector present in libvmaf)
Scene detection (histogram based should be easy)
Scene detection like the one used in rav1e (not even sure that's possible on a GPU)
Denoising algorithms (the usual ones in vapoursynth are fucking slow, maybe putting the whole processing chain on the GPU can help, needs more research)
New ssimulacra2 implementation, without relying on NPP and with separate planes computations.
NVflip
Audio metrics ? I don't know much about those

Inputs

Many distorted media (the reference is only decoded once)
Region selection
More video containers (mp4)
Raw bitstreams
More codecs (HEVC, VP8, VP9, VC1)
Finish implementing useful colorspaces
HDR support
libavcodec input so everything is supported
CPU decoder fallback
Integrations with other tools ?

Outputs

Plot output

Platform support

Currently, we're locked to Nvidia hardware. However, the problem at hand does not require CUDA or NVDEC specifically.

Other hardware video decoding API.
Other accelerated compute platforms (krnl, cubecl, Vulkan).
libavcodec input might help a lot since everything is already implemented.

About video hardware acceleration

Processing videos efficiently is a 2 parts problem :

Video decoding

So you want a cross-platform way to decode videos on every possible platform ? Sadge. This is a mess, there are nearly as many different api as there are hw vendors, os and gpu apis.

Recap table :

API	Windows	Linux	Nvidia	Intel	AMD	AV1	HEVC	AVC	MPEG2	VC1
NVDEC	✅	✅	✅	❌	❌	✅	✅	✅	✅	✅
VPL	✅	✅	❌	✅	❌	✅	✅	✅
AMF	✅	✅	❌	❌	✅	✅	✅	✅
DXVA	✅		✅	✅	✅			✅	✅
Vulkan Video	✅	✅	✅	✅	✅	✅	✅	✅
VAAPI	🟦vaon12	✅	🟦Nouveau	✅		✅	✅	✅
VDPAU		✅	✅

There is still the option to decode video on the CPU and stream frames to the GPU for computations. This is still faster than doing all processing on the CPU alone.

Compute

Your GPU will blow your CPU on any image processing task. Processing frames on the GPU is the best thing that can be done for speed.

Recap table :

API	Windows	Linux	Intel	AMD	Nvidia	NVDEC	AMF	Vulkan Video	CPU-side Rust	GPU-side Rust
CUDA	✅	✅		🟦ZLUDA	✅	✅			✅	✅llvm ptx
Vulkan	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅Spir-V
OpenCL	✅	✅	✅	✅	✅				✅	✅Spir-V
ROCm/HIP	✅	✅		✅
WGPU	✅	✅	✅	✅	✅			✅	✅	✅Spir-V

From both those tables, it seems Vulkan and Vulkan Video are the way forward but well, it's Vulkan.

Dependencies

~0–2MB
~38K SLoC