1 unstable release
0.1.0 | Oct 12, 2024 |
---|
#303 in Machine learning
Used in 2 crates
(via turbo-metrics)
5KB
96 lines
TurboMetrics
A collection of video related libraries and tools oriented at performance and hardware acceleration. Including :
- cudarse : General purpose (no ML) CUDA bindings for the Driver API, NPP and the Video Codec SDK.
- A workflow and its tools to develop CUDA kernels in Rust just as another crate in the workspace.
- A working ssimulacra2 implementation with CUDA.
- Utilities and foundational libraries for codec bitstream demuxing and statistics.
- Kernels for colorspace conversion and linearization.
Goal
This project started as me noticing my GPU usage at 0% while my CPU was overloaded while doing video processing.
The strategy is to offload as much work as possible onto the GPU :
- Demux a video file on the CPU
- Decode the bitstream on hardware and keep the frame in CUDA memory
- Do any costly processing on the frames (IQA, postprocessing ...) using the GPU
- Get the results back to the CPU
In some instances, it would be impossible to decode the frame on the GPU, which means one has to stream decoded frames from the CPU (e.g. image formats), this would reduce performance but still be faster than full CPU processing if the frames can stay in gpu memory long enough.
Subprojects
turbo-metrics
CLI to process a pair of videos or images and compute various metrics and statistics. Available here.
cudarse
codec-bitstream
Transform codec bitstream for feeding into GPU decoders. Also provides parsing for metadata like color information.
nvptx-core
Nightly only helper library to write CUDA kernels in Rust. Acts as some kind of libstd for the nvptx64-nvidia-cuda target. Provides a math extension trait to replace std based on bindings to libdevice.
nvptx-builder
Allows a crate to define a dependency on a nvptx crate and have it built with a single
cargo build
.
cuda-colorspace
Colorspace conversion CUDA kernels used in other crates.
ssimulacra2-cuda
An attempt at computing the ssimulacra2 metric with GPU acceleration leveraging NPP and custom written kernels written in Rust. Preliminary profiling shows that I'm terrible at writing GPU code that runs fast.
Reference implementation : https://github.com/cloudinary/ssimulacra2
vmaf
Bindings to libvmaf.
Prerequisites
This repository is particularly difficult to set up for a Rust project due to the dependencies on various vendor SDKs. You need to be patient and be able to read error message from builds.
Also, it uses a novel approach enabled by recent rustc developments to colocate CUDA kernels written in Rust within the same cargo workspace. This is very much bleeding edge and the way the crates are linked together prevent publishing to crates.io. The only supported way to build any crate in this repo is by cloning the git repo.
Common
- 64-bit system.
- Nvidia GPU, exact requirement is unknown, but you should be safe with a GTX 10xx or later (codec support will vary with your GPU generation).
- CUDA 12.x (tested with 12.5 and 12.6, it might work with previous versions, I don't know)
- CUDA NPP (normally packaged with CUDA by default, but it's an optional component on Windows)
- Rust stable
- Rust nightly for the CUDA kernels (it should work with only a nightly toolchain and no stable)
- Various rustup components for the nightly channel :
rustup +nightly target add nvptx64-nvidia-cuda rustup +nightly component add llvm-bitcode-linker rustup +nightly component add llvm-tools
- NVIDIA Video Codec SDK (need
headers only on Linux, full sdk on Windows) with the
NV_VIDEO_CODEC_SDK
env var - For the AMF backend : AMD AMF SDK headers
- (in progress) For the libvmaf bindings : libvmaf
- clang toolchain (that's for bindgen)
Windows
- Tested on Windows 10, but should work elsewhere.
CUDA_PATH
env var pointing to your CUDA installAMF_SDK_PATH
env var pointing to your AMF SDK install- NPP dlls can't be built statically in the resulting binary and must be redistributed with whatever binary that depends on it.
Linux
- Tested on Fedora 41 and CachyOS with proprietary Nvidia drivers (I do not think CUDA works with Nouveau ?)
CUDA_PATH
env var is optional, by default it will look in/usr/local/cuda
.AMF_SDK_PATH
env var is optional, by default it will look in/usr/include/AMF
as AMF headers were present in my system packages.- I need to link to
libstdc++
for NPP libraries, but it should be possible to uselibc++
instead.
Support this project
There are various ways you can support development.
- File a detailed issue when you encounter a problem
- Support me through ko-fi or GitHub Sponsors
TODO ideas
The core libraries are getting solid. I plan to implement various tools to help the process of making encodes (except encoding itself) from pre-filtering to validation. In no particular order or priority :
Tools & workflows
- GUI with plots and interactive usage
- GUI for interactive inspection of error maps
- Hull generation, by running a command automatically (e.g.
turbo-metrics --ssimulacra2 --hull --ref ref.png -- avifenc ref.png --crf @
) - Dynamic loading at runtime (e.g. optionally load ffmpeg libraries if present), this might be necessary to support many platform API with a single portable build.
Algorithms implementations
- XPSNR
- Butteraugli
- VMAF (using both libvmaf and a custom CUDA impl)
- CAMBI (banding detector present in libvmaf)
- Scene detection (histogram based should be easy)
- Scene detection like the one used in rav1e (not even sure that's possible on a GPU)
- Denoising algorithms (the usual ones in vapoursynth are fucking slow, maybe putting the whole processing chain on the GPU can help, needs more research)
- New ssimulacra2 implementation, without relying on NPP and with separate planes computations.
- NVflip
- Audio metrics ? I don't know much about those
Inputs
- Many distorted media (the reference is only decoded once)
- Region selection
- More video containers (mp4)
- Raw bitstreams
- More codecs (HEVC, VP8, VP9, VC1)
- Finish implementing useful colorspaces
- HDR support
- libavcodec input so everything is supported
- CPU decoder fallback
- Integrations with other tools ?
Outputs
- Plot output
Platform support
Currently, we're locked to Nvidia hardware. However, the problem at hand does not require CUDA or NVDEC specifically.
- Other hardware video decoding API.
- Other accelerated compute platforms (krnl, cubecl, Vulkan).
- libavcodec input might help a lot since everything is already implemented.
About video hardware acceleration
Processing videos efficiently is a 2 parts problem :
Video decoding
So you want a cross-platform way to decode videos on every possible platform ? Sadge. This is a mess, there are nearly as many different api as there are hw vendors, os and gpu apis.
Recap table :
API | Windows | Linux | Nvidia | Intel | AMD | AV1 | HEVC | AVC | MPEG2 | VC1 |
---|---|---|---|---|---|---|---|---|---|---|
NVDEC | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
VPL | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | ||
AMF | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ||
DXVA | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ||||
Vulkan Video | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ||
VAAPI | 🟦vaon12 | ✅ | 🟦Nouveau | ✅ | ✅ | ✅ | ✅ | |||
VDPAU | ✅ | ✅ |
There is still the option to decode video on the CPU and stream frames to the GPU for computations. This is still faster than doing all processing on the CPU alone.
Compute
Your GPU will blow your CPU on any image processing task. Processing frames on the GPU is the best thing that can be done for speed.
Recap table :
API | Windows | Linux | Intel | AMD | Nvidia | NVDEC | VPL | AMF | Vulkan Video | CPU-side Rust | GPU-side Rust |
---|---|---|---|---|---|---|---|---|---|---|---|
CUDA | ✅ | ✅ | 🟦ZLUDA | ✅ | ✅ | ✅ | ✅llvm ptx | ||||
Vulkan | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅Spir-V | |
OpenCL | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅Spir-V | ||||
ROCm/HIP | ✅ | ✅ | ✅ | ||||||||
WGPU | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅Spir-V |
From both those tables, it seems Vulkan and Vulkan Video are the way forward but well, it's Vulkan.