1 unstable release

0.1.0 Feb 28, 2024

#166 in Rendering

BSD-2-Clause

57KB
901 lines

WebGPU Radix Key-Value Sort

This package implements a GPU version of radix sort. A good introduction to general purpose radix sort can be found here: http://www.codercorner.com/RadixSortRevisited.htm

The GPU radix sort implemented here is a re-implementation of the Vulkan radix sort found in the fuchsia repos: https://fuchsia.googlesource.com/fuchsia/+/refs/heads/main/src/graphics/lib/compute/radix_sort/.

Currently only the sorting for 32-bit key-value pairs is implemented. It can be used to sort unsigned integers and non negative float numbers. See Limitations for more details. The keys are sorted in ascending order.

It was originally implemented for our 3D Gaussian Splatting Renderer to sort splats according to their depth in real time. It can be seen in action in this web demo.

Example

// find best subgroup size
let subgroup_size = guess_workgroup_size(&device, &queue).await.unwrap();
let sorter = GPUSorter::new(&device, subgroup_size);

// setup buffers to sort 100 key-value pairs
let n = 100;
let sort_buffers = sorter.create_sort_buffers(&device, NonZeroU32::new(n).unwrap());

let keys_scrambled: Vec<u32> = (0..n).rev().collect();
let values_scrambled:Vec<u32> = keys_scrambled.clone();

let mut encoder = device.create_command_encoder(&wgpu::CommandEncoderDescriptor {label: None});

upload_to_buffer(
    &mut encoder,
    &sort_buffers.keys(),
    &device,
    keys_scrambled.as_slice(),
);
upload_to_buffer(
    &mut encoder,
    &sort_buffers.values(),
    &device,
    values_scrambled.as_slice(),
);

// sorter.sort(&mut encoder, &sort_buffers);
sorter.sort(&mut encoder,&queue,&sort_buffers,None);
queue.submit([encoder.finish()]);

// key and value buffer is now sorted.

Indirect dispatching is also supported. See examples/sort_indirect.rs;

Benchmarks

To measure the performance we sort the key-value pairs 1000 times and report the average duration per run. Measurements were performed for different number of pairs. Take a look at benches/sort.rs for more details.

Device 10k 100k 1 Million 8 Million 20 Million
NVIDIA RTX A5000 108.277µs 110.179µs 317.191µs 1.641699ms 3.980834ms
AMD Radeon R9 380 803.527µs 829.003µs 2.76469ms 18.81558ms 46.12854ms
Intel HD Graphics 4600 790.382µs 4.12287ms 38.7421ms 295.2937ms 732.3900ms

Limitations

This sorter comes with a number of limitations that are explained in the following.

Subgroups

This renderer makes use of subgroups to reduce synchronization and increase performance. Unfortunately subgroup operations are not supported bei WebGPU/wgpu right now.

To overcome this issue we "guess" the subgroup size by trying out different subgroups and pick the largest one that works (see utils::guess_workgroup_size). This works in almost all cases but can fail because the subgroup size can change over time. Once subgroups are support this will be fixed. Status can be found here.

Floating Point Numbers

The sorting algorithm interprets the values as integers and sorts the keys in ascending order. Non-negative float values can be interpreted as unsigned integers without affecting the ordering. Therefore this sorter can be used to sort 32-bit float keys. Note that NaN and Inf values lead to unexpected results as theses are interpreted as integers as well. An example for sorting float values can be found here.

Dependencies

~3–34MB
~524K SLoC