#thread-pool #spmc #channel

burst-pool

A SPMC channel optimised for bursts of activity

6 releases (3 breaking)

Uses old Rust 2015

0.5.1 Nov 20, 2017
0.5.0 Nov 17, 2017
0.4.0 Nov 2, 2017
0.3.1 May 16, 2017
0.2.2 May 4, 2017

#15 in #spmc

MIT/Apache

17KB
127 lines

A thread pool optimised for bursts of activity.

Designed for the following use-case: A single thread produces work which must then be performed by a pool of worker threads. The work is produced infrequently, but in bursts. Under normal operation, therefore, the threads in the pool sleep until some event requires many of them to be suddenly be woken at once. Those threads perform some work before going back to sleep again.

See the documentation for details.


lib.rs:

An SPMC channel with low, consistent latency, designed for cases where values arrive in a batch and must be processed now-or-never.

burst-pool differs from a regular spmc channel in one major way: sending fails when all workers are currently busy. This lets the sender distribute as much work as possible, and then handle overflow work differently (probably by throwing it away).

Performance

tl;dr: Same ballpark as spmc. Best when the number of workers is less than number of cores.

I'm using the excellent spmc crate as a benchmark. spmc provides a reliable unbounded channel, and when the pool is overloaded it has the normal "queueing" semantics (as opposed to burst-pool's "return-to-sender" semantics). The metric we care about is enqueuerecv latency. Below are some kernel density estimates. The "n/m" notation in the labels means "n messages sent to m workers".

Here's my interpretation of the numbers:

  • Keeping the number of workers fixed to 3 and varying the number of messages sent appears to have no effect on burst-pool's performance. spmc, on the other hand, compares favourably when pool utilisation is low (1/3), becoming gradually worse as the number of messages increases (2/3), until it is roughly the same as burst-pool in the saturating case (3/3).
  • When the pool is overloaded (4/3, 5/3) we see the difference in the semantics of the two crates. In the burst-pool benchmark, only 3 messages are sent and 2 are discarded. In the spmc benchmark, all 5 messages are sent, giving a bimodal distribution with the second peak around 1000 μs. (This seems to mess up gnuplot.)
  • Comparing the two saturating benchmarks (3/3, 6/6) we see another difference. When the number of workers is less than the number of available cores (3/3) we see that performance is roughly the same, but that burst-pool degrades more badly when it exceeds the number of cores (6/6).
  • The response time is generally a bit more reliable for burst-pool than spmc (ie. the variance is lower), with the exception of the 1/3 benchmark.

These observations are consistent with the expected performance characteristics. (See Design, below.)

Run cargo bench to make some measurements on your own machine; there's also a gnuplot script which you can use to visualise the results. (Pro tip: If your results are significantly worse than those above, your kernel might be powering down CPU cores too eagerly. If you care about latency more than battery life, consider setting max_cstate = 0.)

Usage

The API requires two calls to actually send a message to a receiver: first you place your message in the mailbox of one of your workers (enqueue); then you wake the worker up, notifying it that it should check its mailbox (wake_all). A call to enqueue will only succeed if the sender can find an empty mailbox to put the message in (ie., at least one of your workers must currently be blocking on a call to recv). If enqueue fails, you get your message back. If it succeeds, the message will be recieved by exactly one receiver the next time you call wake_all.

#
#
// Create a handle for sending strings
let mut sender: Sender<String> = Sender::new();

// Create a handle for receiving strings, and pass it off to a worker thread.
// Repeat this step as necessary.
let mut receiver: Receiver<String> = sender.mk_receiver();
let th = std::thread::spawn(move ||
loop {
let x = receiver.recv().unwrap();
println!("{}", x);
}
);

// Give the worker thread some time to spawn
sleep_ms(10);

// Send a string to the worker and unblock it
sender.enqueue(Box::new(String::from("hello")));
sender.wake_all();
sleep_ms(10);       // wait for it to process the first string

// Send another string
sender.enqueue(Box::new(String::from("world!")));
sender.wake_all();
sleep_ms(10);

// Drop the send handle, signalling the worker to shutdown
std::mem::drop(sender);
th.join().unwrap_err();  // RecvError::Orphaned

Design

Each receiver has a "slot" which is either empty, blocked, or contains a pointer to some work. Whenever a receiver's slot is empty, it goes to sleep by polling an eventfd. When issuing work, the sender goes through the slots round-robin, placing work in the empty ones. It then signals the eventfd, waking up all sleeping receivers. If a receiver wakes up and finds work in its slot, it takes the work and blocks its slot. If a receivers wakes up and finds its slot is still empty, it goes back to sleep. When a receiver has finished processing the work, it unblocks its slot.

This design means that we expect enqueuerecv latencies to be independent of the number of payloads sent. However, we expect it to become much worse as soon as there are more worker threads than cores available. The benchmark results are consistent with these expectations.

Portability

This crate is Linux-only.

Dependencies

~2MB
~38K SLoC