2 unstable releases

0.2.0 Jan 23, 2024
0.1.0 May 27, 2023

#772 in Algorithms

33 downloads per month
Used in nblast

Apache-2.0 OR MIT

86KB
1.5K SLoC

Bosque

"The clearest way into the Universe is through a forest wilderness." – John Muir

What is it?

Bosque (bohs-keh, spanish for "forest") is a fast, parallel in-place kDTree Library available for Rust, C, Python, and Julia. This library is intended to be an improvement over FNNTW. It achieves its high performance through similar avenues such as parallel builds and queries, but achieves superior performance by mutating the original data (hence, inplace kDTree). By using a select algorithm to partition the original dataset into buckets that are contiguous within its original allocated buffer, bosque acheives incredible cache locality. This same algorithmic choice, along with a round robin split dimension, results in a kDTree that can be stored with no additional metadata; after building a tree, you can always get the left and right side of a tree or subtree by spliting at the median with the split dimension being how many times you've split the tree in half so far. Furthermore, this implies that pre-built trees can be zero-copy deserialized (on and across systems with the same endianness) in a matter of nanoseconds. If only a few queries need to be done, memory-mapped I/O can be used to limit the total number of bytes read from disk, resulting in a further speedup.

At present, since cosmology is the primary use case, the library only supports 3D data. However, it can easily be generalized. The author intends to make the code generic over all dimensions, but feel free to submit an issue + pull request if you want this ASAP!


Getting started

Because this is a Rust library, we show an example in Rust here. This example can be found at examples/rust/simple.rs. In the examples directory, there are examples for C, Python, and Julia.

// uses the `rand` crate.

fn main() {
    // Initialize some data in [0, 1]^3
    let mut pos: Vec<[f64; 3]> = (0..100_000)
        .map(|_| [(); 3].map(|_| rand::random::<f64>()))
        .collect();

    // Build tree in-place!
    bosque::tree::build_tree(&mut pos);

    // Query the tree
    let query = [0.5, 0.5, 0.5];
    let (dist_metric, id) = bosque::tree::nearest_one(&pos, &query);

    // Note that there is a 'sqrt-dist' feature for
    // returning square euclidean or euclidean distance!
    let dist = if cfg!(feature = "sqrt-dist") {
        dist_metric
    } else {
        dist_metric.sqrt()
    };

    println!("closest point is {dist:.2e} units away, corresponding to data point #{id}");
}

Performance

Using the same benchmark as in FNNTW, we extend the list. We exclude the original kiddo benchmark, as there is now a much faster version 2 of kiddo -- which we highly recommend for use cases that cannot be in-place or that require other query functionality e.g. nearest_within which bosque does not offer.

From FNNTW with minor modifications:

We use

  • A mock dataset of 100,000 uniform random points in the unit cube.
  • A query set of 1,000,000 uniform random points in the unit cube.

Over 100 realizations of the datasets and query points, the following results are obtained for the average build and 1NN query times on an AMD EPYC 7502P using 48 threads. The results are sorted by the combined build and query time.

Code Build (ms) Query (ms) Total (ms)
Bosque 3.5 10 13.5
FNNTW 12 22 34
pykdtree (python) 12 35 47
nabo-rs (rust) 25 30 55
Scipy's cKDTree (python) 31 38 69

Dependencies

~0–1.9MB
~25K SLoC