9 releases (breaking)
0.12.0 | Oct 21, 2024 |
---|---|
0.11.1 | Aug 12, 2024 |
0.5.2 | Jan 29, 2021 |
0.4.0 | Jan 18, 2021 |
0.1.1 | Aug 20, 2020 |
#236 in Filesystem
9MB
4K
SLoC
HIDEFIX
This Rust and Python library provides an alternative reader for the HDF5 file or NetCDF4 file (which uses HDF5) which supports concurrent access to data. This is achieved by building an index of the chunks, allowing a thread to use many file handles to read the file. The original (native) HDF5 library is used to build the index, but once it has been created it is no longer needed. The index can be serialized to disk so that performing the indexing is not necessary.
In Rust:
use hidefix::prelude::*;
let idx = Index::index("tests/data/coads_climatology.nc4").unwrap();
let mut r = idx.reader("SST").unwrap();
let values = r.values::<f32>(None, None).unwrap();
println!("SST: {:?}", values);
or with Python using Xarray:
import xarray as xr
import hidefix
ds = xr.open_dataset('file.nc', engine='hidefix')
print(ds)
See the example for how to use hidefix for regular, parallel or concurrent reads.
Motivation
The HDF5 library requires internal locks to be thread-safe since it relies on internal buffers which cannot be safely accessed/written to from multiple threads. This effectively causes multi-threaded applications to use sequential reads, while competing for the locks. And also apparently cause each other trouble, perhaps through dropping cached chunks which other threads still need. It can be safely used from different processes, but that requires potentially much more overhead than multi-threaded or asynchronous code.
Some basic benchmarks
hidefix
is intended to perform better when concurrent reads are made either
to the same dataset, same file or to different files from a single process. For
basic benchmarks the performance is on-par or slightly better compared to doing
standard sequential reads than the native HDF5 library (through its
rust-bindings). Where hidefix
shines
is once the multiple threads in the same process tries to read in any way
from a HDF5 file simultaneously.
This simple benchmark tries to read a small dataset sequentially or
concurrently using the cached
reader from hidefix
and the native reader
from HDF5. The dataset is chunked, shuffled and compressed (using gzip):
$ cargo bench --bench concurrency -- --ignored
test shuffled_compressed::cache_concurrent_reads ... bench: 15,903,406 ns/iter (+/- 220,824)
test shuffled_compressed::cache_sequential ... bench: 59,778,761 ns/iter (+/- 602,316)
test shuffled_compressed::native_concurrent_reads ... bench: 411,605,868 ns/iter (+/- 35,346,233)
test shuffled_compressed::native_sequential ... bench: 103,457,237 ns/iter (+/- 7,703,936)
Inspiration and other projects
This work is based in part on the DMR++ module of the OPeNDAP Hyrax server. The zarr format does something similar, and the same approach has been tested out on HDF5 as swell.
Dependencies
~11–28MB
~438K SLoC