1 unstable release

0.1.1	Jan 16, 2024

#1075 in Data structures

Apache-2.0 OR MIT

210KB
5.5K SLoC

`bstree-file-readonly`

Make and Query read-only binary-search tree file supporting billions of entries in files of tens of GB.

About

Immutable implicit naive Binary Search Tree structure stored in a file.

The tree structure (possibly larger than the available RAM) is created at once using bulk-loading. It is then possible to perform queries on it (nn query, knn query, range query, ...), but not to update it.

It has been developed for (with a more general usage than) static astronomical catalogues.

The datastructure is implicit: it is basically a flat array of entries ordered in a pre-defined way depending on a few parameters like the number of elements in the tree, the size of both the L1 and the disk caches.

The simple design inputs are:

a metadata part followed by the data part
data part as compact as possible, but without compression
=> hence the choice of an implicit structure with an unbalanced rightmost part of the tree

Remark: I do not claim this is the best possible structure, it is a quite naive implementation by a non-expert, rushing to have something working among other duties; any feedback welcome.

CLI

Contains 3 CLI tools:

mkbst: to MaKe a Binary Search Tree file
qbst: to Query a Binary Search Tree file
genfile: to generate files with random values for testing purposes

Purpose

Perform fast queries on a single catalogue column. The binary-search tree basically stores both values and OIDs (row indices).

In input, the tools takes an identifiers (which can be an implicit sequential number) and a value. The indexation is made on the values. Queries return entries which basically are tuples containing an identifier and value couple.

Creation algorithm

Although the first step is an external k-way merge sort, the final file is not ordered sequentially. It consists in a sequence of binary search tree blocks. There is two levels of blocks:

blocks fitting in the L1 cache
groups of blocks fitting in the disk cache The full tree is not balanced:
it is made of a main balance tree
plus a rightmost sub-tree recursively consisting in
- a main balanced tree +plus a rightmost sub-tree... The tree has 0 unused byte.

Disclaimer

Main functionalities are complete (building and querying), but all parts may not necessarily be production ready: more testing is needed (please report any bug).

However, this code is already in production in VizieR, mainly to index large catalogues identifiers.

For performances purposes, the code makes a large use of monomorphization (no dynamic dispatch at all!). It leads to:

very long compilation time (1min/10min in debug/release mode)
large binaries:
- mkbst (tree creation) is about 9/65 MB in release/debug mode
- qbst (tree query) is about 29/116 MB in release/debug mode

Other tools

For a larger project that may fulfill the need (and more), see:

sled and this paper

Info

In the while project, we talk about key/value pairs. Although it may be counter-intuitive, we actually index the value (e.g. a magnitude, which is not necessarily unique) and we return the associated key (e.g. a unique record number in a table).

A few ideas:

common use cases:
- index a unique identifier (value) and return a recno (key)
- index magnitudes (value) and return recnos (key)
to get the position of a Gaia source from its identifier:
index the unique identifier (value) and return formatted JNAMEs (string key).
for fast magnitude based density maps: index a magnitude (value) and return the order 12 healpix index (key).
index CSV rows using as key row starting byte offset

Install

The standard way to install the mkbst, qbst and genfile binaries is:

install rust see here, possibly removing --tlsv1.2 in the command line
fork and download this repository
type cargo install --path . from the downloaded directory (can take ~1-2min)
WARNING: by default only a subset of (key, value) pair is available. For all possibilities, use cargo install --path . --features "all" See Cargo.toml for the list of features.

Fast compilation to test for a two key/value datatype couples, e.g. (unsigned integer/unsigned integer) and (unsigned integer/float):

cargo build --release --no-default-features --features u32_u32,u32_f32 --bin qbst

Full compilation (can tae 10min!):

cargo install --all-features --path .

Example

Generate data (deprecated)

Bash script mkseq.bash to generate a simple sequence of value from 0 to n:

#!/bin/bash

declare -rx END=$1

echo "val"

COUNTER=0
until [ $COUNTER -ge ${END} ]; do
  echo $COUNTER
  let COUNTER+=1
done

Remark: one can use the shuf command (first removing and then adding val) to shuffle the input lines.

Generate data

The genfile tool generates simple files to test the BSTree code. Example:

genfile 10000000000 randf64 | mkbst -h test.10b.randf64 --id-type u5 --val-type f4

Build a Binary Search Tree on 10 billion single precision floats having values in [0.0, 1.0).

Generate a very simple sequential file containing integers with both the id and value columns:

genfile -o 10000000000 out.csv seqint

Remarks:

the purpose of a same sequential number for both the identifier and the value is to test that a value is associated with the correct identifier.
one can use the shuf command (first removing and then adding val) to shuffle the input lines.

Create a tree

Create a simple tree containing 2 billion entries consisting of tuples having the same identifiers (a row number) and value (the row number):

./mkseq.bash 2000000000 | ./mkbst -h bigtest --id-type u4 --val-type u4

Both identifiers and values are stored on regular 32bit unsigned integers.

Perform queries

Perform queries:

time qbst bigtest.bstree get --value 256984
time qbst bigtest.bstree knn --value 69853145 -k 10
time qbst bigtest.bstree range --limit 10000 --from 25639 --to 250000
time qbst bigtest.bstree range --cout --from 25639 --to 250000

Test on 10 million random f32 values

Generate 10 million random f64 and create a bstree storing id on 32 bit integers and value on 32bit floats

genfile 10000000 randf64 | mkbst -h  test_10m --id-type u4 --val-type f4

Look at the nearest value from 0.5

time qbst test_10m.bstree nn value 0.5

Look at the 10 nearest values from 0.2 (the result is ordered by distance to 0.2)

time qbst test_10m.bstree knn -v 0.2 -k 10

Count the number of entries havig value in 0.4 and 0.6

time qbst test_10m.bstree range -f 0.4 -t 0.6 -c

Priny the value in the range 0.49999 and 0.50001 (the result is ordered by increasing values)

time qbst test_10m.bstree range -f 0.49999 -t 0.50001

Benchmark

Test on my personal desktop (MVNe SSD, 16 GB RAM, AMD Ryzen) on a 20 GB file containing sequential numbers (from 0 to 1999999999).

> time mkbst -h --input test.2billion.csv test2b --id-type u4 --val-type u4

real	9m49,775s
user	7m37,361s
sys	0m50,522s

It took less than 10min to build the 15 GB ouput file (ok, the input file is already sorted).

> qbst test2b.bstree info

{
  "types": [
    "U32",
    "U32"
  ],
  "constants": {
    "n_entries": 2000000000,
    "entry_byte_size": 8,
    "n_entries_per_l1page": 4096,
    "n_l1page_per_ldpage": 255
  },
  "layout": {
    "depth": 2,
    "n_entries_root": 1914,
    "n_entries_main": 1999622790,
    "rigthmost_subtree": {
      "depth": 1,
      "n_entries_root": 92,
      "n_entries_main": 376924,
      "rigthmost_subtree": {
        "depth": 0,
        "n_entries_root": 286,
        "n_entries_main": 286,
        "rigthmost_subtree": null
      }
    }
  }
}

Simple exact value query:

> time qbst test2b.bstree get value 1569874529

id,val
1569874529,1569874529

real	0m0,002s
user	0m0,000s
sys	0m0,002s

Nearest neighbour query

> time qbst test2b.bstree nn -v 3000000000

distance,id,val
1000000001,1999999999,1999999999

real	0m0,002s (0m0,009s at the first execution)
user	0m0,002s
sys	0m0,000s

K nearest neighbours query

> time qbst test2b.bstree knn -v 25684 -k 10

distance,id,val
0,25684,25684
1,25685,25685
1,25683,25683
2,25686,25686
2,25682,25682
3,25681,25681
3,25687,25687
4,25688,25688
4,25680,25680
5,25679,25679

real	0m0,002s (0m0,005s at the first execution)
user	0m0,002s
sys	0m0,000s

Range query

> time qbst test2b.bstree range -l 10 -f 25698470 -t 25698570

id,val
25698470,25698470
25698471,25698471
25698472,25698472
25698473,25698473
25698474,25698474
25698475,25698475
25698476,25698476
25698477,25698477
25698478,25698478
25698479,25698479

real	0m0,002s
user	0m0,000s
sys	0m0,002s

At first execution, the limiting factor is the disk access. At the second execution, the limiting factor is the time required by the OS to handle the process. The 2ms includes the time needed to read the tree metadata.

Generate 100,000 random points in 0 - 2billion:

./genfile 2000000000 randint | head -100001 | tail -n +2 > toto.list

In release mode (third execution):

time qbst test2b.bstree get list toto.list > toto.res.txt

real	0m0,732s
user	0m0,245s
sys	0m0,481s

i.e. a mean of less than 8 micro second per query (including parsing, conversion to string and writing the result in a file)! (The mean is ~0.14 ms/query at the first execution)

> time qbst test2b.bstree nn list toto.list > toto.res.txt

real	0m16,224s
user	0m0,482s
sys	0m3,222s

> time qbst test2b.bstree nn list toto.list > toto.res.txt

real	0m1,651s
user	0m0,256s
sys	0m0,804s

> time qbst test2b.bstree nn list toto.list > toto.res.txt

real	0m0,760s
user	0m0,251s
sys	0m0,509s

First execution: 0.16 ms/query
Third execution: 7.60 us/query

We redo those query sorting the random number:

./genfile 2000000000 randint | head -100001 | tail -n +2 | sort -n > toto.list

The results are similar.

We recall that the index file is 15 GB large, so 2nd execution is faster since the data is in the disk cache.

Bench on 10 billion f32 random values in `[0, 1)`

Generate the value and the index at once:

nohup sh -c "genfile 10000000000 randf64 | mkbst -h test.10b.randf64 --id-type u5 --val-type f4" &

Or in two steps

genfile 10000000000 randf64 > test_10b_randf64.csv
mkbst -h --input test_10b_randf64.csv test.10b.randf64 --id-type u5 --val-type f4

The size of the index is ~85 GB.

The queries have been tested on 2 distinct hardware:

Desktop computer
- 1 TB, 7200 RPM, 32 MB cache HDD (HGST HTS721010A9E630)
- Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz (6 MB "smart cache")
- 16 GB DDR4 2133 MHz
Server
- RAID of 5 SSDs (Samsung SATA SSDs)
- 2x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (25 MB "smart cache")
- 64 GB DDR4 2133 MHz

The table return the "real" time provided by the "time" command.
Each command starts by time qbst test.10b.bstree plus the Query params.

Here the command to generate the list input file in the queries using list:

genfile 10000 randf64 | egrep -v "val" | sort -n > randf64_4q.csv

Query params	1st or 2nd	Result	Desktop	Server
nn value 0.5	1		0m0,071s	0m0.013s
nn value 0.5	2		0m0,004s	0m0.003s
knn -v 0.8 -k 10	1		0m0,034s	0m0.007s
knn -v 0.8 -k 10	2		0m0,004s	0m0.003s
all -v 0.8 -c	1/2	588	0m0,004s	0m0.005s
all -v 0.2 -c	1	148	0m0,026s	0m0.011s
all -v 0.2 -c	2	148	0m0,004s	0m0.003s
range -f 0.4 -t 0.5 -c	1/2	1000028688	1m5,450s	1m57.212s
range -f 0.150 -t 0.149 -c	1	157	0m0,028s	0m0.025s
range -f 0.150 -t 0.149 -c	2	157	0m0,004s	0m0.003s
get list 1000.csv > res.csv	1		0m9,392s	0m0.251s
get list 1000.csv > res.csv	2		0m0,026s	0m0.034s
get list 10000.csv > res.csv	1		1m40,354s	0m3.807s
get list 10000.csv > res.csv	2		0m0,181s	0m0.207s
get list 100000.csv > res.csv	1			0m24.548s

In the last case (100000 queries, no data in the disk cache), the mean query time is about 0.24 ms (which is probably not far from the disk access time).

In the previous results, we clearly see the effect of the spinning vs SSD disk at the first execution. On the query range -f 0.4 -t 0.5 -c we see that the server has a slower CPU.

The query get list 10000.csv is very interesting (a factor more than x20 in performances!):

the mean time on a spinning disk is about 10 ms
the mean time on a SSD disk is about 0.4 ms

For the query get list 100000.csv > res.csv, the result time is about the same the input being sorted or not.

Bench with Gaia DR2 data (1.6 Billion entries)

With this BSTree index

The use case is simple: from the Gaia DR2 Source unique identifier, I want to retrieve the associated position (I am using the formatted position %015.9%+015.9f as a string made of 30 ASCII chars).
Thus here the value I want to index is Source and the associated identifier is the position (yes it is different from a HashMap in which Source would be the (unique) key and the position the associated value because in a bs-tree the indexed value is not necessarily unique).

In input, the files looks like:

ra,dec,source
45.00431616421,0.02104503269,34361129088
44.99615368416,0.00561580621,4295806720
45.00497424498,0.01987700037,38655544960
44.96389532530,0.04359518482,343597448960
...

and contains 1,69,2919,136 rows (including the header line).

I build and exec the following script

#!/bin/bash

LC_NUMERIC=C LC_COLLATE=C; \
tail -n +2 gaia_dr2.idradec.csv | tr ',' ' ' |\
while read line; do printf "%015.11f%+015.11f,%d\n" $line; done |\
mkbst gaia_dr2_source --val 1 --id 0 --id-type t30 --val-type u8

(the process is quite slow due to the bash while and printf).

I then have a 2.5 GB file Gaia_source.txt containing more 132,739,322 Source, looking like:

2448780173659609728
2448781208748235648
2448689605685695488
2448689777484387072
2448783991887042176
...

Two consecutive executions (slow 7200 RPM HDD) gives:

time qbst ../gaia_dr2_source.bstree get list Gaia_source.txt > Gaia.test.csv

real	24m4,323s
user	5m8,834s
sys	4m52,898s

and

time qbst ../gaia_dr2_source.bstree get list Gaia_source.txt > Gaia.test.csv

real	13m58,572s
user	4m2,162s
sys	3m32,534s

I guess that the second execution benefits from HDD cache.

Now sorting the input and querying again leads to:

time qbst ../gaia_dr2_source.bstree get list Gaia_source.sorted.txt > Gaia.test.sorted.csv

real	5m53,569s
user	2m43,339s
sys	2m28,872s

The output file is 6.3 GB large.

With PSQL10

I install PSQL10 on Ubuntu via apt, create a user and move the database out of the system disk:

sudo apt-get install postgresql-10
sudo -u postgres createuser --interactive
sudo -u postgres createdb fxtests
sudo systemctrl stop postgresql
sudo rsync -av /var/lib/postgresql /data-local/psql/ 
sudo vim /etc/postgresql/10/main/postgresql.conf
sudo systemctrl start postgresql

I create two tables (the index and the data tables) and copy data:

CREATE TABLE gaia2_idpos (
  source BIGINT PRIMARY KEY,
  pos CHAR(30) NOT NULL
);

COPY gaia2_idpos(pos, source)
FROM '/data-local/org/gaia_dr2.idradec.4index.csv'
DELIMITER ',';

and

CREATE TABLE gaia2_id (
  source BIGINT PRIMARY KEY
)

COPY gaia2_id(source)
FROM '/data-local/org/aas/Gaia_source.txt'
DELIMITER ','

And perform the query:

time psql -d fxtests -t -A -F"," -c "SELECT b.* FROM gaia2_id as a NATURAL JOIN gaia2_idpos as b" > output.csv

real	53m17,051s
user	1m4,109s
sys	0m50,723s

Remarks:

I tested PSQL out of the box, without modifying any parameters
In the PSQL test, using a PRIMARY KEY for both tables, the result is to be compared with the sorted input in the BSTree test.

TODO list

add the possibility to query by a list of target
make a simple test with PSQL
replace memory map by pread/pwrite? (see e.g. positioned-io or scroll)
remove the code which is now obsolete (e.g. get overwritten by get exact visitor)
add much more tests
add benchmarks
add CI
try to reduce the code redundancy (particularly in SubTreeW and SubTreeR)
add support for NULL values (storing them separately, out of the tree structure)
- in the meanwhile, we can use a pre-defined value coding null
perform tests with SQLx and PostgreSQL to have a reference time (would be nice if we are at least as fast)

Acknowledgements

If you use this code and work in a scientific public domain (especially astronomy), please acknowledge its usage and the CDS who developed it. It may help us in promoting our work to our financiers.

Warning

If the compilation fails with a message like

Caused by:
  process didn't exit successfully: `rustc --crate-name bstree_file [...]
  (signal: 9, SIGKILL: kill)

, try

dmesg | egrep -i 'killed process'

If the result looks like

[...] Out of memory: Killed process xxxxx (rustc)

it means that your machine was out of memory.

License

Like most projects in Rust, this project is licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~7MB
~109K SLoC