2 unstable releases
0.2.0 | Mar 13, 2021 |
---|---|
0.1.0 | Dec 24, 2020 |
#2412 in Database interfaces
120KB
3K
SLoC
Index key,value dataset using a Read Only Btree. Refer to rustdoc link (above) for details.
Contribution
- Simple workflow. Fork - Modify - Pull request.
- Before creating a PR,
- Run
make build
to confirm all versions of build is passing with 0 warnings and 0 errors. - Run
check.sh
with 0 warnings, 0 errors and all testcases passing. - Run
perf.sh
with 0 warnings, 0 errors and all testcases passing. - Install and run
cargo spellcheck
to remove common spelling mistakes.
- Run
- Developer certificate of origin is preferred.
lib.rs
:
Package implement an immutable read-only BTree index.
Use [Builder] type to build a new index. And subsequently load the
index using the [Index] type. Index can be concurrently accessed by
cloning the Index
instance. Note that a single Index instance cannot be
shared among threads. Once an index is built using the Builder
type
it is not possible to modify them. While strict immutability might
seem like an inconvenience, they have certain advantages,
- They are fully packed and hence less overhead and lesser tree depth.
- Easy and efficient caching of btree-blocks.
- Can be easily paired with immutable read-only bloom filters.
- Friendly for concurrent access.
Inventory of features
- Index can be parametrized over Key-type and Value-type.
- Uses CBOR for serialization and deserialization.
- Key and Value types can be made
robt
compliant byderive(Cborize)
. - Value can either be stored in leaf-node or in a separate log-file.
- Additionally, incoming iterator, to build index, can supply older versions for value using the [Diff] mechanics.
- Bloom filter can help optimize false lookups.
- API
get()
operation, with bloom-filter support. - API
iter()
andreverse()
operation for forward and reverse iteration. - API
iter_version()
andreverse_version()
operation similar to iter/reverse but also fetches older versions for a entry. Note that iter/reverse do not fetch the older versions.
Value-log file
Values and its deltas (older versions) can be stored in a separate log file. This has following advantage,
- Keep the leaf-node extremely compact and help better caching.
- Efficient when building multi-level index.
- Applications typically deal with older-versions as archives.
While storing value in the value-log file is optional, deltas are always stored in separate value-log file.
Building an index
Unlike mutable data-structure, that support set()
, write()
,
update()
etc.. robt
indexes are built from pre-sorted iterators.
In a way each btree index can be seen as an immutable snapshot of
sorted {key,value}
dataset.
Typical workflow is,
use mkit::traits::BuildIndex;
let config = Config::new("/opt/data/", "movies");
// use one or more set_ method to configure the btree parameters.
let builder = Build::initial(config, app_meta);
builder.from_iter(iter, mkit::nobitmap::NoBitmap);
// Subsequently open an index as,
let reader1 = Index::open("/opt/data", "movies").expect("fail");
// create another concurrent reader
let reader2 = reader.clone();
let handle = thread::spawn(|| reader2);
Let us look at the steps one by one:
- First create a configuration. More configurations available via the
set_
methods. - By supplying
app_meta
, caller can also persist snapshot specific meta-data. - After creating a builder, use
BuildIndex
trait'sfrom_iter()
to build a btree-index from an iterator. It is expected that iterated entries are pre-sorted. - Caller can optionally pass a bitmap instance that shall be used for implementing a bloom filter.
- Bitmap type is parametrized via the
BuildIndex
trait. If probabilistic bitmap table is not required, passNoBitmap
value tofrom_iter()
method.
In the example above, we are using initial()
constructor to create
a builder instance, it is also possible to incrementally build an
index via incremental()
constructor. To understand the difference
we shall dig deeper into how data-set is indexed with robt
.
robt
is a simple btree-index, made up of root-node
,
intermediate-node
(called m-block) and leaf-node
(called z-block).
The entire dataset is maintained in the leaf node and the intermediate
nodes are constructed in bottoms-up fashion using the first-key in the
leaf-node, all the way up to the root-node. The shape and behavior of
root-node is exactly same as the intermediate-node
.
The dataset is made up of entries and each entries is made up of key, value, seqno, a flag to denoted whether the node was deleted or upserted. Reason for maintaining seqno, and deleted-flag is to support database features like vector-timestamping, log-structured-merge etc..
Version control your values, an additional feature with robt
index is that applications can version control their values. That is,
each entry, along with key, value, seqno, etc.. also maintains previous
version of the value along with its modification seqno. And instead of
persisting the entire value (older versions), their deltas as computed
in relation to its new-versions and persisted as deltas. This is
achieved using the [Diff] mechanics. Also note that robt
itself
doesn't compute the version deltas, but it is treated as part of an
entry and persisted.
Each entry in the index is defined as Entry<K, V, D> type and defined
in a common crate. Note that an index entry is parametrized over
key-type, value-type, and delta-type. Here delta-type D
can be
NoDiff
if application is not interested in preserving older-versions
or should be same as <V as Diff>::D
. Refer to [Diff] mechanics for
more detail.
Now coming back to the leaf-node, all entries are stored in the
leaf-node. And to facilitate archival of older versions deltas
are persisted in a separate value-log file. And optionally, to
facilitate incremental build, value can also be persisted in the
value-log file. When both values and deltas are persisted in a
separate value-log file, leaf nodes become very compact and ends
up suitable for caching, compaction, incremental-build, optimized
IOPS and delta-archival.
Reading from index
All read operations are done via [Index] type. Use the same arguments
passed to initial()
or incremental()
constructors to open()
an
existing index for reading.
Cloning an index for concurrency. Though applications can use the
open()
call to create as many needed instance of an Index, the
recommended approach is to call try_clone()
on Index. This will share
the underlying data-structure to avoid memory bloat across several
instance of same Index. Only meta-data is shared across index instance
(when it is cloned), every index instance will keep an open
file-descriptor for underlying file(s).
Simple Key-Value index
robt
indexes are parametrized over key-type, value-type, delta-type,
and bitmap-type. delta-type
implement the older versions of value-type
as delta-difference. bitmap-type
implement bloom filter to optimize
away missing-lookups.
In most cases, delta-type
and bitmap-type
are not needed. To build
and use simple {key,value}
index [Builder] and [Index] type in the
crate-root can be used. To use fully parameterized variant, use
db::Builder and db::Index types.
Index Entry
For simple indexing, key
and value
are enough. But to implement
database-features like compaction, log-structured-merge we need to
preserve more information about each entry. While the internal shape of
entry is not exposed (for future compatibility), robt
uses
mkit::db::Entry as the default index-entry.
Compaction
Compaction is the process of de-duplicating/removing entries
and/or older-versions from an index snapshots (aka instance). In robt
compaction operation consumes the Index
instance and creates a new
a new Index
instance with its entries compacted to the desired level
of cutoff
. There are three types of compaction:
deduplication
This is basically applicable for snapshots that don't have to preserve older versions or deleted entries.
When same value-log file is used to incrementally build newer batch of mutations older values gets duplicated. This requires a periodic clean up of garbage values to reduce disk foot-print.
This type of compaction is also applicable for index instances that do not need distributed LSM. In such cases, the oldest snapshot can compact away older versions of each entry and purge entries that are marked deleted.
lsm-compaction
This is applicable for database index that store their index as multi-level
snapshots, similar to leveldb. Each snapshot can be built as
robt
[Index]. Most of the lsm-based-storage will have their root snapshot
as the oldest and only source of truth, but this is not possible for
distributed index that ends up with multiple truths across different nodes.
To facilitate such designs, in lsm mode, even the root level at any given
node, can retain older versions upto a specified seqno
, that seqno
is
computed through eventual consistency.
Another use case of lsm-compaction is to maintain older versions of value.
tombstone-compaction
Tombstone compaction is similar to lsm-compaction
with one main
difference. When application logic issue tombstone-compaction
only
deleted entries that are older than specified seqno will be purged.
Dependencies
~2–2.9MB
~57K SLoC