#btree #database

nebari

ACID-compliant database storage implementation using an append-only file format

7 releases

0.1.0-rc.6 Nov 21, 2021
0.1.0-rc.5 Nov 10, 2021
0.1.0-rc.4 Oct 28, 2021
0.0.0-reserve.0 Sep 25, 2021

#20 in Database implementations

Download history 11/week @ 2021-09-20 5/week @ 2021-09-27 2/week @ 2021-10-04 98/week @ 2021-10-11 458/week @ 2021-10-18 224/week @ 2021-10-25 81/week @ 2021-11-01 208/week @ 2021-11-08 744/week @ 2021-11-15 135/week @ 2021-11-22

581 downloads per month

MIT/Apache

330KB
8K SLoC

nebari

nebari forbids unsafe code nebari is considered experimental and unsupported crate version Live Build Status HTML Coverage Report for main branch Documentation for main branch

nebari - noun - the surface roots that flare out from the base of a bonsai tree

Warning: This crate is early in development. The format of the file is not considered stable yet. Do not use in production.

This crate provides the Roots type, which is the transactional storage layer for BonsaiDb. It is loosely inspired by Couchstore.

Examples

Inserting a key-value pair in an on-disk tree with full revision history:

use nebari::{
    io::fs::StdFile,
    tree::{Root, Versioned},
    Config,
};

let database_folder = tempfile::tempdir().unwrap();
let roots = Config::<StdFile>::default_for(database_folder.path())
    .open()
    .unwrap();
let tree = roots.tree(Versioned::tree("a-tree")).unwrap();
tree.set("hello", "world").unwrap();

For more examples, check out nebari/examples/.

Features

Nebari exposes multiple levels of functionality. The lowest level functionality is the TreeFile. A TreeFile is a key-value store that uses an append-only file format for its implementation.

Using TreeFiles and a transaction log, Roots enables ACID-compliant, multi-tree transactions.

Each tree supports:

  • Key-value storage: Keys can be any arbitrary byte sequence up to 65,535 bytes long. For efficiency, keys should be kept to smaller lengths. Values can be up to 4 gigabytes (2^32 bytes) in size.
  • Flexible query options: Fetch records one key at a time, multiple keys at once, or ranges of keys.
  • Powerful multi-key operations: Internally, all functions that alter the data in a tree use TreeFile::modify() which allows operating on one or more keys and performing various operations.
  • Pluggable low-level modifications: The Vault trait allows you to bring your own encryption, compression, or other functionality to this format. Each independently-addressible chunk of data that is written to the file passes through the vault.
  • Optional full revision history. If you don't want to lose old revisions of data, you can use a VersionedTreeRoot to store information that allows scanning old revision information. Or, if you want to avoid the extra IO, use the UnversionedTreeRoot which only stores the information needed to retrieve the latest data in the file.
  • ACID-compliance:
    • Atomicity: Every operation on a TreeFile is done atomically. Operation::CompareSwap can be used to perform atomic operations that require evaluating the currently stored value.

    • Consistency: Atomic locking operations are used when publishing a new transaction state. This ensures that readers can never operate on a partially updated state.

    • Isolation: Currently, each tree can only be accessed exclusively within a transaction. This means that if two transactions request the same tree, one will execute and complete before the second is allowed access to the tree. This strategy could be modified in the future to allow for more flexibility.

    • Durability: The append-only file format is designed to only allow reading data that has been fully flushed to disk. Any writes that were interrupted will be truncated from the end of the file.

      Transaction IDs are recorded in the tree headers. When restoring from disk, the transaction IDs are verified with the transaction log. Because of the append-only format, if we encounter a transaction that wasn't recorded, we can continue scanning the file to recover the previous state. We do this until we find a successfluly commited transaction.

      This process is much simpler than most database implementations due to the simple guarantees that append-only formats provide.

Why use an append-only file format?

@ecton wasn't a database engineer before starting this project, and depending on your viewpoint may still not be considered a database engineer. Implementing ACID-compliance is not something that should be attempted lightly.

Creating ACID-compliance with append-only formats is much easier to achieve, however, as long as you can guarantee two things:

  • When opening a previously existing file, can you identify where the last valid write occurred?
  • When writing the file, do not report that a transaction has succeeded until the file is fully flushed to disk.

The B-Tree implementation in Nebari is designed to offer those exact guarantees.

The major downside of append-only formats is that deleted data isn't cleaned up until a maintenance process occurs: compaction. This process rewrites the files contents, skipping over entries that are no longer alive. This process can happen without blocking the file from being operated on, but it does introduce IO overhead during the operation.

Nebari provides APIs that perform compaction, but currently delegates scheduling and automation to consumers of this library.

Open-source Licenses

This project, like all projects from Khonsu Labs, are open-source. This repository is available under the MIT License or the Apache License 2.0.

Dependencies

~6MB
~122K SLoC