10 breaking releases

0.12.0 Dec 25, 2024
0.10.0 Nov 7, 2024
0.3.0 Feb 19, 2024
0.2.0 Dec 27, 2023
0.1.0 Sep 21, 2023

#88 in Science

Download history 106/week @ 2024-09-11 72/week @ 2024-09-18 11/week @ 2024-09-25 153/week @ 2024-10-02 145/week @ 2024-10-09 16/week @ 2024-10-16 119/week @ 2024-11-06 5/week @ 2024-11-13 5/week @ 2024-11-20 128/week @ 2024-12-04 19/week @ 2024-12-11 147/week @ 2024-12-25

294 downloads per month
Used in keyvalint_bench

Apache-2.0

1MB
20K SLoC

lsmtk

This library provides an implementation of an log-structured merge tree; this merge tree uses a new compaction algorithm called triangular compaction that achieves a factor of 6.5x write amplification both in theory and in practice.

Triangular Compaction

Triangular compaction takes note of a single observation: moving from one level to the next is inefficent; instead, the compaction should spread across levels to move data across as many levels as is possible.

The core of the compaction algorithm comes from the intuition embedded in this Python snippet:

sums = 0
count = 0
ingest = 0
for i in range(2**ceil(LSM_NUM_LEVELS)):
    bits = 0
    while i and (1 << bits) & i:
        bits += 1
    sums += (1 << bits) * TARGET_FILE_SIZE * LSM_LEVEL_0_NUM_FILES
    ingest += TARGET_FILE_SIZE * LSM_LEVEL_0_NUM_FILES
    count += 1
COMPACTION_AVERAGE_SIZE = sums / count + LSM_LEVEL_0_NUM_FILES * TARGET_FILE_SIZE # bytes
WRITE_AMPLIFICATION = sums / ingest

In this compaction we select the lowest N levels that are full, stopping at the first level that's empty. In this way we amortize the cost of a compaction similar to how a Vec works in Rust or vector in C++.

The triangular compaction algorithm generalizes this intuition to select triangles from the LSM tree such that the transitive closure of files under that level will be included in the compaction.

Check src/tree/mod.rs for the compaction algorithm.

Trade-Offs

Write amplification is bounded above by 6.5x.

  • Read amplification is unbounded, but stays low empirically.
  • Space amplification is bounded by 2x.

In an empirical benchmark where we ingested 44GiB of data we saw the write amplification stayed under a factor of 3x (trash is the ssts that were compacted away; it will be emptied periodically in a real system):

4.0K    db/compaction
71M     db/ingest
8.8M    db/mani
44G     db/sst
127G    db/trash

In process counters confirms a 2.9x write amplification.

lsmtk.bytes_ingested = 46552563486
lsmtk.compaction.bytes_written = 135401535775

Status

Active development.

Scope

This library provides pieces of an lsm graph. It will eventually grow to support everything necessary to create an embedded lsm graph analogous to LevelDB and RocksDB.

Warts

  • This library is under-tested and will see active development in the future.
  • Tricks used in LevelDB (grandfather overlap) and PebblesDB (guard pages) are not used. These are 100% compatible and would only improve the compaction algorithm's performance.
  • There is no back pressure against excessive ingest.
  • There's a concurrency bug that shows up around the point of 40GiB where compaction will stall. This is just at the prototype phase and I've run out of funds to continue developing it.

Documentation

The latest documentation is always available at docs.rs.

Dependencies

~5.5MB
~95K SLoC