#embedded-database #lsm-tree #database #acid #vlog

bin+lib vlog_db

High-performance embedded key-value database with LSM-tree architecture, ACID transactions, and value-log separation

1 unstable release

Uses new Rust 2024

0.1.0 Sep 19, 2025

#59 in Database implementations

37 downloads per month

Apache-2.0

220KB
5K SLoC

VlogDB: An Embedded Database for Modern Applications

⚠️ EARLY STAGES: VlogDB is an experimental database implementation designed for research. While functionally complete with ACID guarantees, it is not intended for production use due to performance limitations and lack of real-world testing.

VlogDB is an embedded key-value database built entirely in Rust, implementing the LSM-tree architecture with value-log separation. Drawing inspiration from WiscKey and modern storage research, it demonstrates how thoughtful architectural decisions can optimize for specific workloads while maintaining strong consistency guarantees.

Table of Contents

Introduction

Traditional B-tree databases like LMDB excel at read-heavy workloads through memory mapping and cache-friendly data structures. However, write-heavy applications often struggle with the inherent constraints of in-place updates and page-level locking. VlogDB addresses this challenge by implementing an LSM-tree (Log-Structured Merge-tree) architecture that treats writes as immutable log entries, enabling significantly faster write operations while maintaining full ACID properties.

While production LSM-tree implementations like RocksDB have proven themselves in large-scale deployments, this project explores value-log separation as an alternative approach to reducing write amplification.

The database separates large values from the main data structure through a technique called value-log separation, originally proposed in the WiscKey paper. This approach reduces write amplification during compaction operations, as only keys and small metadata need to be moved between levels, while large values remain in a separate append-only log.

When VlogDB Makes Sense

The database performs exceptionally well in scenarios involving frequent writes, large value storage, and sequential access patterns. Applications that append log entries, store document content, or maintain time-series data will find the sequential write optimizations beneficial. The value separation particularly shines when storing values larger than 4KB, where traditional approaches suffer from excessive data movement during compaction.

For read-heavy workloads, LMDB's memory mapping provides superior point lookup performance. For production LSM-tree needs, RocksDB offers battle-tested reliability and extensive configuration options.

Getting Started

Adding VlogDB to your project requires only a single dependency, as it's designed as a pure Rust embedded database with no external systems or daemons.

use vlog_db::{Environment, DatabaseConfig};

fn main() -> vlog_db::Result<()> {
    // Initialize the database environment
    let env = Environment::new()
        .set_max_dbs(4)
        .set_map_size(512 * 1024 * 1024)
        .open("./data")?;

    // Open the default database
    let db = env.open_db(None)?;

    // Perform basic operations
    db.put(b"user:1001", b"{'name': 'Alice', 'age': 30}")?;
    
    if let Some(value) = db.get(b"user:1001")? {
        println!("Retrieved: {}", String::from_utf8_lossy(&value));
    }

    // Use transactions for atomic operations
    let mut txn = db.begin_transaction()?;
    txn.put(b"user:1002", b"{'name': 'Bob', 'age': 25}")?;
    txn.put(b"user:1003", b"{'name': 'Carol', 'age': 35}")?;
    txn.commit()?;

    Ok(())
}

Running Tests and Benchmarks

The project includes comprehensive tests covering ACID compliance, crash recovery scenarios, and concurrent access patterns:

# Run the full test suite
cargo test

# Execute performance benchmarks
cargo bench

# Compare against LMDB
cargo run --bin nova_vs_lmdb_shootout --release

Architecture

The system implements a multi-level storage hierarchy designed to optimize write throughput while maintaining read performance through strategic caching and indexing.

Storage Hierarchy

At the core lies an active memtable implemented as a concurrent skiplist, where all writes initially land. This in-memory structure provides fast insertion and maintains sorted order, crucial for efficient range queries. When the memtable reaches its configured size threshold, it becomes immutable and a new active memtable takes its place.

Background threads continuously flush immutable memtables to disk as SSTable files (Sorted String Tables). These files form the persistent storage layer, organized into levels where each level can hold exponentially more data than the previous one. Level 0 contains recently flushed SSTables that may have overlapping key ranges, while higher levels maintain strict non-overlapping invariants.

The compaction process merges SSTables between levels, eliminating duplicate keys and maintaining the size ratios between levels. This approach amortizes the cost of maintaining sorted order across the entire dataset, enabling sustained write performance even as the database grows.

Value Separation

Large values present a unique challenge in LSM-tree implementations. During compaction, entire SSTables must be rewritten even if only a small percentage of keys need merging. When values are large, this creates substantial write amplification that can saturate storage bandwidth.

The system addresses this through value-log separation. Values exceeding 128 bytes are stored in a separate append-only log file, while the main LSM-tree stores only the key and a pointer to the value's location. This dramatically reduces the amount of data moved during compaction, as pointers are much smaller than the original values.

The value log itself is organized with group commits to balance durability with write performance. Multiple transactions can batch their value writes together, reducing the number of individual fsync operations required for crash safety.

Transaction System

The database provides full ACID guarantees through an optimistic concurrency control (OCC) system combined with multi-version concurrency control (MVCC). Each transaction operates on a consistent snapshot of the database, identified by a global sequence number allocated at transaction start.

During execution, transactions accumulate reads and writes in private buffers without immediately affecting the global state. At commit time, the system validates that no conflicting concurrent transactions have committed, ensuring serializability. Successful commits atomically apply all changes and advance the global sequence number.

This approach enables high concurrency for non-conflicting transactions while providing strong isolation guarantees. The sequence-based snapshots ensure that reads within a transaction always see a consistent view of the data, even as concurrent transactions commit changes.

Recovery and Consistency

The value log serves as the primary source of truth during crash recovery. Each committed transaction writes its changes to the value log with sufficient metadata to reconstruct the exact database state. If the system crashes before background processes complete SSTable generation, the recovery process replays the value log to rebuild the in-memory state.

This design choice simplifies the recovery logic significantly compared to traditional write-ahead logging approaches. The value log contains the complete serialized form of each committed transaction, eliminating complex coordination between multiple log files and data structures.

Performance Characteristics

The performance profile reflects architectural decisions, with strengths in write-heavy scenarios and areas where established databases maintain advantages.

Benchmark Results

Performance testing against LMDB reveals the optimization focus. In write-heavy scenarios, the system demonstrates a 24% improvement in PUT operations and a remarkable 69% advantage in DELETE operations. The sequential nature of LSM-tree writes enables this consistent performance even under sustained load.

Large value storage showcases the benefits of value separation most clearly. When storing 4KB values, VlogDB outperforms LMDB by 90%, as the reduced write amplification during compaction becomes highly beneficial. This advantage grows with value size, making it particularly suitable for document storage and content management applications.

Range operations benefit from the naturally sorted structure of SSTables. Prefix scanning operations run approximately 50% faster than comparable LMDB operations, as sequential reads through sorted files prove more efficient than traversing B-tree structures.

However, point reads favor LMDB's memory-mapped approach by approximately 61%. The direct memory access provided by memory mapping eliminates the overhead of cache management and file I/O that LSM-tree implementations inherently require.

Benchmarks against RocksDB have not been conducted due to its extensive configuration complexity. RocksDB would likely outperform VlogDB in most scenarios given its maturity and optimization.

Current Limitations

The current implementation suffers from a critical performance bottleneck in filesystem I/O patterns. Individual writes trigger multiple separate filesystem calls, preventing the effective amortization of I/O costs. Production-quality implementations require batched filesystem operations to achieve optimal throughput.

Durable writes particularly highlight this limitation, with LMDB demonstrating 98% better performance when fsync is enabled. The overhead of ensuring crash safety through individual synchronous operations significantly impacts sustained write performance.

API Reference

The database provides a clean, Rust-idiomatic interface that prioritizes safety and ergonomics while exposing the full capabilities of the underlying storage engine.

Environment Management

The Environment represents a collection of databases sharing storage resources and configuration. Multiple databases within an environment benefit from shared caching and background processes:

let env = Environment::new()
    .set_max_dbs(10)
    .set_map_size(1024 * 1024 * 1024)  // 1GB
    .open("./database_directory")?;

let users_db = env.open_db(Some("users"))?;
let products_db = env.open_db_with_config(
    Some("products"),
    DatabaseConfig::default().with_memtable_size(64 * 1024 * 1024)
)?;

Basic Operations

The core database interface provides immediate consistency for individual operations:

// Simple key-value operations
db.put(b"configuration:timeout", b"30000")?;
let timeout = db.get(b"configuration:timeout")?;
db.delete(b"configuration:timeout")?;

// Batch operations for improved performance
let keys = vec![b"user:1001".to_vec(), b"user:1002".to_vec()];
let values = db.get_batch(keys)?;

let updates = vec![
    (b"user:1001".to_vec(), b"updated_data".to_vec()),
    (b"user:1002".to_vec(), b"more_data".to_vec()),
];
db.put_batch(updates)?;

Range Operations

Sequential access patterns leverage the sorted nature of the underlying storage:

// Scan with a prefix filter
db.scan_prefix(b"user:", 100, |chunk| {
    for (key, value) in chunk {
        println!("Found user: {:?}", String::from_utf8_lossy(key));
    }
})?;

// Point-in-time queries using sequence numbers
let sequence = db.current_sequence();
let historical_value = db.get_at(b"user:1001", sequence)?;

Transaction Interface

Transactions provide atomic, isolated access to multiple operations:

let mut txn = db.begin_transaction()?;

// All operations see a consistent snapshot
let current_balance = txn.get(b"account:balance")?;
txn.put(b"account:balance", &(current_balance + 100).to_le_bytes())?;
txn.put(b"account:last_transaction", &timestamp_bytes)?;

// Commit atomically applies all changes
txn.commit()?;

// Alternatively, rollback discards all changes
// txn.rollback()?;

Understanding the Design

The design decisions reflect a careful balance between theoretical ideals and practical implementation constraints. Understanding these trade-offs helps in evaluating whether the system fits specific use cases.

LSM-Trees vs. B-Trees

The fundamental choice between LSM-tree and B-tree architectures determines most performance characteristics. B-trees optimize for in-place updates and fast point lookups through memory mapping, making them ideal for read-heavy workloads. The ability to directly map file pages into virtual memory eliminates much of the overhead associated with cache management and buffer pools.

LSM-trees, conversely, optimize for write throughput by treating all writes as immutable log entries. This approach eliminates the random I/O patterns that plague B-tree updates under write-heavy loads. The trade-off manifests as increased complexity in read operations, which must potentially examine multiple SSTables to locate the most recent version of a key.

Value Separation Trade-offs

Value separation addresses LSM-tree write amplification at the cost of increased complexity in garbage collection and space management. Traditional LSM-trees must rewrite entire values during compaction, leading to bandwidth waste when values are large. Separating values into an independent log eliminates this waste but introduces new challenges.

The value log grows continuously and requires periodic garbage collection to reclaim space from deleted or updated values. This process must coordinate with the main LSM-tree to avoid removing values that are still referenced. Additionally, range scans may require random access to the value log, potentially reducing cache effectiveness compared to co-located approaches.

Concurrency Model

The optimistic concurrency control approach, inspired by systems like DynamoDB, prioritizes throughput over latency predictability. Transactions execute speculatively without acquiring locks, maximizing parallelism when conflicts are rare. However, high-conflict workloads may experience significant retry overhead as transactions repeatedly abort and restart.

The sequence-based snapshot mechanism provides strong isolation guarantees while enabling lock-free read operations. Readers never block writers, and writers only briefly coordinate during the commit process. This design scales well with the number of concurrent readers but requires careful memory management to prevent unbounded growth of version history.

Development Status

VlogDB represents a functionally complete implementation of LSM-tree concepts with full ACID guarantees and comprehensive test coverage. The codebase includes tests for crash recovery, concurrent access patterns, and ACID compliance, demonstrating the correctness of core algorithms.

Several factors prevent production deployment, however. The filesystem I/O patterns require optimization to achieve competitive write throughput in realistic scenarios. The lack of operational tooling, monitoring integration, and migration utilities limits practical deployment capabilities.

Dependencies

~8–16MB
~322K SLoC