2 unstable releases

0.2.0 Dec 30, 2020
0.1.0 Dec 9, 2020

#393 in Database interfaces

MIT/Apache

115KB
2K SLoC

clepsydra

Overview

This is a work-in-progress implementation of a core protocol for a minimalist distributed database. It strives to be as small and simple as possible while attempting to provide relatively challenging features:

  • Strict Serializability

  • Online Reconfiguration

  • Fault tolerance

  • High throughput

The implementation is based on a simplified version of the "Ocean Vista" (OV) protocol, and uses its terminology wherever possible. OV combines replication, transaction commitment and concurrency control into a single protocol.

Summary

The short version of the protocol is:

  • Transactions are represented as deterministic thunks over snapshots.

  • Each transaction is assigned a globally-unique timestamp.

  • Transactions are separated into two phases: S-phase and E-phase.

  • S-phase (storage) consists of coordination-free "blind quorum-writes" replicating the thunks into their MVCC order on each replica.

  • A watermark tracking minimum transaction timestamps-being-written is gossiped between peers, increasing as quorum-writes complete.

  • A transaction only enters E-phase after the watermark advances past it.

  • E-phase (evaluation) quorum-reads and evaluates thunks from consistent snapshots below the watermark, lazily resolving any earlier thunks. Everything below the watermark is coordination-free and deterministic.

Caveats

Nothing's perfect, and this crate is anything but:

  • This crate is very incomplete and does not work yet. Don't use it for anything other than experiments and toys. Recovery, reconfiguration, timeouts and nontrivial fault tolerance paths definitely don't work.

  • It also (somewhat recklessly) attempts to combine OV's reconfiguration and gossip protocols into an instance of the [concorde] reconfigurable lattice agreement protocol. This might not even be theoretically safe.

  • It is much more minimal than the full OV protocol: there's no support for sharding, nor the two-level peer-vs-datacenter locality organization. This crate treats its whole peer group as a single symmetric shard.

  • As a result, performance won't be "webscale" or anything. It will scale vertically if you throw cores at it, but no better, and its latency will always have speed-of-light WAN RTT factors in it. It's distributed for fault tolerance, not horizontal scaling.

  • As with OV, this crate does require partial clock synchronization. It doesn't need to be very tight: clock drift only causes increased latency as the watermarks progress as the minimum of all times; it doesn't affect correctness. Normal weak-NTP-level sync should be ok.

  • As with OV, Calvin, and all deterministic databases: your txns have to be deterministic and must have deterministic read and write sets. If they cannot have their read and write sets statically computed (eg. if they rely on the data to decide read and write set) you have to build slightly awkward multi-phase txns. The term in the literature is "reconnaisance queries".

Reference

Hua Fan and Wojciech Golab. Ocean Vista: Gossip-Based Visibility Control for Speedy Geo-Distributed Transactions. PVLDB, 12(11): 1471-1484, 2019.

DOI: https://doi.org/10.14778/3342263.3342627

http://www.vldb.org/pvldb/vol12/p1471-fan.pdf

Name

Wikipedia:

A water clock or clepsydra (Greek κλεψύδρα from κλέπτειν kleptein, 'to steal'; ὕδωρ hydor, 'water') is any timepiece by which time is measured by the regulated flow of liquid into (inflow type) or out from (outflow type) a vessel, and where the amount is then measured.

License: MIT OR Apache-2.0

Dependencies

~7–10MB
~213K SLoC

_