#s3 #sqlite #vfs #blob

verneuil

A sqlite VFS for asynchronous replication to S3-compatible blob stores

15 releases (4 breaking)

0.6.4 Feb 23, 2022
0.5.0 Feb 15, 2022
0.1.3 Dec 7, 2021

#158 in Database implementations

Download history 461/week @ 2022-01-25 419/week @ 2022-02-01 526/week @ 2022-02-08 335/week @ 2022-02-15 196/week @ 2022-02-22 158/week @ 2022-03-01 120/week @ 2022-03-08 223/week @ 2022-03-15 177/week @ 2022-03-22 165/week @ 2022-03-29 205/week @ 2022-04-05 71/week @ 2022-04-12 118/week @ 2022-04-19 254/week @ 2022-04-26 249/week @ 2022-05-03 261/week @ 2022-05-10

891 downloads per month

MIT license

9MB
165K SLoC

C 158K SLoC // 0.3% comments Rust 7K SLoC // 0.1% comments Shell 44 SLoC // 0.5% comments

Verneuil: streaming replication for SQLite

asciicast

Verneuil[^verneuil-process] [vɛʁnœj] is a VFS (OS abstraction layer) for SQLite that accesses local database files like the default unix VFS while asynchronously replicating snapshots to S3-compatible blob stores. We wrote it to improve the scalability and availability of pre-existing services for which SQLite is a good fit, at least for single-node deployments.

Backtrace relies on Verneuil to backup and replicate thousands of SQLite databases that range in size from 100KB to a few gigabytes, some of which see updates every second... for less than $40/day in S3 costs.

[^verneuil-process]: The Verneuil process was the first commercial method of manufacturing synthetic gemstones... and DRH insists on pronouncing SQLite like a mineral, surely a precious one (:

The primary design goal of Verneuil is to add asynchronous read replication to working single-node systems without introducing new catastrophic failure modes. Avoiding new failure modes takes precedence over all other considerations, including replication lag: there is no attempt to bound or minimise the staleness of read replicas. Verneuil read replicas should only be used when stale data is acceptable.

In keeping with this conservative approach to replication, the local database file on disk remains the source of truth, and the VFS is fully compatible with SQLite's default unix VFS, even for concurrent (with file locking) accesses. Verneuil stores all state that must persist across SQLite transactions on disk, so multiple processes can still access and replicate the same database with Verneuil.

Verneuil also paces all API calls (with a [currently hardcoded] limit of 30 call/second/process) to avoid "surprising" cloud bills, and decouples the SQLite VFS from the replication worker threads that upload data to a remote blob store with a crash-safe buffer directory that bounds its worst-case disk footprint to roughly four times the size of the source database file. It's thus always safe to disable access to the blob store: buffered replication data may grow over time, but always within bounds.

Replacing the default unix VFS with Verneuil impacts local SQLite operations, of course: writes must be slower, in order to queue updates for replication. However, this slowdown is usually proportional to the time it took to perform the write itself, and often dominated by the two fsyncs incurred by SQLite transaction commits in rollback mode. In addition, the additional replication logic runs with the write lock downgraded to a read lock, so subsequent transactions only block on the new replication step once they're ready to commit.

This effort is incomparable with litestream: Verneuil is meant for asynchronous read replication, with streaming backups as a nice side effect. The replication approach is thus completely different. In particular, while litestream only works with SQLite databases in WAL mode, Verneuil only supports rollback journaling. See doc/DESIGN.md for details.

What's in this repo

  1. A "Linux" VFS (c/linuxvfs.c) that implements everything that SQLite needs for a non-WAL DB, without all the backward compatibility cruft in SQLite's Unix VFS. The new VFS's behaviour is fully compatible with upstream's Unix VFS! It's a simpler starting point for new (Linux-only) SQLite VFSes.

  2. A Rust crate with a C interface (see include/verneuil.h) to configure and register:

    • The verneuil VFS, which hooks into the Linux VFS to track changes, generate snapshots in spooling directories, and asynchronously upload spooled data to a remote blob store like S3. This VFS is only compatible with SQLite's rollback journal mode. It can be called directly as a Rust program, or via its C interface.

    • The verneuil_snapshot VFS that lets SQLite access snapshots stored in S3-compatible blob stores.

  3. A runtime-loadable SQLite extension, libverneuil_vfs, that lets SQLite open databases with the verneuil VFS (to replicate the database to remote storage), or with the verneuil_snapshot VFS (to access a replicated snapshot).

  4. The verneuilctl command-line tool to restore snapshots, forcibly upload spooled data, synchronise a database file to remote storage, and perform other ad hoc administrative tasks.

Quick start

There is more detailed setup information, including how to directly link against the verneuil crate instead of loading it as a SQLite extension, in doc/VFS.md and doc/SNAPSHOT_VFS.md. The rusqlite_integration example shows how that works for a Rust crate.

For quick hacks and test drives, the easiest way to use Verneuil is to build it as a runtime loadable extension for SQLite (libverneuil_vfs).

cargo build --release --examples --features='dynamic_vfs'

The verneuilctl tool will also be useful.

cargo build --release --examples --features='vendor_sqlite'

Verneuil needs additional configuration to know where to spool replication data, and where to upload or fetch data from remote storage. That configuration data must be encoded in JSON, and will be deserialised into a verneuil::Options struct (in src/lib.rs).

A minimal configuration string looks as follows. See doc/VFS.md and doc/SNAPSHOT_VFS.md for more details.

{
  // "make_default": true, to use the replicating VFS by default
  // "tempdir": "/my/tmp/", to override the location of temporary files
  "replication_spooling_dir": "/tmp/verneuil/",
  "replication_targets": [
    {
      "s3": {
        "region": "us-east-1",
        // "endpoint": "http://127.0.0.1:9000", //for non-standard regions
        "chunk_bucket": "verneuil_chunks",
        "manifest_bucket": "verneuil_manifests",
        "domain_addressing": true  // or false for the legacy bucket-as-path interface
        // "create_buckets_on_demand": true // to create private buckets as needed
      }
    }
  ]
}

That's a mouthful to pass as query string parameters to sqlite3_open_v2, so Verneuil currently looks for that configuration string in the VERNEUIL_CONFIG environment variable. If that variable's value starts with an at sign, like "@/path/to/config.json", Verneuil looks for the configuration JSON in that file.

The configuration file does not include any credential: Verneuil gets those from the environment, either by hitting the local EC2 credentials daemon, or by reading the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

Now that the environment is set up, we can load the extension in SQLite, and start replicating our writes to S3, or any other compatible blob server (we use minio for testing).

$ RUST_LOG=warn VERNEUIL_CONFIG=@verneuil.json sqlite3
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load ./libverneuil_vfs  -- Load the Verneuil VFS extension.
sqlite> .open file:source.db?vfs=verneuil
-- The contents of source.db will now be spooled for replication before
-- letting each transaction close.
sqlite> .open file:verneuil://source.host.name/path/to/replicated.db?vfs=verneuil_snapshot
-- opens a read replica for the most current snapshot replicated to s3 by `source.host.name`
-- for the database at `/path/to/replicated.db`.

Outside the SQLite shell, extensions loading must be enabled in order to allow access to the load_extension SQL function.

URI filenames must also be enabled in order to specify the VFS in the connection string; it's also possible to pass a VFS argument to sqlite3_open_v2.

Replication data is buffered to the replication_spooling_dir synchronously, before the end of each SQLite transaction. Actually uploading the data to remote storage happens asynchronously: we wouldn't want to block transaction commit on network calls.

After exiting the shell or closing an application, we can make sure that all spooled data is flushed to remote storage with verneuilctl flush $REPLICATION_SPOOLING_DIR: this command will attempt to synchronously upload all pending spooled data in the spooling directory, and log noisily / error out on failure.

Find documentation for other verneuilctl subcommands with verneuilctl help:

$ ./verneuilctl --help
verneuilctl 0.1.0
utilities to interact with Verneuil snapshots

USAGE:
    verneuilctl [OPTIONS] <SUBCOMMAND>

FLAGS:
    -h, --help
            Prints help information

    -V, --version
            Prints version information


OPTIONS:
    -c, --config <config>
            The Verneuil JSON configuration used when originally copying the database to remote storage.

            A value of the form "@/path/to/json.file" refers to the contents of that file; otherwise, the argument
            itself is the configuration string.

            This parameter is optional, and defaults to the value of the `VERNEUIL_CONFIG` environment variable.
    -l, --log <log>
            Log level, in the same format as `RUST_LOG`.  Defaults to only logging errors to stderr; `--log=info`
            increases the verbosity to also log info and warning to stderr.

            To fully disable logging, pass `--log=off`.

SUBCOMMANDS:
    flush            The verneuilctl flush utility accepts the path to a spooling directory, (i.e., a value for
                     `verneuil::Options::replication_spooling_dir`), and attempts to upload all the files pending
                     replication in that directory
    help             Prints this message or the help of the given subcommand(s)
    manifest         The verneuilctl manifest utility accepts the path to a source replicated file and an optional
                     hostname, and outputs the contents of the corresponding manifest file to `--out`, or stdout by
                     default
    manifest-name    The verneuilctl manifest-name utility accepts the path to a source replicated file and an
                     optional hostname, and prints the name of the corresponding manifest file to stdout
    restore          The verneuilctl restore utility accepts the path to a verneuil manifest file, and reconstructs
                     its contents to the `--out` argument (or stdout by default)
    sync             The verneuilctl sync utility accepts the path to a sqlite db, and uploads a fresh snapshot to
                     the configured replication targets
$ ./verneuilctl restore --help
verneuilctl-restore 0.1.0
The verneuilctl restore utility accepts the path to a verneuil manifest file, and reconstructs its contents to the
`--out` argument (or stdout by default)

USAGE:
    verneuilctl restore [OPTIONS]

FLAGS:
        --help
            Prints help information

    -V, --version
            Prints version information


OPTIONS:
    -h, --hostname <hostname>
            The hostname of the machine that generated the snapshot.

            Defaults to the current machine's hostname.
    -m, --manifest <manifest>
            The manifest file that describes the snapshot to restore.

            These are typically stored as objects in versioned buckets; it is up to the invoker to fish out the relevant
            version.

            If missing, verneuilctl restore will attempt to download it from remote storage, based on `--hostname` and
            `--source_path`.

            As special cases, an `http://` or `https://` prefix will be downloaded over HTTP(S), an
            `s3://bucket.region[.endpoint]/path/to/blob` URI will be loaded via HTTPS domain-addressed S3,
            `verneuil://machine-host-name/path/to/sqlite.db` will be loaded based on that hostname (or the current
            machine's hostname if empty) and source path, and a `file://` prefix will always be read as a local path.
    -o, --out <out>
            The path to the reconstructed output file.

            Defaults to stdout.
    -s, --source-path <source-path>
            The path to the source file that was replicated by Verneuil, when it ran on `--hostname`

But why?

Backtraces shards most of its backend metadata in thousands of small (1-2 MB) to medium size (up to 1-2 GB) SQLite databases, with an average aggregate write rate of a few dozen write transactions per second (with a few hot databases and many cold ones). Before Verneuil, this approach offered adequate performance and availability. However, things could be better, and that's why we wrote Verneuil: to distribute logic that can work with slightly stale read replicas and to simplify our disaster recovery playbooks, without introducing new failure modes in single-node code that already works well enough.

In fact, making sure replicas are up to date is explicitly not a goal. Nevertheless, we find that once our backend reaches its steady state, less than 0.1% of write transactions take more than 5 seconds to replicate, and detect a replication lag of more than one minute for more rarely than once every million write. Of course, this all depends on the write load and the number of replicated databases on a machine or process. For example, we experience temporary spikes in replication lag whenever a service restarts and writes to a few hundred databases in rapid succession.

Data freshness is not a goal because Verneuil prioritises disaster avoidance over everything else. That's why we interpose a wait-free crash-safe replication buffer (implemented as files on disk) between the snapshot update logic, which must run synchronously with SQLite transaction commits, and the copier worker threads that upload snapshot data to remote blob stores. We trust this buffer to act as a "data diode" that architecturally shuts off feedback loops from the copier workers back to the SQLite VFS (i.e., back to the application). Crucially, the amount of buffered data for a given SQLite data base is bounded to a multiple of that database file's size, even when copiers are completely stuck. Even when the blob store is inaccessible or a target bucket misconfigured, local operations will not be interrupted by an ever-growing replication queue. The buffer is also updated without fsync calls that could easily impact the whole storage subsystem; Verneuil instead achieves crash safety by discarding all replication state after a reboot.

All too often, distributed solutions for scalability and availability end up introducing new catastrophic failure modes, and the result is a system that might offer resilience to rare (once a year or less) events like hardware failure or power loss, but does so by increasing complexity to a level such that unforeseen interactions between correctly functioning pieces of code regularly cause protracted customer impacting issues. Verneuil's conservative approach gives us some confidence that we can use it to improve the scalability and availability of our preexisting centralised systems without worsening the reliablity of everything that already works well enough.

Disaster avoidance includes bounding cloud costs. Verneuil can guarantee cost effectiveness for a wide range of update rate because it's always able to throttle the API calls that update data: the replication buffer will simply squash snapshots and always bound the replication data's footprint to four times the size of the source database file.

Regardless of the update pattern (frequency and number of databases), we can count on Verneuil to remain within our budget for replication: it will never average more than 30[^size-limit] API calls/replication target/second/process. Each call uploads either a chunk (64 KB for incompressible data, less if zstd is useful), or a manifest (16 bytes per 64 KB chunk in the source database file, so 512 KB for a 2 GB file).

[^size-limit]: This hardcoded limit, coupled with the patrol logic that "touches" every extant chunk once a day, limits the total size of replicated databases for a single process: the replication logic may break down around 20-30 GB, but local operations should not be affected, except for the bounded growth in buffered replication data. That's not an issue for us because we only store metadata in SQLite, metadata that tends to be orders of magnitude smaller than the data.

Chunks can be reaped by a retention rule that deletes them after a week of inactivity (Verneuil attempts to "touch" useful chunks once a day), so, even when there's a lot of churn, a chunk upload to a standard bucket in US East costs at most $5e-6 + 64 K/1 GB * $0.023 / 4 (weeks per month) < $6e-6.

Manifests for multi-GB databases can be much larger, but manifest updates are throttled to less than one per second per database, and manifest blobs can be deleted more aggressively (e.g., as soon as a version becomes stale). With a 24h retention rule, uploading the manifest for a 2 GB database adds up to less than $6e-6 for the API call and churned storage.

We could also take into account storage costs for the permanent footprint of the replicated databases ($0.023/GB/month for standard buckets in US East) to this upper bound, but that's usually dominated by API costs.

At an average rate of 30 upload/replication target/second/process, the cost of churned data thus adds up to less than $15.55/replication target/day/process. There is usually only one replication target and one replicating process per machine, so this translates into a maximum of $15.55/day/machine (comparable to a c5.4xlarge). In practice, the average daily cost for Backtrace's backend fleet (millions of writes a day scattered across a few thousand databases) is on the order of $40/day.

How is it tested?

In addition to simply running this in production to confirm that regular single-node operations still work and that the code correctly paces its API calls, we use SQLite's open source regression test suite, after replacing the default Unix VFS with Verneuil. Unfortunately, some tests assume WAL DBs are supported, so we have to disable them; some others inject failures to exercise SQLite's failure handling logic, those too must be disabled. The resulting test suite lives at https://github.com/pkhuong/sqlite/tree/pkhuong/no-wal-baseline

Configure a SQLite build directory from the mutilated test suite, then run verneuil/t/test.sh to build test executables that load the Verneuil VFS and make it the new default. The test script also spins up a local minio container for the Verneuil VFS.

In test mode, the VFS executes internal consistency checking code, and panics whenever it notices a spooling or replication failure.

The logic for read replicas can't piggyback on top of the SQLite test suite as easily. It is instead subjected to classic unit testing and manual smoke testing.

What's missing for 1.0

  • Configurability: most of the plumbing is there to configure individual SQLite connections, but the current implementation is geared towards a program linking directly against libverneuil and configuring it with C calls. We can already load Verneuil in SQLite by configuring it with an environment variable (which matches the current global configuration structure), but we should add support for reading configuration data from the connection string (SQLite query string parameters).

Things we should do after 1.0

  1. We currently always create the journal file in 0644. Umask applies, but it would make sense to implement the same logic as SQLite's Unix VFS and inherit the main db file's permissions.

  2. Many filesystems now support copy-on-write; we should think about using that for the commit step, instead of a journal file!

  3. The S3 client library is really naive. We should reuse HTTPS connections.

  4. Consider some way to get chunks without indirecting through S3. Could gossip promising chunks ahead of time, or simply serve them on demand over request-response like HTTP.

Dependencies

~24–35MB
~740K SLoC