7 releases

0.1.0 Dec 9, 2023
0.0.6 Nov 28, 2023

#885 in Data structures

39 downloads per month

MPL-2.0 license

135KB
3.5K SLoC

sqlite-collections

Rust collection types backed by sqlite database files.

This provides some standard-library-like collections, which may be serialized and deserialized with arbitrary serializers. This allows you to use an interface very similar to the std::collections ones, with these characteristics:

  • You can persist your collections to disk without serializing and deserializing the whole set. Opening a very large collection and making a small change is very efficient compared to just using serde to load and dump the whole thing.
  • You can store very large structures, as big as your hard drive can handle, instead of just your memory. This handles many hundreds of gigabytes in the same way that plain SQLite does.
  • You can use transactions and savepoints to roll back changes to any of your collections, or even many of them together.
  • You can keep your collections across multiple files, with transactional integrity of them all together.

Portability and stability

ds::set::DSSet and ds::map::DSMap types depend on deterministic serialization. You aren't prevented from storing whatever is serializable in a Set, for instance, but keep in mind:

  • If you store something like a HashMap, its order can change between runs.
  • Some formats have multiple representations for the same data.
    • Ciborium (the cbor feature) does not support half-precision floating point yet, so it being added would break determinism in the future.
      • In this way, ciborium doesn't implement CBOR's deterministic encoding.
    • serde_json choosing to escape or not escape some unicode values in strings across versions can also break.
    • Accessing the containers from another programming language that doesn't serialize the same way will cause issues
  • Some serializers will serialize differently on different architectures. If your serializer doesn't behave consistently on different endianness, it will not be portable across these different architectures.
  • Some datatypes are different on different architectures, primarily usize and isize. Some serialization formats will encode these differently.

To be safe, make sure you do not update your serializers without some thorough testing. Direct is always safe. postmark is probably the most reliable serde format you can use, keeping in mind to not depend on types that use the Hash trait to determine ordering unless you can ensure a consistent order separately. cbor and json are useful for inter-language use, but keep in mind the caveats, and make sure that other languages serialize the exact same way.

In the future, a BTreeSet and BTreeMap will be added to support non- deterministic encodings. These will be slower, but will be guaranteed to match just based on Ord and Eq alone. This will not solve platform portability problems, but will solve deterministic serialization concerns.

Concurrency safety

Your collections might be open in a different thread or process through another connection. SAVEPOINT is used internally to prevent inconsistent states.

Performance

This library uses internal SAVEPOINTs to prevent inconsistent states of the database. To get the most performance without sacrificing reliability:

  • Use as large a transaction as you reasonably use around all operations.
  • Use an IMMEDIATE transaction when you know you will modify the database (upgrading transactions may deadlock and cause errors).
  • Switching the database into journal_mode = WAL with synchronized = NORMAL can give some performance gains when there is a lot of writing.

Collections

Currently implemented

  • DSSet
    • A regular set, sorted lexicographically by the stored representation.
    • Requires deterministic serialization.
    • Not completed yet. Most necessary functionality is present, but not all functionality.

To be implemented

  • DSMap
    • A regular map, sorted lexicographically by the stored representation of the key.
    • Requires deterministic serialization.
  • BTreeSet
    • A set allowing non-deterministic serialization, as long as Ord is implemented. Should be quite a lot slower than a normal Set, as every comparison requires a full deserialization.
  • BTreeMap
    • A map allowing non-deterministic serialization.
  • List
    • A sequence ordered by an integer index.
    • This is not called Vec because it's not really an array and is not accessible as a contiguous slice.
  • Deque
    • A sequence ordered by an integer index, with efficient insertion and removal at both ends.

Can't I just use SQLite directly?

Yes. This is mostly to make it easy to efficiently interact with a SQLite- backed collection without having to think too hard about the SQL details, as well as making it not-too-painful to swap out an existing std::collections struct and keep the same functionality, when you need persistence and/or huge data without filling your RAM.

If you really just want a large, persistent, and/or transaction-safe set of collections and don't need any other RDBMS functionality, this library is a good choice.

Dependencies

~22MB
~415K SLoC