6 releases

0.2.3 Feb 27, 2024
0.2.2 Oct 27, 2023
0.2.1 Sep 25, 2023
0.2.0 Apr 27, 2023
0.1.1 Mar 4, 2023

#13 in #capacity

Download history 128/week @ 2024-02-22 44/week @ 2024-02-29 15/week @ 2024-03-07 14/week @ 2024-03-14 14/week @ 2024-03-28 6/week @ 2024-04-04 42/week @ 2024-04-18

62 downloads per month
Used in 2 crates

Apache-2.0

205KB
3K SLoC

Store abstraction for flash storage

Specification

The store provides a partial function from keys to values on top of a storage interface. The store total capacity depends on the size of the storage. Store updates may be bundled in transactions. Mutable operations are atomic, including when interrupted.

The store is flash-efficient in the sense that it uses the storage lifetime efficiently. For each page, all words are written at least once between erase cycles and all erase cycles are used. However, not all written words are user content: Lifetime is also consumed with metadata and compaction.

The store is extendable with other entries than key-values. It is essentially a framework providing access to the storage lifetime. The partial function is simply the most common usage and can be used to encode other usages.

Definitions

An entry is a pair of a key and a value. A key is a number between 0 and 4095. A value is a byte slice with a length between 0 and 1023 bytes (for large enough pages).

The store provides the following updates:

  • Given a key and a value, StoreUpdate::Insert updates the store such that the value is associated with the key. The values for other keys are left unchanged.
  • Given a key, StoreUpdate::Remove updates the store such that no value is associated with the key. The values for other keys are left unchanged. Additionally, if there was a value associated with the key, the value is wiped from the storage (all its bits are set to 0).

The store provides the following read-only operations:

  • Store::iter iterates through the store returning all entries exactly once. The iteration order is not specified but stable between mutable operations.
  • Store::capacity returns how many words can be stored before the store is full.
  • Store::lifetime returns how many words can be written before the storage lifetime is consumed.

The store provides the following mutable operations:

  • Given a set of independent updates, Store::transaction applies the sequence of updates.
  • Given a threshold, Store::clear removes all entries with a key greater or equal to the threshold.
  • Given a length in words, Store::prepare makes one step of compaction unless that many words can be written without compaction. This operation has no effect on the store but may still mutate its storage. In particular, the store has the same capacity but a possibly reduced lifetime.

A mutable operation is atomic if, when power is lost during the operation, the store is either updated (as if the operation succeeded) or left unchanged (as if the operation did not occur). If the store is left unchanged, lifetime may still be consumed.

The store relies on the following storage interface:

  • It is possible to read a byte slice. The slice won't span multiple pages.
  • It is possible to write a word slice. The slice won't span multiple pages.
  • It is possible to erase a page.
  • The pages are sequentially indexed from 0. If the actual underlying storage is segmented, then the storage layer should translate those indices to actual page addresses.

The store has a total capacity of C = (N - 1) × (P - 4) - M - 1 words, where:

  • P is the number of words per page
  • N is the number of pages
  • M is the maximum length in words of a value (256 for large enough pages)

The capacity used by each mutable operation is given below (a transient word only uses capacity during the operation):

Operation/Update Used capacity Freed capacity Transient capacity
StoreUpdate::Insert 1 + value length overwritten entry 0
StoreUpdate::Remove 0 deleted entry see below*
Store::transaction 0 + updates 0 + updates 1
Store::clear 0 deleted entries 0
Store::prepare 0 0 0

*0 if the update is alone in the transaction, otherwise 1.

The total lifetime of the store is below L = ((E + 1) × N - 1) × (P - 2) and above L - M words, where E is the maximum number of erase cycles. The lifetime is used when capacity is used, including transiently, as well as when compaction occurs. Compaction frequency and lifetime consumption are positively correlated to the store load factor (the ratio of used capacity to total capacity).

It is possible to approximate the cost of transient words in terms of capacity: L transient words are equivalent to C - x words of capacity where x is the average capacity (including transient) of operations.

Preconditions

The following assumptions need to hold, or the store may behave in unexpected ways:

  • A word can be written twice between erase cycles.
  • A page can be erased E times after the first boot of the store.
  • When power is lost while writing a slice or erasing a page, the next read returns a slice where a subset (possibly none or all) of the bits that should have been modified have been modified.
  • Reading a slice is deterministic. When power is lost while writing a slice or erasing a slice (erasing a page containing that slice), reading that slice repeatedly returns the same result (until it is overwritten or its page is erased).
  • To decide whether a page has been erased, it is enough to test if all its bits are equal to 1.
  • When power is lost while writing a slice or erasing a page, that operation does not count towards the limits. However, completing that write or erase operation would count towards the limits, as if the number of writes per word and number of erase cycles could be fractional.
  • The storage is only modified by the store. Note that completely erasing the storage is supported, essentially losing all content and lifetime tracking. It is preferred to use Store::clear with a threshold of 0 to keep the lifetime tracking.

The store properties may still hold outside some of those assumptions, but with an increasing chance of failure.

Implementation

We define the following constants:

  • E65535 the number of times a page can be erased.
  • 3 ≤ N < 64 the number of pages in the storage.
  • 8 ≤ P ≤ 1024 the number of words in a page.
  • Q = P - 2 the number of words in a virtual page.
  • M = min(Q - 1, 256) the maximum length in words of a value.
  • W = (N - 1) × Q - M the window size.
  • V = (N - 1) × (Q - 1) - M the virtual capacity.
  • C = V - N the user capacity.

We build a virtual storage from the physical storage using the first 2 words of each page:

  • The first word contains the number of times the page has been erased.
  • The second word contains the starting word to which this page is being moved during compaction.

The virtual storage has a length of (E + 1) × N × Q words and represents the lifetime of the store. (We reserve the last Q + M words to support adding emergency lifetime.) This virtual storage has a linear address space.

We define a set of overlapping windows of N × Q words at each Q-aligned boundary. We call i the window spanning from i × Q to (i + N) × Q. Only those windows actually exist in the underlying storage. We use compaction to shift the current window from i to i + 1, preserving the content of the store.

For a given state of the virtual storage, we define h_i as the position of the first entry of the window i. We call it the head of the window i. Because entries are at most M + 1 words, they can overlap on the next page only by M words. So we have i × Q ≤ h_i ≤ i × Q + M . Since there are no entries before the first page, we have h_0 = 0.

We define t_i as one past the last entry of the window i. If there are no entries in that window, we have t_i = h_i. We call t_i the tail of the window i. We define the compaction invariant as t_i - h_i ≤ V and the window invariant as t_i - h_i ≤ W. The compaction invariant may temporarily be broken during a sequence of (at most N - 1) compactions.

We define |x| as the capacity used before position x. We have |x| ≤ x. We define the capacity invariant as |t_i| - |h_i| ≤ C.

Using this virtual storage, entries are appended to the tail as long as there is both virtual capacity to preserve the compaction invariant and capacity to preserve the capacity invariant. When virtual capacity runs out, the first page of the window is compacted and the window is shifted.

Entries are identified by a prefix of bits. The prefix has to contain at least one bit set to zero to differentiate from the tail. Entries can be one of:

  • Padding: A word whose first bit is set to zero. The rest is arbitrary. This entry is used to mark words partially written after an interrupted operation as padding such that they are ignored by future operations.
  • Header: A word whose second bit is set to zero. It contains the following fields:
    • A bit indicating whether the entry is deleted.
    • A bit indicating whether the value is word-aligned and has all bits set to 1 in its last word. The last word of an entry is used to detect that an entry has been fully written. As such it must contain at least one bit equal to zero.
    • The key of the entry.
    • The length in bytes of the value. The value follows the header. The entry is word-aligned if the value is not.
    • The checksum of the first and last word of the entry.
  • Erase: A word used during compaction. It contains the page to be erased and a checksum.
  • Clear: A word used during the clear operation. It contains the threshold and a checksum.
  • Marker: A word used during a transaction. It contains the number of updates following the marker and a checksum.
  • Remove: A word used inside a transaction. It contains the key of the entry to be removed and a checksum.

Checksums are the number of bits equal to 0.

Proofs

Compaction

Let I be a window at which all invariants hold. We will show that the next N - 1 compactions will preserve the window invariant (the capacity invariant is trivially preserved) after each compaction. We will also show that after N - 1 compactions, the compaction invariant is restored.

We consider all notations on the virtual storage after the full compaction. We will use the |x| notation although we update the state of the virtual storage. This is fine because compaction doesn't change the status of an existing word.

We first show that after each compaction, the window invariant is preserved.

∀(1 ≤ i ≤ N - 1)   t_{I + i} - h_{I + i}  ≤  W

We assume i between 1 and N - 1.

One step of compaction advances the tail by how many words were used in the first page of the window with the last entry possibly overlapping on the next page.

∀j   t_{j + 1}  =  t_j + |h_{j + 1}| - |h_j| + 1

By induction, we have:

t_{I + i}  =  t_I + |h_{I + i}| - |h_I| + i

We have the following properties:

t_I  ≤  h_I + V
|h_{I + i}| - |h_I|  ≤  h_{I + i} - h_I

Replacing into our previous equality, we can conclude:

t_{I + i}  =  t_I + |h_{I + i}| - |h_I| + i
           ≤  h_I + V + h_{I + 1} - h_I + i
iff
t_{I + i} - h_{I + 1}  ≤  V + i
                       ≤  V + N - 1
                       =  W

An important corollary is that the tail stays within the window:

t_{I + i}  ≤  (I + i + N - 1) × Q

We have the following property:

h_{I + i}  ≤  (I + i) × Q + M

From which we conclude with the definition of W:

t_{I + i}  ≤  h_{I + i} + W
           ≤  (I + i) × Q + M + (N - 1) × Q - M
           =  (I + i + N - 1) × Q

We finally show that after N - 1 compactions, the compaction invariant is restored. In particular, the remaining capacity is available without compaction.

V - (t_{I + N - 1} - h_{I + N - 1})  ≥  C - (|t_{I + N - 1}| - |h_{I + N - 1}|) + 1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~   ~
        immediate capacity                        remaining capacity              |
                                                                        reserved for clear

We can replace the definition of C and simplify:

V - (t_{I + N - 1} - h_{I + N - 1})  ≥  V - N - (|t_{I + N - 1}| - |h_{I + N - 1}|) + 1
iff  t_{I + N - 1} - h_{I + N - 1}  ≤  |t_{I + N - 1}| - |h_{I + N - 1}| + N - 1

We have the following properties:

t_{I + N - 1}  =  t_I + |h_{I + N - 1}| - |h_I| + N - 1
|t_{I + N - 1}| - |h_{I + N - 1}|  =  |t_I| - |h_I|
|h_{I + N - 1}| - |t_I|  ≤  h_{I + N - 1} - t_I

From which we conclude:

     t_{I + N - 1} - h_{I + N - 1}  ≤  |t_{I + N - 1}| - |h_{I + N - 1}| + N - 1
iff  t_I + |h_{I + N - 1}| - |h_I| + N - 1 - h_{I + N - 1}  ≤  |t_I| - |h_I| + N - 1
iff  t_I + |h_{I + N - 1}| - h_{I + N - 1}  ≤  |t_I|
iff  |h_{I + N - 1}| - |t_I|  ≤  h_{I + N - 1} - t_I

Checksum

The main property we want is that all partially written/erased words are either the initial word, the final word, or invalid.

We say that a bit sequence TARGET is reachable from a bit sequence SOURCE if both have the same length and SOURCE & TARGET == TARGET where & is the bitwise AND operation on bit sequences of that length. In other words, when SOURCE has a bit equal to 0 then TARGET also has that bit equal to 0.

The only written entries start with 101 or 110 and are written from an erased word. Marking an entry as padding or deleted is a single bit operation, so the property trivially holds. For those cases, the proof relies on the fact that there is exactly one bit equal to 0 in the 3 first bits. Either the 3 first bits are still 111 in which case we expect the remaining bits to be equal to 1. Otherwise we can use the checksum of the given type of entry because those 2 types of entries are not reachable from each other. Here is a visualization of the partitioning based on the first 3 bits:

First 3 bits Description How to check
111 Erased word All bits set to 1
101 User entry Contains a checksum
110 Internal entry Contains a checksum
100 Deleted user entry No check, atomically written
0?? Padding entry No check, atomically written

To show that valid entries of a given type are not reachable from each other, we show 3 lemmas:

  1. A bit sequence is not reachable from another if its number of bits equal to 0 is smaller.
  2. A bit sequence is not reachable from another if they have the same number of bits equals to 0 and are different.
  3. A bit sequence is not reachable from another if it is bigger when they are interpreted as numbers in binary representation.

From those lemmas we consider the 2 cases. If both entries have the same number of bits equal to 0, they are either equal or not reachable from each other because of the second lemma. If they don't have the same number of bits equal to 0, then the one with less bits equal to 0 is not reachable from the other because of the first lemma and the one with more bits equal to 0 is not reachable from the other because of the third lemma and the definition of the checksum.

Fuzzing

For any sequence of operations and interruptions starting from an erased storage, the store is checked against its model and some internal invariant at each step.

For any sequence of operations and interruptions starting from an arbitrary storage, the store is checked not to crash.

No runtime deps