1 release (0 unstable)
1.0.0-beta.5 | Oct 22, 2024 |
---|
#1710 in Database interfaces
159 downloads per month
200KB
5K
SLoC
MemKvStore Documentation
MemKvStore use SSTable as backend. The SSTable (Sorted String Table) is a persistent data structure used for storing key-value pairs in a sorted manner. This document describes the binary format of the SSTable.
Overall Structure
The SSTable consists of the following sections:
┌─────────────────────────────────────────────────────────────────────────────────────────────────┐ │ MemKVStore │ │┌ ─ ─ ─ ─ ─ ─ ─┌ ─ ─ ─ ─ ─ ─ ─ ─┌ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─┌ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─┌ ─ ─ ─ ─ ─ ─ ┐│ │ Magic Number │ Schema Version │ Block Chunk ... │ Block Chunk Block Meta │ Meta Offset │ ││ u32 │ u8 │ bytes │ │ bytes │ bytes │ u32 ││ │ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ─ ─ │ └─────────────────────────────────────────────────────────────────────────────────────────────────┘
- Magic Number (4 bytes): A fixed value "LORO" to identify the file format.
- Schema Version (1 byte): The version of the MemKVStore schema.
- Block Chunks: A series of data blocks containing key-value pairs.
- Block Meta: Metadata for all blocks, including block offset, the first key of the block,
is_large
flag, and last key if not large. - Meta Offset (4 bytes): The offset of the Block Meta section from the beginning of the file.
Block Types
There are two types of blocks: Normal Blocks and Large Value Blocks.
Normal Block
Normal blocks store multiple key-value pairs with compressed keys.
┌────────────────────────────────────────────────────────────────────────────────────────────┐ │Block │ │┌ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─┌ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─┌ ─ ─ ─┌ ─ ─ ─ ┬ ─ ─ ─ ─┌ ─ ─ ─ ─ ─ ─ ─ │ │ Key Value Chunk ... │Key Value Chunk offset │ ... │ offset kv len │Block Checksum│ │ ││ bytes │ │ bytes │ u16 │ │ u16 │ u16 │ u32 │ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ┘─ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ─ ─ ┘ │ └────────────────────────────────────────────────────────────────────────────────────────────┘
Each Key Value Chunk is encoded as follows:
┌─────────────────────────────────────────────────────┐ │ Key Value Chunk │ │┌ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─┌ ─ ─ ─ ─ ─┬ ─ ─ ─ ┐│ │ common prefix len key suffix len│key suffix│ value ││ ││ u8 │ u16 │ bytes │ bytes ││ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ┘─ ─ ─ ─┘│ └─────────────────────────────────────────────────────┘
Because the key of the first chunk is equal to the first key of the block, the first chunk can be simplified as: ┌────────────────────┐ │ Key Value Chunk │ │┌ ─ ─ ─ ─ ─┬ ─ ─ ─ ┐│ ││key suffix│ value ││ ││ bytes │ bytes ││ │ ─ ─ ─ ─ ─ ┘─ ─ ─ ─┘│ └────────────────────┘
Encoding:
- Compress key-value pairs data as Key Value Chunk.
- Write offsets for each key-value pair.
- Write the number of key-value pairs.
- By default, Compress the entire block using LZ4. If you set
compression_type
toNone
, it will not compress the block.- For now, there are two compression type:
None
andLZ4
.
- For now, there are two compression type:
- Calculate and append xxhash_32 checksum.
Decoding:
- Verify the xxhash_32 checksum.
- By default, Decompress the block using LZ4. If you set
compression_type
toNone
, it will not decompress the block. - Read the number of key-value pairs.
- Read offsets for each key-value pair.
- Parse individual key-value chunks.
Large Value Block
Large Value Blocks store a single key-value pair with a large value.
┌──────────────────────────┐ │Large Block │ │┌ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ │ │ value Block Checksum ││ ││ bytes │ u32 │ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘│ └──────────────────────────┘
Encoding:
- Write the value bytes.
- Calculate and append xxhash_32 checksum.
Decoding:
- Verify the xxhash_32 checksum.
- Read the value bytes.
We need not encode the length of value, because we can get the whole Block by offset in meta.
Block Meta
The Block Meta section contains metadata for all blocks in the SSTable.
┌────────────────────────────────────────────────────────────┐ │ All Block Meta │ │┌ ─ ─ ─ ─ ─ ─ ─┌ ─ ─ ─ ─ ─ ─┌ ─ ─ ─┌ ─ ─ ─ ─ ─ ─┌ ─ ─ ─ ─ ─ │ │ block number │ Block Meta │ ... │ Block Meta │ checksum ││ ││ u32 │ bytes │ │ bytes │ u32 │ │ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ─ ┘─ ─ ─ ┘─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ┘│ └────────────────────────────────────────────────────────────┘
Each Block Meta entry is encoded as follows:
┌──────────────────────────────────────────────────────────────────────────────────────────┐ │ Block Meta │ │┌ ─ ─ ─ ─ ─ ─ ─┌ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─┌ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ┐ │ │ block offset │ first key len first key block type │ last key len last key │ ││ u32 │ u16 │ bytes │ u8 │ u16(option) │bytes(option)│ │ │ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ └──────────────────────────────────────────────────────────────────────────────────────────┘
Encoding:
- Write the number of blocks.
- For each block, write its metadata (offset, first key, block type, and last key if not large).
- block type: the first bit for is_large, the next 7 bits for compression_type.
- Calculate and append xxhash_32 checksum.
Decoding:
- Read the number of blocks.
- For each block, read its metadata.
- Verify the xxhash_32 checksum.
Note: In this crate, the empty value is regarded as deleted. only [MemStoreIterator] will filter empty value. Other iterators will still return empty value.
Dependencies
~5–11MB
~111K SLoC