1 unstable release
new 0.1.5 | Apr 8, 2025 |
---|
#6 in #genomics-sequencing
Used in 2 crates
1.5MB
1.5K
SLoC
VBINSEQ
VBINSEQ is a high-performance binary file format for nucleotides.
It is a variant of the BINSEQ file format with support for variable length records and quality scores.
It is a block-based file format with support for parallel compression and decompression with random-access to record blocks.
This is a rust library for reading and writing VBINSEQ files, for a command-line interface see bqtools.
Overview
At a high-level VBINSEQ is a variant of BINSEQ with fixed-size record blocks instead of fixed-size records.
Each record block is composed of repeating records which each at minimum have a single nucleotide sequence. Each record can optionally have an extended sequence (a paired sequence) and associated quality scores. Importantly, records cannot span block boundaries, so all blocks are independent.
Each block has the same size and are independent so they can be compressed and decompressed independently. VBINSEQ tracks both the compressed and uncompressed size of each block in block headers which can then be indexed for random block access.
Structure
The file begins with a FILE HEADER which provides a description of the configuration. The remaining bytes of the file are repeated RECORD BLOCKS.
Each RECORD BLOCK is composed of three parts
- BLOCK HEADER: Provides metadata on the associated block (is always uncompressed)
- BLOCK DATA: Repeating complete VBINSEQ RECORDs (optionally ZSTD compressed).
- BLOCK PADDING: Repeated null bytes to keep the virtual (uncompressed) memory of each block equivalent.
Each VBINSEQ RECORD is composed of two parts: RECORD PREAMBLE, RECORD DATA
- RECORD PREAMBLE: Contains record metadata.
- RECORD DATA: Contains the encoded primary and extended sequences as well as the quality scores.
Description
All binary encoding is little-endian unless specifically noted otherwise.
FILE HEADER
Field | Type | Size (bytes) | Position (bytes) | Description |
---|---|---|---|---|
magic | u32 | 4 | 0 | A magic number to specify the file format (VSEQ) |
format | u8 | 1 | 4 | Version of the file format |
block | u64 | 8 | 5 | Size of all blocks in bytes (virtual memory) |
qual | bool | 1 | 13 | Whether quality scores are included on each sequence |
compressed | bool | 1 | 14 | Whether blocks are ZSTD compressed |
paired | bool | 1 | 15 | Whether records are paired sequences |
reserved | u8 | 16 | 16 | Reserved bytes in case of future extensions |
Total size: 32 bytes
BLOCK HEADER
Field | Type | Size (bytes) | Position (bytes) | Description |
---|---|---|---|---|
magic | u64 | 8 | 0 | A magic number to validate format (BLOCKSEQ) |
size | u64 | 8 | 8 | Actual size of the block in bytes (can be different than configured block size in header depending on compression status) |
records | u32 | 4 | 16 | Number of records in block |
reserved | u8 | 12 | 20 | Reserved bytes in case of future extensions |
Total size: 32 bytes
VBINSEQ RECORD
Field | Type | Size (bytes) | Description |
---|---|---|---|
flag | u64 | 8 | A binary flag for the record |
slen | u64 | 8 | The length of the primary sequence in record (basepairs) |
xlen | u64 | 8 | The length of the extended sequence in record (0 if not paired) |
sbuf | [u64] | ceil(slen / 32) | Encoded primary sequence |
squal | [u8] | qual ? slen : 0 | Associated quality scores of primary sequence (no bytes if not tracking quality) |
xbuf | [u64] | paired ? ceil(xlen / 32) : 0 | Encoded extended sequence (no bytes if not paired) |
xqual | [u8] | qual & paired ? xlen : 0 | Associated quality scores of extended sequence (no bytes if not paired + not tracking quality) |
Total size: 24 + x bytes
x = 8 * (sbuf + xbuf) + (squal + xqual)
Dependencies
~4.5MB
~84K SLoC