3 releases

new 0.1.7	Apr 23, 2025
0.1.6	Apr 21, 2025
0.1.5	Apr 8, 2025

#9 in #nucleotide

267 downloads per month

MIT license

1.5MB
1.5K SLoC

VBINSEQ

VBINSEQ is a high-performance binary file format for nucleotides.

It is a variant of the BINSEQ file format with support for variable length records and quality scores.

It is a block-based file format with support for parallel compression and decompression with random-access to record blocks.

This is a rust library for reading and writing VBINSEQ files, for a command-line interface see bqtools.

Notice

This project is no longer under development in this repository. It has been pulled into the main binseq repository. It is archived here for the time being, but will be removed in the future.

Overview

At a high-level VBINSEQ is a variant of BINSEQ with fixed-size record blocks instead of fixed-size records.

Each record block is composed of repeating records which each at minimum have a single nucleotide sequence. Each record can optionally have an extended sequence (a paired sequence) and associated quality scores. Importantly, records cannot span block boundaries, so all blocks are independent.

Each block has the same size and are independent so they can be compressed and decompressed independently. VBINSEQ tracks both the compressed and uncompressed size of each block in block headers which can then be indexed for random block access.

Structure

The file begins with a FILE HEADER which provides a description of the configuration. The remaining bytes of the file are repeated RECORD BLOCKS.

Each RECORD BLOCK is composed of three parts

BLOCK HEADER: Provides metadata on the associated block (is always uncompressed)
BLOCK DATA: Repeating complete VBINSEQ RECORDs (optionally ZSTD compressed).
BLOCK PADDING: Repeated null bytes to keep the virtual (uncompressed) memory of each block equivalent.

Each VBINSEQ RECORD is composed of two parts: RECORD PREAMBLE, RECORD DATA

RECORD PREAMBLE: Contains record metadata.
RECORD DATA: Contains the encoded primary and extended sequences as well as the quality scores.

Description

All binary encoding is little-endian unless specifically noted otherwise.

FILE HEADER

Field	Type	Size (bytes)	Position (bytes)	Description
magic	u32	4	0	A magic number to specify the file format (VSEQ)
format	u8	1	4	Version of the file format
block	u64	8	5	Size of all blocks in bytes (virtual memory)
qual	bool	1	13	Whether quality scores are included on each sequence
compressed	bool	1	14	Whether blocks are ZSTD compressed
paired	bool	1	15	Whether records are paired sequences
reserved	u8	16	16	Reserved bytes in case of future extensions

Total size: 32 bytes

BLOCK HEADER

Field	Type	Size (bytes)	Position (bytes)	Description
magic	u64	8	0	A magic number to validate format (BLOCKSEQ)
size	u64	8	8	Actual size of the block in bytes (can be different than configured block size in header depending on compression status)
records	u32	4	16	Number of records in block
reserved	u8	12	20	Reserved bytes in case of future extensions

Total size: 32 bytes

VBINSEQ RECORD

Field	Type	Size (bytes)	Description
flag	u64	8	A binary flag for the record
slen	u64	8	The length of the primary sequence in record (basepairs)
xlen	u64	8	The length of the extended sequence in record (0 if not paired)
sbuf	[u64]	ceil(slen / 32)	Encoded primary sequence
squal	[u8]	qual ? slen : 0	Associated quality scores of primary sequence (no bytes if not tracking quality)
xbuf	[u64]	paired ? ceil(xlen / 32) : 0	Encoded extended sequence (no bytes if not paired)
xqual	[u8]	qual & paired ? xlen : 0	Associated quality scores of extended sequence (no bytes if not paired + not tracking quality)

Total size: 24 + x bytes

x = 8 * (sbuf + xbuf) + (squal + xqual)

Dependencies

~4.5MB
~85K SLoC