#file-format #sequence #genome #gene #file-reader #2bit

twobit

Pure Rust implementation of the TwoBit sequence file format

4 releases

0.2.1 Jun 26, 2022
0.2.0 Nov 25, 2021
0.1.1 Sep 28, 2020
0.1.0 Aug 4, 2020

#762 in Parser implementations


Used in 3 crates

MIT license

69KB
1.5K SLoC

twobit

Efficient 2bit file reader, implemented in pure Rust.

Build Latest Version Documentation twobit: rustc 1.51+ MIT

The 2bit file format is used to store genomic sequences on disk. It allows for fast access to specific parts of the genome.

This crate is inspired by py2bit and tries to offer somewhat similar functionality with no C-dependency, no external crate dependencies, and great performance. It follows 2 bit specification version 0.

Examples

use twobit::TwoBitFile;

let mut tb = TwoBitFile::open("assets/foo.2bit")?;
assert_eq!(tb.chrom_names(), &["chr1", "chr2"]);
assert_eq!(tb.chrom_sizes(), &[150, 100]);
let expected_seq = "NNACGTACGTACGTAGCTAGCTGATC";
assert_eq!(tb.read_sequence("chr1", 48..74)?, expected_seq);

All sequence-related methods expect range argument; one can pass .. (unbounded range) in order to query the entire sequence:

assert_eq!(tb.read_sequence("chr1", ..)?.len(), 150);

Files can be fully cached in memory in order to provide fast random access and avoid any IO operations when decoding:

let mut tb_mem = TwoBitFile::open_and_read("assets/foo.2bit")?;
let expected_seq = tb.read_sequence("chr1", ..)?;
assert_eq!(tb_mem.read_sequence("chr1", ..)?, expected_seq);

2bit files offer two types of masks: N masks (aka hard masks) for unknown or arbitrary nucleotides, and soft masks for lower-case nucleotides (e.g. "t" instead of "T").

Hard masks are always enabled; soft masks are disabled by default, but can be enabled manually:

let mut tb_soft = tb.enable_softmask(true);
let expected_seq = "NNACGTACGTACGTagctagctGATC";
assert_eq!(tb_soft.read_sequence("chr1", 48..74)?, expected_seq);

No runtime deps