#format #fasta #sequences #fastq #file-format #format-file #reading

jseqio

Reading and writing biological sequences in FASTA or FASTQ format

3 releases

new 0.1.2 Apr 26, 2024
0.1.1 Apr 26, 2024
0.1.0 Apr 26, 2024

#973 in Parser implementations


Used in unitig_flipper

MIT license

33KB
642 lines

This crate provides parsers for sequences in FASTA and FASTQ format.

Libary design

In bioinformatics, sequences are usually stored in files in FASTA or FASTQ format, which are often compressed with gzip. This makes a total of four formats: FASTA with and without gzip, and FASTQ with and without gzip. The purpose of the crate is to provide a parser that can automatically detect the format of the file and parse it without the user having to know beforehand which format is being used. The file format is detected from the first two bytes of the file, and does not depend on the file extension.

We use dynamic dispatch to hide the details of the file format from the user. This introduces an overhead of one dynamic dispatch per sequence, which is likely negligible unless the sequences are extremely short. This also allows us to support reading from any byte stream, such as the standard input, without having to attach generic parameters onto the parser. The interface is implemented for the struct reader::DynamicFastXReader. There is also reader::StaticFastXReader that takes the input stream as a generic parameter.

A sequence is represented with a record::RefRecord struct that points to slices in the internal buffers of the reader. This is to avoid allocating new memory for each sequence. There also exists record::OwnedRecord which owns the memory.

Since the readers stream over the data, we can not implement the Rust Iterator trait. The lifetime constraints on Rust Iterators require that all elements are valid until the end of the iteration. To support iterators, we provide the seq_db::SeqDB struct that concatenates all sequences, headers and quality values in memory and provides an iterator over them.

Examples

Streaming all sequences in a file and printing them to the standard output.

use jseqio::reader::*;
fn main() -> Result<(), Box<dyn std::error::Error>>{
    // Reading from a FASTQ file. Also works for FASTA,
    // and seamlessly with/without gzip compression.
    let mut reader = DynamicFastXReader::from_file(&"tests/data/reads.fastq.gz")?;
    while let Some(rec) = reader.read_next().unwrap() {
        // Headers do not include the leading '>' in FASTA or '@' in FASTQ.
        eprintln!("Header: {}", std::str::from_utf8(rec.head)?);
        eprintln!("Sequence: {}", std::str::from_utf8(rec.seq)?);
        if let Some(qual) = rec.qual{
            // Quality values are present only in fastq files.
            eprintln!("Quality values: {}", std::str::from_utf8(qual)?);
        }
    }
    Ok(())
}

Loading sequences into memory and computing the total length using an iterator.

use jseqio::reader::DynamicFastXReader;
fn main() -> Result<(), Box<dyn std::error::Error>>{
    let reader = DynamicFastXReader::from_file(&"tests/data/reads.fna")?;
    let db = reader.into_db()?;
    let total_length = db.iter().fold(0_usize, |sum, rec| sum + rec.seq.len());
    eprintln!("Total sequence length: {}", total_length);
    Ok(())
}

Dependencies

~385KB