#bio #fasta #fastq #parser

seq_io

Fast FASTA, FASTQ and FASTX parsing

11 releases

0.4.0-alpha.0 Oct 19, 2020
0.3.0 Aug 21, 2018
0.2.4 Apr 16, 2018
0.2.3 Jan 13, 2018
0.1.0 Jul 31, 2017

#1 in #bio

Download history 42/week @ 2020-10-31 46/week @ 2020-11-07 30/week @ 2020-11-14 46/week @ 2020-11-21 29/week @ 2020-11-28 37/week @ 2020-12-05 28/week @ 2020-12-12 17/week @ 2020-12-19 3/week @ 2020-12-26 36/week @ 2021-01-02 54/week @ 2021-01-09 45/week @ 2021-01-16 52/week @ 2021-01-23 35/week @ 2021-01-30 35/week @ 2021-02-06 47/week @ 2021-02-13

164 downloads per month
Used in less than 6 crates

MIT license

250KB
5K SLoC

FASTA and FASTQ parsing and writing in Rust.

docs.rs Build status

Note: the master branch contains the development for version 0.4.0. The currently stable version is on a separate branch.

This library provides readers for the the following sequence formats:

  • FASTA
  • FASTQ (including multi-line FASTQ)
  • "FASTX": Automatic recognition of the sequence format (either FASTA or FASTQ)

View documentation (for 0.4.0-alpha.x version)

Features

  • Fast readers that minimize the use of allocations and copying of memory
  • Flexible methods for writing FASTA and FASTQ
  • Informative errors with exact positional information
  • Support for recording the position and seeking back
  • Serde support (for owned data structures)
  • Functions for parallel processing
  • Thoroughly tested using fuzzing techniques see here

Simple example

Reads FASTA sequences from STDIN and writes them to STDOUT if their length is > 100. Otherwise it prints a message.

use seq_io::fasta::{Reader,Record};
use std::io;

let mut reader = Reader::new(io::stdin());
let mut stdout = io::stdout();

while let Some(result) = reader.next() {
    let record = result.expect("reading error");
    // determine sequence length
    let seqlen = record.seq_lines()
                       .fold(0, |l, seq| l + seq.len());
    if seqlen > 100 {
        record.write_wrap(&mut stdout, 80).expect("writing error");
    } else {
        eprintln!("{} is only {} long", record.id().expect("not UTF-8"), seqlen);
    }
}

Records are directly borrowing data from the internal buffered reader, no further allocation or copying takes place. As a consequence, the while let construct has to be used instead of a for loop.

seq_lines() directly iterates over the sequence lines, whose position is remembered by the record, again without further copying.

Note: Make sure to add lto = true to the release profile in Cargo.toml because calls to functions of the underlying buffered reader (buf_redux) are not inlined otherwise.

Similar projects in Rust

  • Rust-Bio: Binformatics library that provides simple and easy to use FASTA and FASTQ readers.
  • fastq-rs: Very fast FASTQ parser. seq_io was inspired by fastq_rs in many ways.
  • Needletail: FASTA, FASTQ, FASTX
  • fasten implements its own FASTQ reader

Performance comparisons

The following bar chart shows the results of a few benchmarks on random sequences generated in memory (FASTA sequences either on a single line or wrapped to a width of 80).

The readers from this crate are also compared with fastq-rs and Rust-Bio parsers. The latter is only present in the "owned" section, since there is no possibility to iterate without allocating records.

More benchmarks can be found on a separate page.

benchmark results

Dependencies

~0.8–1.4MB
~32K SLoC