#fasta #statistics #sequence #biological #data #compute #descriptive

app fasta-stats

Simple descriptive statistics on FASTA (biological sequence) data

4 releases (2 breaking)

0.3.1 Jul 12, 2024
0.3.0 Jul 12, 2024
0.2.0 Jul 9, 2024
0.1.0 Jul 8, 2024

#55 in Biology

MIT license

13KB
153 lines

fasta-stats

Compute simple descriptive statistics on a FASTA file

Usage

Simple descriptive statistics on FASTA (biological sequence) data

Usage: fasta-stats [OPTIONS] [FILE]

Arguments:
  [FILE]

Options:
  -m, --median
  -d, --stddev
  -s, --sample <SAMPLE>
      --hint <SIZE_HINT>
  -h, --help              Print help
  -V, --version           Print version

By default, this uses a streaming approach to compute mean, min, max, and count. Minimal memory should be required.

If the median or stddev flags are present, more memory will be required as streaming isn't possible. In order to minimize memory usage, the sample argument can be specified; it is interpreted as "1 in n", as in, if --sample 100 is provided, then an expected 1 in 100 samples will be stored in a vector for purposes of these calculations. Larger values of sample will result in lower memory usage but less-accurate computations.

This simple program expects to read FASTA data either on STDIN or from a named file, and will output the total number of sequences, as well as the min, max, mean, and optionally median and standard deviation, of the sequence lengths. If you have a compressed FASTA file, you can pipe it through zcat or gunzip to decompress it on the fly.

Dependencies

~20MB
~356K SLoC