4 releases (2 breaking)
0.3.1 | Jul 12, 2024 |
---|---|
0.3.0 | Jul 12, 2024 |
0.2.0 | Jul 9, 2024 |
0.1.0 | Jul 8, 2024 |
#52 in Biology
94 downloads per month
13KB
153 lines
fasta-stats
Compute simple descriptive statistics on a FASTA file
Usage
Simple descriptive statistics on FASTA (biological sequence) data
Usage: fasta-stats [OPTIONS] [FILE]
Arguments:
[FILE]
Options:
-m, --median
-d, --stddev
-s, --sample <SAMPLE>
--hint <SIZE_HINT>
-h, --help Print help
-V, --version Print version
By default, this uses a streaming approach to compute mean, min, max, and count. Minimal memory should be required.
If the median
or stddev
flags are present, more memory will be required as streaming isn't possible. In order to minimize memory usage, the sample
argument can be specified; it is interpreted as "1 in n", as in, if --sample 100
is provided, then an expected 1 in 100 samples will be stored in a vector for purposes of these calculations. Larger values of sample
will result in lower memory usage but less-accurate computations.
This simple program expects to read FASTA data either on STDIN or from a named file, and will output the total number of sequences, as well as the min, max, mean, and optionally median and standard deviation, of the sequence lengths. If you have a compressed FASTA file, you can pipe it through zcat
or gunzip
to decompress it on the fly.
Dependencies
~21MB
~358K SLoC