14 releases (5 breaking)
new 0.6.1 | Jan 6, 2025 |
---|---|
0.6.0 | Jan 6, 2025 |
0.5.1 | Nov 26, 2024 |
0.4.2 | Nov 19, 2024 |
0.1.2 | Sep 26, 2024 |
#28 in Biology
320 downloads per month
185KB
3.5K
SLoC
Parallel Construction of Suffix Arrays in Rust
This code is inspired by Cache-friendly, Parallel, and Samplesort-based Constructor for Suffix Arrays and LCP Arrays. We copied many ideas from the original C++ implementation CaPS-SA, most notably the mergesort that constructs the longest common prefix (LCP).
Setup
- Install Rust (this is easy)
- Run
cargo install sufr
to install the CLI - Alternately, execute
cargo run
in the source code directory to build a debug version of the program from source
$ cargo run
Parallel Construction of Suffix Arrays in Rust
Usage: sufr [OPTIONS] [COMMAND]
Commands:
create Create sufr file
extract Extract suffixes from a sufr file
list List the suffix array from a sufr file
count Count occurrences of sequences in a sufr file
locate Locate sequences in a sufr file
summarize Summarize sufr file
help Print this message or the help of the given subcommand(s)
Options:
-t, --threads <THREADS> Number of threads
-l, --log <LOG> Log level [possible values: info, debug]
--log-file <LOG_FILE> Log file
-h, --help Print help
-V, --version Print version
- Execute
cargo build --release
to build an optimized binary in ./target/release/sufr, which you can execute directly and copy into your$PATH
:
$ cargo build --release
$ ./target/release/sufr -h
Parallel Construction of Suffix Arrays in Rust
Usage: sufr [OPTIONS] [COMMAND]
Commands:
create Create sufr file
extract Extract suffixes from a sufr file
list List the suffix array from a sufr file
count Count occurrences of sequences in a sufr file
locate Locate sequences in a sufr file
summarize Summarize sufr file
help Print this message or the help of the given subcommand(s)
Options:
-t, --threads <THREADS> Number of threads
-l, --log <LOG> Log level [possible values: info, debug]
--log-file <LOG_FILE> Log file
-h, --help Print help
-V, --version Print version
Some of the commands may produce debug/info messages.
Use the -l|--log
option to view these on STDOUT or optionally write to a given --log-file
.
Create a suffix array
To begin, you must create a .sufr using the create
(cr
) action:
$ sufr create -h
Create sufr file
Usage: sufr create [OPTIONS] <INPUT>
Arguments:
<INPUT> Input file
Options:
-n, --num-partitions <NUM_PARTS> Subproblem count [default: 16]
-m, --max-query-len <CONTEXT> Max context
-o, --output <OUTPUT> Output file
-d, --dna Input is DNA
-a, --allow-ambiguity Allow suffixes starting with ambiguity codes
-i, --ignore-softmask Ignore suffixes in soft-mask/lowercase regions
-D, --sequence-delimiter <DELIM> Character to separate sequences [default: %]
-s, --seed-mask <MASK> Spaced seeds mask
-r, --random-seed <RANDSEED> Random seed [default: 42]
-h, --help Print help
The resulting binary-encoded output file will contain:
- metadata about the input
- the entire input text encoded as
u8
(bytes) - a fully sorted suffix array (SA)
- an array of the LCP (longest common prefix) for the SA
The input file is a required positional argument and should be a FASTA/Q-formatted file with one or more sequences.
$ cat data/inputs/2.fa
>ABC
acgtacgt
>DEF
acgtacgt
First, read the input sequences into a string of u8
bytes and uppercase the characters.
Each sequence is separated by a specified character (%
by default).
A final dollar sign ($
) is appended to the end of the input text.
If the --dna
flag is present, suffixes are skipped if they begin with any character other than A, C, G, or T unless the --allow-ambiguity
flag is present.
Additionally, soft-masked (lowercase) input is converted to uppercase unless the --ignore-softmask
flag is present, in which case it is converted to N
and ignored.
Next, we partition the suffixes into some number N
partitions by randomly selecting --num-partitions
- 1 pivot suffixes, sorting them, and using the pivots to place each suffix into the highest bounded partition.
The partitions are sorted using a merge sort algorithm that also generates an LCP (longest common prefix) array.
The sorted suffix/LCP arrays are then concatenated to produce the final output.
The sufr
CLI will create an output file containing a binary-encoded representation of the sorted suffix/LCP arrays along with the original sequence data and other metadata used to generate the arrays.
For instance, with the 1.fa file, the default output file will be 1.sufr:
$ sufr --log debug create data/inputs/1.fa -n 2 --dna
[2025-01-06T20:28:54Z INFO sufr] Using 8 threads
[2025-01-06T20:28:54Z INFO sufr] Read input of len 11 in 982.625µs
[2025-01-06T20:28:54Z INFO libsufr::sufr_builder] Selected 1 pivot in 74.291µs
[2025-01-06T20:28:54Z INFO libsufr::sufr_builder] Wrote 9 unsorted suffixes to partition in 469.833µs
[2025-01-06T20:28:54Z INFO libsufr::sufr_builder] Sorted 9 suffixes in 2 partitions (avg 4) in 356.833µs
[2025-01-06T20:28:54Z INFO sufr] Wrote 172 bytes to '1.sufr' in 422.042µs
[2025-01-06T20:28:54Z INFO sufr] Total time: 2.836917ms
Summarize a sufr file
Use the summarize
(su
) action to view metadata about a .sufr file:
$ sufr su -h
Summarize sufr file
Usage: sufr summarize <SUFR>
Arguments:
<SUFR> Sufr file
Options:
-h, --help Print help
For instance:
$ sufr su 1.sufr
+-----------------+------------------+
| Filename | 1.sufr |
+-----------------+------------------+
| Modified | 2025-01-06 13:28 |
+-----------------+------------------+
| File Size | 172 bytes |
+-----------------+------------------+
| File Version | 6 |
+-----------------+------------------+
| DNA | true |
+-----------------+------------------+
| Allow Ambiguity | false |
+-----------------+------------------+
| Ignore Softmask | false |
+-----------------+------------------+
| Text Length | 11 |
+-----------------+------------------+
| Len Suffixes | 9 |
+-----------------+------------------+
| Max query len | 0 |
+-----------------+------------------+
| Num sequences | 1 |
+-----------------+------------------+
| Sequence starts | 0 |
+-----------------+------------------+
| Headers | 1 |
+-----------------+------------------+
Listing suffixes in a sufr file
You can use the list
(ls
) action to view the sorted arrays by their rank:
$ sufr ls -h
List the suffix array from a sufr file
Usage: sufr list [OPTIONS] <SUFR> [RANK]...
Arguments:
<SUFR> Sufr file
[RANK]... Ranks of suffixes to show
Options:
-r, --show-rank Show rank column
-s, --show-suffix Show suffix position column
-p, --show-lcp Show LCP column
-v, --very-low-memory Very low memory
-n, --len <LEN> Length of suffixes to show
-o, --output <OUT> Output
-h, --help Print help
For example, to view the 1.sufr file including rank, suffix, and LCP:
$ sufr ls -rsp 1.sufr
0 10 0 $
1 6 0 ACGT$
2 0 4 ACGTNNACGT$
3 7 0 CGT$
4 1 3 CGTNNACGT$
5 8 0 GT$
6 2 2 GTNNACGT$
7 9 0 T$
8 3 1 TNNACGT$
An optional positional argument for the ranks allows you to select only a portion:
$ sufr ls 1.sufr 0-5
$
ACGT$
ACGTNNACGT$
CGT$
CGTNNACGT$
GT$
As suffixes can get quite long, use the -l|--len
option to restrict the length of the shown suffix:
$ sufr ls 1.sufr 2-3 -n 3
ACG
CGT
Count occurrences of suffixes
Use the count
(co
) command to find the number of occurrences of suffixes:
$ sufr co -h
Count occurrences of sequences in a sufr file
Usage: sufr count [OPTIONS] <SUFR> <QUERY>...
Arguments:
<SUFR> Sufr file
<QUERY>... Query
Options:
-m, --max-query-len <LEN> Maximum query length
-o, --output <OUT> Output
-l, --low-memory Low-memory
-v, --very-low-memory Very low memory
-h, --help Print help
The -l|--low-memory
option will force the suffixes to be read from disk rather than loaded into memory.
The -v|--very-low-memory
option will also force the text to be read from disk rather than loaded into memory.
The time to load a large index (e.g., human genome) into memory may take longer than the actual search, so you may find this is faster for only a few queries but slower for a large number.
For example:
$ sufr co 1.sufr AC GT X
AC 2
GT 2
X not found
Suffixes that do not occur are printed to STDERR
such as the "X" in the preceding output.
Locate suffixes
Use the locate
(lo
) command to find the positions of a given suffix:
$ sufr lo -h
Locate sequences in a sufr file
Usage: sufr locate [OPTIONS] <SUFR> <QUERY>...
Arguments:
<SUFR> Sufr file
<QUERY>... Query
Options:
-o, --output <OUT> Output
-m, --max-query-len <LEN> Maximum query length
-l, --low-memory Low memory
-v, --very-low-memory Very low memory
-a, --abs Show absolute position in text
-h, --help Print help
For instance, given the 3.fa input:
$ cat data/inputs/3.fa
>1
AGCTTTTCATTCTGACTGCAACGG
GCAATANNNNNNNNNNTGTCTC
>2
TGTGTGGATTAAAAAAAGAGTGTC
>3
TGATAGCAGCTTCTGAACTGGTTA
CCTGCCGTGAGTAAAT
The suffix "ACT" is found at position 14 in sequence 1 and 16 in sequence 3, "GT" is found at position 20 in sequence 3, and "X" is not found and is printed to STDERR
:
$ sufr lo data/expected/3.sufr ACT GTT X
ACT
1 14
3 16
//
GTT
3 20
//
X not found
Use the -a|--absolute
flag if you prefer to see the raw suffix position:
$ sufr lo 3.sufr ACT GTT -a
ACT 14 88
GTT 92
By default, the entire suffix array will be loaded into memory.
If your machine lacks suffixient resources, you can use the -m|--max-query-len
option to create a down-sampled suffix array.
Or use the -l|--low-memory
option to search the suffix array on disk, which is slower but requires only enough memory to hold the original text/sequences.
Extract suffixes
Use the extract
(ex
) to print suffixes in FASTA format:
$ sufr ex -h
Extract suffixes from a sufr file
Usage: sufr extract [OPTIONS] <SUFR> <QUERY>...
Arguments:
<SUFR> Sufr file
<QUERY>... Query
Options:
-m, --max-query-len <LEN> Maximum query length
-l, --low-memory Low memory
-v, --very-low-memory Very low memory
-p, --prefix-len <PREFIX_LEN> Prefix length
-s, --suffix-len <SUFFIX_LEN> Suffix length
-o, --output <OUT> Output
-h, --help Print help
Using the preceding locate
results, I can extract the suffixes at those positions.
As the suffixes can get quite long, I will use the -s|--suffix-len
option to limit them to 15bp:
$ sufr ex data/expected/3.sufr ACT GTT -s 15
>1:14-29 ACT 0
ACTGCAACGGGCAAT
>3:16-31 ACT 0
ACTGGTTACCTGCCG
>3:20-35 GTT 0
GTTACCTGCCGTGAG
Combine this with the -p|--prefix-len
to control the amount of preceding text, which might be useful when identifying alignment seeds:
$ sufr ex data/expected/3.sufr ACT GTT -s 15 -p 3
>1:11-29 ACT 3
CTGACTGCAACGGGCAAT
>3:13-31 ACT 3
TGAACTGGTTACCTGCCG
>3:17-35 GTT 3
CTGGTTACCTGCCGTGAG
The FASTA header contains the matching sequence ID, a colon, the start/stop position of the extracted sequence, followed by the query, and finally the location of the query in the extracted sequence, which is only relevant if you included a prefix.
Testing
Run cargo test
.
Authors
- Ken Youens-Clark kyclark@arizona.edu
- Jack Roddy jroddy@arizona.edu
- Travis Wheeler twheeler@arizona.edu
Dependencies
~12–23MB
~352K SLoC