12 releases (4 breaking)
0.5.1 | Nov 26, 2024 |
---|---|
0.5.0 | Nov 22, 2024 |
0.4.2 | Nov 19, 2024 |
0.3.1 | Nov 6, 2024 |
0.1.2 | Sep 26, 2024 |
#33 in Biology
1,029 downloads per month
Used in awry
140KB
3K
SLoC
Parallel Construction of Suffix Arrays in Rust
This code is inspired by Cache-friendly, Parallel, and Samplesort-based Constructor for Suffix Arrays and LCP Arrays. We copied many ideas from the original C++ implementation CaPS-SA, most notably the mergesort that constructs the longest common prefix (LCP).
Setup
- Install Rust
- Run
cargo install sufr
to install the CLI - Alternately, execute
cargo run
in the source code directory to build a debug version of the program from source
$ cargo run
Parallel Construction of Suffix Arrays in Rust
Usage: sufr [OPTIONS] [COMMAND]
Commands:
create Create sufr file
check Check sufr file for correctness
extract Extract suffixes from a sufr file
list List the suffix array from a sufr file
count Count occurrences of sequences in a sufr file
locate Locate sequences in a sufr file
summarize Summarize sufr file
help Print this message or the help of the given subcommand(s)
Options:
-t, --threads <THREADS> Number of threads
-l, --log <LOG> Log level [possible values: info, debug]
--log-file <LOG_FILE> Log file
-h, --help Print help
-V, --version Print version
- Execute
cargo build --release
to build an optimized binary in ./target/release/sufr, which you can execute directly and copy into your$PATH
:
$ cargo build --release
$ ./target/release/sufr -h
Parallel Construction of Suffix Arrays in Rust
Usage: sufr [OPTIONS] [COMMAND]
Commands:
create Create sufr file
check Check sufr file for correctness
extract Extract suffixes from a sufr file
list List the suffix array from a sufr file
count Count occurrences of sequences in a sufr file
locate Locate sequences in a sufr file
summarize Summarize sufr file
help Print this message or the help of the given subcommand(s)
Options:
-t, --threads <THREADS> Number of threads
-l, --log <LOG> Log level [possible values: info, debug]
--log-file <LOG_FILE> Log file
-h, --help Print help
-V, --version Print version
Some of the commands may produce debug/info messages.
Use the -l|--log
option to view these on STDOUT or optionally write to a given --log-file
.
Create a suffix array
To begin, you must create a .sufr using the create
(cr
) action:
$ sufr create -h
Create sufr file
Usage: sufr create [OPTIONS] <INPUT>
Arguments:
<INPUT> Input file
Options:
-n, --num-partitions <NUM_PARTS> Subproblem count [default: 16]
-m, --max-query-len <CONTEXT> Max context
-o, --output <OUTPUT> Output file
-d, --dna Input is DNA
-a, --allow-ambiguity Allow suffixes starting with ambiguity codes
-i, --ignore-softmask Ignore suffixes in soft-mask/lowercase regions
-D, --sequence-delimiter <DELIM> Character to separate sequences [default: %]
-h, --help Print help
The resulting output file will contain:
- metadata about the input
- the entire input text encoded as
u8
(bytes) - a fully sorted suffix array (SA)
- an array of the LCP (longest common prefix) for the SA
The input file is a required positional argument and should be a FASTA/Q-formatted file with one or more sequences.
$ cat sufr/tests/inputs/2.fa
>ABC
acgtacgt
>DEF
acgtacgt
First, read the input sequences into a string of u8
bytes and uppercase the characters.
Each sequence is separated by a specified character (%
by default).
A final dollar sign ($
) is appended to the end of the input text.
If the --dna
flag is present, suffixes are skipped if they begin with any character other than A, C, G, or T unless the --allow-ambiguity
flag is present.
Additionally, soft-masked (lowercase) input is converted to uppercase unless the --ignore-softmask
flag is present, in which case it is converted to N
and ignored.
Next, we partition the suffixes into some number N
partitions by randomly selecting --num-partitions
- 1 pivot suffixes, sorting them, and using the pivots to place each suffix into the highest bounded partition.
The partitions are sorted using a merge sort algorithm that also generates an LCP (longest common prefix) array.
The sorted suffix/LCP arrays are then concatenated to produce the final output.
The sufr
CLI will create an output file containing a binary-encoded representation of the sorted suffix/LCP arrays along with the original sequence data and other metadata used to generate the arrays.
For instance, with the 1.fa file, the default output file will be 1.sufr:
$ sufr --log debug create sufr/tests/inputs/1.fa -n 2 --dna
[2024-11-22T22:28:45Z INFO sufr] Using 8 threads
[2024-11-22T22:28:45Z INFO sufr] Read input of len 11 in 1.0535ms
[2024-11-22T22:28:45Z INFO libsufr::sufr_builder] Selected 1 pivot in 59.25µs
[2024-11-22T22:28:45Z INFO libsufr::sufr_builder] Wrote 9 unsorted suffixes to partition in 356.833µs
[2024-11-22T22:28:45Z INFO libsufr::sufr_builder] Sorted 9 suffixes in 2 partitions (avg 4) in 834.542µs
[2024-11-22T22:28:45Z INFO sufr] Wrote 164 bytes to '1.sufr' in 593.875µs
[2024-11-22T22:28:45Z INFO sufr] Total time: 4.518709ms
Summarize a sufr file
Use the summarize
(su
) action to view metadata about a .sufr file:
$ sufr su -h
Summarize sufr file
Usage: sufr summarize <SUFR>
Arguments:
<SUFR> Sufr file
Options:
-h, --help Print help
For instance:
$ sufr su 1.sufr
+-----------------+------------------+
| Filename | 1.sufr |
+-----------------+------------------+
| Modified | 2024-11-15 09:21 |
+-----------------+------------------+
| File Size | 164 bytes |
+-----------------+------------------+
| File Version | 4 |
+-----------------+------------------+
| DNA | true |
+-----------------+------------------+
| Allow Ambiguity | false |
+-----------------+------------------+
| Ignore Softmask | false |
+-----------------+------------------+
| Text Length | 11 |
+-----------------+------------------+
| Num Suffixes | 9 |
+-----------------+------------------+
| Max query len | 0 |
+-----------------+------------------+
| Num sequences | 1 |
+-----------------+------------------+
| Sequence starts | 0 |
+-----------------+------------------+
| Headers | 1 |
+-----------------+------------------+
Check a suffix array
Use the check
(ch
) action to verify the order of the suffix array:
$ sufr check -h
Check sufr file for correctness
Usage: sufr check [OPTIONS] <SUFR>
Arguments:
<SUFR> Sufr file
Options:
-v, --verbose List errors
-h, --help Print help
For instance:
$ sufr check 1.sufr
Checked 9 suffixes, found 0 errors in suffix array.
Finished checking in 1.112417ms.
Listing suffixes in a sufr file
You can use the list
(ls
) action to view the sorted arrays by their rank:
$ sufr ls -h
List the suffix array from a sufr file
Usage: sufr list [OPTIONS] <SUFR> [RANK]...
Arguments:
<SUFR> Sufr file
[RANK]... Ranks of suffixes to show
Options:
-l, --len <LEN> Length of suffixes to show
-o, --output <OUT> Output
-h, --help Print help
For example, to view the 1.sufr file:
$ sufr ls 1.sufr
R S L
0 10 0: $
1 6 0: ACGT$
2 0 4: ACGTNNACGT$
3 7 0: CGT$
4 1 3: CGTNNACGT$
5 8 0: GT$
6 2 2: GTNNACGT$
7 9 0: T$
8 3 1: TNNACGT$
An optional positional argument for the ranks allows you to select only a portion:
$ sufr ls 1.sufr 0-5
R S L
0 10 0: $
1 6 0: ACGT$
2 0 4: ACGTNNACGT$
3 7 0: CGT$
4 1 3: CGTNNACGT$
5 8 0: GT$
As suffixes can get quite long, use the -l|--len
option to restrict the length of the shown suffix:
$ sufr ls 1.sufr 2-3 -l 3
R S L
2 0 4: ACG
3 7 0: CGT
Count occurrences of suffixes
Use the count
(co
) command to find the number of occurrences of suffixes:
$ sufr co -h
Count occurrences of sequences in a sufr file
Usage: sufr count [OPTIONS] <SUFR> <QUERY>...
Arguments:
<SUFR> Sufr file
<QUERY>... Query
Options:
-m, --max-query-len <LEN> Maximum query length
-o, --output <OUT> Output
-l, --low-memory Low-memory
-h, --help Print help
The -l|--low-memory
option will force the suffixes to be read from disk rather than loaded into memory.
The time to load a large index (e.g., human genome) into memory may take longer than the actual search, so you may find this is faster for only a few queries but slower for a large number.
For example:
$ sufr co 1.sufr AC GT X
AC 2
GT 2
X not found
Suffixes that do not occur are printed to STDERR
such as the "X" in the preceding output.
Locate suffixes
Use the locate
(lo
) command to find the positions of a given suffix:
$ sufr lo -h
Locate sequences in a sufr file
Usage: sufr locate [OPTIONS] <SUFR> <QUERY>...
Arguments:
<SUFR> Sufr file
<QUERY>... Query
Options:
-m, --max-query-len <LEN> Maximum query length
-o, --output <OUT> Output
-l, --low-memory Low-memory
-a, --abs Show absolute position in text
-h, --help Print help
For instance, given the 3.fa input:
$ cat sufr/tests/inputs/3.fa
>1
AGCTTTTCATTCTGACTGCAACGG
GCAATANNNNNNNNNNTGTCTC
>2
TGTGTGGATTAAAAAAAGAGTGTC
>3
TGATAGCAGCTTCTGAACTGGTTA
CCTGCCGTGAGTAAAT
The suffix "ACT" is found at position 14 in sequence 1 and 16 in sequence 3, "GT" is found at position 20 in sequence 3, and "X" is not found and is printed to STDERR
:
$ sufr lo 3.sufr ACT GTT
ACT
1 14
3 16
//
GTT
3 20
//
X not found
Use the -a|--absolute
flag if you prefer to see the raw suffix position:
$ sufr lo 3.sufr ACT GTT -a
ACT 14 88
GTT 92
By default, the entire suffix array will be loaded into memory.
If your machine lacks suffixient resources, you can use the -m|--max-query-len
option to create a down-sampled suffix array.
Or use the -l|--low-memory
option to search the suffix array on disk, which is slower but requires only enough memory to hold the original text/sequences.
Extract suffixes
Use the extract
(ex
) to get suffixes by their position in the text:
$ sufr ex -h
Extract suffixes from a sufr file
Usage: sufr extract [OPTIONS] <SUFR> [SUFFIX]...
Arguments:
<SUFR> Sufr file
[SUFFIX]... Suffixes to extract
Options:
-p, --prefix-len <PREFIX_LEN> Prefix length
-s, --suffix-len <SUFFIX_LEN> Suffix length
-l, --lcp Show LCP
-o, --output <OUT> Output
-h, --help Print help
Using the preceding locate
results, I can extract the suffixes at those positions.
As the suffixes can get quite long, I will use the -s|--suffix-len
option to limit them to 15bp:
$ sufr ex 3.sufr 14 88 92 -s 15
ACTGCAACGGGCAAT
ACTGGTTACCTGCCG
GTTACCTGCCGTGAG
Combine this with the -p|--prefix-len
to control the amount of preceding text, which might be useful when identifying alignment seeds:
$ sufr ex 3.sufr 14 88 92 -s 15 -p 5
TTCTGACTGCAACGGGCAAT
TCTGAACTGGTTACCTGCCG
AACTGGTTACCTGCCGTGAG
Testing
Run cargo test
.
Authors
- Ken Youens-Clark kyclark@arizona.edu
- Jack Roddy jroddy@pharmacy.arizona.edu
- Travis Wheeler twheeler@arizona.edu
Dependencies
~12–22MB
~325K SLoC