#suffix-array #search #fm-index #bioinformatics #search-query #fasta-fastq #exact-match-search

awry

Library for creating FM-indexes from FASTA/FASTQ files. AWRY is able to search at lightning speed by leveraging SIMD vectorization and multithreading over collections of queries.

4 releases

new 0.2.0 Jan 9, 2025
0.1.3 Dec 16, 2024
0.1.2 Dec 13, 2024
0.1.1 Dec 12, 2024
0.1.0 Dec 10, 2024

#77 in Biology

Download history 149/week @ 2024-12-06 339/week @ 2024-12-13 14/week @ 2024-12-20 12/week @ 2025-01-03

431 downloads per month

BSD-3-Clause

130KB
2.5K SLoC

AWRY

Avx Windowed fm-index in Rust? Yes!

Generates an Fm-Index of a given biological sequence text (Fasta or Fastq file), and implements Locate() and Search() functionalities.

AWRY is a port of a state-of-the-art, fastest in its class FM-index implementation (https://doi.org/10.1186/s13015-021-00204-6). AWRY supports parallelized searching, with parallel_count() and parallel_locate() functions.

AWRY supports DNA, RNA, and protein alphabets, and is able to search at lightning speed by leveraging SIMD vectorization and multithreading over collections of queries. These alphabets are case-insensitive- A and a are considered the same symbol under the hood.

Building an FM-index

to build an fm-index, create an FmBuildArgs struct, and call FmIndex::new()

let buildArgs =  FmBuildArgs {
    input_file_src: "my_input.fa",              //sets what the input file for the database text will be
    suffix_array_output_src: None,              //will build to a default location
    suffix_array_compression_ratio: None,       // ratio of suffix array compression, 8 by default
    lookup_table_kmer_len: None,                //by default, chooses reasonable table sizes (Dna=13, Amino=5)
    alphabet: SymbolAlphabet::Nucleotide,       //alphabet to build
    max_query_len: None,                        //if set, only sort suffix array up to n positions
    remove_intermediate_suffix_array_file: true,//deletes the suffix array file if true
}

let fm_index = FmIndex::new(&buildArgs);

If you only intend to use the count function, you can set the suffix array compression to a high value like 255 to reduce memory usage.

Searching for a query

To search for a query, use to count_string and locate_string functions.

pub fn count_string(&self, query: &String) -> u64 {
    ...
}

/// Finds the locations in the original text of all isntances of the given query.
pub fn locate_string(&self, query: &String) -> Vec<u64> {
    ...
}

Searching for queries in parallel

To find a large number of queries, searching can be parallelized easily with the parallel_count and parallel_locate functions

pub fn parallel_count(&self, queries: &Vec<String>) -> Vec<u64> {
    ...
}

// Finds the locations for each query in the query list. This function uses rayon's into_par_iter() for parallelism.
pub fn parallel_locate(&self, queries: &Vec<String>) -> Vec<Vec<u64>> {
    ...
}

Dependencies

~7–18MB
~258K SLoC