3 releases (1 stable)

1.0.0	May 15, 2022
0.2.0	Apr 23, 2022
0.1.0	Apr 18, 2022

#1170 in Algorithms

MIT license

73KB
1.5K SLoC

rust-wfa

Introduction

This project is an implementation of the wavefront alignment algorithm. This algorithm was first described in an article published in 2020: Fast gap-affine pairwise alignment using the wavefront algorithm.

I have pasted part of the abstract below:

In this article, we present the wavefront alignment algorithm (WFA), an exact gap-affine algorithm that takes advantage of homologous regions between the sequences to accelerate the alignment process. As opposed to traditional dynamic programming algorithms that run in quadratic time, the WFA runs in time O(ns), proportional to the read length n and the alignment score s, using O(s2) memory. Furthermore, our algorithm exhibits simple data dependencies that can be easily vectorized, even by the automatic features of modern compilers, for different architectures, without the need to adapt the code. We evaluate the performance of our algorithm, together with other state-of-the-art implementations. As a result, we demonstrate that the WFA runs 20–300× faster than other methods aligning short Illumina-like sequences, and 10–100× faster using long noisy reads like those produced by Oxford Nanopore Technologies."

The reference implementation of the algorithm, written in C by the authors of the article, is available at github.com/smarco/WFA2-lib.

Wavefronts explained

Wavefronts basics

This explanation assumes that the reader is familiar with the Smith-Waterman-Gotoh sequence alignment algorithm, and its gap-affine version.

The wavefront alignment algorithm computes the optimal alignment between a pair of strings Query and Text using three 2-dimensional matrices: Insert_{i, j}, Matches_{i, j} and Deletes_{i, j}.

Unlike SWG alignment, i and j do not correspond to a position in Query or Text.

i corresponds to a score.
j corresponds to a diagonal in the dynamic programming matrix that would be used to align Query and Text. The following figure shows how these diagonals are laid out.

	" "	C	A	T
" "	0	1	2	3
C	-1	0	1	2
A	-2	-1	0	1
C	-3	-2	-1	0

The value that is stored in each of the wavefront matrices cell is the number of characters of Text that can be aligned, for the score i, at the diagonal j.

If we align n character of Text, the number of characters of Query that are aligned is equal to n + the number of the current diagonal. For instance, for the alignment of "CAT" and "CAC" Matches_{i, j} = 2, since on diagonal 0, if the matching penalty/bonus is set to 0, we can match up to 2 characters ("CA").

We'll define the direction of insertions as aligning an extra character in Query (moving horizontally in the alignment matrix) and deletions as the opposite Text (moving vertically in the alignment matrix).

At the highest level, the wavefront alignment algorithm can be defined as such (with pens being a struct that holds the mismatch/gap penalties):

    let mut current_front = new_wavefront_state(query, text, pens);
    loop {
        current_front.extend();
        if current_front.is_finished() {
            break;
        }
        current_front.increment_score();
        current_front.next();
    }

Matches extension

One key aspect of wavefront alignment is that the matching penalty is set to 0. Recall the definition of the value stored in each matrix:

The value that is stored in each of the matrices cell is the number of characters of Text that can be aligned, for the score i, at the diagonal j.

For a given position the score can be extended as long as the character of Text and Query, in the diagonal of the SWG matrix match. We can thus obtain the wavefront extension function:

for diag in current_diagonals {
	let mut x = matches[current_score][diag];
	while text[x] == query[x + diag] {
		x += 1;
		matches[current_score][diag] = x;
	}
}

The wavefront recurrence relation

Smith-Waterman-Gotoh defines the following recurrence relation to build the alignment matrices:

Deletes_{i, j} = min( Deletes_{i - 1, j} + gap extension penalty, Matches_{i - 1, j} + gap extension penalty + gap opening penalty )
Inserts_{i, j} = min( Inserts_{i, j - 1} + gap extension penalty, Matches_{i, j - 1} + gap extension penalty + gap opening penalty )
Matches_{i, j} = min( Matches_{i, j} + the penalty of matching/mismatching Text_i to Query_j, Inserts_{i, j}, Deletes_{i, j} )

The WFA_next function is based on these relations. The WFA_extend function is equivalent to going down a diagonal of the DP matrix while Text_i== Query_j. Once a cell where this equation doens't hold is reached, WFA_next is equivalent to computing the DP cells to the left and right of that cell.

Rephrasing the SWG recurrence relations for wavefronts gives these relations:

For the Deletes matrix, the number of chars that can be matched at a score i and a diagonal j is the maximum of Deletes_{i - gap extension penalty, j + 1} and Matches_{i - gap extension penalty - gap opening penalty, j + 1} + 1. The '+ 1' is due to the definition of the values stored in each cell.
It's the same from the Inserts matrix, except that the values will come from the diagonal j - 1 (since insertions are a rightward movement in the SWG matrix). This time, we don't add 1 because the number stored in each cell is the number of characters of Text matched, and insertions are an extra character in Query.
Matches is the maximum of Deletes_{i, j}, Inserts_{i, j} and Matches_{i - mismatch pen, j}.

Checking if the algorithm is over.

The final cell is at a specific diagonal: the diagonal whose number is the length of Query minus the length of Text. At every cycle, we'll check if the number of characters matched at a diagonal is equal to the length of Text.

Backtracking wavefronts to build an alignment

The backtracking algorithm isn't specified in the original article and I derived it myself since it's not very complicated. One important detail is not to forget that if we're on a match, we need to un-extend the wavefront.

Rust implementation

Validation of my implementation

Verifying that the WFA algorithm gives the same score as SWG alignment.

I have implemented a naive, unoptimized version of the SWG alignment in reference.rs. I have written validate.rs (a binary target that gets compiled to a standalone executable binary file).

./target/release/validate -h
rust_wfa 0.1.0

USAGE:
    validate [OPTIONS] --min-length <MIN_LENGTH> --max-length <MAX_LENGTH> --min-error <MIN_ERROR> --max-error <MAX_ERROR>

OPTIONS:
    -h, --help                       Print help information
        --max-error <MAX_ERROR>      
        --max-length <MAX_LENGTH>    
        --min-error <MIN_ERROR>      
        --min-length <MIN_LENGTH>    
    -p, --parallel                   
    -V, --version

That program generates random strings of a length in the interval specified by the user and a second, mutated version of that string that differs by the error rate (in percent) interval given. It then aligns the 2 strings using both the WFA and SWG algorithm, and checks that their score is the same (the alignment itself is not compared since there can be multiple alignment for an optimal alignment score). It can also run in parallel, doing this process concurrently, with a different text/query pair of strings over each detected cpu core.

After using this executable to fix the remaining bugs in my algorithm, I have now been able to compare the alignments of hundred thousands of strings without a difference in the alignment score between both algorithms, which has convinced me of the soundness of my implementation.

Verifying that the alignment and its score matches.

In a similar manner, I wrote validate_score_matches_alignment. This program gets compiled to a binary with the same name, that can be ran to generate pairs of strings, align them, and verify that the alignment matches its score: the score is recomputed from the alignment and then compared.

./target/release/validate_score_matches_alignment -h
rust_wfa 0.1.0
USAGE:
    validate_score_matches_alignment [OPTIONS] --min-length <MIN_LENGTH> --max-length <MAX_LENGTH> --min-error <MIN_ERROR> --max-error <MAX_ERROR>

OPTIONS:
    -h, --help                       Print help information
        --max-error <MAX_ERROR>      
        --max-length <MAX_LENGTH>    
        --min-error <MIN_ERROR>      
        --min-length <MIN_LENGTH>    
    -p, --parallel                   
    -V, --version                    Print version information

This program allowed me to identify another bug (a single incorrect variable) that produced incorrect alignments, with a correct score (and thus weren't detected by the previous validation program). I can now validate thousands of alignments, without producing bugs.

Benchmarks

I first ran the benchmarks on my naive implementation. The results are in the table below. n = length of the sequences, d = rate at which the elements differ between each sequence.

	n = 100, d = 1%	n = 100, d = 10%	n = 100, d = 30%	n = 1k, d = 1%	n = 1k, d = 10%	n = 1k, d = 30%	n = 10k, d = 1%	n = 10k, d = 10%	n = 10k, d = 30%
rust-wfa	41 µs	131 µs	291 µs	1.77 ms	16 ms	33 ms	202.6 ms	1.59 s	3.3 s
WFA2	5 µs	25 µs	53 µs	42 µs	665 µs	2 ms	1 ms	37 ms	140 ms
WFA2 SWG	87 µs	90 µs	95 µs	11 ms	11 ms	11 ms	1 s	1 s	1 s

As expected, SWG alignment doesn't depend on the error rate. My naive implementation is much slower than the original implementation, and it is only faster than SWG alignment for highly similar sequences.

This was an initial, naive implementation of the WFA algorithm. For instance, I stored the wavefronts as Vec<Vec>, which isn't very efficient.

In the next version, I rewrote my implementation to use a more efficient 1D Vec.

	n = 100, d = 1%	n = 100, d = 10%	n = 100, d = 30%	n = 1k, d = 1%	n = 1k, d = 10%	n = 1k, d = 30%	n = 10k, d = 1%	n = 10k, d = 10%	n = 10k, d = 30%
rust-wfa	3.18 µs	12 µs	36 µs	31 µs	779 µs	3.4 ms	1.3 ms	89 ms	377 ms
WFA2	5 µs	25 µs	53 µs	42 µs	665 µs	2 ms	1 ms	37 ms	140 ms
WFA2 SWG	87 µs	90 µs	95 µs	11 ms	11 ms	11 ms	1 s	1 s	1 s

The runtime was reduced by ~90%. My implementation is now competitive with the reference one, and efficient SWG.

Dependencies

~3.5MB
~71K SLoC