3 releases

0.1.2 Mar 4, 2021
0.1.1 Dec 7, 2020
0.1.0 Sep 7, 2020

#23 in #edit-distance

MIT/Apache

160KB
2.5K SLoC

C++ 2K SLoC // 0.2% comments Rust 316 SLoC // 0.1% comments Cython 176 SLoC // 0.3% comments Python 124 SLoC // 0.1% comments C 7 SLoC

edlib_rs

This crate provides a Rust interface to the Edlib C++ library by Martin Šošić. See Martinsos-edlib

The reference paper is :

Martin Šošić, Mile Šikić; Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics 2017 [btw753. doi] https://doi.org/10.1093/bioinformatics/btw753

The crate offers 2 interfaces to edlib.
The first, accessed via module bindings, is direcly the interface generated by the bindgen crate.
The second, accessed via module edlibrs, provides a more idiomatic Rust interface. It comes at the cost of cloning information stored in pointers startLocations and endLocations in C struct EdlibAlignResult to get a Rust struct EdlibAlignResultRs with Option<Vec<u8>> fields instead of pointers. The cigar string representation is also cloned when computed.
As a consequence memory management is fully transferred to Rust.
Structures and functions have the same name as in edlib with just "Rs" appended to original names.

Example

For the edlibrs interface we have for example:

in normal mode:

    use edlib_rs::edlibrs::*;
    ...
    let query = "ACCTCTG";
    let target = "ACTCTGAAA";
    let align_res = edlibAlignRs(query.as_bytes(), target.as_bytes(), &EdlibAlignConfigRs::default());
    assert_eq!(align_res.status, EDLIB_STATUS_OK);
    assert_eq!(align_res.editDistance, 4);

in the infix mode :

    use edlib_rs::edlibrs::*;
    ...
    let query = "ACCTCTG";
    let target = "TTTTTTTTTTTTTTTTTTTTTACTCTGAAA";
    //
    let mut config = EdlibAlignConfigRs::default();
    config.mode = EdlibAlignModeRs::EDLIB_MODE_HW;
    let align_res = edlibAlignRs(query.as_bytes(), target.as_bytes(), &config);
    assert_eq!(align_res.editDistance, 1);

Installation

The package has the original Edlib library sources embedded in the source tree (See directory edlib-c, corresponding to sources at the date of Decembre 2020) minus the original test_data directory to limit the size of the crate. The standard "cargo build" command runs the edlib's cmake.

The crate enables a logger to monitor the call to the C-interface which is by default set in Cargo.toml to info for release mode and trace for debug mode, but can changed by setting the variable RUST_LOG (see env_logger doc).

Tests

Some tests in module edlib.rs can serve as basic examples.
In directory examples there is also a small version of the edlib edaligner module (see apps/aligner in edlib installation dir) which runs on Fasta files containing only one sequence as contained in the original edlib directory test_data.
As the embedded sources do not contain the original test_data sub-directory, it is necessary to download them separately to run the edaligner example module.
Contrary to the edlib version the module given a query and a target sequence runs the 3 modes (normal/NW, prefix/SHW and infix/HW) in one pass.

With RUST_LOG=info ./target/release/examples/edaligner --dirdata "$edlibpath/test_data/Enterobacteria_Phage_1" --tf "Enterobacteria_phage_1.fasta" --qf "mutated_90_perc.fasta"

we get the following timing in release mode for Enterobacteria_phage_1.fasta as target sequence and mutated_90_perc.fasta as query sequence.

mode edlibrs time(s) edlib time(s) distance
NW 0.106 0.106 9506
SHW 0.184 0.191 9502
HW 0.682 0.695 9502

We get the following timing in release mode for Enterobacteria_phage_1.fasta as target sequence and mutated_60_perc.fasta as query sequence.

mode edlibrs time(s) edlib time(s) distance
NW 0.398 0.398 39829
SHW 0.670 0.684 39828
HW 1.182 1.206 39828

Except for infinitesimal variations of cpu time measurement we see we have the same computation times.

License

Licensed under either of

at your option.

This software was written on my own while working at CEA, CEA-LIST

Dependencies

~5–15MB
~191K SLoC