6 releases (breaking)
|0.5.1||Aug 31, 2023|
|0.4.0||Aug 10, 2023|
|0.3.0||Jun 25, 2023|
|0.1.1||May 4, 2023|
#16 in Biology
90 downloads per month
Used in 3 crates
Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:
lightmotif library provides a Rust crate to run very efficient
searches for a motif encoded in a position weight matrix. The position
scanning combines several techniques to allow high-throughput processing
- Compile-time definition of alphabets and matrix dimensions.
- Sequence symbol encoding for fast table look-ups, as implemented in HMMER or MEME
- Striped sequence matrices to process several positions in parallel, inspired by Michael Farrar.
- Vectorized matrix row look-up using
permuteinstructions of AVX2.
Other crates from the ecosystem provide additional features if needed:
lightmotif-tfmpvalueis an exact reimplementation of the TFMPvalue algorithm for converting between a score and a P-value for a given scoring matrix.
lightmotif-transfacis a parser for position-specific scoring matrices in the TRANSFAC format.
This is the Rust version, there is a Python package available as well.
use lightmotif::*; use lightmotif::abc::Nucleotide; use typenum::U32; // Create a count matrix from an iterable of motif sequences let counts = CountMatrix::<Dna>::from_sequences(&[ EncodedSequence::encode("GTTGACCTTATCAAC").unwrap(), EncodedSequence::encode("GTTGATCCAGTCAAC").unwrap(), ]).unwrap(); // Create a PSSM with 0.1 pseudocounts and uniform background frequencies. let pssm = counts.to_freq(0.1).to_scoring(None); /// Create a pipeline to run tasks with platform acceleration let pli = Pipeline::dispatch(); // Use the pipeline to encode the target sequence into a striped matrix let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG"; let encoded = pli.encode(seq).unwrap(); let mut striped = pli.stripe(encoded); // Use the pipeline to compute scores for every position of the matrix. striped.configure(&pssm); let scores = pli.score(&striped, &pssm); // Scores can be extracted into a Vec<f32>, or indexed directly. let v = scores.to_vec(); assert_eq!(scores, -23.07094); assert_eq!(v, -23.07094); // The highest scoring position can be searched with a pipeline as well. let best = pli.argmax(&scores).unwrap(); assert_eq!(best, 18);
This example uses a dynamic dispatch pipeline, which selects the best available backend (AVX2, SSE2, NEON, or a generic implementation) depending on the local platform.
Both benchmarks use the MX000001
motif from PRODORIC, and the
complete genome of an
Escherichia coli K12 strain.
Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with
Score every position of the genome with the motif weight matrix:
test bench_avx2 ... bench: 4,510,794 ns/iter (+/- 9,570) = 1029 MB/s test bench_sse2 ... bench: 26,773,537 ns/iter (+/- 57,891) = 173 MB/s test bench_generic ... bench: 317,731,004 ns/iter (+/- 2,567,370) = 14 MB/s
Find the highest-scoring position for a motif in a 10kb sequence (compared to the PSSM algorithm implemented in
test bench_avx2 ... bench: 12,797 ns/iter (+/- 380) = 781 MB/s test bench_sse2 ... bench: 62,597 ns/iter (+/- 43) = 159 MB/s test bench_generic ... bench: 671,900 ns/iter (+/- 1,150) = 14 MB/s test bench_bio ... bench: 1,193,911 ns/iter (+/- 2,519) = 8 MB/s
⚠️ Issue Tracker
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
This library is provided under the open-source MIT license.
- Eddy, Sean R. ‘Accelerated Profile HMM Searches’. PLOS Computational Biology 7, no. 10 (20 October 2011): e1002195. doi:10.1371/journal.pcbi.1002195.
- Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. ‘FIMO: Scanning for Occurrences of a given Motif’. Bioinformatics 27, no. 7 (1 April 2011): 1017–18. doi:10.1093/bioinformatics/btr064.
- Farrar, Michael. ‘Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations’. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
- Touzet, Hélène, and Jean-Stéphane Varré. ‘Efficient and Accurate P-Value Computation for Position Weight Matrices’. Algorithms for Molecular Biology 2, no. 1 (2007): 1–12. doi:10.1186/1748-7188-2-15.
- Dudek, Christian-Alexander, and Dieter Jahn. ‘PRODORIC: State-of-the-Art Database of Prokaryotic Gene Regulation’. Nucleic Acids Research 50, no. D1 (7 January 2022): D295–302. doi:10.1093/nar/gkab1110.