13 releases (8 breaking)

0.9.1 Sep 2, 2024
0.8.0 Jun 28, 2024
0.6.0 Dec 13, 2023
0.5.1 Aug 31, 2023
0.3.0 Jun 25, 2023

#209 in Biology

46 downloads per month

MIT OR GPL-3.0-or-later and maybe GPL-3.0-or-later

1.5MB
8K SLoC

🎼🧬 lightmotif Star me

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Actions Coverage License Docs Crate PyPI Wheel Bioconda Python Versions Python Implementations Source Mirror GitHub issues Changelog Downloads

🗺️ Overview

Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:

MX000274.svg

The lightmotif library provides a Python module to run very efficient searches for a motif encoded in a position weight matrix. The position scanning combines several techniques to allow high-throughput processing of sequences:

  • Compile-time definition of alphabets and matrix dimensions.
  • Sequence symbol encoding for fast table look-ups, as implemented in HMMER[1] or MEME[2]
  • Striped sequence matrices to process several positions in parallel, inspired by Michael Farrar[3].
  • Vectorized matrix row look-up using permute instructions of AVX2.

This is the Python version, there is a Rust crate available as well.

🔧 Installing

lightmotif can be installed directly from PyPI, which hosts some pre-built wheels for most mainstream platforms, as well as the code required to compile from source with Rust:

$ pip install lightmotif

In the event you have to compile the package from source, all the required Rust libraries are vendored in the source distribution, and a Rust compiler will be setup automatically if there is none on the host machine.

💡 Example

The motif interface should be mostly compatible with the Bio.motifs module from Biopython. The notable difference is that the calculate method of PSSM objects expects a striped sequence instead.

import lightmotif

# Create a count matrix from an iterable of sequences
motif = lightmotif.create(["GTTGACCTTATCAAC", "GTTGATCCAGTCAAC"])

# Create a PSSM with 0.1 pseudocounts and uniform background frequencies
pwm = motif.counts.normalize(0.1)
pssm = pwm.log_odds()

# Encode the target sequence into a striped matrix
seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG"
striped = lightmotif.stripe(seq)

# Compute scores using the fastest backend implementation for the host machine
scores = pssm.calculate(sseq)

⏱️ Benchmarks

Benchmarks use the MX000001 motif from PRODORIC[4], and the complete genome of an Escherichia coli K12 strain. Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native.

lightmotif (avx2):      5,479,884 ns/iter    (+/- 3,370,523) = 807.8 MiB/s
Bio.motifs:           334,359,765 ns/iter   (+/- 11,045,456) =  13.2 MiB/s
MOODS.scan:           182,710,624 ns/iter    (+/- 9,459,257) =  24.2 MiB/s
pymemesuite.fimo:     239,694,118 ns/iter    (+/- 7,444,620) =  18.5 MiB/s

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the GNU General Public License 3.0 or later, as it contains the GPL-licensed code of the TFM-PVALUE algorithm. The TFM-PVALUE dependency can be disabled by disabling the pvalue crate feature, in which case the code can be used and redistributed under the terms of the MIT license.

This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References

  • Eddy, Sean R. ‘Accelerated Profile HMM Searches’. PLOS Computational Biology 7, no. 10 (20 October 2011): e1002195. doi:10.1371/journal.pcbi.1002195.
  • Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. ‘FIMO: Scanning for Occurrences of a given Motif’. Bioinformatics 27, no. 7 (1 April 2011): 1017–18. doi:10.1093/bioinformatics/btr064.
  • Farrar, Michael. ‘Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations’. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
  • Dudek, Christian-Alexander, and Dieter Jahn. ‘PRODORIC: State-of-the-Art Database of Prokaryotic Gene Regulation’. Nucleic Acids Research 50, no. D1 (7 January 2022): D295–302. doi:10.1093/nar/gkab1110.

Dependencies

~3.5MB
~73K SLoC