7 releases (breaking)

0.6.0 Dec 13, 2023
0.5.1 Aug 31, 2023
0.4.0 Aug 10, 2023
0.3.0 Jun 25, 2023
0.1.1 May 4, 2023

#197 in Biology

MIT license

1.5MB
4K SLoC

🎼🧬 lightmotif Star me

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Actions Coverage License Docs Crate PyPI Wheel Bioconda Python Versions Python Implementations Source Mirror GitHub issues Changelog Downloads

🗺️ Overview

Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:

MX000274.svg

The lightmotif library provides a Python module to run very efficient searches for a motif encoded in a position weight matrix. The position scanning combines several techniques to allow high-throughput processing of sequences:

  • Compile-time definition of alphabets and matrix dimensions.
  • Sequence symbol encoding for fast table look-ups, as implemented in HMMER[1] or MEME[2]
  • Striped sequence matrices to process several positions in parallel, inspired by Michael Farrar[3].
  • Vectorized matrix row look-up using permute instructions of AVX2.

This is the Python version, there is a Rust crate available as well.

🔧 Installing

lightmotif can be installed directly from PyPI, which hosts some pre-built wheels for most mainstream platforms, as well as the code required to compile from source with Rust:

$ pip install lightmotif

In the event you have to compile the package from source, all the required Rust libraries are vendored in the source distribution, and a Rust compiler will be setup automatically if there is none on the host machine.

💡 Example

The motif interface should be mostly compatible with the Bio.motifs module from Biopython. The notable difference is that the calculate method of PSSM objects expects a striped sequence instead.

import lightmotif

# Create a count matrix from an iterable of sequences
motif = lightmotif.create(["GTTGACCTTATCAAC", "GTTGATCCAGTCAAC"])

# Create a PSSM with 0.1 pseudocounts and uniform background frequencies
pwm = motif.counts.normalize(0.1)
pssm = pwm.log_odds()

# Encode the target sequence into a striped matrix
seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG"
striped = lightmotif.stripe(seq)

# Compute scores using the fastest backend implementation for the host machine
scores = pssm.calculate(sseq)

⏱️ Benchmarks

Benchmarks use the MX000001 motif from PRODORIC[4], and the complete genome of an Escherichia coli K12 strain. Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native.

lightmotif (avx2):      5,479,884 ns/iter    (+/- 3,370,523) = 807.8 MiB/s
Bio.motifs:           334,359,765 ns/iter   (+/- 11,045,456) =  13.2 MiB/s
MOODS.scan:           182,710,624 ns/iter    (+/- 9,459,257) =  24.2 MiB/s
pymemesuite.fimo:     239,694,118 ns/iter    (+/- 7,444,620) =  18.5 MiB/s

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the open-source MIT license.

This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References

  • Eddy, Sean R. ‘Accelerated Profile HMM Searches’. PLOS Computational Biology 7, no. 10 (20 October 2011): e1002195. doi:10.1371/journal.pcbi.1002195.
  • Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. ‘FIMO: Scanning for Occurrences of a given Motif’. Bioinformatics 27, no. 7 (1 April 2011): 1017–18. doi:10.1093/bioinformatics/btr064.
  • Farrar, Michael. ‘Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations’. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
  • Dudek, Christian-Alexander, and Dieter Jahn. ‘PRODORIC: State-of-the-Art Database of Prokaryotic Gene Regulation’. Nucleic Acids Research 50, no. D1 (7 January 2022): D295–302. doi:10.1093/nar/gkab1110.

Dependencies

~2.4–8.5MB
~56K SLoC