2 releases

0.1.1	Feb 24, 2025
0.1.0	Dec 18, 2024

#676 in Encoding

187 downloads per month

MIT license

30KB
179 lines

motif-scanner

A command-line tool for scanning DNA sequences and predicting transcription factor binding sites.

Features

🧬 Batch processing of sequence files
📊 PWM/EWM-based binding site analysis
🔍 Configurable occupancy threshold filtering
📈 Multiple output formats (CSV, Parquet)
⚡ Parallel processing for large datasets

Installation

From crates.io

cargo install motif-scanner

From Source

git clone https://github.com/peter6866/tf-binding-rs
cd tf-binding-rs
cargo install --path motif-scanner

Usage

Basic usage:

motif-scanner input.csv motifs.meme output.csv

With options:

motif-scanner input.csv motifs.meme output.parquet --cutoff 0.3 --mu 12

Arguments

DATA_FILE: Input CSV file containing sequences (must have a 'sequence' column)
PWM_FILE: MEME format file containing Position Weight Matrices
OUTPUT_FILE: Path for output file (.csv or .parquet format)
--cutoff: Minimum occupancy threshold (default: 0.2)
--mu: Chemical potential parameter (default: 9)

Input Format

The input CSV file must contain a column named 'sequence' with DNA sequences:

id,sequence
seq1,ATCGATCGTGCTAGCTA
seq2,GCTAGCTAGCTAGCTAG

Output Format

The tool generates a table with the following columns:

label: Sequence index from input file
position: Position of the binding site
motif: Name of the transcription factor
strand: Binding strand (F/R)
length: Length of the motif
occupancy: Predicted occupancy score

Example

# Scan sequences with default parameters
motif-scanner sequences.csv pwm.meme results.csv

# Use stricter threshold and higher chemical potential
motif-scanner sequences.csv pwm.meme results.parquet --cutoff 0.4 --mu 15

# Process and save as Parquet format
motif-scanner data.csv motifs.meme output.parquet

Performance

The tool uses parallel processing for efficient scanning of large sequence datasets. Memory usage scales with the number of input sequences and motifs being scanned.

Dependencies

~47–77MB
~1.5M SLoC