1 unstable release
new 0.1.0 | Dec 18, 2024 |
---|
#3 in #factor
34 downloads per month
30KB
179 lines
motif-scanner
A command-line tool for scanning DNA sequences and predicting transcription factor binding sites.
Features
- 🧬 Batch processing of sequence files
- 📊 PWM/EWM-based binding site analysis
- 🔍 Configurable occupancy threshold filtering
- 📈 Multiple output formats (CSV, Parquet)
- ⚡ Parallel processing for large datasets
Installation
From crates.io
cargo install motif-scanner
From Source
git clone https://github.com/peter6866/tf-binding-rs
cd tf-binding-rs
cargo install --path motif-scanner
Usage
Basic usage:
motif-scanner input.csv motifs.meme output.csv
With options:
motif-scanner input.csv motifs.meme output.parquet --cutoff 0.3 --mu 12
Arguments
DATA_FILE
: Input CSV file containing sequences (must have a 'sequence' column)PWM_FILE
: MEME format file containing Position Weight MatricesOUTPUT_FILE
: Path for output file (.csv or .parquet format)--cutoff
: Minimum occupancy threshold (default: 0.2)--mu
: Chemical potential parameter (default: 9)
Input Format
The input CSV file must contain a column named 'sequence' with DNA sequences:
id,sequence
seq1,ATCGATCGTGCTAGCTA
seq2,GCTAGCTAGCTAGCTAG
Output Format
The tool generates a table with the following columns:
label
: Sequence index from input fileposition
: Position of the binding sitemotif
: Name of the transcription factorstrand
: Binding strand (F/R)length
: Length of the motifoccupancy
: Predicted occupancy score
Example
# Scan sequences with default parameters
motif-scanner sequences.csv pwm.meme results.csv
# Use stricter threshold and higher chemical potential
motif-scanner sequences.csv pwm.meme results.parquet --cutoff 0.4 --mu 15
# Process and save as Parquet format
motif-scanner data.csv motifs.meme output.parquet
Performance
The tool uses parallel processing for efficient scanning of large sequence datasets. Memory usage scales with the number of input sequences and motifs being scanned.
Dependencies
~46–77MB
~1.5M SLoC