#sequences #binding #factor #transcription #dna #csv #scanning

app motif-scanner

Command line tool for scanning DNA sequences for transcription factor binding sites

1 unstable release

new 0.1.0 Dec 18, 2024

#3 in #factor

34 downloads per month

MIT license

30KB
179 lines

motif-scanner

github crates.io

A command-line tool for scanning DNA sequences and predicting transcription factor binding sites.

Features

  • 🧬 Batch processing of sequence files
  • 📊 PWM/EWM-based binding site analysis
  • 🔍 Configurable occupancy threshold filtering
  • 📈 Multiple output formats (CSV, Parquet)
  • ⚡ Parallel processing for large datasets

Installation

From crates.io

cargo install motif-scanner

From Source

git clone https://github.com/peter6866/tf-binding-rs
cd tf-binding-rs
cargo install --path motif-scanner

Usage

Basic usage:

motif-scanner input.csv motifs.meme output.csv

With options:

motif-scanner input.csv motifs.meme output.parquet --cutoff 0.3 --mu 12

Arguments

  • DATA_FILE: Input CSV file containing sequences (must have a 'sequence' column)
  • PWM_FILE: MEME format file containing Position Weight Matrices
  • OUTPUT_FILE: Path for output file (.csv or .parquet format)
  • --cutoff: Minimum occupancy threshold (default: 0.2)
  • --mu: Chemical potential parameter (default: 9)

Input Format

The input CSV file must contain a column named 'sequence' with DNA sequences:

id,sequence
seq1,ATCGATCGTGCTAGCTA
seq2,GCTAGCTAGCTAGCTAG

Output Format

The tool generates a table with the following columns:

  • label: Sequence index from input file
  • position: Position of the binding site
  • motif: Name of the transcription factor
  • strand: Binding strand (F/R)
  • length: Length of the motif
  • occupancy: Predicted occupancy score

Example

# Scan sequences with default parameters
motif-scanner sequences.csv pwm.meme results.csv

# Use stricter threshold and higher chemical potential
motif-scanner sequences.csv pwm.meme results.parquet --cutoff 0.4 --mu 15

# Process and save as Parquet format
motif-scanner data.csv motifs.meme output.parquet

Performance

The tool uses parallel processing for efficient scanning of large sequence datasets. Memory usage scales with the number of input sequences and motifs being scanned.

Dependencies

~46–77MB
~1.5M SLoC