4 releases (breaking)
Uses new Rust 2024
| 0.8.0 | Nov 21, 2025 |
|---|---|
| 0.6.1 | Nov 14, 2025 |
| 0.5.0 | Nov 12, 2025 |
| 0.4.0 | Nov 12, 2025 |
#41 in Biology
62KB
1K
SLoC
🦀 PrimerPincer 🦀
Installation
Install cargo
First install cargo!
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Install primerpincer
Now you can install primerpincer. The most straightforward way is
cargo install primerpincer
However to enable SIMD optimizations in Sassy the following methods can be used.
RUSTFLAGS="-C target-cpu=native" cargo install primerpincer
About
PrimerPincer is a Rust-based command-line tool designed to efficiently detect and remove pairs (forward and reverse) of primers from single-end amplicon reads in FASTQ format, with a particular focus on long-read sequencing data generated by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT).
In amplicon-based microbiome studies, such as those targeting 16S, ITS, 18S, or COI regions, primer removal is a crucial preprocessing step. The phylogenetically conserved regions where primers bind are typically removed because:
- They are phylogenetically uninformative, and their removal can improve the accuracy of downstream taxonomic classification.
- They are susceptible to PCR-induced mutagenesis, and therefore may not accurately represent true biological sequences.
- They often contain uninformative sequence data, and removing them can enhance computational performance in subsequent analyses.
The rise of third-generation sequencing platforms from PacBio and ONT has enabled the use of much longer marker gene regions than was previously feasible—such as the full-length 16S (V1–V9), 16S–ITS–23S operon, or 18S–ITS–28S operon. Additionally, the throughput and read counts produced per run continue to increase, driving a steady growth in the total volume of sequencing data generated.
PrimerPincer is designed to scale with these demands, providing rapid and accurate primer identification and removal for long-read datasets—with performance and scalability built for the future of sequencing.
Features
⚡ Lightning Fast
- Rust-based performance with zero-cost abstractions
- Parallel processing using Paraseq for multi-threaded FASTQ parsing and processing
- SIMD optimizations available
🔍 Multiple Search Algorithms
Choose the best algorithm for your use case:
- Sassy (default) - Approximate string matching as described in Beeloo and Groot Koerkamp (2025)
- Myers - Approximate pattern matching algorithm as described in Myers (1999). Implementation is very similar to Edlib’s (Šošić and Šikić, 2017).
- Hamming - Hamming distance string matching with mistmatch tolerance
- BNDM - Exact match only. No mistmatch or indels tolerance
📦 Compression Format Support
Automatically handles common compression formats via niffler:
- Input: gzip (.gz), zstd (.zst), xz (.xz), bzip2 (.bz2), and uncompressed FASTQ (auto-detected)
- Output: User-selectable via
--compressionflag (gzip, bzip2, xz, zstd, or uncompressed; defaults to gzip)
🧬 IUPAC Aware
Full support for IUPAC nucleotide ambiguity codes in primer sequences:
- Standard codes: R, Y, M, K, S, W, B, D, H, V, N
- Automatically expands degenerate primers or uses degenerate-aware matching algorithms
- Proper reverse complement handling for all ambiguity codes
🔄 Orientation normalization
The tool checks forward orientation first, followed by reverse orientation:
- If both primers are found in forward orientation of the read, the read is kept as-is
- If not found, the reverse orientation is searched for both primers
- If both primers are found in reverse orientation of the read, the read is kept and the reverse complement is output
📏 Size filtering
An optional size filtering can be applied:
- Minimum length to accept amplicons
- Maximum length to accept amplicons
✅ Quality filtering
Reads that fall below a determined average Phred quality score threshold are filtered out:
- Averaging basecall quality scores is calculated as Wouter De Coster outlines in his blog
- This aims to replicate the functionality of chopper
- For more advanced quality trimming options, see chopper!
Usage
PrimerPincer - a CLI tool for the rapid identification and removal of paired primers from long read amplicons
Usage: primerpincer [OPTIONS] --input <FILE> --output <FILE> --forward <SEQUENCE> --reverse <SEQUENCE>
Options:
-i, --input <FILE>
Input FASTQ file
-o, --output <FILE>
Output FASTQ file
-f, --forward <SEQUENCE>
Forward primer sequence (5' to 3' orientation)
-r, --reverse <SEQUENCE>
Reverse primer sequence (5' to 3' orientation)
-a, --algorithm <ALGORITHM>
Algorithm to use for primer matching
Possible values:
- sassy: Pattern matching algorithm as described in Beeloo and Koerkamp (2025)
- myers: Rust Bio's Myers bit-parallel algorithm, very similar to Edlib's algorithm as described in Šošić and Šikić (2017)
- hamming: Hamming distance algorithm as described in Waterman and Eggert (1987). Can tolerate mismatches but not indels
- bndm: Rust Bio's BNDM exact pattern matching algorithm as described in Baeza-Yates and Gonnet (1992). Exact matching only. No mismatch or indels tolerated
[default: sassy]
-e, --error-rate <FLOAT>
Maximum error rate in primer matching (e.g., 0.15 for 15% errors)
[default: 0.15]
-w, --window-size <INT>
Window size to search for primer at start and end of sequence
[default: 100]
-O, --overlap <MINLENGTH>
Minimum overlap length. Require MINLENGTH bases of the primer to match (default 6)
[default: 6]
-t, --threads <INT>
Number of threads to use
[default: 4]
-c, --compression <COMPRESSION>
Compression format for the output FASTQ (defaults to gzip)
Possible values:
- none: No compression; write plain text FASTQ
- gzip: Standard gzip compression
- bzip2: bzip2 compression
- xz: LZMA/XZ compression
- zstd: Zstandard compression
[default: gzip]
-m, --min-length <INT>
Minimum read length after trimming (inclusive)
-M, --max-length <INT>
Maximum read length after trimming (inclusive)
-q, --min-average-quality <FLOAT>
Minimum Average Quality Score
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Examples
primerpincer \
-i ./example_data/raw/ATCC-MSA1003-toy-example.fastq.gz \
-o ~/primerpincer_proccesed/ATCC-MSA1003-toy-example.fastq.gz \
-f "AGRGTTYGATYMTGGCTCAG" \
-r "RGYTACCTTGTTACGACTT" \
-t 12 \
-a sassy \
-O 6 \
-l 500
Contributing
Contributions to PrimerPincer are welcome! Here are some ways you can contribute:
Reporting Issues
- Report bugs or request features by opening an issue on GitHub
- Include example data and error messages when reporting bugs
- Describe your use case when requesting new features
Contributing Code
- Fork the repository
- Create a new branch for your feature (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests to ensure everything works
- Format your code using
cargo fmtand ensure it passescargo clippy --all-targets -- -D warnings - Commit your changes using Conventional Commits format (e.g.,
feat: add new algorithm,fix: resolve compilation error) - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
CI Checks: All pull requests will be automatically checked by our CI workflow (.github/workflows/ci.yaml):
- Commit messages must follow the Conventional Commits specification (validated by Commitizen)
- Code formatting must pass
cargo fmt --all -- --check - Code compilation must pass
cargo check --all-targets - Code linting must pass
cargo clippy --all-targets -- -D warnings
All CI checks must pass before your PR can be merged.
Citation
If you use PrimerPincer in your research, please cite:
Beeloo, R. & Groot Koerkamp, R. Sassy: Searching Short DNA Strings in the 2020s. 2025.07.22.666207 Preprint at https://doi.org/10.1101/2025.07.22.666207 (2025).
Licence
This project is licensed under the MIT License - see the LICENSE file for details.
Dependencies
~34MB
~568K SLoC