5 releases
Uses new Rust 2024
| 0.2.5 | Mar 11, 2026 |
|---|---|
| 0.2.4 | Mar 2, 2026 |
| 0.2.3 | Mar 2, 2026 |
| 0.2.2 | Feb 25, 2026 |
| 0.2.1 | Feb 24, 2026 |
#229 in Biology
1MB
26K
SLoC

Genemancer
Genemancer is a Rust CLI toolkit for genomics file processing, built primarily on the noodles ecosystem, with optional GPU acceleration (wgpu or CUDA) for target-based variant aggregation.
Toolkit
Current subcommands:
merge-bam(implemented): merge multiple coordinate-sorted, indexed BAM files into one BAM, with optional BED filtering (all|strict|trim), read-group filtering, output index writing, and configurable compression level.gff-to-gtf(implemented): convert GFF3 annotations to GTF (stdin/stdout supported).gtf-to-introns(implemented): extract transcript intron intervals from GTF annotations and write GFF3 output, with.gtf.gzinput support and reference-script-style default output naming. Current behavior derives transcript exon-gap introns; full BioconductorintronicParts()parity is still pending for complex overlapping transcript models.call-targets(implemented): call simple SNVs from BAM inputs over BED target intervals and write bgzipped VCF output (.vcf.gz) with index (csidefault, optionaltbi).call-targets-gpu(implemented): same pipeline ascall-targets, but attempts GPU initialization and falls back to CPU unless--require-gpuis set.split-bam(implemented): split one or more coordinate-sorted BAM files into per-region BAMs from a BED file, with optional unassigned-read output and optional output indexing.pod5(implemented): namespace for POD5 operations exposed asgenemancer pod5 <operation>(validateandsubsampleare implemented;inspectis scaffolded).vcf(in progress): namespace for VCF comparison workflows;genemancer vcf diffcurrently validates multisample-vs-multi-file input semantics and set definitions, but record loading and set-difference computation are still scaffolded.cnloh(in progress): SubChrom-inspired CNV/cnLOH detect pipeline with marker-filtered variant evidence, BAM coverage summary, and a single chromosome-coloredCNV.pngplot.
Global options:
-v/--verbose(repeatable) for log verbosity.-t/--threadsto control worker threads.--log-file <FILE>to mirror stderr logs to a file.
Installation
Install from crates.io:
cargo install genemancer
Or install from the local repository checkout:
cargo install --path .
Build And Run From Source
- Install a Rust toolchain with edition 2024 support.
- Build:
cargo build - Show CLI help:
cargo run -- --help
You can inspect any command with:
cargo run -- <subcommand> --help
Installed Binary Usage
After installing with cargo install genemancer or cargo install --path ., run:
genemancer --help
genemancer <subcommand> --help
If $HOME/.cargo/bin is not on your PATH, use:
~/.cargo/bin/genemancer --help
Usage Examples
Examples below assume you provide your own inputs and use a locally installed binary. In this repository, *.bam and /test_data are gitignored.
If you are running from source instead, prefix with cargo run -- (or cargo run --features cuda -- for CUDA-enabled builds).
Merge two BAMs into one BAM with index output:
genemancer merge-bam \
-i /path/to/input1.bam \
-i /path/to/input2.bam \
-o test_data/merged.bam \
--index
Convert GFF3 to GTF:
genemancer gff-to-gtf \
-i input.gff3 \
-o output.gtf
Extract introns from a GTF or gzipped GTF:
genemancer gtf-to-introns /path/to/hg38.ncbiRefSeq.gtf.gz
# writes /path/to/hg38.ncbiRefSeq.introns.gff by default
Call SNVs on target regions (CPU/streaming path):
genemancer call-targets \
-i /path/to/bams_or_directory \
-r /path/to/reference.fa.gz \
-T /path/to/targets.bed \
--rg-map references/rg_map.txt \
-o test_data/out.vcf.gz
Run the GPU-enabled path (falls back to CPU by default):
genemancer call-targets-gpu \
-i /path/to/bams_or_directory \
-r /path/to/reference.fa.gz \
-T /path/to/targets.bed \
--rg-map references/rg_map.txt \
--gpu-backend auto \
-o test_data/out.vcf.gz
Run cnLOH/CNV detect with a single colored CNV plot:
genemancer cnloh detect \
--sample sample_01 \
--bam /path/to/sample_01_lane1.bam /path/to/sample_01_lane2.bam \
--vcf /path/to/sample_01.vcf.gz \
--vcf-sample sample_01 \
--data-type WGS \
--reference /path/to/reference.fa \
--panel-bin WGS \
--marker-dir /path/to/SNPmarker \
--log-output test_data/cnloh/sample_01.cnloh.log \
--plots true \
--output test_data/cnloh
Get the SubChrom-compatible SNP marker databases (hg38/hg19):
curl -o SNPmarker_hg38.zip "https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1" && \
unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip
curl -o SNPmarker_hg19.zip "https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1" && \
unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip
Dockerfile form:
RUN curl -o SNPmarker_hg38.zip https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1 && \
unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip
RUN curl -o SNPmarker_hg19.zip https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1 && \
unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip
cnloh detect defaults to canonical chromosomes (chr1-22, chrX, chrY) for output/plot rows.
Use --include-noncanonical to keep all contigs.
Build/install with CUDA support and force CUDA backend:
cargo install --path . --features cuda --force
genemancer call-targets-gpu \
-i /path/to/bams_or_directory \
-r /path/to/reference.fa.gz \
-T /path/to/targets.bed \
--rg-map references/rg_map.txt \
--gpu-backend cuda \
--cuda-device 0 \
--require-gpu \
-o test_data/out.vcf.gz
Tune GPU behavior explicitly (optional overrides on top of auto/hybrid tuning):
genemancer call-targets-gpu \
-i /path/to/bams_or_directory \
-r /path/to/reference.fa.gz \
-T /path/to/targets.bed \
--rg-map references/rg_map.txt \
--gpu-backend auto \
--tuning-mode hybrid \
--tuning-profile throughput \
--tuning-scale-percent 120 \
--wgpu-matrix-utilization-percent 96 \
--wgpu-upload-utilization-percent 98 \
--max-obs-upload 64000000 \
--stream-matrix-budget-mib 1024 \
--defer-cuda-aggregation \
-o test_data/out.vcf.gz
Split multiple BAMs by BED regions into an output folder:
genemancer split-bam \
-i /path/to/input1.bam \
-i /path/to/input2.bam \
--bed /path/to/targets.bed \
--out-dir test_data/splits \
--output-prefix panel \
--write-indices \
--unassigned test_data/splits/unassigned.bam
Run POD5 operations:
genemancer pod5 inspect -i /path/to/reads.pod5
genemancer pod5 validate -i /path/to/reads.pod5
genemancer pod5 subsample \
--input /path/to/run_a.pod5 \
/path/to/run_b.pod5 \
--percent 10 \
--output /path/to/subsampled_outputs
pod5 subsample accepts both repeated and multi-value --input, so shell glob expansion works:
--input *.pod5.
If the POD5 shared library is not auto-detected in your environment, set:
export GENEMANCER_POD5_LIB=/path/to/lib_pod5/pod5_format_pybind*.so
Repository Data
references: helper scripts and a tracked sample RG map (references/rg_map.txt).tests/data: tracked.baifiles only.- Local working datasets are expected under
test_data/(ignored by git).
Ignored Paths
Notes
call-targetsmay prepare a sorted/indexed BGZF FASTA companion (*.sorted.fa.gzplus indexes) when the provided reference is not already in an indexed form suitable for random access.
cnloh Current State
- Variant evidence input precedence is
--vcf>--snp> BAM marker-site pileup (with a warning when both--vcfand--snpare provided). - Multiple
--baminputs are aggregated as one sample in v1. - BAM scanning is multithreaded and deterministic in merged outputs.
- Read filtering defaults skip duplicate, secondary, and supplementary alignments (opt-in include flags are available).
- Plot generation emits one combined chromosome-colored CNV/cnLOH figure (
*.CNV.png) with coverage/CN/cnLOH panels. - Marker filtering is strict in v1; when marker filtering yields zero variants,
cnloh detectexits with an error. - Canonical-chromosome filtering is enabled by default; use
--include-noncanonicalto disable it. --variant-mode broadand--sample-mode rgare scaffolded for later versions but not implemented in v1.
TODO
| Area | Task | Status | Notes |
|---|---|---|---|
split-bam |
Add end-to-end fixture coverage for overlap edge cases | TODO | Validate multi-overlap and boundary behavior |
call-targets |
Add end-to-end integration tests on small fixture set | TODO | Validate VCF content + index generation |
call-targets |
Reject invalid BED intervals (end <= start) with a hard error |
TODO | Current loader silently skips these rows |
call-targets-gpu |
Expand GPU backend validation matrix | TODO | Cover Vulkan/Metal/DX12 fallback behavior |
call-targets-gpu |
Honor --threads for scan worker fanout |
TODO | Current streaming path uses one worker per input BAM |
merge-bam |
Add CRAM input/output support | TODO | Current implementation is BAM-focused |
merge-bam |
Align --index-path docs with implementation or add CSI writing |
TODO | CLI/docs say BAI-or-CSI but writer is BAI-only |
merge-bam |
Reject zero-length BED intervals (end == start) |
TODO | Current validation only rejects end < start |
gtf-to-introns |
Align Rust intron extraction semantics with Bioconductor intronicParts() |
TODO | Current implementation emits transcript exon-gap introns; needs fixture-level parity checks for overlapping isoforms/shared exons |
cnloh |
Chunk 1: Freeze SubChrom parity spec + fixture baselines | DONE | See docs/cnloh_parity_spec.md, docs/cnloh_baseline_fixtures.md, and tests/cnloh_parity_baseline.rs |
cnloh |
Chunk 2: Add SubChrom-like marker/VAF preprocessing (minCOV, minMAC) |
DONE | Emits vaf_preprocessed.tsv + marker_chrom_stats.tsv and summary keys |
cnloh |
Chunk 3: Implement VAF/MAF + ROH segmentation parity passes | DONE | Emits maf.tsv, vaf_segments.tsv, roh_segments.tsv, and vaf_roh_segments.tsv |
cnloh |
Chunk 4: Align coverage segmentation merge rules with VAF/ROH segments | DONE | Emits coverage_vaf_segments.tsv and uses unified coverage+VAF marker gating in event filtering |
cnloh |
Chunk 5: Align event classification + TF estimation with SubChrom intent | TODO | Include allele-specific event summary fields and tolerance-based tests |
cnloh |
Chunk 6: Finalize CNV visualization parity and end-to-end validation | TODO | Snapshot plot metadata and add parity regression tests |
cnloh |
Refactor monolithic cnloh implementation into smaller modules |
TODO | Split src/cnloh.rs into focused units (I/O, preprocessing, segmentation, events, plotting) with clear interfaces |
cnloh |
Improve inline comments and developer-facing documentation | TODO | Add targeted code comments, module docs, and output-file docs for maintainability |
cnloh |
Implement --variant-mode broad |
TODO | CLI mode is scaffolded but currently hard-fails |
cnloh |
Implement RG-aware --sample-mode rg workflow |
TODO | v1 aggregates all BAMs as one sample |
cnloh |
Add fixture-level integration tests for strict marker-overlap behavior | TODO | Validate hard-fail path when marker overlap is zero |
cnloh |
Add event-level segmentation/calling outputs (cnLOH/CN event table) | TODO | Current output is coverage/variant summaries + plot |
cnloh |
Expand marker database fixtures beyond minimal toy set | TODO | Improves realistic plot coverage in tracked tests |
| Docs | Add example outputs and expected file artifacts per command | TODO | Make quick verification easier for users |
Dependencies
~15–24MB
~432K SLoC