#bioinformatics #variant-calling

nightly bin+lib alpaca

ALPACA is a caller for genomic variants (single nucleotide and small indels) from next-generation sequencing data that uses a novel algebraic approach to incorporate sample based filtering into the calling. This allows to intuitively control the FDR for arbitrary filtering scenarios.

1 unstable release

Uses old Rust 2015

0.1.0 Jun 11, 2015

#60 in #bioinformatics

MIT license

285KB
907 lines

ALPACA - the ALgebraic PArallel variant CAller

ALPACA is a caller for genomic variants (single nucleotide and small indels) from next-generation sequencing data. It has two major distinguishing features compared to other variant callers:

  • ALPACA incorporates arbitrary filtering of samples against each other into the calling. This is done via an expressive, algebraic query language. It calculates the posterior probability for each locus to not behave like described in the filter query. If that probability is small enough, the locus is called.
  • Since the filtering is part of the null hypothesis, controlling the FDR becomes easy and intuitive.

Alpaca separates calling into three steps.

  • preprocessing of each sample into a BCF file,
  • merging preprocessed samples into one BCF file containing only relevant loci,
  • calling on the merged BCF file.

The separation allows to add samples later without having to redo all the computations. Since most of the work is done during preprocessing, the final calling becomes lightweight and can be repeated with different parameters within seconds. The algebraic query language allows to model calling scenarios in a flexible way, e.g.,

  • calling all de-novo mutations of a child: 'child - (mother + father)'
  • calling all variants recurrent in at least 3 samples of a group of samples s1,s2,...s5: 's1 x s2 x s3 x s4 x s5 with k = 3'

A complete description of algebraic variant calling can be found in my thesis

Köster, J. Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis. PhD-Thesis, TU Dortmund, Germany 2014. ISBN: 978-3737537773.

If you use ALPACA, please cite the thesis for now.

Example usage

All in one command:

$ alpaca preprocess --threads 8 A.bam B.bam C.bam | alpaca filter | alpaca call --fdr 0.05 'A - (B + C)' > calls.bcf

Separate preprocessing and merging (this allows to add samples or change queries without redundant computations; alpaca call usually needs a few seconds):

$ alpaca preprocess --threads 8 A.bam > A.bcf
$ alpaca preprocess --threads 8 B.bam > B.bcf
$ alpaca preprocess --threads 8 C.bam > C.bcf
$ alpaca merge --threads 8 A.bcf B.bcf C.bcf > all.bcf
$ alpaca call --threads 8 --fdr 0.05 'A - (B + C)' < all.bcf > calls.bcf

Dependencies

~23MB
~492K SLoC