#variant #level #dna #estimating #calls #contamination #diploid

bin+lib diploid-contam-estimator

Estimating contamination level in diploid DNA seuqencing libraries

7 releases

0.3.4 Oct 29, 2022
0.3.3 Oct 28, 2022
0.2.0 Aug 18, 2022
0.1.0 Aug 18, 2022

#186 in Biology

MIT license

2MB
880 lines

Diploid-contam

poetry CI crates.io

A module to estimate contamination level from diploid variant calls. This is heavily inspired by Dcon.

Background

We are hypothesizing the contamination level would be between $0-0.4$, and at each hypothesized contamination level, we'll calculate the likelihood observing the observed variant-allele-frequency assuming the given contamination level is true for each variant in the VCF file using a binomial model. We then sum the log likelihood for all variants and pick the maximum likelihood contamination level as the final call.

Pseudo code:

n = total_count
x = alt_allele_count
contam = 0
max_log_likelihood = -inf
for contam_level in all_contamination_level:
    p = expected_alt_fraction_for_the_given_contamination_level
    log_likelihood = sum(binom_loglik(n, x, p) for all_variants)
    if log_likelihood > max_log_likelihood:
        contam = contam_level

A simulated study at here.

Homozygous variants

For a homozygous variant, the probability of observing the expected variant-allele-count ($x$) with a read depth $n$ at a given contamination level $c \in [0,0.4]$ is :

$$ P(X=x,c) = \binom{n}{x}p^x(1 - p)^{n-x} $$

where $p = (1-c)$ in all homozygous variants

Heterozygous variants

For a heterozygous variant, the probablity of observing the expected variant-allele-count ($x$) with a read depth $n$ at a given contamination level $c \in [0,0.4]$ will follow the above binomial distribution but $p$ can either be:

  1. $(1 - c)/2$, when a low alternate allele frequency is observed because of the contamination
  2. $(1 - c)$, when a homozygous variant being called as a heterozygous variant because of the contamination
  3. $(0.5 + c)$, when the contamination looks like the alternate allele, such that the alternate allele frequency is higher than expected
  4. $(0.5 - c)$, when the contamination looks like the reference allele, such that the alternate allele frequency is lower than expected
  5. $c$, when the contamination itself is called as low variant frequency heterozygous variant

After evaluating these cases, we will pick the highest probability event when summing the log likelihoods for the given contamination level.

Rust

We also wrote the code in rust.

To run the code:

$ cargo run -- -i data/test.vcf -d debug_json

or:

$ cargo install --path .
$ target/release/diploid-contam-estimator
Douglas Wu <wckdouglas@gmail.com>
Estimating contamination level from a diploid VCF file


    The program assume we are dealing with a diploid genome, and using the
    deviation of allelic balance from the expected allelic frequence for homozygous
    or heterozygous variant calls to compute a contamination value.

    For homozygous variants, we deviation from allelic frequency of 1 is all introduced by
contaminaion.

    For heterozygous variants, it is a little more complex, because it could be due to:
        1. contamination that doesn't look like the HET ALT allele: we expect lower HET alt allele
frequency
        2. contamination that doesn't look like the HOM ALT allele: we expect High HET alt allele
frequency
        3. contamination that looks like the ALT allele: we expect higher alt allele frequency
        4. contamination that looks like the REF allele: we expect lower alt allele frequency
        5. contamination being called as ALT

USAGE:
    diploid-contam-estimator [OPTIONS] --in-vcf <in_vcf>

OPTIONS:
    -d, --debug-json <debug_json>
            A json output file for storing all intermediate log prob

    -h, --help
            Print help information

    -i, --in-vcf <in_vcf>
            A diploid vcf file for estimating contamination

    -m, --min-depth <depth_threshold>
            Minimum depth for a variant to be considered (i.e. DP tag) [default: 0]

    -o, --out-json <out_json>
            A json output file for storing the maximum likelihood contam level for the vcf file

        --snv-only
            Only use SNV (ignore indel) for contamination estimations

    -v, --debug-variant-json <debug_variant_json>
            A json output file for storing all input variants used for calculation

    -V, --version
            Print version information

Dependencies

~10–20MB
~271K SLoC