3 releases

0.1.5	Mar 29, 2023
0.1.4	Mar 24, 2023
0.1.3	Mar 21, 2023
0.1.2	~~Mar 13, 2023~~
0.1.1	~~Feb 28, 2023~~

#347 in Science

AGPL-3.0-or-later

2MB
1K SLoC

`bamcalib` utility

bamcalib is a small rust program to normalize calibrated chip-seq data using a normalization procedure build upon the original method of Hu et al..

Installation

You need cargo installed

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt install cargo
sudo pacman -S rust
brew install rust

Then run the following command:

cargo install bamcalib --version 0.1.5

You can also use the lbmc/bamcalib:0.1.5 Docker image

docker run -it lbmc/bamcalib:0.1.5 bamcalib

Usage

Example

bamcalib \
  -i IP_11_sorted.bam \
  -w T_11_sorted.bam \
  -bigwig-ip IP_11.bigwig \
  -bigwig-WCE T_11.bigwig \
  -t 8

The bamcalib expect the following parameters:

-i, --ip-bam <bam_ip> sorted bam file for the IP data
-w, --wce-bam <bam_wce> sorted bam file for the WCE data
-bigwig-ip, --bigwig-ip <bigwig> normalized bigwig file name for the output
-bigwig-wce, --bigwig-wce <bigwig> normalized bigwig file name for the output

Optional Arguments

-c, --cal-prefix <cal_prefix> calibration prefix added to the chromosome name of the calibration genome (default: calib_)
-h, --help Print help information
-s, --scaling <SCALING> multiply the OR by this value (must be the same across sample) [default: 1000]
-n, --no_bits <no_bits> remove reads which bit code match this code. You can use this site to get an appropriate bit code. The default is 1540 for
- segment unmapped
- not passing quality controls
- PCR or optical duplicate
-t, --thread <thread> number of threads to parse bam files and write the bigwig file
--no-fragment-estimation compute coverage from read information instead of fragment estimation
--intermediate-bigwig output normIP and normWCE bigwig files (not just the ratio of the two)
-V, --version Print version information

Method

Normalization

Hu et al. proposed the following normalization procedure to get the normalized coverage in the IP sample ($\text{norm}IP\left(t\right)$) from:

$IP_{x}\left(t\right)$ the coverage at position $t$ in the IP sample on the reference genome
$IP_{c}\left(t\right)$ the coverage at position $t$ in the IP sample on the calibration genome
$WCE_{x}\left(t\right)$ the coverage at position $t$ in the WCE sample on the reference genome
$WCE_{c}\left(t\right)$ the coverage at position $t$ in the WCE sample on the calibration genome

In a reference genome of size $T_x$ (ignoring the chromosome segmentation), and in a calibration genome of size $T_c$, they propose the compute the following $\text{OR}$ factor :

\text{OR} = \frac{10^6 \sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{\sum_{t^{'}=1}^{T_c}IP_{c}\left(t^{'}\right)\sum_{t^{'}=1}^{T_x}WCE_{x}\left(t^{'}\right)}

Instead, we propose the following formula with $\beta$ (default to $10^3$) an arbitrary scaling factor.

\text{norm}IP\left(t\right) = IP_{x}\left(t\right) \frac{\beta \frac{1}{T_c}\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}IP_{c}\left(t^{'}\right)\frac{1}{T_x}\sum_{t^{'}=1}^{T_x}WCE_{x}\left(t^{'}\right)}

We want to remove the technical effect of the Chip-Seq experiment on the $IP_{x}\left(t\right)$ by scaling by the calibration genome coverage:

\frac{1}{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}IP_{c}\left(t^{'}\right)}

We want to remove the differences in cell proportion effect of the Chip-Seq experiment on the $IP_{x}\left(t\right)$ by scaling by the scaled WCE coverage:

\frac{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{\frac{1}{T_x}\sum_{t^{'}=1}^{T_x}WCE_{x}\left(t^{'}\right)}

To be able to better analyze the coverage information at repetitive regions of the genome, we propose to extend the previous normalization to normalize the signal nucleotide by nucleotide and work with the following OR ratio:

\text{ratio}IP\left(t\right) = \frac{\text{norm}IP\left(t\right)}{\text{norm}WCE\left(t\right)}

with:

\text{norm}WCE\left(t\right) = WCE_{x}\left(t\right) \frac{1}{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}  \alpha

We then find $\alpha$ such that:

E\left(\text{norm}IP\left(t\right)\right)= E\left(\frac{\text{norm}IP\left(t\right)}{\text{norm}WCE\left(t\right)}\right)

which gives

\text{norm}WCE\left(t\right) = WCE_{x}\left(t\right) \frac{1}{\sum_{t^{'}=1}^{T_x}IP_{x}\left(t^{'}\right)} \sum_{t^{'}=1}^{T_x}\frac{IP_{x}\left(t^{'}\right)}{WCE_{x}\left(t^{'}\right)}

Not that all the computation is scaled by the genome size and not at the reads number as in Hu et al., this is also why we add a scaling factor $\beta$ (default to $10^3$).

With this method, we retain the interesting properties of Hu et al. normalization on the average read density between samples (i.e., we can compare two different samples in a quantitative way) and we account for the local bias of read density observed in the WCE samples (differential chromatin accessibility, repetition, low mappability region, etc.).

Genome coverage density

To compute the coverage density $X_y(t)$ with $X \in \left[IP, WCE\right]$ and $y \in \left[c, x\right]$ we count the number of read $r(t)$ overlapping with position $t$. For properly paired reads (with a mate read on the same chromosome and with a starting position ending after the end of the read) we also count a density of 1 between the end of the first reads and the start of his mate read $g(t)$. $X_y(t) = r(t) + g(t)$.

Some fragment can be artificially long, therefore, we compute a robust mean $\mu$ of the gap size, between two reads of a pair, by removing the 0.1 upper and lower value of fragment length. Fragment that has a size higher than $\phi^{-1}(0.95, \mu, 1.0)$ are set to end at the $\phi^{-1}(0.95, \mu, 1.0)$ value, with $\phi()$ the Normal CDF function.

Some fragment can be shorter than the read length in this case we don't count the overlapping reads region as a coverage of 2 fragments but as the coverage of 1 fragment.

Dependencies

~22–34MB
~416K SLoC