3 releases
0.1.5 | Mar 29, 2023 |
---|---|
0.1.4 | Mar 24, 2023 |
0.1.3 | Mar 21, 2023 |
0.1.2 |
|
0.1.1 |
|
#361 in Science
2MB
1K
SLoC
bamcalib
utility
bamcalib
is a small rust program to normalize calibrated chip-seq data using a normalization procedure build upon the original method of Hu et al..
Installation
You need cargo
installed
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt install cargo
sudo pacman -S rust
brew install rust
Then run the following command:
cargo install bamcalib --version 0.1.5
You can also use the lbmc/bamcalib:0.1.5
Docker image
docker run -it lbmc/bamcalib:0.1.5 bamcalib
Usage
Example
bamcalib \
-i IP_11_sorted.bam \
-w T_11_sorted.bam \
-bigwig-ip IP_11.bigwig \
-bigwig-WCE T_11.bigwig \
-t 8
The bamcalib
expect the following parameters:
-i
,--ip-bam <bam_ip>
sorted bam file for the IP data-w
,--wce-bam <bam_wce>
sorted bam file for the WCE data-bigwig-ip
,--bigwig-ip <bigwig>
normalized bigwig file name for the output-bigwig-wce
,--bigwig-wce <bigwig>
normalized bigwig file name for the output
Optional Arguments
-c
,--cal-prefix <cal_prefix>
calibration prefix added to the chromosome name of the calibration genome (default:calib_
)-h
,--help
Print help information-s
,--scaling <SCALING>
multiply the OR by this value (must be the same across sample) [default: 1000]-n
,--no_bits <no_bits>
remove reads which bit code match this code. You can use this site to get an appropriate bit code. The default is1540
for- segment unmapped
- not passing quality controls
- PCR or optical duplicate
-t
,--thread <thread>
number of threads to parse bam files and write the bigwig file--no-fragment-estimation
compute coverage from read information instead of fragment estimation--intermediate-bigwig
output normIP and normWCE bigwig files (not just the ratio of the two)-V
,--version
Print version information
Method
Normalization
Hu et al. proposed the following normalization procedure to get the normalized coverage in the IP sample ($\text{norm}IP\left(t\right)$) from:
- $IP_{x}\left(t\right)$ the coverage at position $t$ in the IP sample on the reference genome
- $IP_{c}\left(t\right)$ the coverage at position $t$ in the IP sample on the calibration genome
- $WCE_{x}\left(t\right)$ the coverage at position $t$ in the WCE sample on the reference genome
- $WCE_{c}\left(t\right)$ the coverage at position $t$ in the WCE sample on the calibration genome
In a reference genome of size $T_x
$ (ignoring the chromosome segmentation), and in a calibration genome of size $T_c
$, they propose the compute the following $\text{OR}$ factor :
\text{OR} = \frac{10^6 \sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{\sum_{t^{'}=1}^{T_c}IP_{c}\left(t^{'}\right)\sum_{t^{'}=1}^{T_x}WCE_{x}\left(t^{'}\right)}
Instead, we propose the following formula with $\beta$ (default to $10^3$) an arbitrary scaling factor.
\text{norm}IP\left(t\right) = IP_{x}\left(t\right) \frac{\beta \frac{1}{T_c}\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}IP_{c}\left(t^{'}\right)\frac{1}{T_x}\sum_{t^{'}=1}^{T_x}WCE_{x}\left(t^{'}\right)}
We want to remove the technical effect of the Chip-Seq experiment on the $IP_{x}\left(t\right)$ by scaling by the calibration genome coverage:
\frac{1}{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}IP_{c}\left(t^{'}\right)}
We want to remove the differences in cell proportion effect of the Chip-Seq experiment on the $IP_{x}\left(t\right)$ by scaling by the scaled WCE coverage:
\frac{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{\frac{1}{T_x}\sum_{t^{'}=1}^{T_x}WCE_{x}\left(t^{'}\right)}
To be able to better analyze the coverage information at repetitive regions of the genome, we propose to extend the previous normalization to normalize the signal nucleotide by nucleotide and work with the following OR ratio:
\text{ratio}IP\left(t\right) = \frac{\text{norm}IP\left(t\right)}{\text{norm}WCE\left(t\right)}
with:
\text{norm}WCE\left(t\right) = WCE_{x}\left(t\right) \frac{1}{\frac{1}{T_c}\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)} \alpha
We then find $\alpha
$ such that:
E\left(\text{norm}IP\left(t\right)\right)= E\left(\frac{\text{norm}IP\left(t\right)}{\text{norm}WCE\left(t\right)}\right)
which gives
\text{norm}WCE\left(t\right) = WCE_{x}\left(t\right) \frac{1}{\sum_{t^{'}=1}^{T_x}IP_{x}\left(t^{'}\right)} \sum_{t^{'}=1}^{T_x}\frac{IP_{x}\left(t^{'}\right)}{WCE_{x}\left(t^{'}\right)}
Not that all the computation is scaled by the genome size and not at the reads number as in Hu et al., this is also why we add a scaling factor $\beta$ (default to $10^3$).
With this method, we retain the interesting properties of Hu et al. normalization on the average read density between samples (i.e., we can compare two different samples in a quantitative way) and we account for the local bias of read density observed in the WCE samples (differential chromatin accessibility, repetition, low mappability region, etc.).
Genome coverage density
To compute the coverage density $X_y(t)$ with $X \in \left[IP, WCE\right]$ and $y \in \left[c, x\right]$ we count the number of read $r(t)$ overlapping with position $t$. For properly paired reads (with a mate read on the same chromosome and with a starting position ending after the end of the read) we also count a density of 1 between the end of the first reads and the start of his mate read $g(t)$. $X_y(t) = r(t) + g(t)$.
Some fragment can be artificially long, therefore, we compute a robust mean $\mu$ of the gap size, between two reads of a pair, by removing the 0.1 upper and lower value of fragment length. Fragment that has a size higher than $\phi^{-1}(0.95, \mu, 1.0)$ are set to end at the $\phi^{-1}(0.95, \mu, 1.0)$ value, with $\phi()$ the Normal CDF function.
Some fragment can be shorter than the read length in this case we don't count the overlapping reads region as a coverage of 2 fragments but as the coverage of 1 fragment.
Dependencies
~22–32MB
~389K SLoC