2 unstable releases

0.2.2 Mar 14, 2024
0.1.0 Feb 26, 2024

#73 in Biology

45 downloads per month

MIT license

395KB
7K SLoC

CI tests

The GRanges Rust library and command line tool

GRanges is a Rust library for working with genomic ranges and their associated data. It aims to make it easy to write extremely performant genomics tools that work with genomic range data (e.g. BED, GTF/GFF, VCF, etc). Internally, GRanges uses the very fast coitrees interval tree library written by Daniel C. Jones for overlap operations. In preliminary benchmarks, GRanges tools can be 10%-30% faster than similar functionality in bedtools2 (see benchmark and caveats below).

GRanges is inspired by "tidy" data analytics workflows, as well as Bioconductor's GenomicRanges and plyranges. GRanges uses a similar method-chaining pipeline approach to manipulate genomic ranges, find overlapping genomic regions, and compute statistics. For example, you could implement your own bedtools map-like functionality in relatively few lines of code:

// Create the "right" GRanges object.
let right_gr = bed5_gr
    // Convert to interval trees.
    .into_coitrees()?
    // Extract out just the score from the additional BED5 columns.
    .map_data(|bed5_cols| {
        bed5_cols.score
    })?;

// Compute overlaps and combine scores into mean.
let results_gr = left_gr
    // Find overlaps
    .left_overlaps(&right_gr)?
    // Summarize overlap data
    .map_joins(mean_score)?;

where mean_score() is:

pub fn mean_score(join_data: CombinedJoinDataLeftEmpty<Option<f64>>) -> f64 {
    // Get the "right data" out of the join -- the BED5 scores.
    let overlap_scores: Vec<f64> = join_data.right_data.into_iter()
        // filter out missing values ('.' in BED)
        .filter_map(|x| x).collect();

    // Calculate the mean score.
    let score_sum: f64 = overlap_scores.iter().sum();
    score_sum / (overlap_scores.len() as f64)
}

Note that GRanges is a compile-time generic Rust library, so the code above will be heavily optimized by the compiler. Rust uses zero-cost abstractions, meaning high-level code like this is compiled and optimized so that it would be just as performant as if it were written in a low-level language.

GRanges is generic in the sense that it works with any data container type that stores data associated with genomic data: a Vec<U> of some type, an ndarray Array2, polars dataframe, etc. GRanges allows the user to write do common genomics data processing tasks in a few lines of Rust, and then lets the Rust compiler optimize it.

As a proof-of-concept, GRanges also provides the command line tool granges built on this library's functionality. This command line tool is intended for benchmarks against comparable command line tools and for large-scale integration tests against other software to ensure that GRanges is bug-free. The granges tool currently provides a subset of the features of other great bioinformatics utilities like bedtools.

Preliminary Benchmarks

In an attempt to combat "benchmark hype", this section details the results of some preliminary benchmarks in an honest and transparent way. On our lab server, here are two runs with 100,000 ranges per operation, and n = 100 samples:

# run 1
command       bedtools time    granges time      granges speedup (%)
------------  ---------------  --------------  ---------------------
map_multiple  270.21 s         112.66 s                      58.3073
map_max       105.46 s         84.03 s                       20.3185
adjust        112.42 s         53.48 s                       52.4269
filter        114.23 s         77.96 s                       31.7512
map_min       116.22 s         78.93 s                       32.0839
flank         162.77 s         80.67 s                       50.4383
map_mean      107.05 s         81.89 s                       23.5073
map_sum       108.43 s         93.35 s                       13.9083
windows       408.03 s         72.58 s                       82.2121
map_median    108.57 s         87.32 s                       19.5731
 
# run 2
command       bedtools time    granges time      granges speedup (%)
------------  ---------------  --------------  ---------------------
map_multiple  293.24 s         103.66 s                      64.6495
map_max       117.84 s         82.39 s                       30.0855
adjust        110.09 s         51.63 s                       53.0999
filter        120.36 s         67.79 s                       43.6784
map_min       114.76 s         86.06 s                       25.0081
flank         160.20 s         75.69 s                       52.756
map_mean      116.97 s         85.12 s                       27.2331
map_sum       114.39 s         85.96 s                       24.8557
windows       418.87 s         65.13 s                       84.4515
map_median    112.35 s         82.81 s                       26.2995

Dependencies

~10–25MB
~321K SLoC