#interval #genomic #bioinformatics #range #tree

scailist

A fast and easy interval overlap library

5 releases

0.2.0 Sep 22, 2019
0.1.4 Sep 20, 2019
0.1.3 Sep 20, 2019
0.1.1 Sep 20, 2019
0.1.0 Sep 20, 2019

#6 in #ranges

MIT license

30KB
476 lines

docs crates.io

ScAIList

This is rust implementation of the AIList algorithm as described here. The biggest difference is that this implementation dynamicaly determines the max number of components instead of capping at 10. One might call it a Scaled Augmented Interval List. It takes the log2 of the input element lengths to be the max number of components and then decomposes into that.

It seems to be very fast. As fast as rust-lapper in all easy cases with spread our intervals, and faster when things get nested. Benchmarks will be added as the interval_bakeoff project moves along.

Documentation

Crates.io


lib.rs:

This module provides an implementation of an AIList, but with a dynamic scaling for the number of sublists.

Features

  • Consistantly fast. The way the input intervals are decomposed diminishes the effects of super containment.
  • Parallel friendly. Queries are on an immutable structure, even for seek
  • Consumer / Adapter paradigm, an iterator is returned.

Details:

Please see the paper.

Most interaction with this crate will be through the ScAIList struct The main methods is [find`](struct.ScAIList.html#method.find).

The overlap function for this assumes a zero based genomic coordinate system. So [start, stop) is not inclusive of the stop position for neither the queries, nor the Intervals.

ScAIList is composed of four primary parts. A main interval list, which holds all the intervals after they have been decomposed. A component index's list, which holds the start index of each sublist post-decomposition, A component lengths list, which holds the length of each component, and finally a max_ends list, which holds the max end releative to a sublist up to a given point for each interval.

The decomposition step is achieved by walking the list of intervals and recursively (with a cap) extracting intervals that overlap a given number of other intervals within a certain distance from it. The unique development in this implementation is to make the cap dynamic.

Examples

   use scailist::{Interval, ScAIList};
   use std::cmp;
   type Iv = Interval<u32>;

   // create some fake data
   let data: Vec<Iv> = (0..20).step_by(5).map(|x| Iv{start: x, end: x + 2, val: 0}).collect();
   println!("{:#?}", data);

   // make lapper structure
   let laps = ScAIList::new(data, None);
   assert_eq!(laps.find(6, 11).next(), Some(&Iv{start: 5, end: 7, val: 0}));
   
   let mut sim: u32= 0;
   // Calculate the overlap between the query and the found intervals, sum total overlap
   for i in (0..10).step_by(3) {
       sim += laps
           .find(i, i + 2)
           .map(|iv| cmp::min(i + 2, iv.end) - cmp::max(i, iv.start))
           .sum::<u32>();
   }
   assert_eq!(sim, 4);

No runtime deps