#k-mer #graph #dna #sequencing

debruijn

Tools for DNA sequences: efficient k-mer manipulation, De Bruijn graph construction and compaction and handling of DNA strings

3 releases

0.3.4 Sep 9, 2021
0.3.3 Aug 27, 2021
0.3.2 Jun 25, 2020

#470 in Algorithms

Download history 28/week @ 2023-12-04 33/week @ 2023-12-11 38/week @ 2023-12-18 21/week @ 2023-12-25 30/week @ 2024-01-01 39/week @ 2024-01-08 24/week @ 2024-01-15 18/week @ 2024-01-22 18/week @ 2024-01-29 32/week @ 2024-02-05 34/week @ 2024-02-12 97/week @ 2024-02-19 84/week @ 2024-02-26 117/week @ 2024-03-04 37/week @ 2024-03-11 49/week @ 2024-03-18

287 downloads per month
Used in 6 crates

MIT license

215KB
5K SLoC

rust-debruijn

De Bruijn graph construction & path compression libraries.

Docs

Key features

  • 2-bit packed fixed-length (Kmer) and variable-length (DnaString) sequence containers
  • Statically compiled code paths for different K values
  • Ability to track arbitrary auxiliary data through the DeBruijn graph
  • Customizable kmer counting & filtering schemes supporting a variety of use cases
  • DeBruijn graph compression
  • Minimum-substring partitioning to shard kmers for memory efficient counting and DeBruijn graph compression
  • Configurable for stranded and non-stranded input sequence
  • Extensive unit test suite
  • In production use in Supernova, Long Ranger, Cell Ranger, and Cell Ranger VDJ pipelines from 10x Genomics.

lib.rs:

debruijn: a De Bruijn graph library for DNA seqeunces in Rust.

This library provides tools for efficient construction DeBruijn graphs (dBG) from DNA sequences, tracking arbitrary metadata associated with kmers in the graph, and performing path-compression of unbranched graph paths to improve speed and reduce memory consumption.

Most applications of debruijn will follow this general workflow:

  1. You generate a set of sequences to make a dBG from.
  2. You pass those sequences to the filter_kmers function, which converts the sequences into kmers, while tracking 'metadata' about each kmer in a very customizable way. The metadata could be read count, a set of colors, a set of read counts split by haplotype, a UMI count, etc.
  3. The the library will convert the kmers to a compressed dBG. You can also customize the rules for how to compress the dBG and how to 'combine' the per-kmer metadata.

Then you can use the final compressed dBG how you like. There are some methods for simplifying and re-building the graph, but those could be developed more.

Examples

All the data structures in debruijn-rs are specialized to the 4 base DNA alphabet, and use 2-bit packed encoding of base-pairs into integer types, and efficient methods for reverse complement, enumerating kmers from longer sequences, and transfering data between sequences.

Encodings

Most methods for ingesting sequence data into the library have a form named 'bytes', which expects bases encoded as the integers 0,1,2,3, and a separate form names 'ascii', which expects bases encoded as the ASCII letters A,C,G,T.

Dependencies

~2.6–3.5MB
~77K SLoC