12 unstable releases (5 breaking)

0.13.0 Jul 12, 2024
0.11.2 Nov 30, 2023
0.10.0 May 5, 2023
0.8.5 Dec 26, 2022
0.1.0 Jun 28, 2021

#21 in Biology

Download history 18/week @ 2024-03-31 1/week @ 2024-04-07 79/week @ 2024-05-05 487/week @ 2024-05-12 47/week @ 2024-05-19 1/week @ 2024-05-26 6/week @ 2024-06-09 1/week @ 2024-06-16 76/week @ 2024-07-07 26/week @ 2024-07-14

102 downloads per month
Used in bio-streams

MIT license

135KB
2.5K SLoC

Docs.rs CI status

bio-seq

Bit-packed and well-typed biological sequences

This crate provides types for representing and operating on sequences of genomic data. Efficient encodings are provided for nucleotides and amino acids and can be extended with the Codec trait.

Short sequences of fixed length (kmers) are given special attention.

Quick start

Add bio-seq to Cargo.toml:

[dependencies]
bio-seq = "0.13"

Iterating over the kmers for a sequence:

use bio_seq::prelude::*;

// The `dna!` macro packs a static sequence with 2-bits per symbol at compile time
let seq = dna!("ATACGATCGATCGATCGATCCGT");

// iterate over the 8-mers of the reverse complement
for kmer in seq.revcomp().kmers::<8>() {
    println!("{kmer}");
}

// ACGGATCG
// CGGATCGA
// GGATCGAT
// GATCGATC
// ATCGATCG
// ...

Sequences are analogous to rust's string types and follow similar dereferencing conventions:

use bio_seq::prelude::*;

// The `dna!` macro packs a static sequence with 2-bits per symbol at compile time.
// This is analogous to rust's string literals:
let s: &'static str = "hello!";
let seq: &'static SeqSlice<Dna> = dna!("CGCTAGCTACGATCGCAT");

// Static sequences can also be copied as kmers
let kmer: Kmer<Dna, 18> = dna!("CGCTAGCTACGATCGCAT").into();
// or with the kmer! macro:
let kmer = kmer!("CGCTAGCTACGATCGCAT");

// `Seq`s can be allocated on the heap like `String`s are:
let s: String = "hello!".into();
let seq: Seq<Dna> = dna!("CGCTAGCTACGATCGCAT").into();

// Alternatively, a `Seq` can be fallibly encoded at runtime:
let seq: Seq<Dna> = "CGCTAGCTACGATCGCAT".try_into().unwrap();

// &SeqSlices are analogous to &str, String slices:
let slice: &str = &s[1..3];
let seqslice: &SeqSlice<Dna> = &seq[2..4];

Philosophy

Many bioinformatics crates implement their own kmer packing logic. This project began as a way to reuse kmer construction code and make it compatible between projects. It quickly became apparent that a kmer type doesn't make sense without being tightly coupled to a general data type for sequences. The scope of the crate will be restricted to representing sequences.

Some people like to engineer clever bit twiddling hacks to reverse complement a sequence and some people want to rapidly prototype succinct datastructures. Most people don't want to worry about endianess. The strength of rust is that we can safely abstract the science from the engineering to work towards both objectives cooperatively.

Contributions are very welcome. There's lots of low hanging fruit for optimisation and ideally we should only have to write them once!

Contents

  • Codec: Coding/Decoding schemes for the symbols of a biological sequence
  • Seq: A sequence of encoded symbols
  • Kmer: A fixed size sequence of length K
  • Derivable codecs: This crate offers utilities for defining your own bit-level encodings
  • Safe conversion between sequences

Sequences

Strings of encoded symbols are packed into Seq. Slicing, chunking, and windowing return SeqSlice. Seq<A: Codec> and &SeqSlice<A: Codec> are analogous to String and &str. As with the standard string types, these are stored on the heap. Kmers are generally stored on the stack, implementing Copy.

Kmers

kmers are short sequences of length k that can fit into a register (e.g. usize, or SIMD vector) and generally implement Copy. These are implemented with const generics and k is fixed at compile time.

All data is stored little-endian. This effects the order that sequences map to the integers ("colexicographic" order).

for i in 0..=15 {
    println!("{}: {}", i, Kmer::<Dna, 5>::from(i));
}
0: AAAAA
1: CAAAA
2: GAAAA
3: TAAAA
4: ACAAA
5: CCAAA
6: GCAAA
7: TCAAA
8: AGAAA
9: CGAAA
10: GGAAA
11: TGAAA
12: ATAAA
13: CTAAA
14: GTAAA
15: TTAAA

Succinct encodings

A lookup table can be indexed in constant time by treating kmers directly as usize:

// This example builds a histogram of kmer occurences

fn kmer_histogram<C: Codec, const K: usize>(seq: &SeqSlice<C>) -> Vec<usize> {
    // For dna::Dna our histogram will need 4^4
    // bins to count every possible 4-mer
    let mut histo = vec![0; 1 << (C::WIDTH * K as u8)];

    for kmer in seq.kmers::<K>() {
        histo[usize::from(kmer)] += 1;
    }

    histo
}

Sketching

Minimising

The 2-bit representation of nucleotides is ordered A < C < G < T. Sequences and kmers are stored little-endian and are ordered "colexicographically". This means that AAAA < CAAA < GAAA < ... < AAAC < ... < TTTT:

let seq = dna!("GCTCGATCGTAAAAAATCGTATT");
let minimiser = seq.kmers::<8>().min().unwrap();

assert_eq!(minimiser, Kmer::from(dna!("GTAAAAAA")));

Hashing

Hash is implemented for sequence and kmer types so equal values of these types with hash identically:

let seq_arr: &SeqArray<Dna, 32, 1> = dna!("AGCGCTAGTCGTACTGCCGCATCGCTAGCGCT");
let seq: Seq<Dna> = seq_arr.into();
let seq_slice: &SeqSlice<Dna> = &seq;
let kmer: Kmer<Dna, 32> = seq_arr.into();

assert_eq!(hash(seq_arr), hash(&seq));
assert_eq!(hash(&seq), hash(&seq_slice));
assert_eq!(hash(&seq_slice), hash(&kmer));

Hashing minimisers

In practice we want to hash sequences that we minimise:

fn hash<T: Hash>(seq: T) -> u64 {
    let mut hasher = DefaultHasher::new();
    seq.hash(&mut hasher);
    hasher.finish()
}

let (minimiser, min_hash) = seq
    .kmers::<16>()
    .map(|kmer| (kmer, hash(&kmer)))
    .min_by_key(|&(_, hash)| hash)
    .unwrap();

Canonical kmers:

To consider both the forward and reverse complement of kmers when minimising:

let (canonical_minimiser, canonical_hash) = seq
    .kmers::<16>()
    .map(|kmer| {
        let canonical_hash = min(hash(&kmer), hash(&kmer.revcomp()));
        (kmer, canonical_hash)
    })
    .min_by_key(|&(_, hash)| hash)
    .unwrap();

Although it's probably more efficient to minimise seq and seq.revcomp() separately.

Codecs

The Codec trait describes the coding/decoding process for the symbols of a biological sequence. This trait can be derived procedurally. There are four built-in codecs:

codec::Dna

Using the lexicographically ordered 2-bit representation

codec::Iupac

IUPAC nucleotide ambiguity codes are represented with 4 bits. This supports membership resolution with bitwise operations. Logical or is the union:

assert_eq!(iupac!("AS-GYTNA") | iupac!("ANTGCAT-"), iupac!("ANTGYWNA"));

Logical and is the intersection of two iupac sequences:

assert_eq!(iupac!("ACGTSWKM") & iupac!("WKMSTNNA"), iupac!("A----WKA"));

codec::Text

utf-8 strings that are read directly from common plain-text file formats can be treated as sequences. Additional logic can be defined to ensure that 'a' == 'A' and for handling 'N'.

codec::Amino

Amino acid sequences are represented with 6 bits. The representation of amino acids is designed to be easy to coerce from sequences of 2-bit encoded DNA.

Defining new codecs

Custom codecs can be defined by implementing the Codec trait.

Codecs can also be derived from the variant names and discriminants of enum types:

use bio_seq_derive::Codec;
use bio_seq::codec::Codec;

#[derive(Clone, Copy, Debug, PartialEq, Codec)]
#[repr(u8)]
pub enum Dna {
    A = 0b00,
    C = 0b01,
    G = 0b10,
    T = 0b11,
}

Note that you need to explicitly provide a "discriminant" (e.g. 0b00) in the enum.

A #[width(n)] attribute specifies how many bits the encoding requires per symbol. The maximum supported is 8. If this attribute isn't specified then the optimal width will be chosen.

#[alt(...,)] and #[display('x')] attributes can be used to define alternative representations or display the item with a special character. Here is the definition for the stop codon in codec::Amino:

pub enum Amino {
    #[display('*')] // print the stop codon as a '*'
    #[alt(0b001011, 0b100011)] // TGA, TAG
    X = 0b000011, // TAA (stop)

Sequence conversions

Translation table traits

Translation tables provide methods for translating codons into amino acids.

Enable the translation feature in Cargo.toml:

[dependencies]
bio-seq = { version="0.13", features=["translation"] }
pub trait TranslationTable<A: Codec, B: Codec> {
    fn to_amino(&self, codon: &SeqSlice<A>) -> B;
    fn to_codon(&self, amino: B) -> Result<Seq<A>, TranslationError>;
}

/// A partial translation table where not all triples of characters map to amino acids
pub trait PartialTranslationTable<A: Codec, B: Codec> {
    fn try_to_amino(&self, codon: &SeqSlice<A>) -> Result<B, TranslationError>;
    fn try_to_codon(&self, amino: B) -> Result<Seq<A>, TranslationError>;
}

The standard genetic code is provided as a translation::STANDARD constant:

use crate::prelude::*;
use crate::translation::STANDARD;
use crate::translation::TranslationTable;

let seq = dna!("AATTTGTGGGTTCGTCTGCGGCTCCGCCCTTAGTACTATGAGGACGATCAGCACCATAAGAACAAA");

let aminos: Seq<Amino> = seq
    .windows(3)
    .map(|codon| STANDARD.to_amino(&codon))
    .collect::<Seq<Amino>>();

assert_eq!(
    aminos,
    Seq<Amino>::try_from("NIFLCVWGGVFSRVSLCARGALSPRAPPLL*SVYTLYM*ERGDTRDISQSAHTPHI*KRENTQK").unwrap()
);

Custom translation tables

Instantiate a translation table from a type that implements Into<HashMap<Seq<A>, B>>:

let codon_mapping: [(Seq<Dna>, Amino); 6] = [
    (dna!("AAA"), Amino::A),
    (dna!("ATG"), Amino::A),
    (dna!("CCC"), Amino::C),
    (dna!("GGG"), Amino::E),
    (dna!("TTT"), Amino::D),
    (dna!("TTA"), Amino::F),
];

let table = CodonTable::from_map(codon_mapping);

let seq: Seq<Dna> = dna!("AAACCCGGGTTTTTATTAATG");
let mut amino_seq: Seq<Amino> = Seq::new();

for codon in seq.chunks(3) {
    amino_seq.push(table.try_to_amino(codon).unwrap());
}

Implementing the TranslationTable trait directly:

struct Mitochondria;

impl TranslationTable<Dna, Amino> for Mitochondria {
    fn to_amino(&self, codon: &SeqSlice<Dna>) -> Amino {
        if *codon == dna!("AGA") {
            Amino::X
        } else if *codon == dna!("AGG") {
            Amino::X
        } else if *codon == dna!("ATA") {
            Amino::M
        } else if *codon == dna!("TGA") {
            Amino::W
        } else {
            Amino::unsafe_from_bits(Into::<u8>::into(codon))
        }
    }

    fn to_codon(&self, _amino: Amino) -> Result<Seq<Dna>, TranslationError> {
        unimplemented!()
    }
}

Contributing

Contributions and suggestions are very much welcome. The easiest way to begin contributing is through github issues.

Dependencies

~1.2–1.8MB
~43K SLoC