16 releases

0.3.7 Mar 12, 2024
0.3.5 Nov 23, 2023
0.3.2 Jul 14, 2023
0.2.3 Feb 20, 2023

#45 in Biology

Download history 7/week @ 2024-01-16 10/week @ 2024-02-06 20/week @ 2024-02-13 43/week @ 2024-02-20 7/week @ 2024-02-27 1/week @ 2024-03-05 154/week @ 2024-03-12 8/week @ 2024-03-19 1/week @ 2024-03-26 10/week @ 2024-04-02

174 downloads per month

Apache-2.0

195KB
3.5K SLoC

Split K-mer Analysis (version 2)

Cargo Build & Test docs.rs Clippy check codecov Crates.io GitHub release (latest SemVer)

Description

This is a reimplementation of the SKA package in the rust language, by Johanna von Wachsmann, Simon Harris and John Lees. We are also grateful to have received user contributions from:

  • Romain Derelle
  • Tommi Maklin
  • Joel Hellewell
  • Timothy Russell
  • Nicholas Croucher
  • Dan Lu

Split k-mer analysis (version 2) uses exact matching of split k-mer sequences to align closely related sequences, typically small haploid genomes such as bacteria and viruses.

SKA can only align SNPs further than the k-mer length apart, and does not use a gap penalty approach or give alignment scores. But the advantages are speed and flexibility, particularly the ability to run on a reference-free manner (i.e. including accessory genome variation) on both assemblies and reads.

Documentation

Can be found at https://docs.rs/ska. We also have some tutorials available:

Installation

Choose from:

  1. Download a binary from the releases.
  2. Use cargo install ska or cargo add ska.
  3. Use conda install -c bioconda ska2 (note the two!).
  4. Build from source

For 2) or 4) you must have the rust toolchain installed.

OS X users

If you have an M1/M2 (arm64) Mac, we aren't currently automatically building binaries, so would recommend either option 2) or 4) for best performance.

If you get a message saying the binary isn't signed by Apple and can't be run, use the following command to bypass this:

xattr -d "com.apple.quarantine" ./ska

Build from source

  1. Clone the repository with git clone.
  2. Run cargo install --path . or RUSTFLAGS="-C target-cpu=native" cargo install --path . to optimise for your machine.

Differences from SKA1

Optimisations include:

  • Integer DNA encoding, optimised parsing from FASTA/FASTQ.
  • Faster dictionaries.
  • Full parallelisation of build phase.
  • Smaller, standardised input/output files. Faster to save/load.
  • Reduced memory footprint and increased speed with read filtering.

And other improvements:

  • IUPAC uncertainty codes for multiple copy split k-mers.
  • Uncertainty with self-reverse-complement split k-mers (palindromes).
  • Fully dynamic files (merge, delete samples).
  • Native VCF output for map.
  • Support for known strand sequence (e.g. RNA viruses).
  • Stream to STDOUT, or file with -o.
  • Simpler command line combining ska fasta, ska fastq, ska alleles and ska merge into the new ska build.
  • Option for single commands to run ska align or ska map.
  • New coverage model for filtering FASTQ files with ska cov.
  • Logging.
  • CI testing.

All of which make ska.rust run faster and with smaller file size and memory footprint than the original.

Planned features

  • Sparse data structure which will reduce space and make parallelisation more efficient. Issue #47.
  • 'fastcall' mode. Issue #52.

Feature ideas (not definitely planned)

  • Add support for ambiguity in VCF output (ska map). Issue #5.
  • Non-serial loading of .skf files (for when they are very large). Issue #22.
  • Alternative mixture models for read error correction. Issue #50.

Things you can no longer do

  • Use k > 63 (shouldn't be necessary? Let us know if you need this and why).
  • ska annotate (use bedtools).
  • ska compare, ska humanise, ska info or ska summary (replaced by ska nk --full-info).
  • ska unique (you can parse ska nk --full-info if you want this functionality, but we didn't think it's used much).
  • ska type (use PopPUNK instead of MLST 🙂)
  • Ns are always skipped, and will not be found in any split k-mers.
  • .skf files are not backwards compatible with version 1.

Dependencies

~13–25MB
~347K SLoC