16 releases
0.3.7 | Mar 12, 2024 |
---|---|
0.3.5 | Nov 23, 2023 |
0.3.2 | Jul 14, 2023 |
0.2.3 | Feb 20, 2023 |
#45 in Biology
174 downloads per month
195KB
3.5K
SLoC
Split K-mer Analysis (version 2)
Description
This is a reimplementation of the SKA package in the rust language, by Johanna von Wachsmann, Simon Harris and John Lees. We are also grateful to have received user contributions from:
- Romain Derelle
- Tommi Maklin
- Joel Hellewell
- Timothy Russell
- Nicholas Croucher
- Dan Lu
Split k-mer analysis (version 2) uses exact matching of split k-mer sequences to align closely related sequences, typically small haploid genomes such as bacteria and viruses.
SKA can only align SNPs further than the k-mer length apart, and does not use a gap penalty approach or give alignment scores. But the advantages are speed and flexibility, particularly the ability to run on a reference-free manner (i.e. including accessory genome variation) on both assemblies and reads.
Documentation
Can be found at https://docs.rs/ska. We also have some tutorials available:
Installation
Choose from:
- Download a binary from the releases.
- Use
cargo install ska
orcargo add ska
. - Use
conda install -c bioconda ska2
(note the two!). - Build from source
For 2) or 4) you must have the rust toolchain installed.
OS X users
If you have an M1/M2 (arm64) Mac, we aren't currently automatically building binaries, so would recommend either option 2) or 4) for best performance.
If you get a message saying the binary isn't signed by Apple and can't be run, use the following command to bypass this:
xattr -d "com.apple.quarantine" ./ska
Build from source
- Clone the repository with
git clone
. - Run
cargo install --path .
orRUSTFLAGS="-C target-cpu=native" cargo install --path .
to optimise for your machine.
Differences from SKA1
Optimisations include:
- Integer DNA encoding, optimised parsing from FASTA/FASTQ.
- Faster dictionaries.
- Full parallelisation of build phase.
- Smaller, standardised input/output files. Faster to save/load.
- Reduced memory footprint and increased speed with read filtering.
And other improvements:
- IUPAC uncertainty codes for multiple copy split k-mers.
- Uncertainty with self-reverse-complement split k-mers (palindromes).
- Fully dynamic files (merge, delete samples).
- Native VCF output for map.
- Support for known strand sequence (e.g. RNA viruses).
- Stream to STDOUT, or file with
-o
. - Simpler command line combining
ska fasta
,ska fastq
,ska alleles
andska merge
into the newska build
. - Option for single commands to run
ska align
orska map
. - New coverage model for filtering FASTQ files with
ska cov
. - Logging.
- CI testing.
All of which make ska.rust
run faster and with smaller file size and memory
footprint than the original.
Planned features
- Sparse data structure which will reduce space and make parallelisation more efficient. Issue #47.
- 'fastcall' mode. Issue #52.
Feature ideas (not definitely planned)
- Add support for ambiguity in VCF output (
ska map
). Issue #5. - Non-serial loading of .skf files (for when they are very large). Issue #22.
- Alternative mixture models for read error correction. Issue #50.
Things you can no longer do
- Use k > 63 (shouldn't be necessary? Let us know if you need this and why).
ska annotate
(use bedtools).ska compare
,ska humanise
,ska info
orska summary
(replaced byska nk --full-info
).ska unique
(you can parseska nk --full-info
if you want this functionality, but we didn't think it's used much).ska type
(use PopPUNK instead of MLST 🙂)- Ns are always skipped, and will not be found in any split k-mers.
.skf
files are not backwards compatible with version 1.
Dependencies
~13–25MB
~347K SLoC