5 unstable releases

0.6.0 Aug 26, 2022
0.5.0 May 27, 2021
0.4.4 Mar 30, 2020
0.4.3 Feb 27, 2020
0.4.0 Feb 26, 2020

#413 in Science

MIT license

35KB
678 lines

sketchy

Genomic neighbor typing for lineage and genotype inference

Overview

v0.6.0

Sketchy is a lineage calling and genotyping tool based on the heuristic principle of genomic neighbor typing developed by Karel Břinda and colleagues (2020). It queries species-wide ('hypothesis-agnostic') reference sketches using MinHash and infers associated genotypes based on the closest match, including multi-locus sequence types, susceptibility profiles, virulence factors or other genome-associated features provided by the user. Unlike the original implementation in RASE, sketchy does not use phylogenetic trees which has some downsides, e.g. for sublineage genotype predictions (see below).

See the latest docs for install, usage and database building.

Strengths and limitations

  • Reference sketches and genotype indices can be constructed easily from large genome and genotype collections
  • Sketchy requires few resources when using small sketch sizes (s = 1000)
  • Sketchy performs best on lineage predictions and lineage-wide genotypes from very few reads - we found that tens to hundreds of reads can often give a good idea of the close matches in the reference sketch (especially when inspecting the top matches using --top)

However:

  • Clade-specific genotype resolution is not as good as when using phylogenetic guide trees (RASE)
  • Sketch size can be increased to increase performance (s = 10000), but resources scale approximately linearly
  • Sketchy genotype inference may be difficult for species with high rates of homologous recombination

Data availability

  • Reference sketches and genotype files (s = 1000, s = 10000, k = 16) for S. aureus (full genotypes including susceptibility predictions and other genotypes), S. pneumoniae, K. pneumoniae, P. aeruginosa and Neisseria spp. (MLST) can be found in the data repository.
  • Reference sketches for cross-validation on the simulated species data can be found in this data repository; genome assemblies for all species extracted from the ENA reference collection are available in this data repository
  • Scripts to extract data from the ENA collections Grace Blackwell et al. and compute reference metrics can be found in the scripts directory.
  • Nanopore reads for the outbreak isolates and genotype surveillance panels in Papua New Guinea (Flongle, Goroka, sequential protocol) are available for download in the data repository. Raw sequence data (Illumina / ONT) is being uploaded to NCBI (PRJNA657380).

Preprint

If you use sketchy for research and other applications, please cite:

Steinig et al. (2022) - Genomic neighbor typing for bacterial outbreak surveillance - bioRxiv 2022.02.05.479210; doi: https://doi.org/10.1101/2022.02.05.479210

Dependencies

~11–17MB
~225K SLoC