2 releases
0.1.0-beta.2 | Nov 22, 2023 |
---|
#103 in Biology
5.5MB
1.5K
SLoC
jam-rs
Just another minhash (jam) implementation. A high performance minhash variant to screen extremely large (metagenomic) datasets in a very short timeframe. Implements parts of the ScaledMinHash / FracMinHash algorithm described in sourmash.
Unlike traditional implementations like sourmash or mash this version tries to focus on estimating the containment of small sequences in large sets by (optionally) introducing an intentional bias towards smaller sequences. This is intended to be used to screen terabytes of data in just a few seconds / minutes.
Installation
A pre-release is published via crates.io to install it use (you need to have cargo
and the rust-toolchain
installed, the easiest way is via rustup.rs):
cargo install jam-rs
If you want the bleeding edge development release you can install it via git:
cargo install --git https://github.com/St4NNi/jam-rs
Comparison
- Multiple algorithms: xxhash3, ahash-fallback (for kmer < 32) and legacy murmurhash3
- No jaccard similarity since this is meaningless when comparing small embeded sequences against large sets
- Additional filter and sketching options to increase for specificity and sensitivity for small sequences in collections of large assembled metagenomes
Scaling methods
Multiple different scaling methods:
- FracMinHash (
fscale
): Restricts the hash-space to a (lower) maximum fraction ofu64::MAX
/fscale
- KmerCountScaling (
kscale
): Restrict the overall maximum number of hashes to a factor ofkscale
-> 10 means 1/10th of all k-mers will be stored - MinMaxAbsoluteScaling (
nscale
): Restricts the minimum or maximum number of hashes per sequence record
If KmerCountScaling
and MinMaxAbsoluteScaling
are used together the minimum number of hashes (per sequence record) will be guaranteed. FracMinHash
and KmerCountScaling
produce similar results, the first is mainly provided for sourmash compatibility.
Usage
$ jam
Just another (genomic) minhasher (jam), obviously blazingly fast
Usage: jam [OPTIONS] <COMMAND>
Commands:
sketch Sketch one or more files and write result to output file (or stdout)
merge Merge multiple input sketches into a single sketch
dist Estimate distance of a (small) sketch against a subset of one or more sketches as database. Requires all sketches to have the same kmer size
help Print this message or the help of the given subcommand(s)
Options:
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-h, --help Print help (see more with '--help')
-V, --version Print version
Sketching
The easiest way to sketch files is to use the jam sketch
command. This accepts one or more input files (fastx / fastx.gz) or a .list
file with a full list of input files. And sketches all inputs to a specific outpuf sketch file.
$ jam sketch
Sketch one or more files and write the result to an output file (or stdout)
Usage: jam sketch [OPTIONS] [INPUT]...
Arguments:
[INPUT]... Input file(s), one directory or one file with list of files to be hashed
Options:
-o, --output <OUTPUT> Output file
-k, --kmer-size <KMER_SIZE> kmer size, all sketches must have the same size to be compared [default: 21]
--fscale <FSCALE> Scale the hash space to a minimum fraction of the maximum hash value (FracMinHash)
--kscale <KSCALE> Scale the hash space to a minimum fraction of all k-mers (SizeMinHash)
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
--nmin <NMIN> Minimum number of k-mers (per record) to be hashed, bottom cut-off
--nmax <NMAX> Maximum number of k-mers (per record) to be hashed, top cut-off
--format <FORMAT> Change to other output formats [default: bin] [possible values: bin, sourmash]
--algorithm <ALGORITHM> Change the hashing algorithm [default: default] [possible values: default, ahash, xxhash, murmur3]
--singleton Create a separate sketch for each sequence record
-h, --help Print help
Dist
Calculate the distance for one or more inputs vs. a large set of database sketches. Optionally specify a minimum cutoff in percent of matching kmers. Output is optional if not specified the result will be printed to stdout.
$ jam dist
Estimate containment of a (small) sketch against a subset of one or more sketches as database. Requires all sketches to have the same kmer size
Usage: jam dist [OPTIONS] --input <INPUT>
Options:
-i, --input <INPUT> Input sketch or raw file
-d, --database <DATABASE> Database sketch(es)
-o, --output <OUTPUT> Output to file instead of stdout
-c, --cutoff <CUTOFF> Cut-off value for similarity [default: 0.0]
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
--stats Use the Stats params for restricting results
--gc-lower <GC_LOWER> Use GC stats with an upper bound of x% (gc_lower and gc_upper must be set)
--gc-upper <GC_UPPER> Use GC stats with an lower bound of y% (gc_lower and gc_upper must be set)
-h, --help Print help
Merge
Merge multiple sketches into one large one.
$ jam merge
Merge multiple input sketches into a single sketch
Usage: jam merge [OPTIONS] --output <OUTPUT> [INPUTS]...
Arguments:
[INPUTS]... One or more input sketches
Options:
-o, --output <OUTPUT> Output file
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-h, --help Print help
License
This project is licensed under the MIT license. See the LICENSE file for more info.
Disclaimer
jam-rs is still in active development and not ready for production use. Use at your own risk.
Credits
This tool is heavily inspired by finch-rs/License and sourmash/License. Check them out if you need a more mature ecosystem with well tested hash functions and more features.
Dependencies
~7–15MB
~213K SLoC