2 releases
Uses new Rust 2024
| 0.1.1 | Aug 18, 2025 |
|---|---|
| 0.1.0 | Aug 17, 2025 |
#414 in Biology
95KB
2K
SLoC
Kira CDH
kira-cdh is a single-binary, CLI-compatible replacement for the core CD-HIT utilities:
--mode cd-hit— protein clustering--mode cd-hit-est— nucleotide clustering--mode cd-hit-2d— two-dataset protein comparison--mode cd-hit-est-2d— two-dataset nucleotide comparison
It accepts the same flags as the original tools. The pipeline is implemented in Rust (edition 2024) using a modular stack:
FASTX I/O → k-mer hashing → KMV/MinHash signatures → LSH candidate retrieval → greedy clustering → CD-HIT compatible .clstr writer.
Status: v0.1 focuses on the 4 core modes above. Additional CD-HIT variants (e.g. PSI-CD-HIT, 454/OTU/LAP/DUP) can be added later via the same
--modemechanism.
Highlights
- Single binary with
--modeswitch, drop-in CLI flag compatibility. - Fast I/O (FASTA/FASTQ, transparent
.gz) with robust, CD-HIT-like error handling. - Scalable indexing via KMV/MinHash signatures and LSH for candidate discovery.
- Greedy clustering (length-first, CD-HIT-like) with optional coverage gates.
- CD-HIT
.clstroutput compatibility.
Installation
From source
# Requires Rust stable (MSRV = 1.85)
git clone https://github.com/ARyaskov/kira-cdh
cd kira-cdh
cargo install --path .
Quick start
Protein clustering (cd-hit):
kira-cdh --mode cd-hit \
-i proteins.fasta -o clusters -c 0.9 -n 5 -T 16
Nucleotide clustering (cd-hit-est):
kira-cdh --mode cd-hit-est \
-i reads.fasta.gz -o clusters -c 0.97 -n 10 -T 16
Two-dataset comparison (protein):
kira-cdh --mode cd-hit-2d \
--i proteinsA.fasta \
--i2 proteinsB.fasta \
-o B_vs_A -c 0.9 -n 5 -T 16
Two-dataset comparison (nucleotide):
kira-cdh --mode cd-hit-est-2d \
--i readsA.fasta.gz \
--i2 readsB.fasta.gz \
-o B_vs_A -c 0.97 -n 10 -T 16
Outputs:
<prefix>— FASTA with cluster representatives<prefix>.clstr— CD-HIT-compatible cluster file
CLI compatibility
The tool exposes the same flags as the original utilities for the selected mode. Run:
kira-cdh --mode <cd-hit|cd-hit-est|cd-hit-2d|cd-hit-est-2d> --help
Common flags (subset)
-i <file>— input (FASTA/FASTQ;.gzsupported)-o <prefix>— output prefix-c <float>— identity threshold[0..1](used for MinHash Jaccard gate)-n <int>— word length (k-mer size). Defaults: protein=5, nucleotide=10-T <int>— threads (0= all CPUs)-M <int>— memory limit (advisory)-d <int>— description length in output FASTA (0 = full)- Coverage/length controls:
-aS,-aL,-A,-s,-S,-uS,-uL,-U - Nucleotide scoring knobs (kept for compatibility):
--match,--mismatch,--gap,--gap-ext - Sorting/format knobs:
--sf,--sc,--bak,-p
2D modes
--i2 <file>— second input (required)- Optional asymmetric cutoffs:
--s2,--S2(length-diff gates for db1)
Paired-end (cd-hit-est only)
-P 1 -j <R2.fastq> --op <out_R2>— paired-end passthrough hooks
Note: For v0.1, only the flags that affect the LSH/Jaccard/greedy stages are functionally active (see details below). Other flags are parsed and validated, but may be no-ops at this stage; see Feature parity.
Mapping to CD-HIT semantics
Internally, identity gating uses MinHash/KMV signatures and LSH:
- Signatures: KMV, length = 128 by default.
- LSH:
bands = 32,rows = 4(compatible with signature length). - Candidate retrieval: keep pairs with at least
ceil(c * rows)collisions, wherecis-c. - Final acceptance in 2D mode uses
jaccard_from_signatures()≥-c.
Greedy clustering is CD-HIT-like:
- Representatives are chosen in length-descending order (
--sc/--sfaffect output sorting only). - For 1-set modes, clustering runs over the entire set.
- For 2D modes, set A is indexed; each sequence from B is assigned to the best matching A if
Jaccard ≥ -c. Otherwise the B sequence forms a singleton cluster.
.clstr output:
- Written via a CD-HIT-compatible writer, with optional length annotations (
nt/aa). - The first member in a cluster is the representative and ends with
*.
Input formats
- FASTA, FASTQ (transparently supports
.gz) - Multi-line FASTA/FASTQ supported
- Robust error handling (skip malformed records, attempt resynchronization)
Feature parity (v0.1)
Implemented end-to-end:
-i,-o,--i2(2D),-c,-n,-T,-d- CD-HIT-like greedy clustering (length-first)
- LSH candidate retrieval + MinHash/KMV Jaccard gate
.clstrwriter compatibility-aS/-aLcoverage gates (basic support) When either is set > 0, the clusterer is configured with corresponding coverage thresholds.
Parsed & validated (currently no-op or partial; accepted for CLI parity):
-M,-G,-b,-t,-s,-S,-A,-uS,-uL,-U-p,--sf,--sc,--bak- Nucleotide scoring (
--match,--mismatch,--gap,--gap-ext) - Paired-end hooks (
-P,-j,--op,--cx,--cy,--ap,-r) — parsed; not all affect clustering yet
If you depend on a specific flag’s exact upstream semantics that are not listed under “Implemented end-to-end”, please open an issue. The plan is to add strict fail-fast checks for unsupported semantics in a subsequent minor release.
Performance knobs
- Threads:
-T(default: 1;0= all CPUs). - k-mer size:
-n(defaults: protein=5, nucleotide=10). - Identity threshold:
-cinfluences LSH and final Jaccard acceptance. - Signature length: currently fixed to 128 (32×4); future releases may expose this.
- Memory:
-Mis advisory in v0.1 (no strict cgroup/pid limit). Indexing is streaming; memory depends mainly on signature storage and LSH buckets.
Examples
Cluster proteins at 90% identity:
kira-cdh --mode cd-hit \
-i uniprot_sprot.fasta \
-o sprot90 \
-c 0.90 -n 5 -T 32
Cluster reads at 97% identity (nucleotide):
kira-cdh --mode cd-hit-est \
-i reads.fasta.gz \
-o reads97 \
-c 0.97 -n 10 -T 16
Compare B against A (protein 2D):
kira-cdh --mode cd-hit-2d \
-i A.faa -i2 B.faa \
-o B_vs_A \
-c 0.9 -n 5
The resulting B_vs_A.clstr contains A-anchored clusters for matches and singleton clusters for unmatched B sequences.
Logging
Set RUST_LOG to tune verbosity:
RUST_LOG=info kira-cdh --mode cd-hit -i input.fa -o out -c 0.9
RUST_LOG=debug kira-cdh --mode cd-hit-est -i input.fq.gz -o out -c 0.97
Contributing
- Follow KISS/DRY principles; prefer small, well-documented modules.
- For performance work, include before/after benchmarks and dataset notes.
- When wiring a new CD-HIT flag, update
README.md(Feature parity) and add a validation path.
License
GPLv2.
Acknowledgements
We would like to thank the original authors and maintainers of CD-HIT for their contributions to the field of sequence clustering, which served as an inspiration for this project.
Dependencies
~15–30MB
~408K SLoC