1 unstable release
Uses old Rust 2015
0.1.0 | Aug 30, 2024 |
---|
#286 in Biology
175KB
4.5K
SLoC
Large-scale Sequence Search with BItsliced Genomic Signature Index (BIGSIG)
This is a port of crate colorid with several updates for real-world application;
- Use xxh3 to suport aarch64 and x86-64 platforms;
- Use needletail for fast and compressed fasta/fastq file processing;
- 2-bit nucleitide sequence representation via kmerutils to improve memory efficiency;
- Recreate the command line interface using recent clap v4.3.
Credit for orginal implementation to original authors.
Install
git clone https://gitlab.com/Jianshu_Zhao/bigsig
cd bigsig
cargo build --release
Usage
************** initializing logger *****************
bigsig 0.1.0
Large-scale Sequence Search with BItsliced Genomic Signature Index (BIGSIG)
USAGE:
bigsig [SUBCOMMAND]
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
batch_identify Identify batch of samples reads
construct Construct a BIGSIG
filter filters reads
help Prints this message or the help of the given subcommand(s)
identify identify reads based on probability
query query a bigsig on one or more fasta/fastq.gz files
show show index parameters
An example to build and query BigSig database
bigsig construct -r ref_file_example.txt -b test -k 31 -mv 21 -s 10000000 -n 4 -t 24
bigsig query -b ./test.mxi -q ./test_data/test.fastq.gz
bigsig identify -b test.mxi -q ./test_data/test.fastq.gz -n output -t 24 --high_mem_load
Results
With the default settings BigSiq will report reference sequences that share >35% of their k-mers with the query. Here is the output of a query with SRA accession SRR4098796 (L. monocytogenes lineage I) as query:
SRR4098796_1.fastq.gz 3076072 Listeria_monocytogenes_F2365 0.87 134.25 126 475266
SRR4098796_1.fastq.gz 3076072 Listeria_monocytogenes_SRR2167842 0.40 128.25 122 7831
In the first column we find the query, the second column shows the number of k-mers in the query, the third column displays the reference sequence, the fourth column the proportion of kmers in the reference shared with the query, the fifth column displays the average coverage based on k-mers that were uniquely matched with this reference, the sixth the modus of the coverage based on uniquely matched k-mers and the last column the number of uniquely matched k-mers.
Reference
- Bradley, Phelim, et al. "Ultrafast search of all deposited bacterial and viral genomic data." Nature biotechnology 37.2 (2019): 152-159.
- Bingmann, Timo, et al. "COBS: a compact bit-sliced signature index." String Processing and Information Retrieval: 26th International Symposium, SPIRE 2019, Segovia, Spain, October 7–9, 2019, Proceedings 26. Springer International Publishing, 2019.
Dependencies
~6–13MB
~154K SLoC