13 releases
0.3.19 | Apr 23, 2025 |
---|---|
0.3.18 | Apr 23, 2025 |
#5 in Biology
1,280 downloads per month
88KB
2K
SLoC
bqtools
A command-line utility for working with BINSEQ files.
Overview
bqtools provides tools to encode, decode, manipulate, and analyze BINSEQ files.
It supports both (*.bq
) and (*.vbq
) files and makes use of the binseq
library.
BINSEQ is a binary file format family designed for high-performance processing of DNA sequences. It currently has two variants: BQ and VBQ.
- BQ (*.bq): Optimized for fixed-length DNA sequences without quality scores.
- VBQ (*.vbq): Optimized for variable-length DNA sequences with optional quality scores.
Both support single and paired sequences and make use of two-bit encoding for efficient nucleotide packing using the bitnuc
library.
For more information about BINSEQ, see our preprint where we describe the format family and its applications.
Features
- Encode: Convert FASTA or FASTQ files to a BINSEQ format
- Decode: Convert a BINSEQ file back to FASTA, FASTQ, or TSV format
- Cat: Concatenate multiple BINSEQ files
- Count: Count records in a BINSEQ file
- Grep: Search for fixed-string or regex patterns in BINSEQ files.
Installation
From Cargo
bqtools can be installed using cargo
, the Rust package manager:
cargo install bqtools
To install cargo
you can follow the instructions on the official Rust website.
From Source
# Clone the repository
git clone https://github.com/arcinstitute/bqtools.git
cd bqtools
# Install
cargo install --path .
# Check installation
bqtools --help
Usage
# Get help information
bqtools --help
# Get help for specific commands
bqtools encode --help
bqtools decode --help
bqtools cat --help
bqtools count --help
Encoding
Convert FASTA/FASTQ files to BINSEQ format:
# Encode a single file to binseq
bqtools encode input.fastq -o output.bq
# Encode a single file to vbinseq
bqtools encode input.fastq -o output.vbq
# Encode paired-end reads
bqtools encode input_R1.fastq input_R2.fastq -o output.bq
# Encode paired-end reads to vbinseq
bqtools encode input_R1.fastq input_R2.fastq -o output.vbq
# Specify a policy for handling non-ATCG nucleotides
bqtools encode input.fastq -o output.bq -p r # Randomly draw A/C/G/T for each N
# Use multiple threads for parallel processing
bqtools encode input.fastq -o output.bq -T 8
Available policies for handling non-ATCG nucleotides:
i
: Ignore sequences with non-ATCG charactersp
: Break on invalid sequencesr
: Randomly draw a nucleotide for each N (default)a
: Set all Ns to Ac
: Set all Ns to Cg
: Set all Ns to Gt
: Set all Ns to T
Note: Input FASTQ files may be compressed.
Decoding
Convert BINSEQ files back to FASTA/FASTQ/TSV:
# Decode to FASTQ (default)
bqtools decode input.bq -o output.fastq
# Decode to FASTA
bqtools decode input.bq -o output.fa -f a
# Decode paired-end reads into separate files
bqtools decode input.bq --prefix output
# Creates output_R1.fastq and output_R2.fastq
# Specify which read of a pair to output
bqtools decode input.bq -o output.fastq -m 1 # Only first read
bqtools decode input.bq -o output.fastq -m 2 # Only second read
# Specify output format
bqtools decode input.bq -o output.tsv -f t # TSV format
Concatenating
Combine multiple BINSEQ files:
bqtools cat file1.bq file2.bq file3.bq -o combined.bq
Counting
Count records in a BINSEQ file:
bqtools count input.bq
Grep
You can easily search for specific subsequences or regular expressions within BINSEQ files:
# See full options list
bqtools grep --help
# Search for a specific subsequence (in primary sequence)
bqtools grep input.bq -e "ATCG"
# Search for a regular expression (in primary)
bqtools grep input.bq -r "AT[CG]"
# Search for both a subsequence (in extended sequence) and a regular expression (in either)
bqtools grep input.bq -E "ATCG" -P "AT[CG]"
Dependencies
~20–28MB
~302K SLoC