1 unstable release

0.1.0	Oct 11, 2023

#53 in #write-file

MIT license

145KB
1K SLoC

scispeak

a rust parser to convert sci-seq-v3 reads into kallisto compatible formats

a CLI tool to whitelist filter sci-seq-v3 reads and convert them to a 10X-style format.

Overview

This tool is used to filter sciseq reads against their respective barcode whitelists and then output fastq file formats in the style of 10X reads.

This parses the sci-seq-v3 format, identifies the cell barcodes and UMIs and writes out a new file to resemble the 10X sequence construct to be used with other tools that have not yet adopted the sci-seq format.

sci-rna-seq3 Sequencing Construct

The sci-rna-seq3 sequencing construct is organized in the following way:

            ┌─'illumina_p5:29'
            ├─'i5:10'
            ├─'truseq_read_1_adapter:33'
            │                            ┌─'hairpin_barcode:10'
            │                            ├─'hairpin_adapter:6'
            ├─read_1─────────────────────┤
            │                            ├─'umi:8'
──RNA───────┤                            └─'cell_bc:10'
            ├─'poly_T:98'
            ├─'read_2:98'
            │                            ┌─'ME:19'
            ├─i7_primer──────────────────┤
            │                            └─'s7:15'
            ├─'i7:10'
            └─'illumina_p7:24'

Visualization from seqspec.

And so the resulting R1 and R2 files boil down to:

# R1
[linker][adapter][umi][barcode]

# R2
[cDNA]

Usage

This is a single command CLI tool. It requires just the R1 and R2 filepaths

scispeak \
    -i data/SRR7827205_sample_R1.fastq.gz \
    -I data/SRR7827205_sample_R2.fastq.gz;

However, it can be accelerated using multiple compression threads:

scispeak \
    -i data/SRR7827205_sample_R1.fastq.gz \
    -I data/SRR7827205_sample_R2.fastq.gz \
    -t 8;

And can store a log file as well to keep matching statistics:

scispeak \
    -i data/SRR7827205_sample_R1.fastq.gz \
    -I data/SRR7827205_sample_R2.fastq.gz \
    -t 8 \
    -l;

Outputs

This program will output 3 files per run:

<args.prefix>_R1.fastq.gz: A fastq with the [barcode][UMI] construct for all reads passing the whitelist.
<args.prefix>_R2.fastq.gz: An unaltered fastq of the R2 for all reads passing the whitelist.
<args.prefix>_log.json: A log file containing the filtering statistics of the run.

Dependencies

~9–18MB
~191K SLoC