#bioinformatics #genomics #single-cell #splicing #sequence-alignment

app splici

A tool to generate spliced and unspliced reference transcripts for sequence alignment

1 unstable release

0.1.1 Sep 20, 2023

#14 in #single-cell

MIT license

29KB
651 lines

splici

a rust implementation of the splici algorithm to build spliced/unspliced transcripts

Overview

This implementation is written fully in rust and takes advantage of three bioinformatics libraries:

  1. gtftools - For parsing of GTF files
  2. bedrs - For genomic interval arithmetic
  3. faiquery - For fast querying of indexed fastas

Usage

splici introns \
    -f <your.fasta> \
    -g <your.gtf> \
    -o splici.fasta.gz;

This will generate a splici reference fasta using the transcripts and exons found within the gtf and will query from the indexed fasta provided.

This expects that the fasta is indexed using samtools faidx.

Getting Started

You can download the latest ensembl DNA and GTF using ggetrs ensembl ref

ggetrs ensembl ref -D -d dna,gtf

Unzip and index the reference DNA.

gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 
samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa

And then run splici to generate your splici reference fasta

splici introns \
    -f Homo_sapiens.GRCh38.dna.primary_assembly.fa \
    -g Homo_sapiens.GRCh38.*.gtf.gz \
    -o splici.fasta.gz;

Background

The splici algorithm was described by (He et al. 2022) and is a shorthand for spliced + intronic sequences.

It describes a method to isolate the intronic regions of all incoming transcripts and generate the sequences of both the spliced transcripts as well as their intronic components.

The algorithm is applied on each gene individually.

First all transcripts for a gene are identified. Then all intronic regions of those transcripts are identified. These intronic regions are defined by the span of the transcripts subtracting out the exonic intervals (see internal). Next, each intronic region is extended by some parameterized amount on both ends, which allows for alignment to junctions between intronic and exonic regions. Intronic regions between isoforms generally have high overlap, so a merging step is performed on the intronic regions to avoid redundant intervals in the final sequences. These intronic regions are then given a unique name and added to the splici reference.

The spliced transcripts are generated by concatenating the exonic intervals for each transcript. These are named by the transcript id and added to the splici reference.

References

  1. He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19, 316–322 (2022).

Dependencies

~12MB
~167K SLoC