#typing #complex #fastq #mtbc #lineage #tuberculosis #mycobacterium

app fastlin

an ultra-fast program for MTBC lineage typing

3 unstable releases

0.2.1 Jul 17, 2023
0.2.0 Jul 16, 2023
0.1.0 Jun 26, 2023

#133 in Biology

MIT/Apache

23KB
331 lines

Crates.io GitHub release (latest SemVer)

fastlin

Overview

Fastlin is an ultra-fast program to perform lineage typing of Mycobacterium tuberculosis complex (MTBC) fastq samples. Using a kmer-based approach, it can accuratly predict MTBC lineages and strain mixtures in seconds.

Reference:TBA

Installation

To install fastlin via cargo, you must have the rust toolchain installed.

cargo install fastlin

Or alternatively you can copy the code from this repository and install it using this command:

cargo install --path .

You will also need a barcode file (see Input files below).

Running fastlin

The default command line is:

fastlin -d /path/directory_fastq_files -b barcodes_file.txt

If your dataset does not contain any BAM-derived fastq file, then we would recommend to apply a maximum kmer coverage threshold to reduce runtimes:

fastlin -d /path/directory_fastq_files -b barcode_file.txt -x 80

Input files

Fastlin takes as input the path of the directory containing the fastq files. The fastq files should be compressed, with extensions being either '.fastq.gz' or 'fq.gz'. Names of paired-end files should be in the form 'name_1.fq.gz' and 'name_2.fq.gz'. The directory can contain both paired-end and single-end fastq files.

The MTBC barcode file can be downloaded from https://www.github.com/rderelle/barcodes-fastlin. Alternatively, you can build and test your own kmer barcodes using the Python scripts available in that directory.

Output file

Fastlin output consists on a tab-delimited file with the following fields:

  • sample: sample name
  • nb_files: 'single' or 'paired'-end files
  • k_cov: theoretical kmer coverage of the fastq files(s) based on the number extracted kmer ()
  • mixture: pure ('no') or mixed ('yes') sample
  • lineages: lineages detected in the sample with their kmer occurences within paratheses
  • log_barcodes: kmer barcodes passing the minimum occurence threshold, indicated by their kmer occurence and grouped by lineages

Here is a simple example:

#sample    nb_files    k_cov    mixture    lineages    log_barcodes
ERRxxxxx    paired    118    no    2    (45)    2 (42, 48, 39, 43, 54, 47, 45), 4.1 (4)

The sample ERRxxxxx contains a single strain belonging to lineage 2. This typing is supported by 7 kmer barcodes, with a median number of occurences of 45. Since the abundance of the strain is far below the theoretical kmer coverage (equal here to 118), we can conclude that the sample is likely to contain high level of contaminations or sequencing errors.

TO DO LIST

  • multi-threading
  • possiblity to analyse FASTA files (genome assemblies)

Dependencies

~6–15MB
~173K SLoC