#read #cluster #fastq #file-read #clustering #de #tsv

app isONclust3

Rust implementation of a novel de novo clustering algorithm. isONclust3 is a tool for clustering either PacBio Iso-Seq reads, or Oxford Nanopore reads into clusters, where each cluster represents all reads that came from a gene family. Output is a tsv file with each read assigned to a cluster-ID and a folder 'fastq' containing one fastq file per cluster generated. Detailed information is available in the isONclust3 paper.

1 unstable release

0.0.2 Nov 26, 2024

#65 in Biology

MIT license

105KB
1.5K SLoC

isONclust3

A rust implementation of a novel de novo clustering algorithm. isONclust3 is a tool for clustering either PacBio Iso-Seq reads, or Oxford Nanopore reads into clusters, where each cluster represents all reads that came from a gene family. Output is a tsv file with each read assigned to a cluster-ID and a folder 'fastq' containing one fastq file per cluster generated. Detailed information is available in the isONclust3 paper.

Table of contents

  1. Installation
  2. Output
  3. Running isONclust3
  4. Contact
  5. Credits

Installation Guide

At the moment building from source is the only option to install the tool. This requires users to install the Rust programming language onto their system.

Installing Rust

You can install rust via

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh (for macOS and Linux or other Unix-based OS). For Windows please follow the instructions on the following site: https://forge.rust-lang.org/infra/other-installation-methods.html .

Installation

After cloning the repository via git clone https://github.com/aljpetri/isONclust3.git use the following two commands to compile the code:
cd isONclust3
cargo build --release ( Compile the current package, the executable is then located in target/release)

Running isONclust3

IsONclust3 can be used on either Pacbio data or ONT data.

isONclust3 --fastq {input.fastq} --mode ont  --outfolder {outfolder}         # Oxford Nanopore reads
isONclust3 --fastq {input.fastq} --mode pacbio  --outfolder {outfolder}      # PacBio reads

The --mode ont argument means setting --k 13 --w 21. The --mode pacbio argument is equal to setting --k 15 --w 51.

Output

Clustering information

The output consists of a tsv file final_clusters.tsv present in the specified output folder. In this file, the first column is the cluster ID and the second column is the read accession. For example:

0 read_X_acc
0 read_Y_acc
...
n read_Z_acc

if there are n reads there will be n rows. Some reads might be singletons.

Clusters

IsONclust outputs the reads in .fastq file format with each file containing the reads for the respective cluster. The .fastq files are located in the fastq_files directory that is created in the given outfolder.

Contact

If you encounter any problems, please raise an issue on the issues page, you can also contact the developer of this repository via: alexander.petri[at]math.su.se

Credits

Dependencies

~18–27MB
~412K SLoC