9 releases

0.2.3 May 22, 2024
0.2.2 Dec 24, 2023
0.2.1 Aug 20, 2023
0.2.0 Oct 8, 2022
0.1.3 Jun 26, 2022

#86 in Biology


Used in atg

MIT license

365KB
7K SLoC

Build crates.io doc-rs

ATGlib

ATGlib is a Rust library to work with genomic transcript data. It handles several file formats, such as GTF, GenePred(ext) and Refgene. You can generate bed files, fasta sequences or custom feature sequences.

ATGlib is useful for Bioinformaticians and Geneticists. It can be used for all kind of data processing workflows that must handle transcript annotations. The main use case are conversions between various file formats and sanity checks for large scale QC. It is primarily targeted for human genome, but should work equally well with all other genomes.

If you are looking for an actual application, or a command line tool to work with transcripts, GTF files etc, use ATG instead. It is using ATGlib behind the scenes and provides a simple to use interface.

Documentation

The library API is mostly documented inline and available on docs.rs

Examples

For examples on how to use ATGlib on a high-level, you can check the source code of the CLI tool ATG.

Convert GTF to RefGene

use atglib::gtf::Reader;
use atglib::refgene::Writer;
use atglib::models::{TranscriptRead, TranscriptWrite};

let mut reader = Reader::from_file("tests/data/example.gtf")
    .unwrap_or_else(|_| panic!("Error opening input file."));

let mut writer = Writer::from_file("/dev/null")
    .unwrap_or_else(|_| panic!("Unable to open output file"));

let transcripts = reader.transcripts()
    .unwrap_or_else(|err| panic!("Error parsing GTF: {}", err));

match writer.write_transcripts(&transcripts) {
    Ok(_) => println!("Success"),
    Err(err) => panic!("Error writing RefGene file: {}", err)
};

Work with transcripts directly

use atglib::tests::transcripts::standard_transcript;
let tx = standard_transcript();

println!(
    "The {} transcript {} is on the {} strand on chromsome {} and has {} exons",
    tx.gene(),
    tx.name(),
    tx.strand(),
    tx.chrom(),
    tx.exon_count()
);

if tx.is_coding() {
    println!(
        "The start and end positions of the CDS are: {}-{}",
        tx.cds_start().unwrap(),
        tx.cds_end().unwrap()
    );
} else {
    println!(
        "The transcript is non-coding and is located at {}:{}-{}",
        tx.chrom(),
        tx.tx_start(),
        tx.tx_end()
    );
}

Limitations

I started ATGlib as a private side project to learn Rust. I'm sure there are many parts of the code that are not idiomatic and can be improved. I'd be more than happy to receive feedback and suggestions for improvement. I also encourage everyone, who is interested, to help and contribute the ATGlib.

When it comes to functional correctness, I try my best to test the functionality on all levels. ATGlib has a very good test coverage and I also run manual checks before every release with a huge test-set. However, I cannot guarantee that everything is correct, so please use it at your own risk and report any bugs and issues via Github.

ToDo / Next tasks

Known issues

GTF parsing

  • NM_001371720.1 has two book-ended exons (155160639-155161619 || 155161620-155162101). During input parsing, book-ended features are merged into one exon

Dependencies

~0.4–1MB
~23K SLoC