14 releases (7 breaking)

0.8.5 Apr 8, 2024
0.8.4 Dec 24, 2023
0.8.3 Aug 20, 2023
0.8.2 Jul 4, 2023
0.2.0 Nov 18, 2021

#11 in Biology

MIT license

99KB
667 lines

ATG

Convert your genomic reference data between formats with a single tool. ATG handles the conversion from and to GTF, GenePred(ext) and Refgene. You can generate bed files, fasta sequences or custom feature sequences. A single tool for all your conversion.

File format Can be used as source Can be created
GTF Yes Yes
GenePred (extended) Yes Yes
RefGene Yes Yes
GenePred (simple) No Yes
Bed No Yes
Fasta No Yes (multiple options)
SpliceAI gene annotation No Yes
Quality Checks No Yes

Reasons to use ATG

  • No need to maintain multiple tools for one-way conversions (gtfToGenePred, genePredToGtf, etc). ATG handles many formats and can convert in both directions.
  • Speed: ATG is really fast - almost twice as fast as gtfToGenePred.
  • Robust parser: It handles GTF, GenePred with all extras according to spec.
  • Low memory footprint: It also runs on machines with little RAM.
  • Extra features, such as quality control and correctness checks.
  • Open for contributions: Every help is welcome improve ATG or to add more functionality.
  • You can also use ATG as a library for your own Rust projects.

ATG command line tool

Install

There are currently 3 different options how to install ATG:

cargo

The easiest way to install ATG is to use cargo (if you have cargo and rust installed)

cargo install atg
Pre-built binaries

You can download pre-built binaries for Linux and Mac from Github.

From source

You can also build ATG from source (if you have the rust toolchains installed):

git clone https://github.com/anergictcell/atg.git
cd atg
cargo build --release

Usage

The main CLI arguments are

  • -f, --from: Specify the file format of the source (e.g. gtf, genepredext, refgene)
  • -t, --to: Specify the target file format (e.g. gtf, genepred, bed, fasta etc)
  • -i, --input: Path to source file. (Use /dev/stdin if you are using atg in a pipe)
  • -o, --output: Path to target file. Existing files will be overwritten. (Use /dev/stdout if you are using atg in a pipe)
  • -v, -vv, -vvv: Verbosity (info, debug, trace)
  • -h, --help: Print the help dialog with detailed usage instructions.

Additional, optional arguments:

  • -g, --gtf-source: Specify the source for GTF output files. Defaults to atg
  • -r, --reference: Path of a reference genome fasta file. Required for fasta output
  • -c, --genetic-code: Specify which genetic code to use for translating the transcripts. Genetic codes can be specified per chromosome by specifying the chromsome and the code, separated by : (e.g. -c chrM:vertebrate mitochondrial). They can also be specified for all chromsomes by omitting the chromosome (e.g. -c vertebrate mitochondrial). The argument can be specified multiple times (e.g: -c "standard" -c "chrM:vertebrate mitochondrial" -c "chrAYN:alternative yeast nuclear"). The code names are based on the name field from the NCBI specs but all lowercase characters. Alternatively, you can also specify the amino acid lookup table directly: -c "chrM:FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG". Defaults to standard.
  • -q, --qc-check: Specify QC-checks for removing transcripts from the output

Examples:

## Convert a GTF file to a RefGene file
atg --from gtf --to refgene --input /path/to/input.gtf --output /path/to/output.refgene

## Convert a GTF file to a GenePred file
atg --from gtf --to genepred --input /path/to/input.gtf --output /path/to/output.genepred

## Convert a GTF file to a GenePredExt file
atg --from gtf --to genepredext --input /path/to/input.gtf --output /path/to/output.genepredext

## Convert RefGene to GTF
atg --from refgene --to gtf --input /path/to/input.refgene --output /path/to/output.gtf

## Convert RefGene to bed
atg --from refgene --to bed --input /path/to/input.refgene --output /path/to/output.bed


## Convert a GTF file to a RefGene file, remove all transcript without proper start and stop codons
atg --from gtf --to refgene --input /path/to/input.gtf --output /path/to/output.refgene --qc-check start --qc-check stop --reference /path/to/fasta.fa

Supported --output formats

gtf

Output in GTF format.

chr9    ncbiRefSeq.2021-05-17   transcript  74526555    74600974    .   +   .   gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9    ncbiRefSeq.2021-05-17   exon        74526555    74526752    .   +   .   gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9    ncbiRefSeq.2021-05-17   5UTR        74526555    74526650    .   +   .   gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9    ncbiRefSeq.2021-05-17   CDS         74526651    74526752    .   +   0   gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9    ncbiRefSeq.2021-05-17   exon        74561922    74562028    .   +   .   gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9    ncbiRefSeq.2021-05-17   CDS         74561922    74562026    .   +   0   gene_id "C9orf85"; transcript_id "NM_001365057.2";
...

You can specify the value of the source column manually using the --gtf-source/-g option. Defaults to atg

refgene

Output in the refGene format, as used by some UCSC and NCBI RefSeq services

0   NM_001101.5     chr7    -   5566778     5570232     5567378     5569288    6   5566778,5567634,5567911,5568791,5569165,5570154,    5567522,5567816,5568350,5569031,5569294,5570232,    0   ACTB    cmpl    cmpl    0,1,0,0,0,-1,
0   NM_001203247.2  chr7    -   148504474   148581383   148504737   148544390  20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
0   NM_001203248.2  chr7    -   148504474   148581383   148504737   148544390  20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
0   NM_001354750.2  chr11   +   113930432   114127487   113934022   114121277  7   113930432,113933932,114027058,114057673,114112888,114117919,114121047,  113930864,113935290,114027156,114057760,114113059,114118087,114127487,  0   ZBTB16  cmpl    cmpl    -1,0,2,1,1,1,1,

genepred(ext)

Output in the GenePred(Ext) format, as used by some UCSC and NCBI RefSeq services

GenePred:

NM_001101.5     chr7    -   5566778     5570232     5567378     5569288     6   5566778,5567634,5567911,5568791,5569165,5570154,    5567522,5567816,5568350,5569031,5569294,5570232,
NM_001203247.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,
NM_001203248.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,
NM_001354750.2  chr11   +   113930432   114127487   113934022   114121277   7   113930432,113933932,114027058,114057673,114112888,114117919,114121047,  113930864,113935290,114027156,114057760,114113059,114118087,114127487,

GenePredExt

NM_001101.5     chr7    -   5566778     5570232     5567378     5569288     6   5566778,5567634,5567911,5568791,5569165,5570154,    5567522,5567816,5568350,5569031,5569294,5570232,    0   ACTB    cmpl    cmpl    0,1,0,0,0,-1,
NM_001203247.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
NM_001203248.2  chr7    -   148504474   148581383   148504737   148544390   20  148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255,    148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,    0   EZH2    cmpl    cmpl    2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
NM_001354750.2  chr11   +   113930432   114127487   113934022   114121277   7   113930432,113933932,114027058,114057673,114112888,114117919,114121047,  113930864,113935290,114027156,114057760,114113059,114118087,114127487,  0   ZBTB16  cmpl    cmpl    -1,0,2,1,1,1,1,

bed

Output in bed format.

chr7    5566778     5570232     ACTB:NM_001101.5       -   5567378    5569288    212,16,48   6   744,182,439,240,129,78  0,856,1133,2013,2387,3376
chr11   113930432   114127487   ZBTB16:NM_001354750.2  +   113934022  114121277  212,16,48   7   432,1358,98,87,171,168,6440 0,3500,96626,127241,182456,187487,190615
chr17   40852292    40897058    EZH1:NM_001321082.2    -   40854549   40880959   212,16,48   20  2318,85,81,82,96,179,126,41,92,197,181,92,164,103,177,121,129,128,91,30 0,2602,3465,4327,4813,5732,7683,8601,9571,12014,12934,17701,18179,18830,19998,22520,27360,28550,30553,44736

fasta

Writes the cDNA sequence of all transcripts into one file. Please note that the sequence is stranded.

This target format requires a reference genome fasta file that must be specified using --reference/-r.

This output allows different --fasta-format options:

  • transcript: The full transcript sequence (from the genomic start to end position, including introns)
  • exons: The cDNA sequence of the processed transcript, i.e. the sequence of all exons, including non-coding exons.
  • cds (default): The CDS of the transcript
>NM_007298.3 BRCA1
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGC
TATGCAGAAAATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAAC
CTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTGCATGCTGAAA
CTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGA
TATAACCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTG
...
>NM_001365057.2 C9orf85
ATGAGCTCCCAGAAAGGCAACGTGGCTCGTTCCAGACCTCAGAAGCACCA
GAATACGTTTAGCTTCAAAAATGACAAGTTCGATAAAAGTGTGCAGACCA
AGAAAATTAATGCAAAACTTCATGATGGAGTATGTCAGCGCTGTAAAGAA
GTTCTTGAGTGGCGTGTAAAATACAGCAAATACAAACCATTATCAAAACC
TAAAAAGTGA
...

fasta-split

Like fasta above, but one file for each transcript. Instead of an output file, you must specify an output directory, ATG will save each transcript as <Transcript_name>.fasta, e.g.: NM_001365057.2.fasta.

This target format requires a reference genome fasta file that must be specified using --reference/-r.

This output allows different --fasta-format options:

  • transcript: The full transcript sequence (from the genomic start to end position, including introns)
  • exons: The cDNA sequence of the processed transcript, i.e. the sequence of all exons, including non-coding exons.
  • cds (default): The CDS of the transcript

feature-sequence

cDNA sequence of each feature (5' UTR, CDS, 3'UTR), each in a separate row.

This target format requires a reference genome fasta file that must be specified using --reference/-r.

BRCA1   NM_007298.3     chr17   41196311    41197694    -   3UTR    CTGCAGCCAGCCAC...
BRCA1   NM_007298.3     chr17   41197694    41197819    -   CDS     CAATTGGGCAGATGTGTG...
BRCA1   NM_007298.3     chr17   41199659    41199720    -   CDS     GGTGTCCACCCAATTGTG...
BRCA1   NM_007298.3     chr17   41201137    41201211    -   CDS     ATCAACTGGAATGGATGG...
BRCA1   NM_007298.3     chr17   41203079    41203134    -   CDS     ATCTTCAGGGGGCTAGAA...
BRCA1   NM_007298.3     chr17   41209068    41209152    -   CDS     CATGATTTTGAAGTCAGA...
BRCA1   NM_007298.3     chr17   41215349    41215390    -   CDS     GGGTGACCCAGTCTATTA...
BRCA1   NM_007298.3     chr17   41215890    41215968    -   CDS     ATGCTGAGTTTGTGTGTG...
BRCA1   NM_007298.3     chr17   41219624    41219712    -   CDS     ATGCTCGTGTACAAGTTT...
BRCA1   NM_007298.3     chr17   41222944    41223255    -   CDS     AGGGAACCCCTTACCTGG...
C9orf85 NM_001365057.2  chr9    74526555    74526650    +   5UTR    ATTGACAGAA...
C9orf85 NM_001365057.2  chr9    74526651    74526752    +   CDS     ATGAGCTCCCAGAA...
C9orf85 NM_001365057.2  chr9    74561922    74562028    +   CDS     AAAATTAATGCAAA...
C9orf85 NM_001365057.2  chr9    74597573    74597573    +   CDS     A
C9orf85 NM_001365057.2  chr9    74597574    74600974    +   3UTR    TGGAGTCTCC...

spliceai

This is a custom format useful for SpliceAI splice predictions. The repo lists example files. The output has one gene per row, each gene record contains a consensus transcript, created by merging overlapping exons.

#NAME       CHROM   STRAND  TX_START    TX_END  EXON_START      EXON_END
OR4F5       1       +       69090       70008   69090,          70008,
AL627309.1  1       -       134900      139379  134900,137620,  135802,139379,

qc

Runs some basic consistency checks on the transcripts:

QC check Explanation Non-Coding vs Coding requires Fasta File
Exon Contains at least one exon all no
Correct CDS Length The length of the CDS is divisible by 3 Coding no
Correct Start Codon The CDS starts with ATG Coding yes
Correct Stop Codon The CDS ends with a Stop codon TAG, TAA, or TGA Coding yes
No upstream Start Codon The 5'UTR does not contain another start codon ATG (This test do not make sense biologically. It is totally fine for a transcript to have upstream ATG start cordons that are not utilized but the ribosome.) Coding yes
No upstream Stop Codon The CDS does not contain another in-frame stop-codon Coding yes
No Start codon The full exon sequence does not contain a start codon ATG (Biologically speaking, a non-coding transcript could have ATG start codons that are not utilized) Non-Coding yes
Correct Coordinates The transcript is within the coordinates of the reference genome all yes

Test results:

  • NA Test could not be performed (e.g. CDS-length for non-coding transcripts), so no conclusion could be drawn
  • OK The test succeeded with an OK results
  • NOK The test failed and gave a NOT OK result
Gene     transcript     Exon  CDS Length  Correct Start Codon  Correct Stop Codon  No upstream Start Codon  No upstream Stop Codon  Correct Coordinates
FAM239A  NR_146581.1    OK    N/A         N/A                  N/A                 OK                       N/A                     OK
OR5H2    NM_001005482.1 OK    OK          OK                   OK                  OK                       OK                      OK
SNX20    NM_001144972.2 OK    OK          OK                   OK                  NOK                      OK                      OK

raw

This is mainly useful for debugging, as it gives a quick glimpse into the Exons and CDS coordinates of the transcripts.

bin

Save Transcripts in ATG binary format for faster re-reading.

ATG as library

ATG uses the atglib library, which is documented inline and available on docs.rs

Known issues

GTF parsing

  • NM_001371720.1 has two book-ended exons (155160639-155161619 || 155161620-155162101). During input parsing, book-ended features are merged into one exon

Dependencies

~35MB
~493K SLoC