15 unstable releases
0.8.6 | May 22, 2024 |
---|---|
0.8.4 | Dec 24, 2023 |
0.8.3 | Aug 20, 2023 |
0.8.2 | Jul 4, 2023 |
0.2.0 | Nov 18, 2021 |
#5 in Biology
1,054 downloads per month
110KB
769 lines
ATG
Convert your genomic reference data between formats with a single tool. ATG handles the conversion from and to GTF, GenePred(ext) and Refgene. You can generate bed files, fasta sequences or custom feature sequences. A single tool for all your conversion.
File format | Can be used as source | Can be created |
---|---|---|
GTF | Yes | Yes |
GenePred (extended) | Yes | Yes |
RefGene | Yes | Yes |
GenePred (simple) | No | Yes |
Bed | No | Yes |
Fasta | No | Yes (multiple options) |
SpliceAI gene annotation | No | Yes |
Quality Checks | No | Yes |
Reasons to use ATG
- No need to maintain multiple tools for one-way conversions (
gtfToGenePred
,genePredToGtf
, etc). ATG handles many formats and can convert in both directions. - Speed: ATG is really fast - almost twice as fast as
gtfToGenePred
. - Robust parser: It handles GTF, GenePred with all extras according to spec.
- Low memory footprint: It also runs on machines with little RAM.
- Extra features, such as quality control and correctness checks.
- Open for contributions: Every help is welcome improve ATG or to add more functionality.
- You can also use ATG as a library for your own Rust projects.
ATG command line tool
Install
There are currently 3 different options how to install ATG:
cargo
The easiest way to install ATG is to use cargo
(if you have cargo
and rust
installed)
cargo install atg
Pre-built binaries
You can download pre-built binaries for Linux and Mac from Github.
From source
You can also build ATG from source (if you have the rust toolchains installed):
git clone https://github.com/anergictcell/atg.git
cd atg
cargo build --release
Usage
The main CLI arguments are
-f
,--from
: Specify the file format of the source (e.g.gtf
,genepredext
,refgene
)-t
,--to
: Specify the target file format (e.g.gtf
,genepred
,bed
,fasta
etc)-i
,--input
: Path to source file. (Use/dev/stdin
if you are using atg in a pipe)-o
,--output
: Path to target file. Existing files will be overwritten. (Use/dev/stdout
if you are using atg in a pipe)-v
,-vv
,-vvv
: Verbosity (info, debug, trace)-h
,--help
: Print the help dialog with detailed usage instructions.
Additional, optional arguments:
-g
,--gtf-source
: Specify the source for GTF output files. Defaults toatg
-r
,--reference
: Path of a reference genome fasta file. Required for fasta output-c
,--genetic-code
: Specify which genetic code to use for translating the transcripts. Genetic codes can be specified per chromosome by specifying the chromsome and the code, separated by:
(e.g.-c chrM:vertebrate mitochondrial
). They can also be specified for all chromsomes by omitting the chromosome (e.g.-c vertebrate mitochondrial
). The argument can be specified multiple times (e.g:-c "standard" -c "chrM:vertebrate mitochondrial" -c "chrAYN:alternative yeast nuclear"
). The code names are based on thename
field from the NCBI specs but all lowercase characters. Alternatively, you can also specify the amino acid lookup table directly:-c "chrM:FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG"
. Defaults tostandard
.-q
,--qc-check
: Specify QC-checks for removing transcripts from the output
Examples:
## Convert a GTF file to a RefGene file
atg --from gtf --to refgene --input /path/to/input.gtf --output /path/to/output.refgene
## Convert a GTF file to a GenePred file
atg --from gtf --to genepred --input /path/to/input.gtf --output /path/to/output.genepred
## Convert a GTF file to a GenePredExt file
atg --from gtf --to genepredext --input /path/to/input.gtf --output /path/to/output.genepredext
## Convert RefGene to GTF
atg --from refgene --to gtf --input /path/to/input.refgene --output /path/to/output.gtf
## Convert RefGene to bed
atg --from refgene --to bed --input /path/to/input.refgene --output /path/to/output.bed
## Convert a GTF file to a RefGene file, remove all transcript without proper start and stop codons
atg --from gtf --to refgene --input /path/to/input.gtf --output /path/to/output.refgene --qc-check start --qc-check stop --reference /path/to/fasta.fa
Supported --output
formats
gtf
Output in GTF format.
chr9 ncbiRefSeq.2021-05-17 transcript 74526555 74600974 . + . gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9 ncbiRefSeq.2021-05-17 exon 74526555 74526752 . + . gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9 ncbiRefSeq.2021-05-17 5UTR 74526555 74526650 . + . gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9 ncbiRefSeq.2021-05-17 CDS 74526651 74526752 . + 0 gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9 ncbiRefSeq.2021-05-17 exon 74561922 74562028 . + . gene_id "C9orf85"; transcript_id "NM_001365057.2";
chr9 ncbiRefSeq.2021-05-17 CDS 74561922 74562026 . + 0 gene_id "C9orf85"; transcript_id "NM_001365057.2";
...
You can specify the value of the source
column manually using the --gtf-source
/-g
option. Defaults to atg
refgene
Output in the refGene format, as used by some UCSC and NCBI RefSeq services
0 NM_001101.5 chr7 - 5566778 5570232 5567378 5569288 6 5566778,5567634,5567911,5568791,5569165,5570154, 5567522,5567816,5568350,5569031,5569294,5570232, 0 ACTB cmpl cmpl 0,1,0,0,0,-1,
0 NM_001203247.2 chr7 - 148504474 148581383 148504737 148544390 20 148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255, 148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383, 0 EZH2 cmpl cmpl 2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
0 NM_001203248.2 chr7 - 148504474 148581383 148504737 148544390 20 148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255, 148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383, 0 EZH2 cmpl cmpl 2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
0 NM_001354750.2 chr11 + 113930432 114127487 113934022 114121277 7 113930432,113933932,114027058,114057673,114112888,114117919,114121047, 113930864,113935290,114027156,114057760,114113059,114118087,114127487, 0 ZBTB16 cmpl cmpl -1,0,2,1,1,1,1,
genepred(ext)
Output in the GenePred(Ext) format, as used by some UCSC and NCBI RefSeq services
GenePred:
NM_001101.5 chr7 - 5566778 5570232 5567378 5569288 6 5566778,5567634,5567911,5568791,5569165,5570154, 5567522,5567816,5568350,5569031,5569294,5570232,
NM_001203247.2 chr7 - 148504474 148581383 148504737 148544390 20 148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255, 148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,
NM_001203248.2 chr7 - 148504474 148581383 148504737 148544390 20 148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255, 148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383,
NM_001354750.2 chr11 + 113930432 114127487 113934022 114121277 7 113930432,113933932,114027058,114057673,114112888,114117919,114121047, 113930864,113935290,114027156,114057760,114113059,114118087,114127487,
GenePredExt
NM_001101.5 chr7 - 5566778 5570232 5567378 5569288 6 5566778,5567634,5567911,5568791,5569165,5570154, 5567522,5567816,5568350,5569031,5569294,5570232, 0 ACTB cmpl cmpl 0,1,0,0,0,-1,
NM_001203247.2 chr7 - 148504474 148581383 148504737 148544390 20 148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255, 148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383, 0 EZH2 cmpl cmpl 2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
NM_001203248.2 chr7 - 148504474 148581383 148504737 148544390 20 148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255, 148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383, 0 EZH2 cmpl cmpl 2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1,
NM_001354750.2 chr11 + 113930432 114127487 113934022 114121277 7 113930432,113933932,114027058,114057673,114112888,114117919,114121047, 113930864,113935290,114027156,114057760,114113059,114118087,114127487, 0 ZBTB16 cmpl cmpl -1,0,2,1,1,1,1,
bed
Output in bed format.
chr7 5566778 5570232 ACTB:NM_001101.5 - 5567378 5569288 212,16,48 6 744,182,439,240,129,78 0,856,1133,2013,2387,3376
chr11 113930432 114127487 ZBTB16:NM_001354750.2 + 113934022 114121277 212,16,48 7 432,1358,98,87,171,168,6440 0,3500,96626,127241,182456,187487,190615
chr17 40852292 40897058 EZH1:NM_001321082.2 - 40854549 40880959 212,16,48 20 2318,85,81,82,96,179,126,41,92,197,181,92,164,103,177,121,129,128,91,30 0,2602,3465,4327,4813,5732,7683,8601,9571,12014,12934,17701,18179,18830,19998,22520,27360,28550,30553,44736
fasta
Writes the cDNA sequence of all transcripts into one file. Please note that the sequence is stranded.
This target format requires a reference genome fasta file that must be specified using --reference
/-r
.
This output allows different --fasta-format
options:
transcript
: The full transcript sequence (from the genomic start to end position, including introns)exons
: The cDNA sequence of the processed transcript, i.e. the sequence of all exons, including non-coding exons.cds
(default): The CDS of the transcript
>NM_007298.3 BRCA1
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGC
TATGCAGAAAATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAAC
CTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTGCATGCTGAAA
CTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGA
TATAACCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTG
...
>NM_001365057.2 C9orf85
ATGAGCTCCCAGAAAGGCAACGTGGCTCGTTCCAGACCTCAGAAGCACCA
GAATACGTTTAGCTTCAAAAATGACAAGTTCGATAAAAGTGTGCAGACCA
AGAAAATTAATGCAAAACTTCATGATGGAGTATGTCAGCGCTGTAAAGAA
GTTCTTGAGTGGCGTGTAAAATACAGCAAATACAAACCATTATCAAAACC
TAAAAAGTGA
...
fasta-split
Like fasta
above, but one file for each transcript. Instead of an output file, you must specify an output directory, ATG will save each transcript as <Transcript_name>.fasta
, e.g.: NM_001365057.2.fasta
.
This target format requires a reference genome fasta file that must be specified using --reference
/-r
.
This output allows different --fasta-format
options:
transcript
: The full transcript sequence (from the genomic start to end position, including introns)exons
: The cDNA sequence of the processed transcript, i.e. the sequence of all exons, including non-coding exons.cds
(default): The CDS of the transcript
feature-sequence
cDNA sequence of each feature (5' UTR, CDS, 3'UTR), each in a separate row.
This target format requires a reference genome fasta file that must be specified using --reference
/-r
.
BRCA1 NM_007298.3 chr17 41196311 41197694 - 3UTR CTGCAGCCAGCCAC...
BRCA1 NM_007298.3 chr17 41197694 41197819 - CDS CAATTGGGCAGATGTGTG...
BRCA1 NM_007298.3 chr17 41199659 41199720 - CDS GGTGTCCACCCAATTGTG...
BRCA1 NM_007298.3 chr17 41201137 41201211 - CDS ATCAACTGGAATGGATGG...
BRCA1 NM_007298.3 chr17 41203079 41203134 - CDS ATCTTCAGGGGGCTAGAA...
BRCA1 NM_007298.3 chr17 41209068 41209152 - CDS CATGATTTTGAAGTCAGA...
BRCA1 NM_007298.3 chr17 41215349 41215390 - CDS GGGTGACCCAGTCTATTA...
BRCA1 NM_007298.3 chr17 41215890 41215968 - CDS ATGCTGAGTTTGTGTGTG...
BRCA1 NM_007298.3 chr17 41219624 41219712 - CDS ATGCTCGTGTACAAGTTT...
BRCA1 NM_007298.3 chr17 41222944 41223255 - CDS AGGGAACCCCTTACCTGG...
C9orf85 NM_001365057.2 chr9 74526555 74526650 + 5UTR ATTGACAGAA...
C9orf85 NM_001365057.2 chr9 74526651 74526752 + CDS ATGAGCTCCCAGAA...
C9orf85 NM_001365057.2 chr9 74561922 74562028 + CDS AAAATTAATGCAAA...
C9orf85 NM_001365057.2 chr9 74597573 74597573 + CDS A
C9orf85 NM_001365057.2 chr9 74597574 74600974 + 3UTR TGGAGTCTCC...
spliceai
This is a custom format useful for SpliceAI splice predictions. The repo lists example files. The output has one gene per row, each gene record contains a consensus transcript, created by merging overlapping exons.
#NAME CHROM STRAND TX_START TX_END EXON_START EXON_END
OR4F5 1 + 69090 70008 69090, 70008,
AL627309.1 1 - 134900 139379 134900,137620, 135802,139379,
qc
Runs some basic consistency checks on the transcripts:
QC check | Explanation | Non-Coding vs Coding | requires Fasta File |
---|---|---|---|
Exon | Contains at least one exon | all | no |
Correct CDS Length | The length of the CDS is divisible by 3 | Coding | no |
Correct Start Codon | The CDS starts with ATG |
Coding | yes |
Correct Stop Codon | The CDS ends with a Stop codon TAG , TAA , or TGA |
Coding | yes |
No upstream Start Codon | The 5'UTR does not contain another start codon ATG (This test do not make sense biologically. It is totally fine for a transcript to have upstream ATG start cordons that are not utilized but the ribosome.) |
Coding | yes |
No upstream Stop Codon | The CDS does not contain another in-frame stop-codon | Coding | yes |
No Start codon | The full exon sequence does not contain a start codon ATG (Biologically speaking, a non-coding transcript could have ATG start codons that are not utilized) |
Non-Coding | yes |
Correct Coordinates | The transcript is within the coordinates of the reference genome | all | yes |
Test results:
NA
Test could not be performed (e.g. CDS-length for non-coding transcripts), so no conclusion could be drawnOK
The test succeeded with an OK resultsNOK
The test failed and gave a NOT OK result
Gene transcript Exon CDS Length Correct Start Codon Correct Stop Codon No upstream Start Codon No upstream Stop Codon Correct Coordinates
FAM239A NR_146581.1 OK N/A N/A N/A OK N/A OK
OR5H2 NM_001005482.1 OK OK OK OK OK OK OK
SNX20 NM_001144972.2 OK OK OK OK NOK OK OK
raw
This is mainly useful for debugging, as it gives a quick glimpse into the Exons and CDS coordinates of the transcripts.
bin
Save Transcripts in ATG binary format for faster re-reading.
ATG as library
ATG uses the atglib library, which is documented inline and available on docs.rs
Known issues
GTF parsing
- NM_001371720.1 has two book-ended exons (155160639-155161619 || 155161620-155162101). During input parsing, book-ended features are merged into one exon
Dependencies
~34MB
~445K SLoC