8 releases

0.3.2	Sep 16, 2022
0.3.1	Sep 7, 2022
0.2.6	Oct 3, 2021
0.2.3	Feb 17, 2021
0.1.1	Aug 7, 2020

#1 in #chance

MIT and LGPL-3.0

9.5MB
2K SLoC

GeNovo - Identifying disease genes with de-novo mutations

Everyone's genome contains some de-novo mutations: Mutations that are not part of their parents' genomes. These de-novo mutations can cause diseases but it is still difficult to tell the disease-causing mutations apart from the ones that have nothing to do with the disease. Furthermore, some genes have a higher chance of mutating which means that a mutation in that gene is not that surprising, even when that gene is mutated in several people with the disease.

The name GeNovo is a wordplay of gene detection based on de-novo mutations.

Installation

First make sure you have cargo installed (cargo is the package manager for the Rust programming language). The easiest way is to install rustup which automatically installs cargo and rustc for you. If you are on a UNIX-based system, you might also be able to install cargo or rustup with your system's package manager.

Secondly, you need to run

cargo install genovo

This will install the program genovo at a specific location which will be shown at the end of the installation process. (You might need to make adjustments to your PATH variable afterwards to automatically find the program on the command line)

The pipeline is a single executable (genovo) that combines the important steps from an earlier pipeline that I had put together. (A single executable is much easier for others to use than a bunch of scripts that are all loosely coupled.)

The pipeline

The pipeline flow diagram

Input files are marked with a blue border.
Pipeline steps are marked with a black border and their names are marked with a green background
Optional parameters are marked in blue font

Input and output files

The pipeline takes several parameters that represent input and output parameters. Depending on which action is being performed, a parameter is treated either as an input or output file.

For example, the --possible-mutations parameter is treated as an input file when --action expect is specified. But it is treated as an output file when --action enumerate is specified. If no --action parameter is specified, the entire pipeline is run and all file parameters are treated as output files.

You can also specify - as an input/output file. Depending on the action, this will be interpreted as STDIN or as STDOUT.

If you are concerned about space usage you can also specify a .gz file extension. genovo will automatically decompress/compress the input/output.

ID filtering / parallel processing

If the parameter --id ID is passed, then only the transcripts with the id ID will be processed and all other transcripts will be ignored. This is useful for parallel execution. You can launch a separate job for each transcript. Please note that --id only accepts a single ID. Multiple IDs are not supported. You can always combine the --id option with any other option.

Running the whole pipeline on a test data set

You can download a test data set from the git repository:

wget https://github.com/BesenbacherLab/genovo/raw/master/test_data/test_data.tar.gz
tar xvzf test_data.tar.gz

If you want to run the entire pipeline on the test data you can execute the following command:

genovo \
	--gff3 gencode.v19.CCDS.annotation.protein_coding_chr22.gff3.gz \
	--observed-mutations observed_mutations_DDD_2017_chr22.txt \
	--genome hg19_chr22.2bit \
	--point-mutation-probabilities SNV_mutation_rate_model.txt \
	--indel-mutation-probabilities indel_mutation_rate_model.txt \
	--scaling-factor 8586 \
	--significant-mutations genovo_results_chr22.txt

The command takes ~1 minute to run on a "normal" computer. It will write the comparisons between observed, expected and sampled mutations to the file genovo_results_chr22.txt. If the --significant-mutations argument is not used the results will be written to STDOUT. For each transcript, t, and mutation type, m, the output contains the following output columns

column	description
observed	The observed number of m mutations in transcript t
expected	The expected number of m mutations in transcript t according to the mutation rate models
expected_lower	The lower bound of the confidence interval around the expected number of mutations
expected_upper	The upper bound of the confidence interval around the expected number of mutations
p_value	A one-sided p-value for a sampling based test of whether there are more observed mutations of type m in transcript t than we would expect given our mutation rate models

When testing on de novo mutations from trios the --scaling-factor should be two times the number of trios in the data. So in this case it is set to 8586 since the data set with the observed mutations looked for de novo mutations in 4293 children.

The SNV_mutation_rate_model.txt and indel_mutation_rate_model.txt are mutation rate models created using kmerpapa. Trained models in the right input format can be downloaded from https://github.com/BesenbacherLab/Genovo_Input.

Steps of the pipeline

Each of the steps of the pipeline can be executed in isolation by specifying the --action STEP parameter. If no --action parameter is specified, all steps are run (as in the example above).

transform

This step takes a gff3 file and transforms it into a regions file (which is a file format that is specific to genovo). The gff3 file needs to have the following attributes:

ID: For identifying the name of each transcript
Parent: For identifying the parent of exons and CDS regions

The file must also be sorted so that all exons and CDS regions are listed after their transcript line.

Some example gff3 files can be obtained from GENCODE

It is highly recommended that you only use annotations that contain exclusively coding transcripts. The results for non-coding transcripts are a lot less informative.

parameters

parameter	input/output	description
--gff3	input	The GFF3 annotation file that contains transcript, exon and CDS information
--genomic-regions	output	A transformed version of the GFF3 entries, limited to transcripts, exons and CDS regions

enumerate

This goes through the genomic sequence of every transcript and determines all possible point mutations together with their probability This command takes a PaPa-rates file which represents the point mutation probabilities for different sequence contexts as described in our paper *Improved prediction of site-specific mutation rates using k-mer pattern partition *. This command also takes a reference genome file in the twobit file format. Make sure that all your input files use the same genome build!

The output file is fairly simple:

#transcript_ID
numeric_mutation_type:mutation_probability
numeric_mutation_type:mutation_probability
numeric_mutation_type:mutation_probability
#other_transcript_ID
numeric_mutation_type:mutation_probability
numeric_mutation_type:mutation_probability
numeric_mutation_type:mutation_probability
numeric_mutation_type:mutation_probability
numeric_mutation_type:mutation_probability
numeric_mutation_type:mutation_probability
numeric_mutation_type:mutation_probability

The possible mutations for each transcript start with a #ID line followed with one line for every possible mutation where the probability is not NaN. Keep in mind that every position may mutate into one of the 3 other bases, so that you have 3 possible point mutations for every position.

The numeric mutation type is defined in the section Mutation type numeric code table.

parameters

parameter	input/output	description
--genomic-regions	input	A transformed version of the GFF3 entries, limited to transcripts, exons and CDS regions
--genome	input	A twobit reference genome sequence file
--point-mutation-probabilities	input	A pattern partition file with the sequence-context-dependent point mutation probabilities
--indel-mutation-probabilities	input	A pattern partition file with the sequence-context-dependent indel probabilities
--possible-mutations	output	The possible mutations for each transcript

classify

This step classifies observed point mutations (which is a raw input file you should provide). The input file format is as follows:

chr1 1230448 G A
chr1 1609723 C T
chr1 1903276 C T
chr1 2574999 C T

The first column is the chromosome (must be the same naming convention as in the annotations and the twobit file). The second column is the position on the chromosome. This position has to be 1-based. The third column is the reference base. If this reference base does not match what is in the twobit file, you will get an error. The fourth column is the new base that was mutated into.

parameters

parameter	input/output	description
--observed-mutations	input	Observed point mutations from your own data
--genome	input	A twobit reference genome sequence file
--genomic-regions	input	A transformed version of the GFF3 entries, limited to transcripts, exons and CDS regions
--classified-mutations	output	The observed mutatations. But annotated with numeric mutation types

expect

This step sums up all possible mutations and their probabilities and gives you a grand total that represents the expected number of mutations for each gene.

parameters

parameter	input/output	description
--possible-mutations	input	The possible mutations for each transcript
--expected-mutations	output	The number of expected mutations per transcript for each mutation type

sample

This step randomly picks possible mutations and simulates randomly if they occur based on their probability. This creates an empirical distribution of sampled mutations that will be used to calculate a p-value in a later step.

parameters

parameter	input/output	description
--possible-mutations	input	The possible mutations for each transcript
--number-of-random-samples	integer	How many random samples? default=1000
--sampled-mutations	output	Distributions of sampled mutations

compare

Compare the observed, expected and sampled mutations to determine a p-value for observing that many mutations in a transcript by chance. Please note that the output file contains the results of all statistical tests, even the non-significant ones. And note that the p-values does not take into account the expected number of mutations.

parameters

parameter	input/output	description
--classified-mutations	input	The observed mutatations. But annotated with numeric mutation types
--expected-mutations	input	The number of expected mutations per transcript for each mutation type
--sampled-mutations	input	Distributions of sampled mutations
--significant-mutations	output	A table of observed, expected, sampled and a p-value for different mutation types.

Mutation type numeric code table

The input and output files of the pipeline are almost all text files. In order to be more space-efficient, there is a numeric encoding for the different types of mutations.

mutation type name	numeric code
unknown	0
synonymous	1
missense	2
nonsense	3
loss of stop codon	4
loss of start codon	5
loss of canonical splice site	6
Mutation within an intron	7
Non-frameshift indel	8
Frameshift indel	9

Dependencies

~6.5MB
~106K SLoC