#genomics #bioinformatics #sequencing #pangenome

app cgt_bacpop

Label core and rare genes in pangenome dataa

1 unstable release

0.1.0 Jan 23, 2024

#204 in Biology

Apache-2.0

11KB
135 lines

Description

This repository is part of the CELEBRIMBOR pangenome analysis pipeline, it provides rust code that labels genes as core, rare, or neither depending on the number of observations of the gene over all genome samples. The code tries to account for incomplete genome samples by using the genome completeness score from software CheckM.

The following people have contributed to writing the rust code and fitting it into the CELEBRIMBOR pipeline:

  • Joel Hellewell
  • John Lees
  • Sam Horsfield
  • Johanna Von Wachsmann

Example

You can run the code on on checkM output called genome_metadata.tsv and a presence-absence matrix (generated earlier in the CELEBRIMBOR snakemake pipeline) gene_presence_absence.Rtab. The completeness-column 7 argument specifies the column in genome_metadata.tsv that contains the completeness score for each genome sample.

First build the crate using cargo build --release in this directory. Then you can run the program on the example data provided with the following command:

target/release/cgt_bacpop example_data/genome_metadata.tsv example_data/gene_presence_absence.Rtab --completeness-column 7

Dependencies

~9–16MB
~215K SLoC