#vcf #data-analysis #command-line-tool #csv #gz #fields #convert

bin+lib tidyvcf

command-line tool to convert VCF files to tab/comma separated tables

10 unstable releases (3 breaking)

0.4.0 Aug 28, 2023
0.3.0 Aug 28, 2023
0.2.5 Aug 27, 2023
0.2.4 May 15, 2023
0.1.1 Nov 22, 2021

#8 in #gz

Download history 7/week @ 2024-02-23 4/week @ 2024-03-01 5/week @ 2024-03-08 2/week @ 2024-03-15 103/week @ 2024-03-29

110 downloads per month

MIT/Apache

115KB
327 lines

#+title: tidyVCF #+author: Jamie D Matthews

=tidyVCF= is a small tool to convert VCF files to tidy tab/comma separated tables, ideal for downstream analysis with R's =tidyverse= or Julia's =DataFrames= ecosystems. All fields are included by default, keeping the command line simple. =tidyVCF= is written in pure Rust, replying on the excellent =noodles-vcf= crate developed by [[https://github.com/zaeleus][@zaeleus]] and contributors.

Warning: /built on an unstable API, lacking proper testing, brittle in terms of erroring at minor VCF spec violations/.

** Install *** Cargo

#+begin_example cargo install tidyvcf #+end_example

*** Pre-built binaries

TBD.

*** Docker

#+begin_src bash docker pull registry.gitlab.com/jdm204/tidyvcf:latest #+end_src

  • Usage ** Basic usage

CSV output with =-c= / =--csv=, default is TSV:

#+begin_example tidyvcf -i test.vcf -c -o test.csv #+end_example

BGZF compressed VCFs are detected by file extension and handled automatically:

#+begin_example tidyvcf -i test.vcf.gz -o test.tsv #+end_example

If dealing with compressed data from =stdin=, use the =--bgzip= flag:

#+begin_example cat test.vcf.gz | tidyvcf --bgzip -o test.tsv #+end_example

To write compressed TSV, use the =.gz= extension for the =--output= file or pass the =-z= / =--out-gz= options.

#+begin_example tidyvcf -i test.vcf.gz --csv -o test.csv.gz #+end_example

** Multiple samples: stacked or cartesian

It is common to perform variant calling on several related samples together, which yields VCFs with multiple sets of 'genotype' or =FORMAT= fields, one for each sample. By default, =tidyvcf= joins sample names to the names of the format fields with the underscore ('_') character - =S1_GT S1_DP S2_GT S2_DP...=.

The =--format-delim= option allow changing the sample-format field delimiter:

#+begin_example tidyvcf -i test.vcf --format-delim '~' -o test.tsv #+end_example

This behaviour violates the [[https://r4ds.had.co.nz/tidy-data.html][tidy data]] principle---to avoid this we can stack samples into rows, with the cost of repeating the static and =INFO= columns for each sample.

Stacking samples:

#+begin_example tidyvcf -i test.vcf --stack -o test_stacked.tsv #+end_example

** Info prefix

To avoid clashes in field names between =INFO= and =FORMAT= columns, =INFO= field names are prefixed with the string "info_" by default---this behaviour can be adjusted with the =--info-prefix= option:

#+begin_example tidyvcf -i test.vcf --info-prefix 'i' -c -o test.csv #+end_example

** VEP =CSQ= INFO field splitting

If your VCF is annotated with Ensembl's Variant Effect Predictor, you can use the =-v= / =--vep-fields= flag to extract those fields into individual columns:

#+begin_example tidyvcf -i vep.vcf.gz --vep-fields -o vep.tsv #+end_example

By default, the output VEP column names are prefixed with "vep_" to avoid name collisions (for example =CSQ/VAF= and =FMT/VAF=)---this string can be customised with the =--vep-prefix= option:

#+begin_example tidyvcf -i vep.vcf.gz --vep-fields --vep-prefix '.' -o vep.tsv #+end_example

/Note/: Only the first annotated transcript for a record is split, the others are bundled unsplit into an additional column named =CSQ_other_transcripts=.

** Spec Non-Compliant VCFs

The =noodles= rust library emphasises correctness in an ecosystem where that hasn't always been standard, so in practice it rejects many VCFs produced by variant callers due to not adhering to the spec. =tidyvcf= comes with a =-l= / =--lenient= option that tries to fix spec non-compliant headers using hardcoded replacement rules before conversion. Currently, this option is sufficient to convert VCFs produced by =octopus= for example. Feel free to raise an issue if this option doesn't help for other spec-non-compliant-but-basically-fine VCFs.

** In a Snakemake Workflow

Here is a sample rule using a container. Note that =snakemake= must be invoked with =--use-singularity= in order to run rules in containers.

#+begin_src python rule tidyvcf: input: "some.vcf", output: "some.tsv", params: "--lenient -v" container: "docker://registry.gitlab.com/jdm204/tidyvcf:latest", shell: "tidyvcf -i {input} -o {output} {params}" #+end_src

  • Comparison with other software :PROPERTIES: :CUSTOM_ID: comparison-with-other-software :END: | Feature | =tidyVCF= | =rbt vcf-to-txt= | =bcftools -f= | =gatk VariantsToTable= | |----------------------------------------+------------+---------------------------------------------+------------------------+------------------------| | include all fields | by default | individually specified; currently no =FILTER= | individually specified | individually specified | | include a subset of fields | ❌ | individually specified; currently no =FILTER= | individually specified | individually specified | | long format | =--stack= | ❌ | ❌ | ❌ | | pipeable | ✓ | ✓ | ✓ | ❌ | | compressed input without external tool | ✓ | ❌ | ✓ | ? | | bcf input | ❌ | ❌ | ✓ | ? |

Dependencies

~6MB
~107K SLoC