tidyvcf

16 unstable releases (5 breaking)

new 0.7.3	Apr 26, 2025
0.7.2	Mar 5, 2025
0.7.1	Feb 27, 2025
0.6.4	Oct 1, 2024
0.1.1	Nov 22, 2021

#256 in Unix APIs

87 downloads per month

MIT/Apache

125KB
533 lines

[[./resources/logo.png]] #+author: Jamie D Matthews

=tidyvcf= is a small tool to convert VCF files to "tidy", generic, columnar tables, ideal for downstream analysis with R's =tidyverse= or Julia's =DataFrames= ecosystems (for example). All fields are included by default, keeping the command line simple. =tidyvcf= is written in pure Rust, based on the excellent =noodles= crate.

** Install *** Cargo

#+begin_example cargo install tidyvcf #+end_example

*** Pre-built binaries

TBD.

*** Docker

If Docker is installed, =tidyvcf= can be run directly like this (useful for one-off execution):

#+begin_src bash sudo docker run registry.gitlab.com/jdm204/tidyvcf:0.7.3 tidyvcf -i test.vcf -o test.tsv #+end_src

*** Apptainer / Singularity

If you don't have superuser privileges (such as on a HPC system), apptainer/singularity is a common choice:

#+begin_src bash apptainer run docker://registry.gitlab.com/jdm204/tidyvcf:0.7.3 -- tidyvcf -i test.vcf -o test.tsv #+end_src

Usage ** Basic usage

CSV output with =-c= / =--csv=, default is TSV:

#+begin_example tidyvcf -i test.vcf -c -o test.csv #+end_example

BGZF compressed VCFs are detected by file extension and handled automatically:

#+begin_example tidyvcf -i test.vcf.gz -o test.tsv #+end_example

If dealing with compressed data from =stdin=, use the =--bgzip= flag:

#+begin_example cat test.vcf.gz | tidyvcf --bgzip -o test.tsv #+end_example

To write compressed TSV, use the =.gz= extension for the =--output= file or pass the =-z= / =--out-gz= options.

#+begin_example tidyvcf -i test.vcf.gz --csv -o test.csv.gz #+end_example

** Multiple samples: stacked or cartesian

It is common to perform variant calling on several related samples together, which yields VCFs with multiple sets of 'genotype' or =FORMAT= fields, one for each sample. By default, =tidyvcf= joins sample names to the names of the format fields with the underscore ('_') character - =S1_GT S1_DP S2_GT S2_DP...=.

The =--format-delim= option allow changing the sample-format field delimiter:

#+begin_example tidyvcf -i test.vcf --format-delim '~' -o test.tsv #+end_example

This behaviour violates the [[https://r4ds.had.co.nz/tidy-data.html][tidy data]] principle---to avoid this we can stack samples into rows, with the cost of repeating the static and =INFO= columns for each sample.

Stacking samples:

#+begin_example tidyvcf -i test.vcf --stack -o test_stacked.tsv #+end_example

** Info prefix

To avoid clashes in field names between =INFO= and =FORMAT= columns, =INFO= field names are prefixed with the string "info_" by default---this behaviour can be adjusted with the =--info-prefix= option:

#+begin_example tidyvcf -i test.vcf --info-prefix 'i' -c -o test.csv #+end_example

** VEP =CSQ= INFO field splitting

If your VCF is annotated with Ensembl's Variant Effect Predictor, you can use the =-v= / =--vep-fields= flag to extract those fields into individual columns:

#+begin_example tidyvcf -i vep.vcf.gz --vep-fields -o vep.tsv #+end_example

By default, the output VEP column names are prefixed with "vep_" to avoid name collisions (for example =CSQ/VAF= and =FMT/VAF=)---this string can be customised with the =--vep-prefix= option:

#+begin_example tidyvcf -i vep.vcf.gz --vep-fields --vep-prefix '.' -o vep.tsv #+end_example

/Note/: Only the first annotated transcript for a record is split, the others are bundled unsplit into an additional column named =CSQ_other_transcripts=.

** Including / Excluding fields from output

A motivating feature for =tidyvcf= was to output all columns in the VCF without having to write them out on the command line, but you can also request to /just/ include certain columns by name:

#+begin_src bash tidyvcf -i some.vcf -o some.tsv.gz --just pos ref alt #+end_src

or /not/ include certain columns.

#+begin_src bash tidyvcf -i some.vcf -o some.tsv.gz --not DP GQ PL #+end_src

Note: currently, =-v= implies all VEP fields, and info fields must be specified with their prefix (default: "info_").

** Missing values

By default, =tidyvcf= outputs "NA" (not applicable) when a value is missing. This is good for R, which recognises these strings as missing data. Other languages have different conventions; in Julia, the string "missing" is used instead.

To configure this for a =tidyvcf= run, use the =-m= (=--missing-string=) option:

#+begin_src bash tidyvcf -i some.vcf -m missing | head #+end_src

** Spec Non-Compliant VCFs

The =noodles= rust library emphasises correctness in an ecosystem where that hasn't always been standard, so in practice it rejects many VCFs produced by variant callers due to not adhering to the spec. =tidyvcf= comes with a =-l= / =--lenient= option that tries to fix spec non-compliant headers using hardcoded replacement rules before conversion. Currently, this option is sufficient to convert VCFs produced by =octopus= for example. Feel free to raise an issue if this option doesn't help for other spec-non-compliant-but-basically-fine VCFs.

** In a Snakemake Workflow

Here is a sample rule using a container. Note that =snakemake= must be invoked with =--use-singularity= in order to run rules in containers.

#+begin_src python rule tidyvcf: input: "some.vcf", output: "some.tsv", params: "-vs", container: "docker://registry.gitlab.com/jdm204/tidyvcf:0.7.3", shell: "tidyvcf -i {input} -o {output} {params}" #+end_src

Why use =tidyvcf=?

Here's a biased feature matrix for =tidyvcf= and the other tools I tried before writing it. Feel free to [[https://gitlab.com/jdm204/tidyvcf/-/issues][leave an issue]] if anything in the matrix is wrong!

| Feature | =tidyvcf= | =rbt vcf-to-txt= | =bcftools -f= | =gatk VariantsToTable= | |----------------------------------------+---------+----------------+-------------+----------------------| | include all fields automatically | ✓ | ❌ | ❌ | ❌ | | include a subset of fields | ✓ | ✓ | ✓ | ✓ | | long format | ✓ | ❌ | ❌ | ❌ | | pipeable | ✓ | ✓ | ✓ | ❌ | | compressed input without external tool | ✓ | ❌ | ✓ | ? | | bcf input | ❌ | ❌ | ✓ | ? |

Notes

Warning: users have noticed after reading tidyvcf output into R with =readr::read_tsv=, that certain comma separated fields (like the FORMAT/AD fields from Mutect2) are parsed as numbers with the commas as grouping separators, depending on the =locale= and the exact values in the n rows that =read_tsv= checks to guess column types. This is not a bug in =tidyvcf=.

Dependencies

~10MB
~176K SLoC