12 releases
0.6.4 | Oct 1, 2024 |
---|---|
0.4.0 | Aug 28, 2023 |
0.2.4 | May 15, 2023 |
0.2.3 | Mar 12, 2023 |
0.1.1 | Nov 22, 2021 |
#1428 in Parser implementations
528 downloads per month
125KB
456 lines
[[./resources/logo.png]] #+author: Jamie D Matthews
=tidyvcf= is a small tool to convert VCF files to "tidy", generic, columnar tables, ideal for downstream analysis with R's =tidyverse= or Julia's =DataFrames= ecosystems (for example). All fields are included by default, keeping the command line simple. =tidyvcf= is written in pure Rust, based on the excellent =noodles= crate.
** Install *** Cargo
#+begin_example cargo install tidyvcf #+end_example
*** Pre-built binaries
TBD.
*** Docker
#+begin_src bash docker pull registry.gitlab.com/jdm204/tidyvcf:0.6.4 #+end_src
- Usage ** Basic usage
CSV output with =-c= / =--csv=, default is TSV:
#+begin_example tidyvcf -i test.vcf -c -o test.csv #+end_example
BGZF compressed VCFs are detected by file extension and handled automatically:
#+begin_example tidyvcf -i test.vcf.gz -o test.tsv #+end_example
If dealing with compressed data from =stdin=, use the =--bgzip= flag:
#+begin_example cat test.vcf.gz | tidyvcf --bgzip -o test.tsv #+end_example
To write compressed TSV, use the =.gz= extension for the =--output= file or pass the =-z= / =--out-gz= options.
#+begin_example tidyvcf -i test.vcf.gz --csv -o test.csv.gz #+end_example
** Multiple samples: stacked or cartesian
It is common to perform variant calling on several related samples together, which yields VCFs with multiple sets of 'genotype' or =FORMAT= fields, one for each sample. By default, =tidyvcf= joins sample names to the names of the format fields with the underscore ('_') character - =S1_GT S1_DP S2_GT S2_DP...=.
The =--format-delim= option allow changing the sample-format field delimiter:
#+begin_example tidyvcf -i test.vcf --format-delim '~' -o test.tsv #+end_example
This behaviour violates the [[https://r4ds.had.co.nz/tidy-data.html][tidy data]] principle---to avoid this we can stack samples into rows, with the cost of repeating the static and =INFO= columns for each sample.
Stacking samples:
#+begin_example tidyvcf -i test.vcf --stack -o test_stacked.tsv #+end_example
** Info prefix
To avoid clashes in field names between =INFO= and =FORMAT= columns, =INFO= field names are prefixed with the string "info_" by default---this behaviour can be adjusted with the =--info-prefix= option:
#+begin_example tidyvcf -i test.vcf --info-prefix 'i' -c -o test.csv #+end_example
** VEP =CSQ= INFO field splitting
If your VCF is annotated with Ensembl's Variant Effect Predictor, you can use the =-v= / =--vep-fields= flag to extract those fields into individual columns:
#+begin_example tidyvcf -i vep.vcf.gz --vep-fields -o vep.tsv #+end_example
By default, the output VEP column names are prefixed with "vep_" to avoid name collisions (for example =CSQ/VAF= and =FMT/VAF=)---this string can be customised with the =--vep-prefix= option:
#+begin_example tidyvcf -i vep.vcf.gz --vep-fields --vep-prefix '.' -o vep.tsv #+end_example
/Note/: Only the first annotated transcript for a record is split, the others are bundled unsplit into an additional column named =CSQ_other_transcripts=.
** Including / Excluding fields from output
A motivating feature for =tidyvcf= was to output all columns in the VCF without having to write them out on the command line, but you can also request to /just/ include certain columns by name:
#+begin_src bash tidyvcf -i some.vcf -o some.tsv.gz --just pos ref alt #+end_src
or /not/ include certain columns.
#+begin_src bash tidyvcf -i some.vcf -o some.tsv.gz --not DP GQ PL #+end_src
Note: currently, =-v= implies all VEP fields, and info fields must be specified with their prefix (default: "info_").
** Missing values
By default, =tidyvcf= outputs "NA" (not applicable) when a value is missing. This is good for R, which recognises these strings as missing data. Other languages have different conventions; in Julia, the string "missing" is used instead.
To configure this for a =tidyvcf= run, use the =-m= (=--missing-string=) option:
#+begin_src bash tidyvcf -i some.vcf -m missing | head #+end_src
** Spec Non-Compliant VCFs
The =noodles= rust library emphasises correctness in an ecosystem where that hasn't always been standard, so in practice it rejects many VCFs produced by variant callers due to not adhering to the spec. =tidyvcf= comes with a =-l= / =--lenient= option that tries to fix spec non-compliant headers using hardcoded replacement rules before conversion. Currently, this option is sufficient to convert VCFs produced by =octopus= for example. Feel free to raise an issue if this option doesn't help for other spec-non-compliant-but-basically-fine VCFs.
** In a Snakemake Workflow
Here is a sample rule using a container. Note that =snakemake= must be invoked with =--use-singularity= in order to run rules in containers.
#+begin_src python rule tidyvcf: input: "some.vcf", output: "some.tsv", params: "--lenient -v" container: "docker://registry.gitlab.com/jdm204/tidyvcf:0.6.4", shell: "tidyvcf -i {input} -o {output} {params}" #+end_src
- Why use =tidyvcf=?
Here's a biased feature matrix for =tidyvcf= and the other tools I tried before writing it. Feel free to [[https://gitlab.com/jdm204/tidyvcf/-/issues][leave an issue]] if anything in the matrix is wrong!
| Feature | =tidyvcf= | =rbt vcf-to-txt= | =bcftools -f= | =gatk VariantsToTable= | |----------------------------------------+---------+----------------+-------------+----------------------| | include all fields automatically | ✓ | ❌ | ❌ | ❌ | | include a subset of fields | ✓ | ✓ | ✓ | ✓ | | long format | ✓ | ❌ | ❌ | ❌ | | pipeable | ✓ | ✓ | ✓ | ❌ | | compressed input without external tool | ✓ | ❌ | ✓ | ? | | bcf input | ❌ | ❌ | ✓ | ? |
Dependencies
~9MB
~158K SLoC