10 unstable releases (3 breaking)
0.4.0 | Aug 28, 2023 |
---|---|
0.3.0 | Aug 28, 2023 |
0.2.5 | Aug 27, 2023 |
0.2.4 | May 15, 2023 |
0.1.1 | Nov 22, 2021 |
#8 in #gz
110 downloads per month
115KB
327 lines
#+title: tidyVCF #+author: Jamie D Matthews
=tidyVCF= is a small tool to convert VCF files to tidy tab/comma separated tables, ideal for downstream analysis with R's =tidyverse= or Julia's =DataFrames= ecosystems. All fields are included by default, keeping the command line simple. =tidyVCF= is written in pure Rust, replying on the excellent =noodles-vcf= crate developed by [[https://github.com/zaeleus][@zaeleus]] and contributors.
Warning: /built on an unstable API, lacking proper testing, brittle in terms of erroring at minor VCF spec violations/.
** Install *** Cargo
#+begin_example cargo install tidyvcf #+end_example
*** Pre-built binaries
TBD.
*** Docker
#+begin_src bash docker pull registry.gitlab.com/jdm204/tidyvcf:latest #+end_src
- Usage ** Basic usage
CSV output with =-c= / =--csv=, default is TSV:
#+begin_example tidyvcf -i test.vcf -c -o test.csv #+end_example
BGZF compressed VCFs are detected by file extension and handled automatically:
#+begin_example tidyvcf -i test.vcf.gz -o test.tsv #+end_example
If dealing with compressed data from =stdin=, use the =--bgzip= flag:
#+begin_example cat test.vcf.gz | tidyvcf --bgzip -o test.tsv #+end_example
To write compressed TSV, use the =.gz= extension for the =--output= file or pass the =-z= / =--out-gz= options.
#+begin_example tidyvcf -i test.vcf.gz --csv -o test.csv.gz #+end_example
** Multiple samples: stacked or cartesian
It is common to perform variant calling on several related samples together, which yields VCFs with multiple sets of 'genotype' or =FORMAT= fields, one for each sample. By default, =tidyvcf= joins sample names to the names of the format fields with the underscore ('_') character - =S1_GT S1_DP S2_GT S2_DP...=.
The =--format-delim= option allow changing the sample-format field delimiter:
#+begin_example tidyvcf -i test.vcf --format-delim '~' -o test.tsv #+end_example
This behaviour violates the [[https://r4ds.had.co.nz/tidy-data.html][tidy data]] principle---to avoid this we can stack samples into rows, with the cost of repeating the static and =INFO= columns for each sample.
Stacking samples:
#+begin_example tidyvcf -i test.vcf --stack -o test_stacked.tsv #+end_example
** Info prefix
To avoid clashes in field names between =INFO= and =FORMAT= columns, =INFO= field names are prefixed with the string "info_" by default---this behaviour can be adjusted with the =--info-prefix= option:
#+begin_example tidyvcf -i test.vcf --info-prefix 'i' -c -o test.csv #+end_example
** VEP =CSQ= INFO field splitting
If your VCF is annotated with Ensembl's Variant Effect Predictor, you can use the =-v= / =--vep-fields= flag to extract those fields into individual columns:
#+begin_example tidyvcf -i vep.vcf.gz --vep-fields -o vep.tsv #+end_example
By default, the output VEP column names are prefixed with "vep_" to avoid name collisions (for example =CSQ/VAF= and =FMT/VAF=)---this string can be customised with the =--vep-prefix= option:
#+begin_example tidyvcf -i vep.vcf.gz --vep-fields --vep-prefix '.' -o vep.tsv #+end_example
/Note/: Only the first annotated transcript for a record is split, the others are bundled unsplit into an additional column named =CSQ_other_transcripts=.
** Spec Non-Compliant VCFs
The =noodles= rust library emphasises correctness in an ecosystem where that hasn't always been standard, so in practice it rejects many VCFs produced by variant callers due to not adhering to the spec. =tidyvcf= comes with a =-l= / =--lenient= option that tries to fix spec non-compliant headers using hardcoded replacement rules before conversion. Currently, this option is sufficient to convert VCFs produced by =octopus= for example. Feel free to raise an issue if this option doesn't help for other spec-non-compliant-but-basically-fine VCFs.
** In a Snakemake Workflow
Here is a sample rule using a container. Note that =snakemake= must be invoked with =--use-singularity= in order to run rules in containers.
#+begin_src python rule tidyvcf: input: "some.vcf", output: "some.tsv", params: "--lenient -v" container: "docker://registry.gitlab.com/jdm204/tidyvcf:latest", shell: "tidyvcf -i {input} -o {output} {params}" #+end_src
- Comparison with other software :PROPERTIES: :CUSTOM_ID: comparison-with-other-software :END: | Feature | =tidyVCF= | =rbt vcf-to-txt= | =bcftools -f= | =gatk VariantsToTable= | |----------------------------------------+------------+---------------------------------------------+------------------------+------------------------| | include all fields | by default | individually specified; currently no =FILTER= | individually specified | individually specified | | include a subset of fields | ❌ | individually specified; currently no =FILTER= | individually specified | individually specified | | long format | =--stack= | ❌ | ❌ | ❌ | | pipeable | ✓ | ✓ | ✓ | ❌ | | compressed input without external tool | ✓ | ❌ | ✓ | ? | | bcf input | ❌ | ❌ | ✓ | ? |
Dependencies
~6MB
~107K SLoC