#command-line #command-line-tool #vcf #field #tools #convert #fields #pure #separated

app tidyvcf

command-line tool to convert VCF files to tab/comma separated tables

6 releases

0.2.3 Mar 12, 2023
0.2.2 Apr 2, 2022
0.2.1 Jan 30, 2022
0.1.1 Nov 22, 2021

#511 in Command line utilities

41 downloads per month

MIT/Apache

18KB
332 lines

tidyVCF

tidyVCF is a small tool to convert VCF files to tidy tab/comma separated tables, ideal for downstream analysis with R's tidyverse or Julia's DataFrames ecosystems. All fields are included by default, keeping the command line simple. tidyVCF is written in pure Rust, replying on the excellent noodles-vcf crate developed by @zaeleus and contributors.

Note: The tool works for me, but isn't ready for production use yet - it's built on a fairly experimental API, it lacks proper testing, and it's quite brittle in terms of generally (not) handling various species of wild VCF, and gracelessly erroring at the most minor of spec violations.

Install

Cargo

cargo install tidyvcf

Pre-built binaries

TBD.

Usage

Basic usage

CSV output with -c/--csv, default is TSV:

tidyvcf -i test.vcf -c -o test.csv

BGZF compressed VCFs are detected by file extension and handled automatically:

tidyvcf -i test.vcf.gz -o test.tsv

If dealing with compressed data from stdin, use the --bgzip flag:

cat test.vcf.gz | tidyvcf --bgzip -o test.tsv

Multiple samples: stacked or cartesian

It is common to perform variant calling on several related samples together, which yields VCFs with multiple sets of 'genotype' or FORMAT fields, one for each sample. By default, tidyvcf joins sample names to the names of the format fields with the underscore ('_') character - S1_GT S1_DP S2_GT S2_DP....

The --format-delim option allow changing the sample-format field delimiter:

tidyvcf -i test.vcf --format-delim '~' -o test.tsv

This behaviour violates the tidy data principle - to avoid this we can stack samples into rows, with the cost of repeating the static and INFO columns for each sample.

Stacking samples:

tidyvcf -i test.vcf --stack -o test_stacked.tsv

Info prefix

To avoid clashes in field names between INFO and FORMAT columns, INFO field names are prefixed with the string "info_" by default - this behaviour can be adjusted with the --info-prefix option:

tidyvcf -i test.vcf --info-prefix 'i' -c -o test.csv

VEP CSQ INFO field splitting

If your VCF is annotated with Ensembl's Variant Effect Predictor, you can use the -v/--vep-fields flag to extract those fields into individual columns:

tidyvcf -i vep.vcf.gz --vep-fields -o vep.tsv

By default, the output VEP column names are prefixed with "vep_" to avoid name collisions (for example CSQ/VAF and FMT/VAF) - this string can be customised with the --vep-prefix option:

tidyvcf -i vep.vcf.gz --vep-fields --vep-prefix '.' -o vep.tsv

Note: Only the first annotated transcript for a record is split, the others are bundled unsplit into an additional column named CSQ_other_transcripts.

Comparison with other software

Feature tidyVCF rbt vcf-to-txt bcftools -f gatk VariantsToTable
include all fields by default individually specified; currently no FILTER individually specified individually specified
include a subset of fields individually specified; currently no FILTER individually specified individually specified
long format --stack
pipeable
compressed input without external tool ?
bcf input ?

Dependencies

~7MB
~127K SLoC