6 releases
0.2.3 | Mar 12, 2023 |
---|---|
0.2.2 | Apr 2, 2022 |
0.2.1 | Jan 30, 2022 |
0.1.1 | Nov 22, 2021 |
#511 in Command line utilities
41 downloads per month
18KB
332 lines
tidyVCF
tidyVCF
is a small tool to convert VCF files to tidy tab/comma
separated tables, ideal for downstream analysis with R's tidyverse
or Julia's DataFrames
ecosystems. All fields are included by
default, keeping the command line simple. tidyVCF
is written in pure
Rust, replying on the excellent noodles-vcf
crate developed by
@zaeleus and contributors.
Note: The tool works for me, but isn't ready for production use yet - it's built on a fairly experimental API, it lacks proper testing, and it's quite brittle in terms of generally (not) handling various species of wild VCF, and gracelessly erroring at the most minor of spec violations.
Install
Cargo
cargo install tidyvcf
Pre-built binaries
TBD.
Usage
Basic usage
CSV output with -c
/--csv
, default is TSV:
tidyvcf -i test.vcf -c -o test.csv
BGZF compressed VCFs are detected by file extension and handled automatically:
tidyvcf -i test.vcf.gz -o test.tsv
If dealing with compressed data from stdin
, use the --bgzip
flag:
cat test.vcf.gz | tidyvcf --bgzip -o test.tsv
Multiple samples: stacked or cartesian
It is common to perform variant calling on several related samples
together, which yields VCFs with multiple sets of 'genotype' or
FORMAT
fields, one for each sample. By default, tidyvcf
joins
sample names to the names of the format fields with the underscore
('_') character - S1_GT S1_DP S2_GT S2_DP...
.
The --format-delim
option allow changing the sample-format field delimiter:
tidyvcf -i test.vcf --format-delim '~' -o test.tsv
This behaviour violates the tidy
data principle - to avoid this
we can stack samples into rows, with the cost of repeating the static
and INFO
columns for each sample.
Stacking samples:
tidyvcf -i test.vcf --stack -o test_stacked.tsv
Info prefix
To avoid clashes in field names between INFO
and FORMAT
columns,
INFO
field names are prefixed with the string "info_" by default -
this behaviour can be adjusted with the --info-prefix
option:
tidyvcf -i test.vcf --info-prefix 'i' -c -o test.csv
VEP CSQ
INFO field splitting
If your VCF is annotated with Ensembl's Variant Effect Predictor, you
can use the -v
/--vep-fields
flag to extract those fields into individual
columns:
tidyvcf -i vep.vcf.gz --vep-fields -o vep.tsv
By default, the output VEP column names are prefixed with "vep_" to
avoid name collisions (for example CSQ/VAF
and FMT/VAF
) - this
string can be customised with the --vep-prefix
option:
tidyvcf -i vep.vcf.gz --vep-fields --vep-prefix '.' -o vep.tsv
Note: Only the first annotated transcript for a record is split, the
others are bundled unsplit into an additional column named
CSQ_other_transcripts
.
Comparison with other software
Feature | tidyVCF |
rbt vcf-to-txt |
bcftools -f |
gatk VariantsToTable |
---|---|---|---|---|
include all fields | by default | individually specified; currently no FILTER |
individually specified | individually specified |
include a subset of fields | ❌ | individually specified; currently no FILTER |
individually specified | individually specified |
long format | --stack |
❌ | ❌ | ❌ |
pipeable | ✓ | ✓ | ✓ | ❌ |
compressed input without external tool | ✓ | ❌ | ✓ | ? |
bcf input | ❌ | ❌ | ✓ | ? |
Dependencies
~7MB
~127K SLoC