Anyone working with genomic data from next or third generation sequencing studies is bound to work with VCF files. No, I’m not talking of the VCard format, but of the Variant Call Format! Bad jokes aside, VCF files are currently the standard for storing variation data, and come with a number of specifications.
VCF files are very common in genomic studies as they are the standard output of variant callers, such as bcftools mpileup, Clair3 and Sniffles2. They are also the primary outputs of several EPI2ME workflows, including wf-human-variation, wf-somatic-variation and wf-amplicon.
Depending on the variant caller and the variant type, your VCF will report different pieces of information for each variant site.
You might want to extract these values from your file in order to perform different types of operations and postprocessing manually.
For instance, you might want to count how many structural variants (SVs) of each type your favourite SV caller has found; this information is normally encoded with the SVTYPE
tag.
Or again, you might want to extract the depth of sequencing at each site, a value commonly reported in the DP
field, in order to help you visualize their distribution in a plot.
VCF files have become very popular thanks to their flexible structure, which lets developers encode many different metrics and values for each variant called. However, this can make VCFs very verbose and complex to query for individual fields or values. This article will guide you through the first steps of applying filters and queries to a VCF file, enabling you to extract those pieces of information you are really interested in. So, without further ado, let’s have a look at how to query a VCF file!
This post is a companion to our previous IPython notebook Introduction to Variant Call Format files. Readers unfamiliar with the basics of VCF files may wish to read through the notebook before reading this post.
Thanks to its popularity, the VCF file has an impressive array of software and libraries developed to achieve the most diverse set of tasks. Some popular software solutions for working with VCF files are:
The evergreen vcftools
comes with a lot of analysis options and functionalities, but it can be a little slow when it comes to querying and filtering large files.
SnpSift
boasts impressive capabilities, but as it is geared towards the outputs of its sibling annotation tool snpEff, this might limit applicability for general filtering tasks.
In many cases, bcftools
is the best tool for the job and this is why we are going to feature it in the tutorial below.
One more thing: while all these tools are command-line interface (CLI) based, their developers did a great job and simplifying their usage as much as possible. So don’t be discouraged by the black screen and give them a go!
bcftools
is a set of utilities for variant calling and manipulating VCFs and BCFs (the more efficient, binary version of VCF).
It is a valuable member of the bioinformatician’s toolkit, since it combines high speed with diverse features and great flexibility in complex situations.
Among the many available, some tools of note are:
bcftools view
- view, subset, filter and convert VCF or BCF files.bcftools filter
- apply filters to files.bcftools norm
- normalize sites, split multiallelic sites, check alleles against the reference, and left-align indels.bcftools annotate
- add or remove annotations to/from the INFO field.bcftools query
- Query fields and write the output in a user-defined format.
The built-in functionalities of bcftools
are further expanded through the use of plugins.
Now, going through all its goodies would make for a very long blog post, so let’s just focus on two very common bcftools
commands: view
and query
.For this blog post, we are using a small toy VCF, that you can easly download from the website from here. Once you have downloaded the file, you will be able to reproduce every command in this post and its outputs by simply copying and pasting them in you command line.
If you want to have a quick look at your VCF file you can simply use bcftools view
:
bcftools view myfile.vcf
This will print the contents of the VCF file:
##fileformat=VCFv4.3##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles">##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">##ALT=<ID=INS,Description="Insertion">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:320 29444234 . A ACGTTCAGAGA . PASS END=29444245;SVTYPE=INS;SVLEN=10 GT:DP 0/1:8 0/1:5 0/0:2
You might want to pipe the command into a less
command in order to scroll up and down the file:
bcftools view myfile.vcf | less
bcftools view
makes it easy to select a specific class of variants through the --types
option.
The variant classes available to choose from are:
snps
: single nucleotide polymorphismsindels
: insertions and deletionsmnps
: multi-nucleotide polymorphismsref
: reference variantsbnd
: breakend (i.e. start or end point of a translocation)others
: any other type of variantsTherefore, if you want to extract only the indels from our test VCF file, you can simply do:
# Use tail -5 to show only the last five linesbcftools view --types indels myfile.vcf | tail -5
This will return the following:
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">##bcftools_viewVersion=1.13+htslib-1.13+ds##bcftools_viewCommand=view -v indels myfile.vcf; Date=Mon Jan 29 11:16:16 2024#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:320 29444234 . A ACGTTCAGAGA . PASS END=29444245;SVTYPE=INS;SVLEN=10 GT:DP 0/1:8 0/1:5 0/0:2
If you want to skip the header entirely, you can easily discard it with the -H
option.
Other convenient filtering options available with the view
command include:
-m
) or at most (-M
) a given number of alleles,-T
,-O z
, or use the more efficient BCF format by giving -O b
.
You can see all the options for view
on the bcftools
website.While filtering variants is useful when visualising some of the outputs, it doesn’t solve a major problem of the VCF format: it can be very verbose and detailed.
The INFO
field in particular can get confusingly large very easily as many different metrics are stored here, ranging from sequencing statistics to functional annotations.
Parsing these fields can be tricky due to the structure of the file format, but bcftools query
gives us a powerful, yet simple, way to extract the pieces of information that we want.
Let’s try with a hands on example. To only show a subset of some fields, we can use the --format
option.
The option can alternatively be provided as -f
; we will use --format
in the examples below for clarity.
We may wish to extract only the variants’ positions and alleles (both reference and alternate) and separate them by a single whitespace.
We can do this with the command:
bcftools query --format '%CHROM %POS %REF %ALT\n' myfile.vcf
We refer to a field by prefixing its name with the %
special character. The field names and descriptions can be found in the last line of the VCF header, in this example we have:
\n
(new-line) character at the end of the line: if we do not add it the software will save everything as a single, very long line.
So don’t forget to add it!
The command will return the following output:20 14370 G A20 17330 T A20 1110696 A G,T20 1230237 T .20 1234567 GTC G,GTCT20 29444234 A ACGTTCAGAGA
We can add a header simply by adding the --print-header
option (or -H
for short), facilitating the interpretation of the fields:
bcftools query --print-header --format '%CHROM %POS %REF %ALT\n' myfile.vcf# [1]CHROM [2]POS [3]REF [4]ALT20 14370 G A20 17330 T A20 1110696 A G,T20 1230237 T .20 1234567 GTC G,GTCT20 29444234 A ACGTTCAGAGA
This output is space-separated, but can be changed to be tab-separated by changing the command line to separate the individual fields with \t
, instead of
:
bcftools query --print-header --format '%CHROM\t%POS\t%REF\t%ALT\n' myfile.vcf# [1]CHROM [2]POS [3]REF [4]ALT20 14370 G A20 17330 T A20 1110696 A G,T20 1230237 T .20 1234567 GTC G,GTCT20 29444234 A ACGTTCAGAGA
The INFO
field of VCF files, can contain arbitrary user defined information.
It contains information in key=value
pairs; the information keys are usually detailed in the header section of the file.
The INFO
field can be queried just like any other field by using the %
special character:
bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\t%INFO\n' myfile.vcf20 14370 G A NS=3;DP=14;AF=0.5;DB;H220 17330 T A NS=3;DP=11;AF=0.01720 1110696 A G,T NS=2;DP=10;AF=0.333,0.667;AA=T;DB20 1230237 T . NS=3;DP=13;AA=T20 1234567 GTC G,GTCT NS=3;DP=9;AA=G20 29444234 A ACGTTCAGAGA END=29444245;SVTYPE=INS;SVLEN=10
While this is useful, it displays the entirety of the INFO
field.
If we want to extract the depth of sequencing of each site in the VCF, we first need to check the file’s header lines (starting with ##
) to identify the corresponding INFO flag.
Let’s have a look at the right line:
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
This header line tells us that within the INFO
field:
DP
(ID=DP
),Number=1
),Type=Integer
), andDescription="Total Depth"
).
Now that we have identified the field we are interested in, we can retrieve it for each site in our VCF file by simply referring to it as %INFO/DP
(which can be read: extract DP
within INFO
):bcftools query --format '%INFO/DP\n' myfile.vcf141110139.
Each line will represent a different depth for a different entry in the VCF file, returing .
when a value is missing. Easy, right?
This can help you to retrieve the fields you’re interested in from INFO, creating tables that can be used in downstream analyses and processes.
VCF files can contain information for more than one biological sample. The first eight columns of a VCF file refer to the genomic site, meaning that the fields are not referring to any sample specifically. The sample-level information is stored in the subsequent columns:
FORMAT
column (column 9) provides the meta information detailing the values presented for each sample.NA00001
, NA00002
and NA00003
:bcftools view -h Test/myfile.vcf | tail -1#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
We can look at the FORMAT and sample fields by querying them directly:
bcftools query -f '%FORMAT\t[]\n' myfile.vcfGT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3:.GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4:.GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2:.GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3GT:DP 0/1:8 0/1:5 0/0:2
The first column in the above reports the order of the data in each sample field, each field is separated by a :
.
Each of these values is detailed in the header:
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
For instance, the first sample (column 10) has the values GT=0|0
, GQ=48
, DP=1
and HQ=51,51
for the first sample.
Values for one or more samples can be extracted by using their identifiers.
For example, we can extract FORMAT
and the fields for NA00001
and NA00003
samples by using the --samples
option (-s
for short) followed by the comma-separated list of sample identifiers:
bcftools query --samples NA00001,NA00003 --format '%FORMAT\t[]\n' myfile.vcfGT:GQ:DP:HQ 0|0:48:1:51,51 1/1:43:5:.,.GT:GQ:DP:HQ 0|0:49:3:58,50 0/0:41:3:.GT:GQ:DP:HQ 1|2:21:6:23,27 2/2:35:4:.GT:GQ:DP:HQ 0|0:54:7:56,60 0/0:61:2:.GT:GQ:DP 0/1:35:4 1/1:40:3GT:DP 0/1:8 0/0:2
Note: the
-s
option works also forbcftools view
, and can be replaced with-S
if you want to provide a text file of the list of individuals to consider.
You might have noticed the empty brackets ([]
) in the command.
This is an instruction for bcftools
to iterate across each sample (subject to sample selection with -s
) and return the requested values.
In the previous example, we did not specify any field, and that led bcftools
to return everything for each sample.
Now if we want to extract only the genotype (GT
) value for each sample, we can provide this within the []
(remember to also provide a delimiter within the brackets, or everything will be concatenated!):
bcftools query --format '[%GT\t]\n' myfile.vcf0|0 1|0 1/10|0 0|1 0/01|2 2|1 2/20|0 0|0 0/00/1 0/2 1/10/1 0/1 0/0
This command will return simply the genotype (GT
field) for each of the samples in the VCF file.
To access multiple fields, for example both genotype (GT
) and depth (DP
) for each sample as a set of comma-separated fields (again with \t
to show how we want the output to display):
bcftools query -f '[%GT,%DP\t]\n' myfile.vcf0|0,1 1|0,8 1/1,50|0,3 0|1,5 0/0,31|2,6 2|1,0 2/2,40|0,7 0|0,4 0/0,20/1,4 0/2,2 1/1,30/1,8 0/1,5 0/0,2
This command extracts all the data for each sample as a single column. Alternatively it is possible to structure the output with columns containing each requested data field for all samples:
bcftools query -f '[%GT,]\t[%DP,]\n' myfile.vcf0|0,1|0,1/1, 1,8,5,0|0,0|1,0/0, 3,5,3,1|2,2|1,2/2, 6,0,4,0|0,0|0,0/0, 7,4,2,0/1,0/2,1/1, 4,2,3,0/1,0/1,0/0, 8,5,2,
The first column contains the comma-separated list of genotypes for each sample, whilst the second contains the comma-separated list of sequencing depths for each sample.
One of the main advantages of bcftools
is the possibility of combining multiple commands to achieve great flexibility.
This might involve combining some filtering with bcftools view
followed by bcftools query
.
For instance, if we want to extract chromosome, position, and alleles for the indels in our VCF, we can write:
bcftools view --types indels myfile.vcf | bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\n'20 1234567 GTC G,GTCT20 29444234 A ACGTTCAGAGA
The view
command here first extracts indels (i.e. insertions and deletions) from the file, before the query
command produces formatted output.
VCF annotations can be very complex and it is not unusual for an INFO
value to include multiple subvalues.
An example of this is the functional annotation produced by tools such as the Variant Effect Predictor (VEP).
For these specific cases you might want to check if there is a plugin that can help you deal with these special use cases.
For instance, the annotations from VEP can be queried by using the split-vep plugin.
We hope the above was helpful and clarified how to query your VCF file. To recap:
bcftools view
can be used to view, convert and filter your VCF file.bcftools query
can be used to extract and format desired data:%
special character to target a field/
to target items in INFO
columns, for example INFO/DP
[...]
to retrieve per-sample metrics, for example [%GT\t]