Anyone working with genomic data from next or third generation sequencing studies is bound to work with VCF files. No, I’m not talking of the VCard format, but of the Variant Call Format! Bad jokes aside, VCF files are currently the standard for storing variation data, and come with a number of specifications.

VCF files are very common in genomic studies as they are the standard output of variant callers, such as bcftools mpileup, Clair3 and Sniffles2. They are also the primary outputs of several EPI2ME workflows, including wf-human-variation, wf-somatic-variation and wf-amplicon.

Depending on the variant caller and the variant type, your VCF will report different pieces of information for each variant site. You might want to extract these values from your file in order to perform different types of operations and postprocessing manually. For instance, you might want to count how many structural variants (SVs) of each type your favourite SV caller has found; this information is normally encoded with the SVTYPE tag. Or again, you might want to extract the depth of sequencing at each site, a value commonly reported in the DP field, in order to help you visualize their distribution in a plot.

VCF files have become very popular thanks to their flexible structure, which lets developers encode many different metrics and values for each variant called. However, this can make VCFs very verbose and complex to query for individual fields or values. This article will guide you through the first steps of applying filters and queries to a VCF file, enabling you to extract those pieces of information you are really interested in. So, without further ado, let’s have a look at how to query a VCF file!

This post is a companion to our previous IPython notebook Introduction to Variant Call Format files. Readers unfamiliar with the basics of VCF files may wish to read through the notebook before reading this post.

The toolkit and bcftools

Thanks to its popularity, the VCF file has an impressive array of software and libraries developed to achieve the most diverse set of tasks. Some popular software solutions for working with VCF files are:

vcftools - the eponymous (but likely not best these days) tool.
SnpSift - from the authors of the SnpEff tool.
bcftools - part of the htslib family of tools.

The evergreen vcftools comes with a lot of analysis options and functionalities, but it can be a little slow when it comes to querying and filtering large files. SnpSift boasts impressive capabilities, but as it is geared towards the outputs of its sibling annotation tool snpEff, this might limit applicability for general filtering tasks. In many cases, bcftools is the best tool for the job and this is why we are going to feature it in the tutorial below.

One more thing: while all these tools are command-line interface (CLI) based, their developers did a great job and simplifying their usage as much as possible. So don’t be discouraged by the black screen and give them a go!

bcftools is a set of utilities for variant calling and manipulating VCFs and BCFs (the more efficient, binary version of VCF). It is a valuable member of the bioinformatician’s toolkit, since it combines high speed with diverse features and great flexibility in complex situations. Among the many available, some tools of note are:

bcftools view - view, subset, filter and convert VCF or BCF files.
bcftools filter - apply filters to files.
bcftools norm - normalize sites, split multiallelic sites, check alleles against the reference, and left-align indels.
bcftools annotate - add or remove annotations to/from the INFO field.
bcftools query - Query fields and write the output in a user-defined format. The built-in functionalities of bcftools are further expanded through the use of plugins. Now, going through all its goodies would make for a very long blog post, so let’s just focus on two very common bcftools commands: view and query.

Visualize and filter a VCF file

For this blog post, we are using a small toy VCF, that you can easly download from the website from here. Once you have downloaded the file, you will be able to reproduce every command in this post and its outputs by simply copying and pasting them in you command line.

If you want to have a quick look at your VCF file you can simply use bcftools view:

bcftools view myfile.vcf

This will print the contents of the VCF file:

##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##ALT=<ID=INS,Description="Insertion">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
20  14370   rs6054257   G   A   29  PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51  1|0:48:8:51,51  1/1:43:5:.,.
20  17330   .   T   A   3   q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50  0|1:3:5:65,3    0/0:41:3
20  1110696 rs6040355   A   G,T 67  PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ 1|2:21:6:23,27  2|1:2:0:18,2    2/2:35:4
20  1230237 .   T   .   47  PASS    NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60  0|0:48:4:51,51  0/0:61:2
20  1234567 microsat1   GTC G,GTCT  50  PASS    NS=3;DP=9;AA=G  GT:GQ:DP    0/1:35:4    0/2:17:2    1/1:40:3
20  29444234    .   A   ACGTTCAGAGA .   PASS    END=29444245;SVTYPE=INS;SVLEN=10    GT:DP   0/1:8   0/1:5   0/0:2

You might want to pipe the command into a less command in order to scroll up and down the file:

bcftools view myfile.vcf | less

bcftools view makes it easy to select a specific class of variants through the --types option. The variant classes available to choose from are:

snps: single nucleotide polymorphisms
indels: insertions and deletions
mnps: multi-nucleotide polymorphisms
ref: reference variants
bnd: breakend (i.e. start or end point of a translocation)
others: any other type of variants

Therefore, if you want to extract only the indels from our test VCF file, you can simply do:

# Use tail -5 to show only the last five lines
bcftools view --types indels myfile.vcf | tail -5

This will return the following:

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##bcftools_viewVersion=1.13+htslib-1.13+ds
##bcftools_viewCommand=view -v indels myfile.vcf; Date=Mon Jan 29 11:16:16 2024
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
20      1234567 microsat1       GTC     G,GTCT  50      PASS    NS=3;DP=9;AA=G  GT:GQ:DP        0/1:35:4        0/2:17:2        1/1:40:3
20      29444234        .       A       ACGTTCAGAGA     .       PASS    END=29444245;SVTYPE=INS;SVLEN=10        GT:DP   0/1:8   0/1:5   0/0:2

If you want to skip the header entirely, you can easily discard it with the -H option.

Other convenient filtering options available with the view command include:

filter variants that have at least (-m) or at most (-M) a given number of alleles,
extract or exclude variants in regions defined in a BED file, provided with -T,
compress the output with using gzip compressiong with -O z, or use the more efficient BCF format by giving -O b. You can see all the options for view on the bcftools website.

Querying a VCF file

While filtering variants is useful when visualising some of the outputs, it doesn’t solve a major problem of the VCF format: it can be very verbose and detailed. The INFO field in particular can get confusingly large very easily as many different metrics are stored here, ranging from sequencing statistics to functional annotations. Parsing these fields can be tricky due to the structure of the file format, but bcftools query gives us a powerful, yet simple, way to extract the pieces of information that we want.

Let’s try with a hands on example. To only show a subset of some fields, we can use the --format option. The option can alternatively be provided as -f; we will use --format in the examples below for clarity. We may wish to extract only the variants’ positions and alleles (both reference and alternate) and separate them by a single whitespace. We can do this with the command:

bcftools query --format '%CHROM %POS %REF %ALT\n' myfile.vcf

We refer to a field by prefixing its name with the % special character. The field names and descriptions can be found in the last line of the VCF header, in this example we have:

CHROM - the chromosome.
POS - the position on the chromosome where the variant occurs.
REF - the allele contained in the reference sequence.
ALT - alternative allele(s) which may occur in a sample. Notice that we add a \n (new-line) character at the end of the line: if we do not add it the software will save everything as a single, very long line. So don’t forget to add it! The command will return the following output:

20 14370 G A
20 17330 T A
20 1110696 A G,T
20 1230237 T .
20 1234567 GTC G,GTCT
20 29444234 A ACGTTCAGAGA

We can add a header simply by adding the --print-header option (or -H for short), facilitating the interpretation of the fields:

bcftools query --print-header --format '%CHROM %POS %REF %ALT\n' myfile.vcf
# [1]CHROM [2]POS [3]REF [4]ALT
20 14370 G A
20 17330 T A
20 1110696 A G,T
20 1230237 T .
20 1234567 GTC G,GTCT
20 29444234 A ACGTTCAGAGA

This output is space-separated, but can be changed to be tab-separated by changing the command line to separate the individual fields with \t, instead of :

bcftools query --print-header --format '%CHROM\t%POS\t%REF\t%ALT\n' myfile.vcf
# [1]CHROM  [2]POS  [3]REF  [4]ALT
20      14370   G       A
20      17330   T       A
20      1110696 A       G,T
20      1230237 T       .
20      1234567 GTC     G,GTCT
20      29444234        A       ACGTTCAGAGA

Extracting fields from the INFO field

The INFO field of VCF files, can contain arbitrary user defined information. It contains information in key=value pairs; the information keys are usually detailed in the header section of the file.

The INFO field can be queried just like any other field by using the % special character:

bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\t%INFO\n' myfile.vcf
20      14370   G       A       NS=3;DP=14;AF=0.5;DB;H2
20      17330   T       A       NS=3;DP=11;AF=0.017
20      1110696 A       G,T     NS=2;DP=10;AF=0.333,0.667;AA=T;DB
20      1230237 T       .       NS=3;DP=13;AA=T
20      1234567 GTC     G,GTCT  NS=3;DP=9;AA=G
20      29444234        A       ACGTTCAGAGA     END=29444245;SVTYPE=INS;SVLEN=10

While this is useful, it displays the entirety of the INFO field. If we want to extract the depth of sequencing of each site in the VCF, we first need to check the file’s header lines (starting with ##) to identify the corresponding INFO flag. Let’s have a look at the right line:

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

This header line tells us that within the INFO field:

there is a variable named DP (ID=DP),
that can have a single value (Number=1),
which is an integer (Type=Integer), and
refers to the “Total Depth” (Description="Total Depth"). Now that we have identified the field we are interested in, we can retrieve it for each site in our VCF file by simply referring to it as %INFO/DP (which can be read: extract DP within INFO):

bcftools query --format '%INFO/DP\n' myfile.vcf
14
11
10
13
9
.

Each line will represent a different depth for a different entry in the VCF file, returing . when a value is missing. Easy, right? This can help you to retrieve the fields you’re interested in from INFO, creating tables that can be used in downstream analyses and processes.

Extracting fields for each sample

VCF files can contain information for more than one biological sample. The first eight columns of a VCF file refer to the genomic site, meaning that the fields are not referring to any sample specifically. The sample-level information is stored in the subsequent columns:

The FORMAT column (column 9) provides the meta information detailing the values presented for each sample.
Column 10 onwards contains the per-sample information, one column per sample. Each sample in the VCF file is named with a unique identifier, reported in the last line of the header. For instance, our VCF file contains three separate individuals named NA00001, NA00002 and NA00003:

bcftools view -h Test/myfile.vcf | tail -1
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003

We can look at the FORMAT and sample fields by querying them directly:

bcftools query -f '%FORMAT\t[]\n' myfile.vcf
GT:GQ:DP:HQ     0|0:48:1:51,51  1|0:48:8:51,51  1/1:43:5:.,.
GT:GQ:DP:HQ     0|0:49:3:58,50  0|1:3:5:65,3    0/0:41:3:.
GT:GQ:DP:HQ     1|2:21:6:23,27  2|1:2:0:18,2    2/2:35:4:.
GT:GQ:DP:HQ     0|0:54:7:56,60  0|0:48:4:51,51  0/0:61:2:.
GT:GQ:DP        0/1:35:4        0/2:17:2        1/1:40:3
GT:DP   0/1:8   0/1:5   0/0:2

The first column in the above reports the order of the data in each sample field, each field is separated by a :. Each of these values is detailed in the header:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

For instance, the first sample (column 10) has the values GT=0|0, GQ=48, DP=1 and HQ=51,51 for the first sample.

Values for one or more samples can be extracted by using their identifiers. For example, we can extract FORMAT and the fields for NA00001 and NA00003 samples by using the --samples option (-s for short) followed by the comma-separated list of sample identifiers:

bcftools query --samples NA00001,NA00003 --format '%FORMAT\t[]\n' myfile.vcf
GT:GQ:DP:HQ     0|0:48:1:51,51  1/1:43:5:.,.
GT:GQ:DP:HQ     0|0:49:3:58,50  0/0:41:3:.
GT:GQ:DP:HQ     1|2:21:6:23,27  2/2:35:4:.
GT:GQ:DP:HQ     0|0:54:7:56,60  0/0:61:2:.
GT:GQ:DP        0/1:35:4        1/1:40:3
GT:DP   0/1:8   0/0:2

Note: the -s option works also for bcftools view, and can be replaced with -S if you want to provide a text file of the list of individuals to consider.

You might have noticed the empty brackets ([]) in the command. This is an instruction for bcftools to iterate across each sample (subject to sample selection with -s) and return the requested values. In the previous example, we did not specify any field, and that led bcftools to return everything for each sample.

Now if we want to extract only the genotype (GT) value for each sample, we can provide this within the [] (remember to also provide a delimiter within the brackets, or everything will be concatenated!):

bcftools query --format '[%GT\t]\n' myfile.vcf
0|0     1|0     1/1
0|0     0|1     0/0
1|2     2|1     2/2
0|0     0|0     0/0
0/1     0/2     1/1
0/1     0/1     0/0

This command will return simply the genotype (GT field) for each of the samples in the VCF file. To access multiple fields, for example both genotype (GT) and depth (DP) for each sample as a set of comma-separated fields (again with \t to show how we want the output to display):

bcftools query -f '[%GT,%DP\t]\n' myfile.vcf
0|0,1   1|0,8   1/1,5
0|0,3   0|1,5   0/0,3
1|2,6   2|1,0   2/2,4
0|0,7   0|0,4   0/0,2
0/1,4   0/2,2   1/1,3
0/1,8   0/1,5   0/0,2

This command extracts all the data for each sample as a single column. Alternatively it is possible to structure the output with columns containing each requested data field for all samples:

bcftools query -f '[%GT,]\t[%DP,]\n' myfile.vcf
0|0,1|0,1/1,    1,8,5,
0|0,0|1,0/0,    3,5,3,
1|2,2|1,2/2,    6,0,4,
0|0,0|0,0/0,    7,4,2,
0/1,0/2,1/1,    4,2,3,
0/1,0/1,0/0,    8,5,2,

The first column contains the comma-separated list of genotypes for each sample, whilst the second contains the comma-separated list of sequencing depths for each sample.

Piping commands

One of the main advantages of bcftools is the possibility of combining multiple commands to achieve great flexibility. This might involve combining some filtering with bcftools view followed by bcftools query. For instance, if we want to extract chromosome, position, and alleles for the indels in our VCF, we can write:

bcftools view --types indels myfile.vcf | bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\n'
20      1234567 GTC     G,GTCT
20      29444234        A       ACGTTCAGAGA

The view command here first extracts indels (i.e. insertions and deletions) from the file, before the query command produces formatted output.

bcftools plugins

VCF annotations can be very complex and it is not unusual for an INFO value to include multiple subvalues. An example of this is the functional annotation produced by tools such as the Variant Effect Predictor (VEP). For these specific cases you might want to check if there is a plugin that can help you deal with these special use cases. For instance, the annotations from VEP can be queried by using the split-vep plugin.

Summary

We hope the above was helpful and clarified how to query your VCF file. To recap:

bcftools view can be used to view, convert and filter your VCF file.
bcftools query can be used to extract and format desired data:
- use the % special character to target a field
- use / to target items in INFO columns, for example INFO/DP
- use [...] to retrieve per-sample metrics, for example [%GT\t]