Introduction to Variant Call Format (.vcf) files

The tutorial provides a short introduction to Variant Call Format files used in bioinformatics to store differences between the DNA sequence of a sample and that of a reference sequence.

Methods used in this tutorial include:

The computational requirements for this tutorial are:

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Introduction

This tutorial aims to elucidate the information stored with a Variant Call Format (VCF) file, and how such files can be read, or parsed, within the Python programming language and on the command line.

The goals from this tutorial include:

The tutorial includes a sample VCF file output by medaka from whole genome sequencing of the Genome in a Bottle (GIAB) HG002 sample.

Getting started

Before anything else we will create and set a working directory:

Sample Data

In order to provide a concrete example of handling a long-read VCF file this tutorial is provided with an example file produced by Oxford Nanopore Technologies' consensus and variant calling program Medaka.

To download the sample file we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

Data entry

Having downloaded the sample data we need to provide the filepaths as input to the notebook.

The form can be used to enter the filenames of your inputs.

Executing the above form will have checked input files and attempted to create an index file for the specified VCF file. We will come back to index file later in the tutorial.

VCF files

Before discussing how to read VCF files in Python we will first review their structure. The formal specification for VCF files can be found here. This is a thorough technical document detailing many different use cases. Here we will focus on a brief introduction to the common elements of all VCF files athough it is rather a turgid document.

To understand a VCF file we must first recall how such a file has been produced. Classically the file will have been created by analysing alignments of sequencing reads to a reference database containing one or more reference sequences. VCF files refer to these reference sequences variously as contigs or CHROMs.

Variant call format files are a human readable, text-based file format. Very often they will be compressed using a variant of gzip compression. Aside from the header sections they are tabular in nature. The three sections of a VCF file are:

  1. Meta-information lines (prefixed with ##) containing information helpful for interpreting the rest of the file,
  2. The header line (prefixed with #) which labels the columns
  3. Data lines containing the information regarding variants.

For example the start of the medaka VCF looks like:

We see that the meta-information lines are all of the form:

##key=value

The first of these is always a line describing the specific version of the format of the file, in this case version 4.1 of VCF (fileformat=VCFv4.1). Next we have a line stating the version of Medaka used to produce the VCF file, this line is bespoke to VCF files produced by Medaka. The third meta-information line contig=<ID=chr1> describes the fact that there could be variants in the file lying on the sequence named chr1. There will be as many such lines as there are unique sequences in the reference database to which the VCF file refers.

We then have a set of lines which look like:

##INFO=<ID=...,Description="...">

These lines describe the contents and datatypes of the INFO field of the data section of the file, we will return to these shortly. Following the INFO meta-information descriptors we have a similar set of FORMAT descriptors of the form:

##FORMAT=<ID=,...,Description="...">

Similarly these describe the contents of the sample columns in the data section: a VCF file can have one of more sample columns, the Medaka VCF has a single sample column named (appropriately, or confusingly) SAMPLE.

A final meta-infomation field CL describes the command-line program and options used to produce the VCF file.

We see therefore that the meta-information section although optional in the VCF file specification, is rather crucial for understanding the contents of the rest of the file. Some of the information in the header is for the benefit of a human audience, whilst other parts are more to aid machine parsing.

The VCF data table

In the above prelude we have introduced the meta-information section of VCF files. We will now move to examine the data section of VCF files.

Aside from the header information a VCF file is simple a tab-delimited data table. Each row of the table describes a possible variant in one or more of the analysed samples, with the columns being:

Name Brief description (see the specification for details).
1 CHROM The name of the sequence (typically a chromosome) on which the variation is being called (contig from the meta-information). This sequence is usually known as 'the reference sequence'.
2 POS The 1-based position of the variation on the given sequence.
3 ID The identifier of the variation, e.g. a dbSNP rs identifier, or if unknown a ".". Multiple identifiers should be separated by semi-colons without white-space. Medaka will output a "." here.
4 REF The reference base (or bases in the case of an indel) at the given position on the given reference sequence.
5 ALT The list of alternative alleles at this position, i.e. the sequence of "the variant"
6 QUAL A quality score associated with the inference of the given alleles.
7 FILTER A flag indicating which of a given set of filters the variation has passed.
8 INFO An extensible list of key-value pairs describing the variation. Multiple fields are separated by semicolons with optional values in the format: =[,data]. The description of each field is given in the meta-information section.
9 FORMAT An (optional) extensible list of fields for describing the samples. The description of each field is given in the meta-information section.
+ SAMPLEs For each (optional) sample described in the file, values are given for the fields listed in FORMAT.

Let's rexamine the first three lines of the Medaka VCF with the above field descriptions in mind:

So we have three rows, corresponding to three variants at positions 10108, 10177, and 10257. Taking the last of these first, we see a reference base of A whereas the alternative (ALT) base is suggested as C. The variant therefore corresponds to a substitution of A for C in this sample. The quality (QUAL) of this variant is 0.799, this value is a scaled probability:

−10 log10 [Prob(call in ALT is wrong)]

Note however that for the case of Medaka this interpretation should be taken with a pinch of salt, the scores output by the Medaka neural network have not been empirically calibrated to error rates of calls. The FILTER field for a VCF output directly by Medaka will contain either PASS or lowqual. For diploid variant calling Medaka performs a simple thresholding of variant qualities to mask possible false positive variants.

The final three fields of the VCF table are rather more dense in information. First we have the INFO column. This column very often constitutes a sub-table within the main table: it is comprised of key=value pairs. A description of each key should be given in the meta-information section as described above. Here Medaka lists a pos1/2 item which describes a position with respect to the calling of a variant in one of the two haplotypes. This is mainly an item for debugging. Secondly we have items of the form q1/2: these are similar to the main QUAL field but are expressed independently for the two haplotypes of the reference genome.

Moving on to the FORMAT field, this is a specifier for how the information in the remaining (one or more) sample columns is stored. To understand FORMAT fully we must again refer back to the meta-information section of the file. In the Medaka example GT:GQ specifies that the sample columns contain first a genotype (GT), followed by a genotype quality score (GQ).

The genotypes are encoded as allele values separated by either / or |. A / indicates the genotype is unphased whereas | that the genotype is phased. Two or more variants are said to be phased if the relation between the alleles of the genotypes is known across variants, else the variants are unphased. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For our substitution variant above we have a genotype 1|0 meaning a phased variant, the first allele is from the first ALT sequence C (the only ALT). The second allele is the reference base A. We therefore have a heterozygous variant indicated.

Returning to the first and second variants in from the Medaka VCF we have the substitutions C>CT and A>AC respectively. In both cases this substitution of a single base with multiple bases means that we have an insertion variant. Likewise a deletion variant would be specified by, for example, AG>A meaning a deletion of a G at the position after that specified by POS. Note that special cases handle how insertion and deletions at the of reference sequences are described, see the specification for more information. The first variant is a homozygous alternative variant 1|1 while the second is heterozygous 0|1 with the first allele being the reference base.

File compression and indexing

Very often VCF files will not be stored as plain text but compressed to a binary stream. To compress a VCF file it is recommended to use the program bgzip rather than gzip. This program performs the compression in a manner which is both compatible with gzip but crucially can be indexed. The indexing allows rapid retrieval of variants spanning a particular reference location, value of CHROM and POS.

Let's demonstrate how to compress a VCF file, by first taking a sample of the Medaka VCF:

The file snippet.vcf is now a plain text file. Let's recompressed this extract using bgzip:

to produce the file snippet.vcf.gz which can be indexed using the tabix program:

The utility of the compression is seen by observing the filesize of the files:

Let's try to index a gzip-compressed file to observe what happens:

Tabix tells us that the file has not been compressed with bgzip and consequently it cannot create an index file.

Manipulating VCF files

The section above has given an outline to the data contained within a VCF file and how the file is arranged. In this section we will highlight several methods for parsing the data contained within a VCF. We will first show how to read such files in Python, and then introduce the standard bcftools suite of programs.

Reading VCF files with Python

There are a variety of Python libraries that aim to provide reasonably intuitive access to the data contained within a VCF file. Here we will briefly introduce the specialist pysam.VariantFile parser before using the Pandas library to manipulate the file as a basic tabular format.

Using pysam

Pysam provides a convenient Python implementation of the htslib suite of C-libaries for handling a core set of bioinformatics file formats including VCF files. The library understands the layout of a VCF file and provides a set of functions and code objects representing the data stored with the file. Much of the below is based simply on the corresponding section of the Pysam documentation.

Let's jump right in. To iterate through the first few records in a VCF file using pysam and print the reference name, the reference position and the mutation implied by the variant record we can perform the following:

Note how as discussed above the non-reference, alternative sequence (ALT) includes surrounding bases in the case of insertion and deletion variants.

For each of the columns of a VCF file Pysam creates an attribute of the variant object. This makes it particularly easy to retrieve information. For example in order to filter variants to the more confident calls we can examine the QUAL field of the variant records. Let's print out the INFO field for higher quality variants (and report the occurence of lower quality variants). Pysam parses the the information stored within the INFO column into a VariantRecordInfo object which can be understood as a Python dictionary:

Viewing information from the header of the file is particularly easy:

The power of using Pysam to parse VCF files should not be underestimated as is provides a reliable, standards compliant way of accessing data. Not withstanding this fact we will now look at a second method for parsing VCF files in Python.

Using pandas

We could continue to manipulate VCF files using pysam but for interaction with the Python ecosystem it can be easier to treat simply the VCF file as a standard data table using pandas. This allows us to leverage optimised functions for handling tabular data.

For example the code below will parse the VCF in a way to allow easy plotting:

We can use this function to read our Medaka example VCF file and then create some basic plots:

Using bcftools

Above we have shown two Python-based methods for interacting with VCF files. The first of these is useful for inspecting each variant in turn, whereas the second is useful for generic programming and creating views across variants.

Finally we will introduce bcftools, a set of command-line tools for manipulating VCF files and their binary cousins: BCF files. Perhaps one of the most useful parts of bcftools with which to get started is the bcftools query command. The query command can be used to format a VCF in a user defined fashion, whilst performing filtering.

To create a:

we can run the following incantation:

This is just one small example. The expressions section of the documention is a useful resource for understanding how to query and filter VCF files.

Summary

In this notebook we have introduced the Variant Call Format with an examplar file from the Medaka consensus and variant calling program. We have outlined the contents of such files and how they can be intepreted with a selection of common software packages.

The code tools presented here can be run on any dataset from an Oxford Nanopore Technologies' device. The code will run within the EPI2ME Labs notebook server environment.