Human Variant Calling with Medaka

The tutorial is intended as an introduction to single nucleotide polymorphism (SNP) calling with medaka for targeted sequencing. For calling variants with whole genome sequencing please refer to the medaka documentation.

Computational requirements for this tutorial include:

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Introduction

Medaka was designed to call small variants (substition, deletion, and insertion events), in whole genome sequencing. This tutorial aims to demonstate how to use Medaka to call substitution variants in human targeted sequencing. For completeness we will also attempt to call insertion and deletion (indel) variants, although as we we learn this stretches the methodolgy employeed by medaka when used with targeted, and particularly amplicon, data.

The goals from this tutorial include:

The tutorial is supplied with a sample dataset containing reads spanning multiple targets from a human cell line.

Getting started

The workflow below requires a single folder containing .fastq files from an Oxford Nanopore Technologies' sequencing device, or a single such file. Compressed or uncompressed files may be used. In addition a reference sequence .fasta file and a .bed file describing target regions within the reference sequence are required.

Before anything else we will create and set a working directory:

Sample Data

To demonstrate the workflow below a sample dataset is included with this tutorial. The dataset comprises a manageable dataset spanning various targets from an experiment using the Oxford Nanopore Technologies' target sequencing protocol. We will download also a .bed file, this file describes the location of the gene in a tabular text document.

To download the sample files we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

To view the outcome of the download we can use the tree command to show the contents of the working directory:

The files should also appear in the File Browser to the left-hand side of the screen.

For our analysis we will also need the human genome reference:

The reference contained within the above download is that which can also be obtained here, the reasons for using this file are presented in Heng Li's blog post.

Using your own data

If you wish to analyse your own data rather than the sample data, you can edit the value of the input_file variable below. To find the correct full path of a file you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

image.png

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data entry

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook in the form below:

Alignment of reads

Our first task in analysing our data is to align the sequence reads to the reference sequence. We do this with the mini_align program from the pomoxis package. This is preinstalled in the EPI2ME Labs notebook server. Note that this command may take a while to run depending on the number of reads in your datasets. With the sample data (8000 reads) and using 4 compute threads (-t 4 in the code), the alignments will take around 5 minutes.

While we are here we will calculate also a table summarising the alignment data, again using a program from pomoxis:

The summary file gives useful information on the alignment of each read to the reference sequence, including: chromosome, start and end coordinates, and the accuracy of the read with respect to the reference. We can plot a histogram of the latter quantity:

The alignment summary can be used also to count reads overlapping the target regions in the input .bed file. To do this we will use the pyranges python package. This analysis provides a quick QC step before proceeding to SNP calling.

SNP Calling

Medaka calculates variants from read alignments stored within a .bam file. The workflow internal to Medaka was designed with whole genome sequencing in mind, it can however be coerced into being a tool for calling effectively SNP variants from targeted and amplicon datasets.

Medaka's workflow

The workflow used within Medaka to call variants in human whole genome sequencing is outlined in the documentation. Variant calling is a three step process:

  1. call homozygous and heterozygous SNPs,
  2. use the information contained within the heterozygous SNPs to haplotype the reads,
  3. create haplotype consensus sequences for the two haplotypes and recombine the data into a single set of SNP and INDEL variants.

Examining the workflow we identify a possible problem for the calling of generic variants: the second and third steps steps rely on finding a sufficient number of heterozygous SNPs within each target region. For whole genome sequencing with long reads this is not an issue. For short targets or amplicons containing few or no heterozygous SNPs the reads cannot be partitioned into their corresponding haplotypes. For reference the average distance between heterozygous variants is ~630 bases in the GIAB HG002 sample:

Coverage requirements

As an additional consideration since Medaka is designed to work with whole genome sequencing datasets, the genome coverage is expected to be rather low compared to what might be obtained in a targeted sequencing experiment. Medaka's variant calling is optimised to work with reference coverage in the range of 30-100 fold. If a dataset contains more data than this it is recommended to first downsample the data to 80X.

Various tools can downsample a dataset, pomoxis includes a tool for doing just this whilst attempting to normalise coverage:

A more simple approach would be to use simply samtools view with its -s option.

Calling variants

With the above considerations in mind we will now use Medaka to call variants.

Medaka contains a single program to perform all the necessary steps to produce a Variant Call Format (.vcf) file from a .bam file. To calculate variants efficiently we will give explicitely the regions of interest to medaka:

The above code will have created a separate output folder for each entry in the input .bed file. Each folder will contain something like:

Many of these files are intermediate files that are not useful to most users (they can be automatically remove by using the -d option to medaka_variant). The most interesting outputs are:

For longer target regions and amplicons the round_1.vcf file will be populated. For smaller regions, or regions of low heterozygosity, the haplotype partitioning step may have failed and only round_0_hap_mixed_unphased.vcf will contain useful data.

Annotating the variant calls

The step above calculated independently a .vcf file for all regions of interest. In this step we will aggregate and annotate these files.

A usual step in analysing putative variants is to consider each variant in terms of the number of reads with support for the alternative and reference alleles. To enable this analysis Medaka has a dedicated tool which performs realignment of reads to determine which allele the read best supports.

The outcome of this analysis may prove counter-intuitive. The neural network used within Medaka can be considered as encoding knowledge (prior beliefs) on the expected base distributions for alternative sequences. In the presence of errorful reads the neural network provides additional power over naively counting bases. Users are advised not to trivially dismiss calls based on incongruent read counts.

To aggregate the independent .vcf files run the code cell below:

Examining the annotations

Having created our combined and annotated .vcf we will now highlight a few of its features in order to provide pointers for further filtering and analysis.

VCF files generically comprise a header section, a set of mandatory fields together with an INFO field which may contain addition useful information. In common with many other tools Medaka uses the INFO field to embed an additional data table. The annotation program above added the following to the INFO field (which can be found in the .vcf header information):

There are a variety of VCF parsers available in Python, but since the file is essentially a tab-separated table here we use simply the pandas library to extract the INFO fields above into their own columns and then count variants according to the depth DP column and Medaka's in built filtering:

This gives just a flavour of the information available in the INFO field. For more advanced filtering of variants users are encouraged to read the bcftools query documentation and examples. For example to perform a similar manipulation as the above python snippet, only outputting variants with high quality and depth to a tab-separated text file:

Headline numbers

As a final summary of the experiment the infographic below displays a small set of headline numbers captured through the course of the analysis:

Summary

This tutorial has stepped through use of Medaka's variant calling functionality as applied to targeted sequencing experiments. We have discussed how Medaka performs variant calling and how that may have and effect on the results. Through the aid of an example dataset we have highlighted how to interpret the key outputs of the workflow.

The analysis presented can be run on any dataset from an Oxford Nanopore Technologies' device. The code will run within the EPI2ME Labs notebook server environment.