# Viral and Bacterial Variant Calling

The tutorial is a short demonstration of how to perfrom small variant calling in viral and bacterial samples (or more generically any sample that is ploidy-1).

Questions answered by this tutorial include:

• How does my sample differ from a reference sequence?
• What is the level of evidence for these variants in my data?
• What tools can I use to programatically manipulate VCF files.

Methods used in this tutorial include:

• medaka for variant calling and annotation.
• pomoxis for basic data QC.
• pysam for iterating through VCF files.
• pandas for manipulating VCF files as a table.
• bcftools for filtering VCF files on the command-line.

Computational requirements for this tutorial:

• A computer running the EPI2ME Labs Server environment
• 16Gb RAM

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

## Introduction¶

This tutorial aims to demonstrate how medaka can be used to detect small variants in a haploid sample. Medaka has not been explicitely developed for this task but it is a task so closely related to calculating a consensus that it is a task to which medaka is ideally suited.

The goals from this tutorial include:

• Understand how to use medaka to call small variants in a haploid sample
• Learn how to use medaka's ancilliary tools to annotate variants with evidence statistics
• Understand the basic concepts of a Variant Call Format (VCF) file and how to manipulate such files using a range of software tools.

The tutorial includes a small sample dataset from a B.subtilis sample. The workflow demonstrated below has been successfully used in the ARTIC bioinformatics pipeline for calling variants in SARS-CoV-2019 samples.

## Getting started¶

Before anything else we will create and set a working directory:

### Sample Data¶

In order to demonstrate the abilities of this notebook, a small sample dataset from a B.subtilis isolate is included together with a reference sequence.

To download the sample file we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

### Data entry¶

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. Enter the paths to you input files and output folder here. The input data may be either a a single fastq, compressed or otherwise, file or a directory containing multiple such files:

### Preliminary Analysis Section¶

To verify the data we have is suitable for performing variant calling we can align the reads to the reference sequence to check the accuracy of the reads and their coverage of the genome:

The results of the above alignments and analysis can be visualised using the code cell below.

The left-hand plot depicts read accuracy and should contain a peak at or above 93%. The right-hand coverage plot will reveal regions of low coverage. In such regions variant calling may prove more difficult (but not impossible); this should be taken into account when evaluating variants in the next section.

## Variant Calling with medaka¶

The section above provides a basic QC of the input data: if the reads appear of a good quality with repect to the reference sequence and there is good coverage of reads across the reference it is time to perform variant calling.

### Calling variants¶

Medaka contains a specialised program to perfom variant calling for ploidy-1 samples: medaka_haploid_variant. This program uses a combination of racon and medaka's RNN consensus network to generate a consensus sequence. The output sequence is then compared to the reference sequence in order to generate small variant calls in the form of a VCF file. In order to further annotate the called variants with read support information, the medaka tools annotate tool can be used.

Calling variants is as simple as that! The above code block will have ultimately produced a VCF file annotated with read support information. These annotations are calculated by realigning all reads in the vicinity of a variant locus and determining which alternative sequence the read best supports

The outcome of the read support annotations may prove counter-intuitive. The neural network used within Medaka can be considered as encoding knowledge (prior beliefs) on the expected base distributions for alternative sequences. In the presence of errorful reads the neural network provides additional power over naively counting bases. Users are advised not to trivially dismiss calls based on incongruent read counts.

In the next section we will review the produced VCF to highlight aspects of it a user may wish to explore further.

## Reviewing the calls¶

The section above has produced a set of (small) variants, annotated them and output a VCF file. This section aims to provide a gentle introduction to the VCF format, show how to read and manipulate such files in Python, and introduce the standard bcftools suite of programs.

### Reading VCF files with Python¶

There are a variety of Python libraries that aim to provide reasonably intuitive access to the data contained within a VCF file. Here we will briefly introduce the specialist pysam.VariantFile parser before using the pandas to manipulate the file as a basic tabular format.

#### Using pysam¶

To iterate through the first few records in a VCF using pysam and print the reference name, the reference position and the mutation implied by the variant record:

Note how the non-reference, alternative sequence includes surrounding bases in the case of insertion and deletion variants.

In order to filter variants to those which for which the calling algorithm (in this case medaka) is more confident it is useful to examine the QUAL and INFO fields of the variant records. The INFO field is a set of key=value pairs, with a description contained within the header section of the file:

NOTE: The information contained within in the INFO section is freely defined, and so can vary, by the program producing the VCF file.

#### Using pandas¶

We could continue to manipulate the file medaka VCF file using pysam but for interaction with the Python ecosystem it can be easier to treat simply the VCF file as standard tabular data using pandas. For example the code below will parse the VCF in a way to allow easy plotting:

#### Using bcftools¶

Above we have shown two Python-based methods for interacting with VCF files. The first of these is useul for inspecting each variant in turn, whereas the second is useful for generic programming and creating views across variants.

Finally we will introduce bcftools, a set of command-line tools for manipulating VCF files and their binary cousings: BCF files. Perhaps one of the most useful parts of bcftools with which to get started is the bcftools query command. The query command can be used to format a VCF in a user defined fashion, whilst performing filtering.

To create a tab-separated output for variants with sequencing depth greater than 75 reads:

This is just one small example. The expressions section of the documention is a useful resource for understanding how to query and filter VCF files.

## Summary¶

In this notebook we have demonstrated some key methods for performing variant calling and inspection for haploid samples. Medaka was used to perform the variant calling whilst we illustrated several tools for manipulating the resulting VCF file.

The analysis presented can be run on any dataset from an Oxford Nanopore Technologies' device. The code will run within the EPI2ME Labs notebook server environment.