How medaka works

Open In Colab

The following is a relatively short document describing how Oxford Nanopore Technologies' program for consensus calling of sequencing data, medaka, functions internally. We will demonstrate the core functionality required to process alignment data, how it is presented to a recurrent neural network, and how a consensus sequence is formed.

The document you are reading is not a static web page, but an interactive environment called a Colab notebook that lets you write and execute code. You can inspect, modify, and run any of the code on this page.

This notebook has been designed to work natively on Google Colab. It will likely not function correctly within the EPI2MELabs environments.

Installation

Before getting started with how medaka works, we will install it into the Colab environment. This will enable us to both inspect and run the code. To install the medaka run the code cell below by clicking the "play" icon to the left or pressing on the cell and typing <shift>-<enter>

After running the above, it is necessary to select Runtime>Restart runtime... from the menu at the top of the page, after which we should be able to import medaka

Medaka's input

As input the core medaka algorithm accepts sequencing reads aligned to an assembly sequence. If you have run the medaka_consensus pipeline you will have given as input an assembly sequence and your sequencing data. The pipeline simply runs minimap2 to calculate alignments of the reads to the assembly.

For the purposes of this demonstration we will download pre-aligned data from an R9.4.1 MinION sequencing run:

The downloaded saureus.bam file contains alignments of sequencing reads to the downloaded saureus_canu.fasta. The depth of sequencing has been reduced to around 150-fold coverage of the genome.

Diving in: counting bases

The first step of medaka's calculation is to parse the alignment data into a base counts table ready for input to the neural network. In this section we explore the functions responsible for doing this, how exactly counting is performed and what the results may represent.

Pileup interface

At the heart of medaka resides a straight-forward base-counting procedure. From the alignment data comparing sequencing reads to the reference sequence a pileup is created, much like the display and alignment viewer such as IGV would display.

The pileup is summarise by counting the different base types contained within its columns. The function responsible for this counting excercise is called pileup_counts in the features module:

The pileup_counts function above has various arguments, most of which are advanced options and not used within the default operation of medaka. To create a counts matrix we call the function with a Samtools-style region string (medaka uses 0-based end exclusive co-ordinates) and a filepath to our alignment file:

The counts matrix

The pileup_counts function returned two structures. The latter of these is a positions table, this records which pileup columns are reference positions and which are caused by inserted bases in one or more reads:

The field minor in the above array indicates reference and insertion columns: it takes a value 0 for a reference position and counts upwards for all following insertion events. The major field keeps track of the reference base co-ordinate.

The base counts themselves from the alignment pileup are stored separately:

The matrix is of shape (# pileup columns, 10), each row of the matrix corresponds to the counts of bases and gaps in the pileup columns (yes, the rows and columns get confusing). There are 10 entries one each for the fours base types and gap, multiplied by two as reads on the forward and reverse strand are counted separately. The ordering of the entries is given by:

in which lower-case letters denote reverse strand counts (upper case, forward) and 'd' and 'D' count deletions. A point of note is that this counting strategy it itself makes a distinction between bases which are deleted in reads with respect to reference sequence and bases which are deleted in reads with respect to other reads (the bases in the other reads being insertions with respect to the reference). Previous versions of medaka have performed a symmetrization here: by adding in deletion counts for all read that span a pileup column, whether that pileup column is a reference position (minor=0) or an insertion column (minor>0).

An aside on performance

Although medaka is mainly written in python, this first calculation is performed in C for speed; pileup_counts defers to a C implementation of the base counting which makes use of the pileup API from htslib. This C function is 1-2 orders of magnitude faster than a previous python implementation, using pysam. Nevertheless this seemingly trivial step is often the performance bottleneck in medaka, particularly when using GPUs to run the neural network calculations. Further it is within the htslib function resolve_cigar2 that the code is bogged down:

Because of this medaka uses multiple worker threads to perform the base counting: the calculation is split into 1 Mbase shards and then further subdivided into 100kbase chunks with the chunks being reassembled for each shard. In conclusion, note the time taken for the simple pure C program pileup above and that for running the full medaka consensus program below:

On the standard Colab environment with two CPU cores and an NVIDIA P4 GPU, we see that wallclock time for medaka consensus is less that the single-threaded pileup program. That this speed can be achieved is down to the asynchronous multi-threaded queuing that medaka implements for the generation of the counts matrices and the raw power of GPUs to process the data.

Normalization

After obtained the base-counts matrix produced in the section above medaka performs a normalization of the counts. Across the pileup columns, all count vectors with equal corresponding major position index are normalized by the total count for the column with minor=0 (the reference position). This choice of normalization accounts for the lack of symmetry described above, and that whilst consensus insertions are typically rare, isolated insertions may still occur within any one read spanning two reference positions. There are on average up to three pileup columns for every input reference position.

Ordinarily this normalization is performed in a post-processing method of the CountsFeatureEncoder class, for the purposes of exposition the operation under normal behaviour is:

The normalization is across all bases, it is not split by strand; splitting the normalization by strand would potentially lose important information with respect to strand bias and relative errors. The plot below visualizes the final input to the neural network used in medaka.

The neural network

Having counted bases in an alignment pileup medaka proceeds to analyse these counts using a Recurrent Neural Network, (RNN). A full discussion of such algorithms is beyond the scope of this discussion, this section demonstrates their use in calculating a consensus sequence from the base counts array. When medaka is used as a variant caller different methods are used.

The model

In order to construct a consensus sequence medaka uses a multi-layer bidirection RNN. This is defined using the keras API in tensorflow. The following code is adapted from the models module of medaka, it has been simplified to show only the essential parts:

The model takes the count matrix as input and outputs for each corresponding column of the counts matrix a set of five scores. The five scores express the possibility that the consensus sequence should contain one of the four bases A, C, G, or T, or a gap '-' character at the pileup column under consideration.

Making predictions

In order to make predictions using the RNN, medaka splits the normalized counts array into overlapping chunks before processing by the model. Chunking the array allows for more efficient parallel computation while overlapping is a mitigation against edge-effects at the boundaries of chunks.

As mentioned above in the aside on performance, medaka has a somewhat elaborate system for managing data chunks. For the purposes of exposition the code below implements a simple chunking and batching of the data.

The code performs a simple undoing of the overlapping before stitching the consensus sequence pieces back together. This is sufficient to obtain results here; the full medaka implementation also keeps track of the positions array to ensure the sequence stitching is performed correctly with respect to the original input reference sequence.

Checking our results

We can write out the full consensus sequence derived above and compare it to the truth sequence by using the assess_assembly program from the pomoxis package. By also examining the original draft sequence, we can see the improvement in quality from medaka:

Remarks

In this short walkthrough we have examined some of the internals of Oxford Nanopore Technologies' medaka program performs GPU accelerated consensus calculations from aligned sequencing data. The public medaka codebase implements various alternative forms of the algorithms presented here including run length compression and support for multiple datatypes. Hopefully this guide will prove useful to anyone wishing to implement algorithms similar to that implemented in medaka.