# How medaka works

The following is a relatively short document describing how Oxford Nanopore Technologies' program for consensus calling of sequencing data, medaka, functions internally. We will demonstrate the core functionality required to process alignment data, how it is presented to a recurrent neural network, and how a consensus sequence is formed.

## Getting started¶

Before anything else we will create and set a working directory:

## Medaka's input¶

As input the core medaka algorithm accepts sequencing reads aligned to an assembly sequence. If you have run the medaka_consensus pipeline you will have given as input an assembly sequence and your sequencing data. The pipeline simply runs minimap2 to calculate alignments of the reads to the assembly.

For the purposes of this demonstration we will download pre-aligned data from an R9.4.1 MinION sequencing run:

The downloaded saureus.bam file contains alignments of sequencing reads to the downloaded saureus_canu.fasta. The depth of sequencing has been reduced to around 150-fold coverage of the genome.

# Diving in: counting bases¶

The first step of medaka's calculation is to parse the alignment data into a base counts table ready for input to the neural network. In this section we explore the functions responsible for doing this, how exactly counting is performed and what the results may represent.

## Pileup interface¶

At the heart of medaka resides a straight-forward base-counting procedure. From the alignment data comparing sequencing reads to the reference sequence a pileup is created, much like the display and alignment viewer such as IGV would display.

The pileup is summarise by counting the different base types contained within its columns. The function responsible for this counting excercise is called pileup_counts in the features module:

The pileup_counts function above has various arguments, most of which are advanced options and not used within the default operation of medaka. To create a counts matrix we call the function with a Samtools-style region string (medaka uses 0-based end exclusive co-ordinates) and a filepath to our alignment file:

### The counts matrix¶

The pileup_counts function returned two structures. The latter of these is a positions table, this records which pileup columns are reference positions and which are caused by inserted bases in one or more reads:

The field minor in the above array indicates reference and insertion columns: it takes a value 0 for a reference position and counts upwards for all following insertion events. The major field keeps track of the reference base co-ordinate.

The base counts themselves from the alignment pileup are stored separately:

The matrix is of shape (# pileup columns, 10), each row of the matrix corresponds to the counts of bases and gaps in the pileup columns (yes, the rows and columns get confusing). There are 10 entries one each for the fours base types and gap, multiplied by two as reads on the forward and reverse strand are counted separately. The ordering of the entries is given by:

in which lower-case letters denote reverse strand counts (upper case, forward) and 'd' and 'D' count deletions. A point of note is that this counting strategy it itself makes a distinction between bases which are deleted in reads with respect to reference sequence and bases which are deleted in reads with respect to other reads (the bases in the other reads being insertions with respect to the reference). Previous versions of medaka have performed a symmetrization here: by adding in deletion counts for all read that span a pileup column, whether that pileup column is a reference position (minor=0) or an insertion column (minor>0).

## Normalization¶

After obtained the base-counts matrix produced in the section above medaka performs a normalization of the counts. Across the pileup columns, all count vectors with equal corresponding major position index are normalized by the total count for the column with minor=0 (the reference position). This choice of normalization accounts for the lack of symmetry described above, and that whilst consensus insertions are typically rare, isolated insertions may still occur within any one read spanning two reference positions. There are on average up to three pileup columns for every input reference position.

Ordinarily this normalization is performed in a post-processing method of the CountsFeatureEncoder class, for the purposes of exposition the operation under normal behaviour is:

The normalization is across all bases, it is not split by strand; splitting the normalization by strand would potentially lose important information with respect to strand bias and relative errors. The plot below visualizes the final input to the neural network used in medaka.

# The neural network¶

Having counted bases in an alignment pileup medaka proceeds to analyse these counts using a Recurrent Neural Network, (RNN). A full discussion of such algorithms is beyond the scope of this discussion, this section demonstrates their use in calculating a consensus sequence from the base counts array. When medaka is used as a variant caller different methods are used.

## The model¶

In order to construct a consensus sequence medaka uses a multi-layer bidirection RNN. This is defined using the keras API in tensorflow. The following code is adapted from the models module of medaka, it has been simplified to show only the essential parts:

The model takes the count matrix as input and outputs for each corresponding column of the counts matrix a set of five scores. The five scores express the possibility that the consensus sequence should contain one of the four bases A, C, G, or T, or a gap '-' character at the pileup column under consideration.

## Making predictions¶

In order to make predictions using the RNN, medaka splits the normalized counts array into overlapping chunks before processing by the model. Chunking the array allows for more efficient parallel computation while overlapping is a mitigation against edge-effects at the boundaries of chunks.

As mentioned above in the aside on performance, medaka has a somewhat elaborate system for managing data chunks. For the purposes of exposition the code below implements a simple chunking and batching of the data.

The code performs a simple undoing of the overlapping before stitching the consensus sequence pieces back together. This is sufficient to obtain results here; the full medaka implementation also keeps track of the positions array to ensure the sequence stitching is performed correctly with respect to the original input reference sequence.

### Checking our results¶

We can write out the full consensus sequence derived above and compare it to the truth sequence by using the assess_assembly program from the pomoxis package. By also examining the original draft sequence, we can see the improvement in quality from medaka:

# Remarks¶

In this short walkthrough we have examined some of the internals of Oxford Nanopore Technologies' medaka program performs GPU accelerated consensus calculations from aligned sequencing data. The public medaka codebase implements various alternative forms of the algorithms presented here including run length compression and support for multiple datatypes. Hopefully this guide will prove useful to anyone wishing to implement algorithms similar to that implemented in medaka.