Assembly of small genomes

This tutorial aims to demonstrate how to assemble and correct small genomes, up to ~100Mb, using Oxford Nanopore Technologies' sequencing data.

Included with the tutorial is an example Escherichia coli dataset, the workflow can be used to:

Computational requirements for this tutorial include:

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Data preparation

To start analysing our experiment we must first collate our data. The workflow below expects to be given a single .fastq file containing all the sequencing data which is to be assembled. If you have multiple .fastq files which need to first be aggregated, see our Introduction to .fastq notebook.

Before anything else we will create and set a working directory:

Sample Data

To get started we will download a sample sequencing dataset. The data comprise sequencing reads, from an R10.3 flowcell, of a bacterial mock community. Two preprocessed sets are available:

To use this tutorial with sample data we can download the files using the linux command wget. To execute the commands to download the data select the dataset you wish to download and then click the Play symbol to the left-hand side. (To see the code responsible for downloading the dataset double click the bold title text).

To view the outcome of the download we can use the tree command to show the contents of the working directory:

The files should also appear in the File Browser to the left-hand side of the screen.

Using your own data

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:


The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data entry

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. This is done in the form below.

If you want simply to plot all the graphs in this tutorial for your dataset, rather than working through the tutorial, select Run selected Cell and All Below from the Run menu above after executing the cell below.

Basic Assembly with Flye

For small to mid-sized genomes, and meta-genomic applications, Oxford Nanopore Technologies' recommends the use of the flye assembler. Flye yields acceptable assembly results under a range of input conditions with little to no tuning from the user.

Oxford Nanopore Technologies also work with two other common assemblers:

Genome assembly

The flye assembler has a relatively compact set of options, and is easily run with an input .fastq file. To run the assembly execute the cell below. A genome size estimate is required (for example, "5m" and "2.6g" should be given for genomes of sizes 5 mega-bases and 2.6 giga-bases respectively), the code can be edited before running. The --threads option specifies the amount of CPU resource that will be used. On a mid-range contemporary laptop the assembly will take around 30 minutes. For maximum performance set this option to the number of CPU cores available on your computer.

Providing flye runs to completion it will have output some basic statistics of the assembly, for example the sample data should give:

Total length:   4897024
Fragments:      3
Fragments N50:  4775934
Largest frg:    4775934
Scaffolds:      0
Mean coverage:  28

The expectation for bacterial genomes is to obtain a single fragment corresponding to a whole chromosome. For larger genomes this might not always occur depending on the sequencing depth and complexity of the genome.

Error correcting the assembly

The flye assembler creates a reasonable quality consensus sequence from the aggregate information contained within the sequencing reads. However for improved accuracy it is recommended to run "assembly polishing," a process to correct residual base-level errors in the reads. This step will not change the gross structure of the assembly. To perform assembly polishing we will use the medaka program. Medaka uses a recurrent neural network to predict a consensus sequence from a pileup of reads aligned to a draft sequence.

Medaka contains a simple program for processing a draft assembly and read data. A particular point to note however is that it must be instructed which basecaller version and flowcell type (and so pore type) were used in the sequencing experiment; which model to use. A list of models can be found by running:

As a reminder the correspondence between flowcell type and pore type is:

Flowcell Pore type Medaka model prefix
FLO-MIN106D R9.4.1 r941_min
FLO-MIN111 R10.3 r103_min
FLO-PRO002 R9.4.1 r941_prom

For more information on choosing the correct model see the Sequence Correction section of the medaka documentation.

⚠️ Note: Be sure to edit the code below to use the correct model (-m option) for your data.

When medaka has run to completion it will output a file consensus.fasta in its output directory. This file contains the polished assembly sequences.

Genome quality assessment

As a brief demonstration that things have gone well for the sample data provided with the tutorial, we will now compare the assembly sequence to a reference sequence for the analyte used in the experiment. A more in depth look at assessing assembly accuracy can be found in the links below.

A set of reference sequences for the analyte used in the sample data experiment can be downloaded using wget, this file contains reference sequence for all the mock community samples in the experiment:

Pomoxis quality analysis

The assess_assembly program from the pomoxis suite can be used to obtain an alignment-based quality analysis of a genome assembly. The basic workflow is to split assembly contigs into fixed lenth chunks before aligning these to a reference sequence. These alignments are then analysed for errors.

To run assess_assembly only two arguments are required, a reference sequence and the assembly. The -p option below sets an output prefix, while -t sets the number of compute threads:

The summary Q scores section of the output gives statistics of the errors of the fixed length alignment chunks expressed in the usual logarithmic fashion ($-10\log_{10}(P_{err})$) with the err_ont row giving the total error rate:

$P_{err\_ont} = \frac{N_{ins} + N_{del} + N_{sub}}{N_{match} + N_{ins} + N_{del}}$

Next steps

This tutorial has walked through how to assemble small to mid-sized genomes using the flye assembler. We have also shown how to improve the accuracy of our assembly using the medaka consensus program. As a technology demonstrator we have also shown how to perform a reference-based quality assessment of an assembly.

For more information on genome assembly and assement see:

In a forthcoming tutorial we will provide a longer discussion on how to assess the quality of an assembly, both with and without a reference sequence.