This tutorial aims to demonstrate how to assemble and correct small genomes, up to ~100Mb, using Oxford Nanopore Technologies' sequencing data. Assembly is the task of taking a number of shorter sequences and putting them back together to create a representation of the original genomic sequence.
Included with the tutorial is an example Escherichia coli dataset, the workflow can be used to:
.fastq
file into a genomic sequence using the flye
assembler.medaka
software.pomoxis
.Computational requirements for this tutorial include:
⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs
before using the notebook to analyse your own data.
To start analysing our experiment we must first collate our data. The workflow below expects to be given a single .fastq
file containing all the sequencing data which is to be assembled. If you have multiple .fastq
files which need to first be aggregated, see our *Introduction to .fastq* notebook.
Before anything else we will create and set a working directory:
# create a work directory and move into it
tutorial_name = "assembly_tutorial"
working_dir = '/epi2melabs/{}/'.format(tutorial_name)
!mkdir -p "$working_dir"
%cd "$working_dir"
from epi2melabs import ping
pinger = ping.Pingu()
_ = pinger.send_notebook_ping('start', 'assembly_tutorial')
/epi2melabs/assembly_tutorial
To get started we will download a sample sequencing dataset. The data comprise sequencing reads, from an R10.3 flowcell, of a bacterial mock community. Two preprocessed sets are available:
To use this tutorial with sample data we can download the files using the linux
command wget
. To execute the commands to download the data select the dataset you wish to download and then click the Play symbol to the left-hand side. (To see the code responsible for downloading the dataset double click the bold title text).
import os
from epi2melabs.notebook import InputForm, InputSpec
import ipywidgets as widgets
downloads = {
"smaller (30x)": "ecoli.30x.r103.fastq.gz",
"larger (60x)": "ecoli.60x.r103.fastq.gz"}
def process_form(inputs):
print("Using sample data")
bucket = "ont-exd-int-s3-euwst1-epi2me-labs"
domain = "s3-eu-west-1.amazonaws.com"
site = "https://{}.{}".format(bucket, domain)
location = '{}/assembly_tutorial'.format(site)
fname = "242318_classification_16s_barcode-v1.csv"
fname = downloads[inputs.dataset]
print("Downloading {} dataset...".format(inputs.dataset))
!wget "$location/$fname" \
&& echo "Download complete" || echo "Download failed"
!mv "$fname" sample_data.fastq.gz
inputs = InputForm(
InputSpec('dataset', 'Select dataset', list(downloads.keys())),)
inputs.add_process_button(process_form)
inputs.display()
VBox(children=(HBox(children=(Label(value='Select dataset', layout=Layout(width='150px')), interactive(childre…
To view the outcome of the download we can use the tree
command to show the contents of the working directory:
!tree .
The files should also appear in the File Browser to the left-hand side of the screen.
If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input
variable below. To find the correct full path of a directory you can navigate to it in
the Files
browser to the left-hand side, right-click on the file and select Copy path
:
The location shared with the EPI2ME labs server from your computer will show as /epi2melabs
, for example a file located at /data/my_gridion_run/fastq_pass
on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass
when it is the /data
folder that is shared.
Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. This is done in the form below.
If you want simply to plot all the graphs in this tutorial for your dataset, rather than working through the tutorial, select
Run selected Cell and All Below
from theRun
menu above after executing the cell below.
output_folder = None
input_fastq = None
def process_form(inputs):
global output_folder, input_fastq
output_folder = inputs.output_folder
input_fastq = os.path.abspath(os.path.join(output_folder, "input.fastq.gz"))
!echo "Making output folder"
!mkdir -p "$output_folder"
input_data = inputs.input_data
!rationalize_fastq -i "$input_data" -o "$input_fastq"
inputs = InputForm(
InputSpec('input_data', '.fastq input', '/epi2melabs/assembly_tutorial/sample_data.fastq.gz'),
InputSpec('output_folder', 'Output folder', 'analysis'))
inputs.add_process_button(process_form)
inputs.display()
VBox(children=(HBox(children=(Label(value='.fastq input', layout=Layout(width='150px')), interactive(children=…
For small to mid-sized genomes, and meta-genomic applications, Oxford Nanopore Technologies' recommends the use of the flye assembler. Flye yields acceptable assembly results under a range of input conditions with little to no tuning from the user.
Oxford Nanopore Technologies also work with two other common assemblers:
The flye assembler has a relatively compact set of options, and is easily run with an input .fastq
file. To run the assembly execute the cell below. A genome size estimate is required (for example, "5m" and "2.6g" should be given for genomes of sizes 5 mega-bases and 2.6 giga-bases respectively), the code can be edited before running. The --threads
option specifies the amount of CPU resource that will be used. On a mid-range contemporary laptop the assembly will take around 30 minutes. For maximum performance set this option to the number of CPU cores available on your computer.
!flye --nano-raw "$input_fastq" --out-dir "$output_folder"/flye --genome-size 5m --threads 8
Providing flye runs to completion it will have output some basic statistics of the assembly, for example the sample data should give:
Total length: 4897024
Fragments: 3
Fragments N50: 4775934
Largest frg: 4775934
Scaffolds: 0
Mean coverage: 28
The expectation for bacterial genomes is to obtain a single fragment
corresponding to a whole chromosome. For larger genomes this might not always occur depending on the sequencing depth and complexity of the genome.
The flye assembler creates a reasonable quality consensus sequence from the aggregate information contained within the sequencing reads. However for improved accuracy it is recommended to run "assembly polishing," a process to correct residual base-level errors in the reads. This step will not change the gross structure of the assembly. To perform assembly polishing we will use the medaka program. Medaka uses a recurrent neural network to predict a consensus sequence from a pileup of reads aligned to a draft sequence.
Medaka contains a simple program for processing a draft assembly and read data. A particular point to note however is that it must be instructed which basecaller version and flowcell type (and so pore type) were used in the sequencing experiment; which model to use. A list of models can be found by running:
!run medaka tools list_models
As a reminder the correspondence between flowcell type and pore type is:
Flowcell | Pore type | Medaka model prefix
--- | --- | ---
FLO-MIN106D | R9.4.1 | r941_min
FLO-MIN111 | R10.3 | r103_min
FLO-PRO002 | R9.4.1 | r941_prom
For more information on choosing the correct model see the Sequence Correction section of the medaka documentation.
⚠️ Note: Be sure to edit the code below to use the correct model (
-m
option) for your data.
!run medaka_consensus \
-d "$output_folder"/flye/assembly.fasta -i "$input_fastq" \
-o "$output_folder"/medaka \
-t 8 -m r103_min_high_g345
pinger.send_notebook_ping('stop', 'assembly_tutorial')
When medaka has run to completion it will output a file consensus.fasta
in its output directory. This file contains the polished assembly sequences.
As a brief demonstration that things have gone well for the sample data provided with the tutorial, we will now compare the assembly sequence to a reference sequence for the analyte used in the experiment. A more in depth look at assessing assembly accuracy can be found in the links below.
A set of reference sequences for the analyte used in the sample data experiment can be downloaded using wget
, this file contains reference sequence for all the mock community samples in the experiment:
!wget 'https://ont-research.s3-eu-west-1.amazonaws.com/datasets/r941_zymo/references.fasta'
The assess_assembly
program from the pomoxis suite can be used to obtain an alignment-based quality analysis of a genome assembly. The basic workflow is to split assembly contigs into fixed lenth chunks before aligning these to a reference sequence. These alignments are then analysed for errors.
To run assess_assembly
only two arguments are required, a reference sequence and the assembly. The -p
option below sets an output prefix, while -t
sets the number of compute threads:
!mkdir -p analysis/assessment/
!assess_assembly -r references.fasta -i analysis/medaka/consensus.fasta \
-p analysis/assessment/assm -t 8
The summary Q scores
section of the output gives statistics of the errors of the fixed length alignment chunks expressed in the usual logarithmic fashion ($-10\log_{10}(P_{err})$) with the err_ont
row giving the total error rate:
$P_{err\_ont} = \frac{N_{ins} + N_{del} + N_{sub}}{N_{match} + N_{ins} + N_{del}}$
This tutorial has walked through how to assemble small to mid-sized genomes using the flye assembler. We have also shown how to improve the accuracy of our assembly using the medaka
consensus program. As a technology demonstrator we have also shown how to perform a reference-based quality assessment of an assembly.
For more information on genome assembly and assement see:
In a forthcoming tutorial we will provide a longer discussion on how to assess the quality of an assembly, both with and without a reference sequence.