The Summary statistics and QC tutorial is intended as a functional guide to help assess the quality characteristics of a single Nanopore sequencing run. This tutorial aims to enable an objective assessment of the performance of a Nanopore flowcell run and to assess the sequence characteristics to benchmark quality.
Sufficient information is provided in the tutorial that the workflow can be tested, validated and replicated. The tutorial is provided with an example dataset from a barcoded sequence library. The tutorial is intended to address important questions:
Methods used in this tutorial include:
pandas
, numpy
, and bokeh
,sequencing_summary.txt
file from MinKNOW or Guppy as data source for parsing.Computational requirements for this tutorial include:
⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select
Edit > Clear all outputs
before using the notebook to analyse your own data.
This tutorial aims to summarise the data characteristics from an Oxford Nanopore Technologies sequencing run. Observations from basecalled reads and their quality characteristics, temporal performance, and barcoded content are presented. The information presented is derived solely from the sequence_summary.txt
file produced during basecalling with the Guppy software.
The goals from this tutorial include:
statistics to evaluate the relative performance of runs.
The steps of this tutorial will be:
sequencing_summary.txt
file¶The sequencing_summary.txt
file is automatically produced during base-calling with the Guppy software. This summary file contains rich metadata for each sequence read produced during a run. These data include timestamp, pore duration, read quality, and channel information, in addition to the characteristics of the resulting DNA sequence. This workflow presented here uses this summary file rather than the raw FAST5
format data for performance reasons.
Tools such as pomoxis utilise the fastq files for quality metrics, and other tools make extensive use of the fast5 files. Parsing the fast5 files provides additional analytical context but is much more demanding in terms of compute resource and time. This tutorial is lightweight and is intended to run within a few minutes on a desktop computer.
To start analysing our experiment we must first collate our data. The workflow below expects to be given a single sequencing summary file as output by the Oxford Nanopore Technologies' sequencing device.
Before anything else we will create and set a working directory:
from epi2melabs import ping
pinger = ping.Pingu()
pinger.send_notebook_ping('start', 'basic_qc_tutorial')
# create a work directory and move into it
tutorial_name = "basicqc_tutorial"
working_dir = '/epi2melabs/{}/'.format(tutorial_name)
!mkdir -p "$working_dir"
%cd "$working_dir"
/epi2melabs/basicqc_tutorial
To start analysing our sequencing_summary.txt
we must first "read in" the data. For this tutorial we will download a sample file of a Lambda phage sequencing run.
To download the sample file we run the linux
command wget
. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.
bucket = "ont-exd-int-s3-euwst1-epi2me-labs"
domain = "s3-eu-west-1.amazonaws.com"
site = "https://{}.{}".format(bucket, domain)
!wget -O lambda_sequencing_summary.txt.bz2 \
"$site/basicqc_tutorial/lambda_sequencing_summary.txt.bz2"
If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input
variable below. To find the correct full path of a directory you can navigate to it in
the Files
browser to the left-hand side, right-click on the file and select Copy path
:
The location shared with the EPI2ME labs server from your computer will show as /epi2melabs
, for example a file located at /data/my_gridion_run/fastq_pass
on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass
when it is the /data
folder that is shared.
Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook.
The form below reads the sequencing summary file using the pandas
library. The form can be used to enter the filename of your sequencing summary file.
If you want simply to plot all the graphs in this tutorial for your dataset, rather than working through the tutorial, select
Run after
from theRuntime
menu above after entering your filepath.
import os
import pandas as pd
import aplanat
import aplanat.report
import aplanat.graphics
from epi2melabs.notebook import InputForm, InputSpec
import ipywidgets as widgets
seq_summary = None
exec_summary = None
report = None
def process_form(inputs):
summary_file = os.path.abspath(inputs.summary_file)
global seq_summary, exec_summary, report
# Read the file, use a tab ('\t') as the separator between columns
print("Reading input file...")
seq_summary = pd.read_csv(
inputs.summary_file, delimiter='\t',
usecols=[
'channel', 'start_time', 'duration', 'passes_filtering',
'sequence_length_template', 'mean_qscore_template'])
# ensure the file is sorted by time and reset its
# indexing after doing so
seq_summary.sort_values('start_time', inplace=True)
seq_summary.reset_index(drop=True, inplace=True)
seq_summary = seq_summary
print("Done.")
exec_summary = aplanat.graphics.InfoGraphItems()
report = aplanat.report.HTMLReport(
"Basic QC Tutorial", "EPI2ME Labs Summary")
report.markdown("## Excutive summary", key='exec-head')
report.plot(None, 'exec-plot') # placeholder
inputs = InputForm(
InputSpec('summary_file', 'Sequencing summary', 'lambda_sequencing_summary.txt.bz2'),)
inputs.add_process_button(process_form)
inputs.display()
VBox(children=(HBox(children=(Label(value='Sequencing summary', layout=Layout(width='150px')), interactive(chi…
The variable seq_summary
is now a pandas
dataframe object. Having read the file we can take a peek at the start of the file using the .head()
method of the dataframe objects:
# Use the .head() method to check the contents of the dataframe
seq_summary.head()
channel | start_time | duration | passes_filtering | sequence_length_template | mean_qscore_template | |
---|---|---|---|---|---|---|
0 | 154 | 8.71150 | 10.82900 | True | 3704 | 11.429653 |
1 | 323 | 9.33300 | 10.20775 | True | 3437 | 10.364430 |
2 | 216 | 9.34650 | 10.19425 | True | 3635 | 11.148887 |
3 | 403 | 9.41975 | 10.12100 | True | 3479 | 10.644968 |
4 | 412 | 9.54400 | 9.99675 | True | 3613 | 10.794808 |
In order to count the number of reads sequenced in the experiment, we can ask the dataframe how many rows it contains:
# Show the total number of reads
print("The total number of reads is:", len(seq_summary))
# Group the rows of the dataframe by the value of the `passes_filtering` column
# then ask the size of each group.
pass_fail = seq_summary.groupby('passes_filtering').size()
print("The number of pass reads is:", pass_fail[True])
print("The number of fail reads is:", pass_fail[False])
exec_summary.append('Total reads', pass_fail[True].sum(), 'angle-up', '')
The total number of reads is: 975715 The number of pass reads is: 814554 The number of fail reads is: 161161
As a more interesting way of summarising these values we can use the aplanat
library to create a stacked bar chart. The aplanat
library provides templates
to create common plots using the bokeh
library:
# Plotting pass and fail reads (click play)
import aplanat
from aplanat import bars
pass_percentage = 100 * pass_fail[True] / len(seq_summary)
values = [pass_percentage, 100 - pass_percentage]
classes = ['Pass', 'Fail']
colors = ['#54B8B1', '#EF4135']
plot = bars.single_hbar(
values, classes, colors,
title="Pass / Fail reads")
plot.xaxis.axis_label = '%age Reads'
aplanat.show(plot, background="#f4f4f4")
# add item to report
report.markdown("""
## Pass/fail classification
The run contained {} pass and {} fail reads for a total of {} reads.
""".format(pass_fail[True], pass_fail[False], len(seq_summary)),
key="pass-fail-head")
report.plot(plot)
Moving on from counting reads, let us examine the length of reads in this experiment.
We will first calculate the mean read length and the N50 read length:
# Calculating mean and N50 read length
# take the `sequence_length_template` column (the read lengths)
# and sort them in ascending order
sorted_lengths = seq_summary.sequence_length_template.sort_values(ascending=False).reset_index(drop=True)
# calculate the cumulative sum of lengths
cumulative_length = sorted_lengths.cumsum()
# find the total number of bases and the mean length
total_bases = cumulative_length.iloc[-1]
mean_length = total_bases / len(seq_summary)
# find the N50 length: the length for which half the reads are longer
n50_index = cumulative_length.searchsorted(total_bases / 2)
n50_length = sorted_lengths.iloc[n50_index]
exec_summary.append('Total yield', total_bases, 'signal', 'b')
exec_summary.append('Mean read length', mean_length, 'align-center', 'b')
exec_summary.append('N50 read length', n50_length, 'align-left', 'b')
report.markdown("""
## Read Lengths and yield
The following graphics depict the read-length distribution in a variety
of ways, see the tutorial for additional details.
""",
key='length-yield-head')
These values be will used below to annotate the plots. We can depict the read length distribution in several ways. The simplest is as a simple histogram:
# Plotting the read length distribution (click play)
# calculate counts for a histogram using numpy
from aplanat import hist, annot
datas = [seq_summary.sequence_length_template]
plot = hist.histogram(datas, bins=400, title="Read length distribution", colors=['#0084A9'])
# add vertical lines for mean and N50 read length
annot.marker_vline(plot, mean_length, 'Mean: {:.0f}'.format(mean_length))
annot.marker_vline(plot, n50_length, 'N50: {}'.format(n50_length), text_baseline='top')
# add axis labels
plot.xaxis.axis_label = 'Read Length / bases'
plot.yaxis.axis_label = 'Number of reads'
# show the plot
aplanat.show(plot, background="#f4f4f4")
report.plot(plot, key='length-plain')
The distribution of sequence lengths will be dependent on the protocols that have been used to extract and/or prepare the starting DNA. Sequences from amplicon DNA will have a tight distribution of read lengths, while sequences from genomic DNA will have a broader distribution. The distribution will be further influenced if a size-selection step has been used, and will also be dependent on the choice of sequencing library preparation kits. The read-length distribution should be assessed to see if the distribution is concordant with that expected. For the sample dataset, the graph depicts a spread of read lengths with a strong peak in the distribution at around 48 kbases, the length of the genome under examination.
A second way to visualise the read length distribution is as a mass distribution, or a weighted histogram. This method most similarly portrays the intensities that would be seen in an electrophoresis experiment.
# Plotting read-length mass distribution (click play)
from aplanat import hist, annot
datas = [seq_summary.sequence_length_template]
weights = [seq_summary.sequence_length_template.astype(float)]
plot = hist.histogram(
datas, weights=weights,
normalize=True,
bins=400, title="Read length distribution",
colors = ["#0084A9"])
# add vertical lines for mean and N50 read length
annot.marker_vline(plot, mean_length, 'Mean: {:.0f}'.format(mean_length))
annot.marker_vline(plot, n50_length, 'N50: {}'.format(n50_length), text_baseline='top')
# add axis labels
plot.xaxis.axis_label = 'Read Length / bases'
plot.yaxis.axis_label = 'Base density'
# show the plot
aplanat.show(plot, background="#f4f4f4")
report.plot(plot, key='length-mass')
*Note how a normalised density has been plotted, to avoid misinterpretation when plotting simply the total bases per bin.*
The weighted read length histogram above shows the binned distribution of sequence length against number of sequence nucleotides contained within the bin. This plot will show clear peaks if for example, amplicons are sequenced or if size selection has been performed.
A final way of viewing this data is to take the cumulative sum of the number of bases and plot it against the read lengths, to obtain a curve expressing the number of bases contained within reads above a given length:
# Plotting cumulative yield by read length (click play)
from aplanat import lines
x = sorted_lengths[0:-1:10000]
y = cumulative_length[0:-1:10000] / 1e9
plot = lines.line([x], [y], xlim=(0, None), ylim=(0, None),
title='Base yield above given read length',
colors = ["#0084A9"])
plot.xaxis.axis_label = 'Read Length / bases'
plot.yaxis.axis_label = 'Yield above length / Gbases'
# show the plot
aplanat.show(plot, background="#f4f4f4")
report.plot(plot, 'yield-length')
This graph indicates how many small fragments contribute equally to the total yield as fewer longer fragments.
From the graph we can read off that the total yield of the experiment from the left-most point of the blue line.
In this section we will review the predicted quality of the sequencing
data using the mean per-base quality score of reads written in the sequencing_summary.txt
file.
Let's start by simply plotting a histogram of the quality scores:
# Histogram of read quality (click play)
passes = seq_summary['passes_filtering']
datas = [
seq_summary.loc[passes == status]['mean_qscore_template']
for status in [True, False]]
plot = hist.histogram(
datas, colors=['#54B8B1', '#EF4135'],
names=['Pass', 'Fail'], bins=200, xlim=(1, None),
title="Read quality score distribution")
plot.xaxis.axis_label = 'Read Quality Score'
plot.yaxis.axis_label = 'Count reads / M'
aplanat.show(plot, background="#f4f4f4")
pass_mean_qscore = seq_summary[seq_summary['passes_filtering'] == True]["mean_qscore_template"].mean()
exec_summary.append('Mean qscore (pass)', pass_mean_qscore, 'thumbs-up', '')
report.markdown("""
## Read quality
Distribution of read quality for pass and fail reads.
""", key='qual-head')
report.plot(plot, key='qual-plot')
The read quality score is calculated as the mean of the per-base quality scores of a read. Each per base quality gives an estimation of the error of that base in expressed on a logarithmic scale: $QV = -10\log_{10} (P_{error})$).
Evident from the histogram above, a quality score of 7 is the threshold value for determining the Pass
of Fail
status of reads. The boundary is chosen to be in the tail of the distribution. This QV filter is applied during the base-calling process as a modifiable parameter. For downstream analyses we recommend processing only the higher-quality sequence reads.
A key metric in the quality review of a sequencing run is an analysis of the temporal performance of the run. During a run each sequencing channel will address a selection of different pores and the individual pores may become temporarily or permanently blocked. It is therefore expected that during a run sequencing productivity will decrease. It is useful to consider whether the observed productivity decline is normal or if it happens more rapidly than expected. A rapid pore decline could be indicative of contaminants with the sequencing library.
Plotting the number of bases generated per unit time over the course of a sequencing run can reveal unexpected behaviours. In an ideal experiment there should not be any sudden decreases in performance.
def process_form(inputs):
seq_summary['time'] = (seq_summary['start_time'] / 60 / inputs.discretize).astype(int)
# group the data by quarter hour and pass/fail
groups = seq_summary.groupby(['passes_filtering', 'time'], as_index=False)
# sum the bases per group
throughput = groups['sequence_length_template'].agg('sum')
data = [
throughput[throughput.passes_filtering == status]
for status in [True, False]]
plot = lines.line(
[d.time / (60 / inputs.discretize) for d in data],
[d.sequence_length_template / inputs.discretize / 1e6 for d in data],
names=['Pass', 'Fail'],
colors=['#54B8B1', '#EF4135'],
xlim=(0,None), ylim=(0, None),
title='Throughput',
x_axis_label = 'time / hr',
y_axis_label = 'yield / Gbases')
aplanat.show(plot, background="#f4f4f4")
report.markdown("""
## Performance through time
The following charts illustrate read, yield, and flowcell performance through
the course of the sequencing experiment.
""", key='time-head')
report.plot(plot, key='yield-time-plot')
tp_inputs = InputForm(
InputSpec('discretize', 'Time discretization (min)', (1, 60, 1)), description_width="180px")
tp_inputs.add_process_button(process_form)
tp_inputs.display()
VBox(children=(HBox(children=(Label(value='Time discretization (min)', layout=Layout(width='180px')), interact…
The above plot shows a steady decline in the sequencing rate for Pass
reads, with a constant background of Fail
reads. The dips in throughput every hour are normal as the sequencing control software MinKNOW reassessed from which pores to record data.
A second way of viewing the sequencing throughput is to plot the cumulative yield over time.
# Plotting base yield through time (click play)
# group the data by pass/fail
groups = seq_summary.set_index(['passes_filtering']).groupby(level=0, as_index=False)
# calculate a cumulative sum of the number of bases sequenced (per group)
sum_tp = groups['sequence_length_template'].cumsum().reset_index()
# attach the start times from the original table
sum_tp['start_time'] = seq_summary['start_time']
data = [
sum_tp[::1000][sum_tp[::1000].passes_filtering == status]
for status in [True, False]]
plot = lines.line(
[d.start_time / 3600 for d in data],
[d.sequence_length_template / 1e9 for d in data],
names=['Pass', 'Fail'],
colors=['#54B8B1', '#EF4135'],
xlim=(0,None), ylim=(0, None),
title='Cumulative yield',
x_axis_label = 'time / hr',
y_axis_label = 'yield / Gbases')
aplanat.show(plot, background="#f4f4f4")
report.plot(plot, key='yield-time-plot')
As well as changes in throughput over time another diagnostic metric to evaluate is the sequencing speed: the number of bases processed per second by the enzymes controlling the movement of DNA through the nanopore channels.
# Plotting Sequencing speed plot (click play)
from bokeh import layouts
hour = (seq_summary.start_time / 3600).astype(int)
speed = seq_summary.sequence_length_template / seq_summary.duration
spd_plot = bars.boxplot_series(
hour, speed, title='Sequencing speed',
x_axis_label = 'time / hr',
y_axis_label = 'speed / (bases / s)')
length = seq_summary.sequence_length_template
len_plot = bars.boxplot_series(
hour, length, title='Sequencing Lengths',
x_axis_label='time / hr',
y_axis_label='read length / bases')
len_plot.x_range=spd_plot.x_range
p = layouts.column(spd_plot, len_plot)
aplanat.show(p, background="#f4f4f4")
report.plot(p, key='speed-length-boxplots')
The graphs above all depict the changing rate of sequencing from the perspective of the volume of data produced. A contribution to the falling sequencing rate is the number of functional channels falls throughout the experiment.
A channel is defined as being active if one or more sequence reads are observed per time window. It is expected that over the course of the run pores will block and the number of channels producing data will decrease. Changing the pore used by the channel and strategies to unblock pores mean that the number of functional channels may increase or decrease at a given timepoint but generally the number of channels producing data will decrease over time.
def process_form(inputs):
# calculate a start time for each read rounded to discretize minutes
seq_summary['time'] = (seq_summary['start_time'] / 60 / inputs.discretize).astype(int)
# group the data by quarter hour
groups = seq_summary.groupby('time', as_index=False)
# count the unique channels in each 5 minutes
chan_counts = groups['channel'].apply(lambda x: len(x.unique()))
hours = chan_counts.index / (60 / inputs.discretize)
plot = lines.line(
[hours],
[chan_counts['channel']],
xlim=(0,None), ylim=(0, None),
title='Functional channels ({}min period)'.format(inputs.discretize),
x_axis_label = 'time / hr',
y_axis_label = 'active channels',
colors = ["#0084A9"])
aplanat.show(plot, background="#f4f4f4")
report.plot(plot, 'active-channels-plot')
ac_inputs = InputForm(
InputSpec('discretize', 'Time discretization (min)', (1, 60, 1)), description_width="180px")
ac_inputs.add_process_button(process_form)
ac_inputs.display()
VBox(children=(HBox(children=(Label(value='Time discretization (min)', layout=Layout(width='180px')), interact…
The nanopores through which DNA is passed, and signal collected, are arrayed as a 2-dimensional matrix. A heatmap can be plotted showing channel productivity against spatial position on the matrix. Such a plot enables the identification of spatial artifacts that could result from membrane damage through e.g. the introduction of an air-bubble. This heatmap representation of spatial activity shows only gross spatial aberations. On the MinION, GridION, and PromethION platforms (but not Flongle) each channel can address four different pores ("mux"), the activity plot below therefore shows the number of sequences produced per channel, not strictly per pore.
Counting reads and channel map
We will break apart displaying the channel activity into two steps since the code is quite long. First we count form our dataset; for this we need to calculate the positions of channels in the flowcell:
# Calculating spatial position of channels on flowcells (click play)
import numpy as np
def minion_array():
# The array is composed of blocks of 64 channels, the low
# halve is to the right (high to left), working inwards
# in a 4 x 8 block
def channel_block(i):
m = np.zeros((4, 16), dtype=int)
m[4::-1, :7:-1] = np.arange(i, i + 32).reshape(4, 8)
m[4::-1, 0:8] = np.arange(i + 32, i + 64).reshape(4, 8)
return m
# the blocks ascend from high to low, except the
# lowest block is at the top
first_channel = list(range(65, 450, 64)) + [1]
channel_array = np.vstack([channel_block(x) for x in first_channel])
ca = pd.DataFrame(channel_array).reset_index().melt('index')
ca.columns = ['row', 'column', 'channel']
return ca
def flongle_array():
# channels are in a simple grid except two upper- and
# lower-most channels on right-hand column are missing
channel_array = np.concatenate([
np.arange(1, 13), np.array([0]),
np.arange(13, 25), np.array([0]),
np.arange(25, 115), np.array([0]),
np.arange(115, 127), np.array([0])]).reshape(10, 13)
ca = pd.DataFrame(channel_array).reset_index().melt('index')
ca.columns = ['row', 'column', 'channel']
return ca
def promethion_array():
# Array is simple blocks of 25*10 channels
channel_array = np.hstack([
np.arange(x, x + 250).reshape(25, 10)
for x in range(1, 2752, 250)])
ca = pd.DataFrame(channel_array).reset_index().melt('index')
ca.columns = ['row', 'column', 'channel']
return ca
def guess_device(max_channel):
device = "minion"
if max_channel > 512:
device = "promethion"
if max_channel <= 130:
device = "flongle"
return device
channel_maps = {
'minion': minion_array(),
'flongle': flongle_array(),
'promethion': promethion_array()}
Using the channel map from the above code cell we can create a table with columns row
, column
, and reads
:
device = guess_device(seq_summary['channel'].max())
channel_map = channel_maps[device]
# count pass reads per channel...
counts = seq_summary[seq_summary['passes_filtering']] \
.groupby('channel') \
.size().to_frame('reads') \
.reset_index()
# ...and merge with the channel map
counts = counts.merge(channel_map, on='channel', how='outer').fillna(0)
Now that we have out counts, we can plot a heat map:
# Plotting activity heat-map plot (click play)
from aplanat import spatial
plot = spatial.heatmap(
counts['column'], counts['row'], counts['reads'],
name='# reads')
aplanat.show(plot, background="#f4f4f4")
report.markdown("### Active channels", key='active-channels-head')
report.plot(plot, key='active-channels-heat')
As a final summary of the experiment the infographic below displays a small set of headline numbers captured through the course of the analysis:
# Executive summary (click play)
pinger.send_notebook_ping('stop', 'basic_qc_tutorial')
exec_plot = aplanat.graphics.infographic(exec_summary.values(), ncols=3)
aplanat.show(exec_plot, background="#f4f4f4")
report.plot(exec_plot, key='exec-plot')
fname = os.path.join(os.path.dirname(inputs.summary_file), 'report.html')
report.write(fname)
print("Report written to: {}.".format(fname))
Report written to: report.html.
This tutorial has stepped through basic QC analysis of a Nanopore sequencing run. The sequencing throughout and read lengths have been explored, as has the progression of the experiment through the course of time.
The analysis presented can be run on any sequencing_summary.txt
from MinKNOW or Guppy. The code will run within the EPI2ME Labs notebook server environment.