# Basic QC Tutorial

#### Expected duration: 15 minutes

The Summary statistics and QC tutorial is intended as a functional guide to help assess the quality characteristics of a single Nanopore sequencing run. This tutorial aims to enable an objective assessment of the performance of a Nanopore flowcell run and to assess the sequence characteristics to benchmark quality.

Sufficient information is provided in the tutorial that the workflow can be tested, validated and replicated. The tutorial is provided with an example dataset from a barcoded sequence library. The tutorial is intended to address important questions:

• How many reads (and how many gigabases) were sequenced?
• What fraction of my sequence collection is good quality?
• How are longer sequence reads represented in my sample?

Methods used in this tutorial include:

• Python for statistical analysis and reporting, including use of pandas, numpy, and bokeh,
• the sequencing_summary.txt file from MinKNOW or Guppy as data source for parsing.

Computational requirements for this tutorial include:

• Computer running the EPI2ME Labs notebook Server
• At least 8 Gb RAM
• Runtime with provided example data - approximately 10 minutes

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

## Introduction¶

This tutorial aims to summarise the data characteristics from an Oxford Nanopore Technologies sequencing run. Observations from basecalled reads and their quality characteristics, temporal performance, and barcoded content are presented. The information presented is derived solely from the sequence_summary.txt file produced during basecalling with the Guppy software.

The goals from this tutorial include:

• To introduce a literate framework for analysing base-calling summary statistics to evaluate the relative performance of runs.
• To provide basic QC metrics so that a review and consideration of experimental data can be undertaken.
• To provide training as to which QC metrics are of most interest and to encourage an understanding of how different aspects of sequence data quality can be attributed to sample characteristics from DNA isolation and library preparation.

The steps of this tutorial will be:

• Basic QC analysis - including number of reads, length of reads, yield and quaility scores.
• Performance of a sequencing run through time - including sequencing throughput, speed and fragment length.
• Sequencing channel activity - a way to visualise nanopore channel productivity.

## The sequencing_summary.txt file¶

The sequencing_summary.txt file is automatically produced during base-calling with the Guppy software. This summary file contains rich metadata for each sequence read produced during a run. These data include timestamp, pore duration, read quality, and channel information, in addition to the characteristics of the resulting DNA sequence. This workflow presented here uses this summary file rather than the raw FAST5 format data for performance reasons.

Tools such as pomoxis utilise the fastq files for quality metrics, and other tools make extensive use of the fast5 files. Parsing the fast5 files provides additional analytical context but is much more demanding in terms of compute resource and time. This tutorial is lightweight and is intended to run within a few minutes on a desktop computer.

## Basic QC analysis and results¶

To start analysing our experiment we must first collate our data. The workflow below expects to be given a single sequencing summary file as output by the Oxford Nanopore Technologies' sequencing device.

Before anything else we will create and set a working directory:

### Sample Data¶

To start analysing our sequencing_summary.txt we must first "read in" the data. For this tutorial we will download a sample file of a Lambda phage sequencing run.

To download the sample file we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

### Using your own data¶

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

### Data entry¶

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook.

The form below reads the sequencing summary file using the pandas library. The form can be used to enter the filename of your sequencing summary file.

If you want simply to plot all the graphs in this tutorial for your dataset, rather than working through the tutorial, select Run after from the Runtime menu above after entering your filepath.

The variable seq_summary is now a pandas dataframe object. Having read the file we can take a peek at the start of the file using the .head() method of the dataframe objects:

### Numbers of reads¶

In order to count the number of reads sequenced in the experiment, we can ask the dataframe how many rows it contains:

As a more interesting way of summarising these values we can use the aplanat library to create a stacked bar chart. The aplanat library provides templates to create common plots using the bokeh library:

### Read Lengths and Yield¶

Moving on from counting reads, let us examine the length of reads in this experiment.

We will first calculate the mean read length and the N50 read length:

These values be will used below to annotate the plots. We can depict the read length distribution in several ways. The simplest is as a simple histogram:

The distribution of sequence lengths will be dependent on the protocols that have been used to extract and/or prepare the starting DNA. Sequences from amplicon DNA will have a tight distribution of read lengths, while sequences from genomic DNA will have a broader distribution. The distribution will be further influenced if a size-selection step has been used, and will also be dependent on the choice of sequencing library preparation kits. The read-length distribution should be assessed to see if the distribution is concordant with that expected. For the sample dataset, the graph depicts a spread of read lengths with a strong peak in the distribution at around 48 kbases, the length of the genome under examination.

A second way to visualise the read length distribution is as a mass distribution, or a weighted histogram. This method most similarly portrays the intensities that would be seen in an electrophoresis experiment.

Note how a normalised density has been plotted, to avoid misinterpretation when plotting simply the total bases per bin.

The weighted read length histogram above shows the binned distribution of sequence length against number of sequence nucleotides contained within the bin. This plot will show clear peaks if for example, amplicons are sequenced or if size selection has been performed.

A final way of viewing this data is to take the cumulative sum of the number of bases and plot it against the read lengths, to obtain a curve expressing the number of bases contained within reads above a given length:

This graph indicates how many small fragments contribute equally to the total yield as fewer longer fragments.

From the graph we can read off that the total yield of the experiment from the left-most point of the blue line.

### Quality Scores¶

In this section we will review the predicted quality of the sequencing data using the mean per-base quality score of reads written in the sequencing_summary.txt file.

Let's start by simply plotting a histogram of the quality scores:

The read quality score is calculated as the mean of the per-base quality scores of a read. Each per base quality gives an estimation of the error of that base in expressed on a logarithmic scale: $QV = -10\log_{10} (P_{error})$).

Evident from the histogram above, a quality score of 7 is the threshold value for determining the Pass of Fail status of reads. The boundary is chosen to be in the tail of the distribution. This QV filter is applied during the base-calling process as a modifiable parameter. For downstream analyses we recommend processing only the higher-quality sequence reads.

## Performance through time¶

A key metric in the quality review of a sequencing run is an analysis of the temporal performance of the run. During a run each sequencing channel will address a selection of different pores and the individual pores may become temporarily or permanently blocked. It is therefore expected that during a run sequencing productivity will decrease. It is useful to consider whether the observed productivity decline is normal or if it happens more rapidly than expected. A rapid pore decline could be indicative of contaminants with the sequencing library.

### Throughput¶

Plotting the number of bases generated per unit time over the course of a sequencing run can reveal unexpected behaviours. In an ideal experiment there should not be any sudden decreases in performance.

The above plot shows a steady decline in the sequencing rate for Pass reads, with a constant background of Fail reads. The dips in throughput every hour are normal as the sequencing control software MinKNOW reassessed from which pores to record data.

A second way of viewing the sequencing throughput is to plot the cumulative yield over time.

### Sequencing speed and fragment length through time¶

As well as changes in throughput over time another diagnostic metric to evaluate is the sequencing speed: the number of bases processed per second by the enzymes controlling the movement of DNA through the nanopore channels.

### Active Channels¶

The graphs above all depict the changing rate of sequencing from the perspective of the volume of data produced. A contribution to the falling sequencing rate is the number of functional channels falls throughout the experiment.

A channel is defined as being active if one or more sequence reads are observed per time window. It is expected that over the course of the run pores will block and the number of channels producing data will decrease. Changing the pore used by the channel and strategies to unblock pores mean that the number of functional channels may increase or decrease at a given timepoint but generally the number of channels producing data will decrease over time.

## Sequencing channel activity plot¶

The nanopores through which DNA is passed, and signal collected, are arrayed as a 2-dimensional matrix. A heatmap can be plotted showing channel productivity against spatial position on the matrix. Such a plot enables the identification of spatial artifacts that could result from membrane damage through e.g. the introduction of an air-bubble. This heatmap representation of spatial activity shows only gross spatial aberations. On the MinION, GridION, and PromethION platforms (but not Flongle) each channel can address four different pores ("mux"), the activity plot below therefore shows the number of sequences produced per channel, not strictly per pore.

Counting reads and channel map

We will break apart displaying the channel activity into two steps since the code is quite long. First we count form our dataset; for this we need to calculate the positions of channels in the flowcell:

Using the channel map from the above code cell we can create a table with columns row, column, and reads:

Now that we have out counts, we can plot a heat map:

## Headline numbers¶

As a final summary of the experiment the infographic below displays a small set of headline numbers captured through the course of the analysis: