# Validation of cloned DNA material using Nanopore sequencing

#### Expected Duration: 45 minutes

The workflow below is intended to assess a molecular cloning experiment, specifically whether the introduction of one sequence into another has been successful. For example, the introduction of a protein encoding sequence into a plasmid vector molecule.

Questions answered by this tutorial include:

• If a barcoding strategy has been used, what fraction of demultiplexed reads correspond to the target of interest.
• Does a sequenced DNA construct correspond to its expected sequence?
• Does the sequenced DNA construct contain frameshifts or base substitions?
• If the DNA construct encodes a peptide sequence, is the peptide sequence correct?

Methods used in this tutorial include:

• guppy_barcoder for the demultiplexing of barcoded sequence reads,
• mini_align from the pomoxis package is used to align sequence reads to the target sequence of interest, and
• flye is used to assemble the input sequencing reads,

Computational requirements for this tutorial:

• A computer running the EPI2ME Labs server
• 16Gb RAM

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

## Introduction¶

This tutorial aims to determine the success of a molecular cloning experiment; to determine whether one DNA sequence has been correctly inserted into another as the experimentalist was expecting.

The goals from this tutorial include:

• Understand how to perform basic QC steps on the input data.
• Know how to assess the circularisation of a sequence assembly.
• Verify that the required sequence has been correctly inserted into the target.

This workflow naturally requires knowledge of the target sequence and optionally its encoded peptide. The methodology presented is based on an sequence assembly method; an alternative simpler method would be to employ simple sequence alignment.

## Getting started¶

The workflow below requires a single folder containing .fastq files from an Oxford Nanopore Technologies' sequencing device, or a single such file. Compressed or uncompressed files may be used. In addition a DNA reference sequences for the vector is required, and a protein reference sequence for the inserted DNA sequence.

Before anything else we will create and set a working directory:

### Install additional software¶

This tutorial uses a couple of software packages that are not included in the default EPI2ME Labs server. Below we will install software packages that include last and diamond using the conda package manager.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

### Sample Data¶

To demonstrate the workflow below a sample dataset is included with this tutorial. The dataset comprises data from a small Flongle sequencing experiment together with a vector DNA sequence and target gene protein sequence.

To download the sample file we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

The dataset contains four files. One of these (flone_clone_data.fastq.gz) is a small set of single-molecule sequencing reads from a Flongle flowcell. The reads comprise five multiplexed samples. For two of these samples (barcodes 01 and 02) the dataset contains the target peptides: peptide1.fasta and peptide2.fasta respectively. These have been successfully incorporated into the vector sequence vector.fasta, as the workflow will show.

### Using your own data¶

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

### Data entry¶

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook.

This workflow requires has some manual intervention steps to check the sanity of intermediate outputs. It is not recommended to use the Run All functionality.

The form can be used to enter the filenames of your inputs.

With our workspace prepared we will now move on to the analysis of our data.

### Preliminary Analysis Section¶

It is likely that multiple DNA constructs will have been barcoded and run in parallel during a sequencing run. For this reason we include an optional barcode demultiplexing step in this workflow.

After demultiplexing, this notebook is concerned with only a single sample. To analyse all samples the later sections of this notebook will need to be run sequentially. The results of analysing each of the samples will be accumulated into a report which can be downloaded at the end of the notebook.

If a barcoding strategy has not been used within your experiment, this step can be skipped.

Having run demultiplexing of reads below code cell will allow you to view the number of reads found to contain each barcode and allow you to select which group of reads to use in the remainder of the workflow. If you have not run demultiplexing the code will set parameters for the workflow. It is necessary to run this step even in the case of a single (unbarcoded) analyte.

Our choice of barcode has been made. Let us quickly review the quality and size distribution of the associated sequences:

## Analysis of a single barcoded analyte¶

In the sections above we have prepared our single-molecule sequence reads for further examination. The remainder of this notebook deals with the assembly of these reads and comparison of the assembly to the reference sequences provided.

Use the form below to select a set of parameters for the filtering and downsampling of the sequence collection. The code block will filter the available .fastq sequences using seqkit, and then further downsample (also using seqkit) to produce a dataset of approximately the fold-coverage requested.

At this point we have selected, filtered, and optionally downsampled our sequencing reads. To check we have a reasonable volume and quality of data to perform the assembly step we will align the reads to the vector reference given at the start of the workflow.

Two helpful visualizations of these alignment results are the depth of sequencing across the reference and the accuracy of the reads with respect to the reference.

In this experiment we are expecting differences from the reference sequence to occur in our reads, so the accuracy of the reads may appear lower than otherwise expected. The plots also report only a single, primary, alignment of reads: if the read length is a significant fraction of the plasmid size coverage may appear reduced at the edges of the reference and read coverage may be low. This is expected and normal.

Provided these plots look reasonable, containing the target coverage across the genome and read accuracy > 93%, we can proceed to the next assembly section of the workflow. If the coverage is low try altering the filtering and sampling settings above.

### De novo assembly of the plasmid sequence¶

It is now time to create an assembly from our selected read data. To do this we will use the flye assembler. Flye is a versatile and moderately fast assembler which provides good results in most situations with little parameterisation:

When flye has finished running it will have output something similar to:

    Total length:   9997
Fragments:      1
Fragments N50:  9997
Largest frg:    9997
Scaffolds:      0
Mean coverage:  63

If this is not the case return to the data filtering section above.

The statistics above summarise a file output by flye which can be inspected by running:

We expect to see a single long contig; the length of this may well be larger than the expected length and similar to whole number multiples of the expected length. If this is so try to run the assembly again with a larger value of the overlap fraction. If you obtain more than a single contig try altering the parameter until only one is obtained.

In the table, the contig will likely be identified as circular. A circular contig from the de novo assembly of a plasmid will likely contain duplicated sequences at the end of the sequence. Let's check if this is the case by preparing a dotplot showing regions of similarity when the assembly is compared to itself:

We would like to obtain a single coherent, linearised sequence repesenting the circular biological molecule. To do this we can break the contig in half and put it back together removing the duplicated sequence. This is done in the following code cell.

The result of the above should be a single alignment with no off-diagonal elements. If this is not the case try to amend the assembly parameters. If a single diagonal alignment is not obtainable automatically, it may be possible to examine the alignment co-ordinates and dot plots to manually determine how to edit the sequence.

In this section we have assembled our single-molecule reads into a single contiguous sequence. With a few simple algorithms and manual inspection we have ensured that this contig sequence contains no repetivitive sequence due to the originating target molecule being circular. In the next sections we will improve the quality of the contig and analyse it to determine if it contains the peptide of interest.

### Medaka polishing of assembly¶

With our assembly tidied up to remove duplicate sequence, the base-level quality of the assembly can be improved by using medaka:

### Rotating assembly to reference and identifying target peptide¶

In the previous section we first assembled our reads to obtain a single contiguous sequence. This sequence was trimmed to remove duplicated sequence under the assumption that the sequenced molecule is circular. We then improved the base-level quality of the trimmed sequence using medaka.

In this section we will now compare the assembly to the supplied reference sequences. In order to make this comparison easier we will first attempt to rotate the assembly so that its start co-ordinate matches that of the reference sequence.

When the sequence coordinates have been corrected, the BLAST results between the reference and assembled plasmids will be shown. Please assess the sequence alignments for expected coordinates.

The start tolerance below can be enlarged if the process fails.

As a final sanity check we will align all sequencing reads to our final curated assembly to show the sequencing depth:

The final step of our analysis is to identify the target peptide in our assembly to check that it is present and correct. To do this we will used diamond to align the peptide sequence given at the start of the analysis to the translated assembly sequence: