Introduction to Pychopper

Pychopper is a tool to identify, orient and trim full-length Nanopore cDNA reads. It acts as a QC and filtering step in more complicated cDNA workflows.

The tutorial contains a sample D. melanogaster cDNA .fastq dataset but you can run your own dataset by following the instructions in Using your own data. Please note that if you wish to analyse your own data you must supply an untrimmed cDNA .fastq, i.e. the direct output of the sequencing device.

This workflow will:

Methods used in this tutorial include:

Computational requirements for this tutorial include:

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Installing Pychopper

The default EPI2MELabs server environment does not have Pychopper preinstalled. To use Pychopper we must therefore first install it, this is most easily done with conda:

To avoid conflicts with other software, you may wish to restart your EPI2MELabs server when you have finished using Pychopper.

Data preparation

The workflow below requires a single folder containing .fastq files from an Oxford Nanopore Technologies' sequencing device, or a single such file. Compressed or uncompressed files may be used.

Before anything else we will create and set a working directory:

Sample Data

To get started we will download a sample sequencing dataset. There are two options available:

The form below will download either dataset and save them as sample_data.fastq. To start the download click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

To view the outcome of the download we can use the tree command to show the contents of the working directory:

The files should also appear in the File Browser to the left-hand side of the screen.

Using your own data

If you wish to analyse your own data rather than the sample data, you can edit the value of the input_file variable below. To find the correct full path of a file you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

image.png

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data entry

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. This is done in the form below.

If you want simply to plot all the graphs in this tutorial for your dataset, rather than working through the tutorial, select Run selected Cell and All Below from the Run menu above after executing the cell below.

Running pychopper

Pychopper consists of a single program cdna_classifier.py to identify, orient and trim full-length Nanopore cDNA reads.

To run pychopper in a basic mode we need to give simply an input and output .fastq. Pychopper first identifies alignment hits of sequencing primers across the length of the sequence reads. The default method for doing this is using nhmmscan with pre-trained strand specific profile HMMs. Alternatively, one can use the edlib backend, which uses a combination of global and local alignment to identify the primers within the read.

After identifying the primer hits, the reads are divided into segments defined by two consecutive primer hits. Segments are given a score based on their length, provided that the flanking primer hits are valid. When the primer hits are invalid a zero score is assigned to the segment.

The segments are assigned to reads using a dynamic programming algorithm maximizing the sum of used segment scores.

To run cdna_classifier.py on the input file specified above execute the cell below. For the D. Melanogaster sample dataset this will take around 10 minutes.

Analysis of pychopper results

To evaluate the results of pychopper the code box below will summarise the classification of reads. It will also display a plot illustrating the selection of the classification decision boundary. This plot should be unimodal (have a single peak).