The tutorial contains a sample D. melanogaster cDNA .fastq
dataset but you can run your own dataset by following the instructions in Using your own data. Please note that if you wish to analyse your own data you must supply an untrimmed cDNA .fastq
, i.e. the direct output of the sequencing device.
This workflow will:
Methods used in this tutorial include:
Pychopper
Computational requirements for this tutorial include:
⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.
The default EPI2MELabs server environment does not have Pychopper preinstalled. To use Pychopper we must therefore first install it, this is most easily done with conda:
!mamba install -q -y pychopper -c epi2melabs -c bioconda -c conda-forge
Pinned packages: - conda 4.9.2 - python 3.8.* - tini 0.18.0
To avoid conflicts with other software, you may wish to restart your EPI2MELabs server when you have finished using Pychopper.
The workflow below requires a single folder containing .fastq
files from an Oxford Nanopore Technologies' sequencing device, or a single such file. Compressed or uncompressed files may be used.
Before anything else we will create and set a working directory:
# create a work directory and move into it
tutorial_name = "pychopper_tutorial"
working_dir = '/epi2melabs/{}'.format(tutorial_name)
!mkdir -p "$working_dir"
%cd "$working_dir"
from epi2melabs import ping
pinger = ping.Pingu()
_ = pinger.send_notebook_ping('start', 'pychopper_tutorial')
To get started we will download a sample sequencing dataset. There are two options available:
The form below will download either dataset and save them as sample_data.fastq
. To start the download click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.
#@markdown Select download type and press Play
from epi2melabs.notebook import InputForm, InputSpec
bucket = "ont-exd-int-s3-euwst1-epi2me-labs"
domain = "s3-eu-west-1.amazonaws.com"
site = "https://{}.{}".format(bucket, domain)
download = "D. melanogaster" #@param ["D. melanogaster", "small set"]
## download the data
def process_form(inputs):
location="{}/pychopper_tutorial".format(site)
filename = None
if inputs.dataset == "D. melanogaster":
filename = "Dmel.4.filt.fastq.gz"
elif inputs.dataset == "small set":
filename = "small_cdna_example.fastq.gz"
!echo "Downloading $filename"
!wget -O sample_data.fastq.gz "$location"/"$filename"
!echo "Extracting"
!gunzip -f sample_data.fastq.gz
!echo "Done"
input_form = InputForm(
InputSpec('dataset', 'Dataset', ["D. melanogaster", "small set"]))
input_form.add_process_button(process_form)
input_form.display()
To view the outcome of the download we can use the tree
command to show the contents of the working directory:
!tree .
. ├── analysis │ ├── cdna_classifier_report.tsv │ ├── full_length_output.fq │ ├── input.fastq -> /epi2melabs/pychopper_tutorial/sample_data.fastq │ ├── report.pdf │ ├── rescued.fq │ └── unclassified.fq └── sample_data.fastq 1 directory, 7 files
The files should also appear in the File Browser to the left-hand side of the screen.
If you wish to analyse your own data rather than the sample data, you can edit the value of the input_file
variable below. To find the correct full path of a file you can navigate to it in the Files
browser to the left-hand side, right-click on the file and select Copy path
:
The location shared with the EPI2ME labs server from your computer will show as /epi2melabs
, for example a file located at /data/my_gridion_run/fastq_pass
on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass
when it is the /data
folder that is shared.
Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. This is done in the form below. It is also possible to select which primer kit has been used, the default is PCS110 but it is also possible to choose PCS109 or PCS111 using the drop down.
If you want simply to plot all the graphs in this tutorial for your dataset, rather than working through the tutorial, select
Run selected Cell and All Below
from theRun
menu above after executing the cell below.
import os
import ipywidgets as widgets
input_file = None
output_folder = None
def process_form(inputs):
global input_file
global output_folder
global primer_kit
# run a command to concatenate all the files together
!echo "Making output folder"
output_folder = inputs.output_folder
!mkdir -p "$output_folder"
input_data = inputs.input_data
!test -e "$input_data" && echo "Found input file." || "WARNING: $input_data does not exist"
input_file = os.path.join(output_folder, "input.fastq")
!rationalize_fastq -i "$input_data" -o "$input_file"
primer_kit = inputs.primer_kit
pychopper_form = InputForm(
InputSpec('input_data', 'Input fastq', '/epi2melabs/pychopper_tutorial/sample_data.fastq'),
InputSpec('primer_kit', 'Primer Kit', widgets.Dropdown(
options=['PCS110','PCS109', 'PCS111'])),
InputSpec('output_folder', 'Output folder', 'analysis'))
pychopper_form.add_process_button(process_form)
pychopper_form.display()
Pychopper consists of a single program cdna_classifier.py
to identify, orient and trim full-length Nanopore cDNA reads.
To run pychopper in a basic mode we need to give simply an input and output .fastq
. Pychopper first identifies alignment hits of sequencing primers across the length of the sequence reads. The default method for doing this is using nhmmscan with pre-trained strand specific profile HMMs. Alternatively, one can use the edlib backend, which uses a combination of global and local alignment to identify the primers within the read.
After identifying the primer hits, the reads are divided into segments defined by two consecutive primer hits. Segments are given a score based on their length, provided that the flanking primer hits are valid. When the primer hits are invalid a zero score is assigned to the segment.
The segments are assigned to reads using a dynamic programming algorithm maximizing the sum of used segment scores.
To run cdna_classifier.py
on the input file specified above execute the cell below. For the D. Melanogaster sample dataset this will take around 10 minutes.
!cdna_classifier.py \
-r "$output_folder"/report.pdf \
-u "$output_folder"/unclassified.fq \
-w "$output_folder"/rescued.fq \
-S "$output_folder"/cdna_classifier_report.tsv \
-k "$primer_kit" \
"$input_file" \
"$output_folder"/full_length_output.fq
To evaluate the results of pychopper the code box below will summarise the classification of reads. It will also display a plot illustrating the selection of the classification decision boundary. This plot should be unimodal (have a single peak).
# Classification summary *(click to show)*
import os
import pandas as pd
import aplanat
from aplanat import lines
pinger.send_notebook_ping('stop', 'pychopper_tutorial')
csv = os.path.join(output_folder, "cdna_classifier_report.tsv")
df = pd.read_csv(csv, sep="\t", index_col="Name")
classified = df.loc[df["Category"] == "Classification"] \
.copy().reset_index().rename(columns={'Name': 'Classification'})
classified["Percentage"] = 100 * classified["Value"] / classified["Value"].sum()
display(classified[['Classification', 'Percentage',]])
tuning = df.loc[df["Category"] == "AutotuneSample"] \
.copy().reset_index().rename(columns={'Name': 'Filter'})
plot = lines.line(
[tuning['Filter'].astype(float)], [tuning['Value']],
title="Cut off parameter selection curve",
x_axis_label='Filter cut off',
y_axis_label='Selected reads')
aplanat.show(plot, background="#f4f4f4")
Classification | Percentage | |
---|---|---|
0 | Primers_found | 74.364532 |
1 | Rescue | 1.931034 |
2 | Unusable | 23.704433 |