The isoform tutorial is intended as a simple guide for assembling transcript data and comparing the data against a known annotation.
The tutorial is provided with a small sample Drosophila transcriptome dataset and can be used to address such questions as:
Methods used in this tutorial include:
minimap2
: generating a reference index and aligning, in a splice-aware manner, reads against the reference index,samtools
: processing aligned reads,gffcompare
: for manipulating .gff
filespychopper
: trimming and orienting cDNA reads,StringTie
: assembling transcriptome sequence reads.The computational requirements for this tutorial include:
⚠️ Warning: This notebook has been saved with its outputs for demonstration purposes. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.
This tutorial aims to demonstrate how users can process the sequencing data from a cDNA or direct-RNA experiment to produce annotated transcripts and discover novel transcripts.
By the end of the tutorial the user will be able to:
To start analysing our experiment we must first collate our data. The workflow below expects to be given a single .fastq
file.
Before anything else we will create and set a working directory:
# create a work directory and move into it
from epi2melabs import ping
tutorial_name = "isoform_tutorial"
pinger = ping.Pingu()
pinger.send_notebook_ping('start', tutorial_name)
working_dir = "/epi2melabs/{}/".format(tutorial_name)
!mkdir -p "$working_dir"
%cd "$working_dir"
This tutorial uses a couple of software packages that are not included in the default EPI2ME Labs server. Below we will install these tools using the conda
package manager.
Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server
!mamba install -q -y stringtie gffcompare gffread!mamba install -q -y pychopper -c epi2melabs -c bioconda -c conda-forge
To get started we will download a modest D. melanogaster dataset to explore with the isoform tutorial. The workflow also requires a reference sequence and an annotation set.
The form below will download the sample data, reference sequence and annotations. To start the download click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.
# Download sample data (click play)
bucket = "ont-exd-int-s3-euwst1-epi2me-labs"
domain = "s3-eu-west-1.amazonaws.com"
site = "https://{}.{}".format(bucket, domain)
location='{}/pychopper_tutorial'.format(site)
filename = "Dmel.4.filt.fastq.gz"
!echo "Downloading sample data"
!wget -q -O sample_data.fastq.gz "$location"/"$filename" \
&& cecho success "✔ Downloaded" || cecho error "Failed"
# download reference and annotations
!echo "Downloading reference sequence:"
filename = "Drosophila_melanogaster.BDGP6.dna.toplevel.fa.gz"
!wget -q -O $filename "$location"/"$filename" \
&& cecho success "✔ Downloaded" || cecho error "Failed"
!echo "Downloading annotation file"
filename = "Drosophila_melanogaster.BDGP6.95.gtf.gz"
!wget -q -O $filename "$location"/"$filename" \
&& cecho success "✔ Downloaded" || cecho error "Failed"
If you wish to analyse your own data rather than the sample data, you can edit the value of the input_file
variable below. To find the correct full path of a file you can navigate to it in the Files
browser to the left-hand side, right-click on the file and select Copy path
:
The location shared with the EPI2ME labs server from your computer will show as /epi2melabs
, for example a file located at /data/my_gridion_run/fastq_pass
on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass
when it is the /data
folder that is shared.
Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. This is done in the form below, after updating the values, be sure to press the play button.
The input files may be gzip compressed (and end in the .gz
suffix). Along with the input files and option is provided to rebuild the alignment index (rebuild_index
) and specify the number of compute threads (threads
) to use during computations.
A annotation is required only for the final comparison, it is not integral to the workflow.
Take care in selecting the read_type
. The pychopper
tool (used for orienting and identifying full length transcripts) is run only on cDNA reads; direct RNA reads do not have the required adapters to perform this step. The sample data provided with this tutorial is a cDNA analyte.
import os
import multiprocessing
from epi2melabs.notebook import InputForm, InputSpec
import ipywidgets as widgets
read_type = "cDNA" #@param ["cDNA", "dRNA"]
fastq = "sample_data.fastq.gz" #@param {type: "string"}
reference = "Drosophila_melanogaster.BDGP6.dna.toplevel.fa.gz" #@param {type: "string"}
annotations = "Drosophila_melanogaster.BDGP6.95.gtf.gz" #@param {type: "string"}
analysis_folder = "analysis" #@param {type: "string"}
threads = 4 #@param {type: "integer"}
# @markdown Force rebuild any minimap2 index:
rebuild_index = False #@param {type: "boolean"}
def ensure_path(path):
"""
Raise error is path doesn't exist
"""
if not os.path.exists(path):
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), path)
return True
def return_filepath(path):
"""
Takes a filepath. If filepath is gzipped (has .gz) extensions then file
is extracted and extracted path is returned. If filepath has gz extension
and doesn't exist but the extracted version exists, return that path. If
extracted version of file or path (if not gzipped) raise FileNotFoundError.
"""
if path.endswith(".gz"):
extracted = os.path.splitext(path)[0]
if os.path.exists(path):
!echo "* Unzipping $path" && gunzip -f $path
else:
print("* Extracted file found: {}".format(extracted))
path = extracted
ensure_path(path)
return path
def ensure_path(path):
"""
Raise error is path doesn't exist
"""
if not os.path.exists(path):
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), path)
return True
def process_form(inputs):
global read_type, run_pychopper, fastq, reference, annotations, analysis_folder, threads, processed_reads_dir, input_dict, primer_kit
analysis_folder = inputs.analysis_folder
!mkdir -p $analysis_folder
read_type = inputs.read_type
run_pychopper = read_type == "cDNA"
fastq = inputs.fastq
reference = inputs.reference
annotations = inputs.annotations
threads = inputs.threads
primer_kit = inputs.primer_kit
to_normalise = zip(
('fastq', 'reference', 'annotations'),
(os.path.basename(x) for x in (fastq, reference, annotations)))
print("Checking files:")
input_dict = dict()
for file_type, path in to_normalise:
print("Checking {}: {}".format(file_type, path))
if file_type == "annotations" and not path:
print("* SKIPPING: No annotation file given")
continue
path = return_filepath(path)
input_dict[file_type] = path
print("Creating processed fastq folder")
processed_reads_dir = "processed_reads"
!mkdir -p $processed_reads_dir
print("Linking input_fastq to processed reads dir.")
input_fastq = input_dict["fastq"]
!ln -s `realpath $input_fastq` $processed_reads_dir/input_reads.fq 2> /dev/null
inputs = InputForm(
InputSpec("read_type", 'Type of read', ["cDNA", "dRNA"]),
InputSpec("fastq", "Fastq dataset", fastq),
InputSpec("reference", "Reference fasta", reference),
InputSpec('primer_kit', 'Primer Kit', widgets.Dropdown(options=['PCS110','PCS109'])),
InputSpec("annotations", "Annotation gtf", annotations),
InputSpec("analysis_folder", "Analysis folder name", analysis_folder),
InputSpec("threads", "Number of threads", (1, multiprocessing.cpu_count(), 1)),
)
inputs.add_process_button(process_form)
inputs.display()
Before we begin the analysis it is important to take an overview of the data with which we will be working.
For more detail please see the Introduction to fastq epi2me-labs tutorial.
The below code-block will perform a simple summary of the input single-molecule reads and display graphs depicting the read and base quality and their lengths. Read qualities are expected to be >10; read lengths will be dependent on your experiments but should mostly correspond to the expected transcript length distribution.
# Basic read QC graphs *(click play)*
import aplanat
from aplanat.hist import histogram
from bokeh.layouts import gridplot
import numpy as np
from pysam import FastxFile
qualities = list()
mean_qualities = list()
lengths = list()
input_reads = os.path.join(processed_reads_dir, "input_reads.fq")
# open the file and iterate through its records
with FastxFile(input_reads) as fq:
for rec in fq:
# ONT calculation for "mean Q score"
quals = np.fromiter(
(ord(x) - 33 for x in rec.quality),
dtype=int, count=len(rec.quality))
mean_p = np.mean(np.power(10, quals/-10))
mean_qualities.append(-10*np.log10(mean_p))
# all qualities
qualities.extend(quals)
lengths.append(len(quals))
p1 = histogram(
[np.array(mean_qualities)], title="Read quality scores",
x_axis_label="quality", y_axis_label="count",
height=250, width=300)
p2 = histogram(
[qualities], title="Base quality scores",
x_axis_label="quality", y_axis_label="count",
height=250, width=300)
p3 = histogram(
[lengths], title="Read lengths",
x_axis_label="read length / bases", y_axis_label="count",
height=250, width=300)
aplanat.show(gridplot((p1, p2, p3), ncols=3), background="#f4f4f4")
If we are dealing with cDNA reads (read_type
specified above) we can classify, trim and orient the sequencing data. To do this we will use the cdna_classifier.py
software from pychopper
. This is a standard step in long read cDNA pipelines and makes subsequent steps easier.
If you performed a direct RNA sequencing experiment please nevertheless run the code-block below as it will setup necessary inputs for subsequent steps.
In order to check the validity of the pychopper results, a plot will be displayed illustrating the selection of the classification decision boundary. This plot should be unimodal (have a single peak).
# Running pychopper *(click play)*
import os
import pandas as pd
import aplanat
from aplanat import lines
full_length_reads = os.path.join(processed_reads_dir, "full_length_reads.fq")
# Run pychopper
if run_pychopper:
!cd $processed_reads_dir && \
cdna_classifier.py \
-t $threads \
-S "cdna_classifier_report.tsv" \
-k "$primer_kit" \
"input_reads.fq" \
"full_length_reads.fq"
csv = os.path.join(processed_reads_dir, "cdna_classifier_report.tsv")
df = pd.read_csv(csv, sep="\t", index_col="Name")
classified = df.loc[df["Category"] == "Classification"] \
.copy().reset_index().rename(columns={'Name': 'Classification'})
classified["Percentage"] = 100 * classified["Value"] / classified["Value"].sum()
display(classified[['Classification', 'Percentage',]])
tuning = df.loc[df["Category"] == "AutotuneSample"] \
.copy().reset_index().rename(columns={'Name': 'Filter'})
plot = lines.line(
[tuning['Filter'].astype(float)], [tuning['Value']],
title="Cut off parameter selection curve",
x_axis_label='Filter cut off',
y_axis_label='Selected reads')
aplanat.show(plot, background="#f4f4f4")
else:
!ln -s `realpath $processed_reads_dir/input_reads.fq` $full_length_reads 2> /dev/null
print("Skipping: Running pychopper")
With our reads QCed, we proceed by aligning the data to the provided reference. We will use minimap2
to perform this task using its splice aware mode. This allows reads to be split across introns and span exons. Once minimap2 is finished we process the alignment (.bam
) file, filter, sort and generate an index (.bai
). Filtering is used to remove secondary and supplementary alignments and those below the minimum mapping quality threshold (specified below - default Q40).
A note on alignment filtering: the alignment file is filtered such that secondary and supplementary alignments are removed. The correct placement of a read may be ambiguous, for example from the presence of repeats in the genome. In this case there may be multiple read alignments for the same read. One of these alignments is considered primary. All the other alignments have the secondary alignment flag set. Some alignments of reads cannot be represented as a simple linear alignment. These chimeric alignments are represented as a set of linear alignments that do not have large overlaps. Typically one of the linear alignments in a chimeric alignment is considered the “representative” alignment, and the others are called “supplementary”.
Sensible defaults have been selected for these advanced minimap2 parameters but feel free to change these for your specific experiment. A secondary filtering step is performed based on reasonable expectations of transcript data:
min_mapping_quality
: reads below this mapping quality will be disgarded. The higher this parameter, the more reads will be discarded.max_poly_run
: maximum allowed poly(A) length in the genome near the 3' end of mapping.poly_context
: internal priming filter context sizeThis section will generate the following files in the alignments/
directory:
reads_aln_sorted.bam
- alignment of the processed reads against the chosen reference.reads_aln_sorted.bam.bai
- index file for the alignmentreads_aln_sorted.bam.tsv
- tsv representation of the alignment including accuracy, identity and coverage.# Alignment and filtering with `minimap2` *(click play)*
import yaml
min_mapping_quality = 40
# Secondary filtering
poly_context = 24
max_poly_run = 8
context_plot_alignments = 5000
alignments_dir = os.path.join(analysis_folder, 'alignments')
# create minimap index
index_dir = os.path.join(analysis_folder, 'index')
!mkdir -p $index_dir
ref = input_dict['reference']
minimap_index = os.path.join(index_dir, "genome_index.mmi")
if rebuild_index or not os.path.exists(minimap_index):
print("* Building minimap index...")
!minimap2 -t $threads -k14 -I 1000G -d $minimap_index $ref \
&& echo "✓ Built index!"
else:
print("* Skipping building minimap index")
!mkdir -p $alignments_dir
minimap_out_bam = os.path.join(alignments_dir, "reads_aln_sorted.bam")
unfiltered_sam = os.path.join(alignments_dir, "unfiltered.sam")
filtered_tsv = os.path.join(alignments_dir, "internal_priming_fail.tsv")
context_filt = {
"AlnContext": {
"Ref": ref,
"LeftShift": -poly_context,
"RightShift": poly_context,
"RegexEnd": "[Aa]{{{pr},}}".format(pr=max_poly_run),
"Stranded": True,
"Invert": True,
"Tsv": filtered_tsv}}
with open('filt.yaml', 'w') as outfile:
yaml.dump(context_filt, outfile)
!rm -rf $minimap_out_bam
!minimap2 -t $threads -ax splice -uf $minimap_index $full_length_reads \
| samtools view -q $min_mapping_quality -F2304 -bS - \
| seqkit_bam_yaml filt.yaml \
| samtools sort -@ $threads -o $minimap_out_bam -
!samtools index $minimap_out_bam
As a brief sanity check of the alignments, we can summarise the information
contained within the alignment .bam
file and check some basic statistics. The code below will plot a read accuracy histogram together with a histogram depicting the proportion of each read contained within its alignment. This second plot should show a strong peak >98% indicating that the reads align to the reference across their full length.
# Read summary plot code *(click play)*
# calculate some basic statistics from the alignments
!stats_from_bam -o $minimap_out_bam'.tsv' $minimap_out_bam
#2> /dev/null
from aplanat import hist
from bokeh.layouts import gridplot
df = pd.read_csv(minimap_out_bam + '.tsv', sep="\t")
p1 = hist.histogram(
[df['acc']], xlim=(70,101), binwidth=0.5, height=200,
x_axis_label='read accuracy', y_axis_label='read count')
p2 = hist.histogram(
[df['coverage']], xlim=(90,101), binwidth=0.2, height=200,
x_axis_label='% of read contained in alignment',
y_axis_label='read count')
aplanat.show(gridplot([[p1], [p2]]), background="#f4f4f4")
Mapped/Unmapped/Short/Masked/Skipped(all matches masked): 11167/0/0/0/0
We will now run StringTie to assemble RNA-Seq reads into potential transcripts.
This step sill produce one or more GFF which are then consolidated in the single output file annotation/stringtie.gff
.
The
split_bam
option here will aid computational performance but may lead to less accurate results than otherwise.
# Assemble transcripts with StringTie *(click play)*
import glob
bam_bundle_dir = "bam_bundles"
min_reads = 50000 #@param {type: "integer"}
split_bam = True #@param {type: "boolean"}
# make bam bundles to speed calculation
!rm -rf $bam_bundle_dir
if split_bam:
!seqkit bam -j $threads -N $min_reads $minimap_out_bam -o $bam_bundle_dir
else:
!mkdir -p bam_bundle_dir
!ln -s $minimap_out_bam $bam_bundle_dir'/000000000_ALL:0:1_bundle.bam'
# use_guide = True #@param {type: "boolean"}
bam_paths = os.path.join(bam_bundle_dir, "*")
gff_bundle_dir = "gff_bundles"
for bam_path in glob.glob(bam_paths):
bam_name = os.path.splitext(os.path.basename(bam_path))[0]
label = "STR.{}.".format(int(bam_name.split("_")[0]))
gff = "{}/{}.gff".format(gff_bundle_dir, bam_name)
g_flag = ""
if input_dict["annotations"]:
g_flag = '-G {}'.format(input_dict["annotations"])
!stringtie --rf $g_flag -l "$label" -L -v -p "$threads" --conservative -o "$gff" "$bam_path"
# merge gff bundles into a single file:
# `results/annotation/str_merged.gff`
annotation_dir = os.path.join("results", "annotation")
!mkdir -p $annotation_dir
stringtie_gff = os.path.join(annotation_dir, "stringtie.gff")
!rm stringtie_gff 2>/dev/null
paths = glob.glob(os.path.join(bam_bundle_dir, "*.bam"))
bundle = [os.path.splitext(os.path.basename(path))[0] for path in paths]
bundle = sorted(bundle, key=lambda x: int(x.split("_")[0]))
gffs = [os.path.join(gff_bundle_dir, "{}.gff".format(name)) for name in bundle]
for gff in gffs:
!grep -v '#' "$gff" >> $stringtie_gff
The transcriptome generated by the above workflow can be compared to an existing genome annotation. One result from this analysis can be the discovery of novel transcripts not present in the reference databases.
To create a comparison between the reference annotations and the transcriptome produced by the above workflow we can use gffcompare
.
This will only run if an annotation file was supplied at the beginning of this notebook.
This step will produce the following files in the results/gffcompare
directory:
stringtie.annotated.gtf
: a GTF file containing the transcripts assembled by Stringtie an annotated with the reference annotation informaiton.stringtie.tracking
: a tabular summary file containing the annotation status of the transcripts.# Annotating assembled transcripts *(click play)*
annotations = os.path.abspath(input_dict["annotations"])
gff_compare_dir = os.path.join("results", "gffcompare")
!mkdir -p "$gff_compare_dir"
annotated_gff = os.path.join(gff_compare_dir, "stringtie.annotated.gtf")
if annotations:
!gffcompare -o "$gff_compare_dir"/stringtie -r "$annotations" -R "$stringtie_gff" \
&& cecho success "✓ Completed gffcompare" \
|| cecho error "✗ Failed gffcompare"
else:
!cecho warning "Skipping comparison with annotation as none provided."
34767 reference transcripts loaded.
3271 duplicate reference transcripts discarded.
121 query transfrags loaded.
✓ Completed gffcompare
Now that gffcompare
has been run we can investigate the .tracking
file produced. This gives details on the comparison between the reference annotation and the new annotation produced by this analysis above. A reference guide to the transcript classifications can be found here, though the following table highlights the key details. In general novel transcripts will be marked as u
for unknown.
Using the .tracking
file we can summarise the classifications found. A TSV will be created in the results/gffcompare/
directory per possible classification. By default a TSV will be created per classification code even if there are no transcripts within that category.
# Classification summary *(click play)*
write_empty_tsvs = False
tracking_headings = [
"query_transfrag_id", "query_locus_id", "ref_gene_id",
"class", "details"]
nice_names = {
'=': 'complete', 'c': 'contained', 'k': 'containment',
'm': 'retained', 'n': 'retained (partial)', 'j': 'multi',
'e': 'single', 'o': 'overlap', 's': 'opposite',
'x': 'exonic', 'i': 'intron', 'y': 'contains', 'p': 'runon',
'r': 'repeat', 'u': 'unknown'}
tracking_file = os.path.join(
gff_compare_dir, "stringtie.tracking")
if os.path.exists(annotations):
tracking = pd.read_csv(
tracking_file, sep="\t", names=tracking_headings[1:],
index_col=0)
d = pd.DataFrame(tracking['class'].value_counts()) \
.reset_index().rename(columns={'index':'class', 'class':'count'})
d['description'] = [nice_names[x] for x in d['class']]
display(d)
# write a separate table for each class
for class_code, table in tracking.groupby('class'):
if not write_empty_tsvs and table.empty:
print("Skipping: No transcripts found for: {}".format(class_code))
continue
path = tracking_file + ".{}.tsv".format(class_code)
table.to_csv(path)
else:
!cecho warning "Skipping classification summary as no annotation provided."
class | count | description | |
---|---|---|---|
0 | = | 51 | complete |
1 | c | 29 | contained |
2 | j | 25 | multi |
3 | s | 3 | opposite |
4 | e | 3 | single |
5 | o | 3 | overlap |
6 | u | 2 | unknown |
7 | n | 2 | retained (partial) |
8 | p | 1 | runon |
9 | m | 1 | retained |
10 | i | 1 | intron |
We now have our finalised transcripts, either straight from the StringTie assembler or having been annotated. To create a transcriptome from this output we can use gffread
to combine it with the genomic reference sequence.
# Create transcriptome *(click play)*
stringtie_transcriptome = os.path.join("results", "stringtie_transcriptome.fas")
ref = input_dict['reference']
if os.path.exists(annotated_gff):
!cecho ok "Creating transcriptome with annotations"
!gffread -F -g "$ref" -w $stringtie_transcriptome $annotated_gff
!echo "Done"
else:
!cecho warning "Creating transcriptome without annotations"
!gffread -g "$ref" -w "$stringtie_transcriptome" "$stringtie_gff"
!echo "Done"
In this tutorial we have worked through analysis of Oxford Nanopore Technologies' sequencing data. We first performed some basic QC on the reads. If the reads were cDNA then full length reads were identified, trimmed and oriented using pychopper. The reads were then aligned to a reference genome and assembled using StringTie to create transcripts.
If an existing annotation was provided the assembled transcripts were annotated, in so far as any known transcripts were labelled as such and novel transcripts were identified.
In a final step we created a transcriptome using the assembled transcripts and the reference genome.
The analysis presented can be run on any dataset from an Oxford Nanopore Technologies' device. The code will run within the EPI2ME Labs notebook server environment.