Isoform Tutorial

The isoform tutorial is intended as a simple guide for assembling transcript data and comparing the data against a known annotation.

The tutorial is provided with a small sample Drosophila transcriptome dataset and can be used to address such questions as:

Methods used in this tutorial include:

The computational requirements for this tutorial include:

⚠️ Warning: This notebook has been saved with its outputs for demonstration purposes. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Introduction

This tutorial aims to demonstrate how users can process the sequencing data from a cDNA or direct-RNA experiment to produce annotated transcripts and discover novel transcripts.

By the end of the tutorial the user will be able to:

Getting Started

To start analysing our experiment we must first collate our data. The workflow below expects to be given a single .fastq file.

Before anything else we will create and set a working directory:

Install additional software

This tutorial uses a couple of software packages that are not included in the default EPI2ME Labs server. Below we will install these tools using the conda package manager.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

Sample Data

To get started we will download a modest D. melanogaster dataset to explore with the isoform tutorial. The workflow also requires a reference sequence and an annotation set.

The form below will download the sample data, reference sequence and annotations. To start the download click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

Using your own data

If you wish to analyse your own data rather than the sample data, you can edit the value of the input_file variable below. To find the correct full path of a file you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

image.png

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data Entry

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. This is done in the form below, after updating the values, be sure to press the play button.

The input files may be gzip compressed (and end in the .gz suffix). Along with the input files and option is provided to rebuild the alignment index (rebuild_index) and specify the number of compute threads (threads) to use during computations.

A annotation is required only for the final comparison, it is not integral to the workflow.

Take care in selecting the read_type. The pychopper tool (used for orienting and identifying full length transcripts) is run only on cDNA reads; direct RNA reads do not have the required adapters to perform this step. The sample data provided with this tutorial is a cDNA analyte.

Sequence data quality control and pre-processing

Before we begin the analysis it is important to take an overview of the data with which we will be working.

For more detail please see the Introduction to fastq epi2me-labs tutorial.

The below code-block will perform a simple summary of the input single-molecule reads and display graphs depicting the read and base quality and their lengths. Read qualities are expected to be >10; read lengths will be dependent on your experiments but should mostly correspond to the expected transcript length distribution.

Identifying and orienting full length transcripts

If we are dealing with cDNA reads (read_type specified above) we can classify, trim and orient the sequencing data. To do this we will use the cdna_classifier.py software from pychopper. This is a standard step in long read cDNA pipelines and makes subsequent steps easier.

If you performed a direct RNA sequencing experiment please nevertheless run the code-block below as it will setup necessary inputs for subsequent steps.

In order to check the validity of the pychopper results, a plot will be displayed illustrating the selection of the classification decision boundary. This plot should be unimodal (have a single peak).

Mapping reads to the reference

With our reads QCed, we proceed by aligning the data to the provided reference. We will use minimap2 to perform this task using its splice aware mode. This allows reads to be split across introns and span exons. Once minimap2 is finished we process the alignment (.bam) file, filter, sort and generate an index (.bai). Filtering is used to remove secondary and supplementary alignments and those below the minimum mapping quality threshold (specified below - default Q40).

A note on alignment filtering: the alignment file is filtered such that secondary and supplementary alignments are removed. The correct placement of a read may be ambiguous, for example from the presence of repeats in the genome. In this case there may be multiple read alignments for the same read. One of these alignments is considered primary. All the other alignments have the secondary alignment flag set. Some alignments of reads cannot be represented as a simple linear alignment. These chimeric alignments are represented as a set of linear alignments that do not have large overlaps. Typically one of the linear alignments in a chimeric alignment is considered the “representative” alignment, and the others are called “supplementary”.

Sensible defaults have been selected for these advanced minimap2 parameters but feel free to change these for your specific experiment. A secondary filtering step is performed based on reasonable expectations of transcript data:

This section will generate the following files in the alignments/ directory:

As a brief sanity check of the alignments, we can summarise the information contained within the alignment .bam file and check some basic statistics. The code below will plot a read accuracy histogram together with a histogram depicting the proportion of each read contained within its alignment. This second plot should show a strong peak >98% indicating that the reads align to the reference across their full length.

Assembling reads into transcripts

We will now run StringTie to assemble RNA-Seq reads into potential transcripts.

This step sill produce one or more GFF which are then consolidated in the single output file annotation/stringtie.gff.

The split_bam option here will aid computational performance but may lead to less accurate results than otherwise.

Comparison to existing annotation

The transcriptome generated by the above workflow can be compared to an existing genome annotation. One result from this analysis can be the discovery of novel transcripts not present in the reference databases.

To create a comparison between the reference annotations and the transcriptome produced by the above workflow we can use gffcompare.

This will only run if an annotation file was supplied at the beginning of this notebook.

This step will produce the following files in the results/gffcompare directory:

Now that gffcompare has been run we can investigate the .tracking file produced. This gives details on the comparison between the reference annotation and the new annotation produced by this analysis above. A reference guide to the transcript classifications can be found here, though the following table highlights the key details. In general novel transcripts will be marked as u for unknown.

image.png

Using the .tracking file we can summarise the classifications found. A TSV will be created in the results/gffcompare/ directory per possible classification. By default a TSV will be created per classification code even if there are no transcripts within that category.

Generating a transcriptome

We now have our finalised transcripts, either straight from the StringTie assembler or having been annotated. To create a transcriptome from this output we can use gffread to combine it with the genomic reference sequence.

Summary

In this tutorial we have worked through analysis of Oxford Nanopore Technologies' sequencing data. We first performed some basic QC on the reads. If the reads were cDNA then full length reads were identified, trimmed and oriented using pychopper. The reads were then aligned to a reference genome and assembled using StringTie to create transcripts.

If an existing annotation was provided the assembled transcripts were annotated, in so far as any known transcripts were labelled as such and novel transcripts were identified.

In a final step we created a transcriptome using the assembled transcripts and the reference genome.

The analysis presented can be run on any dataset from an Oxford Nanopore Technologies' device. The code will run within the EPI2ME Labs notebook server environment.