# Isoform Tutorial

The isoform tutorial is intended as a simple guide for assembling transcript data and comparing the data against a known annotation.

The tutorial is provided with a small sample Drosophila transcriptome dataset and can be used to address such questions as:

• How many different full length transcripts are in my sample?
• Do I have novel transcripts in my dataset?

Methods used in this tutorial include:

• Python for statistical analysis and reporting, including use of pandas,
• minimap2: generating a reference index and aligning, in a splice-aware manner, reads against the reference index,
• samtools: processing aligned reads,
• gffcompare: for manipulating .gff files
• pychopper: trimming and orienting cDNA reads,
• StringTie: assembling transcriptome sequence reads.

The computational requirements for this tutorial include:

• Computer running the EPI2ME Labs notebook Server.
• At least 8 Gb RAM (though more may be needed for larger reference sequences).

⚠️ Warning: This notebook has been saved with its outputs for demonstration purposes. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

## Introduction¶

This tutorial aims to demonstrate how users can process the sequencing data from a cDNA or direct-RNA experiment to produce annotated transcripts and discover novel transcripts.

By the end of the tutorial the user will be able to:

• align transcript data to a reference genome,
• assemble transcript data using StringTie,
• compare transcript data to an existing annotation,
• generate a transcriptome from the data.

## Getting Started¶

To start analysing our experiment we must first collate our data. The workflow below expects to be given a single .fastq file.

Before anything else we will create and set a working directory:

This tutorial uses a couple of software packages that are not included in the default EPI2ME Labs server. Below we will install these tools using the conda package manager.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

### Sample Data¶

To get started we will download a modest D. melanogaster dataset to explore with the isoform tutorial. The workflow also requires a reference sequence and an annotation set.

The form below will download the sample data, reference sequence and annotations. To start the download click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

If you wish to analyse your own data rather than the sample data, you can edit the value of the input_file variable below. To find the correct full path of a file you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

### Data Entry¶

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. This is done in the form below, after updating the values, be sure to press the play button.

The input files may be gzip compressed (and end in the .gz suffix). Along with the input files and option is provided to rebuild the alignment index (rebuild_index) and specify the number of compute threads (threads) to use during computations.

A annotation is required only for the final comparison, it is not integral to the workflow.

Take care in selecting the read_type. The pychopper tool (used for orienting and identifying full length transcripts) is run only on cDNA reads; direct RNA reads do not have the required adapters to perform this step. The sample data provided with this tutorial is a cDNA analyte.

### Sequence data quality control and pre-processing¶

Before we begin the analysis it is important to take an overview of the data with which we will be working.

For more detail please see the Introduction to fastq epi2me-labs tutorial.

The below code-block will perform a simple summary of the input single-molecule reads and display graphs depicting the read and base quality and their lengths. Read qualities are expected to be >10; read lengths will be dependent on your experiments but should mostly correspond to the expected transcript length distribution.

## Identifying and orienting full length transcripts¶

If we are dealing with cDNA reads (read_type specified above) we can classify, trim and orient the sequencing data. To do this we will use the cdna_classifier.py software from pychopper. This is a standard step in long read cDNA pipelines and makes subsequent steps easier.

If you performed a direct RNA sequencing experiment please nevertheless run the code-block below as it will setup necessary inputs for subsequent steps.

In order to check the validity of the pychopper results, a plot will be displayed illustrating the selection of the classification decision boundary. This plot should be unimodal (have a single peak).

## Mapping reads to the reference¶

With our reads QCed, we proceed by aligning the data to the provided reference. We will use minimap2 to perform this task using its splice aware mode. This allows reads to be split across introns and span exons. Once minimap2 is finished we process the alignment (.bam) file, filter, sort and generate an index (.bai). Filtering is used to remove secondary and supplementary alignments and those below the minimum mapping quality threshold (specified below - default Q40).

A note on alignment filtering: the alignment file is filtered such that secondary and supplementary alignments are removed. The correct placement of a read may be ambiguous, for example from the presence of repeats in the genome. In this case there may be multiple read alignments for the same read. One of these alignments is considered primary. All the other alignments have the secondary alignment flag set. Some alignments of reads cannot be represented as a simple linear alignment. These chimeric alignments are represented as a set of linear alignments that do not have large overlaps. Typically one of the linear alignments in a chimeric alignment is considered the “representative” alignment, and the others are called “supplementary”.

Sensible defaults have been selected for these advanced minimap2 parameters but feel free to change these for your specific experiment. A secondary filtering step is performed based on reasonable expectations of transcript data:

• min_mapping_quality: reads below this mapping quality will be disgarded. The higher this parameter, the more reads will be discarded.
• max_poly_run: maximum allowed poly(A) length in the genome near the 3' end of mapping.
• poly_context: internal priming filter context size

This section will generate the following files in the alignments/ directory:

• reads_aln_sorted.bam - alignment of the processed reads against the chosen reference.
• reads_aln_sorted.bam.bai - index file for the alignment
• reads_aln_sorted.bam.tsv - tsv representation of the alignment including accuracy, identity and coverage.

As a brief sanity check of the alignments, we can summarise the information contained within the alignment .bam file and check some basic statistics. The code below will plot a read accuracy histogram together with a histogram depicting the proportion of each read contained within its alignment. This second plot should show a strong peak >98% indicating that the reads align to the reference across their full length.

We will now run StringTie to assemble RNA-Seq reads into potential transcripts.

This step sill produce one or more GFF which are then consolidated in the single output file annotation/stringtie.gff.

The split_bam option here will aid computational performance but may lead to less accurate results than otherwise.

## Comparison to existing annotation¶

The transcriptome generated by the above workflow can be compared to an existing genome annotation. One result from this analysis can be the discovery of novel transcripts not present in the reference databases.

To create a comparison between the reference annotations and the transcriptome produced by the above workflow we can use gffcompare.

This will only run if an annotation file was supplied at the beginning of this notebook.

This step will produce the following files in the results/gffcompare directory:

• stringtie.annotated.gtf: a GTF file containing the transcripts assembled by Stringtie an annotated with the reference annotation informaiton.
• stringtie.tracking: a tabular summary file containing the annotation status of the transcripts.

Now that gffcompare has been run we can investigate the .tracking file produced. This gives details on the comparison between the reference annotation and the new annotation produced by this analysis above. A reference guide to the transcript classifications can be found here, though the following table highlights the key details. In general novel transcripts will be marked as u for unknown.

Using the .tracking file we can summarise the classifications found. A TSV will be created in the results/gffcompare/ directory per possible classification. By default a TSV will be created per classification code even if there are no transcripts within that category.

### Generating a transcriptome¶

We now have our finalised transcripts, either straight from the StringTie assembler or having been annotated. To create a transcriptome from this output we can use gffread to combine it with the genomic reference sequence.

# Summary¶

In this tutorial we have worked through analysis of Oxford Nanopore Technologies' sequencing data. We first performed some basic QC on the reads. If the reads were cDNA then full length reads were identified, trimmed and oriented using pychopper. The reads were then aligned to a reference genome and assembled using StringTie to create transcripts.

If an existing annotation was provided the assembled transcripts were annotated, in so far as any known transcripts were labelled as such and novel transcripts were identified.

In a final step we created a transcriptome using the assembled transcripts and the reference genome.

The analysis presented can be run on any dataset from an Oxford Nanopore Technologies' device. The code will run within the EPI2ME Labs notebook server environment.