# Differential Gene Expression Tutorial¶

This tutorial is intended as a simple guide for aligning and performing simple different gene expression and transcript usage.

The tutorial will answer the questions:

• Which genes are expressed in my study?
• Which genes are up-regulated in my experimental sample?
• Which gene isoforms show different patterns of abundance?
• What is the expression pattern for my favourite gene?

Methods used in this tutorial include:

• python: for statistical analysis and reporting,
• minimap2: is used to align sequence reads in a splice-aware manner against a reference transcriptome,
• samtools: processes the aligned reads and prepares BAM files,
• salmon: for defining the expressed genes and quantifying aligned reads,
• edgeR and StageR: for differential expression analysis,
• DEXSeq: to quantify differential transcript usage.

The computational requirements for this tutorial include:

• Computer running the EPI2ME Labs notebook server.
• At least 8GB of RAM

⚠️ Warning: This notebook has been saved with its outputs for demonstration purposes. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

## Introduction¶

This tutorial aims to demonstrate a workflow for long-read differential expression analysis based on cDNA sequence data. Differential gene expression (DGE) and differential transcript usage (DTU) analyses aim to identify genes and/or transcripts that show statistically (and magnitudinally) altered expression patterns in a studied biological system. The results of the differential analyses are presented in a quantitative format and therefore the degree of change (up or down regulation) between experimental conditions can be calculated for each gene identified.

These differential analyses requires a “snapshot” of gene expression that can be used to quantify the abundance of the genes’ transcripts and the relative abundance of their isoforms. In this context, abundance corresponds to the number of messenger RNAs (mRNA) measured from each gene isoform within the organism / tissue / culture being investigated. The greater the number of mRNA molecules observed from a given gene isoform, the higher its expression level. In order to determine expression levels across the whole genome, sequence data specifically targeting the mRNA molecules can be generated.

Once cDNA sequence data has been produced from both the experimental and paired control samples (with an appropriate number of biological replicates), the sequence reads can be mapped to the organism’s reference transcriptome. The number of sequences mapping to each gene isoform can be counted, and it is these count data that form the basis for the DGE and DTU analyses.

The goals from this tutorial include:

• How a Snakemake workflow can be run inside EPI2ME Labs
• A basic introduction to recommended QC considerations for a sequence-based transcriptomics study
• How TSV-format results from pipeline-transcriptome-de may be opened, sorted and presented in a report
• How MA-plots may be prepared from TSV files and reviewed to understand the scope of transcriptional changes in a transcriptomics study
• How gene-lists prepared from TSV files may be filtered and used to present the data corresponding to genes of interest

Oxford Nanopore Technologies provides a number of sequencing solutions to allow users to generate the required snapshot of gene expression. This can be achieved by both sequencing the mRNA directly, or via a complementary DNA (cDNA) proxy. In contrast to short read sequencing technologies, entire mRNA transcripts can be captured as single reads. The example data provided with this tutorial is from a study based on the PCR-cDNA kit. This is a robust choice for performing differential transcript usage studies. This kit is suitable for preparation of sequence libraries from low mRNA input quantities. The cDNA population is enriched through PCR with low bias; an important prerequisite for the subsequent statistical analysis.

## Getting Started¶

To start analysing our experiment we must first set up our working directory and install software not already present in the EPI2ME Labs environment.

Before anything else we will create and set a working directory:

This tutorial uses a couple of software packages that are not included in the default EPI2ME Labs server. Below we will install these tools using the mamba package manager.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

The first piece of software we will download is our workflow code. This corresponds to the Oxford Nanopore pipeline-transcriptome-de analysis workflow and it contains all the analysis code and logic:

We will install also various dependencies of the above workflow code. To do this we use mamba, a fast implementation of the conda package manager.

Finally we need to install a few R-language packages:

### Sample Data¶

To get started we will download a set of example human control and treated reads from the pipeline github project along with the human transcriptome reference and its annotation from ensembl.

The form below will download the sample data, reference sequence and annotations. To start the download click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

### Data entry¶

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook.

The form can be used to enter the filenames of your inputs. The datafiles will be expected to be found at the path:

/epi2melabs/transcriptome/{experiment_name}/{control_name}/{control_name}_rep{1,2,3}.fastq.gz

If all inputs were successfully validated (and error logging has been requested) the code above will display the configuration file to used by the worflow steps below. If some input files could not be found then an error will be displayed.

## Running the analysis pipeline¶

With our dataset defined in the preceding section we can now begin our analysis. In this section we will walk through the steps of the analysis pipeline, describing each of the steps in turn such that the user can better understand the process and its outputs.

### Aligning reads to reference sequence¶

The first step of our analysis is to align both the control and treated samples to the transcriptome reference. To do this most efficiently we should first create a transcriptome reference index:

We can use the reference transcriptome index created in the step above to find alignments of our sequencing reads to the reference sequence:

The previous step should have created a BAM format alignment file for each of the experimental and control samples. These data will be used for different statistical analyses. It is recommended to review the results from the mapping analysis before considering the statistical analysis. Let's look at the key mapping characteristics for our samples - a first step should be to extract some mapping characteristics from each of the BAM files; we will use the seqkit software to do this.

Differential gene expression is sensitive to the quantity and quality of starting data. The next step in the QC assessment is to assess whether the starting sequence libraries are similar in these regards. The code block below with present the results from the seqkit analysis as a table. Please have a look and see whether there is equivalence in the number of sequence reads, mapped reads and quality scores.

### Transcript quantification¶

In the previous sections, cDNA sequence reads have been mapped to the reference transcriptome using minimap2. The Salmon tool (Patro et al. (2017)) will now we used to assign cDNA reads to individual annotated transcripts defined by the GTF-format annotation file provided. This will be performed using the alignment-based mode of Salmon.

The Salmon analysis in the previous step has produced output files that describe the number of sequence reads assigned to each of the annotated transcripts. To simplify the exploration and analysis of these count data, the information for each of the experimental samples will be aggregated into a single file placed in the file, merged/all_counts.tsv.

At this point in the workflow we have used the Salmon tool to convert a BAM file of cDNA sequence read mappings to lists of read-counts at individual gene transcript resolution. The information from each of the experimental samples has been aggregated and we have a single TSV format file that describes the quantitative context of our biological system.

In the next section we will go on to perform a statistical analysis of this starting file to identify the annotated features that show either differential gene expression or differential transcript usage.

### Calculating differential expression¶

The previous section has produced a single TSV-format file that describes the number of cDNA sequence reads that correspond to each of the annotated transcripts in the provided GTF-format annotation file.

We will now use these count data to perform a statistical analysis to identify the genes and transcripts that show differences in abundance between the experimental conditions.

#### Pre-filtering of quantitative data using DRIMSeq¶

DRIMSeq (Nowicka and Robinson (2016)) is used to filter the transcript count data from the salmon analysis. The filter step will be used to select for genes and transcripts that satisfy rules for the number of samples in which a gene or transcript must be observed and minimum threshold levels for the number of observed reads. The parameters used for filtering are defined in the config.yaml file. The parameters defined for this analysis include

• min_samps_gene_expr = 3 - a transcript must be mapped to a gene in at least this minimum number of samples for the gene be included in the analysis
• min_samps_feature_expr = 1 - a transcript must be mapped to an isoform in at least this this minimum number of samples for the gene isoform to be included in the analysis
• min_gene_expr = 10 - the minimum number of total mapped sequence reads for a gene to be considered expressed
• min_feature_expr = 3 - the minimum number of total mapped sequence reads for a gene isoform to be considered

#### edgeR based differential expression analysis¶

A statistical analysis is first performed using edgeR (Robinson, McCarthy, and Smyth (2010), McCarthy et al. (2012)) to identify the subset of differentially expressed genes. The filtered list of gene counts is used as input. A normalisation factor is calculated for each sequence library (using the default TMM method - please see McCarthy et al. (2012) for further details). The defined experimental design is used to calculate estimates of dispersion across for each of the gene features. Genewise statistical tests are then calculated, again using the contrasts defined in the experimental design. The differentially expressed genes are corrected for false discovery (fdr) using the method of Benjamini & Hochberg (Benjamini and Hochberg (1995))

#### Differential transcript usage using DEXSeq¶

Differential transcript usage analysis is performed using the R DEXSeq package (Reyes et al. (2013)). Similar to the edgeR package, DEXSeq estimates the variance between the biological replicates and applies generalised linear models for the statistical testing. The key difference is that the DEXSeq method is looking for differences at the exon count level.

DEXSeq uses the filtered transcript count data prepared earlier in this analysis.

#### StageR stage-wise analysis of DGE and DTU¶

The final component of this cDNA sequence based gene isoform analysis is a stage-wise statistical analysis using the R software package called stageR (Van den Berge and Clement (2018)). stageR uses (1) the raw p-values for DTU from the DEXSeq analysis in the previous section and (2) a false-discovery corrected set of p-values from testing whether individual genes contain at least one exoning showing DTU. A hierarchical two-stage statistical testing evaluates the set of genes for DTU.

#### Running the analysis¶

The filtering and statistical tests described in the sections above can be run using a single target of the Snakefile used.

The following files are created by the above step and can be viewed outside of this notebook:

• results_dge.pdf - this is a summary file report of DGE results
• results_dge.tsv - this is a gene-level result file that describes genes and the probability that they show differential expression between experimental conditions (from edgeR analysis)
• results_dtu.pdf - this is a summary report of DTU results.
• results_dtu_gene.tsv - this is a gene-level result file from DEXSeq that lists annotated genes and their probabilities of differential expression
• results_dtu_stageR.tsv - this is the output from StageR and is shows both gene and transcript probabilities of differential expression
• results_dtu_transcript.tsv- this is a transcript-level result file from DEXSeq that lists annotated genes and their probabilities of differential expression
• results_dexseq.tsv - this is the complete output from the DEXSeq-analysis and is used to prepare some of the figures presented in this tutorial.

In the next section we will be investigating and reviewing the content from some of these files. Key data from the analysis will be presented in a graphical way to help understand the biological system and identify candidate genes.

## Data analysis¶

The preceding section has walked through the execution of the differential expression analysis code. We will now use the outputs of this analysis pipeline to examine the datasets and tease out biologically meaningful observations.

#### Reviewing results from the Salmon analysis¶

In the Snakemake workflow the Salmon software was used to link mapped cDNA sequence reads to their corresponding annotated transcripts. Within the differential expression analysis the Salmon results (one per sample) were aggregated into a single file. Let's have a look at this file and review some of the more abundant transcripts identified.

The code block below will import the aggregated results from the TSV file prepared and will sort the transcripts by abundance (total reads across all samples) before printing the top 10 transcripts characterised. These are the primary data that are passed on to the subsequent statistical tests.

The Salmon method has assigned reads to transcripts. It is recommended to have a quick check of the number of transcripts that are assigned to individual genes. It is anticipated that a majority of genes will be associated with a single transcript; it is the population of genes containing two or more transcripts that have been assessed for differential transcript usage.

To plot the number of transcripts per gene we need to open two files from the merged output results. The file merged/all_counts_filtered.tsv contains the input data for the stageR analysis and contains the mappings and read counts for genes associated multiple-transcripts. merged/all_gene_counts.tsv contains all gene-level counts and can be used to identify the genes associated with a single transcript.

The code-block below processes the data from these two files to prepare a histogram showing the abundance of transcript counts for genes identified in the analysis.

### Differential Gene Expression (DGE)¶

The following analysis will focus on the data found in the output file named results_dge.tsv from the analysis pipeline. Similar plots to those produced below are also presented in results_dge.pdf report file from the pipeline.

The results from the edgeR analysis can be plotted as an MA Plot. This plot visualises differences in measurements between the two experimental conditions. M is the log2 ratio of gene expression calculated between the conditions. A is a log2 transformed mean expression value. The figure below presents the MA figure from this edgeR analysis. Genes that satisfy the logFC and FDR corrected p-value thresholds defined are shaded as 'Up-' or 'Down-' regulated.

### Differential Transcript Usage¶

Differential transcript usage analysis is performed using the R DEXSeq package (Reyes et al. (2013)). Similar to the edgeR package, DEXSeq estimates the variance between the biological replicates and applies generalised linear models for the statistical testing. The key difference is that the DEXSeq method is looking for differences at the exon count level.

DEXSeq uses the filtered transcript count data prepared earlier in this analysis - let's have a quick look at the results described in the file de_analysis/results_dtu_transcript.tsv.

The results from the DEXSeq analysis can also be plotted as an MA Plot. MA-plots are described in the previous section on edgeR data review. As a brief reminder, M is the log2 ratio of transcript abundance between conditions. A is the log2 transformed mean abundance value. The figure below presents the MA plot from the DEXSeq analysis. Transcripts that satisfy the logFC and FDR corrected p-value thresholds defined are shaded as 'Up-' or 'Down-' regulated.

### StageR analysis¶

The final component of this cDNA sequence based gene isoform analysis is a stage-wise statistical analysis using the R software package called stageR (Van den Berge and Clement (2018)). StageR uses (1) the raw p-values for DTU from the DEXSeq analysis in the previous section and (2) a false-discovery corrected set of p-values from testing whether individual genes contain at least one exoning showing DTU. A hierarchical two-stage statistical testing evaluates the set of genes for DTU.

We can present the results from the StageR analysis by loading the file de_analysis/results_dtu_stageR.tsv - the results displayed are sorted by the probability that the gene shows differential expression.

### Review of candidate genes¶

The tables and MA-plots presented in the previous sections have extracted genes and transcripts of interest from the pipeline-transcriptome-de results files. This section is used to assess and present the quantitative information produced by the workflow for a priori known genes-of-interest.

For searching for a gene of interest we need to know the gene identifier used within the GTF annotation file. Please use this identifier in the form below to further explore the data.

The code will prepare a table summarising the transcripts associated with the specified gene, along with read counts and the probability results from the stageR analysis. The transcript count data will be additionally plotted as a boxplot.

As a final summary of the experiment the infographic below displays a small set of headline numbers captured through the course of the analysis:

## Summary¶

In this tutorial we have used the pipeline-transcriptome-de workflow to analyse sequence data from a replicated cDNA sequencing study to look for genes showing differential expression and trancripts showing differential transcript usage between experimental conditions.

• cDNA sequences have been mapped to the reference transcriptome using minimap2; samtools has been used to prepare a sorted and indexed BAM file suitable for further analysis.
• We have reviewed the mapping characteristics for all the samples used in our study and have hopefully established that the samples included in the design are largely equivalent.
• Salmon has been used to prepare transcript counts from the transcriptome mapping data and the provided GTF-format annotation file. The count data from Salmon has been assessed for differential expression using edgeR, for differential transcript usage using DEXSeq and a combined analysis using StageR.
• We have reviewed the CSV-format result files from each of the statistical analyses above and have prepared tabular data showing lists of top genes and transcripts - the data has been presented as MA-plots to present the overall structure of the data.
• A simple method has been implemented to present summary data for genes of interest - a summary report has been prepared and written to an HTML file for offline review.

The analysis presented can be run on any dataset from an Oxford Nanopore Technologies' device. The code will run with the EPI2ME Labs notebook server environment.

## References¶

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological) 57 (1): 289–300. http://www.jstor.org/stable/2346101.

McCarthy, Davis J., Chen, Yunshun, Smyth, and Gordon K. 2012. “Differential Expression Analysis of Multifactor Rna-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97.

Nowicka, Malgorzata, and Mark D. Robinson. 2016. “DRIMSeq: A Dirichlet-Multinomial Framework for Multivariate Count Outcomes in Genomics [Version 2; Referees: 2 Approved].” F1000Research 5 (1356). https://doi.org/10.12688/f1000research.8900.2.

Patro, Robert, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods 14 (March). https://doi.org/10.1038/nmeth.4197.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.