wf-transcriptomes documentation

By EPI2ME Labs
6 min read

Workflow Transcriptomes

Transcriptome analysis including assembly and annotation of cDNA and direct RNA sequencing data and differential expression analysis.

Introduction

This workflow can be used for the following:

  • Identify RNA transcripts using either cDNA or direct RNA reads.
  • Reference aided transcriptome assembly.
  • Annotation of assembled transcripts.
  • Differential gene expression analysis using a pre-computed or assembled reference transcriptome.
  • Differential transcript usage analysis using a precomputed or assembled reference transcriptome.

Compute requirements

Recommended requirements:

  • CPUs = 16
  • Memory = 32GB

Minimum requirements:

  • CPUs = 8
  • Memory = 32GB

Approximate run time: 15 minutes per sample, with 1 million reads and recommended resources.

ARM processor support: False

Install and run

These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME Desktop application.

The workflow uses Nextflow to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.

The workflow can currently be run using either Docker or Singularity to provide isolation of the required software. Both methods are automated out-of-the-box provided either Docker or Singularity is installed. This is controlled by the -profile parameter as exemplified below.

It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.

The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:

nextflow run epi2me-labs/wf-transcriptomes --help

To update a workflow to the latest version on the command line use the following command:

nextflow pull epi2me-labs/wf-transcriptomes

A demo dataset is provided for testing of the workflow. It can be downloaded and unpacked using the following commands:

wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-transcriptomes/wf-transcriptomes-demo.tar.gz
tar -xzvf wf-transcriptomes-demo.tar.gz

The workflow can then be run with the downloaded demo data using:

nextflow run epi2me-labs/wf-transcriptomes \
--de_analysis \
--direct_rna \
--fastq 'wf-transcriptomes-demo/differential_expression_fastq' \
--minimap2_index_opts '-k15' \
--ref_annotation 'wf-transcriptomes-demo/gencode.v22.annotation.chr20.gtf' \
--ref_genome 'wf-transcriptomes-demo/hg38_chr20.fa' \
--sample_sheet 'wf-transcriptomes-demo/sample_sheet.csv' \
-profile standard

For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/

This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices.

Find related protocols in the Nanopore community.

Input example

This workflow accepts either FASTQ or BAM files as input.

The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with --sample. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.

(i) (ii) (iii)
input_reads.fastq ─── input_directory ─── input_directory
├── reads0.fastq ├── barcode01
└── reads1.fastq │ ├── reads0.fastq
│ └── reads1.fastq
├── barcode02
│ ├── reads0.fastq
│ ├── reads1.fastq
│ └── reads2.fastq
└── barcode03
└── reads0.fastq

Input parameters

Input Options

Nextflow parameter nameTypeDescriptionHelpDefault
fastqstringFASTQ files to use in the analysis.This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with --sample. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.
bamstringBAM or unaligned BAM (uBAM) files to use in the analysis.This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with --sample. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.
transcriptome_sourcestringSelect how the transcriptome used for analysis should be prepared.For differential expression analysis, use of an existing transcriptome may be preferred and so ‘precomputed’ should be selected. In this case the ‘ref_transcriptome’ parameter should be specified. To create a reference transcriptome using an existing reference genome, select ‘reference guided’ and specify the ‘ref_genome’ parameter.reference-guided
ref_genomestringPath to reference genome sequence [.fa/.fq/.fa.gz/fq.gz]. Required for reference-based workflow.A reference genome is required for reference-based assembly of a transcriptome.
ref_transcriptomestringTranscriptome reference file. Required for precomputed transcriptome calculation and for differential expression analysis.A reference transcriptome related to the sample under study. Must be supplied when the ‘Transcriptome source’ parameter has been set to ‘precomputed’ or to perform differential expression.
ref_annotationstringA reference annotation in GFF2 or GFF3 format (extensions .gtf(.gz), .gff(.gz), .gff3(.gz)). Only annotation files from Encode, Ensembl and NCBI are supported.This will be used for guiding the transcriptome assembly and to label transcripts with their corresponding gene identifiers. Note: If in de_analysis mode transcript strands must be only + or -.
direct_rnabooleanSet to true for direct RNA sequencing.Omits the pychopper step.False
analyse_unclassifiedbooleanAnalyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory.If selected and if the input is a multiplex directory the workflow will also process the unclassified directory.False

Output Options

Nextflow parameter nameTypeDescriptionHelpDefault
out_dirstringDirectory for output of all user-facing files.output
igvbooleanVisualize outputs in the EPI2ME IGV visualizer.Enabling this option will visualize the output alignment files in the EPI2ME Desktop App IGV visualizer.False

Sample Options

Nextflow parameter nameTypeDescriptionHelpDefault
sample_sheetstringA CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. If you are running the differential expression workflow, there must be an additional column condition with two labels, one of which must be control (e.g. control and treated). Control will indicate which samples will be used as the reference. There should be at least 3 repeats for each condition.The sample sheet is a CSV file with, minimally, columns named barcode and alias. Extra columns are allowed.
samplestringA single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files.

Options for reference-based workflow

Nextflow parameter nameTypeDescriptionHelpDefault
plot_gffcmp_statsbooleanCreate a PDF of plots from showing gffcompare resultsIf set to true, a PDF file containing detailed gffcompare reults will be outputTrue
gffcompare_optsstringExtra command-line options to give to gffcompare -rFor a list of possible options see gffcompare.-R
minimap2_index_optsstringExtra command-line options for minimap2 indexing.See minimap2 index options for more information. These will only be relevant in the reference based transcriptome assembly.-k14
minimap2_optsstringAdditional command-line options for minimap2 alignment.See minimap2 options for further information. These will only be relevant in the reference based transcriptome assembly.-uf
minimum_mapping_qualityintegerfilter aligned reads by MAPQ quality.Reads that do not meet this mapping quality after minimap2 alignment, will be filtered out.40
stringtie_optsstringExtra command-line options for stringtie transcript assembly.For additional String tie options see here.—conservative

Differential Expression Options

Nextflow parameter nameTypeDescriptionHelpDefault
de_analysisbooleanRun DE anaylsisRunning this requires you to provide at least two replicates for a control and treated sample as well as a sample sheet param.False
min_gene_exprintegerThe minimum number of total mapped sequence reads required for a gene to be considered in differential transcript usage analysis.Filtering at the gene level ensures that the observed transcript ratios are calculated with a minimum number of counts per gene.10
min_feature_exprintegerThe minimum number of reads assigned to a transcript for it to be considered in differential transcript usage analysis.Filter out transcripts that do not have this minimum number of transcript expression, reducing noise.3
min_samps_gene_exprintegerSet the minimum number of samples in which a gene is expressed to be included in the differential transcript usage analysis.A gene must be expressed in at least this number of samples for the gene be included in the differential transcript usage analysis. Filtering at the gene level improves the reliability of the observed transcript ratios.3
min_samps_feature_exprintegerSet the minimum number of samples in which a transcript is expressed to be included in the differential transcript usage analysis.A transcript must expressed in at least this minimum number of samples to be included in the analysis. Should be equal to the number of replicates per sample you have.1

Advanced Options

Nextflow parameter nameTypeDescriptionHelpDefault
threadsintegerNumber of CPU threads.Only provided to processes including alignment and and assembly that benefit from multiple threads.4
cdna_kitstringIf cDNA reads are used, select the kit used.This will be used by pychopper to preprocess the reads for downstream analysis.SQK-PCS109
pychopper_backendstringPychopper can use one of two available backends for identifying primers in the raw reads‘edlib’ is set by default due to its high performance. However, it may be less sensitive than ‘phmm’.edlib
pychopper_optsstringExtra pychopper optsSee available options (here)[https://github.com/epi2me-labs/pychopper#usage]
bundle_min_readsintegerMinimum size of bam bundle for parallel processing.50000
isoform_table_nrowsintegerMaximum rows to dispay in the isoform report table5000

Outputs

Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

TitleFile pathDescriptionPer sample or aggregated
workflow reportwf-transcriptomes-report.htmla HTML report document detailing the primary findings of the workflowaggregated
Per file read statsfastq_ingress_results/reads/fastcat_stats/per-file-stats.tsvA TSV with per file read stats, including all samples.aggregated
Read statsfastq_ingress_results/reads/fastcat_stats/per-read-stats.tsvA TSV with per read stats, including all samples.aggregated
Run ID’sfastq_ingress_results/reads/fastcat_stats/run_idsList of run IDs present in reads.aggregated
Meta map jsonfastq_ingress_results/reads/metamap.jsonMetadata used in workflow presented in a JSON.aggregated
Concatenated sequence datafastq_ingress_results/reads/{{ alias }}.fastq.gzPer sample reads concatenated in to one FASTQ file.per-sample
Assembled transcriptome{{ alias }}_transcriptome.fasPer sample assembled transcriptome.per-sample
Annotated assembled transcriptome{{ alias }}_merged_transcriptome.fasPer sample annotated assembled transcriptome.per-sample
Alignment summary statistics{{ alias }}_read_aln_stats.tsvPer sample alignment summary statistics.per-sample
GFF compare results.{{ alias }}_gffcompareAll GFF compare output files.per-sample
Differential gene expression resultsde_analysis/results_dge.tsvThis is a gene-level result file that describes genes and their probability of showing differential expression between experimental conditions.aggregated
Differential gene expression reportde_analysis/results_dge.pdfSummary report of differential gene expression analysis as a PDF.aggregated
Differential transcript usage gene TSVde_analysis/results_dtu_gene.tsvThis is a gene-level result file from DEXSeq that lists annotated genes and their probabilities of differential expression.aggregated
Differential transcript usage reportde_analysis/results_dtu.pdfSummary report of differential transcript usage results as a PDF.aggregated
Differential transcript usage TSVde_analysis/results_dtu_transcript.tsvThis is a transcript-level result file from DEXSeq that lists annotated genes and their probabilities of differential expression.aggregated
Differential transcript usage stageR TSVde_analysis/results_dtu_stageR.tsvThis is the output from StageR and it shows both gene and transcript probabilities of differential expressionaggregated
Differential transcript usage DEXSeq TSVde_analysis/results_dexseq.tsvThe complete output from the DEXSeq-analysis, shows both gene and transcript probabilities of differential expression.aggregated
Gene countsde_analysis/all_gene_counts.tsvRaw gene counts created by the Salmon tool, before filtering.aggregated
Gene counts per millionde_analysis/cpm_gene_counts.tsvThis file shows counts per million (CPM) of the raw gene counts to facilitate comparisons across samples.aggregated
Transcript countsde_analysis/unfiltered_transcript_counts_with_genes.tsvRaw transcript counts created by the Salmon tool, before filtering. Includes reference to the associated gene ID.aggregated
Transcript per million countsde_analysis/unfiltered_tpm_transcript_counts.tsvThis file shows transcripts per million (TPM) of the raw counts to facilitate comparisons across samples.aggregated
Transcript counts filteredde_analysis/filtered_transcript_counts_with_genes.tsvFiltered transcript counts, used for differential transcript usage analysis. Includes a reference to the associated gene ID.aggregated
Transcript info table{{ alias }}_transcripts_table.tsvThis file details each isoform that was reconstructed from the input reads. It contains a subset of columns from the .tmap output from gffcompareper-sample
Final non redundant transcriptomede_analysis/final_non_redundant_transcriptome.fastaTranscripts that were used for differential expression analysis including novel transcripts with the identifiers used for DE analysis.aggregated
Index of reference FASTA fileigv_reference/{{ ref_genome file }}.faiReference genome index of the FASTA file required for IGV config.aggregated
GZI index of the reference FASTA fileigv_reference/{{ ref_genome file }}.gziGZI Index of the reference FASTA file.aggregated
JSON configuration file for IGV browserigv.jsonJSON configuration file to be loaded in IGV for visualising alignments against the reference.aggregated
BAM file (minimap2)BAMS/{{ alias }}.reads_aln_sorted.bamBAM file generated from mapping input reads to the reference.per-sample
BAM index file (minimap2)BAMS/{{ alias }}.reads_aln_sort.bam.baiIndex file generated from mapping input reads to the reference.per-sample

Pipeline overview

1. Concatenate input files and generate per read stats.

The fastcat tool is used to concatenate multifile samples to be processed by the workflow. It will also output per read stats including average read lengths and qualities.

2. Preprocess cDNA.

If input sequences are cDNA Pychopper is used to orient, trim and rescue full length cDNA reads and associated statistics. If the direct_rna parameter is selected this step will be skipped.

3. Build transcriptome.

If the transcriptome_source parameter is “reference-guided” a transcriptome will be built for each sample as outlined below. If the transcriptome_source is “precomputed” and the reference_transcriptome parameter is provided the workflow will skip step 3.

3.1 Align reads with reference genome.

The reference genome will be indexed and aligned using Minimap2. The output is sorted and converted to a BAM file using Samtools. Alignment stats are created from these using Seqkit BAM.

Additionally, the workflow will generate an IGV configuration file if --igv is selected. This file allows the user to view the aligned BAM in the EPI2ME Desktop Application in the Viewer tab.

3.2 Chunk BAM

The aligned BAMs are split into chunks using the bundle_min_reads parameter (default: 50000).

3.3 Assemble transcripts

StringTie is then used to assemble the transcripts using the aligned segments in the chunked BAM files. The assembled transcript will be output as a GFF file. If a ref_annotation file is provided this will also be included in the GFF.

3.4 Merge Chunks

Transcript GFF files from the chunks with the same sample aliases will then be merged.

3.5 Annnotate

GffCompare is then used to compare query and reference annotations, merging records where appropriate and then annotating them. This also creates estimates of accuracy of the GFF files output in a stats file per sample.

3.6 Create transcriptomes

Gffread is used to create a transcriptome FASTA file from the final GFF as well as a merged transcriptome that includes annotations in the FASTA headers where available.

4. Differential expression analysis

Differential gene expression (DGE) and differential transcript usage (DTU) analyses aim to identify genes and transcripts that show statistically altered expression patterns.

Differential Expression requires at least 2 replicates of each sample to compare (but we recommend three). You can see an example sample_sheet.csv below.

Sample sheet condition column

The sample sheet should be a comma separated values file (.csv) and include at least three columns named barcode, alias and condition.

  • Each barcode should refer to a directory of the same name in the input FASTQ directory (in the example below barcode01 to barcode06 reflect the test_data directory).
  • The alias column allows you to rename each barcode to an alias that will be used in the report and other output files.
  • The condition column will need to contain one of two keys to indicate the two samples being compared. Control must be one of the keys, used to indicate which samples will be used as the reference in the differential expression analysis.

eg. sample_sheet.csv

barcode,alias,condition
barcode01,sample01,control
barcode02,sample02,control
barcode03,sample03,control
barcode04,sample04,treated
barcode05,sample05,treated
barcode06,sample06,treated

4.1 Merge cross sample transcriptomes

If a ref_transcriptome is not provided, the transcriptomes created by the workflow will be used for DE analysis. To do this, the GFF outputs of GffCompare are merged using StringTie. A final non redundant FASTA file of the transcripts is created using the merged GFF file and the reference genome using seqkit.

4.2 Create a final non redundant transcriptome

The reads from all the samples will be aligned with the final non redundant transcriptome using Minimap2 in a splice aware manner.

4.3 Count genes and transcripts

Salmon is used for transcript quantification, giving gene and transcript counts.

4.4 edgeR based differential expression analysis

A statistical analysis is first performed using edgeR to identify the subset of differentially expressed genes using the gene counts as input. A normalisation factor is calculated for each sequence library using the default TMM method (see McCarthy et al. (2012) for further details). The defined experimental design is used to calculate estimates of dispersion for each of the gene features. Statistical tests are calculated using the contrasts defined in the experimental design. The differentially expressed genes are corrected for false discovery (FDR) using the method of Benjamini & Hochberg (Benjamini and Hochberg (1995))

4.5 Pre-filtering of quantitative data using DRIMSeq

DRIMSeq is used to filter the transcript count data from the Salmon analysis for differential transcript usage (DTU) analysis. The filter step will be used to select for genes and transcripts that satisfy rules for the number of samples in which a gene or transcript must be observed, and minimum threshold levels for the number of observed reads. The parameters used for filtering are min_samps_gene_expr, min_samps_feature_expr, min_gene_expr, and min_feature_expr. By default, any transcripts with zero expression or one transcript in all samples are filtered out at this stage.

4.6 Differential transcript usage using DEXSeq

Differential transcript usage analysis is performed using the R DEXSeq package (Anders et al. (2012)). Similar to the edgeR package, DEXSeq estimates the variance between the biological replicates and applies generalised linear models for the statistical testing. The key difference is that the DEXSeq method looks for differences at the exon count level. DEXSeq uses the filtered transcript count data prepared earlier in this analysis.

4.7 StageR stage-wise analysis of DGE and DTU

The final component of this isoform analysis is a stage-wise statistical test using the R software package stageR(Van den Berge and Clement (2018)). stageR uses (1) the raw p-values for DTU from the DEXSeq analysis in the previous section and (2) a false-discovery corrected set of p-values from testing whether individual genes contain at least one exon showing DTU. A hierarchical two-stage statistical testing evaluates the set of genes for DTU.

Troubleshooting

  • If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.
  • See how to interpret some common nextflow exit codes here.
  • Renaming, moving or deleting the input BAM, reference genome or the output directory from the location provided at runtime will stop IGV in the EPI2ME Desktop app from loading.

FAQ’s

Does the workflow support de novo assembly? - Currently the workflow does not have a de novo mode.

Why is the IGV panel not showing? - The workflow expects either an uncompressed or bgzip-compressed reference. If the user provides a reference compressed not with bgzip, the workflow will run to completion, but won’t be able to generate the necessary indexes to visualize the outputs in IGV.

If your question is not answered here, please report any issues or suggestions on the github issues page or start a discussion on the community.

See the EPI2ME website for lots of other resources and blog posts.


Share

EPI2ME Labs

EPI2ME Labs

Senior Button Pusher

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media

© 2020 - 2024 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.