SARS-Cov-2 Analysis Workflow

This tutorial implements the best-practices bioinformatics workflow for the assembly of an SARS-CoV-2 viral genomes. The workflow in the document implements the ARTIC Nanopore bioinformatics SOP.

Computational requirements for this tutorial include:

Getting Started

Before anything else we will create and set a working directory:

Install additional software

The default EPI2MELabs environment does not contain the ARTIC software. In this section we will prepare the enviroment with the necessary ARTIC software installation.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

Having connected to the EPI2ME Labs Server, we will install the necessary software. Press the play button below to the left hand side (it may appear as [ ]):

Sample Data

This tutorial is provided with a sample dataset. Samples included in the demonstration dataset were obtained from European Nucleotide Archive project PRJNA650037. This project has the title Johns Hopkins Viral Genomics of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and describes 210 virus samples that have been sequenced according to the ARTIC protocol on a GridION device. Ten samples with unique barcodes in the range 1..12 were picked from this project for this demonstration dataset.

To download the sample file we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

Using your own data

If you wish to analyse your own data rather than the sample data, you can edit the value of the input_file variable below. To find the correct full path of a file you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

image.png

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data Entry

The workflow requires one or more .fastq files from an Oxford Nanopore Technologies' sequencing device. The workflow does not include a demultiplex step as reads should already have been demultiplexed in MinKNOW, if your data is not demultiplexed refer back to the protocol or use the demultiplex workflow in EPI2ME.

The form below can be used to change the analysis parameters from their defaults.

The input directory should contain reads which have already been demultiplexed (Either 12/24- or 96-barcodes).

The default minimum and maximum read lengths are appropriate for the ARTIC 400mer applicon sets, and experiment using alterative amplicon scheme (e.g. the V1200 scheme) should change these values.

Analysis

With our software environment set up and our inputs specified it is time to move on the the analysis. The following workflow begins with quality control of the reads, followed by running of the Artic analysis and then downstream anaylsis of the Artic results.

Read Quality Control

In this section we will merge the reads within each barcoded folder and review average quality, read length and per barcode read depth. These results will be added to the final report.

To generate a QC report, run the cell below:

Run ARTIC for each sample

With demultiplexed reads, we are in a position to analyse each dataset independently using the ARTIC workflow.

The ARTIC worflow produces the following files for each barcode (\ is the value given at the top of this page):

  1. <run_name>.rg.primertrimmed.bam - BAM file for visualisation after primer-binding site trimming
  2. <run_name>.trimmed.bam - BAM file with the primers left on (used in variant calling)
  3. <run_name>.merged.vcf - all detected variants in VCF format
  4. <run_name>.pass.vcf.gz - detected variants in VCF format passing quality filter
  5. <run_name>.fail.vcf - detected variants in VCF format failing quality filter
  6. <run_name>.primers.vcf - detected variants falling in primer-binding regions
  7. <run_name>.consensus.fasta - consensus sequence

These will be present in folders named as:

<output_folder>/analysis/artic/<barcode>/

where <output_folder> is the value given at the top of this page and <barcode> is the identified barcode for each dataset.

Consensus sequences

The artic workflow does not collate all consensus sequences from each barcode together. To do this run the codecell below. You will be given the opportunity to provide meaningful names to each sample if desired. These sequences can be uploaded to nextclade for further analysis, or submitted to GISAID.

Artic Analysis Status success/failed

A check to see how many samples passed or failed to produce results from the primary ARTIC analysis.

Brief summary of results

Running the below will produce a simple tabular summary for each barcoded dataset.

QC Summary of ARTIC pipeline results

The results of the ARTIC pipeline include alignment of the reads to a reference genome. A summary of these alignments is produced by the section below. Things to look for here include even coverage of amplicons and that the negative control sample shows little to no data.

With the summary data collated we can plot coverage histograms for all barcoded samples, across primer pools or by read orientation. Use the tabs to switch between views.

For adequate variant calling depth should be at least 30X in any region.

To better display all possible data, the depth axes of the plots below are not tied between plots for different samples. Care should be taken in comparing depth across samples.