SARS-Cov-2 Analysis Workflow

This tutorial implements the best-practices bioinformatics workflow for the assembly of an SARS-CoV-2 viral genomes. The workflow in the document implements the ARTIC Nanopore bioinformatics SOP.

Computational requirements for this tutorial include:

Getting Started

Before anything else we will create and set a working directory:

Install additional software

The default EPI2MELabs environment does not contain the ARTIC software. In this section we will prepare the enviroment with the necessary ARTIC software installation.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

Having connected to the EPI2ME Labs Server, we will install the necessary software. Press the play button below to the left hand side (it may appear as [ ]):

Sample Data

This tutorial is provided with a sample dataset. Samples included in the demonstration dataset were obtained from European Nucleotide Archive project PRJNA650037. This project has the title Johns Hopkins Viral Genomics of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and describes 210 virus samples that have been sequenced according to the ARTIC protocol on a GridION device. Ten samples with unique barcodes in the range 1..12 were picked from this project for this demonstration dataset.

To download the sample file we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

Using your own data

If you wish to analyse your own data rather than the sample data, you can edit the value of the input_file variable below. To find the correct full path of a file you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

image.png

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data Entry

The workflow requires .fastq files from an Oxford Nanopore Technologies' sequencing device. The workflow will, by default, demultiplex the reads. This step can be skipped if the reads have already been demultiplexed by MinKNOW or Guppy from the options below.

The input folder should be a folder containing one or more .fastq files, such as the fastq_pass folder output by MinKNOW (or Guppy).

The form below can be used to change the analysis parameters from their defaults.

If your input directory contains reads which have already been demultiplexed, selet the "Skip demultiplexing" option. You should still however select the an appropriate barcode arrangements setting (either 12/24- or 96-barcodes)

The default minimum and maximum read lengths are appropriate for the ARTIC 400mer applicon sets, and experiment using alterative amplicon scheme (e.g. the V1200 scheme) should change these values.

Analysis

With our software environment set up and our inputs specified it is time to move on the the analysis. The following workflow begins with demultiplexing the reads using strict settings to avoid misidentification, followed by running of the Artic analysis, and finally creation of Quality control plots to determine if tha analysis is valid.

Demultiplexing and Read Quality Control

In this section we will run sample demultiplexing using the guppy_barcoder software. The results of this will appear in the output folder under the demultiplex folder. After demultiplexing a report is produced from the demultiplexed .fastq data.

In the case of pre-demultiplexed reads, guppy_barcoder is skipped and the input folder is searched for outputs of previous demultiplexing

The demultiplexing produces a summary file recording the barcode found in each read, or "unclassified" if the barcodes could not be confidently identified.

To generate a QC report, run the cell below:

Run ARTIC for each sample

With demultiplexed reads, we are in a position to analyse each dataset independently using the ARTIC workflow.

The ARTIC worflow produces the following files for each barcode (\ is the value given at the top of this page):

  1. <run_name>.rg.primertrimmed.bam - BAM file for visualisation after primer-binding site trimming
  2. <run_name>.trimmed.bam - BAM file with the primers left on (used in variant calling)
  3. <run_name>.merged.vcf - all detected variants in VCF format
  4. <run_name>.pass.vcf.gz - detected variants in VCF format passing quality filter
  5. <run_name>.fail.vcf - detected variants in VCF format failing quality filter
  6. <run_name>.primers.vcf - detected variants falling in primer-binding regions
  7. <run_name>.consensus.fasta - consensus sequence

These will be present in folders named as:

<output_folder>/analysis/artic/<barcode>/

where <output_folder> is the value given at the top of this page and <barcode> is the identified barcode for each dataset.

Brief summary of results

Running the below will produce a simple tabular summary for each barcoded dataset.

QC Summary of ARTIC pipeline results

The results of the ARTIC pipeline include alignment of the reads to a reference genome. A summary of these alignments is produced by the section below. Things to look for here include even coverage of amplicons and that the negative control sample shows little to no data.

With the summary data collated we can plot coverage histograms for all barcoded samples, and across primer pools:

It can be interesting also to examine the basecalling quality for the different samples and primer pools. This can indicate potential problems in the sequencing library preparation.