SARS-CoV-2 Midnight Analysis

By Matt Parker
Published in How Tos
January 24, 2022
6 min read
SARS-CoV-2 Midnight Analysis

Introduction

I wanted to write a blog post about SARS-CoV-2 sequencing data analysis to achieve two things:

  1. Clarify the changes needed to the underlying code for ARTIC analysis to accommodate midnight and,
  2. Discuss lineage and clade assignment and how we keep up to date with new versions of these.

Oxford Nanopore Technologies Midnight

Sequencing of SARS-CoV-2 has been pivotal in the ongoing pandemic. First the rapid publication of the original sequence from Wuhan to ongoing surveillance to identify and track the emergence of variants of the virus that could be a public health concern.

There are many methods to sequence the genome of SARS-CoV-2 but one of the most popular remains the ARTIC Network protocol. This protocol relies on 98 overlapping ~400bp amplicons split into two pools. This enables full genome coverage even in situations where viral RNA might be more degraded or is present at low number of copies.

The amended Midnight protocol uses longer 1200bp amplicons first proposed in a publication by Freed et. al.. These primers were designed using primalscheme by Quick et. al..

In addition to the amplicon length the key difference between the Midnight protocol and the original ARTIC protocol with Nanopore sequencing is the library preparation method. ARTIC original used the ONT ligation sequencing kit and therefore all reads produced are equivalent to the amplicon length. Midnight however uses the rapid library preparation chemistry, improving turnaround time but due to the transposase tagmentation employed in this method read lengths are less than or equal to the intact amplicon length.


1. Changes to ARTIC Bioinformatics Analysis

The ARTIC bioinformatics analysis workflow is globally recognised as the gold standard for the processing of ARTIC tiled amplicon SARS-CoV-2 genomes. The SOP can be found here.

Because of the differences between the original ARTIC method and Midnight, amendments were made to the underlying assumptions in the ARTIC FieldBioinformatics package used to analyse data generated by tiled amplicon sequencing of SARS-CoV-2 and so our wf-artic Nextflow workflow was born.

1a. Read lengths are longer (and SHORTER!)

This might seem like an obvious point, but because the Midnight amplicons are longer than the standard Artic amplicons and we have the existence of fragmented amplicons, we need to adjust the read length cut-offs used to filter reads. The standard ARTIC bioinformatics SOP recommends using reads >=400bp and <=700bp.

It is obvious that we need to include longer reads to account for the increased amplicon size, but because of the library preparation method we also need to allow for the presence of shorter reads. Therefore we use reads >=150bp and <=1200bp.

1b. Reads length is not amplicon length

In addition due to the tagmentation library preparation in Midnight read lengths will no longer always be the same as the length of the intact amplicons defined by primer pairs. Steps in the underlying analysis code for ARTIC original data assume that the read length will equal that of the amplicon. We have therefore made changes to this code under guidance from the original authors to allow for shorter read lengths. These changes can be summarised as:

  • The code no longer requires that reads fully span an amplicon region,
  • The code tags a read as belonging to an amplicon region simply by largest overlap,
  • The read selection code to achieve desired coverage was rewritten to account for incomplete amplicons, whilst retaining the longest reads.

Changes to ARTIC FieldBioinformatics

Changes have been applied to the align_trim.py Python program in the ARTIC network FieldBioinformatics package:

  • For each read we find the amplicon from which it originated by selecting the amplicon with the largest overlap with the read, we also find the next closest match.
  • We discard the read if the next closest match is a large proportion of the mutual overlap of the two amplicons. This is a guard against chimeric reads, either from library preparation or faults in the sequencing platform control software.
  • (There is an option to only allow those reads that extend across the whole amplicon, if set to true then we check whether the alignment extends to the primer at each end, this is a “correctly paired” read.)
  • To normalise we take the passing reads, sort by the amount of coverage they provide and take the 1st n reads.

The modified align_trim.py program can be found here.

Other points of note

  • Like ARTIC original, we downsample reads, using only 200 in each direction for each amplicon. Reads above this coverage threshold are discarded.
  • Twenty reads covering a position are required for a mutation call.
  • wf-artic uses medaka as opposed to the default in ARTIC original which is nanopolish, although there is an option to use the faster medaka. Using medaka also negates the need for fast5 files.

2. Lineage and clade assignment

For many users the classification of the SARS-CoV-2 sample being sequenced into a clade or lineage is often the primary end point of analysis workflows described here. These classifications help us put the sequence into the context of the global pandemic and create a shared language we can use when discussing the sample. Further, the identification of genomic changes which differ from the definitions of these clades and lineages might help the identification of important changes that could help define new clades and lineages. We realise that timely updates to lineage calling tools and the data they use is an important consideration for those analysing SARS-CoV-2 sequence data.

There are some excellent publications and blog posts which discuss lineages and clades that you may wish to read, including:

The Problem

The rapid generation of sequencing data and the emergence of variants of SARS-CoV-2 with new constellations of mutations requires that the data underlying the tools used to classify a SARS-CoV-2 sequence into a clade or lineage are in a constant state of flux. We must therefore balance rapid releases, resources, and ensuring the most important clades/lineages are identified by our workflow. The most important clades/lineages are those that have been deemed Variants of Concern (VOCs) or Variants Under Investigation (VUIs) by WHO or UKHSA as often our users are sequencing to inform public health decisions in the field.

We have opted for a model that we think helps satisfy those users who have no- or intermittent- internet access, or for reasons of security have no internet access on certain facets of their computing infrastructure.

We have automated processes that run daily on our continuous integration servers that check our analysis software images hosted on Dockerhub (Pangolin and Nextclade) for the latest versions from their authors. These images are then kept up to date automatically. Users of our wf-artic Nextflow workflow can specify the version of Pangolin --pangolin_version or Nextclade --nextclade_version to use when they run wf-artic - but the version you select must be available from our Dockerhub registry. These are static images and (usually) are never updated again.

Nexclade

Data used by Nextclade to determine the clade to which your SARS-CoV-2 sample belongs is provided in a GitHub repository: https://github.com/nextstrain/nextclade_data. This repository also contains data for other viruses so you need to navigate to data/datasets/sars-cov-2/references/MN908947/versions to see the data packages available. These are helpfully organised by date and time. We maintain a copy of this data in the wf-artic repository in data/nextclade. If no --nextclade_data_tag (i.e. 2021-12-16T20:57:35Z) is specified at wf-artic run time then the most recent contained within our repository will be used. You may specify any tag that we have in the data/nextclade directory.

If you want the absolute latest version just specifying --update_data at runtime will download the latest Nextclade dataset with the command nextclade dataset get before Nextclade is executed.

If you also specify the --nextclade_data_tag then that version will be downloaded by nextclade dataset get

Pangolin

In general the data available to Pangolin is determined at the time when our continuous integration systems build our docker analysis images. We don’t release a new docker container image unless the version of Pangolin itself is increased. But like Nextclade we can update when we run wf-artic.

Pangolin data updates are organised slightly differently to Nextclade. Pangolin can update both itself and the data files it uses with the command pangolin --update. If you specify --update_data at runtime, the update will be executed before the lineage assignment takes place and you will run the latest version of this lineage classification tool.

Again you can specify the version of Pangolin you would like to run within wf-artic, as long as we have a docker image in our Dockerhub registry by specifying --pangolin_version. If you also specify --update_data then the data used by Pangolin will also be upgraded to the latest available at runtime.

Advanced: Manually updating a local Pangolin or Nextclade docker container image

If you would like to update the Pangolin docker container used by wf-artic follow the instructions below:

docker run ontresearch/pangolin:3.1.17 pangolin --update

Your output should look something like this:

pangolin already latest release (v3.1.17)
pangolearn updated to 2022-01-20
constellations already latest release (v0.1.1)
scorpio already latest release (v0.3.16)
pango-designation already latest release (v1.2.123)

Now note the image identifier of the container just fetched and run:

docker ps -a

Your output should look like this:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
068f67cb118e ontresearch/pangolin:3.1.17 "pangolin --update" About a minute ago Exited (0) 29 seconds ago keen_poincare

and commit your update:

docker commit <CONTAINER_ID> ontresearch/pangolin:3.1.17-updated

Where <CONTAINER_ID> = 068f67cb118e in this case.

Then when you next run wf-artic specify --pangolin_version 3.1.17-updated at runtime and it will use the local container you just created.

You can follow a similar procedure to upgrade the Nextclade data in the Nextclade docker container image.


Tags

#workflows#nextflow

Share

Matt Parker

Matt Parker

Director, Clinical Bioinformatics Software

Table Of Contents

1
Introduction
2
1. Changes to ARTIC Bioinformatics Analysis
3
2. Lineage and clade assignment

Related Posts

Unexpected results, so now what?
July 02, 2024
3 min

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media

© 2020 - 2024 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.