I wanted to write a blog post about SARS-CoV-2 sequencing data analysis to achieve two things:
Sequencing of SARS-CoV-2 has been pivotal in the ongoing pandemic. First the rapid publication of the original sequence from Wuhan to ongoing surveillance to identify and track the emergence of variants of the virus that could be a public health concern.
There are many methods to sequence the genome of SARS-CoV-2 but one of the most popular remains the ARTIC Network protocol. This protocol relies on 98 overlapping ~400bp amplicons split into two pools. This enables full genome coverage even in situations where viral RNA might be more degraded or is present at low number of copies.
In addition to the amplicon length the key difference between the Midnight protocol and the original ARTIC protocol with Nanopore sequencing is the library preparation method. ARTIC original used the ONT ligation sequencing kit and therefore all reads produced are equivalent to the amplicon length. Midnight however uses the rapid library preparation chemistry, improving turnaround time but due to the transposase tagmentation employed in this method read lengths are less than or equal to the intact amplicon length.
The ARTIC bioinformatics analysis workflow is globally recognised as the gold standard for the processing of ARTIC tiled amplicon SARS-CoV-2 genomes. The SOP can be found here.
Because of the differences between the original ARTIC method and Midnight, amendments were made to the underlying assumptions in the ARTIC FieldBioinformatics package used to analyse data generated by tiled amplicon sequencing of SARS-CoV-2 and so our
wf-artic Nextflow workflow was born.
This might seem like an obvious point, but because the Midnight amplicons are longer than the standard Artic amplicons and we have the existence of fragmented amplicons, we need to adjust the read length cut-offs used to filter reads. The standard ARTIC bioinformatics SOP recommends using reads >=400bp and <=700bp.
It is obvious that we need to include longer reads to account for the increased amplicon size, but because of the library preparation method we also need to allow for the presence of shorter reads. Therefore we use reads >=150bp and <=1200bp.
In addition due to the tagmentation library preparation in Midnight read lengths will no longer always be the same as the length of the intact amplicons defined by primer pairs. Steps in the underlying analysis code for ARTIC original data assume that the read length will equal that of the amplicon. We have therefore made changes to this code under guidance from the original authors to allow for shorter read lengths. These changes can be summarised as:
Changes have been applied to the
align_trim.py Python program in the ARTIC network FieldBioinformatics package:
align_trim.py program can be found here.
wf-articuses medaka as opposed to the default in ARTIC original which is nanopolish, although there is an option to use the faster medaka. Using medaka also negates the need for fast5 files.
For many users the classification of the SARS-CoV-2 sample being sequenced into a clade or lineage is often the primary end point of analysis workflows described here. These classifications help us put the sequence into the context of the global pandemic and create a shared language we can use when discussing the sample. Further, the identification of genomic changes which differ from the definitions of these clades and lineages might help the identification of important changes that could help define new clades and lineages. We realise that timely updates to lineage calling tools and the data they use is an important consideration for those analysing SARS-CoV-2 sequence data.
There are some excellent publications and blog posts which discuss lineages and clades that you may wish to read, including:
The rapid generation of sequencing data and the emergence of variants of SARS-CoV-2 with new constellations of mutations requires that the data underlying the tools used to classify a SARS-CoV-2 sequence into a clade or lineage are in a constant state of flux. We must therefore balance rapid releases, resources, and ensuring the most important clades/lineages are identified by our workflow. The most important clades/lineages are those that have been deemed Variants of Concern (VOCs) or Variants Under Investigation (VUIs) by WHO or UKHSA as often our users are sequencing to inform public health decisions in the field.
We have opted for a model that we think helps satisfy those users who have no- or intermittent- internet access, or for reasons of security have no internet access on certain facets of their computing infrastructure.
We have automated processes that run daily on our continuous integration servers that check our analysis software images hosted on Dockerhub (Pangolin and Nextclade) for the latest versions from their authors. These images are then kept up to date automatically. Users of our
wf-artic Nextflow workflow can specify the version of Pangolin
--pangolin_version or Nextclade
--nextclade_version to use when they run
wf-artic - but the version you select must be available from our Dockerhub registry. These are static images and (usually) are never updated again.
Data used by Nextclade to determine the clade to which your SARS-CoV-2 sample belongs is provided in a GitHub repository: https://github.com/nextstrain/nextclade_data. This repository also contains data for other viruses so you need to navigate to
data/datasets/sars-cov-2/references/MN908947/versions to see the data packages available. These are helpfully organised by date and time. We maintain a copy of this data in the
wf-artic repository in
data/nextclade. If no
2021-12-16T20:57:35Z) is specified at
wf-artic run time then the most recent contained within our repository will be used. You may specify any tag that we have in the
If you want the absolute latest version just specifying
--update_data at runtime will download the latest Nextclade dataset with the command
nextclade dataset get before Nextclade is executed.
If you also specify the
--nextclade_data_tag then that version will be downloaded by
nextclade dataset get
In general the data available to Pangolin is determined at the time when our continuous integration systems build our docker analysis images. We don’t release a new docker container image unless the version of Pangolin itself is increased. But like Nextclade we can update when we run
Pangolin data updates are organised slightly differently to Nextclade. Pangolin can update both itself and the data files it uses with the command
pangolin --update. If you specify
--update_data at runtime, the update will be executed before the lineage assignment takes place and you will run the latest version of this lineage classification tool.
Again you can specify the version of Pangolin you would like to run within
wf-artic, as long as we have a docker image in our Dockerhub registry by specifying
--pangolin_version. If you also specify
--update_data then the data used by Pangolin will also be upgraded to the latest available at runtime.
If you would like to update the Pangolin docker container used by
wf-artic follow the instructions below:
docker run ontresearch/pangolin:3.1.17 pangolin --update
Your output should look something like this:
pangolin already latest release (v3.1.17) pangolearn updated to 2022-01-20 constellations already latest release (v0.1.1) scorpio already latest release (v0.3.16) pango-designation already latest release (v1.2.123)
Now note the image identifier of the container just fetched and run:
docker ps -a
Your output should look like this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 068f67cb118e ontresearch/pangolin:3.1.17 "pangolin --update" About a minute ago Exited (0) 29 seconds ago keen_poincare
and commit your update:
docker commit <CONTAINER_ID> ontresearch/pangolin:3.1.17-updated
068f67cb118e in this case.
Then when you next run
--pangolin_version 3.1.17-updated at runtime and it will use the local container you just created.
You can follow a similar procedure to upgrade the Nextclade data in the Nextclade docker container image.