Metagenomic Classification Tutorial

The metagenomic classification tutorial allows the analysis of a sample analyte containing unknown DNA fragments. The tutorial is intended to address important questions:

Methods used in this tutorial include:

Computational requirements for this tutorial include:

This tutorial does not cover the construction of centrifuge indices. The building of custom databases requires a large amount of compute time and memory. For example to build a database including the bacteria domain takes on the order of one to three days on a multi-core server and requires >500Gb RAM.

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Introduction

This tutorial aims to demonstrate use of the centrifuge and pavian software packages for the analysis of metagenomic datasets from Oxford Nanopore Technologies' sequencing platforms.

The learning outcomes from this tutorial include:

A sample dataset is provided which can be analysed against a pre-made metagenomic index in under 10 minutes on a GridION device.

Getting started

The workflow below requires a single folder containing .fastq files from an Oxford Nanopore Technologies' sequencing device, or a single such file. Compressed or uncompressed files may be used. In addition the workflow will download a selected metagenomic database.

Before anything else we will create and set a working directory:

Install additional software

This tutorial uses a couple of software packages that are not included in the default EPI2ME Labs server. Below we will install software packages that include last and diamond using the conda package manager.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

Sample Data

To demonstrate the workflow below a sample dataset is included with this tutorial. The data comprise an extract of a MinION run using the ZymoBIOMICS microbial mock community.

To download the sample dataset we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

Using your own data

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

image.png

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data entry

Having downloaded the sample sequencing data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. We must also select or download a metagenomic index.

The form can be used to enter the filenames of your inputs.

To prepare a database use the form below. First select from the first dropdown whether to (download and) use a pre-made index or use an index present on your computer. Having made this selection fill in the requisite portion of the form before pressing the >Enter button.

After running the above, a set of .cf files constituting the metagenomic database will be present in the location indicated. These files can used to perform single-read classifications using centrifuge.

Metagenomic classification

In order to perform metagenomic classification of reads, the section below will use the centrifuge program together with the index selected or created above. We will then view the results of the classification using the pavian viewer.

Running centrifuge

The first step in our analysis is to run centrifuge to classify all reads according to the selected index. Running the command below may take a fair amount of time depending on the compute resources available:

Identifying genera

The centrifuge program provides two primary outputs:

  1. read_classifications.tsv: a classification for each input read in terms of its origin.
  2. centrifuge_report.tsv: counts of reads for identified species.

The second of these can be used to identify the most common genera in the sample:

Viewing results with pavian

The pavian application can be used to visualise the results of a metagenomics classifer, including amonst other things producing Sankey diagrams.

In order to use pavian we must first convert the centrifuge report into a different format, this is done with the centrifuge-kreport program:

The .kraken file is what we must load into the pavian browser. The browser runs a webserver which can be started by first selecting the network port from the cell below (it should match that specified in the EPI2ME Labs Launcher as the Aux. Port),

and then running the code cell below:

In order to use Pavian click the link in the output above in the message: Listening on http://0.0.0.0:8889. To stop Pavian running click stop on the codecell above.

Once Pavian is running and you have navigated to it in your web browser, click on the Use data on server tab and then type /epi2melabs in the text entry box:

image.png

Then use the filebrowser to navigate to and select the .kraken report file produced above, and finally click Read selected directories:

image.png

After selecting the dataset, Pavian can be used to explore the dataset. For example navigating to the Sample tab in the left-hand menu will display a Sankey plot visualising the classifications of reads. For example analysis of the sample dataset gives:

image.png

To generate a standalone report click the Generate HTML report... link in the left-hand menu. This will download a self-contained HTML file to your computer which summarises results of the analysis.

When you have finished using Pavian remember to press stop on the code cell above to stop the Pavin webserver from running.

Summary

This tutorial has step through the processes involved in classifying reads from a metagenomic sample and identifying the genera present. We have used centrifuge-download and centrifuge-build to create a metagenomic index from data available in the NCBI database, before using centrifuge to perform the classification. The results of the classification were inspected with the pavian viewer.

The analysis presented can be run on any dataset from an Oxford Nanopore Technologies' device. The code will run within the EPI2ME Labs notebook server environment.