Our metagenomics workflow, wf-metagenomics, gives users the ability to unveil the taxonomic composition of their Oxford Nanopore Technologies sequencing data. The workflow offers two different sub-workflows: kraken2 and minimap2. In this blog post we describe how to choose one of the default databases or how to use your custom database.
The wf-metagenomics workflow offers four databases which cover the more general cases in both pipelines. However, we are aware that you could have more specific questions that can be solved using your own custom databases, so we explain here how to run the workflow with them.
Kraken2 is a taxonomic sequence classifier that relies on a k-mers1 approach to assign taxonomic labels to DNA sequences. It examines the k-mers within a query sequence (the read) and uses this information to query a database. This means that the database must be built in advance to extract the k-mer information of each reference sequence and store it in an efficient format for later query. To Find out more about the files that comprise a Kraken2 database in the Kraken2 documentation.
We offer a set of different databases that can be useful in many situations so that you do not need to be worried about it.
To analyze archaeal, bacterial and fungal 16S/18S ribosomal RNA genes and ITS, there are two databases available built using data from NCBI.
ncbi_16s_18s: contains 16S ribosomal RNA sequences that correspond to bacteria and archaea type materials and 18S ribosomal RNA Nucleotide sequence records from fungi. This is the default option.
ncbi_16s_18s_28s_ITS: contains 16S ribosomal RNA sequences that correspond to bacteria and archaea type materials, 18S ribosomal RNA Nucleotide sequence records and sequences from the ITS region from fungi.
To change the database you should add the specific option to your command, or selecting it in the EPI2ME Labs menu (Reference Options > Database set).
nextflow run epi2me-labs/wf-metagenomics \--fastq <PATH_TO_FASTQ> --database_set ncbi_16s_18s_28s_ITS
In this case, we need a database that contains whole genome information. The Kraken2 authors curate a set of pre-built databases. We have selected two of them based on their reasonable size and coverage of a wide diversity of organisms that can be found in the environment:
PlusPF-8: contains references for archaea, bacteria, viral, plasmid, human, UniVec_Core, protozoa and fungi. To use this database the memory available to the workflow must be slightly higher than size of the database index (8GB).
PlusPFP-8: It contains references as PlusPF-8 and additionally plants. To use this database the memory available to the workflow must be slightly higher than size of the database index (8GB).
With these steps performed you can provide the database directory to wf-metagenomics using the
--database option, which can be either a <.tar.gz> format file or a directory. Note that the memory available to the workflow must be slightly higher than size of the database index.
nextflow run epi2me-labs/wf-metagenomics \--fastq <PATH_TO_FASTQ> --database <DATABASE>
This can also be performed in EPI2ME Labs from the Reference Options > Database option, by pointing at the folder.
This workflow relies on mapping the reads against a database based on their identity. To analyze archaeal, bacterial and fungal 16S/18S ribosomal DNA and ITS data, you can use the two databases available also for the kraken2 pipeline (see above for more information). In addition, you can use your custom database according to what you expect (or not) to find in the samples.
The reference file can be either a fasta format file or a minimap2 index file (<.mmi>). The mmi file can be created running (see their Github for more information):
minimap2 -d <reference.mmi> <reference.fasta>
And then use it in the workflow by using the
nextflow run epi2me-labs/wf-metagenomics \--fastq <PATH_TO_FASTQ> --classifier minimap2 --reference <reference.mmi> OR <reference.fasta>
or in the app (Reference Options > Reference) pointing the file.
In this case, you may want to provide a file with the taxonomy of each of your reference sequences. For this purpose you can use the
--ref2taxid option which expects a tsv file without headers and with the taxid of each reference (from within EPI2ME Labs this is the parameter: Minimap2 Options > Ref2taxid).
1 k-mers: a sequence of k characters in a string (or nucleotides in a DNA sequence).