You just finished the lab work for the experiment that will save the future of humankind. A huge amount of data is waiting to be interpreted; so you run wf-metagenomics to find out the taxonomic composition of your sample and… surprise! You get many unclassified or totally unexpected taxa! How is this possible? What should you do next? Well, do not panic! As a first step simply read this blog post.

First of all, unexpected results are not always worrying. Results depend on many factors such as the completeness of a database or the source of the sample. If you are sequencing a less studied environment, it could be that some of the taxa have not yet been described and therefore are not in the databases. In such cases, those unclassified reads might be a new species waiting for you to classify them! However, in other cases, you may know the expected taxonomic profile of your sample and you just want to check it. In that case, unclassified reads can be worrying.

In both cases, however, it is important to properly think your experiment through before checking the results and trying to understand them! To make sense of your results (or sometimes just to confirm them), you may need to recover unclassified reads (or reads classified as unexpected taxa) and perform further analysis. However, if you are not familiar with bioinformatics tools, this can be a daunting challenge.

With this blog post we try to shed some light on a few useful tools that will make this task easier. It will explain how to use seqkit and csvtk to recover specific reads from the workflow output for subsequent analysis.

`seqkit`: your best friend to manipulate FASTQ (and FASTA) files

seqkit is your ally when you need to deal with FASTQ files. It handles both compressed (.gz) and uncompressed (.fq or .fastq and .fasta, .fa) files and covers a wide range of use cases (see here for details). For example, it is very helpful for many common tasks:

You can check the different things you can do with seqkit here, but as small taste of its power, here is a list of some of the most common tasks:

Want to get an overview of your sequence data? - seqkit stats is your helper.
Filter sequences based on read length or quality? - take a look at seqkit seq.
Need to subset a FASTQ file based on a list of given read identifiers? - seqkit grep is what you are looking for.
Too much data to run small tests? - Create a random subset with seqkit sample.

We should also mention seqtk. While it does not provide as many features as seqkit, it can be considerably faster for some use cases. Similarly, fastcat, the tool used in our workflows, can also filter reads based on read length or quality and provides summary statistics of your data too.

Later on we’ll have some examples of how to use some of these commands, so please be patient and keep reading!

csvtk: working with tables on the command line is not a pain anymore

csvtk is another essential to add to your toolkit. It contains intuitive commands that work on compressed and uncompressed files containing tabular data (CSV, TSV, etc.).

There are some small things to take into account before using csvtk. By default, it assumes that input files have a header row, but you can disable this with -H. It also expects CSV files by default. You have to add -t for tab-delimited files. Furthermore, it ignores lines starting with ”#“. You can set this to a different character with -C.

Again, let’s see some of the most useful functions:

What does our table look like? csvtk headers, dim, nrow, ncol or summary are great for that.
Need to randomly subsample a table? Then look at csvtk sample.
Search patterns? csvtk grep.
Don’t need all the data? Extract specific columns with csvtk cut or filter rows with csvtk filter2.

How to recover unclassified reads from kraken2 output

Now that we’ve covered the basics, let’s look at a practical example! We ran wf-metagenomics using kraken2 and in the output we find the kraken folder with the kraken2 reports.

You can use conda to install the tools we mentioned above. First we take a look at what these files look like:

cat barcode01.kraken2.assignments.tsv | head

The first column distinguishes between Classified (“C”) or Unclassified (“U”) and the second one is the readID (read this for more details about the kraken2 output format).

C       80525fcb-a74a-423d-b965-c941fa6e46fa    2       5477    0:2418 2638727:3 0:34 526218:5 0:155 158190:5 0:2433 1160721:1 0:389
C       4516e4cf-545e-4b80-9606-bcbaea3d3ba5    821     2640    0:16 909656:4 0:12 909656:1 0:48 821:5 0:42 821:5 0:15 821:3 0:1 821:3 0:24 821:3 0:4 821:1 0:36 815:3 0:14 815:2 0:3 909656:5 0:27 909656:5 0:38 821:2 0:18 909656:3 0:83 821:5 0:33 909656:5 0:20 909656:5 0:13 909656:3 0:28 821:1 0:27 909656:1 0:14 909656:5 821:1 0:2 909656:4 0:37 821:1 0:2 821:3 0:12 821:5 0:3 821:5 0:36 909656:5 0:9 909656:5 0:3 821:5 0:7 821:1 0:82 821:8 0:29 909656:2 0:24 909656:2 0:14 909656:2 0:4 909656:9 0:1 909656:9 0:7 909656:4 0:137 909656:5 0:77 909656:4 0:33 909656:2 0:24 821:1 0:19 909656:5 0:43 909656:7 0:11 909656:2 0:25 821:2 0:26 909656:11 0:4 909656:5 0:3 821:1 0:57 821:2 0:5 821:2 0:70 909656:3 0:5 909656:5 0:40 821:1 0:51 909656:3 0:91 909656:2 0:23 909656:2 0:4 909656:5 0:12 909656:4 0:49 909656:3 0:92 909656:1 0:9 909656:2 0:64 909656:1 0:13 909656:5 0:27 821:5 0:66 909656:2 0:5 909656:4 0:27 821:5 0:43 909656:1 0:7 909656:2 0:33 821:1 0:81 821:1 0:23 821:5 0:16 909656:1 821:5 0:33 821:7 0:2 821:5 0:30 909656:1 0:1 909656:5 0:2 909656:1 0:6 909656:1 0:5 909656:1 0:21 821:1 0:9 821:1 0:29 821:5 0:70 821:2 0:6
U       3659b19f-d683-40d1-ae71-db28e853d646    0       4971    0:4937
C       3cfd7ebf-6246-4ad7-9dbf-31aef52ffe5d    39491   1459    0:2 39491:1 0:54 39491:2 0:9 39491:3 0:9 39491:2 0:16 39491:8 0:10 39491:3 0:38 39491:4 0:15 39491:2 0:5 39491:3 0:91 39491:8 0:1 39491:1 0:9 39491:1 0:1128
C       6ae3b42a-dc87-4ab3-aa12-2e16992985df    2929488 1015    0:53 186802:2 0:22 186802:4 0:36 186802:5 0:7 2646395:1 0:39 186802:8 0:7 186802:3 0:28 186802:5 0:10 186802:7 0:68 186802:2 0:46 186802:5 0:60 216851:1 0:23 186802:5 0:11 186802:5 0:89 186802:1 0:8 186802:2 0:34 186802:5 0:3 186802:4 0:23 186802:5 0:41 2929488:5 186802:5 0:2 186802:2 0:97 186802:9 0:20 186802:2 0:19 186802:3 0:4 186802:2 0:3 186802:5 0:125
U       6dc4e39a-7d54-4601-bc0c-68db2be71458    0       1248    0:1214
C       8058eb02-bc9c-43bc-8d11-00d18b1f3657    186802  3585    0:510 186803:1 0:934 2763670:4 0:25 350688:5 0:1409 742737:5 0:385 657314:3 0:270
C       cf2d87df-935c-4254-b007-dc5e1f465688    657317  2453    0:1920 657317:2 0:82 1736:1 0:414
U       eda02d14-094c-4e94-befe-3ddd6dab90aa    0       1694    0:1660

# we can count the number of reads using:
wc -l kraken2.assignments.tsv
22109814

So, our task is to take the unclassified reads to perform further analysis. To do this we need to:

Extract the read IDs that are unclassified:

# We don't have column names (`-H`), our file is tab separated (`-t`) and we want all the rows whose first column is "U".
csvtk filter2 -H -t -f '$1=="U"' barcode01.kraken2.assignments.tsv -o kraken2.U.assignments.tsv

head -n3 kraken2.U.assignments.tsv
# see that all are unclassified
U       3659b19f-d683-40d1-ae71-db28e853d646    0       4971    0:4937
U       6dc4e39a-7d54-4601-bc0c-68db2be71458    0       1248    0:1214
U       eda02d14-094c-4e94-befe-3ddd6dab90aa    0       1694    0:1660

# we can count the number of unclassified using:
wc -l kraken2.U.assignments.tsv
3905287

We need just the readIDs, so:

csvtk cut -t -H -f 2 kraken2.U.assignments.tsv -o kraken2.U.readIDs.txt
# alternatively, we can run the two commands together:
csvtk filter2 -H -t -f '$1=="U"' barcode01.kraken2.assignments.tsv | csvtk cut -t -H - -f 2 -o kraken2.U.readIDs.txt

head -n3 kraken2.U.readIDs.txt
3659b19f-d683-40d1-ae71-db28e853d646
6dc4e39a-7d54-4601-bc0c-68db2be71458
eda02d14-094c-4e94-befe-3ddd6dab90aa

Better to have first a small subset of readIDs for initial data exploration.

# Sample randonmly 1% of the read IDs.
csvtk sample -p 0.01 kraken2.U.readIDs.txt -o subset.kraken2.U.readIDs.txt

Extract those sequences from the FASTQ file.

seqkit grep -f subset.kraken2.U.readIDs.txt seqs.fastq.gz | bgzip -@ 4 > subset.kraken2.U.readIDs.fq.gz
# or
seqtk subseq seqs.fastq.gz subset.kraken2.U.readIDs.txt | bgzip -@ 4 > subset.kraken2.U.readIDs.fq.gz

Note that we use bgzip instead of gzip as this allows us to further subset the generated files with samtools fqidx later if needed.

How to recover unmapped reads from BAM files

First, we have to run the workflow using the keep_bam option so that the BAM files it generates are published to the output folder (you will find them in the bams sub-folder). Then, we’d like to subset those reads which have not mapped against our reference:

# Extract unmapped reads (-f 4) and convert them to FASTQ format.
samtools view -f 4 barcode01.reference.bam | samtools fastq -@ 4 - | bgzip -@ 4 > barcode01.unmapped.fq.gz

From here, you can explore your unmapped or unclassified sequences further. For example, you can check if they satisfy quality control targets, map them against another database, and so on. There are different options that you may try! We hope you find this post useful and don’t forget to take a look at some of our related blog posts!