You just finished the lab work for the experiment that will save the future of humankind. A huge amount of data is waiting to be interpreted; so you run wf-metagenomics to find out the taxonomic composition of your sample and… surprise! You get many unclassified or totally unexpected taxa! How is this possible? What should you do next? Well, do not panic! As a first step simply read this blog post.
First of all, unexpected results are not always worrying. Results depend on many factors such as the completeness of a database or the source of the sample. If you are sequencing a less studied environment, it could be that some of the taxa have not yet been described and therefore are not in the databases. In such cases, those unclassified reads might be a new species waiting for you to classify them! However, in other cases, you may know the expected taxonomic profile of your sample and you just want to check it. In that case, unclassified reads can be worrying.
In both cases, however, it is important to properly think your experiment through before checking the results and trying to understand them! To make sense of your results (or sometimes just to confirm them), you may need to recover unclassified reads (or reads classified as unexpected taxa) and perform further analysis. However, if you are not familiar with bioinformatics tools, this can be a daunting challenge.
With this blog post we try to shed some light on a few useful tools that will make this task easier.
It will explain how to use seqkit
and csvtk
to recover specific reads from the workflow output for subsequent analysis.
seqkit
: your best friend to manipulate FASTQ (and FASTA) filesseqkit is your ally when you need to deal with FASTQ files.
It handles both compressed (.gz
) and uncompressed (.fq
or .fastq
and .fasta
, .fa
) files and covers a wide range of use cases (see here for details).
For example, it is very helpful for many common tasks:
You can check the different things you can do with seqkit here, but as small taste of its power, here is a list of some of the most common tasks:
seqkit stats
is your helper.seqkit seq
.seqkit grep
is what you are looking for.seqkit sample
.We should also mention seqtk. While it does not provide as many features as
seqkit
, it can be considerably faster for some use cases. Similarly, fastcat, the tool used in our workflows, can also filter reads based on read length or quality and provides summary statistics of your data too.
Later on we’ll have some examples of how to use some of these commands, so please be patient and keep reading!
csvtk is another essential to add to your toolkit. It contains intuitive commands that work on compressed and uncompressed files containing tabular data (CSV, TSV, etc.).
There are some small things to take into account before using
csvtk
. By default, it assumes that input files have a header row, but you can disable this with-H
. It also expects CSV files by default. You have to add-t
for tab-delimited files. Furthermore, it ignores lines starting with ”#“. You can set this to a different character with-C
.
Again, let’s see some of the most useful functions:
csvtk headers
, dim
, nrow
, ncol
or summary
are great for that.csvtk sample
.csvtk grep
.csvtk cut
or filter rows with csvtk filter2
.Now that we’ve covered the basics, let’s look at a practical example!
We ran wf-metagenomics using kraken2 and in the output we find the kraken
folder with the kraken2 reports.
You can use conda to install the tools we mentioned above. First we take a look at what these files look like:
cat barcode01.kraken2.assignments.tsv | head
The first column distinguishes between Classified (“C”) or Unclassified (“U”) and the second one is the readID (read this for more details about the kraken2 output format).
C 80525fcb-a74a-423d-b965-c941fa6e46fa 2 5477 0:2418 2638727:3 0:34 526218:5 0:155 158190:5 0:2433 1160721:1 0:389C 4516e4cf-545e-4b80-9606-bcbaea3d3ba5 821 2640 0:16 909656:4 0:12 909656:1 0:48 821:5 0:42 821:5 0:15 821:3 0:1 821:3 0:24 821:3 0:4 821:1 0:36 815:3 0:14 815:2 0:3 909656:5 0:27 909656:5 0:38 821:2 0:18 909656:3 0:83 821:5 0:33 909656:5 0:20 909656:5 0:13 909656:3 0:28 821:1 0:27 909656:1 0:14 909656:5 821:1 0:2 909656:4 0:37 821:1 0:2 821:3 0:12 821:5 0:3 821:5 0:36 909656:5 0:9 909656:5 0:3 821:5 0:7 821:1 0:82 821:8 0:29 909656:2 0:24 909656:2 0:14 909656:2 0:4 909656:9 0:1 909656:9 0:7 909656:4 0:137 909656:5 0:77 909656:4 0:33 909656:2 0:24 821:1 0:19 909656:5 0:43 909656:7 0:11 909656:2 0:25 821:2 0:26 909656:11 0:4 909656:5 0:3 821:1 0:57 821:2 0:5 821:2 0:70 909656:3 0:5 909656:5 0:40 821:1 0:51 909656:3 0:91 909656:2 0:23 909656:2 0:4 909656:5 0:12 909656:4 0:49 909656:3 0:92 909656:1 0:9 909656:2 0:64 909656:1 0:13 909656:5 0:27 821:5 0:66 909656:2 0:5 909656:4 0:27 821:5 0:43 909656:1 0:7 909656:2 0:33 821:1 0:81 821:1 0:23 821:5 0:16 909656:1 821:5 0:33 821:7 0:2 821:5 0:30 909656:1 0:1 909656:5 0:2 909656:1 0:6 909656:1 0:5 909656:1 0:21 821:1 0:9 821:1 0:29 821:5 0:70 821:2 0:6U 3659b19f-d683-40d1-ae71-db28e853d646 0 4971 0:4937C 3cfd7ebf-6246-4ad7-9dbf-31aef52ffe5d 39491 1459 0:2 39491:1 0:54 39491:2 0:9 39491:3 0:9 39491:2 0:16 39491:8 0:10 39491:3 0:38 39491:4 0:15 39491:2 0:5 39491:3 0:91 39491:8 0:1 39491:1 0:9 39491:1 0:1128C 6ae3b42a-dc87-4ab3-aa12-2e16992985df 2929488 1015 0:53 186802:2 0:22 186802:4 0:36 186802:5 0:7 2646395:1 0:39 186802:8 0:7 186802:3 0:28 186802:5 0:10 186802:7 0:68 186802:2 0:46 186802:5 0:60 216851:1 0:23 186802:5 0:11 186802:5 0:89 186802:1 0:8 186802:2 0:34 186802:5 0:3 186802:4 0:23 186802:5 0:41 2929488:5 186802:5 0:2 186802:2 0:97 186802:9 0:20 186802:2 0:19 186802:3 0:4 186802:2 0:3 186802:5 0:125U 6dc4e39a-7d54-4601-bc0c-68db2be71458 0 1248 0:1214C 8058eb02-bc9c-43bc-8d11-00d18b1f3657 186802 3585 0:510 186803:1 0:934 2763670:4 0:25 350688:5 0:1409 742737:5 0:385 657314:3 0:270C cf2d87df-935c-4254-b007-dc5e1f465688 657317 2453 0:1920 657317:2 0:82 1736:1 0:414U eda02d14-094c-4e94-befe-3ddd6dab90aa 0 1694 0:1660# we can count the number of reads using:wc -l kraken2.assignments.tsv22109814
So, our task is to take the unclassified reads to perform further analysis. To do this we need to:
# We don't have column names (`-H`), our file is tab separated (`-t`) and we want all the rows whose first column is "U".csvtk filter2 -H -t -f '$1=="U"' barcode01.kraken2.assignments.tsv -o kraken2.U.assignments.tsvhead -n3 kraken2.U.assignments.tsv# see that all are unclassifiedU 3659b19f-d683-40d1-ae71-db28e853d646 0 4971 0:4937U 6dc4e39a-7d54-4601-bc0c-68db2be71458 0 1248 0:1214U eda02d14-094c-4e94-befe-3ddd6dab90aa 0 1694 0:1660# we can count the number of unclassified using:wc -l kraken2.U.assignments.tsv3905287
We need just the readIDs, so:
csvtk cut -t -H -f 2 kraken2.U.assignments.tsv -o kraken2.U.readIDs.txt# alternatively, we can run the two commands together:csvtk filter2 -H -t -f '$1=="U"' barcode01.kraken2.assignments.tsv | csvtk cut -t -H - -f 2 -o kraken2.U.readIDs.txthead -n3 kraken2.U.readIDs.txt3659b19f-d683-40d1-ae71-db28e853d6466dc4e39a-7d54-4601-bc0c-68db2be71458eda02d14-094c-4e94-befe-3ddd6dab90aa
# Sample randonmly 1% of the read IDs.csvtk sample -p 0.01 kraken2.U.readIDs.txt -o subset.kraken2.U.readIDs.txt
seqkit grep -f subset.kraken2.U.readIDs.txt seqs.fastq.gz | bgzip -@ 4 > subset.kraken2.U.readIDs.fq.gz# orseqtk subseq seqs.fastq.gz subset.kraken2.U.readIDs.txt | bgzip -@ 4 > subset.kraken2.U.readIDs.fq.gz
Note that we use
bgzip
instead ofgzip
as this allows us to further subset the generated files withsamtools fqidx
later if needed.
First, we have to run the workflow using the keep_bam
option so that the BAM files it generates are published to the output folder (you will find them in the bams
sub-folder).
Then, we’d like to subset those reads which have not mapped against our reference:
# Extract unmapped reads (-f 4) and convert them to FASTQ format.samtools view -f 4 barcode01.reference.bam | samtools fastq -@ 4 - | bgzip -@ 4 > barcode01.unmapped.fq.gz
From here, you can explore your unmapped or unclassified sequences further. For example, you can check if they satisfy quality control targets, map them against another database, and so on. There are different options that you may try! We hope you find this post useful and don’t forget to take a look at some of our related blog posts!