Oxford Nanopore Sequencing devices output RNA and DNA read sequences in batched fastq files. This is useful for real-time analyses like those run by EPI2ME but can be inconvenient for providing the data to standard bioinformatics tools, many of which only cater for data provided in a single file. In this post we’ll examine a new tool developed for use in EPI2ME Labs workflows to provide a first-line data sanitization and normalization step.

The problem

The fastq format is ubiquitous for storing sequencing reads. Data to be analysed by a workflow can however be presented with at least three wrinkles:

there might be multiple fastq files to be analysed as a whole,
the files may contain reads from a multiplexed experiment to be analysed separately,
or the files may or may not be compressed with gzip compression.

We would like a process to handle these situations to format the data in a standardised form: a single gzip compressed fastq file for each demultiplexed sample. There are already a variety of programs that can take as input the data described above and get us most of the way to the desired result. Indeed for most cases simple shell scripting can do the job. The standard Linux cat program can be used in simple cases with either uncompressed or gzip compressed data, but not a mixture. The demultiplexing aspect of the task complicates things further: Oxford Nanopore Sequencing devices output demultiplexed reads helpfully into distinct directories, but what if this structure has been lost? We would like a more robust solution.

The solution

To solve this problem the EPI2ME Labs team developed the fastcat program. The program handles the three concerns above placing minimal contraints on the inputs. It performs this task using two key ingredients:

the stream-based fastq/fasta parser kseq as used in the ubiquitous htslib package,
the header information written to fastq records by the Guppy basecaller.

The program is not yet a complete replacement to more fully featured fastq toolkits such as the excellent seqkit though does provide a few additional generally useful features in addition to solving the core normalization functions. These include length and quality filtering, as well as producing a summary file containing per-read statistics. Being written in C it performs these tasks several times faster than any similar programs written in languages such as Python.

The fastq header parsing of fastcat allows plotting experimental yield withou resorting to Fast5 or standalone sequencing summary files.

Summary

Bioinformatics workflows often require using tools which are not designed to work in a real-time, streaming fashion and require all sequencing data to be provided in a single input file. fastcat provides a convenient method to normalize input data for downstream tools.

Key fastcat features

accepts compressed, uncompressed, or a mixture of fastq files,
filters reads on length and quality score,
produces per-read and per-input file summaries including information from fastq headers,
recapitulate standard MinKNOW demultiplexing directory layout from unstructured data.