Oxford Nanopore Sequencing devices output RNA and DNA read sequences in batched fastq files. This is useful for real-time analyses like those run by EPI2ME but can be inconvenient for providing the data to standard bioinformatics tools, many of which only cater for data provided in a single file. In this post we’ll examine a new tool developed for use in EPI2ME Labs workflows to provide a first-line data sanitization and normalization step.
The fastq format is ubiquitous for storing sequencing reads. Data to be analysed by a workflow can however be presented with at least three wrinkles:
We would like a process to handle these situations to format the data in a
standardised form: a single gzip compressed fastq file for each demultiplexed
sample. There are already a variety of programs that can take
as input the data described above and get us most of the way to the desired
result. Indeed for most cases simple shell scripting can do the job. The standard
cat program can be used in simple cases with either uncompressed
or gzip compressed data, but not a mixture. The demultiplexing aspect of the
task complicates things further: Oxford Nanopore Sequencing devices output
demultiplexed reads helpfully into distinct directories, but what if this
structure has been lost? We would like a more robust solution.
To solve this problem the EPI2ME Labs team developed the fastcat program. The program handles the three concerns above placing minimal contraints on the inputs. It performs this task using two key ingredients:
The program is not yet a complete replacement to more fully featured fastq toolkits such as the excellent seqkit though does provide a few additional generally useful features in addition to solving the core normalization functions. These include length and quality filtering, as well as producing a summary file containing per-read statistics. Being written in C it performs these tasks several times faster than any similar programs written in languages such as Python.
Bioinformatics workflows often require using tools which are not designed to work in a real-time, streaming fashion and require all sequencing data to be provided in a single input file. fastcat provides a convenient method to normalize input data for downstream tools.
Key fastcat features