Introduction to Fastq files

The fastq format is (usually) a 4 line string (text) data format denoting a sequence and it's corresponding quality score values. There different ways of encoding quality in a .fastq file however, files from ONT sequencing devices use sanger phred scores. A sequence record is made up of 4 lines:

line 1: Sequence ID and Sequence description
line 2: Sequence line e.g. ATCGs
line 3: plus symbol (can additionally have description here)
line 4: Sequence line qualities

IMPORTANT: Lines 2 and line 4 must have the same length or the sequence record is not valid.

For example a sample record looks like:

@sequence_id sequence_description

The sequence ID must not contain any spaces. Anything after the first space in the sequence ID line will be considered the "description".

A .fastq file may contain multiple records. The default number of records in a fastq file generated during a nanopore run is 4000 reads (16000 lines).

Useful snippets

The following snippets demonstrate common tasks you might want to perform on a single .fastq file or a set of such files. For many tasks we recommend the excellent seqkit program.

Before anything else we will create and set a working directory:

The snippets all have their code to the left-hand side and a form to the right which can be used to change their inputs (as an alternative to directly editing the code).

How many records in my .fastq file?

To count the number of records in a .fastq file we can use the linux word count command to count the number of lines in a file, with a division by four accounting for four lines per record:

List all the fastqs in a directory

As Oxford Nanopore Technologies' sequencing devices output multiple .fastq files during the course of an experiment, it can be useful to find and list all such files. We can do this with the linux find command:

The default directory value here (.) means "the current working directory."

Concatenate all fastqs in a directory into a single file

Many bioinformatics programs require all sequence data to be present in a single .fastq file. In order to process sequences across multiple files we must concatenate (or "cat") all the .fastq files into a single consolidated file. To perform this task we can use a combination of the linux find, xargs, and cat commands:

Again the default directory value here (.) means "the current working directory."

You may often see a simple form of the above:

cat *.fastq > output.fastq

however, this command will fail if the number of .fastq files found is very large.

Remove all duplicates in a fastq

In can sometimes be the case that for some reason a .fastq file contains duplicates of the same read. To remove these we can use the rmdup command of the seqkit program:

For the example data, 200 duplicate records are identified because the three files (containing 100 records each) are in fact copies of the same file.

Compress or extract a fastq file

We can save hard disk space on our computer by compressing .fastq files. To do this we recommend using bgzip which allows for indexing and fast retrieval of sequences by bioinformatics programs:

The size of the compressed file is roughly half of the original. To decompress the compress file, we again use bgzip:

Compress a directory structure

In order to compress a directory structure we can use the linux tar command with the compression option:

When compressing directories and their contents in this way it is good practice to compress a single top-level directory, so that when the archive is decompressed a single top-level directory is retrieved (and the users working directory isn't polluted).

To decompress the archive we use a similar command:

Visualizing fastq

The snippets below demonstrate basic parsing of fastq data in python. We do not recommend using this code in practice as much of the information is more readily available in the sequencing_summary.txt file produced by Oxford Nanopore Technologies' sequencing devices. See our Basic QC Tutorial for more examples.