Introduction to Fast5 files

The tutorial provides a short introduction to Fast5 files used to store raw data output of Oxford Nanopore Technologies' sequencing devices. The tutorial aims to provide background information for why users may have cause to interact with Fast5 files and show how to perform common manipulations.

Methods used in this tutorial include:

The computational requirements for this tutorial are:

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Introduction

This tutorial aims to elucidate the information stored within a Fast5 file, and how such files can be read, or parsed, within the Python programming language and on the command line.

The goals from this tutorial include:

The tutorial includes a sample Fast5 dataset from a metagenomic sample.

Getting started

Before anything else we will create and set a working directory:

Install additional software

This tutorial uses the ont_fast5_api software; this is not installed in the default EPI2ME Labs environment. We will install this now in an isolated manner so as to not interfere with the existing environment.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server.

Sample Data

In order to provide a concrete example of handling a Fast5 files this tutorial is provided with an example dataset sampled from a MinION sequencing run: the dataset is not a full MinION run in order to reduced the download size.

To download the sample file we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

Data entry

Having downloaded the sample data we need to provide the filepaths as input to the notebook.

The form can be used to enter the filenames of your inputs.

Executing the above form will have checked the input folder attempted to find Fast5 files located in the folder.

Fast5 files

Fast5 files are used by the MinKNOW instrument software and the Guppy basecalling software to store the primary sequencing data from Oxford Nanopore Technologies' sequencing devices and the results of primary and secondary analyses such as basecalling information and modified-base detection.

Before discussing how to read and manipulate Fast5 files in Python we will first review their internal structure.

HDF5 files

Files output by the MinKNOW instrument software and the Guppy basecalling software using the .fast5 file extension are a container file) using the HDF5 format. As such they are a self-describing file with all the necessary information to correctly interpret the data they contain.

A Fast5 file differs from a generic HDF5 file in containing only a fixed, defined structure of data. This structure is elucidated in the ont_h5_validator repository on Github, specifically in the file multi_read_fast5.yaml.

Users are referred to the YAML schemas to gain an understanding of all the data contained in Fast5 files. Users are encouraged to raise Issues on the ont_h5_validator project if the schemas are unclear. The rest of this tutorial will be mostly practical in nature.

The schema file describes how the internal structure of a Fast5 file is laid out. There are three core concepts to understand:

  1. Groups: an HDF group acts similarly to a folder on a computer filesystem. Hierachies of groups can exist to organise data.
  2. Datasets: an HDF dataset contains generic data. This might range from simple text data like a Fastq record to a matrix of a complex datatype containing subfields of data.
  3. Attributes: both groups and datasets can be labelled with attributes. These are short data items, commonly meta data, that can be used to aid understanding of the data. For example a sequencing run identifier or a timestamp.

An appreciation of these concepts is required for using the data contained within Fast5 files, though as we will see for common manipulations of Fast5 files users need only an awareness of these ideas.

Fast5 Flavours

Historically there have been two flavours of Fast5 files:

The internal layout, in terms of groups and datasets, of these two flavours of Fast5 are very similar. In essence a multi-read file embeds the group hierarchy of multiple single-read files within one HDF5 container.

Single-read files are deprecated and no longer used by MinKNOW or Guppy. We recommend that any single-read files are converted to multi-read files before further use or storage, how to do this is demonstrated later in this tutorial.

Overview of Fast5 contents

As noted above the ont_h5_validator project contains a full description of the expected contents of a Fast5 file. Here we will briefly highlight the key groups and datasets stored within a Fast5 file.

Using the dataset provided in above let's enumerate the contents of the first file using the h5ls program:

The h5ls program can be used to display the datasets and groups of a Fast5 file. It can be used also to display attributes and more advanced properties of a file when given different options.

From the sample file we see a hierarchy of groups and datasets. The highest of these is a group indicating a read together with the unique identifier of the read. Under this top level read group we have:

For a full description of these fields see the documentation on the Nanopore Community.

For files generated by MinKNOW without live basecalling enabled, the Analyses section will be absent (or contain no subgroups). The sample file is one that has been created by Guppy using the --fast5_out option to produce Fast5 files in addition to .fastq.gz files containing solely the basecalls. The Analyses section listed above therefore contains two subgroups: Segmentation_000 and Basecall_1D_000. The first of these contains information regarding how a read as been trimmed by the basecaller into seqeuncing adapter and, barcode and insert regions. The second contains the basecaller outputs, primarily the Fastq dataset but also two additional groups Move and Trace which contain advanced basecaller outputs. Again see the documentation in the Nanopore Community for a full description of these.

An aside on file indexing and compression

The Fast5 files from a MinION run can become fairly sizeable, up to a few hundred gigabytes. Efficient and performant compression and indexing is therefore required.

Indexing reads

For the most part the self describing and indexed nature of the HDF5 format ensures that data within a file can be quickly retrieved. However for a MinION run multiple Fast5 files are created each with a subset of the sequencing reads produced by the sequencer. Therefore finding the information pertaining to a read of a known ID cannot be done without a supplementary index cross-referencing the reads contained within in file; the alternative is to open all the files in turn and enquire about their contents. The sequencing_summary.txt file produced by both MinKNOW and Guppy provides an index of the reads contained within in each Fast5 file. This index can of course be reconstructed if required (as in the case of nanopolish index), though we recommend always storing the sequencing summary with the Fast5 data files.

File compression

Due to the large volume of data created by nanopore sequencing devices Oxford Nanopore Technologies has developed a bespoke compression scheme for ionic current trace data known as VBZ. VBZ is a combination of two open compression algorithms and is itself open and freely available from the Github release page. Ordinarily it will not be necessary to install the VBZ compression library and HDF5 plugin to simply use MinKNOW and Guppy as these software applications include their own copy of VBZ. However if you wish to read Fast5 files using third party applications (such as h5py) you will need to install the VBZ plugin.

Manipulating Fast5 files

The section above has given an outline to the data contained within a Fast5 file and how the file is arranged. Again for a more fulsome description of the contents of files users are directed to the ont_h5_validator project. In this section we will highlight several methods for manipulating the data contained within Fast5 files.

Oxford Nanopore Technologies provides a Python-based software for accessing data stored within a set of Fast5 files: ont_fast5_api. For the most part this set of tools hides from the user the need to understand anything about the nature of Fast5 files. Here we will show how to perform some common tasks that might be required when dealing with Fast5 files. For a guide in using ont_fast5_api programmatically please see the documention.

Converting multi-read files to single-read files

Since some older programs have not been updated to use multi-read files it can sometimes be necessary to convert such files to the deprecated single-read flavour. To do this run:

The output of the above command is a set of folders each containing a subset of the sequencing reads, one read per file. The filename of each read corresponds to the read's unique identifier.

Converting single-read to multi-read files

A similar program exists to convert single-read files to multi-read files. We recommend that all datasets are updated to multi-read files for longer term storage. Here we will convert the single-reads created above back to multi-read files:

The output of this command is a single directory containing all multi-read files. The filenames are prefixed with prefix as taken by the --filename_base argument of the program. The --batch_size argument here controls the number of reads per file:

The filename_mapping.txt cross-references the data from the input files with the output files.

Creating a listing of reads within multi-read files

As mentioned in the discussion above it can be useful to have an index of which reads are contained within which multi-read files. Usually this indexing is provided by the sequencing_summary.txt file output by MinKNOW and Guppy. However if it is lost, here's a way to recover the information:

The read_index.txt output file contains the simple index we desire:

Filtering multi-reads by reference locus

The program fast5_subset within ont_fast5_api can be used to create a new file set containing only a subset of reads.

The sample data contains data from a microbial mock community. Using the accompanying BAM alignment file lets find the reads with align to a single reference sequence:

We can now use this file with the subsetting program:

Cleaning multi-read files of Analyses groups

It can be the case that it is desirable to remove the Analyses groups from multi-read files. For example if live basecalling were performed during a run but these results are not wanted before data is archived.

To accomplish this task we will use the compress_fast5 program with the --sanitize option:

This achieves an approximate 3.5X reduction in filesize:

Summary

In this notebook we have introduced the Variant Call Format with an examplar file from the Medaka consensus and variant calling program. We have outlined the contents of such files and how they can be intepreted with a selection of common software packages.

The code tools presented here can be run on any dataset from an Oxford Nanopore Technologies' device. The code will run within the EPI2ME Labs notebook server environment.