Modified Base Tutorial

The modified base tutorial is intended as a simple guide for handling the modified-base information output by the Guppy basecaller. We will walk through the data provided in the .bam file output by Guppy and demonstrate how to convert this this data to per locus modified-base frequencies.

The tutorial is provided with a sample dataset for chromosome 20 of the GIAB NA24385 sample, and the workflow can be used to address such questions as:

Methods used in this tutorial include:

Computational requirements for this tutorial include:

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed.

Getting started

To start analysing our experiment we must first collate our data. The workflow below expects to be given a single data folder. It will search this folder to find .fast5 files containing modified base data from the Guppy basecaller.

Before anything else we will create and set a working directory:

Install additional software

This tutorial uses a couple of software packages that are not included in the default EPI2ME Labs server. Below we will install:

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

Sample Data

To use this tutorial with sample data we can download the files using the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

For our analysis we will also need the human genome reference:

The reference contained within the above download is that which can also be obtained here, the reasons for using this file are presented in Heng Li's blog post.

The sample data taken from our ONT Open Datasets archive. The full dataset comprises MinION sequencing runs, the data used here has been prefiltered to reads derived from human chromosome 20 to provide a small example dataset.

To view the outcome of the download we can use the tree command to show the contents of the working directory:

The files should also appear in the File Browser to the left-hand side of the screen.

Modified base base-calling

The Guppy basecaller software includes a model for the detection of base modifications. The version of Guppy at the time of writing (5.0.12) supports modifications that include 5mC in a CpG context, Dcm (5mC motifs CCAGG and CCTGG) and Dam, m6A GATC motif.

By default, Guppy will not produce modified base statistics in its outputs. In order to obtain these it is necessary to use an non-default configuration file and enable the alignment output. For example one should run a command such as:

guppy_basecaller \
    --config dna_r9.4.1_450bps_modbases_5mc_hac.cfg \
    --device cuda:0 \
    --bam_out --recursive --compress \
    --align_ref <reference fasta> \
    -i <fast5 input directory> -s <output directory>

Using your own data

If you wish to analyse your own data rather than the sample data, you can edit the values in the form below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

image.png

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data entry

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. This is done in the form below. After updating the values, be sure to press the > Enter button.

Summarising the data

In order to enable further analysis we will first summarise the data from Guppy's basecalling outputs. We will do this using the modbam2bed program to create a base-count summary text file.

Creating a tabular summary

With our sequencing data in a .bam, we can proceed to analyse the data with standard bioinformatics tools.

After basecalling, which can be performed live during the sequencing run, a simple one step process can be used to summarize the BAM files into methylated and unmethylated frequency information akin to that obtained for bisulfite sequencing using packages such as bismark. For this we use our recently developed modbam2bed program:

Here the option -m 5mC instructs the program to examine 5-methylcytosine presence, while the option --cpg indicates that only loci containing the sequence motif CG should be output. Other choices can be found by running:

The output file guppy.cpg.bam is formatted as a bedMethyl file, which is a simple tab-separated text file that can be read with pandas:

The column names should be self explanatory. The first three columns denote simply: the reference sequence (chrom) and the start and end positions (invariably one apart). The next column indicates the name of the modified base analysis, the score column indicates the reliability of the result and is described in the modbam2bed documentation. The strand columns indicated on which strand of the reference the modification is found. tstart, tend, and color are part of the Encode project description of the bedMethyl format but can be ignored. The coverage column indicates the number of reads spanning the locus while freq indicates the modification frequency (percentage of reads with modification). The canon, mod, and filt columns indicate the absolute count of reads containing the canonical base associated with the modification, the count of reads with the modified base, and the number of ambiguous (filtered) reads. The coverage column is not the sum of these final three columns as it also includes reads with deletions and substitutions at the locus of interest.

Analysis of the summary data

In this section we will perform some basic analysis of the summary count data. From this table alone interesting observations can be made.

Aggregated methylation status

We will start our analysis by further summarising the summary table to pull out average methylation rates and examine potential strand bias.

We start by assessing the coverage of our candidate modified-base sites:

We can use the bulk coverage data to determine a filter to exclude sites with lower than expected coverage, as these may skew our observations.

The following summarises methylation status across the dataset:

Assessing methylation locality

It is known that CpG sites occur less frequently than random chance in the human genome, and that further the sites which do exist tend to be clustered in so-called CpG islands. These islands typically occur at transcription start sites, and it is found that methylation rates in CpG islands are low for genes which are suppressed. To enable analysis of these effects we can plot the summary data according to genomic coordinates.

To plot a coverage trace across a reference sequence use the form below. This allows exploration of position dependent modified-base rates and any strand biases present.

Comparison to Bisulphite sequencing data

As a technology demonstrator this section of the tutorial shows an approach through which the Guppy-based base-modification information can be compared with data from Bisulphite sequencing experiments.

The bisulphite data used in this analysis has been sourced from the same GM24385 cell line. The data were derived from the same sample extraction as the nanopore data. See Detection of 5-methylcytosine modification in GM24385 for more details.

The bisulphite data can be downloaded using:

and read with pandas:

We can depict the concordance between the methylation frequencies using a heat map:

Next steps

This tutorial has stepped through basic handling of the modified-base output from the Guppy basecaller. We have seen how to aggregate read data into a .bam alignment file for easy processing. Having aggregated read data we have performed some cursory analysis to motivate and guide further targetted investigation.

Users might also be interested in the following packages to help guide further exploration of their data: