# Modified Base Tutorial

The modified base tutorial is intended as a simple guide for handling the modified-base information output by the Guppy basecaller. We will walk through the data provided in the .fast5 file output by Guppy and demonstrate a method to store the data in .bam format to make it easier to handle with common standard bioinformatic tools.

The tutorial is provided with a sample dataset for chromosome 20 of the GIAB NA12878 sample, and the workflow can be used to address such questions as:

• How can CpG related 5mC base-modifications be extracted from a FAST5 file?
• Where in the genome do these base-modifications occur?
• How prevalent are base-modifications at given genomic locations?

Methods used in this tutorial include:

Computational requirements for this tutorial include:

• Computer running the EPI2ME Labs notebook Server
• At least 8 Gb RAM

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

## Getting started¶

To start analysing our experiment we must first collate our data. The workflow below expects to be given a single data folder. It will search this folder to find .fast5 files containing modified base data from the Guppy basecaller.

Before anything else we will create and set a working directory:

This tutorial uses a couple of software packages that are not included in the default EPI2ME Labs server. Below we will install the fast5mod package.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

### Sample Data¶

To use this tutorial with sample data we can download the files using the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

To get started we will download a sample sequencing dataset. The data is available in two forms:

• .fast5 files: files output by the Guppy basecaller (~180Gb).
• .bam file: a processed version of the Guppy .fast5 files with reads aligned to a reference sequences and modified-base data stored within an industry standard format. (~8Gb).

The .bam download has the advantage of being much smaller. In the tutorial we will show how it is created from the .fast5 files, this step can naturally be skipped if the .bam is used. The .fast5 files are not used in this tutorial after the conversion step.

To view the outcome of the download we can use the tree command to show the contents of the working directory:

The files should also appear in the File Browser to the left-hand side of the screen.

For our analysis we will also need the human genome reference:

The reference contained within the above download is that which can also be obtained here, the reasons for using this file are presented in Heng Li's blog post.

The sample data is a set of .fast5 files from a PromethION sequencing experiment. They have been prefiltered to reads derived from human chromosome 20. This was achieved by using minimap2 to align reads to the human reference genome, filtering the results to include only chromosome 20 alignments, extracting the read identifiers, and finally using the filter_reads program from fast5_research

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

### Data entry¶

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook. This is done in the form below. After updating the values, be sure to press the > Enter button.

## Modified base base-calling¶

The Guppy basecaller software includes a model for the detection of base modifications. The version of Guppy at the time of writing (3.3.3) supports modifications that include 5mC in a CpG context, Dcm (5mC motifs CCAGG and CCTGG) and Dam, 6mA GATC motif.

The base-calling software outputs a .fastq file that includes the canonical sequence base and augments the .fast5 file with a table of probabilities that represent that a given base has been modified. The figure below shows the contents of a .fast5 file visualised with the hdfview software. The figure that shows for a single selected read the fastq and base-modification probabilities and their hierarchical structure.

The probability table to the right-hand side deserves some explanation.

The table is a two-dimensional array, where each row of the table corresponds to a base in the basecalled sequence; there are as many rows in the table as there are bases in the basecall.

The columns of the table correspond to different base identities. The ordering of the columns is defined by the precise basecaller model used, but is commonly:

A, 6mA, C, 5mC, G, T



(The ont-fast5-api can be used to determine this order for a given file).

Entries in the table score the relative probability that a position in the basecall should be determined to be a modified (or canonical) base. The scoring is expressed in a an integral range 0-255; the scores for a set of mutual bases (e.g. A and 6mA, or C and 5mC) will sum to 255. The table should be intepreted jointly with the basecall sequence, for example the row:

[63, 192, 255, 0, 255, 255]



in itself indicates:

• If an A is found in the basecall at this postion, the score for a canonical A is 63 (which we might interpret as a ~25% probability), and that for 6mA is 192 (~75%).
• Given that a C was called, the score for canonical C is 255 (100%), with that for 5mC being 0.
• Given that a G was called, the score for canonical G is 255. For a model with not alternative G forms this will always be the case.
• Given that a T was called, the score for canonical T is 255, again because in this example there are no modified T bases.

Therefore, comparing the values in the table with the bases from the base-called sequence can inform us which bases of the sequenced strand are likely to be modified bases.

### Running Guppy for modified-base base calling¶

By default, Guppy will not produce modified base statistics in its outputs. In order to obtain these it is necessary to use an non-default configuration file and enable the fast5 output. For example one should run a command such as:

guppy_basecaller \
--save_path <output path> --input_path <input path> \
--compress_fastq --fast5_out \
--config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac_prom.cfg

## Summarising the data¶

In order to enable further analysis we will first summarise the data from Guppy's basecalling outputs. There are two steps to this process. We will extract pertinent information from the .fast5 files, align the sequencing reads to the reference sequence, and store the results in a .bam alignment file. In a second step we will create a per-reference position summary of the data.

### Conversion of .fast5 to .bam¶

Note: for the sample data, this step can be skipped if you downloaded the .bam data. Instead run:

!mv onll04465.bam "$output_folder"/methylation.bam !mv onll04465.bam.bai "$output_folder"/methylation.bam.bai

in a code cell. (This is commented out in the codecell below).

In order to convert our Guppy outputs into reference aligned data we will make use of a program from medaka.

When executed the following command runs the medaka methylation guppy2sam program on all .fast5 files under the \$data_folder directory that was specified above. It does this using 4 processes to read data from the .fast5 files (io_workers) and 8 alignment threads (workers) for aligning the basecalls to the reference genome. The output of this first program is "piped" through the commands samtools sort and samtools view which sort the alignment records by chromosome position and convert the data to a binary .bam format respectively. In a discrete second step the .bam file is indexed by samtools index.

The modified base scores are stored in two BAM tags, MC and MA, one each for 5mC and 6mA respectively. The tags are 8bit integer array-values, one value per basecall position, they correspond directly to the two columns from the probability table store by Guppy in the .fast5 files.

The MC and MA tags are a different storage form to that proposed in the current hts-specs proposition, but allow for more trivial parsing.

### Creating a tabular summary¶

With our sequencing data in a .bam, we can proceed to analyse the data with standard bioinformatics tools.

Again, medaka contains a handy tool for summarising the methylation-data containing .bam file into a tabular summary. For our immediate purposes we will simply run this tool and then discuss its output:

Here the option --meth cpg indicates that loci containing the sequence motif CG should be examined for 5mC presence. Other choices are dcm (motifs CCAGG and CCTGG) for 5mC and dam (GATC) for 6mA.

The output file from the above command is a simple tab-delimited text file. We can inspect the file using the pandas library:

The column names should be self explanatory. The first three columns denote simply: the reference sequence (ref.name, position along their reference sequence (position), and sequence motif at this reference position (motif). The following four columns count the occurrence of reads judged to possess a modified (meth) or canonical (canon) base at the sequence position. Counts are given for both the forward (fwd) and reverse (rev) strands. The final five columns are computed within the code above and summarise simply the previous four columns.

The format of this file may change in a future medaka release to match the output of other software projects.

## Analysis of the summary data¶

In this section we will perform some basic analysis of the summary count data. From this table alone interesting observations can be made.

### Aggregated methylation status¶

We will start our analysis by further summarising the summary table to pull out average methylation rates and examine potential strand bias.

We start by assessing the coverage of our candidate modified-base sites:

We can use the bulk coverage data to determine a filter to exclude sites with lower than expected coverage, as these may skew our observations.

The following summarises methylation status across the dataset: