TutorialsWorkflowsOpen dataDownloadsOur Team
Articles
Detection of 5-methylcytosine modification in GM24385
Chris Wright
Chris Wright
August 27, 2021
4 min

Using nanopore sequencing, researchers have directly identified DNA and RNA base modifications at nucleotide resolution, including 5-methylcytosine, 5-hydroxymethylcytosine, N6-methyladenosine, and 5-bromodeoxyuridine in DNA; and N6-methyladenosine in RNA, with detection of other natural or synthetic epigenetic modifications possible through training basecalling algorithms. One of the most widespread genomic modifications is 5-methylcytosine (5mC), which most frequently occurs at CpG dinucleotides. Compared to whole-genome bisulfite sequencing, the traditional method of 5mC detection, nanopore technology can offer many advantages which we will explore in this post with the aid of newly released data in our ONT Open Datasets archive.

For more information and help downloading data from our open dataset archive see the Datasets Tutorials page. All the data referred to in this blog can be accessed under:

s3://ont-open-data/gm24385_mod_2021.09/

In the below we provide direct links to the more interesting analysis outputs.

The GM24385 cell line samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research.

GM24385 dataset generation

In order to demonstrate the utility and convenience of Oxford Nanopore Technologies’ sequencing platform for performing detection and analysis of 5mC, we have sequenced the HG002 Genome in a Bottle Sample GM24385 with both traditional bisulfite sequencing and using nanopore sequencing. Both technologies, old and new, were applied to the same sample from a single DNA extraction.

Bisulfite sequencing

Bisulfite sequencing was performed by a commercial provider and processed with the commonly used bismark package to obtain the proportion of reads displaying methylation at CpG sites throughout the whole genome. The primary output of this processing is a single BedGraph-like file describing these proportions:

https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/bisulphite/cpg/CpG.gz.bismark.zero.cov.gz

Bismark was installed using the mamba package manager, the commands used to produce the above file can be found in:

https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/bisulphite/fastq2bed.sh

with the input read data being located at:

https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/bisulphite/reads/004_0111_001_R1.fq.gz
https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/bisulphite/reads/004_0111_001_R2.fq.gz

The running of the bismark analysis pipeline to produce the final BED file took on the order of 4 days on a desktop computer.

Nanopore sequencing

Nanopore sequencing was performed using the same sample of GM24385 material sent for bisulfite sequencing. Sequencing was performed on the MinION platform, across multiple flowcells, as part of ongoing platform development activities. The sequencing was not performed explicitly for the analysis presented here; we are making available all sequencing runs undertaken with this sample for the benefit of the community.

Sequencing was carried out in October 2020 with results presented here being derived from fresh basecalling using Guppy version 5.0.1 and the dna_r9.4.1_450bps_modbases_5mc_hac configuration, the optimal choice for methylation calling of CpG sites.

The simplicity of the analysis workflow presented here leverages the ability of the Guppy basecaller to output BAM files annotated with methylation calls as described in the SAM tags specification found at: https://samtools.github.io/hts-specs. Guppy can be provided with a reference genome and instructed to output BAM files with the Mm and Ml tags described in the specification documents:

guppy_basecaller \
    --config dna_r9.4.1_450bps_modbases_5mc_hac.cfg \
    --device cuda:0 \
    --bam_out --recursive --compress \
    --align_ref <reference fasta> \
    -i <fast5 input directory> -s <output directory>

After basecalling, which can be performed live during the sequencing run, a simple one step process can be used to summarize the BAM files into methylated and unmethylated frequency information akin to the bismark BED file. Our recently developed modbam2bed program is available through conda for both Linux and MacOS:

modbam2bed \
    -e -m 5mC --cpg -t 10 \
    <reference fasta> <guppy bams> ... \
    > guppy.cpg.bam

The BED file output by the above conforms to the bedMethyl description from the ENCODE project.

This simple one step analysis contrasts with the multistep and time-consuming steps required to process the raw bisulfite sequencing data to obtain the frequency counts. The program will happily consume multiple BAM files simultaneously (up to limits imposed by the user’s system) to produce aggregated counts. One current small wrinkle is that Guppy does not currently produce a BAM index files alongside its BAM files such that the user must first index Guppy’s outputs with samtools:

ls <guppy output directory>/*.bam | xargs samtools index

A future version of Guppy will correctly output BAM indices such that this step is no longer required. For reference the Open Dataset archive includes a single, consolidated BAM file (and index) for all sequencing runs in the set:

https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/extra_analysis/all.bam
https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/extra_analysis/all.bam.csi

The corresponding BED file with methylation frequencies is available at:

https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/extra_analysis/all.cpg.bed

In contrast to the bisulfite data processing use of modbam2bed to aggregate data from the basecaller outputs is possible in minutes on a desktop computer.

Technology Comparison

To further demonstrate the power of nanopore sequencing for the detection of 5mC in CpG sites we will briefly survey the properties of the two sequence datasets. Our aim here is not to analyse the full biology of the sample sequenced but merely to illustrate that nanopore sequencing data is a convenient and accurate replacement for bisulfite sequencing.

Firstly as we have shown previously, nanopore sequencing does not suffer from GC-content contextual bias often associated with short-read bisulfite sequencing such that accurate methylation frequencies can be obtained throughout entire genomes. Figure 1. depicts the read coverage of CpG sites in chromosome 1 for both the nanopore and bisulfite sequencing experiments. The nanopore data show a tighter coverage distribution with very few sites of low coverage. By contrast the bisulfite sequencing shows a tail of low coverage (both absolute and relative to the mean coverage), with a noticeable spike close to zero coverage.

Figure 1. Comparison of sequencing coverage in bisulfite and nanopore sequencing.

Of course simply achieving low coverage bias does not guarantee acceptable results, we would like that the methylation frequencies obtained are correct. To this end Figure 2. illustrates the correlation between the calculated bisulfite frequencies and those obtained from nanopore sequencing.

Figure 2. Heatmap indicating correlation between CpG site methylation frequencies from bisulfite and nanopore sequencing.

There is a strong correlation (R=0.943) between the per-site methylation proportions calculated from the two technologies. We note that commercial bisulfite providers typically quote bisulfite conversion error of around 2%, which goes some way to explain the lack of perfect correlation. The use of megalodon can improve the accuracy of 5mC identification further to unprecedented levels of accuracy.

Through examining the data provided in the Open Dataset archive we invite users to explore both datasets in more detail.

Discussion

In this short post we have introduced matched bisulfite and nanopore sequencing datasets of a single DNA extraction from a GM24385 cell line sample. We have shown how 5mC identification can be performed easily without time consuming or specialised sample preparation or data analysis by utilising Oxford Nanopore Technologies’ sequencing platforms.

The dataset is provided for use by the community. We hope that it will aid in development of new and existing tools such as mbtools, methplotlib, and pycoMeth.


Tags

#modifiedbases#gm24385#ont-open-data#human cell-line#R9.4.1

Related Posts

October 2021 GM24385 Q20+ Simplex Dataset Release
October 08, 2021
3 min
© 2020 - 2021
Oxford Nanopore Technologies
All Rights Reserved.

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media