Using nanopore sequencing, researchers have directly identified DNA and RNA base modifications at nucleotide resolution, including 5-methylcytosine, 5-hydroxymethylcytosine, N6-methyladenosine, and 5-bromodeoxyuridine in DNA; and N6-methyladenosine in RNA, with detection of other natural or synthetic epigenetic modifications possible through training basecalling algorithms. One of the most widespread genomic modifications is 5-methylcytosine (5mC), which most frequently occurs at CpG dinucleotides. Compared to whole-genome bisulfite sequencing, the traditional method of 5mC detection, nanopore technology can offer many advantages which we will explore in this post with the aid of newly released data in our Oxford Nanopore Open Data archive.
For more information and help downloading data from our open dataset archive see the Datasets Tutorials page. All the data referred to in this blog can be accessed under:
s3://ont-open-data/gm24385_mod_2021.09/
In the below we provide direct links to the more interesting analysis outputs.
The GM24385 cell line samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research.
In order to demonstrate the utility and convenience of Oxford Nanopore Technologies’ sequencing platform for performing detection and analysis of 5mC, we have sequenced the HG002 Genome in a Bottle Sample GM24385 with both traditional bisulfite sequencing and using nanopore sequencing. Both technologies, old and new, were applied to the same sample from a single DNA extraction.
Bisulfite sequencing
Bisulfite sequencing was performed by a commercial provider and processed with the commonly used bismark package to obtain the proportion of reads displaying methylation at CpG sites throughout the whole genome. The primary output of this processing is a single BedGraph-like file describing these proportions:
https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/bisulphite/cpg/CpG.gz.bismark.zero.cov.gz
Bismark was installed using the mamba package manager, the commands used to produce the above file can be found in:
https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/bisulphite/fastq2bed.sh
with the input read data being located at:
https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/bisulphite/reads/004_0111_001_R1.fq.gzhttps://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/bisulphite/reads/004_0111_001_R2.fq.gz
The running of the bismark analysis pipeline to produce the final BED file took on the order of 4 days on a desktop computer.
Nanopore sequencing
Nanopore sequencing was performed using the same sample of GM24385 material sent for bisulfite sequencing. Sequencing was performed on the MinION platform, across multiple flowcells, as part of ongoing platform development activities. The sequencing was not performed explicitly for the analysis presented here; we are making available all sequencing runs undertaken with this sample for the benefit of the community.
Sequencing was carried out in October 2020 with results presented here being derived from
fresh basecalling using Guppy version 5.0.1 and the dna_r9.4.1_450bps_modbases_5mc_hac
configuration, the optimal choice for methylation calling of CpG sites.
The simplicity of the analysis workflow presented here leverages the ability of the Guppy
basecaller to output BAM files annotated with methylation calls as described in the
SAM tags specification found at: https://samtools.github.io/hts-specs.
Guppy can be provided with a reference genome and instructed to output BAM files with the Mm
and Ml
tags described in the specification documents:
guppy_basecaller \--config dna_r9.4.1_450bps_modbases_5mc_hac.cfg \--device cuda:0 \--bam_out --recursive --compress \--align_ref <reference fasta> \-i <fast5 input directory> -s <output directory>
After basecalling, which can be performed live during the sequencing run, a simple one step process can be used to summarize the BAM files into methylated and unmethylated frequency information akin to the bismark BED file. Our recently developed modbam2bed program is available through conda for both Linux and MacOS:
modbam2bed \-e -m 5mC --cpg -t 10 \<reference fasta> <guppy bams> ... \> guppy.cpg.bam
The BED file output by the above conforms to the bedMethyl description from the ENCODE project.
This simple one step analysis contrasts with the multistep and time-consuming steps required to process the
raw bisulfite sequencing data to obtain the frequency counts. The program will happily consume
multiple BAM files simultaneously (up to limits imposed by the user’s system) to
produce aggregated counts. One current small wrinkle is that Guppy does not currently
produce a BAM index files alongside its BAM files such that the user must first index
Guppy’s outputs with samtools
:
ls <guppy output directory>/*.bam | xargs samtools index
A future version of Guppy will correctly output BAM indices such that this step is no longer required. For reference the Oxford Nanopore Open Data archive includes a single, consolidated BAM file (and index) for all sequencing runs in the set:
https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/extra_analysis/all.bamhttps://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/extra_analysis/all.bam.csi
The corresponding BED file with methylation frequencies is available at:
https://ont-open-data.s3.amazonaws.com/gm24385_mod_2021.09/extra_analysis/all.cpg.bed
In contrast to the bisulfite data processing use of modbam2bed
to aggregate data from
the basecaller outputs is possible in minutes on a desktop computer.
To further demonstrate the power of nanopore sequencing for the detection of 5mC in CpG sites we will briefly survey the properties of the two sequence datasets. Our aim here is not to analyse the full biology of the sample sequenced but merely to illustrate that nanopore sequencing data is a convenient and accurate replacement for bisulfite sequencing.
Firstly as we have shown previously, nanopore sequencing does not suffer from GC-content contextual bias often associated with short-read bisulfite sequencing such that accurate methylation frequencies can be obtained throughout entire genomes. Figure 1. depicts the read coverage of CpG sites in chromosome 1 for both the nanopore and bisulfite sequencing experiments. The nanopore data show a tighter coverage distribution with very few sites of low coverage. By contrast the bisulfite sequencing shows a tail of low coverage (both absolute and relative to the mean coverage), with a noticeable spike close to zero coverage.
Of course simply achieving low coverage bias does not guarantee acceptable results, we would like that the methylation frequencies obtained are correct. To this end Figure 2. illustrates the correlation between the calculated bisulfite frequencies and those obtained from nanopore sequencing.
There is a strong correlation (R=0.943) between the per-site methylation proportions calculated from the two technologies. We note that commercial bisulfite providers typically quote bisulfite conversion error of around 2%, which goes some way to explain the lack of perfect correlation. The use of megalodon can improve the accuracy of 5mC identification further to unprecedented levels of accuracy.
Through examining the data provided in the Oxford Nanopore Open Data archive we invite users to explore both datasets in more detail.
In this short post we have introduced matched bisulfite and nanopore sequencing datasets of a single DNA extraction from a GM24385 cell line sample. We have shown how 5mC identification can be performed easily without time consuming or specialised sample preparation or data analysis by utilising Oxford Nanopore Technologies’ sequencing platforms.
The dataset is provided for use by the community. We hope that it will aid in development of new and existing tools such as mbtools, methplotlib, and pycoMeth.