Creating correlation plots from bedMethyl files

By Chris Wright
Published in How Tos
September 22, 2023
1 min read
Creating correlation plots from bedMethyl files

In this vignette we will show how we have in the past produced correlation plots showing concordance between modified base frequencies calculated from bisulfite sequencing and sequencing with Oxford Nanopore Technologies sequencing devices.

This article is intended to give a leg up to bioinformaticians exploring modified base information stored in BAM files. If you are more generally interested in preparing 5mC and 5hmC data for onward investigation you may like to consider our wf-human-variation workflow.

Without further ado, we start from BAM files output by either the dorado or guppy basecallers. To create a bedMethyl file we have two options:

  • The deprecated program modbam2bed
  • The next generation program modkit

We stress that modbam2bed is deprecated and modkit should be used in preference. There are no known bugs in modbam2bed, it is deprecated ony because the more fulsome package modkit subsumes its functionality. We present code for using the output of both only because users may have historic data from either. This article is not intended as a full user guide to these programs, please see their respective documentation and command-line help instructions for additional information.

To create a bedMethyl file using modbam2bed we can run:

# an input BAM file and reference sequence
BAM=chr20.bam
REF=GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
modbam2bed \
--cpg --threads 24 \
--region chr20:0-99999999 \
--combine --extended \
"${REF}" "${BAM} > modbam2bed.bed

We can run modkit to produce a file similar to the modbam2bed.cpg.acc.bed file:

modkit pileup \
"${BAM}" modkit.bed \
--region chr20 --threads 24 \
--ref "${REF}" --preset traditional

The results of the two programs are broadly similar, but not identical.

The output modbam2bed.bed contains rows for each strand of DNA. The program can produce results aggregated across neighbouring reference positions (corresponding to the two cytosine bases on opposite strands in a CpG motif). We produce the stranded counts here for illustrative purposes. There are other subtleties in how the two programs handle data and create their outputs, which are broadly of little consequence for our purposes here. We encourage you to explore to options of modkit to produce the outputs required for your use case.

So to the aim of this how to article. Let’s load the data into Python and start plotting:

# All the packages we need for loading the data and plotting
import aplanat
from aplanat import spatial
from bokeh.models.widgets import Tabs, Panel
import numpy as np
import pandas as pd
# the input files
bisulfite_file = "CpG.gz.bismark.zero.cov.gz"
modbam2bed_file = "modbam2bed.bed"
modkit_file = "modkit.bed"
# load bisulfite data
# we need to calculate the coverage here
bis = pd.read_csv(
bisulfite_file,
sep="\t", header=None, engine="c",
dtype={
'chrom':str, 'start':int, 'end':int,
'freq':float, 'mod':int, 'canon':int},
names=['chrom', 'start', 'end', 'freq', 'mod', 'canon']
)
bis["coverage"] = bis["mod"] + bis["canon"]
# load modbam2bed data
# the file contains counts from + and - strands separately, which
# are combined below. See docs for further information.
modbam = pd.read_csv(
modbam2bed_file,
sep="\t", header=None, engine="c",
usecols=["chrom", "start", "coverage", "freq", "strand", "mod"],
dtype={
"chrom":str, "start":int, "coverage":int,
"freq":float, "strand":str, "mod":int},
names=[
"chrom", "start", "end", "name", "score", "strand",
"tstart", "tend", "color",
"coverage", "freq",
"canon", "mod", "filt", "nocall", "altmod"])
# load modkit data
# nothing special to do here. See docs for further information
modkit = pd.read_csv(
modkit_file,
sep="\s+", header=None, engine="c",
usecols=["chrom", "start", "end", "coverage", "freq"],
dtype={
"chrom":str, "start":int, "end":int,
"coverage":int, "freq":float},
names=[
"chrom", "start", "end", "name", "score", "strand",
"tstart", "tend", "color",
"coverage", "freq", "mod", "canon", "other", "del", "fail", "diff", "nocall"])

With data loaded we can plot heat maps to illustrate correlations between the datasets:

# a function that create the plot for two datasets
def make_heatmap(bis, nano, title):
r_coeff = np.corrcoef(bis, nano)
r_coeff = r_coeff[0,1]
rmse = np.sqrt(np.mean((bis-nano)**2))
# the curious bin numbers here attempt to mitigate aliasing artefacts around
# rational modification fractions.
p = spatial.heatmap2(
bis, nano,
tools = "pan,wheel_zoom,box_zoom,reset",
log=True, x_bins=53, y_bins=53, xlim=(0,100), ylim=(0,100), zlim=(100, 100000),
title="Methylation comparison. R={:.3f}, RMSE={:.3f}".format(r_coeff, rmse))
p.match_aspect = True
p.aspect_ratio = 1.2
p.xaxis.axis_label = 'Bisulphite Methylation Frequency'
p.yaxis.axis_label = f'{title} Methylation Frequency'
p.toolbar.logo = None
return p
# filter the bisulphite data to chromosome 20
select = {"chr20"}
bis_sel = bis.loc[(bis["chrom"].isin(select))]
plots = dict()
for ver, df in zip(("modbam2bed", "modkit"), (modbam, modkit)):
# filter to chrom 20 and drop any rows with NaNs
sel = df.loc[(df["chrom"].isin(select))].dropna()
if ver == "modbam2bed":
# combine strands
sel.loc[sel.strand == "-", 'start'] -= 1
sel = sel.groupby(['chrom','start']).sum().reset_index()
sel['freq'] = 100 * sel['mod'] / sel['coverage']
# perform an inner join on genomic location
combined = pd.merge(
bis_sel, sel, on=["chrom", "start"], suffixes=[".bis", ".nano"])
# filter data to where both bisulfite and nanopore have >20 reads
plot_data = combined.loc[
(combined['coverage.bis'] > 20) &
(combined['coverage.nano'] > 20)]
# make a plot
plot = make_heatmap(plot_data["freq.bis"], plot_data["freq.nano"], ver)
plots[ver] = Panel(child=plot, title=ver)
tabs = Tabs(tabs=list(plots.values()))
aplanat.show(tabs)

The output of the above will look something like the plots below.

Figure 1. Correlation plots generated through the example code using both modbam2bed and modkit to extract modified base information from a BAM file output by either of the basecallers guppy or dorado.

This concludes our short tour of handling bedMethyl files from our deprecated and current tools for handling modified base information present in BAM files from Oxford Nanopore Technologies’ basecallers. We hope this information provides useful to the community.


Tags

#bioinformatics#modified bases#bisulfite#5mC

Share

Chris Wright

Chris Wright

Senior Director, Customer Workflows

Related Posts

How to build and use databases to run wf-metagenomics and wf-16s offline
October 17, 2023
7 min

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media

© 2020 - 2024 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.