In this vignette we will show how we have in the past produced correlation plots showing concordance between modified base frequencies calculated from bisulfite sequencing and sequencing with Oxford Nanopore Technologies sequencing devices.

This article is intended to give a leg up to bioinformaticians exploring modified base information stored in BAM files. If you are more generally interested in preparing 5mC and 5hmC data for onward investigation you may like to consider our wf-human-variation workflow.

Without further ado, we start from BAM files output by either the dorado or guppy basecallers. To create a bedMethyl file we have two options:

The deprecated program modbam2bed
The next generation program modkit

We stress that modbam2bed is deprecated and modkit should be used in preference. There are no known bugs in modbam2bed, it is deprecated ony because the more fulsome package modkit subsumes its functionality. We present code for using the output of both only because users may have historic data from either. This article is not intended as a full user guide to these programs, please see their respective documentation and command-line help instructions for additional information.

To create a bedMethyl file using modbam2bed we can run:

# an input BAM file and reference sequence
BAM=chr20.bam
REF=GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta

modbam2bed \
    --cpg --threads 24 \
    --region chr20:0-99999999 \
    --combine --extended \
    "${REF}" "${BAM} > modbam2bed.bed

We can run modkit to produce a file similar to the modbam2bed.cpg.acc.bed file:

modkit pileup \
    "${BAM}" modkit.bed \
    --region chr20 --threads 24 \
    --ref "${REF}" --preset traditional

The results of the two programs are broadly similar, but not identical.

The output modbam2bed.bed contains rows for each strand of DNA. The program can produce results aggregated across neighbouring reference positions (corresponding to the two cytosine bases on opposite strands in a CpG motif). We produce the stranded counts here for illustrative purposes. There are other subtleties in how the two programs handle data and create their outputs, which are broadly of little consequence for our purposes here. We encourage you to explore to options of modkit to produce the outputs required for your use case.

So to the aim of this how to article. Let’s load the data into Python and start plotting:

# All the packages we need for loading the data and plotting
import aplanat
from aplanat import spatial
from bokeh.models.widgets import Tabs, Panel
import numpy as np
import pandas as pd

# the input files
bisulfite_file = "CpG.gz.bismark.zero.cov.gz"
modbam2bed_file = "modbam2bed.bed"
modkit_file = "modkit.bed"

# load bisulfite data
#   we need to calculate the coverage here
bis = pd.read_csv(
    bisulfite_file,
    sep="\t", header=None, engine="c",
    dtype={
        'chrom':str, 'start':int, 'end':int,
        'freq':float, 'mod':int, 'canon':int},
    names=['chrom', 'start', 'end', 'freq', 'mod', 'canon']
)
bis["coverage"] = bis["mod"] + bis["canon"]

# load modbam2bed data
#   the file contains counts from + and - strands separately, which
#   are combined below. See docs for further information.
modbam = pd.read_csv(
    modbam2bed_file,
    sep="\t", header=None, engine="c",
    usecols=["chrom", "start", "coverage", "freq", "strand", "mod"],
    dtype={
        "chrom":str, "start":int, "coverage":int,
        "freq":float, "strand":str, "mod":int},
    names=[
        "chrom", "start", "end", "name", "score", "strand",
        "tstart", "tend", "color",
        "coverage", "freq",
        "canon", "mod", "filt", "nocall", "altmod"])

# load modkit data
#    nothing special to do here. See docs for further information
modkit = pd.read_csv(
    modkit_file,
    sep="\s+", header=None, engine="c",
    usecols=["chrom", "start", "end", "coverage", "freq"],
    dtype={
        "chrom":str, "start":int, "end":int,
        "coverage":int, "freq":float},
    names=[
        "chrom", "start", "end", "name", "score", "strand",
        "tstart", "tend", "color",
        "coverage", "freq", "mod", "canon", "other", "del", "fail", "diff", "nocall"])

With data loaded we can plot heat maps to illustrate correlations between the datasets:

# a function that create the plot for two datasets
def make_heatmap(bis, nano, title):
    r_coeff = np.corrcoef(bis, nano)
    r_coeff = r_coeff[0,1]
    rmse = np.sqrt(np.mean((bis-nano)**2))

    # the curious bin numbers here attempt to mitigate aliasing artefacts around
    # rational modification fractions.
    p = spatial.heatmap2(
        bis, nano,
        tools = "pan,wheel_zoom,box_zoom,reset",
        log=True, x_bins=53, y_bins=53, xlim=(0,100), ylim=(0,100), zlim=(100, 100000),
        title="Methylation comparison. R={:.3f}, RMSE={:.3f}".format(r_coeff, rmse))

    p.match_aspect = True
    p.aspect_ratio = 1.2
    p.xaxis.axis_label = 'Bisulphite Methylation Frequency'
    p.yaxis.axis_label = f'{title} Methylation Frequency'
    p.toolbar.logo = None
    return p

# filter the bisulphite data to chromosome 20
select = {"chr20"}
bis_sel = bis.loc[(bis["chrom"].isin(select))]
plots = dict()

for ver, df in zip(("modbam2bed", "modkit"), (modbam, modkit)):
    # filter to chrom 20 and drop any rows with NaNs
    sel = df.loc[(df["chrom"].isin(select))].dropna()
    if ver == "modbam2bed":
        # combine strands
        sel.loc[sel.strand == "-", 'start'] -= 1
        sel = sel.groupby(['chrom','start']).sum().reset_index()
        sel['freq'] = 100 * sel['mod'] / sel['coverage']
    # perform an inner join on genomic location
    combined = pd.merge(
        bis_sel, sel, on=["chrom", "start"], suffixes=[".bis", ".nano"])
    # filter data to where both bisulfite and nanopore have >20 reads
    plot_data = combined.loc[
        (combined['coverage.bis'] > 20) &
        (combined['coverage.nano'] > 20)]
    # make a plot
    plot = make_heatmap(plot_data["freq.bis"], plot_data["freq.nano"], ver)
    plots[ver] = Panel(child=plot, title=ver)

tabs = Tabs(tabs=list(plots.values()))
aplanat.show(tabs)

The output of the above will look something like the plots below.

Figure 1. Correlation plots generated through the example code using both modbam2bed and modkit to extract modified base information from a BAM file output by either of the basecallers guppy or dorado.

This concludes our short tour of handling bedMethyl files from our deprecated and current tools for handling modified base information present in BAM files from Oxford Nanopore Technologies’ basecallers. We hope this information provides useful to the community.