In this vignette we will show how we have in the past produced correlation plots showing concordance between modified base frequencies calculated from bisulfite sequencing and sequencing with Oxford Nanopore Technologies sequencing devices.
This article is intended to give a leg up to bioinformaticians exploring modified base information stored in BAM files. If you are more generally interested in preparing 5mC and 5hmC data for onward investigation you may like to consider our wf-human-variation workflow.
Without further ado, we start from BAM files output by either the dorado or guppy basecallers. To create a bedMethyl file we have two options:
We stress that modbam2bed is deprecated and modkit should be used in preference. There are no known bugs in modbam2bed, it is deprecated ony because the more fulsome package modkit subsumes its functionality. We present code for using the output of both only because users may have historic data from either. This article is not intended as a full user guide to these programs, please see their respective documentation and command-line help instructions for additional information.
To create a bedMethyl file using modbam2bed we can run:
# an input BAM file and reference sequenceBAM=chr20.bamREF=GCA_000001405.15_GRCh38_no_alt_analysis_set.fastamodbam2bed \--cpg --threads 24 \--region chr20:0-99999999 \--combine --extended \"${REF}" "${BAM} > modbam2bed.bed
We can run modkit to produce a file similar to the modbam2bed.cpg.acc.bed
file:
modkit pileup \"${BAM}" modkit.bed \--region chr20 --threads 24 \--ref "${REF}" --preset traditional
The results of the two programs are broadly similar, but not identical.
The output modbam2bed.bed
contains rows for each strand of DNA.
The program can produce results aggregated across neighbouring reference positions (corresponding to the two cytosine bases on opposite strands in a CpG motif).
We produce the stranded counts here for illustrative purposes.
There are other subtleties in how the two programs handle data and create their outputs, which are broadly of little consequence for our purposes here.
We encourage you to explore to options of modkit to produce the outputs required for your use case.
So to the aim of this how to article. Let’s load the data into Python and start plotting:
# All the packages we need for loading the data and plottingimport aplanatfrom aplanat import spatialfrom bokeh.models.widgets import Tabs, Panelimport numpy as npimport pandas as pd# the input filesbisulfite_file = "CpG.gz.bismark.zero.cov.gz"modbam2bed_file = "modbam2bed.bed"modkit_file = "modkit.bed"# load bisulfite data# we need to calculate the coverage herebis = pd.read_csv(bisulfite_file,sep="\t", header=None, engine="c",dtype={'chrom':str, 'start':int, 'end':int,'freq':float, 'mod':int, 'canon':int},names=['chrom', 'start', 'end', 'freq', 'mod', 'canon'])bis["coverage"] = bis["mod"] + bis["canon"]# load modbam2bed data# the file contains counts from + and - strands separately, which# are combined below. See docs for further information.modbam = pd.read_csv(modbam2bed_file,sep="\t", header=None, engine="c",usecols=["chrom", "start", "coverage", "freq", "strand", "mod"],dtype={"chrom":str, "start":int, "coverage":int,"freq":float, "strand":str, "mod":int},names=["chrom", "start", "end", "name", "score", "strand","tstart", "tend", "color","coverage", "freq","canon", "mod", "filt", "nocall", "altmod"])# load modkit data# nothing special to do here. See docs for further informationmodkit = pd.read_csv(modkit_file,sep="\s+", header=None, engine="c",usecols=["chrom", "start", "end", "coverage", "freq"],dtype={"chrom":str, "start":int, "end":int,"coverage":int, "freq":float},names=["chrom", "start", "end", "name", "score", "strand","tstart", "tend", "color","coverage", "freq", "mod", "canon", "other", "del", "fail", "diff", "nocall"])
With data loaded we can plot heat maps to illustrate correlations between the datasets:
# a function that create the plot for two datasetsdef make_heatmap(bis, nano, title):r_coeff = np.corrcoef(bis, nano)r_coeff = r_coeff[0,1]rmse = np.sqrt(np.mean((bis-nano)**2))# the curious bin numbers here attempt to mitigate aliasing artefacts around# rational modification fractions.p = spatial.heatmap2(bis, nano,tools = "pan,wheel_zoom,box_zoom,reset",log=True, x_bins=53, y_bins=53, xlim=(0,100), ylim=(0,100), zlim=(100, 100000),title="Methylation comparison. R={:.3f}, RMSE={:.3f}".format(r_coeff, rmse))p.match_aspect = Truep.aspect_ratio = 1.2p.xaxis.axis_label = 'Bisulphite Methylation Frequency'p.yaxis.axis_label = f'{title} Methylation Frequency'p.toolbar.logo = Nonereturn p# filter the bisulphite data to chromosome 20select = {"chr20"}bis_sel = bis.loc[(bis["chrom"].isin(select))]plots = dict()for ver, df in zip(("modbam2bed", "modkit"), (modbam, modkit)):# filter to chrom 20 and drop any rows with NaNssel = df.loc[(df["chrom"].isin(select))].dropna()if ver == "modbam2bed":# combine strandssel.loc[sel.strand == "-", 'start'] -= 1sel = sel.groupby(['chrom','start']).sum().reset_index()sel['freq'] = 100 * sel['mod'] / sel['coverage']# perform an inner join on genomic locationcombined = pd.merge(bis_sel, sel, on=["chrom", "start"], suffixes=[".bis", ".nano"])# filter data to where both bisulfite and nanopore have >20 readsplot_data = combined.loc[(combined['coverage.bis'] > 20) &(combined['coverage.nano'] > 20)]# make a plotplot = make_heatmap(plot_data["freq.bis"], plot_data["freq.nano"], ver)plots[ver] = Panel(child=plot, title=ver)tabs = Tabs(tabs=list(plots.values()))aplanat.show(tabs)
The output of the above will look something like the plots below.
This concludes our short tour of handling bedMethyl files from our deprecated and current tools for handling modified base information present in BAM files from Oxford Nanopore Technologies’ basecallers. We hope this information provides useful to the community.
Information