Following on from our CliveOME 2022.05 data release we are now excited to present 5mC basecalls for both the original cfDNA reads and reads from a second ultralong cellular DNA sample preparation.
Aligned reads are available in BAM format with modified base tags as defined in the hts-specs document. The cfDNA basecalls are available at (see below for more details):
s3://ont-open-data/cliveome_kit14_2022.05/cfdna/basecalls/bonito_mod
with three flowcells of cellular WGS data available at:
s3://ont-open-data/cliveome_kit14_2022.05/gdna/basecalling/
Figure 1. shows the data output for one of the three ULK flowcells as a function of read length. This ULK sample preparation has yielded a total of 70 gigabases of data with over 20 gigabases of data contained with reads over 100kb. Figure 2. presents the read accuracy as measured by alignment to the GRCh38 reference sequence for both DNA samples. Note that the cfDNA distribution here appears rather broad as explained previously: the single-read base accuracy density is broadened by the appearance of 1 or 2 errors in reads of length ~100 bases.
The FAST5 files from the sequencing run have been placed within our Amazon S3 bucket publicly available at:
s3://ont-open-data/cliveome_kit14_2022.05/
More information on downloading the data from s3://ont-open-data
may be found
on our Open datasets Tutorials page.
The cellular DNA sample was prepared for sequencing using Oxford Nanopore’s Ultra-Long DNA Sequencing Kit, the details of which can be found on the ONT Store.
Details of the cfDNA sample preparation can be found on the previous post.
The samples taken for DNA extractions of cfDNA and cellular-DNA were not contemporaneous.
In both cases, cfDNA and cellular-DNA, bonito was used to perform basecalling straight to BAM files with modified base tags. Bonito was chosen over Guppy because at the time of writing bonito implements a slightly more accurate algorithm for 5mC calling which is thought to help particularly in the case of short fragment mode.
Ordinarily users should use Guppy for obtaining 5mC calls, which can be performed in real-time on the sequencing instrument to further lower the barrier to obtaining such data.
This extremely simple workflow is in contrast to the laborious sample preparation and data processing required for techniques such as bisulfite sequencing. We previously discussed these differences in our 5mC GM24385 blog post. To recap, all that is required to obtain 5mC calls from the primary sequencing data is to run:
bonito basecaller \dna_r10.4.1_e8.2_sup@v3.5.1 \<input location>--recursive \--modified-bases 5mC \--reference <reference fasta> \| samtools sort -@16 \> bonito_calls.bamsamtools index bonito_calls.bam
Aggregation of 5mC information by genomic position can be performed by our modbam2bed program:
modbam2bed \-e -m 5mC --cpg -t 10 \<reference fasta> bonito_calls.bam \> bonito.cpg.bed
to obtain per-site methylation frequencies in
The modbam2bed
program can accept a BAM file with additional tags specifiying
the haplotype to which each read belongs. In this manner it is possible to
simply acquire haplotype specific methylation frequencies for CpG sites in human
samples, greatly accelerating research into phenomena controlled by genetic
imprinting. We will leave these tasks for another blog post, in the meanwhile
please do download and explore the dataset.