Following on from our CliveOME 2022.05 data release we are now excited to present 5mC basecalls for both the original cfDNA reads and reads from a second ultralong cellular DNA sample preparation.

Aligned reads are available in BAM format with modified base tags as defined in the hts-specs document. The cfDNA basecalls are available at (see below for more details):

s3://ont-open-data/cliveome_kit14_2022.05/cfdna/basecalls/bonito_mod

with three flowcells of cellular WGS data available at:

s3://ont-open-data/cliveome_kit14_2022.05/gdna/basecalling/

Figure 1. shows the data output for one of the three ULK flowcells as a function of read length. This ULK sample preparation has yielded a total of 70 gigabases of data with over 20 gigabases of data contained with reads over 100kb. Figure 2. presents the read accuracy as measured by alignment to the GRCh38 reference sequence for both DNA samples. Note that the cfDNA distribution here appears rather broad as explained previously: the single-read base accuracy density is broadened by the appearance of 1 or 2 errors in reads of length ~100 bases.

Figure 1. Cumulative data volume for one of the three flowcells used to sequence the ULK cellular DNA sample. This flowcell produced 20Gbases of reads greater than 100 kbase in length.

Figure 2. Kernel density estimate depicting the read accuracy for short fragment mode cfDNA sequencing as well as long-read cellular DNA data.

Data Availability

The FAST5 files from the sequencing run have been placed within our Amazon S3 bucket publicly available at:

s3://ont-open-data/cliveome_kit14_2022.05/

More information on downloading the data from s3://ont-open-data may be found on our Open datasets Tutorials page.

Sample extraction

The cellular DNA sample was prepared for sequencing using Oxford Nanopore’s Ultra-Long DNA Sequencing Kit, the details of which can be found on the ONT Store.

Details of the cfDNA sample preparation can be found on the previous post.

The samples taken for DNA extractions of cfDNA and cellular-DNA were not contemporaneous.

Data processing.

In both cases, cfDNA and cellular-DNA, bonito was used to perform basecalling straight to BAM files with modified base tags. Bonito was chosen over Guppy because at the time of writing bonito implements a slightly more accurate algorithm for 5mC calling which is thought to help particularly in the case of short fragment mode.

Ordinarily users should use Guppy for obtaining 5mC calls, which can be performed in real-time on the sequencing instrument to further lower the barrier to obtaining such data.

This extremely simple workflow is in contrast to the laborious sample preparation and data processing required for techniques such as bisulfite sequencing. We previously discussed these differences in our 5mC GM24385 blog post. To recap, all that is required to obtain 5mC calls from the primary sequencing data is to run:

bonito basecaller \
    dna_r10.4.1_e8.2_sup@v3.5.1 \
    <input location>
    --recursive \
    --modified-bases 5mC \
    --reference <reference fasta> \
    | samtools sort -@16 \
    > bonito_calls.bam
samtools index bonito_calls.bam

Aggregation of 5mC information by genomic position can be performed by our modbam2bed program:

modbam2bed \
    -e -m 5mC --cpg -t 10 \
    <reference fasta> bonito_calls.bam \
    > bonito.cpg.bed

to obtain per-site methylation frequencies in

The modbam2bed program can accept a BAM file with additional tags specifiying the haplotype to which each read belongs. In this manner it is possible to simply acquire haplotype specific methylation frequencies for CpG sites in human samples, greatly accelerating research into phenomena controlled by genetic imprinting. We will leave these tasks for another blog post, in the meanwhile please do download and explore the dataset.