We are happy to announce an updated dataset release of simplex nanopore sequencing from the Genome in a Bottle sample GM24385. Sequencing libraries were prepared using our Kit 12 sequencing chemistry and DNA sequences were produced using R10.4 flowcells (FLO-PRO112).

Multiple PromethION flowcells were used during data generation. The sequencing output is provided both as raw signal data stored in .fast5 files is provided and as basecalled data in .fastq files. Super accuracy simplex basecalling was performed using the Guppy 5.0.15 basecaller.

Additional secondary analyses are included in the dataset release:

alignments of simplex sequence data to the reference genome are provided along with performance statistics,
whole genome structural variant calls and benchmarking results,
and small variant calls for chr6 and chr10.

What’s included?

The dataset comprises data from multiple R10.4 flowcells with the Kit 12 sequencing chemistry.

The primary sequencing outputs are included as self contained directories. The derived outputs from the Katuali pipeline are also stored separately.

The data is available from an Amazon Web Services S3 bucket at:

s3://ont-open-data/gm24385_q20_2021.10/

See the tutorials page for information on downloading datasets.

The uploaded dataset has been prepared using a snakemake pipeline to:

Perform basecalling using the latest Guppy 5.0.15 basecaller,
Align basecalls to reference sequence. All primary, secondary and supplementary alignments are kept,
Filter .bam file to list of regions defined in configuration file retaining only primary alignments,
Produce read statistics from per-region .bams.

For more details see our post detailing the pipeline and its outputs.

Information about sample preparation

The following cell line samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: GM24385

High molecular weight DNA from GM24385 lymphoblastoid cells was prepared by Evotec. A standard, published protocol was used to deplete DNA fragments < 35kb in length. DNA was end repaired and dA tailed prior to LSK based library preparation.

Dataset Summary

The outputs of the analysis pipeline include summary files which are readily interrogated using standard data analysis software. The sequencing_summary.txt files prepared by the Guppy basecaller can be used to answer basic questions concerning read length and sequencing yield, while the calls2ref_stats.txt calculated from read to reference alignments produced by the minimap2 aligner can be used visualize, amongst other statistics, the read accuracy. Summary graphs are shown in Figure 1.

Figure 1. Per-flowcell dataset summary statistics. Experiments were performed 'interaction free', with no nuclease flowcell flushing or library reloading.

Variant calling

Variant calling for the GM24385 genome has been performed and analysed with respect to published benchmark callsets.

Small variant calling

Small variant calling for Chromosome 20 was performed using clair3 with inference models trained explicitely for R10.4 flowcells with the Kit 12 chemistry. This model is hosted at our rerio model repository and is fully compatible with the latest version of Clair3. All inputs, outputs, and commands to reproduce the results presented below are available at:

s3://ont-open-data/gm24385_q20_2021.10/extra_analysis/small_variants

and can be downloaded by following the instructions provided in our tutorials page.

The table below shows recall, precision, and f1-scores for both insertion and deletion, and single nucleotide polymophisms for variant calls reported by Clair3. The benchmarking was performed with respect to the GIAB v4.2.1 truth set with no additional filtering.

Metric	INDEL	SNP
Variants in truth set	11256	71333
Recall	0.82907	0.99913
Precision	0.91288	0.99902
f1-score	0.86896	0.99908

Structural variant calling

In addition to small variant calling we also performed structural variant calling across the whole genome using our wf-human-sv pipeline. wf-human-sv is based on lra and cuteSV. All data required to reproduce the results presented below are available at:

s3://ont-open-data/gm24385_q20_2021.10/extra_analysis/structural_variants

The table below illustrates key metrics for the variant calls output by the wf-human-sv pipeline. Benchmarking was performed with respect to the GIAB v0.6 Tier1 variant set.

Metric	Value
Variants in truth set	9641
Recall	0.9760
Precision	0.9473
f1-score	0.9615

Although the entire dataset was used for these calculations, we note that accurate structural variation calling can be also be performed at significantly lower depth of coverage.

Summary

The new R10.4 GM24385 dataset is being released to provide researchers with access to an example dataset that demonstrates the capabilities of the new sequencing chemistry and can be used as a resource to support the development of new algorithms.

This dataset may be re-released in future as iterations to the sequencing and flowcell chemistries are made. Duplex basecalling has not yet been performance, the basecalled reads as all simplex. Continuous upgrades to the Q20+ chemistry and library preparation protocols focused on output and delivering high duplex yields will follow. A duplex basecalled release of GM24385 is planned.

As always we look forward to your feeback and hope that the dataset provides a valuable and instructive resource.