We are happy to announce an updated dataset release of simplex nanopore sequencing from the Genome in a Bottle sample GM24385. Sequencing libraries were prepared using our Kit 12 sequencing chemistry and DNA sequences were produced using R10.4 flowcells (FLO-PRO112).
Multiple PromethION flowcells were used during data generation. The sequencing output is provided both as raw signal data stored in .fast5 files is provided and as basecalled data in .fastq files. Super accuracy simplex basecalling was performed using the Guppy 5.0.15 basecaller.
Additional secondary analyses are included in the dataset release:
What’s included?
The dataset comprises data from multiple R10.4 flowcells with the Kit 12 sequencing chemistry.
The primary sequencing outputs are included as self contained directories. The derived outputs from the Katuali pipeline are also stored separately.
The data is available from an Amazon Web Services S3 bucket at:
s3://ont-open-data/gm24385_q20_2021.10/
See the tutorials page for information on downloading datasets.
The uploaded dataset has been prepared using a snakemake pipeline to:
For more details see our post detailing the pipeline and its outputs.
Information about sample preparation
The following cell line samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: GM24385
High molecular weight DNA from GM24385 lymphoblastoid cells was prepared by Evotec. A standard, published protocol was used to deplete DNA fragments < 35kb in length. DNA was end repaired and dA tailed prior to LSK based library preparation.
The outputs of the analysis pipeline include summary files which are readily
interrogated using standard data analysis software. The sequencing_summary.txt
files prepared by the Guppy basecaller can be used to answer basic questions
concerning read length and sequencing yield, while the calls2ref_stats.txt
calculated from read to reference alignments produced by the minimap2
aligner
can be used visualize, amongst other statistics, the read accuracy. Summary graphs
are shown in Figure 1.
Variant calling for the GM24385 genome has been performed and analysed with respect to published benchmark callsets.
Small variant calling
Small variant calling for Chromosome 20 was performed using clair3 with inference models trained explicitely for R10.4 flowcells with the Kit 12 chemistry. This model is hosted at our rerio model repository and is fully compatible with the latest version of Clair3. All inputs, outputs, and commands to reproduce the results presented below are available at:
s3://ont-open-data/gm24385_q20_2021.10/extra_analysis/small_variants
and can be downloaded by following the instructions provided in our tutorials page.
The table below shows recall, precision, and f1-scores for both insertion and deletion, and single nucleotide polymophisms for variant calls reported by Clair3. The benchmarking was performed with respect to the GIAB v4.2.1 truth set with no additional filtering.
Metric | INDEL | SNP |
---|---|---|
Variants in truth set | 11256 | 71333 |
Recall | 0.82907 | 0.99913 |
Precision | 0.91288 | 0.99902 |
f1-score | 0.86896 | 0.99908 |
Structural variant calling
In addition to small variant calling we also performed structural variant calling across the whole genome using our wf-human-sv pipeline. wf-human-sv is based on lra and cuteSV. All data required to reproduce the results presented below are available at:
s3://ont-open-data/gm24385_q20_2021.10/extra_analysis/structural_variants
The table below illustrates key metrics for the variant calls output by the
wf-human-sv
pipeline. Benchmarking was performed with respect to the GIAB v0.6 Tier1
variant set.
Metric | Value |
---|---|
Variants in truth set | 9641 |
Recall | 0.9760 |
Precision | 0.9473 |
f1-score | 0.9615 |
Although the entire dataset was used for these calculations, we note that accurate structural variation calling can be also be performed at significantly lower depth of coverage.
The new R10.4 GM24385 dataset is being released to provide researchers with access to an example dataset that demonstrates the capabilities of the new sequencing chemistry and can be used as a resource to support the development of new algorithms.
This dataset may be re-released in future as iterations to the sequencing and flowcell chemistries are made. Duplex basecalling has not yet been performance, the basecalled reads as all simplex. Continuous upgrades to the Q20+ chemistry and library preparation protocols focused on output and delivering high duplex yields will follow. A duplex basecalled release of GM24385 is planned.
As always we look forward to your feeback and hope that the dataset provides a valuable and instructive resource.
Information