Genome in a Bottle Data Release 2025.01

Published in Data Releases
January 05, 2025
2 min read
Genome in a Bottle Data Release 2025.01

We are pleased to announce an updated and improved release of nanopore sequencing of Genome in a bottle samples, now including both the Ashkenazi and Han Chinese Trios.

Sequencing was performed on a Promethion with the Ligation Sequencing Kit V14 using the SQK-LSK114 protocol and the latest basecall model dna_r10.4.1_e8.2_400bps@v5.0.0. As such the quality of data presented here should be representative of routine sequencing that can be performed by any lab using this latest release.

These reference samples were sequenced with two PromethION flow cells each, yielding around 140 Gbases sequencing data per flowcell.

The following cell line samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: GM12878, GM24143, GM24149, GM24385, GM24631, GM24694, GM24695

Data location

As with previous releases the new dataset is available for anonymous download from an Amazon Web Services S3 bucket. The bucket is part of the Open Data on AWS project enabling sharing and analysis of a wide range of data.

The data is located in the bucket at:

s3://ont-open-data/giab_2025.01/

See the tutorials page for information on downloading the dataset.

Sequencing Outputs

Two flowcells were used to sequence each of the samples to high depth:

GenomeDescriptionCell line
HG001CEPH/UTAHGM12878
HG002PGP Ashkenazi SonGM24385
HG003PGP Ashkenazi FatherGM24149
HG004PGP Ashkenazi MotherGM24143
HG005PGP Chinese SonGM24631
HG006PGP Chinese FatherGM24694
HG007PGP Chinese MotherGM24695

For each flowcell used in the sequencing the PromethION device outputs are available. All data is present as .pod files, along with associated summary files in a structured fashion. For example results for the two flowcells used to sequence the GM24385 (HG002) sample are found at:

$ aws s3 ls s3://ont-open-data/giab_2025.01/flowcells/hg002/

Basecalling and analysis

The data analyses presented here were performed using:

  • Dorado v0.8.2 and
  • wf-human-variation v2.4.1,

the latter of these includes:

ToolVersion
Clair3v1.0.8
Sniffles2v2.0.7
Straglrv1.4.5
Spectrev0.2.2

Note that although wf-human-variation incorporates the functionality of wf-basecalling, in this instance the standalone basecalling workflow was used for logistical data processing reasons. The wf-human-variation workflow is fully integrated using containerised software to provide scalable analysis. As a brief overview the workflow is capable of performing:

  • GPU optimised basecalling with dorado, our latest state-of-the-art basecaller
  • read alignment with minimap2
  • small variant calling with a horizontally-scaled implementation of clair3, and inference models provided by Oxford Nanopore
  • structural variant calling with sniffles2
  • copy number variant (CNV) calling through spectre
  • Modified basecalling with modkit
  • Short tandem repeat (STR) calling genotyped using a fork of Straglr
  • creation of a consolidated summary reports

The workflow was run on each flowcell independently, providing two results for each sample. For each sample and flowcell we have additionally provided results for two flavours of the basecalling algorithm: firstly hac or high accuracy, and secondly sup or super accuracy.

The choice is reflected in the path names in the S3 bucket.

Figure 1. Exemplar sequencing summary metrics, in this case for the HG002 (GM24385) dataset sequenced with Oxford Nanopore Technologies' PromethION instrument.

The results of the wf-human-variation can be found at:

s3://ont-open-data/giab_2025.01/analysis/wf-human-variation

Variant calling summary benchmarks

In addition to running variant calling have provided also results of benchmarking analysis using hap.py for small variants for all the genomes. A summary of the benchmarking is shown in Figure 2, full results and output from hap.py can be found at:

s3://ont-open-data/giab_2025.01/analysis/happy-benchmark/
Figure 2. Small variant calling accuracy statistics for seven Genome in a Bottle samples, sequenced with Oxford Nanopore Technologies' PromethION instrument.

Structural variant benchmarking was done using Truvari for HG002. A summary of the benchmarking is shown in Figure 3, full results and output can be found at:

s3://ont-open-data/giab_2025.01/analysis/truvari-benchmark/
Figure 3. Structural Variant calling accuracy statistics for HG002, sequenced with Oxford Nanopore Technologies' PromethION instrument.

Further information

For additional information regarding these data please contact support@nanoporetech.com.

We hope that these data and analyses provide a useful resource to the community.


Tags

#datasets#humancell-line#R10.4.1#basecalling#dorado#kit14#variant-calls

Share

Related Posts

Nanopore-only T2T assembly of a human genome
May 22, 2024
2 min

Quick Links

WorkflowsOpen DataContact

Social Media

© 2020 - 2025 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.