We are pleased to announce an updated and improved release of nanopore sequencing of Genome in a bottle samples, now including both the Ashkenazi and Han Chinese Trios.
Sequencing was performed on a Promethion with the
Ligation Sequencing Kit V14 using the SQK-LSK114 protocol and the latest basecall model dna_r10.4.1_e8.2_400bps@v5.0.0
.
As such the quality of data presented here should be representative of routine sequencing that can be performed by any lab using this latest release.
These reference samples were sequenced with two PromethION flow cells each, yielding around 140 Gbases sequencing data per flowcell.
The following cell line samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: GM12878, GM24143, GM24149, GM24385, GM24631, GM24694, GM24695
As with previous releases the new dataset is available for anonymous download from an Amazon Web Services S3 bucket. The bucket is part of the Open Data on AWS project enabling sharing and analysis of a wide range of data.
The data is located in the bucket at:
s3://ont-open-data/giab_2025.01/
See the tutorials page for information on downloading the dataset.
Two flowcells were used to sequence each of the samples to high depth:
Genome | Description | Cell line |
---|---|---|
HG001 | CEPH/UTAH | GM12878 |
HG002 | PGP Ashkenazi Son | GM24385 |
HG003 | PGP Ashkenazi Father | GM24149 |
HG004 | PGP Ashkenazi Mother | GM24143 |
HG005 | PGP Chinese Son | GM24631 |
HG006 | PGP Chinese Father | GM24694 |
HG007 | PGP Chinese Mother | GM24695 |
For each flowcell used in the sequencing the PromethION device outputs are available.
All data is present as .pod
files, along with associated summary files in a structured fashion.
For example results for the two flowcells used to sequence the GM24385 (HG002) sample are found at:
$ aws s3 ls s3://ont-open-data/giab_2025.01/flowcells/hg002/
The data analyses presented here were performed using:
the latter of these includes:
Tool | Version |
---|---|
Clair3 | v1.0.8 |
Sniffles2 | v2.0.7 |
Straglr | v1.4.5 |
Spectre | v0.2.2 |
Note that although wf-human-variation incorporates the functionality of wf-basecalling, in this instance the standalone basecalling workflow was used for logistical data processing reasons. The wf-human-variation workflow is fully integrated using containerised software to provide scalable analysis. As a brief overview the workflow is capable of performing:
The workflow was run on each flowcell independently, providing two results for each sample. For each sample and flowcell we have additionally provided results for two flavours of the basecalling algorithm: firstly hac or high accuracy, and secondly sup or super accuracy.
The choice is reflected in the path names in the S3 bucket.
The results of the wf-human-variation can be found at:
s3://ont-open-data/giab_2025.01/analysis/wf-human-variation
In addition to running variant calling have provided also results of benchmarking analysis using hap.py for small variants for all the genomes. A summary of the benchmarking is shown in Figure 2, full results and output from hap.py can be found at:
s3://ont-open-data/giab_2025.01/analysis/happy-benchmark/
Structural variant benchmarking was done using Truvari for HG002. A summary of the benchmarking is shown in Figure 3, full results and output can be found at:
s3://ont-open-data/giab_2025.01/analysis/truvari-benchmark/
For additional information regarding these data please contact support@nanoporetech.com.
We hope that these data and analyses provide a useful resource to the community.