Sequencing Genome in a Bottle samples

By Andrea Talenti
Published in Data Releases
May 26, 2023
2 min read
Sequencing Genome in a Bottle samples

We are pleased to announce the release of a new addition to the Oxford Nanopore Open Data project: sequencing of several Genome in a Bottle samples (including the Ashkenazi Trio). Sequencing was performed with the 5 kHz upgrade to the Ligation Sequencing Kit V14 released in MinKNOW 23.04.05. As such the quality of data presented here should be representative of routine sequencing that can be performed by any lab using this latest release.

These reference samples were sequenced with two PromethION flow cells each to yield around more than 200 Gbases sequencing data per sample.

The following cell line samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: GM12878, GM24143, GM24149, GM24385

Data location

As with previous releases the new dataset is available for anonymous download from an Amazon Web Services S3 bucket. The bucket is part of the Open Data on AWS project enabling sharing and analysis of a wide range of data.

The data is located in the bucket at:

s3://ont-open-data/giab_2023.05/

See the tutorials page for information on downloading the dataset.

Sequencing Outputs

Two flowcells were used to sequence each of the samples to high depth:

GenomeDescriptionCell line
HG001CEPH/UTAHGM12878
HG002PGP Ashkenazi SonGM24385
HG003PGP Ashkenazi FatherGM24149
HG004PGP Ashkenazi MotherGM24143

For each flowcell used in the sequencing the PromethION device outputs are available. All data is present as .pod files, along with associated summary files in a structured fashion. For example results from one of the flowcells used to sequence the GM24385 (HG002) sample are found as:

$ aws s3 ls s3://ont-open-data/giab_2023.05/flowcells/hg002/20230424_1302_3H_PAO89685_2264ba8c/
PRE other_reports/
PRE pod5_fail/
PRE pod5_pass/
2023-05-12 12:10:53 248 barcode_alignment_PAO89685_2264ba8c_afee3a87.tsv
2023-05-12 12:39:33 656 final_summary_PAO89685_2264ba8c_afee3a87.txt
2023-05-12 12:39:33 224724629 full_ss_every_17.txt
2023-05-12 18:21:16 2269523 pore_activity_PAO89685_2264ba8c_afee3a87.csv
2023-05-12 18:21:16 22500344 read_list.txt
2023-05-12 18:21:18 1496823 report_PAO89685_20230424_1308_2264ba8c.html
2023-05-12 18:21:18 946254 report_PAO89685_20230424_1308_2264ba8c.json
2023-05-12 18:21:19 2817707 report_PAO89685_20230424_1308_2264ba8c.md
2023-05-12 18:21:19 180 sample_sheet_PAO89685_20230424_1308_2264ba8c.csv
2023-05-12 18:21:20 3602275623 sequencing_summary_PAO89685_2264ba8c_afee3a87.txt
2023-05-12 18:21:38 546602 throughput_PAO89685_2264ba8c_afee3a87.csv

Note that for some flowcell there are multiple logical sequencing runs due to the sequencing devices being restarted partway through the intended run times.

Basecalling and analysis

The data analyses presented here were performed using our workflows:

  • wf-basecalling
  • wf-human-variation

implemented in Nexflow. Note that although wf-human-variation incorporates the functionality of wf-basecalling, in this instance the standalone basecalling workflow was used for logistical data processing reasons. The wf-human-variation workflow is fully integrated using containerised software to provide scalable analysis. As a brief overview the workflow is capable of performing:

  • GPU optimised basecalling with dorado, our latest state-of-the-art basecaller
  • read alignment with minimap2
  • small variant calling with a horizontally-scaled implementation of clair3, and inference models provided by Oxford Nanopore
  • structural variant calling with sniffles2
  • aggregation of 5mC and 5hmC modified base data with our own modbam2bed
  • copy number variant (CNV) calling through QDNAseq
  • Short tandem repeat (STR) calling genotyped using a fork of Straglr
  • creation of a consolidated summary reports

The workflow was run on the combined sets of data from each pair of flowcells for each sample. For each sample we have provided results for two flavours of the basecalling algorithm: 1) hac - high accuracy and 2) sup - super accuracy. The choice is reflected in the path names in the S3 bucket.

Figure 1. Exemplar sequencing summary metrics, in this case for the HG002 (GM24385) dataset sequenced with Oxford Nanopore Technologies' PromethION instrument.

The results of the wf-human-variation workflow can be found at:

s3://ont-open-data/giab_2023.05/analysis/variant_calling

A note on read depth

We find that Clair3 is sensitive to high read depth: variant calling performance can suffer when read depth is excessive. Therefore we have performed variant calling using the full datasets for each of the genomes HG001-004 and for a downsampled random selection of reads leading to 60-fold coverage of each genome HG001-003. Downsampling was not performed for HG004 as the total depth was approximately 60X in any case. This data downsampling is not currently implemented in wf-human-variation.

Variant calling summary benchmarks

In addition to running variant calling have provided also results of benchmarking analysis using hap.py for small variants for all the genomes. A summary of the benchmarking is shown in Figure 2, full results and output from hap.py can be found at:

s3://ont-open-data/giab_2023.05/analysis/small_variants_happy/
Figure 2. Variant calling accuracy statistics for four Genome in a Bottle samples at 60-fold coverage, sequenced with Oxford Nanopore Technologies' PromethION instrument.

Further information

For additional information regarding these data please contact support@nanoporetech.com.

We hope that these data and analyses provide a useful resource to the community.


Tags

#datasets#human cell-line#R10.4.1#basecalling#dorado#kit14#variant-calls

Share

Andrea Talenti

Bioinformatician

Related Posts

Updated Tumor Normal Pair Benchmark Dataset
March 07, 2024
1 min

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media

© 2020 - 2024 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.