Genome in a Bottle Ashkenazi Trio with Ligation Sequencing Kit V14

By Chris Wright
Published in Data Releases
January 27, 2023
2 min read
Genome in a Bottle Ashkenazi Trio with Ligation Sequencing Kit V14

We are pleased to announce the release of a new addition to the Oxford Nanopore Open Data project: sequencing of the Genome in a Bottle Ashkenazi Trio. These three reference samples were sequenced with two PromethION flow cells each to yield around 200 Gbases of sequencing per sample. All sequencing was performed using the Ligation Sequencing Kit V14 sequencing chemistry.

The following cell line samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: GM24143, GM24149, GM24385

Data location

As with previous releases the new dataset is available for anonymous download from an Amazon Web Services S3 bucket. The bucket is part of the Open Data on AWS project enabling sharing and analysis of a wide range of data.

The data is located in the bucket at:

s3://ont-open-data/giab_lsk114_2022.12/

See the tutorials page for information on downloading the dataset.

Sequencing Outputs

Two flowcells were used to sequence each of the samples to high depth:

GenomeRelationshipCell line
HG002SonGM24385
HG003FatherGM24149
HG004MotherGM24143

For each flowcell used in the sequencing the PromethION device outputs are available. All data is present as .fast5 files, along with associated summary files in a structured fashion. For example results from one of the flowcells used to sequence the GM24385 (HG002) sample are found as:

$ aws s3 ls s3://ont-open-data/giab_lsk114_2022.12/flowcells/hg002/20221109_1654_5A_PAG65784_f306681d/
PRE fast5_fail/
PRE fast5_pass/
PRE fast5_skip/
PRE other_reports/
2022-12-06 16:11:54 258 barcode_alignment_PAG65784_f306681d_16a70748.tsv
2022-12-06 22:33:58 670 final_summary_PAG65784_f306681d_16a70748.txt
2022-12-06 22:34:01 2670667 pore_activity_PAG65784_f306681d_16a70748.csv
2022-12-06 22:34:00 1233022 report_PAG65784_20221109_1700_f306681d.html
2022-12-06 22:34:01 764774 report_PAG65784_20221109_1700_f306681d.json
2022-12-06 22:34:02 3313290 report_PAG65784_20221109_1700_f306681d.md
2022-12-06 22:34:02 190 sample_sheet_PAG65784_20221109_1700_f306681d.csv
2022-12-06 22:34:03 2583851309 sequencing_summary_PAG65784_f306681d_16a70748.txt
2022-12-06 22:34:18 640850 throughput_PAG65784_f306681d_16a70748.csv

Basecalling and analysis

The data analyses presented here were performed using our end-to-end wf-human-variation workflow implemented in Nexflow. The workflow is fully integrated using containerised software to provide scalable analysis. As a brief overview the workflow is capable of performing:

  • GPU optimised basecalling with dorado, our latest state-of-the-art basecaller
  • read alignment with minimap2
  • small variant calling with a horizontally-scaled implementation of clair3, and inference models provided by Oxford Nanopore
  • structural variant calling with sniffles2
  • aggregation of 5mC and 5hmC modified base data with our own modbam2bed
  • creation of a consolidated summary reports

The workflow was run on the combined sets of data from each pair of flowcells for each sample. For each sample we have provided results for two flavours of the basecalling algorithm: 1) hac - high accuracy and 2) sup - super accuracy. The choice is reflected in the path names in the S3 bucket.

All compute was performed using Amazon Web Services Batch compute driven by Nextflow. The end-to-end workflow including basecalling and small variant calling took around 7 hours to run. All alignments and variants were calculated with respect to GRCh38.

Sequencing summary metrics for each genome in the Ashkenazi Trio sequenced with Oxford Nanopore Technologies' PromethION instrument.

Variant calling summary benchmarks

In addition to running variant calling have provided also results of benchmarking analysis using hap.py for small variants (HG002, HG003 HG004) and truvari for structural variants (HG002 only). The results of running these tools are shown in the tables below and can be found at:

s3://ont-open-data/giab_lsk114_2022.12/analysis/small_variants_happy/
s3://ont-open-data/giab_lsk114_2022.12/analysis/structural_variants_truvari/

Note that since the HG002 structural variant benchmark is with respect to the GRCh37 reference, the sequencing data was realigned to this reference before performing variant calling and benchmarking

genometypebasecallerrecallprecisionf1-score
HG002SNPsup0.99810.99870.9984
hac0.99880.99770.9983
INDELsup0.82280.93590.8757
hac0.81920.88840.8524
SVsup0.97450.95080.9625
hac0.97560.95210.9637
HG003SNPsup0.99590.99880.9973
hac0.99810.99790.9980
INDELsup0.84190.94070.8886
hac0.83200.90130.8653
HG004SNPsup0.99810.99890.9985
hac0.99890.99790.9984
INDELsup0.82790.93490.8781
hac0.82260.88790.8540

Further information

For additional information regarding these data please contact support@nanoporetech.com.

We hope that these data and analyses provide a useful resource to the community.


Tags

#datasets#human cell-line#R10.4.1#basecalling#dorado#kit14#variant-calls

Share

Chris Wright

Chris Wright

Senior Director, Customer Workflows

Related Posts

Updated Tumor Normal Pair Benchmark Dataset
March 07, 2024
1 min

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media

© 2020 - 2024 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.