Structural variation calling with GM24385

By Chris Wright
Published in Data Releases
December 01, 2020
1 min read

In this blog post we will explore structural variant calling using the recently released HG002 (GM24385 Ashkenazi Son) data release.

The GM24385 dataset comprises whole genome sequencing of a well-characterised human cell line. It therefore provides a useful benchmark sample; the cell line was also used as a “seen” sample in the recent PrecisionFDA Truth Challenge V2 competition for small variant calling.

Structural variant calling with lra and cuteSV

As an easily reproducible example we will focus on a single flowcell of the GM24385 2020.11 data release.

This walkthrough assumes some familiarity with standard bioinformatic tools for handling genomics data. A working installation of samtools, snakemake, git, and the AWS command-line tools are required to follow the process below.

Data preparation

We will start by downloading Guppy 4.0.11 basecalls from a PromethION sequencing experiment (see our tutorial FAQs) for more information on downloading data):

aws s3 --no-sign-request cp \
s3://ont-open-data/gm24385_2020.11/analysis/r9.4.1/20201026_1644_2-E5-H5_PAG07162_d7f262d5/guppy_v4.0.11_r9.4.1_hac_prom/basecalls.fastq.gz \

The .fastq file downloaded above contains the QC pass calls from the experiment amounting to around 200 Gbases.

To run the SV calling pipeline and perform benchmarking we will need release 37 of the human reference sequence:


Running the variant calling

To perform structural variant calling Oxford Nanopore Technologies recommends using the pipeline-structural-variation snakemake workflow. This workflow as been recently updated to use lra and cuteSV, replacing the previous minimap2 and sniffles based approach. After installation of this software we use it with its default settings:

conda activate pipeline-structural-variation-v2
snakemake call --config \
input_fastq=basecalls.fastq.gz \
reference_fasta=human_g1k_v37.fasta \
threads=76 \

The useful output for our purposes is the single Variant Call Format file; a copy of the file is available in the dataset S3 bucket at:



The veracity of the variant calling performed above can be obtained by comparing the results to the Genome In A Bottle truth sets for the GM24385 sample. The truth sets can be downloaded from the NCBI repository:

for ext in ".vcf.gz", ".vcf.gz.tbi", ".bed"; do
wget -O $truth_name$ext \$truth_name$ext

With these reference data we will use truvari to assess the recall and precision of the variant calls made by the calling pipeline:

truvari bench --passonly --pctsim 0 \
-b $truth_vcf --includebed $truth_bed \
-f $reference -c $input_vcf \
-o $output_dir

Truvari outputs precision and recall figures for the structural variants. With a little work (detailed in the EPI2MELabs Structural Variation Benchmarking tutorial) we can separate the counts for deletion and insertion (including duplication) variants:


With still a little more work we can produce the following depicting the f1-score alongwith counts of SVs in the truthset:

Structural variation calling f1-score


#datasets#human cell-line#R9.4.1#structure variants


Chris Wright

Chris Wright

Senior Director, Customer Workflows

Related Posts

A experimental extremely high-accuracy, ultra-long sequencing kit
December 06, 2023
1 min

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media

© 2020 - 2023 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.