Structural variation calling with GM24385

By Chris Wright
Published in Data Releases
December 01, 2020
1 min read

In this blog post we will explore structural variant calling using the recently released HG002 (GM24385 Ashkenazi Son) data release.

The GM24385 dataset comprises whole genome sequencing of a well-characterised human cell line. It therefore provides a useful benchmark sample; the cell line was also used as a “seen” sample in the recent PrecisionFDA Truth Challenge V2 competition for small variant calling.

Structural variant calling with lra and cuteSV

As an easily reproducible example we will focus on a single flowcell of the GM24385 2020.11 data release.

This walkthrough assumes some familiarity with standard bioinformatic tools for handling genomics data. A working installation of samtools, snakemake, git, and the AWS command-line tools are required to follow the process below.

Data preparation

We will start by downloading Guppy 4.0.11 basecalls from a PromethION sequencing experiment (see our tutorial FAQs) for more information on downloading data):

aws s3 --no-sign-request cp \
s3://ont-open-data/gm24385_2020.11/analysis/r9.4.1/20201026_1644_2-E5-H5_PAG07162_d7f262d5/guppy_v4.0.11_r9.4.1_hac_prom/basecalls.fastq.gz \
basecalls.fastq.gz

The .fastq file downloaded above contains the QC pass calls from the experiment amounting to around 200 Gbases.

To run the SV calling pipeline and perform benchmarking we will need release 37 of the human reference sequence:

wget http://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz

Running the variant calling

To perform structural variant calling Oxford Nanopore Technologies recommends using the pipeline-structural-variation snakemake workflow. This workflow as been recently updated to use lra and cuteSV, replacing the previous minimap2 and sniffles based approach. After installation of this software we use it with its default settings:

conda activate pipeline-structural-variation-v2
snakemake call --config \
input_fastq=basecalls.fastq.gz \
reference_fasta=human_g1k_v37.fasta \
threads=76 \
sample_name=PAG07162-hg37

The useful output for our purposes is the single Variant Call Format file; a copy of the file is available in the dataset S3 bucket at:

s3://ont-open-data/gm24385_2020.11/extra_analyses/sv_calling/PAG07162-hg37/sv_calls/PAG07162-hg37_cutesv_filtered.vcf.gz

Evaluation

The veracity of the variant calling performed above can be obtained by comparing the results to the Genome In A Bottle truth sets for the GM24385 sample. The truth sets can be downloaded from the NCBI repository:

truth_name="HG002_SVs_Tier1_v0.6"
for ext in ".vcf.gz", ".vcf.gz.tbi", ".bed"; do
wget -O $truth_name$ext \
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/$truth_name$ext
done

With these reference data we will use truvari to assess the recall and precision of the variant calls made by the calling pipeline:

truth_vcf="${truth_name}.vcf.gz"
truth_bed="${truth_name}.bed"
reference="human_g1k_v37.fasta"
input_vcf="PAG07162-hg37_cutesv_filtered.vcf.gz"
output_dir="truvari"
truvari bench --passonly --pctsim 0 \
-b $truth_vcf --includebed $truth_bed \
-f $reference -c $input_vcf \
-o $output_dir

Truvari outputs precision and recall figures for the structural variants. With a little work (detailed in the EPI2MELabs Structural Variation Benchmarking tutorial) we can separate the counts for deletion and insertion (including duplication) variants:

TypeDeletionsInsertions
Recall0.96670.9427
Precision0.98930.9667
f1-score0.97790.9545

With still a little more work we can produce the following depicting the f1-score alongwith counts of SVs in the truthset:

sv_length_f1
Structural variation calling f1-score


Tags

#datasets#human cell-line#R9.4.1#structure variants

Share

Chris Wright

Chris Wright

Senior Director, Customer Workflows

Related Posts

Updated Tumor Normal Pair Benchmark Dataset
March 07, 2024
1 min

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media

© 2020 - 2024 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.