Copy Number Calling Workflow

By Sirisha Hesketh
Published in How Tos
October 05, 2022
3 min read
Copy Number Calling Workflow

Introduction

We are pleased to add to our human analysis workflow repertoire with wf-cnv. This new workflow enables copy number calling from Oxford Nanopore Technologies sequencing data.

Best practices for human copy number calling are actively being investigated by the ONT applications team, and this workflow puts some of that work into something that can be easily used by our community.

wf-cnv also utilises our new reporting and plotting package ezcharts. This uses python dominate and an apache echart api to allow us to make modern, responsive layouts and plots with relative ease.

With the release of wf-cnv we also include a new ideogram plotting component for ezcharts.

ezcharts ideogramns
Figure 1 - ezcharts ideogram plotting

Pre-requisites

Along with the usual requirements to run our workflows (nextflow & docker) you will need a reference genome in FASTA format, which can be downloaded from UCSC using rsync:

  • hg19:
rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz .
  • hg38:
rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz .

How to run

Example command:

nextflow run epi2me-labs/wf-cnv --fastq <PATH_TO_FASTQS> --fasta <PATH_TO_REFERENCE> --genome <hg19|hg38> --bin_size <BIN_SIZE>

Workflow

CNV calling methods

There are a few different approaches taken by CNV detection algorithms, such as read pair, split read, and assembly-based methods. The algorithm used here, QDNAseq, is based on the commonly-used read depth strategy, which seeks to correlate the copy number of a region with the depth of coverage, so for example, a gain in copy number would have a higher depth than expected. Typically this is achieved by dividing the genome into fixed size bins, and the number of reads within each bin counted and normalised. In addition, it is common for most tools to hold an internal ‘blacklist’ of problematic regions, to improve variant calling.

Workflow details

QDNAseq is an R package which determines the copy number status of bins, the size of which can be tuned by using the --bin_size parameter at run time. Pre-calculated bin annotations are available for hg19 and hg38 for a range of bin sizes (1, 5, 10, 15, 30, 50, 100, 500, and 1000 kbp).

Following alignment of raw reads to a reference, the resulting BAM is used to generate a raw copy number profile using the selected bin size. This is filtered to remove blacklisted bins in problematic genomic regions. The raw profile is further refined by estimating and applying the correction for GC content and mappability, and performing smoothing and normalisation. Segmentation (merging of regions with similar read count to estimate a CNV segment) and CNV calling are carried out using DNAcopy and CGHcall, respectively.

The test samples included were WGA amplified from genomic DNA and sequenced with either rapid or native barcoding for 180mins, and are cell line samples obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: NA01920, NA03225, and NA03623. These samples demonstrate that with low coverage and short Nanopore reads an accurate picture of large scale copy number variations can be detected. The read length and bin size affects the results of QDNASeq analysis. In future versions we will provide preset parameters based on detected read length.

BarcodeSample NameDetailsGenotypic Sex
barcode01NA01920Trisomy 21XY
barcode03NA03225Chr 7 DeletionXX
barcode05NA03623Trisomy 18XXX

The FASTQs for three test samples are available here and can be used with the the accompanying sample sheet from here.

Example command with test data:

nextflow run epi2me-labs/wf-cnv --fastq <PATH_TO_DOWNLOADED_FASTQ> --sample_sheet <PATH_TO_DOWNLOADED_SAMPLE_SHEET> --fasta /path/to/hg38.fa.gz --genome hg38 --bin_size 500

We include two styles of plot on the reports for this workflow. An ideoplot that shows the copy number data from QDNAseq overlaid on a representation of human chromosomes, and a plot of log2 transformed copy number counts per bin.

Trisomy 21 Ideoplot
Figure 2 - XY Ideoplot Indicating Trisomy 21

Trisomy 21 Scatterplot
Figure 3 - XY Ideoplot Indicating Trisomy 21

Output files

The workflow outputs several files per sample:

  • <SAMPLE_NAME>_wf-cnv-report.html: HTML CNV report containing chromosome copy summary, ideoplot, plot of read counts per bin, links to genes in detected CNVs, and QC data: read length histogram, noise plot (noise as a function of sequence depth) and isobar plot (median read counts per bin shown as a function of GC content and mappability)
  • <SAMPLE_NAME>.stats: Read stats
  • BAM/<SAMPLE_NAME>.bam: Alignment of reads to reference
  • BAM/<SAMPLE_NAME>.bam.bai: BAM index
  • qdna_seq/<SAMPLE_NAME>_plots.pdf: QDNAseq-generated plots
  • qdna_seq/<SAMPLE_NAME>_raw_bins.bed: BED file of raw read counts per bin
  • qdna_seq/<SAMPLE_NAME>_bins.bed: BED file of corrected, normalised, and smoothed read counts per bin
  • qdna_seq/<SAMPLE_NAME>_calls.vcf: VCF file of CNV calls

Feedback

We welcome feedback on this, and any of our workflows, either in the nanopore community or as GitHub issues on the workflow repository.

Reference

  • Scheinin I, Sie D, Bengtsson H, van de Wiel MA, Olshen AB, van Thuijl HF, van Essen HF, Eijk PP, Rustenburg F, Meijer GA, Reijneveld JC, Wesseling P, Pinkel D, Albertson DG, Ylstra B. DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. Genome Res. 2014 Dec;24(12):2022-32. doi: 10.1101/gr.175141.114. Epub 2014 Sep 18. PMCID.

Tags

#workflows#nextflow

Share

Sirisha Hesketh

Clinical Bioinformatician

Table Of Contents

1
Introduction
2
Pre-requisites
3
How to run
4
Workflow
5
Feedback

Related Posts

Unexpected results, so now what?
July 02, 2024
3 min

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media

© 2020 - 2024 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.