Structural Variation Pipeline Tutorial

This tutorial provides an introduction to the detection and analysis of structural variation (SV) from Oxford Nanopore Technologies whole human genome sequencing experiments. The tutorial walks through and introduces the steps involved in SV detection, filtering and refinement. The workflow introduced uses a human reads dataset (.fastq file) as its starting point.

Computational requirements for this tutorial include:

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Introduction

Long read DNA sequences, when mapped to a reference genome, can be used for the discovery and characterisation of structural variation. This is achieved by investigating discordant mapping where regions of a single sequenced DNA molecule map to different regions of the genome. Software packages such as cuteSV identify these discordant regions and can be used to call insertions, deletions, duplications, and more.

This tutorial is based on the Oxford Nanopore Technologies wf-human-sv software, which is an automated end-to-end workflow available on github. This analysis pipeline utilises the cuteSV software and may be used to identify high confidence genomic insertion and deletion events in Human datasets.

The tutorial is packaged with example data from the Genome in a Bottle project (GM24385). The analysis considers a collection of sequence reads that map to a 50 Mb region from Chromosome 20.

The workflow covered in this tutorial will:

The tutorial workflow will answer questions such as:

Getting started

The structural variation tutorial requires long read whole genome human sequencing data in .fastq format. An example dataset is provided with this workflow.

Tip: To execute the commands click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

Before anything else we will create and set a working directory:

Install additional software

This tutorial uses software packages that are not included in the default EPI2ME Labs server.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

Sample data

To demonstrate this structural variation workflow a small dataset is provided. This dataset corresponds to sequence reads from the GM12878 human reference material that map to the hg19 reference genome. To facilitate a quick analysis and speedy download the whole dataset has been downsampled to a 50 Mb region of chromosome 20.

The code below will download the sample .fastq reads, the reference genome in .fasta format and a tabular .bed file for targeted calling. The wget command is used to download the files.

Using your own data

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

image.png

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data input

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the appropriate file locations as input to the notebook.

The main inputs are:

Finally, we can also tune the characteristics of the SV analysis using the following inputs:

The form below can also be edited to reflect different values for the above options, but sensible defaults are provided. Ensure that you click the form Enter button to set the variables for the rest of the workflow.

Inspecting input data

Prior to beginning the analysis proper, it is always prudent to inspect the data we are going to use in our workflow, specficically casting our eye over read_lengths and quality scores. We want to ascertain that these distributions look like we expect them to look, specifically that the read lengths are sufficient for SV calling (too short and the SV calling will be suboptimal) and that the quality of the reads is not significantly lower than expected (lower means there is likely to be higher error within the reads rate leading to a drop in call accuracy).

To perform these checks we will run seqkit stats and fastcat to generate a range of summary information, which can then be displayed in tabular format or plotted as distributions.

Read Alignment

CuteSV is a reference alignment based SV caller, meaning the first step in using it to call SVs is to align the reads to the reference genome using an aligner such as Minimap2 or LRA. LRA is the aligner used in the wf-human-sv workflow because it has been determined to provide better results in the context of Human SV detection than minimap2, so we'll use the same approach here, although minimap2 is also a viable option.

When aligning reads to a reference, we tend to follow these steps:

For a more complete understanding of what these terms mean, see the Introduction to SAM/BAM tutorial provided within EPI2ME Labs.

Mapping Analysis

Before calling structural variants in our dataset it is worthwhile to review the key quality characteristics from our .bam file.

The seqkit bam method is used to prepare a collection of per read-alignment summary observations. These observations will be presented as figures and tables that describe the distribution of mapping qualities, depths of coverage and summarise the numbers of reads and sequence bases that map to the reference genome.

Using the output from seqkit we can derive some basic statistics concerning the alignments of reads to the reference sequence:

Calling Structural Variants

We will use cuteSV to call structural variants. In the code block below, cuteSV is called with parameters that define the acceptable size range for a call, minimum read length, minimum read support, and minimum mapping quality.

These parameters are permissive and will produce a large number of candidate SVs. This is intentional and subsequent steps in the workflow will qualitatively filter and refine the SV collection.

Filtering SVs by type, region, depth and frequency

The complete set of SVs from cuteSV can be filtered to regions of interest using the .bed file specified in the sections above, allowing for targeted downstream analysis.

To avoid false positive calls due to low read coverage we will now assess the read depth across the regions in the dataset using the mosdepth tool:

Using the average read depth we can compute a minimum read depth required to call variants. The code below first attempts to establish a bound using the average sequencing depth. If this lower that a minimum bound (min_support), then the higher value is chosen.

The original .vcf file produced by cuteSV can now be filtered by the depth threshold reported above, as well as by the region given with the regions bedfile, and by type of SV (insertions and deletions by default).

Ensure that you click the form Enter button on the form below to perform the filtering step.

Analysis

In this section we will analyse the called and filtered variants. There are many more interesting things which can be learnt from the variant calls, including specific biological questions of interest. Here however we focus on simple, generic properties of the discovered variants.

To start we can generate plots indicating the occurence and density of SVs throughout the genome:

For a more quantitative view of the data, let us examine the length distributions of the called variants separated by type:

Headline Numbers

As a final summary of the experiment the infographic below displays a small set of headline numbers captured through the course of the analysis. Running the codecell will produce also a standalone HTML report document in the output folder.

Summary

This tutorial has stepped through the calling an summary analysis of structural variants implied by an Oxford Nanopore long read dataset. Using the cuteSV package to call variants the types, lengths, and location of variants has been explored.

The analysis presented can also be run on any .bam alignment file generated by lra which retains supplementary alignments.

Appendix

Example code for the preparation of a bed file

In the workflow form at the head of this tutorial a requirement for a BED coordinate file was introduced. An appropriate .bed file may be prepared from a .fasta format reference genome with the following command.

$ pip install pyfaidx  
$ faidx --transform bed test.fasta > test.bed