Structural Variation Pipeline Tutorial

This tutorial provides an introduction to the detection and analysis of structural variation (SV) from Oxford Nanopore Technologies whole human genome sequencing experiments. The tutorial walks through and introduces the steps involved in SV detection, filtering and refinement. The workflow introduced uses a human alignment dataset (.bam file) as its starting point.

Computational requirements for this tutorial include:

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Introduction

Long read DNA sequences, when mapped to a reference genome, can be used for the discovery and characterisation of structural variation. This is achieved by investigating discordant mapping where regions of a single sequenced DNA molecule map to different regions of the genome. Software packages such as sniffles identify these discordant regions and can be used to call insertions, deletions, duplications, inversion and translocations.

This tutorial is based on the Oxford Nanopore Technologies pipeline-structural-variation software. This analysis pipeline utilises the sniffles software and may be used to identify high confidence genomic insertion, duplication and deletion events.

The tutorial is packaged with example data from the Genome in a Bottle project (GM24385). The analysis considers a collection of sequence reads that map to a 50 Mb region from Chromosome 20.

The workflow covered in this tutorial will:

The tutorial workflow will answer questions such as:

Getting started

The structural variation tutorial requires an indexed .bam format sequence alignment file from long read whole genome human sequencing data. An example dataset is provided with this workflow.

This tutorial does not introduce mapping sequence reads to a reference genome. This topic is covered in the Cas9 Targeted Sequencing tutorial where mini_align from the Pomoxis software is used to run minimap2. Mapping is also the subject of an upcoming tutorial.

Before anything else we will create and set a working directory:

Install additional software

This tutorial uses software packages that are not included in the default EPI2ME Labs server. To install these we will fetch them from Github and install with the pip pacakge manager.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

Sample data

To demonstrate this structural variation workflow a small dataset is provided. This dataset corresponds to sequence reads from the GM12878 human reference material that have been mapped to the hg19 reference genome. This version of the reference genome has been selected to enable the benchmarking of SV-calling performance against the Genome In a Bottle truthset. To facilitate a quick analysis and speedy download the whole dataset has been downsampled to a 50 Mb region of chromosome 20.

The code below will download the sample .bam mapping data and its corresponding index. The wget command is used to download the files.

To execute the commands click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

Using your own data

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

image.png

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

Data input

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the appropriate file location as input to the notebook.

The form below can also be edited to tune the characteristics of the SV analysis.

Preliminary Analysis

Before calling structural variants in our dataset it is worthwhile to review the key quality characteristics from our starting .bam file.

The seqkit bam method is used to prepare a collection of per read-alignment summary observations. These observations will be presented as figures and tables that describe the distribution of mapping qualities, depths of coverage and summarise the numbers of reads and sequence bases that map to the reference genome.

Using the output from seqkit we can derive some basic statistics concerning the alignments of reads to the reference sequence:

Calling Structural Variants

A first step in calling structural variants is to permissively call all possible structural variants. We will use sniffles for this purpose. In the code block below, sniffles is called with parameters that define the minimum read length, minimum read support, and minimum mapping quality.

These parameters are permissive and will produce a large number of candidate SVs. This is intentional and subsequent steps in the workflow will qualitatively filter and refine the SV collection.

Filtering SVs for depth and length

The complete set of SVs from sniffles can be filtered to regions of interest using the .bed file specified in the sections above. This allows for targeted SV analyses in specific regions of interest. At this step we will also run sniffles-edit to ensure the VCF file is compatible with tools such as the Integrative Genome Viewer (IGV).

To avoid false positive calls due to low read coverage we will now assess the read depth across the regions in the dataset using the mosdepth tool:

Using the average read depth we can compute a minimum read depth required to call variants. The code below first attempts to establish a bound using a fraction (min_support_frac) of the average sequencing depth. If this lower that a minimum bound (min_support), then the higher value is chosen.

The original .vcf file produced by sniffles can now be filtered by the depth threshold reported above. At this stage we filter also by the minimum and maximum lengths specified at the start of this tutorial. In addition variant calls are filtered based on strand support and allelic frequencies. The strand support filter uses Fischer's exact test to remove SV calls with extreme strand bias. An example of this would be where SV calls occur in a small number of forward strands but never on the reverse despite the total depth being made up of and equal number of forward and reverse reads. SV calls with less than the given allelic frequency will be filtered out.

Analysis

In this section we will analyse the called and filtered variants. There are many more interesting things which can be learnt from the variant calls, including specific biological questions of interest. Here however we focus on simple, generic properties of the discovered variants.

To start we can generate plots indicating the occurence and density of SVs throughout the genome:

For a more quantitative view of the data, let us examine the length distributions of the called variants separated by type:

Headline Numbers

As a final summary of the experiment the infographic below displays a small set of headline numbers captured through the course of the analysis. Running the codecell will produce also a standalone HTML report document in the output folder.

Summary

This tutorial has stepped through the calling an summary analysis of structural variants implied by an Oxford Nanopore long read dataset. Using the sniffles package to call variants the types, lengths, and location of variants has been explored.

The analysis presented can be run on any .bam alignment file generated by minimap2 which retains secondary and supplementary alignments.

Appendix

Example code for the preparation of a bed file

In the workflow form at the head of this tutorial a requirement for a BED coordinate file was introduced. An appropriate .bed file may be prepared from a .fasta format reference genome with the following command.

$ pip install pyfaidx  
$ faidx --transform bed test.fasta > test.bed