# Structural Variation Pipeline Tutorial

This tutorial provides an introduction to the detection and analysis of structural variation (SV) from Oxford Nanopore Technologies whole human genome sequencing experiments. The tutorial walks through and introduces the steps involved in SV detection, filtering and refinement. The workflow introduced uses a human alignment dataset (.bam file) as its starting point.

Computational requirements for this tutorial include:

• Computer running the EPI2ME Labs notebook Server
• At least 16Gb RAM

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

## Introduction¶

Long read DNA sequences, when mapped to a reference genome, can be used for the discovery and characterisation of structural variation. This is achieved by investigating discordant mapping where regions of a single sequenced DNA molecule map to different regions of the genome. Software packages such as sniffles identify these discordant regions and can be used to call insertions, deletions, duplications, inversion and translocations.

This tutorial is based on the Oxford Nanopore Technologies pipeline-structural-variation software. This analysis pipeline utilises the sniffles software and may be used to identify high confidence genomic insertion, duplication and deletion events.

The tutorial is packaged with example data from the Genome in a Bottle project (GM24385). The analysis considers a collection of sequence reads that map to a 50 Mb region from Chromosome 20.

The workflow covered in this tutorial will:

• Assess a mapping .bam file for qualitative characteristics including mapping quality, number of mapped reads and depth-of-coverage.
• Identify a set of permissive SVs relative to a human reference genome using sniffles.
• Refine the set of SVs to produce a high quality subset in a VCF format file.

The tutorial workflow will answer questions such as:

• How many sequence reads map to the reference genome?
• What is the depth of sequence coverage across the genome?
• How many SVs can be identified, what is their size distribution, and how frequent are the different types of SV?

## Getting started¶

The structural variation tutorial requires an indexed .bam format sequence alignment file from long read whole genome human sequencing data. An example dataset is provided with this workflow.

This tutorial does not introduce mapping sequence reads to a reference genome. This topic is covered in the Cas9 Targeted Sequencing tutorial where mini_align from the Pomoxis software is used to run minimap2. Mapping is also the subject of an upcoming tutorial.

Before anything else we will create and set a working directory:

This tutorial uses software packages that are not included in the default EPI2ME Labs server. To install these we will fetch them from Github and install with the pip pacakge manager.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

### Sample data¶

To demonstrate this structural variation workflow a small dataset is provided. This dataset corresponds to sequence reads from the GM12878 human reference material that have been mapped to the hg19 reference genome. This version of the reference genome has been selected to enable the benchmarking of SV-calling performance against the Genome In a Bottle truthset. To facilitate a quick analysis and speedy download the whole dataset has been downsampled to a 50 Mb region of chromosome 20.

The code below will download the sample .bam mapping data and its corresponding index. The wget command is used to download the files.

To execute the commands click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

## Data input¶

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the appropriate file location as input to the notebook.

The form below can also be edited to tune the characteristics of the SV analysis.

• alignment_bam is the sorted and indexed .bam mapping file. ❗IMPORTANT: Alignments must contain supplementary alignments as they are used in this analysis.
• target_bed is a .bed coordinate file that defines the genomic regions of interest. If you do not wish to filter by a region leave this field empty. See the appendix for more details.
• sample_name is an arbitrary name for the analysis; please use appropriate names when running your own samples.
• threads defines the number of compute threads that will be used by the processes that can multi-thread. Please do not use more threads than are available on your computer.
• min_sv_length and max_sv_length defines the length of the shortest and longest allowable structural variations respectively.
• mean_read_mapping_quality is a cutoff for the minimum read mapping quality that will be considered for the calculation of SVs.

### Preliminary Analysis¶

Before calling structural variants in our dataset it is worthwhile to review the key quality characteristics from our starting .bam file.

The seqkit bam method is used to prepare a collection of per read-alignment summary observations. These observations will be presented as figures and tables that describe the distribution of mapping qualities, depths of coverage and summarise the numbers of reads and sequence bases that map to the reference genome.

Using the output from seqkit we can derive some basic statistics concerning the alignments of reads to the reference sequence:

## Calling Structural Variants¶

A first step in calling structural variants is to permissively call all possible structural variants. We will use sniffles for this purpose. In the code block below, sniffles is called with parameters that define the minimum read length, minimum read support, and minimum mapping quality.

• min_read_length is a minimum threshold for read lengths. This is used to reject the shortest mapping sequences and is by default defined as 1000 nucleotides.
• read_support is a minimum threshold for the number of required supporting reads. The default value requires that three of more reads cover an SV.
• min_read_mapping_quality was defined in the configuration form at the top of the analysis section (a value of 20 by default) and is a minimum threshold used to reject reads with lower quality mapping scores.

These parameters are permissive and will produce a large number of candidate SVs. This is intentional and subsequent steps in the workflow will qualitatively filter and refine the SV collection.

### Filtering SVs for depth and length¶

The complete set of SVs from sniffles can be filtered to regions of interest using the .bed file specified in the sections above. This allows for targeted SV analyses in specific regions of interest. At this step we will also run sniffles-edit to ensure the VCF file is compatible with tools such as the Integrative Genome Viewer (IGV).

To avoid false positive calls due to low read coverage we will now assess the read depth across the regions in the dataset using the mosdepth tool:

Using the average read depth we can compute a minimum read depth required to call variants. The code below first attempts to establish a bound using a fraction (min_support_frac) of the average sequencing depth. If this lower that a minimum bound (min_support), then the higher value is chosen.

The original .vcf file produced by sniffles can now be filtered by the depth threshold reported above. At this stage we filter also by the minimum and maximum lengths specified at the start of this tutorial. In addition variant calls are filtered based on strand support and allelic frequencies. The strand support filter uses Fischer's exact test to remove SV calls with extreme strand bias. An example of this would be where SV calls occur in a small number of forward strands but never on the reverse despite the total depth being made up of and equal number of forward and reverse reads. SV calls with less than the given allelic frequency will be filtered out.

## Analysis¶

In this section we will analyse the called and filtered variants. There are many more interesting things which can be learnt from the variant calls, including specific biological questions of interest. Here however we focus on simple, generic properties of the discovered variants.

To start we can generate plots indicating the occurence and density of SVs throughout the genome:

For a more quantitative view of the data, let us examine the length distributions of the called variants separated by type:

As a final summary of the experiment the infographic below displays a small set of headline numbers captured through the course of the analysis. Running the codecell will produce also a standalone HTML report document in the output folder.

## Summary¶

This tutorial has stepped through the calling an summary analysis of structural variants implied by an Oxford Nanopore long read dataset. Using the sniffles package to call variants the types, lengths, and location of variants has been explored.

The analysis presented can be run on any .bam alignment file generated by minimap2 which retains secondary and supplementary alignments.

## Appendix¶

Example code for the preparation of a bed file

In the workflow form at the head of this tutorial a requirement for a BED coordinate file was introduced. An appropriate .bed file may be prepared from a .fasta format reference genome with the following command.

$pip install pyfaidx$ faidx --transform bed test.fasta > test.bed