# Structural Variation Pipeline Tutorial

This tutorial provides an introduction to the detection and analysis of structural variation (SV) from Oxford Nanopore Technologies whole human genome sequencing experiments. The tutorial walks through and introduces the steps involved in SV detection, filtering and refinement. The workflow introduced uses a human reads dataset (.fastq file) as its starting point.

Computational requirements for this tutorial include:

• Computer running the EPI2ME Labs notebook Server
• At least 16Gb RAM

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

## Introduction¶

Long read DNA sequences, when mapped to a reference genome, can be used for the discovery and characterisation of structural variation. This is achieved by investigating discordant mapping where regions of a single sequenced DNA molecule map to different regions of the genome. Software packages such as cuteSV identify these discordant regions and can be used to call insertions, deletions, duplications, and more.

This tutorial is based on the Oxford Nanopore Technologies wf-human-sv software, which is an automated end-to-end workflow available on github. This analysis pipeline utilises the cuteSV software and may be used to identify high confidence genomic insertion and deletion events in Human datasets.

The tutorial is packaged with example data from the Genome in a Bottle project (GM24385). The analysis considers a collection of sequence reads that map to a 50 Mb region from Chromosome 20.

The workflow covered in this tutorial will:

• Assess some whole genome human sequencing data in .fastq format in terms of read length and quality.
• Align our sequencing data to the reference genome using LRA.
• Assess the resulting .bam file for qualitative characteristics including mapping quality, number of mapped reads and depth-of-coverage.
• Identify a set of SVs relative to a human reference genome using cuteSV.
• Refine the SVs to produce a high quality subset in a .vcf format file.

The tutorial workflow will answer questions such as:

• How many sequence reads map to the reference genome?
• What is the depth of sequence coverage across the genome?
• How many SVs can be identified, what is their size distribution, and how frequent are the different types of SV?

## Getting started¶

The structural variation tutorial requires long read whole genome human sequencing data in .fastq format. An example dataset is provided with this workflow.

Tip: To execute the commands click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

Before anything else we will create and set a working directory:

This tutorial uses software packages that are not included in the default EPI2ME Labs server.

Please note that the software installed is not persistent and this step will need to be re-run if you stop and restart the EPI2ME Labs server

### Sample data¶

To demonstrate this structural variation workflow a small dataset is provided. This dataset corresponds to sequence reads from the GM12878 human reference material that map to the hg19 reference genome. To facilitate a quick analysis and speedy download the whole dataset has been downsampled to a 50 Mb region of chromosome 20.

The code below will download the sample .fastq reads, the reference genome in .fasta format and a tabular .bed file for targeted calling. The wget command is used to download the files.

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

## Data input¶

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the appropriate file locations as input to the notebook.

The main inputs are:

• fastq is the set of reads in .fastq format (gzip allowed).
• reference is the reference genome file in .fasta format (gzip allowed).
• target_bed is a .bed coordinate file that defines the genomic regions of interest. If you do not wish to filter by a region leave this field empty. See the appendix for more details.
• threads defines the number of compute threads that will be used by the processes that can multi-thread. Please do not use more threads than are available on your computer.

Finally, we can also tune the characteristics of the SV analysis using the following inputs:

• min_sv_length and max_sv_length defines the length of the shortest and longest allowable structural variations respectively.
• min_read_length sets the minimum length of reads to be considered when discovering SVs.
• min_read_support determines how many reads at minimum must be in support of a potential SV for it to be called.
• mean_read_mapping_quality sets the minimum average quality of reads to be considered when discovering SVs.

The form below can also be edited to reflect different values for the above options, but sensible defaults are provided. Ensure that you click the form Enter button to set the variables for the rest of the workflow.

### Inspecting input data¶

Prior to beginning the analysis proper, it is always prudent to inspect the data we are going to use in our workflow, specficically casting our eye over read_lengths and quality scores. We want to ascertain that these distributions look like we expect them to look, specifically that the read lengths are sufficient for SV calling (too short and the SV calling will be suboptimal) and that the quality of the reads is not significantly lower than expected (lower means there is likely to be higher error within the reads rate leading to a drop in call accuracy).

To perform these checks we will run seqkit stats and fastcat to generate a range of summary information, which can then be displayed in tabular format or plotted as distributions.

CuteSV is a reference alignment based SV caller, meaning the first step in using it to call SVs is to align the reads to the reference genome using an aligner such as Minimap2 or LRA. LRA is the aligner used in the wf-human-sv workflow because it has been determined to provide better results in the context of Human SV detection than minimap2, so we'll use the same approach here, although minimap2 is also a viable option.

When aligning reads to a reference, we tend to follow these steps:

• Index the reference genome
• Align the reads to the reference genome
• Sort and Index the output .sam file from alignment to create a .bam file

For a more complete understanding of what these terms mean, see the Introduction to SAM/BAM tutorial provided within EPI2ME Labs.

### Mapping Analysis¶

Before calling structural variants in our dataset it is worthwhile to review the key quality characteristics from our .bam file.

The seqkit bam method is used to prepare a collection of per read-alignment summary observations. These observations will be presented as figures and tables that describe the distribution of mapping qualities, depths of coverage and summarise the numbers of reads and sequence bases that map to the reference genome.

Using the output from seqkit we can derive some basic statistics concerning the alignments of reads to the reference sequence:

## Calling Structural Variants¶

We will use cuteSV to call structural variants. In the code block below, cuteSV is called with parameters that define the acceptable size range for a call, minimum read length, minimum read support, and minimum mapping quality.

• min_sv_length is a minimum length threshold a call must have, which is set to 30 nucleotides by default.
• max_sv_length is a maximum length threshold above which we will not consider calls, set to 10000 by default.
• min_read_length is a minimum threshold for read lengths. This is used to reject the shortest mapping sequences and is by default defined as 1000 nucleotides.
• read_support is a minimum threshold for the number of required supporting reads. The default value requires that three of more reads cover an SV.
• min_read_mapping_quality was defined in the configuration form at the top of the analysis section (a value of 20 by default) and is a minimum threshold used to reject reads with lower quality mapping scores.

These parameters are permissive and will produce a large number of candidate SVs. This is intentional and subsequent steps in the workflow will qualitatively filter and refine the SV collection.

### Filtering SVs by type, region, depth and frequency¶

The complete set of SVs from cuteSV can be filtered to regions of interest using the .bed file specified in the sections above, allowing for targeted downstream analysis.

To avoid false positive calls due to low read coverage we will now assess the read depth across the regions in the dataset using the mosdepth tool:

Using the average read depth we can compute a minimum read depth required to call variants. The code below first attempts to establish a bound using the average sequencing depth. If this lower that a minimum bound (min_support), then the higher value is chosen.

The original .vcf file produced by cuteSV can now be filtered by the depth threshold reported above, as well as by the region given with the regions bedfile, and by type of SV (insertions and deletions by default).

Ensure that you click the form Enter button on the form below to perform the filtering step.

## Analysis¶

In this section we will analyse the called and filtered variants. There are many more interesting things which can be learnt from the variant calls, including specific biological questions of interest. Here however we focus on simple, generic properties of the discovered variants.

To start we can generate plots indicating the occurence and density of SVs throughout the genome:

For a more quantitative view of the data, let us examine the length distributions of the called variants separated by type:

As a final summary of the experiment the infographic below displays a small set of headline numbers captured through the course of the analysis. Running the codecell will produce also a standalone HTML report document in the output folder.

## Summary¶

This tutorial has stepped through the calling an summary analysis of structural variants implied by an Oxford Nanopore long read dataset. Using the cuteSV package to call variants the types, lengths, and location of variants has been explored.

The analysis presented can also be run on any .bam alignment file generated by lra which retains supplementary alignments.

## Appendix¶

Example code for the preparation of a bed file

In the workflow form at the head of this tutorial a requirement for a BED coordinate file was introduced. An appropriate .bed file may be prepared from a .fasta format reference genome with the following command.

$pip install pyfaidx$ faidx --transform bed test.fasta > test.bed