# Cas9 Targeted Sequencing Tutorial

#### Expected Duration: 30 minutes

CRISPR associated protein 9 (Cas9) is a protein that is used in genetic engineering applications to induce site-directed double-strand breaks in DNA.

This tutorial is intended as an introduction into the analysis of a Cas9 targeted sequencing experiment with nanopore sequencing. The tutorial will step through and discuss the steps of the workflow starting from raw sequencing data. We will learn how to:

• Align read data to a human reference sequence
• Show the sequencing depth for target regions
• Analyse the proportion of off-target reads

The tutorial makes use of some common bioinformatics including:

Computational requirements for this tutorial include:

• Computer running the EPI2ME Labs notebook Server
• At least 16 Gb RAM

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

## Introduction¶

The tutorial requires a FASTQ format sequence file and a BED file of target coordinates as input, along with pointers to download locations for the reference genome to be used.

The tutorial aids with the quantification of the non-target depletion and provides information on mapping characteristics that highlight the protocol performance. The figures plotted include depth-of-coverage over the target regions and strand bias over these regions. The location and peaks of coverage and local biases in strandedness may be used to assess the performance of guide-RNA sequences and may highlight guide RNAs that are not performing. A review of likely off-target regions over-represented within the sequence collection may inform of strategies to refine guide-RNA design.

## Data preparation¶

The workflow below accepts a single or multiple .fastq files containing single molecule reads. In addition a reference .fasta file is required and a .bed file describing the target regions.

Before anything else we will create and set a working directory:

### Sample Data¶

To use this tutorial with sample data we can download the files using the linux command wget. We will download three files:

1. A .fastq file containing our sequencing data
2. The human reference genome
3. A .bed file describing the target regions of the genome

First to download the sequencing data we run (again press Play to execute the code):

The data downloaded is laid-out in the form as written by MinKNOW on a sequencing device, only reads passing the quality score filter applied by MinKNOW has been included (again click the cell and press Play):

For our analysis we will also need the human genome reference (note that this download is quite large, it includes both the reference sequence and indexes required for analysis):

The reference contained within the above download is that which can also be obtained here, the reasons for using this file are presented in Heng Li's blog post.

Finally we will need a BED file describing the target regions of the genome.

### Using your own data¶

If you wish to analyse your own data rather than the sample data, you can edit the value .fastq input variable below. To find the correct full path of a directory you can navigate to it in the Files browser to the left-hand side, right-click on the file and select Copy path:

The location shared with the EPI2ME labs server from your computer will show as /epi2melabs, for example a file located at /data/my_gridion_run/fastq_pass on your computer will appear as /epi2melabs/my_gridion_run/fastq_pass when it is the /data folder that is shared.

### Data entry¶

Having downloaded the sample data, or locating your own data in the file browser, we need to provide the filepaths as input to the notebook.

If you want simply to plot all the graphs in this tutorial for your dataset, rather than working through the tutorial, select Run Selected Cell and All Below from the Run menu above after executing the cell below.

## Alignment of reads¶

Our first task in analysing our data is to align the sequence reads to the reference sequence. We do this with the mini_align program from the pomoxis package. This is preinstalled in the EPI2ME Labs notebook server. Note that this command may take a while to run depending on the number of reads in your datasets. With the sample data (8000 reads) and using 4 compute threads (-t 4 in the code), the alignments will take around 5 minutes.

While we are here we will calculate also a table summarising the alignment data, again using a program from pomoxis:

The summary file gives useful information on the alignment of each read to the reference sequence, including: chromosome, start and end coordinates, and the accuracy of the read with respect to the reference. We can plot a histogram of the latter quantity:

The statistics file has many other columns which are interesting to explore:

## On Target Depth¶

In this section we will examine the depth of sequencing for each of the target regions, and count reads aligning to other genomic locations.

For this purpose and throughout the rest of this tutorial we will use the Pyranges library.

### Simple Overlaps¶

As a first look at the data lets tabulate the number of reads overlapping each target. All overlaps are counted here, including partial overlaps.

The table above summaries the number of times a read overlaps one of the targets. We can also think about the inverse problem, how many times the targets intersect the reads; this allows us to calculate approximately numbers of on-target and non-target bases. Non-target bases are defined simply as any base not belonging to a target region; we will break down this category further later in the tutorial.

### Genome tiles¶

In order to calculate quickly and efficiently more detailed statistics (and create informative plots), we will use a genome tiling technique. We will split the reference genome into uniformly sized tiles; for each reference tile we can calculate the overlapping reads. We create a tile for both the + and the - DNA strand. For the reference we construct the tiles using the pysam library to read the reference_genome file:

We will also define a small helper function to annotate genomic regions with additional information:

To proceed we will first use the function defined above to annotate our genome tiles and our reads with their target:

Having prepared these tables we can now query them in various ways. Let's summarise the information we have for each target:

The table details:

• tname --- the target name.
• tsize --- the length of the target (in bases).
• kbases --- approximate number of bases in reads overlapping target.
• median_coverage --- average read depth across target.
• nreads --- number of reads aligning.
• mean_read_length --- average read length of reads aligning.
• mean_accuracy --- average base accuracy of reads aligning.
• strand_bias --- proportional difference of reads aligning to each strand. A value or +1 or -1 indicates complete bias to the foward or reverse strand respectively.

Using the tiling statistics we can make coverage plots for all regions.

## Examining Non-Target Sequencing¶

An important consideration for the Cas9 targeted sequencing protocol is the throughput of sequencing not overlapping the target regions, what we have called "non-target" in the sections above. These effects can be broken down further into three types:

1. Target proximal --- sequencing occurring around the target region.
2. Off-target --- enrichment of loci other than those intended, identified as high coverage regions which are not target regions.
3. Background --- non-specific sequencing, identified as low coverage sequencing throughout the genome.

We will examine these effects in this section. In order to do this we will first identify the background level of sequencing. After this we can find regions which are not target regions but which have an abnormally high level of overlapping reads.

### Calculating the background¶

In order to calculate the background sequencing effect we will first calculate collectively the background and off-target effects. By using the overlaps of reads to the genome tiles calculated previously and subtracting the target regions and proximal regions we obtain the desired result:

Naturally determining the boundary between "off-target" and "background" using the coverage data is a heuristic. To get a qualitative feel for this distintiction we can plot a histogram of the number of genome tiles with defined coverage. For example the plot may hint at some non-target regions with unusally high coverage; often simply large bars at a coverage of 1 and 2 will be observed before a rapid fall off.

The widget below can also be used to choose where the boundary should lie in according to the coverage or these regions:

### Listing off-target regions¶

Finally to summarise the off-target regions, which could be analysed further, we can merge our genome tiles back into contiguous regions:

Off-target regions can arise from Cas9 cuts of the source DNA due to probes with low specificity.