wf-aav-qc documentation

By EPI2ME Labs
6 min read

AAV QC workflow

Nextflow workflow for AAV vector quality control.

Introduction

This workflow takes reads sequenced from adeno-associated virus (rAAV) vector preps and does some basic quality control checks. The main stages of the workflow are:

  • Masking of transgene cassette variable ITR regions
  • Mapping reads to a combined reference sequence containing host cell reference genome, mask transgene plasmid and other AAV plasmids
  • Identification of reference which each reads maps to
  • Truncation hotspot identification
  • ITR transgene cassette coverage
  • Determination of AAV genome structure types
  • Calling transgene plasmid variants and creation of consensus

Compute requirements

Recommended requirements:

  • CPUs = 8
  • Memory = 32GB

Minimum requirements:

  • CPUs = 4
  • Memory = 16GB

Approximate run time: 15 minutes per sample - 150k reads and 8 cpus

ARM processor support: False

Install and run

These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME Desktop application.

The workflow uses Nextflow to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.

The workflow can currently be run using either Docker or Singularity to provide isolation of the required software. Both methods are automated out-of-the-box provided either Docker or Singularity is installed. This is controlled by the -profile parameter as exemplified below.

It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.

The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:

nextflow run epi2me-labs/wf-aav-qc --help

To update a workflow to the latest version on the command line use the following command:

nextflow pull epi2me-labs/wf-aav-qc

A demo dataset is provided for testing of the workflow. It can be downloaded and unpacked using the following commands:

wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-aav-qc/wf-aav-qc-demo.tar.gz
tar -xzvf wf-aav-qc-demo.tar.gz

The workflow can then be run with the downloaded demo data using:

nextflow run epi2me-labs/wf-aav-qc \
--fastq 'wf-aav-qc-demo/simulated_reads.fq' \
--itr1_end 156 \
--itr1_start 11 \
--itr2_end 2286 \
--itr2_start 2156 \
--ref_helper 'wf-aav-qc-demo/helper.fasta' \
--ref_host 'wf-aav-qc-demo/cell_line.fasta.gz' \
--ref_rep_cap 'wf-aav-qc-demo/repcap.fasta' \
--ref_transgene_plasmid 'wf-aav-qc-demo/transgene.fasta' \
-profile standard

For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/

This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices using this protocol:

https://community.nanoporetech.com/docs/prepare/library_prep_protocols/ligation-sequencing-gdna-v14-adeno-associated-virus-sequencing/v/aav_9194_v114_reva_20sep2023

Input example

This workflow accepts either FASTQ or BAM files as input.

The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with --sample. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.

(i) (ii) (iii)
input_reads.fastq ─── input_directory ─── input_directory
├── reads0.fastq ├── barcode01
└── reads1.fastq │ ├── reads0.fastq
│ └── reads1.fastq
├── barcode02
│ ├── reads0.fastq
│ ├── reads1.fastq
│ └── reads2.fastq
└── barcode03
└── reads0.fastq

Input parameters

Input Options

Nextflow parameter nameTypeDescriptionHelpDefault
fastqstringFASTQ files to use in the analysis.This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with --sample. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.
bamstringBAM or unaligned BAM (uBAM) files to use in the analysis.This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with --sample. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.
analyse_unclassifiedbooleanAnalyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory.If selected and if the input is a multiplex directory the workflow will also process the unclassified directory.False
itr_fl_thresholdintegerThe maximum number of bases missing from an ITR in order for it to be classed as a full length ITR.For ITR1, this many bases can be missing from the end of the ITR region. For ITR2, this many bases can be missing from the start of the ITR region.100
itr_backbone_thresholdintegerThe maximum number of bases and alignment is allowed to extended outside of the ITR-ITR region for an associated read to not be classed as backbone.Reads mapping to the transgene plasmid sometimes extend beyond the ITRs. This parameter sets a maximum number or bases after which the read is classified as backbone.20
itr1_startintegerThe start position of ITR1.
itr1_endintegerThe end position of ITR2.
itr2_startintegerThe start position of ITR2.
itr2_endintegerThe end position of ITR2.
symmetry_thresholdintegerThe threshold to consider whether the start or end positions on opposite strands are classed as symmetrical or asymmetrical.For certain categories of AAV genome type we want to test whether alignments on both strands are symmetrical or asymmetrical (i.e. whether the start and end positions are approximately the same or not) This parameter sets the threshold for this comparison.10
ref_hoststringThe reference FASTA file for the host organism (.fasta/fasta.gz).
ref_helperstringThe helper plasmid FASTA file.
ref_rep_capstringThe rep/cap plasmid FASTA file.
ref_transgene_plasmidstringThe transgene plasmid FASTA file.

Sample Options

Nextflow parameter nameTypeDescriptionHelpDefault
sample_sheetstringA CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files.The sample sheet is a CSV file with, minimally, columns named barcode and alias. Extra columns are allowed. A type column is required for certain workflows and should have the following values; test_sample, positive_control, negative_control, no_template_control.
samplestringA single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files.

Output Options

Nextflow parameter nameTypeDescriptionHelpDefault
out_dirstringDirectory for output of all workflow results.output
output_genometype_bamsbooleanIf true, output a BAM file per identified AAV genome structure type. Otherwise output a BAM file per sample.Output individual BAM files by the assigned genome type.False

Advanced Options

Nextflow parameter nameTypeDescriptionHelpDefault
override_basecaller_cfgstringOverride auto-detected basecaller model that processed the signal data; used to select an appropriate Medaka model.Per default, the workflow tries to determine the basecall model from the input data. This parameter can be used to override the detected value (or to provide a model name if none was found in the inputs). However, users should only do this if they know for certain which model was used as selecting the wrong option might give sub-optimal results. A list of recent models can be found here: https://github.com/nanoporetech/dorado#DNA-models.

Miscellaneous Options

Nextflow parameter nameTypeDescriptionHelpDefault
threadsintegerMaximum number of CPU threads for a process to consume. Applies to the minimap2 mapping and the AAV structure determination stages.A minimap2 and AAV structure determination process per sample will be will be run. This setting applies a maximum number of threads to be used for each of these.4

Outputs

Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

TitleFile pathDescriptionPer sample or aggregated
Workflow report./wf-aav-qc-report.htmlReport for all samplesaggregated
Combined reference sequence./combined_reference.fa.gzReference file containing all AAV plasmid and host genome sequences.aggregated
Combined reference sequence index./combined_reference.fa.gz.faiIndex file for combined reference FASTA.aggregated
Compressed combined reference sequence index./combined_reference.fa.gz.gziExtra index file for combined reference FASTA (required because combined reference is bgzip-compressed).aggregated
Per read alignment info./{{ alias }}/{{ alias }}_bam_info.tsvThe result of seqkit bam.per-sample
AAV structure assignment./{{ alias }}/{{ alias }}_aav_per_read_info.tsvAAV per read genome subtypes.per-sample
Transgene plasmid consensus./{{ alias }}/{{ alias }}_transgene_plasmid_consensus.fasta.gzThe transgene plasmid consensus sequence generated by medaka.per-sample
Transgene plasmid variants./{{ alias }}/{{ alias }}_transgene_plasmid_variants.vcf.gzThe transgene plasmid variants file generated by medaka.per-sample
Alignment file./{{ alias }}/tagged_bams/sorted.tagged.bamThe resulting tagged BAM file from mapping reads to the combined reference.per-sample
Alignment index file./{{ alias }}/tagged_bams/sorted.tagged.bam.baiThe index for the resulting tagged BAM file from mapping reads to the combined reference.per-sample
backbone_contamination alignment./{{ alias }}/tagged_bams/backbone_contamination.bamThe resulting tagged BAM file from mapping reads to the combined reference. This file contains alignmnets with the genotype assignment: backbone_contaminationper-sample
backbone_contamination alignment index./{{ alias }}/tagged_bams/backbone_contamination.bam.baiThe resulting tagged BAM index file from mapping reads to the combined reference. This indexes the file containing alignments with the genotype assignment: backbone_contaminationper-sample
full_ssaav alignment./{{ alias }}/tagged_bams/full_ssaav.bamThe resulting tagged BAM file from mapping reads to the combined reference. This file contains alignmnets with the genotype assignment: full_ssaavper-sample
full_ssaav alignment index./{{ alias }}/tagged_bams/full_ssaav.bam.baiThe resulting tagged BAM index file from mapping reads to the combined reference. This indexes the file containing alignments with the genotype assignment: full_ssaavper-sample
partial_ssaav alignment./{{ alias }}/tagged_bams/partial_ssaav.bamThe resulting tagged BAM file from mapping reads to the combined reference. This file contains alignmnets with the genotype assignment: partial_ssaavper-sample
partial_ssaav alignment index./{{ alias }}/tagged_bams/partial_ssaav.bam.baiThe resulting tagged BAM index file from mapping reads to the combined reference. This indexes the file containing alignments with the genotype assignment: partial_ssaavper-sample
full_scaav alignment./{{ alias }}/tagged_bams/full_scaav.bamThe resulting tagged BAM file from mapping reads to the combined reference. This file contains alignmnets with the genotype assignment: full_scaavper-sample
full_scaav alignment index./{{ alias }}/tagged_bams/full_scaav.bam.baiThe resulting tagged BAM index file from mapping reads to the combined reference. This indexes the file containing alignments with the genotype assignment: full_scaavper-sample
itr_region_only alignment./{{ alias }}/tagged_bams/itr_region_only.bamThe resulting tagged BAM file from mapping reads to the combined reference. This file contains alignmnets with the genotype assignment: itr_region_onlyper-sample
itr_region_only alignment index./{{ alias }}/tagged_bams/itr_region_only.bam.baiThe resulting tagged BAM index file from mapping reads to the combined reference. This indexes the file containing alignments with the genotype assignment: itr_region_onlyper-sample
complex alignment./{{ alias }}/tagged_bams/complex.bamThe resulting tagged BAM file from mapping reads to the combined reference. This file contains alignmnets with the genotype assignment: complexper-sample
complex alignment index./{{ alias }}/tagged_bams/complex.bam.baiThe resulting tagged BAM index file from mapping reads to the combined reference. This indexes the file containing alignments with the genotype assignment: complexper-sample
partial_scaav alignment./{{ alias }}/tagged_bams/partial_scaav.bamThe resulting tagged BAM file from mapping reads to the combined reference. This file contains alignmnets with the genotype assignment: partial_scaavper-sample
partial_scaav alignment index./{{ alias }}/tagged_bams/partial_scaav.bam.baiThe resulting tagged BAM index file from mapping reads to the combined reference. This indexes the file containing alignments with the genotype assignment: partial_scaavper-sample
unknown alignment./{{ alias }}/tagged_bams/unknown.bamThe resulting tagged BAM file from mapping reads to the combined reference. This file contains alignmnets with the genotype assignment: unknownper-sample
unknown alignment index./{{ alias }}/tagged_bams/unknown.bam.baiThe resulting tagged BAM index file from mapping reads to the combined reference. This indexes the file containing alignments with the genotype assignment: unknownper-sample
IGV config JSON file./igv.jsonJSON file with IGV config options to be used by the EPI2ME Desktop Application.aggregated

Pipeline overview

The following (Fig.1) is a basic schematic of the workflow:

AAV QC overview
Fig.1 wf-aav-qc workflow

1: Make a combined reference sequence

Reads can originate from the transgene cassette, but can also come from the other plasmids used in the rAAV prep as well as host cell DNA. Therefore, a combined reference is created that contains the following reference sequences:

  • host reference genome
  • Rep-Cap plasmid
  • helper plasmid
  • transgene plasmid (variable ITR regions masked).

The transgene plasmid ITR cassette will naturally exist in four orientations. (termed flip-flip, flip-flop, flop-flop and flop-flip; see Fig.2) This can lead to incorrect mapping of reads. To address this, the variable regions in the transgene cassette are masked. This is done by taking the input transgene plasmid and locating the two ITR regions as defined in the --transgene_annotation file. The C', C, B' and B ITR regions are identified for each ITR. From these regions it can be determined which positions are constant between orientations and which are variable, and will be masked.

Combined reference
Fig.2 Making a combined reference

2: Map to reference and get alignment summaries

The reads are mapped to the combined reference using minimap2 (secondary alignments are excluded). seqkit bam is used to generate alignment summaries that are used in the rest of the workflow.

3: Contamination

Reads that do not map to the transgene expression cassette are classified as contaminants. They can arise from

  • The Rep-Cap or helper plasmids
  • The host expression system
  • None of the above reference sequences. The reads will be classified as Unknown. If there are a large proportion of reads in this category, it may warrant further investigation to identify the source.
    Contamination
    Fig.3 Contamination summary plot

4: Per-base coverage of the transgene cassette

Depth of coverage is generated for the transgene cassette region using samtools depth. A plot of this data is shown which indicates whether sufficient coverage has been achieved across the transgene cassette.

Coverage
Fig.4 ITR-ITR coverage

5: Identification of transgene plasmid variants

Transgene plasmid variants are called using medaka, producing a VCF file that is used to generate a consensus sequence using bcftool concensus The workflow selects the appropriate Medaka models based on the basecaller configuration that was used to process the signal data. By default, the workflow will attempt to determine the basecaller model from the input data. When this fails (or when you wish to override the automatic selection), it can be provided with --override_basecaller_cfg.

6: Identification of truncated regions

The ‘start’ and ‘end’ positions of alignments that map within the transgene cassette are plotted to highlight potential regions where sequences are becoming truncated.

Truncations plot
Fig.5 Plot of start and end positions of reads mapping to transgene cassette.

7: rAAV structure determination

The rAAV transgene expression cassette will ideally exist as full length ITR-flanked regions. However, subgenomic particles will be present in any prep, and it can be useful to know the abundance of the various genome types, which is the aim of this stage of the workflow. Genome types are assigned to each read by applying a series of heuristics that use the characteristics of each alignment from the read.

There are two user-adjustable parameters relevant to this part of the workflow:

  • --itr_fl_threshold (default 100). This parameter specifies the maximum number of bases missing from an ITR in order for it to be classed as a full length ITR.
  • --itr_backbone_threshold (default 20). Reads mapping to the transgene plasmid sometimes extend beyond the ITRs. This parameter sets a maximum number of bases after which the read is classified as backbone.

See the AAV structures section for some representative diagrams of AAV gene structures and how they are classified.

At this stage, the BAM alignment files are tagged with AV:Z which associates each alignment with an assigned genotype, in the format AV:Z:full_ssaav.

If —gtype_bams is set to true, these tagged BAMs are split on this tag into separate BAM files.

Troubleshooting

  • If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.

  • See how to interpret some common nextflow exit codes here.

  • Please ensure that the ITR-ITR cassette region spans and contiguous range in the transgene plasmid reference sequence. (i.e. ITR1 sequence should precede the ITR2 sequence)

FAQ’s

If your question is not answered here, please report any issues or suggestions on the github issues page or start a discussion on the community.

See the EPI2ME website for lots of other resources and blog posts

AAV structure diagrams

The following diagrams illustrate some of the possible genome type configurations that can be found in an rAAV prep. Different types of subgenomic structures can be formed by numerous different combinations of truncations and the examples are representative examples only.

The first of these is an annotated example describing the components of the image.

Example
Fig.6 Full and partial examples of a rAAV transgene expression cassette. The colour codes are preserved in the following figures

Full ssAAV (single stranded AAV)

Contains a single alignment including both ITRs (up to itr_fl_threshold bases missing)

Example

Genome deletion mutants (GDM)

A subgenomic type of ssAAV where part of the transgene expression cassette, internal to the ITRs, is deleted. This class will have two alignments both on the same strand.

GDM

Incomplete genome (ICG)

Another subgenomic type of ssAAV where one side contains a full ITR (up to itr_fl_threshold bases missing) and the ITR is partial or missing on the other side.

ICG

Full scAAV (self complementary rAAV)

Contains a full or partial ITR (up to itr_fl_threshold bases missing) on both ends of the alignments

Full scAAV

Snapback Genome (SBG)

An scAAV subtype where only the left or right ITR region is retained. Reads of this category will have two alignments on opposite strands. These can be symmetric or asymmetric based the relative starts and end positions at the non-ITR end of the transgene cassette.

SBG

Unresolved SBG

A type of SBG genome (scAAV) in which ITRs present on one strand only or have a single ITR (no mid section) on second strand.

Unresolved SBG

ITR-ITR concatemers

These scAAV subgenomic particles contain only ITR sequence and can be full or partial.

ITR concatemers

Backbone integration

Theis category contains regions from the plasmid backbone, where the start and/or end positions of the alignment are found outside the ITR-ITR region.

Backbone integration

Complex

The complex category contains reads with n alignments >= 3


Share

EPI2ME Labs

EPI2ME Labs

Senior Button Pusher

Quick Links

TutorialsWorkflowsOpen DataContact

Social Media

© 2020 - 2024 Oxford Nanopore Technologies plc. All rights reserved. Registered Office: Gosling Building, Edmund Halley Road, Oxford Science Park, OX4 4DQ, UK | Registered No. 05386273 | VAT No 336942382. Oxford Nanopore Technologies, the Wheel icon, EPI2ME, Flongle, GridION, Metrichor, MinION, MinIT, MinKNOW, Plongle, PromethION, SmidgION, Ubik and VolTRAX are registered trademarks of Oxford Nanopore Technologies plc in various countries. Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.