De novo assembly of plasmid sequences, designed for verifying the results of molecular cloning experiments.
Among other uses this workflow could determine the success of a molecular cloning experiment and determine whether one DNA sequence has been correctly inserted into another as an experimentalist was expecting.
In brief, this workflow will perform the following:
Recommended requirements:
Minimum requirements:
Approximate run time: 6 minutes per sample for 10,000 reads
ARM processor support: True
These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME Desktop application.
The workflow uses Nextflow to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.
The workflow can currently be run using either
Docker
or Singularity
to provide isolation of the required software.
Both methods are automated out-of-the-box provided
either Docker or Singularity is installed.
This is controlled by the
-profile
parameter as exemplified below.
It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.
The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:
nextflow run epi2me-labs/wf-clone-validation --help
To update a workflow to the latest version on the command line use the following command:
nextflow pull epi2me-labs/wf-clone-validation
A demo dataset is provided for testing of the workflow. It can be downloaded and unpacked using the following commands:
wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-clone-validation/wf-clone-validation-demo.tar.gztar -xzvf wf-clone-validation-demo.tar.gz
The workflow can then be run with the downloaded demo data using:
nextflow run epi2me-labs/wf-clone-validation \--fastq 'wf-clone-validation-demo/test' \--primers 'wf-clone-validation-demo/primers.tsv' \-profile standard
For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/
This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices using this protocol:
This workflow accepts either FASTQ or BAM files as input.
The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with --sample
. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet
.
(i) (ii) (iii)input_reads.fastq ─── input_directory ─── input_directory├── reads0.fastq ├── barcode01└── reads1.fastq │ ├── reads0.fastq│ └── reads1.fastq├── barcode02│ ├── reads0.fastq│ ├── reads1.fastq│ └── reads2.fastq└── barcode03└── reads0.fastq
When using a sample sheet the workflow allows the use of additional columns approx_size
, full_reference
and insert_reference
which replace parameters --approx_size
, --full_reference
and --insert_reference
respectively. This allows per-sample variables to be applied rather than global settings. An example sample sheet is shown below.
alias,barcode,type,approx_size,full_reference,insert_referencesample1,barcode01,test_sample,4000,/path/to/full_reference.fasta,/path/to/insert_reference.fastasample2,barcode02,test_sample,4000,/path/to/full_reference.fasta,/path/to/insert_reference.fastasample3,barcode03,test_sample,7000,/path/to/full_reference_alt.fasta,/path/to/insert_reference_alt.fasta
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
fastq | string | FASTQ files to use in the analysis. | This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with --sample . In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet . | |
bam | string | BAM or unaligned BAM (uBAM) files to use in the analysis. | This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with --sample . In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet . | |
approx_size | integer | Approximate size of the plasmid in base pairs. This can also be defined per sample: see sample_sheet parameter. | 7000 | |
assm_coverage | integer | Fold coverage for use per assembly | This is the coverage that will be used to subsample reads to use for the assembly. | 60 |
primers | string | TSV File containing primers used to find inserts. If left empty then inserts will not be searched for. | Specify one or more primer sets which will be used to find the sequence inserted in the construct. This file should be in .tsv format containing columns [primer_name, 5’ primer, 3’ primer] with no header. An example primers.tsv for pRham/T7 is available in the demo data for the workflow. | |
analyse_unclassified | boolean | Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. | If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. | False |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
insert_reference | string | Optional file containing insert reference sequence which will be used for comparison with consensus insert in the report. | Providing a reference sequence can be useful as a QC on the base-level resolution of the the reconstructed insert sequences. Users can specify different insert references for individual samples using the sample sheet and including an insert_reference column. This cannot be used in conjunction with --insert_reference . | |
full_reference | string | Optional FASTA file containing the reference sequence of the full plasmid. This will be used for comparison with the assembled construct. | Providing a reference sequence can be useful as a quality check on the base-level resolution of the reconstructed sequence, the reference is not used to generate the assembly. Users can specify different full references for individual samples using the sample sheet and including a full_reference column. This cannot be used in conjunction with --full_reference . | |
host_reference | string | A host reference genome FASTA file. Read which map to this reference are discarded and not used for the assembly. | ||
regions_bedfile | string | If a host_reference supplied, add an optional BED file to provide host reference regions that will be masked during filtering. |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
sample_sheet | string | A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. An optional column approx_size can be added to provide size estimates for each sample. When not provided, the --approx_size parameter will be used for all samples. | The sample sheet is a CSV file with, minimally, columns named barcode and alias . Extra columns are allowed. A type column is required for certain workflows and should have the following values; test_sample , positive_control , negative_control , no_template_control . The workflow can use a number of optional columns: approx_size provides optional size estimates for each sample, cut_site can be added to provide a cut site as a sequence which will be used to provide a linearisation efficiency section in the report, full_reference and insert_reference allow the use of per-sample references when providing full/relative paths (with respect to the workflow launch directory) to the respective reference files. | |
sample | string | A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files. |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
out_dir | string | Directory for output of all user-facing files. | output | |
prefix | string | The prefix attached to each of the output filenames. |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
override_basecaller_cfg | string | Override auto-detected basecaller model that processed the signal data; used to select an appropriate Medaka model. | Per default, the workflow tries to determine the basecall model from the input data. This parameter can be used to override the detected value (or to provide a model name if none was found in the inputs). However, users should only do this if they know for certain which model was used as selecting the wrong option might give sub-optimal results. A list of recent models can be found here: https://github.com/nanoporetech/dorado#DNA-models. | |
medaka_model_path | string | A custom model file (.tar.gz or .hdf) to be used instead of the automatic model selection and take precedence over the optional --override_basecaller_cfg parameter. | Allows for users to test experimental Medaka models. Users should not provide a model with this parameter for general analysis. | |
large_construct | boolean | Enable assembly of larger constructs including Bacterial Artificial Chromosomes (50,000-300,000 base pairs). | Selecting this will skip approximate size filtering steps allowing the assembly of larger genomes. Multiple sequence alignment of inserts will be skipped in this mode. | False |
trim_length | integer | Number of base pairs to trim from both ends of each read. Set to 0 if no trimming is required. | 150 | |
flye_quality | string | The Flye parameter for quality of input reads, default nano-hq : high-quality reads, Guppy5+ SUP or Q20 (<5% error). | Other options include nano-corr : reads that were corrected with other methods (<3% error), nano-raw : pre-Guppy5 (<20% error). | nano-hq |
non_uniform_coverage | boolean | Set this to true if your reads have highly non-uniform coverage. | Run flye in metagenome assembly mode, which may help with the assembly if you have high non-uniform coverage reads; generally, should not be required. | False |
db_directory | string | Optional directory containing a gene annotation database. | A default generic annotation is provided in tar.gz format, containing entries from fpbase, Swiss-Prot , Rfam and snapgene | |
assembly_tool | string | Select the assembly tool to use, either Canu or Flye. | Flye is the default assembler tool which will work in most cases. Alternatively select Canu but it will not work with ARM processors. | flye |
canu_fast | boolean | Fast option can make the assembly step significantly faster. It can be used on any genome size but may produce less continuous assemblies on genomes larger than 1 Gbp. | This option is only relevant if Canu is set as the assembly_tool parameter | False |
cutsite_mismatch | integer | Maximum number of mismatches allowed when searching for the cutsite in the reference fasta provided. Set to 0 for perfect matches only. Increasing allowed mismatches when increase risk of multiple matches, which will fail the workflow. | 1 |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
threads | integer | Maximum number of CPU threads to use per workflow task. | Several tasks in this workflow benefit from using multiple CPU threads. This option sets the number of CPU threads for all such processes. The total CPU resource used by the workflow is contrained by the executor configuration. | 4 |
Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
Title | File path | Description | Per sample or aggregated |
---|---|---|---|
worfklow report | ./wf-clone-validation-report.html | A report bringing together the main results of the workflow, across samples. | aggregated |
sample status | ./sample_status.txt | A CSV file with per-sample assembly success or failure reasons | aggregated |
plasmid annotations | ./plannotate.json | Plasmid annotations in a JSON structure. | aggregated |
annotations bed | ./{{ alias }}.annotations.bed | Plasmid annotations in a BED file format for onward use | per-sample |
annotations gbk | ./{{ alias }}.annotations.gbk | Plasmid annotations in a GBK file format for onward use | per-sample |
Assembly FASTQ | ./{{ alias }}.final.fastq | Sequence and quality score of the final assembly. | per-sample |
Assembly statistics | ./{{ alias }}.assembly_stats.tsv | Assembly statistics from fastcat. | per-sample |
Insert FASTA | ./{{ alias }}.insert.fasta | Insert sequence found in the final assembly, only relevant if the primers parameter was used. | per-sample |
Variant stats report | ./{{ alias }}.full_construct.stats | A BCF stats report with any variants found, only relevant if a full reference was provided. | per-sample |
Variants BCF file | ./{{ alias }}.full_construct.calls.bcf | A BCF file with any variants found per sample, only relevant if a full reference was provided. | per-sample |
Reference alignment | ./{{ alias }}.bam | Reference aligned with the assembly in BAM format, only relevant if a full reference was provided. | per-sample |
Reference alignment index | ./{{ alias }}.bam.bai | The index for the reference aligned with the assembly, only relevant if a full reference was provided. | per-sample |
BAM Stats | ./{{ alias }}.bam.stats | Stats report for the reference aligned with the assembly, only relevant if a full reference was provided. | per-sample |
The fastcat tool is used to concatenate multifile samples to be processed by the workflow. It will also output per read stats including average read lengths and qualities. Reads with lengths less than 0.5 X and more than 1.5 X the approximate size are filtered out unless the --large_construct
parameter is provided which indicates the assembly is expected to be larger (50,000-300,000 bps).
If a host_reference fasta file is provided, Minimap2 is used to align all reads to the host_reference, and any aligned reads are filtered out.
The reads are then trimmed at the ends using SeqKit with the provided trim length parameter, which has a default of 150bp. Set this value to 0 if no trimming is desired, such as for non-linearized plasmid sequences or linearized plasmid sequences that have already been trimmed. At this stage SeqKit is also used to filter out reads that are longer than 1.2 x the approximate size or shorter than 100bp.
The sequences are then subsampled using Rasusa. The subsampling will take the expected coverage parameter in to account; as we will be repeating the assembly 3 times, we subsample to a target of approximately 3x the required coverage. However, this is just a target and if there is not enough data, Rasusa is still able to create the 3 subsamples. The approximate size parameter is also used by Rasusa to work out the target number of bases and therefore number of reads required for each of the subsamples.
+Trycycler is used to create 3 subsamples as we will be creating three assemblies and finding the consensus between all three. This consensus generation will be handled by Ttrycycler.
We perform the assembly for each of the 3 subsamples separately. The assembly is done using either Flye or Canu depending on what is set as the assembly_tool
parameter. Both Flye and Canu are popular assemblers that usually produce reliable assemblies. Flye is our default assembler as it usually provides reliable assemblies in less time than Canu, and supports ARM processors. If Flye fails to assemble you may wish to try Canu.
If there are concatemers in the assembly, these are found using minimap2 and de-concatenated using a custom Python script. If the assembly is already roughly the expected approximate size, this de-concatenate step will be skipped.
Trycycler is used to reconcile the subsampled assemblies into one final assembly. This is then polished with Medaka. A per-base quality score for the assembly is output by Medaka in a FASTQ file. This is used for creating the mean assembly quality you will find in the report.
SeqKit is used to locate inserts using the primers supplied to the primers parameter.
A multiple sequence alignment (MSA) will be done using Pyspoa. This will be presented in the report to help users compare inserts across samples in a multi-sample run. If an insert reference FASTA file is provided, this will also be included in the MSA.
If a reference insert FASTA sequence is provided, BCFtools is used to find variants between it and the final insert assembly, and are reported in BCF file per sample.
If a full reference FASTA sequence is provided, Minimap2 is used to align the final assembly with the reference. BCFtools is used to report variants between the reference and the final assembly, which are reported in a BCF stats file per sample.
The assembly is annotated by pLannotate to show any features that are present. The default database is used, which contains entries from FPbase, Swiss-Prot, Rfam and SnapGene. Descriptions, percentage match and length of the match are also provided.
For each sample a self alignment will be done using Last and the output will be presented as a dotplot. This can help identify any repetitive regions in your final assembly.
If a user provides a cut_site
column in the sample sheet (per sample short sequences) these will be used to predict linearisation efficiency by calculating how many reads don’t span the cut site vs total reads and provided as a percentage.
I don’t know the approximate size of my plasmid? - On most occasions you can use the mode of the data as an approximate guide To find the mode, you can run the workflow with the default settings, and from the raw data read length plot find the highest peak. This value should approximate the plasmid size because for most plasmids only one cut is made to the circular plasmid prior to sequencing, meaning each read is of the full plasmid. Furthermore, it is better to overestimate the approximate size than underestimate.
Does the workflow report contaminants? - The workflow has no way of reporting contaminants. However, if contaminants are present, the workflow may struggle to create consistent assemblies and the output assemblies are likely to show low quality. If you have a reference for an expected contaminant, you could use this as the host reference to filter out any reads that align with that.
Can I use my own annotation database? – Currently using your own annotation database is not supported, but we may add it in future.
Does this workflow support reference based assembly? - It does not have a reference based assembly mode.
Does this workflow have support for bacterial artificial chromosomes (BACs)? - This workflow does not yet have BAC support and has not been tested for assembly of genomes larger than 50,000bps
If your question is not answered here, please report any issues or suggestions on the github issues page or start a discussion on the community.
See the EPI2ME website for lots of other resources and blog posts.
Information