Assembly is required because sample extraction, processing, handling and sequencing can lead to the fragmentation of DNA molecules comprising whole chromosomes are rarely present. With careful preparation reads in excess of 2 Mbases have been achieve on Oxford Nanopore Technologies’ sequencing platforms. Many tools exist which aim to achieve the best assembly whilst balancing computational requirements and time constraints.
There are commonly two types of whole genome consensus tasks undertaken in bioinformatics: i) reference assisted consensus and ii) de novo assembly. In the first of these we use knowledge of an exisiting reference sequence as a scaffold to piece together reads. De novo assembly by contrast does not use prior knowledge and is therefore a more difficult task that requires greater computation but is arguably of greater utility as it also allows assembly of anything that may not be present in a reference sequence database, removes bias a reference sequence may introduce and allow you to find novel components or structural variations.
Sequencing read data from Oxford Nanopore Technologies’ sequencing platforms are ideal for creating de novo assemblies due to the long-read lengths produced, having longer reads means that there are more unique sequences and more easily distinguishable overlaps between reads which makes it easier to piece them together. This also makes them more useful for resolving repetitive elements and other structural ambiguities.
Some popular assembly tools for working with ONT data include Flye, Canu, Raven, Shasta and Miniasm. I don’t plan to explain the algorithms used, as these are explained superiorly in the tool’s respective papers but it’s useful to understand that they make use graphs to model relationships between sequence fragments. The Overlap–layout–consensus (OLC), approach is used in part by all mentioned above. This can be presented as a directed overlap graph where each node is a read and edges are where sequence fragments overlap. Scored edges can then be used to find a most likely consensus sequence. Whilst various tools use OLC they differ by including additional steps, using different alignment methods, and scoring systems to come to an ultimate assembly.
The assembler that we have been using frequently in our workflows is Flye which constructs a repeat graph where edges represent the genomic sequence and the nodes are sequences overlaps. The edges are labelled as either unique or repetitive. The genome is then predicted by traversing the graph so each of the unique edges appear once. A more in depth explanation can be found here. You can also see from the Github page that Flye has been benchmarked with sequence sets ranging from small bacteria to large human giving it a broad applicability. Flye is still being actively improved and developed with the most recent release being in February this year.
There are a lot of benchmarking papers and it’s important to keep in mind that they are each focused on specific applications. Different assemblers may be better suited to different applications and there is not yet one assembler that far out-weighs the rest for all applications. One recent independent study where tools were benchmarked for use with prokaryote whole genome sequencing for various aspects found that Flye was one of the top assemblers and made the fewest sequence errors in comparison to other tools for this application but notably used the most RAM: Flye used 8-16 Gb vs. 8 Gb or less for all other assemblers (Wick and Holt 2021). Most people these days are likely to have access to 16 Gb on a regular laptop. Canu had the longest run time of 1 to 6 hours, where as Flye’s average was 15 minutes making it easier to carry out full analysis in a reasonable amount of time. Another paper benchmarking assembly tools for the application of plant genome assemblies concluded that Flye and Canu were best for creating accurate assemblies (Jung et al. 2020).
From our own experience, we have found Flye to be reliable and able to resolve assemblies in most cases. It generally represents an improvement in consensus accuracy and a decrease in assembly time, compared to other tools we have used in the past. We also find it very user friendly, requiring less configuration than some other tools.
For some datasets Flye fails to complete assemblies after a substantial amount of time and consistently failed to assemble certain datasets at all so we have needed to use of other assemblers. Whilst Flye is often our go to assembler for each workflow we research and experiment with various options to find the most suitable in each case.
Experimental design as well as other tools used before assembling may also have an impact on results for example filtering on read length, quality score filtering, removing adapters or identifying areas within reads that are low quality may help when an assembly fails. When looking to find a robust assembly, it can be worth using more than one assembler and comparing results.
Assembly is an ongoing area of research with improvements to speed and accuracy being made continually by algorithm experts. Flye is often our first choice of assembler due to its broad applicability, it’s speed and reliability in solving an assembly but it is not the only assembler that we use. We continually review tools and look forward to future developments and improvements.
Wick, R.R. and Holt, K.E. 2021. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research 8, p. 2138. doi: 10.12688/f1000research.21782.4.
Jung, H. et al. 2020. Comparative Evaluation of Genome Assemblers from Long-Read Sequencing for Plants and Crops. Journal of Agricultural and Food Chemistry 68(29), pp. 7670–7677. doi: 10.1021/acs.jafc.0c01647.