Three years ago I foresaw a cloud arriving on the bioinformatics horizon. Apple were about to launch their new line of hardware that would include their own system-on-a-chip, Apple M1.

The release of M1 powered computers would be the third time that Apple changed the CPU architecture used in their products. For the previous 14 years Apple had used processors from Intel in its laptop and desktop hardware. These were the same Intel processors ubiquitously used across the rest of the computer industry from home computers to high-performance computer clusters.

Two roads diverged

For the living memory of many (but not all) people working in the bioinformatics space, distributing software that could run on both Apple and non-Apple computers had been made less complicated by Apple’s choice to use Intel processors. There were certainly still wrinkles when trying to compile code to run on on macOS compared to Linux operating systems (let’s not get into the Windows thing here), but things were certainly easier than when Apple used PowerPC processors. With the introduction of these new powerful and efficient M1 processors from Apple there would be the task of compiling code to run on them.

Before the fanboys get flustered, let’s talk about Rosetta. Apple once described Rosetta as “The most amazing software you’ll never see.” This is a bold claim, with some justification. Without getting too technical, Rosetta allows current Apple hardware to run computer code created for Intel processors to run on M1 (and M2) processors. It implements something called “dynamic binary translation”: processor instructions are translated from the language of Intel processors (x86-64) to that of M1 processors (AArch64, also known as ARM64) whilst a program is running. Rosetta allows code created for earlier Intel-based Apple computers to run on current ARM-based systems.

The binary translation afforded by Rosetta is not however perfect. Certain extensions to the x86-64 language used by Intel processors are not supported by Rosetta. This means various programs simply will not run without being recompiled to run natively on ARM processors. The situation is worse than this in fact because of the dynamic nature of Rosetta: programs will run happily until the point at which an instruction that Rosetta cannot handle is encountered. When this happens the program will abruptly exit, often without any indication as to the cause.

“This is all great Chris, but aren’t you being a bit melodramatic? We’re all still running things just fine on our MacBooks.” Yes, yes you are. Rosetta truly is a marvel, accept when it isn’t. Its a useful technology whilst developers play catchup and release their software to run natively on ARM processors. The use of Rosetta is not free. Aside from the issue of not supporting all x86-64 instructions such that not all programs will run, there is a performance cost to use of Rosetta. In our experiments below, recompiling a set of commonly used tools to run natively without Rosetta leads to at worst a 2x performance improvement.

Then took the other, as just as fair

On the surface recompiling bioinformatics tools to run natively on ARM would seem fairly simple. Just grab your favourite software, run the compilation commands, and voila some nice new ARM binary files ready to run. As it happens growing ARMs is not so easy in practice. We quickly find that many software projects have been written in ways that mean they are intimately coupled to running on x86-64 processors. Add in the fact that many projects depend on other projects and you quickly find yourself bent in the undergrowth.

To make the job somewhat easier we can make use of available public repositories of pre-compiled software libaries. Unfortunately the go-to repository of bioinformatics software, bioconda, does not build any software for ARM64! This is the cloud I saw on the horizon back in November 2020: without Rosetta duck-taping the ship together, bioinformatics on Mac would be a no go for many. Theres little impetus to provide native packages, with maintainers hiding behind the excuse, “Rosetta will take care of it.”

For software that doesn’t run with a helping hand from Rosetta the only option for end users is to compile software themselves. This is beyond the skill of many. Getting a bioinformatics pipeline running with multiple pieces of software all compiled to native ARM64 code is not trivial. After one developer has gone through the pain of creating ARM software packages however, the results can be shared with everyone. The premise of package libraries like bioconda is exactly this: a community of package maintainers build executable code from source code for everyone’s use.

With this in mind we started project inkling. The idea is to steadily and progressively create ARM conda packages for the software used within our Nextflow workflows. To achieve this we make use of our existing conda packaging continuous integration pipelines that we have used for packaging our own software projects such as fastcat and modbam2bed. These pipelines are backed by a set of Linux virtual machines and Apple devices running on both x86-64 and ARM64 hardware. These machines provide us with the ability to create a total of four different software builds which we push to our anaconda repository.

Yet knowing how way leads on to way

One of our most popular workflows is wf-clone-validation, which can be used for the de novo assembly and annotation of plasmid sequences. So let’s try to compile all the software it requires for ARM. At its core the workflow uses flye and trycyler to first create a high quality assembly before inspection and annotation with pLannotate. So three packages to build for ARM on macOS and Linux. Not quite. Each of these, notably trycycler and pLannotate depend on multiple other software libraries and packages. Not all of these are required for our use case in wf-clone-validation but without deconstructing the original software components it is necessary to also compile the dependencies for ARM. To do otherwise would lead to ARM packages without the full functionality of the exisiting x86-64 packages. All-in-all we ended up needing to create more than 20 new ARM packages! (Various non-specific libraries are available for ARM already through conda-forge).

Building all these packages was not a terribly exciting affair, certainly not a spectator sport. But neither was it objectively difficult given patience and perseverance. Starting from one of the direct dependencies of wf-clone-validation (say flye) we can copy the recipe (package build instructions) from bioconda and run the build process on our ARM continuous integration machines to give us shiny new ARM conda packages. However for many packages this process doesn’t work out of the box. As noted above in order to build a package successfully, its dependencies must be available. A usual build failure is to find prerequisite packages need to be built first: so to build one package we must first build others, potentially in a recursive fashion. In creating the package hierarchy for wf-clone-validation we reached four layers deep of dependencies that needed to be built.

Perhaps more interesting an issue is when the source code of a piece of software is written in a manner which does not permit compilation on ARM. This is where we have to get our hands dirty. A most common issue is when code has been used to explicitly use particular extensions to the x86 instruction set. Often developers will use intrinsic functions in their code to aid runtime performance. However such functions are generally not portable across different types of processor such as x86 to ARM: code must be rewritten if it is to work on multiple processor architectures. Fortunately there are special software libraries that can help with this. For example SIMD Everywhere and sse2neon are two tools for allowing the use of so called SIMD function intrinsics in a portable manner.

Tool developers who are on-the-ball will already be using these tools, so creating and conda package from the source code is typically simply a matter of enabling options during the build process. This was the case for tools such as minimap2 (sse2neon) and spoa/racon (SIMDe). Other times we found that we had to perform the work ourselves in order to create a working software build. It always pays to ask the software developers before embarking on anywork however: I spent a morning doing this for one tool, only to find the developer had already done the work on a branch of their code.

A note of caution to anyone embarking on a similar ramble. An issue you will almost certainly encounter at some stage is openssl versions being worn really about the same. Various conda packages will require different, incompatible versions of openssl with the result that some packages cannot be simultaneously used. In our case we found that the project capnproto only recently enabled support for openssl version 3.0.8 in addition to version 1.1.1. Unfortunately there remain some issues with compiling the 3.0.8 supporting capnproto with the conda toolchain. To resolve this, and because we do not need openssl support we broke our own rule from above and created a boutique capnproto-nossl package.

All-in-all in order to create the necessary packages for wf-clone-validation we had to make patches to the source code of around a third of the projects. In addition we had to make alterations to around two-thirds of the build recipes.

I shall be telling this with a sigh

So was all the effort worth it? We can judge this by whether the effort makes a difference to end users. To do this we can run the wf-clone-validation on Apple hardware using both the standard x86-64 compiled software and our new ARM64 code. The benchmarks were run on a MacBook with an M1 Max processor (Model A2442). For comparison we show also results using a Intel i7-11800H based device running Microsoft Windows. This processor is of a similar vintage to the M1 device and can be found in high-end laptops of late 2021.

The input data used to obtain the timings was the demonstration data included with the workflow: a dataset comprising three samples, two of which are intended to not pass QC steps of the workflow. Figure 1. shows the execution time of the core steps of the workflow. These are:

Plasmid assembly using the flye assembler and Trycyler (which using a variety of tools to clean up reconcile assemblies).
Consensus calculation using medaka.
Plasmid annotation using pLannotate, which itself uses a variety of tools including blastn, diamond, and cmscan from infernal.
Primer search using seqkit.
construction of an insert MSA across multiple samples using SPOA.

For each of these steps we see between a two and five-fold improvement in speed. The largest increase observed is for the numerically intensive medaka step. Note however that even the seemingly trivial step of the primer search using seqkit is sped up by almost 4 times. The total execution time for the workflow was reduced from 9 minutes and 2 seconds to just 3 minutes and 18 seconds.

Figure 1. Workflow execution timings for critical steps of wf-clone-validation. We observed between a two and five-fold improvement in execution time when using ARM code compared to x86 code on a macOS device. For comparison we show also timings on an Intel based i7-11800H device running Microsoft Windows.

The wf-clone validation workflow is not a particularly strenuous workflow for the processing of a single sample. Ten minutes of bioinformatics analysis time is not really something necessarily deserving of optimisation. The results do however scale nicely with the number of samples processed.

And that has made all the difference

In the above we’ve journeyed through the process of converting our popular wf-clone-validation Nextflow workflow to use ARM64 code - the language of Apple’s M1 and M2 families of processors. This was motivated by a desire to better support users of Apple Hardware, which account for a good proportion of users of the EPI2ME Desktop application. Apple’s Rosetta technology cannot be relied upon to allow the running of all possible software. It does a good job in providing a compatibility layer, but ultimately it is preferable to run natively compiled code.

For our wf-clone-validation workflow the performance improvement is not particularly relevant; the workflow runs quickly enough as it stands. The performance improvement of the individual tools is however more intriguing. For example the computational work that medaka performs scales with the length of the genome processed. We can conceive that the processing of larger genomes, which could take on the order of hours to process previous, will now take only minutes. If several samples are to be processed the gains soon add up. Many of the tools used within wf-clone-validation are also used within several other of our workflows.

We shall certainly be continuing project inkling until and hope to bring native ARM64 code to all of our Nextflow workflows.