Have you ever tried downloading a new shiny software tool from GitHub, keen to run your test dataset through it, only to find that the application fails with a missing library error, or even worse it produces different results to those you were expecting? Well most applications require a whole host of other software libraries in order to function. Each of these libraries can exist as different versions that may or may not produce different results.
One increasingly common solution to this issue is to bundle-up your software into a container. This blog post introduces the use of containers for distributing bioinformatics applications, focusing on the Singularity container engine which, as will be demonstrated, is very well suited to bioinformatics applications.
This post serves as a guide to Linux users developing bioinformatics application that are unfamiliar with using containers, or would like an introduction to Singularity. We will discuss also how and why we use containers in EPI2ME workflows.
For reproducible research we need a way to ensure all the software dependencies of an application are the same no matter where or when the software is run. One way to do this is to post your laptop by DHL to your collaborators who can then run the same analyses as you. If this seems a bit cumbersome, we should consider some alternatives.
Virtual machines (VMs) are a type of software that runs on your computer and fully emulates a second computer. So you could have a linux VM running on a Windows host machine with complete isolation, and it can be identical to a copy of the same VM running elsewhere. However, VMs can consume a lot of the host system’s resources (even if they don’t actually use it), and the files used to define and share VMs are large.
An alternative approach is to use a container to bundle-up your application along with all the dependencies it needs to run reproducibly. Containers, unlike VMs, run as a normal process on the host machine. A container contains all the programs, system settings and other dependencies that are needed to reproducibly run your application. But they directly use the resources of the underlying operating system and so use much fewer resources and are easier to distribute than VMs. Containerised apps have now largely superseded the use of VMs in the distribution of bionformatics applications.
A note on terminology:
images: layers of data, which record the state of a computer’s file system.
container: a running instance of an image.
In the sea of container engine players, two stand out as particularly popular in the bioinformatics community: Docker and Singularity (or equivalently Apptainer, see History). Docker has been around longer and was originally designed with the deployment of web applications in mind. The company behind docker host a distribution platform called dockerhub that enables the storing and sharing of docker images.
Singularity? Apptainer? What’s the difference? The answers to this questions is long. If you are interested read more in this post. We will use singularity and apptainer interchangeably in this post.
Docker is often run with administrator access as it has many powerful functions that need escalated privileges. On multi-user high performance computing systems, giving every Tom, Dick, and Harry admin permissions isn’t such a great idea. That’s where Singularity comes in. When a Singularity container executes it runs as the same user within the container, and has identical permissions. This model side-steps many of the security issues associated with Docker. Singularity can seamlessly run docker container images, including directly from dockerhub, as well as being able to use its own container format.
Singularity is geared for running on Linux systems and this is where you most likely encounter it in the wild. Installation instructions differ for different operating systems. To see detailed instructions including for different Linux flavours, and for Mac see here.
As we noted above, Singularity is able to use transparently the Docker image format. There are a multitude Docker images to be found on Docker Hub with a good chance that your favourite applications have already been packaged into an image and are available there. The biocontainers project is a community-led project to create bioinformatics containers backed largely by bioconda packages. Their registry hosts many useful container images in the native singularity format.
To show how we can use a container from Docker Hub, we’ll take a look at the EPI2ME wf-alignment workflow image as an example by running the following command.
singularity shell docker://ontresearch/wf-alignment:shaa9faef16822c5aa48366a4c45b401c9233a6c0f7
This will instantiate a wf-alignment container and drop into a shell within it. Let’s break down this command.
singularity
- the main commandshell
- this subcommand tells Singularity that we want to run a shell within a containerdocker://
- tells Singularity to get images from Docker Hub. To use images from Singularity Library, prefix the path with library://
insteadontresearch/
- is Oxford Nanopore Technologies’ Docker Hub namespacewf-alignment
- specifies the project:shaa9faef16822c5aa48366a4c45b401c9233a6c0f7
- everything after the :
is the tag. This is the version of the container.
In EPI2ME workflows, this tag is what associates an image version with a workflow version and is specified in the workflow config (see this example).Running the command above drops us into a shell in the container. We can run applications that are installed in the container, for example seqkit
:
Singularity> seqkit versionseqkit v2.6.
When a container is running in Singularity, each process will have a work directory on the host system, which you’ll likely be reading data from and writing results to. For security purposes not all the locations on the host system are available from within the container (see here), although there are some default locations that are mounted, which include:
$HOME
mounted within the container at $HOME
/tmp
mounted within the container at /tmp
If you want other host directories to be available in your container, for example a network share folder, use Singularity’s --bind
option.
The following command opens a shell in a container with the host folder /mnt/share
available in the container at /data
.
Note that the bind
option must come before the image path.
singularity shell --bind /data/share:/data docker://ontresearch/wf-alignment:shaa9faef16822c5aa48366a4c45b401c9233a6c0f7
To create a Singularity image, we first need to create a definition file.
There is a good tutorial available here, but we’ll just create a very simple definition file that defines a container that has a single application; samtools
(let’s call it samtools.def)
Bootstrap: dockerFrom: ubuntu:22.04%postapt -y updateapt install samtools
The first line here tells Singularity that we will be using an image from Docker Hub as our starting point. The second line specifies that our starting image will be Ubuntu version 22.04. Changes that we specify later in the file, will be applied to this Ubuntu image.
The commands in the %post
section are run, and the results are added on to the base Ubuntu image to create your new Ubuntu + samtools image.
To go ahead and build the image, run the following:
singularity build samtools.sif samtools.def
We can fire up our newly-minted image into a running container using the following command (remembering to bind our data folder directory):
singularity shell --bind /home/me/data:/data samtools.sif
And run a command in the running container to view a BAM file.
Singularity> samtools view /data/chr19.bam | lessd42c2f04-3ad0-47a7-9d85-c2adf95f3ec1_0 16 chr19 58498337 60 80S35M2I121M1I198M93N18M1D111M207N104M1D4M1D15M30S * 0 0 TCCTACGACGCT.....
Nextflow is a domain-specific language that is used to create bioinformatics workflows; it’s used to power all our EPI2ME workflows. A Nextflow workflow consists of individual processes that run custom scripts and commands. Each of these processes is able to run these commands in its own container separate from all other processes and from the host system.
Nextflow natively supports both Singularity and Docker (as do our workflows) as well as some lesser-known container engines. See the nextflow documentation for more information on these.
If your workflow is fairly simple with few dependencies, you might want to use the same container for each process.
To do that, use the -with-singularity
option e.g: nextflow run <your script> -with-singularity [singularity image file]
.
Using this, each process will run using the specified container image for each process and run all the process commands in it.
If you don’t want to add this to the command line every time, it can be supplied in the Nextflow config like so:
process.container = '/path/to/singularity.sif'singularity.enabled = true
This is great if your workflow utilizes a single container, but if your workflow is complex or if you have a common container that is shared across workflows, you may want to apply specific containers to each process. This is what we do in many EPI2ME workflows; there will be at least one workflow-specific container and another common container, which supplies functionality shared across all workflows.
Nextflow allows each process to be assigned its own container using either the process name or the process label. Using the process label as below, it’s possible to apply the same container to multiple processes with the same label, simplifying the configuration. The container can be a locally stored image or exist on the internet.
process {withLabel:process1 {# Path to a local imagecontainer = "path/to/container1.sif"}withLabel:process2 {# A Docker Hub imagecontainer = "docker://ontresearch/wf-common"}}
EPI2ME workflows are only supported when run with our supplied images. Each of our workflows are associated with containers, that are mostly built by our internal continuous integration (CI) systems. Conda packages for each container are defined in Dockerfiles, with some of the conda packages created by our team. The process is quite similar to that adopted by the biocontainers project though we lean more toward building containers containing multiple tools. Our internal CI system then builds a Docker image from this file and deposits it to Docker Hub with a unique tag. This tagged image is then associated with one or more versions of the workflow.
To use Singularity when running an EPI2ME workflow, it’s as easy as supplying the profile option in your command
nextflow run epi2me-labs/wf-basecalling -profile singularity
Using this command, the following profile defined in the workflow config is used.
profiles {singularity {enabled = trueautoMounts = true}}
This profile enables the use of Singularity.
It also sets autoMounts
to true, which instructs Singularity to mount automatically the host paths that are used in the workflow.
Common issues encountered when using Singularity are often due to host paths not being available within the container. The following describes two such issues and how to workaround them.
Sometimes the use of relative paths can result in problems when using nextflow with Singularity. Therefore, it’s advisable to use absolute paths in the command. So do this:
nextflow run /home/git/wf-basecalling —input /home/data/fastq
rather than this:
nextflow run ~/wf-basecalling —input ~/data/fastq
Singularity needs a place to store temporary files during the building of containers.
By default, this is set to /tmp
, but this can result in /tmp
getting full
especially on shared computing systems, and you might encounter a message such as
FATAL: While making image from oci registry .... short write: write /tmp/
To prevent this, set the Singularity temp directory by placing the following in
your ~/.bashrc
file.
export SINGULARITY_TMPDIR=/path/to/you_tmp_dir
This short guide was an introduction to containers and, in particular, Singularity (Apptainer) and why and how it can be used for bioinformatics applications. We briefly touched on what containers are, how to create and run Singularity images and containers and highlighted a couple of common issues you may encounter while using Singularity.