These two popular workflow management projects were started in 2014 and 2013 respectively and likely stemmed from a common need to process the increasing quantities of scientific data. They both provide similar solutions to help break down analysis into tasks or processes and link them together in an analysis pipeline.
Here are some of the main things we considered when deciding which to use for creating our own workflows –
Nextflow and Snakemake both use domain specific language extensions of Groovy and Python respectively. Python is a well-known language among bioinformaticians potentially making Snakemake easier to learn and share. On closer inspection Groovy is an equally elegant Python style language run on Java which for moderately experienced programmers is easy to pick up. A benefit in both cases is the ability to use the underlying languages beyond the domain specification as required.
Considering the way the pipelines run and execute commands - both enable automatic parallelization of jobs with each process running as soon as an input and computing resources are available. Furthermore, Nextflow will retry jobs that fail automatically, and in Snakemake it is possible to specify a number of retries for each process.
Both require the user to define inputs for each task to expect and run tasks when inputs are received. Nextflow automates the naming of output folders and files unless specified and outputs may include data objects or in-memory values. The automatically created output folders include logs and other information that can help greatly with debugging.
Snakemake differs in that the process execution is dependent on the actual input and output file names. The user has to explicitly define output file names and folders, as it is not automated. For a lot of command line tools we use, there are many output files so having to explicitly define each one may add a fair amount of additional code as well as having to think about naming of files.
One benefit of Snakemake is that is has an option to allow dry test runs without any data and shows which steps would be ran, this can be useful to check the process flow. With Nextflow you need to use small datasets, the upside is using test datasets can help catch errors early on. We note the stub feature was recently introduced into Nextflow which can help in testing workflows and examining their flow.
Documentation is extensive and clear for both; each can be installed using Conda with a single command and both have easy to follow quick start guides.
Whilst the user community for Nextflow is only marginally bigger for Snakemake (going by Github repo stats), the active user community project nf-core collates curated pipelines that use Nextflow is of particular interest to us for the future.
A main feature of both is portability, to allow scientists to reproduce analysis on different computing environments with use of virtual environments, container technology and cloud services. Snakemake supports Docker and Singularity containers, Conda environments and some Amazon Web Services (AWS). Nextflow supports all of those as well as additional technologies including Podman, Charliecloud, Shifter and more. Nextflow also has specific documentation for AWS batch. Overall Nextflow gives us more choice when it comes to third party software and integrations.
Nextflow and Snakemake both seem like solid choices for developing scientific workflows. There may be cases where working with Snakemake may be preferable and we may use it on occasion but ultimately it made sense for our group to choose one workflow management system to use for the bulk of our work for which we have gone with Nextflow.