Materials for the shared FAANG workshop taking place on the 26th of February 2020 in the Wellcome Genome Campus, Hinxton, Cambridge, UK. "nf-core: A community-driven collection of omics portable pipelines"
Material adapted from the nf-core tutorial
Duration: 1hr 45
- Abstract
- Introduction
- Installing the nf-core helper tools
- Listing available nf-core pipelines
- Running nf-core pipelines
I think we can introduce at the and that you can either create or contribute to the nf-core pipelines but not go through this part of the tutorial
I think we can introduce at the and that you can either create or contribute to the nf-core pipelines but not go through this part of the tutorial
The nf-core community provides a range of tools to help new users get to grips with nextflow - both by providing
complete pipelines that can be used out of the box, and also by helping developers with best practices. Companion tools
can create a bare-bones pipeline from a template scattered with TODO
pointers and CI with linting tools check code
quality. Guidelines and documentation help to get nextflow newbies on their feet in no time. Best of all, the nf-core
community is always on hand to help.
In this tutorial we discuss the best-practice guidelines developed by the nf-core community, why they're important and give insight into the best tips and tricks for budding nextflow pipeline developers. ✨
nf-core is a community-led project to develop a set of best-practice pipelines built using Nextflow. Pipelines are governed by a set of guidelines, enforced by community code reviews and automatic linting (code testing). A suite of helper tools aim to help people run and develop pipelines.
This tutorial attempts to give an overview of how nf-core works: how to run nf-core pipelines, how to make new pipelines using the nf-core template and how nf-core pipelines are reviewed and ultimately released.
The beauty of nf-core is that there is lots of help on offer! The main place for this is Slack - an instant messaging
service. The nf-core Slack organisation has channels dedicated for each pipeline, as well as specific topics
(eg. #new-pipeliens
, #tools
and #aws
).
The nf-core Slack can be found at https://nfcore.slack.com (NB: no hyphen in nfcore
!).
To join you will need an invite, which you can get at https://nf-co.re/join/slack.
One additional tool which this author swears by is TLDR - it gives concise command line reference
through example commands for most linux tools, including nextflow
, docker
, singularity
, conda
, git
and more.
There are many clients, but raylee/tldr is arguably the simplest - just a single bash
script.
Much of this tutorial will make use of the nf-core
command line tool. This has been developed to provide a range of
additional functionality for the project such as pipeline creation, testing and more.
The nf-core
tool is written in Python and is available from the
Python Package Index and
Bioconda. You can install it from PyPI as follows:
pip install nf-core
If using conda, first set up for bioconda as described in the bioconda docs and then install nf-core:
conda install nf-core
The nf-core/tools source code is available at https://github.com/nf-core/tools
- if you prefer, you can clone this repository and install the code locally:
git clone https://github.com/nf-core/tools.git nf-core-tools
cd nf-core-tools
python setup.py install
Once installed, you can check that everything is working by printing the help:
nf-core --help
- Install nf-core/tools
- Use the help flag to list the available commands
As you saw from the --help
output, the tool has a range of subcommands. The simplest is nf-core list
, which lists
all available nf-core pipelines. The output shows the latest version number, when that was released. If the pipeline
has been pulled locally using Nextflow, it tells you when that was and whether you have the latest version.
If you supply additional keywords after the command, the listed pipeline will be filtered. Note that this searches more
than just the displayed output, including keywords and description text. The --sort
flag allows you to sort the list
(default is by most recently released) and --json
gives JSON output for programmatic use.
- Use the help flag to print the list command usage
- List all pipelines
- Sort pipelines alphabetically, then by popularity (stars)
- Fetch one of the pipelines using
nextflow pull
- Use
nf-core list
to see if the pipeline you pulled is up to date - Filter pipelines for those that work with RNA
- Save these pipeline details to a JSON file
In order to run nf-core pipelines, you will need to have Nextflow installed (https://www.nextflow.io). The only other requirement is a software packaging tool: Conda, Docker or Singularity. In theory it is possible to run the pipelines with software installed by other methods (e.g. environment modules, or manual installation), but this is not recommended. Most people find either Docker or Singularity the best options.
Unless you are actively developing pipeline code, we recommend using the Nextflow
built-in functionality to fetch nf-core pipelines. Nextflow will
automatically fetch the pipeline code when you run nextflow run nf-core/PIPELINE
. For the best reproducibility, it is
good to explicitly reference the pipeline version number that you wish to use with the -revision
/-r
flag. For
example:
nextflow run nf-core/rnaseq -revision 1.3
If not specified, Nextflow will fetch the master
branch - for nf-core pipelines this will be the latest release. If
you would like to run the latest development code, use -r dev
.
Note that once pulled, Nextflow will use the local cached version for subsequent runs. Use the -latest
flag when
running the pipeline to always fetch the latest version. Alternatively, you can force Nextflow to pull a pipeline again
using the nextflow pull
command:
nextflow pull nf-core/rnaseq
You can find general documentation and instructions for Nextflow and nf-core on the nf-core website:
https://nf-co.re/. Pipeline-specific documentation is bundled with each pipeline in the /docs
folder. This can be read either locally, on GitHub, or on the nf-core website. Each pipeline has its own webpage at
https://nf-co.re/PIPELINE
.
In addition to this documentation, each pipeline comes with basic command line reference. This can be seen by running
the pipeline with the --help
flag, for example:
nextflow run nf-core/rnaseq --help
Nextflow can load pipeline configurations from multiple locations. To make it easy to apply a group of options on the command line, Nextflow uses the concept of config profiles. nf-core pipelines load configuration in the following order:
- Pipeline: Default 'base' config
- Always loaded. Contains pipeline-specific parameters and "sensible defaults" for things like computational requirements
- Does not specify any method for software packaging. If nothing else is specified, Nextflow will expect all software to be available on the command line.
- Pipeline: Core config profiles
- All nf-core pipelines come with some generic config profiles. The most commonly used ones are for software
packaging:
docker
,singularity
andconda
- Other core profiles are
awsbatch
,debug
andtest
- All nf-core pipelines come with some generic config profiles. The most commonly used ones are for software
packaging:
- nf-core/configs: Server profiles
- At run time, nf-core pipelines fetch configuration profiles from the configs remote repository. The profiles here are specific to clusters at different institutions.
- Because this is loaded at run time, anyone can add a profile here for their system and it will be immediately available for all nf-core pipelines.
- Local config files given to Nextflow with the
-c
flag - Command line configuration
Multiple comma-separate config profiles can be specified in one go, so the following commands are perfectly valid:
nextflow run nf-core/rnaseq -profile test,docker
nextflow run nf-core/hlatyping -profile singularity,debug
Note that the order in which config profiles are specified matters. Their priority increases from left to right.
The test
config profile is a bit of a special case. Whereas all other config profiles tell Nextflow how to run on
different computational systems, the test
profile configures each nf-core
pipeline to run without any other
command line flags. It specifies URLs for test data and all required parameters. Because of this, you can test any
nf-core pipeline with the following command:
nextflow run nf-core/PIPELINE -profile test
Note that you will typically still need to combine this with a configuration profile for your system - e.g.
-profile test,docker
. Running with the test profile is a great way to confirm that you have Nextflow configured
properly for your system before attempting to run with real data.
Most nf-core pipelines have a number of flags that need to be passed on the command line: some mandatory, some
optional. To make it easier to launch pipelines, these parameters are described in a JSON file bundled with the
pipeline. The nf-core launch
command uses this to build an interactive command-line wizard which walks through
the different options with descriptions of each, showing the default value and prompting for values.
NOTE
This is an experimental feature - JSON file and rich descriptions of parameters is not yet available for all pipelines.
Once all prompts have been answered, non-default values are saved to a params.json
file which can be supplied to
nextflow to run the pipeline. Optionally, the nextflow command can be launched there and then.
To use the launch feature, just specify the pipeline name:
nf-core launch <PIPELINE>
============================================= TODO: Think of removing this section
Many of the techniques and resources described above require an active internet connection at run time - pipeline files, configuration profiles and software containers are all dynamically fetched when the pipeline is launched. This can be a problem for people using secure computing resources that do not have connections to the internet.
To help with this, the nf-core download
command automates the fetching of required files for running nf-core
pipelines offline. The command can download a specific release of a pipeline with -r
/--release
and fetch the
singularity container if --singularity
is passed (this needs Singularity to be installed). All files are saved to a
single directory, ready to be transferred to the cluster where the pipeline will be executed.
- Install required dependencies (nextflow, docker)
- Print the command-line usage instructions for the nf-core/rnaseq pipeline
- In a new directory, run the nf-core/rnaseq pipeline with the provided test data
- Try launching the RNA pipeline using the
nf-core launch
command - Download the nf-core/rnaseq pipeline for offline use using the
nf-core download
command
The heart of nf-core is the standardisation of pipeline code structure. To achieve this, all pipelines adhere to a
generalised pipeline template. The best way to build an nf-core pipeline is to start by using this template via the
nf-core create
command. This launches an interactive prompt on the command line which asks for things such as
pipeline name, a short description and the author's name. These values are then propagated throughout the template
files automatically.
Not everything can be completed with a template and all new pipelines will need to edit and add to the resulting
pipeline files in a similar set of locations. To make it easier to find these, the nf-core template files have
numerous comment lines beginning with TODO nf-core:
, followed by a description of what should be changed or added.
These comment lines can be deleted once the required change has been made.
Most code editors have tools to automatically discover such TODO
lines and the nf-core lint
command will flag
these. This makes it simple to systematically work through the new pipeline, editing all files where required.
The only hard requirement for all nf-core pipelines is that software must be available in Docker images. However, it is recommended that pipelines use the following methodology where possible:
- Software requirements are defined for Conda in
environment.yml
- Docker images are automatically built on Docker Hub, using Conda
- Singularity images are generated from Docker Hub at run time for end users
This approach has the following merits:
- A single file contains a list of all required software, making it easy to maintain
- Identical (or as close as is possible) software is available for users using Conda, Docker or Singularity
- Having a single container image for the pipeline uses disk space efficiently for Singularity images, and is simple to manage and transfer.
The reason that the above approach is not a hard requirement is that some issues can prevent it from working, such as:
- It may not be possible to package software on conda due to software licensing limitations
- Different packages may have dependency conflicts which are impossible to resolve
Alternative approaches are then decided upon on a case-by-case basis. We encourage you to discuss this on Slack early on as we have been able to resolve some such issues in the past.
The nf-core template will create a simple environment.yml
file for you with an environment name, conda channels
and one or two dependencies. You can then add additional required software to this file. Note that all software
packages must have a specific version number pinned - the format is a single equals sign, e.g package=version
.
Where software packages are not already available on Bioconda or Conda-forge, we encourage developers to add them. This benefits the wider community, as well as just users of the nf-core pipeline.
You can use Docker for testing by building the image locally. The pipeline expects a container with a specific name, so you must tag the Docker image with this. You can build and tag an image in a single step with the following command:
docker build -t nfcore/PIPELINE:dev .
Note that it is nfcore
without a hyphen (docker hub doesn't allow any punctuation). The .
refers to the current
working directory - if run in the root pipeline folder this will tell Docker to use the Dockerfile
recipe found there.
All nf-core pipelines use GitHub as their code repository, and git as their version control system. For newcomers to this world, it is helpful to know some of the basic terminology used:
- A repository contains everything for a given project
- Commits are code checkpoints.
- A branch is a linear string of commits - multiple parallel branches can be created in a repository
- Commits from one branch can be merged into another
- Repositories can be forked from one GitHub user to another
- Branches from different forks can be merged via a Pull Request (PR) on github.com
Typically, people will start developing a new pipeline under their own personal account on GitHub. When it is ready for its first release and has been discussed on Slack, this repository is forked to the nf-core organisation. All developers then maintain their own forks of this repository, contributing new code back to the nf-core fork via pull requests.
All nf-core pipelines must have the following three branches:
master
- commits from stable releases only. Should always have code from the most recent release.dev
- current development code. Merged intomaster
for releases.TEMPLATE
- used for template automation by the @nf-core-bot GitHub account. Should only contain commits with unmodified template code.
Pull requests to the nf-core fork have a number of automated steps that must pass before the PR can be merged. A few points to remember are:
- The pipeline
CHANGELOG.md
must be updated - PRs must not be against the
master
branch (typically you wantdev
) - PRs should be reviewed by someone else before being merged
When you fork your pipeline repository to the nf-core organisation, one of the core team will set up Travis CI (automated testing) and Docker Hub (automated Docker image creation) for you. However, it can be helpful to set these up on your personal fork as well. That way, you can be confident that everything will work when you fork or open a PR on the nf-core organisation.
Both services are free to use. To set them up, visit https://travis-ci.com and https://hub.docker.com and link your personal GitHub repository.
- Make a new pipeline using the template
- Update the readme file to fill in the
TODO
statements - Add a new process to the pipeline in
main.nf
- Add the new software dependencies from this process in to
environment.yaml
Manually checking that a pipeline adheres to all nf-core guidelines and requirements is a difficult job. Wherever possible, we automate such code checks with a code linter. This runs through a series of tests and reports failures, warnings and passed tests.
The linting code is closely tied to the nf-core template and both change over time. When we change something in the template, we often add a test to the linter to make sure that pipelines do not use the old method.
Each lint test has a number and is documented on the nf-core website. When warnings and failures are reported on the command line, a short description is printed along with a link to the documentation for that specific test on the website.
Code linting is run automatically every time you push commits to GitHub, open a pull request or make a release. You can also run these tests yourself locally with the following command:
nf-core lint /path/to/pipeline
When merging PRs from dev
to master
, the lint
command will be run with the --release
flag which includes a few
additional tests.
When adding a new pipeline, you must also set up the test
config profile. To do this, we use the
nf-core/test-datasets repository. Each pipeline has its own branch on this
repository, meaning that the data can be cloned without having to fetch all test data for all pipelines:
git clone --single-branch --branch PIPELINE https://github.com/nf-core/test-datasets.git
To set up the test profile, make a new branch on the nf-core/test-datasets
repo through the web page (see
instructions).
Fork the repository to your user and open a PR to your new branch with a really (really!) tiny dataset. Once merged,
set up the conf/test.config
file in your pipeline to refer to the URLs for your test data.
These test datasets are used by the automated continuous integration tests. The systems that run these tests are extremely limited in the resources that they have available. Typically, the pipeline should be able to complete in around 10 minutes and use no more than 6-7 GB memory. To achieve this, input files and reference genomes need to be very tiny. If possible, a good approach can be to use PhiX or Yeast as a reference genome. Alternatively, a single small chromosome (or part of a chromosome) can be used. If you are struggling to get the tests to run, ask for help on Slack.
When writing conf/test.config
remember to define all required parameters so that the pipeline will run with only
-profile test
. Note that remote URLs cannot be traversed like a regular file system - so glob file expansions such
as *.fa
will not work.
The automated tests with Travis CI are configured in the .travis.yml
file that is generated by the template. The
script
block defines three tests: linting the code with nf-core lint
, linting the syntax of all Markdown
documentation and running the pipeline with the test data.
The env
section sets the NXF_VER
environment variable twice. This tells Travis to run the tests twice in parallel
- once with the latest version of Nextflow (
NXF_VER=''
) and once with the minimum version supported by the pipeline. Do not edit this version number manually - it appears in multiple locations through the pipeline code, so it's better to usenf-core bump-version --nextflow
instead.
The provided tests may be sufficient for your pipeline. However, if it is possible to run the pipeline with
significantly different options (for example, different alignment tools), then it is good to test all of these. You can
do this by adding additional commands in the script
block.
- Run
nf-core lint
on your pipeline and make note of any test warnings / failures - Read up on one or two of the linting rules on the nf-core website and see if you can fix some.
- Take a look at
conf/test.config
and switch the test data for another dataset on nf-core/test_data.
Your pipeline is written and ready to go! Before you can release it with nf-core there are a few steps that need to
be done. First, tell everyone about it on Slack in the
#new-pipelines
channel. Hopefully you've already done this before
you spent lots of time on your pipeline, to check that there aren't other similar efforts happening elsewhere. Next,
you need to be a member of the nf-core GitHub organisation. You can find instructions
for how to do this at https://nf-co.re/join.
Once you're ready to go, you can fork your repository to nf-core. A lot of stuff happens automatically when you do
this: the website will update itself to include your new pipeline, complete with rendered documentation pages and usage
statistics. Your pipeline will also appear in the nf-core list
command output and in various other locations.
Unfortunately, at the time of writing, Travis CI, Docker Hub and Zenodo (automated DOI assignment for releases) services are not created automatically. These can only be set up by nf-core administrators, so please ask someone to do this for you on Slack.
Once everything is set up and all tests are passing on the dev
branch, let us know on Slack and we will do a large
community review. This is a one-off process that is done before the first release for all pipelines. In order to give
a nice interface to review all pipeline code, we create a "pseudo pull request" comparing dev
against the first
commit in the pipeline (hopefully the template creation). This PR will never be merged, but gives the GitHub review
web pages where people can comment on specific lines in the code.
These first community reviews can take quite a long time and typically result in a lot of comments and suggestions (nf-core/deepvariant famously had 156 comments before it was approved). Try not to be intimidated - this is the main step where the community attempts to standardise and suggest improvements for your code. Your pipeline will come out the other side stronger than ever!
Once the pseudo-PR is approved, you're ready to make the release. To do this, first bump the pipeline version to a
stable tag using nextflow bump-version
, then open a pull-request from the dev
branch to master
. Once tests are
passing and two nf-core members have approved this PR, it can be merged to master
. Then a GitHub release is made,
using the contents of the changelog as a description.
Pipeline version numbers (release tags) should be numerical only, using semantic versioning. For example, with a
release version 1.4.3
, bumping 1
would correspond to the major release where results would no longer be backwards
compatible. Changing 4
would be a minor release, for example adding some new features. Changing 3
would be a patch
release for minor things such as fixing bugs.
Over time, new versions of nf-core/tools will be released with changes to the template. In order to keep all nf-core
pipelines in sync, we have developed an automated synchronisation procedure. A GitHub bot account,
@nf-core-bot is scripted on a new tools release to use nf-core create
with the new
template using the input values you used on your pipeline. This is committed to the TEMPLATE
branch and a
pull-request created to incorporate these changes into dev
.
Note that these PRs can sometimes create git merge conflicts which will need to be resolved manually. There are plugins for most code editors to help with this process. Once resolved and checked this PR can be merged and a new pipeline release created.
- Use
nf-core bump-version
to update the required version of Nextflow in your pipeline - Bump your pipeline's version to 1.0, ready for its first release!
- Make sure that you're signed up to the nf-core slack (get an invite on nf-co.re) and drop us a line about your latest and greatest pipeline plans!
- Ask to be a member of the nf-core GitHub organisation by commenting on this GitHub issue
- If you're a twitter user, make sure to follow the @nf_core account
I hope that this nf-core tutorial has been helpful! Remember that there is more in-depth documentation on many of these topics available on the nf-core website. If in doubt, please ask for help on Slack.
If you have any suggestions for how to improve this tutorial, or spot any mistakes, please create an issue or pull request on the nf-core/nf-co.re repository.