Skip to content

Commit

Permalink
Improve fusion docs (#5166) [ci skip]
Browse files Browse the repository at this point in the history

Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Jordi Deu-Pons <[email protected]>
Co-authored-by: Adam Talbot <[email protected]>
  • Loading branch information
3 people committed Sep 4, 2024
1 parent 1daf22a commit fb2ff10
Show file tree
Hide file tree
Showing 4 changed files with 205 additions and 85 deletions.
11 changes: 11 additions & 0 deletions docker-scratch/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM amazoncorretto:17-al2023
COPY .nextflow /.nextflow
COPY nextflow /usr/bin/nextflow
ENV NXF_HOME=/.nextflow
RUN nextflow info
RUN NXF_PLUGINS_DEFAULT=false nextflow plugin install nf-tower,nf-wave,nf-cloudcache,nf-azure,nf-google,nf-amazon,xpack-amzn,xpack-google,nf-cloudcache

#FROM scratch
#COPY --from=0 /.nextflow /.nextflow
#COPY --from=0 /usr/bin/nextflow /usr/bin/nextflow

58 changes: 58 additions & 0 deletions docker-scratch/make.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#
# Copyright 2013-2024, Seqera Labs
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
#

# cleanup
rm -rf .nextflow && mkdir .nextflow
# copy nextflow dependencies
(cd ..
./gradlew compile assemble
BUILD_PACK=1 ./gradlew installScratch publishToMavenLocal -Dmaven.repo.local=${PWD}/docker-scratch/.nextflow/capsule/deps/
)

# copy nextflow launcher script
cp ../nextflow . && chmod +x nextflow
cp ../modules/nextflow/src/main/resources/META-INF/build-info.properties .
source build-info.properties

if [ -z "$version" ]; then
echo "Error: version is empty or missing"; exit 1
fi
if [ -z "$build" ]; then
echo "Error: build is empty or missing"; exit 1
fi
if [ -z "$commitId" ]; then
echo "Error: commitId is empty or missing"; exit 1
fi


TAG=${version}-${commitId}
repository=${repository:-'docker.io/pditommaso/nf-launcher-dev'}
image=${repository}:${TAG}
base=${base:-'docker.io/pditommaso/nf-lancher:j17-base'}

docker buildx build \
--no-cache \
--platform linux/amd64 \
--output=type=docker \
--progress=plain \
--tag ${image} \
--build-arg TARGETPLATFORM=linux/amd64 \
.

#launcher=$(wave -i ${base} --include ${image} --config-env NXF_HOME=/.nextflow)
#
#echo $launcher
216 changes: 131 additions & 85 deletions docs/fusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,95 @@ Support for Google Cloud Storage.

Fusion is a distributed virtual file system for cloud-native data pipeline and optimised for Nextflow workloads.

It bridges the gap between cloud-native storage and data analysis workflow by implementing a thin client that allows any existing application to access object storage using the standard POSIX interface, thus simplifying and speeding up most operations. Currently it supports AWS S3 and Google Cloud Storage.
It bridges the gap between cloud-native storage and data analysis workflow by implementing a thin client that allows any existing application to access object storage using the standard POSIX interface, thus simplifying and speeding up most operations.
Currently it supports AWS S3, Google Cloud Storage and Azure Blob containers.

## Getting started

The Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in
parallel to and from object storage into a container-local temporary folder. This means that the performance of the disk
volume used to carry out your computation is key to achieving maximum performance.

By default Fusion uses the container `/tmp` directory as a temporary cache, so the size of the volume can be much lower
than the actual needs of your pipeline processes. Fusion has a built-in garbage collector that constantly monitors remaining
disk space on the temporary folder and immediately evicts old cached entries when necessary.

### Requirements

Fusion file system is designed to work with containerised workloads, therefore it requires the use of a container engine such as Docker or a container native platform for the execution of your pipeline e.g. AWS Batch or Kubernetes. It also requires the use of {ref}`Wave containers<wave-page>`.
Fusion file system is designed to work with containerised workloads, therefore it requires the use of a container engine
such as Docker or a container native platform for the execution of your pipeline e.g. AWS Batch or Kubernetes. It also requires
the use of {ref}`Wave containers<wave-page>`.

### Azure Cloud

Fusion provides built-in support for [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs/)
when running in Azure Cloud.

The support for Azure does not require any specific setting other then enabling Wave and Fusion in your Nextflow
configuration. For example:

```
fusion.enabled = true
wave.enabled = true
process.executor = 'azure-batch'
tower.accessToken = '<your platform access token>'
```

Then run your pipeline using the usual command:

```
nextflow run <your pipeline> -work-dir az://<your blob container>/scratch
```

Azure machines come with fast SSDs attached, therefore no additional storage configuration is required however it is
recommended to use the machine types with larger data disks attached, denoted by the suffix `d` after the core number
(e.g. `Standard_E32*d*_v5`). These will increase the throughput of Fusion and reduce the chance of overloading the machine.

### AWS Cloud

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor.
The use of Fusion makes obsolete the need to create and configure a custom AMI that includes the `aws` command
line tool, when setting up the AWS Batch compute environment.

The configuration for this deployment scenario looks like the following:

```groovy
fusion.enabled = true
wave.enabled = true
process.executor = 'awsbatch'
process.queue = '<YOUR BATCH QUEUE>'
aws.region = '<YOUR AWS REGION>'
tower.accessToken = '<your platform access token>'
```

Then you can run your pipeline using the following command:

```bash
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
```

For best performance make sure to use instance types that provide a NVMe disk as [instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html).
If you are creating the AWS Batch compute environment by yourselves, you will need to make sure the NVMe is properly formatted (see below).


#### NVMe storage

The recommended setup to get maximum performance is to mount a NVMe disk as the temporary folder and run the pipeline with
the {ref}`scratch <process-scratch>` directive set to `false` to also avoid stage-out transfer time.

Example configuration for using AWS Batch with [NVMe disks](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html) to maximize performance:

```groovy
aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
process.scratch = false
```

:::{tip}
Seqera Platform is able to automatically format and configure the NVMe instance storage by enabling
the option "Use Fast storage" when creating the Batch compute environment.
:::

### AWS S3 configuration
#### AWS IAM permissions

The AWS S3 bucket should be configured with the following IAM permissions:

Expand Down Expand Up @@ -54,75 +134,30 @@ The AWS S3 bucket should be configured with the following IAM permissions:
}
```

## Use cases
### Google Cloud

### Local execution with S3 bucket as work directory
Fusion provides built-in support for [Google Storage](https://cloud.google.com/storage?hl=en)
when running in Google Cloud.

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Nextflow local executor. This configuration requires the use of Docker (or similar container engine) for the execution of your pipeline tasks.

The AWS S3 bucket credentials should be made accessible via standard `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables.
The support for Google does not require any specific setting other then enabling Wave and Fusion in your Nextflow
configuration. For example:

The following configuration should be added in your Nextflow configuration file:

```groovy
docker {
enabled = true
}
fusion {
enabled = true
exportStorageCredentials = true
}
wave {
enabled = true
}
```

Then you can run your pipeline using the following command:

```bash
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
fusion.enabled = true
wave.enabled = true
process.executor = 'google-batch'
tower.accessToken = '<your platform access token>'
```

Replace `<YOUR PIPELINE>` and `<YOUR BUCKET>` with a pipeline script and bucket or your choice, for example:
Then run your pipeline using the usual command:

```bash
nextflow run https://github.com/nextflow-io/rnaseq-nf -work-dir s3://nextflow-ci/scratch
```

### AWS Batch execution with S3 bucket as work directory

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor. The use of Fusion makes obsolete the need to create and configure a custom AMI that includes the `aws` command line tool, when setting up the AWS Batch compute environment.

The configuration for this deployment scenario looks like the following:

```groovy
fusion {
enabled = true
}
wave {
enabled = true
}
process {
executor = 'awsbatch'
queue = '<YOUR BATCH QUEUE>'
}
aws {
region = '<YOUR AWS REGION>'
}
nextflow run <your pipeline> -work-dir gs://<your google bucket>/scratch
```

Then you can run your pipeline using the following command:
When using Fusion, if the `process.disk` is not set, Nextflow will attach a single local SSD disk to the machine. The size of this disk can be much lower than the actual needs of your pipeline processes because Fusion uses it only as a temporal cache. Fusion is also compatible with other types of `process.disk`, but better performance is achieved when using local SSD disks.

```bash
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
```

### Kubernetes execution with S3 bucket as work directory
### Kubernetes

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Kubernetes executor.

Expand All @@ -131,23 +166,13 @@ The use of Fusion makes obsolete the need to create and manage and separate pers
The configuration for this deployment scenario looks like the following:

```groovy
wave {
enabled = true
}
fusion {
enabled = true
}
process {
executor = 'k8s'
}
k8s {
context = '<YOUR K8S CONFIGURATION CONTEXT>'
namespace = '<YOUR K8S NAMESPACE>'
serviceAccount = '<YOUR K8S SERVICE ACCOUNT>'
}
fusion.enabled = true
wave.enabled = true
process.executor = 'k8s'
k8s.context = '<YOUR K8S CONFIGURATION CONTEXT>'
k8s.namespace = '<YOUR K8S NAMESPACE>'
k8s.serviceAccount = '<YOUR K8S SERVICE ACCOUNT>'
tower.accessToken = '<your platform access token>'
```

The `k8s.context` represents the Kubernetes configuration context to be used for the pipeline execution. This setting can be omitted if Nextflow itself is run as a pod in the Kubernetes clusters.
Expand All @@ -162,21 +187,42 @@ Having the above configuration in place, you can run your pipeline using the fol
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
```

## NVMe storage
:::{note}
You an also use Fusion and Kubernetes with Azure Blob Storage and Google Storage using the same deployment approach.
:::

The Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in parallel to and from object storage into a container-local temporary folder. This means that the performance of the temporary folder inside the container (`/tmp` in a default setup) is key to achieving maximum performance.
### Local execution

The temporary folder is used only as a temporary cache, so the size of the volume can be much lower than the actual needs of your pipeline processes. Fusion has a built-in garbage collector that constantly monitors remaining disk space on the temporary folder and immediately evicts old cached entries when necessary.
Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Nextflow local executor. This configuration requires the use of Docker (or similar container engine) for the execution of your pipeline tasks.

The recommended setup to get maximum performance is to mount a NVMe disk as the temporary folder and run the pipeline with the {ref}`scratch <process-scratch>` directive set to `false` to also avoid stage-out transfer time.
The AWS S3 bucket credentials should be made accessible via standard `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables.

Example configuration for using AWS Batch with [NVMe disks](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html) to maximize performance:
The following configuration should be added in your Nextflow configuration file:

```groovy
aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
process.scratch = false
docker.enabled = true
fusion.enabled = true
fusion.exportStorageCredentials = true
wave.enabled = true
```

Then you can run your pipeline using the following command:

```bash
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
```

Replace `<YOUR PIPELINE>` and `<YOUR BUCKET>` with a pipeline script and bucket or your choice, for example:

```bash
nextflow run https://github.com/nextflow-io/rnaseq-nf -work-dir s3://nextflow-ci/scratch
```

:::{warning}
The option `fusion.exportStorageCredentials` leaks the AWS credentials on the task launcher script created by Nextflow.
This option should only be used for development purposes.
:::

## Advanced settings

Fusion advanced configuration settings are described in the {ref}`Fusion <config-fusion>` section on the Nextflow configuration page.
5 changes: 5 additions & 0 deletions packing.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,11 @@ task installLauncher(type: Copy, dependsOn: ['pack']) {
into "$homeDir/.nextflow/framework/$version/"
}

task installScratch(type: Copy, dependsOn: ['pack']) {
from "$releaseDir/nextflow-$version-one.jar"
into "${rootProject.projectDir}/docker-scratch/.nextflow/framework/$version/"
}

/*
* build, tag and publish a and new docker packaged nextflow release
*/
Expand Down

0 comments on commit fb2ff10

Please sign in to comment.