Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fusion docs #5166

Merged
merged 9 commits into from
Aug 1, 2024
205 changes: 137 additions & 68 deletions docs/fusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,105 @@ Support for Google Cloud Storage.

Fusion is a distributed virtual file system for cloud-native data pipeline and optimised for Nextflow workloads.

It bridges the gap between cloud-native storage and data analysis workflow by implementing a thin client that allows any existing application to access object storage using the standard POSIX interface, thus simplifying and speeding up most operations. Currently it supports AWS S3 and Google Cloud Storage.
It bridges the gap between cloud-native storage and data analysis workflow by implementing a thin client that allows any existing application to access object storage using the standard POSIX interface, thus simplifying and speeding up most operations.
Currently it supports AWS S3, Google Cloud Storage and Azure Blob containers.

## Getting started

### Requirements

Fusion file system is designed to work with containerised workloads, therefore it requires the use of a container engine such as Docker or a container native platform for the execution of your pipeline e.g. AWS Batch or Kubernetes. It also requires the use of {ref}`Wave containers<wave-page>`.
Fusion file system is designed to work with containerised workloads, therefore it requires the use of a container engine
such as Docker or a container native platform for the execution of your pipeline e.g. AWS Batch or Kubernetes. It also requires
the use of {ref}`Wave containers<wave-page>`.

### AWS S3 configuration
### Azure Cloud

Fusion provides built-in support for [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs/)
when running in Azure Cloud.

The support for Azure does not require any specific setting other then enabling Wave and Fusion in your Nextflow
configuration. For example:

```
process.executor = 'azure-batch'
wave.enabled = true
fusion.enabled = true
tower.accessToken = '<your platform access token>'
```

Then run your pipeline using the usual command:

```
nextflow run <your pipeline> -work-dir az://<your blob container>/scratch
```

pditommaso marked this conversation as resolved.
Show resolved Hide resolved
[TODO add some notes on Azure Blob permission]
pditommaso marked this conversation as resolved.
Show resolved Hide resolved

Azure machines come with fast SSDs attached, therefore no additional storage configuration is required however it is recommended to use the machine types with larger data disks attached, denoted by the suffix 'd' after the core number (e.g. Standard_E32*d*_v5). These will increase the throughput of Fusion and reduce the chance of overloading the machine.

### AWS Cloud

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor.
The use of Fusion makes obsolete the need to create and configure a custom AMI that includes the `aws` command
line tool, when setting up the AWS Batch compute environment.

The configuration for this deployment scenario looks like the following:

```groovy
fusion {
enabled = true
}

wave {
enabled = true
}

process {
executor = 'awsbatch'
queue = '<YOUR BATCH QUEUE>'
}

aws {
region = '<YOUR AWS REGION>'
}
```
pditommaso marked this conversation as resolved.
Show resolved Hide resolved

Then you can run your pipeline using the following command:

```bash
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
```

For best performance make sure to use instance types that provide a NVMe disk as [instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html).
If you are creating the AWS Batch compute environment by yourselves, you will need to make sure the NVMe is properly formatted (see below).


#### NVMe storage

The Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in
parallel to and from object storage into a container-local temporary folder. This means that the performance of the temporary
folder inside the container (`/tmp` in a default setup) is key to achieving maximum performance.

The temporary folder is used only as a temporary cache, so the size of the volume can be much lower than the actual needs of your
pipeline processes. Fusion has a built-in garbage collector that constantly monitors remaining disk space on the temporary folder
and immediately evicts old cached entries when necessary.
pditommaso marked this conversation as resolved.
Show resolved Hide resolved

The recommended setup to get maximum performance is to mount a NVMe disk as the temporary folder and run the pipeline with
the {ref}`scratch <process-scratch>` directive set to `false` to also avoid stage-out transfer time.

Example configuration for using AWS Batch with [NVMe disks](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html) to maximize performance:

```groovy
aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
process.scratch = false
```

:::{tip}
Seqera Platform is able to automatically format and configure the NVMe instance storage by enabling
the option "Use Fast storage" when creating the Batch compute environment.
:::

#### AWS IAM permissions

The AWS S3 bucket should be configured with the following IAM permissions:

Expand Down Expand Up @@ -54,75 +144,30 @@ The AWS S3 bucket should be configured with the following IAM permissions:
}
```

## Use cases
### Google Cloud

### Local execution with S3 bucket as work directory

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Nextflow local executor. This configuration requires the use of Docker (or similar container engine) for the execution of your pipeline tasks.

The AWS S3 bucket credentials should be made accessible via standard `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables.

The following configuration should be added in your Nextflow configuration file:
Fusion provides built-in support for [Google Storage](https://cloud.google.com/storage?hl=en)
when running in Google Cloud.

```groovy
docker {
enabled = true
}

fusion {
enabled = true
exportStorageCredentials = true
}
The support for Google does not require any specific setting other then enabling Wave and Fusion in your Nextflow
configuration. For example:

wave {
enabled = true
}
```

Then you can run your pipeline using the following command:

```bash
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
process.executor = 'google-batch'
wave.enabled = true
fusion.enabled = true
tower.accessToken = '<your platform access token>'
```

Replace `<YOUR PIPELINE>` and `<YOUR BUCKET>` with a pipeline script and bucket or your choice, for example:
Then run your pipeline using the usual command:

```bash
nextflow run https://github.com/nextflow-io/rnaseq-nf -work-dir s3://nextflow-ci/scratch
```

### AWS Batch execution with S3 bucket as work directory

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor. The use of Fusion makes obsolete the need to create and configure a custom AMI that includes the `aws` command line tool, when setting up the AWS Batch compute environment.

The configuration for this deployment scenario looks like the following:

```groovy
fusion {
enabled = true
}

wave {
enabled = true
}

process {
executor = 'awsbatch'
queue = '<YOUR BATCH QUEUE>'
}

aws {
region = '<YOUR AWS REGION>'
}
nextflow run <your pipeline> -work-dir gs://<your google bucket>/scratch
```

Then you can run your pipeline using the following command:

```bash
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
```
When using Fusion, if the `process.disk` is not set, Nextflow will attach a single local SSD disk to the machine. The size of this disk can be much lower than the actual needs of your pipeline processes because Fusion uses it only as a temporal cache. Fusion is also compatible with other types of `process.disk`, but better performance is achieved when using local SSD disks.

### Kubernetes execution with S3 bucket as work directory
### Kubernetes

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Kubernetes executor.

Expand Down Expand Up @@ -162,19 +207,43 @@ Having the above configuration in place, you can run your pipeline using the fol
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
```

## NVMe storage
:::{note}
You an also use Fusion and Kubernetes with Azure Blob Storage and Google Storage using the same deployment approach.
:::

The Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in parallel to and from object storage into a container-local temporary folder. This means that the performance of the temporary folder inside the container (`/tmp` in a default setup) is key to achieving maximum performance.
### Local execution

The temporary folder is used only as a temporary cache, so the size of the volume can be much lower than the actual needs of your pipeline processes. Fusion has a built-in garbage collector that constantly monitors remaining disk space on the temporary folder and immediately evicts old cached entries when necessary.
Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Nextflow local executor. This configuration requires the use of Docker (or similar container engine) for the execution of your pipeline tasks.

The recommended setup to get maximum performance is to mount a NVMe disk as the temporary folder and run the pipeline with the {ref}`scratch <process-scratch>` directive set to `false` to also avoid stage-out transfer time.
The AWS S3 bucket credentials should be made accessible via standard `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables.

Example configuration for using AWS Batch with [NVMe disks](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html) to maximize performance:
The following configuration should be added in your Nextflow configuration file:

```groovy
aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
process.scratch = false
docker {
enabled = true
}

fusion {
enabled = true
exportStorageCredentials = true
}

wave {
enabled = true
}
```

Then you can run your pipeline using the following command:

```bash
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
```

Replace `<YOUR PIPELINE>` and `<YOUR BUCKET>` with a pipeline script and bucket or your choice, for example:

```bash
nextflow run https://github.com/nextflow-io/rnaseq-nf -work-dir s3://nextflow-ci/scratch
```

## Advanced settings
Expand Down