nextflow-io · pditommaso · Aug 1, 2024 · Jul 25, 2024 · Jul 30, 2024 · Jul 31, 2024
@@ -13,15 +13,105 @@ Support for Google Cloud Storage.
 
 Fusion is a distributed virtual file system for cloud-native data pipeline and optimised for Nextflow workloads.
 
-It bridges the gap between cloud-native storage and data analysis workflow by implementing a thin client that allows any existing application to access object storage using the standard POSIX interface, thus simplifying and speeding up most operations. Currently it supports AWS S3 and Google Cloud Storage.
+It bridges the gap between cloud-native storage and data analysis workflow by implementing a thin client that allows any existing application to access object storage using the standard POSIX interface, thus simplifying and speeding up most operations.
+Currently it supports AWS S3, Google Cloud Storage and Azure Blob containers.
 
 ## Getting started
 
 ### Requirements
 
-Fusion file system is designed to work with containerised workloads, therefore it requires the use of a container engine such as Docker or a container native platform for the execution of your pipeline e.g. AWS Batch or Kubernetes. It also requires the use of {ref}`Wave containers<wave-page>`.
+Fusion file system is designed to work with containerised workloads, therefore it requires the use of a container engine
+such as Docker or a container native platform for the execution of your pipeline e.g. AWS Batch or Kubernetes. It also requires
+the use of {ref}`Wave containers<wave-page>`.
 
-### AWS S3 configuration
+### Azure Cloud
+
+Fusion provides built-in support for [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs/)
+when running in Azure Cloud.
+
+The support for Azure does not require any specific setting other then enabling Wave and Fusion in your Nextflow
+configuration. For example:
+
+```
+process.executor = 'azure-batch'
+wave.enabled = true
+fusion.enabled = true
+tower.accessToken = '<your platform access token>'
+```
+
+Then run your pipeline using the usual command:
+
+```
+nextflow run <your pipeline> -work-dir az://<your blob container>/scratch
+```
+
+[TODO add some notes on Azure Blob permission]
+
+Azure machines come with fast SSDs attached, therefore no additional storage configuration is required however it is recommended to use the machine types with larger data disks attached, denoted by the suffix 'd' after the core number (e.g. Standard_E32*d*_v5). These will increase the throughput of Fusion and reduce the chance of overloading the machine.
+
+### AWS Cloud
+
+Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor.
+The use of Fusion makes obsolete the need to create and configure a custom AMI that includes the `aws` command
+line tool, when setting up the AWS Batch compute environment.
+
+The configuration for this deployment scenario looks like the following:
+
+```groovy
+fusion {
+    enabled = true
+}
+
+wave {
+    enabled = true
+}
+
+process {
+    executor = 'awsbatch'
+    queue = '<YOUR BATCH QUEUE>'
+}
+
+aws {
+    region = '<YOUR AWS REGION>'
+}
+```
+
+Then you can run your pipeline using the following command:
+
+```bash
+nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
+```
+
+For best performance make sure to use instance types that provide a NVMe disk as [instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html).
+If you are creating the AWS Batch compute environment by yourselves, you will need to make sure the NVMe is properly formatted (see below).
+
+
+#### NVMe storage
+
+The Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in
+parallel to and from object storage into a container-local temporary folder. This means that the performance of the temporary
+folder inside the container (`/tmp` in a default setup) is key to achieving maximum performance.
+
+The temporary folder is used only as a temporary cache, so the size of the volume can be much lower than the actual needs of your
+pipeline processes. Fusion has a built-in garbage collector that constantly monitors remaining disk space on the temporary folder
+and immediately evicts old cached entries when necessary.
+
+The recommended setup to get maximum performance is to mount a NVMe disk as the temporary folder and run the pipeline with
+the {ref}`scratch <process-scratch>` directive set to `false` to also avoid stage-out transfer time.
+
+Example configuration for using AWS Batch with [NVMe disks](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html) to maximize performance:
+
+```groovy
+aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
+process.scratch = false
+```
+
+:::{tip}
+Seqera Platform is able to automatically format and configure the NVMe instance storage by enabling
+the option "Use Fast storage" when creating the Batch compute environment.
+:::
+
+#### AWS IAM permissions
 
 The AWS S3 bucket should be configured with the following IAM permissions:
 
@@ -54,75 +144,30 @@ The AWS S3 bucket should be configured with the following IAM permissions:
 }
 ```
 
-## Use cases
+### Google Cloud
 
-### Local execution with S3 bucket as work directory
-
-Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Nextflow local executor. This configuration requires the use of Docker (or similar container engine) for the execution of your pipeline tasks.
-
-The AWS S3 bucket credentials should be made accessible via standard `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables.
-
-The following configuration should be added in your Nextflow configuration file:
+Fusion provides built-in support for [Google Storage](https://cloud.google.com/storage?hl=en)
+when running in Google Cloud.
 
-```groovy
-docker {
-    enabled = true
-}
-
-fusion {
-    enabled = true
-    exportStorageCredentials = true
-}
+The support for Google does not require any specific setting other then enabling Wave and Fusion in your Nextflow
+configuration. For example:
 
-wave {
-    enabled = true
-}
 ```
-
-Then you can run your pipeline using the following command:
-
-```bash
-nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
+process.executor = 'google-batch'
+wave.enabled = true
+fusion.enabled = true
+tower.accessToken = '<your platform access token>'
 ```
 
-Replace `<YOUR PIPELINE>` and `<YOUR BUCKET>` with a pipeline script and bucket or your choice, for example:
+Then run your pipeline using the usual command:
 
-```bash
-nextflow run https://github.com/nextflow-io/rnaseq-nf -work-dir s3://nextflow-ci/scratch
 ```
-
-### AWS Batch execution with S3 bucket as work directory
-
-Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor. The use of Fusion makes obsolete the need to create and configure a custom AMI that includes the `aws` command line tool, when setting up the AWS Batch compute environment.
-
-The configuration for this deployment scenario looks like the following:
-
-```groovy
-fusion {
-    enabled = true
-}
-
-wave {
-    enabled = true
-}
-
-process {
-    executor = 'awsbatch'
-    queue = '<YOUR BATCH QUEUE>'
-}
-
-aws {
-    region = '<YOUR AWS REGION>'
-}
+nextflow run <your pipeline> -work-dir gs://<your google bucket>/scratch
 ```
 
-Then you can run your pipeline using the following command:
-
-```bash
-nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
-```
+When using Fusion, if the `process.disk` is not set, Nextflow will attach a single local SSD disk to the machine. The size of this disk can be much lower than the actual needs of your pipeline processes because Fusion uses it only as a temporal cache. Fusion is also compatible with other types of `process.disk`, but better performance is achieved when using local SSD disks.
 
-### Kubernetes execution with S3 bucket as work directory
+### Kubernetes
 
 Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Kubernetes executor.
 
@@ -162,19 +207,43 @@ Having the above configuration in place, you can run your pipeline using the fol
 nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
 ```
 
-## NVMe storage
+:::{note}
+You an also use Fusion and Kubernetes with Azure Blob Storage and Google Storage using the same deployment approach.
+:::
 
-The Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in parallel to and from object storage into a container-local temporary folder. This means that the performance of the temporary folder inside the container (`/tmp` in a default setup) is key to achieving maximum performance.
+### Local execution
 
-The temporary folder is used only as a temporary cache, so the size of the volume can be much lower than the actual needs of your pipeline processes. Fusion has a built-in garbage collector that constantly monitors remaining disk space on the temporary folder and immediately evicts old cached entries when necessary.
+Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Nextflow local executor. This configuration requires the use of Docker (or similar container engine) for the execution of your pipeline tasks.
 
-The recommended setup to get maximum performance is to mount a NVMe disk as the temporary folder and run the pipeline with the {ref}`scratch <process-scratch>` directive set to `false` to also avoid stage-out transfer time.
+The AWS S3 bucket credentials should be made accessible via standard `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables.
 
-Example configuration for using AWS Batch with [NVMe disks](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html) to maximize performance:
+The following configuration should be added in your Nextflow configuration file:
 
 ```groovy
-aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
-process.scratch = false
+docker {
+    enabled = true
+}
+
+fusion {
+    enabled = true
+    exportStorageCredentials = true
+}
+
+wave {
+    enabled = true
+}
+```
+
+Then you can run your pipeline using the following command:
+
+```bash
+nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
+```
+
+Replace `<YOUR PIPELINE>` and `<YOUR BUCKET>` with a pipeline script and bucket or your choice, for example:
+
+```bash
+nextflow run https://github.com/nextflow-io/rnaseq-nf -work-dir s3://nextflow-ci/scratch
 ```
 
 ## Advanced settings