Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume-based artifact passing system #1349

Open
Ark-kun opened this issue Apr 30, 2019 · 14 comments
Open

Volume-based artifact passing system #1349

Ark-kun opened this issue Apr 30, 2019 · 14 comments
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc type/feature Feature request

Comments

@Ark-kun
Copy link
Member

Ark-kun commented Apr 30, 2019

Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST

I'd like to implement a feature that automatically mounts a single volume to the workflow pods to passively orchestrate the data passing.

I'm working on implementing this feature and will submit PR once it's ready.

It's possible to implement now on top of Argo, but it might be nice to have it built-in.
I've previously illustrated the proposal in #1227 (comment)

The main idea is to substitute the Argo's "active" way of passing artifacts (copying, packing, uploading/downloading/unpacking) with a passive system that has many advantages:

  • Much faster artifact storage I/O. No packing/unpacking. No copying files.
  • Artifact size is not limited by main/wait container disk sizes.

Syntax (unresolved):

# New syntax
artifactStorage: 
  volume: # Will automatically mount this volume to all Pods in a particular way
    persistentVolumeClaim:
      claimName: vol01

#The rest of the code is the usual artifact-passing syntax
templates:
  - name: producer
    outputs:
      artifacts:
      - name: out-art1
        path: /argo/outputs/out-art1/data
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world > /argo/outputs/out-art1/data"]

  - name: consumer
    inputs:
      artifacts:
      - name: in-art1
        path: /argo/inputs/in-art1/data
    container:
      image: docker/whalesay:latest
      command: [cat, '/argo/inputs/in-art1/data']

  - name: main
    dag:
      tasks:
      - name: producer-task
        template: producer
      - name: consumer-task
        template: consumer
        arguments:
          artifacts:
          - name: in-art1
            from: "{{tasks.producer-task.outputs.artifacts.out-art1}}"

Transformed spec:

volumes:
  - name: argo-storage
    persistentVolumeClaim:
      claimName: vol01
templates:
  - name: producer
    outputs:
      parameters:
      - name: out-art1-subpath
        value: "{{workflow.uid}}/{{pod.name}}/out-art1/"
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world > /argo/outputs/out-art1/data"]
      volumeMounts:
        - name: argo-storage
          mountPath: /argo/outputs/out-art1/
          subPath: "{{workflow.uid}}/{{pod.name}}/out-art1/"

  - name: consumer
    inputs:
      parameters:
      - name: in-art1-subpath
    container:
      image: docker/whalesay:latest
      command: [cat, '/argo/inputs/in-art1/data']
      volumeMounts:
        - name: argo-storage
          mountPath: /argo/inputs/in-art1/
          subPath: "{{input.parameters.in-art1-subpath}}"
          readOnly: true

  - name: main
    dag:
      tasks:
      - name: producer-task
        template: producer
      - name: consumer-task
        template: consumer
        arguments:
          artifacts:
          - name: in-art1-subpath
            from: "{{tasks.producer-task.outputs.parameters.out-art1-subpath}}"

This system becomes becomes better with the #1329, #1348 and #1300 features.

@JoshRagem
Copy link

there is a serious issue with this approach on AWS ebs volumes--the volumes will fail to attach and/or mount once you have two or more pods on different nodes. If your proposal here could be extended with some option to prefer scheduling pods on nodes that already have the volume attached (if allowed by resource requests), that might reduce the errors

@Ark-kun
Copy link
Member Author

Ark-kun commented Sep 3, 2019

Is it true that AWS does not support any multi-write volume types that work for any set of pods?

@Ark-kun
Copy link
Member Author

Ark-kun commented Oct 16, 2019

Here is a draft rewriter script: https://github.com/Ark-kun/pipelines/blob/SDK---Compiler---Added-support-for-volume-based-data-passing/sdk/python/kfp/compiler/_data_passing_using_volume.py
It can be run as a command-line program to rewrite Argo Workflow from artifacts to volume-based data passing.

What does everyone think?

@danxmoran
Copy link
Contributor

Hi @Ark-kun, I'm exploring Argo for a use-case where I want to:

  1. Query metadata about a file in external storage (i.e. FTP), outputting its size
  2. Dynamically generate a volume big enough to store the downloaded file
  3. Download the file to the volume
  4. Mount the volume in a separate task, for processing

Would this proposal support the step-level (or template-level) dynamic volume sizing that I'd need to implement this flow?

@Ark-kun
Copy link
Member Author

Ark-kun commented Nov 2, 2019

The per-step or per-artifact volumes could technically be implemented as another rewriting layer on top of the one in this issue. (My rewriter scrip will make it easier. You'll just need to change subPaths to volume names.)

This issue is more geared towards centralized data storage though.

@alexec
Copy link
Contributor

alexec commented May 12, 2020

Could you use PVCs for this?

@alexec alexec added artifacts type/feature Feature request labels May 12, 2020
@Ark-kun
Copy link
Member Author

Ark-kun commented May 13, 2020

Could you use PVCs for this?

If this is a question for me, then yes - the proposed feature and the implementation script are volume-agnostic. Any volume can be used and probably most users will specify some PVC even if only for a layer of indirection.

@BlackRider97
Copy link

Is there any issue if I use AzureFiles as persistent volume as Argo since it provides concurrent access on volume which is the limitation with EBS ?

@hadim
Copy link

hadim commented Jun 8, 2020

@Ark-kun are you still planning to implement this feature? The lifecycle of the artifacts in Argo could be an issue for us as it involves a lot of copying/downloading/uploading.

@hadim
Copy link

hadim commented Jun 8, 2020

Also, how would you automatically remove the PVC at the end of the workflow? A typical workflow for us would be:

  • setup a PVC
  • get some data on S3
  • step1: use data from S3 and generate new data on PVC
  • step2: use data from step1 and generate new data on PVC
  • etc...
  • upload data generate by the last step on S3
  • delete PVC

@rmgogogo
Copy link

rmgogogo commented Jun 15, 2020

Any reason why not directly read/write S3?
Is it because the libary doesn't support S3 interface?

@hadim
Copy link

hadim commented Jun 15, 2020

We don't want to upload/download our data at each step for performance purposes. Using PVC solves this. We only use artifacts for the first and last steps.

@Ark-kun
Copy link
Member Author

Ark-kun commented Jun 16, 2020

@Ark-kun are you still planning to implement this feature?

I've implemented this feature back in October 2019 as a separate script which can transform a subset of DAG-based workflows:

Here is a draft rewriter script: https://github.com/Ark-kun/pipelines/blob/SDK---Compiler---Added-support-for-volume-based-data-passing/sdk/python/kfp/compiler/_data_passing_using_volume.py
It can be run as a command-line program to rewrite Argo Workflow from artifacts to volume-based data passing.

I wonder whether we need to add it to the Argo controller itself (as it can just be used as a preprocessor). WDYT?

Also, how would you automatically remove the PVC at the end of the workflow?

It could be possible to do using exit handler and resource templates.

My main scenario requires the volume to persist between restarts. This allows implementing the intermediate data caching so that when you run a modified pipeline it can skip already computed parts instead of running all steps every time. (There probably needs to be some garbage collection system that deletes the expired data.)

@alexec
Copy link
Contributor

alexec commented Jan 18, 2021

Relates to #4130 and #2551

icecoffee531 pushed a commit to icecoffee531/argo-workflows that referenced this issue Jan 5, 2022
* chore: remove unused fields

Signed-off-by: Derek Wang <[email protected]>
@alexec alexec added the area/artifacts S3/GCP/OSS/Git/HDFS etc label Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc type/feature Feature request
Projects
None yet
Development

No branches or pull requests

7 participants