-
Notifications
You must be signed in to change notification settings - Fork 3.2k
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait container fails to extract large artifacts from workflow step #1203
Comments
I've investigated this issue myself for several days to no avail, so any input from the Argo team would be greatly appreciated. |
Has anyone tried moving very large data through argo and also gotten hit by this limitation? |
@alexfrieden @GiantToast I ran into this issue myself when attempting to upload data of around 1.5Gb as an artifact. From debugging, it seemed to be an issue with the docker cp command which extracts items from the The docker cp command was also preventing us from gathering logs from the container because it creates a lock around it while the file transfer is in progress. From what I understand, @jessesuen is currently working on a solution as part of #1214, which will prevent the need to run this command, by simply allowing the sidecar to share the same process and volume as the main container. I received more or less the same log messages as you from the wait sidecar, however an incomplete tarball was being uploaded instead of the whole thing failing. Seemed to be because the docker cp ... | gzip` command was timing out. To prove this, I ended up rebuilding the argoexec container and adding some additional log messages to stat the file size of the tar before it was uploaded. The file size was significantly smaller than expected and thus I gathered it was an issue extracting the artifact from the main container, as opposed to being an issue with the upload itself. |
This should be fixed as part of PNS. The implementation has been changed to upload artifacts directly from a mirrored volume from the wait container. This bypasses the However, there still potential for large artifacts to fail the |
Hey @jessesuen, could you give an example how I specify an emptyDir volume as workflow output? |
|
Thanks @sarabala1979. When I get @jessesuen s comment right he said that we should use |
* chore: Combine binaries/images to one Signed-off-by: Derek Wang <[email protected]>
target "all" was replaced by "build" in argoproj#1203 Signed-off-by: Stephan van Maris <[email protected]>
* chore: Fix make command (argoproj#1221) target "all" was replaced by "build" in argoproj#1203 Signed-off-by: Stephan van Maris <[email protected]> * feat: added expr filter logic and tests Signed-off-by: Vaibhav Page <[email protected]> * chore: codegen Signed-off-by: Vaibhav Page <[email protected]> * chore: codegen Signed-off-by: Vaibhav Page <[email protected]> Co-authored-by: Stephan van Maris <[email protected]> Co-authored-by: Derek Wang <[email protected]>
Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT
What happened:
A single-step workflow completes successfully when creating artifacts on smaller size (~200Mb) but fails when the wait container attempts to extract larger (~6.7Gb) artifacts. Main container shows success of its step, but pod fails out with wait container showing Exit Code 2 leading to
failed to save outputs: verify serviceaccount argo:default has necessary privileges
. Pod appears to have ample disk space and memory and manual extraction of artifacts can successfully occur via local docker container use outside of argo/k8s.What you expected to happen:
The workflow should complete successfully and extract the expected artifacts
How to reproduce it (as minimally and precisely as possible):
The workflow takes a single argument of a NCBI SRA run identifier:
SRR000001
should complete successfully creating artifacts ~200MbSRR7460726
fails as it creates artifacts ~6.7Gb in sizeAnything else we need to know?:
Initially thought that the issue was similar to #724, however editing the
workflow-controller-configmap
to confirm more resources does not solve the issue.main log for
SRR000001
:main log for
SRR7460726_1
:Environment:
Other debugging information (if applicable):
init:
wait:
The text was updated successfully, but these errors were encountered: