-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.5.8: Stuck in Running
when out of disk on a node during artifact packaging
#13344
Comments
This is probably a 3.5.2+ or 3.5.5+ regression, i.e. that the wait container never writes a completed status to the respective
Yes in that that is how you increase space, no in that it probably didn't get stuck in |
Running
when out of disk on a node during artifact packaging
Also this is an excellent minimal repro, thanks! |
@Garett-MacGowan @juliev0 @shuangkun we've all made various mitigations to this (which are useful in their own right), but I feel like the Controller logic getting stuck in |
I think the case that isn't working is specific to the wait container getting an OOM failure, right (or any other non-graceful shutdown)? I think if the wait container or the main container were to error out normally, the executor should still set the "incomplete" label and then everything should shut down normally. But you're probably right that there could be some additional check during the taskResult reconciliation to determine if we should stop. |
It's a little bit broader as any unhandled error:
Handling these is still good (what I called "mitigations" above), but they are all specific cases.
So yes it would also be a much broader solution if TaskResult reconciliation didn't just completely get stuck in the generic case of "Pod completed but TaskResult not". While there are still some specific cases that would still be good to handle gracefully in the Executor, a reconciliation fix would at least prevent users from getting completely stuck. |
This issue is fixed by #13491. # argo get allocate-lf59f
Name: allocate-lf59f
Namespace: argo
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Failed
Message: The node was low on resource: ephemeral-storage. Threshold quantity: 2626078349, available: 2519016Ki. Container wait was using 68Ki, request is 0, has larger consumption of ephemeral-storage. Container main was using 20427300Ki, request is 0, has larger consumption of ephemeral-storage.
Conditions:
PodRunning False
Completed True
Created: Fri Sep 13 14:17:51 +0800 (4 minutes ago)
Started: Fri Sep 13 14:17:51 +0800 (4 minutes ago)
Finished: Fri Sep 13 14:21:49 +0800 (58 seconds ago)
Duration: 3 minutes 58 seconds
Progress: 0/1
ResourcesDuration: 5m45s*(100Mi memory),41s*(1 cpu)
Parameters:
allocate: 100G
STEP TEMPLATE PODNAME DURATION MESSAGE
✖ allocate-lf59f allocate allocate-lf59f 3m The node was low on resource: ephemeral-storage. Threshold quantity: 2626078349, available: 2519016Ki. Container wait was using 68Ki, request is 0, has larger consumption of ephemeral-storage. Container main was using 20427300Ki, request is 0, has larger consumption of ephemeral-storage. |
Running
when out of disk on a node during artifact packagingRunning
when out of disk on a node during artifact packaging
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
We ran into the issue where a step failed due to allocating large amount of disk on a node for files that subsequently should have been packed into an artifact and pushed to s3. The actual step succeeded, but argo then failed during the taring of output artifacts (I noticed the message that output artifacts are now being tar'ed). Since the files that were created allocated around 80% of disk space available for the node, I suspect that there was not enough room left to create the output artifact.
The node hat 1.6 TB disk space, and the step created files with about 1TB size.
The reported error message was
The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0.
While this behavior in itself is fine (If there is not enough space, the step should fail), I noticed that this does not fail the workflow itself. It is still in the
Running
state, even several hours after the step failed.This is all probably related to #8526
Expected behaviour
In this case, I would expect the whole workflow to fail.
Version(s)
v3.5.8
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: