v3.5.8: Stuck in `Running` when out of disk on a node during artifact packaging #13344

dnlbauer · 2024-07-15T15:33:54Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

We ran into the issue where a step failed due to allocating large amount of disk on a node for files that subsequently should have been packed into an artifact and pushed to s3. The actual step succeeded, but argo then failed during the taring of output artifacts (I noticed the message that output artifacts are now being tar'ed). Since the files that were created allocated around 80% of disk space available for the node, I suspect that there was not enough room left to create the output artifact.

The node hat 1.6 TB disk space, and the step created files with about 1TB size.
The reported error message was The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0.

While this behavior in itself is fine (If there is not enough space, the step should fail), I noticed that this does not fail the workflow itself. It is still in the Running state, even several hours after the step failed.

This is all probably related to #8526

Expected behaviour

In this case, I would expect the whole workflow to fail.

Version(s)

v3.5.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow

metadata:
  generateName: allocate-
  namespace: argo

spec:
  arguments:
    parameters:
      - name: allocate
        value: 100G
 
  entrypoint: allocate
  templates:
    # allocate and archive a huge file
    - name: allocate
      inputs:
        parameters:
          - name: allocate
            value: "{{ workflow.parameters.allocate }}"
      outputs:
        artifacts:
          - name: large_binary_blob
            path: /test.img
            archive:
              none: {}

      script:
        image: alpine
        command: ["sh"]
        source: |
          echo "allocating..."
          dd if=/dev/zero of=/test.img bs=1M iflag=fullblock,count_bytes count={{inputs.parameters.allocate}}
          ls -lah

  podGC:
    strategy: OnWorkflowSuccess

Logs from the workflow controller

time="2024-07-15T10:13:34.372Z" level=info msg="Processing workflow" Phase= ResourceVersion=9999 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.378Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=allocate-gvh2r
time="2024-07-15T10:13:34.378Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.378Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.378Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.378Z" level=info msg="Pod node allocate-gvh2r initialized Pending" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.399Z" level=info msg="Created pod: allocate-gvh2r (allocate-gvh2r)" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.399Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.399Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.413Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=10003 workflow=allocate-gvh2r
time="2024-07-15T10:13:44.400Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=10003 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:44.401Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:13:44.401Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:13:44.401Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=allocate-gvh2r old.message= old.phase=Pending old.progress=0/1 workflow=allocate-gvh2r
time="2024-07-15T10:13:44.401Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:44.401Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:44.415Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=10064 workflow=allocate-gvh2r
time="2024-07-15T10:19:47.639Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=10064 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:47.640Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:19:47.640Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:19:47.640Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:19:47.640Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:47.640Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:47.655Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:57.665Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:25:34.606Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:25:34.607Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:25:34.607Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:25:34.608Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:25:34.608Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:25:34.608Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:25:34.629Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:29:36.644Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:29:36.644Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:29:36.644Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:29:36.644Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:29:36.645Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:29:36.645Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:29:36.655Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:39:04.657Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:39:04.658Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:39:04.658Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:39:04.658Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:39:04.659Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:39:04.659Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:39:04.696Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:45:34.608Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:45:34.609Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:45:34.609Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:45:34.609Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:45:34.609Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:45:34.609Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:45:34.628Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:48:20.680Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:48:20.681Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:48:20.681Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:48:20.681Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:48:20.681Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:48:20.681Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:48:20.699Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:53:51.681Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:53:51.682Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:53:51.682Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:53:51.684Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:53:51.685Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:53:51.685Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:53:51.701Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:02:18.697Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:02:18.697Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:02:18.697Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:02:18.697Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:02:18.698Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:02:18.698Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:02:18.741Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:05:34.608Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:05:34.609Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:05:34.609Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:05:34.609Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:05:34.609Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:05:34.609Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:05:34.629Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:11:25.709Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:11:25.710Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:11:25.710Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:11:25.710Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:11:25.710Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:11:25.710Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:11:25.729Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:16:33.737Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:16:33.738Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:16:33.738Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:16:33.738Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:16:33.738Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:16:33.738Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:16:33.776Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:22:41.767Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:22:41.768Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:22:41.768Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:22:41.768Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:22:41.768Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:22:41.768Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:22:41.787Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:25:34.609Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:25:34.610Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:25:34.610Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:25:34.610Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:25:34.610Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:25:34.610Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:25:34.627Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:31:04.768Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:31:04.769Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:31:04.769Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:31:04.769Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:31:04.769Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:31:04.769Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:31:04.787Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:38:20.781Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:38:20.782Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:38:20.782Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:38:20.782Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:38:20.783Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:38:20.783Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:38:20.803Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:45:34.610Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:45:34.611Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:45:34.611Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:45:34.611Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:45:34.611Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:45:34.611Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:45:34.626Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:48:07.787Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:48:07.788Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:48:07.788Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:48:07.788Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:48:07.789Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:48:07.789Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:48:07.818Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:53:20.535Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:20.537Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:53:20.537Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:53:20.537Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=allocate-gvh2r old.message= old.phase=Running old.progress=0/1 workflow=allocate-gvh2r
time="2024-07-15T11:53:20.537Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:20.537Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:20.542Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/allocate-gvh2r/terminateContainers
time="2024-07-15T11:53:20.545Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:53:20.578Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=36376 workflow=allocate-gvh2r
time="2024-07-15T11:53:21.008Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with terminated\n"
time="2024-07-15T11:53:30.579Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=36376 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:30.580Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:53:30.580Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:53:30.580Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:53:30.580Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:30.580Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:30.585Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/allocate-gvh2r/terminateContainers
time="2024-07-15T11:53:30.585Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:53:30.603Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=36420 workflow=allocate-gvh2r
time="2024-07-15T11:53:30.957Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with terminated\n"
time="2024-07-15T11:53:40.604Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=36420 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:40.605Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:53:40.605Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:53:40.605Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:53:40.605Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:40.605Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:40.610Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/allocate-gvh2r/terminateContainers
time="2024-07-15T11:53:40.610Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:53:40.621Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=36420 workflow=allocate-gvh2r
time="2024-07-15T11:53:41.053Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with terminated\n"
time="2024-07-15T11:53:51.013Z" level=info msg="cleaning up pod" action=killContainers key=argo/allocate-gvh2r/killContainers
time="2024-07-15T11:53:51.013Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=9&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:53:51.346Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with killed\n"
time="2024-07-15T11:57:29.802Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=36420 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:57:29.802Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:57:29.802Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:57:29.803Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:57:29.803Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:57:29.803Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:57:29.808Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/allocate-gvh2r/terminateContainers
time="2024-07-15T11:57:29.808Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:57:29.825Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=36420 workflow=allocate-gvh2r
time="2024-07-15T11:57:30.129Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with terminated\n"
time="2024-07-15T11:58:00.130Z" level=info msg="cleaning up pod" action=killContainers key=argo/allocate-gvh2r/killContainers
time="2024-07-15T11:58:00.131Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=9&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:58:00.497Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with killed\n"
time="2024-07-15T12:00:19.329Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=36420 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:19.330Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T12:00:19.330Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:00:19.331Z" level=info msg="Pod failed: The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " displayName=allocate-gvh2r namespace=argo pod=allocate-gvh2r templateName=allocate workflow=allocate-gvh2r
time="2024-07-15T12:00:19.331Z" level=info msg="node changed" namespace=argo new.message="The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " new.phase=Failed new.progress=0/1 nodeID=allocate-gvh2r old.message= old.phase=Running old.progress=0/1 workflow=allocate-gvh2r
time="2024-07-15T12:00:19.331Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:19.331Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:19.349Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=38196 workflow=allocate-gvh2r
time="2024-07-15T12:00:29.352Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=38196 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg="Pod failed: The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " displayName=allocate-gvh2r namespace=argo pod=allocate-gvh2r templateName=allocate workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:29.375Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=38196 workflow=allocate-gvh2r
time="2024-07-15T12:05:34.610Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=38196 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:05:34.611Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T12:05:34.612Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:05:34.612Z" level=info msg="Pod failed: The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " displayName=allocate-gvh2r namespace=argo pod=allocate-gvh2r templateName=allocate workflow=allocate-gvh2r
time="2024-07-15T12:05:34.612Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:05:34.612Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:05:34.612Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:05:34.626Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=38196 workflow=allocate-gvh2r
time="2024-07-15T12:06:04.805Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=38196 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg="Pod failed: The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " displayName=allocate-gvh2r namespace=argo pod=allocate-gvh2r templateName=allocate workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:06:04.833Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=38196 workflow=allocate-gvh2r
(....snip..)
time="2024-07-15T15:26:47.076Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=76585 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T15:26:47.076Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T15:26:47.076Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T15:26:47.077Z" level=info msg="Pod failed: The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " displayName=allocate-gvh2r namespace=argo pod=allocate-gvh2r templateName=allocate workflow=allocate-gvh2r
time="2024-07-15T15:26:47.077Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T15:26:47.077Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T15:26:47.077Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T15:26:47.087Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=76585 workflow=allocate-gvh2r

Logs from in your workflow's wait container

Error from server (BadRequest): container "wait" in pod "allocate-gvh2r" is terminated

The text was updated successfully, but these errors were encountered:

agilgur5 · 2024-07-15T16:39:37Z

I noticed that this does not fail the workflow itself. It is still in the Running state, even several hours after the step failed.

This is probably a 3.5.2+ or 3.5.5+ regression, i.e. that the wait container never writes a completed status to the respective WorkflowTaskResult due to the changes in #11947 or related PRs. That would suggest that the error isn't getting caught for some reason

This is all probably related to #8526

Yes in that that is how you increase space, no in that it probably didn't get stuck in Running on that version; it would've probably correctly failed.

agilgur5 · 2024-07-15T16:48:25Z

Also this is an excellent minimal repro, thanks!

agilgur5 · 2024-07-15T17:10:41Z

@Garett-MacGowan @juliev0 @shuangkun we've all made various mitigations to this (which are useful in their own right), but I feel like the Controller logic getting stuck in Running is itself incorrect. If the Pod is completed and the WorkflowTaskResult is incomplete, it'll never become completed again, so the Controller should not wait on that condition. And I think in every case of it getting stuck, the Pod is completed in a failure state

juliev0 · 2024-07-16T02:21:31Z

I think the case that isn't working is specific to the wait container getting an OOM failure, right (or any other non-graceful shutdown)? I think if the wait container or the main container were to error out normally, the executor should still set the "incomplete" label and then everything should shut down normally.

But you're probably right that there could be some additional check during the taskResult reconciliation to determine if we should stop.

agilgur5 · 2024-07-16T03:08:06Z

I think the case that isn't working is specific to the wait container getting an OOM failure, right (or any other non-graceful shutdown)?

It's a little bit broader as any unhandled error:

fix(resource): catch fatal kubectl errors #13321 as an example of graceful shutdown but unhandled error.
And ofc this issue is a non-OOM one too.

Handling these is still good (what I called "mitigations" above), but they are all specific cases.

But you're probably right that there could be some additional check during the taskResult reconciliation to determine if we should stop.

So yes it would also be a much broader solution if TaskResult reconciliation didn't just completely get stuck in the generic case of "Pod completed but TaskResult not".
Or, in other words, TaskResult reconciliation in the Controller should itself gracefully handle the entire class of unhandled errors.

While there are still some specific cases that would still be good to handle gracefully in the Executor, a reconciliation fix would at least prevent users from getting completely stuck.
And there are ofc some cases of unavoidable unhandled errors that it would be needed for (this issue and Pod eviction are not necessarily avoidable in all cases)

jswxstw · 2024-09-13T06:23:26Z

This issue is fixed by #13491.

# argo get allocate-lf59f
Name:                allocate-lf59f
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Failed
Message:             The node was low on resource: ephemeral-storage. Threshold quantity: 2626078349, available: 2519016Ki. Container wait was using 68Ki, request is 0, has larger consumption of ephemeral-storage. Container main was using 20427300Ki, request is 0, has larger consumption of ephemeral-storage. 
Conditions:          
 PodRunning          False
 Completed           True
Created:             Fri Sep 13 14:17:51 +0800 (4 minutes ago)
Started:             Fri Sep 13 14:17:51 +0800 (4 minutes ago)
Finished:            Fri Sep 13 14:21:49 +0800 (58 seconds ago)
Duration:            3 minutes 58 seconds
Progress:            0/1
ResourcesDuration:   5m45s*(100Mi memory),41s*(1 cpu)
Parameters:          
  allocate:          100G

STEP               TEMPLATE  PODNAME         DURATION  MESSAGE
 ✖ allocate-lf59f  allocate  allocate-lf59f  3m        The node was low on resource: ephemeral-storage. Threshold quantity: 2626078349, available: 2519016Ki. Container wait was using 68Ki, request is 0, has larger consumption of ephemeral-storage. Container main was using 20427300Ki, request is 0, has larger consumption of ephemeral-storage.

dnlbauer added the type/bug label Jul 15, 2024

dnlbauer changed the title ~~Running out of disk on a node does not fail the workflow~~ Running out of disk on a node during artifact packaging does not fail the workflow Jul 15, 2024

agilgur5 changed the title ~~Running out of disk on a node during artifact packaging does not fail the workflow~~ Stuck in Running when out of disk on a node during artifact packaging Jul 15, 2024

agilgur5 added type/regression Regression from previous behavior (a specific type of bug) area/artifacts S3/GCP/OSS/Git/HDFS etc labels Jul 15, 2024

agilgur5 added this to the v3.5.x patches milestone Jul 15, 2024

agilgur5 added the area/controller Controller issues, panics label Jul 15, 2024

agilgur5 added the P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important label Jul 15, 2024

agilgur5 mentioned this issue Jul 30, 2024

v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod #12993

Closed

4 tasks

agilgur5 closed this as completed Sep 13, 2024

jswxstw mentioned this issue Sep 20, 2024

signaled container: wait error: unable to upgrade connection: container not found. node Succeeded but wf not progressing #13627

Open

4 tasks

agilgur5 changed the title ~~Stuck in Running when out of disk on a node during artifact packaging~~ v3.5.8: Stuck in Running when out of disk on a node during artifact packaging Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.5.8: Stuck in `Running` when out of disk on a node during artifact packaging #13344

v3.5.8: Stuck in `Running` when out of disk on a node during artifact packaging #13344

dnlbauer commented Jul 15, 2024 •

edited by agilgur5

Loading

Logs from the workflow controller

agilgur5 commented Jul 15, 2024 •

edited

Loading

agilgur5 commented Jul 15, 2024

agilgur5 commented Jul 15, 2024

juliev0 commented Jul 16, 2024 •

edited by agilgur5

Loading

agilgur5 commented Jul 16, 2024 •

edited

Loading

jswxstw commented Sep 13, 2024

v3.5.8: Stuck in Running when out of disk on a node during artifact packaging #13344

v3.5.8: Stuck in Running when out of disk on a node during artifact packaging #13344

Comments

dnlbauer commented Jul 15, 2024 • edited by agilgur5 Loading

Pre-requisites

What happened? What did you expect to happen?

Expected behaviour

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

agilgur5 commented Jul 15, 2024 • edited Loading

agilgur5 commented Jul 15, 2024

agilgur5 commented Jul 15, 2024

juliev0 commented Jul 16, 2024 • edited by agilgur5 Loading

agilgur5 commented Jul 16, 2024 • edited Loading

jswxstw commented Sep 13, 2024

v3.5.8: Stuck in `Running` when out of disk on a node during artifact packaging #13344

v3.5.8: Stuck in `Running` when out of disk on a node during artifact packaging #13344

dnlbauer commented Jul 15, 2024 •

edited by agilgur5

Loading

agilgur5 commented Jul 15, 2024 •

edited

Loading

juliev0 commented Jul 16, 2024 •

edited by agilgur5

Loading

agilgur5 commented Jul 16, 2024 •

edited

Loading