Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.5.8: Stuck in Running when out of disk on a node during artifact packaging #13344

Closed
3 of 4 tasks
dnlbauer opened this issue Jul 15, 2024 · 6 comments
Closed
3 of 4 tasks
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@dnlbauer
Copy link

dnlbauer commented Jul 15, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

We ran into the issue where a step failed due to allocating large amount of disk on a node for files that subsequently should have been packed into an artifact and pushed to s3. The actual step succeeded, but argo then failed during the taring of output artifacts (I noticed the message that output artifacts are now being tar'ed). Since the files that were created allocated around 80% of disk space available for the node, I suspect that there was not enough room left to create the output artifact.

The node hat 1.6 TB disk space, and the step created files with about 1TB size.
The reported error message was The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0.

While this behavior in itself is fine (If there is not enough space, the step should fail), I noticed that this does not fail the workflow itself. It is still in the Running state, even several hours after the step failed.

This is all probably related to #8526

Expected behaviour

In this case, I would expect the whole workflow to fail.

Version(s)

v3.5.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow

metadata:
  generateName: allocate-
  namespace: argo

spec:
  arguments:
    parameters:
      - name: allocate
        value: 100G
 
  entrypoint: allocate
  templates:
    # allocate and archive a huge file
    - name: allocate
      inputs:
        parameters:
          - name: allocate
            value: "{{ workflow.parameters.allocate }}"
      outputs:
        artifacts:
          - name: large_binary_blob
            path: /test.img
            archive:
              none: {}

      script:
        image: alpine
        command: ["sh"]
        source: |
          echo "allocating..."
          dd if=/dev/zero of=/test.img bs=1M iflag=fullblock,count_bytes count={{inputs.parameters.allocate}}
          ls -lah

  podGC:
    strategy: OnWorkflowSuccess

Logs from the workflow controller

time="2024-07-15T10:13:34.372Z" level=info msg="Processing workflow" Phase= ResourceVersion=9999 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.378Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=allocate-gvh2r
time="2024-07-15T10:13:34.378Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.378Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.378Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.378Z" level=info msg="Pod node allocate-gvh2r initialized Pending" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.399Z" level=info msg="Created pod: allocate-gvh2r (allocate-gvh2r)" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.399Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.399Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:34.413Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=10003 workflow=allocate-gvh2r
time="2024-07-15T10:13:44.400Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=10003 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:44.401Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:13:44.401Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:13:44.401Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=allocate-gvh2r old.message= old.phase=Pending old.progress=0/1 workflow=allocate-gvh2r
time="2024-07-15T10:13:44.401Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:44.401Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:13:44.415Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=10064 workflow=allocate-gvh2r
time="2024-07-15T10:19:47.639Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=10064 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:47.640Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:19:47.640Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:19:47.640Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:19:47.640Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:47.640Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:47.655Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:57.656Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:19:57.665Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:25:34.606Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:25:34.607Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:25:34.607Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:25:34.608Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:25:34.608Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:25:34.608Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:25:34.629Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:29:36.644Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:29:36.644Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:29:36.644Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:29:36.644Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:29:36.645Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:29:36.645Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:29:36.655Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:39:04.657Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:39:04.658Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:39:04.658Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:39:04.658Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:39:04.659Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:39:04.659Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:39:04.696Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:45:34.608Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:45:34.609Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:45:34.609Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:45:34.609Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:45:34.609Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:45:34.609Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:45:34.628Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:48:20.680Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:48:20.681Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:48:20.681Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:48:20.681Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:48:20.681Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:48:20.681Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:48:20.699Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T10:53:51.681Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:53:51.682Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T10:53:51.682Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:53:51.684Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T10:53:51.685Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:53:51.685Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T10:53:51.701Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:02:18.697Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:02:18.697Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:02:18.697Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:02:18.697Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:02:18.698Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:02:18.698Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:02:18.741Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:05:34.608Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:05:34.609Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:05:34.609Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:05:34.609Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:05:34.609Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:05:34.609Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:05:34.629Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:11:25.709Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:11:25.710Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:11:25.710Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:11:25.710Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:11:25.710Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:11:25.710Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:11:25.729Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:16:33.737Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:16:33.738Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:16:33.738Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:16:33.738Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:16:33.738Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:16:33.738Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:16:33.776Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:22:41.767Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:22:41.768Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:22:41.768Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:22:41.768Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:22:41.768Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:22:41.768Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:22:41.787Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:25:34.609Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:25:34.610Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:25:34.610Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:25:34.610Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:25:34.610Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:25:34.610Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:25:34.627Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:31:04.768Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:31:04.769Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:31:04.769Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:31:04.769Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:31:04.769Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:31:04.769Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:31:04.787Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:38:20.781Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:38:20.782Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:38:20.782Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:38:20.782Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:38:20.783Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:38:20.783Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:38:20.803Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:45:34.610Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:45:34.611Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:45:34.611Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:45:34.611Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:45:34.611Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:45:34.611Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:45:34.626Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:48:07.787Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:48:07.788Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:48:07.788Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:48:07.788Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:48:07.789Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:48:07.789Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:48:07.818Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11665 workflow=allocate-gvh2r
time="2024-07-15T11:53:20.535Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=11665 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:20.537Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:53:20.537Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:53:20.537Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=allocate-gvh2r old.message= old.phase=Running old.progress=0/1 workflow=allocate-gvh2r
time="2024-07-15T11:53:20.537Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:20.537Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:20.542Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/allocate-gvh2r/terminateContainers
time="2024-07-15T11:53:20.545Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:53:20.578Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=36376 workflow=allocate-gvh2r
time="2024-07-15T11:53:21.008Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with terminated\n"
time="2024-07-15T11:53:30.579Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=36376 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:30.580Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:53:30.580Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:53:30.580Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:53:30.580Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:30.580Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:30.585Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/allocate-gvh2r/terminateContainers
time="2024-07-15T11:53:30.585Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:53:30.603Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=36420 workflow=allocate-gvh2r
time="2024-07-15T11:53:30.957Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with terminated\n"
time="2024-07-15T11:53:40.604Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=36420 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:40.605Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:53:40.605Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:53:40.605Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:53:40.605Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:40.605Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:53:40.610Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/allocate-gvh2r/terminateContainers
time="2024-07-15T11:53:40.610Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:53:40.621Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=36420 workflow=allocate-gvh2r
time="2024-07-15T11:53:41.053Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with terminated\n"
time="2024-07-15T11:53:51.013Z" level=info msg="cleaning up pod" action=killContainers key=argo/allocate-gvh2r/killContainers
time="2024-07-15T11:53:51.013Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=9&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:53:51.346Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with killed\n"
time="2024-07-15T11:57:29.802Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=36420 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:57:29.802Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T11:57:29.802Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:57:29.803Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T11:57:29.803Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:57:29.803Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T11:57:29.808Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/allocate-gvh2r/terminateContainers
time="2024-07-15T11:57:29.808Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:57:29.825Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=36420 workflow=allocate-gvh2r
time="2024-07-15T11:57:30.129Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with terminated\n"
time="2024-07-15T11:58:00.130Z" level=info msg="cleaning up pod" action=killContainers key=argo/allocate-gvh2r/killContainers
time="2024-07-15T11:58:00.131Z" level=info msg="https://10.254.0.1:443/api/v1/namespaces/argo/pods/allocate-gvh2r/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=9&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2024-07-15T11:58:00.497Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=allocate-gvh2r stderr= stdout="killing 1 with killed\n"
time="2024-07-15T12:00:19.329Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=36420 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:19.330Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T12:00:19.330Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:00:19.331Z" level=info msg="Pod failed: The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " displayName=allocate-gvh2r namespace=argo pod=allocate-gvh2r templateName=allocate workflow=allocate-gvh2r
time="2024-07-15T12:00:19.331Z" level=info msg="node changed" namespace=argo new.message="The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " new.phase=Failed new.progress=0/1 nodeID=allocate-gvh2r old.message= old.phase=Running old.progress=0/1 workflow=allocate-gvh2r
time="2024-07-15T12:00:19.331Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:19.331Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:19.349Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=38196 workflow=allocate-gvh2r
time="2024-07-15T12:00:29.352Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=38196 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg="Pod failed: The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " displayName=allocate-gvh2r namespace=argo pod=allocate-gvh2r templateName=allocate workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:29.353Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:00:29.375Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=38196 workflow=allocate-gvh2r
time="2024-07-15T12:05:34.610Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=38196 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:05:34.611Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T12:05:34.612Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:05:34.612Z" level=info msg="Pod failed: The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " displayName=allocate-gvh2r namespace=argo pod=allocate-gvh2r templateName=allocate workflow=allocate-gvh2r
time="2024-07-15T12:05:34.612Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:05:34.612Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:05:34.612Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:05:34.626Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=38196 workflow=allocate-gvh2r
time="2024-07-15T12:06:04.805Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=38196 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg="Pod failed: The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " displayName=allocate-gvh2r namespace=argo pod=allocate-gvh2r templateName=allocate workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:06:04.806Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T12:06:04.833Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=38196 workflow=allocate-gvh2r
(....snip..)
time="2024-07-15T15:26:47.076Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=76585 namespace=argo workflow=allocate-gvh2r
time="2024-07-15T15:26:47.076Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=allocate-gvh2r
time="2024-07-15T15:26:47.076Z" level=info msg="task-result changed" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T15:26:47.077Z" level=info msg="Pod failed: The node was low on resource: ephemeral-storage. Container wait was using 24Ki, which exceeds its request of 0. " displayName=allocate-gvh2r namespace=argo pod=allocate-gvh2r templateName=allocate workflow=allocate-gvh2r
time="2024-07-15T15:26:47.077Z" level=info msg="node unchanged" namespace=argo nodeID=allocate-gvh2r workflow=allocate-gvh2r
time="2024-07-15T15:26:47.077Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=allocate-gvh2r
time="2024-07-15T15:26:47.077Z" level=info msg=reconcileAgentPod namespace=argo workflow=allocate-gvh2r
time="2024-07-15T15:26:47.087Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=76585 workflow=allocate-gvh2r

Logs from in your workflow's wait container

Error from server (BadRequest): container "wait" in pod "allocate-gvh2r" is terminated
@dnlbauer dnlbauer changed the title Running out of disk on a node does not fail the workflow Running out of disk on a node during artifact packaging does not fail the workflow Jul 15, 2024
@agilgur5
Copy link
Contributor

agilgur5 commented Jul 15, 2024

I noticed that this does not fail the workflow itself. It is still in the Running state, even several hours after the step failed.

This is probably a 3.5.2+ or 3.5.5+ regression, i.e. that the wait container never writes a completed status to the respective WorkflowTaskResult due to the changes in #11947 or related PRs. That would suggest that the error isn't getting caught for some reason

This is all probably related to #8526

Yes in that that is how you increase space, no in that it probably didn't get stuck in Running on that version; it would've probably correctly failed.

@agilgur5 agilgur5 changed the title Running out of disk on a node during artifact packaging does not fail the workflow Stuck in Running when out of disk on a node during artifact packaging Jul 15, 2024
@agilgur5 agilgur5 added type/regression Regression from previous behavior (a specific type of bug) area/artifacts S3/GCP/OSS/Git/HDFS etc labels Jul 15, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Jul 15, 2024
@agilgur5
Copy link
Contributor

Also this is an excellent minimal repro, thanks!

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Jul 15, 2024
@agilgur5
Copy link
Contributor

@Garett-MacGowan @juliev0 @shuangkun we've all made various mitigations to this (which are useful in their own right), but I feel like the Controller logic getting stuck in Running is itself incorrect. If the Pod is completed and the WorkflowTaskResult is incomplete, it'll never become completed again, so the Controller should not wait on that condition. And I think in every case of it getting stuck, the Pod is completed in a failure state

@agilgur5 agilgur5 added the P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important label Jul 15, 2024
@juliev0
Copy link
Contributor

juliev0 commented Jul 16, 2024

I think the case that isn't working is specific to the wait container getting an OOM failure, right (or any other non-graceful shutdown)? I think if the wait container or the main container were to error out normally, the executor should still set the "incomplete" label and then everything should shut down normally.

But you're probably right that there could be some additional check during the taskResult reconciliation to determine if we should stop.

@agilgur5
Copy link
Contributor

agilgur5 commented Jul 16, 2024

I think the case that isn't working is specific to the wait container getting an OOM failure, right (or any other non-graceful shutdown)?

It's a little bit broader as any unhandled error:

Handling these is still good (what I called "mitigations" above), but they are all specific cases.

But you're probably right that there could be some additional check during the taskResult reconciliation to determine if we should stop.

So yes it would also be a much broader solution if TaskResult reconciliation didn't just completely get stuck in the generic case of "Pod completed but TaskResult not".
Or, in other words, TaskResult reconciliation in the Controller should itself gracefully handle the entire class of unhandled errors.

While there are still some specific cases that would still be good to handle gracefully in the Executor, a reconciliation fix would at least prevent users from getting completely stuck.
And there are ofc some cases of unavoidable unhandled errors that it would be needed for (this issue and Pod eviction are not necessarily avoidable in all cases)

@jswxstw
Copy link
Member

jswxstw commented Sep 13, 2024

This issue is fixed by #13491.

# argo get allocate-lf59f
Name:                allocate-lf59f
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Failed
Message:             The node was low on resource: ephemeral-storage. Threshold quantity: 2626078349, available: 2519016Ki. Container wait was using 68Ki, request is 0, has larger consumption of ephemeral-storage. Container main was using 20427300Ki, request is 0, has larger consumption of ephemeral-storage. 
Conditions:          
 PodRunning          False
 Completed           True
Created:             Fri Sep 13 14:17:51 +0800 (4 minutes ago)
Started:             Fri Sep 13 14:17:51 +0800 (4 minutes ago)
Finished:            Fri Sep 13 14:21:49 +0800 (58 seconds ago)
Duration:            3 minutes 58 seconds
Progress:            0/1
ResourcesDuration:   5m45s*(100Mi memory),41s*(1 cpu)
Parameters:          
  allocate:          100G

STEP               TEMPLATE  PODNAME         DURATION  MESSAGE
 ✖ allocate-lf59f  allocate  allocate-lf59f  3m        The node was low on resource: ephemeral-storage. Threshold quantity: 2626078349, available: 2519016Ki. Container wait was using 68Ki, request is 0, has larger consumption of ephemeral-storage. Container main was using 20427300Ki, request is 0, has larger consumption of ephemeral-storage.

@agilgur5 agilgur5 changed the title Stuck in Running when out of disk on a node during artifact packaging v3.5.8: Stuck in Running when out of disk on a node during artifact packaging Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

4 participants