Deleted pods in k8s needs to be handled in argo. #893

VenkataKarthikP · 2018-06-26T06:24:47Z

Is this a BUG REPORT or FEATURE REQUEST?:
BUG Report
What happened:

When underlying pod is deleted in k8s. Argo retry node is not spawning new node even if re-try attempts are left.

What you expected to happen:

Controller should treat deleted pods as errors and continue with either creating new pod or mark retry node as error appropriately.

How to reproduce it (as minimally and precisely as possible):
Delete pod in retry node manually in k8s.

Anything else we need to know?:

Environment:

Argo version:

argo: v2.1.1
  BuildDate: 2018-05-29T20:38:37Z
  GitCommit: ac241c95c13f08e868cd6f5ee32c9ce273e239ff
  GitTreeState: clean
  GitTag: v2.1.1
  GoVersion: go1.9.3
  Compiler: gc
  Platform: darwin/amd64

Kubernetes version :

kubectl version -o yaml
clientVersion:
  buildDate: 2018-06-22T05:40:13Z
  compiler: gc
  gitCommit: 32ac1c9073b132b8ba18aa830f46b77dcceb0723
  gitTreeState: clean
  gitVersion: v1.10.5
  goVersion: go1.9.7
  major: "1"
  minor: "10"
  platform: darwin/amd64
serverVersion:
  buildDate: 2018-02-07T11:55:20Z
  compiler: gc
  gitCommit: d2835416544f298c919e2ead3be3d0864b52323b
  gitTreeState: clean
  gitVersion: v1.9.3
  goVersion: go1.9.2
  major: "1"
  minor: "9"
  platform: linux/amd64

Other debugging information (if applicable):

workflow result:

~$ argo get cmg-1529802060
Name:                cmg-1529802060
Namespace:           ml-prod
ServiceAccount:      default
Status:              Running
Created:             Sat Jun 23 18:01:22 -0700 (1 day ago)
Started:             Sat Jun 23 18:01:22 -0700 (1 day ago)
Duration:            1 days 18 hours 37 minutes 52 seconds
STEP                                            PODNAME                                          DURATION  MESSAGE
 ● cmg-1529802060
 ├---✔ model-gen-prep
 |   ├-✖ model-gen-prep(0)                      cmg-1529802060-3308568300  16m       failed with exit code 1
 |   └-✔ model-gen-prep(1)                      cmg-1529802060-1630659305  58m
 └-·-✖ model-gen-training(0:demo:f18_34)                                                                   No more retries left
   | ├-✖ model-gen-training(0:demo:f18_34)(0)   cmg-1529802060-2253993419  10m       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(1)   cmg-1529802060-2858134798  53m       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(2)   cmg-1529802060-2455619037  13s       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(3)   cmg-1529802060-2791318512  7m        failed with exit code 1
   | ├-⚠ model-gen-training(0:demo:f18_34)(4)   cmg-1529802060-241370687   1d        pod deleted
   | ├-✖ model-gen-training(0:demo:f18_34)(5)   cmg-1529802060-845512066   11m       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(6)   cmg-1529802060-711438209   2m        failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(7)   cmg-1529802060-2389347204  14m       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(8)   cmg-1529802060-2792157155  15m       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(9)   cmg-1529802060-2322530918  13s       failed with exit code 1
   | └-✖ model-gen-training(0:demo:f18_34)(10)  cmg-1529802060-3910701520  10m       failed with exit code 1
   ├-✔ model-gen-training(10:demo:f21_49)(0)    cmg-1529802060-1436784580  1h
   ├-✔ model-gen-training(11:demo:f21_99)(0)    cmg-1529802060-1954071774  1h
   ├-● model-gen-training(1:demo:f18_49)
   | ├-✖ model-gen-training(1:demo:f18_49)(0)   cmg-1529802060-3705857692  11m       failed with exit code 1
   | └-⚠ model-gen-training(1:demo:f18_49)(1)   cmg-1529802060-3101716313  1d        pod deleted
   ├-✔ model-gen-training(2:demo:f18_99)(0)     cmg-1529802060-485286116   1h
   ├-✔ model-gen-training(3:demo:m18_34)(0)     cmg-1529802060-3915130641  1h
   ├-✔ model-gen-training(4:demo:m18_49)(0)     cmg-1529802060-3246018548  1h

The text was updated successfully, but these errors were encountered:

jessesuen · 2018-08-28T10:26:11Z

I tried to reproduce this, but I could not. I am able to confirm that deleted pods are indeed being retried:

Name:                parallelism-limit-x5pmh
Namespace:           default
ServiceAccount:      default
Status:              Failed
Message:             child 'parallelism-limit-x5pmh-1567181470' failed
Created:             Tue Aug 28 03:13:24 -0700 (3 minutes ago)
Started:             Tue Aug 28 03:13:24 -0700 (3 minutes ago)
Finished:            Tue Aug 28 03:16:43 -0700 (now)
Duration:            3 minutes 19 seconds

STEP                           PODNAME                             DURATION  MESSAGE
 ✖ parallelism-limit-x5pmh                                                   child 'parallelism-limit-x5pmh-1567181470' failed
 └-·-✔ sleep(0:this)(0)        parallelism-limit-x5pmh-2126383399  1m
   ├-✔ sleep(1:workflow)(0)    parallelism-limit-x5pmh-1762536779  1m
   ├-✔ sleep(2:should)(0)      parallelism-limit-x5pmh-1915675102  1m
   ├-✔ sleep(3:take)
   | ├-✖ sleep(3:take)(0)      parallelism-limit-x5pmh-1494225199  40s       pod failed for unknown reason
   | └-✔ sleep(3:take)(1)      parallelism-limit-x5pmh-1024598962  1m
   ├-✔ sleep(4:at)(0)          parallelism-limit-x5pmh-4027640022  1m
   ├-✔ sleep(5:least)(0)       parallelism-limit-x5pmh-3848287587  1m
   ├-✔ sleep(6:60)(0)          parallelism-limit-x5pmh-2096368459  1m
   ├-✔ sleep(7:seconds)(0)     parallelism-limit-x5pmh-741286865   1m
   ├-✔ sleep(8:to)(0)          parallelism-limit-x5pmh-3368753014  1m
   └-✖ sleep(9:complete)                                                     No more retries left
     ├-⚠ sleep(9:complete)(0)  parallelism-limit-x5pmh-974189773   3m        pod deleted
     └-⚠ sleep(9:complete)(1)  parallelism-limit-x5pmh-2383656864  1m        pod deleted

In the above example, I deleted a few pods from underneath the controller. The pod was retried as expected.

I'm now looking at the output in the original bug, I actually do no see evidence that this isn't working as intended.

Here we see that model-gen-training(1:demo:f18_49) has not yet been marked failed.

   ├-● model-gen-training(1:demo:f18_49)
   | ├-✖ model-gen-training(1:demo:f18_49)(0)   cmg-1529802060-3705857692  11m       failed with exit code 1
   | └-⚠ model-gen-training(1:demo:f18_49)(1)   cmg-1529802060-3101716313  1d        pod deleted

I think the real issue is that model-gen-training(1:demo:f18_49) node was stuck, and was not proceed with creating more pods for the retries. The stuck retry node may be a result of some bad metadata that was getting formed in v2.1.1 specifically with respect to retries. v2.2 fixes some metadata issues happening with retries so I do think this may be already fixed.

Here is another issue where retry node was stuck: #965

…j#893)

jessesuen added the type/bug label Jun 27, 2018

jessesuen added this to the v2.2 milestone Aug 28, 2018

jessesuen closed this as completed Aug 28, 2018

jessesuen added the duplicate label Aug 28, 2018

hden mentioned this issue Dec 15, 2018

panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *unstructured.Unstructured #1132

Closed

logicfox mentioned this issue Jun 11, 2019

Handle pod deletes outside workflow lifecycle #1414

Closed

icecoffee531 pushed a commit to icecoffee531/argo-workflows that referenced this issue Jan 5, 2022

chore: update log.info output to stdout. Fixes argoproj#884. (argopro…

8298cea

…j#893)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleted pods in k8s needs to be handled in argo. #893

Deleted pods in k8s needs to be handled in argo. #893

VenkataKarthikP commented Jun 26, 2018 •

edited

Loading

jessesuen commented Aug 28, 2018

Deleted pods in k8s needs to be handled in argo. #893

Deleted pods in k8s needs to be handled in argo. #893

Comments

VenkataKarthikP commented Jun 26, 2018 • edited Loading

jessesuen commented Aug 28, 2018

VenkataKarthikP commented Jun 26, 2018 •

edited

Loading