Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleted pods in k8s needs to be handled in argo. #893

Closed
VenkataKarthikP opened this issue Jun 26, 2018 · 1 comment
Closed

Deleted pods in k8s needs to be handled in argo. #893

VenkataKarthikP opened this issue Jun 26, 2018 · 1 comment
Labels
Milestone

Comments

@VenkataKarthikP
Copy link

VenkataKarthikP commented Jun 26, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
BUG Report
What happened:

When underlying pod is deleted in k8s. Argo retry node is not spawning new node even if re-try attempts are left.

What you expected to happen:

Controller should treat deleted pods as errors and continue with either creating new pod or mark retry node as error appropriately.

How to reproduce it (as minimally and precisely as possible):
Delete pod in retry node manually in k8s.

Anything else we need to know?:

Environment:

  • Argo version:
argo: v2.1.1
  BuildDate: 2018-05-29T20:38:37Z
  GitCommit: ac241c95c13f08e868cd6f5ee32c9ce273e239ff
  GitTreeState: clean
  GitTag: v2.1.1
  GoVersion: go1.9.3
  Compiler: gc
  Platform: darwin/amd64
  • Kubernetes version :
kubectl version -o yaml
clientVersion:
  buildDate: 2018-06-22T05:40:13Z
  compiler: gc
  gitCommit: 32ac1c9073b132b8ba18aa830f46b77dcceb0723
  gitTreeState: clean
  gitVersion: v1.10.5
  goVersion: go1.9.7
  major: "1"
  minor: "10"
  platform: darwin/amd64
serverVersion:
  buildDate: 2018-02-07T11:55:20Z
  compiler: gc
  gitCommit: d2835416544f298c919e2ead3be3d0864b52323b
  gitTreeState: clean
  gitVersion: v1.9.3
  goVersion: go1.9.2
  major: "1"
  minor: "9"
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
~$ argo get cmg-1529802060
Name:                cmg-1529802060
Namespace:           ml-prod
ServiceAccount:      default
Status:              Running
Created:             Sat Jun 23 18:01:22 -0700 (1 day ago)
Started:             Sat Jun 23 18:01:22 -0700 (1 day ago)
Duration:            1 days 18 hours 37 minutes 52 seconds
STEP                                            PODNAME                                          DURATION  MESSAGE
 ● cmg-1529802060
 ├---✔ model-gen-prep
 |   ├-✖ model-gen-prep(0)                      cmg-1529802060-3308568300  16m       failed with exit code 1
 |   └-✔ model-gen-prep(1)                      cmg-1529802060-1630659305  58m
 └-·-✖ model-gen-training(0:demo:f18_34)                                                                   No more retries left
   | ├-✖ model-gen-training(0:demo:f18_34)(0)   cmg-1529802060-2253993419  10m       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(1)   cmg-1529802060-2858134798  53m       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(2)   cmg-1529802060-2455619037  13s       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(3)   cmg-1529802060-2791318512  7m        failed with exit code 1
   | ├-⚠ model-gen-training(0:demo:f18_34)(4)   cmg-1529802060-241370687   1d        pod deleted
   | ├-✖ model-gen-training(0:demo:f18_34)(5)   cmg-1529802060-845512066   11m       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(6)   cmg-1529802060-711438209   2m        failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(7)   cmg-1529802060-2389347204  14m       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(8)   cmg-1529802060-2792157155  15m       failed with exit code 1
   | ├-✖ model-gen-training(0:demo:f18_34)(9)   cmg-1529802060-2322530918  13s       failed with exit code 1
   | └-✖ model-gen-training(0:demo:f18_34)(10)  cmg-1529802060-3910701520  10m       failed with exit code 1
   ├-✔ model-gen-training(10:demo:f21_49)(0)    cmg-1529802060-1436784580  1h
   ├-✔ model-gen-training(11:demo:f21_99)(0)    cmg-1529802060-1954071774  1h
   ├-● model-gen-training(1:demo:f18_49)
   | ├-✖ model-gen-training(1:demo:f18_49)(0)   cmg-1529802060-3705857692  11m       failed with exit code 1
   | └-⚠ model-gen-training(1:demo:f18_49)(1)   cmg-1529802060-3101716313  1d        pod deleted
   ├-✔ model-gen-training(2:demo:f18_99)(0)     cmg-1529802060-485286116   1h
   ├-✔ model-gen-training(3:demo:m18_34)(0)     cmg-1529802060-3915130641  1h
   ├-✔ model-gen-training(4:demo:m18_49)(0)     cmg-1529802060-3246018548  1h

screen shot 2018-06-25 at 10 02 17 am

@jessesuen jessesuen added this to the v2.2 milestone Aug 28, 2018
@jessesuen
Copy link
Member

I tried to reproduce this, but I could not. I am able to confirm that deleted pods are indeed being retried:

Name:                parallelism-limit-x5pmh
Namespace:           default
ServiceAccount:      default
Status:              Failed
Message:             child 'parallelism-limit-x5pmh-1567181470' failed
Created:             Tue Aug 28 03:13:24 -0700 (3 minutes ago)
Started:             Tue Aug 28 03:13:24 -0700 (3 minutes ago)
Finished:            Tue Aug 28 03:16:43 -0700 (now)
Duration:            3 minutes 19 seconds

STEP                           PODNAME                             DURATION  MESSAGE
 ✖ parallelism-limit-x5pmh                                                   child 'parallelism-limit-x5pmh-1567181470' failed
 └-·-✔ sleep(0:this)(0)        parallelism-limit-x5pmh-2126383399  1m
   ├-✔ sleep(1:workflow)(0)    parallelism-limit-x5pmh-1762536779  1m
   ├-✔ sleep(2:should)(0)      parallelism-limit-x5pmh-1915675102  1m
   ├-✔ sleep(3:take)
   | ├-✖ sleep(3:take)(0)      parallelism-limit-x5pmh-1494225199  40s       pod failed for unknown reason
   | └-✔ sleep(3:take)(1)      parallelism-limit-x5pmh-1024598962  1m
   ├-✔ sleep(4:at)(0)          parallelism-limit-x5pmh-4027640022  1m
   ├-✔ sleep(5:least)(0)       parallelism-limit-x5pmh-3848287587  1m
   ├-✔ sleep(6:60)(0)          parallelism-limit-x5pmh-2096368459  1m
   ├-✔ sleep(7:seconds)(0)     parallelism-limit-x5pmh-741286865   1m
   ├-✔ sleep(8:to)(0)          parallelism-limit-x5pmh-3368753014  1m
   └-✖ sleep(9:complete)                                                     No more retries left
     ├-⚠ sleep(9:complete)(0)  parallelism-limit-x5pmh-974189773   3m        pod deleted
     └-⚠ sleep(9:complete)(1)  parallelism-limit-x5pmh-2383656864  1m        pod deleted

In the above example, I deleted a few pods from underneath the controller. The pod was retried as expected.

I'm now looking at the output in the original bug, I actually do no see evidence that this isn't working as intended.

Here we see that model-gen-training(1:demo:f18_49) has not yet been marked failed.

   ├-● model-gen-training(1:demo:f18_49)
   | ├-✖ model-gen-training(1:demo:f18_49)(0)   cmg-1529802060-3705857692  11m       failed with exit code 1
   | └-⚠ model-gen-training(1:demo:f18_49)(1)   cmg-1529802060-3101716313  1d        pod deleted

I think the real issue is that model-gen-training(1:demo:f18_49) node was stuck, and was not proceed with creating more pods for the retries. The stuck retry node may be a result of some bad metadata that was getting formed in v2.1.1 specifically with respect to retries. v2.2 fixes some metadata issues happening with retries so I do think this may be already fixed.

Here is another issue where retry node was stuck: #965

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants