Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo pods deleted during auto-scaling event #652

Closed
jessesuen opened this issue Jan 3, 2018 · 2 comments
Closed

Argo pods deleted during auto-scaling event #652

jessesuen opened this issue Jan 3, 2018 · 2 comments

Comments

@jessesuen
Copy link
Member

A user reports that when scheduling a large number of parallel steps against an initially undersized cluster, GKE auto scaled the cluster which resulted in the deletion of many pods.

STEP                              PODNAME              MESSAGE
✖ hm-prssb
├---✔ sequence2fasta             hm-prssb-3873877291
├---✔ blast                      hm-prssb-1133494656
├-·-✔ sparksx                    hm-prssb-3498283319
| ├-✔ hhsearch                   hm-prssb-1713851375
| └-✔ make-fragments             hm-prssb-3503857521
├---✔ merge-alignments           hm-prssb-1288511333
├---✔ threading                  hm-prssb-3509650727
├---✔ makecst                    hm-prssb-3817660954
├---✔ prepare-hybridize-payload  hm-prssb-3414254980
├---✔ models-acummulator         hm-prssb-1649970139
├---✔ create-endpoint            hm-prssb-63038533
├---✔ generate-sampling          hm-prssb-1976157333
└-·-⚠ hybridize(0)               hm-prssb-2120165667  pod deleted
  ├-⚠ hybridize(1)               hm-prssb-1650539430  pod deleted
  ├-⚠ hybridize(2)               hm-prssb-2321791285  pod deleted
  ├-⚠ hybridize(3)               hm-prssb-4268142184  pod deleted
  ├-⚠ hybridize(4)               hm-prssb-3865626423  pod deleted
  ├-⚠ hybridize(5)               hm-prssb-174800506   pod deleted
  ├-⚠ hybridize(6)               hm-prssb-1651378073  pod deleted
  ├-⚠ hybridize(7)               hm-prssb-2255519452  pod deleted
  ├-⚠ hybridize(8)               hm-prssb-1582001931  pod deleted

Our suspicion is that when GKE auto-scales, it wants to balance the pods against the newly available nodes, and it does that by simply deleting pods, which the controller obviously does not like. When the workflow was resubmitted, the workflow ran through successfully the second time.

This issue is to research and try and reproduce this behavior, and if confirmed as the known behavior, provide guidance on how to mitigate this.

It's likely the retry feature can solve this, but retrying steps should not be done automatically since we cannot assume user's steps are idempotent.

@jessesuen
Copy link
Member Author

/cc @javierbq

@jessesuen
Copy link
Member Author

Closing this unless reproduced. Since filing this issue, there were a number of issues discovered in the controller that was incorrectly detecting "pod deleted" and was fixed in the 2.0.0-alpha3 and 2.0.0-beta1 releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant