Argo pods deleted during auto-scaling event #652

jessesuen · 2018-01-03T01:08:40Z

A user reports that when scheduling a large number of parallel steps against an initially undersized cluster, GKE auto scaled the cluster which resulted in the deletion of many pods.

STEP                              PODNAME              MESSAGE
✖ hm-prssb
├---✔ sequence2fasta             hm-prssb-3873877291
├---✔ blast                      hm-prssb-1133494656
├-·-✔ sparksx                    hm-prssb-3498283319
| ├-✔ hhsearch                   hm-prssb-1713851375
| └-✔ make-fragments             hm-prssb-3503857521
├---✔ merge-alignments           hm-prssb-1288511333
├---✔ threading                  hm-prssb-3509650727
├---✔ makecst                    hm-prssb-3817660954
├---✔ prepare-hybridize-payload  hm-prssb-3414254980
├---✔ models-acummulator         hm-prssb-1649970139
├---✔ create-endpoint            hm-prssb-63038533
├---✔ generate-sampling          hm-prssb-1976157333
└-·-⚠ hybridize(0)               hm-prssb-2120165667  pod deleted
  ├-⚠ hybridize(1)               hm-prssb-1650539430  pod deleted
  ├-⚠ hybridize(2)               hm-prssb-2321791285  pod deleted
  ├-⚠ hybridize(3)               hm-prssb-4268142184  pod deleted
  ├-⚠ hybridize(4)               hm-prssb-3865626423  pod deleted
  ├-⚠ hybridize(5)               hm-prssb-174800506   pod deleted
  ├-⚠ hybridize(6)               hm-prssb-1651378073  pod deleted
  ├-⚠ hybridize(7)               hm-prssb-2255519452  pod deleted
  ├-⚠ hybridize(8)               hm-prssb-1582001931  pod deleted

Our suspicion is that when GKE auto-scales, it wants to balance the pods against the newly available nodes, and it does that by simply deleting pods, which the controller obviously does not like. When the workflow was resubmitted, the workflow ran through successfully the second time.

This issue is to research and try and reproduce this behavior, and if confirmed as the known behavior, provide guidance on how to mitigate this.

It's likely the retry feature can solve this, but retrying steps should not be done automatically since we cannot assume user's steps are idempotent.

The text was updated successfully, but these errors were encountered:

jessesuen · 2018-01-03T01:11:42Z

/cc @javierbq

jessesuen · 2018-02-11T06:58:36Z

Closing this unless reproduced. Since filing this issue, there were a number of issues discovered in the controller that was incorrectly detecting "pod deleted" and was fixed in the 2.0.0-alpha3 and 2.0.0-beta1 releases.

jessesuen closed this as completed Feb 11, 2018

hden mentioned this issue Dec 15, 2018

panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *unstructured.Unstructured #1132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argo pods deleted during auto-scaling event #652

Argo pods deleted during auto-scaling event #652

jessesuen commented Jan 3, 2018

jessesuen commented Jan 3, 2018

jessesuen commented Feb 11, 2018

Argo pods deleted during auto-scaling event #652

Argo pods deleted during auto-scaling event #652

Comments

jessesuen commented Jan 3, 2018

jessesuen commented Jan 3, 2018

jessesuen commented Feb 11, 2018