You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A user reports that when scheduling a large number of parallel steps against an initially undersized cluster, GKE auto scaled the cluster which resulted in the deletion of many pods.
STEP PODNAME MESSAGE
✖ hm-prssb
├---✔ sequence2fasta hm-prssb-3873877291
├---✔ blast hm-prssb-1133494656
├-·-✔ sparksx hm-prssb-3498283319
| ├-✔ hhsearch hm-prssb-1713851375
| └-✔ make-fragments hm-prssb-3503857521
├---✔ merge-alignments hm-prssb-1288511333
├---✔ threading hm-prssb-3509650727
├---✔ makecst hm-prssb-3817660954
├---✔ prepare-hybridize-payload hm-prssb-3414254980
├---✔ models-acummulator hm-prssb-1649970139
├---✔ create-endpoint hm-prssb-63038533
├---✔ generate-sampling hm-prssb-1976157333
└-·-⚠ hybridize(0) hm-prssb-2120165667 pod deleted
├-⚠ hybridize(1) hm-prssb-1650539430 pod deleted
├-⚠ hybridize(2) hm-prssb-2321791285 pod deleted
├-⚠ hybridize(3) hm-prssb-4268142184 pod deleted
├-⚠ hybridize(4) hm-prssb-3865626423 pod deleted
├-⚠ hybridize(5) hm-prssb-174800506 pod deleted
├-⚠ hybridize(6) hm-prssb-1651378073 pod deleted
├-⚠ hybridize(7) hm-prssb-2255519452 pod deleted
├-⚠ hybridize(8) hm-prssb-1582001931 pod deleted
Our suspicion is that when GKE auto-scales, it wants to balance the pods against the newly available nodes, and it does that by simply deleting pods, which the controller obviously does not like. When the workflow was resubmitted, the workflow ran through successfully the second time.
This issue is to research and try and reproduce this behavior, and if confirmed as the known behavior, provide guidance on how to mitigate this.
It's likely the retry feature can solve this, but retrying steps should not be done automatically since we cannot assume user's steps are idempotent.
The text was updated successfully, but these errors were encountered:
Closing this unless reproduced. Since filing this issue, there were a number of issues discovered in the controller that was incorrectly detecting "pod deleted" and was fixed in the 2.0.0-alpha3 and 2.0.0-beta1 releases.
A user reports that when scheduling a large number of parallel steps against an initially undersized cluster, GKE auto scaled the cluster which resulted in the deletion of many pods.
Our suspicion is that when GKE auto-scales, it wants to balance the pods against the newly available nodes, and it does that by simply deleting pods, which the controller obviously does not like. When the workflow was resubmitted, the workflow ran through successfully the second time.
This issue is to research and try and reproduce this behavior, and if confirmed as the known behavior, provide guidance on how to mitigate this.
It's likely the retry feature can solve this, but retrying steps should not be done automatically since we cannot assume user's steps are idempotent.
The text was updated successfully, but these errors were encountered: