Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle pod deletes outside workflow lifecycle #1414

Closed
logicfox opened this issue Jun 11, 2019 · 4 comments
Closed

Handle pod deletes outside workflow lifecycle #1414

logicfox opened this issue Jun 11, 2019 · 4 comments
Assignees
Labels

Comments

@logicfox
Copy link
Contributor

Possibly a suggestion/feature request. I ran into an issue similar to #893 when Kured restarted a node on which a pod was executing a workflow step. This triggered the handling mechanism here which marked the workflow step as failed with the message pod deleted.

Can't this scenario be augmented with a pre-stop hook injected into the pod-spec to notify workflow-controller to better handle cases where a pod has been deleted outside of the workflow lifecycle?

https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/

@sarabala1979 sarabala1979 added the type/feature Feature request label Jul 15, 2019
@audriusrudalevicius
Copy link

audriusrudalevicius commented Feb 21, 2020

I run in to same issue with v2.6.0-rc1. To reproduce just delete node running one or several workflow pods. Whole workflow stuck in running state as some pods had "pod deleted" and newer retried. Expected behaviour is pod rescheduled with retry. Found it works with 2.4.3

@alexec alexec added type/bug and removed type/feature Feature request labels Feb 21, 2020
@alexec alexec added this to the Backlog milestone Feb 21, 2020
@whynowy whynowy self-assigned this Feb 28, 2020
@whynowy
Copy link
Member

whynowy commented Feb 28, 2020

@audriusrudalevicius - Could you give more detail about your case? Argo has the ability to handle the situation that pod is deleted outside of the wf lifecycle, in general if the POD is deleted, wf will retry (if it's there in your spec) or marked as failed. I want to see if there's a bug to make it not work in your case.

@audriusrudalevicius
Copy link

I found the issue. The problem was in my workflow after I upgraded argo from 2.4.3 to 2.6.0. I did't changed retry parameters: before it was {retryStrategy: limit: 10} (only supported by that version). And after upgrade I got message: level=info msg="Node not set to be retried after status: Error". My fault, didn't noticed this message in workflow-controller logs. The fix for workflow to change {retryStrategy: limit: 10, retryPolicy: Always, backoff: ...} backoff i added because limit 10 can be reached quickly. Now it works. Thanks! Maybe this information will help for others

@simster7
Copy link
Member

simster7 commented Mar 2, 2020

Closing, feel free to reopen if necessary

@simster7 simster7 closed this as completed Mar 2, 2020
icecoffee531 pushed a commit to icecoffee531/argo-workflows that referenced this issue Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants