-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: completed workflow tracking #12198
Conversation
Problem: The workflow cache can have out of date information in it which will resurrect a completed workflow. In this scenario, with an on-exit handler with `when: "{{workflow.status}} != Succeeded"` it may fire the exit handler erroneously. Firing enough of this workflow into a real cluster fast enough will reproduce this - tested in 3.4.8, 3.4.9, and master ``` apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: exit- spec: entrypoint: entrypoint onExit: exit-handler templates: - name: entrypoint container: image: quay.io/pch/whalesay:latest command: [ cowsay ] args: [ "hello world" ] resources: limits: memory: 32Mi cpu: 100m - name: ohnoes container: image: quay.io/pch/whalesay:latest command: [ cowsay ] args: [ "oh noes" ] resources: limits: memory: 32Mi cpu: 100m - name: exit-handler steps: - - name: ohnoes when: "{{workflow.status}} != Succeeded" template: ohnoes ``` It does not reproduce easily on a toy cluster in k3d. I believe because API latencies are too small and the whole machine starts to fall apart before the problem happens. It happens rarely in real workloads but if `ohnoes` is replaced by a notification you'll receive a notification that your workflow has failed, even though it has succeeded when you go and look at it. Story: The workflow controller is busy, and the k8s API is busy. The k8s API is providing workflow updates which run through the workflow informer cache. Normally these cannot go 'backwards'. Backwards would be where an update to the workflow takes our view of the workflow back to an earlier point in time because we've received a high latency update from the k8s API. We prevent many of these using the UpdateFunc handler which stops us reconciling at the point we've already seen. The problem is that as a workflow completes we drop it from the informer cache (it no longer meets `reconciliationNeeded`). This allows it to come back into the cache in an old form as a new resource. If this happens we attempt to reconcile it. In the event the pod representing the node that was running in the form of the reappearing workflow doesn't exist we're going to (correctly) treat the workflow as in error. So we have a workflow which is in error exiting, and test the onExit `when` clause, decide it's true, and launch the pod to handle it (ohnoes in my example). This runs, regardless of the fact that the workflow controller then discovers that the workflow resourceVersion is out of date and cannot write the current state back to the cluster. Fixes considered: * Attempt to ensure we're up to date before actioning the onExit. This seemed like a very convoluted code path for doing special onExit handling and likely to introduce more bugs than it fixed. This would either put much more load on the API by causing the controller to get the latest version, or attempt the write back to see if we were out of date. * Attempt to prevent old version re-entry in the cache rather than at the reconcile stage. I tried this, but we still have a hole where it'll make it through. So the fix is to separately track workflows that have been evicted from the cache and not process them at all. This cache has a lifetime of 10 minutes as I could get up to latencies of a few minutes during heavy workload testing. Signed-off-by: Alan Clucas <[email protected]>
Oh, just thinking...the key is the Workflow name, right? So, what happens when somebody legitimately creates a new Workflow of the same name as the previous Workflow within 10 minutes of the previous one being deleted? |
I thought I'd handled this, but obviously not. This went through many iterations, so I'm blaming that! I've added a check when we may reject to ensure that the Phase is not "". This will allow brand new workflows back in, but any that have already been handled by the controller will go through potential rejection. |
Nice catch! I didn't think about this myself |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, but I couldn't find the change for:
I've added a check when we may reject to ensure that the Phase is not "". This will allow brand new workflows back in, but any that have already been handled by the controller will go through potential rejection.
There also seems to be a redundant throttler.Remove(), after those changes I would be happy to approve.
What this issue opens up though is the question of where else are there similar bugs in the code? Would this be the only interaction with the k8s API where we need to care about this, or is this a bug that would impact other things as well,say for example configmaps (I remember seeing a ConfigMap cache somewhere) ?
mutex->RWMutex more comments removed unneeded throttler.Remove Signed-off-by: Alan Clucas <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
there was a Windows Unit Test failure related to the HTTP artifact code. That had nothing to do with these changes, so I merged this |
Signed-off-by: Alan Clucas <[email protected]>
Problem: The workflow cache can have out of date information in it which will resurrect a completed workflow. In this scenario, with an on-exit handler with
when: "{{workflow.status}} != Succeeded"
it may fire the exit handler erroneously.Firing enough of this workflow into a real cluster fast enough will reproduce this - tested in 3.4.8, 3.4.9, and master
It does not reproduce easily on a toy cluster in k3d. I believe because API latencies are too small and the whole machine starts to fall apart before the problem happens.
It happens rarely in real workloads but if
ohnoes
is replaced by a notification you'll receive a notification that your workflow has failed, even though it has succeeded when you go and look at it.Story: The workflow controller is busy, and the k8s API is busy. The k8s API is providing workflow updates which run through the workflow informer cache. Normally these cannot go 'backwards'.
Backwards would be where an update to the workflow takes our view of the workflow back to an earlier point in time because we've received a high latency update from the k8s API. We prevent many of these using the UpdateFunc handler which stops us reconciling at the point we've already seen.
The problem is that as a workflow completes we drop it from the informer cache (it no longer meets
reconciliationNeeded
). This allows it to come back into the cache in an old form as a new resource.If this happens we attempt to reconcile it. In the event the pod representing the node that was running in the form of the reappearing workflow doesn't exist we're going to (correctly) treat the workflow as in error.
So we have a workflow which is in error exiting, and test the onExit
when
clause, decide it's true, and launch the pod to handle it (ohnoes in my example). This runs, regardless of the fact that the workflow controller then discovers that the workflow resourceVersion is out of date and cannot write the current state back to the cluster.Fixes considered:
So the fix is to separately track workflows that have been evicted from the cache and not process them at all. This cache has a lifetime of 10 minutes as I could get up to latencies of a few minutes during heavy workload testing.
Verification
Tested on EKS with 3 4-core nodes and the above workflow injected rapidly.