fix: completed workflow tracking #12198

Joibel · 2023-11-14T14:45:14Z

Problem: The workflow cache can have out of date information in it which will resurrect a completed workflow. In this scenario, with an on-exit handler with when: "{{workflow.status}} != Succeeded" it may fire the exit handler erroneously.

Firing enough of this workflow into a real cluster fast enough will reproduce this - tested in 3.4.8, 3.4.9, and master

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: exit-
spec:
  entrypoint: entrypoint
  onExit: exit-handler
  templates:
    - name: entrypoint
      container:
        image: quay.io/pch/whalesay:latest
        command: [ cowsay ]
        args: [ "hello world" ]
        resources:
          limits:
            memory: 32Mi
            cpu: 100m
    - name: ohnoes
      container:
        image: quay.io/pch/whalesay:latest
        command: [ cowsay ]
        args: [ "oh noes" ]
        resources:
          limits:
            memory: 32Mi
            cpu: 100m
    - name: exit-handler
      steps:
        - - name: ohnoes
            when: "{{workflow.status}} != Succeeded"
            template: ohnoes

It does not reproduce easily on a toy cluster in k3d. I believe because API latencies are too small and the whole machine starts to fall apart before the problem happens.

It happens rarely in real workloads but if ohnoes is replaced by a notification you'll receive a notification that your workflow has failed, even though it has succeeded when you go and look at it.

Story: The workflow controller is busy, and the k8s API is busy. The k8s API is providing workflow updates which run through the workflow informer cache. Normally these cannot go 'backwards'.

Backwards would be where an update to the workflow takes our view of the workflow back to an earlier point in time because we've received a high latency update from the k8s API. We prevent many of these using the UpdateFunc handler which stops us reconciling at the point we've already seen.

The problem is that as a workflow completes we drop it from the informer cache (it no longer meets reconciliationNeeded). This allows it to come back into the cache in an old form as a new resource.

If this happens we attempt to reconcile it. In the event the pod representing the node that was running in the form of the reappearing workflow doesn't exist we're going to (correctly) treat the workflow as in error.

So we have a workflow which is in error exiting, and test the onExit when clause, decide it's true, and launch the pod to handle it (ohnoes in my example). This runs, regardless of the fact that the workflow controller then discovers that the workflow resourceVersion is out of date and cannot write the current state back to the cluster.

Fixes considered:

Attempt to ensure we're up to date before actioning the onExit. This seemed like a very convoluted code path for doing special onExit handling and likely to introduce more bugs than it fixed. This would either put much more load on the API by causing the controller to get the latest version, or attempt the write back to see if we were out of date.
Attempt to prevent old version re-entry in the cache rather than at the reconcile stage. I tried this, but we still have a hole where it'll make it through.

So the fix is to separately track workflows that have been evicted from the cache and not process them at all. This cache has a lifetime of 10 minutes as I could get up to latencies of a few minutes during heavy workload testing.

Verification

Tested on EKS with 3 4-core nodes and the above workflow injected rapidly.

Problem: The workflow cache can have out of date information in it which will resurrect a completed workflow. In this scenario, with an on-exit handler with `when: "{{workflow.status}} != Succeeded"` it may fire the exit handler erroneously. Firing enough of this workflow into a real cluster fast enough will reproduce this - tested in 3.4.8, 3.4.9, and master ``` apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: exit- spec: entrypoint: entrypoint onExit: exit-handler templates: - name: entrypoint container: image: quay.io/pch/whalesay:latest command: [ cowsay ] args: [ "hello world" ] resources: limits: memory: 32Mi cpu: 100m - name: ohnoes container: image: quay.io/pch/whalesay:latest command: [ cowsay ] args: [ "oh noes" ] resources: limits: memory: 32Mi cpu: 100m - name: exit-handler steps: - - name: ohnoes when: "{{workflow.status}} != Succeeded" template: ohnoes ``` It does not reproduce easily on a toy cluster in k3d. I believe because API latencies are too small and the whole machine starts to fall apart before the problem happens. It happens rarely in real workloads but if `ohnoes` is replaced by a notification you'll receive a notification that your workflow has failed, even though it has succeeded when you go and look at it. Story: The workflow controller is busy, and the k8s API is busy. The k8s API is providing workflow updates which run through the workflow informer cache. Normally these cannot go 'backwards'. Backwards would be where an update to the workflow takes our view of the workflow back to an earlier point in time because we've received a high latency update from the k8s API. We prevent many of these using the UpdateFunc handler which stops us reconciling at the point we've already seen. The problem is that as a workflow completes we drop it from the informer cache (it no longer meets `reconciliationNeeded`). This allows it to come back into the cache in an old form as a new resource. If this happens we attempt to reconcile it. In the event the pod representing the node that was running in the form of the reappearing workflow doesn't exist we're going to (correctly) treat the workflow as in error. So we have a workflow which is in error exiting, and test the onExit `when` clause, decide it's true, and launch the pod to handle it (ohnoes in my example). This runs, regardless of the fact that the workflow controller then discovers that the workflow resourceVersion is out of date and cannot write the current state back to the cluster. Fixes considered: * Attempt to ensure we're up to date before actioning the onExit. This seemed like a very convoluted code path for doing special onExit handling and likely to introduce more bugs than it fixed. This would either put much more load on the API by causing the controller to get the latest version, or attempt the write back to see if we were out of date. * Attempt to prevent old version re-entry in the cache rather than at the reconcile stage. I tried this, but we still have a hole where it'll make it through. So the fix is to separately track workflows that have been evicted from the cache and not process them at all. This cache has a lifetime of 10 minutes as I could get up to latencies of a few minutes during heavy workload testing. Signed-off-by: Alan Clucas <[email protected]>

workflow/controller/controller.go

juliev0 · 2023-11-23T05:25:01Z

Oh, just thinking...the key is the Workflow name, right? So, what happens when somebody legitimately creates a new Workflow of the same name as the previous Workflow within 10 minutes of the previous one being deleted?

Joibel · 2023-11-23T15:32:42Z

Oh, just thinking...the key is the Workflow name, right? So, what happens when somebody legitimately creates a new Workflow of the same name as the previous Workflow within 10 minutes of the previous one being deleted?

I thought I'd handled this, but obviously not. This went through many iterations, so I'm blaming that!

I've added a check when we may reject to ensure that the Phase is not "". This will allow brand new workflows back in, but any that have already been handled by the controller will go through potential rejection.

isubasinghe · 2023-11-23T23:26:56Z

Oh, just thinking...the key is the Workflow name, right? So, what happens when somebody legitimately creates a new Workflow of the same name as the previous Workflow within 10 minutes of the previous one being deleted?

Nice catch! I didn't think about this myself

isubasinghe

Generally looks good, but I couldn't find the change for:

I've added a check when we may reject to ensure that the Phase is not "". This will allow brand new workflows back in, but any that have already been handled by the controller will go through potential rejection.

There also seems to be a redundant throttler.Remove(), after those changes I would be happy to approve.

What this issue opens up though is the question of where else are there similar bugs in the code? Would this be the only interaction with the k8s API where we need to care about this, or is this a bug that would impact other things as well,say for example configmaps (I remember seeing a ConfigMap cache somewhere) ?

mutex->RWMutex more comments removed unneeded throttler.Remove Signed-off-by: Alan Clucas <[email protected]>

isubasinghe

LGTM

juliev0 · 2023-11-28T18:08:41Z

there was a Windows Unit Test failure related to the HTTP artifact code. That had nothing to do with these changes, so I merged this

Signed-off-by: Alan Clucas <[email protected]>

Joibel requested review from agilgur5 and juliev0 November 14, 2023 14:45

Joibel force-pushed the on-exit branch from 8c3ae8a to a0f0178 Compare November 14, 2023 15:57

Joibel force-pushed the on-exit branch from 1d18a33 to 21249a1 Compare November 15, 2023 08:31

caelan-io requested a review from isubasinghe November 22, 2023 16:03

juliev0 reviewed Nov 22, 2023

View reviewed changes

workflow/controller/controller.go Show resolved Hide resolved

juliev0 reviewed Nov 22, 2023

View reviewed changes

workflow/controller/controller.go Show resolved Hide resolved

juliev0 reviewed Nov 22, 2023

View reviewed changes

workflow/controller/controller.go Show resolved Hide resolved

juliev0 reviewed Nov 22, 2023

View reviewed changes

workflow/controller/controller.go Show resolved Hide resolved

isubasinghe reviewed Nov 23, 2023

View reviewed changes

chore: updates based on review feedback

52a7908

mutex->RWMutex more comments removed unneeded throttler.Remove Signed-off-by: Alan Clucas <[email protected]>

isubasinghe approved these changes Nov 28, 2023

View reviewed changes

juliev0 approved these changes Nov 28, 2023

View reviewed changes

juliev0 merged commit 62732b3 into argoproj:master Nov 28, 2023
26 of 27 checks passed

agilgur5 added the area/controller Controller issues, panics label Dec 29, 2023

sarabala1979 pushed a commit that referenced this pull request Jan 9, 2024

fix: completed workflow tracking (#12198)

3ecfe56

Signed-off-by: Alan Clucas <[email protected]>

agilgur5 mentioned this pull request Jan 19, 2024

Succeeded jobs counted into parallelism limit #12541

Closed

4 tasks

isubasinghe mentioned this pull request Jan 30, 2024

REQUEST: Promotion to Approver for @isubasinghe argoproj/argoproj#232

Closed

7 tasks

Joibel mentioned this pull request Jan 31, 2024

REQUEST: Promotion to Approver for Joibel argoproj/argoproj#276

Closed

6 tasks

agilgur5 mentioned this pull request Jan 31, 2024

Workflow loses its workflows.argoproj.io/phase label and is considered as unreconciled after it's reapplied with kubectl #12601

Closed

4 tasks

Joibel deleted the on-exit branch May 9, 2024 09:51

Joibel mentioned this pull request May 25, 2024

v3.4.10: Controller cannot find completed pod, reports workflow as Error #13086

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: completed workflow tracking #12198

fix: completed workflow tracking #12198

Joibel commented Nov 14, 2023 •

edited by agilgur5

Loading

juliev0 commented Nov 23, 2023

Joibel commented Nov 23, 2023

isubasinghe commented Nov 23, 2023

isubasinghe left a comment

isubasinghe left a comment

juliev0 commented Nov 28, 2023

fix: completed workflow tracking #12198

fix: completed workflow tracking #12198

Conversation

Joibel commented Nov 14, 2023 • edited by agilgur5 Loading

Verification

juliev0 commented Nov 23, 2023

Joibel commented Nov 23, 2023

isubasinghe commented Nov 23, 2023

isubasinghe left a comment

Choose a reason for hiding this comment

isubasinghe left a comment

Choose a reason for hiding this comment

juliev0 commented Nov 28, 2023

Joibel commented Nov 14, 2023 •

edited by agilgur5

Loading