Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: completed workflow tracking #12198

Merged
merged 2 commits into from
Nov 28, 2023
Merged

fix: completed workflow tracking #12198

merged 2 commits into from
Nov 28, 2023

Conversation

Joibel
Copy link
Member

@Joibel Joibel commented Nov 14, 2023

Problem: The workflow cache can have out of date information in it which will resurrect a completed workflow. In this scenario, with an on-exit handler with when: "{{workflow.status}} != Succeeded" it may fire the exit handler erroneously.

Firing enough of this workflow into a real cluster fast enough will reproduce this - tested in 3.4.8, 3.4.9, and master

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: exit-
spec:
  entrypoint: entrypoint
  onExit: exit-handler
  templates:
    - name: entrypoint
      container:
        image: quay.io/pch/whalesay:latest
        command: [ cowsay ]
        args: [ "hello world" ]
        resources:
          limits:
            memory: 32Mi
            cpu: 100m
    - name: ohnoes
      container:
        image: quay.io/pch/whalesay:latest
        command: [ cowsay ]
        args: [ "oh noes" ]
        resources:
          limits:
            memory: 32Mi
            cpu: 100m
    - name: exit-handler
      steps:
        - - name: ohnoes
            when: "{{workflow.status}} != Succeeded"
            template: ohnoes

It does not reproduce easily on a toy cluster in k3d. I believe because API latencies are too small and the whole machine starts to fall apart before the problem happens.

It happens rarely in real workloads but if ohnoes is replaced by a notification you'll receive a notification that your workflow has failed, even though it has succeeded when you go and look at it.

Story: The workflow controller is busy, and the k8s API is busy. The k8s API is providing workflow updates which run through the workflow informer cache. Normally these cannot go 'backwards'.

Backwards would be where an update to the workflow takes our view of the workflow back to an earlier point in time because we've received a high latency update from the k8s API. We prevent many of these using the UpdateFunc handler which stops us reconciling at the point we've already seen.

The problem is that as a workflow completes we drop it from the informer cache (it no longer meets reconciliationNeeded). This allows it to come back into the cache in an old form as a new resource.

If this happens we attempt to reconcile it. In the event the pod representing the node that was running in the form of the reappearing workflow doesn't exist we're going to (correctly) treat the workflow as in error.

So we have a workflow which is in error exiting, and test the onExit when clause, decide it's true, and launch the pod to handle it (ohnoes in my example). This runs, regardless of the fact that the workflow controller then discovers that the workflow resourceVersion is out of date and cannot write the current state back to the cluster.

Fixes considered:

  • Attempt to ensure we're up to date before actioning the onExit. This seemed like a very convoluted code path for doing special onExit handling and likely to introduce more bugs than it fixed. This would either put much more load on the API by causing the controller to get the latest version, or attempt the write back to see if we were out of date.
  • Attempt to prevent old version re-entry in the cache rather than at the reconcile stage. I tried this, but we still have a hole where it'll make it through.

So the fix is to separately track workflows that have been evicted from the cache and not process them at all. This cache has a lifetime of 10 minutes as I could get up to latencies of a few minutes during heavy workload testing.

Verification

Tested on EKS with 3 4-core nodes and the above workflow injected rapidly.

Problem: The workflow cache can have out of date information in it
which will resurrect a completed workflow. In this scenario, with an
on-exit handler with `when: "{{workflow.status}} != Succeeded"` it may
fire the exit handler erroneously.

Firing enough of this workflow into a real cluster fast enough will
reproduce this - tested in 3.4.8, 3.4.9, and master
```
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: exit-
spec:
  entrypoint: entrypoint
  onExit: exit-handler
  templates:
    - name: entrypoint
      container:
        image: quay.io/pch/whalesay:latest
        command: [ cowsay ]
        args: [ "hello world" ]
        resources:
          limits:
            memory: 32Mi
            cpu: 100m
    - name: ohnoes
      container:
        image: quay.io/pch/whalesay:latest
        command: [ cowsay ]
        args: [ "oh noes" ]
        resources:
          limits:
            memory: 32Mi
            cpu: 100m
    - name: exit-handler
      steps:
        - - name: ohnoes
            when: "{{workflow.status}} != Succeeded"
            template: ohnoes
```

It does not reproduce easily on a toy cluster in k3d. I believe
because API latencies are too small and the whole machine starts to
fall apart before the problem happens.

It happens rarely in real workloads but if `ohnoes` is replaced by a
notification you'll receive a notification that your workflow has
failed, even though it has succeeded when you go and look at it.

Story: The workflow controller is busy, and the k8s API is busy. The
k8s API is providing workflow updates which run through the workflow
informer cache. Normally these cannot go 'backwards'.

Backwards would be where an update to the workflow takes our view of
the workflow back to an earlier point in time because we've received a
high latency update from the k8s API. We prevent many of these using
the UpdateFunc handler which stops us reconciling at the point we've
already seen.

The problem is that as a workflow completes we drop it from the
informer cache (it no longer meets `reconciliationNeeded`). This
allows it to come back into the cache in an old form as a new
resource.

If this happens we attempt to reconcile it. In the event the pod
representing the node that was running in the form of the reappearing
workflow doesn't exist we're going to (correctly) treat the workflow
as in error.

So we have a workflow which is in error exiting, and test the onExit
`when` clause, decide it's true, and launch the pod to handle
it (ohnoes in my example). This runs, regardless of the fact that the
workflow controller then discovers that the workflow resourceVersion
is out of date and cannot write the current state back to the cluster.

Fixes considered:
* Attempt to ensure we're up to date before actioning the onExit. This
seemed like a very convoluted code path for doing special onExit
handling and likely to introduce more bugs than it fixed. This would
either put much more load on the API by causing the controller to get
the latest version, or attempt the write back to see if we were out of
date.
* Attempt to prevent old version re-entry in the cache rather than at
the reconcile stage. I tried this, but we still have a hole where
it'll make it through.

So the fix is to separately track workflows that have been evicted
from the cache and not process them at all. This cache has a lifetime
of 10 minutes as I could get up to latencies of a few minutes during
heavy workload testing.

Signed-off-by: Alan Clucas <[email protected]>
@juliev0
Copy link
Contributor

juliev0 commented Nov 23, 2023

Oh, just thinking...the key is the Workflow name, right? So, what happens when somebody legitimately creates a new Workflow of the same name as the previous Workflow within 10 minutes of the previous one being deleted?

@Joibel
Copy link
Member Author

Joibel commented Nov 23, 2023

Oh, just thinking...the key is the Workflow name, right? So, what happens when somebody legitimately creates a new Workflow of the same name as the previous Workflow within 10 minutes of the previous one being deleted?

I thought I'd handled this, but obviously not. This went through many iterations, so I'm blaming that!

I've added a check when we may reject to ensure that the Phase is not "". This will allow brand new workflows back in, but any that have already been handled by the controller will go through potential rejection.

@isubasinghe
Copy link
Member

Oh, just thinking...the key is the Workflow name, right? So, what happens when somebody legitimately creates a new Workflow of the same name as the previous Workflow within 10 minutes of the previous one being deleted?

Nice catch! I didn't think about this myself

Copy link
Member

@isubasinghe isubasinghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, but I couldn't find the change for:

I've added a check when we may reject to ensure that the Phase is not "". This will allow brand new workflows back in, but any that have already been handled by the controller will go through potential rejection.

There also seems to be a redundant throttler.Remove(), after those changes I would be happy to approve.

What this issue opens up though is the question of where else are there similar bugs in the code? Would this be the only interaction with the k8s API where we need to care about this, or is this a bug that would impact other things as well,say for example configmaps (I remember seeing a ConfigMap cache somewhere) ?

mutex->RWMutex
more comments
removed unneeded throttler.Remove

Signed-off-by: Alan Clucas <[email protected]>
Copy link
Member

@isubasinghe isubasinghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@juliev0 juliev0 merged commit 62732b3 into argoproj:master Nov 28, 2023
26 of 27 checks passed
@juliev0
Copy link
Contributor

juliev0 commented Nov 28, 2023

there was a Windows Unit Test failure related to the HTTP artifact code. That had nothing to do with these changes, so I merged this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants