Skip to content

fix: workflow controller to detect stale workflows#15090

Merged
Joibel merged 8 commits intoargoproj:mainfrom
eduardodbr:fix-unordered-workflows
Jan 20, 2026
Merged

fix: workflow controller to detect stale workflows#15090
Joibel merged 8 commits intoargoproj:mainfrom
eduardodbr:fix-unordered-workflows

Conversation

@eduardodbr
Copy link
Member

@eduardodbr eduardodbr commented Nov 28, 2025

Motivation

Multiple issues have been created because of unexpected workflow behavior:

#13986
#14833
#12352
#14780

It appears that many of these issues occur because the controller is processing an outdated version of the workflow. The exact cause of these stale reads is still unknown, but there is some suspicion that it may be related to the informer write-back mechanism, which is being disabled by default in #15079.
This PR ensures that stale workflow versions are not reconciled by keeping track of the last processed resource version for each workflow in a last-seen-version annotation. A workflow is only processed when its annotation matches the expected version; otherwise, it is re-queued. The annotation stores the workflow’s resource version, though any unique value would work. I just thought using the RV was enough.

Modifications

  • Introduce a new last-seen-version annotation, updated with the current resource version on every Update() event.
  • Store the last-seen-version of each workflow in memory. When a workflow is processed, it proceeds only if the annotation matches the stored version.
  • If no stored version exists (e.g., after a controller restart), the workflow is always processed to allow normal recovery.
  • The in-memory entry is removed as soon as a Delete event is received or when the workflow completes.

Verification

Executed workflows with success.

Documentation

Summary by CodeRabbit

  • New Features
    • Workflow version tracking now maintains the last observed resource version per workflow
    • System skips processing outdated workflows to optimize reconciliation performance
    • Added caching layer for efficient version state management across workflow events

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
Signed-off-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
@eduardodbr
Copy link
Member Author

/retest

@eduardodbr eduardodbr marked this pull request as ready for review November 30, 2025 19:46
@Joibel Joibel requested a review from Copilot December 1, 2025 11:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a mechanism to detect and skip processing of stale workflow versions in the workflow controller, addressing multiple issues where the controller processes outdated versions of workflows. The implementation uses a combination of a workflow annotation and an in-memory map to track the last processed resource version for each workflow.

Key Changes:

  • Added last-seen-version annotation and in-memory tracking to identify stale workflow events
  • Integrated stale detection check (isOutdated) in the workflow processing pipeline
  • Cleanup of tracking data when workflows complete or are deleted

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
workflow/common/common.go Defines the new AnnotationKeyLastSeenVersion constant for storing the last seen resource version
workflow/controller/controller.go Adds lastSeenVersions struct and tracking logic, implements isOutdated check in processing pipeline, and cleanup on workflow completion/deletion
workflow/controller/operator.go Updates persistUpdates and persistWorkflowSizeLimitErr to set the annotation and update in-memory tracking after successful workflow updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Signed-off-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
…sistWorkflowSizeLimitErr

Signed-off-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
Signed-off-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 10, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This pull request adds last-seen resource version tracking to the workflow controller. A new annotation constant is introduced, the controller caches last-observed resource versions per workflow to detect outdated events, and the operator updates both the annotation and cache during workflow updates to maintain state consistency.

Changes

Cohort / File(s) Summary
Annotation constant
workflow/common/common.go
Adds exported constant AnnotationKeyLastSeenVersion to define the annotation key for storing the last seen workflow version.
Controller state tracking and validation
workflow/controller/controller.go
Adds lastSeenVersions map field to cache last-observed resource versions per workflow UID, protected by mutex. Initializes cache in NewWorkflowController. Adds isOutdated() method to check if a workflow's stored version differs from its annotation, getLastSeenVersionKey() to derive cache keys, and deleteLastSeenVersionKey() to purge entries. Modifies processNextItem() to skip outdated workflows. Updates informer handlers (AddFunc, UpdateFunc, DeleteFunc) to manage cache lifecycle when workflows are completed or deleted.
Operator annotation and cache updates
workflow/controller/operator.go
Adds updateLastSeenVersionAnnotation() and updateLastSeenVersion() helper methods to wfOperationCtx. Integrates version tracking into persistUpdates() and persistWorkflowSizeLimitErr() by capturing old resource version, writing it to annotations, and syncing to the in-memory cache after successful updates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Thread safety: Review mutex usage and lock placement in lastSeenVersions access patterns across AddFunc, UpdateFunc, and DeleteFunc handlers
  • State consistency: Verify cache entries are properly purged when workflows are deleted or completed, and that annotation/cache synchronization in operator methods maintains correctness under concurrent updates
  • Control flow: Confirm isOutdated() logic and skip behavior in processNextItem() correctly prevent processing of stale events without losing legitimate updates

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The description covers motivation, modifications, and verification, but lacks issue reference in Fixes field and documentation updates are incomplete. Add 'Fixes #' with at least one referenced issue number and clarify documentation status or impact.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding stale workflow detection to the workflow controller.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@eduardodbr
Copy link
Member Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 10, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
workflow/common/common.go (1)

60-61: Clarify what “version” refers to in the new annotation comment

To reduce ambiguity, consider explicitly stating that this stores the workflow metadata.resourceVersion last successfully processed by the controller and is used with the in‑memory cache to skip stale informer events. The implementation itself looks correct.

workflow/controller/operator.go (1)

764-766: Last‑seen version bookkeeping around persist paths looks consistent; worth hardening with tests

Using oldRV := woc.wf.ResourceVersion as the value for both updateLastSeenVersionAnnotation and updateLastSeenVersion ensures that a workflow is only considered “up‑to‑date” when the informer cache has observed the controller’s last successful update (annotation matches cached value). The conflict (reapplyUpdate) and size‑limit error paths also keep annotation and cache in sync only after a successful Update, which is the right behavior.

Given how central this is to skipping stale reconciliations, I’d strongly recommend adding focused unit tests that exercise at least:

  • Successful persistUpdates (no conflict),
  • Conflict + successful reapplyUpdate,
  • Request‑entity‑too‑large path via persistWorkflowSizeLimitErr,
  • isOutdated on objects whose annotations are (a) behind the cache, (b) equal to the cache, and (c) missing.

That would make the intended semantics much easier to maintain.

Also applies to: 789-789, 866-873

🧹 Nitpick comments (2)
workflow/controller/controller.go (2)

85-88: lastSeenVersions cache design is solid; consider behavior when users edit the annotation

The UID‑keyed lastSeenVersions with an RWMutex looks correct, and isOutdated’s “process only when annotation == cached value, or cache miss” rule matches the PR description and avoids acting on informer state that hasn’t observed the controller’s last successful update.

One subtle edge case: if someone manually edits or removes workflow.argoproj.io/last-seen-version on a running Workflow, isOutdated will keep returning true (annotation ≠ cached value) and the controller will never reconcile that workflow again, because the cache entry is only updated from within persistUpdates / persistWorkflowSizeLimitErr, which are gated behind isOutdated.

If that is an acceptable “don’t touch controller‑owned annotation” contract, it might be worth:

  • Documenting this annotation as controller‑owned, and/or
  • Logging at warn level when we detect a persistent mismatch, or resetting the cache entry when the annotation is missing.

Otherwise, we may want a small escape hatch so an accidentally edited annotation doesn’t permanently wedge a workflow.

Also applies to: 162-163, 215-218, 1369-1379


738-742: Outdated‑workflow requeue behavior is correct; consider adding observability

Re‑queuing with AddRateLimited when isOutdated is true is a reasonable way to wait for the informer cache to catch up before reconciling.

Given this path indicates potential cache staleness, you might consider incrementing a metric or counter when we skip an outdated event so operators can spot clusters where this happens frequently and diagnose underlying watch/cache issues.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9242c6f and ec0397e.

📒 Files selected for processing (3)
  • workflow/common/common.go (1 hunks)
  • workflow/controller/controller.go (7 hunks)
  • workflow/controller/operator.go (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
workflow/controller/controller.go (1)
workflow/common/common.go (1)
  • AnnotationKeyLastSeenVersion (61-61)
workflow/controller/operator.go (2)
util/logging/logging.go (1)
  • Warn (59-59)
workflow/common/common.go (1)
  • AnnotationKeyLastSeenVersion (61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Unit Tests
  • GitHub Check: Windows Unit Tests
  • GitHub Check: Lint
  • GitHub Check: argo-images (argoexec)
  • GitHub Check: argo-images (argocli)
🔇 Additional comments (2)
workflow/controller/operator.go (1)

4405-4419: Thread‑safety and scoping of last‑seen helpers look good

Both updateLastSeenVersionAnnotation and updateLastSeenVersion are nicely localized to wfOperationCtx and correctly guard shared state (lastSeenVersions.versions) with a mutex. Keying by woc.controller.getLastSeenVersionKey(woc.wf) (UID) ensures stability across renames/resubmissions of the same workflow.

workflow/controller/controller.go (1)

966-967: Cleaning lastSeenVersions on completion and delete avoids cache leaks

Calling deleteLastSeenVersionKey both when reconciliation is no longer needed (completed workflows with no GC finalizer) and on delete events properly cleans up the in‑memory cache and ensures old UIDs don’t accumulate or interfere with resubmitted workflows.

Also applies to: 1024-1025

@isubasinghe
Copy link
Member

This might not be needed, see #15107

I would prefer if this were not needed so lets test 15107 first.

@eduardodbr
Copy link
Member Author

@isubasinghe yeah, I did talk with @Joibel and suggested only going forward with this PR after testing that one

Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@Joibel Joibel added cherry-pick/3.6 cherry-pick/3.7 Cherry-pick this to release-3.7 labels Jan 20, 2026
@Joibel Joibel enabled auto-merge (squash) January 20, 2026 08:27
@Joibel Joibel merged commit b7670b6 into argoproj:main Jan 20, 2026
39 checks passed
@argo-cd-cherry-pick-bot
Copy link

❌ Cherry-pick failed for 3.6. Please check the workflow logs for details.

@argo-cd-cherry-pick-bot
Copy link

❌ Cherry-pick failed for 3.7. Please check the workflow logs for details.

Joibel pushed a commit that referenced this pull request Jan 21, 2026
Co-authored-by: Alan Clucas <alan@clucas.org>
(cherry picked from commit b7670b6)

Signed-off-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
Joibel pushed a commit that referenced this pull request Jan 21, 2026
Co-authored-by: Alan Clucas <alan@clucas.org>
(cherry picked from commit b7670b6)

Signed-off-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
Joibel pushed a commit that referenced this pull request Jan 21, 2026
Co-authored-by: Alan Clucas <alan@clucas.org>
(cherry picked from commit b7670b6)

Signed-off-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
isubasinghe pushed a commit to pipekit/argo-workflows that referenced this pull request Jan 22, 2026
…roj#15090 for 3.6) (argoproj#15263)

Signed-off-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
Joibel added a commit to Joibel/argo-workflows that referenced this pull request Jan 28, 2026
…roj#15090 for 3.7) (argoproj#15262)

Signed-off-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Eduardo Rodrigues <eduardodbr@hotmail.com>
Joibel pushed a commit to Joibel/argo-workflows that referenced this pull request Feb 17, 2026
… persist

Log the workflow's last-seen-version annotation value (from PR argoproj#15090)
in two key locations:
- "Processing workflow" log at reconciliation start (alongside ResourceVersion)
- "Workflow update successful" log in persistUpdates

This improves observability for debugging stale workflow detection.

https://claude.ai/code/session_01V7QyvrRrYu9uVKeTS2dugS
Joibel added a commit to Joibel/argo-workflows that referenced this pull request Feb 17, 2026
… persist

Log the workflow's last-seen-version annotation value (from PR argoproj#15090)
in two key locations:
- "Processing workflow" log at reconciliation start (alongside ResourceVersion)
- "Workflow update successful" log in persistUpdates

This improves observability for debugging stale workflow detection.

https://claude.ai/code/session_01V7QyvrRrYu9uVKeTS2dugS
Signed-off-by: Claude <noreply@anthropic.com>
Joibel added a commit to Joibel/argo-workflows that referenced this pull request Feb 17, 2026
… persist

Log the workflow's last-seen-version annotation value (from PR argoproj#15090)
in two key locations:
- "Processing workflow" log at reconciliation start (alongside resourceVersion)
- "Workflow update successful" log in persistUpdates

This improves observability for debugging stale workflow detection.

https://claude.ai/code/session_01V7QyvrRrYu9uVKeTS2dugS
Signed-off-by: Claude <noreply@anthropic.com>
Joibel added a commit to Joibel/argo-workflows that referenced this pull request Feb 17, 2026
… persist

Log the workflow's last-seen-version annotation value (from PR argoproj#15090)
in two key locations:
- "Processing workflow" log at reconciliation start (alongside resourceVersion)
- "Workflow update successful" log in persistUpdates

This improves observability for debugging stale workflow detection.

https://claude.ai/code/session_01V7QyvrRrYu9uVKeTS2dugS
Signed-off-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick/3.7 Cherry-pick this to release-3.7

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants