Skip to content

fix: re-resolve symbolic target revisions on sync retry#26925

Open
rohansood10 wants to merge 1 commit intoargoproj:masterfrom
rohansood10:fix/26530-retry-stale-revision
Open

fix: re-resolve symbolic target revisions on sync retry#26925
rohansood10 wants to merge 1 commit intoargoproj:masterfrom
rohansood10:fix/26530-retry-stale-revision

Conversation

@rohansood10
Copy link

@rohansood10 rohansood10 commented Mar 19, 2026

Fixes #26530

When a sync with targetRevision: HEAD fails and retries, the retry reuses the originally-resolved commit SHA instead of re-resolving HEAD. If HEAD has moved (which it usually has), the retry applies manifests from the old commit — potentially rolling back the cluster to a stale state.

The fix clears the resolved revision on retry when the target is a symbolic ref (HEAD, branch name, tag). Literal SHAs are left alone since they're already pinned. Uses the existing git.IsCommitSHA() helper to distinguish.

Relates to #23547

When a sync operation fails and retries, symbolic refs like HEAD or
branch names were not re-resolved, causing the retry to use the
originally-resolved commit SHA which could be months old. This led
to silent rollbacks of cluster resources to stale states.

The retry logic only cleared the resolved revision when Retry.Refresh
was explicitly set to true. For the common case where Retry.Refresh is
false (the default), the stale SHA persisted across retries.

Fix by also clearing the resolved revision on retry when the target
revision is a symbolic ref (not a 40-character hex SHA), forcing
re-resolution to the latest commit. Literal SHA target revisions are
preserved since they always point to the same commit.

Fixes argoproj#26530

Signed-off-by: Rohan Sood <56945243+rohansood10@users.noreply.github.com>
@rohansood10 rohansood10 requested a review from a team as a code owner March 19, 2026 21:02
@bunnyshell
Copy link

bunnyshell bot commented Mar 19, 2026

🔴 Preview Environment stopped on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

  • 🔵 /bns:start to start the environment
  • 🚀 /bns:deploy to redeploy the environment
  • /bns:delete to remove the environment

@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.17%. Comparing base (aabe852) to head (9610088).

Files with missing lines Patch % Lines
controller/appcontroller.go 90.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #26925   +/-   ##
=======================================
  Coverage   63.16%   63.17%           
=======================================
  Files         414      414           
  Lines       56461    56480   +19     
=======================================
+ Hits        35664    35681   +17     
- Misses      17427    17430    +3     
+ Partials     3370     3369    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@olivergondza
Copy link
Contributor

@rohansood10, when implementing the Retry.Refresh, the community sentiment was not to change the default behavior, and only resolve those when explicitly configured: #11494 (comment).

Correct if I am wrong, but this change practically does the refresh regardless of the option's value, right?

@rohansood10
Copy link
Author

Good question - and you're right to flag this. The change here is intentionally scoped narrower than toggling Retry.Refresh globally.

What this does: when the target revision is a symbolic ref (HEAD, branch name, tag) rather than a literal SHA, it clears the resolved revision on retry so it gets re-resolved. It does NOT change the Retry.Refresh flag itself, so the full app refresh behavior (CompareWithLatest, etc.) is still opt-in.

The distinction is: re-resolving a symbolic ref to its current SHA is a lightweight git operation, while Retry.Refresh triggers a full comparison cycle with manifest generation. The bug is specifically that a stale SHA gets reused when HEAD has moved, which can silently roll back cluster state to a months-old commit.

That said, I see the concern about changing default retry behavior. Would it be cleaner to gate this behind a new sync option like ResolveRefsOnRetry to keep it fully opt-in? Or does the current approach feel proportionate given the data corruption risk?

cc the discussion at #11494 (comment)

@olivergondza
Copy link
Contributor

The bug is specifically that a stale SHA gets reused when HEAD has moved, which can silently roll back cluster state to a months-old commit.

Can you elaborate on how is this possible? I do not see the connection with this described behavior from #26530.

The distinction is: re-resolving a symbolic ref to its current SHA is a lightweight git operation, while Retry.Refresh triggers a full comparison cycle with manifest generation.

Have you maybe forgot to push your full change set? I see it does the exact same thing as .Refresh - not setting the fag, but activating the same behavior for symbolic revisions.


Was any AI assistant used to produce the contribution?

@rohansood10
Copy link
Author

Sure - the scenario from #26530 is:

  1. User creates an Application with targetRevision: HEAD
  2. First sync resolves HEAD to commit abc123, stores it in state.Operation.Sync.Revision
  3. Sync fails (e.g., webhook rejection, hook timeout)
  4. Meanwhile, 50+ commits are pushed to the repo
  5. On retry, if Retry.Refresh is false (the default), the retry reuses abc123 from step 2 instead of re-resolving HEAD
  6. The manifests from the months-old abc123 get applied, silently rolling back the cluster

The issue reporters saw this in production with HA mode where retries can be delayed. The fix specifically targets symbolic refs - literal SHAs are preserved since they're already pinned.

Regarding the opt-in concern - you're right that this changes default behavior for symbolic refs. Would gating it behind Retry.Refresh being set make more sense? The tradeoff is that the data corruption risk remains for the default case.

@olivergondza
Copy link
Contributor

olivergondza commented Mar 23, 2026

Hmm. I am still not convinced this is the culprit.

  • If step n. 6 downgrades the app when it finally succeeds to apply abc123, what have upgraded the app manifests beyond abc123 when the Argo app was struggling to do so, for - as you said - months?
  • From the issue it is not clear (to me at least) that there were an ongoing retry-loop going on for the entire time.

Would gating it behind Retry.Refresh being set make more sense?

See my question above. How is this different from what the flag does already (except that you skip SHAs)?

@rohansood10
Copy link
Author

Yeah fair question. Looking at the issue again, the reporters describe it as happening in HA setups with auto-sync + retry, but the exact sequence of how the cluster ended up ahead of the retried revision isn't fully clear from the logs they shared.

You're right that if the sync was failing continuously, nothing newer would have been deployed through ArgoCD, so the retry applying the original commit wouldn't technically be a rollback. The scenario where it becomes dangerous is if something else deployed a newer version in the meantime (manual kubectl apply, another sync that succeeded, etc.) and then the retry fires with the old SHA.

I'm less certain now that this is the right fix for #26530 specifically. Might be worth getting more detail from the reporters about what exactly happened in their case. Happy to close this if the consensus is that Retry.Refresh already covers the legitimate use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Argo CD sometimes applies very old Git commit during sync retry despite targetRevision=HEAD

2 participants