Skip to content

fix(appset): Fix infinite loop caused by not ignoring statuses updates#19676

Closed
joliveirinha wants to merge 1 commit intoargoproj:masterfrom
joliveirinha:joliveirinha/fix-reconcile-loop-appset
Closed

fix(appset): Fix infinite loop caused by not ignoring statuses updates#19676
joliveirinha wants to merge 1 commit intoargoproj:masterfrom
joliveirinha:joliveirinha/fix-reconcile-loop-appset

Conversation

@joliveirinha
Copy link

@joliveirinha joliveirinha commented Aug 24, 2024

This commit fixes a bug where the appset controller updating the status of the resource itself, was causing it be notified for a new reconcile loop.
This happened, because, altough the individual resource (app) statuses were not changing, the order of the list was, as such, resulting in an updated appset resource.

The fix, adds a GenerationPredicate to the Watch on the appSet to guarantee that if no actual change to the AppSet change was done, then it can be ignored.

Fixes [#19675]

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

@joliveirinha joliveirinha requested a review from a team as a code owner August 24, 2024 16:01
@bunnyshell
Copy link

bunnyshell bot commented Aug 24, 2024

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

@bunnyshell
Copy link

bunnyshell bot commented Aug 24, 2024

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

This commit fixes a bug where the appset controller updating the status
of the resource itself, was causing it be notified for a new reconcile
loop.
This happened, because, altough the individual resource (app) statuses
were not changing, the order of the list was, as such, resulting in an
updated appset resource.

The fix, adds a GenerationPredicate to the Watch on the appSet to
guarantee that if no actual change to the AppSet change was done, then
it can be ignored.

Signed-off-by: João Oliveirinha <joliveirinha@gmail.com>
@joliveirinha joliveirinha force-pushed the joliveirinha/fix-reconcile-loop-appset branch from a1c0196 to 70c5e4b Compare August 24, 2024 16:07
@codecov
Copy link

codecov bot commented Aug 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (master@e612199). Learn more about missing BASE report.
Report is 1468 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #19676   +/-   ##
=========================================
  Coverage          ?   55.91%           
=========================================
  Files             ?      316           
  Lines             ?    43794           
  Branches          ?        0           
=========================================
  Hits              ?    24487           
  Misses            ?    16762           
  Partials          ?     2545           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@reggie-k
Copy link
Member

reggie-k commented Aug 25, 2024

Thank you for the detailed description and the PR, @joliveirinha.
Could you add tests for this case?

Copy link
Member

@gdsoumya gdsoumya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks good, a test would ensure we don't regress

Copy link
Contributor

@todaywasawesome todaywasawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests and it looks 🔥

@joliveirinha
Copy link
Author

Sorry for the late response. Only had time to look at this now.

It doesn't seem straight forward to implement a regression test for this.
AFAIK, the tests that exist use fake client and there doesnt seem possible to test the reconcile loop with it.

It seems the only possibility is to use envtest instead.

I maybe missing something since this is my first time looking at this.

@jetersen
Copy link
Contributor

jetersen commented Sep 3, 2024

Another option is to ensure order of the status.resources which I think is the real cause of this issue. See #19757

@joliveirinha
Copy link
Author

joliveirinha commented Sep 3, 2024

Another option is to ensure order of the status.resources which I think is the real cause of this issue. See #19757

Hmm I have mixed feelings about that.

  1. altought what you are saying is indeed true in the exact scenario I reported, it feels that "fixing" it that way is not the ideal fix, mainly because:
  2. we shouldn't care at all for the reason and contents that changed on the resource status from the perspective of being notified by the controller-runtime due to those changes. Even if the status changed, it should be in the implementation to decide what is best requeue time for the next iteration. Most probably, a couple of seconds.

For 2) the current fix here proposed, gives all the flexibility for the implementation to decide what is the best requeue time. If we go via the direction you are suggesting, then a status update that changes the actual status content, will always trigger right away the notification from the runtime, which means you have less flexibility.

You may argue, that it will only happen once because in the next iteration the status is probably the same, still....

To summarize, what I suggest, since the fix is small and relatively obvious, is for us to merge this PR and eventually when the project has integrations tests with the envtest, then it can cover this predicate and the rest that already exist to validate the runtime scenarios.

Also, because this PR most probably also fix other related issues related to the progressive sync.

@jetersen
Copy link
Contributor

jetersen commented Sep 4, 2024

But if you are using SCM Provider and someone creates a new repo, then it should reconcile if a new application is added?

@joliveirinha
Copy link
Author

But if you are using SCM Provider and someone creates a new repo, then it should reconcile if a new application is added?

Yes. The fix doesn't change that. normal requeueing based on the configured requeue time will still happen.

Copy link
Member

@agaudreault agaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many cases where the behavior of the appset controller may depend on status updates adding back the object to the queue. One that comes to my mind is the progressive delivery. There will be less update events, so less calls to shouldRequeueApplicationSet. What are the implication of that? hard to know.

These kinds of scenario are usually coverer by the E2E test suite.

Copy link
Collaborator

@alexmt alexmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @agaudreault . Argo CD application controller logic relies on reconciliation after changing the status. It is not ideal, but I'm afraid the AppicationSet controller is the same. We might open a can of worms by disabling reconciliation on status change.

I would implement this suggestion instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants