-
Notifications
You must be signed in to change notification settings - Fork 7k
chore: remove controller liveness probe #9557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has setting a failureThreshold been considered? This is set to 1 by default. If set to 5 for example it would need 5 failures to restart the pod. If a probe is successful between failures the failureThreshold is reset. This would give the EU a choice to make an adjustment based on their workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachaller can you remember if we discussed setting a failure threshold?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we did talk about it, also failure threshold actually defaults to 3 with a minimum of 1. I think if we want to keep the liveness check we have to bump the timeout to something like 3 to 5 seconds. That is mainly where the issue is at but the reasoning for removal was because restarting can make things worse under load conditions when the timeout is hit due to load.
I do think keeping it and bumping timeout can be a first pass but it might eventually still make sense to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachaller You are absolutely right failureThreshold does default to 3. I think it also might be worth trying to set failureThreshol=5 and bumping up the timeoutSeconds.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning about removing it is related to the nature of the controller watch loop versus how the liveness probe is currently implemented. Currently the
/healthzendpoint is exposed by the metrics server which runs in a separate goroutine. This can cause the controller to be restarted by reasons unrelated to the watch loop. In cases when the watch queues are accumulating lots of resource updates the controller will be busy trying to process them. Killing the controller in this case makes the situation worse as the queue remains the same after the restart. We noticed this behaviour in one of our internal instances which led to this discussion challenging the effectiveness of the current liveness probe implementation in ArgoCD controller.@34fathombelow do you have a real use case where the current liveness-probe implementation is useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for your detailed explanation. I'm fine with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra background info as well @leoluz I was not aware that the check was also in a seperate go routine which makes the liveness prob as you mentioned even more useless.