chore: remove controller liveness probe#9557
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9557 +/- ##
=======================================
Coverage 45.77% 45.77%
=======================================
Files 222 222
Lines 26372 26372
=======================================
Hits 12072 12072
Misses 12651 12651
Partials 1649 1649
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Has setting a failureThreshold been considered? This is set to 1 by default. If set to 5 for example it would need 5 failures to restart the pod. If a probe is successful between failures the failureThreshold is reset. This would give the EU a choice to make an adjustment based on their workloads.
livenessProbe:
httpGet:
path: /healthz
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 5There was a problem hiding this comment.
@zachaller can you remember if we discussed setting a failure threshold?
There was a problem hiding this comment.
So we did talk about it, also failure threshold actually defaults to 3 with a minimum of 1. I think if we want to keep the liveness check we have to bump the timeout to something like 3 to 5 seconds. That is mainly where the issue is at but the reasoning for removal was because restarting can make things worse under load conditions when the timeout is hit due to load.
I do think keeping it and bumping timeout can be a first pass but it might eventually still make sense to remove.
There was a problem hiding this comment.
@zachaller You are absolutely right failureThreshold does default to 3. I think it also might be worth trying to set failureThreshol=5 and bumping up the timeoutSeconds.
There was a problem hiding this comment.
The reasoning about removing it is related to the nature of the controller watch loop versus how the liveness probe is currently implemented. Currently the /healthz endpoint is exposed by the metrics server which runs in a separate goroutine. This can cause the controller to be restarted by reasons unrelated to the watch loop. In cases when the watch queues are accumulating lots of resource updates the controller will be busy trying to process them. Killing the controller in this case makes the situation worse as the queue remains the same after the restart. We noticed this behaviour in one of our internal instances which led to this discussion challenging the effectiveness of the current liveness probe implementation in ArgoCD controller.
@34fathombelow do you have a real use case where the current liveness-probe implementation is useful?
There was a problem hiding this comment.
Got it, thanks for your detailed explanation. I'm fine with this change.
There was a problem hiding this comment.
Thanks for the extra background info as well @leoluz I was not aware that the check was also in a seperate go routine which makes the liveness prob as you mentioned even more useless.
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
8600faf to
79ab2c4
Compare
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
alexmt
left a comment
There was a problem hiding this comment.
LGTM. Liveness probe use to make sense a long time ago. Controller used to verify k8s connectivity in healthz handler. We updated healthz handler but forgot to remove liveness probe.
Current controller liveness probe is a noop just checking if the metric-server is able to receive requests.
In cases when the controller is overloaded reconciling large queues restarting the Pod is more harmful than letting it running. In discussion with @alexmt we understand that liveness probe should be removed from the controller to prevent it from being restarted.
Signed-off-by: Leonardo Luz Almeida leonardo_almeida@intuit.com
Note on DCO:
If the DCO action in the integration test fails, one or more of your commits are not signed off. Please click on the Details link next to the DCO action for instructions on how to resolve this.
Checklist: