Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -136,12 +136,6 @@ spec:
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
Comment on lines 139 to 144
Copy link
Member

@34fathombelow 34fathombelow Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has setting a failureThreshold been considered? This is set to 1 by default. If set to 5 for example it would need 5 failures to restart the pod. If a probe is successful between failures the failureThreshold is reset. This would give the EU a choice to make an adjustment based on their workloads.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8082
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zachaller can you remember if we discussed setting a failure threshold?

Copy link
Contributor

@zachaller zachaller Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we did talk about it, also failure threshold actually defaults to 3 with a minimum of 1. I think if we want to keep the liveness check we have to bump the timeout to something like 3 to 5 seconds. That is mainly where the issue is at but the reasoning for removal was because restarting can make things worse under load conditions when the timeout is hit due to load.

I do think keeping it and bumping timeout can be a first pass but it might eventually still make sense to remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zachaller You are absolutely right failureThreshold does default to 3. I think it also might be worth trying to set failureThreshol=5 and bumping up the timeoutSeconds.

Copy link
Collaborator Author

@leoluz leoluz Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning about removing it is related to the nature of the controller watch loop versus how the liveness probe is currently implemented. Currently the /healthz endpoint is exposed by the metrics server which runs in a separate goroutine. This can cause the controller to be restarted by reasons unrelated to the watch loop. In cases when the watch queues are accumulating lots of resource updates the controller will be busy trying to process them. Killing the controller in this case makes the situation worse as the queue remains the same after the restart. We noticed this behaviour in one of our internal instances which led to this discussion challenging the effectiveness of the current liveness probe implementation in ArgoCD controller.

@34fathombelow do you have a real use case where the current liveness-probe implementation is useful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for your detailed explanation. I'm fine with this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra background info as well @leoluz I was not aware that the check was also in a seperate go routine which makes the liveness prob as you mentioned even more useless.

securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
Expand Down
6 changes: 0 additions & 6 deletions manifests/core-install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9911,12 +9911,6 @@ spec:
optional: true
image: quay.io/argoproj/argocd:latest
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /healthz
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
name: argocd-application-controller
ports:
- containerPort: 8082
Expand Down
6 changes: 0 additions & 6 deletions manifests/ha/install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11281,12 +11281,6 @@ spec:
optional: true
image: quay.io/argoproj/argocd:latest
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /healthz
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
name: argocd-application-controller
ports:
- containerPort: 8082
Expand Down
6 changes: 0 additions & 6 deletions manifests/ha/namespace-install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2136,12 +2136,6 @@ spec:
optional: true
image: quay.io/argoproj/argocd:latest
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /healthz
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
name: argocd-application-controller
ports:
- containerPort: 8082
Expand Down
6 changes: 0 additions & 6 deletions manifests/install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10612,12 +10612,6 @@ spec:
optional: true
image: quay.io/argoproj/argocd:latest
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /healthz
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
name: argocd-application-controller
ports:
- containerPort: 8082
Expand Down
6 changes: 0 additions & 6 deletions manifests/namespace-install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1467,12 +1467,6 @@ spec:
optional: true
image: quay.io/argoproj/argocd:latest
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /healthz
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
name: argocd-application-controller
ports:
- containerPort: 8082
Expand Down