-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster syncs hang (syncs never complete) #18467
Comments
Same issue here! We are running We configured for HA, are using annotations for paths, and are using 20 min jitter... due to deploying all app from one 'deployment' repo; which is essentially our overlay repo. We've also tried in creasing application processors and using the 'round-robin' sharding algorithm Annotation added to app manifests
|
what was the latest version of argoCD you used that did not have this issue? what kubernetes version is the cluster? |
We were using ArgoCD version Current k8s version: |
Issue occurred again this morning right on schedule. This time 3 out of the 5 clusters "successfully synced":
|
are u on 1.29.4 or higher? "Kube-apiserver: fixes a 1.27+ regression in watch stability by serving watch requests without a resourceVersion from the watch cache by default, as in <1.27 (disabling the change in #115096 by default). This mitigates the impact of an etcd watch bug (etcd-io/etcd#17555). If the 1.27 change in #115096 to serve these requests from underlying storage is still desired despite the impact on watch stability, it can be re-enabled with a WatchFromStorageWithoutResourceVersion feature gate. (kubernetes/kubernetes#123973, @serathius) [SIG API Machinery]" is a maybe related fix that is only available in >= 1.29.4 |
@tooptoop4 I mistakenly stated we were on |
We upgraded to |
Hello, I have a similar behaviour on the GUI with Red Hat Openshift Gitops in Version 1.12.3 which is running ArgoCD v2.10.10+9b3d0c0. |
Thanks @sockejr. We have Argo CD deployed to a dedicated 4 node cluster with plenty of CPU/mem available. We currently have no limits on our Argo CD pods. I see no OOM in the logs and all Argo CD pods show healthy and are in a running state. I'm actually experiencing the issue as I type this now. One of our three app controller pods shows its using 2 cores, while the others are using less than .5. The cluster syncs started around an hour ago and 3/5 have completed. Assuming this is like every other occurrence, the remaining two will never complete. |
Here's something stands out to me... |
maybe its a variance of the kubectl/fork issue from here: |
Thanks @sockejr , we looked into your suggestion and even set the |
Issue still occurring, on a daily basis ( sometimes twice ), only solution we've found is to restart |
I'm have the same issue here, one of our application-controller killed by OOMKill, the only sollution that works, is to restart application-controller |
Uggghhhh..... |
I'm trying to figure out if the issue we are having (similar to @klinux) is the same thing everybody else here is experiencing: Without any (apparent) increase in workload, the The interesting thing is that memory usage peaks at about 2Gi, but we still need 5Gi for the pod to start and not immediately OOM-kill: We "only" have about 230 apps (created by 20 appsets), and we have |
Hi @wanddynosios, my problem here, is the project that manager the application, has the monitoring resources enabled, and the coluster that application-controller was monitoring has a lot of orphans resources, after disable monitoring resources in the project, our argocd stops do eat memory. |
@wanddynosios You can usually tell if a pod has been OOM killed by looking at the container status in the Pod manifest. This prometheus alert is also useful. At the start of the pods, The application-controller will initiate watches with the configured cluster and from my understanding, this, with the initial processing of all applications, is what consumes a lot of initial memory requiring the limit to be much higher than the average memory consumption. Every 24h (I think) or sporadically when a watch fails, the controller will "relist"/reconnect the watch causing some memory spikes too.
|
Some tips to reduce the memory:
And if that is not possible, then increase the memory limit. These are based on my observation, but you can also do go profiling on the pod to find more what is consuming the memory. |
Thank you both for the pointers. We don't have orphaned resources monitoring enabled, and we already have the OOM-killed alert. I will definitely look into the tips to reduce memory: Though I don't think we can watch less namespaces, we might have some resources, which don't have to be watched, and we can look into sharding. I'll report back :-) |
For CPU issues, I suggest you to join slack and post your ArgoCD metrics dashboard. Similar to this post: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1719410085483709?thread_ts=1719236455.982699&cid=C01TSERG0KZ You can also look at the history for other investigations. Multiple things can cause an application to hangs, but most of the time it is reflected by a workqueue with too many items or the controller CPU hitting the ceiling. Of course, if you don't see any logs in the controller, it may be in a broken state due to a deadlock, which can also be seen with the CPU being near 0. |
@wanddynosios you can set the env variable From the code: argo-cd/controller/cache/cache.go Line 65 in dafb37c
|
So I did some playing around in our non-productive environment (~130 apps). I definitely did encounter the core of this issue, where syncs never finish unless you cycle all application-controller pods (and sometimes repo-server, as noted above). To summarize my findings regarding resource consumption:
For now, our previous setup of fewer controllers with a larger memory buffer offers the most stability for the money. But this queueing issue and the high resource consumption don't appear to be directly connected. |
We are sharing our case here, which may help some others. We had a similar problem in our Openshift 4.12 environments using ArgoCD 2.10 and below. In the argocd-application-controller statefulset logs we could see : These clusters have a very large number of objects, but only a few objects are literally managed by ArgoCD: As already mentioned by @agaudreault, we chose the option:
We implemented the resourceInclusions/resourceExclusions logic to significantly reduce the COUNT of APIs to watch, and also the COUNT of RESOURCES, as described here for the Openshift GitOps ArgoCD operator: https://docs.openshift.com/gitops/1.12/argocd_instance/argo-cd-cr-component-properties.html For example, we use the following, which covers the helm charts/objects we deploy through ArgoCD:
The change had a significant impact on the cluster synchronisation process. We went from over 3 minutes to about 45 seconds in all clusters, regardless of the RESOURCES COUNT. I assume it's related to the number of APIs to watch. |
I can observe the same issue in openshift-gitops-operator 1.11.0 which uses ArgoCD v2.9.2+c5ea5c4 reconcillation queue just gets stuck and is never processed. We have implemented automatic restart every 3h to make sure that environment gets reconciled at the end. After the restart it works for some time. We observe it a lot during creation of a lot of new applications, but usually we check ArgoCD then, so it probably starts much before. We are not using any WATCH exclusions, around 250 applications managing around 4 external clusters at the moment. We see the same error in the logs
|
@jbartyze-rh assuming the CPU and memory are fine at that moment, and that there are no deadlocks (this can be seen with pprof), did you check the Kubernetes API Client metrics? There are some Env variables that can be adjusted to reduce burst and throttling. #8404. |
Those ENV variables look promising! @agaudreault do we have some rule of thumb recommendations for tunning those? Missed the pprof part - sadly this is highly regulated environment where I do not have flexibility to access such tools. Also looks like this RH OCPBUG https://issues.redhat.com/browse/GITOPS-4757 is correlated to that particular issue. I cannot share the content, but it was not possible to replicate the bug with ArgoCD 2.11. Once new openshift-gitops-operator 1.13 will be released will validate if controller hang is still happening. There is some information(lets treat it with grain of salt) that problem could be correlated to git timeouts experienced by ArgoCD and git cache introduced in 2.11 could help. Some metric dump here Currently as we can see controller is hanged. API activity looks like its going down when controlller hangs. There is huge amount of gorutines launched in in one controller vs the other ones When hanging starts we see also spike in git requests (ls-remote) On ETCD level on that hub we see slowness on disk sync introduced 10 minutes before. Could have an effect on API response on that particular hub cluster. |
I was able to find some controllers that were in a deadlock state on Kubernetes API changes. Here are the relevant logs from the pprof goroutine capture. I am not sure if it will fully fix this all the issue in this thread, but it definitely causes the controller to hang. I will be working on a fix for this specific scenario. Summary:
|
any idea which argo version introduced this bug? |
@agaudreault I had opened an issue for that specific problem and submitted a PR for it here: #18902 |
Checklist:
argocd version
.Describe the bug
In the logs, I see all of our (five) clusters start to sync around the same time. This is denoted by the the following log message:
After few minutes, some of the cluster syncs complete, while other clusters never complete. The following logs show this behavior:
Note: For the clusters that don't sync, there is never an additional log message that states the sync failed. I've watched the logs for up to three days.
Shortly after this occurs the Argo CD UI shows applications in a perpetual syncing ("Refreshing") state. See screen shot
#1
. This state continues until the application-controller and repo-server are restarted.In addition to the behavior above, I also see this new message in the logs:
Note: This log message never occurs unless we're experiencing this issue and goes away after cycling the application-controller and repo-server.
To Reproduce
Wait for the clusters to automatically sync which appears to be happening once a day (perhaps this is related to the generated manifests cache invalidation that defaults to 24h).
Expected behavior
"Cluster successfully synced"
message for each cluster.Screenshots
1
Version
Logs
Other notes
We are using AWS EKS but are not using AWS VPC CNI network policies.
The text was updated successfully, but these errors were encountered: