Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster syncs hang (syncs never complete) #18467

Closed
3 tasks done
dnlloyd opened this issue May 31, 2024 · 30 comments · Fixed by argoproj/gitops-engine#604
Closed
3 tasks done

Cluster syncs hang (syncs never complete) #18467

dnlloyd opened this issue May 31, 2024 · 30 comments · Fixed by argoproj/gitops-engine#604
Labels
bug/in-triage This issue needs further triage to be correctly classified bug Something isn't working component:api API bugs and enhancements component:core Syncing, diffing, cluster state cache component:server type:bug

Comments

@dnlloyd
Copy link

dnlloyd commented May 31, 2024

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

In the logs, I see all of our (five) clusters start to sync around the same time. This is denoted by the the following log message:

msg="Start syncing cluster" server="https://SOMEK8SCLUSTER.us-east-1.eks.amazonaws.com"

After few minutes, some of the cluster syncs complete, while other clusters never complete. The following logs show this behavior:

06:25:04.499
time="2024-05-31T11:25:04Z" level=info msg="Start syncing cluster" server="ObfuscatedClusterA"

06:25:04.499
E0531 11:25:04.499930       8 retrywatcher.go:130] "Watch failed" err="context canceled"
   - above message repeated multiple times -
 
06:25:35.582
time="2024-05-31T11:25:35Z" level=info msg="Cluster successfully synced" server="ObfuscatedClusterA"
 
06:27:34.708
time="2024-05-31T11:27:34Z" level=info msg="Start syncing cluster" server="ObfuscatedClusterB"
 
06:33:14.119
time="2024-05-31T11:33:14Z" level=info msg="Start syncing cluster" server="ObfuscatedClusterC"
 
06:40:32.176
time="2024-05-31T11:40:32Z" level=info msg="Start syncing cluster" server="ObfuscatedClusterD"
 
06:40:32.992
time="2024-05-31T11:40:32Z" level=info msg="Cluster successfully synced" server="ObfuscatedClusterD"

06:40:32.178
E0531 11:40:32.176967       7 retrywatcher.go:130] "Watch failed" err="context canceled"
   - above message repeated multiple times -

06:41:10.617
time="2024-05-31T11:41:10Z" level=info msg="Start syncing cluster" server="ObfuscatedClusterE"

Note: For the clusters that don't sync, there is never an additional log message that states the sync failed. I've watched the logs for up to three days.

Shortly after this occurs the Argo CD UI shows applications in a perpetual syncing ("Refreshing") state. See screen shot #1. This state continues until the application-controller and repo-server are restarted.

In addition to the behavior above, I also see this new message in the logs:

retrywatcher.go:130] "Watch failed" err="context canceled"

Note: This log message never occurs unless we're experiencing this issue and goes away after cycling the application-controller and repo-server.

To Reproduce

Wait for the clusters to automatically sync which appears to be happening once a day (perhaps this is related to the generated manifests cache invalidation that defaults to 24h).

Expected behavior

  • All cluster syncs complete and I see a "Cluster successfully synced" message for each cluster.
  • The Argo CD UI shows applications in a synced state.

Screenshots

1
image

Version

argocd: v2.6.4+7be094f.dirty
  BuildDate: 2023-03-07T23:43:59Z
  GitCommit: 7be094f38d06859b594b98eb75c7c70d39b80b1e
  GitTreeState: dirty
  GoVersion: go1.20.2
  Compiler: gc
  Platform: darwin/arm64
argocd-server: v2.10.9+c071af8
  BuildDate: 2024-04-30T15:53:28Z
  GitCommit: c071af808170bfc39cbdf6b9be4d0212dd66db0c
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.3.0+ksops.v4.3.1 2024-01-10T15:16:39Z
  Helm Version: v3.14.3+gf03cc04
  Kubectl Version: v0.26.11
  Jsonnet Version: v0.20.0

Logs

See inline logs above

Other notes

We are using AWS EKS but are not using AWS VPC CNI network policies.

@dnlloyd dnlloyd added the bug Something isn't working label May 31, 2024
@jenna-foghorn
Copy link

Same issue here!

We are running argocd-server v2.10.9 also... been going on for weeks. The only fix we've found is to restart argocd-application-controller's each time our developers run into the issue.

We configured for HA, are using annotations for paths, and are using 20 min jitter... due to deploying all app from one 'deployment' repo; which is essentially our overlay repo. We've also tried in creasing application processors and using the 'round-robin' sharding algorithm

Annotation added to app manifests

    # resolves to the spec.source.path directory
    argocd.argoproj.io/manifest-generate-paths: .

@tooptoop4
Copy link

what was the latest version of argoCD you used that did not have this issue? what kubernetes version is the cluster?

@dnlloyd
Copy link
Author

dnlloyd commented Jun 1, 2024

what was the latest version of argoCD you used that did not have this issue? what kubernetes version is the cluster?

We were using ArgoCD version 2.6.6 + k8s version 1.25 for about a year before this issue started occurring. The first occurrence of this issue did happen 2-3 weeks after a k8s upgrade to 1.29. Since then, we have upgraded ArgoCD to v2.10.9 as noted in the issue.

Current k8s version: 1.29

@dnlloyd
Copy link
Author

dnlloyd commented Jun 1, 2024

Issue occurred again this morning right on schedule. This time 3 out of the 5 clusters "successfully synced":

 
05:50:25.809
time="2024-06-01T10:50:25Z" level=info msg="Start syncing cluster" server="https://ObfuscatedClusterA"
 
05:50:53.355
time="2024-06-01T10:50:53Z" level=info msg="Cluster successfully synced" server="https://ObfuscatedClusterA"
 
05:54:58.234
time="2024-06-01T10:54:58Z" level=info msg="Start syncing cluster" server="https://ObfuscatedClusterB"
 
06:03:31.571
time="2024-06-01T11:03:31Z" level=info msg="Start syncing cluster" server="https://ObfuscatedClusterC"
 
06:03:32.280
time="2024-06-01T11:03:32Z" level=info msg="Cluster successfully synced" server="https://ObfuscatedClusterC"
 
06:08:41.076
time="2024-06-01T11:08:41Z" level=info msg="Start syncing cluster" server="https://ObfuscatedClusterD"
 
06:09:38.045
time="2024-06-01T11:09:38Z" level=info msg="Cluster successfully synced" server="https://ObfuscatedClusterD"
 
06:11:49.761
time="2024-06-01T11:11:49Z" level=info msg="Start syncing cluster" server="https://ObfuscatedClusterE"

@tooptoop4
Copy link

are u on 1.29.4 or higher? "Kube-apiserver: fixes a 1.27+ regression in watch stability by serving watch requests without a resourceVersion from the watch cache by default, as in <1.27 (disabling the change in #115096 by default). This mitigates the impact of an etcd watch bug (etcd-io/etcd#17555). If the 1.27 change in #115096 to serve these requests from underlying storage is still desired despite the impact on watch stability, it can be re-enabled with a WatchFromStorageWithoutResourceVersion feature gate. (kubernetes/kubernetes#123973, @serathius) [SIG API Machinery]" is a maybe related fix that is only available in >= 1.29.4

@dnlloyd
Copy link
Author

dnlloyd commented Jun 3, 2024

@tooptoop4 I mistakenly stated we were on 1.29. We are actually on 1.28.9. I'll see if we can get an upgrade to 1.29.4 (or higher) scheduled so we can test the option you posted.

@dnlloyd
Copy link
Author

dnlloyd commented Jun 3, 2024

We upgraded to 1.29.4 and and I immediately learned that most API server feature gates are not exposed on EKS (there are only three enabled by default on 1.29). So looks like this option is off the table for now.

@sockejr
Copy link

sockejr commented Jun 6, 2024

Hello, I have a similar behaviour on the GUI with Red Hat Openshift Gitops in Version 1.12.3 which is running ArgoCD v2.10.10+9b3d0c0.
In the logs of the app-controller the pod is showing only sync messages and then goes into OOM and crashLoopBackOff.
In my case increasing the memory limit of the app-controller pod from 2 GB to 3 GB fixed the issue. Do you see any pod crashes ? of the app-controller ? Any OOMs ?

@dnlloyd
Copy link
Author

dnlloyd commented Jun 6, 2024

In my case increasing the memory limit of the app-controller pod from 2 GB to 3 GB fixed the issue. Do you see any pod
crashes ? of the app-controller ? Any OOMs ?

Thanks @sockejr. We have Argo CD deployed to a dedicated 4 node cluster with plenty of CPU/mem available. We currently have no limits on our Argo CD pods. I see no OOM in the logs and all Argo CD pods show healthy and are in a running state.

I'm actually experiencing the issue as I type this now. One of our three app controller pods shows its using 2 cores, while the others are using less than .5. The cluster syncs started around an hour ago and 3/5 have completed. Assuming this is like every other occurrence, the remaining two will never complete.

image

@dnlloyd
Copy link
Author

dnlloyd commented Jun 6, 2024

Here's something stands out to me...
The temporary fix for this issue is to cycle the app controller and repo server pods. Cycling these pods kicks off the same "cluster syncs" as above. But when cycling the pods, the syncs always succeed. Maybe they succeed because they run on a "fresh" container? I.e. no existing cache on disk.

@sockejr
Copy link

sockejr commented Jun 6, 2024

maybe its a variance of the kubectl/fork issue from here:
Trying the solution to set or experiment with different settings of --kubectl-parallelism-limit=10 might be worth a try. #5505 (comment)

@dnlloyd
Copy link
Author

dnlloyd commented Jun 10, 2024

maybe its a variance of the kubectl/fork issue from #5505
Trying the solution to set or experiment with different settings of --kubectl-parallelism-limit=10 might be worth a try. #5505 (comment)

Thanks @sockejr , we looked into your suggestion and even set the kubectl-parallelism-limit but it did not affect the issue.

@jenna-foghorn
Copy link

Issue still occurring, on a daily basis ( sometimes twice ), only solution we've found is to restart argocd-application-controller... any news?

@klinux
Copy link

klinux commented Jun 18, 2024

I'm have the same issue here, one of our application-controller killed by OOMKill, the only sollution that works, is to restart application-controller

@jenna-foghorn
Copy link

Uggghhhh.....

@wanddynosios
Copy link

I'm trying to figure out if the issue we are having (similar to @klinux) is the same thing everybody else here is experiencing:

Without any (apparent) increase in workload, the application-controller-0 pod (always the -0 pod) starts OOM-ing, however recreating all the pods does not help. We had this issue a few months ago, and we had to bump memory limit to 3.5 Gi, and now again, where we had to bump to 5Gi.

The interesting thing is that memory usage peaks at about 2Gi, but we still need 5Gi for the pod to start and not immediately OOM-kill:
image
There are of course peaks in memory usage too short to be picked up by metrics-server, so it sure feels like some kind of memory leak.

We "only" have about 230 apps (created by 20 appsets), and we have manifest-generate-paths set to a reasonably fine granularity.

@klinux
Copy link

klinux commented Jun 26, 2024

Hi @wanddynosios, my problem here, is the project that manager the application, has the monitoring resources enabled, and the coluster that application-controller was monitoring has a lot of orphans resources, after disable monitoring resources in the project, our argocd stops do eat memory.

@agaudreault
Copy link
Member

@wanddynosios You can usually tell if a pod has been OOM killed by looking at the container status in the Pod manifest. This prometheus alert is also useful.

At the start of the pods, The application-controller will initiate watches with the configured cluster and from my understanding, this, with the initial processing of all applications, is what consumes a lot of initial memory requiring the limit to be much higher than the average memory consumption. Every 24h (I think) or sporadically when a watch fails, the controller will "relist"/reconnect the watch causing some memory spikes too.

manifest-generate-paths should not affect the controller memory much.

@agaudreault
Copy link
Member

Some tips to reduce the memory:

And if that is not possible, then increase the memory limit. These are based on my observation, but you can also do go profiling on the pod to find more what is consuming the memory.

@wanddynosios
Copy link

Thank you both for the pointers. We don't have orphaned resources monitoring enabled, and we already have the OOM-killed alert.

I will definitely look into the tips to reduce memory: Though I don't think we can watch less namespaces, we might have some resources, which don't have to be watched, and we can look into sharding. I'll report back :-)

@alexmt alexmt added bug/in-triage This issue needs further triage to be correctly classified component:server type:bug labels Jun 26, 2024
@agaudreault
Copy link
Member

For CPU issues, I suggest you to join slack and post your ArgoCD metrics dashboard. Similar to this post: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1719410085483709?thread_ts=1719236455.982699&cid=C01TSERG0KZ

You can also look at the history for other investigations. Multiple things can cause an application to hangs, but most of the time it is reflected by a workqueue with too many items or the controller CPU hitting the ceiling.

Of course, if you don't see any logs in the controller, it may be in a broken state due to a deadlock, which can also be seen with the CPU being near 0.

@dhruvang1
Copy link
Contributor

@wanddynosios you can set the env variable ARGOCD_CLUSTER_CACHE_LIST_SEMAPHORE which defaults to 50 to a lower number. This would increase the cluster sync time at startup, but also consume less memory.

From the code:

EnvClusterCacheListSemaphore = "ARGOCD_CLUSTER_CACHE_LIST_SEMAPHORE"

// EnvClusterCacheListSemaphore is the env variable to control size of the list semaphore
// This is used to limit the number of concurrent memory consuming operations on the
// k8s list queries results across all clusters to avoid memory spikes during cache initialization.

@wanddynosios
Copy link

So I did some playing around in our non-productive environment (~130 apps). I definitely did encounter the core of this issue, where syncs never finish unless you cycle all application-controller pods (and sometimes repo-server, as noted above).

To summarize my findings regarding resource consumption:

  • Setting ARGOCD_CLUSTER_CACHE_LIST_SEMAPHORE to 10 did not have a measurable effect on startup times or memory usage
  • Reducing the number of status-processors or operation-processors did have an effect on cpu usage, but not a large effect on memory usage
  • Setting the sharding method to 'round-robin' did not have an effect (but I assume that's due to our setup with two controllers and four clusters)
  • Reducing the number of clusters per shard (i.e. spinning up two additional controllers to have four controllers for four clusters) did reduce the minimum amount of ram needed for startup (by about 30%), however it was pretty easy to "stress" the pods to start OOM'ing and get into a state, where only restarting all controllers might solve the issue. To me it seems that per controller, you need a buffer of ~2Gi memory over the observable peak load (metrics-server), to avoid this error, regardless of the number of clusters per shard.

For now, our previous setup of fewer controllers with a larger memory buffer offers the most stability for the money. But this queueing issue and the high resource consumption don't appear to be directly connected.

@zouloubitos
Copy link

We are sharing our case here, which may help some others.

We had a similar problem in our Openshift 4.12 environments using ArgoCD 2.10 and below.
These environments were all affected by this unstability.

In the argocd-application-controller statefulset logs we could see :
time="2024-06-11T15:55:27Z" level=info msg="Start syncing cluster" server="https://kubernetes.default.svc".
Usually after about 3 minutes and 20 seconds we get the following message
time="2024-06-11T15:58:47Z" level=info msg="Cluster successfully synchronised" server="https://kubernetes.default.svc".
It has happened very often (probably once a day) that the process of refreshing the cache with the cluster gets "stuck" or hangs, and then we see in the web UI that all the applications are running in a continuous "refresh" state, as described in the main topic here.

These clusters have a very large number of objects, but only a few objects are literally managed by ArgoCD:
The usual CACHE INFO in ArgoCD looked like this
APIs COUNT: over 200, but the same number in all clusters,
RESOURCES COUNT: from 15000 to 22000,
APPLICATIONS COUNT: About 100 in all clusters.

As already mentioned by @agaudreault, we chose the option:

Watch [fewer resources] (https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#resource-exclusioninclusion)

We implemented the resourceInclusions/resourceExclusions logic to significantly reduce the COUNT of APIs to watch, and also the COUNT of RESOURCES, as described here for the Openshift GitOps ArgoCD operator: https://docs.openshift.com/gitops/1.12/argocd_instance/argo-cd-cr-component-properties.html

For example, we use the following, which covers the helm charts/objects we deploy through ArgoCD:

  resourceInclusions: |
    - apiGroups:
      - "*"
      clusters:
      - "*"
      kinds:
      - "AppProject"
      - "Application"
      - "ApplicationSet"
      - "ArgoCD"
      - "ConfigMap"
      - "ControllerRevision"
      - "Deployment"
      - "EndpointSlice"
      - "Endpoints"
      - "HorizontalPodAutoscaler"
      - "Ingress"
      - "Pod"
      - "ReplicaSet"
      - "Role"
      - "RoleBinding"
      - "Route"
      - "Secret"
      - "Service"
      - "ServiceAccount"
      - "StatefulSet"

The change had a significant impact on the cluster synchronisation process. We went from over 3 minutes to about 45 seconds in all clusters, regardless of the RESOURCES COUNT. I assume it's related to the number of APIs to watch.

@alexmt alexmt added component:api API bugs and enhancements component:core Syncing, diffing, cluster state cache labels Jul 3, 2024
@jbartyze-rh
Copy link

I can observe the same issue in openshift-gitops-operator 1.11.0 which uses ArgoCD v2.9.2+c5ea5c4

reconcillation queue just gets stuck and is never processed. We have implemented automatic restart every 3h to make sure that environment gets reconciled at the end.

image

After the restart it works for some time. We observe it a lot during creation of a lot of new applications, but usually we check ArgoCD then, so it probably starts much before. We are not using any WATCH exclusions, around 250 applications managing around 4 external clusters at the moment.

We see the same error in the logs

W0703 20:48:32.266283 1 warnings.go:70] apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+1743time="2024-07-03T20:50:09Z" level=error msg="Failed to cache app resources: error getting resource tree: failed to get app hosts: error synchronizing cache state : failed to sync cluster https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443: failed to load initial state of resource ImageStream.image.openshift.io: the server was unable to return a response in the time allotted, but may still be processing the request" application=argocd-mgmt/c52-machine-config-chrony dedup_ms=0 diff_ms=0 git_ms=73518 health_ms=0 live_ms=0 settings_ms=0 sync_ms=01744time="2024-07-03T20:50:09Z" level=error msg="Failed to cache app resources: error getting managed resources: error getting cluster cache: error synchronizing cache state : failed to sync cluster https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443: failed to load initial state of resource ImageStream.image.openshift.io: the server was unable to return a response in the time allotted, but may still be processing the request" application=argocd-mgmt/c52-argocd-managed-ocp dedup_ms=181361 diff_ms=0 git_ms=1234969 health_ms=0 live_ms=0 settings_ms=0 sync_ms=01745time="2024-07-03T20:50:09Z" level=error msg="Failed to cache app resources: error getting resource tree: failed to get app hosts: error synchronizing cache state : failed to sync cluster https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443: failed to load initial state of resource ImageStream.image.openshift.io: the server was unable to return a response in the time allotted, but may still be processing the request" application=argocd-mgmt/c52-image dedup_ms=0 diff_ms=0 git_ms=73913 health_ms=0 live_ms=0 settings_ms=0 sync_ms=01746time="2024-07-03T20:50:10Z" level=error msg="Failed to cache app resources: error getting managed resources: error getting cluster cache: error synchronizing cache state : failed to sync cluster https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443: failed to load initial state of resource ImageStream.image.openshift.io: the server was unable to return a response in the time allotted, but may still be processing the request" application=argocd-mgmt/c52-monitoring dedup_ms=182086 diff_ms=12 git_ms=1099531 health_ms=0 live_ms=0 settings_ms=0 sync_ms=01747time="2024-07-03T20:50:12Z" level=error msg="Failed to cache app resources: error getting resource tree: failed to get app hosts: error synchronizing cache state : failed to sync cluster https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443: failed to load initial state of resource ImageStream.image.openshift.io: the server was unable to return a response in the time allotted, but may still be processing the request" application=argocd-mgmt/c52-ipam-controller dedup_ms=0 diff_ms=0 git_ms=0 health_ms=0 live_ms=0 settings_ms=0 sync_ms=01748E0703 20:50:49.184759 1 retrywatcher.go:130] "Watch failed" err="context canceled"1749E0703 20:50:49.185175 1 retrywatcher.go:130] "Watch failed" err="context canceled"1750E0703 20:50:49.185203 1 retrywatcher.go:130] "Watch failed" err="context canceled"1751E0703 20:50:49.184763 1 retrywatcher.go:130] "Watch failed" err="context canceled"1752E0703 20:50:49.184787 1 retrywatcher.go:130] "Watch failed" err="context canceled"1753E0703 20:50:49.184794 1 retrywatcher.go:130] "Watch failed" err="context canceled"1754E0703 20:50:49.184819 1 retrywatcher.go:130] "Watch failed" err="context canceled"1755E0703 20:50:49.184827 1 retrywatcher.go:130] "Watch failed" err="context canceled"1756E0703 20:50:49.185382 1 retrywatcher.go:130] "Watch failed" err="context canceled"1757E0703 20:50:49.184835 1 retrywatcher.go:130] "Watch failed" err="context canceled"1758E0703 20:50:49.184854 1 retrywatcher.go:130] "Watch failed" err="context canceled"1759E0703 20:50:49.184878 1 retrywatcher.go:130] "Watch failed" err="context canceled"1760E0703 20:50:49.184905 1 retrywatcher.go:130] "Watch failed" err="context canceled"1761E0703 20:50:49.184942 1 retrywatcher.go:130] "Watch failed" err="context canceled"1762E0703 20:50:49.184958 1 retrywatcher.go:130] "Watch failed" err="context canceled"1763E0703 20:50:49.184962 1 retrywatcher.go:130] "Watch failed" err="context canceled"1764E0703 20:50:49.184984 1 retrywatcher.go:130] "Watch failed" err="context canceled"1765E0703 20:50:49.185014 1 retrywatcher.go:130] "Watch failed" err="context canceled"1766E0703 20:50:49.185028 1 retrywatcher.go:130] "Watch failed" err="Get \"https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443/apis/user.openshift.io/v1/users?allowWatchBookmarks=true&resourceVersion=165834346&watch=true\": context canceled"1767E0703 20:50:49.185097 1 retrywatcher.go:130] "Watch failed" err="context canceled"1768E0703 20:50:49.185135 1 retrywatcher.go:130] "Watch failed" err="context canceled"1769E0703 20:50:49.185139 1 retrywatcher.go:130] "Watch failed" err="context canceled"1770E0703 20:50:49.185146 1 retrywatcher.go:130] "Watch failed" err="context canceled"1771E0703 20:50:49.185168 1 retrywatcher.go:130] "Watch failed" err="context canceled"1772E0703 20:50:49.185170 1 retrywatcher.go:130] "Watch failed" err="context canceled"1773E0703 20:50:49.184759 1 retrywatcher.go:130] "Watch failed" err="context canceled"1774E0703 20:50:49.185409 1 retrywatcher.go:130] "Watch failed" err="context canceled"1775E0703 20:50:49.185448 1 retrywatcher.go:130] "Watch failed" err="context canceled"1776E0703 20:50:49.185449 1 retrywatcher.go:130] "Watch failed" err="context canceled"

@agaudreault
Copy link
Member

@jbartyze-rh assuming the CPU and memory are fine at that moment, and that there are no deadlocks (this can be seen with pprof), did you check the Kubernetes API Client metrics? There are some Env variables that can be adjusted to reduce burst and throttling. #8404.

@jbartyze-rh
Copy link

jbartyze-rh commented Jul 4, 2024

Those ENV variables look promising! @agaudreault do we have some rule of thumb recommendations for tunning those?

Missed the pprof part - sadly this is highly regulated environment where I do not have flexibility to access such tools.

Also looks like this RH OCPBUG https://issues.redhat.com/browse/GITOPS-4757 is correlated to that particular issue. I cannot share the content, but it was not possible to replicate the bug with ArgoCD 2.11. Once new openshift-gitops-operator 1.13 will be released will validate if controller hang is still happening. There is some information(lets treat it with grain of salt) that problem could be correlated to git timeouts experienced by ArgoCD and git cache introduced in 2.11 could help.

Some metric dump here

Currently as we can see controller is hanged.
image

API activity looks like its going down when controlller hangs.
image

There is huge amount of gorutines launched in in one controller vs the other ones
image

When hanging starts we see also spike in git requests (ls-remote)
image

On ETCD level on that hub we see slowness on disk sync introduced 10 minutes before. Could have an effect on API response on that particular hub cluster.
image

On API level on that hub
image

@agaudreault
Copy link
Member

I was able to find some controllers that were in a deadlock state on Kubernetes API changes. Here are the relevant logs from the pprof goroutine capture. I am not sure if it will fully fix this all the issue in this thread, but it definitely causes the controller to hang.

I will be working on a fix for this specific scenario.

Summary:

  • loadInitialState(..., false) is called on watch event. At this moment, it does not hold the cache lock
  • It ends up calling listResources, which use a semaphore (default of 50)
  • during the list, it performs the provided callback
  • the provided callback tries to obtain the cache lock.
  • [repeat 50x]
  • A watch event is received for a new/unwatched Kubernetes APi version
  • It obtains the cache lock and call startMissingWatches
  • loadInitialState(..., true) is called on watch event. At this moment, it does hold the cache lock
  • It ends up calling listResources, which use a semaphore (default of 50)
  • It waits for the semaphore (50) because they are all in use, while it is holding the cache lock.
  • All the 50 semaphore are blocked waiting for the cache lock.
// NO LOCK. HAS SEMAPHORE LOCK. WAITING FOR LOCK...... 50 routines, and semaphore lock is 50..... 
50 @ 0x43e64e 0x44faf8 0x44facf 0x46dce5 0x47d13d 0x47e611 0x47e5f2 0x1f1b37c 0x1f1c0e7 0x1f1b6a2 0x1f1bd33 0x1f1c996 0x1b7ab1b 0x1335f53 0x133496c 0x13366e7 0x1336569 0x1335fb3 0x1335ecb 0x1b7a916 0x1f1c785 0x471e61
#	0x46dce4	sync.runtime_SemacquireMutex+0x24								/usr/local/go/src/runtime/sema.go:77
#	0x47d13c	sync.(*Mutex).lockSlow+0x15c									/usr/local/go/src/sync/mutex.go:171
#	0x47e610	sync.(*Mutex).Lock+0x30										/usr/local/go/src/sync/mutex.go:90
#	0x47e5f1	sync.(*RWMutex).Lock+0x11									/usr/local/go/src/sync/rwmutex.go:147
#	0x1f1b37b	github.com/argoproj/gitops-engine/pkg/cache.runSynced+0x3b					/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:524
#	0x1f1c0e6	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).loadInitialState.func1+0x306	/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:586
#	0x1f1b6a1	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).listResources+0x201			/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:567
#	0x1f1bd32	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).loadInitialState+0x92		/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:571
#	0x1f1c995	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents.func1+0x155		/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:607
#	0x1b7ab1a	github.com/argoproj/gitops-engine/pkg/utils/kube.RetryUntilSucceed.func1+0xfa			/go/pkg/mod/github.com/argoproj/[email protected]/pkg/utils/kube/kube.go:411
#	0x1335f52	k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil.ConditionFunc.WithContext.func1+0x12	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:222
#	0x133496b	k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext+0x4b		/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:235
#	0x13366e6	k8s.io/apimachinery/pkg/util/wait.WaitForWithContext+0x106					/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:662
#	0x1336568	k8s.io/apimachinery/pkg/util/wait.poll+0x88							/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:596
#	0x1335fb2	k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext+0x52				/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:547
#	0x1335eca	k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil+0x6a					/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:538
#	0x1b7a915	github.com/argoproj/gitops-engine/pkg/utils/kube.RetryUntilSucceed+0x115			/go/pkg/mod/github.com/argoproj/[email protected]/pkg/utils/kube/kube.go:409
#	0x1f1c784	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents+0x244			/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:598

// HAS WRITE LOCK. WAITING FOR SEMAPHORE LIST
1 @ 0x43e64e 0x44eac5 0x1d77e45 0x1f1b526 0x1f1bd33 0x1f1ae30 0x1f1e4ca 0x1f1acf0 0x1f1e0f7 0x1f1b3c6 0x1f1d219 0x1b7ab1b 0x1335f53 0x133496c 0x13366e7 0x1336569 0x1335fb3 0x1335ecb 0x1b7a916 0x1f1c785 0x471e61
#	0x1d77e44	golang.org/x/sync/semaphore.(*Weighted).Acquire+0x324						/go/pkg/mod/golang.org/x/[email protected]/semaphore/semaphore.go:60
#	0x1f1b525	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).listResources+0x85			/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:531
#	0x1f1bd32	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).loadInitialState+0x92		/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:571
#	0x1f1ae2f	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).startMissingWatches.func1+0xcf	/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:494
#	0x1f1e4c9	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).processApi+0xc9			/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:736
#	0x1f1acef	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).startMissingWatches+0x36f		/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:493
#	0x1f1e0f6	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents.func1.4+0x16		/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:701
#	0x1f1b3c5	github.com/argoproj/gitops-engine/pkg/cache.runSynced+0x85					/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:526
#	0x1f1d218	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents.func1+0x9d8		/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:700
#	0x1b7ab1a	github.com/argoproj/gitops-engine/pkg/utils/kube.RetryUntilSucceed.func1+0xfa			/go/pkg/mod/github.com/argoproj/[email protected]/pkg/utils/kube/kube.go:411
#	0x1335f52	k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil.ConditionFunc.WithContext.func1+0x12	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:222
#	0x133496b	k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext+0x4b		/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:235
#	0x13366e6	k8s.io/apimachinery/pkg/util/wait.WaitForWithContext+0x106					/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:662
#	0x1336568	k8s.io/apimachinery/pkg/util/wait.poll+0x88							/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:596
#	0x1335fb2	k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext+0x52				/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:547
#	0x1335eca	k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil+0x6a					/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:538
#	0x1b7a915	github.com/argoproj/gitops-engine/pkg/utils/kube.RetryUntilSucceed+0x115			/go/pkg/mod/github.com/argoproj/[email protected]/pkg/utils/kube/kube.go:409
#	0x1f1c784	github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents+0x244			/go/pkg/mod/github.com/argoproj/[email protected]/pkg/cache/cluster.go:598

@tooptoop4
Copy link

any idea which argo version introduced this bug?

@jdroot
Copy link

jdroot commented Jul 11, 2024

@agaudreault I had opened an issue for that specific problem and submitted a PR for it here: #18902

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/in-triage This issue needs further triage to be correctly classified bug Something isn't working component:api API bugs and enhancements component:core Syncing, diffing, cluster state cache component:server type:bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.