-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multicluster Service Mirror Mem Leak #10746
Labels
Comments
mprof dumps from environment taken 24 hours apart: |
alpeb
added a commit
that referenced
this issue
Apr 27, 2023
Fixes #10746 ## Repro Just by leaving two linked clusters by themselves and monitoring the service mirror pod's RSS (via the `container_memory_working_set_bytes` metric) and goroutines count (via the `go_goroutines` metric) one can see the small but consistent memory increase accompanied by periodic increases of goroutines in batches of 5 (see PR for chart). ## Diagnosis Those goroutines count increases can be correlated in the controller's log with the probe worker and cluster watcher recycling: ``` time="2023-04-27T10:06:14Z" level=info msg="Stopping probe worker" probe-key=target time="2023-04-27T10:06:14Z" level=info msg="Starting probe worker" probe-key=target time="2023-04-27T10:06:14Z" level=info msg="Received: Stop" apiAddress="https://192.168.32.4:6443" cluster=remote ``` Enabling pprof, the beginning of the goroutine stack dump look like this: ``` goroutine profile: total 94 10 @ 0x43c976 0x406d3b 0x406878 0x175ee77 0xb2aefe 0xb2adb6 0xb2aca9 0x175edeb 0x175ed95 0xb2cfda 0x46de41 # 0x175ee76 k8s.io/client-go/tools/cache.(*processorListener).run.func1+0x56 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:968 # 0xb2aefd k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1+0x3d /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 # 0xb2adb5 k8s.io/apimachinery/pkg/util/wait.BackoffUntil+0xb5 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 # 0xb2aca8 k8s.io/apimachinery/pkg/util/wait.JitterUntil+0x88 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 # 0x175edea k8s.io/apimachinery/pkg/util/wait.Until+0x6a /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161 # 0x175ed94 k8s.io/client-go/tools/cache.(*processorListener).run+0x14 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:967 # 0xb2cfd9 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 10 @ 0x43c976 0x44c81c 0x175eb99 0xb2cfda 0x46de41 # 0x175eb98 k8s.io/client-go/tools/cache.(*processorListener).pop+0x118 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:938 # 0xb2cfd9 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 8 @ 0x43c976 0x406d3b 0x406878 0x19f35d3 0x46de41 # 0x19f35d2 k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1+0x72 /go/pkg/mod/k8s.io/[email protected]/tools/record/event.go:320 8 @ 0x43c976 0x406d3b 0x406878 0xa88825 0x46de41 # 0xa88824 k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop+0x64 /go/pkg/mod/k8s.io/[email protected]/pkg/watch/mux.go:268 8 @ 0x43c976 0x46abb5 0x1a2a7a5 0x46de41 # 0x46abb4 time.Sleep+0x134 /usr/local/go/src/runtime/time.go:195 # 0x1a2a7a4 github.com/linkerd/linkerd2/multicluster/service-mirror.(*Ticker).loop+0x64 /linkerd-build/multicluster/service-mirror/jittered_ticker.go:55 ``` The number of goroutines shown here is almost the same as the number of goroutines count jumps we've seen in the chart when this was taken, which suggests these are indeed the leaks. The service mirror main function contains a loop that restarts the cluster watcher (and with it its probe worker) whenever the main watch following the Link resources terminates (I don't know why that happens). And some resources aren't getting cleaned up upon the restart, producint the leak. It would be interesting to see (as a followup) why that restart is required here, but not in the other controllers. ## Resolution We can map each one of those entries in the dump to these leaks: ### jitterred_ticker.go Each probe worker starts a jittered ticker, whose `Stop()` method was never called. Fixed that through a `defer` statement. ### event.go With every watcher restart a new k8s event recorder was instantiated but it wasn't cleaned. Added a `Shutdown()` call to the event's broadcaster in the cluster watcher's `Stop()` method. ### mux.go, shared_informer.go and friends The cluster watcher attached event handlers to informers for Services, Endpoints and Namespaces. Added `RemoveEventHandler()` to them on the cluster watcher's `Stop()` method. ## Result After the fix, we can see the goroutines count remains stable (see PR for chart).
alpeb
added a commit
that referenced
this issue
May 2, 2023
Fixes #10746 ## Repro Just by leaving two linked clusters by themselves and monitoring the service mirror pod's RSS (via the `container_memory_working_set_bytes` metric) and goroutines count (via the `go_goroutines` metric) one can see the small but consistent memory increase accompanied by periodic increases of goroutines in batches of 5. ![Screenshot from 2023-04-27 09-28-48](https://user-images.githubusercontent.com/554287/235007199-604dd517-47ae-4f95-a863-21ae16eca898.png) ## Diagnosis Those goroutines count increases can be correlated in the controller's log with the probe worker and cluster watcher recycling: ``` time="2023-04-27T10:06:14Z" level=info msg="Stopping probe worker" probe-key=target time="2023-04-27T10:06:14Z" level=info msg="Starting probe worker" probe-key=target time="2023-04-27T10:06:14Z" level=info msg="Received: Stop" apiAddress="https://192.168.32.4:6443" cluster=remote ``` Enabling pprof, the beginning of the goroutine stack dump look like this: ``` goroutine profile: total 94 10 @ 0x43c976 0x406d3b 0x406878 0x175ee77 0xb2aefe 0xb2adb6 0xb2aca9 0x175edeb 0x175ed95 0xb2cfda 0x46de41 # 0x175ee76 k8s.io/client-go/tools/cache.(*processorListener).run.func1+0x56 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:968 # 0xb2aefd k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1+0x3d /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 # 0xb2adb5 k8s.io/apimachinery/pkg/util/wait.BackoffUntil+0xb5 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 # 0xb2aca8 k8s.io/apimachinery/pkg/util/wait.JitterUntil+0x88 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 # 0x175edea k8s.io/apimachinery/pkg/util/wait.Until+0x6a /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161 # 0x175ed94 k8s.io/client-go/tools/cache.(*processorListener).run+0x14 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:967 # 0xb2cfd9 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 10 @ 0x43c976 0x44c81c 0x175eb99 0xb2cfda 0x46de41 # 0x175eb98 k8s.io/client-go/tools/cache.(*processorListener).pop+0x118 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:938 # 0xb2cfd9 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 8 @ 0x43c976 0x406d3b 0x406878 0x19f35d3 0x46de41 # 0x19f35d2 k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1+0x72 /go/pkg/mod/k8s.io/[email protected]/tools/record/event.go:320 8 @ 0x43c976 0x406d3b 0x406878 0xa88825 0x46de41 # 0xa88824 k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop+0x64 /go/pkg/mod/k8s.io/[email protected]/pkg/watch/mux.go:268 8 @ 0x43c976 0x46abb5 0x1a2a7a5 0x46de41 # 0x46abb4 time.Sleep+0x134 /usr/local/go/src/runtime/time.go:195 # 0x1a2a7a4 github.com/linkerd/linkerd2/multicluster/service-mirror.(*Ticker).loop+0x64 /linkerd-build/multicluster/service-mirror/jittered_ticker.go:55 ``` The number of goroutines shown here is almost the same as the number of goroutines count jumps we've seen in the chart when this was taken, which suggests these are indeed the leaks. The service mirror main function contains a loop that restarts the cluster watcher (and with it its probe worker) whenever the main watch following the Link resources terminates (I don't know why that happens). And some resources aren't getting cleaned up upon the restart, producing the leak. It would be interesting to see (as a followup) why that restart is required here, but not in the other controllers. ## Resolution We can map each one of those entries in the dump to these leaks: ### jitterred_ticker.go Each probe worker starts a jittered ticker, whose `Stop()` method was never called. Fixed that through a `defer` statement. ### event.go With every watcher restart a new k8s event recorder was instantiated but it wasn't cleaned. Added a `Shutdown()` call to the event's broadcaster in the cluster watcher's `Stop()` method. ### mux.go, shared_informer.go and friends The cluster watcher attached event handlers to informers for Services, Endpoints and Namespaces. Added `RemoveEventHandler()` to them on the cluster watcher's `Stop()` method. ## Result After the fix, we can see the goroutines count remains stable. ![image](https://user-images.githubusercontent.com/554287/235007727-3b3b6f31-901d-4c05-aedb-9fa3d3178ee1.png)
adleong
pushed a commit
that referenced
this issue
Jun 20, 2023
Fixes #10746 ## Repro Just by leaving two linked clusters by themselves and monitoring the service mirror pod's RSS (via the `container_memory_working_set_bytes` metric) and goroutines count (via the `go_goroutines` metric) one can see the small but consistent memory increase accompanied by periodic increases of goroutines in batches of 5. ![Screenshot from 2023-04-27 09-28-48](https://user-images.githubusercontent.com/554287/235007199-604dd517-47ae-4f95-a863-21ae16eca898.png) ## Diagnosis Those goroutines count increases can be correlated in the controller's log with the probe worker and cluster watcher recycling: ``` time="2023-04-27T10:06:14Z" level=info msg="Stopping probe worker" probe-key=target time="2023-04-27T10:06:14Z" level=info msg="Starting probe worker" probe-key=target time="2023-04-27T10:06:14Z" level=info msg="Received: Stop" apiAddress="https://192.168.32.4:6443" cluster=remote ``` Enabling pprof, the beginning of the goroutine stack dump look like this: ``` goroutine profile: total 94 10 @ 0x43c976 0x406d3b 0x406878 0x175ee77 0xb2aefe 0xb2adb6 0xb2aca9 0x175edeb 0x175ed95 0xb2cfda 0x46de41 # 0x175ee76 k8s.io/client-go/tools/cache.(*processorListener).run.func1+0x56 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:968 # 0xb2aefd k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1+0x3d /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 # 0xb2adb5 k8s.io/apimachinery/pkg/util/wait.BackoffUntil+0xb5 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 # 0xb2aca8 k8s.io/apimachinery/pkg/util/wait.JitterUntil+0x88 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 # 0x175edea k8s.io/apimachinery/pkg/util/wait.Until+0x6a /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161 # 0x175ed94 k8s.io/client-go/tools/cache.(*processorListener).run+0x14 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:967 # 0xb2cfd9 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 10 @ 0x43c976 0x44c81c 0x175eb99 0xb2cfda 0x46de41 # 0x175eb98 k8s.io/client-go/tools/cache.(*processorListener).pop+0x118 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:938 # 0xb2cfd9 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 8 @ 0x43c976 0x406d3b 0x406878 0x19f35d3 0x46de41 # 0x19f35d2 k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1+0x72 /go/pkg/mod/k8s.io/[email protected]/tools/record/event.go:320 8 @ 0x43c976 0x406d3b 0x406878 0xa88825 0x46de41 # 0xa88824 k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop+0x64 /go/pkg/mod/k8s.io/[email protected]/pkg/watch/mux.go:268 8 @ 0x43c976 0x46abb5 0x1a2a7a5 0x46de41 # 0x46abb4 time.Sleep+0x134 /usr/local/go/src/runtime/time.go:195 # 0x1a2a7a4 github.com/linkerd/linkerd2/multicluster/service-mirror.(*Ticker).loop+0x64 /linkerd-build/multicluster/service-mirror/jittered_ticker.go:55 ``` The number of goroutines shown here is almost the same as the number of goroutines count jumps we've seen in the chart when this was taken, which suggests these are indeed the leaks. The service mirror main function contains a loop that restarts the cluster watcher (and with it its probe worker) whenever the main watch following the Link resources terminates (I don't know why that happens). And some resources aren't getting cleaned up upon the restart, producing the leak. It would be interesting to see (as a followup) why that restart is required here, but not in the other controllers. ## Resolution We can map each one of those entries in the dump to these leaks: ### jitterred_ticker.go Each probe worker starts a jittered ticker, whose `Stop()` method was never called. Fixed that through a `defer` statement. ### event.go With every watcher restart a new k8s event recorder was instantiated but it wasn't cleaned. Added a `Shutdown()` call to the event's broadcaster in the cluster watcher's `Stop()` method. ### mux.go, shared_informer.go and friends The cluster watcher attached event handlers to informers for Services, Endpoints and Namespaces. Added `RemoveEventHandler()` to them on the cluster watcher's `Stop()` method. ## Result After the fix, we can see the goroutines count remains stable. ![image](https://user-images.githubusercontent.com/554287/235007727-3b3b6f31-901d-4c05-aedb-9fa3d3178ee1.png)
alpeb
added a commit
that referenced
this issue
Sep 6, 2023
Fixes #10746 ## Repro Just by leaving two linked clusters by themselves and monitoring the service mirror pod's RSS (via the `container_memory_working_set_bytes` metric) and goroutines count (via the `go_goroutines` metric) one can see the small but consistent memory increase accompanied by periodic increases of goroutines in batches of 5. ![Screenshot from 2023-04-27 09-28-48](https://user-images.githubusercontent.com/554287/235007199-604dd517-47ae-4f95-a863-21ae16eca898.png) ## Diagnosis Those goroutines count increases can be correlated in the controller's log with the probe worker and cluster watcher recycling: ``` time="2023-04-27T10:06:14Z" level=info msg="Stopping probe worker" probe-key=target time="2023-04-27T10:06:14Z" level=info msg="Starting probe worker" probe-key=target time="2023-04-27T10:06:14Z" level=info msg="Received: Stop" apiAddress="https://192.168.32.4:6443" cluster=remote ``` Enabling pprof, the beginning of the goroutine stack dump look like this: ``` goroutine profile: total 94 10 @ 0x43c976 0x406d3b 0x406878 0x175ee77 0xb2aefe 0xb2adb6 0xb2aca9 0x175edeb 0x175ed95 0xb2cfda 0x46de41 # 0x175ee76 k8s.io/client-go/tools/cache.(*processorListener).run.func1+0x56 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:968 # 0xb2aefd k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1+0x3d /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 # 0xb2adb5 k8s.io/apimachinery/pkg/util/wait.BackoffUntil+0xb5 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 # 0xb2aca8 k8s.io/apimachinery/pkg/util/wait.JitterUntil+0x88 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 # 0x175edea k8s.io/apimachinery/pkg/util/wait.Until+0x6a /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161 # 0x175ed94 k8s.io/client-go/tools/cache.(*processorListener).run+0x14 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:967 # 0xb2cfd9 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 10 @ 0x43c976 0x44c81c 0x175eb99 0xb2cfda 0x46de41 # 0x175eb98 k8s.io/client-go/tools/cache.(*processorListener).pop+0x118 /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:938 # 0xb2cfd9 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59 /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 8 @ 0x43c976 0x406d3b 0x406878 0x19f35d3 0x46de41 # 0x19f35d2 k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1+0x72 /go/pkg/mod/k8s.io/[email protected]/tools/record/event.go:320 8 @ 0x43c976 0x406d3b 0x406878 0xa88825 0x46de41 # 0xa88824 k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop+0x64 /go/pkg/mod/k8s.io/[email protected]/pkg/watch/mux.go:268 8 @ 0x43c976 0x46abb5 0x1a2a7a5 0x46de41 # 0x46abb4 time.Sleep+0x134 /usr/local/go/src/runtime/time.go:195 # 0x1a2a7a4 github.com/linkerd/linkerd2/multicluster/service-mirror.(*Ticker).loop+0x64 /linkerd-build/multicluster/service-mirror/jittered_ticker.go:55 ``` The number of goroutines shown here is almost the same as the number of goroutines count jumps we've seen in the chart when this was taken, which suggests these are indeed the leaks. The service mirror main function contains a loop that restarts the cluster watcher (and with it its probe worker) whenever the main watch following the Link resources terminates (I don't know why that happens). And some resources aren't getting cleaned up upon the restart, producing the leak. It would be interesting to see (as a followup) why that restart is required here, but not in the other controllers. ## Resolution We can map each one of those entries in the dump to these leaks: ### jitterred_ticker.go Each probe worker starts a jittered ticker, whose `Stop()` method was never called. Fixed that through a `defer` statement. ### event.go With every watcher restart a new k8s event recorder was instantiated but it wasn't cleaned. Added a `Shutdown()` call to the event's broadcaster in the cluster watcher's `Stop()` method. ### mux.go, shared_informer.go and friends The cluster watcher attached event handlers to informers for Services, Endpoints and Namespaces. Added `RemoveEventHandler()` to them on the cluster watcher's `Stop()` method. ## Result After the fix, we can see the goroutines count remains stable. ![image](https://user-images.githubusercontent.com/554287/235007727-3b3b6f31-901d-4c05-aedb-9fa3d3178ee1.png)
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
What is the issue?
It appears that the multicluster service mirror has a memory leak.
How can it be reproduced?
Still working out exact trigger, but observable in Buoyant's aks environments as well as customer environments.
Logs, error output, etc
output of
linkerd check -o short
n/a
Environment
Observed in both AKS and EKS environments..
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: