`duplicate metrics` warnings in service-mirror #11839

siggy · 2023-12-27T18:08:45Z

What is the issue?

I'm seeing failed to register Prometheus gauge ... duplicate metrics collector registration attempted warnings for service_cache_size and endpoints_cache_size metrics in the service-mirror log.

How can it be reproduced?

Follow the Multicluster install instructions: https://linkerd.io/2.14/tasks/multicluster/, then:

kubectl -n linkerd-multicluster logs -f deploy/linkerd-service-mirror-east -c service-mirror

Logs, error output, etc

$ kubectl -n linkerd-multicluster logs -f deploy/linkerd-service-mirror-east -c service-mirror
...
time="2023-12-26T22:40:54Z" level=warning msg="failed to register Prometheus gauge Desc{fqName: \"service_cache_size\", help: \"Number of items in the client-go service cache\", constLabels: {cluster=\"remote\"}, variableLabels: []}: duplicate metrics collector registration attempted"
time="2023-12-26T22:40:54Z" level=warning msg="failed to register Prometheus gauge Desc{fqName: \"endpoints_cache_size\", help: \"Number of items in the client-go endpoints cache\", constLabels: {cluster=\"remote\"}, variableLabels: []}: duplicate metrics collector registration attempted"

output of `linkerd check -o short`

$ linkerd check -o short
linkerd-buoyant
---------------
‼ Linkerd health ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-health for hints
‼ Linkerd vulnerability report ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-vulnerability-report for hints
‼ Linkerd data plane upgrade assistance ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-data-plane-upgrade for hints
‼ Linkerd trust anchor rotation assistance ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-trust-anchor-rotation for hints

Status check results are √

Environment

Kubernetes Version: v1.27.3
Cluster Environment: kind
Host OS: Ubuntu 22.04.3 LTS
Linkerd version: stable-2.14.7

Possible solution

Ensure Prometheus metrics are registered only once, preferably at startup, and/or scope each registration with subsystem names.

Additional context

No response

Would you like to work on fixing this bug?

maybe

The text was updated successfully, but these errors were encountered:

Fixes #11839 When in `restartClusterWatcher` we fail to connect to the target cluster for whatever reason, the function gets called again 10s later, and tries to register the same prometheus metrics without unregistering them first, which generates warnings. The problem lies in `NewRemoteClusterServiceWatcher`, which instantiates the remote kube-api client and registers the metrics, returning a nil object if the client can't connect. `cleanupWorkers` at the beginning of `restartClusterWatcher` won't unregister those metrics because of that nil object. This fix reorders `NewRemoteClusterServiceWatcher` so that an object is returned even when there's an error, so cleanup on that object can be performed.

Unregister prom gauges when recycling cluster watcher Fixes #11839 When in `restartClusterWatcher` we fail to connect to the target cluster for whatever reason, the function gets called again 10s later, and tries to register the same prometheus metrics without unregistering them first, which generates warnings. The problem lies in `NewRemoteClusterServiceWatcher`, which instantiates the remote kube-api client and registers the metrics, returning a nil object if the client can't connect. `cleanupWorkers` at the beginning of `restartClusterWatcher` won't unregister those metrics because of that nil object. To fix this, gauges are unregistered on error.

This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>

This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised ([#11917]) [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>

siggy added the bug label Dec 27, 2023

alpeb mentioned this issue Jan 3, 2024

Unregister prom gauges when recycling cluster watcher #11875

Merged

alpeb self-assigned this Jan 3, 2024

olix0r closed this as completed in #11875 Jan 6, 2024

mateiidavid mentioned this issue Jan 12, 2024

edge-24.1.1 #11922

Merged

github-actions bot locked as resolved and limited conversation to collaborators Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`duplicate metrics` warnings in service-mirror #11839

`duplicate metrics` warnings in service-mirror #11839

siggy commented Dec 27, 2023

duplicate metrics warnings in service-mirror #11839

duplicate metrics warnings in service-mirror #11839

Comments

siggy commented Dec 27, 2023

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

`duplicate metrics` warnings in service-mirror #11839

`duplicate metrics` warnings in service-mirror #11839

output of `linkerd check -o short`