Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate metrics warnings in service-mirror #11839

Closed
siggy opened this issue Dec 27, 2023 · 0 comments · Fixed by #11875 or #11922
Closed

duplicate metrics warnings in service-mirror #11839

siggy opened this issue Dec 27, 2023 · 0 comments · Fixed by #11875 or #11922
Assignees
Labels

Comments

@siggy
Copy link
Member

siggy commented Dec 27, 2023

What is the issue?

I'm seeing failed to register Prometheus gauge ... duplicate metrics collector registration attempted warnings for service_cache_size and endpoints_cache_size metrics in the service-mirror log.

How can it be reproduced?

Follow the Multicluster install instructions: https://linkerd.io/2.14/tasks/multicluster/, then:

kubectl -n linkerd-multicluster logs -f deploy/linkerd-service-mirror-east -c service-mirror

Logs, error output, etc

$ kubectl -n linkerd-multicluster logs -f deploy/linkerd-service-mirror-east -c service-mirror
...
time="2023-12-26T22:40:54Z" level=warning msg="failed to register Prometheus gauge Desc{fqName: \"service_cache_size\", help: \"Number of items in the client-go service cache\", constLabels: {cluster=\"remote\"}, variableLabels: []}: duplicate metrics collector registration attempted"
time="2023-12-26T22:40:54Z" level=warning msg="failed to register Prometheus gauge Desc{fqName: \"endpoints_cache_size\", help: \"Number of items in the client-go endpoints cache\", constLabels: {cluster=\"remote\"}, variableLabels: []}: duplicate metrics collector registration attempted"

output of linkerd check -o short

$ linkerd check -o short
linkerd-buoyant
---------------
‼ Linkerd health ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-health for hints
‼ Linkerd vulnerability report ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-vulnerability-report for hints
‼ Linkerd data plane upgrade assistance ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-data-plane-upgrade for hints
‼ Linkerd trust anchor rotation assistance ok
    Linkerd not connected to Buoyant Cloud
    see https://buoyant.io/linkerd/debugging#linkerd-trust-anchor-rotation for hints

Status check results are √

Environment

  • Kubernetes Version: v1.27.3
  • Cluster Environment: kind
  • Host OS: Ubuntu 22.04.3 LTS
  • Linkerd version: stable-2.14.7

Possible solution

Ensure Prometheus metrics are registered only once, preferably at startup, and/or scope each registration with subsystem names.

Additional context

No response

Would you like to work on fixing this bug?

maybe

@siggy siggy added the bug label Dec 27, 2023
alpeb added a commit that referenced this issue Jan 3, 2024
Fixes #11839

When in `restartClusterWatcher` we fail to connect to the target cluster
for whatever reason, the function gets called again 10s later, and tries
to register the same prometheus metrics without unregistering them
first, which generates warnings.

The problem lies in `NewRemoteClusterServiceWatcher`, which instantiates
the remote kube-api client and registers the metrics, returning a nil
object if the client can't connect. `cleanupWorkers` at the beginning of
`restartClusterWatcher` won't unregister those metrics because of that
nil object.

This fix reorders `NewRemoteClusterServiceWatcher` so that an object is
returned even when there's an error, so cleanup on that object can be
performed.
@alpeb alpeb self-assigned this Jan 3, 2024
olix0r pushed a commit that referenced this issue Jan 6, 2024
Unregister prom gauges when recycling cluster watcher

Fixes #11839

When in `restartClusterWatcher` we fail to connect to the target cluster
for whatever reason, the function gets called again 10s later, and tries
to register the same prometheus metrics without unregistering them
first, which generates warnings.

The problem lies in `NewRemoteClusterServiceWatcher`, which instantiates
the remote kube-api client and registers the metrics, returning a nil
object if the client can't connect. `cleanupWorkers` at the beginning of
`restartClusterWatcher` won't unregister those metrics because of that
nil object.

To fix this, gauges are unregistered on error.
mateiidavid added a commit that referenced this issue Jan 12, 2024
This edge release introduces a number of different fixes and improvements. More
notably, it introduces a new `cni-repair-controller` binary to the CNI plugin
image. The controller will automatically restart pods that have not received
their iptables configuration.

* Removed shortnames from Tap API resources to avoid colliding with existing
  Kubernetes resources ([#11816]; fixes [#11784])
* Introduced a new ExternalWorkload CRD to support upcoming mesh expansion
  feature ([#11805])
* Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI
  identities ([#11882])
* Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to
  automatically restart misconfigured pods that are missing iptables rules
  ([#11699]; fixes [#11073])
* Fixed a `"duplicate metrics"` warning in the multicluster service-mirror
  component ([#11875]; fixes [#11839])
* Added metric labels and weights to `linkerd diagnostics endpoints` json
  output ([#11889])
* Changed how `Server` updates are handled in the destination service. The
  change will ensure that during a cluster resync, consumers won't be
  overloaded by redundant updates ([#11907])
* Changed `linkerd install` error output to add a newline when a Kubernetes
  client cannot be successfully initialised

[#11816]: #11816
[#11784]: #11784
[#11805]: #11805
[#11882]: #11882
[#11699]: #11699
[#11073]: #11073
[#11875]: #11875
[#11839]: #11839
[#11889]: #11889
[#11907]: #11907
[#11917]: #11917

Signed-off-by: Matei David <[email protected]>
mateiidavid added a commit that referenced this issue Jan 12, 2024
This edge release introduces a number of different fixes and improvements. More
notably, it introduces a new `cni-repair-controller` binary to the CNI plugin
image. The controller will automatically restart pods that have not received
their iptables configuration.

* Removed shortnames from Tap API resources to avoid colliding with existing
  Kubernetes resources ([#11816]; fixes [#11784])
* Introduced a new ExternalWorkload CRD to support upcoming mesh expansion
  feature ([#11805])
* Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI
  identities ([#11882])
* Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to
  automatically restart misconfigured pods that are missing iptables rules
  ([#11699]; fixes [#11073])
* Fixed a `"duplicate metrics"` warning in the multicluster service-mirror
  component ([#11875]; fixes [#11839])
* Added metric labels and weights to `linkerd diagnostics endpoints` json
  output ([#11889])
* Changed how `Server` updates are handled in the destination service. The
  change will ensure that during a cluster resync, consumers won't be
  overloaded by redundant updates ([#11907])
* Changed `linkerd install` error output to add a newline when a Kubernetes
  client cannot be successfully initialised ([#11917])

[#11816]: #11816
[#11784]: #11784
[#11805]: #11805
[#11882]: #11882
[#11699]: #11699
[#11073]: #11073
[#11875]: #11875
[#11839]: #11839
[#11889]: #11889
[#11907]: #11907
[#11917]: #11917

Signed-off-by: Matei David <[email protected]>
mateiidavid added a commit that referenced this issue Jan 12, 2024
This edge release introduces a number of different fixes and improvements. More
notably, it introduces a new `cni-repair-controller` binary to the CNI plugin
image. The controller will automatically restart pods that have not received
their iptables configuration.

* Removed shortnames from Tap API resources to avoid colliding with existing
  Kubernetes resources ([#11816]; fixes [#11784])
* Introduced a new ExternalWorkload CRD to support upcoming mesh expansion
  feature ([#11805])
* Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI
  identities ([#11882])
* Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to
  automatically restart misconfigured pods that are missing iptables rules
  ([#11699]; fixes [#11073])
* Fixed a `"duplicate metrics"` warning in the multicluster service-mirror
  component ([#11875]; fixes [#11839])
* Added metric labels and weights to `linkerd diagnostics endpoints` json
  output ([#11889])
* Changed how `Server` updates are handled in the destination service. The
  change will ensure that during a cluster resync, consumers won't be
  overloaded by redundant updates ([#11907])
* Changed `linkerd install` error output to add a newline when a Kubernetes
  client cannot be successfully initialised ([#11917])

[#11816]: #11816
[#11784]: #11784
[#11805]: #11805
[#11882]: #11882
[#11699]: #11699
[#11073]: #11073
[#11875]: #11875
[#11839]: #11839
[#11889]: #11889
[#11907]: #11907
[#11917]: #11917

Signed-off-by: Matei David <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
2 participants