Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdHighNumberOfFailedGRPCRequests alert spam from kube-prometheus-stack #239

Closed
allenporter opened this issue Jul 11, 2021 · 3 comments
Closed

Comments

@allenporter
Copy link
Owner

allenporter commented Jul 11, 2021

This issue has been discussed in many different places, e.g. etcd-io/etcd#13147 Basically, the etcdHighNumberOfFailedGRPCRequests rule matches canceled etcd rpcs.

https://github.com/etcd-io/etcd/blob/release-3.5/contrib/mixin/mixin.libsonnet currently has a fix for this, however it is not straight forward to include in kube-prometheus-stack due to prometheus-community/helm-charts#1155

For now, i've silenced etcdHighNumberOfFailedGRPCRequests and am going with a custom alert instead from ce56fbb

(This issue tracks cleanup for my specific cluster, waiting for a propper upstream fix to kube-prometheus-stack)

@allenporter
Copy link
Owner Author

allenporter commented Jul 11, 2021

I can confirm that rolling out etcd 3.5 greatly reduces the # of Unavailable return codes. This query shows a large drop after update:

100 * sum without(grpc_type) (rate(grpc_server_handled_total{grpc_code=~"Unavailable",job=~".*etcd.*",grpc_method="Watch"}[5m])) / sum without(grpc_type) (rate(grpc_server_handled_total{job=~".*etcd.*",grpc_method="Watch"}[5m])) 

So that etcd error code fix combined with the etcd 3.5 monitoring rule improvements should be effective at reducing the alert spam.

@allenporter
Copy link
Owner Author

Will keep using a custom etcdHighNumberOfFailedGRPCRequestsCustom and keep this issue open until the upstream issue prometheus-community/helm-charts#1155 is resolved and kube-prometheus-stack can correctly import etcd rules.

@allenporter
Copy link
Owner Author

Underlying kube-prometheus-stack alerts are now updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant