Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert rule etcdHighNumberOfFailedGRPCRequests in Prometheus #13147

Closed
aldelsa opened this issue Jun 25, 2021 · 8 comments
Closed

Alert rule etcdHighNumberOfFailedGRPCRequests in Prometheus #13147

aldelsa opened this issue Jun 25, 2021 · 8 comments

Comments

@aldelsa
Copy link

aldelsa commented Jun 25, 2021

ISSUE TYPE

Bug Report

SUMMARY

We have a k8s cluster using kube-prometheus-stack for monitoring. This prometheus have a several alert rules to check all the cluster. One of those alert is to check request on etcd of kubernetes. The problem is that this rule is persistently alerting:
message = etcd cluster "kube-etcd": 100% of requests for Watch failed on etcd instance ". But the cluster is running properly.
We have changed the image of etcd to new one and we have the same problem.

ENVIRONMENT

Kubernetes version: 1.21.1
Etcd version: gcr.io/etcd-development/etcd:v3.4.16
Kube-prometheus-stack: 15.4.4
Operating System: Ubuntu 18.04

EXPECTED RESULTS

No receive any alert about etcd

ACTUAL RESULTS
Labels
alertname = etcdHighNumberOfFailedGRPCRequests
grpc_method = Watch
grpc_service = etcdserverpb.Watch
instance = 192.168.251.221:2381
job = kube-etcd
prometheus = monitoring/kube-prometheus-stack-prometheus
severity = warning
Annotations
message = etcd cluster "kube-etcd": 100% of requests for Watch failed on etcd instance 192.168.251.221:2381.

@ksa-real
Copy link

ksa-real commented Jul 1, 2021

I wonder if #9166 is related

@ksa-real
Copy link

ksa-real commented Jul 2, 2021

I see that according to Prometheus data the grpc calls to etcd's Watch method have either "Cancelled" or "Unknown" grpc code. Check the graph of below expression e.g. last 24 hours:

sum by (grpc_code) (grpc_server_handled_total{job=~".*etcd.*",grpc_service="etcdserverpb.Watch",instance="10.10.10.5:2379",grpc_method="Watch", job="kube-etcd"})

So effectively if there are no calls to Watch API, the service looks healthy. If there are calls, they fail with !="OK" grpc code, and Prometheus produces an alert. Can someone comment on whether this behavior is expected?

@allenporter
Copy link
Contributor

I believe when I've looked into this before it is #10289

@allenporter
Copy link
Contributor

(The fix appears in the etcd 3.5 changelog, though I have not upgraded yet)

@allenporter
Copy link
Contributor

@lilic updated the rules for etcdHighNumberOfFailedGRPCRequests in #13127

roughly:

grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"

Now i think upstream projects that copied these rules just need to be updated. Great!

@allenporter
Copy link
Contributor

kube-prometheus-stack is currently not able to be easily updated with the latest etcd rules due to prometheus-community/helm-charts#225

@allenporter
Copy link
Contributor

I think this issue is fixed and can be closed.

@spzala
Copy link
Member

spzala commented Aug 19, 2021

@allenporter thanks for findings!
@aldelsa closing the issue per @allenporter comment and #13127 Please free to reopen if needed. Thanks!

@spzala spzala closed this as completed Aug 19, 2021
hwuethrich added a commit to zebbra/victoriametrics-helm-charts that referenced this issue Jul 31, 2022
This fixes false positives for `etcdHighNumberOfFailedGRPCRequests` alerts
(see etcd-io/etcd#13147)
hwuethrich added a commit to zebbra/victoriametrics-helm-charts that referenced this issue Oct 10, 2022
This fixes false positives for `etcdHighNumberOfFailedGRPCRequests` alerts
(see etcd-io/etcd#13147)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants