Alert rule etcdHighNumberOfFailedGRPCRequests in Prometheus #13147

aldelsa · 2021-06-25T15:00:51Z

ISSUE TYPE

Bug Report

SUMMARY

We have a k8s cluster using kube-prometheus-stack for monitoring. This prometheus have a several alert rules to check all the cluster. One of those alert is to check request on etcd of kubernetes. The problem is that this rule is persistently alerting:
message = etcd cluster "kube-etcd": 100% of requests for Watch failed on etcd instance ". But the cluster is running properly.
We have changed the image of etcd to new one and we have the same problem.

ENVIRONMENT

Kubernetes version: 1.21.1
Etcd version: gcr.io/etcd-development/etcd:v3.4.16
Kube-prometheus-stack: 15.4.4
Operating System: Ubuntu 18.04

EXPECTED RESULTS

No receive any alert about etcd

ACTUAL RESULTS
Labels
alertname = etcdHighNumberOfFailedGRPCRequests
grpc_method = Watch
grpc_service = etcdserverpb.Watch
instance = 192.168.251.221:2381
job = kube-etcd
prometheus = monitoring/kube-prometheus-stack-prometheus
severity = warning
Annotations
message = etcd cluster "kube-etcd": 100% of requests for Watch failed on etcd instance 192.168.251.221:2381.

The text was updated successfully, but these errors were encountered:

ksa-real · 2021-07-01T00:33:31Z

I wonder if #9166 is related

ksa-real · 2021-07-02T09:04:00Z

I see that according to Prometheus data the grpc calls to etcd's Watch method have either "Cancelled" or "Unknown" grpc code. Check the graph of below expression e.g. last 24 hours:

sum by (grpc_code) (grpc_server_handled_total{job=~".*etcd.*",grpc_service="etcdserverpb.Watch",instance="10.10.10.5:2379",grpc_method="Watch", job="kube-etcd"})

So effectively if there are no calls to Watch API, the service looks healthy. If there are calls, they fail with !="OK" grpc code, and Prometheus produces an alert. Can someone comment on whether this behavior is expected?

allenporter · 2021-07-09T14:46:47Z

I believe when I've looked into this before it is #10289

allenporter · 2021-07-09T14:49:04Z

(The fix appears in the etcd 3.5 changelog, though I have not upgraded yet)

allenporter · 2021-07-11T06:37:40Z

@lilic updated the rules for etcdHighNumberOfFailedGRPCRequests in #13127

roughly:

Now i think upstream projects that copied these rules just need to be updated. Great!

allenporter · 2021-07-11T08:38:44Z

kube-prometheus-stack is currently not able to be easily updated with the latest etcd rules due to prometheus-community/helm-charts#225

allenporter · 2021-08-19T15:00:10Z

I think this issue is fixed and can be closed.

spzala · 2021-08-19T16:13:27Z

@allenporter thanks for findings!
@aldelsa closing the issue per @allenporter comment and #13127 Please free to reopen if needed. Thanks!

See also: etcd-io/etcd#13147 prometheus-community/helm-charts#1155

This fixes false positives for `etcdHighNumberOfFailedGRPCRequests` alerts (see etcd-io/etcd#13147)

This was referenced Jul 11, 2021

[kube-prometheus-stack] etcd rules for newer versions of etcd can no longer be updated using python script prometheus-community/helm-charts#1155

Closed

etcdHighNumberOfFailedGRPCRequests alert spam from kube-prometheus-stack allenporter/k8s-gitops#239

Closed

spzala closed this as completed Aug 19, 2021

woll0r added a commit to woll0r/k8s-cluster that referenced this issue Aug 23, 2021

Change alerting receiver rules to prevent spam

d9137d7

See also: etcd-io/etcd#13147 prometheus-community/helm-charts#1155

hwuethrich added a commit to zebbra/victoriametrics-helm-charts that referenced this issue Jul 31, 2022

[victoria-metrics-k8s-stack] use etcd v3.5 rules

5464546

This fixes false positives for `etcdHighNumberOfFailedGRPCRequests` alerts (see etcd-io/etcd#13147)

hwuethrich added a commit to zebbra/victoriametrics-helm-charts that referenced this issue Oct 10, 2022

[victoria-metrics-k8s-stack] use etcd v3.5 rules

dd791a3

This fixes false positives for `etcdHighNumberOfFailedGRPCRequests` alerts (see etcd-io/etcd#13147)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert rule etcdHighNumberOfFailedGRPCRequests in Prometheus #13147

Alert rule etcdHighNumberOfFailedGRPCRequests in Prometheus #13147

aldelsa commented Jun 25, 2021

ksa-real commented Jul 1, 2021

ksa-real commented Jul 2, 2021 •

edited

Loading

allenporter commented Jul 9, 2021

allenporter commented Jul 9, 2021

allenporter commented Jul 11, 2021

allenporter commented Jul 11, 2021

allenporter commented Aug 19, 2021

spzala commented Aug 19, 2021

Alert rule etcdHighNumberOfFailedGRPCRequests in Prometheus #13147

Alert rule etcdHighNumberOfFailedGRPCRequests in Prometheus #13147

Comments

aldelsa commented Jun 25, 2021

ksa-real commented Jul 1, 2021

ksa-real commented Jul 2, 2021 • edited Loading

allenporter commented Jul 9, 2021

allenporter commented Jul 9, 2021

allenporter commented Jul 11, 2021

allenporter commented Jul 11, 2021

allenporter commented Aug 19, 2021

spzala commented Aug 19, 2021

ksa-real commented Jul 2, 2021 •

edited

Loading