EtcdHighNumberOfFailedGRPCRequests #248

nabbdl · 2019-02-14T11:30:12Z

I enabled "cluster monitoring operator" on several OKD 3.11 clusters. Everything is working fine except for ETCD monitoring. I followed the documentation to enable etcd monitoring. It's seems to work : most of the checks are green, except the "EtcdHighNumberOfFailedGRPCRequests" which is always triggered (etcd cluster is working correctly). Do I miss something or is there any know issue while enabling etcd cluster monitoring ?

nabbdl · 2019-02-14T11:38:36Z

To be more specific the query regarding this alert always retrieve "100" as a value. I suspect that the query is false.
Current query is :

100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

in my opinion it should be (remove *100 at the begining) :

sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

benhwebster · 2019-02-14T16:29:18Z

I also see this on OpenShift 3.11 after enabling etcd monitoring.

metalmatze · 2019-02-15T11:45:33Z

The current query seems correct to me. It alerts on more than 5% of requests failing.
On my personal cluster I can see that there's a watch method pending for 4min.

{grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.135.73.45",job="etcd"}

Is it the same for you? Maybe we should ignore the watches here.

nabbdl · 2019-02-15T13:28:48Z

Tested your query, result is empty. For me the strange thing is that I have the same alert on all the OpenShifts Cluster I have installed. and the query result for EtcdHighNumberOfFailedGRPCRequests gives alway a value of 100

zot24 · 2019-03-19T15:59:24Z

Hi @nabbdl just wondering if you still seeing that error and if you resolve the mystery? I'm facing the same error here and not sure why is happening.

nabbdl · 2019-03-19T16:29:44Z

Hi @zot24. Unfortunately, I'm still seeing the same error.

zot24 · 2019-03-19T20:24:45Z

@nabbdl jtlyk I have been doing some research and after reading a lot of comments I think I'll just ignore those alert for now poseidon/typhoon#175 there are a bunch of issues regarding this error message, but pretty much what's going on it's summarized in this issue etcd-io/etcd#10289 and there is still not a fix for it.

In more detail I think this is the offended line https://github.com/gyuho/etcd/blob/0cf9382024da6132cb5f0778c3fb43e4a6c88afd/etcdserver/api/v3rpc/util.go#L111

zot24 · 2019-03-19T23:34:02Z

If you using jsonnet you could add the following to suppress that rule for now:

{
  prometheusAlerts+:: {
    groups: std.map(
      function(group)
        if group.name == 'etcd' then
          group {
            rules: std.filter(
              function(rule)
                rule.alert != 'etcdHighNumberOfFailedGRPCRequests',
              group.rules
            ),
          }
        else
          group,
      super.groups
    ),
  },
}

zot24 · 2019-04-30T18:49:44Z

@nabbdl jtlyk this just got merge #340

nabbdl · 2019-05-01T10:12:58Z

@jtlyk thank you for the update. I’m currently using the « cluster monitoring operator » provided with OpenShift so I can’t use jsonnet to disable the rule. The only thing I can do for now is to completely disable « etcd monitoring ». Or maybe the cluster-monitoring operator will update itself and will take into account the modification ??

benhwebster · 2019-06-19T14:11:15Z

You could do what I did in my clusters and create a silence for those alerts in alertmanager, but it does look like they may be backporting it currently: #383

paulfantom · 2020-06-03T08:54:01Z

Fix backported in #383

/close

openshift-ci-robot · 2020-06-03T08:54:16Z

@paulfantom: Closing this issue.

In response to this:

Fix backported in #383

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ift/cluster-monitoring-operator#248

paskal mentioned this issue Jun 12, 2019

removed checks that keep giving false positives. etcd-io/etcd#10796

Closed

servo1x mentioned this issue Jul 28, 2019

Documentation: Fix etcdHighNumberOfFailedGRPCRequests rules return NaN value etcd-io/etcd#10629

Closed

haoqing0110 mentioned this issue Aug 13, 2019

ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert ibm-cloud-architecture/CSMO-ICP#8

Open

openshift-ci-robot closed this as completed Jun 3, 2020

pando85 added a commit to pando85/kubernetes-deployer that referenced this issue Aug 13, 2020

[monitoring] Remove EtcdHighNumberOfFailedGRPCRequests alerts: opensh…

6c19bbd

…ift/cluster-monitoring-operator#248

servo1x mentioned this issue Sep 26, 2020

[kube-prometheus-stack] Disable etcdHighNumberOfFailedGRPCRequests prometheus-community/helm-charts#144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EtcdHighNumberOfFailedGRPCRequests #248

EtcdHighNumberOfFailedGRPCRequests #248

nabbdl commented Feb 14, 2019

nabbdl commented Feb 14, 2019 •

edited

Loading

benhwebster commented Feb 14, 2019

metalmatze commented Feb 15, 2019

nabbdl commented Feb 15, 2019

zot24 commented Mar 19, 2019

nabbdl commented Mar 19, 2019

zot24 commented Mar 19, 2019 •

edited

Loading

zot24 commented Mar 19, 2019

zot24 commented Apr 30, 2019

nabbdl commented May 1, 2019

benhwebster commented Jun 19, 2019

paulfantom commented Jun 3, 2020

openshift-ci-robot commented Jun 3, 2020

EtcdHighNumberOfFailedGRPCRequests #248

EtcdHighNumberOfFailedGRPCRequests #248

Comments

nabbdl commented Feb 14, 2019

nabbdl commented Feb 14, 2019 • edited Loading

benhwebster commented Feb 14, 2019

metalmatze commented Feb 15, 2019

nabbdl commented Feb 15, 2019

zot24 commented Mar 19, 2019

nabbdl commented Mar 19, 2019

zot24 commented Mar 19, 2019 • edited Loading

zot24 commented Mar 19, 2019

zot24 commented Apr 30, 2019

nabbdl commented May 1, 2019

benhwebster commented Jun 19, 2019

paulfantom commented Jun 3, 2020

openshift-ci-robot commented Jun 3, 2020

nabbdl commented Feb 14, 2019 •

edited

Loading

zot24 commented Mar 19, 2019 •

edited

Loading