-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EtcdHighNumberOfFailedGRPCRequests #248
Comments
To be more specific the query regarding this alert always retrieve "100" as a value. I suspect that the query is false. 100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5 in my opinion it should be (remove *100 at the begining) : sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5 |
I also see this on OpenShift 3.11 after enabling etcd monitoring. |
The current query seems correct to me. It alerts on more than 5% of requests failing.
Is it the same for you? Maybe we should ignore the watches here. |
Hi @nabbdl just wondering if you still seeing that error and if you resolve the mystery? I'm facing the same error here and not sure why is happening. |
Hi @zot24. Unfortunately, I'm still seeing the same error. |
@nabbdl jtlyk I have been doing some research and after reading a lot of comments I think I'll just ignore those alert for now poseidon/typhoon#175 there are a bunch of issues regarding this error message, but pretty much what's going on it's summarized in this issue etcd-io/etcd#10289 and there is still not a fix for it. In more detail I think this is the offended line https://github.com/gyuho/etcd/blob/0cf9382024da6132cb5f0778c3fb43e4a6c88afd/etcdserver/api/v3rpc/util.go#L111 |
If you using jsonnet you could add the following to suppress that rule for now:
|
@jtlyk thank you for the update. I’m currently using the « cluster monitoring operator » provided with OpenShift so I can’t use jsonnet to disable the rule. The only thing I can do for now is to completely disable « etcd monitoring ». Or maybe the cluster-monitoring operator will update itself and will take into account the modification ?? |
You could do what I did in my clusters and create a silence for those alerts in alertmanager, but it does look like they may be backporting it currently: #383 |
Fix backported in #383 /close |
@paulfantom: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I enabled "cluster monitoring operator" on several OKD 3.11 clusters. Everything is working fine except for ETCD monitoring. I followed the documentation to enable etcd monitoring. It's seems to work : most of the checks are green, except the "EtcdHighNumberOfFailedGRPCRequests" which is always triggered (etcd cluster is working correctly). Do I miss something or is there any know issue while enabling etcd cluster monitoring ?
The text was updated successfully, but these errors were encountered: