gRPC code Unavailable instead Canceled #10289

arslanbekov · 2018-11-29T15:44:29Z

After this PR #9178 we get:

It's in etcd running with parameter --log-package-levels=debug:

...
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL")
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 78; CANCEL")
...

in my prometheus i see:

sum by(grpc_type, grpc_service, grpc_code) (grpc_server_handled_total{grpc_code="Unavailable", job="etcd", grpc_type="bidi_stream"}) = 4172

But it seems to me, this not error, and should not fall under Unavailable Code

Unavailable indicates the service is currently unavailable. This is a most likely a transient condition and may be corrected by retrying with a backoff. See litmus test above for deciding between FailedPrecondition, Aborted, and Unavailable.

It looks like Canceled Code

Canceled indicates the operation was canceled (typically by the caller).

Now everyone who uses prometheus operator + alertmanager, get this alert, because CANCEL falls under Unavailable

#9725, #9576, #9166 openshift/origin#20311, prometheus-operator/prometheus-operator#2046, and in google groups

The text was updated successfully, but these errors were encountered:

xiang90 · 2018-11-29T16:24:30Z

@arslanbekov From #9178, it seems that gRPC returned code is unavailable, but the description is canceled. Can you open an issue in grcp-go to ask?

arslanbekov · 2018-11-29T16:30:35Z

@xiang90

but the description is canceled

are you talking about it?
https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L113

I will create issue in grpc-go, thanks

xiang90 · 2018-11-29T16:47:30Z

yes. thank you.

menghanl · 2018-12-03T23:29:15Z

The error string is generated from x/net/http2. The error description comes from method Error().

The error conversion should be handled by mapRecvMsgError, which also calls Error().

Can you add a print to mapRecvMsgError to print the error type and error details?
Something like fmt.Printf("%T %v", err, err). Thanks.

hexfusion · 2018-12-07T15:31:31Z

@arslanbekov just touching base on this did you see @menghanl (grpc-go maintainer) comment above?

arslanbekov · 2018-12-07T15:39:24Z

Yes, @hexfusion thanks for notifying, there is no free time to do it now. Maybe someone will help?

hexfusion · 2018-12-07T17:08:19Z

Yes, @hexfusion thanks for notifying, there is no free time to do it now. Maybe someone will help?

@arslanbekov I will try to take a look soon.

menghanl · 2019-01-17T21:21:51Z

@hexfusion Did you get time to try this? Thanks!

hexfusion · 2019-01-17T21:23:30Z

@menghanl thanks for the reminder, I will work on it before the end of the week.

hexfusion · 2019-01-21T20:46:22Z

@spzala could you please take a look?

spzala · 2019-01-21T20:48:46Z

@hexfusion I sure will, thanks!

spzala · 2019-01-27T22:27:06Z

@hexfusion I sure will, thanks!

Quick update - I had to spend some time creating a cluster with tls but I have it created now, so should be providing more update soon. Thanks!

spzala · 2019-01-30T04:01:33Z

So not sure if I am doing something incorrect or in a different way but I could not reproduce the issue exactly but I do see rpc error: code = Unavailable (i.e. https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L108 instead of L113 as commented in the issue above)
This is how I am testing:
Run this on server:
./bin/etcd --data-dir=data.etcd --name ${ETCD_NAME} --advertise-client-urls ${ADVERTISE_CLIENT_URLS} --listen-client-urls ${LISTEN_CLIENT_URLS} --initial-cluster ${INITIAL_CLUSTER} --initial-cluster-state ${INITIAL_CLUSTER_STATE} --initial-cluster-token ${INITIAL_CLUSTER_TOKEN} --cert-file $CERT_FILE --trusted-ca-file $TRUSTED_CA_FILE --key-file $KEY_FILE --debug
Run below commands on client (a separate VM) on two diff cli:
./bin/etcdctl --endpoints https://<ip>:2379 watch foo --cacert /root/cfssl/ca.pem --cert /root/cfssl/client.pem --key /root/cfssl/client-key.pem
./bin/etcdctl --endpoints https://<ip>:2379 put foo bar --cacert /root/cfssl/ca.pem --cert /root/cfssl/client.pem --key /root/cfssl/client-key.pem
And then cancel the watch command (ctrl+c) which gives me following error most of the time,
2019-01-30 03:35:31.618639 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = client disconnected”)

Also, @menghanl thanks but I couldn't see the print output in the etcd log after I add the line to mapRecvMsgError. Is it logged somewhere other than etcd server output? I also tried adding GRPC_GO_LOG_VERBOSITY_LEVEL=2 and other flags in env. Since I couldn't reproduce issue as explained above, I also tried couple of other Println in different package functions just to play around but didn't see anything in the log. Thanks!

menghanl · 2019-01-30T18:26:08Z

@spzala
Re logging:
Did you call grpclog.Info (or other grpclog functions)? If so:

If grpclogger is the default, then GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info should print everything.
If grpclogger is replaced (by calling SetLogger or SetLoggerV2), then you need to configure the custom logger to log everything.

If you called fmt.Print, it's logged to stdout, and gRPC doesn't do anything special to that.

The function (mapRecvMsgError) is just where I guess the problem is. If nothing is printed, it's totally possible that the bug happens in another codepath. But we can only confirm that when we reproduce the bug.

In the reproducing command, what should be the values of the configs (like ${ADVERTISE_CLIENT_URLS})? I will try to run it myself, but I'm not very familiar with etcd. Thanks!

spzala · 2019-01-30T21:19:56Z

@menghanl thanks much for the great detail!! and yes etcds uses SetLoggerV2 so that might be the reason. I will run again modifying it not to override default grpclog. Also for testing I used both, fmt.Printf and fmt.Println so as you guessed the problem might be something else but I will try again with grpclogger.

About the etcd server flags and their values - for the ones that I used here, I am copying below but you can find those and more flags at https://github.com/etcd-io/etcd/blob/master/etcdmain/help.go. With tls enabled, I have used https instead of http e.g. ADVERTISE_CLIENT_URLS=${HOST}:2379 where HOST is https://<fqdn>

--advertise-client-urls 'http://localhost:2379'
List of this member's client URLs to advertise to the public.
--listen-client-urls 'http://localhost:2379'
List of URLs to listen on for client traffic.
--initial-cluster 'default=http://localhost:2380'
Initial cluster configuration for bootstrapping.
--initial-cluster-state 'new'
Initial cluster state ('new' or 'existing').
--initial-cluster-token 'etcd-cluster'
Initial cluster token for the etcd cluster during bootstrap.
Specifying this can protect you from unintended cross-cluster interaction when running multiple clusters.
--cert-file ''
Path to the client server TLS cert file.
--key-file ''
Path to the client server TLS key file.

For the etcd client - https://github.com/etcd-io/etcd/tree/master/etcdctl

--cacert=/tmp/ca.pem			
verify certificates of TLS-enabled secure servers using this CA bundle
--cert=/tmp/cert.pem				
identify secure client using this TLS certificate file
-key=/tmp/key.pem
identify secure client using this TLS key file

jingyih · 2019-02-11T19:31:39Z

cc @jpbetz

hexfusion · 2019-03-05T13:02:41Z

Thanks for digging @spzala I am going to dig on this further.

spzala · 2019-03-05T15:42:48Z

Thanks @hexfusion !!

hexfusion · 2019-04-28T22:04:06Z

Update: @menghanl was right about the source of the issue the actual inbound error to mapRecvMsgError is.

rpc error: code = Unknown desc = client disconnected

I will connect the rest of the dots and come up with some resolution hopefully soon.

hexfusion · 2019-04-29T04:24:44Z

I have thought about this for a while and although it does feel like there could be a middle ground between Unknown and Unavailable error codes in this situation. For example by allowing Unknown to conditionally remain as the error code similar to [1] vs forcing Unavailable. If we look at the gRPC spec [1] the client generated Unavailable seems sane vs server. So perhaps this is not a matter of using the wrong error code but a result of gRPC having 2 server transports [3]? @menghanl curious your thoughts?

/cc @brancz

Some data transmitted (e.g., request metadata written to TCP connection) before connection breaks	UNAVAILABLE	Client

Server shutting down ......................................................................................	UNAVAILABLE	Server

[1]https://github.com/grpc/grpc-go/blob/66cd5249103cfe4638864f08462537504b07ff1d/internal/transport/handler_server.go#L426
[2] https://github.com/grpc/grpc/blob/master/doc/statuscodes.md
[3] grpc/grpc-go#1819 (comment) (won't fix)

menghanl · 2019-04-29T20:37:44Z

@hexfusion Thanks for the findings!

actual inbound error to mapRecvMsgError is.
rpc error: code = Unknown desc = client disconnected

Do you know where this error is from? This looks like an error returned from a service handler with fmt.Errorf, instead of status.Errorf with an error code.
Also, the error reported originally is rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL, which has different description.

I think the issue raised here is not about which error code to return.
It more about the confusion caused by error code Unavailable while the error description says CANCEL.

What we want to figure out is why (how) rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL is generated.

hexfusion · 2019-04-30T10:09:38Z

@menghanl thanks for the notes, your questions are valid, I will prioritize some time to get you those answers.

These alerts are firing constantly due to some issues around how to etcd increases its metrics. See etcd-io/etcd#10289

darwink · 2019-09-04T03:58:12Z

@hexfusion Is there any progress about this issue ? Certain etcd3 alert rules are being triggered because of this issue https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules.yml#L32 ?

Any update or plan to fix it?

siwyd · 2019-11-13T16:37:44Z

FYI, for those using the etcd mixin, I worked around this by tweaking the etcd selector:

        local etcd = import "etcd-mixin/mixin.libsonnet";

        etcd {
          _config+:: {
            etcd_selector: 'job="etcd",grpc_service!="etcdserverpb.Watch"'
          },
        }

hexfusion · 2019-11-13T18:14:27Z

I am picking this back up now, hopefully for the last time to resolve.

hexfusion · 2019-11-19T17:56:27Z

I have found the source of Unavailable instead of Canceled.

https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/watch.go#L198-L201

When the watch is canceled we pass error

etcd/etcdserver/api/v3rpc/rpctypes/error.go

Line 66 in 63dd73c

    
           ErrGRPCNoLeader                   = status.New(codes.Unavailable, "etcdserver: no leader").Err()

Which is not valid resulting in codes.Unavailable passed to status.New() leading to metrics reporting Unavailable. In my local testing adding a new error using codes.Canceled allows the metrics to properly report.

I will raise a PR in a bit to resolve.

Right now when Envoy shuts down it sometimes sends an Unavailable error code instead of Cancelled - see etcd-io/etcd#10289 for details, but its pretty complex. The end result is this shows up as "internal error" in the dashboards, when there is nothing going wrong. Instead, we should filter these out. Checking the string directly feels a bit like a hack, but etcd does it so seems reasonable: https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L113

* Stop logging client disconnect as a pilot error Right now when Envoy shuts down it sometimes sends an Unavailable error code instead of Cancelled - see etcd-io/etcd#10289 for details, but its pretty complex. The end result is this shows up as "internal error" in the dashboards, when there is nothing going wrong. Instead, we should filter these out. Checking the string directly feels a bit like a hack, but etcd does it so seems reasonable: https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/util.go#L113 * Add deadline exceeded

imunhatep · 2020-04-14T12:25:58Z

Any news on merging fixes?!

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. This patch improves the behavior by: 1. Performing a deeper analysis during stream closure to more conclusively determine that a leader has actually been lost before propagating a ErrGRPCNoLeader error. 2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn regarding leader loss. There remains an assumption that absence of leader loss evidence represents a client cancellation, but in practice this seems less likely to break down whereas client cancellations are frequent and expected. This is a continuation of the work already done in etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. This patch improves the behavior by: 1. Performing a deeper analysis during stream closure to more conclusively determine that a leader has actually been lost before propagating a ErrGRPCNoLeader error. 2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn regarding leader loss. There remains an assumption that absence of evidence of leader loss means a client cancelled, but in practice this seems less likely to break down whereas client cancellations are frequent and expected. This is a continuation of the work already done in etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. The commit 9c103dd introduced an interceptor which wraps watch streams requiring a leader, causing those streams to be actively canceled when leader loss is detected. However, the error handling code assumes all stream context cancellations are from the interceptor. This assumption is broken when the context was canceled because of a client stream cancelation. The core challenge is lack of information conveyed via `context.Context` which is shared by both the send and receive sides of the stream handling and is subject to cancellation by all paths (including the gRPC library itself). If any piece of the system cancels the shared context, there's no way for a context consumer to understand who cancelled the context or why. To solve the ambiguity of the stream interceptor code specifically, this patch introduces a custom context struct which the interceptor uses to expose a custom error through the context when the interceptor decides to actively cancel a stream. Now the consuming side can more safely assume a generic context cancellation can be propagated as a cancellation, and the server generated leader error is preserved and propagated normally without any special inference. When a client cancels the stream, there remains a race in the error handling code between the send and receive goroutines whereby the underlying gRPC error is lost in the case where the send path returns and is handled first, but this issue can be taken separately as no matter which paths wins, we can detect a generic cancellation. This is a replacement of etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

immanuelfodor · 2020-12-12T18:31:06Z

This could solve some ongoing issues with Prometheus monitoring, will this PR be ever merged? 😃

arslanbekov mentioned this issue Nov 29, 2018

etcd alerting with critical severity on lost watch connection prometheus-operator/prometheus-operator#2046

Closed

arslanbekov mentioned this issue Dec 3, 2018

gRPC code Unavailable instead Canceled grpc/grpc-go#2475

Closed

hexfusion self-assigned this Dec 7, 2018

hexfusion added type/question stage/investigating labels Dec 7, 2018

hexfusion assigned spzala Jan 21, 2019

zot24 mentioned this issue Mar 19, 2019

EtcdHighNumberOfFailedGRPCRequests openshift/cluster-monitoring-operator#248

Closed

brancz added a commit to brancz/cluster-monitoring-operator that referenced this issue Apr 30, 2019

manifests: Remove etcd gRPC calls failed alerts

400f412

These alerts are firing constantly due to some issues around how to etcd increases its metrics. See etcd-io/etcd#10289

brancz mentioned this issue Jun 11, 2019

etcdHighNumberOfFailedGRPCRequests alarm not correct due NaN return value prometheus-operator/prometheus-operator#2636

Closed

paskal mentioned this issue Jun 12, 2019

removed checks that keep giving false positives. #10796

Closed

Reamer mentioned this issue Jun 21, 2019

etcd - All grpc_code for grpc_method "Watch" is "Unavailable" openshift/origin#20311

Open

servo1x mentioned this issue Jul 28, 2019

Documentation: Fix etcdHighNumberOfFailedGRPCRequests rules return NaN value #10629

Closed

rafal-szypulka mentioned this issue Aug 13, 2019

ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert ibm-cloud-architecture/CSMO-ICP#8

Open

hexfusion mentioned this issue Nov 22, 2019

etcdserver: fix watch metrics #11375

Closed

howardjohn mentioned this issue Jan 2, 2020

Stop logging client disconnect as a pilot error istio/istio#19882

Merged

ironcladlou mentioned this issue Aug 3, 2020

etcdserver: fix incorrect metrics generated when clients cancel watches #12196

Merged

surajssd mentioned this issue Aug 25, 2020

Disable etcd alert about false positive cluster state kinvolk/lokomotive#850

Open

servo1x mentioned this issue Sep 26, 2020

[kube-prometheus-stack] Disable etcdHighNumberOfFailedGRPCRequests prometheus-community/helm-charts#144

Closed

axeal mentioned this issue Jan 19, 2021

Weird etcdHighNumberOfFailedGRPCRequests prometheus rule in v2.5 rancher-monitoring rancher/rancher#29939

Closed

gyuho closed this as completed in #12196 Feb 19, 2021

allenporter mentioned this issue Jul 9, 2021

Alert rule etcdHighNumberOfFailedGRPCRequests in Prometheus #13147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC code Unavailable instead Canceled #10289

gRPC code Unavailable instead Canceled #10289

arslanbekov commented Nov 29, 2018 •

edited

Loading

xiang90 commented Nov 29, 2018

arslanbekov commented Nov 29, 2018

xiang90 commented Nov 29, 2018

menghanl commented Dec 3, 2018

hexfusion commented Dec 7, 2018

arslanbekov commented Dec 7, 2018

hexfusion commented Dec 7, 2018

menghanl commented Jan 17, 2019

hexfusion commented Jan 17, 2019

hexfusion commented Jan 21, 2019

spzala commented Jan 21, 2019

spzala commented Jan 27, 2019

spzala commented Jan 30, 2019

menghanl commented Jan 30, 2019

spzala commented Jan 30, 2019 •

edited

Loading

jingyih commented Feb 11, 2019

hexfusion commented Mar 5, 2019

spzala commented Mar 5, 2019

hexfusion commented Apr 28, 2019

hexfusion commented Apr 29, 2019 •

edited

Loading

menghanl commented Apr 29, 2019

hexfusion commented Apr 30, 2019 •

edited

Loading

darwink commented Sep 4, 2019

siwyd commented Nov 13, 2019

hexfusion commented Nov 13, 2019

hexfusion commented Nov 19, 2019 •

edited

Loading

imunhatep commented Apr 14, 2020

immanuelfodor commented Dec 12, 2020

gRPC code Unavailable instead Canceled #10289

gRPC code Unavailable instead Canceled #10289

Comments

arslanbekov commented Nov 29, 2018 • edited Loading

xiang90 commented Nov 29, 2018

arslanbekov commented Nov 29, 2018

xiang90 commented Nov 29, 2018

menghanl commented Dec 3, 2018

hexfusion commented Dec 7, 2018

arslanbekov commented Dec 7, 2018

hexfusion commented Dec 7, 2018

menghanl commented Jan 17, 2019

hexfusion commented Jan 17, 2019

hexfusion commented Jan 21, 2019

spzala commented Jan 21, 2019

spzala commented Jan 27, 2019

spzala commented Jan 30, 2019

menghanl commented Jan 30, 2019

spzala commented Jan 30, 2019 • edited Loading

jingyih commented Feb 11, 2019

hexfusion commented Mar 5, 2019

spzala commented Mar 5, 2019

hexfusion commented Apr 28, 2019

hexfusion commented Apr 29, 2019 • edited Loading

menghanl commented Apr 29, 2019

hexfusion commented Apr 30, 2019 • edited Loading

darwink commented Sep 4, 2019

siwyd commented Nov 13, 2019

hexfusion commented Nov 13, 2019

hexfusion commented Nov 19, 2019 • edited Loading

imunhatep commented Apr 14, 2020

immanuelfodor commented Dec 12, 2020

arslanbekov commented Nov 29, 2018 •

edited

Loading

spzala commented Jan 30, 2019 •

edited

Loading

hexfusion commented Apr 29, 2019 •

edited

Loading

hexfusion commented Apr 30, 2019 •

edited

Loading

hexfusion commented Nov 19, 2019 •

edited

Loading