gRPC error in etcd v3.2.14 HA environment. #9166

hchenxa · 2018-01-18T07:07:15Z

I setup a etcd HA environment with etcd v3.2.14 and here is the etcd server startup parameters:

etcd --name=etcd1 --data-dir=/var/lib/etcd --initial-advertise-peer-urls=https://9.111.255.45:2380 --listen-peer-urls=https://0.0.0.0:2380 --listen-client-urls=https://0.0.0.0:4001 --advertise-client-urls=https://9.111.255.45:4001 --cert-file=/etc/cfc/conf/etcd/server.pem --key-file=/etc/cfc/conf/etcd/server-key.pem --client-cert-auth --peer-auto-tls --trusted-ca-file=/etc/cfc/conf/etcd/ca.pem --initial-cluster-token=etcd-cluster-1 --initial-cluster=etcd0=https://9.111.255.15:2380,etcd1=https://9.111.255.45:2380,etcd2=https://9.111.255.131:2380 --initial-cluster-state=new

when etcd started, I found below messages in etcd logs:

2018-01-18 06:50:59.058302 I | mvcc: finished scheduled compaction at 4375 (took 1.299701ms)
2018-01-18 06:54:21.121542 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 21; CANCEL")
2018-01-18 06:54:28.093291 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 19; CANCEL")
2018-01-18 06:54:52.302752 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 15; CANCEL")
2018-01-18 06:55:59.062545 I | mvcc: store.index: compact 4859

Does there have any impact about gRPC failed?

The text was updated successfully, but these errors were encountered:

gyuho · 2018-01-18T15:35:35Z

Seems like server failed to receive watch request on compaction. Was there any error in client side?

hchenxa · 2018-01-19T01:52:14Z

@gyuho , I did not see any error from kubernetes apiserver side.

And I was using kubernetes 1.8.3

xiang90 · 2018-01-19T01:58:08Z

@gyuho

the requests are canceled from the client. we should not log these as warning. i am not sure why they are logged though. i think @spzala put up a PR to get it fixed. maybe we missed something.

gyuho · 2018-01-19T02:00:15Z

Yeah it was only handling context cancel errors. Will try reproduce tomorrow.

gyuho · 2018-01-19T06:25:42Z

@hchenxa Can you reproduce outside of k8s?

I see similar errors with TLS

W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = client disconnected")

Which should be at debug level as well (http2.errClientDisconnected).

rpc error: code = Unavailable desc = stream error: stream ID 21; CANCEL

This is from gRPC closing http2 stream with http2.ErrCodeCancel which is expected.

Should we change all gRPC errors with code.Unavailable and code.Canceled to debug level?

D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")

spzala · 2018-01-19T15:04:53Z

@xiang90 yes I worked on a related PR #9105 @gyuho thanks, I am close to work on load test issue #9121 and hoping to have PR created this weekend. I have couple other issues to work on but if needed I can work on this issue next week. To me, moving expected gRPC error/warning messages to debug level makes sense to not to confuse users. Thanks!

motecshine · 2018-06-08T08:10:14Z

[root@irain217 kube-dashboard]# etcdctl -version etcdctl version: 3.2.11 API version: 2

Jun 08 15:55:14 irain217 etcd[26005]: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 67; CANCEL") Jun 08 15:55:14 irain217 etcd[26005]: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 81; CANCEL") Jun 08 15:58:46 irain217 etcd[26005]: store.index: compact 8636 Jun 08 15:58:46 irain217 etcd[26005]: finished scheduled compaction at 8636 (took 1.010574ms) Jun 08 16:02:23 irain217 etcd[26005]: transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:2379->127.0.0.1:49072: read: connection reset by peer Jun 08 16:03:46 irain217 etcd[26005]: store.index: compact 9087 Jun 08 16:03:46 irain217 etcd[26005]: finished scheduled compaction at 9087 (took 1.00037ms) Jun 08 16:03:47 irain217 etcd[26005]: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 85; CANCEL") Jun 08 16:03:47 irain217 etcd[26005]: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 97; CANCEL") Jun 08 16:06:01 irain217 etcd[26005]: failed to receive watch request from gRPC stream ("rpc error: code = Canceled desc = context canceled")

may be same problems ?

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. This patch improves the behavior by: 1. Performing a deeper analysis during stream closure to more conclusively determine that a leader has actually been lost before propagating a ErrGRPCNoLeader error. 2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn regarding leader loss. There remains an assumption that absence of leader loss evidence represents a client cancellation, but in practice this seems less likely to break down whereas client cancellations are frequent and expected. This is a continuation of the work already done in etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. This patch improves the behavior by: 1. Performing a deeper analysis during stream closure to more conclusively determine that a leader has actually been lost before propagating a ErrGRPCNoLeader error. 2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn regarding leader loss. There remains an assumption that absence of evidence of leader loss means a client cancelled, but in practice this seems less likely to break down whereas client cancellations are frequent and expected. This is a continuation of the work already done in etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. The commit 9c103dd introduced an interceptor which wraps watch streams requiring a leader, causing those streams to be actively canceled when leader loss is detected. However, the error handling code assumes all stream context cancellations are from the interceptor. This assumption is broken when the context was canceled because of a client stream cancelation. The core challenge is lack of information conveyed via `context.Context` which is shared by both the send and receive sides of the stream handling and is subject to cancellation by all paths (including the gRPC library itself). If any piece of the system cancels the shared context, there's no way for a context consumer to understand who cancelled the context or why. To solve the ambiguity of the stream interceptor code specifically, this patch introduces a custom context struct which the interceptor uses to expose a custom error through the context when the interceptor decides to actively cancel a stream. Now the consuming side can more safely assume a generic context cancellation can be propagated as a cancellation, and the server generated leader error is preserved and propagated normally without any special inference. When a client cancels the stream, there remains a race in the error handling code between the send and receive goroutines whereby the underlying gRPC error is lost in the case where the send path returns and is handled first, but this issue can be taken separately as no matter which paths wins, we can detect a generic cancellation. This is a replacement of etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

johnxiaohe · 2021-12-11T04:42:35Z

This problem is caused by the unsuccessful connection between the client and the server. There is a problem with the current state of the server. For example, the number of connections is full

gyuho mentioned this issue Jan 19, 2018

etcdserver/api/v3rpc: debug-log client disconnect on TLS, http/2 stream CANCEL #9178

Merged

gyuho closed this as completed in #9178 Jan 19, 2018

brancz mentioned this issue Apr 3, 2018

Add etcd metrics, Prometheus scrapes, and Grafana dash poseidon/typhoon#175

Merged

arslanbekov mentioned this issue Nov 29, 2018

gRPC code Unavailable instead Canceled #10289

Closed

hexfusion mentioned this issue Nov 23, 2019

etcdserver: fix watch metrics #11375

Closed

ringtail mentioned this issue Jan 19, 2020

部署kube-eventer后，etcd报错增多 AliyunContainerService/kube-eventer#55

Closed

ironcladlou mentioned this issue Aug 3, 2020

etcdserver: fix incorrect metrics generated when clients cancel watches #12196

Merged

ksa-real mentioned this issue Jul 1, 2021

Alert rule etcdHighNumberOfFailedGRPCRequests in Prometheus #13147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC error in etcd v3.2.14 HA environment. #9166

gRPC error in etcd v3.2.14 HA environment. #9166

hchenxa commented Jan 18, 2018

gyuho commented Jan 18, 2018

hchenxa commented Jan 19, 2018 •

edited

Loading

xiang90 commented Jan 19, 2018

gyuho commented Jan 19, 2018

gyuho commented Jan 19, 2018 •

edited

Loading

spzala commented Jan 19, 2018

motecshine commented Jun 8, 2018

johnxiaohe commented Dec 11, 2021

gRPC error in etcd v3.2.14 HA environment. #9166

gRPC error in etcd v3.2.14 HA environment. #9166

Comments

hchenxa commented Jan 18, 2018

gyuho commented Jan 18, 2018

hchenxa commented Jan 19, 2018 • edited Loading

xiang90 commented Jan 19, 2018

gyuho commented Jan 19, 2018

gyuho commented Jan 19, 2018 • edited Loading

spzala commented Jan 19, 2018

motecshine commented Jun 8, 2018

johnxiaohe commented Dec 11, 2021

hchenxa commented Jan 19, 2018 •

edited

Loading

gyuho commented Jan 19, 2018 •

edited

Loading