etcd:etcdserverpb.Watch Unavailable #9576

MichaelGou1105 · 2018-04-17T06:07:38Z

i start my etcd , i find some msg from metric:

grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 2

but,when i restart etcd , this msg never show again.

The text was updated successfully, but these errors were encountered:

gyuho · 2018-04-17T21:28:27Z

but,when i restart etcd

You should set up a separate Prometheus node_exporter if you want to persist previous metrics data. Metrics served via etcd does not get stored on disk. So on reboot, etcd starts over collecting metrics.

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. This patch improves the behavior by: 1. Performing a deeper analysis during stream closure to more conclusively determine that a leader has actually been lost before propagating a ErrGRPCNoLeader error. 2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn regarding leader loss. There remains an assumption that absence of leader loss evidence represents a client cancellation, but in practice this seems less likely to break down whereas client cancellations are frequent and expected. This is a continuation of the work already done in etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. This patch improves the behavior by: 1. Performing a deeper analysis during stream closure to more conclusively determine that a leader has actually been lost before propagating a ErrGRPCNoLeader error. 2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn regarding leader loss. There remains an assumption that absence of evidence of leader loss means a client cancelled, but in practice this seems less likely to break down whereas client cancellations are frequent and expected. This is a continuation of the work already done in etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of a gRPC `Unavailable` metric in association with the client watch cancellation. The metric looks like this: grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} So, the watch server has misidentified the error as a server error and then propagates the mistake to metrics, leading to a false indicator that the leader has been lost. This false signal then leads to false alerting. The commit 9c103dd introduced an interceptor which wraps watch streams requiring a leader, causing those streams to be actively canceled when leader loss is detected. However, the error handling code assumes all stream context cancellations are from the interceptor. This assumption is broken when the context was canceled because of a client stream cancelation. The core challenge is lack of information conveyed via `context.Context` which is shared by both the send and receive sides of the stream handling and is subject to cancellation by all paths (including the gRPC library itself). If any piece of the system cancels the shared context, there's no way for a context consumer to understand who cancelled the context or why. To solve the ambiguity of the stream interceptor code specifically, this patch introduces a custom context struct which the interceptor uses to expose a custom error through the context when the interceptor decides to actively cancel a stream. Now the consuming side can more safely assume a generic context cancellation can be propagated as a cancellation, and the server generated leader error is preserved and propagated normally without any special inference. When a client cancels the stream, there remains a race in the error handling code between the send and receive goroutines whereby the underlying gRPC error is lost in the case where the send path returns and is handled first, but this issue can be taken separately as no matter which paths wins, we can detect a generic cancellation. This is a replacement of etcd-io#11375. Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166

gyuho closed this as completed Apr 17, 2018

arslanbekov mentioned this issue Nov 29, 2018

gRPC code Unavailable instead Canceled #10289

Closed

hexfusion mentioned this issue Nov 23, 2019

etcdserver: fix watch metrics #11375

Closed

ironcladlou mentioned this issue Aug 3, 2020

etcdserver: fix incorrect metrics generated when clients cancel watches #12196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd:etcdserverpb.Watch Unavailable #9576

etcd:etcdserverpb.Watch Unavailable #9576

MichaelGou1105 commented Apr 17, 2018

gyuho commented Apr 17, 2018

etcd:etcdserverpb.Watch Unavailable #9576

etcd:etcdserverpb.Watch Unavailable #9576

Comments

MichaelGou1105 commented Apr 17, 2018

gyuho commented Apr 17, 2018