Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: fix incorrect metrics generated when clients cancel watches #12196

Merged
merged 1 commit into from
Feb 19, 2021

Commits on Nov 18, 2020

  1. etcdserver: fix incorrect metrics generated when clients cancel watches

    Before this patch, a client which cancels the context for a watch results in the
    server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
    a gRPC `Unavailable` metric in association with the client watch cancellation.
    The metric looks like this:
    
        grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}
    
    So, the watch server has misidentified the error as a server error and then
    propagates the mistake to metrics, leading to a false indicator that the leader
    has been lost. This false signal then leads to false alerting.
    
    The commit 9c103dd introduced an interceptor which wraps
    watch streams requiring a leader, causing those streams to be actively canceled
    when leader loss is detected.
    
    However, the error handling code assumes all stream context cancellations are
    from the interceptor. This assumption is broken when the context was canceled
    because of a client stream cancelation.
    
    The core challenge is lack of information conveyed via `context.Context` which
    is shared by both the send and receive sides of the stream handling and is
    subject to cancellation by all paths (including the gRPC library itself). If any
    piece of the system cancels the shared context, there's no way for a context
    consumer to understand who cancelled the context or why.
    
    To solve the ambiguity of the stream interceptor code specifically, this patch
    introduces a custom context struct which the interceptor uses to expose a custom
    error through the context when the interceptor decides to actively cancel a
    stream. Now the consuming side can more safely assume a generic context
    cancellation can be propagated as a cancellation, and the server generated
    leader error is preserved and propagated normally without any special inference.
    
    When a client cancels the stream, there remains a race in the error handling
    code between the send and receive goroutines whereby the underlying gRPC error
    is lost in the case where the send path returns and is handled first, but this
    issue can be taken separately as no matter which paths wins, we can detect a
    generic cancellation.
    
    This is a replacement of etcd-io#11375.
    
    Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
    ironcladlou committed Nov 18, 2020
    Configuration menu
    Copy the full SHA
    9571325 View commit details
    Browse the repository at this point in the history