-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keepalive: apply minimum ping time of 10s to client and 1s to server #2642
Conversation
@dfawley Could you explain why it's necessary to disallow any lower ping than 10s/1s? We've been using a keep alive interval of 100ms (both |
Thanks for the quick response. To clear up one thing, we use 100ms as the client keep alive. This is to very quickly detect a broken server process so we can failover to another server. It makes sense to follow the spec better, but in this case it would be really nice if the minimum keepalive could be configured by the library user as well. Especially since lower keep alives were possible before. And also because the server already has the minimum time enforcement policy. |
@JelteF We could potentially allow a lower limit to be configured, given that we had unintentional support for this in the past. If this is a valid use case, however, we should think about how we would support similar scenarios in Java and C, which currently also restrict clients to 10s keepalive ping intervals. We have another client-side health checking mechanism (gRFC), but that will only quickly detect a problem where the server is still alive but reports itself as unhealthy, not in the case where the server becomes unavailable. cc @ejona86 who designed client-side keepalives. He had concerns about this being used to detect network outages (where 100ms is probably too low to be reliable), but it sounds like you are mostly concerned about the server itself becoming wedged. |
Our servers have a minimum time enforcement policy, but not all do. The worry is that people really like setting "really low" values without understanding the network/load implications. For example, if the server is overloaded, you don't want the client to disconnect and reconnect. It is a dangerous tool to wield at 100ms, as it can amplify problems. |
This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this upgrade is to eliminate the disconnections caused by grpc/grpc-go#1882. These failures manifest themselves as the following set of errors: ``` ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE ``` Which then lead to tripped breakers and general badness. I suspect that there are several other good bug fixes in here, including some purported leaks and correctness fixes on shutdown. I have verified that with this upgrade I no longer see connections break in overload scenarios which reliably reproduced the situation in the above log. This commit removes one condition from grpcutil.IsClosedConnection which should be subsumed by the status check above. The `transport` subpackage has not been around for many releases. This does not upgrade to the current release 1.22.0 because the maintainer mentions that it contains a bug (grpc/grpc-go#2663 (comment)). This change also unfortunately updates the keepalive behavior to be more spec compliant (grpc/grpc-go#2642). This change mandates a minimum ping time of 10s to the client. Given grpc/grpc-go#2638 this means that the rpc test for keepalives now takes over 20s. I would be okay skipping it but leave that discussion for review. Also updated the acceptance test to look out for an HTTP/2.0 header because grpc now does not send RPCs until after the HTTP handshake has completed (see grpc/grpc-go#2406). Release note (bug fix): Upgrade grpc library to fix connection state management bug.
Also fix tests that relied upon setting lower values.
This is in preparation for the fix to issue #2638