Linkerd 2.16 HTTP/2 keep-alive is causing issues with gRPC connections. #13266

edgarkz · 2024-11-04T15:58:59Z

What is the issue?

After upgrading from Linkerd stable 2.13 to 2.16 BEL, we are experiencing an increase in gRPC errors due to dropped connections, which is leading to data loss. Our current setup implements gRPC retries on the client side (c# grpc client generated by https://github.com/protobuf-net/protobuf-net.Grpc), but only for specific errors.

How can it be reproduced?

Install the latest BEL edition using the Helm chart with high availability (HA) values.
Internal client - protobuf-net.Grpc v1.1.1 (Grpc.Net.Client v2.62.0)

Logs, error output, etc

gRPC-related errors in internal service:
Grpc.Core.RpcException: Status(StatusCode="Internal", Detail="Incomplete message."

LinkerD proxy errors:
`[ 94676.618554s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=10.0.139.170:47318 server.addr=10.0.115.25:8081

| Nov 4, 2024 @ 17:45:34.188 | [ 9258.779776s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=10.0.130.128:49554 server.addr=172.20.45.82:8081

| Nov 4, 2024 @ 17:39:06.554 | [ 94278.350425s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=10.0.106.32:36818 server.addr=10.0.134.88:8081

| Nov 4, 2024 @ 17:38:37.660 | [ 94249.456302s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=10.0.129.202:49608 server.addr=10.0.134.88:8081

| Nov 4, 2024 @ 17:38:16.548 | [ 94228.345056s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=10.0.139.170:45848 server.addr=10.0.134.88:8081

`

output of `linkerd check -o short`

Environment

1.28 AWS EKS - Server Version: v1.28.13-eks-a737599
LinkerD BEL 2.16

Possible solution

none.

Additional context

We've adjusted the default timeouts, but timeout logs are still appearing.

Related PR: #12498

Would you like to work on fixing this bug?

None

The text was updated successfully, but these errors were encountered:

olix0r · 2024-11-15T20:08:37Z

In your environment, is the client running meshed with a sidecar proxy? Or is only the server process meshed?

We've adjusted the default timeouts, but timeout logs are still appearing.

This timeout was introduced to help defend against situations where the underlying networking stack doesn't properly detect a closed/lost connection. Proxy servers send PING frames to clients and expect them (as required by the HTTP/2 spec) to respond. This timeout indicates that we haven't received a PING from the client.

If you've increased the timeout and continue to see these errors, this may be surfacing a problem that was being hidden before.

It would be helpful to try to trace this behavior to the clients to figure out why PINGs aren't being acknowledged in a timely fashion.

edgarkz added the bug label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd 2.16 HTTP/2 keep-alive is causing issues with gRPC connections. #13266

Linkerd 2.16 HTTP/2 keep-alive is causing issues with gRPC connections. #13266

edgarkz commented Nov 4, 2024 •

edited

Loading

olix0r commented Nov 15, 2024

Linkerd 2.16 HTTP/2 keep-alive is causing issues with gRPC connections. #13266

Linkerd 2.16 HTTP/2 keep-alive is causing issues with gRPC connections. #13266

Comments

edgarkz commented Nov 4, 2024 • edited Loading

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

olix0r commented Nov 15, 2024

edgarkz commented Nov 4, 2024 •

edited

Loading

output of `linkerd check -o short`