Linkerd closing connections of long-running gRPC requests #12964

eli-nomasec · 2024-08-19T12:26:21Z

What is the issue?

We're using Temporal.io with Linkerd.
Temporal worker uses long-polling. It makes a poll call that blocks for 60 seconds.
If no task is received during this time, the call returns an empty result, and a new call is created immediately.

It seems that Linkerd is closing the connection before Temporal does, and when Temporal tries to close the connection, it throws the exception mentioned above.

It seems most requests are capped at 50 seconds (see attached image below).

How can it be reproduced?

This can probably be reproduced by sending a gRPC request, and having the server wait for a minute - then respond.
If not - install Temporal and inject the proxies.
You don't even have to implement any workloads with the SDK, Temporal's own workloads will get the error (Worker->Frontend).

Logs, error output, etc

Another user who had the same issue posted these logs (they run Java):

io.grpc.StatusRuntimeException: INTERNAL: unexpected error
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[grpc-stub-1.54.1.jar!/:1.54.1]
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[grpc-stub-1.54.1.jar!/:1.54.1]
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[grpc-stub-1.54.1.jar!/:1.54.1]
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollWorkflowTaskQueue(WorkflowServiceGrpc.java:3910) ~[temporal-serviceclient-1.21.1.jar!/:na]
	at io.temporal.internal.worker.WorkflowPollTask.doPoll(WorkflowPollTask.java:155) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:137) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:46) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:298) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:258) ~[temporal-sdk-1.21.1.jar!/:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:840) ~[na:na]
[11:29](https://linkerd.slack.com/archives/C89RTCWJF/p1723451398491159?thread_ts=1723451335.174499&cid=C89RTCWJF)
We are on version 2.14.7, but this issue persists on the latest edge versions as well.
[11:34](https://linkerd.slack.com/archives/C89RTCWJF/p1723451671778799?thread_ts=1723451335.174499&cid=C89RTCWJF)
[  1081.455684s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue, version: HTTP/2.0, headers: {"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}, body: BoxBody }
[  1081.455695s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f71b37da900, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 4, rx_fields: "..." } } } })
[  1081.455709s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: method=POST uri=http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue version=HTTP/2.0
[  1081.455714s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: headers={"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}
[  1085.005443s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
[  1085.005508s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_detect: Connection completed
[  1085.005519s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}: linkerd_app_core::serve: Connection closed
[  1085.005530s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_stack_tracing: drop
[  1121.833118s] TRACE ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_proxy_balance::discover::from_resolve: poll
[  1141.459115s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error

Proxy logs:

[  1081.455684s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue, version: HTTP/2.0, headers: {"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}, body: BoxBody }
[  1081.455695s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f71b37da900, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 4, rx_fields: "..." } } } })
[  1081.455709s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: method=POST uri=http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue version=HTTP/2.0
[  1081.455714s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: headers={"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}
[  1085.005443s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
[  1085.005508s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_detect: Connection completed
[  1085.005519s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}: linkerd_app_core::serve: Connection closed
[  1085.005530s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_stack_tracing: drop
[  1121.833118s] TRACE ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_proxy_balance::discover::from_resolve: poll
[  1141.459115s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error

output of `linkerd check -o short`

Status check results are √

Environment

Kubernetes Version: Server Version: v1.28.11-eks-db838b0
Cluster Environment: EKS
Host OS: Mac M2
Linkerd version:
Client version: edge-24.8.2
Server version: edge-24.8.2

Possible solution

It's possible we have a timeout or some setting that causes Linkerd to disrupt long-running gRPC connections.
We tried increasing the overall timeouts (on the proxy and also specifically with GRPCRoutes) - with no success.

Additional context

Relevant Slack Conversations:
https://linkerd.slack.com/archives/C89RTCWJF/p1723451335174499
https://linkerd.slack.com/archives/C89RTCWJF/p1723658831519119
https://linkerd.slack.com/archives/C89RTCWJF/p1723659323180089
https://linkerd.slack.com/archives/C89RTCWJF/p1724061511236179

My email / Slack user: [email protected]
Another user experiencing this (Slack): @prajith

Would you like to work on fixing this bug?

None

The text was updated successfully, but these errors were encountered:

olix0r · 2024-08-21T15:31:34Z

Linkerd doesn't have any default request timeouts. The log messages you include do not indicate any sort of timeout behavior.

[  1081.455684s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue, version: HTTP/2.0, headers: {"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}, body: BoxBody }
[  1081.455695s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f71b37da900, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 4, rx_fields: "..." } } } })
[  1081.455709s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: method=POST uri=http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue version=HTTP/2.0
[  1081.455714s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: headers={"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}

Four seconds after the request is dispatched, the caller closes the connection:

[  1085.005443s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
[  1085.005508s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_detect: Connection completed
[  1085.005519s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}: linkerd_app_core::serve: Connection closed
[  1085.005530s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_stack_tracing: drop

Later, there is a log indicating that a response includes the l5d-proxy-error header, indicating that an error was handled in the server proxy:

[  1141.459115s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error

You may want to look at the upstream proxy's logs to understand what error is being encountered.

You may also want to use the debug container to understand the TCP flows on both the client and server pods.

olix0r · 2024-08-21T17:35:09Z

I'm unable to reproduce this with an example application that sleeps for 60s. Clients successfully observe responses. And in the screenshot, you include, we see other endpoints with response times higher than 50s. I'm quite confident that there is no timeout being enforced by Linkerd.

It's possible that idle connections are being lost by the conntrack table. You may want to configure the server to enable HTTP/2 keepalives so that the server actively communicates on connections with sparse request-response traffic. And, as I mentioned, you probably want to understand what error is actually arising that is causing an INTERNAL grpc-status to be set.

I'm going to close this out since there is nothing actionable for us here, but please let us know if you are able to identify a problem with Linkerd.

prajithp13 · 2024-10-11T03:13:43Z

@olix0r I just enabled http2 debug and this is what I can see.

2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_HEADERS stream=39 len=26
2024/10/11 03:06:45 http2: decoded hpack field header field ":status" = "200"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-type" = "application/grpc"
2024/10/11 03:06:45 http2: decoded hpack field header field "date" = "Fri, 11 Oct 2024 03:06:44 GMT"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read DATA stream=39 len=5 data="\x00\x00\x00\x00\x00"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_STREAM|END_HEADERS stream=39 len=2
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "0"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = ""
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote WINDOW_UPDATE len=4 (conn) incr=5
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote HEADERS flags=END_HEADERS stream=47 len=11
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read PING flags=ACK len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote DATA flags=END_STREAM stream=47 len=138 data="\x00\x00\x00\x00\x85\n\adefault\x12\x0f\n\vhello-world\x10\x01\x1a87313@vendor-service-sandbox-vendor-comm-workflow-worker@\"\v\n\t\t\x00\x00\x00\x00\x00j\xf8@*\"\n 17f8a4c649c37ff7ffc26b6c67cc30df"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_STREAM|END_HEADERS stream=47 len=54
2024/10/11 03:06:45 http2: decoded hpack field header field ":status" = "200"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-length" = "0"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-type" = "application/grpc"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "13"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = "unexpected error"
2024/10/11 03:06:45 http2: decoded hpack field header field "l5d-proxy-error" = "unexpected error"
2024/10/11 03:06:45 http2: decoded hpack field header field "date" = "Fri, 11 Oct 2024 03:06:44 GMT"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_HEADERS stream=41 len=3
2024/10/11 03:06:45 http2: decoded hpack field header field ":status" = "200"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-type" = "application/grpc"
2024/10/11 03:06:45 http2: decoded hpack field header field "date" = "Fri, 11 Oct 2024 03:06:44 GMT"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read DATA stream=41 len=5 data="\x00\x00\x00\x00\x00"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_STREAM|END_HEADERS stream=41 len=2
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "0"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = ""
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote WINDOW_UPDATE len=4 (conn) incr=5
time=2024-10-11T03:06:45.632Z level=WARN source=/home/prajithp/dev/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_worker_base.go:458 msg="Failed to poll for task." Namespace=default TaskQueue=hello-world WorkerID=7313@workflow-worker@ WorkerType=ActivityWorker Error="unexpected error"

olix0r · 2024-10-11T03:42:20Z

@prajithp13 Unfortunately this doesn't really get us any closer to a repro, though it does indicate that your application is seeing an error that originates in linkerd:

2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "13"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = "unexpected error"
2024/10/11 03:06:45 http2: decoded hpack field header field "l5d-proxy-error" = "unexpected error"

Do you have the accompanying proxy logs? I would expect the proxy to log a warning when emitting an unexpected error. Note that it's possible that this error originates in the server's proxy pod. It may also be appropriate to configure the relevant workloads with a pod annotation like config.linkerd.io/proxy-log-level: linkerd=debug,h2=trace,info.

Is this issue reliably reproducible? If so, it might be most expedient to try to reduce a minimal reproduction that you can share with us so we can debug this hands-on.

prajithp13 · 2024-10-11T04:14:37Z

I can only see following logs in the proxy.

[   429.553649s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll:FramedRead::poll_next:FramedRead::decode_frame{offset=13}: h2::codec::framed_read: decoding frame from 13B
[   429.553653s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll:FramedRead::poll_next:FramedRead::decode_frame{offset=13}: h2::codec::framed_read: frame.kind=Reset
[   429.553658s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll:FramedRead::poll_next: h2::codec::framed_read: received frame=Reset { stream_id: StreamId(7), error_code: CANCEL }
[   429.553667s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll: h2::proto::connection: recv RST_STREAM frame=Reset { stream_id: StreamId(7), error_code: CANCEL }
[   429.553672s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll: h2::proto::streams::state: recv_reset; frame=Reset { stream_id: StreamId(7), error_code: CANCEL }; state=HalfClosedRemote(AwaitingHeaders); queued=false
[   429.553680s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll: h2::proto::streams::counts: transition_after; stream=StreamId(7); state=State { inner: Closed(Error(Reset(StreamId(7), CANCEL, Remote))) }; is_closed=true; pending_send_empty=true; buffered_send_data=0; num_recv=4; num_send=0

I can set up a sample application along with steps to reproduce the error. I'll share a sample repository here shortly.

prajithp13 · 2024-10-11T04:45:14Z

@olix0r Here is the deployment YAML for both the Temporal server and worker: https://raw.githubusercontent.com/Prajithp/hello-temporal-go/refs/heads/main/deployment.yaml. You can apply this and, after waiting for at least 20 minutes, you should see the following error message in the worker pod logs: msg="Failed to poll for task."

Let me know if this works for you!

eli-nomasec · 2024-10-12T08:49:07Z

I'm also working on extracting the logic between the Temporal Worker and Frontend to a simple Go app that reproduces this, will share it in a couple of days

prajithp13 · 2024-10-13T02:43:00Z

@olix0r did you get a chance to look into it?

olix0r · 2024-10-13T03:23:09Z

Thanks for sharing the repro. I haven’t had a chance to look into it yet, but I’ll update the issue once I have more information. Oliver Gould < ***@***.*** >

…

On Sat, Oct 12 2024 at 19:43, Prajith < ***@***.*** > wrote: @olix0r ( https://github.com/olix0r ) did you get a chance to look into it? — Reply to this email directly, view it on GitHub ( #12964 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAB2YYTZDIVZW5CZTPODFEDZ3HM4TAVCNFSM6AAAAABMXXDB6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBYG44TGNJWHE ). You are receiving this because you were mentioned. Message ID: <linkerd/linkerd2/issues/12964/2408793569 @ github. com>

olix0r · 2024-10-14T20:20:53Z

While I'm still plumbing through some of the internals, I noticed something pretty glaring: these errors reliably manifest around the 5 minute mark. I did a quick search through the temporal codebase and found this setting:

	KeepAliveMaxConnectionAge = NewGlobalDurationSetting(
		"frontend.keepAliveMaxConnectionAge",
		5*time.Minute,
		`KeepAliveMaxConnectionAge is a duration for the maximum amount of time a
connection may exist before it will be closed by sending a GoAway. A
random jitter of +/-10% will be added to MaxConnectionAge to spread out
connection storms.`,
	)

I think there's some sort of race condition being encountered that causes us to dispatch requests on connections that have already been terminated by the server, causing the error you see propagated back to the worker.

This probably will help us narrow down a repro independent of temporal. In the meantime, you may see improvements by increasing the max connection age.

olix0r · 2024-10-15T14:30:06Z

You may also benefit from configuring retries on a GRPCRoute so that clients are able to retry these failures.

eli-nomasec added the bug label Aug 19, 2024

olix0r added support and removed bug labels Aug 21, 2024

olix0r closed this as completed Aug 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2024

linkerd unlocked this conversation Oct 10, 2024

olix0r reopened this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd closing connections of long-running gRPC requests #12964

Linkerd closing connections of long-running gRPC requests #12964

eli-nomasec commented Aug 19, 2024

olix0r commented Aug 21, 2024 •

edited

Loading

olix0r commented Aug 21, 2024

prajithp13 commented Oct 11, 2024

olix0r commented Oct 11, 2024 •

edited

Loading

prajithp13 commented Oct 11, 2024 •

edited

Loading

prajithp13 commented Oct 11, 2024

eli-nomasec commented Oct 12, 2024

prajithp13 commented Oct 13, 2024

olix0r commented Oct 13, 2024 via email

olix0r commented Oct 14, 2024 •

edited

Loading

olix0r commented Oct 15, 2024

Linkerd closing connections of long-running gRPC requests #12964

Linkerd closing connections of long-running gRPC requests #12964

Comments

eli-nomasec commented Aug 19, 2024

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

olix0r commented Aug 21, 2024 • edited Loading

olix0r commented Aug 21, 2024

prajithp13 commented Oct 11, 2024

olix0r commented Oct 11, 2024 • edited Loading

prajithp13 commented Oct 11, 2024 • edited Loading

prajithp13 commented Oct 11, 2024

eli-nomasec commented Oct 12, 2024

prajithp13 commented Oct 13, 2024

olix0r commented Oct 13, 2024 via email

olix0r commented Oct 14, 2024 • edited Loading

olix0r commented Oct 15, 2024

output of `linkerd check -o short`

olix0r commented Aug 21, 2024 •

edited

Loading

olix0r commented Oct 11, 2024 •

edited

Loading

prajithp13 commented Oct 11, 2024 •

edited

Loading

olix0r commented Oct 14, 2024 •

edited

Loading