Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd closing connections of long-running gRPC requests #12964

Open
eli-nomasec opened this issue Aug 19, 2024 · 11 comments
Open

Linkerd closing connections of long-running gRPC requests #12964

eli-nomasec opened this issue Aug 19, 2024 · 11 comments
Labels

Comments

@eli-nomasec
Copy link

What is the issue?

We're using Temporal.io with Linkerd.
Temporal worker uses long-polling. It makes a poll call that blocks for 60 seconds.
If no task is received during this time, the call returns an empty result, and a new call is created immediately.

It seems that Linkerd is closing the connection before Temporal does, and when Temporal tries to close the connection, it throws the exception mentioned above.

It seems most requests are capped at 50 seconds (see attached image below).

image (8)

How can it be reproduced?

This can probably be reproduced by sending a gRPC request, and having the server wait for a minute - then respond.
If not - install Temporal and inject the proxies.
You don't even have to implement any workloads with the SDK, Temporal's own workloads will get the error (Worker->Frontend).

Logs, error output, etc

Another user who had the same issue posted these logs (they run Java):

io.grpc.StatusRuntimeException: INTERNAL: unexpected error
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[grpc-stub-1.54.1.jar!/:1.54.1]
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[grpc-stub-1.54.1.jar!/:1.54.1]
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[grpc-stub-1.54.1.jar!/:1.54.1]
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollWorkflowTaskQueue(WorkflowServiceGrpc.java:3910) ~[temporal-serviceclient-1.21.1.jar!/:na]
	at io.temporal.internal.worker.WorkflowPollTask.doPoll(WorkflowPollTask.java:155) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:137) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:46) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:298) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:258) ~[temporal-sdk-1.21.1.jar!/:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:840) ~[na:na]
[11:29](https://linkerd.slack.com/archives/C89RTCWJF/p1723451398491159?thread_ts=1723451335.174499&cid=C89RTCWJF)
We are on version 2.14.7, but this issue persists on the latest edge versions as well.
[11:34](https://linkerd.slack.com/archives/C89RTCWJF/p1723451671778799?thread_ts=1723451335.174499&cid=C89RTCWJF)
[  1081.455684s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue, version: HTTP/2.0, headers: {"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}, body: BoxBody }
[  1081.455695s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f71b37da900, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 4, rx_fields: "..." } } } })
[  1081.455709s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: method=POST uri=http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue version=HTTP/2.0
[  1081.455714s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: headers={"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}
[  1085.005443s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
[  1085.005508s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_detect: Connection completed
[  1085.005519s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}: linkerd_app_core::serve: Connection closed
[  1085.005530s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_stack_tracing: drop
[  1121.833118s] TRACE ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_proxy_balance::discover::from_resolve: poll
[  1141.459115s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error

Proxy logs:

[  1081.455684s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue, version: HTTP/2.0, headers: {"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}, body: BoxBody }
[  1081.455695s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f71b37da900, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 4, rx_fields: "..." } } } })
[  1081.455709s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: method=POST uri=http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue version=HTTP/2.0
[  1081.455714s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: headers={"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}
[  1085.005443s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
[  1085.005508s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_detect: Connection completed
[  1085.005519s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}: linkerd_app_core::serve: Connection closed
[  1085.005530s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_stack_tracing: drop
[  1121.833118s] TRACE ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_proxy_balance::discover::from_resolve: poll
[  1141.459115s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error

output of linkerd check -o short

Status check results are √

Environment

  • Kubernetes Version: Server Version: v1.28.11-eks-db838b0
  • Cluster Environment: EKS
  • Host OS: Mac M2
  • Linkerd version:
    Client version: edge-24.8.2
    Server version: edge-24.8.2

Possible solution

It's possible we have a timeout or some setting that causes Linkerd to disrupt long-running gRPC connections.
We tried increasing the overall timeouts (on the proxy and also specifically with GRPCRoutes) - with no success.

Additional context

Relevant Slack Conversations:
https://linkerd.slack.com/archives/C89RTCWJF/p1723451335174499
https://linkerd.slack.com/archives/C89RTCWJF/p1723658831519119
https://linkerd.slack.com/archives/C89RTCWJF/p1723659323180089
https://linkerd.slack.com/archives/C89RTCWJF/p1724061511236179

My email / Slack user: [email protected]
Another user experiencing this (Slack): @prajith

Would you like to work on fixing this bug?

None

@olix0r olix0r added support and removed bug labels Aug 21, 2024
@olix0r
Copy link
Member

olix0r commented Aug 21, 2024

Linkerd doesn't have any default request timeouts. The log messages you include do not indicate any sort of timeout behavior.

[  1081.455684s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue, version: HTTP/2.0, headers: {"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}, body: BoxBody }
[  1081.455695s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f71b37da900, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 4, rx_fields: "..." } } } })
[  1081.455709s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: method=POST uri=http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue version=HTTP/2.0
[  1081.455714s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: headers={"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}

Four seconds after the request is dispatched, the caller closes the connection:

[  1085.005443s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
[  1085.005508s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_detect: Connection completed
[  1085.005519s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}: linkerd_app_core::serve: Connection closed
[  1085.005530s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_stack_tracing: drop

Later, there is a log indicating that a response includes the l5d-proxy-error header, indicating that an error was handled in the server proxy:

[  1141.459115s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error

You may want to look at the upstream proxy's logs to understand what error is being encountered.

You may also want to use the debug container to understand the TCP flows on both the client and server pods.

@olix0r
Copy link
Member

olix0r commented Aug 21, 2024

I'm unable to reproduce this with an example application that sleeps for 60s. Clients successfully observe responses. And in the screenshot, you include, we see other endpoints with response times higher than 50s. I'm quite confident that there is no timeout being enforced by Linkerd.

It's possible that idle connections are being lost by the conntrack table. You may want to configure the server to enable HTTP/2 keepalives so that the server actively communicates on connections with sparse request-response traffic. And, as I mentioned, you probably want to understand what error is actually arising that is causing an INTERNAL grpc-status to be set.

I'm going to close this out since there is nothing actionable for us here, but please let us know if you are able to identify a problem with Linkerd.

@olix0r olix0r closed this as completed Aug 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2024
@linkerd linkerd unlocked this conversation Oct 10, 2024
@prajithp13
Copy link

@olix0r I just enabled http2 debug and this is what I can see.

2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_HEADERS stream=39 len=26
2024/10/11 03:06:45 http2: decoded hpack field header field ":status" = "200"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-type" = "application/grpc"
2024/10/11 03:06:45 http2: decoded hpack field header field "date" = "Fri, 11 Oct 2024 03:06:44 GMT"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read DATA stream=39 len=5 data="\x00\x00\x00\x00\x00"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_STREAM|END_HEADERS stream=39 len=2
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "0"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = ""
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote WINDOW_UPDATE len=4 (conn) incr=5
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote HEADERS flags=END_HEADERS stream=47 len=11
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read PING flags=ACK len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote DATA flags=END_STREAM stream=47 len=138 data="\x00\x00\x00\x00\x85\n\adefault\x12\x0f\n\vhello-world\x10\x01\x1a87313@vendor-service-sandbox-vendor-comm-workflow-worker@\"\v\n\t\t\x00\x00\x00\x00\x00j\xf8@*\"\n 17f8a4c649c37ff7ffc26b6c67cc30df"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_STREAM|END_HEADERS stream=47 len=54
2024/10/11 03:06:45 http2: decoded hpack field header field ":status" = "200"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-length" = "0"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-type" = "application/grpc"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "13"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = "unexpected error"
2024/10/11 03:06:45 http2: decoded hpack field header field "l5d-proxy-error" = "unexpected error"
2024/10/11 03:06:45 http2: decoded hpack field header field "date" = "Fri, 11 Oct 2024 03:06:44 GMT"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_HEADERS stream=41 len=3
2024/10/11 03:06:45 http2: decoded hpack field header field ":status" = "200"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-type" = "application/grpc"
2024/10/11 03:06:45 http2: decoded hpack field header field "date" = "Fri, 11 Oct 2024 03:06:44 GMT"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read DATA stream=41 len=5 data="\x00\x00\x00\x00\x00"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_STREAM|END_HEADERS stream=41 len=2
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "0"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = ""
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote WINDOW_UPDATE len=4 (conn) incr=5
time=2024-10-11T03:06:45.632Z level=WARN source=/home/prajithp/dev/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_worker_base.go:458 msg="Failed to poll for task." Namespace=default TaskQueue=hello-world WorkerID=7313@workflow-worker@ WorkerType=ActivityWorker Error="unexpected error"

@olix0r olix0r reopened this Oct 11, 2024
@olix0r
Copy link
Member

olix0r commented Oct 11, 2024

@prajithp13 Unfortunately this doesn't really get us any closer to a repro, though it does indicate that your application is seeing an error that originates in linkerd:

2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "13"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = "unexpected error"
2024/10/11 03:06:45 http2: decoded hpack field header field "l5d-proxy-error" = "unexpected error"

Do you have the accompanying proxy logs? I would expect the proxy to log a warning when emitting an unexpected error. Note that it's possible that this error originates in the server's proxy pod. It may also be appropriate to configure the relevant workloads with a pod annotation like config.linkerd.io/proxy-log-level: linkerd=debug,h2=trace,info.

Is this issue reliably reproducible? If so, it might be most expedient to try to reduce a minimal reproduction that you can share with us so we can debug this hands-on.

@prajithp13
Copy link

prajithp13 commented Oct 11, 2024

I can only see following logs in the proxy.

[   429.553649s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll:FramedRead::poll_next:FramedRead::decode_frame{offset=13}: h2::codec::framed_read: decoding frame from 13B
[   429.553653s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll:FramedRead::poll_next:FramedRead::decode_frame{offset=13}: h2::codec::framed_read: frame.kind=Reset
[   429.553658s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll:FramedRead::poll_next: h2::codec::framed_read: received frame=Reset { stream_id: StreamId(7), error_code: CANCEL }
[   429.553667s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll: h2::proto::connection: recv RST_STREAM frame=Reset { stream_id: StreamId(7), error_code: CANCEL }
[   429.553672s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll: h2::proto::streams::state: recv_reset; frame=Reset { stream_id: StreamId(7), error_code: CANCEL }; state=HalfClosedRemote(AwaitingHeaders); queued=false
[   429.553680s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll: h2::proto::streams::counts: transition_after; stream=StreamId(7); state=State { inner: Closed(Error(Reset(StreamId(7), CANCEL, Remote))) }; is_closed=true; pending_send_empty=true; buffered_send_data=0; num_recv=4; num_send=0

I can set up a sample application along with steps to reproduce the error. I'll share a sample repository here shortly.

@prajithp13
Copy link

@olix0r Here is the deployment YAML for both the Temporal server and worker: https://raw.githubusercontent.com/Prajithp/hello-temporal-go/refs/heads/main/deployment.yaml. You can apply this and, after waiting for at least 20 minutes, you should see the following error message in the worker pod logs: msg="Failed to poll for task."

Let me know if this works for you!

@eli-nomasec
Copy link
Author

I'm also working on extracting the logic between the Temporal Worker and Frontend to a simple Go app that reproduces this, will share it in a couple of days

@prajithp13
Copy link

@olix0r did you get a chance to look into it?

@olix0r
Copy link
Member

olix0r commented Oct 13, 2024 via email

@olix0r
Copy link
Member

olix0r commented Oct 14, 2024

While I'm still plumbing through some of the internals, I noticed something pretty glaring: these errors reliably manifest around the 5 minute mark. I did a quick search through the temporal codebase and found this setting:

	KeepAliveMaxConnectionAge = NewGlobalDurationSetting(
		"frontend.keepAliveMaxConnectionAge",
		5*time.Minute,
		`KeepAliveMaxConnectionAge is a duration for the maximum amount of time a
connection may exist before it will be closed by sending a GoAway. A
random jitter of +/-10% will be added to MaxConnectionAge to spread out
connection storms.`,
	)

I think there's some sort of race condition being encountered that causes us to dispatch requests on connections that have already been terminated by the server, causing the error you see propagated back to the worker.

This probably will help us narrow down a repro independent of temporal. In the meantime, you may see improvements by increasing the max connection age.

@olix0r
Copy link
Member

olix0r commented Oct 15, 2024

You may also benefit from configuring retries on a GRPCRoute so that clients are able to retry these failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants