-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linkerd closing connections of long-running gRPC requests #12964
Comments
Linkerd doesn't have any default request timeouts. The log messages you include do not indicate any sort of timeout behavior.
Four seconds after the request is dispatched, the caller closes the connection:
Later, there is a log indicating that a response includes the l5d-proxy-error header, indicating that an error was handled in the server proxy:
You may want to look at the upstream proxy's logs to understand what error is being encountered. You may also want to use the debug container to understand the TCP flows on both the client and server pods. |
I'm unable to reproduce this with an example application that sleeps for 60s. Clients successfully observe responses. And in the screenshot, you include, we see other endpoints with response times higher than 50s. I'm quite confident that there is no timeout being enforced by Linkerd. It's possible that idle connections are being lost by the conntrack table. You may want to configure the server to enable HTTP/2 keepalives so that the server actively communicates on connections with sparse request-response traffic. And, as I mentioned, you probably want to understand what error is actually arising that is causing an INTERNAL grpc-status to be set. I'm going to close this out since there is nothing actionable for us here, but please let us know if you are able to identify a problem with Linkerd. |
@olix0r I just enabled http2 debug and this is what I can see.
|
@prajithp13 Unfortunately this doesn't really get us any closer to a repro, though it does indicate that your application is seeing an error that originates in linkerd:
Do you have the accompanying proxy logs? I would expect the proxy to log a warning when emitting an unexpected error. Note that it's possible that this error originates in the server's proxy pod. It may also be appropriate to configure the relevant workloads with a pod annotation like Is this issue reliably reproducible? If so, it might be most expedient to try to reduce a minimal reproduction that you can share with us so we can debug this hands-on. |
I can only see following logs in the proxy.
I can set up a sample application along with steps to reproduce the error. I'll share a sample repository here shortly. |
@olix0r Here is the deployment YAML for both the Temporal server and worker: Let me know if this works for you! |
I'm also working on extracting the logic between the Temporal Worker and Frontend to a simple Go app that reproduces this, will share it in a couple of days |
@olix0r did you get a chance to look into it? |
Thanks for sharing the repro. I haven’t had a chance to look into it yet, but I’ll update the issue once I have more information.
Oliver Gould < ***@***.*** >
…On Sat, Oct 12 2024 at 19:43, Prajith < ***@***.*** > wrote:
@olix0r ( https://github.com/olix0r ) did you get a chance to look into it?
—
Reply to this email directly, view it on GitHub (
#12964 (comment) )
, or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AAB2YYTZDIVZW5CZTPODFEDZ3HM4TAVCNFSM6AAAAABMXXDB6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBYG44TGNJWHE
).
You are receiving this because you were mentioned. Message ID: <linkerd/linkerd2/issues/12964/2408793569
@ github. com>
|
While I'm still plumbing through some of the internals, I noticed something pretty glaring: these errors reliably manifest around the 5 minute mark. I did a quick search through the temporal codebase and found this setting: KeepAliveMaxConnectionAge = NewGlobalDurationSetting(
"frontend.keepAliveMaxConnectionAge",
5*time.Minute,
`KeepAliveMaxConnectionAge is a duration for the maximum amount of time a
connection may exist before it will be closed by sending a GoAway. A
random jitter of +/-10% will be added to MaxConnectionAge to spread out
connection storms.`,
) I think there's some sort of race condition being encountered that causes us to dispatch requests on connections that have already been terminated by the server, causing the error you see propagated back to the worker. This probably will help us narrow down a repro independent of temporal. In the meantime, you may see improvements by increasing the max connection age. |
You may also benefit from configuring retries on a GRPCRoute so that clients are able to retry these failures. |
What is the issue?
We're using Temporal.io with Linkerd.
Temporal worker uses long-polling. It makes a poll call that blocks for 60 seconds.
If no task is received during this time, the call returns an empty result, and a new call is created immediately.
It seems that Linkerd is closing the connection before Temporal does, and when Temporal tries to close the connection, it throws the exception mentioned above.
It seems most requests are capped at 50 seconds (see attached image below).
How can it be reproduced?
This can probably be reproduced by sending a gRPC request, and having the server wait for a minute - then respond.
If not - install Temporal and inject the proxies.
You don't even have to implement any workloads with the SDK, Temporal's own workloads will get the error (Worker->Frontend).
Logs, error output, etc
Another user who had the same issue posted these logs (they run Java):
Proxy logs:
output of
linkerd check -o short
Environment
Client version: edge-24.8.2
Server version: edge-24.8.2
Possible solution
It's possible we have a timeout or some setting that causes Linkerd to disrupt long-running gRPC connections.
We tried increasing the overall timeouts (on the proxy and also specifically with GRPCRoutes) - with no success.
Additional context
Relevant Slack Conversations:
https://linkerd.slack.com/archives/C89RTCWJF/p1723451335174499
https://linkerd.slack.com/archives/C89RTCWJF/p1723658831519119
https://linkerd.slack.com/archives/C89RTCWJF/p1723659323180089
https://linkerd.slack.com/archives/C89RTCWJF/p1724061511236179
My email / Slack user: [email protected]
Another user experiencing this (Slack): @prajith
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: