-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"New streams cannot be created after receiving a GOAWAY" and CANCELLED errors during service deployments #2694
Comments
@murgatroid99 Hey Michael apologies in advanced for tagging you explicitly but doing so as you had cut the ticket to |
You don't need to tag me. I get notified for every issue anyway. The latest version is 1.10.3. Please verify that the issue still happens on that version and include updated trace logs. |
Thanks! Here are the logs: A case on 1.10.3 when we see Another case on 1.10.3 when we only see |
The first log shows the "new streams cannot be created" error getting transparently retried. That is exactly the intended behavior, and the error is not surfaced to the application, so that shouldn't be a problem. As for the "cancelled" errors, the logs show that the connection received a GOAWAY with code 8 right before those errors were surfaced. That is also the expected behavior when the client receives a GOAWAY like that. You said in the first post on this issue
You don't give any details about this configuration, but it is common for servers gracefully closing connections to send a GOAWAY with code 0 to signal that no new streams should be opened, then wait some period of time (often referred to as a "grace period"), and then send a GOAWAY with code 8 to cancel all streams that are still open after that time. What we are seeing in this log is the expected outcome of that behavior. |
Thanks! Are these |
This is from the framework that facilitates our server-side shutdown:
This results in a GOAWAY frame from the server to the client with http2 stream id 2147483647 and code 0.
There is a drain that accompanies this and keeps the connection alive while the old streams that are in progress are serviced. What I am not clear from the logs is which requests the client tried to work on using the older connection. I am also curious how other SDKs are handling this that doesn't result in |
I was assuming your server was one of the gRPC implementations. If it isn't, then I can't say anything about what the server is doing. What we have are the client logs, and the client logs say that the client received a GOAWAY with code 8. |
I should have been more specific - we do use
|
I also should have been more specific. Tonic is an implementation of gRPC, but it is not one of the official gRPC implementations that we maintain/support. I understand that the server sends a GOAWAY with code 0. I am saying that it also sends a GOAWAY with code 8. Cancellation errors cannot be retried in general, because the client doesn't know how much of the request the server processed. There isn't enough information in the logs to determine why you see different behavior in other libraries and older versions of Node gRPC. We see a small snippet of what the client sees, and we see that it gets a GOAWAY with code 8 and acts accordingly. |
In case the problem you are experiencing is related to #2625, I recommend testing again with grpc-js 1.10.4. |
I tried with that but still see the |
Problem description
During our service deployments, the client is rejecting certain amount of requests with
CANCELLED
errors. Our servers are configured to send aGOAWAY
frame to clients during deployments to gracefully terminate connections. On May 11 2023, we had bumped our gRPC-js version from1.7.3
to1.8.14
. We reproduced our deployment scenario on multiple versions of our SDK, and the aforementioned gRPC version bump is the one where we start seeing the errors during deployments. We are not seeing this behavior with other SDKs.I haven't bisected the gRPC-js versions themselves in between those version ranges, and I do see that there are 3k commits between the versions so hard to pinpoint a specific commit. I also see an open issue related to
"New streams cannot be created after receiving a GOAWAY"
but that's one part of the problem we witness. There's another ticket that seemed relevant. On our logs, we see two flavors of errors.One is when the
transport
layer of gRPC detects aGOAWAY
frame and logs it, but still is followed by a fewCANCELLED
errors. The other flavor is when thetransport
layer doesn't detect the frame explicitly, but we seenew stream cannot be created after GOAWAY
messages, followed by more cancelled errors. Sometimes we see both of these behaviors.Here's a logStream from when we see both of them with http and gRPC traces:
with_cancelled_new_stream_errors.log
The below is a logStream where we don't see any
cancelled
ornew stream
errors from a version of gRPC earlier to1.8.14
. Notice that these do reportINTERNAL
errors sometimes (not as frequently or high ascancelled
), but even that can be safely retried so isn't an issue. If there's a way where we can know if thecancelled
errors are safe to retry for non-idempotent operations, it will be awesome.without_cancelled_new_stream_errors_old_grpc_version.log
Let me know if you need more logs! I have attached as files to not clutter this space. I can easily reproduce this in my setup so will be happy to add more traces/scenarios that you'd like!
Reproduction steps
Environment
1.8.14
The text was updated successfully, but these errors were encountered: