-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy not retrying Http2 streams sent but ignored by server after a GOAWAY frame and returning 503s #36450
Comments
FWIW and to anyone having the same problem we've been able to avoid/hide this problem so far by setting the following in our cluster configs that point to HTTP2 ALBs:
Basically we just try not hit the 10k maximum number of requests per connection that AWS enforces on the ALBs to avoid this issue |
Envoy should be capable of retrying on a different connection (or a different endpoint) but the retry behavior is configurable, so maybe "GOAWAY" is not considered a retriable reason (https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-retry-on). CC @wbpcode |
Thank you - yes I should have mentioned that. I uploaded a minimally reproducible example config, but I've also configured the route with this retry_policy:
We still see 503s on the 10001+ request per connection From what I can tell, none of these Not sure what else to try but all suggestions appreciated 👍 |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
Title: Envoy not retrying Http2 streams sent but ignored by server after a GOAWAY frame and returning 503s
Description:
After configuring an Http2 AWS Application Load Balancer (ALB) as an upstream target and running load tests through Envoy, we observed sporadic 503s (a few out of every 10k) and did some investigation.
The problem appears to stem from ALB's default behavior of sending a
GOAWAY
frame every time a connection hits 10000 streams:We validated this by looking at access logs and grouping them by TCP connection (
upstream_local_address
). Every observed 503 response would be the 10001+ request on that TCP connection.I also captured and decrypted a PCAP between Envoy and the ALB and saw the same thing. At the very end of the TCP stream, Envoy sends an HTTP2 frame to the ALB with 6 streams inside, corresponding to 6 GET requests (19999, 20001, 20003, 20005, 20007, 20009)
The next frame is a
GOAWAY
response from the ALB indicating that the last stream processed was19999
:The ALB then gives the response for stream 19999 and sends a
FIN ACK
.The problem is that each of those other streams that were sent in the frame before the
GOAWAY
are never retried. Each one of those requests results in a 503 from Envoy to the downstream (verified by looking at the request_id header in Wireshark)Reading the HTTP2 RFC, I think both AWS and Envoy are a little to blame. I think Envoy is supposed to retry those streams on a new connection:
However, the ALB should also probably be more graceful about how it handles shutdowns on the 10k request mark:
Doing some more digging, it looks like the official Go HTTP2 package also has the same issue with how to handle the ALB's abrupt
GOAWAYs
: golang/go#18639Is this a bug with Envoy's Http2 client? Should it be retrying lost streams on a new connection when it receives a
GOAWAY
for with a stream ID lower than the last one it sent?I am also pursuing this with AWS support on the ALB side as well.
Repro steps:
Point to an Http2 Amazon ALB as an upstream cluster and run a load test with hey:
After a few minutes, some 503s will start to appear
Config:
Logs:
The debug logs during the load test are ~650MB. I will try to do some filteirng and sorting if there's specific things I should be grepping for?
The text was updated successfully, but these errors were encountered: