Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy not retrying Http2 streams sent but ignored by server after a GOAWAY frame and returning 503s #36450

Open
ropnop opened this issue Oct 4, 2024 · 4 comments
Labels
area/http bug no stalebot Disables stalebot from closing an issue

Comments

@ropnop
Copy link

ropnop commented Oct 4, 2024

Title: Envoy not retrying Http2 streams sent but ignored by server after a GOAWAY frame and returning 503s

Description:
After configuring an Http2 AWS Application Load Balancer (ALB) as an upstream target and running load tests through Envoy, we observed sporadic 503s (a few out of every 10k) and did some investigation.

The problem appears to stem from ALB's default behavior of sending a GOAWAY frame every time a connection hits 10000 streams:

image

We validated this by looking at access logs and grouping them by TCP connection (upstream_local_address). Every observed 503 response would be the 10001+ request on that TCP connection.

I also captured and decrypted a PCAP between Envoy and the ALB and saw the same thing. At the very end of the TCP stream, Envoy sends an HTTP2 frame to the ALB with 6 streams inside, corresponding to 6 GET requests (19999, 20001, 20003, 20005, 20007, 20009)

Screenshot 2024-10-04 at 3 51 48 PM

The next frame is a GOAWAY response from the ALB indicating that the last stream processed was 19999:

Screenshot 2024-10-04 at 3 52 23 PM

The ALB then gives the response for stream 19999 and sends a FIN ACK.

The problem is that each of those other streams that were sent in the frame before the GOAWAY are never retried. Each one of those requests results in a 503 from Envoy to the downstream (verified by looking at the request_id header in Wireshark)

Reading the HTTP2 RFC, I think both AWS and Envoy are a little to blame. I think Envoy is supposed to retry those streams on a new connection:

If the receiver of the GOAWAY has sent data on streams with a higher stream identifier than what is indicated in the GOAWAY frame, those streams are not or will not be processed. The receiver of the GOAWAY frame can treat the streams as though they had never been created at all, thereby allowing those streams to be retried later on a new connection.

However, the ALB should also probably be more graceful about how it handles shutdowns on the 10k request mark:

A server that is attempting to gracefully shut down a connection SHOULD send an initial GOAWAY frame with the last stream identifier set to 2^31-1 and a NO_ERROR code. This signals to the client that a shutdown is imminent and that initiating further requests is prohibited. After allowing time for any in-flight stream creation (at least one round-trip time), the server MAY send another GOAWAY frame with an updated last stream identifier. This ensures that a connection can be cleanly shut down without losing requests.

Doing some more digging, it looks like the official Go HTTP2 package also has the same issue with how to handle the ALB's abrupt GOAWAYs: golang/go#18639

Is this a bug with Envoy's Http2 client? Should it be retrying lost streams on a new connection when it receives a GOAWAY for with a stream ID lower than the last one it sent?

I am also pursuing this with AWS support on the ALB side as well.

Repro steps:
Point to an Http2 Amazon ALB as an upstream cluster and run a load test with hey:

hey -z 5m  http://127.0.0.1/ping

After a few minutes, some 503s will start to appear

Config:

admin:
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9090

static_resources:
  listeners:
    - name: envoy-listener-8080
      traffic_direction: OUTBOUND
      address:
          socket_address:
            address: 0.0.0.0
            port_value: 8080
            protocol: TCP
      filter_chains:
        - name: default_filter_chain
          filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress_http
                server_name: envoy
                use_remote_address: true
                generate_request_id: true
                request_id_extension:
                  typed_config:
                    '@type': type.googleapis.com/envoy.extensions.request_id.uuid.v3.UuidRequestIdConfig
                    use_request_id_for_trace_sampling: false
                access_log:
                  - name: envoy.access_loggers.file
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
                      path: "/dev/stdout"
                      log_format:
                        json_format:
                          component: http
                          # log the original authority before its rewritten
                          authority: "REQ(:AUTHORITY)"
                          bytes_received: "%BYTES_RECEIVED%"
                          bytes_sent: "%BYTES_SENT%"
                          downstream_local_address: "%DOWNSTREAM_LOCAL_ADDRESS%"
                          downstream_remote_address: "%DOWNSTREAM_REMOTE_ADDRESS%"
                          duration: "%DURATION%"
                          method: "%REQ(:METHOD)%"
                          path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%"
                          protocol: "%PROTOCOL%"
                          request_id: "%REQ(X-REQUEST-ID)%"
                          requested_server_name: "%REQUESTED_SERVER_NAME%"
                          response_code: "%RESPONSE_CODE%"
                          response_code_details: '%RESPONSE_CODE_DETAILS%'
                          response_flags: "%RESPONSE_FLAGS%"
                          start_time: "%START_TIME%"
                          upstream_cluster: "%UPSTREAM_CLUSTER%"
                          upstream_host: "%UPSTREAM_HOST%"
                          upstream_authority: "%REQ(:AUTHORITY)%"
                          upstream_protocol: "%UPSTREAM_PROTOCOL%"
                          upstream_path: "%REQ(:PATH)%"
                          upstream_local_address: "%UPSTREAM_LOCAL_ADDRESS%"
                          upstream_service_time: "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%"
                          upstream_transport_failure_reason: "%UPSTREAM_TRANSPORT_FAILURE_REASON%"
                          upstream_request_attempt_count: "%UPSTREAM_REQUEST_ATTEMPT_COUNT%"
                          upstream_request_bytes_received: "%UPSTREAM_HEADER_BYTES_RECEIVED%" 
                          upstream_connection_pool_ready_duration: "%UPSTREAM_CONNECTION_POOL_READY_DURATION%"
                          upstream_peer_subject: "%UPSTREAM_PEER_SUBJECT%"
                          upstream_peer_issuer: "%UPSTREAM_PEER_ISSUER%"
                          upstream_tls_session_id: "%UPSTREAM_TLS_SESSION_ID%"
                          upstream_tls_cipher: "%UPSTREAM_TLS_CIPHER%"
                          upstream_tls_version: "%UPSTREAM_TLS_VERSION%"
                          upstream_peer_cert: "%UPSTREAM_PEER_CERT%"
                          upstream_peer_cert_v_start: "%UPSTREAM_PEER_CERT_V_START%"
                          upstream_peer_cert_v_end: "%UPSTREAM_PEER_CERT_V_END%"       
                          user_agent: "%REQ(USER-AGENT)%"
                          x_forwarded_for: "%REQ(X-FORWARDED-FOR)%"
                          upstream_connection_id: "%UPSTREAM_CONNECTION_ID%"
                          stream_id: "%STREAM_ID%"
                          upstream_connection_pool_ready: "%UPSTREAM_CONNECTION_POOL_READY_DURATION%"
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                route_config:
                  virtual_hosts:
                    - name: poc-alb
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/"
                          route:
                            cluster: poc-alb
                            
  clusters:
    - name: poc-alb
      dns_lookup_family: V4_ONLY
      connect_timeout: 5s        
      type: LOGICAL_DNS
      lb_policy: ROUND_ROBIN
      transport_socket:
        name: envoy.transport_socket.tls
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          common_tls_context:
            key_log:
              path: /configs/keylog.log
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions             
          auto_config:
            http_protocol_options: {}
            http2_protocol_options: {}
      load_assignment:
        cluster_name: poc-alb
        endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: 10.162.31.60 # private ALB IP address
                    port_value: 443
                    protocol: TCP

Logs:
The debug logs during the load test are ~650MB. I will try to do some filteirng and sorting if there's specific things I should be grepping for?

@ropnop ropnop added bug triage Issue requires triage labels Oct 4, 2024
@ropnop
Copy link
Author

ropnop commented Oct 4, 2024

FWIW and to anyone having the same problem we've been able to avoid/hide this problem so far by setting the following in our cluster configs that point to HTTP2 ALBs:

common_http_protocol_options:
    max_requests_per_connection: 9500

Basically we just try not hit the 10k maximum number of requests per connection that AWS enforces on the ALBs to avoid this issue

@kyessenov
Copy link
Contributor

Envoy should be capable of retrying on a different connection (or a different endpoint) but the retry behavior is configurable, so maybe "GOAWAY" is not considered a retriable reason (https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-retry-on).

CC @wbpcode

@kyessenov kyessenov added area/http and removed triage Issue requires triage labels Oct 4, 2024
@ropnop
Copy link
Author

ropnop commented Oct 7, 2024

Thank you - yes I should have mentioned that. I uploaded a minimally reproducible example config, but I've also configured the route with this retry_policy:

retry_policy:
  retry_on: retriable-status-codes,reset,reset-before-request,refused-stream,connect-failure,retriable-headers
  retriable_status_codes: [500, 502, 503, 504]
  num_retries: 3
  per_try_timeout: 5s

We still see 503s on the 10001+ request per connection

From what I can tell, none of these retry_on scenarios apply. The server does respond to the H2 frame (for only the 10000th stream) and then closes the connection, so I don't think Envoy is interpreting it as a reset or connect-failure. And it's not a reset-before-request because the request stream was actually sent and ACK'd by the server.

Not sure what else to try but all suggestions appreciated 👍

Copy link

github-actions bot commented Nov 7, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 7, 2024
@ggreenway ggreenway removed the stale stalebot believes this issue/PR has not been touched recently label Nov 7, 2024
@yanavlasov yanavlasov added the no stalebot Disables stalebot from closing an issue label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/http bug no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

4 participants