Set keep-alive timeout higher than ALB idle timeout #448

rbreslow · 2020-09-24T16:48:48Z

Overview

Keep-alive, when enabled, enables the load balancer to reuse back-end connections until the keep-alive timeout expires. To ensure that the load balancer is responsible for closing the connections to your instance, make sure that the value you set for the HTTP keep-alive time is greater than the idle timeout setting configured for your load balancer.

The default keep-alive timeout for the Node http.Server is 5 seconds. The default idle timeout for the ALB is 60 seconds.

The ALB is opening connections for reuse, the Node http.Server closes that connection, and then the ALB tries to communicate over the closed connection which triggers a 502.

HTTP 502: Bad gateway

Possible causes:
. . .
The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. Check whether the keep-alive duration of the target is shorter than the idle timeout value of the load balancer.

We can see more evidence of this by looking at the ALB access logs with Athena:

Details

SELECT elb_status_code,
         request_processing_time,
         target_processing_time,
         response_processing_time,
         request_url
FROM stg_alb_logs
WHERE elb_status_code LIKE '5%';

response_processing_time is set to -1 if the load balancer can't send the request to a target. This can happen if the target closes the connection before the idle timeout or if the client sends a malformed request.

This PR sets the keep-alive timeout to be higher than the ALB timeout so that the ALB is responsible for prompting the closure of individual TCP connections to the server.

See:

Resolves #428

Checklist

Description of PR is in an appropriate section of CHANGELOG.md and grouped with similar changes, if possible

Testing Instructions

I'm not sure how to replicate the failures we saw earlier. I've tried exercising the staging application and wasn't able to generate any 502s today before or after applying the fix (see the Jenkins deploy). Additionally, the prd_alb_logs table in Athena does not contain any 502s.

I think the evidence makes it clear we should apply this either way, and we can be on the look out for more 502s as we're able to gather more data.

rbreslow · 2020-09-24T16:58:51Z

src/server/src/main.ts

+  // Ensure the headersTimeout is set higher than the keepAliveTimeout due to
+  // this nodejs regression bug: https://github.com/nodejs/node/issues/27363
+  server.headersTimeout = 66000;


This was resolved and back ported to Node v12 in nodejs/node#34131. However, it doesn't look like the fix will be accessible until Node v12.18.5 drops (https://github.com/nodejs/node/commits/v12.x. vs https://github.com/nodejs/node/commits/v12.x-staging).

hectcastro

The application server timeout changes appear to be working as an effective 502 deterrent. I added a simple k6 load test in 4827e9a and used it against an instance of the application with and without the changes.

When the timeout changes were not applied, 502s were present in the Athena ALB request logs. When the timeout changes were applied, 502s were not present in the logs. 👍

Sets the keep-alive timeout to be higher than the ALB timeout so that the ALB is responsible for prompting the closure of individual TCP connections to the server.

I manually filled out a 50 district PA project and exported it as a HAR file. From there, I used har-to-k6 converter to produce a k6 load test. Lastly, I created a few parameters to make the test more reusable: - JWT_AUTH_TOKEN: A DistrictBuilder JWT authentication token - DB_PROJECT_ID: A DistrictBuilder project UUID (PA; 50 districts) - DB_DOMAIN: The DistrictBuilder instance domain where the project above resides

colekettler · 2020-09-28T15:55:19Z

@rbreslow Chiming in to say that this was really useful to read and that Adam Crowder article offers a great explanation. Good stuff! 🕵️

rbreslow self-assigned this Sep 24, 2020

rbreslow commented Sep 24, 2020

View reviewed changes

rbreslow marked this pull request as ready for review September 24, 2020 18:22

rbreslow requested review from kshepard and hectcastro September 24, 2020 18:22

hectcastro approved these changes Sep 27, 2020

View reviewed changes

rbreslow and others added 2 commits September 28, 2020 09:53

Set keep-alive timeout higher than ALB idle timeout

644f6fa

Sets the keep-alive timeout to be higher than the ALB timeout so that the ALB is responsible for prompting the closure of individual TCP connections to the server.

rbreslow force-pushed the feature/jrb/adjust-timeouts branch from 4827e9a to eabff24 Compare September 28, 2020 13:54

rbreslow merged commit 1b8b1db into develop Sep 28, 2020

rbreslow deleted the feature/jrb/adjust-timeouts branch September 28, 2020 13:55

KlaasH mentioned this pull request Mar 6, 2023

More r6 instance types available. #1294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set keep-alive timeout higher than ALB idle timeout #448

Set keep-alive timeout higher than ALB idle timeout #448

rbreslow commented Sep 24, 2020 •

edited

Loading

rbreslow Sep 24, 2020

hectcastro left a comment

colekettler commented Sep 28, 2020

Set keep-alive timeout higher than ALB idle timeout #448

Set keep-alive timeout higher than ALB idle timeout #448

Conversation

rbreslow commented Sep 24, 2020 • edited Loading

Overview

Checklist

Testing Instructions

rbreslow Sep 24, 2020

Choose a reason for hiding this comment

hectcastro left a comment

Choose a reason for hiding this comment

colekettler commented Sep 28, 2020

rbreslow commented Sep 24, 2020 •

edited

Loading