Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set keep-alive timeout higher than ALB idle timeout #448

Merged
merged 2 commits into from
Sep 28, 2020

Conversation

rbreslow
Copy link
Contributor

@rbreslow rbreslow commented Sep 24, 2020

Overview

Keep-alive, when enabled, enables the load balancer to reuse back-end connections until the keep-alive timeout expires. To ensure that the load balancer is responsible for closing the connections to your instance, make sure that the value you set for the HTTP keep-alive time is greater than the idle timeout setting configured for your load balancer.

The default keep-alive timeout for the Node http.Server is 5 seconds. The default idle timeout for the ALB is 60 seconds.

The ALB is opening connections for reuse, the Node http.Server closes that connection, and then the ALB tries to communicate over the closed connection which triggers a 502.

HTTP 502: Bad gateway

Possible causes:
. . .
The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. Check whether the keep-alive duration of the target is shorter than the idle timeout value of the load balancer.

We can see more evidence of this by looking at the ALB access logs with Athena:

Details
SELECT elb_status_code,
         request_processing_time,
         target_processing_time,
         response_processing_time,
         request_url
FROM stg_alb_logs
WHERE elb_status_code LIKE '5%';
Athena 2020-09-23 14-35-24

response_processing_time is set to -1 if the load balancer can't send the request to a target. This can happen if the target closes the connection before the idle timeout or if the client sends a malformed request.

This PR sets the keep-alive timeout to be higher than the ALB timeout so that the ALB is responsible for prompting the closure of individual TCP connections to the server.

See:

Resolves #428

Checklist

  • Description of PR is in an appropriate section of CHANGELOG.md and grouped with similar changes, if possible

Testing Instructions

I'm not sure how to replicate the failures we saw earlier. I've tried exercising the staging application and wasn't able to generate any 502s today before or after applying the fix (see the Jenkins deploy). Additionally, the prd_alb_logs table in Athena does not contain any 502s.

I think the evidence makes it clear we should apply this either way, and we can be on the look out for more 502s as we're able to gather more data.

@rbreslow rbreslow self-assigned this Sep 24, 2020
Comment on lines 28 to 30
// Ensure the headersTimeout is set higher than the keepAliveTimeout due to
// this nodejs regression bug: https://github.com/nodejs/node/issues/27363
server.headersTimeout = 66000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was resolved and back ported to Node v12 in nodejs/node#34131. However, it doesn't look like the fix will be accessible until Node v12.18.5 drops (https://github.com/nodejs/node/commits/v12.x. vs https://github.com/nodejs/node/commits/v12.x-staging).

@rbreslow rbreslow marked this pull request as ready for review September 24, 2020 18:22
Copy link
Contributor

@hectcastro hectcastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The application server timeout changes appear to be working as an effective 502 deterrent. I added a simple k6 load test in 4827e9a and used it against an instance of the application with and without the changes.

When the timeout changes were not applied, 502s were present in the Athena ALB request logs. When the timeout changes were applied, 502s were not present in the logs. 👍

rbreslow and others added 2 commits September 28, 2020 09:53
Sets the keep-alive timeout to be higher than the ALB timeout so that
the ALB is responsible for prompting the closure of individual TCP connections to the server.
I manually filled out a 50 district PA project and exported it as a HAR
file. From there, I used har-to-k6 converter to produce a k6 load test.
Lastly, I created a few parameters to make the test more reusable:

- JWT_AUTH_TOKEN: A DistrictBuilder JWT authentication token
- DB_PROJECT_ID: A DistrictBuilder project UUID (PA; 50 districts)
- DB_DOMAIN: The DistrictBuilder instance domain where the project above
  resides
@rbreslow rbreslow merged commit 1b8b1db into develop Sep 28, 2020
@rbreslow rbreslow deleted the feature/jrb/adjust-timeouts branch September 28, 2020 13:55
@colekettler
Copy link
Contributor

@rbreslow Chiming in to say that this was really useful to read and that Adam Crowder article offers a great explanation. Good stuff! 🕵️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resolve 502 errors generated by ELB
3 participants