-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set keep-alive timeout higher than ALB idle timeout #448
Conversation
src/server/src/main.ts
Outdated
// Ensure the headersTimeout is set higher than the keepAliveTimeout due to | ||
// this nodejs regression bug: https://github.com/nodejs/node/issues/27363 | ||
server.headersTimeout = 66000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was resolved and back ported to Node v12 in nodejs/node#34131. However, it doesn't look like the fix will be accessible until Node v12.18.5 drops (https://github.com/nodejs/node/commits/v12.x. vs https://github.com/nodejs/node/commits/v12.x-staging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The application server timeout changes appear to be working as an effective 502 deterrent. I added a simple k6
load test in 4827e9a and used it against an instance of the application with and without the changes.
When the timeout changes were not applied, 502s were present in the Athena ALB request logs. When the timeout changes were applied, 502s were not present in the logs. 👍
Sets the keep-alive timeout to be higher than the ALB timeout so that the ALB is responsible for prompting the closure of individual TCP connections to the server.
I manually filled out a 50 district PA project and exported it as a HAR file. From there, I used har-to-k6 converter to produce a k6 load test. Lastly, I created a few parameters to make the test more reusable: - JWT_AUTH_TOKEN: A DistrictBuilder JWT authentication token - DB_PROJECT_ID: A DistrictBuilder project UUID (PA; 50 districts) - DB_DOMAIN: The DistrictBuilder instance domain where the project above resides
4827e9a
to
eabff24
Compare
@rbreslow Chiming in to say that this was really useful to read and that Adam Crowder article offers a great explanation. Good stuff! 🕵️ |
Overview
The default keep-alive timeout for the Node http.Server is 5 seconds. The default idle timeout for the ALB is 60 seconds.
The ALB is opening connections for reuse, the Node http.Server closes that connection, and then the ALB tries to communicate over the closed connection which triggers a 502.
We can see more evidence of this by looking at the ALB access logs with Athena:
Details
This PR sets the keep-alive timeout to be higher than the ALB timeout so that the ALB is responsible for prompting the closure of individual TCP connections to the server.
See:
Resolves #428
Checklist
CHANGELOG.md
and grouped with similar changes, if possibleTesting Instructions
I'm not sure how to replicate the failures we saw earlier. I've tried exercising the staging application and wasn't able to generate any 502s today before or after applying the fix (see the Jenkins deploy). Additionally, the
prd_alb_logs
table in Athena does not contain any 502s.I think the evidence makes it clear we should apply this either way, and we can be on the look out for more 502s as we're able to gather more data.