Possible memory leak in WP entrypoint binary #1192

kfh · 2021-03-12T08:56:58Z

WP server: 0.2.3
WP client: 0.2.3

I've chosen my wording carefully because we haven't done any form of isolated profiling on the entrypoint but i'm instead basing this on our observations over time as heavy users of WP in general with and without the entrypoint injected into our containers.

This should be seen in relevance with the memory draining issue reported here. By using the soft limit memory draining is only delayed. Eventually the container will be killed.

This is a slow burner, even using the smallest possible instance type on Fargate(512mb) is takes some days before it's drained completely. But the pattern is clear, even when the apps deployed are only idling there is a constant rise in memory usage. As mentioned in the linked issue we have profiled our apps using AWS Codeguru extensively to make sure we are not the source of this issue. The attached image illustrate our observations:

When we opt out of injecting the entrypoint binary into our containers we have not seen any memory draining issues with our WP enabled applications.

mitchellh · 2021-03-12T16:07:27Z

While we've been very careful to try to avoid something like this, I believe it. We'll have to do some investigation on what's going on here. But can I ask are you using any features such as app config, exec, logs access, etc.?

mitchellh · 2021-03-12T20:31:07Z

I've been running a rather unscientific test in the background as I go through my day today. I'll update this post with any findings. For this test, I deployed into Docker locally since we're testing for an entrypoint binary leak which is hopefully reproducible regardless of platform, and Docker is easy for me to setup for test.

Idle deployment for 2 hours: no memory usage change at all
- Unlikely we have a memory leak in this scenario
- Steady state container RAM for me at 30 MB
Refresh the URL service URL every second. I did this for another 2 hours.
- Container RAM seems to go from 60 MB to up to 120 MB or so before going back down
- Assuming this is mostly GC.

So far I'm not seeing any leaks. That doesn't mean they don't exist, of course, its just that maybe they're not obvious to trigger yet. I'll keep my refresh loop running for the day and see what happens...

mitchellh · 2021-03-12T20:37:29Z

@kfh If I added a signal handler (i.e. SIGUSR1 or something) so that the entrypoint did a heap profile dump, would you be able to run that signal and get us the heap dump? That will make finding the leak a lot more easy. Note the profile should not have memory values in it so it should be sanitized, but to be safe you can send it privately if you'd like.

kfh · 2021-03-15T11:29:17Z

While we've been very careful to try to avoid something like this, I believe it. We'll have to do some investigation on what's going on here. But can I ask are you using any features such as app config, exec, logs access, etc.?

Our waypoint configurations are fairly small. We are also not using much of the functionality provided by the entrypoint anymore. For logs we have to use AWS Cloudwatch since we need more advanced features and that works out of the box. The only feature we are missing when not injecting the entrypoint is the possibility to exec directly into a shell on the running Fargate container. Something that isn't supported by default for Fargate, but there are workarounds.

mitchellh · 2021-03-15T18:39:08Z

We ran a deployment with my PR above over the weekend and dumped the profile today. We spent some time thinking about it and we think we found the source of the leak. The good news is we think we can solve this serverside (a TCP connection on the server isn't being properly closed out which is causing the clients to hold on). We're also gonna investigate making the client more robust so we can just force rst the connection when we're done.

Will update you soon!

This brings in a partial fix for #1192 by adding a close timeout to open Yamux streams to release the connection from the connection table. There is another change coming in to fix this more directly but this adds a fail safe

mitchellh · 2021-03-16T16:20:52Z

The fixes are coming in. These are more fixes than are necessary but ultimately make the whole source of this memory leak more robust to never happen again on both the client and server sides.

After merging these in we'll probably want to run another test for a couple days to completely validate this then cut a release.

mitchellh · 2021-03-17T20:20:13Z

This is fixed and verified over the past 24 hours! We're going to cut a new 0.2.x release soon and that will have the fix in it.

krantzinator added bug Something isn't working core labels Mar 12, 2021

mitchellh mentioned this issue Mar 12, 2021

cmd/waypoint-entrypoint: dump heap profile on SIGUSR1 #1194

Merged

mitchellh mentioned this issue Mar 16, 2021

update horizon #1200

Merged

This was referenced Mar 16, 2021

Backport of cmd/waypoint-entrypoint: dump heap profile on SIGUSR1 into release/0.2.x #1201

Closed

Backport of update horizon into release/0.2.x #1202

Closed

mitchellh closed this as completed Mar 17, 2021

kfh mentioned this issue Apr 16, 2021

Memory reservation feature in aws-ecs plugin not exposed through task definition #1163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak in WP entrypoint binary #1192

Possible memory leak in WP entrypoint binary #1192

kfh commented Mar 12, 2021

mitchellh commented Mar 12, 2021

mitchellh commented Mar 12, 2021

mitchellh commented Mar 12, 2021 •

edited

Loading

kfh commented Mar 15, 2021 •

edited

Loading

mitchellh commented Mar 15, 2021

mitchellh commented Mar 16, 2021 •

edited

Loading

mitchellh commented Mar 17, 2021

Possible memory leak in WP entrypoint binary #1192

Possible memory leak in WP entrypoint binary #1192

Comments

kfh commented Mar 12, 2021

mitchellh commented Mar 12, 2021

mitchellh commented Mar 12, 2021

mitchellh commented Mar 12, 2021 • edited Loading

kfh commented Mar 15, 2021 • edited Loading

mitchellh commented Mar 15, 2021

mitchellh commented Mar 16, 2021 • edited Loading

mitchellh commented Mar 17, 2021

mitchellh commented Mar 12, 2021 •

edited

Loading

kfh commented Mar 15, 2021 •

edited

Loading

mitchellh commented Mar 16, 2021 •

edited

Loading