Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Possible memory leak in WP entrypoint binary #1192

Closed
kfh opened this issue Mar 12, 2021 · 7 comments
Closed

Possible memory leak in WP entrypoint binary #1192

kfh opened this issue Mar 12, 2021 · 7 comments
Labels
bug Something isn't working core

Comments

@kfh
Copy link

kfh commented Mar 12, 2021

WP server: 0.2.3
WP client: 0.2.3

I've chosen my wording carefully because we haven't done any form of isolated profiling on the entrypoint but i'm instead basing this on our observations over time as heavy users of WP in general with and without the entrypoint injected into our containers.

This should be seen in relevance with the memory draining issue reported here. By using the soft limit memory draining is only delayed. Eventually the container will be killed.

This is a slow burner, even using the smallest possible instance type on Fargate(512mb) is takes some days before it's drained completely. But the pattern is clear, even when the apps deployed are only idling there is a constant rise in memory usage. As mentioned in the linked issue we have profiled our apps using AWS Codeguru extensively to make sure we are not the source of this issue. The attached image illustrate our observations:

entrypoint

When we opt out of injecting the entrypoint binary into our containers we have not seen any memory draining issues with our WP enabled applications.

@krantzinator krantzinator added bug Something isn't working core labels Mar 12, 2021
@mitchellh
Copy link
Contributor

While we've been very careful to try to avoid something like this, I believe it. We'll have to do some investigation on what's going on here. But can I ask are you using any features such as app config, exec, logs access, etc.?

@mitchellh
Copy link
Contributor

I've been running a rather unscientific test in the background as I go through my day today. I'll update this post with any findings. For this test, I deployed into Docker locally since we're testing for an entrypoint binary leak which is hopefully reproducible regardless of platform, and Docker is easy for me to setup for test.

  • Idle deployment for 2 hours: no memory usage change at all
    • Unlikely we have a memory leak in this scenario
    • Steady state container RAM for me at 30 MB
  • Refresh the URL service URL every second. I did this for another 2 hours.
    • Container RAM seems to go from 60 MB to up to 120 MB or so before going back down
    • Assuming this is mostly GC.

So far I'm not seeing any leaks. That doesn't mean they don't exist, of course, its just that maybe they're not obvious to trigger yet. I'll keep my refresh loop running for the day and see what happens...

@mitchellh
Copy link
Contributor

mitchellh commented Mar 12, 2021

@kfh If I added a signal handler (i.e. SIGUSR1 or something) so that the entrypoint did a heap profile dump, would you be able to run that signal and get us the heap dump? That will make finding the leak a lot more easy. Note the profile should not have memory values in it so it should be sanitized, but to be safe you can send it privately if you'd like.

@kfh
Copy link
Author

kfh commented Mar 15, 2021

While we've been very careful to try to avoid something like this, I believe it. We'll have to do some investigation on what's going on here. But can I ask are you using any features such as app config, exec, logs access, etc.?

Our waypoint configurations are fairly small. We are also not using much of the functionality provided by the entrypoint anymore. For logs we have to use AWS Cloudwatch since we need more advanced features and that works out of the box. The only feature we are missing when not injecting the entrypoint is the possibility to exec directly into a shell on the running Fargate container. Something that isn't supported by default for Fargate, but there are workarounds.

@mitchellh
Copy link
Contributor

We ran a deployment with my PR above over the weekend and dumped the profile today. We spent some time thinking about it and we think we found the source of the leak. The good news is we think we can solve this serverside (a TCP connection on the server isn't being properly closed out which is causing the clients to hold on). We're also gonna investigate making the client more robust so we can just force rst the connection when we're done.

Will update you soon!

mitchellh added a commit that referenced this issue Mar 16, 2021
This brings in a partial fix for #1192 by adding a close timeout to open
Yamux streams to release the connection from the connection table. There
is another change coming in to fix this more directly but this adds a
fail safe
@mitchellh
Copy link
Contributor

mitchellh commented Mar 16, 2021

The fixes are coming in. These are more fixes than are necessary but ultimately make the whole source of this memory leak more robust to never happen again on both the client and server sides.

After merging these in we'll probably want to run another test for a couple days to completely validate this then cut a release.

@mitchellh
Copy link
Contributor

This is fixed and verified over the past 24 hours! We're going to cut a new 0.2.x release soon and that will have the fix in it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working core
Projects
None yet
Development

No branches or pull requests

3 participants