-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All NodeJS apps on my server are leaking memory #947
Comments
Do the applications actually crash with an out-of-memory error? If so, what's the exact message it prints? |
@bnoordhuis, thanks for the quick response. |
@bnoordhuis, I canceled the memory limit restart. The nine app instances run for an hour. RSS grew to 1.8-2.0G but none of them crashed. At this point performance deteriorated too much and I had to restart. |
these 2 observations do not connect well IMO. Can you double-confirm the heap statistics by collecting one soon after the |
Is 9 * 2 GB > physical RAM? If yes, start node with If the total footprint of the node processes is bigger than physical memory, the machine starts swapping and performance falls off a cliff. |
@bnoordhuis, had about 5G left just before restarting the app. The server was not swapping. @gireeshpunathil, Scavenge is not flat and so is Mark Sweep. The latter only looked flat due to scale. Here's how it looks when I remove Scavenge from the plot: |
@gerenrot - thanks. I did not mean flat, instead meant to say they @bnoordhuis - what is your interpretation of the chart data especially around the memory growth and the GC efforts? |
The graphs don't look out of the ordinary. The footprint growing over time does not necessarily indicate a memory leak and seems unlikely since no processes abort. |
The CPU has 10 cores and 20 threads. CPU is a bit below 50% before the restart and GC Scavenge time is approaching 250 seconds. Doesn't this mean the 9 threads running the app are constantly garbage collecting? |
To clarify, the memory footprint doesn't look unusual, but are you saying scavenge times are in seconds, not milliseconds? That seems almost impossible. |
According to NewRelic, yes they are in seconds. |
What I'd do is start node with Forcing a core dump with gcore and inspecting it with lldb+llnode is also a good option. If live debugging is an option: |
I'll try that, thanks. |
@bnoordhuis followed your advise about perf top. When CPU went from 10% to 35 (running on 9 out of the 20 cores) I run perf top:
It is scavenging too much. Since scavenge evolves moving objects and updating pointers, could this be caused by deleting properties instead of setting them to unknown? |
@gerenrot Scavenges don't move objects around, that's the mark-compact phase. That weak list it's spending so much time on is the list of optimized functions. As a quick (well...) sanity check, does |
The problem persists with
|
You get a different top 5 now so it clearly did something. Those first two are JS functions. If you start node with |
I'm already running node with --perf-basic-prof and I can see JS code symbols in the map file. Not sure why some symbols are still missing. |
Found why I'm not getting the symbols. Its a file owner issue. Will update later. |
Here's perf top -g with the optimizer disabled:
Expanded:
|
Here's a flame graph with optimizer enabled and before it starts spinning: |
Going by the number of samples, that's about 4 seconds? Nothing in there that really stands out, at a quick glance anyway. |
Its one node instance for 60 seconds: Thanks for having a look. |
Ah, you might want to raise the frequency to |
Here are two more flame graphs. One with the optimizer on and the other with the optimizer turned off. Changed the sample frequency to -F 997 and the sample duration is 60 seconds as before. optimizer-on.svg.gz @bnoordhuis, does anything stand out? |
Forgot to mention, I waited until the CPU increased from ~8% to ~35% before recording these graphs. |
I've stared at it for quite some time and the most striking difference is GC time: <1% vs. >30%. Maybe the optimized version runs faster but it fills up the new space 5x more often (~50x vs ~250x), which sets off the garbage collector. The optimized function linked list is scanned for live objects every scavenge and it's seemingly really long in your application because it takes > ~20 ms every time (avg 66, std 96 - quite a few outliers.) I think we've identified the cause now but I don't know how to fix it yet. Can you check with (Redirect stdout because those options can log a lot of info.) |
Run node with --trace_opt --trace_deopt for two minutes.
I can send you the log privately but rather not post it here. Not sure the data it contains will not introduce a security risk. |
Does it reach a kind of steady state after a few minutes? It doesn't have to be a complete cessation of opt/deopt messages but it should quiet down after a while. cc @nodejs/v8 - see #947 (comment), the troublesome list is |
Isn't the list gone by now? https://v8project.blogspot.de/2017/10/lazy-unlinking.html?m=1 |
I had noticed that |
Iiuc there were quite a few changes preceding it. I would not feel comfortable with backporting. |
@bnoordhuis after 7 minutes I get deoptimizations at a rate of 450 rpm. Is this normal? |
@gerenrot No, that's pretty high. Try getting rid of the most egregious offenders and see if performance improves; it most likely will. |
The most dominant offender is sequelize Instance.set(key, value, options). I think what happens is that the function is optimized for value accepting a string. However value is really polymorphic and it will occasionally accept something other than a string which will lead to deoptimization. How can I filter out optimization of this specific function? Couldn't find documentation on --turbo_filter. |
That's how Instance.set() deoptimization looks in the log:
|
The logic is here if you need it. |
@bnoordhuis can you refer me to tools or resources for optimization? most articles I can across were outdated or didn't go deep enough to be useful in my case. |
There's https://github.com/nodejs/node/tree/v8.9.0/deps/v8/tools/turbolizer but I'm not the right person to ask how it works, I don't use it much. |
Is there any final conclusion of this, I am also getting same issue in node application. GC Pause Time: 8s And at the end node is down after reaching 100%. |
are you using https in nodejs? we encountered a very similar problem that went away entirely after we moved https processing to nginx. we have a simple test case where we continually post to a long running route (~2 seconds on avg) with lots of IO. You can clearly see where we switched off https at around 12:05. |
This was due to setInterval functions. So on every hit there is one more setInterval encountered and it's increasing GC and CPU utilizations. |
I'm trying to solve a memory leak in my NodeJS app for a long time now. The strange thing is that I have several completely different NodeJS apps on my server and they all leak memory. Their RSS grows indefinitely over time.
My attempts to find the issue concentrate on the main app. Here is what I did:
At this point I'm out of ideas what to do next.
Can any of you experts offer a good advise?
The text was updated successfully, but these errors were encountered: