Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After upgrade from v8 to v12.16.1 heap usage becomes erratic #32737

Closed
wdittmer-mp opened this issue Apr 9, 2020 · 8 comments
Closed

After upgrade from v8 to v12.16.1 heap usage becomes erratic #32737

wdittmer-mp opened this issue Apr 9, 2020 · 8 comments
Labels
invalid Issues and PRs that are invalid.

Comments

@wdittmer-mp
Copy link

  • Version:
    v12.16.1 (but v12.13 on production)

  • Platform:
    MacOS Catalina 10.15.4 / Darwin 19.4.0 Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64 x86_64
    (but docker with node:12.13.0-alpine on production)

  • Subsystem:
    ? runtime, heap, garbage collection

Description:

We recently upgrade our production servers with docker containers with node v8 to docker containers with node v12.10 (node:12.13.0-alpine). At first all seems fine, but then we started noticing pod restarts by Kubernetes being OOM Killed. Since the upgrade, memory usage seems to increase over time sometimes in steep inclines until reaching ~500MB at which time they are killed by Kuberenetes. For example:
Screenshot 2020-04-07 at 09 24 26

In the image above you can see some 'staircase' increases in pod memory usage, these are multiple instances of the same service. It turns out that also our other services start to increase memory usage, though this depends on the number requests that they serve.

What steps will reproduce the bug?

To find out what is going on I did the following:

  1. run the code of that service directly on the mac, with nodejs v12.16.1 and with nodejs v13.12.0
  2. have 6 clients send a continuous stream of requests for 1 hour
  3. print out the memory usage every 10 seconds on the service
  4. create a graph of the output

v12.16.1:
v12 16 1 w_ ncrm 1 18 0 1 hour_ rss, heapTotal, heapUsed and external

v13.12.0:
v13 12 0 w_ ncrm 1 18 0 1 hour_ rss, heapTotal, heapUsed and external

Observations:
In v12.16.1 heapUsed it is quite stable but suddenly it starts spiking even though the load remains the same.
In v13.12.0 I do not see a similar behaviour and the heapUsed seems to remain quite stable and well below the 500MB mark.
I hope this (it showing in one nodejs version but not the other) shows that it is not a leak/erratic behaviour in our code.

How often does it reproduce? Is there a required condition?

On production we see it on all service upgraded to v12 and after each restart the pattern repeats, and I can repro the graphs using the methods described above.

What is the expected behavior?

A stable heapUsed like in v13 and no spikes in memory usage causing OOM kills

What do you see instead?

On production we see staircase behaviour, on the test we see huge spikes and erratic heap used.

Additional information

I seek guidance as to where to go from here. Since v12 is a LTS version, the behaviour above could affect more people and v13 should not be used in production.

Our service is quite complicated at the moment (grpc + redis + dynamodb +zipkin + metrics for prometheus), but I could try to remove each one by one till I have something small to reproduce.

On the other hand, there might be someone that already recognises this pattern and first wants to try a workaround/patch or a heap dump, etc.

Please let me know how I can help further,

Kind regards,
Wilfred

@bnoordhuis
Copy link
Member

memory usage seems to increase over time sometimes in steep inclines until reaching ~500MB at which time they are killed by Kuberenetes

I infer that you enforce some kind of hard memory limit? Does it keep running when you start node with --max-old-space-size=256 or --max-old-space-size=384?

The garbage collector can grow the JS heap at will (time/space trace-off) unless you clamp it. There is a hard-coded limit but it's way beyond 500 MB.

@wdittmer-mp
Copy link
Author

We are setting up a load-test environment to see if we can reproduce this and try out solutions.
We will definitely try to set the --max-old-space-size and hope to see some result.
I am currently running the above scenario with v10, so that could at least provide a way out, while still using a LTS version.
I will also re-run the v12 scenario and set the --max-old-space-size to see how that affects it.

@wdittmer-mp
Copy link
Author

I ran the scenario with v10 and it looks OK.
v10 20 0 w_ncrm v1 18 0 rss, heapTotal, heapUsed and external

Next I will run it with the --max-old-space-size setting and v12 to see the impact.

@wdittmer-mp
Copy link
Author

Run with --max-old-space-size=384 and v12.16.1:
v12 16 1 w_384MB rss, heapTotal, heapUsed and external

This looks good!

So careful conclusion:
in v12 it switches garbage collection strategy (?) and because we did not limit max-old-space-size it was able to create the peaks.

It will be good to limit the memory of the node process (similar to JAVA containers).

But I am still curious if there is an explanation why in v12 we have this behaviour and in the other versions not. Why does it decide to change tactics (if that is what it does) even though load remains more or less the same the whole time.

Will leave the ticket open to confirm that this setting has the desired effect on loadtest env and/or if someone else want to know/try something else.

For now, thanks for the quick response, very much appreciated.
I hope to have a happy customer and production soon again.

@mmarchini
Copy link
Contributor

But I am still curious if there is an explanation why in v12 we have this behaviour and in the other versions not. Why does it decide to change tactics (if that is what it does) even though load remains more or less the same the whole time.

V8 might have changed how their garbage collector works. Between Node.js v8 and v12, V8 had 14 major versions, with lots of changes. At least one change in GC is that now it runs on a separate thread. If you want to see how GC behavior changed over time you can use --trace-gc (not in production though), it will output a summary every time GC runs.

@wdittmer-mp
Copy link
Author

I used --trace-gc with v12.16.1 both with and without --max-old-space-size=384.
With the --max-old-space-size setting, I see about 3-5 Scavenge and then a Mark-sweep in a regular pattern.
Without the setting however, the Scavenge continue, but the Mark-sweep is not being done as often.

E.g. this is at 1084810 ms:

[6218:0x10291f000] 1084810 ms: Mark-sweep 351.9 (356.0) -> 83.6 (129.3) MB, 1048.0 / 925.8 ms (+ 3.4 ms in 19 steps since start of marking, biggest step 1.1 ms, walltime since start of marking 1133 ms) (average mu = 0.990, current mu = 0.990) finalize incremental marking via task GC in old space requested
[6218:0x10291f000] 1085378 ms: Mark-sweep 132.7 (178.3) -> 128.4 (135.3) MB, 11.7 / 4.0 ms (+ 0.3 ms in 3 steps since start of marking, biggest step 0.2 ms, walltime since start of marking 215 ms) (average mu = 0.990, current mu = 0.979) finalize incremental marking via stack guard GC in old space requested

then the next one is at 1191919 ms (a whole lot Scavenge in between):

[6218:0x10291f000] 1191919 ms: Mark-sweep 329.8 (358.4) -> 84.6 (133.8) MB, 1065.5 / 919.5 ms (+ 4.4 ms in 42 steps since start of marking, biggest step 0.3 ms, walltime since start of marking 1185 ms) (average mu = 0.990, current mu = 0.990) finalize incremental marking via task GC in old space requested
[6218:0x10291f000] 1192540 ms: Mark-sweep 133.6 (182.8) -> 128.5 (146.3) MB, 14.4 / 3.0 ms (+ 0.9 ms in 3 steps since start of marking, biggest step 0.7 ms, walltime since start of marking 213 ms) (average mu = 0.990, current mu = 0.975) finalize incremental marking via stack guard GC in old space requested

So this took 107109 ms between sweeps.

If I look around the same time with the --max-old-space-size you see:

[7549:0x10291f000] 1080256 ms: Mark-sweep 44.5 (74.4) -> 26.7 (64.3) MB, 42.7 / 30.7 ms (+ 1.7 ms in 14 steps since start of marking, biggest step 0.2 ms, walltime since start of marking 88 ms) (average mu = 0.990, current mu = 0.990) finalize incremental marking via task GC in old space requested

And the next sweep:

[7549:0x10291f000] 1085052 ms: Mark-sweep 44.4 (74.4) -> 26.5 (64.3) MB, 40.3 / 33.1 ms (+ 1.5 ms in 10 steps since start of marking, biggest step 0.4 ms, walltime since start of marking 71 ms) (average mu = 0.991, current mu = 0.991) finalize incremental marking via task GC in old space requested

Just 4796ms in between.

Thanks, this reinforces that we should definitely use --max-old-space-size, also if there are future changes to V8/GC.

@bnoordhuis
Copy link
Member

I'll close this out but I can move it to nodejs/help if you have follow-up questions. Cheers.

@bnoordhuis bnoordhuis added the invalid Issues and PRs that are invalid. label Apr 10, 2020
@wdittmer-mp
Copy link
Author

Sure, thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid Issues and PRs that are invalid.
Projects
None yet
Development

No branches or pull requests

3 participants