After upgrade from v8 to v12.16.1 heap usage becomes erratic #32737

wdittmer-mp · 2020-04-09T07:15:27Z

Version:
v12.16.1 (but v12.13 on production)
Platform:
MacOS Catalina 10.15.4 / Darwin 19.4.0 Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64 x86_64
(but docker with node:12.13.0-alpine on production)
Subsystem:
? runtime, heap, garbage collection

Description:

We recently upgrade our production servers with docker containers with node v8 to docker containers with node v12.10 (node:12.13.0-alpine). At first all seems fine, but then we started noticing pod restarts by Kubernetes being OOM Killed. Since the upgrade, memory usage seems to increase over time sometimes in steep inclines until reaching ~500MB at which time they are killed by Kuberenetes. For example:

In the image above you can see some 'staircase' increases in pod memory usage, these are multiple instances of the same service. It turns out that also our other services start to increase memory usage, though this depends on the number requests that they serve.

What steps will reproduce the bug?

To find out what is going on I did the following:

run the code of that service directly on the mac, with nodejs v12.16.1 and with nodejs v13.12.0
have 6 clients send a continuous stream of requests for 1 hour
print out the memory usage every 10 seconds on the service
create a graph of the output

v12.16.1:

v13.12.0:

Observations:
In v12.16.1 heapUsed it is quite stable but suddenly it starts spiking even though the load remains the same.
In v13.12.0 I do not see a similar behaviour and the heapUsed seems to remain quite stable and well below the 500MB mark.
I hope this (it showing in one nodejs version but not the other) shows that it is not a leak/erratic behaviour in our code.

How often does it reproduce? Is there a required condition?

On production we see it on all service upgraded to v12 and after each restart the pattern repeats, and I can repro the graphs using the methods described above.

What is the expected behavior?

A stable heapUsed like in v13 and no spikes in memory usage causing OOM kills

What do you see instead?

On production we see staircase behaviour, on the test we see huge spikes and erratic heap used.

Additional information

I seek guidance as to where to go from here. Since v12 is a LTS version, the behaviour above could affect more people and v13 should not be used in production.

Our service is quite complicated at the moment (grpc + redis + dynamodb +zipkin + metrics for prometheus), but I could try to remove each one by one till I have something small to reproduce.

On the other hand, there might be someone that already recognises this pattern and first wants to try a workaround/patch or a heap dump, etc.

Please let me know how I can help further,

Kind regards,
Wilfred

bnoordhuis · 2020-04-09T10:43:13Z

memory usage seems to increase over time sometimes in steep inclines until reaching ~500MB at which time they are killed by Kuberenetes

I infer that you enforce some kind of hard memory limit? Does it keep running when you start node with --max-old-space-size=256 or --max-old-space-size=384?

The garbage collector can grow the JS heap at will (time/space trace-off) unless you clamp it. There is a hard-coded limit but it's way beyond 500 MB.

wdittmer-mp · 2020-04-09T11:04:54Z

We are setting up a load-test environment to see if we can reproduce this and try out solutions.
We will definitely try to set the --max-old-space-size and hope to see some result.
I am currently running the above scenario with v10, so that could at least provide a way out, while still using a LTS version.
I will also re-run the v12 scenario and set the --max-old-space-size to see how that affects it.

wdittmer-mp · 2020-04-09T11:52:53Z

I ran the scenario with v10 and it looks OK.

Next I will run it with the --max-old-space-size setting and v12 to see the impact.

wdittmer-mp · 2020-04-09T13:12:35Z

Run with --max-old-space-size=384 and v12.16.1:

This looks good!

So careful conclusion:
in v12 it switches garbage collection strategy (?) and because we did not limit max-old-space-size it was able to create the peaks.

It will be good to limit the memory of the node process (similar to JAVA containers).

But I am still curious if there is an explanation why in v12 we have this behaviour and in the other versions not. Why does it decide to change tactics (if that is what it does) even though load remains more or less the same the whole time.

Will leave the ticket open to confirm that this setting has the desired effect on loadtest env and/or if someone else want to know/try something else.

For now, thanks for the quick response, very much appreciated.
I hope to have a happy customer and production soon again.

mmarchini · 2020-04-09T16:31:44Z

But I am still curious if there is an explanation why in v12 we have this behaviour and in the other versions not. Why does it decide to change tactics (if that is what it does) even though load remains more or less the same the whole time.

V8 might have changed how their garbage collector works. Between Node.js v8 and v12, V8 had 14 major versions, with lots of changes. At least one change in GC is that now it runs on a separate thread. If you want to see how GC behavior changed over time you can use --trace-gc (not in production though), it will output a summary every time GC runs.

wdittmer-mp · 2020-04-10T08:58:20Z

I used --trace-gc with v12.16.1 both with and without --max-old-space-size=384.
With the --max-old-space-size setting, I see about 3-5 Scavenge and then a Mark-sweep in a regular pattern.
Without the setting however, the Scavenge continue, but the Mark-sweep is not being done as often.

E.g. this is at 1084810 ms:

[6218:0x10291f000] 1084810 ms: Mark-sweep 351.9 (356.0) -> 83.6 (129.3) MB, 1048.0 / 925.8 ms (+ 3.4 ms in 19 steps since start of marking, biggest step 1.1 ms, walltime since start of marking 1133 ms) (average mu = 0.990, current mu = 0.990) finalize incremental marking via task GC in old space requested
[6218:0x10291f000] 1085378 ms: Mark-sweep 132.7 (178.3) -> 128.4 (135.3) MB, 11.7 / 4.0 ms (+ 0.3 ms in 3 steps since start of marking, biggest step 0.2 ms, walltime since start of marking 215 ms) (average mu = 0.990, current mu = 0.979) finalize incremental marking via stack guard GC in old space requested

then the next one is at 1191919 ms (a whole lot Scavenge in between):

[6218:0x10291f000] 1191919 ms: Mark-sweep 329.8 (358.4) -> 84.6 (133.8) MB, 1065.5 / 919.5 ms (+ 4.4 ms in 42 steps since start of marking, biggest step 0.3 ms, walltime since start of marking 1185 ms) (average mu = 0.990, current mu = 0.990) finalize incremental marking via task GC in old space requested
[6218:0x10291f000] 1192540 ms: Mark-sweep 133.6 (182.8) -> 128.5 (146.3) MB, 14.4 / 3.0 ms (+ 0.9 ms in 3 steps since start of marking, biggest step 0.7 ms, walltime since start of marking 213 ms) (average mu = 0.990, current mu = 0.975) finalize incremental marking via stack guard GC in old space requested

So this took 107109 ms between sweeps.

If I look around the same time with the --max-old-space-size you see:

[7549:0x10291f000] 1080256 ms: Mark-sweep 44.5 (74.4) -> 26.7 (64.3) MB, 42.7 / 30.7 ms (+ 1.7 ms in 14 steps since start of marking, biggest step 0.2 ms, walltime since start of marking 88 ms) (average mu = 0.990, current mu = 0.990) finalize incremental marking via task GC in old space requested

And the next sweep:

[7549:0x10291f000] 1085052 ms: Mark-sweep 44.4 (74.4) -> 26.5 (64.3) MB, 40.3 / 33.1 ms (+ 1.5 ms in 10 steps since start of marking, biggest step 0.4 ms, walltime since start of marking 71 ms) (average mu = 0.991, current mu = 0.991) finalize incremental marking via task GC in old space requested

Just 4796ms in between.

Thanks, this reinforces that we should definitely use --max-old-space-size, also if there are future changes to V8/GC.

bnoordhuis · 2020-04-10T09:03:44Z

I'll close this out but I can move it to nodejs/help if you have follow-up questions. Cheers.

wdittmer-mp · 2020-04-10T09:10:40Z

Sure, thanks for your help.

wdittmer-mp mentioned this issue Apr 9, 2020

After upgrade from v8 to v12.16.1 heap usage becomes erratic nodejs/help#2585

Closed

bnoordhuis closed this as completed Apr 10, 2020

bnoordhuis added the invalid Issues and PRs that are invalid. label Apr 10, 2020

RobertDittmann mentioned this issue May 6, 2020

Node v12.16.2 memory issues after upgrade from v8 (<--- Last few GCs --->) #33266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After upgrade from v8 to v12.16.1 heap usage becomes erratic #32737

After upgrade from v8 to v12.16.1 heap usage becomes erratic #32737

wdittmer-mp commented Apr 9, 2020

bnoordhuis commented Apr 9, 2020

wdittmer-mp commented Apr 9, 2020

wdittmer-mp commented Apr 9, 2020

wdittmer-mp commented Apr 9, 2020

mmarchini commented Apr 9, 2020

wdittmer-mp commented Apr 10, 2020

bnoordhuis commented Apr 10, 2020

wdittmer-mp commented Apr 10, 2020

After upgrade from v8 to v12.16.1 heap usage becomes erratic #32737

After upgrade from v8 to v12.16.1 heap usage becomes erratic #32737

Comments

wdittmer-mp commented Apr 9, 2020

Description:

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior?

What do you see instead?

Additional information

bnoordhuis commented Apr 9, 2020

wdittmer-mp commented Apr 9, 2020

wdittmer-mp commented Apr 9, 2020

wdittmer-mp commented Apr 9, 2020

mmarchini commented Apr 9, 2020

wdittmer-mp commented Apr 10, 2020

bnoordhuis commented Apr 10, 2020

wdittmer-mp commented Apr 10, 2020