Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pummel/test-heapdump-shadow-realm.js is flaky due to OOM #49572

Closed
joyeecheung opened this issue Sep 9, 2023 · 15 comments
Closed

pummel/test-heapdump-shadow-realm.js is flaky due to OOM #49572

joyeecheung opened this issue Sep 9, 2023 · 15 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. linux Issues and PRs related to the Linux platform.

Comments

@joyeecheung
Copy link
Member

joyeecheung commented Sep 9, 2023

Test

pummel/test-heapdump-shadow-realm.js

Platform

Linux x64

Console output

11:41:28 not ok 3651 pummel/test-heapdump-shadow-realm
11:41:28   ---
11:41:28   duration_ms: 72141.13100
11:41:28   severity: crashed
11:41:28   exitcode: -6
11:41:28   stack: |-
11:41:28     
11:41:28     <--- Last few GCs --->
11:41:28     0.[3846001:0x61f1c60]    46446 ms: Mark-Compact (reduce) 977.0 (993.2) -> 976.4 (993.7) MB, 1058.93 / 0.00 ms  (+ 78.7 ms in 13 steps since start of marking, biggest step 13.7 ms, walltime since start of marking 1165 ms) (average mu = 0.260, current mu = 0.[3846001:0x61f1c60]    47642 ms: Mark-Compact (reduce) 977.1 (993.7) -> 976.5 (993.9) MB, 892.02 / 0.00 ms  (+ 65.2 ms in 9 steps since start of marking, biggest step 16.2 ms, walltime since start of marking 978 ms) (average mu = 0.234, current mu = 0.200
11:41:28     
11:41:28     <--- JS stacktrace --->
11:41:28     
11:41:28     FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
11:41:28      1: 0xca2d40 node::Abort() [out/Release/node]
11:41:28      2: 0xb7f1ab  [out/Release/node]
11:41:28      3: 0xee4ff0 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [out/Release/node]
11:41:28      4: 0xee53ac v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [out/Release/node]
11:41:28      5: 0x10fa1e5  [out/Release/node]
11:41:28      6: 0x111095d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [out/Release/node]
11:41:28      7: 0x10e875a v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [out/Release/node]
11:41:28      8: 0x10e92f5 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [out/Release/node]
11:41:28      9: 0x10c72fe v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [out/Release/node]
11:41:28     10: 0x1516e42 v8::internal::Runtime_AllocateInOldGeneration(int, unsigned long*, v8::internal::Isolate*) [out/Release/node]
11:41:28     11: 0x1951ef6  [out/Release/node]
11:41:28   ...

Build links

nodejs/reliability#660

First CI: https://ci.nodejs.org/job/node-test-pull-request/53749/
Last CI: https://ci.nodejs.org/job/node-test-pull-request/53778/

Additional information

cc @legendecas At first glance I think we could lower the number of shadow realms created in the test. But then the realms are supposed to be GC-able anyway...perhaps we should switch to setImmediate() instead to give GC some time to kick in?

@joyeecheung joyeecheung added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Sep 9, 2023
@github-actions github-actions bot added the linux Issues and PRs related to the Linux platform. label Sep 9, 2023
nodejs-github-bot pushed a commit that referenced this issue Sep 11, 2023
With a tight loop the GC may not have enough time to kick in.
Try setImmediate() instead.

PR-URL: #49573
Refs: #49572
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Chengzhong Wu <[email protected]>
@legendecas
Copy link
Member

legendecas commented Sep 11, 2023

I'm not aware that weak callbacks for ShadowRealm are deferred to prevent it from being collected. Since this is related to the heapdump specifically (there is no reliability report on test/parallel/test-shadow-realm-gc.js), I'm wondering if this might be related to insufficient memory for the heapdump operation.

@joyeecheung
Copy link
Member Author

joyeecheung commented Sep 11, 2023

Normally heapdump should not have more JS heap overhead than the size of a semi-space (because of promotion) + a little bit of margin for cached source positions (can't be that much for ShadowRealms that only have one tiny script). It could be though that because the weak callbacks are deferred, they are still recognized as reachable when the heap snapshot is taken and thus still in the heap snapshot which means the heap snapshot can get bloated a lot (and a huge heap snapshot in itself can eat a lot of memory because we serialize it into JS on the same thread for test verification). We can see if the setImmediate() trick deflakes it and if it doesn't, consider adding some logging to the test to see what's causing the flake.

@joyeecheung
Copy link
Member Author

Looks like this disappeared from the CI. Closing for now. We can reopen if it reappears.

ruyadorno pushed a commit that referenced this issue Sep 28, 2023
With a tight loop the GC may not have enough time to kick in.
Try setImmediate() instead.

PR-URL: #49573
Refs: #49572
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Chengzhong Wu <[email protected]>
@targos
Copy link
Member

targos commented Sep 30, 2023

@targos targos reopened this Sep 30, 2023
@targos
Copy link
Member

targos commented Oct 2, 2023

@joyeecheung
Copy link
Member Author

Oh, my, this failed18 PRs across the last 100 CI runs. nodejs/reliability#681 looks like it started from 09-29

@joyeecheung
Copy link
Member Author

joyeecheung commented Oct 2, 2023

I am guessing something landed around that time introduced a memory leak in the realms?

@joyeecheung
Copy link
Member Author

joyeecheung commented Oct 3, 2023

Actually it seems this time around it's a different bug. It's STATUS_STACK_BUFFER_OVERRUN on Windows (or it could be other assertion failures, https://devblogs.microsoft.com/oldnewthing/20190108-00/?p=100655)

@joyeecheung
Copy link
Member Author

First STATUS_STACK_BUFFER_OVERRUN failure goes back to https://ci.nodejs.org/job/node-test-pull-request/54325/ which is rebasing onto a4fdb1a

@joyeecheung
Copy link
Member Author

joyeecheung commented Oct 3, 2023

Stress test on ubuntu1804-64 https://ci.nodejs.org/job/node-stress-single-test/455/ however the failures in the CI mostly come from win2016_vs2017-x64 and the stress tests do not run there..

@targos
Copy link
Member

targos commented Oct 5, 2023

Stress test passed, but it looks like it ran other tests, not pummel/test-heapdump-shadow-realm

@legendecas
Copy link
Member

legendecas commented Oct 5, 2023

https://ci.nodejs.org/job/node-stress-single-test/456/parameters/ Another run of stress test on pummel/test-heapdump-shadow-realm.

Edit: well, I do remember I checked win2016-vs2017 as one of the run labels, but it turned out that only ubuntu1804-64 was checked.

@targos
Copy link
Member

targos commented Oct 7, 2023

No failures with ubuntu1804-64

nodejs-github-bot pushed a commit that referenced this issue Oct 12, 2023
ShadowRealm garbage-collection is covered in another test. Reduce the
number of repetition in test-heapdump-shadowrealm.js trying to fix the
flakiness of the test.

PR-URL: #50104
Refs: #49572
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
alexfernandez pushed a commit to alexfernandez/node that referenced this issue Nov 1, 2023
With a tight loop the GC may not have enough time to kick in.
Try setImmediate() instead.

PR-URL: nodejs#49573
Refs: nodejs#49572
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Chengzhong Wu <[email protected]>
alexfernandez pushed a commit to alexfernandez/node that referenced this issue Nov 1, 2023
ShadowRealm garbage-collection is covered in another test. Reduce the
number of repetition in test-heapdump-shadowrealm.js trying to fix the
flakiness of the test.

PR-URL: nodejs#50104
Refs: nodejs#49572
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
targos pushed a commit that referenced this issue Nov 11, 2023
ShadowRealm garbage-collection is covered in another test. Reduce the
number of repetition in test-heapdump-shadowrealm.js trying to fix the
flakiness of the test.

PR-URL: #50104
Refs: #49572
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
@lpinca
Copy link
Member

lpinca commented Sep 11, 2024

There have been no new failures since December 2023, closing.

@lpinca lpinca closed this as completed Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. linux Issues and PRs related to the Linux platform.
Projects
None yet
Development

No branches or pull requests

4 participants