-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pummel/test-worker-take-heapsnapshot - crash in debug mode on arm #41204
Comments
I found this while working on a debug job for ARM64 as I'd like to move over our debug testing to our arm machines. As part of a Red Hat team day looking at what we can do to move the ci closer to green, failures in the debug builds seem to be the top cause of failure. Looking at the logs it seems like it is because the debug builds require a lot of memory and the x86 container hosts don't have enough. Our ARM container hosts have a While creating the new job I came across what appears to be a persistent failure. Thinking we should mark as flaky so that we can move over. |
Also noting that it was on ubuntu 20 in case that matters. - https://ci.nodejs.org/job/node-test-commit-arm-debug-mdawson/6/console |
Also seems to occur on ubuntu 18. |
Seems to fail on master, 17.x, and 16.x but not earlier streams. |
- Mark test-worker-take-heapsnapshot as flaky on arm with debug Refs: nodejs#41204 Refs: nodejs#41209 Signed-off-by: Michael Dawson <[email protected]>
- Mark test-worker-take-heapsnapshot as flaky on arm with debug Refs: #41204 Refs: #41209 Signed-off-by: Michael Dawson <[email protected]> PR-URL: #41253 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: James M Snell <[email protected]>
This is the stack trace from running the test under gdb with [Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[New Thread 0xffffaf3cb110 (LWP 182370)]
[New Thread 0xffffaebca110 (LWP 182371)]
[New Thread 0xffffae3c9110 (LWP 182372)]
[New Thread 0xffffadbc8110 (LWP 182373)]
[New Thread 0xffffad3c7110 (LWP 182374)]
[New Thread 0xffffacbc6110 (LWP 182375)]
NOTE: The test started as a child_process using these flags: [ '--expose-internals' ] Use NODE_SKIP_FLAG_CHECK to run the test with the original flags.
(node:182376) internal/test/binding: These APIs are for internal testing only. Do not use them.
(Use `node --trace-warnings ...` to show where the warning was created)
Thread 1 "node_g" received signal SIGSEGV, Segmentation fault.
0x0000ffffaf3fe7e8 in kill () at ../sysdeps/unix/syscall-template.S:78
78 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0 0x0000ffffaf3fe7e8 in kill () at ../sysdeps/unix/syscall-template.S:78
#1 0x0000aaaae0c2944c in uv_kill (pid=0, signum=11) at ../deps/uv/src/unix/process.c:537
#2 0x0000aaaadfd63984 in node::Kill (args=...) at ../src/node_process_methods.cc:164
#3 0x0000aaaadffea4c0 in v8::internal::FunctionCallbackArguments::Call (this=this@entry=0xffffd671dc50, handler=..., handler@entry=...) at ../deps/v8/src/api/api-arguments-inl.h:152
#4 0x0000aaaadffed788 in v8::internal::(anonymous namespace)::HandleApiCallHelper<false> (isolate=isolate@entry=0xaaaae639cdf0, function=function@entry=...,
new_target=new_target@entry=..., fun_data=..., receiver=..., receiver@entry=..., args=...) at ../deps/v8/src/execution/arguments.h:81
#5 0x0000aaaadffeddf0 in v8::internal::Builtin_Impl_HandleApiCall (isolate=0xaaaae639cdf0, args=...) at ../deps/v8/src/handles/handles.h:133
#6 v8::internal::Builtin_HandleApiCall (args_length=7, args_object=0xffffd671ddb0, isolate=0xaaaae639cdf0) at ../deps/v8/src/builtins/builtins-api.cc:130
#7 0x0000aaaae0cc388c in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit () at ../../deps/v8/src/builtins/torque-internal.tq:84
Backtrace stopped: previous frame identical to this frame (corrupt stack?) |
This is the stack trace from loading the core file [Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/iojs/node/out/Debug/node --expose-internals /home/iojs/node/test/pummel/t'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 std::__atomic_base<long>::load (__m=std::memory_order_relaxed, this=0xffffffffffffffff) at /usr/include/c++/8/bits/atomic_base.h:400
400 load(memory_order __m = memory_order_seq_cst) const volatile noexcept
[Current thread is 1 (Thread 0xffff99ffb110 (LWP 182383))]
(gdb) bt
#0 std::__atomic_base<long>::load (__m=std::memory_order_relaxed, this=0xffffffffffffffff) at /usr/include/c++/8/bits/atomic_base.h:400
#1 std::atomic_load_explicit<long> (__m=std::memory_order_relaxed, __a=0xffffffffffffffff) at /usr/include/c++/8/atomic:1091
#2 v8::base::Relaxed_Load (ptr=0xffffffffffffffff) at ../deps/v8/src/base/atomicops.h:308
#3 v8::base::AsAtomicImpl<long>::Relaxed_Load<unsigned long> (addr=0xffffffffffffffff) at ../deps/v8/src/base/atomic-utils.h:79
#4 v8::internal::TaggedField<v8::internal::MapWord, 0>::Relaxed_Load_Map_Word (host=..., cage_base=...) at ../deps/v8/src/objects/tagged-field-inl.h:117
#5 v8::internal::HeapObject::map_word (this=<synthetic pointer>, cage_base=...) at ../deps/v8/src/objects/objects-inl.h:802
#6 v8::internal::HeapObject::map (this=<synthetic pointer>, cage_base=...) at ../deps/v8/src/objects/objects-inl.h:735
#7 v8::internal::HeapObject::IsInternalizedString (this=<synthetic pointer>, cage_base=...) at ../deps/v8/src/objects/instance-type-inl.h:79
#8 v8::internal::HeapObject::IsInternalizedString (this=<synthetic pointer>) at ../deps/v8/src/objects/instance-type-inl.h:79
#9 v8::internal::ScopeInfo::FunctionContextSlotIndex (this=this@entry=0xffff99ff3850, name=name@entry=...) at ../deps/v8/src/objects/scope-info.cc:976
#10 0x0000aaaaaf023c1c in v8::internal::V8HeapExplorer::ExtractContextReferences (this=this@entry=0xffff99ff3b60, entry=entry@entry=0xffff945ac610, context=...)
at ../deps/v8/src/profiler/heap-snapshot-generator.cc:1042
#11 0x0000aaaaaf026784 in v8::internal::V8HeapExplorer::ExtractReferences (this=this@entry=0xffff99ff3b60, entry=entry@entry=0xffff945ac610, obj=...)
at ../deps/v8/src/objects/heap-object.h:226
#12 0x0000aaaaaf026b30 in v8::internal::V8HeapExplorer::IterateAndExtractReferences (this=this@entry=0xffff99ff3b60, generator=generator@entry=0xffff99ff3b48)
at ../deps/v8/src/profiler/heap-snapshot-generator.cc:1652
#13 0x0000aaaaaf02783c in v8::internal::HeapSnapshotGenerator::FillReferences (this=0xffff99ff3b48) at ../deps/v8/src/profiler/heap-snapshot-generator.cc:2293
#14 v8::internal::HeapSnapshotGenerator::GenerateSnapshot (this=this@entry=0xffff99ff3b48) at ../deps/v8/src/profiler/heap-snapshot-generator.cc:2257
#15 0x0000aaaaaf011bc4 in v8::internal::HeapProfiler::TakeSnapshot (this=0xffff940fbb60, control=0x0, resolver=0x0, treat_global_objects_as_roots=<optimized out>,
capture_numeric_value=false) at ../deps/v8/src/profiler/heap-profiler.cc:90
#16 0x0000aaaaae6f7568 in node::worker::Worker::<lambda(node::Environment*)>::operator()(node::Environment *) const (__closure=0xaaaabcac18b8, worker_env=0xffff94018be0)
at ../src/node_worker.cc:748
#17 0x0000aaaaae6fa8f0 in node::CallbackQueue<void, node::Environment*>::CallbackImpl<node::worker::Worker::TakeHeapSnapshot(const v8::FunctionCallbackInfo<v8::Value>&)::<lambda(node::Environment*)> >::Call(node::Environment *) (this=0xaaaabcac18a0, args#0=0xffff94018be0) at ../src/callback_queue-inl.h:90
#18 0x0000aaaaae4f5148 in node::Environment::RunAndClearInterrupts (this=0xffff94018be0) at ../src/env.cc:742
#19 0x0000aaaaae4f5610 in node::Environment::<lambda(v8::Isolate*, void*)>::operator()(v8::Isolate *, void *) const (__closure=0x0, isolate=0xffff94000ce0, data=0xffff9401ae40)
at ../src/env.cc:831
#20 0x0000aaaaae4f566c in node::Environment::<lambda(v8::Isolate*, void*)>::_FUN(v8::Isolate *, void *) () at ../src/env.cc:832
#21 0x0000aaaaaeaea6d4 in v8::internal::Isolate::InvokeApiInterruptCallbacks (this=0xffff94000ce0) at ../deps/v8/src/execution/isolate.cc:1490
#22 0x0000aaaaaeb12510 in v8::internal::StackGuard::HandleInterrupts (this=this@entry=0xffff94000ce8) at ../deps/v8/src/execution/stack-guard.cc:325
#23 0x0000aaaaaf0d4144 in v8::internal::__RT_impl_Runtime_StackGuard (isolate=0xffff94000ce0, args=...) at ../deps/v8/src/execution/isolate-data.h:117
#24 v8::internal::Runtime_StackGuard (args_length=<optimized out>, args_object=<optimized out>, isolate=0xffff94000ce0) at ../deps/v8/src/runtime/runtime-internal.cc:309
#25 0x0000aaaaaf5fd74c in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit () at ../../deps/v8/src/builtins/torque-internal.tq:84
Backtrace stopped: previous frame identical to this frame (corrupt stack?) |
@miladfarca is the code in/around
platform specific ? We are only seeing the failure on ARM and trying to figure out if it's most likely a problem in V8 which is platform specific or something that's gone wrong elsewhere. |
I don't think its platform specific but it may be related to tagged fields or pointer compression/pointer compression cage and their usage on Arm64. Trace is showing the address passed to It could be a V8 problem related to CLs such as this: |
- Mark test-worker-take-heapsnapshot as flaky on arm with debug Refs: #41204 Refs: #41209 Signed-off-by: Michael Dawson <[email protected]> PR-URL: #41253 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: James M Snell <[email protected]>
@miladfarca thanks for taking a look. Do you think it would make sense to report to V8? |
Not a problem. Sure it can be reported. Is Node currently enabling pointer compression/cage on Arm64 debug builds? |
@mhdawson This issue also reproducible with other platforms such as x64 and s390x when node is built from source (debug). |
@miladfarca that's interesting, I'm wondering why we did not see it fail in the x86 builds. In any case good to know. I guess we can try to figure out at what point it started to fail. Will kickoff some runs. |
Recreates on v16.0.0 |
Recreates in v15.0.0 |
Passes in v14.0.0 |
Issue could be related to this pr if V8 related: https://github.com/nodejs/node/pull/33579/files |
Fails in v14.18.3 but @miladfarca I see you have it narrowed down further. |
I see #35415 was in 15.0.0. Since there is a failure in in v14.18.3 I may look at a few more v14's as that might help narrow down which v8 commits are potentially related. |
v14.10.0 seemed to crash, v14.5.0 seems to pass. |
Sorry might have pasted the wrong link above, now updated. If it was due to upgrading to V8 8.4 it can be related these commits: v8/v8@3b62751...819e184 |
It seems to pass with v14.6.0 and fail with v14.7.0 |
These are the list of commits in the changelog for v14.7.0 minus the doc ones. [dd29889] - async_hooks: optimize fast-path promise hook for ALS (Andrey Pechkurov) #34512 |
Reverting this commit seems to make the test pass - [0aa3809] - (SEMVER-MINOR) worker: make MessagePort inherit from EventTarget (Anna Henningsen) #34057 |
Given the crash in #41204 (comment) my first guess would have been something like not keeping something alive that results in a messed up heap during the walk. From my look so far I don't see anything like that in 0aa3809. @addaleax since you know the code better if you have time to take a look that would be great. |
I think the latest V8 update has fixed the issue in master. Maybe the revert of 0aa3809 just moved things around so that the failure did not occur versus being related. Will open a PR to remove being marked as flaky in master. The issue still exists in the 15, 16 and 17 lines but we don't have the arm debug build in those CI configs. |
PR to remove from flaky list on main branch - #41684 |
Recent V8 upgrade seems to have made this pass reliably now. Remove flaky entry Refs: nodejs#41204 Signed-off-by: Michael Dawson <[email protected]>
- Mark test-worker-take-heapsnapshot as flaky on arm with debug Refs: #41204 Refs: #41209 Signed-off-by: Michael Dawson <[email protected]> PR-URL: #41253 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: James M Snell <[email protected]>
- Mark test-worker-take-heapsnapshot as flaky on arm with debug Refs: nodejs#41204 Refs: nodejs#41209 Signed-off-by: Michael Dawson <[email protected]> PR-URL: nodejs#41253 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: James M Snell <[email protected]>
Recent V8 upgrade seems to have made this pass reliably now. Remove flaky entry Refs: #41204 Signed-off-by: Michael Dawson <[email protected]> PR-URL: #41684 Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Darshan Sen <[email protected]> Reviewed-By: James M Snell <[email protected]>
- Mark test-worker-take-heapsnapshot as flaky on arm with debug Refs: #41204 Refs: #41209 Signed-off-by: Michael Dawson <[email protected]> PR-URL: #41253 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: James M Snell <[email protected]>
Recent V8 upgrade seems to have made this pass reliably now. Remove flaky entry Refs: #41204 Signed-off-by: Michael Dawson <[email protected]> PR-URL: #41684 Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Darshan Sen <[email protected]> Reviewed-By: James M Snell <[email protected]>
Recent V8 upgrade seems to have made this pass reliably now. Remove flaky entry Refs: #41204 Signed-off-by: Michael Dawson <[email protected]> PR-URL: #41684 Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Darshan Sen <[email protected]> Reviewed-By: James M Snell <[email protected]>
Recent V8 upgrade seems to have made this pass reliably now. Remove flaky entry Refs: #41204 Signed-off-by: Michael Dawson <[email protected]> PR-URL: #41684 Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Darshan Sen <[email protected]> Reviewed-By: James M Snell <[email protected]>
Recent V8 upgrade seems to have made this pass reliably now. Remove flaky entry Refs: #41204 Signed-off-by: Michael Dawson <[email protected]> PR-URL: #41684 Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Darshan Sen <[email protected]> Reviewed-By: James M Snell <[email protected]>
Closing because I believe this has been fixed but by all means reopen if I'm mistaken. |
Version
head
Platform
arm64
Subsystem
workers
What steps will reproduce the bug?
Run test pummel/test-worker-take-heapsnapshot as part of running test suite in debug mode
How often does it reproduce? Is there a required condition?
Seems to be 100% in debug mode on ARM64
What is the expected behavior?
tests pasess
What do you see instead?
12:13:06 not ok 3230 pummel/test-worker-take-heapsnapshot
12:13:07 ---
12:13:07 duration_ms: 0.815
12:13:07 severity: crashed
12:13:07 exitcode: -11
12:13:07 stack: |-
12:13:07 (node:758209) internal/test/binding: These APIs are for internal testing only. Do not use them.
12:13:07 (Use
node --trace-warnings ...
to show where the warning was created)12:13:07 ...
Additional information
No response
The text was updated successfully, but these errors were encountered: