Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault with different nodejs procs (I have coredump info) #47324

Closed
amit777 opened this issue Mar 30, 2023 · 17 comments
Closed

Segfault with different nodejs procs (I have coredump info) #47324

amit777 opened this issue Mar 30, 2023 · 17 comments

Comments

@amit777
Copy link

amit777 commented Mar 30, 2023

Version

v18.15.0

Platform

Linux server-host 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

I don't know how to reproduce. But this seems to happen on our busiest servers. The server has plenty of RAM available (80GB out of 128GB). And it has 96 CPUs.

How often does it reproduce? Is there a required condition?

It seems to happen when my server is under heavy traffic (although CPU and RAM usage are not high). It happens maybe once or twice per day. I cannot reproduce it readily.

What is the expected behavior? Why is that the expected behavior?

it should not seg fault

What do you see instead?

I opened the coredump with coredump info and coredumpctl debug. Here is some info. I got the backtrace as well. I'm happy to send this coredump if it's helpful. It's only about 10Mb.

coredumpctl info


           PID: 35768 (PM2 v5.3.0: God)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Thu 2023-03-30 14:35:27 EDT (2min 33s ago)
  Command Line: $'PM2 v5.3.0: God Daemon (/root/.pm2)' "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    Executable: /root/.nvm/versions/node/v18.15.0/bin/node
 Control Group: /user.slice/user-0.slice/session-7.scope
          Unit: session-7.scope
         Slice: user-0.slice
       Session: 7
     Owner UID: 0 (root)
       Boot ID: cf3aaa25cae342f8903aefa322ba0e8f
    Machine ID: 53ee02db24a24b008c218ca45204acc3
      Hostname: server-host
       Storage: /var/lib/systemd/coredump/core.PM2\x20v5\x2e3\x2e0:\x20God.0.cf3aaa25cae342f8903aefa322ba0e8f.35768.1680201327000000.zst (present)
     Disk Size: 10.5M
       Message: Process 35768 (PM2 v5.3.0: God) of user 0 dumped core.

                Found module /root/.nvm/versions/node/v18.15.0/bin/node with build-id: 417328b1953d476d486d4bfa7f9a79d6571d694c
                Found module linux-vdso.so.1 with build-id: 2dfff2e3b9908e870457adfd8ab8457adccee5ca
                Found module ld-linux-x86-64.so.2 with build-id: 61ef896a699bb1c2e4e231642b2e1688b2f1a61e
                Found module libc.so.6 with build-id: 69389d485a9793dbe873f0ea2c93e02efaa9aa3d
                Found module libpthread.so.0 with build-id: e62798b68557abb4bc5548aba2640cd5ab948f36
                Found module libgcc_s.so.1 with build-id: 09c4935b79388431a1248f6a98e00d7dc81b8513
                Found module libm.so.6 with build-id: 27e82301dba6c3f644404d504e1bb1c97894b433
                Found module libstdc++.so.6 with build-id: f57e02bfadacc0c923c82457d5e18e1830b5faea
                Found module libdl.so.2 with build-id: 952123c3bdce7c3370553717942b4878d1fad797
                Stack trace of thread 35768:
                #0  0x00007fb7447122a9 tcache_get (libc.so.6 + 0xa52a9)
                #1  0x00007fb744a519cc _Znwm (libstdc++.so.6 + 0xae9cc)
                #2  0x00000000010805c0 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0xc805c0)
                #3  0x0000000000d7a1a4 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x97a1a4)
                #4  0x0000000000b11fa3 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x711fa3)
                #5  0x0000000000c6f528 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x86f528)
                #6  0x0000000001676dba n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x1276dba)
                #7  0x0000000001677790 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x1277790)
                #8  0x000000000167d534 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x127d534)
                #9  0x000000000166b95e n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x126b95e)
                #10 0x0000000000abda2d n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x6bda2d)
                #11 0x0000000000bc1874 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x7c1874)
                #12 0x0000000000b36434 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x736434)
                #13 0x0000000000b3a02f n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x73a02f)
                #14 0x00007fb744696d90 __libc_start_call_main (libc.so.6 + 0x29d90)
                #15 0x00007fb744696e40 __libc_start_main_impl (libc.so.6 + 0x29e40)
                #16 0x0000000000aba37e n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x6ba37e)

coredump debug with backtrace

Type "apropos word" to search for commands related to "word"...
Reading symbols from /root/.nvm/versions/node/v18.15.0/bin/node...
[New LWP 35768]
[New LWP 35773]
[New LWP 35771]
[New LWP 35778]
[New LWP 35775]
[New LWP 35815]
[New LWP 35772]
[New LWP 35814]
[New LWP 35817]
[New LWP 35774]
[New LWP 35816]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `PM2 v5.3.0: God Daemon (/root/.pm2)                                      '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fb7447122a9 in tcache_get (tc_idx=<optimized out>) at ./malloc/malloc.c:3196
3196	./malloc/malloc.c: No such file or directory.
[Current thread is 1 (Thread 0x7fb74466a7c0 (LWP 35768))]
(gdb) bt
#0  0x00007fb7447122a9 in tcache_get (tc_idx=<optimized out>) at ./malloc/malloc.c:3196
#1  __GI___libc_malloc (bytes=64) at ./malloc/malloc.c:3313
#2  0x00007fb744a519cc in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00000000010805c0 in v8::internal::BackingStore::Allocate(v8::internal::Isolate*, unsigned long, v8::internal::SharedFlag, v8::internal::InitializedFlag) ()
#4  0x0000000000d7a1a4 in v8::ArrayBuffer::NewBackingStore(v8::Isolate*, unsigned long) ()
#5  0x0000000000b11fa3 in node::Environment::allocate_managed_buffer(unsigned long) ()
#6  0x0000000000c6f528 in node::LibuvStreamWrap::ReadStart()::{lambda(uv_handle_s*, unsigned long, uv_buf_t*)#1}::_FUN(uv_handle_s*, unsigned long, uv_buf_t*) ()
#7  0x0000000001676dba in uv__read (stream=stream@entry=0x7fb7403cc0c0) at ../deps/uv/src/unix/stream.c:1105
#8  0x0000000001677790 in uv__stream_io (loop=<optimized out>, w=0x7fb7403cc148, events=1) at ../deps/uv/src/unix/stream.c:1270
#9  0x000000000167d534 in uv__io_poll (loop=loop@entry=0x526dfc0 <default_loop_struct>, timeout=<optimized out>) at ../deps/uv/src/unix/epoll.c:374
#10 0x000000000166b95e in uv_run (loop=0x526dfc0 <default_loop_struct>, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:406
#11 0x0000000000abda2d in node::SpinEventLoop(node::Environment*) ()
#12 0x0000000000bc1874 in node::NodeMainInstance::Run() ()
#13 0x0000000000b36434 in node::LoadSnapshotDataAndRun(node::SnapshotData const**, node::InitializationResult const*) ()
#14 0x0000000000b3a02f in node::Start(int, char**) ()
#15 0x00007fb744696d90 in __libc_start_call_main (main=main@entry=0xab1f70 <main>, argc=argc@entry=2, argv=argv@entry=0x7ffe26fd3918) at ../sysdeps/nptl/libc_start_call_main.h:58
#16 0x00007fb744696e40 in __libc_start_main_impl (main=0xab1f70 <main>, argc=2, argv=0x7ffe26fd3918, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe26fd3908)
    at ../csu/libc-start.c:392
#17 0x0000000000aba37e in _start ()

Additional information

Runny Ubuntu 22.04 Jammy with latest patches.

@bnoordhuis
Copy link
Member

The crash happens inside tcache_get() and that's part of glibc's internal thread-local arena allocator. It could mean either a bug in glibc or memory corruption in node or V8 but it's impossible to tell which. Is the stack trace always the same?

@amit777
Copy link
Author

amit777 commented Mar 31, 2023

@bnoordhuis I just got another coredump on a different node process (not pm2). It's on one of regular processes which is not actually heavily loaded at all:

    Executable: /root/.nvm/versions/node/v18.15.0/bin/node
 Control Group: /user.slice/user-0.slice/session-11.scope
          Unit: session-11.scope
         Slice: user-0.slice
       Session: 11
     Owner UID: 0 (root)
       Boot ID: e0e1c4000e374eb3be79d2771dae8b38
    Machine ID: 53ee02db24a24b008c218ca45204acc3
      Hostname: server-host
       Storage: /var/lib/systemd/coredump/core.node\x20\x2fopt\x2fsgapp.0.e0e1c4000e374eb3be79d2771dae8b38.8503.1680275943000000.zst (present)
     Disk Size: 42.4M
       Message: Process 8503 (node /opt/sgapp) of user 0 dumped core.

                Found module /root/.nvm/versions/node/v18.15.0/bin/node with build-id: 417328b1953d476d486d4bfa7f9a79d6571d694c
                Found module linux-vdso.so.1 with build-id: 2dfff2e3b9908e870457adfd8ab8457adccee5ca
                Found module node.abi108.glibc.node with build-id: e0c58441f96b297d918c6c84667f1fc83949b122
                Found module cpufeatures.node with build-id: 0fad5b4196d1d888f98e51f54be13f9f35ce936f
                Found module sshcrypto.node with build-id: 87305d932cd530eb9b18ca92773bc93c7adeb1d9
                Found module magic.node with build-id: 43b5706feca8ad1ef606d761397aefaa31efa54e
                Found module libresolv.so.2 with build-id: 7fd7253c61aa6fce2b7e13851c15afa14a5ab160
                Found module libvips-cpp.so.42 with build-id: 0050c94cee88d6975b35f5851bb32b961df4ec2a
                Found module sharp-linux-x64.node with build-id: 989b35e5d2bfd8276f7836332c993e74ac0d4df7
                Found module librt.so.1 with build-id: 1030be4690f8aa93b63acd9180b3037d99a974bf
                Found module zstd.linux-x64-gnu.node without build-id.
                Found module gcstats.node with build-id: 84065b408a5197ce069818171da9349ce35d3e09
                Found module ld-linux-x86-64.so.2 with build-id: 61ef896a699bb1c2e4e231642b2e1688b2f1a61e
                Found module libc.so.6 with build-id: 69389d485a9793dbe873f0ea2c93e02efaa9aa3d
                Found module libpthread.so.0 with build-id: e62798b68557abb4bc5548aba2640cd5ab948f36
                Found module libgcc_s.so.1 with build-id: 09c4935b79388431a1248f6a98e00d7dc81b8513
                Found module libm.so.6 with build-id: 27e82301dba6c3f644404d504e1bb1c97894b433
                Found module libstdc++.so.6 with build-id: f57e02bfadacc0c923c82457d5e18e1830b5faea
                Found module libdl.so.2 with build-id: 952123c3bdce7c3370553717942b4878d1fad797
                Stack trace of thread 8503:
                #0  0x00007fda75c6ba7c __pthread_kill_implementation (libc.so.6 + 0x96a7c)
                #1  0x00007fda75c17476 __GI_raise (libc.so.6 + 0x42476)
                #2  0x00007fda75bfd885 __GI_abort (libc.so.6 + 0x28885)
                #3  0x00007fda75c5e6f6 __libc_message (libc.so.6 + 0x896f6)
                #4  0x00007fda75c75d7c malloc_printerr (libc.so.6 + 0xa0d7c)
                #5  0x00007fda75c7a45c tcache_get (libc.so.6 + 0xa545c)
                #6  0x00007fda75fb99cc _Znwm (libstdc++.so.6 + 0xae9cc)
                #7  0x0000000001072f9f n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0xc72f9f)
                #8  0x0000000000def651 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x9ef651)
                #9  0x0000000001707c79 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x1307c79)
                #10 0x0000000005c43ccc n/a (n/a + 0x0)
                #11 0x0000000005c0ad66 n/a (n/a + 0x0)
                #12 0x0000000005ba3098 n/a (n/a + 0x0)
                #13 0x0000000005bb49be n/a (n/a + 0x0)
                #14 0x000000000168a1dc n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x128a1dc)
                #15 0x0000000001689f03 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x1289f03)
                #16 0x0000000000ea98fe n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0xaa98fe)
                #17 0x0000000000eaaaef n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0xaaaaef)
                #18 0x0000000000d84e53 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x984e53)
                #19 0x0000000000aef055 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x6ef055)
                #20 0x0000000001667175 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x1267175)
                #21 0x000000000166b885 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x126b885)
                #22 0x0000000000abda2d n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x6bda2d)
                #23 0x0000000000bc1874 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x7c1874)
                #24 0x0000000000b36434 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x736434)
                #25 0x0000000000b3a02f n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x73a02f)
                #26 0x00007fda75bfed90 __libc_start_call_main (libc.so.6 + 0x29d90)
                #27 0x00007fda75bfee40 __libc_start_main_impl (libc.so.6 + 0x29e40)
                #28 0x0000000000aba37e n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x6ba37e)
Type "apropos word" to search for commands related to "word"...
Reading symbols from /root/.nvm/versions/node/v18.15.0/bin/node...
[New LWP 8503]
[New LWP 8531]
[New LWP 8530]
[New LWP 8540]
[New LWP 8527]
[New LWP 8543]
[New LWP 8541]
[New LWP 8528]
[New LWP 8544]
[New LWP 8542]
[New LWP 8529]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `node /opt/sgapp/mailapi/mailapi/combined.js                                   '.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140576254928832) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7fda75bd27c0 (LWP 8503))]
(gdb) btw
Undefined command: "btw".  Try "help".
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140576254928832) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140576254928832) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140576254928832, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007fda75c17476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fda75bfd885 in __GI_abort () at ./stdlib/abort.c:100
#5  0x00007fda75c5e6f6 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fda75db0b8c "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007fda75c75d7c in malloc_printerr (str=str@entry=0x7fda75db3d40 "malloc(): unaligned tcache chunk detected") at ./malloc/malloc.c:5664
#7  0x00007fda75c7a45c in tcache_get (tc_idx=<optimized out>) at ./malloc/malloc.c:3195
#8  __GI___libc_malloc (bytes=4) at ./malloc/malloc.c:3313
#9  0x00007fda75fb99cc in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x0000000001072f9f in v8::internal::DoubleToFixedCString(double, int) ()
#11 0x0000000000def651 in v8::internal::Builtin_NumberPrototypeToFixed(int, unsigned long*, v8::internal::Isolate*) ()
#12 0x0000000001707c79 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit ()
#13 0x0000000005c43ccc in ?? ()
#14 0x000023c93ad015b9 in ?? ()
#15 0x0000299c06d87cf9 in ?? ()
#16 0x0000000600000000 in ?? ()
#17 0x000023c93ad01689 in ?? ()
#18 0x00003fba10fbfa59 in ?? ()
#19 0x0000000200000000 in ?? ()
#20 0x0000000000000404 in ?? ()
#21 0x0000000000000404 in ?? ()
#22 0x0000000000000404 in ?? ()
#23 0x00003fba10fbfa49 in ?? ()
#24 0x0000040400000000 in ?? ()
#25 0x3fd11eecbfb15c00 in ?? ()
#26 0x0000000000000001 in ?? ()
#27 0x00003fba10fbfa11 in ?? ()
#28 0x00003fba10f90629 in ?? ()
#29 0x00003fba10f8e561 in ?? ()
#30 0x00000039ababf131 in ?? ()
#31 0x00000039ababef29 in ?? ()
#32 0x0000000000000001 in ?? ()
#33 0x0000178e88484511 in ?? ()
#34 0x00000039ababef29 in ?? ()
#35 0x00007fff6724e3d8 in ?? ()
#36 0x0000000005c0ad66 in ?? ()
#37 0x0000178e88483db9 in ?? ()
#38 0x00007fff6724e3a8 in ?? ()
#39 0x000000000168d3ef in Builtins_CallApiCallback ()
#40 0x00007fff6724e3b8 in ?? ()
#41 0x00007fff6724e3f0 in ?? ()
#42 0x0000000000000000 in ?? ()

@amit777
Copy link
Author

amit777 commented Mar 31, 2023

Also, I'm happy to run do whatever instrumentation is necessary to help isolate the issue. So if you would like me to run a debug version (or if it's easy to install a debug version with nvm), please let me know.

More information, the node process is started by pm2 in cluster mode.

@amit777
Copy link
Author

amit777 commented Mar 31, 2023

lscpu output incase it's useful

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  96
  On-line CPU(s) list:   0-95
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8260Y CPU @ 2.40GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  24
    Socket(s):           2
    Stepping:            7
    CPU max MHz:         3900.0000
    CPU min MHz:         1000.0000
    BogoMIPS:            4800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe
                         1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
                         ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c r
                         drand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shad
                         ow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx sma
                         p clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
                         dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   1.5 MiB (48 instances)
  L1i:                   1.5 MiB (48 instances)
  L2:                    48 MiB (48 instances)
  L3:                    71.5 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-23,48-71
  NUMA node1 CPU(s):     24-47,72-95
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; Enhanced IBRS
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Mitigation; TSX disabled

@amit777 amit777 changed the title Segfault with PM2 Daemon (I have coredump info) Segfault with different nodejs procs (I have coredump info) Mar 31, 2023
@amit777
Copy link
Author

amit777 commented Mar 31, 2023

Also, I wanted to note that (I think) I got this as well in Node 16.19.1. We recently updated to Node 18.15.0 because it's LTS and because we wanted to see if node 18 magically fixed the problem.

@bnoordhuis
Copy link
Member

You have a bunch of native add-ons loaded (the *.node files). Try excluding those because any one of them can be a source of memory corruption bugs. We don't accept bug reports where native add-ons are involved.

I didn't see them in your original report but the stack trace looked different there too. The second stack trace includes the "malloc(): unaligned tcache chunk detected" error message, which is highly indicative of a use-after-free or a buffer overrun/underrun.

One thing you could try is running node under valgrind but, caveat emptor, it'll be very slow. Building node from source with Address Sanitizer support (./configure --enable-asan && make -j32) is another option.

@amit777
Copy link
Author

amit777 commented Mar 31, 2023

Thank you for the pointers! I will explore those native addons.

would building with Address Sanitizer support have significant performance impact?

@bnoordhuis
Copy link
Member

It's usually on the order of 2-4x but it really depends on what you're doing.

@amit777
Copy link
Author

amit777 commented Mar 31, 2023

Tried building from source with --enable-asan but got the following error (I was able to successfully build without the --enable-asan flag):

I'm wondering if the build error may provide a clue or is it probably totally unrelated?

  cc -o ./node-v18.15.0/out/Release/obj.target/cares/deps/cares/src/lib/ares_getsock.o ../deps/cares/src/lib/ares_getsock.c '-DV8_DEPRECATION_WARNINGS' '-DV8_IMMINENT_DEPRECATION_WARNINGS' '-D_GLIBCXX_USE_CXX11_ABI=1' '-DNODE_OPENSSL_CONF_NAME=nodejs_conf' '-DNODE_OPENSSL_HAS_QUIC' '-DICU_NO_USER_DATA_OVERRIDE' '-D_DARWIN_USE_64_BIT_INODE=1' '-D_LARGEFILE_SOURCE' '-D_FILE_OFFSET_BITS=64' '-D_GNU_SOURCE' '-DLEAK_SANITIZER' '-DV8_USE_ADDRESS_SANITIZER' '-D__STDC_FORMAT_MACROS' '-DOPENSSL_NO_PINSHARED' '-DOPENSSL_THREADS' '-DCARES_STATICLIB' '-DHAVE_CONFIG_H' -I../deps/cares/include -I../deps/cares/config/linux  -fno-omit-frame-pointer -fsanitize=address -fsanitize-address-use-after-scope -pthread -Wall -Wextra -Wno-unused-parameter -m64 -g -pedantic -Wall -Wextra -Wno-unused-parameter --std=gnu89 -O3 -fno-omit-frame-pointer  -MMD -MF /root/node-v18.15.0/out/Release/.deps//root/node-v18.15.0/out/Release/obj.target/cares/deps/cares/src/lib/ares_getsock.o.d.raw   -c
make[1]: *** [tools/v8_gypfiles/v8_libplatform.target.mk:164: /root/node-v18.15.0/out/Release/obj.target/v8_libplatform/deps/v8/src/libplatform/task-queue.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: *** [tools/v8_gypfiles/v8_libplatform.target.mk:164: /root/node-v18.15.0/out/Release/obj.target/v8_libplatform/deps/v8/src/libplatform/default-worker-threads-task-runner.o] Error 1
In file included from ../deps/v8/src/libplatform/worker-thread.h:13,
                 from ../deps/v8/src/libplatform/worker-thread.cc:5:
../deps/v8/src/base/platform/platform.h: In static member function ‘static v8::base::Stack::StackSlot v8::base::Stack::GetRealStackAddressForSlot(v8::base::Stack::StackSlot)’:
../deps/v8/src/base/platform/platform.h:635:16: error: operands to ‘?:’ have different types ‘char*’ and ‘v8::base::Stack::StackSlot’
  634 |     return real_frame
      |            ~~~~~~~~~~
  635 |                ? (static_cast<char*>(real_frame) + kAsanRealFrameOffsetBytes)
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  636 |                : slot;
      |                ~~~~~~
../deps/v8/src/base/platform/platform.h:636:18: error: conversion from ‘v8::base::Stack::StackSlot’ to ‘char*’ is ambiguous
  636 |                : slot;
      |                  ^~~~
../deps/v8/src/base/platform/platform.h:608:5: note: candidate: ‘v8::base::Stack::StackSlot::operator void*() const’ (near match)
  608 |     operator void*() const { return reinterpret_cast<void*>(value); }
      |     ^~~~~~~~
../deps/v8/src/base/platform/platform.h:608:5: note:   no known conversion from ‘void*’ to ‘char*’
../deps/v8/src/base/platform/platform.h:609:5: note: candidate: ‘v8::base::Stack::StackSlot::operator uintptr_t() const’ (near match)
  609 |     operator uintptr_t() const { return value; }  // NOLINT
      |     ^~~~~~~~
../deps/v8/src/base/platform/platform.h:609:5: note:   no known conversion from ‘uintptr_t’ {aka ‘long unsigned int’} to ‘char*’
In file included from ../deps/v8/src/libplatform/tracing/trace-object.cc:8:
../deps/v8/src/base/platform/platform.h: In static member function ‘static v8::base::Stack::StackSlot v8::base::Stack::GetRealStackAddressForSlot(v8::base::Stack::StackSlot)’:
../deps/v8/src/base/platform/platform.h:635:16: error: operands to ‘?:’ have different types ‘char*’ and ‘v8::base::Stack::StackSlot’
  634 |     return real_frame
      |            ~~~~~~~~~~
  635 |                ? (static_cast<char*>(real_frame) + kAsanRealFrameOffsetBytes)
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  636 |                : slot;
      |                ~~~~~~
../deps/v8/src/base/platform/platform.h:636:18: error: conversion from ‘v8::base::Stack::StackSlot’ to ‘char*’ is ambiguous
  636 |                : slot;
      |                  ^~~~
../deps/v8/src/base/platform/platform.h:608:5: note: candidate: ‘v8::base::Stack::StackSlot::operator void*() const’ (near match)
  608 |     operator void*() const { return reinterpret_cast<void*>(value); }
      |     ^~~~~~~~
../deps/v8/src/base/platform/platform.h:608:5: note:   no known conversion from ‘void*’ to ‘char*’
../deps/v8/src/base/platform/platform.h:609:5: note: candidate: ‘v8::base::Stack::StackSlot::operator uintptr_t() const’ (near match)
  609 |     operator uintptr_t() const { return value; }  // NOLINT
      |     ^~~~~~~~
../deps/v8/src/base/platform/platform.h:609:5: note:   no known conversion from ‘uintptr_t’ {aka ‘long unsigned int’} to ‘char*’
make[1]: *** [tools/v8_gypfiles/v8_libplatform.target.mk:164: /root/node-v18.15.0/out/Release/obj.target/v8_libplatform/deps/v8/src/libplatform/worker-thread.o] Error 1
In file included from ../deps/v8/src/libplatform/tracing/trace-writer.cc:11:
../deps/v8/src/base/platform/platform.h: In static member function ‘static v8::base::Stack::StackSlot v8::base::Stack::GetRealStackAddressForSlot(v8::base::Stack::StackSlot)’:
../deps/v8/src/base/platform/platform.h:635:16: error: operands to ‘?:’ have different types ‘char*’ and ‘v8::base::Stack::StackSlot’
  634 |     return real_frame
      |            ~~~~~~~~~~
  635 |                ? (static_cast<char*>(real_frame) + kAsanRealFrameOffsetBytes)
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  636 |                : slot;
      |                ~~~~~~
../deps/v8/src/base/platform/platform.h:636:18: error: conversion from ‘v8::base::Stack::StackSlot’ to ‘char*’ is ambiguous
  636 |                : slot;
      |                  ^~~~
../deps/v8/src/base/platform/platform.h:608:5: note: candidate: ‘v8::base::Stack::StackSlot::operator void*() const’ (near match)
  608 |     operator void*() const { return reinterpret_cast<void*>(value); }
      |     ^~~~~~~~
../deps/v8/src/base/platform/platform.h:608:5: note:   no known conversion from ‘void*’ to ‘char*’
../deps/v8/src/base/platform/platform.h:609:5: note: candidate: ‘v8::base::Stack::StackSlot::operator uintptr_t() const’ (near match)
  609 |     operator uintptr_t() const { return value; }  // NOLINT
      |     ^~~~~~~~
../deps/v8/src/base/platform/platform.h:609:5: note:   no known conversion from ‘uintptr_t’ {aka ‘long unsigned int’} to ‘char*’
make[1]: *** [tools/v8_gypfiles/v8_libplatform.target.mk:164: /root/node-v18.15.0/out/Release/obj.target/v8_libplatform/deps/v8/src/libplatform/default-platform.o] Error 1
make[1]: *** [tools/v8_gypfiles/v8_libplatform.target.mk:164: /root/node-v18.15.0/out/Release/obj.target/v8_libplatform/deps/v8/src/libplatform/tracing/trace-writer.o] Error 1
make[1]: *** [tools/v8_gypfiles/v8_libplatform.target.mk:164: /root/node-v18.15.0/out/Release/obj.target/v8_libplatform/deps/v8/src/libplatform/tracing/trace-object.o] Error 1
../deps/icu-small/source/i18n/decNumber.cpp: In function ‘decNumber* uprv_decNumberLog10_72(decNumber*, const decNumber*, decContext*)’:
../deps/icu-small/source/i18n/decNumber.cpp:1528:14: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
 1528 |     w->lsu[1]=1; w->lsu[0]=0;           /* ..  */
      |     ~~~~~~~~~^~
In file included from ../deps/icu-small/source/i18n/decNumber.cpp:184:
../deps/icu-small/source/i18n/decNumber.h:89:19: note: at offset 1 into destination object ‘decNumber::lsu’ of size 1
   89 |     decNumberUnit lsu[DECNUMUNITS];
      |                   ^~~
../deps/icu-small/source/i18n/decNumber.cpp:1528:14: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
 1528 |     w->lsu[1]=1; w->lsu[0]=0;           /* ..  */
      |     ~~~~~~~~~^~
In file included from ../deps/icu-small/source/i18n/decNumber.cpp:184:
../deps/icu-small/source/i18n/decNumber.h:89:19: note: at offset 1 into destination object ‘decNumber::lsu’ of size 1
   89 |     decNumberUnit lsu[DECNUMUNITS];
      |                   ^~~
../deps/icu-small/source/i18n/decNumber.cpp: In function ‘decNumber* uprv_decNumberSquareRoot_72(decNumber*, const decNumber*, decContext*)’:
../deps/icu-small/source/i18n/decNumber.cpp:2974:31: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
 2974 |         t->lsu[0]=9; t->lsu[1]=5; t->lsu[2]=2;
      |                      ~~~~~~~~~^~
In file included from ../deps/icu-small/source/i18n/decNumber.cpp:184:
../deps/icu-small/source/i18n/decNumber.h:89:19: note: at offset 1 into destination object ‘decNumber::lsu’ of size 1
   89 |     decNumberUnit lsu[DECNUMUNITS];
      |                   ^~~
../deps/icu-small/source/i18n/decNumber.cpp:2974:44: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
 2974 |         t->lsu[0]=9; t->lsu[1]=5; t->lsu[2]=2;
      |                                   ~~~~~~~~~^~
In file included from ../deps/icu-small/source/i18n/decNumber.cpp:184:
../deps/icu-small/source/i18n/decNumber.h:89:19: note: at offset 2 into destination object ‘decNumber::lsu’ of size 1
   89 |     decNumberUnit lsu[DECNUMUNITS];
      |                   ^~~
rm ea8120b8ecf7d1828495478cfbfbab854b9f032f.intermediate
make: *** [Makefile:134: node] Error 2

@bnoordhuis
Copy link
Member

Welcome to the hell of trying to build modern C++. Check the requirements in BUILDING.md; you can't stray too far from them without getting errors like you're seeing.

@bnoordhuis
Copy link
Member

Ha, that particular error stirred a memory: #43370 (comment)

@amit777
Copy link
Author

amit777 commented Mar 31, 2023

No problem.. I'm actually just getting this entire server replaced. We've had too many issues with it. I'm going to close this ticket out. Thanks for taking a look so quickly though!

@amit777 amit777 closed this as completed Mar 31, 2023
@amit777
Copy link
Author

amit777 commented Apr 1, 2023

Hi @bnoordhuis , does it look like similar issue happened without any 3rd party native modules from this stack trace? This is the pm2 God daemon crash which I don't believe loads any 3rd party stuff. Also worth noting this crash happened on a completely different server than crashes from above.

         PID: 9054 (PM2 v5.3.0: God)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Sat 2023-04-01 16:33:55 EDT (5min ago)
  Command Line: $'PM2 v5.3.0: God Daemon (/root/.pm2)' "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    Executable: /root/.nvm/versions/node/v18.15.0/bin/node
 Control Group: /user.slice/user-0.slice/session-5.scope
          Unit: session-5.scope
         Slice: user-0.slice
       Session: 5
     Owner UID: 0 (root)
       Boot ID: 20f80993f1d940d88e8228d3da90a1ee
    Machine ID: 9fd846961653461f807eb3f2735a5bc8
      Hostname: server-host2
       Storage: /var/lib/systemd/coredump/core.PM2\x20v5\x2e3\x2e0:\x20God.0.20f80993f1d940d88e8228d3da90a1ee.9054.1680381235000000.zst (present)
     Disk Size: 12.4M
       Message: Process 9054 (PM2 v5.3.0: God) of user 0 dumped core.

                Found module /root/.nvm/versions/node/v18.15.0/bin/node with build-id: 417328b1953d476d486d4bfa7f9a79d6571d694c
                Found module linux-vdso.so.1 with build-id: 2dfff2e3b9908e870457adfd8ab8457adccee5ca
                Found module ld-linux-x86-64.so.2 with build-id: 61ef896a699bb1c2e4e231642b2e1688b2f1a61e
                Found module libc.so.6 with build-id: 69389d485a9793dbe873f0ea2c93e02efaa9aa3d
                Found module libpthread.so.0 with build-id: e62798b68557abb4bc5548aba2640cd5ab948f36
                Found module libgcc_s.so.1 with build-id: 09c4935b79388431a1248f6a98e00d7dc81b8513
                Found module libm.so.6 with build-id: 27e82301dba6c3f644404d504e1bb1c97894b433
                Found module libstdc++.so.6 with build-id: f57e02bfadacc0c923c82457d5e18e1830b5faea
                Found module libdl.so.2 with build-id: 952123c3bdce7c3370553717942b4878d1fad797
                Stack trace of thread 9054:
                #0  0x00007fb7fbe4c2a9 tcache_get (libc.so.6 + 0xa52a9)
                #1  0x00007fb7fc1899cc _Znwm (libstdc++.so.6 + 0xae9cc)
                #2  0x00000000010805c0 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0xc805c0)
                #3  0x0000000000d7a1a4 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x97a1a4)
                #4  0x0000000000b11fa3 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x711fa3)
                #5  0x0000000000c6f528 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x86f528)
                #6  0x0000000001676dba n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x1276dba)
                #7  0x0000000001677790 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x1277790)
                #8  0x000000000167d534 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x127d534)
                #9  0x000000000166b95e n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x126b95e)
                #10 0x0000000000abda2d n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x6bda2d)
                #11 0x0000000000bc1874 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x7c1874)
                #12 0x0000000000b36434 n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x736434)
                #13 0x0000000000b3a02f n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x73a02f)
                #14 0x00007fb7fbdd0d90 __libc_start_call_main (libc.so.6 + 0x29d90)
                #15 0x00007fb7fbdd0e40 __libc_start_main_impl (libc.so.6 + 0x29e40)
                #16 0x0000000000aba37e n/a (/root/.nvm/versions/node/v18.15.0/bin/node + 0x6ba37e)

GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /root/.nvm/versions/node/v18.15.0/bin/node...
[New LWP 9054]
[New LWP 9060]
[New LWP 9056]
[New LWP 9059]
[New LWP 9055]
[New LWP 9057]
[New LWP 9062]
[New LWP 9063]
[New LWP 9058]
[New LWP 9061]
[New LWP 9064]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `PM2 v5.3.0: God Daemon (/root/.pm2)                                      '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fb7fbe4c2a9 in tcache_get (tc_idx=<optimized out>) at ./malloc/malloc.c:3196
3196	./malloc/malloc.c: No such file or directory.
[Current thread is 1 (Thread 0x7fb7fbda27c0 (LWP 9054))]
(gdb) bt
#0  0x00007fb7fbe4c2a9 in tcache_get (tc_idx=<optimized out>) at ./malloc/malloc.c:3196
#1  __GI___libc_malloc (bytes=64) at ./malloc/malloc.c:3313
#2  0x00007fb7fc1899cc in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00000000010805c0 in v8::internal::BackingStore::Allocate(v8::internal::Isolate*, unsigned long, v8::internal::SharedFlag, v8::internal::InitializedFlag)
    ()
#4  0x0000000000d7a1a4 in v8::ArrayBuffer::NewBackingStore(v8::Isolate*, unsigned long) ()
#5  0x0000000000b11fa3 in node::Environment::allocate_managed_buffer(unsigned long) ()
#6  0x0000000000c6f528 in node::LibuvStreamWrap::ReadStart()::{lambda(uv_handle_s*, unsigned long, uv_buf_t*)#1}::_FUN(uv_handle_s*, unsigned long, uv_buf_t*)
    ()
#7  0x0000000001676dba in uv__read (stream=stream@entry=0x7fb7f93f95f0) at ../deps/uv/src/unix/stream.c:1105
#8  0x0000000001677790 in uv__stream_io (loop=<optimized out>, w=0x7fb7f93f9678, events=1) at ../deps/uv/src/unix/stream.c:1270
#9  0x000000000167d534 in uv__io_poll (loop=loop@entry=0x526dfc0 <default_loop_struct>, timeout=<optimized out>) at ../deps/uv/src/unix/epoll.c:374
#10 0x000000000166b95e in uv_run (loop=0x526dfc0 <default_loop_struct>, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:406
#11 0x0000000000abda2d in node::SpinEventLoop(node::Environment*) ()
#12 0x0000000000bc1874 in node::NodeMainInstance::Run() ()
#13 0x0000000000b36434 in node::LoadSnapshotDataAndRun(node::SnapshotData const**, node::InitializationResult const*) ()
#14 0x0000000000b3a02f in node::Start(int, char**) ()
#15 0x00007fb7fbdd0d90 in __libc_start_call_main (main=main@entry=0xab1f70 <main>, argc=argc@entry=2, argv=argv@entry=0x7ffcec991658)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#16 0x00007fb7fbdd0e40 in __libc_start_main_impl (main=0xab1f70 <main>, argc=2, argv=0x7ffcec991658, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7ffcec991648) at ../csu/libc-start.c:392
#17 0x0000000000aba37e in _start ()

@bnoordhuis
Copy link
Member

Looks like the same stack trace, yes. One other thing you can try is downgrading to ubuntu 20.04 and see if the crashes go away. That would suggest the bug is in glibc or maybe libstdc++.

FWIW, no one else has reported similar crashes so far. My hunch is it's something in your environment rather than in node itself.

@amit777
Copy link
Author

amit777 commented Apr 3, 2023

You are probably right.. I'm just confused because we doing nothing special in our env. In any case, we will probably be downgrading to Debian 11 from jammy.. But before we do that, I'm trying one last ditched effort to try and start the processes with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so pm2 ...

In the off chance it helps us to avoid a malloc() bug with tcache_get.

@amit777
Copy link
Author

amit777 commented Apr 3, 2023

One last note in case anyone comes via google search. So the jemalloc shim seemed to work for a bit, but I did notice periodic core dumps with it for various procs started by pm2 with the following stack trace.

Reading symbols from /root/.nvm/versions/node/v18.15.0/bin/node...
[New LWP 8503]
[New LWP 8531]
[New LWP 8530]
[New LWP 8540]
[New LWP 8527]
[New LWP 8543]
[New LWP 8541]
[New LWP 8528]
[New LWP 8544]
[New LWP 8542]
[New LWP 8529]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `node combined.js                                   '.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140576254928832) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7fda75bd27c0 (LWP 8503))]

I'm going to move to Debian 11..

@amit777
Copy link
Author

amit777 commented Apr 8, 2023

as a quick note, we rebuilt the node_modules folder and it seems to have fixed things. This is odd, because we were pretty sure we had already done that. In any case, hope that's helpful for anyone else who sees something like this. We did not downgrade to Debian 11..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants