-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault with different nodejs procs (I have coredump info) #47324
Comments
The crash happens inside tcache_get() and that's part of glibc's internal thread-local arena allocator. It could mean either a bug in glibc or memory corruption in node or V8 but it's impossible to tell which. Is the stack trace always the same? |
@bnoordhuis I just got another coredump on a different node process (not pm2). It's on one of regular processes which is not actually heavily loaded at all:
|
Also, I'm happy to run do whatever instrumentation is necessary to help isolate the issue. So if you would like me to run a debug version (or if it's easy to install a debug version with nvm), please let me know. More information, the node process is started by pm2 in cluster mode. |
|
Also, I wanted to note that (I think) I got this as well in Node 16.19.1. We recently updated to Node 18.15.0 because it's LTS and because we wanted to see if node 18 magically fixed the problem. |
You have a bunch of native add-ons loaded (the *.node files). Try excluding those because any one of them can be a source of memory corruption bugs. We don't accept bug reports where native add-ons are involved. I didn't see them in your original report but the stack trace looked different there too. The second stack trace includes the "malloc(): unaligned tcache chunk detected" error message, which is highly indicative of a use-after-free or a buffer overrun/underrun. One thing you could try is running node under valgrind but, caveat emptor, it'll be very slow. Building node from source with Address Sanitizer support ( |
Thank you for the pointers! I will explore those native addons. would building with Address Sanitizer support have significant performance impact? |
It's usually on the order of 2-4x but it really depends on what you're doing. |
Tried building from source with I'm wondering if the build error may provide a clue or is it probably totally unrelated?
|
Welcome to the hell of trying to build modern C++. Check the requirements in BUILDING.md; you can't stray too far from them without getting errors like you're seeing. |
Ha, that particular error stirred a memory: #43370 (comment) |
No problem.. I'm actually just getting this entire server replaced. We've had too many issues with it. I'm going to close this ticket out. Thanks for taking a look so quickly though! |
Hi @bnoordhuis , does it look like similar issue happened without any 3rd party native modules from this stack trace? This is the pm2 God daemon crash which I don't believe loads any 3rd party stuff. Also worth noting this crash happened on a completely different server than crashes from above.
|
Looks like the same stack trace, yes. One other thing you can try is downgrading to ubuntu 20.04 and see if the crashes go away. That would suggest the bug is in glibc or maybe libstdc++. FWIW, no one else has reported similar crashes so far. My hunch is it's something in your environment rather than in node itself. |
You are probably right.. I'm just confused because we doing nothing special in our env. In any case, we will probably be downgrading to Debian 11 from jammy.. But before we do that, I'm trying one last ditched effort to try and start the processes with In the off chance it helps us to avoid a |
One last note in case anyone comes via google search. So the jemalloc shim seemed to work for a bit, but I did notice periodic core dumps with it for various procs started by pm2 with the following stack trace.
I'm going to move to Debian 11.. |
as a quick note, we rebuilt the node_modules folder and it seems to have fixed things. This is odd, because we were pretty sure we had already done that. In any case, hope that's helpful for anyone else who sees something like this. We did not downgrade to Debian 11.. |
Version
v18.15.0
Platform
Linux server-host 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Subsystem
No response
What steps will reproduce the bug?
I don't know how to reproduce. But this seems to happen on our busiest servers. The server has plenty of RAM available (80GB out of 128GB). And it has 96 CPUs.
How often does it reproduce? Is there a required condition?
It seems to happen when my server is under heavy traffic (although CPU and RAM usage are not high). It happens maybe once or twice per day. I cannot reproduce it readily.
What is the expected behavior? Why is that the expected behavior?
it should not seg fault
What do you see instead?
I opened the coredump with
coredump info
andcoredumpctl debug
. Here is some info. I got the backtrace as well. I'm happy to send this coredump if it's helpful. It's only about 10Mb.coredumpctl info
coredump debug with backtrace
Additional information
Runny Ubuntu 22.04 Jammy with latest patches.
The text was updated successfully, but these errors were encountered: