-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocketchat crashes with segmentation fault since nodejs 8.10 #19274
Comments
@nodejs/v8 |
I'd check whether this is fixed with newer V8 and bisect for the fix. |
I could reproduce. Then I wanted to run it again with node master but there is a dependency that uses a V8 API that doesn't exist anymore |
I commented on the Rocket Chat issue. From the debug build output it looks like API abuse to me. |
I was wrong. This rather looks like the parser is running on a thread it is not supposed to be on. |
If you have a patch at some point, i'm happy to to test. |
I've seen the thread ID error when running a debug build of Node and using https://github.com/laverdet/node-fibers (which I believe RocketChat uses, since it's a Meteor app?):
Specifically, the V8 itself is designed to tolerate multiple threads as long as you're careful to lock and unlock the While the |
node-fibers mucks around with thread-local storage to trick V8 into believing it's running on multiple threads. Some recent change probably broke that hack but that's ultimately a node-fibers issue. |
umm... I'm shocked. |
Just to be clear: we have not pinned this down to fibers yet, though that seems possible, and of course those of us who rely on fibers are willing to put in work to keep them compatible with Node and V8. Let’s wait and see what the real problem is before assuming the solution will be unwelcome to anyone? |
As soon as we know this is a fibers issue for sure, I definitely support moving the discussion over to that repo! |
Definitely support finding out the root cause. I'm just expressing my surprise that node-fibers would manipulate TLS. |
In principle, fibers could be implemented by actually spawning new threads, and then there would be nothing weird about how it works. I agree there’s some trickery involved in getting V8 to think there are multiple threads involved, without incurring the full overhead of spawning threads, and if those tricks are the source of this problem, then of course that’s a fibers problem. Those ticks have worked fairly well for several years, so I’m confident we can find a solution that involves fewer tricks (including just spawning real threads, if necessary). |
Maybe someone can bisect the changes between 8.9 and 8.10 to see what introduced the issue to surface? |
@benjamn has anyone reported similar problems on 9.x? It seems odd that this is just showing up now |
@MylesBorins If this is a meteor-specific problem (fibers-related or otherwise), then it’s relevant to note that each version of meteor ships with a specific (stable) node version, and 8.9.4 is the most recent version we’ve shipped (in meteor 1.6.1). You could run an unreleased build of meteor with Node 9, but that’s uncommon. |
@benjamn with that in mind we should try and figure out how to get some of the meteor stuff into CITGM so we can find these problems early 😄 |
@hashseed It took a little bit of dancing around and shuffling commits to avoid some of the surrounding un-buildable commits, but this bisects down to bede7a3 ( Is there a documented process somewhere as to how this larger V8 update commit is formed? Is it essentially a |
@abernix thanks for bisecting! It seems that the commit updated V8 from In a V8 check out I get: git log 6.1.534..6.2.414.46 --oneline | wc -l
1043 So bisecting about 10 steps in V8 should give you the culprit. To do so, check out V8 and update Node.js like this on every bisect step v8/tools/release/update_node.py $V8_DIR $NODE_DIR Then build and try to repro. |
🎉 Good news! 🐰🎩🔮😌After @abernix bisected Node between I say "conclusively" because I can rebuild Node 8.10.0 with just this one commit reverted, and the problem disappears in every case where we were previously able to reproduce it. Here's the final output of
In other words, the offending commit appears to be v8/v8@ea0e1e2, which (as @abernix points out), was originally intended to fix a problem introduced by v8/v8@e15f554, though that commit was later reverted by v8/v8@a193fde, and has not successfully re-landed since then. Since the problem that v8/v8@ea0e1e2 fixed is no longer present, arguably v8/v8@ea0e1e2 itself could be (should be?) reverted in V8 and then cherry-picked into Node, even if we don't know exactly why it caused this particular problem. See below for a plausible theory, but I'll be the first to admit it's probably not entirely correct. Nevertheless, I think we can reach the same conclusion without getting wrapped up in a debate about the theory. RecommendationWe believe that v8/v8@ea0e1e2 should be reverted, and the revert commit should be cherry-picked into Node's copy of V8, and then hopefully released in Node v8.10.1. Though the use of If this recommendation seems reasonable, what's the best way to proceed? Specifically, should I submit a V8 PR/CL based on 6.2.414.42, or the current version (6.7.106, much more recent)? Detailed theory
Taking a closer look at the commit in question, I think there are a few observations we can make: commit ea0e1e21ecc13884302e0c77edad67659f2e68b4
Author: Juliana Franco <[email protected]>
Date: Fri Aug 4 10:45:33 2017 +0200
Fixing failure on GC stress.
This bug was introduced by the CL
https://chromium-review.googlesource.com/c/586707
With these changes we make sure that the object being deoptimized
does not point to code objects that have been already collected.
The CL https://chromium-review.googlesource.com/c/596027 did not
fix this problem because we were only invalidating embedded objects
reachable from the stack, however it is possible that there are some
dangling references in objects not on the stack. Thus we consider
all the optimized code objects that are marked for deoptimization.
Bug: v8:751825
Change-Id: I3a6410c2bf556fa254c54a25e1f49d7356b9e51d
Reviewed-on: https://chromium-review.googlesource.com/601967
Commit-Queue: Juliana Patricia Vicente Franco <[email protected]>
Reviewed-by: Jaroslav Sevcik <[email protected]>
Cr-Commit-Position: refs/heads/master@{#47163}
diff --git a/src/deoptimizer.cc b/src/deoptimizer.cc
index 2f5657e82e..3e90127db3 100644
--- a/src/deoptimizer.cc
+++ b/src/deoptimizer.cc
@@ -301,71 +301,70 @@ void Deoptimizer::DeoptimizeMarkedCodeForContext(Context* context) {
safe_to_deopt_topmost_optimized_code = safe_if_deopt_triggered;
}
}
}
#endif
// Move marked code from the optimized code list to the deoptimized
// code list.
// Walk over all optimized code objects in this native context.
Code* prev = NULL;
Object* element = context->OptimizedCodeListHead();
while (!element->IsUndefined(isolate)) {
Code* code = Code::cast(element);
CHECK_EQ(code->kind(), Code::OPTIMIZED_FUNCTION);
Object* next = code->next_code_link();
if (code->marked_for_deoptimization()) {
+ // Make sure that this object does not point to any garbage.
+ code->InvalidateEmbeddedObjects();
if (prev != NULL) {
// Skip this code in the optimized code list.
prev->set_next_code_link(next);
} else {
// There was no previous node, the next node is the new head.
context->SetOptimizedCodeListHead(next);
}
// Move the code to the _deoptimized_ code list.
code->set_next_code_link(context->DeoptimizedCodeListHead());
context->SetDeoptimizedCodeListHead(code);
} else {
// Not marked; preserve this element.
prev = code;
}
element = next;
}
// Finds the with activations of codes marked for deoptimization, search for
// the trampoline to the deoptimizer call respective to each code, and use it
// to replace the current pc on the stack.
for (StackFrameIterator it(isolate, isolate->thread_local_top()); !it.done();
it.Advance()) {
if (it.frame()->type() == StackFrame::OPTIMIZED) {
Code* code = it.frame()->LookupCode();
if (code->kind() == Code::OPTIMIZED_FUNCTION &&
code->marked_for_deoptimization()) {
// Obtain the trampoline to the deoptimizer call.
SafepointEntry safepoint = code->GetSafepointEntry(it.frame()->pc());
int trampoline_pc = safepoint.trampoline_pc();
DCHECK_IMPLIES(code == topmost_optimized_code,
safe_to_deopt_topmost_optimized_code);
// Replace the current pc on the stack with the trampoline.
it.frame()->set_pc(code->instruction_start() + trampoline_pc);
-
- // Make sure that this object does not point to any garbage.
- code->InvalidateEmbeddedObjects();
}
}
}
} Because this change is so simple, it seems reasonable to conclude that What could go wrong if As @abernix suggested in this comment, the original call to More generally, moving This theory is relevant to our previous How does this become a problem for Node?In the reproduction we used for bisecting, almost every core dump stack trace included this line: if (env_->tick_callback_function()->Call(process, 0, nullptr).IsEmpty()) { This The process tended to crash while calling into native code here, which is consistent with a theory that the
|
I'm not familiar with the code, but the suggestion sounds reasonable. I have a theory.
Prior to v8/v8@ea0e1e2, we would only invalidate code on the current stack. After it, we would invalidate all code from the same context. In cases where V8 is embedded with So to reproduce this bug, what needs to happen is:
So we should definitely revert that change. Thanks for the bisect! @bmeurer does that sound correct to you? |
Did not remove ActivationsFinder from `src/runtime/runtime-compiler.cc` as in the original commit as the Class is still being used prior to f0acede landing Original Commit Message: Deoptimization and multithreading. When using Lockers and Unlockers it is possible to create a scenario where multiple threads point to the same optimized code object. When that happens, if one of the threads triggers deoptimization, then the stack replacement needs to happen in the stacks of all threads. With this CL, the deoptimizer visits all threads to do so. The CL also adds three tests where V8 used to crash due to this issue. Bug: v8:6563 Change-Id: I74e9af472d4833aa8d13e579df45133791f6a503 Reviewed-on: https://chromium-review.googlesource.com/670783 Reviewed-by: Jaroslav Sevcik <[email protected]> Commit-Queue: Juliana Patricia Vicente Franco <[email protected]> Cr-Commit-Position: refs/heads/master@{#48060} PR-URL: #19477 Fixes: #19274 Refs: v8/v8@596d55a Refs: v8/v8@f0acede Reviewed-By: Gus Caplan <[email protected]> Reviewed-By: Yang Guo <[email protected]> Reviewed-By: Franziska Hinkelmann <[email protected]> Reviewed-By: Benedikt Meurer <[email protected]>
Did not remove ActivationsFinder from `src/runtime/runtime-compiler.cc` as in the original commit as the Class is still being used prior to f0acede landing Original Commit Message: Deoptimization and multithreading. When using Lockers and Unlockers it is possible to create a scenario where multiple threads point to the same optimized code object. When that happens, if one of the threads triggers deoptimization, then the stack replacement needs to happen in the stacks of all threads. With this CL, the deoptimizer visits all threads to do so. The CL also adds three tests where V8 used to crash due to this issue. Bug: v8:6563 Change-Id: I74e9af472d4833aa8d13e579df45133791f6a503 Reviewed-on: https://chromium-review.googlesource.com/670783 Reviewed-by: Jaroslav Sevcik <[email protected]> Commit-Queue: Juliana Patricia Vicente Franco <[email protected]> Cr-Commit-Position: refs/heads/master@{nodejs#48060} PR-URL: nodejs#19477 Fixes: nodejs#19274 Refs: v8/v8@596d55a Refs: v8/v8@f0acede Reviewed-By: Gus Caplan <[email protected]> Reviewed-By: Yang Guo <[email protected]> Reviewed-By: Franziska Hinkelmann <[email protected]> Reviewed-By: Benedikt Meurer <[email protected]>
I just tried the d46fafc with 8.11.1. It works so far. One question though. There is no file called test-locker.cc. Is that an additional file somewhere? thanks and cheers |
@himBeere Are you referring to the (plural) |
The patch is there, yes. But which file should I patch? It's note included in the source as far as I can see.
thanks and cheers |
In the node repo run the following command
|
Also worth noting that this patch has not yet gonna wait in an 8.x release 8.11.1 was a patch release due to infra related issues. Another version of 8.x will be coming in just over 2 weeks |
@MylesBorins good to know. I understand the meteor guys are trying to fix the issue themself for now. |
@himBeere Correct, we (as in, Meteor, myself and @benjamn included) are trying to mitigate the situation which arose after 8.11.1 was announced. Of course, since it's a security update, many of our users are updating to it in the name of security (makes sense!). Unfortunately, it includes the breaking change outlined in this issue which causes segmentation faults for many of our users. We're currently deploying a fix to our own hosting platform, Galaxy, by using our own custom build of Node 8.11.1 with the d46fafc commit applied, but that leaves non-Galaxy users needing to make the decision (and often unbeknownst to them, until they run into either problem) of whether to stay on 8.11.0 or tolerate the segmentation faults. We're anxiously awaiting the 8.11.2(?) which should solve this! |
News about the 8.11.2 release date? |
At least anecdotally, the meteor build seems to have resolved ongoing segmentation faults in our pm2 God processes; 'anecdotally' and 'seems' because proving a negative is difficult, but we've been solid since applying the patch, and we've got quite the collection of core dumps from Mars prior to it. |
Did not remove ActivationsFinder from `src/runtime/runtime-compiler.cc` as in the original commit as the Class is still being used prior to f0acede landing Original Commit Message: Deoptimization and multithreading. When using Lockers and Unlockers it is possible to create a scenario where multiple threads point to the same optimized code object. When that happens, if one of the threads triggers deoptimization, then the stack replacement needs to happen in the stacks of all threads. With this CL, the deoptimizer visits all threads to do so. The CL also adds three tests where V8 used to crash due to this issue. Bug: v8:6563 Change-Id: I74e9af472d4833aa8d13e579df45133791f6a503 Reviewed-on: https://chromium-review.googlesource.com/670783 Reviewed-by: Jaroslav Sevcik <[email protected]> Commit-Queue: Juliana Patricia Vicente Franco <[email protected]> Cr-Commit-Position: refs/heads/master@{nodejs#48060} PR-URL: nodejs#19477 Fixes: nodejs#19274 Refs: v8/v8@596d55a Refs: v8/v8@f0acede Reviewed-By: Gus Caplan <[email protected]> Reviewed-By: Yang Guo <[email protected]> Reviewed-By: Franziska Hinkelmann <[email protected]> Reviewed-By: Benedikt Meurer <[email protected]>
Fix seems to be in 8.11.2. Is that right? |
Yes, I think this can be closed! |
We didn't release an official version using 8.11.2 yet, we will do it soon 😄 |
Sounds like this can be closed. Feel free to reopen if I'm incorrect. |
Hello Devs.
Since nodejs 8.10 Rocketchat (https://rocket.chat/) crashes with segmentation fault. Compiled nodejs with debug and got this:
And:
Any idea how to fix this?
Here is the rocketchat issue: RocketChat/Rocket.Chat#10060
thanks and cheers
The text was updated successfully, but these errors were encountered: