-
Notifications
You must be signed in to change notification settings - Fork 654
Multiple I/O loops race with waitpid #887
Comments
Thanks, I can confirm the issue. This got overlooked when adding cross-event loop signal handler support. Is this blocking you somehow? I can land a quick (if perhaps slightly sub-optimal) fix in master sometime this week. A better fix would probably take a little longer. |
FWIW, I have the aforementioned quick fix sitting in a branch right now: bnoordhuis/libuv@362f2a2. It does an O(n) scan over the event loop's active process watchers so it's not terribly efficient. On the other hand, you probably won't notice a difference until you have several thousands of watchers. |
Wow, thanks for coming up with a fix so quickly! This is a mild blocker, but nothing major. We're using our own fork of libuv in rust right now, so we could easily just apply your patch. I think we're pretty far behind master anyway, so there shouldn't be too much of a need merge this into master quickly. That being said, I had thought about making a fix like yours, but it seems even more sub-optimal because even with one event loop you do an O(n) scan every time a process exits to figure out which process exited. It seems like that's a very undesirable property... I also attempted to make a fix, and the progress I had is at alexcrichton/libuv@42fc3fb. It's also super-hacky and nowhere near a good quality, but the idea is that the signal handler itself actually consumes all pids and sends messages over the signal pipe. It also takes the shotgun approach of notifying event loops, but it only notifies loops which have children. The lookup for which child died is about as fast as possible because each loop already knows the pid of what died. The bad part is that it still doesn't catch everything. I was running into some weird scenarios where the signal handler never exited... Regardless, I may try to get your patch integrated into our fork in the meantime, but I feel like for general users of libuv, to prevent a regression, a patch on master should probably take some more time. |
I pushed a slightly better fix in bnoordhuis/libuv@7eb488d. The old patch has some issues when you close handles when it's still iterating over the list.
I don't disagree. This patch is only an interim solution (hey, I wrote it between my morning coffee and taking my kid to the petting zoo) but FWIW, I ran a few node.js benchmarks where the script spawns a few thousand instances of /bin/false in rounds and I wasn't able to measure a difference with
Yeah, that's similar to what I had in mind. signal.c and threadpool.c have some common infrastructure that can be shared with process.c. It's a bit of work to factor it out but I'll probably get around to it this week.
That's because signal handlers are definitely 'here be dragons' territory. :-) For that matter, the current signal handler in libuv is doing things that are borderline illegal. It works but it's not terribly future proof. I've been meaning to rewrite it but I'm not sure that it will make the cut for v0.12. |
Oh wow, thanks! I think I trust your patch more than I do mine. Also thanks for taking the time to look into it. I remembered that printf isn't exactly reentrant, so when I took out my debugging printfs in the signal handler in my patch, turns out it's actually working just fine. Although if what you say is true about the functions not showing up in perf at all, this may be a moot point. (premature optimization?) In the meantime, I think I'll see if we can apply your patch locally, and I'll continue to watch this. Thanks again! |
Quick for for joyent#887.
Quick for for joyent#887.
Quick for for joyent#887.
Quick for for joyent#887.
Quick for for joyent#887.
Quick for for joyent#887.
Quick for for joyent#887.
Landed the interim patch in master in 5c00a0e. If people find performance regressions, please open a new issue with numbers and reference this issue. |
Before this commit, multiple event loops raced with each other when a SIGCHLD signal was received. More concretely, it was possible for event loop A to consume waitpid() events that should have been delivered to event loop B. This commit addresses that by doing a linear scan over the list of child processes. An O(n) scan is not terribly efficient but the actual performance impact is not measurable in a benchmark that spawns rounds of several thousands instances of /bin/false. For the time being, this patch will suffice; we can always revisit it later. Fixes joyent#887.
For the following program, I would expect that the printfs in
on_exit
would be invoked twice every time I run the program:When running this, however, it's only rarely that you see two exit prints. After investigating, I believe that it's because of the following sequence of events:
exit_cb
invoked successfully, but child 2 falls through the cracks due to this lineIn this situation, it looks like the pids reaped from
waitpid
aren't guaranteed to be delivered to the right loop, so children can exit and be reaped, but their correspondingexit_cb
fields will never be run.A similar problem may exist on windows, but I haven't looked too closely yet.
The text was updated successfully, but these errors were encountered: