Skip to content
This repository has been archived by the owner on May 4, 2018. It is now read-only.

Multiple I/O loops race with waitpid #887

Closed
alexcrichton opened this issue Aug 18, 2013 · 6 comments
Closed

Multiple I/O loops race with waitpid #887

alexcrichton opened this issue Aug 18, 2013 · 6 comments

Comments

@alexcrichton
Copy link

For the following program, I would expect that the printfs in on_exit would be invoked twice every time I run the program:

#include <pthread.h>
#include <stdio.h>
#include "uv.h"

void on_exit(uv_process_t *p, int exit_status, int term_signal) {
    printf("exit %p %d %d\n", p, exit_status, term_signal);
}

void* run(void *arg) {
    uv_loop_t *loop1 = uv_loop_new();
    uv_process_options_t options = {0};
    options.file = "echo";
    char *args[3] = {"echo", "wut", NULL};
    options.args = args;
    options.exit_cb = on_exit;

    uv_process_t process;
    int ret = uv_spawn(loop1, &process, options);
    printf("%d\n", process.pid);
    uv_run(loop1, UV_RUN_DEFAULT);
}

int main() {
    pthread_t a, b;
    pthread_create(&a, NULL, run, NULL);
    pthread_create(&b, NULL, run, NULL);
    pthread_join(a, NULL);
    pthread_join(b, NULL);
}

When running this, however, it's only rarely that you see two exit prints. After investigating, I believe that it's because of the following sequence of events:

  1. Thread a spawns child 1
  2. Thread b spawns child 2
  3. Both child 1/2 exit, parent process receives SIGCHLD
  4. Thread a runs its SIGCHLD handler in the event loop.
  5. Because of this line, thread a reaps both child 1 and child 2. Child 1 has it's exit_cb invoked successfully, but child 2 falls through the cracks due to this line
  6. Thread b now runs its SIGCHLD handler, waitpid doesn't return anything, and nothing is done.

In this situation, it looks like the pids reaped from waitpid aren't guaranteed to be delivered to the right loop, so children can exit and be reaped, but their corresponding exit_cb fields will never be run.

A similar problem may exist on windows, but I haven't looked too closely yet.

@bnoordhuis
Copy link
Contributor

Thanks, I can confirm the issue. This got overlooked when adding cross-event loop signal handler support.

Is this blocking you somehow? I can land a quick (if perhaps slightly sub-optimal) fix in master sometime this week. A better fix would probably take a little longer.

@bnoordhuis
Copy link
Contributor

FWIW, I have the aforementioned quick fix sitting in a branch right now: bnoordhuis/libuv@362f2a2.

It does an O(n) scan over the event loop's active process watchers so it's not terribly efficient. On the other hand, you probably won't notice a difference until you have several thousands of watchers.

@alexcrichton
Copy link
Author

Wow, thanks for coming up with a fix so quickly!

This is a mild blocker, but nothing major. We're using our own fork of libuv in rust right now, so we could easily just apply your patch. I think we're pretty far behind master anyway, so there shouldn't be too much of a need merge this into master quickly.

That being said, I had thought about making a fix like yours, but it seems even more sub-optimal because even with one event loop you do an O(n) scan every time a process exits to figure out which process exited. It seems like that's a very undesirable property...

I also attempted to make a fix, and the progress I had is at alexcrichton/libuv@42fc3fb. It's also super-hacky and nowhere near a good quality, but the idea is that the signal handler itself actually consumes all pids and sends messages over the signal pipe. It also takes the shotgun approach of notifying event loops, but it only notifies loops which have children. The lookup for which child died is about as fast as possible because each loop already knows the pid of what died. The bad part is that it still doesn't catch everything. I was running into some weird scenarios where the signal handler never exited...

Regardless, I may try to get your patch integrated into our fork in the meantime, but I feel like for general users of libuv, to prevent a regression, a patch on master should probably take some more time.

@bnoordhuis
Copy link
Contributor

I pushed a slightly better fix in bnoordhuis/libuv@7eb488d. The old patch has some issues when you close handles when it's still iterating over the list.

That being said, I had thought about making a fix like yours, but it seems even more sub-optimal because even with one event loop you do an O(n) scan every time a process exits to figure out which process exited. It seems like that's a very undesirable property...

I don't disagree. This patch is only an interim solution (hey, I wrote it between my morning coffee and taking my kid to the petting zoo) but FWIW, I ran a few node.js benchmarks where the script spawns a few thousand instances of /bin/false in rounds and I wasn't able to measure a difference with perf stat -i -r 5 or perf record -i. I mean, the function doesn't even show up in the top 100.

I also attempted to make a fix, and the progress I had is at alexcrichton/libuv@42fc3fb. It's also super-hacky and nowhere near a good quality, but the idea is that the signal handler itself actually consumes all pids and sends messages over the signal pipe.

Yeah, that's similar to what I had in mind. signal.c and threadpool.c have some common infrastructure that can be shared with process.c. It's a bit of work to factor it out but I'll probably get around to it this week.

The bad part is that it still doesn't catch everything. I was running into some weird scenarios where the signal handler never exited...

That's because signal handlers are definitely 'here be dragons' territory. :-)

For that matter, the current signal handler in libuv is doing things that are borderline illegal. It works but it's not terribly future proof. I've been meaning to rewrite it but I'm not sure that it will make the cut for v0.12.

@alexcrichton
Copy link
Author

Oh wow, thanks! I think I trust your patch more than I do mine. Also thanks for taking the time to look into it.

I remembered that printf isn't exactly reentrant, so when I took out my debugging printfs in the signal handler in my patch, turns out it's actually working just fine. Although if what you say is true about the functions not showing up in perf at all, this may be a moot point. (premature optimization?)

In the meantime, I think I'll see if we can apply your patch locally, and I'll continue to watch this. Thanks again!

alexcrichton pushed a commit to alexcrichton/libuv that referenced this issue Aug 19, 2013
alexcrichton pushed a commit to alexcrichton/libuv that referenced this issue Aug 19, 2013
alexcrichton pushed a commit to alexcrichton/libuv that referenced this issue Aug 26, 2013
alexcrichton pushed a commit to alexcrichton/libuv that referenced this issue Aug 26, 2013
alexcrichton pushed a commit to alexcrichton/libuv that referenced this issue Aug 26, 2013
alexcrichton pushed a commit to alexcrichton/libuv that referenced this issue Aug 27, 2013
alexcrichton pushed a commit to alexcrichton/libuv that referenced this issue Aug 29, 2013
@bnoordhuis
Copy link
Contributor

Landed the interim patch in master in 5c00a0e. If people find performance regressions, please open a new issue with numbers and reference this issue.

rurban pushed a commit to rurban/libuv that referenced this issue Oct 17, 2013
Before this commit, multiple event loops raced with each other when a
SIGCHLD signal was received.  More concretely, it was possible for
event loop A to consume waitpid() events that should have been
delivered to event loop B.

This commit addresses that by doing a linear scan over the list of
child processes.  An O(n) scan is not terribly efficient but the
actual performance impact is not measurable in a benchmark that spawns
rounds of several thousands instances of /bin/false.  For the time
being, this patch will suffice; we can always revisit it later.

Fixes joyent#887.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants