-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mac,thread: fix pthread_barrier_wait serial thread #2003
Conversation
Ping @cjihrig @bnoordhuis @saghul Would you mind taking a peek at this PR? |
Hi @ofrobots!
Plausible. But from this I understand that you don't have proof, just a guess? Is the implementation wrong?
Let us say instead, it is not safe for any other thread to successfully destroy the barrier until all threads have exited it safely. The
The corresponding libuv polyfill for
As a result, my understanding is that:
(I might be wrong, just sharing my perspective on the code/spec). What should the implementation do?The return value |
Am I reasonably certain? Yes. The crashes that I observe match the thread races as I have described them. With my change, the crashes get fixed.
I can buy that argument. The more interesting question is whether
The only documentation for With that in mind, I stand by the fix I provided here, but I am open to alternative suggestions if you have them. |
Discussion
Maybe I misunderstood the behavior you are experiencing.
The libuv docs say here that "The API largely follows the pthreads API", which is why I was looking at the POSIX spec. Your patchI am concerned about your implementation in the case of a barrier initialized with a limit of 1. |
src/unix/thread.c
Outdated
if (++barrier->in == barrier->threshold) { | ||
barrier->in = 0; | ||
barrier->out = barrier->threshold - 1; | ||
rc = pthread_cond_signal(&barrier->cond); | ||
assert(rc == 0); | ||
|
||
pthread_mutex_unlock(&barrier->mutex); | ||
return PTHREAD_BARRIER_SERIAL_THREAD; | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a 1-thread barrier, shouldn't this return PTHREAD_BARRIER_SERIAL_THREAD
?
Can we put the do-while
loop under an else
so that all threads share the same exit path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍Fixed.
src/unix/thread.c
Outdated
@@ -76,6 +76,7 @@ int pthread_barrier_init(pthread_barrier_t* barrier, | |||
|
|||
int pthread_barrier_wait(pthread_barrier_t* barrier) { | |||
int rc; | |||
char last; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you use char
instead of int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/unix/thread.c
Outdated
} | ||
/* Otherwise, wait for other threads until in is set to 0, | ||
then return 0 to indicate this is not the first thread. */ | ||
do { | ||
if ((rc = pthread_cond_wait(&barrier->cond, &barrier->mutex)) != 0) | ||
break; | ||
} while (barrier->in != 0); | ||
assert(rc == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be just:
do {
rc = pthread_cond_wait(&barrier->cond, &barrier->mutex);
assert(rc == 0);
} while (barrier->in != 0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it... everywhere else we call abort()
when pthread functions fail when they logically can't. Would be good to do that here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. There are a few more places where we should be abort
ing if a pthread function fails illogically. I didn't include those in this PR to keep the review delta small. I think they can be addressed in a follow-on.
src/unix/thread.c
Outdated
barrier->out--; | ||
/* Mark thread exit. If this the last thread to exit, return | ||
PTHREAD_BARRIER_SERIAL_THREAD. */ | ||
last = (--barrier->out == 0); | ||
pthread_cond_signal(&barrier->cond); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimization: if this is the last thread, don't call pthread_cond_signal()
; there isn't anyone to signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cf025f6
to
fb3cfab
Compare
I have added a commit with a test ( |
fb3cfab
to
e34cd4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, mostly LGTM. Is the third commit (the one that registers barrier_serial_thread_single
) supposed to be squashed with the first one?
test/test-barrier.c
Outdated
@@ -104,3 +104,38 @@ TEST_IMPL(barrier_3) { | |||
|
|||
return 0; | |||
} | |||
|
|||
static void serial_worker(void* data) { | |||
uv_barrier_t* barrier = (uv_barrier_t*) data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
test/test-barrier.c
Outdated
exited the barrier. If this value is returned too early and the barrier is | ||
destroyed prematurely, then this test may see a crash. */ | ||
TEST_IMPL(barrier_serial_thread) { | ||
uv_thread_t threads[NUM_WORKERS]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be slightly more idiomatic to declare this as uv_thread_t threads[4];
and then use ARRAY_SIZE(threads)
everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
test/test-barrier.c
Outdated
|
||
ASSERT(0 == uv_barrier_init(&barrier, NUM_WORKERS + 1)); | ||
|
||
for (int i = 0; i < NUM_WORKERS; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you declare int i
at the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I have (at last!) identified the cause. Maybe this was obvious to everyone already, but just in case... I have been studying the source of the libuv polyfills, and I can't find a thread schedule that has But you're calling
The In summary:
I don't like the prospect of memory leaks, so I support your patch. |
e34cd4a
to
01e18bf
Compare
It is not clear (to me) which spec applies here anyway. I suggest we move this discussion to another issue as it is not necessary to resolve the bug that I am trying to fix. @bnoordhuis I've squashed the commit and address the rest of the comments. Thank you both for your review and insights! |
Should this be targeting v1.x? |
Agreed, @ofrobots can you rebase? |
01e18bf
to
a989af1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm still interested in opinions about the assert in uv_barrier_destroy
, but this is not a blocker for me.
@davisjam can we fork that discussion into a separate issue? I would this change to land. This fix is blocking a cascade of PRs in Node.js. |
What are the next steps here, is there a CI somewhere that can be run? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a CI somewhere that can be run?
Indeed there is: https://ci.nodejs.org/view/libuv/job/libuv-test-commit/1047/
src/unix/thread.c
Outdated
@@ -83,7 +83,7 @@ int pthread_barrier_init(pthread_barrier_t* barrier, | |||
} | |||
|
|||
int pthread_barrier_wait(pthread_barrier_t* barrier) { | |||
int rc; | |||
int rc, last; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style issue: one declaration per line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
The thread returning PTHREAD_BARRIER_SERIAL_THREAD should be the one last to exit the barrier, otherwise it is unsafe to destroy the barrier – other threads may have yet to exit.
a989af1
to
67130df
Compare
Windows test failure seems unrelated. AIX failure is relevant and needs to be investigated. I have opened nodejs/build#1514 to get login access to AIX to be able to investigate this. |
It seems that the AIX implementation of This means that that default example of if (uv_barrier_wait(&barrier) > 0)
uv_barrier_destroy(&barrier); I am forced to conclude that the uv_barrier_wait(&barrier);
if (0 == uv_barrier_destroy(&barrier)) {
// any additional cleanup, e.g. free of the barrier.
} |
I also noticed that we have AIX 6.1 in the CI. As per https://www-01.ibm.com/support/docview.wss?uid=isg3T1012517, AIX 6.1 may already be EOL since last year? It is possible that the behaviour is different/fixed on AIX 7. Looking at node supported platforms: https://github.com/nodejs/node/blob/master/BUILDING.md#supported-platforms-1, AIX 7 is the minimum supported. libuv claims to support AIX 6 though. |
It was pointed out that pthread_barrier_wait() behaves slightly different from other platforms. Switch to libuv's own thread barrier for uniformity of behavior. Perhaps we'll do that for more platforms in the future. Refs: libuv#2003 (comment)
Maybe we should use our own barrier implementation everywhere for the sake of uniform behavior, or at least switch to it on AIX. I've opened #2019 and included the changes from this PR. |
👍. Deferring to #2019. |
It was pointed out that pthread_barrier_wait() behaves slightly different from other platforms. Switch to libuv's own thread barrier for uniformity of behavior. Perhaps we'll do that for more platforms in the future. PR-URL: libuv#2019 Refs: libuv#2003 (comment) Reviewed-By: Santiago Gimeno <[email protected]>
Libuv's own thread barrier implementation signaled completion to the first waiter that saw the threshold being reached, contrary to what some native pthreads barrier implementations do, which is to signal it to the _last_ waiter. Libuv's behavior is not strictly non-conforming but it's inconvenient because it means this snippet (that appears in the libuv documentation) has a race condition in it: if (uv_barrier_wait(&barrier) > 0) uv_barrier_destroy(&barrier); // can still have waiters This issue was discovered and fixed by Ali Ijaz Sheikh, a.k.a @ofrobots, but some refactoring introduced conflicts in his pull request and I didn't have the heart to ask him to redo it from scratch. :-) PR-URL: libuv#2019 Refs: libuv#2003 Reviewed-By: Santiago Gimeno <[email protected]>
Hi Everyone 👋. This is my first PR here. While trying to fix nodejs/node#23065 using
uv_barriers
, I noticed a thread race on my mac that, I believe, stems from a buggy polyfill ofpthread_barrier_wait
in libuv.Here's the documentation on
uv_barrier_wait
:The actual implementation returned
PTHREAD_BARRIER_SERIAL_THREAD
(i.e. > 0) on the last thread to enter the wait. IMHO this value should be returned on the last thread to exit the wait. It is not safe for any other thread to destroy the barrier as there may be threads yet to exit the barrier.Checking with implementations elsewhere, refer to the the implementation of
absl::Barrier::Block
from the excellent abseil libraries, and they seem to agree with my assessment.I didn't know about
pthread_barrier_t
until yesterday so it is possible I am figuring things wrong. PTAL.