Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mac,thread: fix pthread_barrier_wait serial thread #2003

Closed
wants to merge 1 commit into from

Conversation

ofrobots
Copy link
Contributor

Hi Everyone 👋. This is my first PR here. While trying to fix nodejs/node#23065 using uv_barriers, I noticed a thread race on my mac that, I believe, stems from a buggy polyfill of pthread_barrier_wait in libuv.

Here's the documentation on uv_barrier_wait:

Note uv_barrier_wait() returns a value > 0 to an arbitrarily chosen “serializer” thread to facilitate cleanup, i.e.

if (uv_barrier_wait(&barrier) > 0)
    uv_barrier_destroy(&barrier);

The actual implementation returned PTHREAD_BARRIER_SERIAL_THREAD (i.e. > 0) on the last thread to enter the wait. IMHO this value should be returned on the last thread to exit the wait. It is not safe for any other thread to destroy the barrier as there may be threads yet to exit the barrier.

Checking with implementations elsewhere, refer to the the implementation of absl::Barrier::Block from the excellent abseil libraries, and they seem to agree with my assessment.

I didn't know about pthread_barrier_t until yesterday so it is possible I am figuring things wrong. PTAL.

@ofrobots
Copy link
Contributor Author

Ping @cjihrig @bnoordhuis @saghul Would you mind taking a peek at this PR?

@davisjam
Copy link
Contributor

davisjam commented Sep 27, 2018

Hi @ofrobots!

I noticed a thread race on my mac that, I believe, stems from a buggy polyfill of pthread_barrier_wait in libuv.

Plausible. But from this I understand that you don't have proof, just a guess?

Is the implementation wrong?

It is not safe for any other thread to destroy the barrier as there may be threads yet to exit the barrier.

Let us say instead, it is not safe for any other thread to successfully destroy the barrier until all threads have exited it safely.

The pthread_barrier_destroy docs specify the return of EBUSY for the case:

[EBUSY]
    The implementation has detected an attempt to destroy a barrier while it is in use (for example, while being used in a pthread_barrier_wait() call) by another thread.

The corresponding libuv polyfill for pthread_barrier_destroy includes

  if ((rc = pthread_mutex_lock(&b->mutex)) != 0)
    return rc; 

  if (b->in > 0 || b->out > 0)
    rc = EBUSY;

As a result, my understanding is that:

  1. The polyfill is not incorrect according to the spec, but
  2. The example in the libuv docs can fail with EBUSY.

(I might be wrong, just sharing my perspective on the code/spec).

What should the implementation do?

The return value PTHREAD_BARRIER_SERIAL_THREAD seems useful only from a clean-up perspective.
In this light, the possibility of a subsequent EBUSY failure on pthread_barrier_destroy adds unnecessary-seeming complexity to application code.
Your patch would address this.
But shouldn't the application code be checking return values no matter what?

@ofrobots
Copy link
Contributor Author

ofrobots commented Sep 27, 2018

Plausible. But from this I understand that you don't have proof, just a guess?

Am I reasonably certain? Yes. The crashes that I observe match the thread races as I have described them. With my change, the crashes get fixed.

The polyfill is not incorrect according to the spec,

I can buy that argument. The more interesting question is whether uv_barrier_t has been usefully implemented by libuv. The polyfill is an implementation detail. There are lots of other ways to implement uv_barrier_t. Whether not pthread_barrier_destroy polyfill does what the POSIX spec says are also implementation details. libuv may or may not implement it as per spec, but it must implement uv_barrier_t usefully.

But shouldn't the application code be checking return values no matter what?

The only documentation for uv_barrier_t is the example I pasted above. The example does not work presently on a mac. The EBUSY return value is neither documented, nor does it actually returned on the mac implementation. As you point out, that would be unnecessary complexity for user code anyway.

With that in mind, I stand by the fix I provided here, but I am open to alternative suggestions if you have them.

@davisjam
Copy link
Contributor

Discussion

The crashes that I observe match the thread races as I have described them.

Maybe I misunderstood the behavior you are experiencing.

  1. Are the crashes occurring in one of the pthread_barrier_X functions? (if so, we should fix this)
  2. Or are they due to an assumption about the state of the barrier in your code? (if so, this is more a question of usability)

The only documentation for uv_barrier_t is the example I pasted above.

The libuv docs say here that "The API largely follows the pthreads API", which is why I was looking at the POSIX spec.

Your patch

I am concerned about your implementation in the case of a barrier initialized with a limit of 1.
It looks like your patch will return 0 in this case?

if (++barrier->in == barrier->threshold) {
barrier->in = 0;
barrier->out = barrier->threshold - 1;
rc = pthread_cond_signal(&barrier->cond);
assert(rc == 0);

pthread_mutex_unlock(&barrier->mutex);
return PTHREAD_BARRIER_SERIAL_THREAD;
return 0;
Copy link
Contributor

@davisjam davisjam Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a 1-thread barrier, shouldn't this return PTHREAD_BARRIER_SERIAL_THREAD?
Can we put the do-while loop under an else so that all threads share the same exit path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍Fixed.

@@ -76,6 +76,7 @@ int pthread_barrier_init(pthread_barrier_t* barrier,

int pthread_barrier_wait(pthread_barrier_t* barrier) {
int rc;
char last;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use char instead of int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
/* Otherwise, wait for other threads until in is set to 0,
then return 0 to indicate this is not the first thread. */
do {
if ((rc = pthread_cond_wait(&barrier->cond, &barrier->mutex)) != 0)
break;
} while (barrier->in != 0);
assert(rc == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be just:

do {
  rc = pthread_cond_wait(&barrier->cond, &barrier->mutex);
  assert(rc == 0);
} while (barrier->in != 0);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it... everywhere else we call abort() when pthread functions fail when they logically can't. Would be good to do that here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. There are a few more places where we should be aborting if a pthread function fails illogically. I didn't include those in this PR to keep the review delta small. I think they can be addressed in a follow-on.

barrier->out--;
/* Mark thread exit. If this the last thread to exit, return
PTHREAD_BARRIER_SERIAL_THREAD. */
last = (--barrier->out == 0);
pthread_cond_signal(&barrier->cond);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization: if this is the last thread, don't call pthread_cond_signal(); there isn't anyone to signal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ofrobots
Copy link
Contributor Author

@davisjam

Maybe I misunderstood the behavior you are experiencing.
Are the crashes occurring in one of the pthread_barrier_X functions? (if so, we should fix this)
Or are they due to an assumption about the state of the barrier in your code? (if so, this is more a question of usability)

I have added a commit with a test (barrier_serial_thread) that fails without the fix included in the PR.

Copy link
Member

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, mostly LGTM. Is the third commit (the one that registers barrier_serial_thread_single) supposed to be squashed with the first one?

test/test-barrier.c Show resolved Hide resolved
@@ -104,3 +104,38 @@ TEST_IMPL(barrier_3) {

return 0;
}

static void serial_worker(void* data) {
uv_barrier_t* barrier = (uv_barrier_t*) data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary cast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

exited the barrier. If this value is returned too early and the barrier is
destroyed prematurely, then this test may see a crash. */
TEST_IMPL(barrier_serial_thread) {
uv_thread_t threads[NUM_WORKERS];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be slightly more idiomatic to declare this as uv_thread_t threads[4]; and then use ARRAY_SIZE(threads) everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


ASSERT(0 == uv_barrier_init(&barrier, NUM_WORKERS + 1));

for (int i = 0; i < NUM_WORKERS; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you declare int i at the top?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@davisjam
Copy link
Contributor

I have added a commit with a test (barrier_serial_thread) that fails without the fix included in the PR.

I have (at last!) identified the cause. Maybe this was obvious to everyone already, but just in case...

I have been studying the source of the libuv polyfills, and I can't find a thread schedule that has pthread_barrier_wait and pthread_barrier_destroy mis-behaving.

But you're calling uv_barrier_X directly! (which makes sense)
The pthread_barrier_destroy polyfill returns EBUSY correctly in the event of not-yet-exited threads, but then in uv_barrier_destroy we abort on a non-zero rc:

void uv_barrier_destroy(uv_barrier_t* barrier) {
  if (pthread_barrier_destroy(barrier))
    abort();
}

The abort in uv_barrier_destroy is incorrect because of the thread schedule you've pointed out, but it's also incorrect according to the POSIX spec.
Your PR will cause this abort never to fail, but I don't think the abort should exist at all.

In summary:

  1. The bug you're reporting could be addressed with a one-liner to remove the abort in uv_barrier_destroy. Unfortunately, uv_barrier_destroy returns void so there would be no way to propagate the error. Users could leak memory in the event of an error.
  2. Your patch takes the approach of preventing the "bad" schedule from occurring, so there will be no memory leaks when using the libuv polyfill. But this non-spec-compliant abort may still cause program crashes in other implementations of pthread.

I don't like the prospect of memory leaks, so I support your patch.
I also don't like the idea of non-spec-compliant aborts, so I think we should remove this abort as part of the PR.

@ofrobots
Copy link
Contributor Author

@davisjam

I also don't like the idea of non-spec-compliant aborts, so I think we should remove this abort as part of the PR.

It is not clear (to me) which spec applies here anyway. uv_barrier_t have very little documentation. There are hints that for thread APIs in general that they are supposed to "largely follow" pthread, but the barrier docs and the code sample provided don't really seem to be mirroring pthread at this point. For example, the void return on uv_barrier_wait.

I suggest we move this discussion to another issue as it is not necessary to resolve the bug that I am trying to fix.

@bnoordhuis I've squashed the commit and address the rest of the comments.

Thank you both for your review and insights!

@cjihrig
Copy link
Contributor

cjihrig commented Sep 30, 2018

Should this be targeting v1.x?

@davisjam
Copy link
Contributor

Agreed, @ofrobots can you rebase?

@ofrobots ofrobots changed the base branch from master to v1.x October 1, 2018 16:49
@ofrobots
Copy link
Contributor Author

ofrobots commented Oct 1, 2018

@cjihrig @davisjam Done.

Copy link
Contributor

@davisjam davisjam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm still interested in opinions about the assert in uv_barrier_destroy, but this is not a blocker for me.

@ofrobots
Copy link
Contributor Author

ofrobots commented Oct 1, 2018

@davisjam can we fork that discussion into a separate issue? I would this change to land. This fix is blocking a cascade of PRs in Node.js.

@davisjam
Copy link
Contributor

davisjam commented Oct 1, 2018

@ofrobots Sure. It's not a blocker for me. I opened #2011 for discussion.

@ofrobots
Copy link
Contributor Author

ofrobots commented Oct 1, 2018

What are the next steps here, is there a CI somewhere that can be run?

Copy link
Member

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a CI somewhere that can be run?

Indeed there is: https://ci.nodejs.org/view/libuv/job/libuv-test-commit/1047/

@@ -83,7 +83,7 @@ int pthread_barrier_init(pthread_barrier_t* barrier,
}

int pthread_barrier_wait(pthread_barrier_t* barrier) {
int rc;
int rc, last;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style issue: one declaration per line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

The thread returning PTHREAD_BARRIER_SERIAL_THREAD should be the one
last to exit the barrier, otherwise it is unsafe to destroy the
barrier – other threads may have yet to exit.
@ofrobots
Copy link
Contributor Author

ofrobots commented Oct 2, 2018

Windows test failure seems unrelated. AIX failure is relevant and needs to be investigated. I have opened nodejs/build#1514 to get login access to AIX to be able to investigate this.

@ofrobots
Copy link
Contributor Author

ofrobots commented Oct 3, 2018

It seems that the AIX implementation of pthread_barrier_wait does indeed return PTHREAD_BARRIER_SERIAL_THREAD early – before the other threads have exited the barrier.

This means that that default example of uv_barrier_wait in docs may abort as @davisjam was worried about:

if (uv_barrier_wait(&barrier) > 0)
    uv_barrier_destroy(&barrier);

I am forced to conclude that the abort() in uv_barrier_destroy() is not safe on AIX, and that we should change the default example to something like this instead:

uv_barrier_wait(&barrier);
if (0 == uv_barrier_destroy(&barrier)) {
  // any additional cleanup, e.g. free of the barrier.
}

@ofrobots
Copy link
Contributor Author

ofrobots commented Oct 3, 2018

I also noticed that we have AIX 6.1 in the CI. As per https://www-01.ibm.com/support/docview.wss?uid=isg3T1012517, AIX 6.1 may already be EOL since last year? It is possible that the behaviour is different/fixed on AIX 7.

Looking at node supported platforms: https://github.com/nodejs/node/blob/master/BUILDING.md#supported-platforms-1, AIX 7 is the minimum supported. libuv claims to support AIX 6 though.

bnoordhuis added a commit to bnoordhuis/libuv that referenced this pull request Oct 3, 2018
It was pointed out that pthread_barrier_wait() behaves slightly
different from other platforms. Switch to libuv's own thread barrier
for uniformity of behavior. Perhaps we'll do that for more platforms
in the future.

Refs: libuv#2003 (comment)
@bnoordhuis
Copy link
Member

Maybe we should use our own barrier implementation everywhere for the sake of uniform behavior, or at least switch to it on AIX. I've opened #2019 and included the changes from this PR.

@ofrobots
Copy link
Contributor Author

ofrobots commented Oct 3, 2018

👍. Deferring to #2019.

@ofrobots ofrobots closed this Oct 3, 2018
@ofrobots ofrobots deleted the pushing-barriers branch October 3, 2018 17:29
bnoordhuis added a commit to bnoordhuis/libuv that referenced this pull request Oct 8, 2018
It was pointed out that pthread_barrier_wait() behaves slightly
different from other platforms. Switch to libuv's own thread barrier
for uniformity of behavior. Perhaps we'll do that for more platforms
in the future.

PR-URL: libuv#2019
Refs: libuv#2003 (comment)
Reviewed-By: Santiago Gimeno <[email protected]>
bnoordhuis added a commit to bnoordhuis/libuv that referenced this pull request Oct 8, 2018
Libuv's own thread barrier implementation signaled completion to the
first waiter that saw the threshold being reached, contrary to what
some native pthreads barrier implementations do, which is to signal
it to the _last_ waiter.

Libuv's behavior is not strictly non-conforming but it's inconvenient
because it means this snippet (that appears in the libuv documentation)
has a race condition in it:

    if (uv_barrier_wait(&barrier) > 0)
      uv_barrier_destroy(&barrier);  // can still have waiters

This issue was discovered and fixed by Ali Ijaz Sheikh, a.k.a @ofrobots,
but some refactoring introduced conflicts in his pull request and I
didn't have the heart to ask him to redo it from scratch. :-)

PR-URL: libuv#2019
Refs: libuv#2003
Reviewed-By: Santiago Gimeno <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Platform thread race: process.exit executed before background threads are ready
4 participants