Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DefaultRwLock accumulates write-waiters, eventually fails to write lock #13163

Closed
jumpnbrownweasel opened this issue Oct 14, 2022 · 10 comments · Fixed by #13180
Closed

DefaultRwLock accumulates write-waiters, eventually fails to write lock #13163

jumpnbrownweasel opened this issue Oct 14, 2022 · 10 comments · Fixed by #13180
Labels
bug Observed behavior contradicts documented or intended behavior standard library This issue involves writing Zig code for the standard library.
Milestone

Comments

@jumpnbrownweasel
Copy link
Contributor

jumpnbrownweasel commented Oct 14, 2022

Zig Version

0.10.0-dev.4280+c3d67c5c4

Steps to Reproduce

Run the following test and expect it to run to completion (not hang).

test "RwLock" {
    var lock = std.Thread.RwLock.DefaultRwLock{};
    var i: usize = 0;
    while (i < std.math.maxInt(u32)) : (i += 1) {
        lock.lock();
        lock.unlock();
    }
    std.debug.print("\nAbout to lockShared...\n", .{});
    lock.lockShared();
}

Let it run for several minutes since it takes some time to run that many iterations, but eventually it will stop consuming CPU and hang waiting for a semaphore in the DefaultRwLock.lock method.

Expected Behavior

Prints the "About to lockShared..." message after several minutes and finishes running (no errors).

Actual Behavior

After several minutes it will stop consuming CPU and hang waiting for a semaphore in the DefaultRwLock.lock method. Although a large number (2^30 I believe) write locks must be taken before the problem occurs, it could definitely occur in a long running program.

Note that the Zig version doesn't matter. It fails on latest master, 0.9.0, and probably every other version, since the DefaultRwLock code has not changed and is neglecting to decrement the writes-waiting counter. I'll provide more info and a potential fix later in the issue.

The problem was originally found by @Curious Thing on Discord here:
https://discord.com/channels/605571803288698900/1028554539210645586

@jumpnbrownweasel jumpnbrownweasel added the bug Observed behavior contradicts documented or intended behavior label Oct 14, 2022
@jumpnbrownweasel
Copy link
Contributor Author

I'm working on a PR and I've attached a diff here with a very minimal fix and test. I think more general tests should be added as well with the PR, since there are currently no tests at all for RwLock, and that's what I'll be working on.

The (fairly obvious) bug is that WRITE is never subtracted, only added, by the lock method. The fix is to subtract it before lock returns, without changing any other code or behavior. Note that IS_WRITING is set before subtracting WRITE, so the "writing" state checked by other methods (state & (IS_WRITING | WRITER_MASK) == 0) is not impacted.

prelim_fix.txt

Let me know if for some reason you don't want me to proceed, e.g., if RwLock is not intended to work (I say this only because there are no tests).

@jumpnbrownweasel
Copy link
Contributor Author

The reason that the reproduction at the top hangs is because the write count overflows into the read count (they're adjacent bit fields in a single integer), and if the read count is non-zero the lock method will wait for a semaphore that is never fired.

@jumpnbrownweasel
Copy link
Contributor Author

jumpnbrownweasel commented Oct 14, 2022

You may ask, why not fix it by subtracting WRITE in the unlock method? The reason is that WRITE is not added in the tryLock method, and we can't have unbalanced additions and subtractions. This is also good justification for subtracting it before the lock method returns -- locked with a zero write count is already a known valid state.

@andrewrk andrewrk added the standard library This issue involves writing Zig code for the standard library. label Oct 15, 2022
@andrewrk andrewrk added this to the 0.10.1 milestone Oct 15, 2022
@andrewrk
Copy link
Member

cc @kprotty

@kprotty
Copy link
Member

kprotty commented Oct 15, 2022

This RwLock was incorrectly ported from this repo. For the lock() path, adding the IS_WRITING bit should've also removed a WRITER which it's not doing here. The fix is most likely something like this:

//const state = @atomicRmw(usize, &rwl.state, .Or, IS_WRITING, .SeqCst);
const state = @atomicRmw(usize, &rwl.state, .Add, IS_WRITING -% WRITER, .SeqCst);

A while back I started rewriting some of the sync primitives but stuff came up and never got to RwLock. If it does end up being fixed, I'd suggest taking that opportunity to make it smaller & based on Futex like this algorithm or the writer-preferring version of it Rust stdlib ended up using.

@jumpnbrownweasel
Copy link
Contributor Author

Thanks for commenting on this @kprotty!

const state = @atomicRmw(usize, &rwl.state, .Add, IS_WRITING -% WRITER, .SeqCst);

That makes sense and is a direct port of the C++ code, so should not be risky. I'll try that instead of the separate Sub I am using.

I have another question for you. Let's say we decide to keep this implementation and fix it as above, at least temporarily. I also noticed that the C++ implementation uses Acquire/Release instead of SeqCst, and I think this would make a big performance difference. Do you see any reason not to use Acquire/Release (matching the C++ code) in the Zig implementation?

@kprotty
Copy link
Member

kprotty commented Oct 15, 2022

There's a lot of orderings that could be weakened in the implementation, and SeqCst isn't expensive. On x86, all atomic rmw instructions use the lock prefix regardless of ordering. On aarch64, SeqCst and AcqRel are both Acq and Rel barriers for either LL/SC loops or the specialized instructions on v8.1+ so they codegen to the same thing.

In regards to that line in particular, I believe it can be Acquire? It at least needs to have the READER sub in unlockShared() happens-before it (when it observes READER_MASK = 0) to avoid memory writes it's trying to protect be reordered before it and before the reader is observed to be finished reading.

@jumpnbrownweasel
Copy link
Contributor Author

I didn't mean to substitute AcqRel for all SeqCst, I meant to use the same orderings (Acquire, Release or AcqRel) that are used in the C++ implementation, in all cases.

Looking further, in the C++ code, AcqRel is used in all rmw cases so there is no difference there. Acquire is used only for loads prior to use of AcqRel, so I guess you're right and the difference would be very minor.

Thanks for your reply.

@jumpnbrownweasel
Copy link
Contributor Author

Here is a diff with the modified fix (from kprotty) along with a test for the specific problem and general testing of RwLock (there were no tests previously). I think I should create a PR with this diff now, and that a more efficient rewrite (as kprotty suggests) be done separately. Sound Ok? @andrewrk

fix_with_tests.txt

@kprotty
Copy link
Member

kprotty commented Oct 16, 2022

I think you could've went ahead and created the PR. Then, the review could happen there as it's easier to do so vs. a text file.

kprotty pushed a commit that referenced this issue Oct 17, 2022
…ails to write lock (#13180)

* Fix for: DefaultRwLock accumulates write-waiters, eventually fails to write lock #13163

* Comment out debug.print at the end of the last test.

* Code formatting

* - use equality test after lock/unlock rather than peeking into internals.
  however, this is still implementation specific and only done for
  DefaultRwLock.
- add num_reads maximum to ensure that reader threads stop if writer threads are
  starved
- use relaxed orderings for the read atomic counter
- don't check at the end for non-zero read ops, since the reader threads may
  only run once if they are starved

* More review changes
- Monotonic is sufficient for incrementing the reads counter
@andrewrk andrewrk modified the milestones: 0.10.1, 0.10.0 Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Observed behavior contradicts documented or intended behavior standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants