Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make std::sync::Arc compatible with ThreadSanitizer #65097

Merged
merged 1 commit into from
Mar 21, 2020
Merged

Conversation

tmiasko
Copy link
Contributor

@tmiasko tmiasko commented Oct 4, 2019

The memory fences used previously in Arc implementation are not properly
understood by thread sanitizer as synchronization primitives. This had
unfortunate effect where running any non-trivial program compiled with
-Z sanitizer=thread would result in numerous false positives.

Replace acquire fences with acquire loads to address the issue.

Fixes #39608.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @joshtriplett (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 4, 2019
@jonas-schievink jonas-schievink added the A-sanitizers Area: Sanitizers for correctness and code quality label Oct 4, 2019
@Centril
Copy link
Contributor

Centril commented Oct 4, 2019

r? @RalfJung (take your time)

@RalfJung
Copy link
Member

RalfJung commented Oct 5, 2019

Cc @jhjourdan who did the formal verification of current Arc

@RalfJung
Copy link
Member

RalfJung commented Oct 5, 2019

I don't think I know the correctness argument of Arc well enough to review this on my own -- I have never tried to convince myself of its correctness. Some of my colleagues did though, I am trying to get them to help.

Do we have anyone else who is enough of an expert in weak memory to help? IIRC @aturon wrote most of this, but they are busy with other things these days.

@RalfJung
Copy link
Member

RalfJung commented Oct 5, 2019

Looking at the changes, this seems to mostly un-do deliberate optimizations that specifically keep synchronization to the absolute minimum that is necessary. I am not sure if "sanitizers do not properly support fences" is a good enough argument to change our code -- shouldn't sanitizers be fixed instead? Fences are an important part of Rust's and C's concurrency model.

I feel before we do a correctness review, we need a policy decision by some team (T-libs, I assume) whether we are willing to reduce code quality in order to help sanitizers.

@RalfJung RalfJung added I-nominated T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Oct 5, 2019
@tmiasko
Copy link
Contributor Author

tmiasko commented Oct 5, 2019

AFAIK there are no known algorithmic approaches supporting memory barriers with
practical runtime overhead. Presuming that thread sanitizer should be fixed is
no solution, other approaches might be. For example, a conditional compilation
would be another thing to consider (if that is more agreeable?).

Changes here have no effect on tier 1, since generated code on i686 / x86_64 is
unaffected. On aarch64 assembly changes roughly reflect those made in the code,
i.e., dmb ishld is removed and ldxr is replaced with ldaxr.

I also mentioned other implementations (libc++ / libstdc++) as implicit
reflection of their views on the trade-offs involved. Personally I find thread
sanitizer to be invaluable, and if the only thing to make it work correctly is
to change standalone memory fence to an operation on concrete memory location, the
decision to me is clear. This is especially true here, since the issue is
limited to Arc.

@RalfJung
Copy link
Member

RalfJung commented Oct 5, 2019

Changes here have no effect on tier 1, since generated code on i686 / x86_64 is
unaffected.

Did you test that? I know that lowering of atomic operations to x86 assembly is the same, but this change will affect compiler transformations even on x86. So there can still easily be differences.

@tmiasko
Copy link
Contributor Author

tmiasko commented Oct 5, 2019

I extracted the code form Arc (before and after changes) and examined
assembly. On i686 / x86_64 the difference was disappearance of #MEMBARRIER
compiler marker after changes, which is actually promising.

@RalfJung
Copy link
Member

RalfJung commented Oct 9, 2019

AFAIK there are no known algorithmic approaches supporting memory barriers with
practical runtime overhead.

Interesting. I thought I new that these fences are equivalent to doing a release-acquire RMW from a global location, which would be trivial to check algorithmically (assuming a release/acquire checker is already implemented) but I may misremember.

I extracted the code form Arc (before and after changes) and examined
assembly. On i686 / x86_64 the difference was disappearance of #MEMBARRIER
compiler marker after changes, which is actually promising.

Thanks, that is useful.


So, the nominated question here for @rust-lang/libs is: we have a trade-off here between (a) keeping the existing, highly optimized, well-reviewed and even formally analyzed code, which however at least current dynamic thread sanitizers cannot handle properly, and (b) changing this code to be less efficient in theory due to stronger synchronization, but ultimately simpler for the same reason (and with likely no perf change in practice), working better with thread sanitizers, but also changing extremely subtle code that is very well-tested (any change has some risk of introducing a bug) and losing the existing formal results. What do you think we should do? I am asking you because I don't think I should make such calls.

My personal feeling is: I am a big fan of sanitizers, so it seems worth sacrificing some entirely theoretical performance benefit of the current code for better testability. However, losing the existing formal analyses is an unfortunate regression. That said, comparing the assembly gives some assurance that at least for simple clients, the behavior did not change.

@Amanieu
Copy link
Member

Amanieu commented Oct 9, 2019

I would like to benchmark this on ARM first since that platform seems to be affected. The main issue here is that we will have to unconditional emit an acquire fence even if the ref count isn't dropping to zero.

@RalfJung RalfJung added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 10, 2019
@Amanieu
Copy link
Member

Amanieu commented Oct 10, 2019

I ran a quick benchmark comparison of fetch_sub(Release) and fetch_sub(AcqRel) but couldn't fine a difference in performance (they both report a consistent 20ns). I did check the assembly and Release uses ldxr while AcqRel uses ldaxr, and that is the only difference.

In theory ldaxr is slower than ldxr since it acts as an acquire barrier which prevents loads after it from being executed before it in an out-of-order CPU. I guess this isn't visible in this benchmark since there are no other loads.

@joshtriplett
Copy link
Member

joshtriplett commented Oct 12, 2019 via email

@alexcrichton
Copy link
Member

The comments in the code indicate where all this logic originally came from (Boost) and its history also shows that this is an extremely sensitive operation for performance (see #41714 for example). Benchmarking this change to evaluate its performance impact would be quite difficult, but leaning on users who have previously benchmarked various changes here (such as Servo/Gecko) is a good start.

I don't think that the correctness of the current code is in question, this looks like it's a change intended on making Arc compatible with the thread sanitizer. Given that it may be a regression in performance to an operation that is quite sensitive to performance I would personally prefer to not merge this PR as-is until sufficient data is gathered showing that this isn't a performance regression. If it is indeed a performance regression, or if data does not wish to be gathered, then landing this would indeed require some form of conditional compilation, but that is also somewhat hard to do since this is already quite tricky code to read, and adding a conditional path for only-rarely-used thread sanitization isn't necessarily great.

@JohnTitor
Copy link
Member

Ping from triage: @rust-lang/libs @tmiasko any updates on this?

@pitdicker
Copy link
Contributor

pitdicker commented Oct 30, 2019

To summarize this PR is changing a few cases of:

fn drop(&mut self) {
    if self.inner().strong.fetch_sub(1, Release) != 1 {
        return;
    }
    atomic::fence(Acquire);
    /* drop contents of Arc */
}

to:

fn drop(&mut self) {
    if self.inner().strong.fetch_sub(1, AcqRel) != 1 {
        return;
    }
    /* drop contents of Arc */
}

This places an extra requirement to do an acquire synchronization in what may be hot code (cloning and dropping Arc references should be cheap). Before the acquire was in the colder path that drops the final reference of the Arc and drops the contents.

If avoiding fences is a goal, something like #41714 seems like the much better option to me. I.e. change the same code to:

fn drop(&mut self) {
    if self.inner().strong.fetch_sub(1, Release) != 1 {
        return;
    }
    let _ = self.inner().strong.load(Acquire);
    /* drop contents of Arc */
}

This keeps the hot path the same, and it adds at most an extra mov in the colder path.

I would actually feel better about an acquire load instead of a fence, because of the trickiness of fences (Atomic-fence synchronization). Using a fence just to optimize out one instruction in a colder path seems like a questionable optimization. Especially because the processor knows just as good as we do that there are no other references to this atomic. So it can't have been changed in the meantime, and should still be in a register or cache. The mov should basically be free.

@pitdicker
Copy link
Contributor

Interesting. I thought I new that these fences are equivalent to doing a release-acquire RMW from a global location, which would be trivial to check algorithmically (assuming a release/acquire checker is already implemented) but I may misremember.

According to Atomic-fence synchronization a fence binds to a nearby atomic operation.

atomic_a.load(Relaxed);
atomic_b.store(Relaxed);
fence(Acquire);
atomic_b.load(Relaxed);

An acquire fence works together with the atomic that last did a load operation, in the example atomic_a.load(Relaxed). In theory it only has to synchronize data with the other threads that did a release on atomic_a.

In the same vein fence(Release) works together with the store on an atomic right after.

The reordering rules for fences are carefully worded to not talk about themselves, but the previous read or the next write (Acquire and Release Fences Don't Work the Way You'd Expect).

In all cases a fence needs operations on some atomic to bind to, a fence-fence synchronization without atomics does not seem to be a thing.

@RalfJung
Copy link
Member

RalfJung commented Nov 3, 2019

In all cases a fence needs operations on some atomic to bind to, a fence-fence synchronization without atomics does not seem to be a thing.

Indeed, synchronization always arises from a reads-from edge. Fences can just "upgrade" relaxed reads/writes to still incur synchronization.

@JohnTitor
Copy link
Member

Ping from triage: @tmiasko and @rust-lang/libs what do you think about the above comments?

@alexcrichton
Copy link
Member

I think my previous comment about a performance investigation still stands.

@tmiasko
Copy link
Contributor Author

tmiasko commented Nov 21, 2019

The approach suggested so far (including current implementation) are:

  1. Use acquire fence (currently used in std::sync::Arc).
  2. Use acquire load instead of acquire fence (current used in servo_arc::Arc for example).
  3. Use acq-rel in fetch_sub, removing the fence.
  4. Conditional compilation. I am not particularly fond of this approach, but
    this could be limited to a conditional definition of a single constant,
    i.e., ordering used with fetch_sub, while leaving memory fence in place.

I performed micro benchmarks on x86-64 (https://github.com/tmiasko/arc):

  • 1 and 3 generate equivalent code, so there is nothing interesting to see there.
  • When only fast path is involved (clone and drop, but refcount never reaches zero) the results of 1,2 and 3 are indistinguishable, as generated code is equivalent in that part.
  • When only slow path is involved (create and immediately drop the arc) the results are:
    • 30 ns for 1 and 3
    • 39 ns for 2 (additional mov instructions)

@pitdicker I agree that the second approach would be preferable to the third
one from perspective of architectures with weaker memory models, while still
being compatible with tsan.

@alexcrichton I looked briefly at servo benchmarks (test-dromaeo, test-perf),
but generally as far as I can see the benchmarks are quite noisy, and
std::sync::Arc and servo_arc::Arc account for too little runtime, so any
changes like this are well within measurement error.

Unfortunately neither of those approaches is a Pareto improvement over current
state of affairs, so there is a trade-off involved.

@alexcrichton
Copy link
Member

It's really difficult to measure the performance here unfortunately, and I think that focusing only on one architecture may be missing the purpose of these atomics as well.

I also believe that (2) as you proposed is an incorrect solution because I believe that an acquire load only synchronizes with one release store, and the reason we use a fence is want to synchronize with all of the release stores, not just one.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). labels Mar 20, 2020
@joshtriplett
Copy link
Member

joshtriplett commented Mar 20, 2020 via email

@tmiasko
Copy link
Contributor Author

tmiasko commented Mar 20, 2020

Note that the synchronization used here is not really stronger, so there is no
risk of that kind, except in trivially true sense that with ThreadSanitizer you
are de-facto executing completely difference machine code, with additional
calls to the runtime for each atomic operation etc.

Furthermore if you are actually trying to debug issues in Arc / Weak,
ThreadSanitizer was of no use so far. On the other hand with changes
proposed here it is able to detect previous bugs when reintroduced.

@RalfJung
Copy link
Member

One could actually argue synchronization is weaker, because an acquire fence syncs with all release writes/fences, while an acquire load only syncs with release writes to the same location.

Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 20, 2020
Make std::sync::Arc compatible with ThreadSanitizer

The memory fences used previously in Arc implementation are not properly
understood by thread sanitizer as synchronization primitives. This had
unfortunate effect where running any non-trivial program compiled with
`-Z sanitizer=thread` would result in numerous false positives.

Replace acquire fences with acquire loads to address the issue.

Fixes rust-lang#39608.
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 21, 2020
Make std::sync::Arc compatible with ThreadSanitizer

The memory fences used previously in Arc implementation are not properly
understood by thread sanitizer as synchronization primitives. This had
unfortunate effect where running any non-trivial program compiled with
`-Z sanitizer=thread` would result in numerous false positives.

Replace acquire fences with acquire loads to address the issue.

Fixes rust-lang#39608.
Centril added a commit to Centril/rust that referenced this pull request Mar 21, 2020
Make std::sync::Arc compatible with ThreadSanitizer

The memory fences used previously in Arc implementation are not properly
understood by thread sanitizer as synchronization primitives. This had
unfortunate effect where running any non-trivial program compiled with
`-Z sanitizer=thread` would result in numerous false positives.

Replace acquire fences with acquire loads to address the issue.

Fixes rust-lang#39608.
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 21, 2020
Rollup of 16 pull requests

Successful merges:

 - rust-lang#65097 (Make std::sync::Arc compatible with ThreadSanitizer)
 - rust-lang#69033 (Use generator resume arguments in the async/await lowering)
 - rust-lang#69997 (add `Option::{zip,zip_with}` methods under "option_zip" gate)
 - rust-lang#70038 (Remove the call that makes miri fail)
 - rust-lang#70058 (can_begin_literal_maybe_minus: `true` on `"-"? lit` NTs.)
 - rust-lang#70111 (BTreeMap: remove shared root)
 - rust-lang#70139 (add delay_span_bug to TransmuteSizeDiff, just to be sure)
 - rust-lang#70165 (Remove the erase regions MIR transform)
 - rust-lang#70166 (Derive PartialEq, Eq and Hash for RangeInclusive)
 - rust-lang#70176 (Add tests for rust-lang#58319 and rust-lang#65131)
 - rust-lang#70177 (Fix oudated comment for NamedRegionMap)
 - rust-lang#70184 (expand_include: set `.directory` to dir of included file.)
 - rust-lang#70187 (more clippy fixes)
 - rust-lang#70188 (Clean up E0439 explanation)
 - rust-lang#70189 (Abi::is_signed: assert that we are a Scalar)
 - rust-lang#70194 (#[must_use] on split_off())

Failed merges:

r? @ghost
@bors
Copy link
Contributor

bors commented Mar 21, 2020

☔ The latest upstream changes (presumably #70205) made this pull request unmergeable. Please resolve the merge conflicts.

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Mar 21, 2020
@bors bors merged commit 4b91729 into rust-lang:master Mar 21, 2020
@tmiasko tmiasko deleted the arc branch March 21, 2020 07:54
@choller
Copy link
Contributor

choller commented Apr 9, 2020

Thank you @tmiasko for working on this. This should get us one step closer to find races between C++ and Rust code in Firefox.

And fwiw, we have taken similar measures to replace a fence in our implementation with a load for TSan:

https://searchfox.org/mozilla-central/rev/4e228dc5f594340d35da7453829ad9f3c3cb8b58/mfbt/RefCounted.h#142-149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sanitizers Area: Sanitizers for correctness and code quality S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ThreadSanitizer detects a data race in the test runner (rustc --test)