Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the 64b inner:monotonize() implementation not the 128b one for aarch64 #88651

Merged
merged 1 commit into from
Oct 5, 2021

Conversation

AGSaidi
Copy link
Contributor

@AGSaidi AGSaidi commented Sep 4, 2021

aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees
untorn 128b reads except for completing a 128b load/store exclusive pair
(ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to
complete a 128b read+write atomic is actually more expensive and more unfair
than the previous implementation of monotonize() which used a Mutex on aarch64,
especially at large core counts. For aarch64 switch to the 64b atomic
implementation which is about 13x faster for a benchmark that involves many
calls to Instant::now().

…rch64

aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees
untorn 128b reads except for completing a 128b load/store exclusive pair
(ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to
complete a 128b read+write atomic is actually more expensive and more unfair
than the previous implementation of monotonize() which used a Mutex on aarch64,
especially at large core counts.  For aarch64 switch to the 64b atomic
implementation which is about 13x faster for a benchmark that involves many
calls to Instant::now().
@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @dtolnay (or someone else) soon.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 4, 2021
@the8472
Copy link
Member

the8472 commented Sep 5, 2021

(continueing discussion from #88652)

I didn't expect any platform to use a read-write instruction to emulate 128bit atomic loads. In that case would it make make sense to make this implementation conditional on the lse feature on aarch64?

@AGSaidi
Copy link
Contributor Author

AGSaidi commented Sep 5, 2021

Unfortunately, no. +lse isn't sufficient. While +lse should give us a casp instruction instead of a ldxp/stxp pair and be better that the load-store-exclusives, it's still far worse than it could be. I can't find any target-features that convince llvm to that 128b loads are singe-copy-atomic +v8.4a doesn't work even though it's mandated at that architecture revision. Also strangely while +lse emits a casp in a debug build, in a release build the ldxp/stxp pair is emitted. There are two bugs i'll follow up with LLVM on (the optimizer and not recognizing 128b values are single-copy-atomic in v8.4), but in the mean time this is the code that gets emitted for fetch_max. It requires the stxp to complete successfully meaning that this is as expensive as a lock acquisition without the benefits of relying on pthread_mutex_lock which detects the architecture revision and uses the best code path.

    7064:       c87f212a        ldxp    x10, x8, [x9] <--(loop trying to complex ld/st pair)--|             
    7068:       f100295f        cmp     x10, #0xa                                             |
    706c:       1a9f97ec        cset    w12, hi  // hi = pmore                                |
    7070:       f100011f        cmp     x8, #0x0                                              |
    7074:       1a9f07ed        cset    w13, ne  // ne = any                                  |
    7078:       1a8d018c        csel    w12, w12, w13, eq  // eq = none                       |
    707c:       7100019f        cmp     w12, #0x0                                             |
    7080:       9a9f110c        csel    x12, x8, xzr, ne  // ne = any                         |
    7084:       9a8b114d        csel    x13, x10, x11, ne  // ne = any                        |
    7088:       c82e312d        stxp    w14, x13, x12, [x9]                                   |
    708c:       35fffece        cbnz    w14, 7064 <ld128b::main+0x14>     --------------------|

@AGSaidi
Copy link
Contributor Author

AGSaidi commented Sep 7, 2021

A little program that demonstrates the performance impact:
instant-now-repro.tar.gz

@the8472 the8472 added A-time Area: Time T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 8, 2021
@AGSaidi
Copy link
Contributor Author

AGSaidi commented Sep 14, 2021

Any feedback?

@dtolnay
Copy link
Member

dtolnay commented Sep 15, 2021

@rustbot ping arm
Would one of you be willing to sanity check the explanation in the PR description, and the discussion above? I'll approve the PR if it broadly seems reasonable but this is outside of my expertise.

@rustbot
Copy link
Collaborator

rustbot commented Sep 15, 2021

Hey ARM Group! This bug has been identified as a good "ARM candidate".
In case it's useful, here are some instructions for tackling these sorts of
bugs. Maybe take a look?
Thanks! <3

cc @adamgemmell @hug-dev @jacobbramley @JamieCunliffe @joaopaulocarreiro @raw-bin @Stammark

@rustbot rustbot added the O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state label Sep 15, 2021
@the8472
Copy link
Member

the8472 commented Sep 15, 2021

There are two bugs i'll follow up with LLVM on (the optimizer and not recognizing 128b values are single-copy-atomic in v8.4)

When you do that can you also open an issue here linking to the upstream bugs so we can update the cfg flags on the AtomicU128 version once they're fixed?

@AGSaidi
Copy link
Contributor Author

AGSaidi commented Sep 15, 2021

When you do that can you also open an issue here linking to the upstream bugs so we can update the cfg flags on the AtomicU128 version once they're fixed?

@the8472 Absolutely, but the condition will become more complicated as we'll have to have both the architecture check and a target-flags check. I'm honestly not clear how we can detect if the target-feature we're passing in RUSTFLAGS ends up enabling certain functionality in the LLVM and even if we could the standard library would presumably be compiled for maximum compatibility with v8.0-v8.x. Put another way, we'll only be able to chose the 128b version if the architecture version we're running on is v8.4 or higher. If we think it's really important, the right thing to do is likely dynamically detect the architecture version and choose.

Assuming #88652 is merged this will end up not being needed for aarch64/linux at least; other OSes will continue to benefit.

@jacobbramley
Copy link
Contributor

@rustbot ping arm
Would one of you be willing to sanity check the explanation in the PR description, and the discussion above? I'll approve the PR if it broadly seems reasonable but this is outside of my expertise.

I wonder what the performance is like on small systems (i.e. those with few cores). However, I think this PR is correct in the functional sense.

Investigating the LSE2 issue a bit, I've been reminded that, in general, we can't simply mix the legacy mutex- or ldxp/stxp-based atomics with LSE2's versions, at least not for the same objects. I think that LLVM won't generate the 8.4 sequence because doing so would make it ABI-incompatible with earlier code. Now, that might not actually be an issue here (e.g. if the field will never be exposed to such code), but that doesn't mean that LLVM provides a way to express that.

I'm not sure, but it may be difficult to use 8.4 LSE2 sequences without defining a new target, or requiring the use of Cargo's build-std. Our PAuth/BTI proposal (#88354) has a similar requirement.

@AGSaidi
Copy link
Contributor Author

AGSaidi commented Sep 16, 2021

I wonder what the performance is like on small systems (i.e. those with few cores). However, I think this PR is correct in the functional sense.

The performance will be better in small systems too. Using the 64b version means that the load is single-copy-atomic without the heavy weight exclusive-pair completing. That should be good for all systems, big or small.

Investigating the LSE2 issue a bit, I've been reminded that, in general, we can't simply mix the legacy mutex- or ldxp/stxp-based atomics with LSE2's versions, at least not for the same objects.

I don’t think you mean LSE2 but perhaps just LSE. AFAIK, there isn’t anything in the architecture that makes the code in this case incorrect, it’s just perhaps non-performant.

I think that LLVM won't generate the 8.4 sequence because doing so would make it ABI-incompatible with earlier code. Now, that might not actually be an issue here (e.g. if the field will never be exposed to such code), but that doesn't mean that LLVM provides a way to express that.

I haven’t found any evidence looking through the llvm source that the fact that in arm v8.4 16B aligned loads are single-copy atomic was implemented. I also don’t see how it’s an ABI break. 16B loads continue to have to be naturally aligned. They have to be for the exclusive-pair instructions already. Can you explain? It absolutely is an Arm architecture version problem in that it is only guaranteed by the architecture in v8.4+ and generally people build generic v8.0+ code, but I just don’t think this feature has been implemented in LLVM yet. Even if it had, it applies to so few of the Arm systems in the world it’a not worth agonizing over currently.

@jacobbramley
Copy link
Contributor

I wonder what the performance is like on small systems (i.e. those with few cores). However, I think this PR is correct in the functional sense.

The performance will be better in small systems too. Using the 64b version means that the load is single-copy-atomic without the heavy weight exclusive-pair completing. That should be good for all systems, big or small.

I must admit that I didn't look in detail at the 64-bit version, but I understood that it used a mutex to protect a standard (non-atomic) pair access, which isn't obviously better (though it could be, and I won't dispute it).

Investigating the LSE2 issue a bit, I've been reminded that, in general, we can't simply mix the legacy mutex- or ldxp/stxp-based atomics with LSE2's versions, at least not for the same objects.

I don’t think you mean LSE2 but perhaps just LSE. AFAIK, there isn’t anything in the architecture that makes the code in this case incorrect, it’s just perhaps non-performant.

I'm still digging, to be honest, and some of my comment was a bit confused! I've seen some implementations use a locking implementation, but possibly individual loads and stores within the locks (rather than ldp/stp). How relevant they are, I don't know, especially as they'd be incompatible with a lock-free ldxp/stxp for the same reasons. Anyway, this isn't blocking this PR so let's continue this thread elsewhere (e.g. an issue or Zulip as you prefer). Ultimately, I would like to see the LSE2 sequences properly supported.

I haven’t found any evidence looking through the llvm source that the fact that in arm v8.4 16B aligned loads are single-copy atomic was implemented.

I've just been shown this: https://reviews.llvm.org/D109827
It looks like LLVM will get support soon.

@AGSaidi
Copy link
Contributor Author

AGSaidi commented Sep 16, 2021

I must admit that I didn't look in detail at the 64-bit version, but I understood that it used a mutex to protect a standard (non-atomic) pair access, which isn't obviously better (though it could be, and I won't dispute it).

The 64b version doesn't use a mutex, it has slightly more code it handle rollover, but it's 13x faster precisely b/c it's not getting a mutex or emulating a single-copy-atomic load with a ldxp/stxp instruction pair which is effectively the same cost as getting a mutex.

I've just been shown this: https://reviews.llvm.org/D109827

Awesome! However, it would still require dynamically detecting that the processor version supports v8.4 and choosing the correct code path. Overall I doubt it's worth the effort given the additional code probably isn't any better than the current role-over code for quite a bit more complexity.

@jacobbramley
Copy link
Contributor

Sounds good, thanks!

Please approve, @dtolnay.

@the8472
Copy link
Member

the8472 commented Sep 16, 2021

Uh, I looked over the AtomicU64 implementation again and noticed a fairly egregious mistake. It's only using a load and store, not a RMW atomic. That poisons any benchmark results since it's too good to be true.

I'll submit a fix.

@the8472
Copy link
Member

the8472 commented Sep 16, 2021

You might want to redo your benchmarks on top of #89017 to make sure that the AtomicU64 approach still has a significant advantage over AtomicU128.

@AGSaidi
Copy link
Contributor Author

AGSaidi commented Sep 18, 2021

Thanks @the8472, this still looks to help 2-3x, but not the 13x it did previously. I'll take a closer look at the generated code early next week.

@AGSaidi
Copy link
Contributor Author

AGSaidi commented Sep 29, 2021

I've retested this and yes it continues to help to use the 64b version instead of the 128b version by about 3.5x on the attached reproducer (down from 13x) with the fix from @the8472. Staring at the generate assembly i can't say i understand why except maybe that the critical section between the ld/st exclusive is a lot smaller in the case of the 64b fetch-update vs the max operation that requires testing both 64b.

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, all :)

@dtolnay
Copy link
Member

dtolnay commented Oct 3, 2021

@bors r+

@bors
Copy link
Contributor

bors commented Oct 3, 2021

📌 Commit ce450f8 has been approved by dtolnay

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 3, 2021
Manishearth added a commit to Manishearth/rust that referenced this pull request Oct 5, 2021
…4, r=dtolnay

Use the 64b inner:monotonize() implementation not the 128b one for aarch64

aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees
untorn 128b reads except for completing a 128b load/store exclusive pair
(ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to
complete a 128b read+write atomic is actually more expensive and more unfair
than the previous implementation of monotonize() which used a Mutex on aarch64,
especially at large core counts.  For aarch64 switch to the 64b atomic
implementation which is about 13x faster for a benchmark that involves many
calls to Instant::now().
Manishearth added a commit to Manishearth/rust that referenced this pull request Oct 5, 2021
…4, r=dtolnay

Use the 64b inner:monotonize() implementation not the 128b one for aarch64

aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees
untorn 128b reads except for completing a 128b load/store exclusive pair
(ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to
complete a 128b read+write atomic is actually more expensive and more unfair
than the previous implementation of monotonize() which used a Mutex on aarch64,
especially at large core counts.  For aarch64 switch to the 64b atomic
implementation which is about 13x faster for a benchmark that involves many
calls to Instant::now().
@workingjubilee
Copy link
Member

Hm, it's worth noting this PR recently made its way in as well, and the two might have some impact on each other: #83655

Manishearth added a commit to Manishearth/rust that referenced this pull request Oct 5, 2021
…4, r=dtolnay

Use the 64b inner:monotonize() implementation not the 128b one for aarch64

aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees
untorn 128b reads except for completing a 128b load/store exclusive pair
(ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to
complete a 128b read+write atomic is actually more expensive and more unfair
than the previous implementation of monotonize() which used a Mutex on aarch64,
especially at large core counts.  For aarch64 switch to the 64b atomic
implementation which is about 13x faster for a benchmark that involves many
calls to Instant::now().
bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 5, 2021
…arth

Rollup of 12 pull requests

Successful merges:

 - rust-lang#87631 (os current_exe using same approach as linux to get always the full ab…)
 - rust-lang#88234 (rustdoc-json: Don't ignore impls for primitive types)
 - rust-lang#88651 (Use the 64b inner:monotonize() implementation not the 128b one for aarch64)
 - rust-lang#88816 (Rustdoc migrate to table so the gui can handle >2k constants)
 - rust-lang#89244 (refactor: VecDeques PairSlices fields to private)
 - rust-lang#89364 (rustdoc-json: Encode json files with UTF-8)
 - rust-lang#89423 (Fix ICE caused by non_exaustive_omitted_patterns struct lint)
 - rust-lang#89426 (bootstrap: add config option for nix patching)
 - rust-lang#89462 (haiku thread affinity build fix)
 - rust-lang#89482 (Follow the diagnostic output style guide)
 - rust-lang#89504 (Don't suggest replacing region with 'static in NLL)
 - rust-lang#89535 (fix busted JavaScript in error index generator)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit dd223d5 into rust-lang:master Oct 5, 2021
@rustbot rustbot added this to the 1.57.0 milestone Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-time Area: Time O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants