Use the 64b inner:monotonize() implementation not the 128b one for aarch64 #88651

AGSaidi · 2021-09-04T20:33:58Z

aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees
untorn 128b reads except for completing a 128b load/store exclusive pair
(ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to
complete a 128b read+write atomic is actually more expensive and more unfair
than the previous implementation of monotonize() which used a Mutex on aarch64,
especially at large core counts. For aarch64 switch to the 64b atomic
implementation which is about 13x faster for a benchmark that involves many
calls to Instant::now().

…rch64 aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees untorn 128b reads except for completing a 128b load/store exclusive pair (ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to complete a 128b read+write atomic is actually more expensive and more unfair than the previous implementation of monotonize() which used a Mutex on aarch64, especially at large core counts. For aarch64 switch to the 64b atomic implementation which is about 13x faster for a benchmark that involves many calls to Instant::now().

rust-highfive · 2021-09-04T20:34:00Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @dtolnay (or someone else) soon.

Please see the contribution instructions for more information.

the8472 · 2021-09-05T14:09:27Z

(continueing discussion from #88652)

I didn't expect any platform to use a read-write instruction to emulate 128bit atomic loads. In that case would it make make sense to make this implementation conditional on the lse feature on aarch64?

AGSaidi · 2021-09-05T15:25:01Z

Unfortunately, no. +lse isn't sufficient. While +lse should give us a casp instruction instead of a ldxp/stxp pair and be better that the load-store-exclusives, it's still far worse than it could be. I can't find any target-features that convince llvm to that 128b loads are singe-copy-atomic +v8.4a doesn't work even though it's mandated at that architecture revision. Also strangely while +lse emits a casp in a debug build, in a release build the ldxp/stxp pair is emitted. There are two bugs i'll follow up with LLVM on (the optimizer and not recognizing 128b values are single-copy-atomic in v8.4), but in the mean time this is the code that gets emitted for fetch_max. It requires the stxp to complete successfully meaning that this is as expensive as a lock acquisition without the benefits of relying on pthread_mutex_lock which detects the architecture revision and uses the best code path.

    7064:       c87f212a        ldxp    x10, x8, [x9] <--(loop trying to complex ld/st pair)--|             
    7068:       f100295f        cmp     x10, #0xa                                             |
    706c:       1a9f97ec        cset    w12, hi  // hi = pmore                                |
    7070:       f100011f        cmp     x8, #0x0                                              |
    7074:       1a9f07ed        cset    w13, ne  // ne = any                                  |
    7078:       1a8d018c        csel    w12, w12, w13, eq  // eq = none                       |
    707c:       7100019f        cmp     w12, #0x0                                             |
    7080:       9a9f110c        csel    x12, x8, xzr, ne  // ne = any                         |
    7084:       9a8b114d        csel    x13, x10, x11, ne  // ne = any                        |
    7088:       c82e312d        stxp    w14, x13, x12, [x9]                                   |
    708c:       35fffece        cbnz    w14, 7064 <ld128b::main+0x14>     --------------------|

AGSaidi · 2021-09-07T14:13:25Z

A little program that demonstrates the performance impact:
instant-now-repro.tar.gz

AGSaidi · 2021-09-14T22:22:24Z

Any feedback?

dtolnay · 2021-09-15T18:07:01Z

@rustbot ping arm
Would one of you be willing to sanity check the explanation in the PR description, and the discussion above? I'll approve the PR if it broadly seems reasonable but this is outside of my expertise.

rustbot · 2021-09-15T18:07:02Z

Hey ARM Group! This bug has been identified as a good "ARM candidate".
In case it's useful, here are some instructions for tackling these sorts of
bugs. Maybe take a look?
Thanks! <3

cc @adamgemmell @hug-dev @jacobbramley @JamieCunliffe @joaopaulocarreiro @raw-bin @Stammark

the8472 · 2021-09-15T18:44:01Z

There are two bugs i'll follow up with LLVM on (the optimizer and not recognizing 128b values are single-copy-atomic in v8.4)

When you do that can you also open an issue here linking to the upstream bugs so we can update the cfg flags on the AtomicU128 version once they're fixed?

AGSaidi · 2021-09-15T19:20:57Z

When you do that can you also open an issue here linking to the upstream bugs so we can update the cfg flags on the AtomicU128 version once they're fixed?

@the8472 Absolutely, but the condition will become more complicated as we'll have to have both the architecture check and a target-flags check. I'm honestly not clear how we can detect if the target-feature we're passing in RUSTFLAGS ends up enabling certain functionality in the LLVM and even if we could the standard library would presumably be compiled for maximum compatibility with v8.0-v8.x. Put another way, we'll only be able to chose the 128b version if the architecture version we're running on is v8.4 or higher. If we think it's really important, the right thing to do is likely dynamically detect the architecture version and choose.

Assuming #88652 is merged this will end up not being needed for aarch64/linux at least; other OSes will continue to benefit.

jacobbramley · 2021-09-16T09:43:38Z

@rustbot ping arm
Would one of you be willing to sanity check the explanation in the PR description, and the discussion above? I'll approve the PR if it broadly seems reasonable but this is outside of my expertise.

I wonder what the performance is like on small systems (i.e. those with few cores). However, I think this PR is correct in the functional sense.

Investigating the LSE2 issue a bit, I've been reminded that, in general, we can't simply mix the legacy mutex- or ldxp/stxp-based atomics with LSE2's versions, at least not for the same objects. I think that LLVM won't generate the 8.4 sequence because doing so would make it ABI-incompatible with earlier code. Now, that might not actually be an issue here (e.g. if the field will never be exposed to such code), but that doesn't mean that LLVM provides a way to express that.

I'm not sure, but it may be difficult to use 8.4 LSE2 sequences without defining a new target, or requiring the use of Cargo's build-std. Our PAuth/BTI proposal (#88354) has a similar requirement.

AGSaidi · 2021-09-16T13:18:19Z

I wonder what the performance is like on small systems (i.e. those with few cores). However, I think this PR is correct in the functional sense.

The performance will be better in small systems too. Using the 64b version means that the load is single-copy-atomic without the heavy weight exclusive-pair completing. That should be good for all systems, big or small.

Investigating the LSE2 issue a bit, I've been reminded that, in general, we can't simply mix the legacy mutex- or ldxp/stxp-based atomics with LSE2's versions, at least not for the same objects.

I don’t think you mean LSE2 but perhaps just LSE. AFAIK, there isn’t anything in the architecture that makes the code in this case incorrect, it’s just perhaps non-performant.

I think that LLVM won't generate the 8.4 sequence because doing so would make it ABI-incompatible with earlier code. Now, that might not actually be an issue here (e.g. if the field will never be exposed to such code), but that doesn't mean that LLVM provides a way to express that.

I haven’t found any evidence looking through the llvm source that the fact that in arm v8.4 16B aligned loads are single-copy atomic was implemented. I also don’t see how it’s an ABI break. 16B loads continue to have to be naturally aligned. They have to be for the exclusive-pair instructions already. Can you explain? It absolutely is an Arm architecture version problem in that it is only guaranteed by the architecture in v8.4+ and generally people build generic v8.0+ code, but I just don’t think this feature has been implemented in LLVM yet. Even if it had, it applies to so few of the Arm systems in the world it’a not worth agonizing over currently.

jacobbramley · 2021-09-16T14:37:12Z

I wonder what the performance is like on small systems (i.e. those with few cores). However, I think this PR is correct in the functional sense.

The performance will be better in small systems too. Using the 64b version means that the load is single-copy-atomic without the heavy weight exclusive-pair completing. That should be good for all systems, big or small.

I must admit that I didn't look in detail at the 64-bit version, but I understood that it used a mutex to protect a standard (non-atomic) pair access, which isn't obviously better (though it could be, and I won't dispute it).

Investigating the LSE2 issue a bit, I've been reminded that, in general, we can't simply mix the legacy mutex- or ldxp/stxp-based atomics with LSE2's versions, at least not for the same objects.

I don’t think you mean LSE2 but perhaps just LSE. AFAIK, there isn’t anything in the architecture that makes the code in this case incorrect, it’s just perhaps non-performant.

I'm still digging, to be honest, and some of my comment was a bit confused! I've seen some implementations use a locking implementation, but possibly individual loads and stores within the locks (rather than ldp/stp). How relevant they are, I don't know, especially as they'd be incompatible with a lock-free ldxp/stxp for the same reasons. Anyway, this isn't blocking this PR so let's continue this thread elsewhere (e.g. an issue or Zulip as you prefer). Ultimately, I would like to see the LSE2 sequences properly supported.

I haven’t found any evidence looking through the llvm source that the fact that in arm v8.4 16B aligned loads are single-copy atomic was implemented.

I've just been shown this: https://reviews.llvm.org/D109827
It looks like LLVM will get support soon.

AGSaidi · 2021-09-16T14:47:08Z

I must admit that I didn't look in detail at the 64-bit version, but I understood that it used a mutex to protect a standard (non-atomic) pair access, which isn't obviously better (though it could be, and I won't dispute it).

The 64b version doesn't use a mutex, it has slightly more code it handle rollover, but it's 13x faster precisely b/c it's not getting a mutex or emulating a single-copy-atomic load with a ldxp/stxp instruction pair which is effectively the same cost as getting a mutex.

I've just been shown this: https://reviews.llvm.org/D109827

Awesome! However, it would still require dynamically detecting that the processor version supports v8.4 and choosing the correct code path. Overall I doubt it's worth the effort given the additional code probably isn't any better than the current role-over code for quite a bit more complexity.

jacobbramley · 2021-09-16T14:52:48Z

Sounds good, thanks!

Please approve, @dtolnay.

the8472 · 2021-09-16T15:54:32Z

Uh, I looked over the AtomicU64 implementation again and noticed a fairly egregious mistake. It's only using a load and store, not a RMW atomic. That poisons any benchmark results since it's too good to be true.

I'll submit a fix.

the8472 · 2021-09-16T16:53:41Z

You might want to redo your benchmarks on top of #89017 to make sure that the AtomicU64 approach still has a significant advantage over AtomicU128.

AGSaidi · 2021-09-18T20:45:09Z

Thanks @the8472, this still looks to help 2-3x, but not the 13x it did previously. I'll take a closer look at the generated code early next week.

AGSaidi · 2021-09-29T23:27:47Z

I've retested this and yes it continues to help to use the 64b version instead of the 128b version by about 3.5x on the attached reproducer (down from 13x) with the fix from @the8472. Staring at the generate assembly i can't say i understand why except maybe that the critical section between the ld/st exclusive is a lot smaller in the case of the 64b fetch-update vs the max operation that requires testing both 64b.

dtolnay

Thanks, all :)

dtolnay · 2021-10-03T04:33:12Z

@bors r+

bors · 2021-10-03T04:33:13Z

📌 Commit ce450f8 has been approved by dtolnay

…4, r=dtolnay Use the 64b inner:monotonize() implementation not the 128b one for aarch64 aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees untorn 128b reads except for completing a 128b load/store exclusive pair (ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to complete a 128b read+write atomic is actually more expensive and more unfair than the previous implementation of monotonize() which used a Mutex on aarch64, especially at large core counts. For aarch64 switch to the 64b atomic implementation which is about 13x faster for a benchmark that involves many calls to Instant::now().

workingjubilee · 2021-10-05T05:01:27Z

Hm, it's worth noting this PR recently made its way in as well, and the two might have some impact on each other: #83655

…4, r=dtolnay Use the 64b inner:monotonize() implementation not the 128b one for aarch64 aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees untorn 128b reads except for completing a 128b load/store exclusive pair (ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to complete a 128b read+write atomic is actually more expensive and more unfair than the previous implementation of monotonize() which used a Mutex on aarch64, especially at large core counts. For aarch64 switch to the 64b atomic implementation which is about 13x faster for a benchmark that involves many calls to Instant::now().

…arth Rollup of 12 pull requests Successful merges: - rust-lang#87631 (os current_exe using same approach as linux to get always the full ab…) - rust-lang#88234 (rustdoc-json: Don't ignore impls for primitive types) - rust-lang#88651 (Use the 64b inner:monotonize() implementation not the 128b one for aarch64) - rust-lang#88816 (Rustdoc migrate to table so the gui can handle >2k constants) - rust-lang#89244 (refactor: VecDeques PairSlices fields to private) - rust-lang#89364 (rustdoc-json: Encode json files with UTF-8) - rust-lang#89423 (Fix ICE caused by non_exaustive_omitted_patterns struct lint) - rust-lang#89426 (bootstrap: add config option for nix patching) - rust-lang#89462 (haiku thread affinity build fix) - rust-lang#89482 (Follow the diagnostic output style guide) - rust-lang#89504 (Don't suggest replacing region with 'static in NLL) - rust-lang#89535 (fix busted JavaScript in error index generator) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup

rust-highfive assigned dtolnay Sep 4, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 4, 2021

AGSaidi mentioned this pull request Sep 5, 2021

linux/aarch64 Now() should be actually_monotonic() #88652

Merged

the8472 added A-time Area: Time T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 8, 2021

rustbot added the O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state label Sep 15, 2021

dtolnay approved these changes Oct 3, 2021

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 3, 2021

Manishearth mentioned this pull request Oct 5, 2021

Rollup of 18 pull requests #89540

Closed

Manishearth mentioned this pull request Oct 5, 2021

Rollup of 17 pull requests #89543

Closed

Manishearth mentioned this pull request Oct 5, 2021

Rollup of 14 pull requests #89548

Closed

Manishearth mentioned this pull request Oct 5, 2021

Rollup of 12 pull requests #89549

Merged

bors merged commit dd223d5 into rust-lang:master Oct 5, 2021

rustbot added this to the 1.57.0 milestone Oct 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the 64b inner:monotonize() implementation not the 128b one for aarch64 #88651

Use the 64b inner:monotonize() implementation not the 128b one for aarch64 #88651

AGSaidi commented Sep 4, 2021

rust-highfive commented Sep 4, 2021

the8472 commented Sep 5, 2021

AGSaidi commented Sep 5, 2021 •

edited

Loading

AGSaidi commented Sep 7, 2021

AGSaidi commented Sep 14, 2021

dtolnay commented Sep 15, 2021

rustbot commented Sep 15, 2021

the8472 commented Sep 15, 2021

AGSaidi commented Sep 15, 2021 •

edited

Loading

jacobbramley commented Sep 16, 2021

AGSaidi commented Sep 16, 2021

jacobbramley commented Sep 16, 2021

AGSaidi commented Sep 16, 2021

jacobbramley commented Sep 16, 2021

the8472 commented Sep 16, 2021 •

edited

Loading

the8472 commented Sep 16, 2021

AGSaidi commented Sep 18, 2021

AGSaidi commented Sep 29, 2021

dtolnay left a comment

dtolnay commented Oct 3, 2021

bors commented Oct 3, 2021

workingjubilee commented Oct 5, 2021

Use the 64b inner:monotonize() implementation not the 128b one for aarch64 #88651

Use the 64b inner:monotonize() implementation not the 128b one for aarch64 #88651

Conversation

AGSaidi commented Sep 4, 2021

rust-highfive commented Sep 4, 2021

the8472 commented Sep 5, 2021

AGSaidi commented Sep 5, 2021 • edited Loading

AGSaidi commented Sep 7, 2021

AGSaidi commented Sep 14, 2021

dtolnay commented Sep 15, 2021

rustbot commented Sep 15, 2021

the8472 commented Sep 15, 2021

AGSaidi commented Sep 15, 2021 • edited Loading

jacobbramley commented Sep 16, 2021

AGSaidi commented Sep 16, 2021

jacobbramley commented Sep 16, 2021

AGSaidi commented Sep 16, 2021

jacobbramley commented Sep 16, 2021

the8472 commented Sep 16, 2021 • edited Loading

the8472 commented Sep 16, 2021

AGSaidi commented Sep 18, 2021

AGSaidi commented Sep 29, 2021

dtolnay left a comment

Choose a reason for hiding this comment

dtolnay commented Oct 3, 2021

bors commented Oct 3, 2021

workingjubilee commented Oct 5, 2021

AGSaidi commented Sep 5, 2021 •

edited

Loading

AGSaidi commented Sep 15, 2021 •

edited

Loading

the8472 commented Sep 16, 2021 •

edited

Loading