derived `PartialOrd::le` is non-optimal for 2-field struct of primitive integers #106107

scottmcm · 2022-12-24T02:29:34Z

This codegen test should pass: comparison-operators-twofields.patch https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=4c8846efffd0574ec44e60c3cffbe842

And it turns out that the derived PartialOrd::lt does pass that test

define noundef zeroext i1 @check_lt(i16 %0, i16 %1, i16 %2, i16 %3) unnamed_addr #0 {
start:
  %_6.i.i = icmp slt i16 %0, %2
  %_9.i.i.not = icmp eq i16 %0, %2
  %_6.i3.i = icmp ult i16 %1, %3
  %4 = select i1 %_9.i.i.not, i1 %_6.i3.i, i1 %_6.i.i
  ret i1 %4
}

but the derived le, gt, and ge do not pass it.

define noundef zeroext i1 @check_gt(i16 %0, i16 %1, i16 %2, i16 %3) unnamed_addr #0 {
start:
  %_6.i.i = icmp slt i16 %0, %2
  %_9.i.i = icmp ne i16 %0, %2
  %..i.i = zext i1 %_9.i.i to i8
  %_3.0.i.i = select i1 %_6.i.i, i8 -1, i8 %..i.i
  %4 = icmp eq i8 %_3.0.i.i, 0
  %_6.i3.i = icmp ult i16 %1, %3
  %_9.i4.i = icmp ne i16 %1, %3
  %..i5.i = zext i1 %_9.i4.i to i8
  %_3.0.i6.i = select i1 %_6.i3.i, i8 -1, i8 %..i5.i
  %.0.i = select i1 %4, i8 %_3.0.i6.i, i8 %_3.0.i.i
  %5 = icmp eq i8 %.0.i, 1
  ret i1 %5
}

instead leaving in a whole bunch of unnecessary stuff.

Given the direction in #98655, we probably don't want to fix this by changing the derive.

So potential fixes could be changing our Ord::cmp implementation for integers (cc #105840), changing the default PartialOrd::le implementations (cc #106065), improved optimizations in LLVM (cc llvm/llvm-project#59666), or more.

The text was updated successfully, but these errors were encountered:

scottmcm · 2023-02-17T05:03:22Z

I tried switching the cmp implementation for integers to use the same order as clang (see llvm/llvm-project#60012), which managed to fix le, but didn't fix gt or ge, so it's not obvious that it's worth doing as it's more x86 asm for a single cmp.

If you want to play with this yourself, there's a codegen test in https://github.com/rust-lang/rust/pull/108156/files#diff-9ee4982898ebe1fa1f721acc54dc662f4a51cb2e017c71f9e24286493358e020

But we might just need to wait for LLVM 17 and llvm/llvm-project#59666 (comment).

@scottmcm

# Objective - Follow up on #10519, diving deeper into optimising `Entity` due to the `derive`d `PartialOrd` `partial_cmp` not being optimal with codegen: rust-lang/rust#106107 - Fixes #2346. ## Solution Given the previous PR's solution and the other existing LLVM codegen bug, there seemed to be a potential further optimisation possible with `Entity`. In exploring providing manual `PartialOrd` impl, it turned out initially that the resulting codegen was not immediately better than the derived version. However, once `Entity` was given `#[repr(align(8)]`, the codegen improved remarkably, even more once the fields in `Entity` were rearranged to correspond to a `u64` layout (Rust doesn't automatically reorder fields correctly it seems). The field order and `align(8)` additions also improved `to_bits` codegen to be a single `mov` op. In turn, this led me to replace the previous "non-shortcircuiting" impl of `PartialEq::eq` to use direct `to_bits` comparison. The result was remarkably better codegen across the board, even for hastable lookups. The current baseline codegen is as follows: https://godbolt.org/z/zTW1h8PnY Assuming the following example struct that mirrors with the existing `Entity` definition: ```rust #[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)] pub struct FakeU64 { high: u32, low: u32, } ``` the output for `to_bits` is as follows: ``` example::FakeU64::to_bits: shl rdi, 32 mov eax, esi or rax, rdi ret ``` Changing the struct to: ```rust #[derive(Clone, Copy, Eq)] #[repr(align(8))] pub struct FakeU64 { low: u32, high: u32, } ``` and providing manual implementations for `PartialEq`/`PartialOrd`/`Ord`, `to_bits` now optimises to: ``` example::FakeU64::to_bits: mov rax, rdi ret ``` The full codegen example for this PR is here for reference: https://godbolt.org/z/n4Mjx165a To highlight, `gt` comparison goes from ``` example::greater_than: cmp edi, edx jae .LBB3_2 xor eax, eax ret .LBB3_2: setne dl cmp esi, ecx seta al or al, dl ret ``` to ``` example::greater_than: cmp rdi, rsi seta al ret ``` As explained on Discord by @scottmcm : >The root issue here, as far as I understand it, is that LLVM's middle-end is inexplicably unwilling to merge loads if that would make them under-aligned. It leaves that entirely up to its target-specific back-end, and thus a bunch of the things that you'd expect it to do that would fix this just don't happen. ## Benchmarks Before discussing benchmarks, everything was tested on the following specs: AMD Ryzen 7950X 16C/32T CPU 64GB 5200 RAM AMD RX7900XT 20GB Gfx card Manjaro KDE on Wayland I made use of the new entity hashing benchmarks to see how this PR would improve things there. With the changes in place, I first did an implementation keeping the existing "non shortcircuit" `PartialEq` implementation in place, but with the alignment and field ordering changes, which in the benchmark is the `ord_shortcircuit` column. The `to_bits` `PartialEq` implementation is the `ord_to_bits` column. The main_ord column is the current existing baseline from `main` branch. ![Screenshot_20231114_132908](https://github.com/bevyengine/bevy/assets/3116268/cb9090c9-ff74-4cc5-abae-8e4561332261) My machine is not super set-up for benchmarking, so some results are within noise, but there's not just a clear improvement between the non-shortcircuiting implementation, but even further optimisation taking place with the `to_bits` implementation. On my machine, a fair number of the stress tests were not showing any difference (indicating other bottlenecks), but I was able to get a clear difference with `many_foxes` with a fox count of 10,000: Test with `cargo run --example many_foxes --features bevy/trace_tracy,wayland --release -- --count 10000`: ![Screenshot_20231114_144217](https://github.com/bevyengine/bevy/assets/3116268/89bdc21c-7209-43c8-85ae-efbf908bfed3) On avg, a framerate of about 28-29FPS was improved to 30-32FPS. "This trace" represents the current PR's perf, while "External trace" represents the `main` branch baseline. ## Changelog Changed: micro-optimized Entity align and field ordering as well as providing manual `PartialOrd`/`Ord` impls to help LLVM optimise further. ## Migration Guide Any `unsafe` code relying on field ordering of `Entity` or sufficiently cursed shenanigans should change to reflect the different internal representation and alignment requirements of `Entity`. Co-authored-by: james7132 <[email protected]> Co-authored-by: NathanW <[email protected]>

@scottmcm

…gine#10558) # Objective - Follow up on bevyengine#10519, diving deeper into optimising `Entity` due to the `derive`d `PartialOrd` `partial_cmp` not being optimal with codegen: rust-lang/rust#106107 - Fixes bevyengine#2346. ## Solution Given the previous PR's solution and the other existing LLVM codegen bug, there seemed to be a potential further optimisation possible with `Entity`. In exploring providing manual `PartialOrd` impl, it turned out initially that the resulting codegen was not immediately better than the derived version. However, once `Entity` was given `#[repr(align(8)]`, the codegen improved remarkably, even more once the fields in `Entity` were rearranged to correspond to a `u64` layout (Rust doesn't automatically reorder fields correctly it seems). The field order and `align(8)` additions also improved `to_bits` codegen to be a single `mov` op. In turn, this led me to replace the previous "non-shortcircuiting" impl of `PartialEq::eq` to use direct `to_bits` comparison. The result was remarkably better codegen across the board, even for hastable lookups. The current baseline codegen is as follows: https://godbolt.org/z/zTW1h8PnY Assuming the following example struct that mirrors with the existing `Entity` definition: ```rust #[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)] pub struct FakeU64 { high: u32, low: u32, } ``` the output for `to_bits` is as follows: ``` example::FakeU64::to_bits: shl rdi, 32 mov eax, esi or rax, rdi ret ``` Changing the struct to: ```rust #[derive(Clone, Copy, Eq)] #[repr(align(8))] pub struct FakeU64 { low: u32, high: u32, } ``` and providing manual implementations for `PartialEq`/`PartialOrd`/`Ord`, `to_bits` now optimises to: ``` example::FakeU64::to_bits: mov rax, rdi ret ``` The full codegen example for this PR is here for reference: https://godbolt.org/z/n4Mjx165a To highlight, `gt` comparison goes from ``` example::greater_than: cmp edi, edx jae .LBB3_2 xor eax, eax ret .LBB3_2: setne dl cmp esi, ecx seta al or al, dl ret ``` to ``` example::greater_than: cmp rdi, rsi seta al ret ``` As explained on Discord by @scottmcm : >The root issue here, as far as I understand it, is that LLVM's middle-end is inexplicably unwilling to merge loads if that would make them under-aligned. It leaves that entirely up to its target-specific back-end, and thus a bunch of the things that you'd expect it to do that would fix this just don't happen. ## Benchmarks Before discussing benchmarks, everything was tested on the following specs: AMD Ryzen 7950X 16C/32T CPU 64GB 5200 RAM AMD RX7900XT 20GB Gfx card Manjaro KDE on Wayland I made use of the new entity hashing benchmarks to see how this PR would improve things there. With the changes in place, I first did an implementation keeping the existing "non shortcircuit" `PartialEq` implementation in place, but with the alignment and field ordering changes, which in the benchmark is the `ord_shortcircuit` column. The `to_bits` `PartialEq` implementation is the `ord_to_bits` column. The main_ord column is the current existing baseline from `main` branch. ![Screenshot_20231114_132908](https://github.com/bevyengine/bevy/assets/3116268/cb9090c9-ff74-4cc5-abae-8e4561332261) My machine is not super set-up for benchmarking, so some results are within noise, but there's not just a clear improvement between the non-shortcircuiting implementation, but even further optimisation taking place with the `to_bits` implementation. On my machine, a fair number of the stress tests were not showing any difference (indicating other bottlenecks), but I was able to get a clear difference with `many_foxes` with a fox count of 10,000: Test with `cargo run --example many_foxes --features bevy/trace_tracy,wayland --release -- --count 10000`: ![Screenshot_20231114_144217](https://github.com/bevyengine/bevy/assets/3116268/89bdc21c-7209-43c8-85ae-efbf908bfed3) On avg, a framerate of about 28-29FPS was improved to 30-32FPS. "This trace" represents the current PR's perf, while "External trace" represents the `main` branch baseline. ## Changelog Changed: micro-optimized Entity align and field ordering as well as providing manual `PartialOrd`/`Ord` impls to help LLVM optimise further. ## Migration Guide Any `unsafe` code relying on field ordering of `Entity` or sufficiently cursed shenanigans should change to reflect the different internal representation and alignment requirements of `Entity`. Co-authored-by: james7132 <[email protected]> Co-authored-by: NathanW <[email protected]>

scottmcm mentioned this issue Dec 24, 2022

Micro-optimize Ord::cmp for primitives #105840

Closed

scottmcm mentioned this issue Jan 18, 2023

replace if with match in binary_search #106969

Closed

scottmcm mentioned this issue Feb 17, 2023

Add a codegen test for comparisons of 2-tuples of primitives #108156

Closed

Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023

Bluefinger mentioned this issue Nov 14, 2023

Optimise Entity with repr align & manual PartialOrd/Ord bevyengine/bevy#10558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

derived `PartialOrd::le` is non-optimal for 2-field struct of primitive integers #106107

derived `PartialOrd::le` is non-optimal for 2-field struct of primitive integers #106107

scottmcm commented Dec 24, 2022 •

edited

Loading

scottmcm commented Feb 17, 2023 •

edited

Loading

derived PartialOrd::le is non-optimal for 2-field struct of primitive integers #106107

derived PartialOrd::le is non-optimal for 2-field struct of primitive integers #106107

Comments

scottmcm commented Dec 24, 2022 • edited Loading

scottmcm commented Feb 17, 2023 • edited Loading

derived `PartialOrd::le` is non-optimal for 2-field struct of primitive integers #106107

derived `PartialOrd::le` is non-optimal for 2-field struct of primitive integers #106107

scottmcm commented Dec 24, 2022 •

edited

Loading

scottmcm commented Feb 17, 2023 •

edited

Loading