Entity cmp::Eq improvement #2346

tower120 · 2021-06-15T15:55:30Z

What problem does this solve or what need does it fill?

Current Entity comparison relies on rust's #derive(Eq, ...)

It now actually makes two comparisons (see asm code)

https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=38662860685c22c6cbaf0a14a7e38b8b

What solution would you like?

If treat whole Entity struct as u64 (through pointer cast on comparison, or even by store generation+id as u64) this could become one bitwise comparison.

https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=3351c4c203e76073de253b3301ed8d13

The text was updated successfully, but these errors were encountered:

NathanSWard · 2021-06-15T16:45:19Z

I'm surprised the compiler didn't optimize this 😢

mockersf · 2021-06-15T16:58:08Z

Bevy very rarely rely on entity equality so I don't think it would change much in performances

NathanSWard · 2021-06-15T17:49:36Z

Bevy very rarely rely on entity equality so I don't think it would change much in performances

Bevy may not, but that doesn't mean users won't use it a lot :)

NathanSWard · 2021-06-21T05:45:07Z

This also seems to greatly affect the assembly for the Ord trait as well :(
https://godbolt.org/z/GKqYnee4h

alice-i-cecile · 2021-12-12T19:19:49Z

#2372 can and should be revived if this is worth pushing forward.

FWIW, I'm in favor of those changes; I tend to use Entity equality quite a bit in my code, and suspect that it will be important for relation-like features.

@scottmcm

# Objective - Follow up on #10519, diving deeper into optimising `Entity` due to the `derive`d `PartialOrd` `partial_cmp` not being optimal with codegen: rust-lang/rust#106107 - Fixes #2346. ## Solution Given the previous PR's solution and the other existing LLVM codegen bug, there seemed to be a potential further optimisation possible with `Entity`. In exploring providing manual `PartialOrd` impl, it turned out initially that the resulting codegen was not immediately better than the derived version. However, once `Entity` was given `#[repr(align(8)]`, the codegen improved remarkably, even more once the fields in `Entity` were rearranged to correspond to a `u64` layout (Rust doesn't automatically reorder fields correctly it seems). The field order and `align(8)` additions also improved `to_bits` codegen to be a single `mov` op. In turn, this led me to replace the previous "non-shortcircuiting" impl of `PartialEq::eq` to use direct `to_bits` comparison. The result was remarkably better codegen across the board, even for hastable lookups. The current baseline codegen is as follows: https://godbolt.org/z/zTW1h8PnY Assuming the following example struct that mirrors with the existing `Entity` definition: ```rust #[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)] pub struct FakeU64 { high: u32, low: u32, } ``` the output for `to_bits` is as follows: ``` example::FakeU64::to_bits: shl rdi, 32 mov eax, esi or rax, rdi ret ``` Changing the struct to: ```rust #[derive(Clone, Copy, Eq)] #[repr(align(8))] pub struct FakeU64 { low: u32, high: u32, } ``` and providing manual implementations for `PartialEq`/`PartialOrd`/`Ord`, `to_bits` now optimises to: ``` example::FakeU64::to_bits: mov rax, rdi ret ``` The full codegen example for this PR is here for reference: https://godbolt.org/z/n4Mjx165a To highlight, `gt` comparison goes from ``` example::greater_than: cmp edi, edx jae .LBB3_2 xor eax, eax ret .LBB3_2: setne dl cmp esi, ecx seta al or al, dl ret ``` to ``` example::greater_than: cmp rdi, rsi seta al ret ``` As explained on Discord by @scottmcm : >The root issue here, as far as I understand it, is that LLVM's middle-end is inexplicably unwilling to merge loads if that would make them under-aligned. It leaves that entirely up to its target-specific back-end, and thus a bunch of the things that you'd expect it to do that would fix this just don't happen. ## Benchmarks Before discussing benchmarks, everything was tested on the following specs: AMD Ryzen 7950X 16C/32T CPU 64GB 5200 RAM AMD RX7900XT 20GB Gfx card Manjaro KDE on Wayland I made use of the new entity hashing benchmarks to see how this PR would improve things there. With the changes in place, I first did an implementation keeping the existing "non shortcircuit" `PartialEq` implementation in place, but with the alignment and field ordering changes, which in the benchmark is the `ord_shortcircuit` column. The `to_bits` `PartialEq` implementation is the `ord_to_bits` column. The main_ord column is the current existing baseline from `main` branch. ![Screenshot_20231114_132908](https://github.com/bevyengine/bevy/assets/3116268/cb9090c9-ff74-4cc5-abae-8e4561332261) My machine is not super set-up for benchmarking, so some results are within noise, but there's not just a clear improvement between the non-shortcircuiting implementation, but even further optimisation taking place with the `to_bits` implementation. On my machine, a fair number of the stress tests were not showing any difference (indicating other bottlenecks), but I was able to get a clear difference with `many_foxes` with a fox count of 10,000: Test with `cargo run --example many_foxes --features bevy/trace_tracy,wayland --release -- --count 10000`: ![Screenshot_20231114_144217](https://github.com/bevyengine/bevy/assets/3116268/89bdc21c-7209-43c8-85ae-efbf908bfed3) On avg, a framerate of about 28-29FPS was improved to 30-32FPS. "This trace" represents the current PR's perf, while "External trace" represents the `main` branch baseline. ## Changelog Changed: micro-optimized Entity align and field ordering as well as providing manual `PartialOrd`/`Ord` impls to help LLVM optimise further. ## Migration Guide Any `unsafe` code relying on field ordering of `Entity` or sufficiently cursed shenanigans should change to reflect the different internal representation and alignment requirements of `Entity`. Co-authored-by: james7132 <[email protected]> Co-authored-by: NathanW <[email protected]>

scottmcm · 2023-11-18T20:21:38Z

For == specifically this was fixed in #10519

I'm surprised the compiler didn't optimize this 😢

See rust-lang/rust#117800

@scottmcm

…gine#10558) # Objective - Follow up on bevyengine#10519, diving deeper into optimising `Entity` due to the `derive`d `PartialOrd` `partial_cmp` not being optimal with codegen: rust-lang/rust#106107 - Fixes bevyengine#2346. ## Solution Given the previous PR's solution and the other existing LLVM codegen bug, there seemed to be a potential further optimisation possible with `Entity`. In exploring providing manual `PartialOrd` impl, it turned out initially that the resulting codegen was not immediately better than the derived version. However, once `Entity` was given `#[repr(align(8)]`, the codegen improved remarkably, even more once the fields in `Entity` were rearranged to correspond to a `u64` layout (Rust doesn't automatically reorder fields correctly it seems). The field order and `align(8)` additions also improved `to_bits` codegen to be a single `mov` op. In turn, this led me to replace the previous "non-shortcircuiting" impl of `PartialEq::eq` to use direct `to_bits` comparison. The result was remarkably better codegen across the board, even for hastable lookups. The current baseline codegen is as follows: https://godbolt.org/z/zTW1h8PnY Assuming the following example struct that mirrors with the existing `Entity` definition: ```rust #[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)] pub struct FakeU64 { high: u32, low: u32, } ``` the output for `to_bits` is as follows: ``` example::FakeU64::to_bits: shl rdi, 32 mov eax, esi or rax, rdi ret ``` Changing the struct to: ```rust #[derive(Clone, Copy, Eq)] #[repr(align(8))] pub struct FakeU64 { low: u32, high: u32, } ``` and providing manual implementations for `PartialEq`/`PartialOrd`/`Ord`, `to_bits` now optimises to: ``` example::FakeU64::to_bits: mov rax, rdi ret ``` The full codegen example for this PR is here for reference: https://godbolt.org/z/n4Mjx165a To highlight, `gt` comparison goes from ``` example::greater_than: cmp edi, edx jae .LBB3_2 xor eax, eax ret .LBB3_2: setne dl cmp esi, ecx seta al or al, dl ret ``` to ``` example::greater_than: cmp rdi, rsi seta al ret ``` As explained on Discord by @scottmcm : >The root issue here, as far as I understand it, is that LLVM's middle-end is inexplicably unwilling to merge loads if that would make them under-aligned. It leaves that entirely up to its target-specific back-end, and thus a bunch of the things that you'd expect it to do that would fix this just don't happen. ## Benchmarks Before discussing benchmarks, everything was tested on the following specs: AMD Ryzen 7950X 16C/32T CPU 64GB 5200 RAM AMD RX7900XT 20GB Gfx card Manjaro KDE on Wayland I made use of the new entity hashing benchmarks to see how this PR would improve things there. With the changes in place, I first did an implementation keeping the existing "non shortcircuit" `PartialEq` implementation in place, but with the alignment and field ordering changes, which in the benchmark is the `ord_shortcircuit` column. The `to_bits` `PartialEq` implementation is the `ord_to_bits` column. The main_ord column is the current existing baseline from `main` branch. ![Screenshot_20231114_132908](https://github.com/bevyengine/bevy/assets/3116268/cb9090c9-ff74-4cc5-abae-8e4561332261) My machine is not super set-up for benchmarking, so some results are within noise, but there's not just a clear improvement between the non-shortcircuiting implementation, but even further optimisation taking place with the `to_bits` implementation. On my machine, a fair number of the stress tests were not showing any difference (indicating other bottlenecks), but I was able to get a clear difference with `many_foxes` with a fox count of 10,000: Test with `cargo run --example many_foxes --features bevy/trace_tracy,wayland --release -- --count 10000`: ![Screenshot_20231114_144217](https://github.com/bevyengine/bevy/assets/3116268/89bdc21c-7209-43c8-85ae-efbf908bfed3) On avg, a framerate of about 28-29FPS was improved to 30-32FPS. "This trace" represents the current PR's perf, while "External trace" represents the `main` branch baseline. ## Changelog Changed: micro-optimized Entity align and field ordering as well as providing manual `PartialOrd`/`Ord` impls to help LLVM optimise further. ## Migration Guide Any `unsafe` code relying on field ordering of `Entity` or sufficiently cursed shenanigans should change to reflect the different internal representation and alignment requirements of `Entity`. Co-authored-by: james7132 <[email protected]> Co-authored-by: NathanW <[email protected]>

tower120 added C-Feature A new feature, making something new possible S-Needs-Triage This issue needs to be labelled labels Jun 15, 2021

mockersf added A-ECS Entities, components, systems, and events and removed S-Needs-Triage This issue needs to be labelled labels Jun 15, 2021

NathanSWard added the C-Performance A change motivated by improving speed, memory usage or compile times label Jun 15, 2021

NathanSWard mentioned this issue Jun 21, 2021

[ecs] Rework Entity to use u64 #2372

Closed

alice-i-cecile added S-Ready-For-Implementation This issue is ready for an implementation PR. Go for it! and removed S-Ready-For-Implementation This issue is ready for an implementation PR. Go for it! labels Dec 12, 2021

james7132 mentioned this issue Dec 10, 2022

Endianness based Entity representation #3788

Closed

james7132 mentioned this issue Nov 18, 2023

Optimise Entity with repr align & manual PartialOrd/Ord #10558

Merged

james7132 closed this as completed in #10558 Nov 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity cmp::Eq improvement #2346

Entity cmp::Eq improvement #2346

tower120 commented Jun 15, 2021

NathanSWard commented Jun 15, 2021

mockersf commented Jun 15, 2021

NathanSWard commented Jun 15, 2021

NathanSWard commented Jun 21, 2021

alice-i-cecile commented Dec 12, 2021

scottmcm commented Nov 18, 2023 •

edited

Loading

Entity cmp::Eq improvement #2346

Entity cmp::Eq improvement #2346

Comments

tower120 commented Jun 15, 2021

What problem does this solve or what need does it fill?

What solution would you like?

NathanSWard commented Jun 15, 2021

mockersf commented Jun 15, 2021

NathanSWard commented Jun 15, 2021

NathanSWard commented Jun 21, 2021

alice-i-cecile commented Dec 12, 2021

scottmcm commented Nov 18, 2023 • edited Loading

scottmcm commented Nov 18, 2023 •

edited

Loading