-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compiler-builtins: Int trait functions are not inlined on wasm #73135
Comments
It seems like rust-lang/compiler-builtins#349 initially had a lot more inline annotations but removed them later on. That may have regressed the performance on wasm, but I'm not entirely sure. This makes me wonder if compiler-builtins is even compiled with LTO at all as it supposedly is meant to be?! |
Looks like this is indeed a regression that started with Rust 1.44, which switched from compiler-builtins 0.1.25 to 0.1.27, which is where this PR is part of. These are the instructions in 1.43: (func $__multi3 (type $t40) (param $p0 i32) (param $p1 i64) (param $p2 i64) (param $p3 i64) (param $p4 i64)
(local $l5 i64) (local $l6 i64)
local.get $p0
local.get $p3
i64.const 32
i64.shr_u
local.tee $l5
local.get $p1
i64.const 32
i64.shr_u
local.tee $l6
i64.mul
local.get $p3
local.get $p2
i64.mul
i64.add
local.get $p4
local.get $p1
i64.mul
i64.add
local.get $p3
i64.const 4294967295
i64.and
local.tee $p3
local.get $p1
i64.const 4294967295
i64.and
local.tee $p1
i64.mul
local.tee $p4
i64.const 32
i64.shr_u
local.get $p3
local.get $l6
i64.mul
i64.add
local.tee $p3
i64.const 32
i64.shr_u
i64.add
local.get $p3
i64.const 4294967295
i64.and
local.get $l5
local.get $p1
i64.mul
i64.add
local.tee $p3
i64.const 32
i64.shr_u
i64.add
i64.store offset=8
local.get $p0
local.get $p3
i64.const 32
i64.shl
local.get $p4
i64.const 4294967295
i64.and
i64.or
i64.store) cc @tmiasko |
There's been a few various changes around codegen units handling and compiler-builtins recently, and it looks like this is caused by compiling this crate with |
Ok I've transferred this to the rust-lang/rust repository since this isn't a bug with compiler-builtins itself I believe but rather how we build compiler-builtins. I believe this is an accidental regression from #70846 because only part of the compiler knows that compiler-builtins is built with more than one CGU, the rest of the compiler thinks it's one CGU (respecting CLI flags). I'll be opening a PR shortly to fix this. |
We use a Vec here now because a HashMap would require hashing the comparison and then comparing the comparison with the string at the index calculated from the hash. This means at least two full iterations over the string are necessary, with one of them being somewhat expensive due to the hashing. Most of the time it is faster to just iterate the few comparisons we have and compare them directly. Most will be rejected right away as the first byte doesn't even match, so in the end you'll end up with less than two full iterations over the string. In fact most of the time Personal Best will be the first in the list and that's the one we most often want to look up anyway. One additional reason for doing this is that the ahash that was calculated for the HashMap uses 128-bit multiplications which regressed a lot in Rust 1.44 for targets where the `compiler-builtins` helpers were used. rust-lang/rust#73135 We could potentially look into interning our comparisons in the future which could yield even better performance.
We use a Vec here now because a HashMap would require hashing the comparison and then comparing the comparison with the string at the index calculated from the hash. This means at least two full iterations over the string are necessary, with one of them being somewhat expensive due to the hashing. Most of the time it is faster to just iterate the few comparisons we have and compare them directly. Most will be rejected right away as the first byte doesn't even match, so in the end you'll end up with less than two full iterations over the string. In fact most of the time Personal Best will be the first in the list and that's the one we most often want to look up anyway. One additional reason for doing this is that the ahash that was calculated for the HashMap uses 128-bit multiplications which regressed a lot in Rust 1.44 for targets where the `compiler-builtins` helpers were used. rust-lang/rust#73135 We could potentially look into interning our comparisons in the future which could yield even better performance.
Assigning |
…ins, r=Mark-Simulacrum Change how compiler-builtins gets many CGUs This commit intends to fix an accidental regression from rust-lang#70846. The goal of rust-lang#70846 was to build compiler-builtins with a maximal number of CGUs to ensure that each module in the source corresponds to an object file. This high degree of control for compiler-builtins is desirable to ensure that there's at most one exported symbol per CGU, ideally enabling compiler-builtins to not conflict with the system libgcc as often. In rust-lang#70846, however, only part of the compiler understands that compiler-builtins is built with many CGUs. The rest of the compiler thinks it's building with `sess.codegen_units()`. Notably the calculation of `sess.lto()` consults `sess.codegen_units()`, which when there's only one CGU it disables ThinLTO. This means that compiler-builtins is built without ThinLTO, which is quite harmful to performance! This is the root of the cause from rust-lang#73135 where intrinsics were found to not be inlining trivial functions. The fix applied in this commit is to remove the special-casing of compiler-builtins in the compiler. Instead the build system is now responsible for special-casing compiler-builtins. It doesn't know exactly how many CGUs will be needed but it passes a large number that is assumed to be much greater than the number of source-level modules needed. After reading the various locations in the compiler source, this seemed like the best solution rather than adding more and more special casing in the compiler for compiler-builtins. Closes rust-lang#73135
…ins, r=Mark-Simulacrum Change how compiler-builtins gets many CGUs This commit intends to fix an accidental regression from rust-lang#70846. The goal of rust-lang#70846 was to build compiler-builtins with a maximal number of CGUs to ensure that each module in the source corresponds to an object file. This high degree of control for compiler-builtins is desirable to ensure that there's at most one exported symbol per CGU, ideally enabling compiler-builtins to not conflict with the system libgcc as often. In rust-lang#70846, however, only part of the compiler understands that compiler-builtins is built with many CGUs. The rest of the compiler thinks it's building with `sess.codegen_units()`. Notably the calculation of `sess.lto()` consults `sess.codegen_units()`, which when there's only one CGU it disables ThinLTO. This means that compiler-builtins is built without ThinLTO, which is quite harmful to performance! This is the root of the cause from rust-lang#73135 where intrinsics were found to not be inlining trivial functions. The fix applied in this commit is to remove the special-casing of compiler-builtins in the compiler. Instead the build system is now responsible for special-casing compiler-builtins. It doesn't know exactly how many CGUs will be needed but it passes a large number that is assumed to be much greater than the number of source-level modules needed. After reading the various locations in the compiler source, this seemed like the best solution rather than adding more and more special casing in the compiler for compiler-builtins. Closes rust-lang#73135
…ins, r=Mark-Simulacrum Change how compiler-builtins gets many CGUs This commit intends to fix an accidental regression from rust-lang#70846. The goal of rust-lang#70846 was to build compiler-builtins with a maximal number of CGUs to ensure that each module in the source corresponds to an object file. This high degree of control for compiler-builtins is desirable to ensure that there's at most one exported symbol per CGU, ideally enabling compiler-builtins to not conflict with the system libgcc as often. In rust-lang#70846, however, only part of the compiler understands that compiler-builtins is built with many CGUs. The rest of the compiler thinks it's building with `sess.codegen_units()`. Notably the calculation of `sess.lto()` consults `sess.codegen_units()`, which when there's only one CGU it disables ThinLTO. This means that compiler-builtins is built without ThinLTO, which is quite harmful to performance! This is the root of the cause from rust-lang#73135 where intrinsics were found to not be inlining trivial functions. The fix applied in this commit is to remove the special-casing of compiler-builtins in the compiler. Instead the build system is now responsible for special-casing compiler-builtins. It doesn't know exactly how many CGUs will be needed but it passes a large number that is assumed to be much greater than the number of source-level modules needed. After reading the various locations in the compiler source, this seemed like the best solution rather than adding more and more special casing in the compiler for compiler-builtins. Closes rust-lang#73135
…ins, r=Mark-Simulacrum Change how compiler-builtins gets many CGUs This commit intends to fix an accidental regression from rust-lang#70846. The goal of rust-lang#70846 was to build compiler-builtins with a maximal number of CGUs to ensure that each module in the source corresponds to an object file. This high degree of control for compiler-builtins is desirable to ensure that there's at most one exported symbol per CGU, ideally enabling compiler-builtins to not conflict with the system libgcc as often. In rust-lang#70846, however, only part of the compiler understands that compiler-builtins is built with many CGUs. The rest of the compiler thinks it's building with `sess.codegen_units()`. Notably the calculation of `sess.lto()` consults `sess.codegen_units()`, which when there's only one CGU it disables ThinLTO. This means that compiler-builtins is built without ThinLTO, which is quite harmful to performance! This is the root of the cause from rust-lang#73135 where intrinsics were found to not be inlining trivial functions. The fix applied in this commit is to remove the special-casing of compiler-builtins in the compiler. Instead the build system is now responsible for special-casing compiler-builtins. It doesn't know exactly how many CGUs will be needed but it passes a large number that is assumed to be much greater than the number of source-level modules needed. After reading the various locations in the compiler source, this seemed like the best solution rather than adding more and more special casing in the compiler for compiler-builtins. Closes rust-lang#73135
This commit intends to fix an accidental regression from rust-lang#70846. The goal of rust-lang#70846 was to build compiler-builtins with a maximal number of CGUs to ensure that each module in the source corresponds to an object file. This high degree of control for compiler-builtins is desirable to ensure that there's at most one exported symbol per CGU, ideally enabling compiler-builtins to not conflict with the system libgcc as often. In rust-lang#70846, however, only part of the compiler understands that compiler-builtins is built with many CGUs. The rest of the compiler thinks it's building with `sess.codegen_units()`. Notably the calculation of `sess.lto()` consults `sess.codegen_units()`, which when there's only one CGU it disables ThinLTO. This means that compiler-builtins is built without ThinLTO, which is quite harmful to performance! This is the root of the cause from rust-lang#73135 where intrinsics were found to not be inlining trivial functions. The fix applied in this commit is to remove the special-casing of compiler-builtins in the compiler. Instead the build system is now responsible for special-casing compiler-builtins. It doesn't know exactly how many CGUs will be needed but it passes a large number that is assumed to be much greater than the number of source-level modules needed. After reading the various locations in the compiler source, this seemed like the best solution rather than adding more and more special casing in the compiler for compiler-builtins. Closes rust-lang#73135
I'm currently doing some benchmarks and I've seen that
__multi3
is taking a really long time to calculate on wasm. Turns out that none of the helper functions it uses are inlined at all, so it does a ton of unnecessary calls and misses a lot of potential optimizations:The same is probably true for the Float trait.
The text was updated successfully, but these errors were encountered: