Skip to content

x64: Only branch once in br_table#5850

Merged
jameysharp merged 1 commit intobytecodealliance:mainfrom
jameysharp:branch-once
Feb 24, 2023
Merged

x64: Only branch once in br_table#5850
jameysharp merged 1 commit intobytecodealliance:mainfrom
jameysharp:branch-once

Conversation

@jameysharp
Copy link
Contributor

This uses the cmov that was necessary anyway for Spectre mitigation to clamp the table index, instead of zeroing it. By then placing the default target as the last entry in the table, we can use just one branch instruction in all cases.

This is a net savings of two bytes in the encoding of x64's br_table pseudoinstruction.

I haven't done any benchmarking yet. I'm guessing either this will be faster because it doesn't branch as often, or it will be slower because it executes more instructions in the common case where the bounds check succeeds.

I also haven't updated the comments in the implementation because first I wanted to see if this works at all (it passes the test suite!) and whether folks have strong arguments for or against this change.

It's just a random idea I had and thought I'd try out.

@jameysharp jameysharp requested a review from cfallin February 22, 2023 02:34
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels Feb 22, 2023
@jameysharp jameysharp marked this pull request as ready for review February 22, 2023 23:24
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me; thanks! I like that the logic is more cleanly split -- the implicit flags-into-pseudoinstruction always bothered me. I'm curious about performance results (and let's wait to see before merging, out of abundance of caution) but suspect it will be either neutral or slightly positive because it eliminates the separate branch for the default target (as you note) which could result in mispredicts.

collector.reg_use(*idx);
collector.reg_early_def(*tmp1);
collector.reg_early_def(*tmp2);
collector.reg_def(*tmp2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here that this is safe because tmp2 is written only after idx is used, and likewise a comment in emit.rs in the listing of the emitted sequence that if this property changes, we need to update regalloc metadata?

@jameysharp
Copy link
Contributor Author

Spidermonkey retires slightly fewer instructions during execution, slightly more during compilation, but no significant difference in CPU cycles in either case.

Significant spidermonkey performance differences
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 23022316.85 ± 32869.86 (confidence = 99%)

  branch-once-82e006e56.so is 1.01x to 1.01x faster than main-0f51338de.so!

  [2852144560 2852262544.40 2852550305] branch-once-82e006e56.so
  [2875156298 2875284861.25 2875531456] main-0f51338de.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 37675655.35 ± 1170739.83 (confidence = 99%)

  main-0f51338de.so is 1.00x to 1.00x faster than branch-once-82e006e56.so!

  [38943859484 38953919135.25 38956367889] branch-once-82e006e56.so
  [38906558057 38916243479.90 38921200049] main-0f51338de.so

The pulldown-cmark benchmark executes slightly faster by CPU cycles and slightly slower by number of instructions retired, which I think means it's hitting the default branch somewhat often.

Significant pulldown-cmark performance differences
execution :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 87213.87 ± 20212.56 (confidence = 99%)

  branch-once-82e006e56.so is 1.01x to 1.01x faster than main-0f51338de.so!

  [8033614 8138405.50 8285060] branch-once-82e006e56.so
  [8140335 8225619.37 8496124] main-0f51338de.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 117594.21 ± 109.08 (confidence = 99%)

  main-0f51338de.so is 1.00x to 1.00x faster than branch-once-82e006e56.so!

  [23761214 23761446.97 23762080] branch-once-82e006e56.so
  [23643597 23643852.76 23644483] main-0f51338de.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 3504417.95 ± 49961.22 (confidence = 99%)

  main-0f51338de.so is 1.00x to 1.00x faster than branch-once-82e006e56.so!

  [1652866013 1653078182.09 1653490982] branch-once-82e006e56.so
  [1649289182 1649573764.14 1650067353] main-0f51338de.so

But overall it looks to me like there's very little performance difference either way on these benchmarks.

This uses the `cmov`, which was previously necessary for Spectre
mitigation, to clamp the table index instead of zeroing it. By then
placing the default target as the last entry in the table, we can use
just one branch instruction in all cases.

Since there isn't a bounds-check branch any more, this sequence no
longer needs Spectre mitigation. And since we don't need to be careful
about preserving flags, half the instructions can be removed from this
pseudoinstruction and emitted as regular instructions instead.

This is a net savings of three bytes in the encoding of x64's br_table
pseudoinstruction. The generated code can sometimes be longer overall
because the blocks are emitted in a slightly different order.

My benchmark results show a very small effect on runtime performance
with this change.

The spidermonkey benchmark in Sightglass runs "1.01x faster" than main
by instructions retired, but with no significant difference in CPU
cycles. I think that means it rarely hit the default case in any
br_table instructions it executed.

The pulldown-cmark benchmark in Sightglass runs "1.01x faster" than main
by CPU cycles, but main runs "1.00x faster" by instructions retired. I
think that means this benchmark hit the default case a significant
amount of the time, so it executes a few more instructions per br_table,
but maybe the branches were predicted better.
@jameysharp jameysharp enabled auto-merge February 24, 2023 02:09
@jameysharp jameysharp added this pull request to the merge queue Feb 24, 2023
Merged via the queue into bytecodealliance:main with commit 7d790fc Feb 24, 2023
@jameysharp jameysharp deleted the branch-once branch February 24, 2023 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants