x64: Only branch once in br_table by jameysharp · Pull Request #5850 · bytecodealliance/wasmtime

jameysharp · 2023-02-22T02:34:27Z

This uses the cmov that was necessary anyway for Spectre mitigation to clamp the table index, instead of zeroing it. By then placing the default target as the last entry in the table, we can use just one branch instruction in all cases.

This is a net savings of two bytes in the encoding of x64's br_table pseudoinstruction.

I haven't done any benchmarking yet. I'm guessing either this will be faster because it doesn't branch as often, or it will be slower because it executes more instructions in the common case where the bounds check succeeds.

I also haven't updated the comments in the implementation because first I wanted to see if this works at all (it passes the test suite!) and whether folks have strong arguments for or against this change.

It's just a random idea I had and thought I'd try out.

cfallin

This seems reasonable to me; thanks! I like that the logic is more cleanly split -- the implicit flags-into-pseudoinstruction always bothered me. I'm curious about performance results (and let's wait to see before merging, out of abundance of caution) but suspect it will be either neutral or slightly positive because it eliminates the separate branch for the default target (as you note) which could result in mispredicts.

cfallin · 2023-02-23T01:03:27Z

cranelift/codegen/src/isa/x64/inst/mod.rs

            collector.reg_use(*idx);
            collector.reg_early_def(*tmp1);
-            collector.reg_early_def(*tmp2);
+            collector.reg_def(*tmp2);


Can we add a comment here that this is safe because tmp2 is written only after idx is used, and likewise a comment in emit.rs in the listing of the emitted sequence that if this property changes, we need to update regalloc metadata?

jameysharp · 2023-02-23T01:47:05Z

Spidermonkey retires slightly fewer instructions during execution, slightly more during compilation, but no significant difference in CPU cycles in either case.

Significant spidermonkey performance differences

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 23022316.85 ± 32869.86 (confidence = 99%)

  branch-once-82e006e56.so is 1.01x to 1.01x faster than main-0f51338de.so!

  [2852144560 2852262544.40 2852550305] branch-once-82e006e56.so
  [2875156298 2875284861.25 2875531456] main-0f51338de.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 37675655.35 ± 1170739.83 (confidence = 99%)

  main-0f51338de.so is 1.00x to 1.00x faster than branch-once-82e006e56.so!

  [38943859484 38953919135.25 38956367889] branch-once-82e006e56.so
  [38906558057 38916243479.90 38921200049] main-0f51338de.so

The pulldown-cmark benchmark executes slightly faster by CPU cycles and slightly slower by number of instructions retired, which I think means it's hitting the default branch somewhat often.

Significant pulldown-cmark performance differences

execution :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 87213.87 ± 20212.56 (confidence = 99%)

  branch-once-82e006e56.so is 1.01x to 1.01x faster than main-0f51338de.so!

  [8033614 8138405.50 8285060] branch-once-82e006e56.so
  [8140335 8225619.37 8496124] main-0f51338de.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 117594.21 ± 109.08 (confidence = 99%)

  main-0f51338de.so is 1.00x to 1.00x faster than branch-once-82e006e56.so!

  [23761214 23761446.97 23762080] branch-once-82e006e56.so
  [23643597 23643852.76 23644483] main-0f51338de.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 3504417.95 ± 49961.22 (confidence = 99%)

  main-0f51338de.so is 1.00x to 1.00x faster than branch-once-82e006e56.so!

  [1652866013 1653078182.09 1653490982] branch-once-82e006e56.so
  [1649289182 1649573764.14 1650067353] main-0f51338de.so

But overall it looks to me like there's very little performance difference either way on these benchmarks.

This uses the `cmov`, which was previously necessary for Spectre mitigation, to clamp the table index instead of zeroing it. By then placing the default target as the last entry in the table, we can use just one branch instruction in all cases. Since there isn't a bounds-check branch any more, this sequence no longer needs Spectre mitigation. And since we don't need to be careful about preserving flags, half the instructions can be removed from this pseudoinstruction and emitted as regular instructions instead. This is a net savings of three bytes in the encoding of x64's br_table pseudoinstruction. The generated code can sometimes be longer overall because the blocks are emitted in a slightly different order. My benchmark results show a very small effect on runtime performance with this change. The spidermonkey benchmark in Sightglass runs "1.01x faster" than main by instructions retired, but with no significant difference in CPU cycles. I think that means it rarely hit the default case in any br_table instructions it executed. The pulldown-cmark benchmark in Sightglass runs "1.01x faster" than main by CPU cycles, but main runs "1.00x faster" by instructions retired. I think that means this benchmark hit the default case a significant amount of the time, so it executes a few more instructions per br_table, but maybe the branches were predicted better.

jameysharp requested a review from cfallin February 22, 2023 02:34

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels Feb 22, 2023

jameysharp force-pushed the branch-once branch from e0f37ca to 82e006e Compare February 22, 2023 23:23

jameysharp marked this pull request as ready for review February 22, 2023 23:24

cfallin approved these changes Feb 23, 2023

View reviewed changes

elliottt mentioned this pull request Feb 23, 2023

Remove module-level code generation tests #5870

Merged

jameysharp force-pushed the branch-once branch from 82e006e to 15e9985 Compare February 24, 2023 02:08

jameysharp enabled auto-merge February 24, 2023 02:09

jameysharp added this pull request to the merge queue Feb 24, 2023

Merged via the queue into bytecodealliance:main with commit 7d790fc Feb 24, 2023

jameysharp deleted the branch-once branch February 24, 2023 06:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x64: Only branch once in br_table#5850

x64: Only branch once in br_table#5850
jameysharp merged 1 commit intobytecodealliance:mainfrom
jameysharp:branch-once

jameysharp commented Feb 22, 2023

Uh oh!

cfallin left a comment

Uh oh!

cfallin Feb 23, 2023

Uh oh!

jameysharp commented Feb 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jameysharp commented Feb 22, 2023

Uh oh!

cfallin left a comment

Choose a reason for hiding this comment

Uh oh!

cfallin Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

jameysharp commented Feb 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants