x64: Only branch once in br_table#5850
Conversation
e0f37ca to
82e006e
Compare
cfallin
left a comment
There was a problem hiding this comment.
This seems reasonable to me; thanks! I like that the logic is more cleanly split -- the implicit flags-into-pseudoinstruction always bothered me. I'm curious about performance results (and let's wait to see before merging, out of abundance of caution) but suspect it will be either neutral or slightly positive because it eliminates the separate branch for the default target (as you note) which could result in mispredicts.
| collector.reg_use(*idx); | ||
| collector.reg_early_def(*tmp1); | ||
| collector.reg_early_def(*tmp2); | ||
| collector.reg_def(*tmp2); |
There was a problem hiding this comment.
Can we add a comment here that this is safe because tmp2 is written only after idx is used, and likewise a comment in emit.rs in the listing of the emitted sequence that if this property changes, we need to update regalloc metadata?
|
Spidermonkey retires slightly fewer instructions during execution, slightly more during compilation, but no significant difference in CPU cycles in either case. Significant spidermonkey performance differencesThe pulldown-cmark benchmark executes slightly faster by CPU cycles and slightly slower by number of instructions retired, which I think means it's hitting the default branch somewhat often. Significant pulldown-cmark performance differencesBut overall it looks to me like there's very little performance difference either way on these benchmarks. |
This uses the `cmov`, which was previously necessary for Spectre mitigation, to clamp the table index instead of zeroing it. By then placing the default target as the last entry in the table, we can use just one branch instruction in all cases. Since there isn't a bounds-check branch any more, this sequence no longer needs Spectre mitigation. And since we don't need to be careful about preserving flags, half the instructions can be removed from this pseudoinstruction and emitted as regular instructions instead. This is a net savings of three bytes in the encoding of x64's br_table pseudoinstruction. The generated code can sometimes be longer overall because the blocks are emitted in a slightly different order. My benchmark results show a very small effect on runtime performance with this change. The spidermonkey benchmark in Sightglass runs "1.01x faster" than main by instructions retired, but with no significant difference in CPU cycles. I think that means it rarely hit the default case in any br_table instructions it executed. The pulldown-cmark benchmark in Sightglass runs "1.01x faster" than main by CPU cycles, but main runs "1.00x faster" by instructions retired. I think that means this benchmark hit the default case a significant amount of the time, so it executes a few more instructions per br_table, but maybe the branches were predicted better.
82e006e to
15e9985
Compare
This uses the
cmovthat was necessary anyway for Spectre mitigation to clamp the table index, instead of zeroing it. By then placing the default target as the last entry in the table, we can use just one branch instruction in all cases.This is a net savings of two bytes in the encoding of x64's br_table pseudoinstruction.
I haven't done any benchmarking yet. I'm guessing either this will be faster because it doesn't branch as often, or it will be slower because it executes more instructions in the common case where the bounds check succeeds.
I also haven't updated the comments in the implementation because first I wanted to see if this works at all (it passes the test suite!) and whether folks have strong arguments for or against this change.
It's just a random idea I had and thought I'd try out.