block/br return value optimizations #667

kripken · 2016-04-21T17:53:31Z

There was some discussion about block and br return values, and in particular how much benefit we can get from them, as both our current compiler options (asm2wasm and wasm-backend) don't use them or barely use them (asm2wasm uses block return values in rare cases, for asm.js sequences/commas, but never uses br values). I investigated this topic in the binaryen optimizer (block-returns branch; note code is not cleaned up yet).

My approach is to process control flow and see where it joins up, at the end of an if-else or a block, and then see if there is a set_local that flows through all those arriving branches (this would occur e.g. if an optimizing compiler emitted a phi node there). If so, we can use block and br return values to pass the result, instead of set_locals, and put a single set_local on the outside of the block or if. In other words, it unifies the set_locals, like this:

(block $out
  (if
    (..)
    (block
      (set_local $x (i32.const 10))
      (br $out)
    )
  )
  ..
  (set_local $x (i32.const 20))
)

==>

(set_local $x
  (block $out
    (if
      (..)
      (br $out (i32.const 10))
    )
    ..
    (i32.const 20)
  )
)

This removes set_locals, and can then allow further optimizations, e.g. the new set_local is now on the outside where it might be sunk into a later get_local, etc., and other stuff.

On BananaBread, this removes around 4% of set_locals. (This was better than I expected, so I ran it through a bunch of tests to verify it's working properly.)

Note that this is an underestimate of the opportunity here, as

It operates on the non-SSA form binaryen AST. Probably in SSA form more could be done (operating directly on phis).
It doesn't track variables through basic blocks, each is only considered in its own basic block.
This removes set_locals and get_locals, which might then allow reducing the total number of locals. But we can't test that since the binaryen optimizer doesn't have a register allocator (yet).

Overall, then, it seems like block/br return value optimizations can be pretty useful.

I also did some measurements for the topic of multiple return values. I can't optimize them in the current AST, so I don't have full and tested data. But I counted how many locations have a set_local that can be optimized, and how many there are at each - when there is more than one, multiple return values might be employed. Looks like 15.4% of potential locations have at least one set_local that can be optimized, and they have an average of 1.12 per location. In other words, it's common to have a single optimizable set_local, and while sometimes there is more than one, that is rare.

All that has the caveats 1-3 from before, and in addition that as mentioned it's just data and not a full and tested optimization pass, so it would be good to see more investigation here. But this data suggests that a single return value from ifs, blocks, and brs is enough for almost all the benefit here.

The text was updated successfully, but these errors were encountered:

sunfishcode · 2016-04-21T18:05:50Z

According to previous measurements here, set_local is about 10% of opcodes.

With that distribution, reducing set_local counts by 4% would produce a %0.4 reduction in total opcodes. Assuming that we eventually have some solution to make the majority of get_local and set_local opcodes single-byte (eg. folded immediates in the opcode table), one might expect the overall reduction in code size to be even less.

lukewagner · 2016-04-21T18:18:59Z

IIUC, the 4% figure doesn't include size reductions from comma/ternary. Alon: could you perhaps measure the size reduction from disabling the optimization that produces those?

kripken · 2016-04-21T18:36:49Z

@sunfishcode: On the one hand, set_local is a fairly small node, so you might expect even less than that 0.4%. But the change in binary size is actually slightly more than 0.4% (I see 0.5%), since the 4% set_local reduction also enables other optimizations to do more work, as mentioned before.

I agree any method that reduces set_local sizes would reduce this modest size improvement. However, even if the code size improvement were 0%, still 4% fewer set_locals means 4% fewer set_locals to bother with when parsing and compiling, which I think is significant.

@lukewagner: Sadly there isn't a good way to get rid of asm.js commas and ternaries - they mostly aren't the result of optimizations. By and large, they are directly created in the asm.js backend or in asm2wasm. So we would need to lower them out to investigate what you ask - I can't think of a simple way to do that.

But, as a partial answer, the reason that the asm.js backend and asm2wasm directly create commas and ternaries is because they are shorter than alternatives. Hard to measure the precise benefit there, but it's definitely useful. Not just for code size, it also makes compiling to wasm easier, it's more expressive (I think this is a point @rossberg-chromium has made).

sunfishcode · 2016-04-21T18:50:56Z

@kripken Currently set_local is currently a multi-byte operator, because it has an immediate and we don't have an opcode table yet, right?

kripken · 2016-04-21T18:56:33Z

Yes. (But sorry, I don't get the context?)

sunfishcode · 2016-04-21T19:08:18Z

The context was the mention of overall binary size. If set_local shrinks from about 2 bytes to about 1, that's a significant difference and we should explicitly factor it into our calculations.

Also, concerning decode time, it would be good to measure this.

kripken · 2016-04-21T19:15:07Z

Do you mean, shrink from 2 to 1 using an opcode table of some sort? Hard to speculate how that would affect things, as it would be a big change across the board to everything. It could amplify or shrink the effects here.

ghost · 2016-04-21T22:10:26Z

@kripken Expressions demand some infrastructure, so there is a runtime complexity. Returning values from blocks is pushing the 'expression pattern' and adding complexity to do so. As you note often multiple values are passed around, and the 'expression pattern' can be pushed a little further with multiple value support. Adding block1 can push it a little further. Operators could be added to produce any number and order of values, but the options to consume the values are still very limited to single use and: destructuring, pass-through parametric operators, result values, arguments to a call. It seems like a lot of language development and a lot of potential bike-shedding.

Let me try to explain how I see an expressionless encoding as compelling. Consider if the current wasm expressions implicitly used local variables to store expression temporary values rather than a separate stack. The encoding efficiency has not changed, but now there is the option to re-use these values. Next examine each operator, and note that it is now reading and writing to local variables, but with some implicit get_local and set_local operations. The encoding efficiency has not change, but the runtime complexity of supporting expressions can be replaced by a local variable predictor, and a simple one at that, that predicts an expression-stack allocation pattern. The fallback, no prediction hit, operators would then accept general local variable indexes and these would more efficiently handle the non-expression patterns. The 'expression' optimizations you are considering might still help but become 'register' allocation optimizations to match the predictor - the last local variable written in a block would be variable predicted to be the next value read and be a 'hit' and given a short encoding. The expression based predictor would bias wasm to expression patterns, give an incentive for developers to emit code in this allocation pattern and to reorder operators to fit this pattern, just as the existing expression support does.

I believe an expressionless encoding would have many advantages in other areas of wasm, such as associating debug and meta info with code positions.

Whatever the encoding wasm chooses to use, for the webll I will be firstly canonicalizing it to an expressionless encoding, and then rebuilding expressions for presentation and the meta information will reference the position in the canonical expressionless encoding.

kripken · 2016-04-21T22:23:10Z

@JSStats: Hard for me to evaluate that approach without a simple concrete example showing how it would work, and how it differs from the current approach. Perhaps in an issue dedicated just to that?

kripken · 2016-07-26T21:30:31Z

This was mostly presenting data and discussion, closing.

kripken mentioned this issue Apr 21, 2016

Clarify that br_table has an optional argument #664

Merged

kripken mentioned this issue Apr 24, 2016

Optimize block and if returns WebAssembly/binaryen#387

Merged

sunfishcode mentioned this issue May 2, 2016

Simplify br_if by removing its value operand. #681

Closed

sunfishcode added the control flow label Jul 8, 2016

sunfishcode modified the milestone: MVP Jul 8, 2016

kripken closed this as completed Jul 26, 2016

jfbastien mentioned this issue Nov 3, 2017

🛤 Multi-value #1146

Closed

binji mentioned this issue Oct 16, 2018

🛤 Multi-value WebAssembly/proposals#19

Closed

kripken mentioned this issue Mar 22, 2021

Alternatives to let? WebAssembly/function-references#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

block/br return value optimizations #667

block/br return value optimizations #667

kripken commented Apr 21, 2016

sunfishcode commented Apr 21, 2016

lukewagner commented Apr 21, 2016 •

edited

Loading

kripken commented Apr 21, 2016

sunfishcode commented Apr 21, 2016

kripken commented Apr 21, 2016

sunfishcode commented Apr 21, 2016

kripken commented Apr 21, 2016

ghost commented Apr 21, 2016

kripken commented Apr 21, 2016

kripken commented Jul 26, 2016

block/br return value optimizations #667

block/br return value optimizations #667

Comments

kripken commented Apr 21, 2016

sunfishcode commented Apr 21, 2016

lukewagner commented Apr 21, 2016 • edited Loading

kripken commented Apr 21, 2016

sunfishcode commented Apr 21, 2016

kripken commented Apr 21, 2016

sunfishcode commented Apr 21, 2016

kripken commented Apr 21, 2016

ghost commented Apr 21, 2016

kripken commented Apr 21, 2016

kripken commented Jul 26, 2016

lukewagner commented Apr 21, 2016 •

edited

Loading