Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

block/br return value optimizations #667

Closed
kripken opened this issue Apr 21, 2016 · 10 comments
Closed

block/br return value optimizations #667

kripken opened this issue Apr 21, 2016 · 10 comments
Milestone

Comments

@kripken
Copy link
Member

kripken commented Apr 21, 2016

There was some discussion about block and br return values, and in particular how much benefit we can get from them, as both our current compiler options (asm2wasm and wasm-backend) don't use them or barely use them (asm2wasm uses block return values in rare cases, for asm.js sequences/commas, but never uses br values). I investigated this topic in the binaryen optimizer (block-returns branch; note code is not cleaned up yet).

My approach is to process control flow and see where it joins up, at the end of an if-else or a block, and then see if there is a set_local that flows through all those arriving branches (this would occur e.g. if an optimizing compiler emitted a phi node there). If so, we can use block and br return values to pass the result, instead of set_locals, and put a single set_local on the outside of the block or if. In other words, it unifies the set_locals, like this:

(block $out
  (if
    (..)
    (block
      (set_local $x (i32.const 10))
      (br $out)
    )
  )
  ..
  (set_local $x (i32.const 20))
)

==>

(set_local $x
  (block $out
    (if
      (..)
      (br $out (i32.const 10))
    )
    ..
    (i32.const 20)
  )
)

This removes set_locals, and can then allow further optimizations, e.g. the new set_local is now on the outside where it might be sunk into a later get_local, etc., and other stuff.

On BananaBread, this removes around 4% of set_locals. (This was better than I expected, so I ran it through a bunch of tests to verify it's working properly.)

Note that this is an underestimate of the opportunity here, as

  1. It operates on the non-SSA form binaryen AST. Probably in SSA form more could be done (operating directly on phis).
  2. It doesn't track variables through basic blocks, each is only considered in its own basic block.
  3. This removes set_locals and get_locals, which might then allow reducing the total number of locals. But we can't test that since the binaryen optimizer doesn't have a register allocator (yet).

Overall, then, it seems like block/br return value optimizations can be pretty useful.

I also did some measurements for the topic of multiple return values. I can't optimize them in the current AST, so I don't have full and tested data. But I counted how many locations have a set_local that can be optimized, and how many there are at each - when there is more than one, multiple return values might be employed. Looks like 15.4% of potential locations have at least one set_local that can be optimized, and they have an average of 1.12 per location. In other words, it's common to have a single optimizable set_local, and while sometimes there is more than one, that is rare.

All that has the caveats 1-3 from before, and in addition that as mentioned it's just data and not a full and tested optimization pass, so it would be good to see more investigation here. But this data suggests that a single return value from ifs, blocks, and brs is enough for almost all the benefit here.

@sunfishcode
Copy link
Member

According to previous measurements here, set_local is about 10% of opcodes.

With that distribution, reducing set_local counts by 4% would produce a %0.4 reduction in total opcodes. Assuming that we eventually have some solution to make the majority of get_local and set_local opcodes single-byte (eg. folded immediates in the opcode table), one might expect the overall reduction in code size to be even less.

@lukewagner
Copy link
Member

lukewagner commented Apr 21, 2016

IIUC, the 4% figure doesn't include size reductions from comma/ternary. Alon: could you perhaps measure the size reduction from disabling the optimization that produces those?

@kripken
Copy link
Member Author

kripken commented Apr 21, 2016

@sunfishcode: On the one hand, set_local is a fairly small node, so you might expect even less than that 0.4%. But the change in binary size is actually slightly more than 0.4% (I see 0.5%), since the 4% set_local reduction also enables other optimizations to do more work, as mentioned before.

I agree any method that reduces set_local sizes would reduce this modest size improvement. However, even if the code size improvement were 0%, still 4% fewer set_locals means 4% fewer set_locals to bother with when parsing and compiling, which I think is significant.

@lukewagner: Sadly there isn't a good way to get rid of asm.js commas and ternaries - they mostly aren't the result of optimizations. By and large, they are directly created in the asm.js backend or in asm2wasm. So we would need to lower them out to investigate what you ask - I can't think of a simple way to do that.

But, as a partial answer, the reason that the asm.js backend and asm2wasm directly create commas and ternaries is because they are shorter than alternatives. Hard to measure the precise benefit there, but it's definitely useful. Not just for code size, it also makes compiling to wasm easier, it's more expressive (I think this is a point @rossberg-chromium has made).

@sunfishcode
Copy link
Member

@kripken Currently set_local is currently a multi-byte operator, because it has an immediate and we don't have an opcode table yet, right?

@kripken
Copy link
Member Author

kripken commented Apr 21, 2016

Yes. (But sorry, I don't get the context?)

@sunfishcode
Copy link
Member

The context was the mention of overall binary size. If set_local shrinks from about 2 bytes to about 1, that's a significant difference and we should explicitly factor it into our calculations.

Also, concerning decode time, it would be good to measure this.

@kripken
Copy link
Member Author

kripken commented Apr 21, 2016

Do you mean, shrink from 2 to 1 using an opcode table of some sort? Hard to speculate how that would affect things, as it would be a big change across the board to everything. It could amplify or shrink the effects here.

@ghost
Copy link

ghost commented Apr 21, 2016

@kripken Expressions demand some infrastructure, so there is a runtime complexity. Returning values from blocks is pushing the 'expression pattern' and adding complexity to do so. As you note often multiple values are passed around, and the 'expression pattern' can be pushed a little further with multiple value support. Adding block1 can push it a little further. Operators could be added to produce any number and order of values, but the options to consume the values are still very limited to single use and: destructuring, pass-through parametric operators, result values, arguments to a call. It seems like a lot of language development and a lot of potential bike-shedding.

Let me try to explain how I see an expressionless encoding as compelling. Consider if the current wasm expressions implicitly used local variables to store expression temporary values rather than a separate stack. The encoding efficiency has not changed, but now there is the option to re-use these values. Next examine each operator, and note that it is now reading and writing to local variables, but with some implicit get_local and set_local operations. The encoding efficiency has not change, but the runtime complexity of supporting expressions can be replaced by a local variable predictor, and a simple one at that, that predicts an expression-stack allocation pattern. The fallback, no prediction hit, operators would then accept general local variable indexes and these would more efficiently handle the non-expression patterns. The 'expression' optimizations you are considering might still help but become 'register' allocation optimizations to match the predictor - the last local variable written in a block would be variable predicted to be the next value read and be a 'hit' and given a short encoding. The expression based predictor would bias wasm to expression patterns, give an incentive for developers to emit code in this allocation pattern and to reorder operators to fit this pattern, just as the existing expression support does.

I believe an expressionless encoding would have many advantages in other areas of wasm, such as associating debug and meta info with code positions.

Whatever the encoding wasm chooses to use, for the webll I will be firstly canonicalizing it to an expressionless encoding, and then rebuilding expressions for presentation and the meta information will reference the position in the canonical expressionless encoding.

@kripken
Copy link
Member Author

kripken commented Apr 21, 2016

@JSStats: Hard for me to evaluate that approach without a simple concrete example showing how it would work, and how it differs from the current approach. Perhaps in an issue dedicated just to that?

@kripken
Copy link
Member Author

kripken commented Jul 26, 2016

This was mostly presenting data and discussion, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants