-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tilt: a proposal for fast control flow #33
Comments
Under what conditions does tilt allow the implementation do more than what it would have done if it jump-threaded any CFG path A->B->C where B is a switch on a variable, the variable is known to be a constant in A, and C is the destination that B would jump to when passed that constant? -Filip
|
I think that in a way we can summarize Tilt syntax as explicitly writing out where jump threading can be performed. It's very straightforward to optimize when written that way, but does not provide more power than a sufficiently smart jump threading could have achieved. |
The jump threading rule that I described is easy to implement - in fact it’s one of the easiest. Note that it requires that block B in the A->B->C chain is only doing a branch/switch on something that is constant in A, and this can be deduced just from looking at the inputs to the Phi for the variable being switched on. If B does more than just that branch/switch, the threading would bail on the grounds that increasing code size isn’t worth it. Though, usually, compilers allow for B to do up to N other things, for some carefully chosen and usually small value of N. So, I would object to adding Tilt syntax, if it can’t be demonstrated that Tilt profitably enables more jump threading than would be done by that rule. That is: one could argue that Tilt is a hint to do more jump threading than you’d already have done, but the counterargument is that the compiler will jump-thread when it knows it is profitable based on some simple code-size-versus-cost-of-branch rules - and so it would likely ignore the Tilt hint anyway. Can you show a case where a compiler could do something with Tilt information that it could not have done if the exact same compiler rule was simply generalized to handle any case of a switch on an integer variable? Note that I’m assuming that any webasm implementation will have to be sophisticated enough to support constant folding, CSE, LICM, basic CFG simplifications, register allocation, and instruction selection. If you have those things - which are all probably necessary just to clean up redundancies that are introduced during translation - then the jump threading rule I describe is trivial to add. -Filip
|
What's concerning about the jump-threading optimization is that it pushes us back more into the JS domain of heuristic optimization and away from one of the high-level goals of predictable near-native perf. Specifically:
That being said, I think the same concerns apply, though to a lesser degree, to this tilt feature. I think the metaphor and comparison with jump-threading is valid. When Alon brought up this idea in #27, I asked if we couldn't define a related, but simpler control-flow primitive. Paraphrasing it here: for each of the loop statements and
To jump to one of these non-default exits, there'd be a The response in #27 was: "I think WhileWithExits is not quite enough (need to be able to enter loops as well)", which I see is related to the Duff's device example above. However, Duff's device isn't reducible (multiple entries into loop) so I'm fine that we can't optimally represent that control flow pattern. (The knowing programmer should have peeled off the switch of Duff's device anyway ;) Are there any other reducible patterns that a An argument to all of this is that we shouldn't bother if what we have now is Good Enough and we have an immediately-post-v.1 goal to support general, efficient irreducible control flow. I can buy this argument and I'd be fine leaving things as is for v.1, but, fwiw, I think implementing the |
@pizlonator: I'm not sure either way if the simple loop threading you describe is sufficient. Would it be able to remove all appearances of
@lukewagner:
This would generally be due to irreducible control flow, so perhaps you don't care about it, although I see the relooper currently emits it in other cases as well (but those could be flipped so the loop is inside the multiple; I didn't check why the current heuristics prefer the current form, might be a code size thing in JS). But I have a difference in opinion about irreducible control flow. I think if we have an easy way to make it predictably fast, then we should, and Tilt offers that. It's better than tail calls for that in some ways. Basically, with Tilt, we would know that an arbitrary CFG from LLVM can run predictably fast - I think that's a good thing! I'd also be ok to delay this discussion til later. Looks like we agreed on the core 2 loop types, which would make defining Tilt or something else easier (the forever loop is important there). |
I don't see how tilt offers any advantages over signature-restricted PTCs; they by definition compile down to jumps whereas tilt (necessarily) doesn't. |
Tilt does compile to jumps. There are two ways to implement them, as mentioned above: the trivial way using a helper variable, and the expected way that does a small analysis which lets it emit jumps. |
IIUC, even when the analysis is present, it depends on what happens in between the |
It requires a bit more than just my jump threading rule - it also requires simple reasoning about the constant folding of Phi functions, using a special case of an existing SSA conversion algorithm. I believe that any webasm backend will have this, as well. Also, if you write a backend that doesn’t use SSA, you could achieve the same effect using SCCP. But, can you also work it out the way you anticipate an implementation using tilt? I’d like to see what you have in mind. Because, I believe that to use tilt to restore the original CFG you would also have to have the same machinery as what I’m proposing, except it would be specific to tilt rather than generalizing over integers. I’m going to skip over one step: the “breaks” in the switch statement will be jump threaded to jump to the loop header. This is a strictly simpler jump threading than what I’m proposing and I assume that jump threading runs to fixpoint. So, here’s the CFG with SSA after that simple change, so we can focus on the much harder jump threading that actually has to consider Phi’s over constants: Lstart: Lloop: // predecessors: Lstart, LswitchCase1, LswitchCase2, Lswitch LsetLabel1: LdontSetLabel1: LsetLabel2: Lwork: Lswitch: // predecessors: LsetLabel1, LsetLabel2, Lwork LswitchCase1: LswitchCase2: Now running jump threading for one iteration will yield the following simplifications:
This results in the following code: Lstart: Lloop: // predecessors: Lstart, LswitchCase1, LswitchCase2, Lswitch LsetLabel1: LdontSetLabel1: LsetLabel2: Lwork: Lswitch: // predecessors: Lwork LswitchCase1: LswitchCase2: Now observe that:
This means that we can remove the switch statement, because its operand is the constant 0: Lstart: Lloop: // predecessors: Lstart, LswitchCase1, LswitchCase2, Lswitch LsetLabel1: LdontSetLabel1: LsetLabel2: Lwork: Lswitch: // predecessors: Lwork LswitchCase1: LswitchCase2: Now you can see that none of the label variables are used, and DCE can eradicate them. The jump from Lwork to Lswitch can then be jump-threaded to be a jump directly to Lloop. I believe that this constitutes the optimal CFG you were looking for?
Begin forwarded message:
|
I agree that we don’t need this at all if what we have already is good enough. Still, I’d like to get to get to the bottom of what you guys are proposing. My objections are really a sincere desire to understand what you’re thinking. Are you suggesting that Tilt should force the implementation to jump-thread no matter what, or are you suggesting something else? If you force jump threading no matter what, then the Tilt feature is sort of a bizarre denial-of-service bomb, no? So, probably an implementation will still have a heuristic for when to clone code. It will have to detect explosion because, at a minimum, it would have to do this to preserve its own integrity. But an implementation that is interested in being competitive will probably do the smarter thing: it would completely ignore the tilt hint if the code bloat is above some very small threshold, because in practice the cost of branching on a small integer value in a register is so tiny. The threshold at which tilt is profitable is likely to be identical to the threshold at which ordinary jump threading is profitable. Ergo, tilt is superfluous. But maybe I have misunderstood! Maybe there is something about tilt that enables the compiler to restore the CFG back to its original, pre-relooping state, without code duplication. If that’s the case, then great! But I just don’t see it. Specific comments inline:
But this should be reliably threadable: Block A sets some variable V to a constant K and jumps to B. Block B does nothing but a switch on V. The constant K that A uses is also one of the case values in the switch in B. That case value jumps to block C. Therefore, the replace the A->B jump with a A->C jump. My preference would be to mention none of this in the webasm spec, except maybe non-normative text that warns of the possibility of “label” variables and the potential benefits of jump threading.
I guess I was assuming that the relooper ran after the dust had settled.
Doing jump threading is something I’ve done in static analysis in the past, precisely because the pattern that tilt covers is a pattern that occurs in user code. Classic example: bool found = false;
Also, do you want baseline compilers to do jump threading? I wouldn’t. It appears that the Tilt proposal hints at this: you are right to treat Tilt as a “label” variable.
Right.
This could be nice - but I suspect it’s superfluous if an implementation just does jump threading!
|
@pizlonator Indeed, I would not want the baseline to do jump threading, this point was mostly just leading up to the point that the |
@lukewagner: Yeah, that's the point at the end of the proposal: to avoid yucky stuff like that, we'd need to define the semantics precisely, and make the yucky stuff a syntax error. Possible, and not that hard, especially if we've agreed on the 2 basic loop types already. In Tilt done that way,
Anything not causing a syntax error is jump-able; anything giving a syntax error should not have worked. (If I didn't miss anything... ;) @pizlonator: Nice! Cool that LLVM can do it. It does hit the issues @lukewagner was raising though, of how much smarts we want to assume on the client VM. We'd lose predictability by relying on that specific chain of optimizations, which worries me. Regarding whether a Tilt implementation could do the same, then I believe yes. Look at the more detailed semantics I wrote above for @lukewagner in this comment. That should work (I hope). Also, note how there is never a need for code duplication. Literally you can emit a jump when you see a Tilt (that didn't cause a syntax error), no heuristics, no code blowup, no DoS. |
Your comments convince me even more that we don’t need Tilt syntax and that we don’t need normative text describing any of this. Your proposal is a jump threading algorithm just like mine. It’s structured differently but it will catch the same cases - the only difference is that yours only looks for “tilt -> X” while mine will apply to any assignment of a constant value to a variable. Your proposal needs to be in the spec because it yields syntax errors. My proposal doesn’t need to be in the spec because it’s an optimization. I don’t see the value of having a mandatory optimization with new syntax and a new error mode, if the same can be achieved by taking that same optimization and running it as part of the optimization pipeline in the webasm backend. Doing so makes this an implementation issue and we don’t need to say anything about it other than maybe non-normative discussion. Maybe the spec could be made better by having non-normative text about optimizations. We could say, for example, what a recommended optimization pipeline could look like. It might be worth saying that a webasm backend ought to do register allocation and instruction selection, and maybe some text calling out that we anticipate the need to run constant folding, CSE and basic CFG simplifications. Jump threading is a classic CFG simplification, and we could point out that it is beneficial because of the lack of general control flow constructs. But I’d be fine with excluding any such text - it is the implementor’s job to figure this stuff out. More comments inline:
You’re describing a jump threading algorithm that could be made to work for any assignment to of a constant to a variable. Just replace “tilt -> X” with “variable = X where X is constant”, and remove the part about the syntax error. The only difference to my jump threading rule is that mine speaks of blocks and Phis rather than assignments to variables (I say that “tilt -> X” is just an assignment to a variable). But I believe they will catch the same cases.
(As an aside, I was just reading the LLVM JumpThreading pass, and it appears that it can do all of this, and probably a fair bit more. It appears to be OK with sometimes cloning code, it does some opportunistic PRE, and it subsumes my jump-threading rule by a two-step process: (1) if A jumps to B, B has a Phi and switches/branches on the value of the Phi, then the jump in A is replaced by the contents of B; and (2) attempt constant-folding on each branch/switch using something like SCCP. My rule happens because if you have A jumps to B and then B switches on a value known to be constant in A, then B will be switching in a Phi so rule (1) holds and then rule (2) constant-folds the switch. As with everything in LLVM, it’s kind of nuts - running blame on it shows that it’s the subject of never-ending carnage.)
Not really. No implementation would be required to implement this. Instead, it would be one of the obvious things that you’d want to implement to get performance, like register allocation and instruction selection. Surely we aren’t going to require implementations to do those other optimizations or any optimization for that matter. Jump threading is just another optimization. We don’t have to require it, we don’t need syntax to aid it, and the most we have to do is have non-normative text suggesting it on the grounds that the webasm generator might create code patterns that benefit from it.
Right. Neither of us is proposing a rule that mandates code duplication. Both of our jump threading rules will catch the same patterns, if you accept that “tilt -> X” is just an assignment of a variable and a constant. -Filip
|
(Oh, sorry, I thought that was LLVM you were mentioning there.) Fair points. Overall I think this all comes down to different opinions on how predictable perf should be on the client. If we want all control flow to be predictably fast, I think we need this in the spec; if we prefer to leave optimizations to VMs, then we don't. Regarding that both options (Tilt and a good optimizing VM) do jump threading, the difference as I see it is that jump threading in the semantics I wrote would be on the AST - something the user emits and controls - while the optimization passes you mention would run on a compiler-internal data structure - something the user has far less control over (and perhaps no understanding of). That ties in again to the issue of predictability being possible with Tilt, but uncertain if we leave it to VMs to optimize on their own. But also, on the AST we can easily see what can be turned into jumps; it is less clear that that would be the case after it is transformed to the internal compiler IR, which is "messier", lower-level, and might reorder things a little, e.g., some code might show up between (what used to be) the Tilt and the multiple. So I am not sure that the VM's jump threading would always work where Tilt would, but I do see your point that it is likely to succeed. |
This discussion makes me question again what it is that we're trying to optimize here with our design choice of structured control flow primitives over 'goto'. My current answers are:
1 and 2 could be preserved by keeping the structured control flow primitives we have now and relooping as a standard compiler pass. As for 3, if we are generating irreducible control flow in the backend anyway (via jump-threading/tilt optimizations), I'm not sure what's really simpler here. OTOH
|
I think that having an occasional branch on a value available in a local variable is sufficiently low cost that we don’t need it in the spec, especially considering that the vast majority of these cases - likely all of them - will be caught by jump threading.
I just don’t buy this. When it comes to constants in local variables and branches, you see strictly more information when you go to a lower-level IR. -Filip
|
Possible additional benefit:
You don’t have to check that your goto branch offsets are within bounds. Such “security” benefits are so subjective. What do y’all think?
I think that wasm with structured control flow and an occasional “label” variable due to weird CFGs will still qualify as something that you could compile somewhat efficiently with a single-pass compiler. The output of this compiler won’t be as good as the output of a beefy backend, but that’s not because of the lack of goto or jump threading - it’ll be because of all of the other things you lose when your backend is restricted to a single pass. -Filip
|
In the polyfill prototype, most statement nodes are encoded as a single byte (if, loops, break, continue, ret); only the labeled breaks/continues/switch/block have a variable-length int immediate (# enclosing blocks to break out of / # cases / # stmts). When I imagine a compact encoding of basic/blocks goto, all of these single-byte statement nodes would require additional var-int immediates (index of the target block) for what are currently child statement nodes. Also, for the labeled-break/continue ops, the immediates would likely be on average larger (many basic blocks vs usually-small block nesting depth) and thus take more bytes for the variable-length int. In the prototype polyfill, statement nodes constitute 12% of the bytes. I would estimate the above goto scheme (in the absence of structured control flow nodes) would require at least twice the bytes (though I can take the time to do a real experiment later). Even with compression, there should still be a significant difference, though not enormous. When you combine this long-term win with the short-term big win for the polyfill, I think the argument is pretty compelling, though, for including these structured primitives.
My first inclination is that a function body would be described as an array of basic blocks (much like the code section of a module is an array of functions) and thus a goto would use the index of the block, not a raw offset into the byte stream. Thus, the case of bounds checking goto targets would be analogous to all the other cases where indices are used (function, local var, global).
Yes, we have established that it does well enough on average, taking "average" to mean that on an average program, performance will not be affected by these bad cases (w/o any jump-threading opts). However, the occasional program will be tremendously affected. The first example that comes to mind is the Guile Scheme interpreter which Marc Feeley tries to Emscripten-compile from time to time and does much worse than native. A second bad example is sqlite which ends up with a lot of label usage which hurts general codegen. In general, though, there is a class of programs that spend the majority of their time in one big spaghetti function. This doesn't mean we have to care about them for v.1, but if we're considering longer-term ideals, I think we should. I also don't think we should commit any warts to wasm solely for the "simple compilers on spaghetti codes" use case; but I do think this is a mild use case in the "pro" column of goto. |
I don't see how Tilt is composable, since it relies on a single hidden E.g. L1: while (1) { In this case, if QuestionMark contains a tilt (even an internal one), then On Wed, May 6, 2015 at 12:30 AM, Alon Zakai [email protected]
|
Yes, tilt isn't meant to be composable in that sense - if QuestionMark is basically anything, that wouldn't be a valid tilt (except for the small list of necessary exceptions in the list given above; stuff like entering a forever loop is ok). Tilts need to be super-simple to figure out, almost as simple as gotos, basically. I would say that composability is not important because tilts are just a way to represent control flow in a structured way: they would be generated as the final step of converting a program to wasm, and removed among the first steps of parsing wasm on the client, so they aren't meant to be part of an IR that you do things like inlining and other compositions. |
On Wed, May 6, 2015 at 8:38 PM, Alon Zakai [email protected] wrote:
All other control structures so far have the single-entry single-exit -B
|
On Wed, May 6, 2015 at 9:26 PM, Ben L. Titzer [email protected] wrote:
-B
|
I do agree that composability is a nice property to have, and "proper" tilt - tilt with errors on yucky syntax - does harm that. You can maintain full composability, however, if you are ok with "loose" tilt - tilt where yucky stuff does nothing. In other words, for proper tilt to literally be guaranteed to optimize into a goto (if it didn't trigger a syntax error), you have to lose composability, because proper tilt in fact is like a proof that it can turn into a goto, and the proof doesn't compose with other stuff, it has to remain in a certain very inflexible form. |
A bit late to the party here, but here's what I'm thinking. A "loose" tilt is no better than having a local variable and running a jump threading algorithm. If we were to have a loose tilt, then I don't think I would bother with including it in our internal IR, since we would still need to perform all the same validation as it would for a more generic jump threading. For this reason, I would only really consider the "proper" tilt. A "proper" tilt would be stronger in that a baseline JIT could safely do jump threading without any analysis needed. It also might save some work in your full JIT if your current backend wouldn't already catch this jump threading opportunity. This advantage makes it slightly appealing, but the reduction to composability/increase to semantic complexity are serious drawbacks. Future AST constructs will necessarily have to consider how they impact tilt/multiple, and while I don't really foresee any terrible problems, I can imagine that this feature continues to accrue complexity. For example, maybe we will need to add semantics for how this will interact with mutexes. And maybe we won't, but I'm afraid this will end up being a feature which we constantly need to consider when adding new constructs. It also adds some extra complexity while validating, e.g. it will require loop detection to cover cases like tilt->X; forever {continue;}. For me, the drawback of increasing complexity outweighs the advantage of potentially faster baseline JIT, and I'd prefer to stick with what we have (though I think gotos deserve their own discussion and serious consideration). |
Well put, and I've come to think that way as well. Proper tilt if anything, but even that is not worth it due to the downsides. So we will not have a clear guarantee of full speed on all control flow (including irredicuble), which is a little sad, but still the best option we have. |
We presently have what seems to be a reasonable consensus around the current approach, so I'd like to close this issue. Of course, if people wish, this issue can always be reopened or the ideas in it reraised. |
The goals of this proposal are
This proposal introduces two new AST constructs, best explained with an example:
What happens here is if
check()
, then we executel_1()
, and ifother()
then we executel_2()
; otherwise we do somework()
and go to the top of the loop. Conceptually, Tilt doesn't actually affect control flow - it just "goes with the flow" of where things are moving anyhow, but it can "tilt" things a little one way or another (like tilting a pinball machine) at specific locations. Specifically, if control flow reaches a multiple, then the tilt has an effect in picking out where we end up inside the multiple. But otherwise, Tilt doesn't cause control flow by itself.In more detail, the semantics are fully defined as follows:
label
as a "hidden" local variable on the stack, only visible to the next 2 bullet points.tilt -> X
setslabel
to a numerical id that represents the label X.multiple { X1: { .. } X2: { .. } .. }
checks the value oflabel
. If it is equal to the numerical id of one of the labelsX1, X2, ...
then we execute the code in that label's block, setlabel
to a null value (that is not the id of any label), and exit the multiple. Otherwise (if it is not equal to any of the ids), we just skip the multiple.Note that Tilt does not move control flow to anywhere it couldn't actually get anyhow. Control flow still moves around using the same structured loops and ifs and breaks that we already have. Tilt only does a minor adjustment to control flow when we reach a multiple. The multiple can be seen as a "switch on labels", and the tilt picks which one we enter. But again, we could have reached any of those labels in the multiple anyhow through structured control flow (and picked which label in the multiple using a switch on the numerical id of the label).
The semantics described above also provide the "super-simple" implementation mentioned in the goals. It is trivial for a VM to implement that - just define the helper variable, set it and test it - and it would be correct; and control flow is still structured. But, it is also possible in a straightforward way to just emit a branch from the
tilt
to the right place. In the example above, that is perhaps too easy, so consider this more interesting example:It is straightforward for the VM to see that
tilt -> L2
will reachL2
, andtilt -> L3
will reachL3
- note how we need a break after the tilt to achieve that - so it can just emit branches directly. The helper variable overhead can be eliminated entirely.This idea is modeled on the Relooper algorithm from 2011. There is a proof there that any control flow can be represented in a structured way, using just the available control flow constructs in JavaScript, and using a helper variable like
label
mentioned in the Tilt semantics, without any code duplication (other approaches split nodes, and have bad worst-case code size situations). The relooper has also been implemented in Emscripten, and over the last 4 years we have gotten a lot of practical experience with it, showing that it gives good results in practice, typically with little usage of the helper variable.Thus, we have both a solid theoretical result that shows we can represent any control flow - even irreducible - in this way, and experience showing that that helper variable overhead tends to be minimal. In the proposal here, that helper variable is, in essence, written in an explicit manner, which allows even that overhead to be optimized out nicely by the VM.
A note on irreducible control flow: As mentioned, Tilt can represent it, e.g.,
This is not surprising as the relooper proof guarantees that it can. And we can also make that irredicuble control flow run at full speed (if the VM optimizes Tilt, instead of doing just the super-simple semantics as its implementation). Still, this is not quite as good as if we directly represented that irreducible control flow as basic blocks plus branches directly, since with Tilt we do need to analyze a little to find that underlying control flow graph. So something like proper tail calls, which can directly represent that graph, may still be useful, but it is debatable - at least a large set of cases of proper tail calls seem to be handled by Tilt (the cases of having heavily irreducible control flow), and without the limitations of proper tail calls like number of parameters. However, there is obviously a lot to debate on that.
In any case, putting aside irreducible control flow and proper tail calls, what Tilt is clearly great for is to eliminate any small amounts of helper variable usage, that occur often in practice, stemming from either small amounts of irreducible control flow (either a goto in the source, or a compiler optimization pass that complexified control flow for some reason), or reducible but complex enough control flow that the compiler didn't manage to remove all helper variable usages (it is an open question whether 100% can be removed in tractable time). Having Tilt in wasm would open the possibility for straightforward and predictable removal of that overhead.
To summarize this proposal, it can be seen as adding an "escape valve" to structured control flow, where code remains still structured, but we can express more complex patterns in a way that is at least easily optimizable without magical VM tricks.
That concludes the core of this proposal. I see two possible large open questions here:
None of the tilts do anything; the multiple is ignored; just reading that makes me nauseous. But it is valid code to write, even if silly. It might be nice to get a parse-time error on such things, and it isn't hard. But to do so requires that we define what is errored on very precisely, which is a lot more detailed than the very simple (but fully defined!) semantics described above.
edit: that detailed description appears in WebAssembly/spec#33 (comment)
And that brings me to a final point here. We could in principle delay discussion of this to a later date. However, if we do want to add this later, and do want to error on such nonsense as the last example, then how easy and clean it is to define the semantics will depend on decisions that we do make now. In particular, the fewer control flow constructs we have, the easier and cleaner it will be; every construct may need to be involved in the description. Also, we might want to start out with simple constructs that make that easier (I have some ideas for those, but nothing too concrete yet).
In other words, what I am getting at is that if we design our control flow constructs as a whole, together with Tilt, then things will be much nicer than if we try to tack Tilt on later to something designed before it. For that reason, we might want to discuss it now.
Bikeshedding: Suggestions for better names are welcome. Another name I considered was
slide -> X
, as in "I totally meant to slide there".The text was updated successfully, but these errors were encountered: