-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Cranelift: remove return-value instructions after calls at callsites. #10502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: remove return-value instructions after calls at callsites. #10502
Conversation
This PR addresses the issues described in bytecodealliance#10488 in a more head-on way: it removes the use of separate "return-value instructions" that load return values from the stack, instead folding these loads into the semantics of the call VCode instruction. This is a prerequisite for exception-handling: we need calls to be workable as terminators, meaning that we cannot require any other (VCode) instructions after the call to define the return values. In principle, this PR starts simply enough: the return-locations list on the `CallInfo` that each backend uses to provide regalloc metadata is updated to support a notion of "register or stack address" as the source of each return value, and this list is now used for both kinds of returns, not just returns in registers. Shared code is defined in `machinst::abi` used by all backends to perform the requisite loads. In order to make this work with more defined values than fit in registers, however, this PR also had to add support for "any"-constrained registers to Cranelift, and handling allocations that may be spillslots. This has always been supported by RA2, but this is the first time that Cranelift uses them directly (previously they were used only internally in RA2 as lowerings from other kinds of constraints like safepoints). This requires encoding a spillslot index in our `Reg` type. There is a little bit of complexity around handling the loads/defs as well: if we have a return value on-stack, and we need to put it in a spillslot, we cannot do a memory-to-memory move directly, so we need a temporary register. Earlier versions of this PR allocated another temp as a vreg on the call, but this doesn't work with all calling conventions (too many clobbers). For simplicity I picked a particular register that is (i) clobbered by calls and (ii) not used for return values for each architecture (x86-64's tailcall needed to lose one return-in-register slot to make this work). This removes retval insts from the shared ABI infra completely. s390x is different, still, because it handles callsite lowering from ISLE; we will need to address that separately for exception support there.
|
I'll note that this appears to pessimize codegen somewhat for many-return-values (more than fit in registers, so more than 8? on aarch64/x86-64), since this does a manual load-from-stack/store-to-spillslot rather than letting regalloc handle it; that seems acceptable given the complexity reduction and unblocking exception support, IMHO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks reasonable to me, and I have a passing thought, but I'm going to defer to @fitzgen for review rather than myself as I think he's been chatting with you more about this
Subscribe to Label Action
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
|
Just pushed a fix for aarch64 -- it turns out our optimization to avoid unnecessary clobber-saves for the lower half of float registers ( |
Could the |
a0b140c to
ee5cced
Compare
|
Ah, yes, that's a much better idea, thanks! Done. |
|
Hi @cfallin I've had a quick look at this, and it should be relatively straightforward to implement this for s390x as well; I'd be happy to do that (either after this has landed, or else I could provide a patch ahead of time?). However, I do have one comment about the new The original reason for Now, assuming the call B->C needs another def as it has to load a return value into a callee-saved register. We have to consider this, which is what the new Quite a while ago, I made the suggestion that |
|
@uweigand thanks, that is a great suggestion -- much better than what I did, and I will update as suggested! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Fuzzbug fix in RA2 (bytecodealliance/regalloc2#214) -- this PR creates more challenging constraints and the bundle-splitting logic had to get a little bit more refined. |
This is a follow-on to bytecodealliance#10502 implementing the same logic for s390x. Like other platforms, we now no longer emit any instructions handling return values after the call instruction; instead, everything is done within a pseudo Call instruction. Unlike other platforms, this also has to handle vector lane swapping when calling between ABIs that mix BE and LE lane orders. The pseudo Call instruction needs enough type information to make this choice, therefore I had to add a type field to the RetLocation::Reg case (the ::Stack case already had one). All other changes are contained within the s390x back-end. To simplify the implementation, the patch also implements a number of clean-ups: - Introduce a MemArg::SpillOffset variant - Remove the (unnecessary on this platform) island code size check - Merge the CallInd instructions into the base Call instructions, using a new CallInstDest type to carry the call destination
This is a follow-on to bytecodealliance#10502 implementing the same logic for s390x. Like other platforms, we now no longer emit any instructions handling return values after the call instruction; instead, everything is done within a pseudo Call instruction. Unlike other platforms, this also has to handle vector lane swapping when calling between ABIs that mix BE and LE lane orders. The pseudo Call instruction needs enough type information to make this choice, therefore I had to add a type field to the RetLocation::Reg case (the ::Stack case already had one). All other changes are contained within the s390x back-end. To simplify the implementation, the patch also implements a number of clean-ups: - Introduce a MemArg::SpillOffset variant - Remove the (unnecessary on this platform) island code size check - Merge the CallInd instructions into the base Call instructions, using a new CallInstDest type to carry the call destination
When exerting additional pressure on regalloc with bytecodealliance/wasmtime#10502, which can lead to call instructions that have significantly more (`any`-constrained) defs, we hit panics in RA2 where (i) a bundle merged several LiveRanges, (ii) one of these LiveRanges had a fixed-reg constraint on an early use, (iii) this fixed-reg constraint conflicted with a clobber (which always happens at the late-point), (iv) the bundle merged in another LiveRange of some arbitrary def at the late point. This would make a bundle (which is the atomic unit of allocation) that covers the whole inst, including the late point; and is required to be in the fixed reg; this is unallocatable because the clobber is also at the late point in that reg. Our allocate-or-split-and-retry logic does not split if a bundle is "minimal". This is meant to give a base case to the retries: when bundles break down into their minimal pieces, any solvable set of constraints should result in allocations. Unfortunately the "is minimal" predicate definition did not account for multiple LiveRanges, but rather only tested whether the total program-point range of the bundle was over one instruction. If there are multiple LiveRanges, we can still split them off, and the resulting split bundles may cover only half the instruction, avoiding the clobbers.
This pulls in a fix for a fuzzbug found after bytecodealliance#10502 started generating more challenging constraints for regalloc. The fix in bytecodealliance/regalloc2#214 updates bundle-splitting logic to properly handle bundles with multiple live-ranges all covering one instruction.
* Cranelift: update to regalloc2 0.11.3. This pulls in a fix for a fuzzbug found after #10502 started generating more challenging constraints for regalloc. The fix in bytecodealliance/regalloc2#214 updates bundle-splitting logic to properly handle bundles with multiple live-ranges all covering one instruction. * Update test expectations after regalloc perturbation.
) This is a follow-on to #10502 implementing the same logic for s390x. Like other platforms, we now no longer emit any instructions handling return values after the call instruction; instead, everything is done within a pseudo Call instruction. Unlike other platforms, this also has to handle vector lane swapping when calling between ABIs that mix BE and LE lane orders. The pseudo Call instruction needs enough type information to make this choice, therefore I had to add a type field to the RetLocation::Reg case (the ::Stack case already had one). All other changes are contained within the s390x back-end. To simplify the implementation, the patch also implements a number of clean-ups: - Introduce a MemArg::SpillOffset variant - Remove the (unnecessary on this platform) island code size check - Merge the CallInd instructions into the base Call instructions, using a new CallInstDest type to carry the call destination
In bytecodealliance#10502, we introduced changes that could make callsites be arbitrarily long, because they now include loads of return-values-on-stack. We made use of the existing island mechanism (now presented as a new pseudoinst as in aarch64, rather than as ad-hoc emission code) to ensure that we meet label-reference-distance deadlines. Unfortunately we didn't update the debug-assert that checks instructions for worst-case size to exclude calls (and the new `EmitIsland` pseudoinst), since they handle islanding separately. Found via fuzzbug at [1]. [1]: https://oss-fuzz.com/testcase-detail/4819793142415360
In #10502, we introduced changes that could make callsites be arbitrarily long, because they now include loads of return-values-on-stack. We made use of the existing island mechanism (now presented as a new pseudoinst as in aarch64, rather than as ad-hoc emission code) to ensure that we meet label-reference-distance deadlines. Unfortunately we didn't update the debug-assert that checks instructions for worst-case size to exclude calls (and the new `EmitIsland` pseudoinst), since they handle islanding separately. Found via fuzzbug at [1]. [1]: https://oss-fuzz.com/testcase-detail/4819793142415360
This PR rewords minimal-bundle logic to be consistent between property computation and splitting, and overall simpler. To recap: a "minimal bundle" is an allocation bundle that is as small as possible. Because RA2 does backtracking, i.e., can kick a bundle out of an allocation to make way for another, and because it can split, i.e. create more work for itself, we need to ensure the algorithm "bottoms out" somewhere to ensure termination. The way we do this is by defining a "minimal bundle": this is a bundle that cannot be split further. A minimal bundle is never evicted, and any allocatable input (set of uses/defs with constraints), when split into minimal bundles, should result in no conflicts. We thus want to define minimal bundles as *minimally* as possible so that we have the maximal solving capacity. For a long time, we defined minimal bundles as those spanning a single instruction (before- and after- progpoints) -- because we cannot insert moves in the middle of an instruction. In bytecodealliance#214, we updated minimal-bundle splitting to avoid putting two different unrelated uses in a single-instruction bundle, beacuse doing so and calling it "minimal" (unsplittable) can artificially extend the liverange of a use with a fixed-reg constraint at the before-point into the after-point, causing an unsolveable conflict. This was triggered by new and tighter constraints on callsites in Cranelift after bytecodealliance/wasmtime#10502 (merging retval defs into calls) landed. Unfortunately this also resulted in an infinite allocation loop, because the definition of "minimal bundle" did not agree between the split-into-minimal-bundles fallback/last-ditch code, and the bundle property computation. The splitter was splitting as far as it was willing to go, but our predicate didn't consider those bundles minimal, so we continued to re-attempt splitting indefinitely. While investigating this, I found that the minimal-bundle concept had accumulated significant cruft ("the detritus of dead fuzzbugs") and this tech-debt was making things more confusing than not -- so I started by clearly defining what a minimal bundle *is*. Precisely: - A single use, within a single LiveRange; - With that LiveRange having a program-point span consistent with the use: - Early def: whole instruction (must live past Late point so it can reach its uses; moves not possible within inst); - Late def: Late point only; - Early use: Early point only; - Late use: whole instruction (must be live starting at Early so the value can reach this use; moves not possible within inst). This is made easier and simpler than what we have before largely because the minimal-bundle splitter aggressively puts spans of LiveRange without uses into the spill bundle, and because we support overlapping LiveRanges for a vreg now (thanks Trevor!), so we can rely on having *some* connectivity between the def and its uses even if we aggressively trim LiveRanges in the minimal bundles down to just their defs/uses. Fixes bytecodealliance#218, bytecodealliance#219.
This PR rewords minimal-bundle logic to be consistent between property computation and splitting, and overall simpler. To recap: a "minimal bundle" is an allocation bundle that is as small as possible. Because RA2 does backtracking, i.e., can kick a bundle out of an allocation to make way for another, and because it can split, i.e. create more work for itself, we need to ensure the algorithm "bottoms out" somewhere to ensure termination. The way we do this is by defining a "minimal bundle": this is a bundle that cannot be split further. A minimal bundle is never evicted, and any allocatable input (set of uses/defs with constraints), when split into minimal bundles, should result in no conflicts. We thus want to define minimal bundles as *minimally* as possible so that we have the maximal solving capacity. For a long time, we defined minimal bundles as those spanning a single instruction (before- and after- progpoints) -- because we cannot insert moves in the middle of an instruction. In bytecodealliance#214, we updated minimal-bundle splitting to avoid putting two different unrelated uses in a single-instruction bundle, beacuse doing so and calling it "minimal" (unsplittable) can artificially extend the liverange of a use with a fixed-reg constraint at the before-point into the after-point, causing an unsolveable conflict. This was triggered by new and tighter constraints on callsites in Cranelift after bytecodealliance/wasmtime#10502 (merging retval defs into calls) landed. Unfortunately this also resulted in an infinite allocation loop, because the definition of "minimal bundle" did not agree between the split-into-minimal-bundles fallback/last-ditch code, and the bundle property computation. The splitter was splitting as far as it was willing to go, but our predicate didn't consider those bundles minimal, so we continued to re-attempt splitting indefinitely. While investigating this, I found that the minimal-bundle concept had accumulated significant cruft ("the detritus of dead fuzzbugs") and this tech-debt was making things more confusing than not -- so I started by clearly defining what a minimal bundle *is*. Precisely: - A single use, within a single LiveRange; - With that LiveRange having a program-point span consistent with the use: - Early def: whole instruction (must live past Late point so it can reach its uses; moves not possible within inst); - Late def: Late point only; - Early use: Early point only; - Late use: whole instruction (must be live starting at Early so the value can reach this use; moves not possible within inst). This is made easier and simpler than what we have before largely because the minimal-bundle splitter aggressively puts spans of LiveRange without uses into the spill bundle, and because we support overlapping LiveRanges for a vreg now (thanks Trevor!), so we can rely on having *some* connectivity between the def and its uses even if we aggressively trim LiveRanges in the minimal bundles down to just their defs/uses. The split-at-program-point splitter (i.e., not the fallback split-into-minimal-bundles splitter) also got a small fix related to this: it has a mode that was intended to "split off one use" if we enter with a split-point at the start of the bundle, but this was really splitting off all uses at the program point (if there are multiple of the same vreg at the same program point). In the case that we still need to split these apart, this just falls back to the minimal-bundle splitter now. Fixes bytecodealliance#218, bytecodealliance#219.
This PR reworks minimal-bundle logic to be consistent between property computation and splitting, and overall simpler. To recap: a "minimal bundle" is an allocation bundle that is as small as possible. Because RA2 does backtracking, i.e., can kick a bundle out of an allocation to make way for another, and because it can split, i.e. create more work for itself, we need to ensure the algorithm "bottoms out" somewhere to ensure termination. The way we do this is by defining a "minimal bundle": this is a bundle that cannot be split further. A minimal bundle is never evicted, and any allocatable input (set of uses/defs with constraints), when split into minimal bundles, should result in no conflicts. We thus want to define minimal bundles as *minimally* as possible so that we have the maximal solving capacity. For a long time, we defined minimal bundles as those spanning a single instruction (before- and after- progpoints) -- because we cannot insert moves in the middle of an instruction. In bytecodealliance#214, we updated minimal-bundle splitting to avoid putting two different unrelated uses in a single-instruction bundle, beacuse doing so and calling it "minimal" (unsplittable) can artificially extend the liverange of a use with a fixed-reg constraint at the before-point into the after-point, causing an unsolveable conflict. This was triggered by new and tighter constraints on callsites in Cranelift after bytecodealliance/wasmtime#10502 (merging retval defs into calls) landed. Unfortunately this also resulted in an infinite allocation loop, because the definition of "minimal bundle" did not agree between the split-into-minimal-bundles fallback/last-ditch code, and the bundle property computation. The splitter was splitting as far as it was willing to go, but our predicate didn't consider those bundles minimal, so we continued to re-attempt splitting indefinitely. While investigating this, I found that the minimal-bundle concept had accumulated significant cruft ("the detritus of dead fuzzbugs") and this tech-debt was making things more confusing than not -- so I started by clearly defining what a minimal bundle *is*. Precisely: - A single use, within a single LiveRange; - With that LiveRange having a program-point span consistent with the use: - Early def: whole instruction (must live past Late point so it can reach its uses; moves not possible within inst); - Late def: Late point only; - Early use: Early point only; - Late use: whole instruction (must be live starting at Early so the value can reach this use; moves not possible within inst). This is made easier and simpler than what we have before largely because the minimal-bundle splitter aggressively puts spans of LiveRange without uses into the spill bundle, and because we support overlapping LiveRanges for a vreg now (thanks Trevor!), so we can rely on having *some* connectivity between the def and its uses even if we aggressively trim LiveRanges in the minimal bundles down to just their defs/uses. The split-at-program-point splitter (i.e., not the fallback split-into-minimal-bundles splitter) also got a small fix related to this: it has a mode that was intended to "split off one use" if we enter with a split-point at the start of the bundle, but this was really splitting off all uses at the program point (if there are multiple of the same vreg at the same program point). In the case that we still need to split these apart, this just falls back to the minimal-bundle splitter now. Fixes bytecodealliance#218, bytecodealliance#219.
This PR reworks minimal-bundle logic to be consistent between property computation and splitting, and overall simpler. To recap: a "minimal bundle" is an allocation bundle that is as small as possible. Because RA2 does backtracking, i.e., can kick a bundle out of an allocation to make way for another, and because it can split, i.e. create more work for itself, we need to ensure the algorithm "bottoms out" somewhere to ensure termination. The way we do this is by defining a "minimal bundle": this is a bundle that cannot be split further. A minimal bundle is never evicted, and any allocatable input (set of uses/defs with constraints), when split into minimal bundles, should result in no conflicts. We thus want to define minimal bundles as *minimally* as possible so that we have the maximal solving capacity. For a long time, we defined minimal bundles as those spanning a single instruction (before- and after- progpoints) -- because we cannot insert moves in the middle of an instruction. In #214, we updated minimal-bundle splitting to avoid putting two different unrelated uses in a single-instruction bundle, beacuse doing so and calling it "minimal" (unsplittable) can artificially extend the liverange of a use with a fixed-reg constraint at the before-point into the after-point, causing an unsolveable conflict. This was triggered by new and tighter constraints on callsites in Cranelift after bytecodealliance/wasmtime#10502 (merging retval defs into calls) landed. Unfortunately this also resulted in an infinite allocation loop, because the definition of "minimal bundle" did not agree between the split-into-minimal-bundles fallback/last-ditch code, and the bundle property computation. The splitter was splitting as far as it was willing to go, but our predicate didn't consider those bundles minimal, so we continued to re-attempt splitting indefinitely. While investigating this, I found that the minimal-bundle concept had accumulated significant cruft ("the detritus of dead fuzzbugs") and this tech-debt was making things more confusing than not -- so I started by clearly defining what a minimal bundle *is*. Precisely: - A single use, within a single LiveRange; - With that LiveRange having a program-point span consistent with the use: - Early def: whole instruction (must live past Late point so it can reach its uses; moves not possible within inst); - Late def: Late point only; - Early use: Early point only; - Late use: whole instruction (must be live starting at Early so the value can reach this use; moves not possible within inst). This is made easier and simpler than what we have before largely because the minimal-bundle splitter aggressively puts spans of LiveRange without uses into the spill bundle, and because we support overlapping LiveRanges for a vreg now (thanks Trevor!), so we can rely on having *some* connectivity between the def and its uses even if we aggressively trim LiveRanges in the minimal bundles down to just their defs/uses. The split-at-program-point splitter (i.e., not the fallback split-into-minimal-bundles splitter) also got a small fix related to this: it has a mode that was intended to "split off one use" if we enter with a split-point at the start of the bundle, but this was really splitting off all uses at the program point (if there are multiple of the same vreg at the same program point). In the case that we still need to split these apart, this just falls back to the minimal-bundle splitter now. Fixes #218, #219.
Prior versions of regalloc2 could not support more than 255 operands on an instruction, and together with the integrated return-value loads on call instructions introduced in bytecodealliance#10502, this caused issues with calls with many returns. This PR upgrades to a version of RA2 that supports up to `2^16 - 1` operands per instruction (well in excess of the maximum of 1000 return/result values per Wasm's implementation limits, for example). Fixes bytecodealliance#10741.
Prior versions of regalloc2 could not support more than 255 operands on an instruction, and together with the integrated return-value loads on call instructions introduced in bytecodealliance#10502, this caused issues with calls with many returns. This PR upgrades to a version of RA2 that supports up to `2^16 - 1` operands per instruction (well in excess of the maximum of 1000 return/result values per Wasm's implementation limits, for example). Fixes bytecodealliance#10741.
…10747) * Cranelift: update to regalloc2 0.12.2; support many return values. Prior versions of regalloc2 could not support more than 255 operands on an instruction, and together with the integrated return-value loads on call instructions introduced in #10502, this caused issues with calls with many returns. This PR upgrades to a version of RA2 that supports up to `2^16 - 1` operands per instruction (well in excess of the maximum of 1000 return/result values per Wasm's implementation limits, for example). Fixes #10741. * Update vets --------- Co-authored-by: Alex Crichton <[email protected]>
…ytecodealliance#10747) * Cranelift: update to regalloc2 0.12.2; support many return values. Prior versions of regalloc2 could not support more than 255 operands on an instruction, and together with the integrated return-value loads on call instructions introduced in bytecodealliance#10502, this caused issues with calls with many returns. This PR upgrades to a version of RA2 that supports up to `2^16 - 1` operands per instruction (well in excess of the maximum of 1000 return/result values per Wasm's implementation limits, for example). Fixes bytecodealliance#10741. * Update vets --------- Co-authored-by: Alex Crichton <[email protected]>
…10747) (#10748) * Cranelift: update to regalloc2 0.12.2; support many return values. Prior versions of regalloc2 could not support more than 255 operands on an instruction, and together with the integrated return-value loads on call instructions introduced in #10502, this caused issues with calls with many returns. This PR upgrades to a version of RA2 that supports up to `2^16 - 1` operands per instruction (well in excess of the maximum of 1000 return/result values per Wasm's implementation limits, for example). Fixes #10741. * Update vets --------- Co-authored-by: Alex Crichton <[email protected]>
This PR addresses the issues described in #10488 in a more head-on way: it removes the use of separate "return-value instructions" that load return values from the stack, instead folding these loads into the semantics of the call VCode instruction.
This is a prerequisite for exception-handling: we need calls to be workable as terminators, meaning that we cannot require any other (VCode) instructions after the call to define the return values.
In principle, this PR starts simply enough: the return-locations list on the
CallInfothat each backend uses to provide regalloc metadata is updated to support a notion of "register or stack address" as the source of each return value, and this list is now used for both kinds of returns, not just returns in registers. Shared code is defined inmachinst::abiused by all backends to perform the requisite loads.In order to make this work with more defined values than fit in registers, however, this PR also had to add support for "any"-constrained registers to Cranelift, and handling allocations that may be spillslots. This has always been supported by RA2, but this is the first time that Cranelift uses them directly (previously they were used only internally in RA2 as lowerings from other kinds of constraints like safepoints). This requires encoding a spillslot index in our
Regtype.There is a little bit of complexity around handling the loads/defs as well: if we have a return value on-stack, and we need to put it in a spillslot, we cannot do a memory-to-memory move directly, so we need a temporary register. Earlier versions of this PR allocated another temp as a vreg on the call, but this doesn't work with all calling conventions (too many clobbers). For simplicity I picked a particular register that is (i) clobbered by calls and (ii) not used for return values for each architecture (x86-64's tailcall needed to lose one return-in-register slot to make this work).
This removes retval insts from the shared ABI infra completely. s390x is different, still, because it handles callsite lowering from ISLE; we will need to address that separately for exception support there.