Skip to content

Conversation

@cfallin
Copy link
Member

@cfallin cfallin commented Apr 1, 2025

This PR addresses the issues described in #10488 in a more head-on way: it removes the use of separate "return-value instructions" that load return values from the stack, instead folding these loads into the semantics of the call VCode instruction.

This is a prerequisite for exception-handling: we need calls to be workable as terminators, meaning that we cannot require any other (VCode) instructions after the call to define the return values.

In principle, this PR starts simply enough: the return-locations list on the CallInfo that each backend uses to provide regalloc metadata is updated to support a notion of "register or stack address" as the source of each return value, and this list is now used for both kinds of returns, not just returns in registers. Shared code is defined in machinst::abi used by all backends to perform the requisite loads.

In order to make this work with more defined values than fit in registers, however, this PR also had to add support for "any"-constrained registers to Cranelift, and handling allocations that may be spillslots. This has always been supported by RA2, but this is the first time that Cranelift uses them directly (previously they were used only internally in RA2 as lowerings from other kinds of constraints like safepoints). This requires encoding a spillslot index in our Reg type.

There is a little bit of complexity around handling the loads/defs as well: if we have a return value on-stack, and we need to put it in a spillslot, we cannot do a memory-to-memory move directly, so we need a temporary register. Earlier versions of this PR allocated another temp as a vreg on the call, but this doesn't work with all calling conventions (too many clobbers). For simplicity I picked a particular register that is (i) clobbered by calls and (ii) not used for return values for each architecture (x86-64's tailcall needed to lose one return-in-register slot to make this work).

This removes retval insts from the shared ABI infra completely. s390x is different, still, because it handles callsite lowering from ISLE; we will need to address that separately for exception support there.

This PR addresses the issues described in bytecodealliance#10488 in a more head-on
way: it removes the use of separate "return-value instructions" that
load return values from the stack, instead folding these loads into
the semantics of the call VCode instruction.

This is a prerequisite for exception-handling: we need calls to be
workable as terminators, meaning that we cannot require any
other (VCode) instructions after the call to define the return values.

In principle, this PR starts simply enough: the return-locations list
on the `CallInfo` that each backend uses to provide regalloc metadata
is updated to support a notion of "register or stack address" as the
source of each return value, and this list is now used for both kinds
of returns, not just returns in registers. Shared code is defined in
`machinst::abi` used by all backends to perform the requisite loads.

In order to make this work with more defined values than fit in
registers, however, this PR also had to add support for
"any"-constrained registers to Cranelift, and handling allocations
that may be spillslots. This has always been supported by RA2, but
this is the first time that Cranelift uses them directly (previously
they were used only internally in RA2 as lowerings from other kinds of
constraints like safepoints). This requires encoding a spillslot index
in our `Reg` type.

There is a little bit of complexity around handling the loads/defs as
well: if we have a return value on-stack, and we need to put it in a
spillslot, we cannot do a memory-to-memory move directly, so we need a
temporary register. Earlier versions of this PR allocated another temp
as a vreg on the call, but this doesn't work with all calling
conventions (too many clobbers). For simplicity I picked a particular
register that is (i) clobbered by calls and (ii) not used for return
values for each architecture (x86-64's tailcall needed to lose one
return-in-register slot to make this work).

This removes retval insts from the shared ABI infra completely. s390x
is different, still, because it handles callsite lowering from ISLE;
we will need to address that separately for exception support there.
@cfallin cfallin requested review from a team as code owners April 1, 2025 18:01
@cfallin cfallin requested review from abrown and alexcrichton and removed request for a team April 1, 2025 18:01
@cfallin
Copy link
Member Author

cfallin commented Apr 1, 2025

I'll note that this appears to pessimize codegen somewhat for many-return-values (more than fit in registers, so more than 8? on aarch64/x86-64), since this does a manual load-from-stack/store-to-spillslot rather than letting regalloc handle it; that seems acceptable given the complexity reduction and unblocking exception support, IMHO.

@alexcrichton alexcrichton requested review from fitzgen and removed request for alexcrichton April 1, 2025 18:17
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks reasonable to me, and I have a passing thought, but I'm going to defer to @fitzgen for review rather than myself as I think he's been chatting with you more about this

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Apr 1, 2025
@github-actions
Copy link

github-actions bot commented Apr 1, 2025

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@cfallin
Copy link
Member Author

cfallin commented Apr 1, 2025

Just pushed a fix for aarch64 -- it turns out our optimization to avoid unnecessary clobber-saves for the lower half of float registers (is_included_in_clobbers) can no longer apply if we have other defs (the stack-carried values) that allocate into other registers; those other registers are indeed clobbered by our function body! The pessimization there is, again, only on very-many-returned-values cases; the happy path of returns-that-fit-in-regs is unaffected.

@alexcrichton
Copy link
Member

this helper is called after we've thrown away all the ABI info

Could the temp parameter to CallInfo::emit_retval_loads turn into an associated const and/or function on ABIMachineSpec? That way it wouldn't have to be passed as a parameter and it'd also be available for assertions if necessary elsewhere? No need really to bend over backwards for this, but I do think it'd be good to have a debug assertion that it's in the right set of caller or callee

@cfallin cfallin force-pushed the new-retvals-same-great-flavor-now-with-less-instructions branch from a0b140c to ee5cced Compare April 2, 2025 16:45
@cfallin
Copy link
Member Author

cfallin commented Apr 2, 2025

Ah, yes, that's a much better idea, thanks! Done.

@uweigand
Copy link
Member

uweigand commented Apr 4, 2025

Hi @cfallin I've had a quick look at this, and it should be relatively straightforward to implement this for s390x as well; I'd be happy to do that (either after this has landed, or else I could provide a patch ahead of time?).

However, I do have one comment about the new has_non_abi_defs mechanism. Not only is this further complicating the is_included_in_clobbers mechanism, it also introduces performance regressions.

The original reason for is_included_in_clobbers was that for partially callee-saved registers (e.g. on s390x, only the top half of some vector registers, which overlap the old floating-point registers, is callee-saved), in a sequence of A calling B calling C, we have to conservatively assume that the call B->C clobbers them, while the call A->B must preserve them, and therefore B will have to save/restore them all. Of course, as long as the A->B calling convention is the same as the B->C calling convention, we can avoid this, which is implemented via is_included_in_clobbers.

Now, assuming the call B->C needs another def as it has to load a return value into a callee-saved register. We have to consider this, which is what the new has_non_abi_defs mechanism achieves. However, that will then also have the side effect of once again forcing B to save/restore all floating-point registers - there's really no good reason for this.

Quite a while ago, I made the suggestion that is_included_in_clobbers should really only apply to the clobber list, not the def list. If we were to make that change, then the extra defs for this case would just happen to work transparently without any new has_non_abi_defs mechanism, and it would not introduce the above regression.

@cfallin
Copy link
Member Author

cfallin commented Apr 4, 2025

@uweigand thanks, that is a great suggestion -- much better than what I did, and I will update as suggested!

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@fitzgen fitzgen added this pull request to the merge queue Apr 5, 2025
@cfallin
Copy link
Member Author

cfallin commented Apr 6, 2025

Fuzzbug fix in RA2 (bytecodealliance/regalloc2#214) -- this PR creates more challenging constraints and the bundle-splitting logic had to get a little bit more refined.

uweigand added a commit to uweigand/wasmtime that referenced this pull request Apr 6, 2025
This is a follow-on to bytecodealliance#10502
implementing the same logic for s390x.

Like other platforms, we now no longer emit any instructions handling
return values after the call instruction; instead, everything is done
within a pseudo Call instruction.  Unlike other platforms, this also
has to handle vector lane swapping when calling between ABIs that mix
BE and LE lane orders.  The pseudo Call instruction needs enough type
information to make this choice, therefore I had to add a type field
to the RetLocation::Reg case (the ::Stack case already had one).

All other changes are contained within the s390x back-end.  To simplify
the implementation, the patch also implements a number of clean-ups:
- Introduce a MemArg::SpillOffset variant
- Remove the (unnecessary on this platform) island code size check
- Merge the CallInd instructions into the base Call instructions,
  using a new CallInstDest type to carry the call destination
uweigand added a commit to uweigand/wasmtime that referenced this pull request Apr 6, 2025
This is a follow-on to bytecodealliance#10502
implementing the same logic for s390x.

Like other platforms, we now no longer emit any instructions handling
return values after the call instruction; instead, everything is done
within a pseudo Call instruction.  Unlike other platforms, this also
has to handle vector lane swapping when calling between ABIs that mix
BE and LE lane orders.  The pseudo Call instruction needs enough type
information to make this choice, therefore I had to add a type field
to the RetLocation::Reg case (the ::Stack case already had one).

All other changes are contained within the s390x back-end.  To simplify
the implementation, the patch also implements a number of clean-ups:
- Introduce a MemArg::SpillOffset variant
- Remove the (unnecessary on this platform) island code size check
- Merge the CallInd instructions into the base Call instructions,
  using a new CallInstDest type to carry the call destination
cfallin added a commit to bytecodealliance/regalloc2 that referenced this pull request Apr 7, 2025
When exerting additional pressure on regalloc with
bytecodealliance/wasmtime#10502, which can lead to call instructions
that have significantly more (`any`-constrained) defs, we hit panics in
RA2 where (i) a bundle merged several LiveRanges, (ii) one of these
LiveRanges had a fixed-reg constraint on an early use, (iii) this
fixed-reg constraint conflicted with a clobber (which always happens at
the late-point), (iv) the bundle merged in another LiveRange of some
arbitrary def at the late point.

This would make a bundle (which is the atomic unit of allocation) that
covers the whole inst, including the late point; and is required to be
in the fixed reg; this is unallocatable because the clobber is also at
the late point in that reg.

Our allocate-or-split-and-retry logic does not split if a bundle is
"minimal". This is meant to give a base case to the retries: when
bundles break down into their minimal pieces, any solvable set of
constraints should result in allocations.

Unfortunately the "is minimal" predicate definition did not account for
multiple LiveRanges, but rather only tested whether the total
program-point range of the bundle was over one instruction. If there are
multiple LiveRanges, we can still split them off, and the resulting
split bundles may cover only half the instruction, avoiding the
clobbers.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Apr 7, 2025
This pulls in a fix for a fuzzbug found after bytecodealliance#10502 started generating
more challenging constraints for regalloc. The fix in
bytecodealliance/regalloc2#214 updates bundle-splitting logic to
properly handle bundles with multiple live-ranges all covering one
instruction.
github-merge-queue bot pushed a commit that referenced this pull request Apr 7, 2025
* Cranelift: update to regalloc2 0.11.3.

This pulls in a fix for a fuzzbug found after #10502 started generating
more challenging constraints for regalloc. The fix in
bytecodealliance/regalloc2#214 updates bundle-splitting logic to
properly handle bundles with multiple live-ranges all covering one
instruction.

* Update test expectations after regalloc perturbation.
github-merge-queue bot pushed a commit that referenced this pull request Apr 7, 2025
)

This is a follow-on to #10502
implementing the same logic for s390x.

Like other platforms, we now no longer emit any instructions handling
return values after the call instruction; instead, everything is done
within a pseudo Call instruction.  Unlike other platforms, this also
has to handle vector lane swapping when calling between ABIs that mix
BE and LE lane orders.  The pseudo Call instruction needs enough type
information to make this choice, therefore I had to add a type field
to the RetLocation::Reg case (the ::Stack case already had one).

All other changes are contained within the s390x back-end.  To simplify
the implementation, the patch also implements a number of clean-ups:
- Introduce a MemArg::SpillOffset variant
- Remove the (unnecessary on this platform) island code size check
- Merge the CallInd instructions into the base Call instructions,
  using a new CallInstDest type to carry the call destination
cfallin added a commit to cfallin/wasmtime that referenced this pull request Apr 8, 2025
In bytecodealliance#10502, we introduced changes that could make callsites be
arbitrarily long, because they now include loads of
return-values-on-stack. We made use of the existing island mechanism
(now presented as a new pseudoinst as in aarch64, rather than as ad-hoc
emission code) to ensure that we meet label-reference-distance
deadlines. Unfortunately we didn't update the debug-assert that checks
instructions for worst-case size to exclude calls (and the new
`EmitIsland` pseudoinst), since they handle islanding separately.

Found via fuzzbug at [1].

[1]: https://oss-fuzz.com/testcase-detail/4819793142415360
github-merge-queue bot pushed a commit that referenced this pull request Apr 9, 2025
In #10502, we introduced changes that could make callsites be
arbitrarily long, because they now include loads of
return-values-on-stack. We made use of the existing island mechanism
(now presented as a new pseudoinst as in aarch64, rather than as ad-hoc
emission code) to ensure that we meet label-reference-distance
deadlines. Unfortunately we didn't update the debug-assert that checks
instructions for worst-case size to exclude calls (and the new
`EmitIsland` pseudoinst), since they handle islanding separately.

Found via fuzzbug at [1].

[1]: https://oss-fuzz.com/testcase-detail/4819793142415360
cfallin added a commit to cfallin/regalloc2 that referenced this pull request Apr 11, 2025
This PR rewords minimal-bundle logic to be consistent between property
computation and splitting, and overall simpler.

To recap: a "minimal bundle" is an allocation bundle that is as small as
possible. Because RA2 does backtracking, i.e., can kick a bundle out of
an allocation to make way for another, and because it can split, i.e.
create more work for itself, we need to ensure the algorithm "bottoms
out" somewhere to ensure termination. The way we do this is by defining
a "minimal bundle": this is a bundle that cannot be split further. A
minimal bundle is never evicted, and any allocatable input (set of
uses/defs with constraints), when split into minimal bundles, should
result in no conflicts. We thus want to define minimal bundles as
*minimally* as possible so that we have the maximal solving capacity.

For a long time, we defined minimal bundles as those spanning a single
instruction (before- and after- progpoints) -- because we cannot insert
moves in the middle of an instruction. In bytecodealliance#214, we updated
minimal-bundle splitting to avoid putting two different unrelated uses
in a single-instruction bundle, beacuse doing so and calling it
"minimal" (unsplittable) can artificially extend the liverange of a use
with a fixed-reg constraint at the before-point into the after-point,
causing an unsolveable conflict. This was triggered by new and tighter
constraints on callsites in Cranelift after
bytecodealliance/wasmtime#10502 (merging retval defs into calls) landed.

Unfortunately this also resulted in an infinite allocation loop, because
the definition of "minimal bundle" did not agree between the
split-into-minimal-bundles fallback/last-ditch code, and the bundle
property computation. The splitter was splitting as far as it was
willing to go, but our predicate didn't consider those bundles minimal,
so we continued to re-attempt splitting indefinitely.

While investigating this, I found that the minimal-bundle concept had
accumulated significant cruft ("the detritus of dead fuzzbugs") and this
tech-debt was making things more confusing than not -- so I started by
clearly defining what a minimal bundle *is*. Precisely:

- A single use, within a single LiveRange;
- With that LiveRange having a program-point span consistent with the
  use:
  - Early def: whole instruction (must live past Late point so it can
    reach its uses; moves not possible within inst);
  - Late def: Late point only;
  - Early use: Early point only;
  - Late use: whole instruction (must be live starting at Early so the
    value can reach this use; moves not possible within inst).

This is made easier and simpler than what we have before largely
because the minimal-bundle splitter aggressively puts spans of LiveRange
without uses into the spill bundle, and because we support overlapping
LiveRanges for a vreg now (thanks Trevor!), so we can rely on having
*some* connectivity between the def and its uses even if we aggressively
trim LiveRanges in the minimal bundles down to just their defs/uses.

Fixes bytecodealliance#218, bytecodealliance#219.
cfallin added a commit to cfallin/regalloc2 that referenced this pull request Apr 11, 2025
This PR rewords minimal-bundle logic to be consistent between property
computation and splitting, and overall simpler.

To recap: a "minimal bundle" is an allocation bundle that is as small as
possible. Because RA2 does backtracking, i.e., can kick a bundle out of
an allocation to make way for another, and because it can split, i.e.
create more work for itself, we need to ensure the algorithm "bottoms
out" somewhere to ensure termination. The way we do this is by defining
a "minimal bundle": this is a bundle that cannot be split further. A
minimal bundle is never evicted, and any allocatable input (set of
uses/defs with constraints), when split into minimal bundles, should
result in no conflicts. We thus want to define minimal bundles as
*minimally* as possible so that we have the maximal solving capacity.

For a long time, we defined minimal bundles as those spanning a single
instruction (before- and after- progpoints) -- because we cannot insert
moves in the middle of an instruction. In bytecodealliance#214, we updated
minimal-bundle splitting to avoid putting two different unrelated uses
in a single-instruction bundle, beacuse doing so and calling it
"minimal" (unsplittable) can artificially extend the liverange of a use
with a fixed-reg constraint at the before-point into the after-point,
causing an unsolveable conflict. This was triggered by new and tighter
constraints on callsites in Cranelift after
bytecodealliance/wasmtime#10502 (merging retval defs into calls) landed.

Unfortunately this also resulted in an infinite allocation loop, because
the definition of "minimal bundle" did not agree between the
split-into-minimal-bundles fallback/last-ditch code, and the bundle
property computation. The splitter was splitting as far as it was
willing to go, but our predicate didn't consider those bundles minimal,
so we continued to re-attempt splitting indefinitely.

While investigating this, I found that the minimal-bundle concept had
accumulated significant cruft ("the detritus of dead fuzzbugs") and this
tech-debt was making things more confusing than not -- so I started by
clearly defining what a minimal bundle *is*. Precisely:

- A single use, within a single LiveRange;
- With that LiveRange having a program-point span consistent with the
  use:
  - Early def: whole instruction (must live past Late point so it can
    reach its uses; moves not possible within inst);
  - Late def: Late point only;
  - Early use: Early point only;
  - Late use: whole instruction (must be live starting at Early so the
    value can reach this use; moves not possible within inst).

This is made easier and simpler than what we have before largely
because the minimal-bundle splitter aggressively puts spans of LiveRange
without uses into the spill bundle, and because we support overlapping
LiveRanges for a vreg now (thanks Trevor!), so we can rely on having
*some* connectivity between the def and its uses even if we aggressively
trim LiveRanges in the minimal bundles down to just their defs/uses.

The split-at-program-point splitter (i.e., not the fallback
split-into-minimal-bundles splitter) also got a small fix related to
this: it has a mode that was intended to "split off one use" if we enter
with a split-point at the start of the bundle, but this was really
splitting off all uses at the program point (if there are multiple of
the same vreg at the same program point). In the case that we still need
to split these apart, this just falls back to the minimal-bundle
splitter now.

Fixes bytecodealliance#218, bytecodealliance#219.
cfallin added a commit to cfallin/regalloc2 that referenced this pull request Apr 11, 2025
This PR reworks minimal-bundle logic to be consistent between property
computation and splitting, and overall simpler.

To recap: a "minimal bundle" is an allocation bundle that is as small as
possible. Because RA2 does backtracking, i.e., can kick a bundle out of
an allocation to make way for another, and because it can split, i.e.
create more work for itself, we need to ensure the algorithm "bottoms
out" somewhere to ensure termination. The way we do this is by defining
a "minimal bundle": this is a bundle that cannot be split further. A
minimal bundle is never evicted, and any allocatable input (set of
uses/defs with constraints), when split into minimal bundles, should
result in no conflicts. We thus want to define minimal bundles as
*minimally* as possible so that we have the maximal solving capacity.

For a long time, we defined minimal bundles as those spanning a single
instruction (before- and after- progpoints) -- because we cannot insert
moves in the middle of an instruction. In bytecodealliance#214, we updated
minimal-bundle splitting to avoid putting two different unrelated uses
in a single-instruction bundle, beacuse doing so and calling it
"minimal" (unsplittable) can artificially extend the liverange of a use
with a fixed-reg constraint at the before-point into the after-point,
causing an unsolveable conflict. This was triggered by new and tighter
constraints on callsites in Cranelift after
bytecodealliance/wasmtime#10502 (merging retval defs into calls) landed.

Unfortunately this also resulted in an infinite allocation loop, because
the definition of "minimal bundle" did not agree between the
split-into-minimal-bundles fallback/last-ditch code, and the bundle
property computation. The splitter was splitting as far as it was
willing to go, but our predicate didn't consider those bundles minimal,
so we continued to re-attempt splitting indefinitely.

While investigating this, I found that the minimal-bundle concept had
accumulated significant cruft ("the detritus of dead fuzzbugs") and this
tech-debt was making things more confusing than not -- so I started by
clearly defining what a minimal bundle *is*. Precisely:

- A single use, within a single LiveRange;
- With that LiveRange having a program-point span consistent with the
  use:
  - Early def: whole instruction (must live past Late point so it can
    reach its uses; moves not possible within inst);
  - Late def: Late point only;
  - Early use: Early point only;
  - Late use: whole instruction (must be live starting at Early so the
    value can reach this use; moves not possible within inst).

This is made easier and simpler than what we have before largely
because the minimal-bundle splitter aggressively puts spans of LiveRange
without uses into the spill bundle, and because we support overlapping
LiveRanges for a vreg now (thanks Trevor!), so we can rely on having
*some* connectivity between the def and its uses even if we aggressively
trim LiveRanges in the minimal bundles down to just their defs/uses.

The split-at-program-point splitter (i.e., not the fallback
split-into-minimal-bundles splitter) also got a small fix related to
this: it has a mode that was intended to "split off one use" if we enter
with a split-point at the start of the bundle, but this was really
splitting off all uses at the program point (if there are multiple of
the same vreg at the same program point). In the case that we still need
to split these apart, this just falls back to the minimal-bundle
splitter now.

Fixes bytecodealliance#218, bytecodealliance#219.
cfallin added a commit to bytecodealliance/regalloc2 that referenced this pull request Apr 11, 2025
This PR reworks minimal-bundle logic to be consistent between property
computation and splitting, and overall simpler.

To recap: a "minimal bundle" is an allocation bundle that is as small as
possible. Because RA2 does backtracking, i.e., can kick a bundle out of
an allocation to make way for another, and because it can split, i.e.
create more work for itself, we need to ensure the algorithm "bottoms
out" somewhere to ensure termination. The way we do this is by defining
a "minimal bundle": this is a bundle that cannot be split further. A
minimal bundle is never evicted, and any allocatable input (set of
uses/defs with constraints), when split into minimal bundles, should
result in no conflicts. We thus want to define minimal bundles as
*minimally* as possible so that we have the maximal solving capacity.

For a long time, we defined minimal bundles as those spanning a single
instruction (before- and after- progpoints) -- because we cannot insert
moves in the middle of an instruction. In #214, we updated
minimal-bundle splitting to avoid putting two different unrelated uses
in a single-instruction bundle, beacuse doing so and calling it
"minimal" (unsplittable) can artificially extend the liverange of a use
with a fixed-reg constraint at the before-point into the after-point,
causing an unsolveable conflict. This was triggered by new and tighter
constraints on callsites in Cranelift after
bytecodealliance/wasmtime#10502 (merging retval defs into calls) landed.

Unfortunately this also resulted in an infinite allocation loop, because
the definition of "minimal bundle" did not agree between the
split-into-minimal-bundles fallback/last-ditch code, and the bundle
property computation. The splitter was splitting as far as it was
willing to go, but our predicate didn't consider those bundles minimal,
so we continued to re-attempt splitting indefinitely.

While investigating this, I found that the minimal-bundle concept had
accumulated significant cruft ("the detritus of dead fuzzbugs") and this
tech-debt was making things more confusing than not -- so I started by
clearly defining what a minimal bundle *is*. Precisely:

- A single use, within a single LiveRange;
- With that LiveRange having a program-point span consistent with the
use:
- Early def: whole instruction (must live past Late point so it can
reach its uses; moves not possible within inst);
  - Late def: Late point only;
  - Early use: Early point only;
- Late use: whole instruction (must be live starting at Early so the
value can reach this use; moves not possible within inst).

This is made easier and simpler than what we have before largely because
the minimal-bundle splitter aggressively puts spans of LiveRange without
uses into the spill bundle, and because we support overlapping
LiveRanges for a vreg now (thanks Trevor!), so we can rely on having
*some* connectivity between the def and its uses even if we aggressively
trim LiveRanges in the minimal bundles down to just their defs/uses.

The split-at-program-point splitter (i.e., not the fallback
split-into-minimal-bundles splitter) also got a small fix related to
this: it has a mode that was intended to "split off one use" if we enter
with a split-point at the start of the bundle, but this was really
splitting off all uses at the program point (if there are multiple of
the same vreg at the same program point). In the case that we still need
to split these apart, this just falls back to the minimal-bundle
splitter now.

Fixes #218, #219.
cfallin added a commit to cfallin/wasmtime that referenced this pull request May 7, 2025
Prior versions of regalloc2 could not support more than 255 operands on
an instruction, and together with the integrated return-value loads on
call instructions introduced in bytecodealliance#10502, this caused issues with calls
with many returns. This PR upgrades to a version of RA2 that supports up
to `2^16 - 1` operands per instruction (well in excess of the maximum of
1000 return/result values per Wasm's implementation limits, for
example).

Fixes bytecodealliance#10741.
cfallin added a commit to cfallin/wasmtime that referenced this pull request May 7, 2025
Prior versions of regalloc2 could not support more than 255 operands on
an instruction, and together with the integrated return-value loads on
call instructions introduced in bytecodealliance#10502, this caused issues with calls
with many returns. This PR upgrades to a version of RA2 that supports up
to `2^16 - 1` operands per instruction (well in excess of the maximum of
1000 return/result values per Wasm's implementation limits, for
example).

Fixes bytecodealliance#10741.
github-merge-queue bot pushed a commit that referenced this pull request May 7, 2025
…10747)

* Cranelift: update to regalloc2 0.12.2; support many return values.

Prior versions of regalloc2 could not support more than 255 operands on
an instruction, and together with the integrated return-value loads on
call instructions introduced in #10502, this caused issues with calls
with many returns. This PR upgrades to a version of RA2 that supports up
to `2^16 - 1` operands per instruction (well in excess of the maximum of
1000 return/result values per Wasm's implementation limits, for
example).

Fixes #10741.

* Update vets

---------

Co-authored-by: Alex Crichton <[email protected]>
cfallin added a commit to cfallin/wasmtime that referenced this pull request May 7, 2025
…ytecodealliance#10747)

* Cranelift: update to regalloc2 0.12.2; support many return values.

Prior versions of regalloc2 could not support more than 255 operands on
an instruction, and together with the integrated return-value loads on
call instructions introduced in bytecodealliance#10502, this caused issues with calls
with many returns. This PR upgrades to a version of RA2 that supports up
to `2^16 - 1` operands per instruction (well in excess of the maximum of
1000 return/result values per Wasm's implementation limits, for
example).

Fixes bytecodealliance#10741.

* Update vets

---------

Co-authored-by: Alex Crichton <[email protected]>
alexcrichton added a commit that referenced this pull request May 7, 2025
…10747) (#10748)

* Cranelift: update to regalloc2 0.12.2; support many return values.

Prior versions of regalloc2 could not support more than 255 operands on
an instruction, and together with the integrated return-value loads on
call instructions introduced in #10502, this caused issues with calls
with many returns. This PR upgrades to a version of RA2 that supports up
to `2^16 - 1` operands per instruction (well in excess of the maximum of
1000 return/result values per Wasm's implementation limits, for
example).

Fixes #10741.

* Update vets

---------

Co-authored-by: Alex Crichton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants