Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: Remove support for implicitly adding a return area pointer #9510

Open
bjorn3 opened this issue Oct 24, 2024 · 5 comments
Open

Cranelift: Remove support for implicitly adding a return area pointer #9510

bjorn3 opened this issue Oct 24, 2024 · 5 comments

Comments

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 24, 2024

Feature

When there are more returns than available return registers, the return value has to be returned through the introduction of a return area pointer. This can be done both explicitly using a ArgumentPurpose::StructReturn parameter (as cg_clif does) or implicitly by Cranelift. The support for implicitly adding it adds a lot of complexity to the ABI handling code in Cranelift (which is already way too complex) and has two ABI incompatibilities:

  • On x86_64 sysv the implicitly introduced return area pointer is not returned again in rax. I've tried to fix this, but the different code paths between explicit and implicit return area pointers are both too different and too similar to be able to handle this.
  • We split the return values between registers and the return area when everything should be returned using the return area pointer when it doesn't fit in registers according to the abi.

Given that there is a lot of complexity in Cranelift caused by this and it is effectively broken anyway for native calling conventions, I propose removing support for implicit return area pointers, requiring the frontend to use explicit return area pointers instead.

Benefit

@cfallin
Copy link
Member

cfallin commented Oct 24, 2024

@bjorn3 and I talked about this a bit offline -- I was initially skeptical for the reason that supporting arbitrary return values is ergonomic, fits well with the CLIF abstraction level in general, and we have it already. However, if we're facing some brokenness in our implementation already with respect to what the ABI says should happen, it may not be a bad idea to say that the Cranelift embedder (Wasmtime or cg_clif or other) needs to build their own mechanism for multiple return values. After all, "N args, 1 return" is also a perfectly consistent abstraction level and one that we live with in many other languages.

Practically speaking in this repo, this means a good amount of code removal in the ABI code as bjorn3 mentions, and then building the equivalent in cranelift-wasm. I suspect we could do it in a single pass still by, lazily when needed, creating a stackslot for returns, generating a CLIF call with extra arg, and loads of values from stackslot to define the return values. (The stackslot can be reused for all callsites, and we could update its size as we discover larger return-value sets.)

What do others think? We can talk about this in the next CL meeting if needed, too.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Oct 24, 2024

I've opened #9511 to gate the support behind an option. This should help with removing it in the future and reduces likelihood of accidentally hitting the ABI issue in the short term.

@fitzgen
Copy link
Member

fitzgen commented Nov 7, 2024

Practically speaking in this repo, this means a good amount of code removal in the ABI code as bjorn3 mentions, and then building the equivalent in cranelift-wasm. I suspect we could do it in a single pass still by, lazily when needed, creating a stackslot for returns, generating a CLIF call with extra arg, and loads of values from stackslot to define the return values. (The stackslot can be reused for all callsites, and we could update its size as we discover larger return-value sets.)

Clarification: Who is creating this stack slot? The caller of any function that returns multiple arguments? And then for tail calls, the caller would forward the return pointer it had to have been given for this tail call to type check?

Idea: we could also stuff a buffer within the vmctx and avoid burning a register on the return pointer. It does mean we would need to eagerly reload values out of the buffer before any subsequent calls, since the buffer would presumably be too small to use as a full stack, but I think that's what you were thinking we'd do anyways.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Nov 8, 2024

Clarification: Who is creating this stack slot? The caller of any function that returns multiple arguments? And then for tail calls, the caller would forward the return pointer it had to have been given for this tail call to type check?

Yes

Idea: we could also stuff a buffer within the vmctx and avoid burning a register on the return pointer.

That would work provided that you know an upper bound on the size of the return area. You could also put the return area pointer in the vmctx instead and keep the return area on the stack. That would avoid expensive copies entirely at the cost of an extra indirection. That said, I don't think multi value return with more values than fit in registers is common enough to warrant avoiding a return area pointer register. The register only needs to be reserved if a return area is necessary at all. For the Tail and Winch call convs we could add more return registers than the native call conv to reduce the likelihood of needing a return area.

@fitzgen
Copy link
Member

fitzgen commented Nov 8, 2024

For the Tail and Winch call convs we could add more return registers than the native call conv to reduce the likelihood of needing a return area.

Very true, and this is definitely the first thing we should do if we find ourselves optimizing the return of many values.

That would work provided that you know an upper bound on the size of the return area.

We can do that, since the Wasm spec allows implementations to impose limits on many things including number of results. We try to match the JS/Web limits for consistency and portability, and the JS limit on number of results is 1000.

In general, I'd prefer embedding the return area space within the vmctx over pointing to the return area space from the vmctx just so we could avoid the indirection. If we get to the point where it makes sense for us to optimize this stuff by saving a register, we might as well go all the way and avoid the indirection as well.


FWIW, the use case where I could imagine these kinds of calling convention micro optimizations actually mattering is when we are dealing with basic blocks that have been compiled down to branchless functions that all tail call each other, and so every function/block in the program takes (say) 6 arguments and generates 6 returns and function calls are frequent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants