-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: add stack_switch CLIF instruction #9078
Conversation
daf8a93
to
e74bd93
Compare
I'm marking this as "ready for review" now, but I have a few questions about things that will probably require additional fixes before this is actually ready:
|
Super excited for this! Haven't taken a look at the actual code yet, but here are some answers to the questions in your comment above.
Will the lowering work for all posix right now or only Linux? Will macos work, for example? We should be precise about the correctness condition here. Cranelift doesn't generally care what OS it is targeting beyond calling conventions and a few other ABI details here and there, and AFAIK we've never needed OS-specific lowering rules before, so this is sort of untrodded ground. A
I think describing the layout of that data would be best done in the documentation for the new instruction itself. I haven't looked at the actual code in this PR, but I'd expect that the only platform-specific bits would be pointer size, and otherwise we are always saving SP, optionally FP when frame pointers are enabled, and PC. Is that assumption incorrect? Are we, or will we be, saving more/fewer values on different platforms? I also would expect that we wouldn't actually need to explicitly define all the details of this data and its layout. I'd imagine we would only say that it has a given size and alignment, and is otherwise only valid to be manipulated via the But maybe something like stack walking means that the runtime needs to be able to peek inside this data, and we can't keep it opaque to the host?
I think we've had similar-ish issues opened in the past for
I think we also want to have the
I don't think having a big filetest is a problem in itself per se, but I think we should also have a very basic filetest that is effectively just the
Things like So for the x64
If the answer to these questions are "can operate directly on memory" and "accessing the value once" then I'd say a FWIW, it is also fine to start with just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I think that, with the guard rails where we only lower this instruction on target OSes for which it will work, and the various nitpicks below addressed, this will be good to merge.
"stack_switch", | ||
r#" | ||
Suspends execution of the current stack and resumes execution of another | ||
one. | ||
The target stack to switch to is identified by the data stored at | ||
``load_context_ptr``. Before switching, this instruction stores | ||
analogous information about the | ||
current (i.e., original) stack at ``store_context_ptr``, to | ||
enabled switching back to the original stack at a later point. | ||
The size and layout of the information stored at ``load_context_ptr`` | ||
and ``store_context_ptr`` is platform-dependent. | ||
The instruction is experimental and only supported on x64 Linux at the | ||
moment. | ||
When switching from a stack A to a stack B, one of the following cases | ||
must apply: | ||
1. Stack B was previously suspended using a ``stack_switch`` instruction. | ||
2. Stack B is a newly initialized stack. The necessary initialization is | ||
platform-dependent and will generally involve running some kind of | ||
trampoline to start execution of a function on the new stack. | ||
In both cases, the ``in_payload`` argument of the ``stack_switch`` | ||
instruction executed on A is passed to stack B. In the first case above, | ||
it will be the result value of the earlier ``stack_switch`` instruction | ||
executed on stack B. In the second case, the value will be accessible to | ||
the trampoline in a platform-dependent register. | ||
The pointers ``load_context_ptr`` and ``store_context_ptr`` are allowed | ||
to be equal; the instruction ensures that all data is loaded from the | ||
former before writing to the latter. | ||
"#, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere in here we should mention the one-shottedness of this instruction, and how resuming a context twice can result in UB due to spilled values being overwritten by the first resumption, etc...
It might even be worth naming this instruction one_shot_stack_switch
.
I think we should also clarify that, while this instruction performs loads and stores, those memory operations are always assumed to be aligned, non-trapping, and otherwise valid. This instruction performs no validation itself. It is as if these memory operations had MemFlags::trusted()
attached to them. Therefore, it is the user's responsibility to ensure that these assumptions are upheld.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've extended the documentation of this instruction now. Still not specifying the actual layout itself, but mentioned the one-shottedness and the fact that the instruction does not check the pointers or data.
// Note that we do not emit anything for preserving and restoring | ||
// ordinary registers here: That's taken care of by regalloc for us, | ||
// since we marked this instruction as clobbering all registers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fantastic hack
Thanks for your answers and looking at the code! I'll look into the things you suggested. In the meantime, some more thoughts regarding the layout of the contexts: On most platforms, the context can indeed look like this:
However, there are two reasons for why the layout may be different somewhere: 1. Additional info needed on WindowsOn Windows, I think we would have to update data inside the "Thread Information Block" (the stuff we briefly talked about with Ryan Hunt at the Pittsburgh CG meeting). I haven't looked too deeply into it, but it looks like it contains pointers to the beginning and end of the currently active stack, which we would need to update when switching. That means that I suspect we would have to add these to the In any case, it looks like we would need Windows-specific lowering rules for 2. Frame pointer walkingThere's another issue that I've briefly mentioned in a comment in the new file I made sure that the Concretely, in our implementation of Wasm stack switching, this is achieved like this: The main idea is now that we make sure that the The last missing ingredient for creating the frame pointer chain is that when we are running the trampoline that kicks of execution inside stack S, we set Long story short: It's quite neat to make sure that Luckily, the layout above (i.e., frame pointer right next to PC) works on all platforms supported by Cranelift, except s390x: On the latter, there's an offset of 14 words between where the FP and PC are stored.
(Or just give up on frame pointer walking there) So at least it seems that within the same ISA, all OSes sufficiently agree on the stack layout so that the frame pointer walking causes no extra hassle. Documenting the layout
Yes, the layout of the The only situation where you do need to know about the layout is when initializing a new stack: You need to create a corresponding Alternatively, we could add a |
Co-authored-by: Nick Fitzgerald <[email protected]>
I've implemented the restriction to only lower I've implemented this using a partial constructor |
SGTM
It doesn't seem ideal. Do we not already have access to a |
The |
|
||
(rule (lower (stack_switch store_context_ptr load_context_ptr in_payload0)) | ||
;; For the time being, stack switching is only supported on x64 Linux | ||
(if (on_linux)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no other place in Cranelift that looks at the target OS when determining how to lower clif ir. Even TLS handling chooses the right variant for the target OS using a codegen flag rather than by looking at the OS in the triple. Using the right calling conventions is done by the producer of the clif ir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree here: philosophically, this is if not quite literally calling-convention related, certainly like an OS-interface detail that should be "baked into" the CLIF (rather than implicit in lowering) as CLIF is ordinarily explicit about such details. Can we check the triple in the Wasm translation (i.e., the wasmtime FuncEnvironment
hook called by cranelift-wasm
, or wherever else this instruction is generated) and fail if on the wrong platform?
This is also a little more future-looking in the sense that there may be Wasmtime details at some future point related to stack-switching (e.g., what if we add our own stack-protection mitigations or have some custom kind of stack-growth scheme or ...) -- we wouldn't want to hardcode that into Cranelift lowering in the same way OS details are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to recap the context sprinkled through this PR: The reason why we need to do something platform/OS-specific here eventually is that Windows requires us to update parts of the Thread Information Block when switching stacks. Unfortunately, what exactly needs to be done is undocumented, so I've not done it, yet. The "plain" stack switching I implemented in this PR was supposed to work only on Linux, but from what I can tell should work on x64 macOS, too.
Following what you said, I guess one approach would be to make the lowering of stack_switch
more similar to that of tls_value
. I'd imagine this would look as follows:
- Create a
StackSwitchMode
enum withDefault
andUpdateTib
variants. - Encode a value of that type in the backend's
shared_settings::Flags
and also add it as a field to theStackSwitch
MInst
. - When emitting code for
StackSwitch
, check the value of theStackSwitchMode
flag and act accordingly.
I'm also happy to have more than one MInst
for stack switching, and lower the stack_switch
CLIF instruction to one of them, based on the value of StackSwitchMode
. That would mirror tls_value
more closely, but I'm inclined to avoid that: The core stack switching code is always the same, we just sometimes need to emit some extra code on top of that.
Finally, what I described above would be the medium-term solution once I've implemented Windows support. For the time being I would just add a None
variant to StackSwitchMode
, use that on Windows, and panic if we see it when emitting code for StackSwitch
.
Can we check the triple in the Wasm translation (i.e., the wasmtime
FuncEnvironment
hook called bycranelift-wasm
, or wherever else this instruction is generated) and fail if on the wrong platform?
Did you mean for this to be just a solution for the current issue of me wanting to fail if not on Linux? Or did you mean to also use this approach (i.e., resolving differences at the Wasm -> CLIF stage) in the future when we want to generate slightly different code on different OS-es?
In the latter case, are you suggesting to have multiple stack switching CLIF instructions, for the case with and without the TIB update, then choose between them during the Wasm translation? Or are you suggesting to have a single CLIF instruction for the stack switching itself, but then emit some additional CLIF on Windows (say, some additional loads and stores) around the stack_switch
during the Wasm translation?
I fully agree that for things like stack growing there's a good chance that in the future, we'd need some more customization of the generated code (e.g., customize stack probing, function preludes, ...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean for this to be just a solution for the current issue of me wanting to fail if not on Linux? Or did you mean to also use this approach (i.e., resolving differences at the Wasm -> CLIF stage) in the future when we want to generate slightly different code on different OS-es?
Both, I think -- basically, any sort of difference of behavior on that level should be reified in the CLIF, rather than implicit; that's how we've handled things like TLS and other platform dependencies.
In the latter case, are you suggesting to have multiple stack switching CLIF instructions, for the case with and without the TIB update, then choose between them during the Wasm translation? Or are you suggesting to have a single CLIF instruction for the stack switching itself, but then emit some additional CLIF on Windows (say, some additional loads and stores) around the
stack_switch
during the Wasm translation?
One or the other, depending on what is actually needed? I don't have enough context to actually specify the full design; that's something we can discuss further; only that we should make it explicit somehow. If one platform requires a superset of the work that another platform does, factoring out the common bit as one instruction and adding more logic at the CLIF level seems reasonable. On the other hand it's totally reasonable to have stack_switch_windows
and stack_switch_sysv
instructions IMHO; again see how we did TLS, with separate instructions for ELF and Mach-O cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand it's totally reasonable to have
stack_switch_windows
andstack_switch_sysv
instructions IMHO; again see how we did TLS, with separate instructions for ELF and Mach-O cases.
Ah, I think that's where my confusion came from: For TLS, there is a single CLIF instruction, which is then lowered one of several per-platform MInst
s (but based on a flag in the backend, not some ad hoc OS check like I did). I'm happy to do it like that, which would be similar to what I described in my response to @bjorn3.
I'd prefer that over moving the TIB update out of the stack_switch
(CLIF) instruction itself.
This reverts commit 2af10f9.
I've reworked the platform dependence logic based now, inspired by what happens for TLS: |
@alexcrichton For some reason the automated reviewer assignment was triggered again, maybe because this PR is now touching a non-Cranelift file. |
Do others have more thoughts on this? This all looks reaosnable enough to me to land, but I'd want to be sure to run by others too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one final nitpick below
(decl stack_switch_model (StackSwitchModel) Type) | ||
(extern extractor infallible stack_switch_model stack_switch_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than making this an extractor from a type, can we make it a partial constructor?
Then the use would look like
(lower (stack_switch store_context_ptr load_context_ptr in_payload)
(if-let (stack_switch_model) (StackSwitchModel.Basic))
...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but you meant (if-let (StackSwitchModel.Basic) (stack_switch_model))
, right?
Is there a particular reason for making it a partial constructor? Following what's happening for TLS I gave my StackSwitchModel
enum a None
variant to indicate that no model was set, but that means that the stack_switch_model
constructor can be total, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but I had thought that you can only use partial constructors in if-let
s. Might be worth removing the None
variant from StackSwitchModel
, if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it, and turning stack_switch_model
into a total constructor seems to just work.
I don't understand the magic mechanism for configuring what goes into the backend Flags
in detail, but from what I can tell it does not allow you to store something logically equivalent to Option<StackSwitchModel>
in there, so keeping the None
variant in StackSwitchModel
itself seems like the way to go.
I think this is in a good enough place that we can land it and then continue with any further improvements in follow ups. Thanks @frank-emrich! |
Oops, it looks like I responded to your comment and added a new commit in parallel to you approving things. But I think it should be uncontroversial, I kept |
Yeah, LGTM |
This PR adds a new CLIF instruction for switching stacks. While the primary motivation is to support the Wasm stack switching proposal currently under development, the CLIF instruction here is lower level and thus intended to be useful for general-purpose stackful context switching (such as implementing coroutines, fibers, etc independently from the Wasm stack switching proposal).
This PR only adds support for the instruction on x64 Linux, but I'm planning to add support for more platforms over time. The design of the instruction should be sufficiently abstract to support all the other platforms.
While work is currently under way to implement Wasm stack switching in Wasmtime here and indeed uses the CLIF instruction introduced by this PR successfully, it seems worthwhile just upstreaming the CLIF instruction by itself. The proposal is not fully finalized yet, and this CLIF instruction seems useful on its own and independent from the remainder of the Wasm proposal.
Concretely, the CLIF instruction looks as follows:
This causes the following to happen:
stack_switch
instruction are stored atstore_context_ptr
. All other registers are marked as clobbered and thus spilled by regalloc as needed.(SP, FP, PC)
triple fromload_context_ptr
, indicating the stack/context to switch to. We assume that we are either switching to a stack that was either previously switched away from by anotherstack_switch
, or it's a newly initialized stack.in_payload
is passed over to the other stack. In other words, if the instruction above switches from some stack A to another stack B, then the return value of thestack_switch
instruction previously executed on B will bein_payload
.out_payload
above (i.e., the return value of thestack_switch
executed when leaving stackA
) is the payload argument passed to the corresponding switch.A few additional notes:
store_context_ptr
andload_context_ptr
can be seen as pointers to what is conceptually a three-element struct, containing SP, FP, PC.store_context_ptr
andload_context_ptr
are allowed to be equal. In particular, in steps 1 and 2 above, we ensure to actually load all required data fromload_context_ptr
before storing tostore_context_ptr
.stack_switch
was executed, or to a new stackstack_switch
instruction, then regalloc has spilled all subsequently needed SSA values for us, no need to manually restore any context besides SP, FP, PC.stack_switch
to switch from a stack A to a stack B, we can only switch back to A once (unless we subsequently execute anotherstack_switch
on A again).This is different from
setjmp
/longjmp
, where we may store the context once usingsetjmp
and then return to it multiple times usinglongjmp
without needing to callsetjmp
again.