x64: Add rudimentary support for some AVX instructions#5795
x64: Add rudimentary support for some AVX instructions#5795alexcrichton merged 2 commits intobytecodealliance:mainfrom
Conversation
|
Oh I should note that the total encoding size of |
Subscribe to Label ActionDetailsThis issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.
I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.
afonso360
left a comment
There was a problem hiding this comment.
This is Awesome 🎉
I'm still looking through the VEX encoding bits, but it looks great!
| @@ -3066,10 +3109,11 @@ | |||
|
|
|||
| ;; Helper for creating `minps` instructions. | |||
| (decl x64_minps (Xmm Xmm) Xmm) | |||
There was a problem hiding this comment.
I don't think we can update this to XmmMem since minps requires alignment, but its unfortunate since it also prevents loads sinking into the AVX version.
Is there another way around it? I suspect we are going to run into more cases like this in the future.
My thoughts would be to do something like:
(decl load_me_maybe (XmmMem) Xmm)
(decl x64_minps (Xmm XmmMem) Xmm)
(rule 0 (x64_minps x y)
(xmm_rm_r (SseOpcode.Minps) x (load_me_maybe y)))But I'm not sure how good an idea that is, and we definitely don't need to do it in this PR.
There was a problem hiding this comment.
One idea I had was that in the sinkable_load extractor a helper function is_mergeable_load is called which currently says false for all SIMD-based loads if they aren't flagged as aligned. I was thinking that if we have complete AVX support we might be able to make that function conditional where if AVX is enabled then it allows unaligned loads but if not it disallows unaligned loads. That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.
There was a problem hiding this comment.
Similar to the above comments, we should probably do load sinking in a separate PR since it is a bit more advanced.
Edit: Oops, I was going to mark this as resolved at the same time you replied! I'm going to leave this open in case anyone else wants to comment.
There was a problem hiding this comment.
That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.
We could pass the opcode to is_mergeable_load as well and have an allow list for AVX instructions that don't barf on unaligned loads.
There was a problem hiding this comment.
I like that idea! We've already got a separate path for sinking loads for sse operations (put_in_xmm_mem) which would be a great place to add this information. The only downside I can see with changing the signature of that function is that we'd no longer be able to use it as an implicit conversion. I think that might actually be a great change though, as it would force us to be a bit more thoughtful about where we're allowing loads to be sunk for sse instructions.
There was a problem hiding this comment.
Alternate idea, for future consideration: an instruction that takes a R/M arg but requires alignment could encode that in the type, via e.g. an XmmMemAligned arg. (If we currently have a case where we use the same MInst enum variant for an alignment-required opcode and not, we should split into separate variants.) Then if we made a pass over all instructions to ensure we got the types right, we can be much more permissive with the automatic conversions everywhere, without having to worry about continued vigilance with explicit put_in_xmm_mem, etc.
85f61df to
4ca5131
Compare
I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.
| @@ -3066,10 +3109,11 @@ | |||
|
|
|||
| ;; Helper for creating `minps` instructions. | |||
| (decl x64_minps (Xmm Xmm) Xmm) | |||
There was a problem hiding this comment.
That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.
We could pass the opcode to is_mergeable_load as well and have an allow list for AVX instructions that don't barf on unaligned loads.
I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.
I was poking around Spidermonkey's wasm backend and saw that the various assembler functions used are all `v*`-prefixed which look like they're intended for use with AVX instructions. I looked at Cranelift and it currently doesn't have support for many AVX-based instructions, so I figured I'd take a crack at it! The support added here is a bit of a mishmash when viewed alone, but my general goal was to take a single instruction from the SIMD proposal for WebAssembly and migrate all of its component instructions to AVX. I, by random chance, picked a pretty complicated instruction of `f32x4.min`. This wasm instruction is implemented on x64 with 4 unique SSE instructions and ended up being a pretty good candidate. Further digging about AVX-vs-SSE shows that there should be two major benefits to using AVX over SSE: * Primarily AVX instructions largely use a three-operand form where two input registers are operated with and an output register is also specified. This is in contrast to SSE's predominant one-register-is-input-but-also-output pattern. This should help free up the register allocator a bit and additionally remove the need for movement between registers. * As bytecodealliance#4767 notes the memory-based operations of VEX-encoded instructions (aka AVX instructions) do not have strict alignment requirements which means we would be able to sink loads and stores into individual instructions instead of having separate instructions. So I set out on my journey to implement the instructions used by `f32x4.min`. The first few were fairly easy. The machinst backends are already of the shape "take these inputs and compute the output" where the x86 requirement of a register being both input and output is postprocessed in. This means that the `inst.isle` creation helpers for SSE instructions were already of the correct form to use AVX. I chose to add new `rule` branches for the instruction creation helpers, for example `x64_andnps`. The new `rule` conditionally only runs if AVX is enabled and emits an AVX instruction instead of an SSE instruction for achieving the same goal. This means that no lowerings of clif instructions were modified, instead just new instructions are being generated. The VEX encoding was previously not heavily used in Cranelift. The only current user are the FMA-style instructions that Cranelift has at this time. These FMA instructions have one extra operand than `vandnps`, for example, so I split the existing `XmmRmRVex` into a few more variants to fit the shape of the instructions that needed generating for `f32x4.min`. This was accompanied then with more AVX opcode definitions, more emission support, etc. Upon implementing all of this it turned out that the test suite was failing on my machine due to the memory-operand encodings of VEX instructions not being supported. I didn't explicitly add those in myself but some preexisting RIP-relative addressing was leaking into the new instructions with existing tests. I opted to go ahead and fill out the memory addressing modes of VEX encoding to get the tests passing again. All-in-all this PR adds new instructions to the x64 backend for a number of AVX instructions, updates 5 existing instruction producers to use AVX instructions conditionally, implements VEX memory operands, and adds some simple tests for the new output of `f32x4.min`. The existing runtest for `f32x.min` caught a few intermediate bugs along the way and I additionally added a plain `target x86_64` to that runtest to ensure that it executes with and without AVX to test the various lowerings. I'll also note that this, and future support, should be well-fuzzed through Wasmtime's fuzzing which may explicitly disable AVX support despite the machine having access to AVX, so non-AVX lowerings should be well-tested into the future. It's also worth mentioning that I am not an AVX or VEX or x64 expert. Implementing the memory operand part for VEX was the hardest part of this PR and while I think it should be good someone else should definitely double-check me. Additionally I haven't added many instructions to the x64 backend yet so I may have missed obvious places to tests or such, so am happy to follow-up with anything to be more thorough if necessary. Finally I should note that this is just the tip of the iceberg when it comes to AVX. My hope is to get some of the idioms sorted out to make it easier for future PRs to add one-off instruction lowerings or such.
4ca5131 to
9c33104
Compare
I stumbled across this working on #5795 and figured this was a nice opportunity to improve the codegen here.
This is a follow-up to comments in bytecodealliance#5795 to remove some cruft in the x64 instruction model to ensure that the shape of an `Inst` reflects what's going to happen in regalloc and encoding. This accessor was used to handle `round*`, `pextr*`, and `pshufb` instructions. The `round*` ones had already moved to the appropriate `XmmUnary*` variant and `pshufb` was additionally moved over to that variant as well. The `pextr*` instructions got a new `Inst` variant and additionally had their constructors slightly modified to no longer require the type as input. The encoding for these instructions now automatically handles the various type-related operands through a new `SseOpcode::Pextrq` operand to represent 64-bit movements.
This is a follow-up to comments in #5795 to remove some cruft in the x64 instruction model to ensure that the shape of an `Inst` reflects what's going to happen in regalloc and encoding. This accessor was used to handle `round*`, `pextr*`, and `pshufb` instructions. The `round*` ones had already moved to the appropriate `XmmUnary*` variant and `pshufb` was additionally moved over to that variant as well. The `pextr*` instructions got a new `Inst` variant and additionally had their constructors slightly modified to no longer require the type as input. The encoding for these instructions now automatically handles the various type-related operands through a new `SseOpcode::Pextrq` operand to represent 64-bit movements.
I was poking around Spidermonkey's wasm backend and saw that the various assembler functions used are all
v*-prefixed which look like they're intended for use with AVX instructions. I looked at Cranelift and it currently doesn't have support for many AVX-based instructions, so I figured I'd take a crack at it!The support added here is a bit of a mishmash when viewed alone, but my general goal was to take a single instruction from the SIMD proposal for WebAssembly and migrate all of its component instructions to AVX. I, by random chance, picked a pretty complicated instruction of
f32x4.min. This wasm instruction is implemented on x64 with 4 unique SSE instructions and ended up being a pretty good candidate.Further digging about AVX-vs-SSE shows that there should be two major benefits to using AVX over SSE:
Primarily AVX instructions largely use a three-operand form where two input registers are operated with and an output register is also specified. This is in contrast to SSE's predominant one-register-is-input-but-also-output pattern. This should help free up the register allocator a bit and additionally remove the need for movement between registers.
As cranelift: Add VEX and EVEX memory encodings #4767 notes the memory-based operations of VEX-encoded instructions (aka AVX instructions) do not have strict alignment requirements which means we would be able to sink loads and stores into individual instructions instead of having separate instructions.
So I set out on my journey to implement the instructions used by
f32x4.min. The first few were fairly easy. The machinst backends are already of the shape "take these inputs and compute the output" where the x86 requirement of a register being both input and output is postprocessed in. This means that theinst.islecreation helpers for SSE instructions were already of the correct form to use AVX. I chose to add newrulebranches for the instruction creation helpers, for examplex64_andnps. The newruleconditionally only runs if AVX is enabled and emits an AVX instruction instead of an SSE instruction for achieving the same goal. This means that no lowerings of clif instructions were modified, instead just new instructions are being generated.The VEX encoding was previously not heavily used in Cranelift. The only current user are the FMA-style instructions that Cranelift has at this time. These FMA instructions have one extra operand than
vandnps, for example, so I split the existingXmmRmRVexinto a few more variants to fit the shape of the instructions that needed generating forf32x4.min. This was accompanied then with more AVX opcode definitions, more emission support, etc.Upon implementing all of this it turned out that the test suite was failing on my machine due to the memory-operand encodings of VEX instructions not being supported. I didn't explicitly add those in myself but some preexisting RIP-relative addressing was leaking into the new instructions with existing tests. I opted to go ahead and fill out the memory addressing modes of VEX encoding to get the tests passing again.
All-in-all this PR adds new instructions to the x64 backend for a number of AVX instructions, updates 5 existing instruction producers to use AVX instructions conditionally, implements VEX memory operands, and adds some simple tests for the new output of
f32x4.min. The existing runtest forf32x.mincaught a few intermediate bugs along the way and I additionally added a plaintarget x86_64to that runtest to ensure that it executes with and without AVX to test the various lowerings. I'll also note that this, and future support, should be well-fuzzed through Wasmtime's fuzzing which may explicitly disable AVX support despite the machine having access to AVX, so non-AVX lowerings should be well-tested into the future.It's also worth mentioning that I am not an AVX or VEX or x64 expert. Implementing the memory operand part for VEX was the hardest part of this PR and while I think it should be good someone else should definitely double-check me. Additionally I haven't added many instructions to the x64 backend yet so I may have missed obvious places to tests or such, so am happy to follow-up with anything to be more thorough if necessary.
Finally I should note that this is just the tip of the iceberg when it comes to AVX. My hope is to get some of the idioms sorted out to make it easier for future PRs to add one-off instruction lowerings or such.