x64: Add rudimentary support for some AVX instructions by alexcrichton · Pull Request #5795 · bytecodealliance/wasmtime

alexcrichton · 2023-02-16T00:12:44Z

I was poking around Spidermonkey's wasm backend and saw that the various assembler functions used are all v*-prefixed which look like they're intended for use with AVX instructions. I looked at Cranelift and it currently doesn't have support for many AVX-based instructions, so I figured I'd take a crack at it!

The support added here is a bit of a mishmash when viewed alone, but my general goal was to take a single instruction from the SIMD proposal for WebAssembly and migrate all of its component instructions to AVX. I, by random chance, picked a pretty complicated instruction of f32x4.min. This wasm instruction is implemented on x64 with 4 unique SSE instructions and ended up being a pretty good candidate.

Further digging about AVX-vs-SSE shows that there should be two major benefits to using AVX over SSE:

Primarily AVX instructions largely use a three-operand form where two input registers are operated with and an output register is also specified. This is in contrast to SSE's predominant one-register-is-input-but-also-output pattern. This should help free up the register allocator a bit and additionally remove the need for movement between registers.
As cranelift: Add VEX and EVEX memory encodings #4767 notes the memory-based operations of VEX-encoded instructions (aka AVX instructions) do not have strict alignment requirements which means we would be able to sink loads and stores into individual instructions instead of having separate instructions.

So I set out on my journey to implement the instructions used by f32x4.min. The first few were fairly easy. The machinst backends are already of the shape "take these inputs and compute the output" where the x86 requirement of a register being both input and output is postprocessed in. This means that the inst.isle creation helpers for SSE instructions were already of the correct form to use AVX. I chose to add new rule branches for the instruction creation helpers, for example x64_andnps. The new rule conditionally only runs if AVX is enabled and emits an AVX instruction instead of an SSE instruction for achieving the same goal. This means that no lowerings of clif instructions were modified, instead just new instructions are being generated.

The VEX encoding was previously not heavily used in Cranelift. The only current user are the FMA-style instructions that Cranelift has at this time. These FMA instructions have one extra operand than vandnps, for example, so I split the existing XmmRmRVex into a few more variants to fit the shape of the instructions that needed generating for f32x4.min. This was accompanied then with more AVX opcode definitions, more emission support, etc.

Upon implementing all of this it turned out that the test suite was failing on my machine due to the memory-operand encodings of VEX instructions not being supported. I didn't explicitly add those in myself but some preexisting RIP-relative addressing was leaking into the new instructions with existing tests. I opted to go ahead and fill out the memory addressing modes of VEX encoding to get the tests passing again.

All-in-all this PR adds new instructions to the x64 backend for a number of AVX instructions, updates 5 existing instruction producers to use AVX instructions conditionally, implements VEX memory operands, and adds some simple tests for the new output of f32x4.min. The existing runtest for f32x.min caught a few intermediate bugs along the way and I additionally added a plain target x86_64 to that runtest to ensure that it executes with and without AVX to test the various lowerings. I'll also note that this, and future support, should be well-fuzzed through Wasmtime's fuzzing which may explicitly disable AVX support despite the machine having access to AVX, so non-AVX lowerings should be well-tested into the future.

It's also worth mentioning that I am not an AVX or VEX or x64 expert. Implementing the memory operand part for VEX was the hardest part of this PR and while I think it should be good someone else should definitely double-check me. Additionally I haven't added many instructions to the x64 backend yet so I may have missed obvious places to tests or such, so am happy to follow-up with anything to be more thorough if necessary.

Finally I should note that this is just the tip of the iceberg when it comes to AVX. My hope is to get some of the idioms sorted out to make it easier for future PRs to add one-off instruction lowerings or such.

alexcrichton · 2023-02-16T00:15:05Z

Oh I should note that the total encoding size of f32x4.min was reduced from 45 to 35 bytes with this change. (less movdqa necessary)

github-actions · 2023-02-16T00:26:53Z

Subscribe to Label Action

cc @cfallin, @fitzgen

Details

This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.

cranelift/codegen/src/isa/x64/inst.isle

cranelift/filetests/filetests/isa/x64/simd-bitwise-avx.clif

afonso360

This is Awesome 🎉

I'm still looking through the VEX encoding bits, but it looks great!

afonso360 · 2023-02-16T11:00:47Z

cranelift/codegen/src/isa/x64/inst.isle

@@ -3066,10 +3109,11 @@

 ;; Helper for creating `minps` instructions.
 (decl x64_minps (Xmm Xmm) Xmm)


I don't think we can update this to XmmMem since minps requires alignment, but its unfortunate since it also prevents loads sinking into the AVX version.

Is there another way around it? I suspect we are going to run into more cases like this in the future.

My thoughts would be to do something like:

(decl load_me_maybe (XmmMem) Xmm) (decl x64_minps (Xmm XmmMem) Xmm) (rule 0 (x64_minps x y) (xmm_rm_r (SseOpcode.Minps) x (load_me_maybe y)))

But I'm not sure how good an idea that is, and we definitely don't need to do it in this PR.

One idea I had was that in the sinkable_load extractor a helper function is_mergeable_load is called which currently says false for all SIMD-based loads if they aren't flagged as aligned. I was thinking that if we have complete AVX support we might be able to make that function conditional where if AVX is enabled then it allows unaligned loads but if not it disallows unaligned loads. That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.

Similar to the above comments, we should probably do load sinking in a separate PR since it is a bit more advanced.

Edit: Oops, I was going to mark this as resolved at the same time you replied! I'm going to leave this open in case anyone else wants to comment.

That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.

We could pass the opcode to is_mergeable_load as well and have an allow list for AVX instructions that don't barf on unaligned loads.

I like that idea! We've already got a separate path for sinking loads for sse operations (put_in_xmm_mem) which would be a great place to add this information. The only downside I can see with changing the signature of that function is that we'd no longer be able to use it as an implicit conversion. I think that might actually be a great change though, as it would force us to be a bit more thoughtful about where we're allowing loads to be sunk for sse instructions.

Alternate idea, for future consideration: an instruction that takes a R/M arg but requires alignment could encode that in the type, via e.g. an XmmMemAligned arg. (If we currently have a case where we use the same MInst enum variant for an alignment-required opcode and not, we should split into separate variants.) Then if we made a pass over all instructions to ensure we got the types right, we can be much more permissive with the automatic conversions everywhere, without having to worry about continued vigilance with explicit put_in_xmm_mem, etc.

cranelift/codegen/src/isa/x64/inst.isle

I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.

fitzgen

LGTM!

cranelift/codegen/src/isa/x64/encoding/vex.rs

fitzgen · 2023-02-16T16:37:09Z

cranelift/codegen/src/isa/x64/inst.isle

@@ -3066,10 +3109,11 @@

 ;; Helper for creating `minps` instructions.
 (decl x64_minps (Xmm Xmm) Xmm)


That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.

We could pass the opcode to is_mergeable_load as well and have an allow list for AVX instructions that don't barf on unaligned loads.

cranelift/codegen/src/isa/x64/inst/emit.rs

I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.

I was poking around Spidermonkey's wasm backend and saw that the various assembler functions used are all `v*`-prefixed which look like they're intended for use with AVX instructions. I looked at Cranelift and it currently doesn't have support for many AVX-based instructions, so I figured I'd take a crack at it! The support added here is a bit of a mishmash when viewed alone, but my general goal was to take a single instruction from the SIMD proposal for WebAssembly and migrate all of its component instructions to AVX. I, by random chance, picked a pretty complicated instruction of `f32x4.min`. This wasm instruction is implemented on x64 with 4 unique SSE instructions and ended up being a pretty good candidate. Further digging about AVX-vs-SSE shows that there should be two major benefits to using AVX over SSE: * Primarily AVX instructions largely use a three-operand form where two input registers are operated with and an output register is also specified. This is in contrast to SSE's predominant one-register-is-input-but-also-output pattern. This should help free up the register allocator a bit and additionally remove the need for movement between registers. * As bytecodealliance#4767 notes the memory-based operations of VEX-encoded instructions (aka AVX instructions) do not have strict alignment requirements which means we would be able to sink loads and stores into individual instructions instead of having separate instructions. So I set out on my journey to implement the instructions used by `f32x4.min`. The first few were fairly easy. The machinst backends are already of the shape "take these inputs and compute the output" where the x86 requirement of a register being both input and output is postprocessed in. This means that the `inst.isle` creation helpers for SSE instructions were already of the correct form to use AVX. I chose to add new `rule` branches for the instruction creation helpers, for example `x64_andnps`. The new `rule` conditionally only runs if AVX is enabled and emits an AVX instruction instead of an SSE instruction for achieving the same goal. This means that no lowerings of clif instructions were modified, instead just new instructions are being generated. The VEX encoding was previously not heavily used in Cranelift. The only current user are the FMA-style instructions that Cranelift has at this time. These FMA instructions have one extra operand than `vandnps`, for example, so I split the existing `XmmRmRVex` into a few more variants to fit the shape of the instructions that needed generating for `f32x4.min`. This was accompanied then with more AVX opcode definitions, more emission support, etc. Upon implementing all of this it turned out that the test suite was failing on my machine due to the memory-operand encodings of VEX instructions not being supported. I didn't explicitly add those in myself but some preexisting RIP-relative addressing was leaking into the new instructions with existing tests. I opted to go ahead and fill out the memory addressing modes of VEX encoding to get the tests passing again. All-in-all this PR adds new instructions to the x64 backend for a number of AVX instructions, updates 5 existing instruction producers to use AVX instructions conditionally, implements VEX memory operands, and adds some simple tests for the new output of `f32x4.min`. The existing runtest for `f32x.min` caught a few intermediate bugs along the way and I additionally added a plain `target x86_64` to that runtest to ensure that it executes with and without AVX to test the various lowerings. I'll also note that this, and future support, should be well-fuzzed through Wasmtime's fuzzing which may explicitly disable AVX support despite the machine having access to AVX, so non-AVX lowerings should be well-tested into the future. It's also worth mentioning that I am not an AVX or VEX or x64 expert. Implementing the memory operand part for VEX was the hardest part of this PR and while I think it should be good someone else should definitely double-check me. Additionally I haven't added many instructions to the x64 backend yet so I may have missed obvious places to tests or such, so am happy to follow-up with anything to be more thorough if necessary. Finally I should note that this is just the tip of the iceberg when it comes to AVX. My hope is to get some of the idioms sorted out to make it easier for future PRs to add one-off instruction lowerings or such.

I stumbled across this working on #5795 and figured this was a nice opportunity to improve the codegen here.

afonso360

LGTM as well!

This is a follow-up to comments in bytecodealliance#5795 to remove some cruft in the x64 instruction model to ensure that the shape of an `Inst` reflects what's going to happen in regalloc and encoding. This accessor was used to handle `round*`, `pextr*`, and `pshufb` instructions. The `round*` ones had already moved to the appropriate `XmmUnary*` variant and `pshufb` was additionally moved over to that variant as well. The `pextr*` instructions got a new `Inst` variant and additionally had their constructors slightly modified to no longer require the type as input. The encoding for these instructions now automatically handles the various type-related operands through a new `SseOpcode::Pextrq` operand to represent 64-bit movements.

This is a follow-up to comments in #5795 to remove some cruft in the x64 instruction model to ensure that the shape of an `Inst` reflects what's going to happen in regalloc and encoding. This accessor was used to handle `round*`, `pextr*`, and `pshufb` instructions. The `round*` ones had already moved to the appropriate `XmmUnary*` variant and `pshufb` was additionally moved over to that variant as well. The `pextr*` instructions got a new `Inst` variant and additionally had their constructors slightly modified to no longer require the type as input. The encoding for these instructions now automatically handles the various type-related operands through a new `SseOpcode::Pextrq` operand to represent 64-bit movements.

alexcrichton mentioned this pull request Feb 16, 2023

cranelift: Add VEX and EVEX memory encodings #4767

Closed

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Feb 16, 2023

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 16, 2023

x64: Improve codegen for vectors with constant shift amounts

50afe70

I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 16, 2023

x64: Improve codegen for vectors with constant shift amounts

3749ef2

I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.

alexcrichton mentioned this pull request Feb 16, 2023

x64: Improve codegen for vectors with constant shift amounts #5797

Merged

afonso360 reviewed Feb 16, 2023

View reviewed changes

cranelift/codegen/src/isa/x64/inst.isle Show resolved Hide resolved

afonso360 reviewed Feb 16, 2023

View reviewed changes

cranelift/filetests/filetests/isa/x64/simd-bitwise-avx.clif Show resolved Hide resolved

afonso360 reviewed Feb 16, 2023

View reviewed changes

alexcrichton force-pushed the better-avx-supoprt branch from 85f61df to 4ca5131 Compare February 16, 2023 15:23

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 16, 2023

x64: Improve codegen for vectors with constant shift amounts

e73a99b

I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.

fitzgen approved these changes Feb 16, 2023

View reviewed changes

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 16, 2023

x64: Improve codegen for vectors with constant shift amounts

b5265ab

I stumbled across this working on bytecodealliance#5795 and figured this was a nice opportunity to improve the codegen here.

alexcrichton force-pushed the better-avx-supoprt branch from 4ca5131 to 9c33104 Compare February 16, 2023 20:01

alexcrichton added a commit that referenced this pull request Feb 16, 2023

x64: Improve codegen for vectors with constant shift amounts (#5797)

cae3b26

I stumbled across this working on #5795 and figured this was a nice opportunity to improve the codegen here.

Review feedback

7f38ec1

afonso360 approved these changes Feb 16, 2023

View reviewed changes

alexcrichton added this pull request to the merge queue Feb 17, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 17, 2023

alexcrichton added this pull request to the merge queue Feb 17, 2023

Merged via the queue into bytecodealliance:main with commit 453330b Feb 17, 2023

alexcrichton deleted the better-avx-supoprt branch February 17, 2023 02:12

afonso360 mentioned this pull request Feb 17, 2023

fuzzgen: Generate ISA flags #5816

Closed

alexcrichton mentioned this pull request Feb 20, 2023

x64: Enable load-coalescing for SSE/AVX instructions #5841

Merged

alexcrichton mentioned this pull request Feb 20, 2023

x64: Remove conditional SseOpcode::uses_src1 #5842

Merged

		@@ -3066,10 +3109,11 @@

		;; Helper for creating `minps` instructions.
		(decl x64_minps (Xmm Xmm) Xmm)

Conversation

alexcrichton commented Feb 16, 2023

Uh oh!

alexcrichton commented Feb 16, 2023

Uh oh!

github-actions bot commented Feb 16, 2023

Subscribe to Label Action

Uh oh!

Uh oh!

Uh oh!

afonso360 left a comment

Choose a reason for hiding this comment

Uh oh!

afonso360 Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

alexcrichton Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

afonso360 Feb 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fitzgen Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

elliottt Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

cfallin Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fitzgen Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

afonso360 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

afonso360 Feb 16, 2023 •

edited

Loading