Skip to content

x64: Add rudimentary support for some AVX instructions#5795

Merged
alexcrichton merged 2 commits intobytecodealliance:mainfrom
alexcrichton:better-avx-supoprt
Feb 17, 2023
Merged

x64: Add rudimentary support for some AVX instructions#5795
alexcrichton merged 2 commits intobytecodealliance:mainfrom
alexcrichton:better-avx-supoprt

Conversation

@alexcrichton
Copy link
Member

I was poking around Spidermonkey's wasm backend and saw that the various assembler functions used are all v*-prefixed which look like they're intended for use with AVX instructions. I looked at Cranelift and it currently doesn't have support for many AVX-based instructions, so I figured I'd take a crack at it!

The support added here is a bit of a mishmash when viewed alone, but my general goal was to take a single instruction from the SIMD proposal for WebAssembly and migrate all of its component instructions to AVX. I, by random chance, picked a pretty complicated instruction of f32x4.min. This wasm instruction is implemented on x64 with 4 unique SSE instructions and ended up being a pretty good candidate.

Further digging about AVX-vs-SSE shows that there should be two major benefits to using AVX over SSE:

  • Primarily AVX instructions largely use a three-operand form where two input registers are operated with and an output register is also specified. This is in contrast to SSE's predominant one-register-is-input-but-also-output pattern. This should help free up the register allocator a bit and additionally remove the need for movement between registers.

  • As cranelift: Add VEX and EVEX memory encodings #4767 notes the memory-based operations of VEX-encoded instructions (aka AVX instructions) do not have strict alignment requirements which means we would be able to sink loads and stores into individual instructions instead of having separate instructions.

So I set out on my journey to implement the instructions used by f32x4.min. The first few were fairly easy. The machinst backends are already of the shape "take these inputs and compute the output" where the x86 requirement of a register being both input and output is postprocessed in. This means that the inst.isle creation helpers for SSE instructions were already of the correct form to use AVX. I chose to add new rule branches for the instruction creation helpers, for example x64_andnps. The new rule conditionally only runs if AVX is enabled and emits an AVX instruction instead of an SSE instruction for achieving the same goal. This means that no lowerings of clif instructions were modified, instead just new instructions are being generated.

The VEX encoding was previously not heavily used in Cranelift. The only current user are the FMA-style instructions that Cranelift has at this time. These FMA instructions have one extra operand than vandnps, for example, so I split the existing XmmRmRVex into a few more variants to fit the shape of the instructions that needed generating for f32x4.min. This was accompanied then with more AVX opcode definitions, more emission support, etc.

Upon implementing all of this it turned out that the test suite was failing on my machine due to the memory-operand encodings of VEX instructions not being supported. I didn't explicitly add those in myself but some preexisting RIP-relative addressing was leaking into the new instructions with existing tests. I opted to go ahead and fill out the memory addressing modes of VEX encoding to get the tests passing again.

All-in-all this PR adds new instructions to the x64 backend for a number of AVX instructions, updates 5 existing instruction producers to use AVX instructions conditionally, implements VEX memory operands, and adds some simple tests for the new output of f32x4.min. The existing runtest for f32x.min caught a few intermediate bugs along the way and I additionally added a plain target x86_64 to that runtest to ensure that it executes with and without AVX to test the various lowerings. I'll also note that this, and future support, should be well-fuzzed through Wasmtime's fuzzing which may explicitly disable AVX support despite the machine having access to AVX, so non-AVX lowerings should be well-tested into the future.

It's also worth mentioning that I am not an AVX or VEX or x64 expert. Implementing the memory operand part for VEX was the hardest part of this PR and while I think it should be good someone else should definitely double-check me. Additionally I haven't added many instructions to the x64 backend yet so I may have missed obvious places to tests or such, so am happy to follow-up with anything to be more thorough if necessary.

Finally I should note that this is just the tip of the iceberg when it comes to AVX. My hope is to get some of the idioms sorted out to make it easier for future PRs to add one-off instruction lowerings or such.

@alexcrichton
Copy link
Member Author

Oh I should note that the total encoding size of f32x4.min was reduced from 45 to 35 bytes with this change. (less movdqa necessary)

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Feb 16, 2023
@github-actions
Copy link

Subscribe to Label Action

cc @cfallin, @fitzgen

Details This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 16, 2023
I stumbled across this working on bytecodealliance#5795 and figured this was a nice
opportunity to improve the codegen here.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 16, 2023
I stumbled across this working on bytecodealliance#5795 and figured this was a nice
opportunity to improve the codegen here.
Copy link
Contributor

@afonso360 afonso360 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Awesome 🎉

I'm still looking through the VEX encoding bits, but it looks great!

@@ -3066,10 +3109,11 @@

;; Helper for creating `minps` instructions.
(decl x64_minps (Xmm Xmm) Xmm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can update this to XmmMem since minps requires alignment, but its unfortunate since it also prevents loads sinking into the AVX version.

Is there another way around it? I suspect we are going to run into more cases like this in the future.

My thoughts would be to do something like:

(decl load_me_maybe (XmmMem) Xmm)

(decl x64_minps (Xmm XmmMem) Xmm)
(rule 0 (x64_minps x y)
      (xmm_rm_r (SseOpcode.Minps) x (load_me_maybe y)))

But I'm not sure how good an idea that is, and we definitely don't need to do it in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea I had was that in the sinkable_load extractor a helper function is_mergeable_load is called which currently says false for all SIMD-based loads if they aren't flagged as aligned. I was thinking that if we have complete AVX support we might be able to make that function conditional where if AVX is enabled then it allows unaligned loads but if not it disallows unaligned loads. That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.

Copy link
Contributor

@afonso360 afonso360 Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above comments, we should probably do load sinking in a separate PR since it is a bit more advanced.

Edit: Oops, I was going to mark this as resolved at the same time you replied! I'm going to leave this open in case anyone else wants to comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.

We could pass the opcode to is_mergeable_load as well and have an allow list for AVX instructions that don't barf on unaligned loads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea! We've already got a separate path for sinking loads for sse operations (put_in_xmm_mem) which would be a great place to add this information. The only downside I can see with changing the signature of that function is that we'd no longer be able to use it as an implicit conversion. I think that might actually be a great change though, as it would force us to be a bit more thoughtful about where we're allowing loads to be sunk for sse instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternate idea, for future consideration: an instruction that takes a R/M arg but requires alignment could encode that in the type, via e.g. an XmmMemAligned arg. (If we currently have a case where we use the same MInst enum variant for an alignment-required opcode and not, we should split into separate variants.) Then if we made a pass over all instructions to ensure we got the types right, we can be much more permissive with the automatic conversions everywhere, without having to worry about continued vigilance with explicit put_in_xmm_mem, etc.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 16, 2023
I stumbled across this working on bytecodealliance#5795 and figured this was a nice
opportunity to improve the codegen here.
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -3066,10 +3109,11 @@

;; Helper for creating `minps` instructions.
(decl x64_minps (Xmm Xmm) Xmm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still somewhat brittle, though, so I don't honestly know the best way to handle this at this time.

We could pass the opcode to is_mergeable_load as well and have an allow list for AVX instructions that don't barf on unaligned loads.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 16, 2023
I stumbled across this working on bytecodealliance#5795 and figured this was a nice
opportunity to improve the codegen here.
I was poking around Spidermonkey's wasm backend and saw that the various
assembler functions used are all `v*`-prefixed which look like they're
intended for use with AVX instructions. I looked at Cranelift and it
currently doesn't have support for many AVX-based instructions, so I
figured I'd take a crack at it!

The support added here is a bit of a mishmash when viewed alone, but my
general goal was to take a single instruction from the SIMD proposal for
WebAssembly and migrate all of its component instructions to AVX. I, by
random chance, picked a pretty complicated instruction of `f32x4.min`.
This wasm instruction is implemented on x64 with 4 unique SSE
instructions and ended up being a pretty good candidate.

Further digging about AVX-vs-SSE shows that there should be two major
benefits to using AVX over SSE:

* Primarily AVX instructions largely use a three-operand form where two
  input registers are operated with and an output register is also
  specified. This is in contrast to SSE's predominant
  one-register-is-input-but-also-output pattern. This should help free
  up the register allocator a bit and additionally remove the need for
  movement between registers.

* As bytecodealliance#4767 notes the memory-based operations of VEX-encoded instructions
  (aka AVX instructions) do not have strict alignment requirements which
  means we would be able to sink loads and stores into individual
  instructions instead of having separate instructions.

So I set out on my journey to implement the instructions used by
`f32x4.min`. The first few were fairly easy. The machinst backends are
already of the shape "take these inputs and compute the output" where
the x86 requirement of a register being both input and output is
postprocessed in. This means that the `inst.isle` creation helpers for
SSE instructions were already of the correct form to use AVX. I chose to
add new `rule` branches for the instruction creation helpers, for
example `x64_andnps`. The new `rule` conditionally only runs if AVX is
enabled and emits an AVX instruction instead of an SSE instruction for
achieving the same goal. This means that no lowerings of clif
instructions were modified, instead just new instructions are being
generated.

The VEX encoding was previously not heavily used in Cranelift. The only
current user are the FMA-style instructions that Cranelift has at this
time. These FMA instructions have one extra operand than `vandnps`, for
example, so I split the existing `XmmRmRVex` into a few more variants to
fit the shape of the instructions that needed generating for
`f32x4.min`. This was accompanied then with more AVX opcode definitions,
more emission support, etc.

Upon implementing all of this it turned out that the test suite was
failing on my machine due to the memory-operand encodings of VEX
instructions not being supported. I didn't explicitly add those in
myself but some preexisting RIP-relative addressing was leaking into the
new instructions with existing tests. I opted to go ahead and fill out
the memory addressing modes of VEX encoding to get the tests passing
again.

All-in-all this PR adds new instructions to the x64 backend for a number
of AVX instructions, updates 5 existing instruction producers to use AVX
instructions conditionally, implements VEX memory operands, and adds
some simple tests for the new output of `f32x4.min`. The existing
runtest for `f32x.min` caught a few intermediate bugs along the way and
I additionally added a plain `target x86_64` to that runtest to ensure
that it executes with and without AVX to test the various lowerings.
I'll also note that this, and future support, should be well-fuzzed
through Wasmtime's fuzzing which may explicitly disable AVX support
despite the machine having access to AVX, so non-AVX lowerings should be
well-tested into the future.

It's also worth mentioning that I am not an AVX or VEX or x64 expert.
Implementing the memory operand part for VEX was the hardest part of
this PR and while I think it should be good someone else should
definitely double-check me. Additionally I haven't added many
instructions to the x64 backend yet so I may have missed obvious places
to tests or such, so am happy to follow-up with anything to be more
thorough if necessary.

Finally I should note that this is just the tip of the iceberg when it
comes to AVX. My hope is to get some of the idioms sorted out to make it
easier for future PRs to add one-off instruction lowerings or such.
alexcrichton added a commit that referenced this pull request Feb 16, 2023
I stumbled across this working on #5795 and figured this was a nice
opportunity to improve the codegen here.
Copy link
Contributor

@afonso360 afonso360 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well!

@alexcrichton alexcrichton added this pull request to the merge queue Feb 17, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 17, 2023
@alexcrichton alexcrichton added this pull request to the merge queue Feb 17, 2023
Merged via the queue into bytecodealliance:main with commit 453330b Feb 17, 2023
@alexcrichton alexcrichton deleted the better-avx-supoprt branch February 17, 2023 02:12
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 20, 2023
This is a follow-up to comments in bytecodealliance#5795 to remove some cruft in the x64
instruction model to ensure that the shape of an `Inst` reflects what's
going to happen in regalloc and encoding. This accessor was used to
handle `round*`, `pextr*`, and `pshufb` instructions. The `round*` ones
had already moved to the appropriate `XmmUnary*` variant and `pshufb`
was additionally moved over to that variant as well.

The `pextr*` instructions got a new `Inst` variant and additionally had
their constructors slightly modified to no longer require the type as
input. The encoding for these instructions now automatically handles the
various type-related operands through a new `SseOpcode::Pextrq` operand
to represent 64-bit movements.
cfallin pushed a commit that referenced this pull request Feb 21, 2023
This is a follow-up to comments in #5795 to remove some cruft in the x64
instruction model to ensure that the shape of an `Inst` reflects what's
going to happen in regalloc and encoding. This accessor was used to
handle `round*`, `pextr*`, and `pshufb` instructions. The `round*` ones
had already moved to the appropriate `XmmUnary*` variant and `pshufb`
was additionally moved over to that variant as well.

The `pextr*` instructions got a new `Inst` variant and additionally had
their constructors slightly modified to no longer require the type as
input. The encoding for these instructions now automatically handles the
various type-related operands through a new `SseOpcode::Pextrq` operand
to represent 64-bit movements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants