-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Refactoring the ARM intrinsics to match API review and share code with x86 #25508
Conversation
Tagging @CarolEidt and @TamarChristinaArm in advance. This can't/shouldn't be merged until after master is updated to target .NET 5, but this is a larger PR, so it will take more time to review. |
// /// int64x1_t vabs_s64 (int64x1_t a) | ||
// /// A64: ABS Dd, Dn | ||
// /// </summary> | ||
// public static Vector64<ulong> AbsScalar(Vector64<long> value) { throw new PlatformNotSupportedException(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are native intrinsics that correspond to these and many other Vector64<double>
, Vector64<long>
, and Vector64<ulong>
methods.
We should fixup the JIT to properly support these types so the methods can be exposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so the JIT doesn't currently handle Vector64
? Should I then also comment them out in my local branch with the new intrinsics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, you probably only mean the x1_t
types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Vector64
works for 7/10 of the primitive types right now. The JIT doesn't properly support the x1_t
types today.
#ifdef FEATURE_HW_INTRINSICS | ||
#include "hwintrinsic.h" | ||
|
||
instruction CodeGen::getOpForHWIntrinsic(GenTreeHWIntrinsic* node, var_types instrType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic was removed and replaced with new logic in hwintrinsiccodegenarm64.cpp
, to mirror the xarch
file.
src/jit/compiler.cpp
Outdated
opts.setSupportedISA(InstructionSet_Vector128); | ||
} | ||
|
||
if(jitFlags.IsSet(JitFlags::JIT_FLAG_HAS_ARM64_ATOMICS) && JitConfig.EnableArm64Atomics()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these were pre-existing. I don't think we even support most of them today.
@@ -3,9 +3,44 @@ | |||
// See the LICENSE file in the project root for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains all the importation and HWIntrinsicInfo
logic that can be shared between both architectures.
There are probably a few more places code could be shared with some ifdefs
, but I haven't looked where.
#include "hwintrinsiclistxarch.h" | ||
#elif defined (_TARGET_ARM64_) | ||
#define HARDWARE_INTRINSIC(isa, name, ival, size, numarg, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, category, flag) \ | ||
{NI_##isa##_##name, #name, InstructionSet_##isa, ival, size, numarg, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, category, static_cast<HWIntrinsicFlag>(flag)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed ARM to use this because that is the pattern we were already following for naming things. Might be nice to also update x86 to do the same as it makes the #defines
in hwintrinsiclist*.h
smaller, but I felt that was a separate PR.
// | ||
// Return Value: | ||
// true if the node has an imm operand; otherwise, false | ||
bool HWIntrinsicInfo::isImmOp(NamedIntrinsic id, const GenTree* op) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these methods aren't used by ARM anywhere yet, but they should still be applicable in a few spots later down the line.
#include "hwintrinsicArm64.h" | ||
#ifdef FEATURE_HW_INTRINSICS | ||
|
||
enum HWIntrinsicCategory : unsigned int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Categories, flags, and the HWIntrinsicInfo layout are all the same.
case NI_Vector128_AsSingle: | ||
case NI_Vector128_AsUInt16: | ||
case NI_Vector128_AsUInt32: | ||
case NI_Vector128_AsUInt64: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic (specifically for As
and get_Count
) could be shared, but I didn't determine a good way to do so yet and decided to leave it to a future PR.
// Arguments: | ||
// node - The hardware intrinsic node | ||
// | ||
void CodeGen::genHWIntrinsic(GenTreeHWIntrinsic* node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this logic is actually very similar to the x86 logic, there might be more opportunities to share bits here as well, but that isn't part of this PR.
@@ -1,306 +0,0 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hwintrinsicxarch.h
and hwintrinsicarm64.h
files aren't needed anymore, as all of this is being shared now.
@@ -981,105 +981,193 @@ int LinearScan::BuildSIMD(GenTreeSIMD* simdTree) | |||
// | |||
int LinearScan::BuildHWIntrinsic(GenTreeHWIntrinsic* intrinsicTree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is, again, quite a bit of logic that is similar between the x86 and ARM paths in lsra and lowering (basically everything that isn't the individual intrinsic handling), so there might be a good opportunity to share code in the future.
CC. @BruceForstall. This is the PR that updates most of the ARM infrastructure to be similar to the infrastructure we setup for x86/x64. The one drawback, that I am aware of is what I called out in the original post:
It, however, should be a good starting point for anyone wanting to finish this up (since I had gotten pulled off onto other work and haven't been able to yet). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that I never published these comments, so I'm doing so now.
src/jit/hwintrinsic.cpp
Outdated
op2 = getArgForHWIntrinsic(argType, argClass); | ||
|
||
#ifdef _TARGET_XARCH_ | ||
var_types op2Type = TYP_UNDEF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me why this can't be merged with the code below that handles the gather intrinsics, rather than having two separate ifdef
regions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason either. I've merged them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it was split because gtIndexBaseType
for the retNode
needs to be set based on the argClass
of op2
. However, you have to get op1
before the retNode
can be created and getting op1
mutates argClass
.
I solved the issue by stashing the op2ArgClass
in a local.
Rebased onto dotnet/master. Will work on addressing comments made so far. |
Addressed feedback given so far. Will resolve any CI failures once the jobs finish. |
@tannergooding you say above that:
But it seems that they've been added back? However, the tests have been deleted. I realize it's probably a bit of work to resurrect the old tests, and perhaps it's best to just move to auto-generated tests, but I'm uncomfortable having no test exposure. |
No, the Abs/Add APIs are the only ones exposed in the commit right now. The other APIs that were exposed are still missing.
Right, I, at a minimum, will be adding generated tests for the APIs that are supported by this PR. |
@CarolEidt, I came across two issues:
|
@tannergooding I can help with this - I am not claiming to be an expert in jit emitter internals but I am also willing to learn and understand how we add new instructions. |
@echesakovMSFT. Sounds good to me. Feel free to ping me internally and we can set up some time to go over this :) Much of this (namely |
@dotnet/jit-contrib |
This one is blocked on #26895 going in first, and then requires some cleanup work and tests. I would be comfortable closing it until after the other changes go through if needed. |
@@ -872,59 +822,19 @@ void Lowering::ContainCheckSIMD(GenTreeSIMD* simdNode) | |||
// | |||
void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CarolEidt, @echesakovMSFT
Just to confirm, there doesn't need to be containment for ARM64 except for immediates, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my understanding, i.e. there are no Arm64 intrinsics that can take either a register or memory operand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, not SIMD
but you do have other intrinsics such as the prefetch
intrinsics (pld
) that do take a memory operand. But that's probably out of scope for now I'd imagine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely out of scope for the PR; likely not outside the scope of the total work. We have Sse.Prefetch0/1/2
and Sse.PrefetchNonTemporal
on the x86 side already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that we only need to worry about the containment question when the same intrinsic can take either a memory or a register operand, which would not be the case for a prefetch.
At least locally, vector tests are passing. Looks like I need to define an Edit: Nevermind, looks like I just need to ensure |
…r SIMDScalar intrinsics
…calar operations.
HARDWARE_INTRINSIC(AdvSimd, Abs, -1, -1, 1, {INS_invalid, INS_abs, INS_invalid, INS_abs, INS_invalid, INS_abs, INS_invalid, INS_invalid, INS_fabs, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoContainment|HW_Flag_UnfixedSIMDSize) | ||
HARDWARE_INTRINSIC(AdvSimd, AbsScalar, -1, 8, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_fabs, INS_fabs}, HW_Category_SIMDScalar, HW_Flag_NoContainment) | ||
HARDWARE_INTRINSIC(AdvSimd, Add, -1, -1, 2, {INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_fadd, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoContainment|HW_Flag_Commutative|HW_Flag_UnfixedSIMDSize) | ||
HARDWARE_INTRINSIC(AdvSimd, AddScalar, -1, 8, 2, {INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_fadd, INS_fadd}, HW_Category_SIMDScalar, HW_Flag_NoContainment|HW_Flag_Commutative) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the semantics around the unused bits for a given vector register?
That is, on Arm64 the vector register is 128-bits in length. However, some encodings only access the first 64-bits (Vector64<T>
operations) and some instructions only access the first element (scalar
instructions only access the first T
).
So, for say AddScalar
which takes in Vector64<float> left, Vector64<float> right
and returns a Vector64<T>
; the contents of result[0] = left[0] + right[0]
, but what is the contents of result[1]
and what is the result of indice 3 and 4 of the backing register (bits 64-96 and bits 96-128)?
Is it cleared to zero, is it preserved to the last value in the register, etc?
The same question goes for Vector64<T> Add(Vector64<T> left, Vector64<T> right)
(what is the contents of bits 64-96 and bits 96-128 of the backing register)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking because I'm seeing for:
left: (0.41178197, 0.0033083581)
right: (0.11088856, 0.89114616)
You get:
result: (0.5226705, 0)
and I'm wanting to find out if this is deterministically 0
or if it is the last value contained in those bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is unless otherwise specified by the instruction (such is inserts
, or high
operations) the unused bits of a register are always cleared to 0
on writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for say AddScalar which takes in Vector64 left, Vector64 right and returns a Vector64; the contents of result[0] = left[0] + right[0], but what is the contents of result[1] and what is the result of indice 3 and 4 of the backing register (bits 64-96 and bits 96-128)?
They're all cleared to 0
in this case. They wouldn't be for an insert (e.g. INS
) a structure load into one lane (e.g. LD3...[<lane>]
) or on some of the instructions which have a second part
(those generally end with a 2
in the name such as PMULL2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is unless otherwise specified by the instruction (such is inserts, or high operations) the unused bits of a register are always cleared to 0 on writes.
Awesome, thanks!
This is the same behavior as for x86 when looking at 128-bit vs 256-bit registers (just with 64-bit vs 128-bit on ARM). However, it differs from x86 when looking at scalar vs vector operations (x86 preserves the upper bits, through bit 128 and clears bits 128-256; where-as ARM just clears all upper bits).
So I just wanted to get this validated in my head. It would likely also be something worth noting for when dealing with operations like Vector128.GetLower()
, Vector64.ToVector128()
, and when performing operations in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CarolEidt, just for a sanity check. Is this register allocator aware of these parts (scalar vs vector64 vs vector128)? Is there anything we need to set to tell it when an operation will preserve vs zero the "upper bits"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That raises a very interesting point. The register allocator only understands fully-overwritten or RMW operands. That is, an operation that partially preserves the value of a target should have that as both a source and a target, and that source should be marked as "delayFree". It is then up to the code generator to copy it if the RMW src and dst aren't allocated the same register. I suspect there are some issues here in what we're doing today.
Going to mark this as "ready for review" now as all tests are passing locally (on my Windows on Arm64 device). There are still a few methods which need tests (namely ArmBase), but I'd like to add those in a follow up PR if possible, as that will unblock this PR from being merged and unblock the work being done by @TamarChristinaArm and @echesakovMSFT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (still/again)
code |= insEncodeReg_Rn(id->idReg2()); // nnnnn | ||
code |= insEncodeVLSElemsize(elemsize); // ss | ||
code |= 0x5000; // xx.x - We only support the one register variant right now | ||
code |= insEncodeReg_Rm(id->idReg3()); // mmmmm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it looks like we might not need these as the instructions that load to multiple vector registers have different names (e.g. ld1, ld2, ld3, ld4).
I think we will need them to support ld1 with multiple vector registers - there are four variant of this instruction.
I've logged https://github.com/dotnet/coreclr/issues/27139 (and assigned it to myself) to track adding the tests for the ArmBase APIs. I have already started working on these and should hopefully have a PR up not terribly long after this one is merged 😄 |
This updates the ARM64 intrinsics to match the proposed layout from: https://github.com/dotnet/corefx/issues/37199.
This also updates the ARM64 intrinsics to share much of the importation logic and various data structures that were already created for x86.
Currently, this also removes many of the APIs that were exposed as part of the
Arm.AdvSimd
class, but I am working on updating those to match the above proposal as well.-- I don't think merging this should be blocked on that, but since this can't be merged until after master starts targeting .NET 5, I will try to get it completed before then.