Refactoring the ARM intrinsics to match API review and share code with x86 #25508

tannergooding · 2019-06-30T16:30:00Z

This updates the ARM64 intrinsics to match the proposed layout from: https://github.com/dotnet/corefx/issues/37199.

This also updates the ARM64 intrinsics to share much of the importation logic and various data structures that were already created for x86.

Currently, this also removes many of the APIs that were exposed as part of the Arm.AdvSimd class, but I am working on updating those to match the above proposal as well.
-- I don't think merging this should be blocked on that, but since this can't be merged until after master starts targeting .NET 5, I will try to get it completed before then.

tannergooding · 2019-06-30T16:30:47Z

Tagging @CarolEidt and @TamarChristinaArm in advance.

This can't/shouldn't be merged until after master is updated to target .NET 5, but this is a larger PR, so it will take more time to review.

tannergooding · 2019-06-30T16:32:49Z

src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/Arm/AdvSimd.PlatformNotSupported.cs

+            // /// int64x1_t vabs_s64 (int64x1_t a)
+            // ///   A64: ABS Dd, Dn
+            // /// </summary>
+            // public static Vector64<ulong> AbsScalar(Vector64<long> value) { throw new PlatformNotSupportedException(); }


There are native intrinsics that correspond to these and many other Vector64<double>, Vector64<long>, and Vector64<ulong> methods.

We should fixup the JIT to properly support these types so the methods can be exposed.

Hmm so the JIT doesn't currently handle Vector64? Should I then also comment them out in my local branch with the new intrinsics?

Oh, I see, you probably only mean the x1_t types.

Right. Vector64 works for 7/10 of the primitive types right now. The JIT doesn't properly support the x1_t types today.

tannergooding · 2019-06-30T16:34:15Z

src/jit/codegenarm64.cpp

-#ifdef FEATURE_HW_INTRINSICS
-#include "hwintrinsic.h"
-
-instruction CodeGen::getOpForHWIntrinsic(GenTreeHWIntrinsic* node, var_types instrType)


This logic was removed and replaced with new logic in hwintrinsiccodegenarm64.cpp, to mirror the xarch file.

tannergooding · 2019-06-30T16:35:13Z

src/jit/compiler.cpp

+        opts.setSupportedISA(InstructionSet_Vector128);
+    }
+
+    if(jitFlags.IsSet(JitFlags::JIT_FLAG_HAS_ARM64_ATOMICS) && JitConfig.EnableArm64Atomics())


All of these were pre-existing. I don't think we even support most of them today.

tannergooding · 2019-06-30T16:36:21Z

src/jit/hwintrinsic.cpp

@@ -3,9 +3,44 @@
 // See the LICENSE file in the project root for more information.


This file contains all the importation and HWIntrinsicInfo logic that can be shared between both architectures.

There are probably a few more places code could be shared with some ifdefs, but I haven't looked where.

tannergooding · 2019-06-30T16:37:32Z

src/jit/hwintrinsic.cpp

+#include "hwintrinsiclistxarch.h"
+#elif defined (_TARGET_ARM64_)
+#define HARDWARE_INTRINSIC(isa, name, ival, size, numarg, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, category, flag) \
+    {NI_##isa##_##name, #name, InstructionSet_##isa, ival, size, numarg, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, category, static_cast<HWIntrinsicFlag>(flag)},


I changed ARM to use this because that is the pattern we were already following for naming things. Might be nice to also update x86 to do the same as it makes the #defines in hwintrinsiclist*.h smaller, but I felt that was a separate PR.

tannergooding · 2019-06-30T16:38:35Z

src/jit/hwintrinsic.cpp

+//
+// Return Value:
+//     true if the node has an imm operand; otherwise, false
+bool HWIntrinsicInfo::isImmOp(NamedIntrinsic id, const GenTree* op)


Some of these methods aren't used by ARM anywhere yet, but they should still be applicable in a few spots later down the line.

tannergooding · 2019-06-30T16:39:52Z

src/jit/hwintrinsic.h

-#include "hwintrinsicArm64.h"
+#ifdef FEATURE_HW_INTRINSICS
+
+enum HWIntrinsicCategory : unsigned int


Categories, flags, and the HWIntrinsicInfo layout are all the same.

tannergooding · 2019-06-30T16:41:42Z

src/jit/hwintrinsicarm64.cpp

+        case NI_Vector128_AsSingle:
+        case NI_Vector128_AsUInt16:
+        case NI_Vector128_AsUInt32:
+        case NI_Vector128_AsUInt64:


This logic (specifically for As and get_Count) could be shared, but I didn't determine a good way to do so yet and decided to leave it to a future PR.

tannergooding · 2019-06-30T16:42:35Z

src/jit/hwintrinsiccodegenarm64.cpp

+// Arguments:
+//    node - The hardware intrinsic node
+//
+void CodeGen::genHWIntrinsic(GenTreeHWIntrinsic* node)


A lot of this logic is actually very similar to the x86 logic, there might be more opportunities to share bits here as well, but that isn't part of this PR.

tannergooding · 2019-06-30T16:43:50Z

src/jit/hwintrinsicxarch.h

@@ -1,306 +0,0 @@
-// Licensed to the .NET Foundation under one or more agreements.


The hwintrinsicxarch.h and hwintrinsicarm64.h files aren't needed anymore, as all of this is being shared now.

tannergooding · 2019-06-30T16:46:01Z

src/jit/lsraarm64.cpp

@@ -981,105 +981,193 @@ int LinearScan::BuildSIMD(GenTreeSIMD* simdTree)
 //
 int LinearScan::BuildHWIntrinsic(GenTreeHWIntrinsic* intrinsicTree)


There is, again, quite a bit of logic that is similar between the x86 and ARM paths in lsra and lowering (basically everything that isn't the individual intrinsic handling), so there might be a good opportunity to share code in the future.

tannergooding · 2019-08-20T21:55:48Z

CC. @BruceForstall.

This is the PR that updates most of the ARM infrastructure to be similar to the infrastructure we setup for x86/x64.

The one drawback, that I am aware of is what I called out in the original post:

Currently, this also removes many of the APIs that were exposed as part of the Arm.AdvSimd class, but I am working on updating those to match the above proposal as well.

It, however, should be a good starting point for anyone wanting to finish this up (since I had gotten pulled off onto other work and haven't been able to yet).

CarolEidt

It seems that I never published these comments, so I'm doing so now.

src/jit/hwintrinsic.cpp

CarolEidt · 2019-07-02T22:25:28Z

src/jit/hwintrinsic.cpp

+                op2     = getArgForHWIntrinsic(argType, argClass);
+
+#ifdef _TARGET_XARCH_
+                var_types op2Type = TYP_UNDEF;


It's unclear to me why this can't be merged with the code below that handles the gather intrinsics, rather than having two separate ifdef regions.

I don't see any reason either. I've merged them.

It looks like it was split because gtIndexBaseType for the retNode needs to be set based on the argClass of op2. However, you have to get op1 before the retNode can be created and getting op1 mutates argClass.

I solved the issue by stashing the op2ArgClass in a local.

src/jit/hwintrinsicarm64.cpp

tannergooding · 2019-09-04T19:06:12Z

Rebased onto dotnet/master. Will work on addressing comments made so far.

tannergooding · 2019-09-05T17:53:40Z

Addressed feedback given so far. Will resolve any CI failures once the jobs finish.

CarolEidt · 2019-09-05T23:27:56Z

@tannergooding you say above that:

Currently, this also removes many of the APIs that were exposed as part of the Arm.AdvSimd class

But it seems that they've been added back? However, the tests have been deleted. I realize it's probably a bit of work to resurrect the old tests, and perhaps it's best to just move to auto-generated tests, but I'm uncomfortable having no test exposure.

tannergooding · 2019-09-05T23:58:39Z

But it seems that they've been added back?

No, the Abs/Add APIs are the only ones exposed in the commit right now. The other APIs that were exposed are still missing.

I'm uncomfortable having no test exposure.

Right, I, at a minimum, will be adding generated tests for the APIs that are supported by this PR.

tannergooding · 2019-09-06T22:03:27Z

@CarolEidt, I came across two issues:

I can't add tests until after the reference assemblies are also updated (which requires something to be merged CoreCLR side first; at least until the repo merge).
As best as I can tell, we don't currently support the LD1 instruction for doing Vector64 and Vector128 loads from memory (which limits what we can test). This is less of a problem, but I need to sit down with someone and ensure I understand how to add new instructions for ARM (it is quite a bit different than x86; and I think I've gotten 50-60% of the way there, but there are some unclear parts still).

echesakov · 2019-09-14T20:38:27Z

... I need to sit down with someone and ensure I understand how to add new instructions for ARM (it is quite a bit different than x86; and I think I've gotten 50-60% of the way there, but there are some unclear parts still).

@tannergooding I can help with this - I am not claiming to be an expert in jit emitter internals but I am also willing to learn and understand how we add new instructions.

tannergooding · 2019-09-20T20:12:33Z

@echesakovMSFT. Sounds good to me. Feel free to ping me internally and we can set up some time to go over this :)

Much of this (namely I can't add tests until after the reference assemblies are also updated) will also be simpler once the repo merge was been completed; as we don't have to worry about roundtripping things from CoreCLR to CoreFX and back.

erozenfeld · 2019-09-25T23:52:01Z

@dotnet/jit-contrib

tannergooding · 2019-09-26T00:07:18Z

This one is blocked on #26895 going in first, and then requires some cleanup work and tests. I would be comfortable closing it until after the other changes go through if needed.

tannergooding · 2019-10-09T21:41:34Z

src/jit/lowerarmarch.cpp

@@ -872,59 +822,19 @@ void Lowering::ContainCheckSIMD(GenTreeSIMD* simdNode)
 //
 void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)


@CarolEidt, @echesakovMSFT

Just to confirm, there doesn't need to be containment for ARM64 except for immediates, correct?

That's my understanding, i.e. there are no Arm64 intrinsics that can take either a register or memory operand.

Indeed, not SIMD but you do have other intrinsics such as the prefetch intrinsics (pld) that do take a memory operand. But that's probably out of scope for now I'd imagine.

Definitely out of scope for the PR; likely not outside the scope of the total work. We have Sse.Prefetch0/1/2 and Sse.PrefetchNonTemporal on the x86 side already.

My point was that we only need to worry about the containment question when the same intrinsic can take either a memory or a register operand, which would not be the case for a prefetch.

tannergooding · 2019-10-09T21:48:13Z

At least locally, vector tests are passing. Looks like I need to define an insOpt for scalar values on a vector registers to support AbsScalar and AddScalar.

Edit: Nevermind, looks like I just need to ensure insOptsAnyArrangement returns false :)

…r SIMDScalar intrinsics

…calar operations.

tannergooding · 2019-10-09T23:39:44Z

src/jit/hwintrinsiclistarm64.h

+HARDWARE_INTRINSIC(AdvSimd,         Abs,                                       -1,              -1,           1,     {INS_invalid,           INS_abs,            INS_invalid,        INS_abs,            INS_invalid,        INS_abs,            INS_invalid,        INS_invalid,        INS_fabs,           INS_invalid},           HW_Category_SimpleSIMD,             HW_Flag_NoContainment|HW_Flag_UnfixedSIMDSize)
+HARDWARE_INTRINSIC(AdvSimd,         AbsScalar,                                 -1,               8,           1,     {INS_invalid,           INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_fabs,           INS_fabs},              HW_Category_SIMDScalar,             HW_Flag_NoContainment)
+HARDWARE_INTRINSIC(AdvSimd,         Add,                                       -1,              -1,           2,     {INS_add,               INS_add,            INS_add,            INS_add,            INS_add,            INS_add,            INS_add,            INS_add,            INS_fadd,           INS_invalid},           HW_Category_SimpleSIMD,             HW_Flag_NoContainment|HW_Flag_Commutative|HW_Flag_UnfixedSIMDSize)
+HARDWARE_INTRINSIC(AdvSimd,         AddScalar,                                 -1,               8,           2,     {INS_add,               INS_add,            INS_add,            INS_add,            INS_add,            INS_add,            INS_add,            INS_add,            INS_fadd,           INS_fadd},              HW_Category_SIMDScalar,             HW_Flag_NoContainment|HW_Flag_Commutative)


@TamarChristinaArm.

What is the semantics around the unused bits for a given vector register?

That is, on Arm64 the vector register is 128-bits in length. However, some encodings only access the first 64-bits (Vector64<T> operations) and some instructions only access the first element (scalar instructions only access the first T).

So, for say AddScalar which takes in Vector64<float> left, Vector64<float> right and returns a Vector64<T>; the contents of result[0] = left[0] + right[0], but what is the contents of result[1] and what is the result of indice 3 and 4 of the backing register (bits 64-96 and bits 96-128)?

Is it cleared to zero, is it preserved to the last value in the register, etc?

The same question goes for Vector64<T> Add(Vector64<T> left, Vector64<T> right) (what is the contents of bits 64-96 and bits 96-128 of the backing register)?

I'm asking because I'm seeing for:

left: (0.41178197, 0.0033083581) right: (0.11088856, 0.89114616)

You get:

result: (0.5226705, 0)

and I'm wanting to find out if this is deterministically 0 or if it is the last value contained in those bits.

The default is unless otherwise specified by the instruction (such is inserts, or high operations) the unused bits of a register are always cleared to 0 on writes.

So, for say AddScalar which takes in Vector64 left, Vector64 right and returns a Vector64; the contents of result[0] = left[0] + right[0], but what is the contents of result[1] and what is the result of indice 3 and 4 of the backing register (bits 64-96 and bits 96-128)?

They're all cleared to 0 in this case. They wouldn't be for an insert (e.g. INS) a structure load into one lane (e.g. LD3...[<lane>]) or on some of the instructions which have a second part (those generally end with a 2 in the name such as PMULL2.

The default is unless otherwise specified by the instruction (such is inserts, or high operations) the unused bits of a register are always cleared to 0 on writes.

Awesome, thanks!

This is the same behavior as for x86 when looking at 128-bit vs 256-bit registers (just with 64-bit vs 128-bit on ARM). However, it differs from x86 when looking at scalar vs vector operations (x86 preserves the upper bits, through bit 128 and clears bits 128-256; where-as ARM just clears all upper bits).

So I just wanted to get this validated in my head. It would likely also be something worth noting for when dealing with operations like Vector128.GetLower(), Vector64.ToVector128(), and when performing operations in general.

@CarolEidt, just for a sanity check. Is this register allocator aware of these parts (scalar vs vector64 vs vector128)? Is there anything we need to set to tell it when an operation will preserve vs zero the "upper bits"?

That raises a very interesting point. The register allocator only understands fully-overwritten or RMW operands. That is, an operation that partially preserves the value of a target should have that as both a source and a target, and that source should be marked as "delayFree". It is then up to the code generator to copy it if the RMW src and dst aren't allocated the same register. I suspect there are some issues here in what we're doing today.

tannergooding · 2019-10-10T23:54:58Z

Going to mark this as "ready for review" now as all tests are passing locally (on my Windows on Arm64 device).

There are still a few methods which need tests (namely ArmBase), but I'd like to add those in a follow up PR if possible, as that will unblock this PR from being merged and unblock the work being done by @TamarChristinaArm and @echesakovMSFT

CarolEidt

LGTM (still/again)

echesakov · 2019-10-11T01:29:17Z

src/jit/emitarm64.cpp

+            code |= insEncodeReg_Rn(id->idReg2());  // nnnnn
+            code |= insEncodeVLSElemsize(elemsize); // ss
+            code |= 0x5000;                         // xx.x         - We only support the one register variant right now
+            code |= insEncodeReg_Rm(id->idReg3());  // mmmmm


Actually, it looks like we might not need these as the instructions that load to multiple vector registers have different names (e.g. ld1, ld2, ld3, ld4).

I think we will need them to support ld1 with multiple vector registers - there are four variant of this instruction.

tannergooding · 2019-10-11T01:34:26Z

I've logged https://github.com/dotnet/coreclr/issues/27139 (and assigned it to myself) to track adding the tests for the ArmBase APIs.

I have already started working on these and should hopefully have a PR up not terribly long after this one is merged 😄

tannergooding commented Jun 30, 2019

View reviewed changes

tannergooding added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Jun 30, 2019

tannergooding added this to the Future milestone Jun 30, 2019

CarolEidt reviewed Aug 20, 2019

View reviewed changes

tannergooding mentioned this pull request Sep 21, 2019

Use Arm64Base.LeadingZeroCount in BitOperations #26815

Closed

sandreenko removed this from the Future milestone Sep 23, 2019

sandreenko added area-CodeGen and removed * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) labels Sep 23, 2019

tannergooding mentioned this pull request Sep 25, 2019

Refactoring the ARM Hardware Intrinsics based on the latest design decisions. #26895

Merged

tannergooding added 8 commits October 8, 2019 18:24

Adding the initial template files for generating Arm HWIntrinsic tests

3bc5bc7

Generating the initial batch of Arm HWIntrinsic tests

20bbaa4

Replace gtRegNum with GetRegNum()

fb15adb

Fixing the test template alignment checks for Arm

ce74557

Regenerating the Arm HWIntrinsic tests

77414f0

Fixing the namespace checked for Arm intrinsics

e7ea3e3

Changing Base to ArmBase and fixing some Arm hwintrinsic metadata

3acbf8d

Fixing the arm emitter and lowering to not assert for hwintrinsics

1cc6da8

tannergooding commented Oct 9, 2019

View reviewed changes

tannergooding added 4 commits October 9, 2019 15:05

Fixing the arm intrinsics to not set the number of vector elements fo…

c8f664d

…r SIMDScalar intrinsics

Updating the AdvSimd.Abs tests to use negative input values

9f267e7

Regenerating the arm hwintrinsic tests

06b1c76

Fixing the arm hwintrinsic codegen to use the element size for simd s…

0c98184

…calar operations.

tannergooding commented Oct 9, 2019

View reviewed changes

tannergooding added 5 commits October 10, 2019 15:25

Fixing the arm scalar intrinsic tests to check for 0 on upper bits

544508d

Regenerating the ARM intrinsic tests

8f2a03e

Fixing AdvSimd.AbsScalar and AdvSimd.AddScalar to be recursive

abdd4b4

Marking AdvSimd.Arm64 and ArmBase.Arm64 as implemented

0db050e

Ensure that AdvSimd.Arm64.Abs is marked as variable size

13ce3ef

tannergooding marked this pull request as ready for review October 10, 2019 23:55

CarolEidt approved these changes Oct 11, 2019

View reviewed changes

echesakov approved these changes Oct 11, 2019

View reviewed changes

tannergooding merged commit 8184d3f into dotnet:master Oct 11, 2019

tannergooding mentioned this pull request Oct 11, 2019

Regenerating the System.Runtime.Intrinsics.Experimental ref assembly dotnet/corefx#41736

Closed

This was referenced Jan 31, 2020

Add tests for remaining ARM intrinsics dotnet/runtime#13570

Closed

Finish adding back arm intrinsics that existing before refactoring dotnet/runtime#13575

Closed

echesakov mentioned this pull request Apr 9, 2020

[Arm64] Change uint to Vector64<uint> in Sha1 intrinsics dotnet/runtime#34730

Merged

		@@ -3,9 +3,44 @@
		// See the LICENSE file in the project root for more information.

		@@ -1,306 +0,0 @@
		// Licensed to the .NET Foundation under one or more agreements.

		@@ -981,105 +981,193 @@ int LinearScan::BuildSIMD(GenTreeSIMD* simdTree)
		//
		int LinearScan::BuildHWIntrinsic(GenTreeHWIntrinsic* intrinsicTree)

		@@ -872,59 +822,19 @@ void Lowering::ContainCheckSIMD(GenTreeSIMD* simdNode)
		//
		void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)

Refactoring the ARM intrinsics to match API review and share code with x86 #25508

Refactoring the ARM intrinsics to match API review and share code with x86 #25508

Conversation

tannergooding commented Jun 30, 2019

tannergooding commented Jun 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Jun 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Aug 20, 2019

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Sep 4, 2019

tannergooding commented Sep 5, 2019

CarolEidt commented Sep 5, 2019

tannergooding commented Sep 5, 2019

tannergooding commented Sep 6, 2019

echesakov commented Sep 14, 2019

tannergooding commented Sep 20, 2019

erozenfeld commented Sep 25, 2019

tannergooding commented Sep 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Oct 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TamarChristinaArm Oct 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Oct 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Oct 10, 2019

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Oct 11, 2019

tannergooding Jun 30, 2019 •

edited

Loading

tannergooding commented Oct 9, 2019 •

edited

Loading

TamarChristinaArm Oct 10, 2019 •

edited

Loading

tannergooding Oct 10, 2019 •

edited

Loading