Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Refactoring the ARM intrinsics to match API review and share code with x86 #25508

Merged
merged 30 commits into from
Oct 11, 2019
Merged

Refactoring the ARM intrinsics to match API review and share code with x86 #25508

merged 30 commits into from
Oct 11, 2019

Conversation

tannergooding
Copy link
Member

This updates the ARM64 intrinsics to match the proposed layout from: https://github.com/dotnet/corefx/issues/37199.

This also updates the ARM64 intrinsics to share much of the importation logic and various data structures that were already created for x86.

Currently, this also removes many of the APIs that were exposed as part of the Arm.AdvSimd class, but I am working on updating those to match the above proposal as well.
-- I don't think merging this should be blocked on that, but since this can't be merged until after master starts targeting .NET 5, I will try to get it completed before then.

@tannergooding
Copy link
Member Author

Tagging @CarolEidt and @TamarChristinaArm in advance.

This can't/shouldn't be merged until after master is updated to target .NET 5, but this is a larger PR, so it will take more time to review.

// /// int64x1_t vabs_s64 (int64x1_t a)
// /// A64: ABS Dd, Dn
// /// </summary>
// public static Vector64<ulong> AbsScalar(Vector64<long> value) { throw new PlatformNotSupportedException(); }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are native intrinsics that correspond to these and many other Vector64<double>, Vector64<long>, and Vector64<ulong> methods.

We should fixup the JIT to properly support these types so the methods can be exposed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so the JIT doesn't currently handle Vector64? Should I then also comment them out in my local branch with the new intrinsics?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, you probably only mean the x1_t types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Vector64 works for 7/10 of the primitive types right now. The JIT doesn't properly support the x1_t types today.

#ifdef FEATURE_HW_INTRINSICS
#include "hwintrinsic.h"

instruction CodeGen::getOpForHWIntrinsic(GenTreeHWIntrinsic* node, var_types instrType)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was removed and replaced with new logic in hwintrinsiccodegenarm64.cpp, to mirror the xarch file.

opts.setSupportedISA(InstructionSet_Vector128);
}

if(jitFlags.IsSet(JitFlags::JIT_FLAG_HAS_ARM64_ATOMICS) && JitConfig.EnableArm64Atomics())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these were pre-existing. I don't think we even support most of them today.

@@ -3,9 +3,44 @@
// See the LICENSE file in the project root for more information.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains all the importation and HWIntrinsicInfo logic that can be shared between both architectures.

There are probably a few more places code could be shared with some ifdefs, but I haven't looked where.

#include "hwintrinsiclistxarch.h"
#elif defined (_TARGET_ARM64_)
#define HARDWARE_INTRINSIC(isa, name, ival, size, numarg, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, category, flag) \
{NI_##isa##_##name, #name, InstructionSet_##isa, ival, size, numarg, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, category, static_cast<HWIntrinsicFlag>(flag)},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed ARM to use this because that is the pattern we were already following for naming things. Might be nice to also update x86 to do the same as it makes the #defines in hwintrinsiclist*.h smaller, but I felt that was a separate PR.

//
// Return Value:
// true if the node has an imm operand; otherwise, false
bool HWIntrinsicInfo::isImmOp(NamedIntrinsic id, const GenTree* op)
Copy link
Member Author

@tannergooding tannergooding Jun 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these methods aren't used by ARM anywhere yet, but they should still be applicable in a few spots later down the line.

#include "hwintrinsicArm64.h"
#ifdef FEATURE_HW_INTRINSICS

enum HWIntrinsicCategory : unsigned int
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Categories, flags, and the HWIntrinsicInfo layout are all the same.

case NI_Vector128_AsSingle:
case NI_Vector128_AsUInt16:
case NI_Vector128_AsUInt32:
case NI_Vector128_AsUInt64:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic (specifically for As and get_Count) could be shared, but I didn't determine a good way to do so yet and decided to leave it to a future PR.

// Arguments:
// node - The hardware intrinsic node
//
void CodeGen::genHWIntrinsic(GenTreeHWIntrinsic* node)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this logic is actually very similar to the x86 logic, there might be more opportunities to share bits here as well, but that isn't part of this PR.

@@ -1,306 +0,0 @@
// Licensed to the .NET Foundation under one or more agreements.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hwintrinsicxarch.h and hwintrinsicarm64.h files aren't needed anymore, as all of this is being shared now.

@@ -981,105 +981,193 @@ int LinearScan::BuildSIMD(GenTreeSIMD* simdTree)
//
int LinearScan::BuildHWIntrinsic(GenTreeHWIntrinsic* intrinsicTree)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is, again, quite a bit of logic that is similar between the x86 and ARM paths in lsra and lowering (basically everything that isn't the individual intrinsic handling), so there might be a good opportunity to share code in the future.

@tannergooding tannergooding added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Jun 30, 2019
@tannergooding tannergooding added this to the Future milestone Jun 30, 2019
@tannergooding
Copy link
Member Author

CC. @BruceForstall.

This is the PR that updates most of the ARM infrastructure to be similar to the infrastructure we setup for x86/x64.

The one drawback, that I am aware of is what I called out in the original post:

Currently, this also removes many of the APIs that were exposed as part of the Arm.AdvSimd class, but I am working on updating those to match the above proposal as well.

It, however, should be a good starting point for anyone wanting to finish this up (since I had gotten pulled off onto other work and haven't been able to yet).

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that I never published these comments, so I'm doing so now.

src/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved
op2 = getArgForHWIntrinsic(argType, argClass);

#ifdef _TARGET_XARCH_
var_types op2Type = TYP_UNDEF;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why this can't be merged with the code below that handles the gather intrinsics, rather than having two separate ifdef regions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason either. I've merged them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it was split because gtIndexBaseType for the retNode needs to be set based on the argClass of op2. However, you have to get op1 before the retNode can be created and getting op1 mutates argClass.

I solved the issue by stashing the op2ArgClass in a local.

src/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved
@tannergooding
Copy link
Member Author

Rebased onto dotnet/master. Will work on addressing comments made so far.

@tannergooding
Copy link
Member Author

Addressed feedback given so far. Will resolve any CI failures once the jobs finish.

@CarolEidt
Copy link

@tannergooding you say above that:

Currently, this also removes many of the APIs that were exposed as part of the Arm.AdvSimd class

But it seems that they've been added back? However, the tests have been deleted. I realize it's probably a bit of work to resurrect the old tests, and perhaps it's best to just move to auto-generated tests, but I'm uncomfortable having no test exposure.

@tannergooding
Copy link
Member Author

But it seems that they've been added back?

No, the Abs/Add APIs are the only ones exposed in the commit right now. The other APIs that were exposed are still missing.

I'm uncomfortable having no test exposure.

Right, I, at a minimum, will be adding generated tests for the APIs that are supported by this PR.

@tannergooding
Copy link
Member Author

@CarolEidt, I came across two issues:

  1. I can't add tests until after the reference assemblies are also updated (which requires something to be merged CoreCLR side first; at least until the repo merge).
  2. As best as I can tell, we don't currently support the LD1 instruction for doing Vector64 and Vector128 loads from memory (which limits what we can test). This is less of a problem, but I need to sit down with someone and ensure I understand how to add new instructions for ARM (it is quite a bit different than x86; and I think I've gotten 50-60% of the way there, but there are some unclear parts still).

@echesakov
Copy link

... I need to sit down with someone and ensure I understand how to add new instructions for ARM (it is quite a bit different than x86; and I think I've gotten 50-60% of the way there, but there are some unclear parts still).

@tannergooding I can help with this - I am not claiming to be an expert in jit emitter internals but I am also willing to learn and understand how we add new instructions.

@tannergooding
Copy link
Member Author

@echesakovMSFT. Sounds good to me. Feel free to ping me internally and we can set up some time to go over this :)

Much of this (namely I can't add tests until after the reference assemblies are also updated) will also be simpler once the repo merge was been completed; as we don't have to worry about roundtripping things from CoreCLR to CoreFX and back.

@sandreenko sandreenko removed this from the Future milestone Sep 23, 2019
@sandreenko sandreenko added area-CodeGen and removed * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) labels Sep 23, 2019
@erozenfeld
Copy link
Member

@dotnet/jit-contrib

@tannergooding
Copy link
Member Author

This one is blocked on #26895 going in first, and then requires some cleanup work and tests. I would be comfortable closing it until after the other changes go through if needed.

@@ -872,59 +822,19 @@ void Lowering::ContainCheckSIMD(GenTreeSIMD* simdNode)
//
void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt, @echesakovMSFT

Just to confirm, there doesn't need to be containment for ARM64 except for immediates, correct?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my understanding, i.e. there are no Arm64 intrinsics that can take either a register or memory operand.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, not SIMD but you do have other intrinsics such as the prefetch intrinsics (pld) that do take a memory operand. But that's probably out of scope for now I'd imagine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely out of scope for the PR; likely not outside the scope of the total work. We have Sse.Prefetch0/1/2 and Sse.PrefetchNonTemporal on the x86 side already.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that we only need to worry about the containment question when the same intrinsic can take either a memory or a register operand, which would not be the case for a prefetch.

@tannergooding
Copy link
Member Author

tannergooding commented Oct 9, 2019

At least locally, vector tests are passing. Looks like I need to define an insOpt for scalar values on a vector registers to support AbsScalar and AddScalar.

Edit: Nevermind, looks like I just need to ensure insOptsAnyArrangement returns false :)

HARDWARE_INTRINSIC(AdvSimd, Abs, -1, -1, 1, {INS_invalid, INS_abs, INS_invalid, INS_abs, INS_invalid, INS_abs, INS_invalid, INS_invalid, INS_fabs, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoContainment|HW_Flag_UnfixedSIMDSize)
HARDWARE_INTRINSIC(AdvSimd, AbsScalar, -1, 8, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_fabs, INS_fabs}, HW_Category_SIMDScalar, HW_Flag_NoContainment)
HARDWARE_INTRINSIC(AdvSimd, Add, -1, -1, 2, {INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_fadd, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoContainment|HW_Flag_Commutative|HW_Flag_UnfixedSIMDSize)
HARDWARE_INTRINSIC(AdvSimd, AddScalar, -1, 8, 2, {INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_fadd, INS_fadd}, HW_Category_SIMDScalar, HW_Flag_NoContainment|HW_Flag_Commutative)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TamarChristinaArm.

What is the semantics around the unused bits for a given vector register?

That is, on Arm64 the vector register is 128-bits in length. However, some encodings only access the first 64-bits (Vector64<T> operations) and some instructions only access the first element (scalar instructions only access the first T).

So, for say AddScalar which takes in Vector64<float> left, Vector64<float> right and returns a Vector64<T>; the contents of result[0] = left[0] + right[0], but what is the contents of result[1] and what is the result of indice 3 and 4 of the backing register (bits 64-96 and bits 96-128)?

Is it cleared to zero, is it preserved to the last value in the register, etc?

The same question goes for Vector64<T> Add(Vector64<T> left, Vector64<T> right) (what is the contents of bits 64-96 and bits 96-128 of the backing register)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking because I'm seeing for:

 left: (0.41178197, 0.0033083581)
right: (0.11088856, 0.89114616)

You get:

result: (0.5226705, 0)

and I'm wanting to find out if this is deterministically 0 or if it is the last value contained in those bits.

Copy link

@TamarChristinaArm TamarChristinaArm Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is unless otherwise specified by the instruction (such is inserts, or high operations) the unused bits of a register are always cleared to 0 on writes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for say AddScalar which takes in Vector64 left, Vector64 right and returns a Vector64; the contents of result[0] = left[0] + right[0], but what is the contents of result[1] and what is the result of indice 3 and 4 of the backing register (bits 64-96 and bits 96-128)?

They're all cleared to 0 in this case. They wouldn't be for an insert (e.g. INS) a structure load into one lane (e.g. LD3...[<lane>]) or on some of the instructions which have a second part (those generally end with a 2 in the name such as PMULL2.

Copy link
Member Author

@tannergooding tannergooding Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is unless otherwise specified by the instruction (such is inserts, or high operations) the unused bits of a register are always cleared to 0 on writes.

Awesome, thanks!

This is the same behavior as for x86 when looking at 128-bit vs 256-bit registers (just with 64-bit vs 128-bit on ARM). However, it differs from x86 when looking at scalar vs vector operations (x86 preserves the upper bits, through bit 128 and clears bits 128-256; where-as ARM just clears all upper bits).

So I just wanted to get this validated in my head. It would likely also be something worth noting for when dealing with operations like Vector128.GetLower(), Vector64.ToVector128(), and when performing operations in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt, just for a sanity check. Is this register allocator aware of these parts (scalar vs vector64 vs vector128)? Is there anything we need to set to tell it when an operation will preserve vs zero the "upper bits"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That raises a very interesting point. The register allocator only understands fully-overwritten or RMW operands. That is, an operation that partially preserves the value of a target should have that as both a source and a target, and that source should be marked as "delayFree". It is then up to the code generator to copy it if the RMW src and dst aren't allocated the same register. I suspect there are some issues here in what we're doing today.

@tannergooding
Copy link
Member Author

Going to mark this as "ready for review" now as all tests are passing locally (on my Windows on Arm64 device).

There are still a few methods which need tests (namely ArmBase), but I'd like to add those in a follow up PR if possible, as that will unblock this PR from being merged and unblock the work being done by @TamarChristinaArm and @echesakovMSFT

@tannergooding tannergooding marked this pull request as ready for review October 10, 2019 23:55
Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (still/again)

code |= insEncodeReg_Rn(id->idReg2()); // nnnnn
code |= insEncodeVLSElemsize(elemsize); // ss
code |= 0x5000; // xx.x - We only support the one register variant right now
code |= insEncodeReg_Rm(id->idReg3()); // mmmmm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it looks like we might not need these as the instructions that load to multiple vector registers have different names (e.g. ld1, ld2, ld3, ld4).

I think we will need them to support ld1 with multiple vector registers - there are four variant of this instruction.

@tannergooding
Copy link
Member Author

I've logged https://github.com/dotnet/coreclr/issues/27139 (and assigned it to myself) to track adding the tests for the ArmBase APIs.

I have already started working on these and should hopefully have a PR up not terribly long after this one is merged 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants