Updating Vector256 to have its software fallback be 2x Vector128<T> ops #76221

tannergooding · 2022-09-27T01:08:55Z

This simplifies the overall logic we need to maintain, answers a request that's been requested by several community members, and will provide additional acceleration on platforms without native V256 support.

Notably this does not mean will not Vector256.IsHardwareAccelerated = true nor does it guarantee that its as fast as unrolling or other specialized logic you may do yourself.

I plan on applying the same pattern to Vector512 and Vector512<T>.

ghost · 2022-09-27T01:09:06Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

This simplifies the overall logic we need to maintain, answers a request that's been requested by several community members, and will provide additional acceleration on platforms without native V256 support.

Notably this does not mean will not Vector256.IsHardwareAccelerated = true nor does it guarantee that its as fast as unrolling or other specialized logic you may do yourself.

I plan on applying the same pattern to Vector512 and Vector512<T>.

Author:	tannergooding
Assignees:	-
Labels:	`area-System.Numerics`
Milestone:	-

stephentoub · 2022-09-27T01:31:12Z

What does this mean for how we write our code and guidance we provide? Does it mean we should be dropping our Vector256.IsHardwareAccelerated checks?

tannergooding · 2022-09-27T01:57:42Z

What does this mean for how we write our code and guidance we provide? Does it mean we should be dropping our Vector256.IsHardwareAccelerated checks?

I don't think we need to change the guidance and I don't believe we should drop any of our existing checks.

Overall, this is mostly about making the code simpler to test/maintain, especially as we look at adding Vector512<T> as well. We're removing almost 1000 lines of implementation, and that's with keeping in mind that I added some missing xml docs, opted for "consistency" over making it even "simpler" (e.g. Vector256.Add calls 2x Vector128.Add and not Vector256.operator +), and added several missing attributes (so the actual amount of "code" removed is over 1000 lines and it could be "simplified" a bit more if desired). -- I expect if I did the same treatment to Vector128 we'd save another 1000+ lines, and there would be no perf benefit or detriment (outside the reflection case).

If we find a place where manual unrolling is beneficial, we might be able to (with care) utilize Vector256 on a Vector128.IsHardwareAccelerated code path to simplify the logic. We're already manually unrolling in a couple places already where we have input.Length >= Vector128<T>.Count * 2.

However, we've also seen that this can hurt perf in a few cases and we're better off doing single dispatch. We've also seen cases where such checks might be beneficial for large scenarios but where we really want a 1x and a 2x path, so that smaller cases can still be vectorized.

Additionally, IsHardwareAccelerated remains existing to say that most operations are accelerated and correctly handled. For the case of Vector128.IsHardwareAccelerated == true && Vector256.IsHardwareAccelerated == false, there are APIs like Vector256.ExtractMostSignificantBits that would be better handled as 2x independent branches against the underlying Vector128.ExtractMostSignificantBits calls (rather than this approach which does both calls and returns a combined result). Likewise, there are cases like Vector256.Shuffle which can't be accelerated at all (at least not without JIT support which is likely not beneficial to add without further justification).

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs

hopperpl · 2022-09-27T09:40:32Z

for what it's worth... in a project a few years back I used a triple-state to indicate if some instructions where ...

Hardware
Hardware Emulated
Software

This was especially the case if some instructions were not available directly but could be emulated using 2-3 other instructions, making the code still faster than a full software fallback.

Maybe there is some benefit to introduce Vector256.IsHardwareAcceleratedUsing128 to let the consumer decide if falling back to a pure 128-bit version is more beneficial than using a JIT emulation of 256-bit instructions. Vector256.IsHardwareAccelerated would then change its meaning to either "accelerated but might be emulated" or "accelerated and emulation never performed".

filipnavara · 2022-09-27T09:42:12Z

Maybe there is some benefit to introduce Vector256.IsHardwareAcceleratedUsing128 to let the consumer decide if falling back to a pure 128-bit version is more beneficial than using a JIT emulation of 256-bit instructions. Vector256.IsHardwareAccelerated would then change its meaning to either "accelerated but might be emulated" or "accelerated and emulation never performed".

I don't think it's necessary. You can just check for Vector256.IsHardwareAccelerated || Vector128.IsHardwareAccelerated as long as you are on .NET 8 w/ this PR. New API would not be available on older .NET versions anyway so there would not be any advantage.

tannergooding · 2022-09-27T23:05:30Z

@fanyang-mono, looks like a number of failures are because nint (IntPtr) and nuint (UIntPtr) aren't supported.

Is this a trivial fix on the mono side? Maybe just a place where the relevant simdBaseType is too restrictive?

tannergooding · 2022-09-27T23:16:41Z

Might've figured it out, not sure if its the right style or preferred way to resolve the issue, however.

vargaz · 2022-09-28T01:20:49Z

The mono changes look ok to me.

…larUnsafe

dakersnar

I don't understand 100% of the implementation details, but this looks good to me, at least conceptually.

Overall, I have three questions:

How often are the software fallbacks of these APIs used?
This is a great way to reuse code and simplify our implementation, but are we confident that this isn't a significant perf hit?
Are all of these paths thoroughly unit tested already?

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256_1.cs

tannergooding · 2022-09-28T22:14:40Z

How often are the software fallbacks of these APIs used?

For x64, they'll be used when Avx2.IsSupported is false -or- when indirectly invoking these APIs, such as via reflection. For Arm64, they'll always be used. They may also get used on platforms that have no SIMD acceleration.

This is a great way to reuse code and simplify our implementation, but are we confident that this isn't a significant perf hit?

It may be a minor perf hit for platforms with no SIMD acceleration. However, such platforms are already in a "worst case scenario" if they are using these APIs.

For a platform like Arm64 which has Vector128, but not Vector256 acceleration, it will likely be a perf win as we'll typically execute 2 SIMD operations rather than executing a loop that iterations Count times.

The same should be true for x64 when Avx2.IsSupported is false. When it is true, none of these implementations matter outside indirect invocation (like reflection) which is already very expensive. Instead, the JIT replaces the implementations with better optimized code, often single SIMD instructions.

Are all of these paths thoroughly unit tested already?

Yes

…tor()

…nsafe.As reinterprets

tannergooding · 2022-09-30T15:28:02Z

@dotnet/jit-contrib this has a small 8 line change in morph.cpp that needs review.

It's changing an assert to account for the fact that Unsafe.As<StructWithReferenceField, HfaStruct>(ref value) would not have matching GC types.

In the case of how its used in the BCL, it's only in dead code that's under a RuntimeHelpers.IsReferenceOrContainsReferences<T>() check. However, this "unsafe" code could be encountered in "live" code for a customer and is arguably no different (functionally) from a reinterpret cast to read the HfaStruct directly from a given memory address (reference). It may cause a GC hole or other problems due to UB, but the actual IL is in itself "valid".

tannergooding · 2022-10-03T22:13:22Z

@dotnet/jit-contrib this has a small 8 line change in morph.cpp that needs review.

tannergooding · 2022-10-03T22:22:21Z

superpmi-diffs failure is #76542. superpmi-replay failure is #76511

tannergooding added 2 commits September 26, 2022 16:39

Updating Vector256<T> to be implemented as 2x Vector128<T> ops

faace6c

Updating Vector256 to be implemented as 2x Vector128<T> ops

d0c49de

dotnet-issue-labeler bot added the area-System.Numerics label Sep 27, 2022

ghost assigned tannergooding Sep 27, 2022

stephentoub reviewed Sep 27, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs Show resolved Hide resolved

tannergooding added 3 commits September 27, 2022 12:15

Merge remote-tracking branch 'dotnet/main' into vector-improvements-4

cf46505

Simplify the NRE handling

392a668

A couple of small bug fixes

a643369

tannergooding requested a review from MichalStrehovsky as a code owner September 27, 2022 20:00

tannergooding requested review from vargaz, lambdageek and SamMonoRT as code owners September 27, 2022 23:16

Ensure MONO_TYPE_I and MONO_TYPE_U are handled in simd-intrinsics

7af3ae9

tannergooding force-pushed the vector-improvements-4 branch from eb4bbc3 to 7af3ae9 Compare September 28, 2022 01:08

tannergooding added 5 commits September 27, 2022 20:07

Ensure generic paths are handled for SN_CreateScalar and SN_CreateSca…

f3110b6

…larUnsafe

Ensure nint/nuint are handled in the relevant fallback paths

30c2105

Fixing braces to match mono code-styling

d0c818f

Fixing the "Any" implementation to use ||

34069fa

Fixing Narrow and Widen

4146f92

dakersnar approved these changes Sep 28, 2022

View reviewed changes

Use Unsafe.ReadUnaligned for the software fallback of Vector256.AsVec…

ebaf95a

…tor()

tannergooding force-pushed the vector-improvements-4 branch from 29e6d5c to ebaf95a Compare September 28, 2022 23:26

tannergooding added 3 commits September 29, 2022 07:53

Merge remote-tracking branch 'dotnet/main' into vector-improvements-4

b2e2767

Adjust an assert in fgMorphMultiregStructArg to account for certain U…

4d5bc7b

…nsafe.As reinterprets

Merge remote-tracking branch 'dotnet/main' into vector-improvements-4

4d8c090

Merge remote-tracking branch 'dotnet/main' into vector-improvements-4

7a4f36d

EgorBo approved these changes Oct 3, 2022

View reviewed changes

tannergooding merged commit b3fdac7 into dotnet:main Oct 3, 2022

This was referenced Oct 4, 2022

Updating Vector128 to have its software fallback be 2x Vector64<T> ops #76592

Merged

Test failure JIT\\HardwareIntrinsics\\General\\Vector256\\Vector256_r\\Vector256_r.cmd #76280

Closed

ghost locked as resolved and limited conversation to collaborators Nov 3, 2022

tannergooding deleted the vector-improvements-4 branch November 11, 2022 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Vector256 to have its software fallback be 2x Vector128<T> ops #76221

Updating Vector256 to have its software fallback be 2x Vector128<T> ops #76221

tannergooding commented Sep 27, 2022

ghost commented Sep 27, 2022

stephentoub commented Sep 27, 2022

tannergooding commented Sep 27, 2022

hopperpl commented Sep 27, 2022

filipnavara commented Sep 27, 2022 •

edited

Loading

tannergooding commented Sep 27, 2022

tannergooding commented Sep 27, 2022

vargaz commented Sep 28, 2022

dakersnar left a comment

tannergooding commented Sep 28, 2022

tannergooding commented Sep 30, 2022

tannergooding commented Oct 3, 2022

tannergooding commented Oct 3, 2022

Updating Vector256 to have its software fallback be 2x Vector128<T> ops #76221

Updating Vector256 to have its software fallback be 2x Vector128<T> ops #76221

Conversation

tannergooding commented Sep 27, 2022

ghost commented Sep 27, 2022

stephentoub commented Sep 27, 2022

tannergooding commented Sep 27, 2022

hopperpl commented Sep 27, 2022

filipnavara commented Sep 27, 2022 • edited Loading

tannergooding commented Sep 27, 2022

tannergooding commented Sep 27, 2022

vargaz commented Sep 28, 2022

dakersnar left a comment

Choose a reason for hiding this comment

tannergooding commented Sep 28, 2022

tannergooding commented Sep 30, 2022

tannergooding commented Oct 3, 2022

tannergooding commented Oct 3, 2022

filipnavara commented Sep 27, 2022 •

edited

Loading