Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Vector256 to have its software fallback be 2x Vector128<T> ops #76221

Merged
merged 16 commits into from
Oct 3, 2022

Conversation

tannergooding
Copy link
Member

This simplifies the overall logic we need to maintain, answers a request that's been requested by several community members, and will provide additional acceleration on platforms without native V256 support.

Notably this does not mean will not Vector256.IsHardwareAccelerated = true nor does it guarantee that its as fast as unrolling or other specialized logic you may do yourself.

I plan on applying the same pattern to Vector512 and Vector512<T>.

@ghost
Copy link

ghost commented Sep 27, 2022

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

This simplifies the overall logic we need to maintain, answers a request that's been requested by several community members, and will provide additional acceleration on platforms without native V256 support.

Notably this does not mean will not Vector256.IsHardwareAccelerated = true nor does it guarantee that its as fast as unrolling or other specialized logic you may do yourself.

I plan on applying the same pattern to Vector512 and Vector512<T>.

Author: tannergooding
Assignees: -
Labels:

area-System.Numerics

Milestone: -

@stephentoub
Copy link
Member

What does this mean for how we write our code and guidance we provide? Does it mean we should be dropping our Vector256.IsHardwareAccelerated checks?

@tannergooding
Copy link
Member Author

What does this mean for how we write our code and guidance we provide? Does it mean we should be dropping our Vector256.IsHardwareAccelerated checks?

I don't think we need to change the guidance and I don't believe we should drop any of our existing checks.

Overall, this is mostly about making the code simpler to test/maintain, especially as we look at adding Vector512<T> as well. We're removing almost 1000 lines of implementation, and that's with keeping in mind that I added some missing xml docs, opted for "consistency" over making it even "simpler" (e.g. Vector256.Add calls 2x Vector128.Add and not Vector256.operator +), and added several missing attributes (so the actual amount of "code" removed is over 1000 lines and it could be "simplified" a bit more if desired). -- I expect if I did the same treatment to Vector128 we'd save another 1000+ lines, and there would be no perf benefit or detriment (outside the reflection case).

If we find a place where manual unrolling is beneficial, we might be able to (with care) utilize Vector256 on a Vector128.IsHardwareAccelerated code path to simplify the logic. We're already manually unrolling in a couple places already where we have input.Length >= Vector128<T>.Count * 2.

However, we've also seen that this can hurt perf in a few cases and we're better off doing single dispatch. We've also seen cases where such checks might be beneficial for large scenarios but where we really want a 1x and a 2x path, so that smaller cases can still be vectorized.

Additionally, IsHardwareAccelerated remains existing to say that most operations are accelerated and correctly handled. For the case of Vector128.IsHardwareAccelerated == true && Vector256.IsHardwareAccelerated == false, there are APIs like Vector256.ExtractMostSignificantBits that would be better handled as 2x independent branches against the underlying Vector128.ExtractMostSignificantBits calls (rather than this approach which does both calls and returns a combined result). Likewise, there are cases like Vector256.Shuffle which can't be accelerated at all (at least not without JIT support which is likely not beneficial to add without further justification).

@hopperpl
Copy link

for what it's worth... in a project a few years back I used a triple-state to indicate if some instructions where ...

  • Hardware
  • Hardware Emulated
  • Software

This was especially the case if some instructions were not available directly but could be emulated using 2-3 other instructions, making the code still faster than a full software fallback.

Maybe there is some benefit to introduce Vector256.IsHardwareAcceleratedUsing128 to let the consumer decide if falling back to a pure 128-bit version is more beneficial than using a JIT emulation of 256-bit instructions. Vector256.IsHardwareAccelerated would then change its meaning to either "accelerated but might be emulated" or "accelerated and emulation never performed".

@filipnavara
Copy link
Member

filipnavara commented Sep 27, 2022

Maybe there is some benefit to introduce Vector256.IsHardwareAcceleratedUsing128 to let the consumer decide if falling back to a pure 128-bit version is more beneficial than using a JIT emulation of 256-bit instructions. Vector256.IsHardwareAccelerated would then change its meaning to either "accelerated but might be emulated" or "accelerated and emulation never performed".

I don't think it's necessary. You can just check for Vector256.IsHardwareAccelerated || Vector128.IsHardwareAccelerated as long as you are on .NET 8 w/ this PR. New API would not be available on older .NET versions anyway so there would not be any advantage.

@tannergooding
Copy link
Member Author

@fanyang-mono, looks like a number of failures are because nint (IntPtr) and nuint (UIntPtr) aren't supported.

Is this a trivial fix on the mono side? Maybe just a place where the relevant simdBaseType is too restrictive?

@tannergooding
Copy link
Member Author

Might've figured it out, not sure if its the right style or preferred way to resolve the issue, however.

@vargaz
Copy link
Contributor

vargaz commented Sep 28, 2022

The mono changes look ok to me.

Copy link
Contributor

@dakersnar dakersnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand 100% of the implementation details, but this looks good to me, at least conceptually.

Overall, I have three questions:

  • How often are the software fallbacks of these APIs used?
  • This is a great way to reuse code and simplify our implementation, but are we confident that this isn't a significant perf hit?
  • Are all of these paths thoroughly unit tested already?

@tannergooding
Copy link
Member Author

How often are the software fallbacks of these APIs used?

For x64, they'll be used when Avx2.IsSupported is false -or- when indirectly invoking these APIs, such as via reflection. For Arm64, they'll always be used. They may also get used on platforms that have no SIMD acceleration.

This is a great way to reuse code and simplify our implementation, but are we confident that this isn't a significant perf hit?

It may be a minor perf hit for platforms with no SIMD acceleration. However, such platforms are already in a "worst case scenario" if they are using these APIs.

For a platform like Arm64 which has Vector128, but not Vector256 acceleration, it will likely be a perf win as we'll typically execute 2 SIMD operations rather than executing a loop that iterations Count times.

The same should be true for x64 when Avx2.IsSupported is false. When it is true, none of these implementations matter outside indirect invocation (like reflection) which is already very expensive. Instead, the JIT replaces the implementations with better optimized code, often single SIMD instructions.

Are all of these paths thoroughly unit tested already?

Yes

@tannergooding
Copy link
Member Author

@dotnet/jit-contrib this has a small 8 line change in morph.cpp that needs review.

It's changing an assert to account for the fact that Unsafe.As<StructWithReferenceField, HfaStruct>(ref value) would not have matching GC types.

In the case of how its used in the BCL, it's only in dead code that's under a RuntimeHelpers.IsReferenceOrContainsReferences<T>() check. However, this "unsafe" code could be encountered in "live" code for a customer and is arguably no different (functionally) from a reinterpret cast to read the HfaStruct directly from a given memory address (reference). It may cause a GC hole or other problems due to UB, but the actual IL is in itself "valid".

@tannergooding
Copy link
Member Author

@dotnet/jit-contrib this has a small 8 line change in morph.cpp that needs review.

@tannergooding
Copy link
Member Author

superpmi-diffs failure is #76542. superpmi-replay failure is #76511

@tannergooding tannergooding merged commit b3fdac7 into dotnet:main Oct 3, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2022
@tannergooding tannergooding deleted the vector-improvements-4 branch November 11, 2022 15:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants