-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Vector256 to have its software fallback be 2x Vector128<T> ops #76221
Updating Vector256 to have its software fallback be 2x Vector128<T> ops #76221
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsThis simplifies the overall logic we need to maintain, answers a request that's been requested by several community members, and will provide additional acceleration on platforms without native V256 support. Notably this does not mean will not I plan on applying the same pattern to
|
What does this mean for how we write our code and guidance we provide? Does it mean we should be dropping our Vector256.IsHardwareAccelerated checks? |
I don't think we need to change the guidance and I don't believe we should drop any of our existing checks. Overall, this is mostly about making the code simpler to test/maintain, especially as we look at adding If we find a place where manual unrolling is beneficial, we might be able to (with care) utilize However, we've also seen that this can hurt perf in a few cases and we're better off doing single dispatch. We've also seen cases where such checks might be beneficial for large scenarios but where we really want a 1x and a 2x path, so that smaller cases can still be vectorized. Additionally, |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs
Show resolved
Hide resolved
for what it's worth... in a project a few years back I used a triple-state to indicate if some instructions where ...
This was especially the case if some instructions were not available directly but could be emulated using 2-3 other instructions, making the code still faster than a full software fallback. Maybe there is some benefit to introduce |
I don't think it's necessary. You can just check for |
@fanyang-mono, looks like a number of failures are because Is this a trivial fix on the mono side? Maybe just a place where the relevant |
Might've figured it out, not sure if its the right style or preferred way to resolve the issue, however. |
eb4bbc3
to
7af3ae9
Compare
The mono changes look ok to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand 100% of the implementation details, but this looks good to me, at least conceptually.
Overall, I have three questions:
- How often are the software fallbacks of these APIs used?
- This is a great way to reuse code and simplify our implementation, but are we confident that this isn't a significant perf hit?
- Are all of these paths thoroughly unit tested already?
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256_1.cs
Show resolved
Hide resolved
For x64, they'll be used when
It may be a minor perf hit for platforms with no SIMD acceleration. However, such platforms are already in a "worst case scenario" if they are using these APIs. For a platform like Arm64 which has Vector128, but not Vector256 acceleration, it will likely be a perf win as we'll typically execute 2 SIMD operations rather than executing a loop that iterations The same should be true for x64 when
Yes |
29e6d5c
to
ebaf95a
Compare
@dotnet/jit-contrib this has a small 8 line change in It's changing an assert to account for the fact that In the case of how its used in the BCL, it's only in dead code that's under a |
@dotnet/jit-contrib this has a small 8 line change in morph.cpp that needs review. |
This simplifies the overall logic we need to maintain, answers a request that's been requested by several community members, and will provide additional acceleration on platforms without native V256 support.
Notably this does not mean will not
Vector256.IsHardwareAccelerated = true
nor does it guarantee that its as fast as unrolling or other specialized logic you may do yourself.I plan on applying the same pattern to
Vector512
andVector512<T>
.