-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize {Last}IndexOf{Any} and {Last}IndexOfAnyExcept without code duplication #73768
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue Details@stephentoub @jkotas the only difference between Codegen diff between my first and second commit: https://www.diffchecker.com/qeBXi1Rj It works great for CLR RyuJIT x64, but I wonder what downsides it has (AOT support, generic code bloat?) Please let me know if using such pattern is acceptable. If it is, I could vectorize more similar methods without duplicating the code.
|
Haven't reviewed yet (won't be able to until later today), but in concept I'm happy with it. Can we do the same for IndexOf{Any} and the other overloads? (@GrabYourPitchforks had actually suggested this approach initially and I'd looked at doing so when adding the methods initially, but the direct use of intrinsics made it challenging.) |
The auxiliary types created in patterns like this have larger static footprint. It is not a big deal if the set of instantiations is small and finite. You can reduce this downside by using the existing types instead of introducing new ones. For example:
|
|
||
return SpanHelpers.LastIndexOf<T>(ref MemoryMarshal.GetReference(span), value, span.Length); | ||
} | ||
=> LastIndexOf((ReadOnlySpan<T>)span, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not inlineable when T
is generic variable due to current generic inlining limitations. It means that this change to just forward Span to ReadOnlySpan will come with some perf regression in some situations.
(Just pointing it out. I will leave it up to you whether to take this regression for simplicity. Either way is fine with me.)
I think that's the case here. The way this is set up it seems like there should be at most 8: 4 primitive types * 2 helpers. |
I think I've hit a bug in JIT: #73804 |
I've ported IndexOf too, but some of the tests are failing due to unaligned reads for runtime/src/libraries/System.Private.CoreLib/src/System/String.cs Lines 599 to 601 in 3e0a5ad
I am going to continue working on this tomorrow |
@adamsitnik, is this PR for .NET 8 or .NET 7? Trying to decide if we need to fix #73804 in .NET 7 or 8. |
It's intended for 7. |
… and avoid duplication by calling it from both Span and ROS
… to searching for zeros
This does not address the problem with inlining limitations that I have mentioned. It actually makes it worse since both ReadOnlySpan and Span overloads get performance hit. It was just Span overload before the last commit. Repro: using System.Diagnostics;
using System.Runtime.CompilerServices;
ReadOnlySpan<MyStruct<string>> span = new MyStruct<string>[1];
var sw = new Stopwatch();
for (;;)
{
sw.Restart();
for (int i = 0; i < 100000000; i++) ContainsDefault(span);
Console.WriteLine(sw.ElapsedMilliseconds);
}
[MethodImpl(MethodImplOptions.NoInlining)]
static bool ContainsDefault<T>(ReadOnlySpan<T> span) where T: IEquatable<T>
=> span.Contains(default);
public struct MyStruct<T> : IEquatable<MyStruct<T>>
{
int _value;
bool IEquatable<MyStruct<T>>.Equals(MyStruct<T> other) => _value == other._value;
} Baseline729ms per iteration Stacktrace to
Current PR:820ms per iteration Stacktrace to
Current PR without the latest commit:Same as baseline. There is no good way to work around the inlining limitations without code duplication. |
@adamsitnik - this has caused significant regressions in AOT-WASM and Interpreter WASM scenarios as indicated in linked issues above. #74395 (comment) comment made last month indicated this could have been a reason for the regressions, but somehow we didn't quite follow up on that. I want to discuss how we can avoid this in the future, possibly with manual runs on Mono scenarios with the changes ? @DrewScoggins - is it possible to call out improvements as well as regressions for the changes across various runs. Also @adamsitnik - I believe it is quite late, but are there any remote chances we can revert those changes from 7.0/release ? cc @jeffhandley |
First of all, please excuse me for missing #74395 (comment)
From my perspective all I need is documentation that describes how to benchmark and preferably how to profile WASM. We have an issue for that: dotnet/BenchmarkDotNet#1818 but it did not receive a lot of traction. We should extend https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md and https://github.com/dotnet/performance/blob/main/docs/profiling-workflow-dotnet-runtime.md with WASM instructions so folks like me can just run the benchmarks themselves and avoid introducing regressions.
I would prefer to not revert it, as it has brought a lot of perf improvements for arm64. In my opinion the best quick fix would be to re-introduce |
I think we'd be interested in a change that preserves the win for non WASM. It really depends on risk/confidence. |
We need a solution for net7 which can be implemented quickly and its low risk. |
I agree. I think what folks are pointing out is that reverting the previous changes is not low risk. |
Which means a high risk change was committed post rc1 and leaves the wasm runtime in a fairly tight spot with no time to react. |
What do you mean post-RC1? This change is in RC1. |
I meant post branch for rc1, sorry for the imprecision |
Yes, it was merged into the rc1 branch a month ago, the day after the rc1 branch was snapped from main. I'm surprised that makes a material difference for wasm having time to react. Regardless, we all agree we want to fix the wasm regressions; let's jointly find a solution rather than placing blame. Adam suggested adding in some |
The simplest lowest-risk solution for .NET 7 is to just put whatever was there before under |
That sounds fine to me. I'm assuming that's what Tanner had in mind (but don't know for sure). |
My current idea: [MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool ExecuteVectorizedCodePath<T>(int length) where T : struct
#if TARGET_WASM
=> Vector.IsHardwareAccelerated && length >= Vector<T>.Count;
#else
=> Vector128.IsHardwareAccelerated && length >= Vector128<T>.Count;
#endif
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
internal static bool ContainsValueType<T>(ref T searchSpace, T value, int length) where T : struct, INumber<T>
{
if (!ExecuteVectorizedCodePath<T>(length))
{
// current non-vectorized code path
}
#if TARGET_WASM
else
{
// restored Vector<T> path
}
#else
else if (Vector256.IsHardwareAccelerated && length >= Vector256<T>.Count)
{
// current Vector256 code path
}
else
{
// current Vector128 code path
}
#endif
return false;
} I am already working on a fix, but I don't know how to benchmark WASM AOT yet |
@adamsitnik #74395 (comment) says that the problem is caused by generics and depending on JIT/AOT doing complex optimizations to streamline the code. I do not think adding |
@jkotas The report mentions regression in |
There might be multiple issues. The one I saw when working on SIMD improvements is related to generics. We now end with shared generic code in methods, which were specialized (non shared) before. In my case that leads to resulting code not using SIMD intrinsics. More importantly the shared generic code is also slower for default (non-SIMD) cases and is visible in browser-bench measurements, screenshot in #75709 - the most affected graph "flavors" are these with SIMD, in the others it is visible too, just in a smaller scale. |
It is ImmutableArray.Contains, Queue.Contains, etc. All of these are implemented using IndexOf that I believe was switched to the generic impl: Lines 278 to 281 in 57bfe47
runtime/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Queue.cs Lines 290 to 298 in 57bfe47
|
The advantage of putting the old code back in |
OK, I am going to bring old the back old code for Mono-only. |
Adding a reference link to where we closed the loop with the performance results that illustrate the mono regressions were indeed fixed: |
@stephentoub @jkotas the only difference between
LastIndexOf
andLastIndexOfAnyExcept
is addional negation. I wanted to avoid code duplication without losing perf, so I've introduced new interface and two structs that are implementing it (first does==
, second!=
). By having the right generic constrains, I was able to get exactly the same perf (all the calls got inlined).Codegen diff between my first and second commit: https://www.diffchecker.com/qeBXi1Rj
It works great for CLR RyuJIT x64, but I wonder what downsides it has (AOT support, generic code bloat?) Please let me know if using such pattern is acceptable. If it is, I could vectorize more similar methods without duplicating the code.