Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize {Last}IndexOf{Any} and {Last}IndexOfAnyExcept without code duplication #73768

Merged
merged 37 commits into from
Aug 17, 2022

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Aug 11, 2022

@stephentoub @jkotas the only difference between LastIndexOf and LastIndexOfAnyExcept is addional negation. I wanted to avoid code duplication without losing perf, so I've introduced new interface and two structs that are implementing it (first does ==, second !=). By having the right generic constrains, I was able to get exactly the same perf (all the calls got inlined).

Codegen diff between my first and second commit: https://www.diffchecker.com/qeBXi1Rj

It works great for CLR RyuJIT x64, but I wonder what downsides it has (AOT support, generic code bloat?) Please let me know if using such pattern is acceptable. If it is, I could vectorize more similar methods without duplicating the code.

@adamsitnik adamsitnik added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) area-System.Memory labels Aug 11, 2022
@ghost ghost assigned adamsitnik Aug 11, 2022
@ghost
Copy link

ghost commented Aug 11, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

@stephentoub @jkotas the only difference between LastIndexOf and LastIndexOfAnyExcept is addional negation. I wanted to avoid code duplication without losing perf, so I've introduced new interface and two structs that are implementing it. By having the right generic constrains, I was able to get exactly the same perf (all the calls got inlined).

Codegen diff between my first and second commit: https://www.diffchecker.com/qeBXi1Rj

It works great for CLR RyuJIT x64, but I wonder what downsides it has (AOT support, generic code bloat?) Please let me know if using such pattern is acceptable. If it is, I could vectorize more similar methods without duplicating the code.

Author: adamsitnik
Assignees: -
Labels:

NO-MERGE, area-System.Memory

Milestone: -

@stephentoub
Copy link
Member

stephentoub commented Aug 11, 2022

Haven't reviewed yet (won't be able to until later today), but in concept I'm happy with it. Can we do the same for IndexOf{Any} and the other overloads?

(@GrabYourPitchforks had actually suggested this approach initially and I'd looked at doing so when adding the methods initially, but the direct use of intrinsics made it challenging.)

@jkotas
Copy link
Member

jkotas commented Aug 11, 2022

I wonder what downsides it has (AOT support, generic code bloat?)

The auxiliary types created in patterns like this have larger static footprint. It is not a big deal if the set of instantiations is small and finite.

You can reduce this downside by using the existing types instead of introducing new ones. For example:

// You can also just define your own `class Negate { }` and `class DoNotNegate { }` instead of piggy backing on Int32/UInt32.
using Negate = System.Int32;
using DoNotNegate = System.UInt32;

...

SpanHelpers.LastIndexOfValueType<byte, Negate>(....)`

...

if (typeof(N) == typeof(Negate))
{
    equals = ~equals;
}


return SpanHelpers.LastIndexOf<T>(ref MemoryMarshal.GetReference(span), value, span.Length);
}
=> LastIndexOf((ReadOnlySpan<T>)span, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not inlineable when T is generic variable due to current generic inlining limitations. It means that this change to just forward Span to ReadOnlySpan will come with some perf regression in some situations.

(Just pointing it out. I will leave it up to you whether to take this regression for simplicity. Either way is fine with me.)

@stephentoub
Copy link
Member

It is not a big deal if the set of instantiations is small and finite.

I think that's the case here. The way this is set up it seems like there should be at most 8: 4 primitive types * 2 helpers.

@adamsitnik
Copy link
Member Author

I think I've hit a bug in JIT: #73804

@adamsitnik
Copy link
Member Author

I've ported IndexOf too, but some of the tests are failing due to unaligned reads for strlen:

// IndexOf processes memory in aligned chunks, and thus it won't crash even if it accesses memory beyond the null terminator.
// This IndexOf behavior is an implementation detail of the runtime and callers outside System.Private.CoreLib must not depend on it.
int length = SpanHelpers.IndexOf(ref *ptr, '\0', int.MaxValue);

I am going to continue working on this tomorrow

@JulieLeeMSFT
Copy link
Member

I think I've hit a bug in JIT: #73804

@adamsitnik, is this PR for .NET 8 or .NET 7? Trying to decide if we need to fix #73804 in .NET 7 or 8.

@stephentoub
Copy link
Member

, is this PR for .NET 8 or .NET 7?

It's intended for 7.

@jkotas
Copy link
Member

jkotas commented Aug 12, 2022

address review from Jan, don't cast Span to ROS.

This does not address the problem with inlining limitations that I have mentioned. It actually makes it worse since both ReadOnlySpan and Span overloads get performance hit. It was just Span overload before the last commit.

Repro:

using System.Diagnostics;
using System.Runtime.CompilerServices;

ReadOnlySpan<MyStruct<string>> span = new MyStruct<string>[1];

var sw = new Stopwatch();
for (;;)
{
    sw.Restart();
    for (int i = 0; i < 100000000; i++) ContainsDefault(span);
    Console.WriteLine(sw.ElapsedMilliseconds);  
}

[MethodImpl(MethodImplOptions.NoInlining)]
static bool ContainsDefault<T>(ReadOnlySpan<T> span) where T: IEquatable<T>
   => span.Contains(default);

public struct MyStruct<T> : IEquatable<MyStruct<T>>
{
    int _value;

    bool IEquatable<MyStruct<T>>.Equals(MyStruct<T> other) => _value == other._value;
}

Baseline

729ms per iteration

Stacktrace to SpanHelpers.Contains:

System_Private_CoreLib!System.SpanHelpers.Contains<MyStruct<string>>+0x8
System_Private_CoreLib!System.MemoryExtensions.Contains<MyStruct<string>>+0x48
repro!Program.<<Main>$>g__ContainsDefault|0_0<MyStruct<string>>+0x3d [C:\repro\Program.cs @ 16] 
repro!Program.<Main>$+0xda [C:\repro\Program.cs @ 10] 

Current PR:

820ms per iteration

Stacktrace to SpanHelpers.Contains. Notice an extra MemoryExtensions.Contains frame.

System_Private_CoreLib!System.SpanHelpers.Contains<MyStruct<string>>+0x31d [C:\runtime\src\libraries\System.Private.CoreLib\src\System\SpanHelpers.T.cs @ 270] 
System_Private_CoreLib!System.MemoryExtensions.Contains<MyStruct<string>>+0x48 [C:\runtime\src\libraries\System.Private.CoreLib\src\System\MemoryExtensions.cs @ 303] 
System_Private_CoreLib!System.MemoryExtensions.Contains<MyStruct<string>>+0x48 [C:\runtime\src\libraries\System.Private.CoreLib\src\System\MemoryExtensions.cs @ 278] 
repro!Program.<<Main>$>g__ContainsDefault|0_0<MyStruct<string>>+0x3d [C:\repro\Program.cs @ 16] 
repro!Program.<Main>$+0xda [C:\repro\Program.cs @ 10] 

Current PR without the latest commit:

Same as baseline.

There is no good way to work around the inlining limitations without code duplication.

@SamMonoRT
Copy link
Member

SamMonoRT commented Sep 15, 2022

@adamsitnik - this has caused significant regressions in AOT-WASM and Interpreter WASM scenarios as indicated in linked issues above. #74395 (comment) comment made last month indicated this could have been a reason for the regressions, but somehow we didn't quite follow up on that. I want to discuss how we can avoid this in the future, possibly with manual runs on Mono scenarios with the changes ? @DrewScoggins - is it possible to call out improvements as well as regressions for the changes across various runs.

Also @adamsitnik - I believe it is quite late, but are there any remote chances we can revert those changes from 7.0/release ?

cc @jeffhandley

@adamsitnik
Copy link
Member Author

First of all, please excuse me for missing #74395 (comment)

I want to discuss how we can avoid this in the future, possibly with manual runs on Mono scenarios with the changes ?

From my perspective all I need is documentation that describes how to benchmark and preferably how to profile WASM. We have an issue for that: dotnet/BenchmarkDotNet#1818 but it did not receive a lot of traction. We should extend https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md and https://github.com/dotnet/performance/blob/main/docs/profiling-workflow-dotnet-runtime.md with WASM instructions so folks like me can just run the benchmarks themselves and avoid introducing regressions.

I believe it is quite late, but are there any remote chances we can revert those changes from 7.0/release ?

I would prefer to not revert it, as it has brought a lot of perf improvements for arm64. In my opinion the best quick fix would be to re-introduce Vector<T> code paths for the methods that were previously vectorized and now are not (because WASM does not support Vector128). I am going to give it a try. The question is whether such a backport would be accepted? @jeffhandley @danmoseley ?

@danmoseley
Copy link
Member

I think we'd be interested in a change that preserves the win for non WASM. It really depends on risk/confidence.

@vargaz
Copy link
Contributor

vargaz commented Sep 16, 2022

We need a solution for net7 which can be implemented quickly and its low risk.

@stephentoub
Copy link
Member

stephentoub commented Sep 16, 2022

We need a solution for net7 which can be implemented quickly and its low risk.

I agree. I think what folks are pointing out is that reverting the previous changes is not low risk.

@lewing
Copy link
Member

lewing commented Sep 16, 2022

We need a solution for net7 which can be implemented quickly and its low risk.

I agree. I think what folks are pointing out is that reverting the previous changes is not low risk.

Which means a high risk change was committed post rc1 and leaves the wasm runtime in a fairly tight spot with no time to react.

@stephentoub
Copy link
Member

Which means a high risk change was committed post rc1

What do you mean post-RC1? This change is in RC1.

@lewing
Copy link
Member

lewing commented Sep 16, 2022

I meant post branch for rc1, sorry for the imprecision

@stephentoub
Copy link
Member

I meant post branch for rc1

Yes, it was merged into the rc1 branch a month ago, the day after the rc1 branch was snapped from main. I'm surprised that makes a material difference for wasm having time to react.

Regardless, we all agree we want to fix the wasm regressions; let's jointly find a solution rather than placing blame. Adam suggested adding in some Vector<T> paths. Tanner suggested some strategically placed ifdefs (though I don't know exactly what he had in mind). Are you pushing back against those? Are there other options on the table?

@jkotas
Copy link
Member

jkotas commented Sep 16, 2022

The simplest lowest-risk solution for .NET 7 is to just put whatever was there before under #if MONO ifdef. No need to be creative this close to GA.

@stephentoub
Copy link
Member

That sounds fine to me. I'm assuming that's what Tanner had in mind (but don't know for sure).

@adamsitnik
Copy link
Member Author

My current idea:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool ExecuteVectorizedCodePath<T>(int length) where T : struct
#if TARGET_WASM
    => Vector.IsHardwareAccelerated && length >= Vector<T>.Count;
#else
    => Vector128.IsHardwareAccelerated && length >= Vector128<T>.Count;
#endif

[MethodImpl(MethodImplOptions.AggressiveOptimization)]
internal static bool ContainsValueType<T>(ref T searchSpace, T value, int length) where T : struct, INumber<T>
{
    if (!ExecuteVectorizedCodePath<T>(length))
    {
        // current non-vectorized code path
    }
#if TARGET_WASM
    else
    {
        // restored Vector<T> path
    }
#else
    else if (Vector256.IsHardwareAccelerated && length >= Vector256<T>.Count)
    {
        // current Vector256 code path
    }
    else
    {
        // current Vector128 code path
    }
#endif

    return false;
}

I am already working on a fix, but I don't know how to benchmark WASM AOT yet

@jkotas
Copy link
Member

jkotas commented Sep 16, 2022

@adamsitnik #74395 (comment) says that the problem is caused by generics and depending on JIT/AOT doing complex optimizations to streamline the code.

I do not think adding Vector<T> paths back is going to fix this.

@adamsitnik
Copy link
Member Author

@jkotas The report mentions regression in Contains which does not use the generic "hack", that is why I am going to benchmark Vector<T> approach first.

@radekdoulik
Copy link
Member

radekdoulik commented Sep 16, 2022

There might be multiple issues. The one I saw when working on SIMD improvements is related to generics. We now end with shared generic code in methods, which were specialized (non shared) before. In my case that leads to resulting code not using SIMD intrinsics. More importantly the shared generic code is also slower for default (non-SIMD) cases and is visible in browser-bench measurements, screenshot in #75709 - the most affected graph "flavors" are these with SIMD, in the others it is visible too, just in a smaller scale.

@jkotas
Copy link
Member

jkotas commented Sep 16, 2022

The report mentions regression in Contains

It is ImmutableArray.Contains, Queue.Contains, etc. All of these are implemented using IndexOf that I believe was switched to the generic impl:

public bool Contains(T item)
{
return this.IndexOf(item) >= 0;
}

if (_head < _tail)
{
return Array.IndexOf(_array, item, _head, _size) >= 0;
}
// We've wrapped around. Check both partitions, the least recently enqueued first.
return
Array.IndexOf(_array, item, _head, _array.Length - _head) >= 0 ||
Array.IndexOf(_array, item, 0, _tail) >= 0;

@danmoseley
Copy link
Member

I don't know how to benchmark WASM AOT yet

The advantage of putting the old code back in #if MONO is that you shouldn't need to profile WASM, or at least it need not hold up merging the change. My preference FWIW is to get Mono back to the old codepath promptly, and look for confirmation from the perf lab in a couple days presumably.

@adamsitnik
Copy link
Member Author

OK, I am going to bring old the back old code for Mono-only.

@jeffhandley
Copy link
Member

Adding a reference link to where we closed the loop with the performance results that illustrate the mono regressions were indeed fixed:
dotnet/perf-autofiling-issues#7981 (comment)

@ghost ghost locked as resolved and limited conversation to collaborators Nov 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.