Switch to iSimdVector and Align WidenAsciiToUtf16 #99982

DeepakRajendrakumaran · 2024-03-19T21:58:00Z

This is an updated version of this PR(#89892). It does the following

Add ' AnyMatches' support for iSimdVector
Use iSimdVector to clean up 'WidenAsciiToUtf16' implementation
Align memory stores

Perf Results

Ran the following tests(sizes : 16, 512, 1024, 5120, 10240) on EMR: (base = main branch, diff = with change)https://github.com/dotnet/performance/blob/47d21ee9571164a8e3f8088d8709ca4061d96827/src/benchmarks/micro/libraries/System.Text.Encoding/Perf.Encoding.cs

On EMR

On ICX - Not much diff

DeepakRajendrakumaran · 2024-03-20T22:25:49Z

@tannergooding @dotnet/avx512-contrib Can you please review this?

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs

tannergooding · 2024-03-27T19:18:08Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs

+        private static unsafe bool HasMatch<TVectorByte>(TVectorByte vector)
+            where TVectorByte : unmanaged, ISimdVector<TVectorByte, byte>
+        {
+            return !(vector & TVectorByte.Create((byte)0x80)).Equals(TVectorByte.Zero);


Why not (vector & TVectorByte.Create((byte)0x80)) != TVectorByte.Zero?

I basically ran into a weird issue where perf degraded significantly for specific cases on my ICX when I had '(vector & TVectorByte.Create((byte)0x80)) != TVectorByte.Zero' and the issue went away with what I have now. It was pretty consistent. I had some trouble narrowing down the exact why with VTune though.

I decided to go with the performant version for now

Was there a codegen difference between them? I'd expect them to generate the same code

I did a quick check on godbolt and they all look the same - https://godbolt.org/z/P87dPdTGa

Now I'm curious if it was just something off when I ran it locally

I did some more digging and something is off. Tried 3 versions with a handwritten benchmark

Benchmark

Match3 is significantly faster than Match1 and Match 2

VTune for Match1 vs Match3

The inlining makes it hard to narrow down

Do you have a strong preference for any of these patterns?

I can look into the why this is happening if it's important. For now, I'm just keeping the fast version

@tannergooding I have created an issue detailing this as discussed: #100493

Please let me know if there is anything else needed for this PR

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs

tannergooding · 2024-03-27T19:47:06Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs

+            if (!HasMatch<TVectorByte>(asciiVector))
+            {
+                (TVectorUShort utf16LowVector, TVectorUShort utf16HighVector) = Widen<TVectorByte, TVectorUShort>(asciiVector);
+                utf16LowVector.Store(pCurrentWriteAddress);
+                utf16HighVector.Store(pCurrentWriteAddress + TVectorUShort.Count);
+                pCurrentWriteAddress += (nuint)(TVectorUShort.Count * 2);
+                if (((int)pCurrentWriteAddress & 1) == 0)
+                {
+                    // Bump write buffer up to the next aligned boundary
+                    pCurrentWriteAddress = (ushort*)((nuint)pCurrentWriteAddress & ~(nuint)(TVectorUShort.Alignment - 1));
+                    nuint numBytesWritten = (nuint)pCurrentWriteAddress - (nuint)pUtf16Buffer;
+                    currentOffset += (nuint)numBytesWritten / 2;
+                }
+                else
+                {
+                    // If input isn't char aligned, we won't be able to align it to a Vector
+                    currentOffset += (nuint)TVectorByte.Count;
+                }
+                while (currentOffset <= finalOffsetWhereCanRunLoop)
+                {
+                    asciiVector = TVectorByte.Load(pAsciiBuffer + currentOffset);
+                    if (HasMatch<TVectorByte>(asciiVector))
+                    {
+                        break;
+                    }
+                    (utf16LowVector, utf16HighVector) = Widen<TVectorByte, TVectorUShort>(asciiVector);
+                    utf16LowVector.StoreAligned(pCurrentWriteAddress);
+                    utf16HighVector.StoreAligned(pCurrentWriteAddress + TVectorUShort.Count);
+                    currentOffset += (nuint)TVectorByte.Count;
+                    pCurrentWriteAddress += (nuint)(TVectorUShort.Count * 2);
+                }
+            }


The code here looks generally good, and I don't expect anything to be changed for this PR based on what I'm about to comment.

However, I would like to refer to how we set up everything for TensorPrimitives as it works very well and allows a lot of code sharing (noting it isn't using ISimdVector yet since it's out of band, but could easily do so in the future): https://source.dot.net/#System.Numerics.Tensors/System/Numerics/Tensors/netcore/Common/TensorPrimitives.IUnaryOperator.cs,fdab74764af40a1e

In general it tries to do the pre-checks up front and only ever execute 1 vector path (so it doesn't have to fallthrough from Vector512->Vector256->Vector128 as remainders exist). To achieve this, it has the Vectorized### helpers (which are all identical, except for the size they operate on, this is what would eventually use ISimdVector) and then a shared VectorizedSmall which is simply a jump table designed to handle any data that is less than a full vector using a single branch.

The core logic for the vectorized algorithm (https://source.dot.net/#System.Numerics.Tensors/System/Numerics/Tensors/netcore/Common/TensorPrimitives.IUnaryOperator.cs,ef9adce4e9561b04) then basically has a path to handle the main loop (which is currently unrolled by a factor of 8) and otherwise hits a jump table to handle remaining blocks (so they can likewise be handled with a single branch).

To help optimize, it preloads the beginning and ending vectors. In the worst case this will result in double processing of some inputs for very small sizes, but its ultimately only 2 main operations which is fine.

The main loop then attempts to align and has an optimized path for extremely large inputs for non-temporal data if alignment could be achieved. Smaller inputs just do regular unaligned stores since the actual address will have been aligned if that was feasible.

This general approach is done because it allows all paths, but particularly the smallest inputs, to minimize the total number of branches done (no more than 2 branches for non-vectorized data and no more 3 to hit the main loop for vectorized code). It also allows us to separate the "algorithm logic" from the "vectorization logic" and share that vectorization logic between multiple vectorized algorithms.

This general setup has worked so well and provided very stable perf numbers for all sizes, such that we opened #93217 as a means of investigating if we could make it more general purpose and public. Long term, it'd probably be desirable to move algorithms like this WidenAsciiToUtf16 to follow the same approach so that we get the best perf, with the least overhead.

This is really useful. And something I can add on in future. Thanks for the detailed comment

tannergooding · 2024-04-04T18:45:11Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs

@@ -692,6 +692,11 @@ private string ToString([StringSyntax(StringSyntaxAttribute.NumericFormat)] stri
        // New Surface Area
        //

+        static bool ISimdVector<Vector128<T>, T>.AnyMatches(Vector128<T> vector)


We did review/approve this (#98055 (comment)) and settled on bool Any(Vector128<T> vector, T value) and bool AnyWhereAllBitsSet(Vector<T> vector)

It would be nice to fix this to follow that. The PR otherwise looks good and should be mergeable.

So, the way I understand is we do not need AnyMatches() anymore

And Any and AnyWhereAllBitsSet would look something like follows

static bool ISimdVector<Vector512<T>, T>.AnyWhereAllBitsSet(Vector512<T> vector) { return (vector.EqualsAny(Vector512<T>.AllBitsSet)); } static bool ISimdVector<Vector512<T>, T>.Any(Vector512<T> vector, T value) { return (vector.EqualsAny(Vector512.Create((T)value))); }

Yes, essentially, with the ability for the JIT to recognize these as intrinsic and optimize them more in appropriate scenarios (but that's not necessary for this PR)

Done - took out AnyMatches() and added AnyWhereAllBitsSet() and Any()

@tannergooding Does this look correct? Anything else you'd like me to fix?

tannergooding · 2024-04-08T15:54:24Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs

@@ -692,6 +692,16 @@ private string ToString([StringSyntax(StringSyntaxAttribute.NumericFormat)] stri
        // New Surface Area
        //

+        static bool ISimdVector<Vector128<T>, T>.AnyWhereAllBitsSet(Vector128<T> vector)
+        {
+            return (Vector128.EqualsAny(vector, Vector128<T>.AllBitsSet));


nit: unnecessary parens here and in Any

If you could fix that in a follow up PR, that'd be great (going to merge this)

Sounds good! Will put it up later today.

DrewScoggins · 2024-04-11T16:34:56Z

Regressions
Linux Ampere arm64: #100922
Linux x64: dotnet/perf-autofiling-issues#32721
Windows x64: dotnet/perf-autofiling-issues#32733

@tannergooding @DeepakRajendrakumaran

matouskozak · 2024-04-16T16:05:16Z

Possible Mono regressions:

Linux x64 AOT-llvm: [Perf] Linux/x64: 11 Regressions on 4/8/2024 6:26:42 PM #101124
Linux arm64 AOT-llvm: [Perf] Linux/arm64: 22 Regressions on 4/8/2024 7:16:22 PM #101127
Linux x64 interpreter: [Perf] Linux/x64: 2 Regressions on 4/8/2024 2:55:20 AM perf-autofiling-issues#32770

lewing · 2024-04-20T20:25:19Z

mono wasm aot regression dotnet/perf-autofiling-issues#32601

* Add AnyMatches() to iSimdVector interface * Switch to iSimdVector and Align WidenAsciiToUtf16. * Fixing perf * Addressing Review Comments. * Mirroring API change : dotnet#98055 (comment)

dotnet-issue-labeler bot added the area-System.Runtime.Intrinsics label Mar 19, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 19, 2024

build-analysis bot mentioned this pull request Mar 20, 2024

Tracking issue for CI build timeouts #76454

Closed

DeepakRajendrakumaran force-pushed the align branch from 76ecff2 to 0772e00 Compare March 20, 2024 16:57

DeepakRajendrakumaran marked this pull request as ready for review March 20, 2024 22:25

build-analysis bot mentioned this pull request Mar 21, 2024

Assert failure(PID 13812 [0x000035f4], Thread: 14128 [0x3730]): promoted_bytes (heap_number) == promoted #100035

Closed

tannergooding reviewed Mar 21, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs Outdated Show resolved Hide resolved

BruceForstall mentioned this pull request Mar 21, 2024

Intel architecture improvements for .NET 9 #93196

Closed

33 tasks

DeepakRajendrakumaran force-pushed the align branch from d8bc357 to a7bb34a Compare March 22, 2024 18:54

build-analysis bot mentioned this pull request Mar 22, 2024

Failure in GC\API\NoGCRegion\Callback_Svr\Callback_Svr.cmd #100149

Closed

DeepakRajendrakumaran force-pushed the align branch from a7bb34a to 39cb56a Compare March 26, 2024 17:04

This was referenced Mar 26, 2024

GC\Regressions\v2.0-beta2\452950\452950\452950.cmd failing on Mono minijit Windows x64 #99729

Open

at GC_Regressions._v2_0_beta2_452950_452950_452950_._v2_0_beta2_452950_452950_452950_cmd() #100174

Closed

DeepakRajendrakumaran force-pushed the align branch from 39cb56a to 39da22a Compare March 27, 2024 00:19

build-analysis bot mentioned this pull request Mar 27, 2024

Timeout in System.Net.Quic.Functional.Tests #86019

Closed