-
-
Notifications
You must be signed in to change notification settings - Fork 887
Speed improvements to resize kernel (w/ SIMD) #1513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
42632c7
3f7deb5
874e951
941e173
493d04a
407c2d9
a7ca1b0
e2211c3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,6 +4,10 @@ | |
| using System; | ||
| using System.Numerics; | ||
| using System.Runtime.CompilerServices; | ||
| #if SUPPORTS_RUNTIME_INTRINSICS | ||
| using System.Runtime.Intrinsics; | ||
| using System.Runtime.Intrinsics.X86; | ||
| #endif | ||
|
|
||
| namespace SixLabors.ImageSharp.Processing.Processors.Transforms | ||
| { | ||
|
|
@@ -66,21 +70,87 @@ public Vector4 Convolve(Span<Vector4> rowSpan) | |
| [MethodImpl(InliningOptions.ShortMethod)] | ||
| public Vector4 ConvolveCore(ref Vector4 rowStartRef) | ||
| { | ||
| ref float horizontalValues = ref Unsafe.AsRef<float>(this.bufferPtr); | ||
| #if SUPPORTS_RUNTIME_INTRINSICS | ||
| if (Fma.IsSupported) | ||
| { | ||
| float* bufferStart = this.bufferPtr; | ||
| float* bufferEnd = bufferStart + (this.Length & ~3); | ||
| Vector256<float> result256_0 = Vector256<float>.Zero; | ||
| Vector256<float> result256_1 = Vector256<float>.Zero; | ||
| var mask = Vector256.Create(0, 0, 0, 0, 1, 1, 1, 1); | ||
|
|
||
| // Destination color components | ||
| Vector4 result = Vector4.Zero; | ||
| while (bufferStart < bufferEnd) | ||
| { | ||
| // It is important to use a single expression here so that the JIT will correctly use vfmadd231ps | ||
| // for the FMA operation, and execute it directly on the target register and reading directly from | ||
| // memory for the first parameter. This skips initializing a SIMD register, and an extra copy. | ||
| // The code below should compile in the following assembly on .NET 5 x64: | ||
| // | ||
| // vmovsd xmm2, [rax] ; load *(double*)bufferStart into xmm2 as [ab, _] | ||
| // vpermps ymm2, ymm1, ymm2 ; permute as a float YMM register to [a, a, a, a, b, b, b, b] | ||
| // vfmadd231ps ymm0, ymm2, [r8] ; result256_0 = FMA(pixels, factors) + result256_0 | ||
| // | ||
| // For tracking the codegen issue with FMA, see: https://github.com/dotnet/runtime/issues/12212. | ||
| // Additionally, we're also unrolling two computations per each loop iterations to leverage the | ||
| // fact that most CPUs have two ports to schedule multiply operations for FMA instructions. | ||
| result256_0 = Fma.MultiplyAdd( | ||
| Unsafe.As<Vector4, Vector256<float>>(ref rowStartRef), | ||
| Avx2.PermuteVar8x32(Vector256.CreateScalarUnsafe(*(double*)bufferStart).AsSingle(), mask), | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to what I learned from @saucecontrol, moving permutes out from an operation (dependency) chain and running them in a separate sequence might help performance. Thinking it further:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was an issue with using locals here (I documented that in the comments here), where the JIT was picking the wrong instruction for the FMA operation and adding extra unnecessary memory copies. Doing this inline instead picked the right one that directly loaded the first argument from memory, which resulted in much better asseembly. I'm worried that moving things around will make the codegen worse again there. Also from what we discussed on Discord, there's usually 2 ports to perform FMA multiplications, so it might not be beneficial to do more than 2 in the same loop? I mean, other than the general marginal improvements just due to more unrolling, possibly.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, 2 is the max number of FMAs that can be scheduled at once, but it's a pipelined instruction, so you can get more benefit from scheduling more sequentially. I had an unroll by 4 in MagicScaler previously, but it wasn't a ton faster so I dropped it to reduce complexity. The way I get around having to shuffle/permute the weights in the inner loop is by pre-duplicating them in my kernel map. So my inner loop is just 2 reads of pixel values and 2 FMAs (with the weight reads contained in the FMA instruction). That approach also has the benefit of being backward compatible with |
||
| result256_0); | ||
|
|
||
| for (int i = 0; i < this.Length; i++) | ||
| { | ||
| float weight = Unsafe.Add(ref horizontalValues, i); | ||
| result256_1 = Fma.MultiplyAdd( | ||
| Unsafe.As<Vector4, Vector256<float>>(ref Unsafe.Add(ref rowStartRef, 2)), | ||
| Avx2.PermuteVar8x32(Vector256.CreateScalarUnsafe(*(double*)(bufferStart + 2)).AsSingle(), mask), | ||
| result256_1); | ||
|
|
||
| bufferStart += 4; | ||
| rowStartRef = ref Unsafe.Add(ref rowStartRef, 4); | ||
| } | ||
|
|
||
| result256_0 = Avx.Add(result256_0, result256_1); | ||
|
|
||
| if ((this.Length & 3) >= 2) | ||
| { | ||
| result256_0 = Fma.MultiplyAdd( | ||
| Unsafe.As<Vector4, Vector256<float>>(ref rowStartRef), | ||
| Avx2.PermuteVar8x32(Vector256.CreateScalarUnsafe(*(double*)bufferStart).AsSingle(), mask), | ||
| result256_0); | ||
|
|
||
| bufferStart += 2; | ||
| rowStartRef = ref Unsafe.Add(ref rowStartRef, 2); | ||
| } | ||
|
|
||
| // Vector4 v = offsetedRowSpan[i]; | ||
| Vector4 v = Unsafe.Add(ref rowStartRef, i); | ||
| result += v * weight; | ||
| Vector128<float> result128 = Sse.Add(result256_0.GetLower(), result256_0.GetUpper()); | ||
|
|
||
| if ((this.Length & 1) != 0) | ||
| { | ||
| result128 = Fma.MultiplyAdd( | ||
| Unsafe.As<Vector4, Vector128<float>>(ref rowStartRef), | ||
| Vector128.Create(*bufferStart), | ||
| result128); | ||
| } | ||
|
|
||
| return *(Vector4*)&result128; | ||
| } | ||
| else | ||
| #endif | ||
| { | ||
| // Destination color components | ||
| Vector4 result = Vector4.Zero; | ||
| float* bufferStart = this.bufferPtr; | ||
| float* bufferEnd = this.bufferPtr + this.Length; | ||
|
|
||
| return result; | ||
| while (bufferStart < bufferEnd) | ||
| { | ||
| // Vector4 v = offsetedRowSpan[i]; | ||
| result += rowStartRef * *bufferStart; | ||
|
|
||
| bufferStart++; | ||
| rowStartRef = ref Unsafe.Add(ref rowStartRef, 1); | ||
| } | ||
|
|
||
| return result; | ||
| } | ||
| } | ||
|
|
||
| /// <summary> | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be
Vector.Loadwith a ROS.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codegen in this very specific case seems to actually be just a
vxorps(which is actually... weird), so I'm looking into this. It seems to be ok even though we're not using aROSthough, or actually better than that too.Will update in a bit 🙂
EDIT: this might actually be a JIT bug, investigating...
EDIT 2: yeah it's an inlining bug that only repros on .NET 5 with precisely these arguments 🤣
Opening an issue.