-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Delete AVX512 paths from IndexOf(string)
#117865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete AVX512 paths from IndexOf(string)
#117865
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-memory |
|
@EgorBot -amd using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkRunner.Run<SingleString>(args: args);
public class SingleString
{
private static readonly SearchValues<string> s_values = SearchValues.Create([Needle], StringComparison.Ordinal);
private static readonly SearchValues<string> s_valuesIC = SearchValues.Create([Needle], StringComparison.OrdinalIgnoreCase);
private static readonly string s_text_noMatches = new('a', Length);
private static readonly string s_text_falsePositives = string.Concat(Enumerable.Repeat("Sherlock Holm_s", Length / Needle.Length));
public const int Length = 100_000;
public const string Needle = "Sherlock Holmes";
[Benchmark] public void Throughput() => s_text_noMatches.AsSpan().Contains(Needle, StringComparison.Ordinal);
[Benchmark] public void SV_Throughput() => s_text_noMatches.AsSpan().ContainsAny(s_values);
[Benchmark] public void SV_ThroughputIC() => s_text_noMatches.AsSpan().ContainsAny(s_valuesIC);
[Benchmark] public void FalsePositives() => s_text_falsePositives.AsSpan().Contains(Needle, StringComparison.Ordinal);
[Benchmark] public void SV_FalsePositives() => s_text_falsePositives.AsSpan().ContainsAny(s_values);
[Benchmark] public void SV_FalsePositivesIC() => s_text_falsePositives.AsSpan().ContainsAny(s_valuesIC);
} |
|
@MihuBot benchmark Regex_Industry https://github.com/MihaZupan/performance/tree/compiled-regex-only -medium |
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_SliceSlice
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Mariomkas
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Leipzig
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple
|
| } | ||
| } | ||
| else if (Vector256.IsHardwareAccelerated && searchSpaceMinusValueTailLength - Vector256<ushort>.Count >= 0) | ||
| if (Vector256.IsHardwareAccelerated && searchSpaceMinusValueTailLength - Vector256<ushort>.Count >= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really rather we not delete this. The issue isn't really V512, but the algorithm/loop being suboptimal for all the vector paths here, this is particularly prevalent from the scalar fallback which causes it to pessimize more for larger vector sizes.
Fixing it isn't that much more work and would be a bigger win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by scalar fallback in this case?
Throughput numbers are just stressing the vectorized inner loop for large inputs with no matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what the main loops look like in this case:
M01_L00:
vpcmpeqw ymm3,ymm0,[rdx]
vpcmpeqw ymm4,ymm1,[rdx+r10]
vpcmpeqw ymm5,ymm2,[rdx+r9]
vpternlogd ymm5,ymm4,ymm3,80
vptest ymm5,ymm5
jne short M01_L02
M01_L01:
add rdx,20
cmp rdx,r8
jbe short M01_L00M01_L00:
vpcmpeqw k1,zmm0,[rdx]
vpmovm2w zmm3,k1
vpcmpeqw k1,zmm1,[rdx+r10]
vpmovm2w zmm4,k1
vpcmpeqw k1,zmm2,[rdx+r9]
vpmovm2w zmm5,k1
vpternlogd zmm5,zmm4,zmm3,80
vptestmb k1,zmm5,zmm5
kortestq k1,k1
nop dword ptr [rax]
jne short M01_L02
M01_L01:
add rdx,40
cmp rdx,r8
jbe short M01_L00There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by scalar fallback in this case?
The ShortInput path that is hit is pessimized for larger vector sizes as it must process 2-4x as many elements as non-vectorized. While this doesn't necessarily show up for some bigger inputs, it will show up for small inputs and for inputs that have trailing elements. Additionally, for the whole method the non-idiomatic loops with goto can pessimize various JIT optimizations, control flow analysis, and other things.
Longer term, these should all be rewritten to follow a "better" pattern which helps minimize the dispatch branching and which allows the trailing elements to also be vectorized. -- Ideally we're generally using a pattern like TensorPrimitives uses, where the core loop/trailing logic is centralized and we're just specifying the inner loops and exit conditions.
Here's what the main loops look like in this case:
The problem with the main loop is the vpcmpeqw, vpmovm2w sequences. This is a really trivially issue related to the fact that the bitwise operands (and/andn/or/xor) are normalized to having a base type of int/uint since the underlying instructions only support these sizes due to embedded broadcast/masking support.
The check that looks for and(cvtmasktovec(op1), cvtmasktovec(op2)) sequences was looking for all three base types to match, when it actually only needs cvtmasktovec(op1) and cvtmasktovec(op2) to match and then the replacement andmask(op1, op2) to track that base type.
The following PR resolves that: #117887
- vpcmpeqw k1, zmm6, zmmword ptr [rsi]
- vpmovm2w zmm0, k1
- vpcmpeqw k1, zmm7, zmmword ptr [rsi+r14]
- vpmovm2w zmm1, k1
- vpcmpeqw k1, zmm8, zmmword ptr [rsi+r15]
- vpmovm2w zmm2, k1
- vpternlogd zmm2, zmm1, zmm0, -128
- vptestmb k1, zmm2, zmm2
- kortestq k1, k1
+ vpcmpeqw k1, zmm6, zmmword ptr [rsi]
+ vpcmpeqw k2, zmm7, zmmword ptr [rsi+r14]
+ kandd k1, k1, k2
+ vpcmpeqw k2, zmm8, zmmword ptr [rsi+r15]
+ kandd k1, k1, k2
+ vpmovm2w zmm0, k1
+ vptestmb k1, zmm0, zmm0
+ kortestq k1, k1Now the codegen still isn't "ideal" because we end up converting the mask to a vector to do the "is there any matches" check (this is the vpmovm2w, vptestmb, kortestq). That is a little more complicated to fix since it requires moving some of the op_Inequality transforms from LIR (lowering) into HIR (morph, valuenum, etc). This is planned work, just not something we've completed yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ShortInput path that is hit is pessimized for larger vector sizes as it must process 2-4x as many elements as non-vectorized. While this doesn't necessarily show up for some bigger inputs, it will show up for small inputs and for inputs that have trailing elements.
I'm not sure I follow? This PR doesn't change how short inputs behave here.
The ShortInput path is only used for inputs relative to Vector128's length. When it's taken does not depend on whether the system has Avx512 or Avx2 support. Trailing elements are also processed with a vectorized step.
The following PR resolves that: #117887
This is planned work, just not something we've completed yet.
Thanks! I'll double-check what the numbers look like with your change.
Assuming it's still worse/not better compared to Vector256 paths, does it make sense to keep around?
E.g. We've reverted Avx512 support from IndexOfAnyAsciiSearcher over much smaller regressions even though there are meaningful throughput benefits on longer inputs there (#93222), whereas it's just worse across the board here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow? This PR doesn't change how short inputs behave here.
It was a general comment about how this and several other vectorized code paths in corelib are written in a way, in general, that pessimizes the larger vector sizes and/or small inptus. It wasn't a comment about the changes in this PR, rather just a general "issue" that helps make V512 perform worse than it should. If we were to fix those issues, all the paths should get faster
Assuming it's still worse/not better compared to Vector256 paths, does it make sense to keep around?
E.g. We've reverted Avx512 support from IndexOfAnyAsciiSearcher over much smaller regressions even though there are meaningful throughput benefits on longer inputs there (#93222), whereas it's just worse across the board here.
I believe it's still worth keeping and to continue incrementally tracking the improvements around. The more we revert, the harder it is to test/validate the improvements as they go in. Which applies to AVX512 and SVE alike, both of which have different considerations for mask handling.
The long term goal is to have centralized SIMD looping logic and to utilize things like (the currently internal) ISimdVector, we're getting closer to that each release and continuing to get large improvements to the handling and codegen across the board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- A lot of the places where the perf is found to be suboptimal bad are also fairly trivial fixes, like the one I did. If we file issues for them and go and ensure the pattern recognition is being handled correctly, it is far better for all the vector paths. The same goes for utilizing the helpers like the Any, All, None, Count, IndexOf, and LastIndexOf helpers that now exist on the vector types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your change the Regex SliceSlice benchmark (just a ton of span.IndexOf(string)-like searches) shows a 40% regression (as in taking 1.4x as long) with the Avx512 paths compared to Avx2 on Zen hardware.
Should we reconsider targeted arch-specific opt outs for such cases if performance diverges this much, and we consider affected code paths as important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't do target specific opt outs.
If you're really that concerned with the regression and it showing up in real world, then I'd just go with the removal for now. But please ensure a tracking issue is filed to ensure it is added back when the direct kortest logic is added in .NET 11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #118108 now merged, Regex won't hit these paths anymore for ASCII values, so I'm less concerned about the real-world impact.
IndexOf(string) is still impacted, but switching to SearchValues<string> can mitigate that.
In general it is unfortunate that we would keep around Vector512 paths if they aren't improving perf, but hopefully potntial future changes you mentioned can help here.
These paths have worse throughput than AVX2 on Zen4, Zen5, and Cascade Lake,
On Saphire Rapids they have better throughput, but seemingly much worse of a time dealing with false positives.
Zen 4
EgorBot/runtime-utils#440 (comment)
Zen 5
Cascade Lake
EgorBot/runtime-utils#440 (comment)
Saphire Rapids
EgorBot/runtime-utils#440 (comment)
Regex results from Zen 4: #117865 (comment)