Skip to content

Conversation

@MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Jul 20, 2025

These paths have worse throughput than AVX2 on Zen4, Zen5, and Cascade Lake,
On Saphire Rapids they have better throughput, but seemingly much worse of a time dealing with false positives.

Zen 4

EgorBot/runtime-utils#440 (comment)

BenchmarkDotNet v0.15.0, Linux Ubuntu 24.04.2 LTS (Noble Numbat)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
  Job-CVEPVM : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-BOOZVV : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio
Throughput Main 4.391 μs 0.0011 μs 1.00
Throughput PR 3.415 μs 0.0005 μs 0.78
SV_Throughput Main 5.484 μs 0.0016 μs 1.00
SV_Throughput PR 5.119 μs 0.0013 μs 0.93
SV_ThroughputIC Main 6.366 μs 0.0010 μs 1.00
SV_ThroughputIC PR 5.119 μs 0.0020 μs 0.80
FalsePositives Main 16.712 μs 0.0034 μs 1.00
FalsePositives PR 12.498 μs 0.0019 μs 0.75
SV_FalsePositives Main 13.861 μs 0.0075 μs 1.00
SV_FalsePositives PR 9.536 μs 0.0049 μs 0.69
SV_FalsePositivesIC Main 15.623 μs 0.0501 μs 1.00
SV_FalsePositivesIC PR 11.264 μs 0.0217 μs 0.72
Zen 5
BenchmarkDotNet v0.14.1-develop (2025-07-20), Windows 11 (10.0.26100.4652)
AMD Ryzen 9 9950X 4.30GHz, 1 CPU, 32 logical and 16 physical cores
.NET SDK 10.0.100-preview.6.25358.103
  [Host]    : .NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Job=MediumRun  IterationCount=15  LaunchCount=2
WarmupCount=10
Method Toolchain Mean Error Ratio
Throughput main 2,347.551 ns 50.4220 ns 1.00
Throughput pr 1,872.423 ns 11.5169 ns 0.80
SV_Throughput main 2,800.805 ns 30.3129 ns 1.00
SV_Throughput pr 2,605.061 ns 20.0059 ns 0.93
SV_ThroughputIC main 2,961.727 ns 28.9590 ns 1.00
SV_ThroughputIC pr 2,665.752 ns 30.6298 ns 0.90
FalsePositives main 10,143.346 ns 93.1595 ns 1.00
FalsePositives pr 7,060.672 ns 59.1907 ns 0.70
SV_FalsePositives main 7,983.976 ns 87.6681 ns 1.00
SV_FalsePositives pr 5,691.896 ns 61.9204 ns 0.71
SV_FalsePositivesIC main 8,759.998 ns 71.9972 ns 1.00
SV_FalsePositivesIC pr 6,733.671 ns 55.4418 ns 0.77
Cascade Lake

EgorBot/runtime-utils#440 (comment)

BenchmarkDotNet v0.15.0, Linux Ubuntu 24.04.2 LTS (Noble Numbat)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
  Job-MVJDOO : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-KYJYDJ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio
Throughput Main 7.203 μs 0.0013 μs 1.00
Throughput PR 7.226 μs 0.0025 μs 1.00
SV_Throughput Main 9.835 μs 0.0015 μs 1.00
SV_Throughput PR 7.658 μs 0.0006 μs 0.78
SV_ThroughputIC Main 10.094 μs 0.0013 μs 1.00
SV_ThroughputIC PR 7.728 μs 0.0012 μs 0.77
FalsePositives Main 22.666 μs 0.0045 μs 1.00
FalsePositives PR 18.466 μs 0.0016 μs 0.81
SV_FalsePositives Main 16.411 μs 0.0058 μs 1.00
SV_FalsePositives PR 14.120 μs 0.0022 μs 0.86
SV_FalsePositivesIC Main 21.348 μs 0.0024 μs 1.00
SV_FalsePositivesIC PR 18.790 μs 0.0032 μs 0.88
Saphire Rapids

EgorBot/runtime-utils#440 (comment)

BenchmarkDotNet v0.15.0, Linux Ubuntu 24.04 LTS (Noble Numbat)
Intel Xeon Platinum 8488C, 1 CPU, 16 logical and 8 physical cores
  Job-YSWOSK : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XTUVRO : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio
Throughput Main 5.506 μs 0.1071 μs 1.00
Throughput PR 6.694 μs 0.1096 μs 1.22
SV_Throughput Main 5.806 μs 0.1093 μs 1.00
SV_Throughput PR 6.446 μs 0.1041 μs 1.11
SV_ThroughputIC Main 5.958 μs 0.1188 μs 1.00
SV_ThroughputIC PR 7.406 μs 0.0872 μs 1.24
FalsePositives Main 30.023 μs 0.2173 μs 1.00
FalsePositives PR 14.169 μs 0.1401 μs 0.47
SV_FalsePositives Main 30.185 μs 0.2823 μs 1.00
SV_FalsePositives PR 11.177 μs 0.1497 μs 0.37
SV_FalsePositivesIC Main 17.555 μs 0.1627 μs 1.00
SV_FalsePositivesIC PR 14.928 μs 0.1665 μs 0.85

Regex results from Zen 4: #117865 (comment)

@MihaZupan MihaZupan added this to the 10.0.0 milestone Jul 20, 2025
@MihaZupan MihaZupan self-assigned this Jul 20, 2025
@Copilot Copilot AI review requested due to automatic review settings July 20, 2025 19:00
Copilot

This comment was marked as duplicate.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

@MihaZupan
Copy link
Member Author

@EgorBot -amd

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;

BenchmarkRunner.Run<SingleString>(args: args);

public class SingleString
{
    private static readonly SearchValues<string> s_values = SearchValues.Create([Needle], StringComparison.Ordinal);
    private static readonly SearchValues<string> s_valuesIC = SearchValues.Create([Needle], StringComparison.OrdinalIgnoreCase);
    private static readonly string s_text_noMatches = new('a', Length);
    private static readonly string s_text_falsePositives = string.Concat(Enumerable.Repeat("Sherlock Holm_s", Length / Needle.Length));

    public const int Length = 100_000;
    public const string Needle = "Sherlock Holmes";

    [Benchmark] public void Throughput() => s_text_noMatches.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_Throughput() => s_text_noMatches.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_ThroughputIC() => s_text_noMatches.AsSpan().ContainsAny(s_valuesIC);

    [Benchmark] public void FalsePositives() => s_text_falsePositives.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_FalsePositives() => s_text_falsePositives.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_FalsePositivesIC() => s_text_falsePositives.AsSpan().ContainsAny(s_valuesIC);
}

@MihaZupan
Copy link
Member Author

@MihuBot
Copy link

MihuBot commented Jul 20, 2025

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_SliceSlice
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=Default  IterationCount=15
LaunchCount=2  MemoryRandomization=Default  MinIterationCount=3
WarmupCount=10
Method Toolchain Options Mean Error Ratio Allocated Alloc Ratio
Count Main Compiled 340.8 ms 0.23 ms 1.00 1072 B 1.00
Count PR Compiled 265.6 ms 0.59 ms 0.78 392 B 0.37
Count Main IgnoreCase, Compiled 404.6 ms 0.27 ms 1.00 1072 B 1.00
Count PR IgnoreCase, Compiled 304.3 ms 0.63 ms 0.75 392 B 0.37
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10
Method Toolchain Pattern Mean Error Ratio Allocated Alloc Ratio
Count Main .* 539,957.77 ns 2,777.656 ns 1.00 2 B 1.00
Count PR .* 536,412.81 ns 1,338.524 ns 0.99 2 B 1.00
Count Main (?i)Holmes 49,553.91 ns 74.842 ns 1.00 - NA
Count PR (?i)Holmes 39,510.00 ns 84.775 ns 0.80 - NA
Count Main (?i)Sher[a-z]+|Hol[a-z]+ 119,455.87 ns 23,264.166 ns 1.09 - NA
Count PR (?i)Sher[a-z]+|Hol[a-z]+ 120,209.41 ns 24,043.653 ns 1.09 1 B NA
Count Main (?i)Sherlock 42,625.47 ns 53.009 ns 1.00 - NA
Count PR (?i)Sherlock 32,006.02 ns 97.674 ns 0.75 - NA
Count Main (?i)Sherlock Holmes 42,313.90 ns 268.481 ns 1.00 - NA
Count PR (?i)Sherlock Holmes 31,753.20 ns 48.300 ns 0.75 - NA
Count Main (?i)Sherlock|Holmes|Watson 121,294.28 ns 24,644.402 ns 1.09 - NA
Count PR (?i)Sherlock|Holmes|Watson 121,771.81 ns 24,192.418 ns 1.10 1 B NA
Count Main (?i)Sherlock|(...)er|John|Baker [49] 189,067.55 ns 21,819.446 ns 1.03 1 B 1.00
Count PR (?i)Sherlock|(...)er|John|Baker [49] 192,967.18 ns 23,294.325 ns 1.05 1 B 1.00
Count Main (?i)the 196,782.71 ns 3,093.229 ns 1.00 1 B 1.00
Count PR (?i)the 205,049.49 ns 22,583.278 ns 1.04 1 B 1.00
Count Main (?m)^Sherlock(...)rlock Holmes$ [37] 37,926.65 ns 66.683 ns 1.00 - NA
Count PR (?m)^Sherlock(...)rlock Holmes$ [37] 30,733.54 ns 93.874 ns 0.81 - NA
Count Main (?s).* 36.26 ns 2.163 ns 1.01 - NA
Count PR (?s).* 32.79 ns 0.128 ns 0.91 - NA
Count Main [^\\n]* 547,209.10 ns 1,460.851 ns 1.00 2 B 1.00
Count PR [^\\n]* 535,933.92 ns 1,465.196 ns 0.98 2 B 1.00
Count Main [a-q][^u-z]{13}x 23,140.50 ns 63.166 ns 1.00 - NA
Count PR [a-q][^u-z]{13}x 23,235.80 ns 86.274 ns 1.00 - NA
Count Main [a-zA-Z]+ing 3,336,651.29 ns 3,147.391 ns 1.00 9 B 1.00
Count PR [a-zA-Z]+ing 3,336,777.42 ns 7,286.788 ns 1.00 11 B 1.22
Count Main \b\w+n\b 6,575,536.86 ns 29,220.552 ns 1.00 22 B 1.00
Count PR \b\w+n\b 6,536,072.91 ns 7,455.555 ns 0.99 20 B 0.91
Count Main \p{L} 8,790,393.70 ns 16,659.746 ns 1.00 35 B 1.00
Count PR \p{L} 9,158,574.55 ns 84,642.571 ns 1.04 31 B 0.89
Count Main \p{Ll} 8,819,521.28 ns 45,229.860 ns 1.00 35 B 1.00
Count PR \p{Ll} 8,516,091.90 ns 27,825.246 ns 0.97 35 B 1.00
Count Main \p{Lu} 365,901.64 ns 10,173.163 ns 1.00 1 B 1.00
Count PR \p{Lu} 339,622.68 ns 7,002.437 ns 0.93 1 B 1.00
Count Main \s[a-zA-Z]{0,12}ing\s 3,549,099.81 ns 71,875.793 ns 1.00 12 B 1.00
Count PR \s[a-zA-Z]{0,12}ing\s 3,454,495.50 ns 3,574.149 ns 0.97 11 B 0.92
Count Main \w+ 4,119,698.97 ns 54,085.001 ns 1.00 21 B 1.00
Count PR \w+ 4,074,142.22 ns 12,838.814 ns 0.99 21 B 1.00
Count Main \w+\s+Holmes 2,814,787.47 ns 38,441.312 ns 1.00 11 B 1.00
Count PR \w+\s+Holmes 2,793,585.60 ns 3,866.369 ns 0.99 11 B 1.00
Count Main \w+\s+Holmes\s+\w+ 3,248,814.52 ns 106,821.302 ns 1.00 12 B 1.00
Count PR \w+\s+Holmes\s+\w+ 3,171,086.20 ns 19,453.611 ns 0.98 10 B 0.83
Count Main aei 35,004.38 ns 415.236 ns 1.00 - NA
Count PR aei 25,099.62 ns 615.743 ns 0.72 - NA
Count Main aqj 35,151.89 ns 386.042 ns 1.00 - NA
Count PR aqj 25,078.21 ns 625.769 ns 0.71 - NA
Count Main Holmes 45,333.62 ns 68.512 ns 1.00 - NA
Count PR Holmes 36,774.43 ns 198.786 ns 0.81 - NA
Count Main Holmes.{0,25}(...).{0,25}Holmes [39] 48,103.12 ns 78.124 ns 1.00 - NA
Count PR Holmes.{0,25}(...).{0,25}Holmes [39] 48,029.25 ns 94.056 ns 1.00 - NA
Count Main Sher[a-z]+|Hol[a-z]+ 49,529.84 ns 114.013 ns 1.00 - NA
Count PR Sher[a-z]+|Hol[a-z]+ 49,345.76 ns 176.323 ns 1.00 - NA
Count Main Sherlock 38,609.91 ns 43.878 ns 1.00 - NA
Count PR Sherlock 30,366.89 ns 147.805 ns 0.79 - NA
Count Main Sherlock Holmes 38,348.62 ns 233.008 ns 1.00 - NA
Count PR Sherlock Holmes 30,626.54 ns 277.150 ns 0.80 - NA
Count Main Sherlock\s+Holmes 39,082.30 ns 102.803 ns 1.00 - NA
Count PR Sherlock\s+Holmes 31,595.29 ns 149.107 ns 0.81 - NA
Count Main Sherlock|Holmes 45,480.20 ns 156.046 ns 1.00 - NA
Count PR Sherlock|Holmes 45,591.13 ns 132.548 ns 1.00 - NA
Count Main Sherlock|Holmes|Watson 59,117.55 ns 60.167 ns 1.00 - NA
Count PR Sherlock|Holmes|Watson 59,255.69 ns 207.926 ns 1.00 - NA
Count Main Sherlock|Holm(...)er|John|Baker [45] 88,472.32 ns 385.285 ns 1.00 - NA
Count PR Sherlock|Holm(...)er|John|Baker [45] 88,480.52 ns 268.297 ns 1.00 - NA
Count Main Sherlock|Street 25,052.54 ns 87.574 ns 1.00 - NA
Count PR Sherlock|Street 25,126.84 ns 131.345 ns 1.00 - NA
Count Main the 163,106.59 ns 424.424 ns 1.00 1 B 1.00
Count PR the 149,097.68 ns 222.376 ns 0.91 - 0.00
Count Main The 49,971.16 ns 88.861 ns 1.00 - NA
Count PR The 40,607.05 ns 142.218 ns 0.81 - NA
Count Main the\s+\w+ 248,334.39 ns 3,462.552 ns 1.00 1 B 1.00
Count PR the\s+\w+ 269,876.82 ns 14,656.125 ns 1.09 1 B 1.00
Count Main zqj 35,114.66 ns 465.562 ns 1.00 - NA
Count PR zqj 25,099.07 ns 616.537 ns 0.72 - NA
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Mariomkas
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  IterationCount=15  LaunchCount=2
WarmupCount=10
Method Toolchain Pattern Mean Error Ratio Allocated Alloc Ratio
Ctor Main (?:(?:250-5]?[0-9][0-9]) [87] 19.05 μs 0.088 μs 1.00 30008 B 1.00
Ctor PR (?:(?:250-5]?[0-9][0-9]) [87] 19.14 μs 0.126 μs 1.00 30008 B 1.00
Count Main (?:(?:250-5]?[0-9][0-9]) [87] 2,669.29 μs 5.771 μs 1.00 15 B 1.00
Count PR (?:(?:250-5]?[0-9][0-9]) [87] 2,667.35 μs 1.906 μs 1.00 15 B 1.00
Ctor Main [\w]+://[^/\s(...)?(?:#[^\\s]*)? [51] 16.33 μs 0.161 μs 1.00 22904 B 1.00
Ctor PR [\w]+://[^/\s(...)?(?:#[^\\s]*)? [51] 16.13 μs 0.127 μs 0.99 22904 B 1.00
Count Main [\w]+://[^/\s(...)?(?:#[^\\s]*)? [51] 770.51 μs 2.507 μs 1.00 4 B 1.00
Count PR [\w]+://[^/\s(...)?(?:#[^\\s]*)? [51] 695.29 μs 3.747 μs 0.90 4 B 1.00
Ctor Main [\w\.+-]+@[\w\.-]+\.[\w\.-]+ 12.24 μs 0.050 μs 1.00 13888 B 1.00
Ctor PR [\w\.+-]+@[\w\.-]+\.[\w\.-]+ 12.27 μs 0.039 μs 1.00 13880 B 1.00
Count Main [\w\.+-]+@[\w\.-]+\.[\w\.-]+ 184.08 μs 0.227 μs 1.00 1 B 1.00
Count PR [\w\.+-]+@[\w\.-]+\.[\w\.-]+ 184.20 μs 0.291 μs 1.00 1 B 1.00
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Leipzig
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10
Method Toolchain Pattern Mean Error Ratio Allocated Alloc Ratio
Count Main .{0,2}(Tom|Sawyer|Huckleberry|Finn) 186,696.3 μs 174.92 μs 1.00 979 B 1.00
Count PR .{0,2}(Tom|Sawyer|Huckleberry|Finn) 186,420.2 μs 162.26 μs 1.00 979 B 1.00
Count Main .{2,4}(Tom|Sawyer|Huckleberry|Finn) 192,434.0 μs 580.58 μs 1.00 984 B 1.00
Count PR .{2,4}(Tom|Sawyer|Huckleberry|Finn) 192,255.6 μs 122.54 μs 1.00 984 B 1.00
Count Main (?i)Tom|Sawyer|Huckleberry|Finn 2,933.2 μs 658.26 μs 1.13 14 B 1.00
Count PR (?i)Tom|Sawyer|Huckleberry|Finn 2,918.6 μs 665.35 μs 1.12 7 B 0.50
Count Main (?i)Twain 1,104.3 μs 1.87 μs 1.00 5 B 1.00
Count PR (?i)Twain 839.2 μs 1.80 μs 0.76 2 B 0.40
Count Main ([A-Za-z]awyer|[A-Za-z]inn)\s 13,374.4 μs 2.55 μs 1.00 45 B 1.00
Count PR ([A-Za-z]awyer|[A-Za-z]inn)\s 13,368.4 μs 4.71 μs 1.00 45 B 1.00
Count Main [a-z]shing 1,053.7 μs 2.56 μs 1.00 5 B 1.00
Count PR [a-z]shing 824.0 μs 5.68 μs 0.78 2 B 0.40
Count Main \p{Sm} 631.7 μs 5.71 μs 1.00 2 B 1.00
Count PR \p{Sm} 632.7 μs 2.95 μs 1.00 2 B 1.00
Count Main Huck[a-zA-Z]+|Saw[a-zA-Z]+ 1,627.7 μs 3.57 μs 1.00 6 B 1.00
Count PR Huck[a-zA-Z]+|Saw[a-zA-Z]+ 1,626.1 μs 1.17 μs 1.00 6 B 1.00
Count Main Tom.{10,25}river|river.{10,25}Tom 6,604.2 μs 3.09 μs 1.00 24 B 1.00
Count PR Tom.{10,25}river|river.{10,25}Tom 6,603.1 μs 1.70 μs 1.00 24 B 1.00
Count Main Tom|Sawyer|Huckleberry|Finn 2,638.2 μs 46.97 μs 1.00 11 B 1.00
Count PR Tom|Sawyer|Huckleberry|Finn 2,640.5 μs 10.94 μs 1.00 10 B 0.91
Count Main Twain 997.1 μs 1.45 μs 1.00 4 B 1.00
Count PR Twain 811.9 μs 3.34 μs 0.81 2 B 0.50
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10
Method Toolchain Id Mean Error Ratio Allocated Alloc Ratio
IsMatch Main 0 20.39 ns 0.233 ns 1.00 - NA
IsMatch PR 0 21.48 ns 0.928 ns 1.05 - NA
IsMatch Main 1 43.72 ns 0.133 ns 1.00 - NA
IsMatch PR 1 43.94 ns 0.094 ns 1.01 - NA
IsMatch Main 2 49.56 ns 0.515 ns 1.00 - NA
IsMatch PR 2 48.87 ns 0.176 ns 0.99 - NA
IsMatch Main 3 85.44 ns 1.353 ns 1.00 - NA
IsMatch PR 3 86.50 ns 3.854 ns 1.01 - NA
IsMatch Main 4 91.86 ns 26.856 ns 1.13 - NA
IsMatch PR 4 72.63 ns 0.405 ns 0.89 - NA
IsMatch Main 5 72.06 ns 0.707 ns 1.00 - NA
IsMatch PR 5 71.40 ns 0.316 ns 0.99 - NA
IsMatch Main 6 21.61 ns 0.077 ns 1.00 - NA
IsMatch PR 6 21.51 ns 0.012 ns 1.00 - NA
IsMatch Main 7 21.32 ns 0.040 ns 1.00 - NA
IsMatch PR 7 21.50 ns 0.083 ns 1.01 - NA
IsMatch Main 8 21.92 ns 0.233 ns 1.00 - NA
IsMatch PR 8 21.78 ns 0.323 ns 0.99 - NA
IsMatch Main 9 23.52 ns 0.154 ns 1.00 - NA
IsMatch PR 9 23.33 ns 0.021 ns 0.99 - NA
IsMatch Main 10 23.84 ns 0.356 ns 1.00 - NA
IsMatch PR 10 23.57 ns 0.021 ns 0.99 - NA
IsMatch Main 11 22.93 ns 0.307 ns 1.00 - NA
IsMatch PR 11 22.63 ns 0.049 ns 0.99 - NA
IsMatch Main 12 26.02 ns 0.049 ns 1.00 - NA
IsMatch PR 12 26.72 ns 0.288 ns 1.03 - NA
IsMatch Main 13 26.22 ns 0.119 ns 1.00 - NA
IsMatch PR 13 26.77 ns 1.184 ns 1.02 - NA

}
}
else if (Vector256.IsHardwareAccelerated && searchSpaceMinusValueTailLength - Vector256<ushort>.Count >= 0)
if (Vector256.IsHardwareAccelerated && searchSpaceMinusValueTailLength - Vector256<ushort>.Count >= 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really rather we not delete this. The issue isn't really V512, but the algorithm/loop being suboptimal for all the vector paths here, this is particularly prevalent from the scalar fallback which causes it to pessimize more for larger vector sizes.

Fixing it isn't that much more work and would be a bigger win.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by scalar fallback in this case?
Throughput numbers are just stressing the vectorized inner loop for large inputs with no matches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what the main loops look like in this case:

M01_L00:
       vpcmpeqw  ymm3,ymm0,[rdx]
       vpcmpeqw  ymm4,ymm1,[rdx+r10]
       vpcmpeqw  ymm5,ymm2,[rdx+r9]
       vpternlogd ymm5,ymm4,ymm3,80
       vptest    ymm5,ymm5
       jne       short M01_L02
M01_L01:
       add       rdx,20
       cmp       rdx,r8
       jbe       short M01_L00
M01_L00:
       vpcmpeqw  k1,zmm0,[rdx]
       vpmovm2w  zmm3,k1
       vpcmpeqw  k1,zmm1,[rdx+r10]
       vpmovm2w  zmm4,k1
       vpcmpeqw  k1,zmm2,[rdx+r9]
       vpmovm2w  zmm5,k1
       vpternlogd zmm5,zmm4,zmm3,80
       vptestmb  k1,zmm5,zmm5
       kortestq  k1,k1
       nop       dword ptr [rax]
       jne       short M01_L02
M01_L01:
       add       rdx,40
       cmp       rdx,r8
       jbe       short M01_L00

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by scalar fallback in this case?

The ShortInput path that is hit is pessimized for larger vector sizes as it must process 2-4x as many elements as non-vectorized. While this doesn't necessarily show up for some bigger inputs, it will show up for small inputs and for inputs that have trailing elements. Additionally, for the whole method the non-idiomatic loops with goto can pessimize various JIT optimizations, control flow analysis, and other things.

Longer term, these should all be rewritten to follow a "better" pattern which helps minimize the dispatch branching and which allows the trailing elements to also be vectorized. -- Ideally we're generally using a pattern like TensorPrimitives uses, where the core loop/trailing logic is centralized and we're just specifying the inner loops and exit conditions.

Here's what the main loops look like in this case:

The problem with the main loop is the vpcmpeqw, vpmovm2w sequences. This is a really trivially issue related to the fact that the bitwise operands (and/andn/or/xor) are normalized to having a base type of int/uint since the underlying instructions only support these sizes due to embedded broadcast/masking support.

The check that looks for and(cvtmasktovec(op1), cvtmasktovec(op2)) sequences was looking for all three base types to match, when it actually only needs cvtmasktovec(op1) and cvtmasktovec(op2) to match and then the replacement andmask(op1, op2) to track that base type.

The following PR resolves that: #117887

-      vpcmpeqw k1, zmm6, zmmword ptr [rsi]
-      vpmovm2w zmm0, k1
-      vpcmpeqw k1, zmm7, zmmword ptr [rsi+r14]
-      vpmovm2w zmm1, k1
-      vpcmpeqw k1, zmm8, zmmword ptr [rsi+r15]
-      vpmovm2w zmm2, k1
-      vpternlogd zmm2, zmm1, zmm0, -128
-      vptestmb k1, zmm2, zmm2
-      kortestq k1, k1
+      vpcmpeqw k1, zmm6, zmmword ptr [rsi]
+      vpcmpeqw k2, zmm7, zmmword ptr [rsi+r14]
+      kandd    k1, k1, k2
+      vpcmpeqw k2, zmm8, zmmword ptr [rsi+r15]
+      kandd    k1, k1, k2
+      vpmovm2w zmm0, k1
+      vptestmb k1, zmm0, zmm0
+      kortestq k1, k1

Now the codegen still isn't "ideal" because we end up converting the mask to a vector to do the "is there any matches" check (this is the vpmovm2w, vptestmb, kortestq). That is a little more complicated to fix since it requires moving some of the op_Inequality transforms from LIR (lowering) into HIR (morph, valuenum, etc). This is planned work, just not something we've completed yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ShortInput path that is hit is pessimized for larger vector sizes as it must process 2-4x as many elements as non-vectorized. While this doesn't necessarily show up for some bigger inputs, it will show up for small inputs and for inputs that have trailing elements.

I'm not sure I follow? This PR doesn't change how short inputs behave here.

The ShortInput path is only used for inputs relative to Vector128's length. When it's taken does not depend on whether the system has Avx512 or Avx2 support. Trailing elements are also processed with a vectorized step.

The following PR resolves that: #117887
This is planned work, just not something we've completed yet.

Thanks! I'll double-check what the numbers look like with your change.

Assuming it's still worse/not better compared to Vector256 paths, does it make sense to keep around?
E.g. We've reverted Avx512 support from IndexOfAnyAsciiSearcher over much smaller regressions even though there are meaningful throughput benefits on longer inputs there (#93222), whereas it's just worse across the board here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow? This PR doesn't change how short inputs behave here.

It was a general comment about how this and several other vectorized code paths in corelib are written in a way, in general, that pessimizes the larger vector sizes and/or small inptus. It wasn't a comment about the changes in this PR, rather just a general "issue" that helps make V512 perform worse than it should. If we were to fix those issues, all the paths should get faster

Assuming it's still worse/not better compared to Vector256 paths, does it make sense to keep around?
E.g. We've reverted Avx512 support from IndexOfAnyAsciiSearcher over much smaller regressions even though there are meaningful throughput benefits on longer inputs there (#93222), whereas it's just worse across the board here.

I believe it's still worth keeping and to continue incrementally tracking the improvements around. The more we revert, the harder it is to test/validate the improvements as they go in. Which applies to AVX512 and SVE alike, both of which have different considerations for mask handling.

The long term goal is to have centralized SIMD looping logic and to utilize things like (the currently internal) ISimdVector, we're getting closer to that each release and continuing to get large improvements to the handling and codegen across the board.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- A lot of the places where the perf is found to be suboptimal bad are also fairly trivial fixes, like the one I did. If we file issues for them and go and ensure the pattern recognition is being handled correctly, it is far better for all the vector paths. The same goes for utilizing the helpers like the Any, All, None, Count, IndexOf, and LastIndexOf helpers that now exist on the vector types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your change the Regex SliceSlice benchmark (just a ton of span.IndexOf(string)-like searches) shows a 40% regression (as in taking 1.4x as long) with the Avx512 paths compared to Avx2 on Zen hardware.

Should we reconsider targeted arch-specific opt outs for such cases if performance diverges this much, and we consider affected code paths as important?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't do target specific opt outs.

If you're really that concerned with the regression and it showing up in real world, then I'd just go with the removal for now. But please ensure a tracking issue is filed to ensure it is added back when the direct kortest logic is added in .NET 11.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #118108 now merged, Regex won't hit these paths anymore for ASCII values, so I'm less concerned about the real-world impact.
IndexOf(string) is still impacted, but switching to SearchValues<string> can mitigate that.

In general it is unfortunate that we would keep around Vector512 paths if they aren't improving perf, but hopefully potntial future changes you mentioned can help here.

@MihaZupan MihaZupan closed this Jul 31, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Aug 31, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants