Delete AVX512 paths from `IndexOf(string)` #117865

MihaZupan · 2025-07-20T19:00:29Z

These paths have worse throughput than AVX2 on Zen4, Zen5, and Cascade Lake,
On Saphire Rapids they have better throughput, but seemingly much worse of a time dealing with false positives.

Zen 4

EgorBot/runtime-utils#440 (comment)

BenchmarkDotNet v0.15.0, Linux Ubuntu 24.04.2 LTS (Noble Numbat)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
  Job-CVEPVM : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-BOOZVV : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method	Toolchain	Mean	Error	Ratio
Throughput	Main	4.391 μs	0.0011 μs	1.00
Throughput	PR	3.415 μs	0.0005 μs	0.78

SV_Throughput	Main	5.484 μs	0.0016 μs	1.00
SV_Throughput	PR	5.119 μs	0.0013 μs	0.93

SV_ThroughputIC	Main	6.366 μs	0.0010 μs	1.00
SV_ThroughputIC	PR	5.119 μs	0.0020 μs	0.80

FalsePositives	Main	16.712 μs	0.0034 μs	1.00
FalsePositives	PR	12.498 μs	0.0019 μs	0.75

SV_FalsePositives	Main	13.861 μs	0.0075 μs	1.00
SV_FalsePositives	PR	9.536 μs	0.0049 μs	0.69

SV_FalsePositivesIC	Main	15.623 μs	0.0501 μs	1.00
SV_FalsePositivesIC	PR	11.264 μs	0.0217 μs	0.72

Zen 5

BenchmarkDotNet v0.14.1-develop (2025-07-20), Windows 11 (10.0.26100.4652)
AMD Ryzen 9 9950X 4.30GHz, 1 CPU, 32 logical and 16 physical cores
.NET SDK 10.0.100-preview.6.25358.103
  [Host]    : .NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Job=MediumRun  IterationCount=15  LaunchCount=2
WarmupCount=10

Method	Toolchain	Mean	Error	Ratio
Throughput	main	2,347.551 ns	50.4220 ns	1.00
Throughput	pr	1,872.423 ns	11.5169 ns	0.80

SV_Throughput	main	2,800.805 ns	30.3129 ns	1.00
SV_Throughput	pr	2,605.061 ns	20.0059 ns	0.93

SV_ThroughputIC	main	2,961.727 ns	28.9590 ns	1.00
SV_ThroughputIC	pr	2,665.752 ns	30.6298 ns	0.90

FalsePositives	main	10,143.346 ns	93.1595 ns	1.00
FalsePositives	pr	7,060.672 ns	59.1907 ns	0.70

SV_FalsePositives	main	7,983.976 ns	87.6681 ns	1.00
SV_FalsePositives	pr	5,691.896 ns	61.9204 ns	0.71

SV_FalsePositivesIC	main	8,759.998 ns	71.9972 ns	1.00
SV_FalsePositivesIC	pr	6,733.671 ns	55.4418 ns	0.77

Cascade Lake

EgorBot/runtime-utils#440 (comment)

BenchmarkDotNet v0.15.0, Linux Ubuntu 24.04.2 LTS (Noble Numbat)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
  Job-MVJDOO : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-KYJYDJ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method	Toolchain	Mean	Error	Ratio
Throughput	Main	7.203 μs	0.0013 μs	1.00
Throughput	PR	7.226 μs	0.0025 μs	1.00

SV_Throughput	Main	9.835 μs	0.0015 μs	1.00
SV_Throughput	PR	7.658 μs	0.0006 μs	0.78

SV_ThroughputIC	Main	10.094 μs	0.0013 μs	1.00
SV_ThroughputIC	PR	7.728 μs	0.0012 μs	0.77

FalsePositives	Main	22.666 μs	0.0045 μs	1.00
FalsePositives	PR	18.466 μs	0.0016 μs	0.81

SV_FalsePositives	Main	16.411 μs	0.0058 μs	1.00
SV_FalsePositives	PR	14.120 μs	0.0022 μs	0.86

SV_FalsePositivesIC	Main	21.348 μs	0.0024 μs	1.00
SV_FalsePositivesIC	PR	18.790 μs	0.0032 μs	0.88

Saphire Rapids

EgorBot/runtime-utils#440 (comment)

BenchmarkDotNet v0.15.0, Linux Ubuntu 24.04 LTS (Noble Numbat)
Intel Xeon Platinum 8488C, 1 CPU, 16 logical and 8 physical cores
  Job-YSWOSK : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XTUVRO : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method	Toolchain	Mean	Error	Ratio
Throughput	Main	5.506 μs	0.1071 μs	1.00
Throughput	PR	6.694 μs	0.1096 μs	1.22

SV_Throughput	Main	5.806 μs	0.1093 μs	1.00
SV_Throughput	PR	6.446 μs	0.1041 μs	1.11

SV_ThroughputIC	Main	5.958 μs	0.1188 μs	1.00
SV_ThroughputIC	PR	7.406 μs	0.0872 μs	1.24

FalsePositives	Main	30.023 μs	0.2173 μs	1.00
FalsePositives	PR	14.169 μs	0.1401 μs	0.47

SV_FalsePositives	Main	30.185 μs	0.2823 μs	1.00
SV_FalsePositives	PR	11.177 μs	0.1497 μs	0.37

SV_FalsePositivesIC	Main	17.555 μs	0.1627 μs	1.00
SV_FalsePositivesIC	PR	14.928 μs	0.1665 μs	0.85

Regex results from Zen 4: #117865 (comment)

dotnet-policy-service · 2025-07-20T19:01:04Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

MihaZupan · 2025-07-20T19:03:47Z

@EgorBot -amd

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;

BenchmarkRunner.Run<SingleString>(args: args);

public class SingleString
{
    private static readonly SearchValues<string> s_values = SearchValues.Create([Needle], StringComparison.Ordinal);
    private static readonly SearchValues<string> s_valuesIC = SearchValues.Create([Needle], StringComparison.OrdinalIgnoreCase);
    private static readonly string s_text_noMatches = new('a', Length);
    private static readonly string s_text_falsePositives = string.Concat(Enumerable.Repeat("Sherlock Holm_s", Length / Needle.Length));

    public const int Length = 100_000;
    public const string Needle = "Sherlock Holmes";

    [Benchmark] public void Throughput() => s_text_noMatches.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_Throughput() => s_text_noMatches.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_ThroughputIC() => s_text_noMatches.AsSpan().ContainsAny(s_valuesIC);

    [Benchmark] public void FalsePositives() => s_text_falsePositives.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_FalsePositives() => s_text_falsePositives.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_FalsePositivesIC() => s_text_falsePositives.AsSpan().ContainsAny(s_valuesIC);
}

MihaZupan · 2025-07-20T19:52:15Z

@MihuBot benchmark Regex_Industry https://github.com/MihaZupan/performance/tree/compiled-regex-only -medium

MihuBot · 2025-07-20T21:46:14Z

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_SliceSlice

BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=Default  IterationCount=15
LaunchCount=2  MemoryRandomization=Default  MinIterationCount=3
WarmupCount=10

Method	Toolchain	Options	Mean	Error	Ratio	Allocated	Alloc Ratio
Count	Main	Compiled	340.8 ms	0.23 ms	1.00	1072 B	1.00
Count	PR	Compiled	265.6 ms	0.59 ms	0.78	392 B	0.37

Count	Main	IgnoreCase, Compiled	404.6 ms	0.27 ms	1.00	1072 B	1.00
Count	PR	IgnoreCase, Compiled	304.3 ms	0.63 ms	0.75	392 B	0.37

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock

BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10

Method	Toolchain	Pattern	Mean	Error	Ratio	Allocated	Alloc Ratio
Count	Main	.*	539,957.77 ns	2,777.656 ns	1.00	2 B	1.00
Count	PR	.*	536,412.81 ns	1,338.524 ns	0.99	2 B	1.00

Count	Main	(?i)Holmes	49,553.91 ns	74.842 ns	1.00	-	NA
Count	PR	(?i)Holmes	39,510.00 ns	84.775 ns	0.80	-	NA

Count	Main	(?i)Sher[a-z]+\|Hol[a-z]+	119,455.87 ns	23,264.166 ns	1.09	-	NA
Count	PR	(?i)Sher[a-z]+\|Hol[a-z]+	120,209.41 ns	24,043.653 ns	1.09	1 B	NA

Count	Main	(?i)Sherlock	42,625.47 ns	53.009 ns	1.00	-	NA
Count	PR	(?i)Sherlock	32,006.02 ns	97.674 ns	0.75	-	NA

Count	Main	(?i)Sherlock Holmes	42,313.90 ns	268.481 ns	1.00	-	NA
Count	PR	(?i)Sherlock Holmes	31,753.20 ns	48.300 ns	0.75	-	NA

Count	Main	(?i)Sherlock\|Holmes\|Watson	121,294.28 ns	24,644.402 ns	1.09	-	NA
Count	PR	(?i)Sherlock\|Holmes\|Watson	121,771.81 ns	24,192.418 ns	1.10	1 B	NA

Count	Main	(?i)Sherlock\|(...)er\|John\|Baker [49]	189,067.55 ns	21,819.446 ns	1.03	1 B	1.00
Count	PR	(?i)Sherlock\|(...)er\|John\|Baker [49]	192,967.18 ns	23,294.325 ns	1.05	1 B	1.00

Count	Main	(?i)the	196,782.71 ns	3,093.229 ns	1.00	1 B	1.00
Count	PR	(?i)the	205,049.49 ns	22,583.278 ns	1.04	1 B	1.00

Count	Main	(?m)^Sherlock(...)rlock Holmes$ [37]	37,926.65 ns	66.683 ns	1.00	-	NA
Count	PR	(?m)^Sherlock(...)rlock Holmes$ [37]	30,733.54 ns	93.874 ns	0.81	-	NA

Count	Main	(?s).*	36.26 ns	2.163 ns	1.01	-	NA
Count	PR	(?s).*	32.79 ns	0.128 ns	0.91	-	NA

Count	Main	[^\\n]*	547,209.10 ns	1,460.851 ns	1.00	2 B	1.00
Count	PR	[^\\n]*	535,933.92 ns	1,465.196 ns	0.98	2 B	1.00

Count	Main	[a-q][^u-z]{13}x	23,140.50 ns	63.166 ns	1.00	-	NA
Count	PR	[a-q][^u-z]{13}x	23,235.80 ns	86.274 ns	1.00	-	NA

Count	Main	[a-zA-Z]+ing	3,336,651.29 ns	3,147.391 ns	1.00	9 B	1.00
Count	PR	[a-zA-Z]+ing	3,336,777.42 ns	7,286.788 ns	1.00	11 B	1.22

Count	Main	\b\w+n\b	6,575,536.86 ns	29,220.552 ns	1.00	22 B	1.00
Count	PR	\b\w+n\b	6,536,072.91 ns	7,455.555 ns	0.99	20 B	0.91

Count	Main	\p{L}	8,790,393.70 ns	16,659.746 ns	1.00	35 B	1.00
Count	PR	\p{L}	9,158,574.55 ns	84,642.571 ns	1.04	31 B	0.89

Count	Main	\p{Ll}	8,819,521.28 ns	45,229.860 ns	1.00	35 B	1.00
Count	PR	\p{Ll}	8,516,091.90 ns	27,825.246 ns	0.97	35 B	1.00

Count	Main	\p{Lu}	365,901.64 ns	10,173.163 ns	1.00	1 B	1.00
Count	PR	\p{Lu}	339,622.68 ns	7,002.437 ns	0.93	1 B	1.00

Count	Main	\s[a-zA-Z]{0,12}ing\s	3,549,099.81 ns	71,875.793 ns	1.00	12 B	1.00
Count	PR	\s[a-zA-Z]{0,12}ing\s	3,454,495.50 ns	3,574.149 ns	0.97	11 B	0.92

Count	Main	\w+	4,119,698.97 ns	54,085.001 ns	1.00	21 B	1.00
Count	PR	\w+	4,074,142.22 ns	12,838.814 ns	0.99	21 B	1.00

Count	Main	\w+\s+Holmes	2,814,787.47 ns	38,441.312 ns	1.00	11 B	1.00
Count	PR	\w+\s+Holmes	2,793,585.60 ns	3,866.369 ns	0.99	11 B	1.00

Count	Main	\w+\s+Holmes\s+\w+	3,248,814.52 ns	106,821.302 ns	1.00	12 B	1.00
Count	PR	\w+\s+Holmes\s+\w+	3,171,086.20 ns	19,453.611 ns	0.98	10 B	0.83

Count	Main	aei	35,004.38 ns	415.236 ns	1.00	-	NA
Count	PR	aei	25,099.62 ns	615.743 ns	0.72	-	NA

Count	Main	aqj	35,151.89 ns	386.042 ns	1.00	-	NA
Count	PR	aqj	25,078.21 ns	625.769 ns	0.71	-	NA

Count	Main	Holmes	45,333.62 ns	68.512 ns	1.00	-	NA
Count	PR	Holmes	36,774.43 ns	198.786 ns	0.81	-	NA

Count	Main	Holmes.{0,25}(...).{0,25}Holmes [39]	48,103.12 ns	78.124 ns	1.00	-	NA
Count	PR	Holmes.{0,25}(...).{0,25}Holmes [39]	48,029.25 ns	94.056 ns	1.00	-	NA

Count	Main	Sher[a-z]+\|Hol[a-z]+	49,529.84 ns	114.013 ns	1.00	-	NA
Count	PR	Sher[a-z]+\|Hol[a-z]+	49,345.76 ns	176.323 ns	1.00	-	NA

Count	Main	Sherlock	38,609.91 ns	43.878 ns	1.00	-	NA
Count	PR	Sherlock	30,366.89 ns	147.805 ns	0.79	-	NA

Count	Main	Sherlock Holmes	38,348.62 ns	233.008 ns	1.00	-	NA
Count	PR	Sherlock Holmes	30,626.54 ns	277.150 ns	0.80	-	NA

Count	Main	Sherlock\s+Holmes	39,082.30 ns	102.803 ns	1.00	-	NA
Count	PR	Sherlock\s+Holmes	31,595.29 ns	149.107 ns	0.81	-	NA

Count	Main	Sherlock\|Holmes	45,480.20 ns	156.046 ns	1.00	-	NA
Count	PR	Sherlock\|Holmes	45,591.13 ns	132.548 ns	1.00	-	NA

Count	Main	Sherlock\|Holmes\|Watson	59,117.55 ns	60.167 ns	1.00	-	NA
Count	PR	Sherlock\|Holmes\|Watson	59,255.69 ns	207.926 ns	1.00	-	NA

Count	Main	Sherlock\|Holm(...)er\|John\|Baker [45]	88,472.32 ns	385.285 ns	1.00	-	NA
Count	PR	Sherlock\|Holm(...)er\|John\|Baker [45]	88,480.52 ns	268.297 ns	1.00	-	NA

Count	Main	Sherlock\|Street	25,052.54 ns	87.574 ns	1.00	-	NA
Count	PR	Sherlock\|Street	25,126.84 ns	131.345 ns	1.00	-	NA

Count	Main	the	163,106.59 ns	424.424 ns	1.00	1 B	1.00
Count	PR	the	149,097.68 ns	222.376 ns	0.91	-	0.00

Count	Main	The	49,971.16 ns	88.861 ns	1.00	-	NA
Count	PR	The	40,607.05 ns	142.218 ns	0.81	-	NA

Count	Main	the\s+\w+	248,334.39 ns	3,462.552 ns	1.00	1 B	1.00
Count	PR	the\s+\w+	269,876.82 ns	14,656.125 ns	1.09	1 B	1.00

Count	Main	zqj	35,114.66 ns	465.562 ns	1.00	-	NA
Count	PR	zqj	25,099.07 ns	616.537 ns	0.72	-	NA

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Mariomkas

BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  IterationCount=15  LaunchCount=2
WarmupCount=10

Method	Toolchain	Pattern	Mean	Error	Ratio	Allocated	Alloc Ratio
Ctor	Main	(?:(?:250-5]?[0-9][0-9]) [87]	19.05 μs	0.088 μs	1.00	30008 B	1.00
Ctor	PR	(?:(?:250-5]?[0-9][0-9]) [87]	19.14 μs	0.126 μs	1.00	30008 B	1.00

Count	Main	(?:(?:250-5]?[0-9][0-9]) [87]	2,669.29 μs	5.771 μs	1.00	15 B	1.00
Count	PR	(?:(?:250-5]?[0-9][0-9]) [87]	2,667.35 μs	1.906 μs	1.00	15 B	1.00

Ctor	Main	*[\w]+://[^/\s(...)?(?:#[^\\s])? [51]**	16.33 μs	0.161 μs	1.00	22904 B	1.00
Ctor	PR	[\w]+://[^/\s(...)?(?:#[^\\s]*)? [51]	16.13 μs	0.127 μs	0.99	22904 B	1.00

Count	Main	[\w]+://[^/\s(...)?(?:#[^\\s]*)? [51]	770.51 μs	2.507 μs	1.00	4 B	1.00
Count	PR	[\w]+://[^/\s(...)?(?:#[^\\s]*)? [51]	695.29 μs	3.747 μs	0.90	4 B	1.00

Ctor	Main	[\w\.+-]+@[\w\.-]+\.[\w\.-]+	12.24 μs	0.050 μs	1.00	13888 B	1.00
Ctor	PR	[\w\.+-]+@[\w\.-]+\.[\w\.-]+	12.27 μs	0.039 μs	1.00	13880 B	1.00

Count	Main	[\w\.+-]+@[\w\.-]+\.[\w\.-]+	184.08 μs	0.227 μs	1.00	1 B	1.00
Count	PR	[\w\.+-]+@[\w\.-]+\.[\w\.-]+	184.20 μs	0.291 μs	1.00	1 B	1.00

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Leipzig

BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10

Method	Toolchain	Pattern	Mean	Error	Ratio	Allocated	Alloc Ratio
Count	Main	.{0,2}(Tom\|Sawyer\|Huckleberry\|Finn)	186,696.3 μs	174.92 μs	1.00	979 B	1.00
Count	PR	.{0,2}(Tom\|Sawyer\|Huckleberry\|Finn)	186,420.2 μs	162.26 μs	1.00	979 B	1.00

Count	Main	.{2,4}(Tom\|Sawyer\|Huckleberry\|Finn)	192,434.0 μs	580.58 μs	1.00	984 B	1.00
Count	PR	.{2,4}(Tom\|Sawyer\|Huckleberry\|Finn)	192,255.6 μs	122.54 μs	1.00	984 B	1.00

Count	Main	(?i)Tom\|Sawyer\|Huckleberry\|Finn	2,933.2 μs	658.26 μs	1.13	14 B	1.00
Count	PR	(?i)Tom\|Sawyer\|Huckleberry\|Finn	2,918.6 μs	665.35 μs	1.12	7 B	0.50

Count	Main	(?i)Twain	1,104.3 μs	1.87 μs	1.00	5 B	1.00
Count	PR	(?i)Twain	839.2 μs	1.80 μs	0.76	2 B	0.40

Count	Main	([A-Za-z]awyer\|[A-Za-z]inn)\s	13,374.4 μs	2.55 μs	1.00	45 B	1.00
Count	PR	([A-Za-z]awyer\|[A-Za-z]inn)\s	13,368.4 μs	4.71 μs	1.00	45 B	1.00

Count	Main	[a-z]shing	1,053.7 μs	2.56 μs	1.00	5 B	1.00
Count	PR	[a-z]shing	824.0 μs	5.68 μs	0.78	2 B	0.40

Count	Main	\p{Sm}	631.7 μs	5.71 μs	1.00	2 B	1.00
Count	PR	\p{Sm}	632.7 μs	2.95 μs	1.00	2 B	1.00

Count	Main	Huck[a-zA-Z]+\|Saw[a-zA-Z]+	1,627.7 μs	3.57 μs	1.00	6 B	1.00
Count	PR	Huck[a-zA-Z]+\|Saw[a-zA-Z]+	1,626.1 μs	1.17 μs	1.00	6 B	1.00

Count	Main	Tom.{10,25}river\|river.{10,25}Tom	6,604.2 μs	3.09 μs	1.00	24 B	1.00
Count	PR	Tom.{10,25}river\|river.{10,25}Tom	6,603.1 μs	1.70 μs	1.00	24 B	1.00

Count	Main	Tom\|Sawyer\|Huckleberry\|Finn	2,638.2 μs	46.97 μs	1.00	11 B	1.00
Count	PR	Tom\|Sawyer\|Huckleberry\|Finn	2,640.5 μs	10.94 μs	1.00	10 B	0.91

Count	Main	Twain	997.1 μs	1.45 μs	1.00	4 B	1.00
Count	PR	Twain	811.9 μs	3.34 μs	0.81	2 B	0.50

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple

BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10

Method	Toolchain	Id	Mean	Error	Ratio	Allocated	Alloc Ratio
IsMatch	Main	0	20.39 ns	0.233 ns	1.00	-	NA
IsMatch	PR	0	21.48 ns	0.928 ns	1.05	-	NA

IsMatch	Main	1	43.72 ns	0.133 ns	1.00	-	NA
IsMatch	PR	1	43.94 ns	0.094 ns	1.01	-	NA

IsMatch	Main	2	49.56 ns	0.515 ns	1.00	-	NA
IsMatch	PR	2	48.87 ns	0.176 ns	0.99	-	NA

IsMatch	Main	3	85.44 ns	1.353 ns	1.00	-	NA
IsMatch	PR	3	86.50 ns	3.854 ns	1.01	-	NA

IsMatch	Main	4	91.86 ns	26.856 ns	1.13	-	NA
IsMatch	PR	4	72.63 ns	0.405 ns	0.89	-	NA

IsMatch	Main	5	72.06 ns	0.707 ns	1.00	-	NA
IsMatch	PR	5	71.40 ns	0.316 ns	0.99	-	NA

IsMatch	Main	6	21.61 ns	0.077 ns	1.00	-	NA
IsMatch	PR	6	21.51 ns	0.012 ns	1.00	-	NA

IsMatch	Main	7	21.32 ns	0.040 ns	1.00	-	NA
IsMatch	PR	7	21.50 ns	0.083 ns	1.01	-	NA

IsMatch	Main	8	21.92 ns	0.233 ns	1.00	-	NA
IsMatch	PR	8	21.78 ns	0.323 ns	0.99	-	NA

IsMatch	Main	9	23.52 ns	0.154 ns	1.00	-	NA
IsMatch	PR	9	23.33 ns	0.021 ns	0.99	-	NA

IsMatch	Main	10	23.84 ns	0.356 ns	1.00	-	NA
IsMatch	PR	10	23.57 ns	0.021 ns	0.99	-	NA

IsMatch	Main	11	22.93 ns	0.307 ns	1.00	-	NA
IsMatch	PR	11	22.63 ns	0.049 ns	0.99	-	NA

IsMatch	Main	12	26.02 ns	0.049 ns	1.00	-	NA
IsMatch	PR	12	26.72 ns	0.288 ns	1.03	-	NA

IsMatch	Main	13	26.22 ns	0.119 ns	1.00	-	NA
IsMatch	PR	13	26.77 ns	1.184 ns	1.02	-	NA

tannergooding · 2025-07-21T05:42:04Z

...System.Private.CoreLib/src/System/SearchValues/Strings/SingleStringSearchValuesThreeChars.cs

-                }
-            }
-            else if (Vector256.IsHardwareAccelerated && searchSpaceMinusValueTailLength - Vector256<ushort>.Count >= 0)
+            if (Vector256.IsHardwareAccelerated && searchSpaceMinusValueTailLength - Vector256<ushort>.Count >= 0)


I'd really rather we not delete this. The issue isn't really V512, but the algorithm/loop being suboptimal for all the vector paths here, this is particularly prevalent from the scalar fallback which causes it to pessimize more for larger vector sizes.

Fixing it isn't that much more work and would be a bigger win.

What do you mean by scalar fallback in this case?
Throughput numbers are just stressing the vectorized inner loop for large inputs with no matches.

Here's what the main loops look like in this case:

M01_L00: vpcmpeqw ymm3,ymm0,[rdx] vpcmpeqw ymm4,ymm1,[rdx+r10] vpcmpeqw ymm5,ymm2,[rdx+r9] vpternlogd ymm5,ymm4,ymm3,80 vptest ymm5,ymm5 jne short M01_L02 M01_L01: add rdx,20 cmp rdx,r8 jbe short M01_L00

M01_L00: vpcmpeqw k1,zmm0,[rdx] vpmovm2w zmm3,k1 vpcmpeqw k1,zmm1,[rdx+r10] vpmovm2w zmm4,k1 vpcmpeqw k1,zmm2,[rdx+r9] vpmovm2w zmm5,k1 vpternlogd zmm5,zmm4,zmm3,80 vptestmb k1,zmm5,zmm5 kortestq k1,k1 nop dword ptr [rax] jne short M01_L02 M01_L01: add rdx,40 cmp rdx,r8 jbe short M01_L00

What do you mean by scalar fallback in this case?

The ShortInput path that is hit is pessimized for larger vector sizes as it must process 2-4x as many elements as non-vectorized. While this doesn't necessarily show up for some bigger inputs, it will show up for small inputs and for inputs that have trailing elements. Additionally, for the whole method the non-idiomatic loops with goto can pessimize various JIT optimizations, control flow analysis, and other things.

Longer term, these should all be rewritten to follow a "better" pattern which helps minimize the dispatch branching and which allows the trailing elements to also be vectorized. -- Ideally we're generally using a pattern like TensorPrimitives uses, where the core loop/trailing logic is centralized and we're just specifying the inner loops and exit conditions.

Here's what the main loops look like in this case:

The problem with the main loop is the vpcmpeqw, vpmovm2w sequences. This is a really trivially issue related to the fact that the bitwise operands (and/andn/or/xor) are normalized to having a base type of int/uint since the underlying instructions only support these sizes due to embedded broadcast/masking support.

The check that looks for and(cvtmasktovec(op1), cvtmasktovec(op2)) sequences was looking for all three base types to match, when it actually only needs cvtmasktovec(op1) and cvtmasktovec(op2) to match and then the replacement andmask(op1, op2) to track that base type.

The following PR resolves that: #117887

- vpcmpeqw k1, zmm6, zmmword ptr [rsi] - vpmovm2w zmm0, k1 - vpcmpeqw k1, zmm7, zmmword ptr [rsi+r14] - vpmovm2w zmm1, k1 - vpcmpeqw k1, zmm8, zmmword ptr [rsi+r15] - vpmovm2w zmm2, k1 - vpternlogd zmm2, zmm1, zmm0, -128 - vptestmb k1, zmm2, zmm2 - kortestq k1, k1 + vpcmpeqw k1, zmm6, zmmword ptr [rsi] + vpcmpeqw k2, zmm7, zmmword ptr [rsi+r14] + kandd k1, k1, k2 + vpcmpeqw k2, zmm8, zmmword ptr [rsi+r15] + kandd k1, k1, k2 + vpmovm2w zmm0, k1 + vptestmb k1, zmm0, zmm0 + kortestq k1, k1

Now the codegen still isn't "ideal" because we end up converting the mask to a vector to do the "is there any matches" check (this is the vpmovm2w, vptestmb, kortestq). That is a little more complicated to fix since it requires moving some of the op_Inequality transforms from LIR (lowering) into HIR (morph, valuenum, etc). This is planned work, just not something we've completed yet.

The ShortInput path that is hit is pessimized for larger vector sizes as it must process 2-4x as many elements as non-vectorized. While this doesn't necessarily show up for some bigger inputs, it will show up for small inputs and for inputs that have trailing elements.

I'm not sure I follow? This PR doesn't change how short inputs behave here.

The ShortInput path is only used for inputs relative to Vector128's length. When it's taken does not depend on whether the system has Avx512 or Avx2 support. Trailing elements are also processed with a vectorized step.

The following PR resolves that: #117887
This is planned work, just not something we've completed yet.

Thanks! I'll double-check what the numbers look like with your change.

Assuming it's still worse/not better compared to Vector256 paths, does it make sense to keep around?
E.g. We've reverted Avx512 support from IndexOfAnyAsciiSearcher over much smaller regressions even though there are meaningful throughput benefits on longer inputs there (#93222), whereas it's just worse across the board here.

I'm not sure I follow? This PR doesn't change how short inputs behave here.

It was a general comment about how this and several other vectorized code paths in corelib are written in a way, in general, that pessimizes the larger vector sizes and/or small inptus. It wasn't a comment about the changes in this PR, rather just a general "issue" that helps make V512 perform worse than it should. If we were to fix those issues, all the paths should get faster

Assuming it's still worse/not better compared to Vector256 paths, does it make sense to keep around?
E.g. We've reverted Avx512 support from IndexOfAnyAsciiSearcher over much smaller regressions even though there are meaningful throughput benefits on longer inputs there (#93222), whereas it's just worse across the board here.

I believe it's still worth keeping and to continue incrementally tracking the improvements around. The more we revert, the harder it is to test/validate the improvements as they go in. Which applies to AVX512 and SVE alike, both of which have different considerations for mask handling.

The long term goal is to have centralized SIMD looping logic and to utilize things like (the currently internal) ISimdVector, we're getting closer to that each release and continuing to get large improvements to the handling and codegen across the board.

-- A lot of the places where the perf is found to be suboptimal bad are also fairly trivial fixes, like the one I did. If we file issues for them and go and ensure the pattern recognition is being handled correctly, it is far better for all the vector paths. The same goes for utilizing the helpers like the Any, All, None, Count, IndexOf, and LastIndexOf helpers that now exist on the vector types.

With your change the Regex SliceSlice benchmark (just a ton of span.IndexOf(string)-like searches) shows a 40% regression (as in taking 1.4x as long) with the Avx512 paths compared to Avx2 on Zen hardware.

Should we reconsider targeted arch-specific opt outs for such cases if performance diverges this much, and we consider affected code paths as important?

We generally don't do target specific opt outs.

If you're really that concerned with the regression and it showing up in real world, then I'd just go with the removal for now. But please ensure a tracking issue is filed to ensure it is added back when the direct kortest logic is added in .NET 11.

With #118108 now merged, Regex won't hit these paths anymore for ASCII values, so I'm less concerned about the real-world impact.
IndexOf(string) is still impacted, but switching to SearchValues<string> can mitigate that.

In general it is unfortunate that we would keep around Vector512 paths if they aren't improving perf, but hopefully potntial future changes you mentioned can help here.

Delete AVX512 paths from IndexOf(string)

07fb1c5

MihaZupan added this to the 10.0.0 milestone Jul 20, 2025

MihaZupan self-assigned this Jul 20, 2025

MihaZupan added the area-System.Memory label Jul 20, 2025

Copilot AI review requested due to automatic review settings July 20, 2025 19:00

This comment was marked as duplicate.

Sign in to view

EgorBot mentioned this pull request Jul 20, 2025

Benchmarks for #117865 (MihaZupan) EgorBot/runtime-utils#440

Open

MihuBot mentioned this pull request Jul 20, 2025

[Benchmark X64] [MihaZupan] Delete AVX512 paths from IndexOf(string) MihuBot/runtime-utils#1244

Open

stephentoub requested a review from tannergooding July 21, 2025 02:48

tannergooding reviewed Jul 21, 2025

View reviewed changes

tannergooding mentioned this pull request Jul 21, 2025

Ensure that bitwise operations of small masks can still be folded #117887

Merged

MihaZupan closed this Jul 31, 2025

github-actions bot locked and limited conversation to collaborators Aug 31, 2025

Delete AVX512 paths from IndexOf(string) #117865

Delete AVX512 paths from IndexOf(string) #117865

Uh oh!

Conversation

MihaZupan commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as duplicate.

Uh oh!

dotnet-policy-service bot commented Jul 20, 2025

Uh oh!

MihaZupan commented Jul 20, 2025

Uh oh!

MihaZupan commented Jul 20, 2025

Uh oh!

MihuBot commented Jul 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Delete AVX512 paths from `IndexOf(string)` #117865

Delete AVX512 paths from `IndexOf(string)` #117865

MihaZupan commented Jul 20, 2025 •

edited

Loading