Opportunities to improve TE Platform-Plaintext benchmark #63059

EgorBo · 2021-12-22T01:22:08Z

Platform-Plaintext TechEmpower Benchmark where dotnet shows impressive results (2nd place, up to ~12.4mln requests per second (RPS) on our PerfLab hardware with bigger network bandwidth) seems to be slightly bottlenecked by three SpanHelpers.IndexOf[Any] functions:

^ Linux-x64

e.g. IndexOfAny(val0, val1) mostly tries to find \n or \r in ASCII strings of length = 26, 49 and 151 where needed symbols usually found at positions 21, 22 and 100 (http headers)

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7
Host: 10.0.0.102:5000
Connection: keep-alive

Host: 10.0.0.102:5000
Connection: keep-alive

Connection: keep-alive

I tried to rewrite them by hands e.g. I changed some branches, removed "Duff's devices", added "two 256bit vectors per iteration" path. Here is a standalone benchmark project with test data extracted from the Platform-Plaintext benchmark.

As the result, I constantly see stable improvements around 1-3% (never slower):

| load                   | base        | my          |         |
| ---------------------- | ----------- | ----------- | ------- |
| CPU Usage (%)          |          96 |          97 |  +1.04% |
| Cores usage (%)        |       2,702 |       2,726 |  +0.89% |
| Working Set (MB)       |          37 |          37 |   0.00% |
| Private Memory (MB)    |         370 |         370 |   0.00% |
| Start Time (ms)        |           0 |           0 |         |
| First Request (ms)     |          77 |          75 |  -2.60% |
| Requests/sec           |  12,403,854 |  12,679,971 |  +2.23% |
| Requests               | 187,261,280 | 191,437,425 |  +2.23% |
| Mean latency (ms)      |        1.18 |        1.31 | +11.02% |
| Max latency (ms)       |       56.19 |       53.21 |  -5.30% |
| Bad responses          |           0 |           0 |         |
| Socket errors          |           0 |           0 |         |
| Read throughput (MB/s) |    1,495.04 |    1,525.76 |  +2.05% |
| Latency 50th (ms)      |        0.66 |        0.63 |  -3.95% |
| Latency 75th (ms)      |        0.99 |        0.95 |  -4.04% |
| Latency 90th (ms)      |        1.83 |        2.00 |  +9.29% |
| Latency 99th (ms)      |       15.86 |       15.82 |  -0.25% |

^ with PGO. As you can see from https://aka.ms/aspnet/benchmarks (17th page, "full" checked) the best results we've ever seen were 12.42M RPS so 12.68M does look like an improvement:

Same relative improvements can be observed in non-PGO mode (default).

Here is the script I used to benchmark it:

for (;;) { crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\base\*.* --json t1.json ;; crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\diff\*.* --json t2.json ;; crank compare t1.json t2.json }

Standalone benchmark: https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd
Commit in dotnet/runtime: EgorBo@b2ee6ad

/cc @stephentoub @GrabYourPitchforks

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-12-22T01:22:13Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2021-12-22T02:30:38Z

Tagging subscribers to this area: @dotnet/area-system-buffers, @GrabYourPitchforks
See info in area-owners.md if you want to be subscribed.

Issue Details

Platform-Plaintext TechEmpower Benchmark where dotnet shows impressive results (2nd place, ~12.4mln RPS on our PerfLab hardware) seems to be slightly bottlenecked by three SpanHelpers.IndexOf[Any] functions:

^ Linux-x64

e.g. IndexOfAny with two values mostly tries to find \n or \r in ASCII strings of length = 26, 49 and 151 where needed symbols usually found at positions 21, 22 and 100 (http headers)

I tried to rewrite them by hands e.g. I changed some branches, removed "Duff's devices" (even for cases where needed char is found at position = 0 my impl is still faster). Here is a standalone benchmark project with test data extracted from the Platform-Plaintext benchmark.

As the result, I constantly see stable improvements around 1-3% (never slower):

| load                   | t1          | t2          |         |
| ---------------------- | ----------- | ----------- | ------- |
| CPU Usage (%)          |          96 |          97 |  +1.04% |
| Cores usage (%)        |       2,702 |       2,726 |  +0.89% |
| Working Set (MB)       |          37 |          37 |   0.00% |
| Private Memory (MB)    |         370 |         370 |   0.00% |
| Start Time (ms)        |           0 |           0 |         |
| First Request (ms)     |          77 |          75 |  -2.60% |
| Requests/sec           |  12,403,854 |  12,679,971 |  +2.23% |
| Requests               | 187,261,280 | 191,437,425 |  +2.23% |
| Mean latency (ms)      |        1.18 |        1.31 | +11.02% |
| Max latency (ms)       |       56.19 |       53.21 |  -5.30% |
| Bad responses          |           0 |           0 |         |
| Socket errors          |           0 |           0 |         |
| Read throughput (MB/s) |    1,495.04 |    1,525.76 |  +2.05% |
| Latency 50th (ms)      |        0.66 |        0.63 |  -3.95% |
| Latency 75th (ms)      |        0.99 |        0.95 |  -4.04% |
| Latency 90th (ms)      |        1.83 |        2.00 |  +9.29% |
| Latency 99th (ms)      |       15.86 |       15.82 |  -0.25% |

^ with PGO. As you can see from https://aka.ms/aspnet/benchmarks (17th page, "full" checked) the best results we've seen were around 12.42M RPS Same relative improvements can be observed in non-PGO mode (default).

Here is the script I used to benchmark it:

for (;;) { crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\base\*.* --json t1.json ;; crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\diff\*.* --json t2.json ;; crank compare t1.json t2.json }

Standalone benchmark: https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd
Commit in dotnet/runtime: EgorBo@b2ee6ad

/cc @stephentoub @GrabYourPitchforks

Author:	EgorBo
Assignees:	-
Labels:	`area-System.Buffers`, `tenet-performance`, `untriaged`
Milestone:	-

danmoseley · 2021-12-22T04:12:43Z

Is it possible that this could regress workloads that don't look like this one? Did you get a chance to run the IndexOf/IndexOfAny benchmarks in dotnet/performance (not sure how good they are but they're there)

EgorBo · 2021-12-22T10:00:42Z

Is it possible that this could regress workloads that don't look like this one? Did you get a chance to run the IndexOf/IndexOfAny benchmarks in dotnet/performance (not sure how good they are but they're there)

Yes, it's always about trade-offs, but what I'm 100% sure in that we can add a "two 256bit vectors per iteration" path for arrays >= 64 elements (bytes) without hurting other cases https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd#file-faster_spanhelpers_indexof-cs-L148-L186.

so it will help us to find \r or \n in

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\n

faster 🙂

EgorBo · 2021-12-22T10:20:14Z

Also, it seems there are optimization opportunities inside the caller of that IndexOfAny - ParseHeaders
we can try to detect : on the go as we search for \r or \n
https://github.com/dotnet/aspnetcore/blob/main/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L141-L421

Currently we do IndexOfAny(data, '\r', '\n') to get current "Header:Value"'s length and then we do IndexOfAny(data, ':', ' ', '\t') again on the same input to extract Header and Value (and validate).

Who knows maybe we can cross 13M RPS 😄

cc @davidfowl @benaadams

stephentoub · 2021-12-22T14:21:25Z

but what I'm 100% sure in that we can add a "two 256bit vectors per iteration" path

I thought we were on a path/plan to switch to using Vector128 in all of these implementations. Is that not the case, @tannergooding?

EgorBo · 2021-12-22T15:28:11Z

Is there a reason?
E.g. as we recently found out with Tanner in Discord that if you want to quickly check if a value of Http Header is "Proxy-Authenticate" (36 bytes) ignoring its case - it's faster to do it via two 256bit vectors than via three 128bit ones: https://gist.github.com/EgorBo/c8e8490ddd6f9a0d5b72c413ddd81d44 on both Core i7 8700K and Ryzen 5950X. And my impression that for any string longer than 32 bytes it's better to go the AVX path.

Also, I believe all the instructions involved don't cause downlclocking (PL0 aka Power License 0) especially on newer CPUs where all avx2 instructions don't do it.

SkylakeX vs Ice Lake, from https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Finding the end-line symbol in this header:

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7

is twice slower with SSE.

EgorBo · 2021-12-28T17:34:30Z

Moved to dotnet/aspnet as a PR dotnet/aspnetcore#39216

EgorBo added the tenet-performance Performance related issue label Dec 22, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Dec 22, 2021

danmoseley added the area-System.Buffers label Dec 22, 2021

EgorBo closed this as completed Dec 28, 2021

ghost locked as resolved and limited conversation to collaborators Jan 27, 2022

tannergooding removed the untriaged New issue has not been triaged by the area owner label Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opportunities to improve TE Platform-Plaintext benchmark #63059

Opportunities to improve TE Platform-Plaintext benchmark #63059

EgorBo commented Dec 22, 2021 •

edited

Loading

dotnet-issue-labeler bot commented Dec 22, 2021

ghost commented Dec 22, 2021

danmoseley commented Dec 22, 2021

EgorBo commented Dec 22, 2021 •

edited

Loading

EgorBo commented Dec 22, 2021 •

edited

Loading

stephentoub commented Dec 22, 2021

EgorBo commented Dec 22, 2021 •

edited

Loading

EgorBo commented Dec 28, 2021

Opportunities to improve TE Platform-Plaintext benchmark #63059

Opportunities to improve TE Platform-Plaintext benchmark #63059

Comments

EgorBo commented Dec 22, 2021 • edited Loading

dotnet-issue-labeler bot commented Dec 22, 2021

ghost commented Dec 22, 2021

danmoseley commented Dec 22, 2021

EgorBo commented Dec 22, 2021 • edited Loading

EgorBo commented Dec 22, 2021 • edited Loading

stephentoub commented Dec 22, 2021

EgorBo commented Dec 22, 2021 • edited Loading

EgorBo commented Dec 28, 2021

EgorBo commented Dec 22, 2021 •

edited

Loading

EgorBo commented Dec 22, 2021 •

edited

Loading

EgorBo commented Dec 22, 2021 •

edited

Loading

EgorBo commented Dec 22, 2021 •

edited

Loading