Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opportunities to improve TE Platform-Plaintext benchmark #63059

Closed
EgorBo opened this issue Dec 22, 2021 · 8 comments
Closed

Opportunities to improve TE Platform-Plaintext benchmark #63059

EgorBo opened this issue Dec 22, 2021 · 8 comments
Labels

Comments

@EgorBo
Copy link
Member

EgorBo commented Dec 22, 2021

Platform-Plaintext TechEmpower Benchmark where dotnet shows impressive results (2nd place, up to ~12.4mln requests per second (RPS) on our PerfLab hardware with bigger network bandwidth) seems to be slightly bottlenecked by three SpanHelpers.IndexOf[Any] functions:
image
image

^ Linux-x64

e.g. IndexOfAny(val0, val1) mostly tries to find \n or \r in ASCII strings of length = 26, 49 and 151 where needed symbols usually found at positions 21, 22 and 100 (http headers)

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7
Host: 10.0.0.102:5000
Connection: keep-alive
Host: 10.0.0.102:5000
Connection: keep-alive
Connection: keep-alive

I tried to rewrite them by hands e.g. I changed some branches, removed "Duff's devices", added "two 256bit vectors per iteration" path. Here is a standalone benchmark project with test data extracted from the Platform-Plaintext benchmark.

As the result, I constantly see stable improvements around 1-3% (never slower):

| load                   | base        | my          |         |
| ---------------------- | ----------- | ----------- | ------- |
| CPU Usage (%)          |          96 |          97 |  +1.04% |
| Cores usage (%)        |       2,702 |       2,726 |  +0.89% |
| Working Set (MB)       |          37 |          37 |   0.00% |
| Private Memory (MB)    |         370 |         370 |   0.00% |
| Start Time (ms)        |           0 |           0 |         |
| First Request (ms)     |          77 |          75 |  -2.60% |
| Requests/sec           |  12,403,854 |  12,679,971 |  +2.23% |
| Requests               | 187,261,280 | 191,437,425 |  +2.23% |
| Mean latency (ms)      |        1.18 |        1.31 | +11.02% |
| Max latency (ms)       |       56.19 |       53.21 |  -5.30% |
| Bad responses          |           0 |           0 |         |
| Socket errors          |           0 |           0 |         |
| Read throughput (MB/s) |    1,495.04 |    1,525.76 |  +2.05% |
| Latency 50th (ms)      |        0.66 |        0.63 |  -3.95% |
| Latency 75th (ms)      |        0.99 |        0.95 |  -4.04% |
| Latency 90th (ms)      |        1.83 |        2.00 |  +9.29% |
| Latency 99th (ms)      |       15.86 |       15.82 |  -0.25% |

^ with PGO. As you can see from https://aka.ms/aspnet/benchmarks (17th page, "full" checked) the best results we've ever seen were 12.42M RPS so 12.68M does look like an improvement:

image

Same relative improvements can be observed in non-PGO mode (default).

Here is the script I used to benchmark it:

for (;;) { crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\base\*.* --json t1.json ;; crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\diff\*.* --json t2.json ;; crank compare t1.json t2.json }

Standalone benchmark: https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd
Commit in dotnet/runtime: EgorBo@b2ee6ad

/cc @stephentoub @GrabYourPitchforks

@EgorBo EgorBo added the tenet-performance Performance related issue label Dec 22, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Dec 22, 2021
@ghost
Copy link

ghost commented Dec 22, 2021

Tagging subscribers to this area: @dotnet/area-system-buffers, @GrabYourPitchforks
See info in area-owners.md if you want to be subscribed.

Issue Details

Platform-Plaintext TechEmpower Benchmark where dotnet shows impressive results (2nd place, ~12.4mln RPS on our PerfLab hardware) seems to be slightly bottlenecked by three SpanHelpers.IndexOf[Any] functions:
image
^ Linux-x64

e.g. IndexOfAny with two values mostly tries to find \n or \r in ASCII strings of length = 26, 49 and 151 where needed symbols usually found at positions 21, 22 and 100 (http headers)

I tried to rewrite them by hands e.g. I changed some branches, removed "Duff's devices" (even for cases where needed char is found at position = 0 my impl is still faster). Here is a standalone benchmark project with test data extracted from the Platform-Plaintext benchmark.

As the result, I constantly see stable improvements around 1-3% (never slower):

| load                   | t1          | t2          |         |
| ---------------------- | ----------- | ----------- | ------- |
| CPU Usage (%)          |          96 |          97 |  +1.04% |
| Cores usage (%)        |       2,702 |       2,726 |  +0.89% |
| Working Set (MB)       |          37 |          37 |   0.00% |
| Private Memory (MB)    |         370 |         370 |   0.00% |
| Start Time (ms)        |           0 |           0 |         |
| First Request (ms)     |          77 |          75 |  -2.60% |
| Requests/sec           |  12,403,854 |  12,679,971 |  +2.23% |
| Requests               | 187,261,280 | 191,437,425 |  +2.23% |
| Mean latency (ms)      |        1.18 |        1.31 | +11.02% |
| Max latency (ms)       |       56.19 |       53.21 |  -5.30% |
| Bad responses          |           0 |           0 |         |
| Socket errors          |           0 |           0 |         |
| Read throughput (MB/s) |    1,495.04 |    1,525.76 |  +2.05% |
| Latency 50th (ms)      |        0.66 |        0.63 |  -3.95% |
| Latency 75th (ms)      |        0.99 |        0.95 |  -4.04% |
| Latency 90th (ms)      |        1.83 |        2.00 |  +9.29% |
| Latency 99th (ms)      |       15.86 |       15.82 |  -0.25% |

^ with PGO. As you can see from https://aka.ms/aspnet/benchmarks (17th page, "full" checked) the best results we've seen were around 12.42M RPS Same relative improvements can be observed in non-PGO mode (default).

Here is the script I used to benchmark it:

for (;;) { crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\base\*.* --json t1.json ;; crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\diff\*.* --json t2.json ;; crank compare t1.json t2.json }

Standalone benchmark: https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd
Commit in dotnet/runtime: EgorBo@b2ee6ad

/cc @stephentoub @GrabYourPitchforks

Author: EgorBo
Assignees: -
Labels:

area-System.Buffers, tenet-performance, untriaged

Milestone: -

@danmoseley
Copy link
Member

Is it possible that this could regress workloads that don't look like this one? Did you get a chance to run the IndexOf/IndexOfAny benchmarks in dotnet/performance (not sure how good they are but they're there)

@EgorBo
Copy link
Member Author

EgorBo commented Dec 22, 2021

Is it possible that this could regress workloads that don't look like this one? Did you get a chance to run the IndexOf/IndexOfAny benchmarks in dotnet/performance (not sure how good they are but they're there)

Yes, it's always about trade-offs, but what I'm 100% sure in that we can add a "two 256bit vectors per iteration" path for arrays >= 64 elements (bytes) without hurting other cases https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd#file-faster_spanhelpers_indexof-cs-L148-L186.

so it will help us to find \r or \n in

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\n

faster 🙂

@EgorBo
Copy link
Member Author

EgorBo commented Dec 22, 2021

Also, it seems there are optimization opportunities inside the caller of that IndexOfAny - ParseHeaders
we can try to detect : on the go as we search for \r or \n
https://github.com/dotnet/aspnetcore/blob/main/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L141-L421

Currently we do IndexOfAny(data, '\r', '\n') to get current "Header:Value"'s length and then we do IndexOfAny(data, ':', ' ', '\t') again on the same input to extract Header and Value (and validate).

image

Who knows maybe we can cross 13M RPS 😄

cc @davidfowl @benaadams

@stephentoub
Copy link
Member

but what I'm 100% sure in that we can add a "two 256bit vectors per iteration" path

I thought we were on a path/plan to switch to using Vector128 in all of these implementations. Is that not the case, @tannergooding?

@EgorBo
Copy link
Member Author

EgorBo commented Dec 22, 2021

Is there a reason?
E.g. as we recently found out with Tanner in Discord that if you want to quickly check if a value of Http Header is "Proxy-Authenticate" (36 bytes) ignoring its case - it's faster to do it via two 256bit vectors than via three 128bit ones: https://gist.github.com/EgorBo/c8e8490ddd6f9a0d5b72c413ddd81d44 on both Core i7 8700K and Ryzen 5950X. And my impression that for any string longer than 32 bytes it's better to go the AVX path.

Also, I believe all the instructions involved don't cause downlclocking (PL0 aka Power License 0) especially on newer CPUs where all avx2 instructions don't do it.
image
SkylakeX vs Ice Lake, from https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Finding the end-line symbol in this header:

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7

is twice slower with SSE.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 28, 2021

Moved to dotnet/aspnet as a PR dotnet/aspnetcore#39216

@EgorBo EgorBo closed this as completed Dec 28, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jan 27, 2022
@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Jun 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants