Skip to content
This repository has been archived by the owner on Dec 18, 2018. It is now read-only.

Request header StringCache #411

Closed
wants to merge 1 commit into from
Closed

Conversation

benaadams
Copy link
Contributor

On a keep-alive connection most of the request header's values are the same (method, host, http version, cookie, user agent, language, path etc); since this string creation is the highest allocation over a connection, and generally the same per request, reuse the previous created string rather than creating new ones.

Updated

This change cuts 88% of allocs and 92.5% of alloc'd memory for strings (and 33% of total allocations)

With current "aspnet/dev" + this; 4k /plaintext?N browser requests from Firefox pipelined 16 deep (where N = 1...200) e.g. 20 refreshes.

Before % allocations

before 42% allocations

After % allocations

after 8% allocations

Before total allocations: 44,112 strings, 2.8 MBytes

before 2,804 kB allocations

After total allocations: 5,619 string, 210 kBytes

after 210 kB allocations

Resolves final part of #291

@benaadams benaadams force-pushed the stringpool branch 3 times, most recently from 716edfa to 9b940dc Compare November 24, 2015 15:24
@benaadams benaadams changed the title [Prototype] Request header Stringpool Request header Stringpool Nov 24, 2015
@benaadams benaadams changed the title Request header Stringpool [Proposal] Request header Stringpool Nov 24, 2015

// give greater hash fidelity to 7 bit ascii
// rotate https://github.com/dotnet/coreclr/pull/2027
hash = ((hash << 7) | (hash >> (32 - 7))) ^ b;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use *int rather than per byte

@benaadams
Copy link
Contributor Author

From @halter73 on #441

@benaadams After discussing #411 with @davidfowl and @CesarBS we were thinking about closing it due to the relative complexity of managing a per-connection invalidating cache to reduce string allocations.

I didn't what to start that discussion before offering an alternative, and this is it. I prefer this approach, because it has a better (or at least easier to analyze) worst case performance both memory and cpu-wise. This might also have a better memory footprint per-connection since we don't have to reallocate a new cache each time.

@benaadams
Copy link
Contributor Author

Hopefully this analysis is correct... might not be; errors and omissions accepted...

worst case performance (...) memory

Looking at a single request, worse case overhead is same as best case as you've still allocated the strings an have them in the header collection; so its the 3 arrays which would be:

25 * sizeof(unit) + 25 * sizeof(int) + 25 * sizeof(IntPtr) + 3 * array overhead

On x64 25 * 4 + 25 * 4 + 25 * 8 + 3 * 28 = 484 bytes per connection
On x86 25 * 4 + 25 * 4 + 25 * 4 + 3 * 12 = 336 bytes per connection

Worse cast overhead by a bad actor; in strings would 25 strings of max length 256 * 25 * 2 = 12800 bytes + 25 * string overhead ; but without it you'd still allocate those strings for the headers anyway and maintain them for the request.

So probably real the worse case that deviates from without cache; would be if someone on keepalive was sending a 256 byte cookie that changed on every request when it would be (25 - connection repeated strings) * 256.

cpu-wise

By itself the hashing should be more than x 2 faster than String.GetHashCode in non-Marvin32 variant and simpler than InternalMarvin32HashString used by Dictionary.Add as it operates on 4 chars(byte) per iteration rather than 2 chars (char) - however its also hashing at the same time as the string construction - so will be close(ish) to free.

So the main worse case will be a scan and compare of a contiguous array 25 uints per unknown string which should occupy 2 cache lines - on match it does a x 4 parallel byte compares to ensure it is actually the same string - a similar cost to a dictionary key lookup - though faster than a case insensitive look up.

Security-wise - Dictionary now uses InternalMarvin32HashString to deal DoS from predicted key collisions. I don't believe this has the same issue as its a bounded array of 25 elements; of maximum length 256. However; I do seed the hash with a random number on startup anyway; to prevent learnings from being transferable. Could also do it per connection rather than per instance fir higher paranoia setting.

I need to recheck its doing what it was doing before I changed the hashcode function to closely match the span hashcode with fix which had a lot of perf analysis done on it.

Also I brought in the loop unrolling and parallel string construction from "Loop unroll; avoid bounds checks" with benchmarking analysis to hopefully make back any loses - however since that could be included by itself - that's probably neither here nor there.

Having said all this... someone else should probably check it.

@benaadams
Copy link
Contributor Author

Will recheck what its doing before rebasing; also #441 should go in first as it makes it better.

Also with it dealing with method and version; FrameRequestHeaders dealing with the names; could drop to 9 rather than 25? Though for my requests to github that seems to be 11-12 header values; probably common path also; some common query strings.

@benaadams
Copy link
Contributor Author

Moved updated stats into summary

@davidfowl
Copy link
Member

@benaadams we don't have any plans to merge this into kestrel. The strategy we're using for cached header values is via storing byte[] in StringValues.

@benaadams
Copy link
Contributor Author

@davidfowl would you hang on to the MemoryPoolBlocks from SocketInput until end of request? (header portion)

e.g. for this stuff which is the same on every request for the connection

GET /plaintext?059 HTTP/1.1
Host: localhost:5001
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.73 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Cookie: SRCHUID=V=2&GUID=...
Upgrade-Insecure-Requests: 1
DNT: 1
Accept-Charset: *
Connection: keep-alive
Cache-Control: max-age=0

@benaadams
Copy link
Contributor Author

Though using spans and Utf8String and not doing a copy at all to string would be better. Or pay when you convert I guess? Would encourage people to stop trying to user agent sniff :)

Want me to close..? 😿

@davidfowl
Copy link
Member

Ah this is request header values. My bad 😄 Can we just do more of what we @CesarBS change does? Intern a bunch of known things and do a fast comparison? This looks way too generic for what we want to do.

@halter73
Copy link
Member

halter73 commented Dec 4, 2015

For some header values like "keep-alive", it is certainly possible do do something more like what @CesarBS's change does. I think it would get trickier for things like user agents. I'm not even sure things like "Accept*" header values are too consistent between browsers, regions, etc... Cookie values are obviously a no go.

It might not seem like cookie values would be reused too often (and that's likely true between different connections), but over a single connection, I expect cookie values get reused frequently.

@davidfowl In case you didn't notice, there is a unique StringPool per connection, so everything @benaadams bolded in his previous comment would likely be successfully cached for every subsequent request made over that connection.

Still, I do wonder about the real-world impact of this change. Do we expect this will improve the performance of a typical high traffic website running on Kestrel? Or is this mostly just to help us out with our benchmarks? Or is there some other scenario where a lot of requests are made over a single connection that I'm not thinking about? Reverse proxies can reuse connections for a lot of requests from different users, but in that case I don't know if caching only the last 25 header values is sufficient.

To figure out if this change makes a difference for a typical web server, I imagine we would have to estimate how many requests are usually made over a single connection by a normal user browsing a website in Chrome, IE, etc..., and how many requests there would need to need to be for us to see a real performance improvement from this change.

Unlike some previous performance improvements, this isn't just a pure win. Obviously I can see some scenarios where this can reduce allocations quite a bit (like with our benchmarks), but there are clearly some scenarios (like with a single request per connection) where it wouldn't help at all. We could mitigate the single request thing by just not using the string pool for non-keep-alive connections, but that just covers one of the possible bad cases (though I like the worst-case analysis and will concede the worst-case isn't that bad).

Long story short, I'm conflicted on this change, and when I'm unsure I like to err on the side of simplicity. I haven't closed this yet because maybe there's something I haven't considered that will change my thinking.

@halter73
Copy link
Member

halter73 commented Dec 4, 2015

@benaadams To add to my previous comment, while a 33% reduction in total allocations is impressive, am I correct in understanding that's for 4k requests over a single connection? I'm also assuming the 16 deep pipelining is helping us reduce our SocketOutput related allocations by comparison.

@benaadams
Copy link
Contributor Author

@halter73 is from Firefox so would be 4k requests over 6 connections or 660+ish per connection; but yes that would likely be very a high number for a single connection.

On the other points you raise. In reality I don't necessarily think it will greatly help with the benchmarks as they don't really send many headers values or any unexpected header keys - so it would be interesting to see what it would do as it would likely be just the host and path. (This is assuming #441 is already live)

It would be real world scenarios that would get the main gains from this; especially if you are also serving static files from the same host. As keep-alive is maintained across pageviews, and is generally around 2 mins from last request it ends up being a lot of requests that could take advantage of this. Also if a reverse proxy is demultiplexing http2 requests; with the currrent http2 recommendations on many small files (disputed in value by some); the impact could be even greater or users of async load like require.js rather than prebundling.

Looking at a not very random sample of sites (ignoring their CDN use) in Chrome; per connection using the common x6 parallelism.

site first load reqs next click/page reqs 1st/conn 2nd+/conn
Azure portal 176 38 29 6
JabbR 79 0 13 0
Amazon.com 327 166 54 28
Microsoft store 120 105 20 17
twitter 153 52 25 9
youtube 83 43 14 7
bbc news 140 105 23 17
facebook 237 85 40 14
ebay.com 172 36 29 6

So first page view its around 24 of 25 sets of string header creates that would be avoided; rising by 10 on each subsequent page view (if its done in 2 minutes).

JabbR using SignalR+WebSockets is the outlier here as its not making any new requests on subsequent actions.

@benaadams
Copy link
Contributor Author

What about as an optional+configurable component?

@benaadams benaadams changed the title [Proposal] Request header Stringpool [Proposal] Request header StringCache Dec 5, 2015
@benaadams
Copy link
Contributor Author

Definitely would want the ability to switch off if accepted regardless; e.g. if used as a pure websocket server

@halter73
Copy link
Member

halter73 commented Dec 7, 2015

I'm going to try testing what kind of difference this makes in wrk on Monday on our perf lab machines. That should be a pretty much a best case scenario as far as seeing the effects of this PR go since so many duplicate requests are made over the same connection. It will be interesting to see how much of an impact the reduced allocations make in terms of RPS.

I'll make sure the test is long running, so it takes into account collections. I imagine most of the strings allocated currently are gen 0, so I sorta doubt there will be a significant difference, but if there's anything measurable I'll be inclined to merge this with a switch.

@benaadams benaadams changed the title [Proposal] Request header StringCache Request header StringCache Dec 7, 2015
@benaadams
Copy link
Contributor Author

Updated to read from config; is on by default?

@halter73 to real world(ish) a little try with a script that adds headers e.g.

wrk.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.73 Safari/537.36"
wrk.headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8"
wrk.headers["Accept-Encoding"] = "gzip, deflate, sdch"
wrk.headers["Accept-Language"] = "en-US,en;q=0.8"
wrk.headers["Cache-Control"] = "max-age=0"

Just before

r[i] = wrk.format(nil)

ofc also the regular benchmark, don't want it to regress!

@benaadams
Copy link
Contributor Author

Fixed (was using wrong type cast for single byte hashing), verified, squished

@AspNetSmurfLab
Copy link

@benaadams I ran our plaintext benchmark scenario with the additional headers you suggested, and here are the results from our perf lab:

Baseline:

wrk -c 64 -t 32 -d 60 -s ./scripts/pipeline.lua http://10.0.0.100:5001/plaintext -- 16
Running 1m test @ http://10.0.0.100:5001/plaintext
  32 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.71ms   20.28ms 462.54ms   88.49%
    Req/Sec    13.09k     2.56k   70.75k    71.27%
  24999284 requests in 1.00m, 3.07GB read
Requests/sec: 415961.30
Transfer/sec:     52.36MB

Allocated: 81.34 MB
Gen 0: 174, Gen 1: 3, Gen 2: 2

Running 1m test @ http://10.0.0.100:5001/plaintext
  32 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.19ms   33.73ms 520.07ms   88.47%
    Req/Sec    12.64k     3.27k   60.91k    68.47%
  24143877 requests in 1.00m, 2.97GB read
Requests/sec: 401748.34
Transfer/sec:     50.57MB

Allocated: 52.68 MB
Gen 0: 170, Gen 1: 3, Gen 2: 2

With this PR merged:

Running 1m test @ http://10.0.0.100:5001/plaintext
  32 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.09ms   29.22ms 644.22ms   89.82%
    Req/Sec    12.64k     3.15k   24.05k    69.16%
  24136730 requests in 1.00m, 2.97GB read
Requests/sec: 402109.14
Transfer/sec:     50.62MB

Allocated: 165.04 MB
Gen 0: 165, Gen 1: 3, Gen 2: 2

Running 1m test @ http://10.0.0.100:5001/plaintext
  32 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.88ms   22.68ms 500.36ms   88.23%
    Req/Sec    13.09k     2.89k   63.51k    68.78%
  25011742 requests in 1.00m, 3.07GB read
Requests/sec: 416176.82
Transfer/sec:     52.39MB

Allocated: 214.21 MB
Gen 0: 177, Gen 1: 3, Gen 2: 2

Note that "Allocated: " is the result GC.GetTotalMemory(forceFullCollection: false) about 15 seconds after the test is completed. So this isn't the amount of memory allocated during the test, but instead the amount of total active memory in the heap after the test is over. If I forced GC, I suspect the allocation numbers for the baseline and the PR would be the same, so I would just ignore this.

The main point of showing the GC numbers is to show that there were plenty of Gen 0 collections in each run.

All said, I don't think these results show a significant enough change in performance to warrant a string cache for headers. If you have any suggestions on parameters I should tweak when running these tests, I'm all ears.

@AspNetSmurfLab
Copy link

I did some more runs with more connections, but less time per run. I also didn't restart the server between runs, but I did do a warmup run before measuring. Here are the results:

Baseline:

wrk -c 256 -t 32 -d 30 -s ./scripts/pipeline.lua http://10.0.0.100:5001/plaintext -- 16
Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.32ms   33.04ms 408.59ms   92.83%
    Req/Sec    13.91k     1.79k   96.00k    92.22%
  13284799 requests in 30.10s, 1.63GB read
Requests/sec: 441356.37
Transfer/sec:     55.56MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    24.05ms   53.10ms 584.39ms   90.42%
    Req/Sec    14.13k     2.26k  127.33k    86.96%
  13509384 requests in 30.10s, 1.66GB read
Requests/sec: 448862.90
Transfer/sec:     56.51MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    16.76ms   37.96ms 455.98ms   91.40%
    Req/Sec    14.04k     1.80k   57.01k    88.76%
  13399709 requests in 30.10s, 1.65GB read
Requests/sec: 445189.55
Transfer/sec:     56.04MB

Allocated: 174.88 MB
Gen 0: 378, Gen 1: 30, Gen 2: 4

Forcing GC... done!

Allocated: 21.48 MB
Gen 0: 379, Gen 1: 31, Gen 2: 5

With this PR merged:

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.54ms   33.68ms 407.07ms   91.68%
    Req/Sec    13.99k     1.83k  152.34k    90.59%
  13349015 requests in 30.10s, 1.64GB read
  Socket errors: connect 0, read 0, write 300, timeout 0
Requests/sec: 443499.07
Transfer/sec:     55.83MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    18.50ms   41.38ms 468.02ms   91.44%
    Req/Sec    14.08k     1.75k   86.40k    86.57%
  13462037 requests in 30.10s, 1.65GB read
Requests/sec: 447262.28
Transfer/sec:     56.30MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    23.63ms   58.09ms 868.69ms   91.74%
    Req/Sec    14.11k     1.76k   57.84k    81.93%
  13498453 requests in 30.10s, 1.66GB read
Requests/sec: 448466.09
Transfer/sec:     56.46MB

Allocated: 120.02 MB
Gen 0: 373, Gen 1: 27, Gen 2: 2

Forcing GC... done!

Allocated: 21.60 MB
Gen 0: 374, Gen 1: 28, Gen 2: 3

I feel these results also show that this PR has no real impact on Requests/sec.

@benaadams
Copy link
Contributor Author

Let's close this for now then - seems a bit rubbish

@AspNetSmurfLab
Copy link

wrk -c 256 -t 32 -d 30 -s ./scripts/pipeline.lua http://10.0.0.100:5001/plaintext -- 16 (no extra headers)

Baseline (dff3a4f):

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    12.33ms   32.51ms 617.34ms   92.28%
    Req/Sec    24.76k     3.11k  167.97k    92.54%
  23629859 requests in 30.10s, 2.90GB read
Requests/sec: 785068.60
Transfer/sec:     98.83MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.11ms   23.88ms 369.65ms   93.03%
    Req/Sec    24.68k     2.29k  187.74k    87.32%
  23590851 requests in 30.10s, 2.90GB read
Requests/sec: 783763.63
Transfer/sec:     98.66MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.57ms   28.19ms 484.16ms   92.35%
    Req/Sec    24.77k     2.55k  113.82k    90.24%
  23677061 requests in 30.10s, 2.91GB read
Requests/sec: 786644.89
Transfer/sec:     99.03MB

Merged (benaadams/stringpool):

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.49ms   12.79ms 217.39ms   94.88%
    Req/Sec    24.45k     1.33k   39.65k    74.98%
  23376623 requests in 30.04s, 2.87GB read
Requests/sec: 778294.21
Transfer/sec:     97.98MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.72ms   18.81ms 333.32ms   95.56%
    Req/Sec    24.49k     2.62k  155.22k    93.10%
  23379532 requests in 30.10s, 2.87GB read
Requests/sec: 776758.24
Transfer/sec:     97.78MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.41ms   12.61ms 214.28ms   95.52%
    Req/Sec    24.37k     1.55k   44.73k    81.56%
  23301257 requests in 30.03s, 2.86GB read
Requests/sec: 775850.68
Transfer/sec:     97.67MB

wrk -c 256 -t 32 -d 30 -s ./scripts/pipeline.lua http://10.0.0.100:5001/plaintext -- 16 (with extra headers)

Baseline (dff3a4f):

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.20ms   23.28ms 392.88ms   94.04%
    Req/Sec    15.10k     1.95k  116.84k    94.86%
  14393209 requests in 30.10s, 1.77GB read
  Socket errors: connect 0, read 0, write 592, timeout 0
Requests/sec: 478258.46
Transfer/sec:     60.21MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.12ms   25.17ms 329.03ms   93.70%
    Req/Sec    15.08k     1.78k  157.39k    92.15%
  14417081 requests in 30.10s, 1.77GB read
Requests/sec: 478985.19
Transfer/sec:     60.30MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.59ms   20.44ms 369.15ms   93.87%
    Req/Sec    15.09k     1.77k  162.28k    94.29%
  14420099 requests in 30.10s, 1.77GB read
Requests/sec: 479083.05
Transfer/sec:     60.31MB

Merged (benaadams/stringpool):

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.38ms   23.30ms 404.19ms   93.76%
    Req/Sec    15.17k     1.32k   30.67k    82.26%
  14503958 requests in 30.03s, 1.78GB read
Requests/sec: 482954.16
Transfer/sec:     60.80MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.21ms   34.62ms 456.62ms   94.27%
    Req/Sec    15.34k     1.33k   35.06k    80.18%
  14663064 requests in 30.03s, 1.80GB read
Requests/sec: 488230.95
Transfer/sec:     61.46MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.33ms   16.03ms 388.50ms   94.07%
    Req/Sec    15.03k     0.99k   22.95k    76.55%
  14362829 requests in 30.03s, 1.77GB read
Requests/sec: 478338.24
Transfer/sec:     60.22MB

I still think all the differences are within the margin of error. And I triple checked dnu list this time.

@benaadams
Copy link
Contributor Author

As the code is fairly complicated it definitely doesn't justify itself 👎

Thank you for investigating.

@benaadams
Copy link
Contributor Author

/cc @davidfowl

@benaadams
Copy link
Contributor Author

Ooooh, have though of something better.... 😈

@cesarblum
Copy link
Contributor

Wow risen from the grave. What's different this time?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants