-
Notifications
You must be signed in to change notification settings - Fork 528
Conversation
716edfa
to
9b940dc
Compare
|
||
// give greater hash fidelity to 7 bit ascii | ||
// rotate https://github.com/dotnet/coreclr/pull/2027 | ||
hash = ((hash << 7) | (hash >> (32 - 7))) ^ b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use *int rather than per byte
|
Hopefully this analysis is correct... might not be; errors and omissions accepted...
Looking at a single request, worse case overhead is same as best case as you've still allocated the strings an have them in the header collection; so its the 3 arrays which would be:
On x64 Worse cast overhead by a bad actor; in strings would 25 strings of max length 256 * 25 * 2 = 12800 bytes + 25 * string overhead ; but without it you'd still allocate those strings for the headers anyway and maintain them for the request. So probably real the worse case that deviates from without cache; would be if someone on keepalive was sending a 256 byte cookie that changed on every request when it would be (25 - connection repeated strings) * 256.
By itself the hashing should be more than x 2 faster than So the main worse case will be a scan and compare of a contiguous array 25 Security-wise - Dictionary now uses I need to recheck its doing what it was doing before I changed the hashcode function to closely match the span hashcode with fix which had a lot of perf analysis done on it. Also I brought in the loop unrolling and parallel string construction from "Loop unroll; avoid bounds checks" with benchmarking analysis to hopefully make back any loses - however since that could be included by itself - that's probably neither here nor there. Having said all this... someone else should probably check it. |
Will recheck what its doing before rebasing; also #441 should go in first as it makes it better. Also with it dealing with method and version; FrameRequestHeaders dealing with the names; could drop to 9 rather than 25? Though for my requests to github that seems to be 11-12 header values; probably common path also; some common query strings. |
Moved updated stats into summary |
@benaadams we don't have any plans to merge this into kestrel. The strategy we're using for cached header values is via storing byte[] in StringValues. |
@davidfowl would you hang on to the MemoryPoolBlocks from SocketInput until end of request? (header portion) e.g. for this stuff which is the same on every request for the connection GET /plaintext?059 HTTP/1.1 |
Though using spans and Utf8String and not doing a copy at all to string would be better. Or pay when you convert I guess? Would encourage people to stop trying to user agent sniff :) Want me to close..? 😿 |
Ah this is request header values. My bad 😄 Can we just do more of what we @CesarBS change does? Intern a bunch of known things and do a fast comparison? This looks way too generic for what we want to do. |
For some header values like "keep-alive", it is certainly possible do do something more like what @CesarBS's change does. I think it would get trickier for things like user agents. I'm not even sure things like "Accept*" header values are too consistent between browsers, regions, etc... Cookie values are obviously a no go. It might not seem like cookie values would be reused too often (and that's likely true between different connections), but over a single connection, I expect cookie values get reused frequently. @davidfowl In case you didn't notice, there is a unique StringPool per connection, so everything @benaadams bolded in his previous comment would likely be successfully cached for every subsequent request made over that connection. Still, I do wonder about the real-world impact of this change. Do we expect this will improve the performance of a typical high traffic website running on Kestrel? Or is this mostly just to help us out with our benchmarks? Or is there some other scenario where a lot of requests are made over a single connection that I'm not thinking about? Reverse proxies can reuse connections for a lot of requests from different users, but in that case I don't know if caching only the last 25 header values is sufficient. To figure out if this change makes a difference for a typical web server, I imagine we would have to estimate how many requests are usually made over a single connection by a normal user browsing a website in Chrome, IE, etc..., and how many requests there would need to need to be for us to see a real performance improvement from this change. Unlike some previous performance improvements, this isn't just a pure win. Obviously I can see some scenarios where this can reduce allocations quite a bit (like with our benchmarks), but there are clearly some scenarios (like with a single request per connection) where it wouldn't help at all. We could mitigate the single request thing by just not using the string pool for non-keep-alive connections, but that just covers one of the possible bad cases (though I like the worst-case analysis and will concede the worst-case isn't that bad). Long story short, I'm conflicted on this change, and when I'm unsure I like to err on the side of simplicity. I haven't closed this yet because maybe there's something I haven't considered that will change my thinking. |
@benaadams To add to my previous comment, while a 33% reduction in total allocations is impressive, am I correct in understanding that's for 4k requests over a single connection? I'm also assuming the 16 deep pipelining is helping us reduce our SocketOutput related allocations by comparison. |
@halter73 is from Firefox so would be 4k requests over 6 connections or 660+ish per connection; but yes that would likely be very a high number for a single connection. On the other points you raise. In reality I don't necessarily think it will greatly help with the benchmarks as they don't really send many headers values or any unexpected header keys - so it would be interesting to see what it would do as it would likely be just the host and path. (This is assuming #441 is already live) It would be real world scenarios that would get the main gains from this; especially if you are also serving static files from the same host. As keep-alive is maintained across pageviews, and is generally around 2 mins from last request it ends up being a lot of requests that could take advantage of this. Also if a reverse proxy is demultiplexing http2 requests; with the currrent http2 recommendations on many small files (disputed in value by some); the impact could be even greater or users of async load like require.js rather than prebundling. Looking at a not very random sample of sites (ignoring their CDN use) in Chrome; per connection using the common x6 parallelism.
So first page view its around 24 of 25 sets of string header creates that would be avoided; rising by 10 on each subsequent page view (if its done in 2 minutes). JabbR using SignalR+WebSockets is the outlier here as its not making any new requests on subsequent actions. |
What about as an optional+configurable component? |
Definitely would want the ability to switch off if accepted regardless; e.g. if used as a pure websocket server |
I'm going to try testing what kind of difference this makes in wrk on Monday on our perf lab machines. That should be a pretty much a best case scenario as far as seeing the effects of this PR go since so many duplicate requests are made over the same connection. It will be interesting to see how much of an impact the reduced allocations make in terms of RPS. I'll make sure the test is long running, so it takes into account collections. I imagine most of the strings allocated currently are gen 0, so I sorta doubt there will be a significant difference, but if there's anything measurable I'll be inclined to merge this with a switch. |
Updated to read from config; is on by default? @halter73 to real world(ish) a little try with a script that adds headers e.g.
Just before
ofc also the regular benchmark, don't want it to regress! |
Fixed (was using wrong type cast for single byte hashing), verified, squished |
@benaadams I ran our plaintext benchmark scenario with the additional headers you suggested, and here are the results from our perf lab: Baseline:
With this PR merged:
Note that "Allocated: " is the result The main point of showing the GC numbers is to show that there were plenty of Gen 0 collections in each run. All said, I don't think these results show a significant enough change in performance to warrant a string cache for headers. If you have any suggestions on parameters I should tweak when running these tests, I'm all ears. |
I did some more runs with more connections, but less time per run. I also didn't restart the server between runs, but I did do a warmup run before measuring. Here are the results: Baseline:
With this PR merged:
I feel these results also show that this PR has no real impact on Requests/sec. |
Let's close this for now then - seems a bit rubbish |
Baseline (dff3a4f):
Merged (benaadams/stringpool):
Baseline (dff3a4f):
Merged (benaadams/stringpool):
I still think all the differences are within the margin of error. And I triple checked |
As the code is fairly complicated it definitely doesn't justify itself 👎 Thank you for investigating. |
/cc @davidfowl |
Ooooh, have though of something better.... 😈 |
Wow risen from the grave. What's different this time? |
On a keep-alive connection most of the request header's values are the same (method, host, http version, cookie, user agent, language, path etc); since this string creation is the highest allocation over a connection, and generally the same per request, reuse the previous created string rather than creating new ones.
Updated
This change cuts 88% of allocs and 92.5% of alloc'd memory for strings (and 33% of total allocations)
With current "aspnet/dev" + this; 4k
/plaintext?N
browser requests from Firefox pipelined 16 deep (where N = 1...200) e.g. 20 refreshes.Before % allocations
After % allocations
Before total allocations: 44,112 strings, 2.8 MBytes
After total allocations: 5,619 string, 210 kBytes
Resolves final part of #291