Skip to content
This repository has been archived by the owner on Dec 18, 2018. It is now read-only.

Reduce string allocations per request #441

Closed
wants to merge 0 commits into from

Conversation

cesarblum
Copy link
Contributor

There is a fixed set of methods and a fixed set of HTTP versions that we'll see in most requests. When we see those we can reuse the same string instead of allocating a new one. If not, we fall back to allocating a new string.

Since all methods we're checking for plus the HTTP versions fit in 8 bytes, I'm pre-computing longs containing those bytes to make the comparisons faster.

cc @halter73 @davidfowl @benaadams @DamianEdwards

@benaadams
Copy link
Contributor

Related: #411 which removes all string allocations on repeated requests on a keep alive - however this resolves common allocators across all requests


if (httpVersion != null)
{
for (int i = 0; i < 8; i++) scan.Take();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can re avoid this by having PeekLong advance the iterator for us.

@halter73
Copy link
Member

halter73 commented Dec 2, 2015

@benaadams After discussing #411 with @davidfowl and @CesarBS we were thinking about closing it due to the relative complexity of managing a per-connection invalidating cache to reduce string allocations.

I didn't what to start that discussion before offering an alternative, and this is it. I prefer this approach, because it has a better (or at least easier to analyze) worst case performance both memory and cpu-wise. This might also have a better memory footprint per-connection since we don't have to allocate a new cache each time.

@benaadams What do you think?

{
httpMethod = HttpDeleteMethod;
}
else if (((scanLong ^ _httpGetMethodLong) << 32) == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitpicky, but we might as well check for GETs and POSTs first since I assume they are the most common methods.

@benaadams
Copy link
Contributor

@halter73 👍 to this as it resolves expected shared strings in known locations; also the two changes aren't in conflict.

Want a discussion over on other issue? Or was this the discussion? 😉

@halter73
Copy link
Member

halter73 commented Dec 2, 2015

You are right that the two changes aren't mutually exclusive, but I'm still not sold on the StringPool yet. I was hoping this change would help convince you that the StringPool isn't necessary 😉

We can can continue discussing that over on the other issue though.

@cesarblum
Copy link
Contributor Author

I was able to optimize this even further. Given we have a set of 11 known strings that will never change, I went looking for a divisor that would yield a unique modulo value for each string's long representation. Turns out there is such a value (37). So now we have a perfect hash of the known strings and matching the input to those is just a matter of clearing uninteresting bits in the input and looking up the resulting value, then making sure the longs are actually the same.

@@ -0,0 +1,3 @@
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

@benaadams
Copy link
Contributor

So now we have a perfect hash of the known strings and matching the input to those is just a matter of ...

Now we're cooking 👍

@pakrym
Copy link
Contributor

pakrym commented Dec 3, 2015

I think we need to make sure that somebody won't accidentally get hash collision when adding new string to known strings. And add more comments to intialization code, because without this issue context it looks spooky

@cesarblum
Copy link
Contributor Author

@pakrym Goot point 👍 I'll add comments explaining what is going on.

@cesarblum cesarblum changed the title Reduce string allocations per request by 30% Reduce string allocations per request Dec 3, 2015
@cesarblum cesarblum force-pushed the cesarbs/perf-alloc-optimizations branch from d509edf to bf927e5 Compare December 3, 2015 23:26
@cesarblum
Copy link
Contributor Author

This is really unfortunate, but the 30% figure from my initial tests where for very simple requests (no headers, I should've thought better about that). With more realistic requests the reduction is a lot smaller (around 1%) 😞

I'm looking for more places where I can apply a similar optimization.

@benaadams
Copy link
Contributor

LGTM

Only comment would be GetKnownString could be split into two GetHttpVersionString and GetMethodString but might be something for future when the hash is broken and there are more types.

@halter73
Copy link
Member

halter73 commented Dec 8, 2015

I'm waiting on verification that this will work on big-endian architectures before merging.

@cesarblum
Copy link
Contributor Author

@halter73 It's going to be hard to check that. Raspberry Pi distros set the processor to little endian. Actually little endian seems to be the default on ARM environments. I'm not sure it's worth verifying this right now. I could enter a bug to track that and move on with this change.

@cesarblum
Copy link
Contributor Author

I've emailed @stephentoub asking if they test Core on big endian. If they do I might try the same as them to test this.

@benaadams
Copy link
Contributor

Would there be problems with the Frame header collection also?

@cesarblum
Copy link
Contributor Author

@benaadams Can you elaborate? I don't see what you mean.

@cesarblum
Copy link
Contributor Author

@benaadams Oh, with regard to endianess, you mean? I hadn't looked at that code, but it's likely it might be affected by it.

@cesarblum cesarblum force-pushed the cesarbs/perf-alloc-optimizations branch from cbc261c to 83f55c5 Compare December 8, 2015 01:59
@cesarblum
Copy link
Contributor Author

Ok, I did much better controlled tests and these are the results I got:

Before:

image

After:

image

Before:

image

After:

image

(Look at the allocation percentages in the last two)

That looks like some improvement to me 😀

I tested it with wrk, from a remote machine:

wrk -c 256 -t 32 -d 10 http://<local address>:5000

@cesarblum
Copy link
Contributor Author

@stephentoub replied that they don't test Core on big endian. Again I'd say we can postpone verification on big endian.

@halter73
Copy link
Member

halter73 commented Dec 8, 2015

@CesarBS Aside from potentially addressing further feedback, do you think this PR is complete?

@stephentoub
Copy link
Contributor

@stephentoub replied that they don't test Core on big endian. Again I'd say we can postpone verification on big endian.

I'd suggest at least adding a Debug.Assert(BitConverter.IsLittleEndian) to the relevant code.

@cesarblum
Copy link
Contributor Author

@halter73 Yes.

@cesarblum
Copy link
Contributor Author

@stephentoub I'd rather not. This code likely does not work on big endian, but it doesn't break things. It'll just go through a less optimized path.

@stephentoub
Copy link
Contributor

I'd rather not. This code likely does not work on big endian, but it doesn't break things. It'll just go through a less optimized path

I've not looked at the code in depth. Just looked like wrong values would be computed in big endian. If that's not true, then an assert isn't valuable. But if running on a big endian system would start resulting in erroneous results, an assert would help to point out the problem immediately, and there's little downside. Like I said, though, I've not looked much at the code, so you know better than I whether there's a problem.

@cesarblum
Copy link
Contributor Author

@stephentoub On a second thought, I noticed someone could craft some weird requests (not necessarily malicious) and this would not behave as expected. I'll add the assert.

@cesarblum cesarblum force-pushed the cesarbs/perf-alloc-optimizations branch from 83f55c5 to 56a5cd1 Compare December 8, 2015 19:27
@cesarblum
Copy link
Contributor Author

Went with a slightly different approach. Instead of an assert I'm just checking BitConverter.IsLittleEndian and skipping the optimized path if it's false.

@cesarblum cesarblum force-pushed the cesarbs/perf-alloc-optimizations branch from 56a5cd1 to 20e6862 Compare December 8, 2015 22:50
@cesarblum
Copy link
Contributor Author

Squashed.

@benaadams benaadams mentioned this pull request Dec 9, 2015
@AspNetSmurfLab
Copy link

Branchmarks:

dev

Running 15s test @ http://10.0.0.100:5001/plaintext
32 threads and 256 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.10ms   12.44ms 393.12ms   96.10%
    Req/Sec    24.62k     2.11k   67.62k    93.12%
11787922 requests in 15.10s, 1.45GB read
Socket errors: connect 0, read 0, write 177, timeout 0
Requests/sec: 780789.10
Transfer/sec:     98.29MB

This PR

Running 15s test @ http://10.0.0.100:5001/plaintext
32 threads and 256 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.94ms   12.55ms 211.73ms   94.72%
    Req/Sec    25.26k     1.87k   54.54k    87.10%
12106391 requests in 15.10s, 1.49GB read
Requests/sec: 801743.82
Transfer/sec:    100.93MB

@CesarBS ran those several times and the RPS was consistently around those marks for each branch.

@benaadams
Copy link
Contributor

LGTM! 👍

@cesarblum cesarblum force-pushed the cesarbs/perf-alloc-optimizations branch from 20e6862 to 49439e8 Compare December 14, 2015 20:45
@cesarblum
Copy link
Contributor Author

Ping.

@halter73
Copy link
Member

:shipit:

@cesarblum
Copy link
Contributor Author

Yay 😀

@cesarblum cesarblum closed this Dec 16, 2015
@cesarblum cesarblum force-pushed the cesarbs/perf-alloc-optimizations branch from 49439e8 to 349af50 Compare December 16, 2015 19:00
@cesarblum cesarblum deleted the cesarbs/perf-alloc-optimizations branch December 16, 2015 19:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants