Skip to content
This repository has been archived by the owner on Dec 18, 2018. It is now read-only.

Resuse writes, initalize queues #363

Closed
wants to merge 3 commits into from

Conversation

benaadams
Copy link
Contributor

Reuse UvWriteReq on the socket rather than recreating for each write - pooled at per connection level.

Pre initialize the circular buffers; rather than resizing as work is added; with a copy on resize.

Saves 1.12 MB per 4k requests; see #363 (comment) for details.

Resolves #165 Pool UvWriteReq objects
Resolves #289 High Allocator DoWriteIfNeeded
Resolves #290 High Allocator SocketOutput.Write

@halter73
Copy link
Member

@benaadams When profiling, we see a fair amount of time spent resizing queues (though this is mostly in SocketOutput). Why did you close this PR?

@benaadams
Copy link
Contributor Author

I didn't feel it was worthwhile to go round the Travis repeat loop for to try to get to pass; will rebase.

@benaadams
Copy link
Contributor Author

Rebased and added SocketOutput

@benaadams benaadams force-pushed the default-work-queuesizes branch 2 times, most recently from 4746d28 to 36b71f8 Compare November 13, 2015 13:26
@benaadams
Copy link
Contributor Author

Allocation reductions for 8k requests

Before 32,023 objects; 2,032,968 bytes

before

After 12,059 objects; 947,168 bytes

after

37% of the objects than before; 46% of the bytes of before

@halter73
Copy link
Member

I think @CesarBS is pretty close to fixing our travis woes. I wouldn't bother going through the travis repeat loop until that gets in. Sorry that it's been such a pain for you.

I should be able kick off reruns for all failing PRs myself once that's in.

@benaadams benaadams force-pushed the default-work-queuesizes branch 2 times, most recently from 904ee08 to 9037939 Compare November 13, 2015 20:15
@benaadams
Copy link
Contributor Author

Removed the allocs from DoWriteIfNeeded so down by another 212800 bytes

@benaadams
Copy link
Contributor Author

And from UvWriteReq.Write

@benaadams
Copy link
Contributor Author

Boom! Use MemoryPool2 rather than new byte[buffer.Count] in write and SocketOutput alloc free; after the first request on the connection.

@DamianEdwards
Copy link
Member

Rebased on 7e8a405 and measured. Basically the same. I really need to profile it to see if the size we're setting them to is high enough.

dev (7e8a405) - 499k
363 (2a32d33) - 501k (+0.4%)

@AspNetSmurfLab
Copy link

Still spending some time in resizing it seems:
image

@benaadams
Copy link
Contributor Author

@DamianEdwards that's quite a nice tool; what are you using? I'm have a hard time measuring GC with core; but 44.5% still; ouch!

@mattwarren
Copy link

@benaadams I'm pretty sure that screenshot is from JetBrains DotTrace, not sure which version though.

@khellang
Copy link
Contributor

Yep, looks like and older version of dotTrace 😄

@benaadams
Copy link
Contributor Author

With 45% GC load though all bets are off on algorithmic improvements; need to get that overhead down first.

@mattwarren
Copy link

Out of interest, which version of the GC were these tests run under? Server, Client, Background Server, Concurrent, etc?

@benaadams
Copy link
Contributor Author

@DamianEdwards could you retest? Obv resizing the queues will add to GC pressure.

Have increased the queues from and array of 16 pointers to an array of 64 pointers - shouldn't cause much difference in allocation expense i.e. objects allocated is the same and they are reused.

benaadams added a commit to benaadams/KestrelHttpServer that referenced this pull request Dec 4, 2015
Without waiting for next libuv pass

Fix for potential regression in aspnet#363 due to bug fix.
@benaadams benaadams mentioned this pull request Dec 7, 2015
@AspNetSmurfLab
Copy link

Here are the numbers from wrk when running the plaintext benchmark with and without this change:

Baseline:

wrk -c 256 -t 32 -d 30 -s ./scripts/pipeline.lua http://10.0.0.100:5001/plaintext -- 16

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.00ms   37.09ms 443.35ms   92.45%
    Req/Sec    21.68k     2.62k  139.04k    89.94%
  20722668 requests in 30.10s, 2.55GB read
Requests/sec: 688462.29
Transfer/sec:     86.67MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.18ms   36.59ms 640.33ms   90.76%
    Req/Sec    21.75k     2.69k  146.35k    90.92%
  20786049 requests in 30.10s, 2.56GB read
Requests/sec: 690577.45
Transfer/sec:     86.93MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    16.44ms   45.55ms   1.09s    92.51%
    Req/Sec    21.65k     2.28k  185.05k    88.12%
  20695354 requests in 30.10s, 2.54GB read
Requests/sec: 687599.80
Transfer/sec:     86.56MB

After merge:

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.26ms    5.24ms 137.84ms   95.14%
    Req/Sec    21.69k     1.11k   34.26k    73.50%
  20739007 requests in 30.04s, 2.55GB read
Requests/sec: 690491.69
Transfer/sec:     86.92MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.24ms    5.42ms 218.06ms   95.36%
    Req/Sec    21.74k     1.07k   48.29k    72.91%
  20789200 requests in 30.05s, 2.56GB read
Requests/sec: 691930.78
Transfer/sec:     87.10MB

Running 30s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.21ms    5.03ms 150.64ms   94.96%
    Req/Sec    21.58k     1.03k   30.87k    69.44%
  20637557 requests in 30.03s, 2.54GB read
Requests/sec: 687156.69
Transfer/sec:     86.50MB

Nothing too significant RPS-wise.

@benaadams
Copy link
Contributor Author

@halter73 Yeah, but look at latency...!

Measure Before After % Improvement
Avg 14 / 15 / 16 ms 4 ms 26% x 3.8
StdDev 37 / 37 / 45 ms 5 ms 13% x 7.7
StdDev% 92% < 51 / 52 / 61 ms 95% < 9 ms 17% x 5.8
Max 443 / 640 / 1090ms 137 / 218 / 150ms 23% x 4.3

That's a massive improvement; as well as reduced variability.

I'll take the x4 latency improvement with no detriment to RPS 👍

That's like being 1,000 km closer to your customers.. Can always increase bandwidth, can't increase the speed of light.

@benaadams
Copy link
Contributor Author

Rebased - tests pass; the App unhappy with 'LogLevel' does not contain a definition for 'Trace'

@muratg
Copy link
Contributor

muratg commented Dec 8, 2015

@benaadams LogLevel.Verbose was recently renamed to LogLevel.Trace (aspnet/Announcements#124). Do you have the up-to-date logging packages?

@khellang
Copy link
Contributor

khellang commented Dec 8, 2015

@muratg It fails both locally and on the CI server; https://ci.appveyor.com/project/aspnet-ci/kestrelhttpserver/build/1.0.867#L509.

Also, the Travis build seems to be stuck; https://travis-ci.org/aspnet/KestrelHttpServer/pull_requests

@muratg
Copy link
Contributor

muratg commented Dec 8, 2015

I guess the updated packages were not pushed to the feeds yet for some reason. cc: @pranavkm

@benaadams
Copy link
Contributor Author

@muratg timing? Feeds updated 20 hours ago logging change went in 19 hours ago.

@halter73
Copy link
Member

halter73 commented Dec 8, 2015

We're going to need to get new numbers: #458 (comment)

Without waiting for next libuv pass

Fix for potential regression in aspnet#363 due to bug fix.
@benaadams
Copy link
Contributor Author

Merged #427 with this as it contains a regress fix.

@halter73
Copy link
Member

@benaadams What do you think of this idea of pooling UvWriteReqs instead of WriteContext? I started this change with the intention of continuing to pool WriteContexts, but have a shared pool per thread. I quickly realized this wouldn't be possible (without locking or a thread-safe data structure) since SocketOutput.WriteAsync is called from an arbitrary ThreadPool thread.

So at the cost of one extra allocation (that of the WriteContext), we can now have a bigger pool shared by all connections managed by a given thread. I am thinking about opening a PR with this change, but since this is an alternative to pooling WriteContexts, I figured we could discuss it here first.

@benaadams
Copy link
Contributor Author

@halter73 Well... minimising UvWriteReq in all ways can only be a good thing being both a finalized object and having allocated native memory - so a wider sharing is probably good especially if it can be thread safe without requiring extra synchronisation.

However... WriteContext is a pretty big object; so what about pooling the UvWriteReq on the thread as you've done; but also pooling the WriteContext on the socket as it was before (so two queue references)?

@halter73
Copy link
Member

I'm not sure WriteContext is big enough to be worth pooling. Isn't the UvWriteReq the main component of WriteContext that currently makes it big?

The WriteContext pool only living for the lifetime of a connection instead of the lifetime of the server also makes me disinclined to add it.

@benaadams
Copy link
Contributor Author

On x64 the size of WriteContext without the UvWriteReq allocation is

Component bytes total
MemoryPoolIterator2 _lockedStart; - 0
ptr _block; 8 8
int _index; 4 12
MemoryPoolIterator2 _lockedEnd; - 12
ptr _block; 8 20
int _index; 4 24
int _bufferCount; 4 28
int ByteCount; 4 32
ptr Self; 8 40
bool SocketShutdownSend; 4 44
bool SocketDisconnect; 4 48
int WriteStatus; 4 52
ptr WriteError; 8 60
ptr _writeReq; 8 68
int ShutdownSendStatus; 4 72
object _syncblk 4 76
object TypeHandle 8 84
object alignment 4 88

88 bytes; for 4k requests (write+shutdown+disconnect) where they only have a single write its 1 MB

The number of writes would be higher than the number of requests as there is generally more than 1 write per request.

However, I agree with moving UvWriteReq to a more global pool rather than pooling with WriteContext as it has higher costs (finalizer+native memory)

@benaadams
Copy link
Contributor Author

@halter73 which would be this on top of yours; look good?

Added PR "SocketOutput Allocation Reductions" #466 with it.

@benaadams benaadams closed this Dec 10, 2015
benaadams added a commit to benaadams/KestrelHttpServer that referenced this pull request Dec 10, 2015
Without waiting for next libuv pass

Fix for potential regression in aspnet#363 due to bug fix.
benaadams added a commit to benaadams/KestrelHttpServer that referenced this pull request Dec 10, 2015
Without waiting for next libuv pass

Fix for potential regression in aspnet#363 due to bug fix.
@benaadams benaadams deleted the default-work-queuesizes branch May 10, 2016 11:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants