buffer: Optimize memory layout for buffer slices so it is better aligned with the 16KB transport socket read size#14111
Conversation
Signed-off-by: Antonio Vicente <avd@google.com>
|
My understanding is that we were reading full 16k chunks, but the allocated buffer ended up being 20k because it took 16k plus header/metadata, which rounded up to 20k. But still that was about 4k wasted on every read (20%), so I think it makes a lot of sense to not inline that data. |
The first read allocates a buffer with 20kb-64 bytes. The second 16KB read adds 4032 bytes to the 20kb buffer and reads the remaining 12352 to a 16kb-64 byte buffer. So buffer is not fragmented, but it results in expensive linearize when writing SSL records later. |
|
Got it. I'm most of the way done writing a simple benchmark around buffer slices and SSL_write() and linearize; it should be easy to adapt it to compare performance. I think the performance risk, which may or may not be caught by a micro-benchmark, is the double pointer deref needed to get to the data. One alternative (not sure of the feasibility) is to put the header (pointer and len) directly into the |
Yes, moving away from unique pointers in the dequeus would be helpful. In order to do that we should remove the subclassing used when tracking mutable and immutable slices. It should just be a relatively straightforward refactoring of the buffer class. |
|
Benchmark based on #14053 seems to show a 5% improvement in the testThroughput/0/0/10 case by avoiding the linearize copies that used to be required due to mixing of slices with 4096*N-64 byte slices for N in {1 to 5} Without this change: With this change: |
Actually, all pointer accesses are done via Slice::base_ in the base class. There's no change in the number of pointers de-referenced when accessing the contents of slices. |
Signed-off-by: Antonio Vicente <avd@google.com>
True, but when the data was inlined, the |
ggreenway
left a comment
There was a problem hiding this comment.
Benchmark results look slightly faster with this version, and it makes everything simpler and easier to reason about. I'm in favor of this.
Naively, I would also think that a single allocation vs. 2 (1 for the slice and one for data) would be better also? |
It sounds like I should go the extra mile and replace the Slice interface with a struct and get rid of a level of pointers and allocations. I'll try to look into it sometime this week. |
It's not a big deal, feel free to merge this if you both agree on it, I was just curious from a drive by perspective. I'm just surprised it's faster. |
It's probably because instead of only using the first handful of bytes in the last page (for sizes that are a multiple of the pagesize) and wasting the rest, 1 fewer pages are used, and the Slice headers can be packed (by the allocator) nicely, with many in a single page. |
Yeah this makes sense. I wonder if as part of this change the slice size could go up to the allocator page size? Wasn't it less previously to account for the data at the beginning? |
|
The slice size has always been however much is requested by the caller, rounded up to the next multiple of 4096. This meant that all the 16384 requests were fulfilled by 5 pages instead of 4. The extra space is probably used by the next read. This change would make 16384 requests work as the caller would naively expect. So I think it does what you're hoping. |
|
Ah OK makes sense. SGTM. |
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
|
latest version of the benchmark. Note that the new version of the buffer ends up hitting an odd edge case when the short slices are 4096 and the benchmark ends up adding a slice with 16kb-short_slices bytes in it before the full sized slices. I hope that the extra cases explored in the benchmark will help you measure the benefits of #14053 before: after: |
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
improve handshake error handling Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
ggreenway
left a comment
There was a problem hiding this comment.
Aside from the minor naming nit, I think this looks good. Benchmarks look to be as good or better, depending on the test case, and I think this makes the buffering behavior much easier to understand.
| buffer.commit(&slice, 1); | ||
| } | ||
|
|
||
| static void addTenFullSizeSlices(Buffer::Instance& buffer) { |
There was a problem hiding this comment.
The name isn't quite correct. What this actually does is simulate 10 full size reads from a socket.
There was a problem hiding this comment.
Attempted to address the naming issue.
| } | ||
|
|
||
| if (move_slices) { | ||
| // Append many full-sized slices, the same manner that HTTP codecs would move data from an |
Signed-off-by: Antonio Vicente <avd@google.com>
Commit Message:
buffer: Optimize memory layout for buffer slices so it is better aligned with the 16KB transport socket read size.
Additional Description:
Having buffer slices of odd sizes seems to have some negative CPU performance consequences as it results in extra copies when generating 16KB SSL records from the buffer contents.
Risk Level: low, no changes to the external buffer API, just changes to how memory is arranged.
Testing: Adjusted unit test expectations, new benchmark based on ggreenway's work.
Docs Changes: n/a
Release Notes: n/a