buffer: Optimize memory layout for buffer slices so it is better aligned with the 16KB transport socket read size by antoniovicente · Pull Request #14111 · envoyproxy/envoy

antoniovicente · 2020-11-20T03:04:39Z

Commit Message:
buffer: Optimize memory layout for buffer slices so it is better aligned with the 16KB transport socket read size.

Additional Description:
Having buffer slices of odd sizes seems to have some negative CPU performance consequences as it results in extra copies when generating 16KB SSL records from the buffer contents.

Risk Level: low, no changes to the external buffer API, just changes to how memory is arranged.
Testing: Adjusted unit test expectations, new benchmark based on ggreenway's work.
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Antonio Vicente <avd@google.com>

ggreenway · 2020-11-20T16:53:10Z

My understanding is that we were reading full 16k chunks, but the allocated buffer ended up being 20k because it took 16k plus header/metadata, which rounded up to 20k. But still that was about 4k wasted on every read (20%), so I think it makes a lot of sense to not inline that data.

antoniovicente · 2020-11-20T17:21:39Z

My understanding is that we were reading full 16k chunks, but the allocated buffer ended up being 20k because it took 16k plus header/metadata, which rounded up to 20k. But still that was about 4k wasted on every read (20%), so I think it makes a lot of sense to not inline that data.

The first read allocates a buffer with 20kb-64 bytes. The second 16KB read adds 4032 bytes to the 20kb buffer and reads the remaining 12352 to a 16kb-64 byte buffer.

So buffer is not fragmented, but it results in expensive linearize when writing SSL records later.

ggreenway · 2020-11-20T20:24:35Z

Got it.

I'm most of the way done writing a simple benchmark around buffer slices and SSL_write() and linearize; it should be easy to adapt it to compare performance.

I think the performance risk, which may or may not be caught by a micro-benchmark, is the double pointer deref needed to get to the data. One alternative (not sure of the feasibility) is to put the header (pointer and len) directly into the SliceDeque instead of pointers to that block.

antoniovicente · 2020-11-20T21:48:09Z

Got it.

I'm most of the way done writing a simple benchmark around buffer slices and SSL_write() and linearize; it should be easy to adapt it to compare performance.

I think the performance risk, which may or may not be caught by a micro-benchmark, is the double pointer deref needed to get to the data. One alternative (not sure of the feasibility) is to put the header (pointer and len) directly into the SliceDeque instead of pointers to that block.

Yes, moving away from unique pointers in the dequeus would be helpful. In order to do that we should remove the subclassing used when tracking mutable and immutable slices. It should just be a relatively straightforward refactoring of the buffer class.

antoniovicente · 2020-11-24T02:48:16Z

Benchmark based on #14053 seems to show a 5% improvement in the testThroughput/0/0/10 case by avoiding the linearize copies that used to be required due to mixing of slices with 4096*N-64 byte slices for N in {1 to 5}

Without this change:

Benchmark                         Time             CPU   Iterations UserCounters...
testThroughput/0/0/0           6.08 us         6.07 us       115217 throughput=2.69878G/s writes_per_iteration=1
testThroughput/0/0/5           34.4 us         34.4 us        20344 throughput=2.85751G/s writes_per_iteration=6
testThroughput/0/0/10          62.8 us         62.8 us        11156 throughput=2.8699G/s writes_per_iteration=11
testThroughput/1/1/0           6.59 us         6.59 us       106446 throughput=2.4857G/s writes_per_iteration=1
testThroughput/1/1/5           35.2 us         35.2 us        19903 throughput=2.79063G/s writes_per_iteration=6
testThroughput/1/1/10          63.8 us         63.8 us        10983 throughput=2.82659G/s writes_per_iteration=11
testThroughput/128/1/0         6.53 us         6.53 us       106832 throughput=2.50952G/s writes_per_iteration=1
testThroughput/128/1/5         34.6 us         34.6 us        20225 throughput=2.84392G/s writes_per_iteration=6
testThroughput/128/1/10        63.0 us         63.0 us        11082 throughput=2.86119G/s writes_per_iteration=11
testThroughput/4096/1/0        6.52 us         6.52 us       107212 throughput=2.5144G/s writes_per_iteration=1
testThroughput/4096/1/5        34.9 us         34.9 us        20042 throughput=2.82045G/s writes_per_iteration=6
testThroughput/4096/1/10       63.3 us         63.3 us        11069 throughput=2.84769G/s writes_per_iteration=11
testThroughput/1/2/0           6.61 us         6.61 us       106464 throughput=2.47929G/s writes_per_iteration=1
testThroughput/1/2/5           35.4 us         35.4 us        19840 throughput=2.7806G/s writes_per_iteration=6
testThroughput/1/2/10          63.9 us         63.9 us        10957 throughput=2.8218G/s writes_per_iteration=11
testThroughput/128/2/0         6.55 us         6.55 us       107025 throughput=2.5011G/s writes_per_iteration=1
testThroughput/128/2/5         34.6 us         34.6 us        20234 throughput=2.84506G/s writes_per_iteration=6
testThroughput/128/2/10        62.9 us         62.9 us        11121 throughput=2.86685G/s writes_per_iteration=11
testThroughput/4096/2/0        6.51 us         6.51 us       107793 throughput=2.51697G/s writes_per_iteration=1
testThroughput/4096/2/5        34.9 us         34.9 us        20038 throughput=2.81385G/s writes_per_iteration=6
testThroughput/4096/2/10       63.3 us         63.3 us        11067 throughput=2.84633G/s writes_per_iteration=11
testThroughput/1/3/0           6.61 us         6.60 us       106163 throughput=2.48197G/s writes_per_iteration=1
testThroughput/1/3/5           35.2 us         35.2 us        19903 throughput=2.78953G/s writes_per_iteration=6
testThroughput/1/3/10          63.8 us         63.8 us        10945 throughput=2.82398G/s writes_per_iteration=11
testThroughput/128/3/0         6.56 us         6.55 us       106860 throughput=2.49989G/s writes_per_iteration=1
testThroughput/128/3/5         35.0 us         35.0 us        20039 throughput=2.81042G/s writes_per_iteration=6
testThroughput/128/3/10        63.0 us         63.0 us        11086 throughput=2.86123G/s writes_per_iteration=11
testThroughput/4096/3/0        6.51 us         6.51 us       107107 throughput=2.5159G/s writes_per_iteration=1
testThroughput/4096/3/5        35.0 us         35.0 us        19955 throughput=2.80748G/s writes_per_iteration=6
testThroughput/4096/3/10       63.4 us         63.4 us        11064 throughput=2.8409G/s writes_per_iteration=11

With this change:

Benchmark                         Time             CPU   Iterations UserCounters...
testThroughput/0/0/0           6.10 us         6.10 us       114328 throughput=2.68561G/s writes_per_iteration=1
testThroughput/0/0/5           32.5 us         32.5 us        21623 throughput=3.02461G/s writes_per_iteration=6
testThroughput/0/0/10          58.8 us         58.8 us        11961 throughput=3.06482G/s writes_per_iteration=11
testThroughput/1/1/0           6.55 us         6.55 us       107076 throughput=2.49992G/s writes_per_iteration=1
testThroughput/1/1/5           35.2 us         35.2 us        19874 throughput=2.78915G/s writes_per_iteration=6
testThroughput/1/1/10          63.9 us         63.8 us        10967 throughput=2.82262G/s writes_per_iteration=11
testThroughput/128/1/0         6.51 us         6.52 us       107613 throughput=2.51438G/s writes_per_iteration=1
testThroughput/128/1/5         34.9 us         34.9 us        20066 throughput=2.8184G/s writes_per_iteration=6
testThroughput/128/1/10        63.1 us         63.1 us        11074 throughput=2.85814G/s writes_per_iteration=11
testThroughput/4096/1/0        6.53 us         6.54 us       107076 throughput=2.50522G/s writes_per_iteration=1
testThroughput/4096/1/5        33.3 us         33.3 us        21048 throughput=2.95276G/s writes_per_iteration=6
testThroughput/4096/1/10       59.5 us         59.5 us        11752 throughput=3.02848G/s writes_per_iteration=11
testThroughput/1/2/0           6.57 us         6.57 us       106575 throughput=2.49454G/s writes_per_iteration=1
testThroughput/1/2/5           35.2 us         35.2 us        19866 throughput=2.78996G/s writes_per_iteration=6
testThroughput/1/2/10          63.8 us         63.8 us        10970 throughput=2.82358G/s writes_per_iteration=11
testThroughput/128/2/0         6.51 us         6.52 us       107302 throughput=2.51465G/s writes_per_iteration=1
testThroughput/128/2/5         34.9 us         34.9 us        20085 throughput=2.81983G/s writes_per_iteration=6
testThroughput/128/2/10        63.7 us         63.7 us        11052 throughput=2.83113G/s writes_per_iteration=11
testThroughput/4096/2/0        6.64 us         6.65 us       106841 throughput=2.46434G/s writes_per_iteration=1
testThroughput/4096/2/5        33.3 us         33.3 us        20970 throughput=2.95183G/s writes_per_iteration=6
testThroughput/4096/2/10       59.6 us         59.6 us        11747 throughput=3.02168G/s writes_per_iteration=11
testThroughput/1/3/0           6.61 us         6.61 us       105860 throughput=2.47856G/s writes_per_iteration=1
testThroughput/1/3/5           35.3 us         35.3 us        19849 throughput=2.78513G/s writes_per_iteration=6
testThroughput/1/3/10          63.9 us         63.9 us        10945 throughput=2.82245G/s writes_per_iteration=11
testThroughput/128/3/0         6.53 us         6.53 us       106742 throughput=2.50784G/s writes_per_iteration=1
testThroughput/128/3/5         34.9 us         34.9 us        19998 throughput=2.81464G/s writes_per_iteration=6
testThroughput/128/3/10        63.4 us         63.4 us        10980 throughput=2.84267G/s writes_per_iteration=11
testThroughput/4096/3/0        6.52 us         6.53 us       106798 throughput=2.50767G/s writes_per_iteration=1
testThroughput/4096/3/5        33.3 us         33.3 us        20978 throughput=2.94981G/s writes_per_iteration=6
testThroughput/4096/3/10       59.6 us         59.6 us        11740 throughput=3.02173G/s writes_per_iteration=11

antoniovicente · 2020-11-24T02:50:45Z

Got it.
I'm most of the way done writing a simple benchmark around buffer slices and SSL_write() and linearize; it should be easy to adapt it to compare performance.
I think the performance risk, which may or may not be caught by a micro-benchmark, is the double pointer deref needed to get to the data. One alternative (not sure of the feasibility) is to put the header (pointer and len) directly into the SliceDeque instead of pointers to that block.

Yes, moving away from unique pointers in the dequeus would be helpful. In order to do that we should remove the subclassing used when tracking mutable and immutable slices. It should just be a relatively straightforward refactoring of the buffer class.

Actually, all pointer accesses are done via Slice::base_ in the base class. There's no change in the number of pointers de-referenced when accessing the contents of slices.

Signed-off-by: Antonio Vicente <avd@google.com>

ggreenway · 2020-11-30T16:37:27Z

Actually, all pointer accesses are done via Slice::base_ in the base class. There's no change in the number of pointers de-referenced when accessing the contents of slices.

True, but when the data was inlined, the base_ dereference is almost certainly an L1 cache hit.

ggreenway

Benchmark results look slightly faster with this version, and it makes everything simpler and easier to reason about. I'm in favor of this.

source/common/buffer/buffer_impl.h

mattklein123 · 2020-11-30T17:17:32Z

True, but when the data was inlined, the base_ dereference is almost certainly an L1 cache hit.

Naively, I would also think that a single allocation vs. 2 (1 for the slice and one for data) would be better also?

antoniovicente · 2020-11-30T17:22:15Z

True, but when the data was inlined, the base_ dereference is almost certainly an L1 cache hit.

Naively, I would also think that a single allocation vs. 2 (1 for the slice and one for data) would be better also?

It sounds like I should go the extra mile and replace the Slice interface with a struct and get rid of a level of pointers and allocations. I'll try to look into it sometime this week.

mattklein123 · 2020-11-30T17:23:16Z

It sounds like I should go the extra mile and replace the Slice interface with a struct and get rid of a level of pointers and allocations. I'll try to look into it sometime this week.

It's not a big deal, feel free to merge this if you both agree on it, I was just curious from a drive by perspective. I'm just surprised it's faster.

ggreenway · 2020-11-30T17:37:56Z

It sounds like I should go the extra mile and replace the Slice interface with a struct and get rid of a level of pointers and allocations. I'll try to look into it sometime this week.

It's not a big deal, feel free to merge this if you both agree on it, I was just curious from a drive by perspective. I'm just surprised it's faster.

It's probably because instead of only using the first handful of bytes in the last page (for sizes that are a multiple of the pagesize) and wasting the rest, 1 fewer pages are used, and the Slice headers can be packed (by the allocator) nicely, with many in a single page.

mattklein123 · 2020-11-30T19:25:21Z

It's probably because instead of only using the first handful of bytes in the last page (for sizes that are a multiple of the pagesize) and wasting the rest, 1 fewer pages are used, and the Slice headers can be packed (by the allocator) nicely, with many in a single page.

Yeah this makes sense. I wonder if as part of this change the slice size could go up to the allocator page size? Wasn't it less previously to account for the data at the beginning?

ggreenway · 2020-11-30T20:18:55Z

The slice size has always been however much is requested by the caller, rounded up to the next multiple of 4096. This meant that all the 16384 requests were fulfilled by 5 pages instead of 4. The extra space is probably used by the next read. This change would make 16384 requests work as the caller would naively expect. So I think it does what you're hoping.

mattklein123 · 2020-11-30T20:54:37Z

Ah OK makes sense. SGTM.

Signed-off-by: Antonio Vicente <avd@google.com>

antoniovicente · 2020-12-02T21:42:30Z

latest version of the benchmark. Note that the new version of the buffer ends up hitting an odd edge case when the short slices are 4096 and the benchmark ends up adding a slice with 16kb-short_slices bytes in it before the full sized slices. I hope that the extra cases explored in the benchmark will help you measure the benefits of #14053

before:

Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
testThroughput/0/0/0/0          55.4 us         55.4 us        12618 num_linearized=9 throughput=2.9593G/s writes_per_iteration=10
testThroughput/1/1/0/0          56.8 us         56.8 us        12293 num_linearized=9 throughput=2.88312G/s writes_per_iteration=11
testThroughput/1/2/0/0          57.9 us         57.9 us        12096 num_linearized=10 throughput=2.82812G/s writes_per_iteration=11
testThroughput/1/3/0/0          57.9 us         57.9 us        12101 num_linearized=10 throughput=2.83036G/s writes_per_iteration=11
testThroughput/128/1/0/0        56.9 us         56.9 us        12286 num_linearized=9 throughput=2.88366G/s writes_per_iteration=11
testThroughput/128/2/0/0        57.5 us         57.5 us        12181 num_linearized=10 throughput=2.85472G/s writes_per_iteration=11
testThroughput/128/3/0/0        58.2 us         58.2 us        12171 num_linearized=10 throughput=2.82361G/s writes_per_iteration=11
testThroughput/4095/1/0/0       57.9 us         57.9 us        12113 num_linearized=9 throughput=2.90171G/s writes_per_iteration=11
testThroughput/4095/2/0/0       59.7 us         59.7 us        11701 num_linearized=10 throughput=2.8829G/s writes_per_iteration=11
testThroughput/4095/3/0/0       60.7 us         60.7 us        11549 num_linearized=10 throughput=2.90372G/s writes_per_iteration=11
testThroughput/4096/1/0/0       57.8 us         57.8 us        12117 num_linearized=9 throughput=2.9036G/s writes_per_iteration=11
testThroughput/4096/2/0/0       59.1 us         59.1 us        11850 num_linearized=10 throughput=2.91126G/s writes_per_iteration=11
testThroughput/4096/3/0/0       60.2 us         60.2 us        11634 num_linearized=10 throughput=2.92718G/s writes_per_iteration=11
testThroughput/4097/1/0/0       57.9 us         57.9 us        12111 num_linearized=9 throughput=2.90185G/s writes_per_iteration=11
testThroughput/4097/2/0/0       59.7 us         59.7 us        11739 num_linearized=10 throughput=2.88383G/s writes_per_iteration=11
testThroughput/4097/3/0/0       60.7 us         60.7 us        11549 num_linearized=10 throughput=2.90404G/s writes_per_iteration=11
testThroughput/0/0/1/0          60.7 us         60.7 us        11531 num_linearized=10 throughput=2.97042G/s writes_per_iteration=11
testThroughput/1/1/1/0          61.8 us         61.8 us        11334 num_linearized=11 throughput=2.91574G/s writes_per_iteration=11
testThroughput/1/2/1/0          61.9 us         61.8 us        11320 num_linearized=11 throughput=2.91423G/s writes_per_iteration=11
testThroughput/1/3/1/0          61.8 us         61.8 us        11334 num_linearized=11 throughput=2.91718G/s writes_per_iteration=11
testThroughput/128/1/1/0        61.0 us         60.9 us        11496 num_linearized=10 throughput=2.95738G/s writes_per_iteration=11
testThroughput/128/2/1/0        61.0 us         61.0 us        11459 num_linearized=10 throughput=2.95268G/s writes_per_iteration=11
testThroughput/128/3/1/0        61.0 us         61.0 us        11441 num_linearized=10 throughput=2.95574G/s writes_per_iteration=11
testThroughput/4095/1/1/0       61.5 us         61.5 us        11383 num_linearized=11 throughput=2.93016G/s writes_per_iteration=11
testThroughput/4095/2/1/0       61.8 us         61.8 us        11345 num_linearized=11 throughput=2.91815G/s writes_per_iteration=11
testThroughput/4095/3/1/0       61.6 us         61.6 us        11337 num_linearized=11 throughput=2.92691G/s writes_per_iteration=11
testThroughput/4096/1/1/0       61.2 us         61.2 us        11411 num_linearized=11 throughput=2.94609G/s writes_per_iteration=11
testThroughput/4096/2/1/0       61.4 us         61.4 us        11435 num_linearized=11 throughput=2.93678G/s writes_per_iteration=11
testThroughput/4096/3/1/0       61.2 us         61.2 us        11438 num_linearized=11 throughput=2.94537G/s writes_per_iteration=11
testThroughput/4097/1/1/0       61.5 us         61.5 us        11371 num_linearized=11 throughput=2.93031G/s writes_per_iteration=11
testThroughput/4097/2/1/0       61.8 us         61.8 us        11330 num_linearized=11 throughput=2.91458G/s writes_per_iteration=11
testThroughput/4097/3/1/0       61.7 us         61.7 us        11337 num_linearized=11 throughput=2.92173G/s writes_per_iteration=11
testThroughput/0/0/0/1          55.2 us         55.2 us        12671 num_linearized=9 throughput=2.9663G/s writes_per_iteration=10
testThroughput/1/1/0/1          57.7 us         57.7 us        12106 num_linearized=10 throughput=2.83864G/s writes_per_iteration=11
testThroughput/1/2/0/1          57.8 us         57.8 us        12068 num_linearized=10 throughput=2.83561G/s writes_per_iteration=11
testThroughput/1/3/0/1          57.6 us         57.6 us        12102 num_linearized=10 throughput=2.84293G/s writes_per_iteration=11
testThroughput/128/1/0/1        57.3 us         57.3 us        12204 num_linearized=10 throughput=2.8616G/s writes_per_iteration=11
testThroughput/128/2/0/1        57.4 us         57.4 us        12183 num_linearized=10 throughput=2.85882G/s writes_per_iteration=11
testThroughput/128/3/0/1        57.4 us         57.4 us        12157 num_linearized=10 throughput=2.85966G/s writes_per_iteration=11
testThroughput/4095/1/0/1       58.7 us         58.7 us        11939 num_linearized=10 throughput=2.86215G/s writes_per_iteration=11
testThroughput/4095/2/0/1       59.6 us         59.6 us        11745 num_linearized=10 throughput=2.88558G/s writes_per_iteration=11
testThroughput/4095/3/0/1       60.6 us         60.6 us        11542 num_linearized=10 throughput=2.90509G/s writes_per_iteration=11
testThroughput/4096/1/0/1       58.3 us         58.3 us        12006 num_linearized=10 throughput=2.87982G/s writes_per_iteration=11
testThroughput/4096/2/0/1       59.2 us         59.2 us        11838 num_linearized=10 throughput=2.90764G/s writes_per_iteration=11
testThroughput/4096/3/0/1       60.2 us         60.2 us        11629 num_linearized=10 throughput=2.92615G/s writes_per_iteration=11
testThroughput/4097/1/0/1       58.8 us         58.8 us        11919 num_linearized=10 throughput=2.85643G/s writes_per_iteration=11
testThroughput/4097/2/0/1       59.6 us         59.6 us        11714 num_linearized=10 throughput=2.88438G/s writes_per_iteration=11
testThroughput/4097/3/0/1       60.6 us         60.6 us        11532 num_linearized=10 throughput=2.90613G/s writes_per_iteration=11
testThroughput/0/0/1/1          60.3 us         60.3 us        11614 num_linearized=9 throughput=2.9907G/s writes_per_iteration=11
testThroughput/1/1/1/1          60.8 us         60.8 us        11514 num_linearized=10 throughput=2.96351G/s writes_per_iteration=11
testThroughput/1/2/1/1          60.9 us         60.9 us        11503 num_linearized=10 throughput=2.96038G/s writes_per_iteration=11
testThroughput/1/3/1/1          60.9 us         60.9 us        11494 num_linearized=10 throughput=2.95816G/s writes_per_iteration=11
testThroughput/128/1/1/1        60.8 us         60.8 us        11516 num_linearized=10 throughput=2.96399G/s writes_per_iteration=11
testThroughput/128/2/1/1        60.9 us         60.9 us        11495 num_linearized=10 throughput=2.96121G/s writes_per_iteration=11
testThroughput/128/3/1/1        60.9 us         60.9 us        11512 num_linearized=10 throughput=2.96037G/s writes_per_iteration=11
testThroughput/4095/1/1/1       60.7 us         60.7 us        11535 num_linearized=10 throughput=2.96745G/s writes_per_iteration=11
testThroughput/4095/2/1/1       60.9 us         60.9 us        11452 num_linearized=10 throughput=2.96165G/s writes_per_iteration=11
testThroughput/4095/3/1/1       61.0 us         61.0 us        11492 num_linearized=10 throughput=2.95683G/s writes_per_iteration=11
testThroughput/4096/1/1/1       60.8 us         60.8 us        11504 num_linearized=10 throughput=2.96615G/s writes_per_iteration=11
testThroughput/4096/2/1/1       60.8 us         60.8 us        11529 num_linearized=10 throughput=2.96275G/s writes_per_iteration=11
testThroughput/4096/3/1/1       60.9 us         60.9 us        11496 num_linearized=10 throughput=2.95725G/s writes_per_iteration=11
testThroughput/4097/1/1/1       60.9 us         60.9 us        11469 num_linearized=10 throughput=2.9598G/s writes_per_iteration=11
testThroughput/4097/2/1/1       60.9 us         60.9 us        11471 num_linearized=10 throughput=2.95977G/s writes_per_iteration=11
testThroughput/4097/3/1/1       61.0 us         61.0 us        11494 num_linearized=10 throughput=2.95602G/s writes_per_iteration=11

after:


Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
testThroughput/0/0/0/0          51.3 us         51.3 us        13662 num_linearized=0 throughput=3.19252G/s writes_per_iteration=10
testThroughput/1/1/0/0          56.6 us         56.6 us        12393 num_linearized=9 throughput=2.89528G/s writes_per_iteration=11
testThroughput/1/2/0/0          57.7 us         57.7 us        12143 num_linearized=10 throughput=2.84122G/s writes_per_iteration=11
testThroughput/1/3/0/0          57.7 us         57.7 us        12119 num_linearized=10 throughput=2.83889G/s writes_per_iteration=11
testThroughput/128/1/0/0        56.6 us         56.6 us        12344 num_linearized=9 throughput=2.89589G/s writes_per_iteration=11
testThroughput/128/2/0/0        57.3 us         57.3 us        12222 num_linearized=10 throughput=2.86294G/s writes_per_iteration=11
testThroughput/128/3/0/0        57.2 us         57.2 us        12201 num_linearized=10 throughput=2.87069G/s writes_per_iteration=11
testThroughput/4095/1/0/0       58.5 us         58.5 us        12166 num_linearized=9 throughput=2.872G/s writes_per_iteration=11
testThroughput/4095/2/0/0       59.5 us         59.4 us        11637 num_linearized=10 throughput=2.89393G/s writes_per_iteration=11
testThroughput/4095/3/0/0       60.4 us         60.4 us        11565 num_linearized=10 throughput=2.91687G/s writes_per_iteration=11
testThroughput/4096/1/0/0       57.6 us         57.6 us        12122 num_linearized=9 throughput=2.91508G/s writes_per_iteration=11
testThroughput/4096/2/0/0       58.9 us         58.9 us        11881 num_linearized=10 throughput=2.9185G/s writes_per_iteration=11
testThroughput/4096/3/0/0       60.0 us         60.0 us        11655 num_linearized=10 throughput=2.93615G/s writes_per_iteration=11
testThroughput/4097/1/0/0       57.7 us         57.7 us        12131 num_linearized=9 throughput=2.91012G/s writes_per_iteration=11
testThroughput/4097/2/0/0       59.4 us         59.4 us        11791 num_linearized=10 throughput=2.8961G/s writes_per_iteration=11
testThroughput/4097/3/0/0       60.9 us         60.9 us        11486 num_linearized=11 throughput=2.89227G/s writes_per_iteration=11
testThroughput/0/0/1/0          56.3 us         56.3 us        12392 num_linearized=0 throughput=3.199G/s writes_per_iteration=11
testThroughput/1/1/1/0          61.6 us         61.6 us        11311 num_linearized=11 throughput=2.92708G/s writes_per_iteration=11
testThroughput/1/2/1/0          61.6 us         61.6 us        11350 num_linearized=11 throughput=2.92565G/s writes_per_iteration=11
testThroughput/1/3/1/0          61.7 us         61.7 us        11343 num_linearized=11 throughput=2.92146G/s writes_per_iteration=11
testThroughput/128/1/1/0        61.1 us         61.1 us        11464 num_linearized=11 throughput=2.94915G/s writes_per_iteration=11
testThroughput/128/2/1/0        61.1 us         61.1 us        11443 num_linearized=11 throughput=2.95038G/s writes_per_iteration=11
testThroughput/128/3/1/0        61.2 us         61.2 us        11447 num_linearized=11 throughput=2.94629G/s writes_per_iteration=11
testThroughput/4095/1/1/0       61.4 us         61.4 us        11385 num_linearized=11 throughput=2.93476G/s writes_per_iteration=11
testThroughput/4095/2/1/0       61.4 us         61.4 us        11380 num_linearized=11 throughput=2.93462G/s writes_per_iteration=11
testThroughput/4095/3/1/0       61.6 us         61.6 us        11359 num_linearized=11 throughput=2.92796G/s writes_per_iteration=11
testThroughput/4096/1/1/0       57.4 us         57.4 us        12194 num_linearized=2 throughput=3.13948G/s writes_per_iteration=11
testThroughput/4096/2/1/0       57.5 us         57.5 us        12160 num_linearized=2 throughput=3.13422G/s writes_per_iteration=11
testThroughput/4096/3/1/0       57.2 us         57.2 us        12180 num_linearized=2 throughput=3.14849G/s writes_per_iteration=11
testThroughput/4097/1/1/0       61.7 us         61.6 us        11374 num_linearized=11 throughput=2.92368G/s writes_per_iteration=11
testThroughput/4097/2/1/0       61.7 us         61.7 us        11359 num_linearized=11 throughput=2.92187G/s writes_per_iteration=11
testThroughput/4097/3/1/0       61.7 us         61.7 us        11339 num_linearized=11 throughput=2.92301G/s writes_per_iteration=11
testThroughput/0/0/0/1          51.5 us         51.5 us        13621 num_linearized=0 throughput=3.18375G/s writes_per_iteration=10
testThroughput/1/1/0/1          57.7 us         57.7 us        12131 num_linearized=10 throughput=2.83879G/s writes_per_iteration=11
testThroughput/1/2/0/1          57.8 us         57.7 us        12120 num_linearized=10 throughput=2.83735G/s writes_per_iteration=11
testThroughput/1/3/0/1          57.8 us         57.8 us        12097 num_linearized=10 throughput=2.8344G/s writes_per_iteration=11
testThroughput/128/1/0/1        57.3 us         57.3 us        12186 num_linearized=10 throughput=2.86175G/s writes_per_iteration=11
testThroughput/128/2/0/1        57.4 us         57.4 us        12191 num_linearized=10 throughput=2.86045G/s writes_per_iteration=11
testThroughput/128/3/0/1        57.5 us         57.5 us        12132 num_linearized=10 throughput=2.85846G/s writes_per_iteration=11
testThroughput/4095/1/0/1       58.5 us         58.5 us        11946 num_linearized=10 throughput=2.8707G/s writes_per_iteration=11
testThroughput/4095/2/0/1       59.5 us         59.5 us        11769 num_linearized=10 throughput=2.89333G/s writes_per_iteration=11
testThroughput/4095/3/0/1       60.6 us         60.6 us        11554 num_linearized=10 throughput=2.90854G/s writes_per_iteration=11
testThroughput/4096/1/0/1       58.2 us         58.2 us        12037 num_linearized=10 throughput=2.88512G/s writes_per_iteration=11
testThroughput/4096/2/0/1       58.9 us         58.9 us        11862 num_linearized=10 throughput=2.92063G/s writes_per_iteration=11
testThroughput/4096/3/0/1       60.0 us         60.0 us        11653 num_linearized=10 throughput=2.93572G/s writes_per_iteration=11
testThroughput/4097/1/0/1       58.9 us         58.9 us        11909 num_linearized=10 throughput=2.85219G/s writes_per_iteration=11
testThroughput/4097/2/0/1       59.7 us         59.6 us        11729 num_linearized=10 throughput=2.88433G/s writes_per_iteration=11
testThroughput/4097/3/0/1       60.4 us         60.4 us        11559 num_linearized=10 throughput=2.91698G/s writes_per_iteration=11
testThroughput/0/0/1/1          56.4 us         56.4 us        12415 num_linearized=0 throughput=3.19574G/s writes_per_iteration=11
testThroughput/1/1/1/1          57.0 us         57.0 us        12297 num_linearized=1 throughput=3.16012G/s writes_per_iteration=11
testThroughput/1/2/1/1          56.9 us         56.9 us        12269 num_linearized=1 throughput=3.1655G/s writes_per_iteration=11
testThroughput/1/3/1/1          57.0 us         57.0 us        12233 num_linearized=1 throughput=3.16216G/s writes_per_iteration=11
testThroughput/128/1/1/1        57.1 us         57.1 us        12265 num_linearized=1 throughput=3.15485G/s writes_per_iteration=11
testThroughput/128/2/1/1        57.0 us         57.0 us        12267 num_linearized=1 throughput=3.16437G/s writes_per_iteration=11
testThroughput/128/3/1/1        57.0 us         57.0 us        12262 num_linearized=1 throughput=3.161G/s writes_per_iteration=11
testThroughput/4095/1/1/1       57.2 us         57.2 us        12287 num_linearized=1 throughput=3.15336G/s writes_per_iteration=11
testThroughput/4095/2/1/1       56.9 us         56.9 us        12259 num_linearized=1 throughput=3.16538G/s writes_per_iteration=11
testThroughput/4095/3/1/1       57.1 us         57.0 us        12266 num_linearized=1 throughput=3.15911G/s writes_per_iteration=11
testThroughput/4096/1/1/1       57.0 us         57.0 us        12288 num_linearized=1 throughput=3.16001G/s writes_per_iteration=11
testThroughput/4096/2/1/1       57.0 us         57.0 us        12272 num_linearized=1 throughput=3.1634G/s writes_per_iteration=11
testThroughput/4096/3/1/1       57.1 us         57.1 us        12261 num_linearized=1 throughput=3.15666G/s writes_per_iteration=11
testThroughput/4097/1/1/1       57.1 us         57.0 us        12277 num_linearized=1 throughput=3.15925G/s writes_per_iteration=11
testThroughput/4097/2/1/1       57.0 us         57.0 us        12272 num_linearized=1 throughput=3.15994G/s writes_per_iteration=11
testThroughput/4097/3/1/1       57.1 us         57.1 us        12274 num_linearized=1 throughput=3.15633G/s writes_per_iteration=11

Signed-off-by: Antonio Vicente <avd@google.com>

improve handshake error handling Signed-off-by: Antonio Vicente <avd@google.com>

Signed-off-by: Antonio Vicente <avd@google.com>

ggreenway

Aside from the minor naming nit, I think this looks good. Benchmarks look to be as good or better, depending on the test case, and I think this makes the buffering behavior much easier to understand.

ggreenway · 2020-12-04T00:23:16Z

test/extensions/transport_sockets/tls/tls_throughput_benchmark.cc

  buffer.commit(&slice, 1);
 }

+static void addTenFullSizeSlices(Buffer::Instance& buffer) {


The name isn't quite correct. What this actually does is simulate 10 full size reads from a socket.

Attempted to address the naming issue.

ggreenway · 2020-12-04T00:26:53Z

test/extensions/transport_sockets/tls/tls_throughput_benchmark.cc

+    }
+
+    if (move_slices) {
+      // Append many full-sized slices, the same manner that HTTP codecs would move data from an


Signed-off-by: Antonio Vicente <avd@google.com>

buffer: Use external control block for owned slices

be563b0

Signed-off-by: Antonio Vicente <avd@google.com>

antoniovicente requested review from brian-pane and ggreenway November 20, 2020 03:04

antoniovicente mentioned this pull request Nov 20, 2020

buffer: improve read reservations to efficiently handle multiple slices #14054

Merged

mattklein123 assigned ggreenway Nov 20, 2020

ggreenway mentioned this pull request Nov 20, 2020

tls: improve write performance by reducing copying #14053

Closed

benchmark based on envoyproxy#14053 by ggreenway

5077868

Signed-off-by: Antonio Vicente <avd@google.com>

ggreenway requested changes Nov 30, 2020

View reviewed changes

source/common/buffer/buffer_impl.h Show resolved Hide resolved

antoniovicente added 3 commits November 30, 2020 16:04

Track times when linearize did something in benchmark

a5a5c44

Signed-off-by: Antonio Vicente <avd@google.com>

Merge remote-tracking branch 'upstream/master' into owner_slice_refactor

07730d2

Signed-off-by: Antonio Vicente <avd@google.com>

refine benchmark

7637218

Signed-off-by: Antonio Vicente <avd@google.com>

antoniovicente marked this pull request as ready for review December 2, 2020 21:42

antoniovicente requested review from PiotrSikora, asraa and lizan as code owners December 2, 2020 21:42

antoniovicente changed the title ~~buffer: Stop using inline storage for owned slices~~ buffer: Optimize memory layout for buffer slices so it is better aligned with the 16KB transport socket read size Dec 2, 2020

antoniovicente added 4 commits December 2, 2020 21:05

fix memory leaks in benchmark

c71c359

Signed-off-by: Antonio Vicente <avd@google.com>

clang-tidy

c85ba0c

Signed-off-by: Antonio Vicente <avd@google.com>

fix clang-tidy

b1257a1

improve handshake error handling Signed-off-by: Antonio Vicente <avd@google.com>

exclude new benchmark from windows builds

f9a7e4b

Signed-off-by: Antonio Vicente <avd@google.com>

ggreenway requested changes Dec 4, 2020

View reviewed changes

address review comments

e41fd03

Signed-off-by: Antonio Vicente <avd@google.com>

ggreenway approved these changes Dec 4, 2020

View reviewed changes

ggreenway merged commit 7cb9a9e into envoyproxy:master Dec 4, 2020

antoniovicente mentioned this pull request Dec 20, 2020

Where did envoy allocate the read/write buffer? #13961

Closed

Conversation

antoniovicente commented Nov 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggreenway commented Nov 20, 2020

Uh oh!

antoniovicente commented Nov 20, 2020

Uh oh!

ggreenway commented Nov 20, 2020

Uh oh!

antoniovicente commented Nov 20, 2020

Uh oh!

antoniovicente commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniovicente commented Nov 24, 2020

Uh oh!

ggreenway commented Nov 30, 2020

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattklein123 commented Nov 30, 2020

Uh oh!

antoniovicente commented Nov 30, 2020

Uh oh!

mattklein123 commented Nov 30, 2020

Uh oh!

ggreenway commented Nov 30, 2020

Uh oh!

mattklein123 commented Nov 30, 2020

Uh oh!

ggreenway commented Nov 30, 2020

Uh oh!

mattklein123 commented Nov 30, 2020

Uh oh!

antoniovicente commented Dec 2, 2020

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

ggreenway Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovicente Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

ggreenway Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

antoniovicente commented Nov 20, 2020 •

edited

Loading

antoniovicente commented Nov 24, 2020 •

edited

Loading