Skip to content

buffer: Optimize memory layout for buffer slices so it is better aligned with the 16KB transport socket read size#14111

Merged
ggreenway merged 10 commits intoenvoyproxy:masterfrom
antoniovicente:owner_slice_refactor
Dec 4, 2020
Merged

buffer: Optimize memory layout for buffer slices so it is better aligned with the 16KB transport socket read size#14111
ggreenway merged 10 commits intoenvoyproxy:masterfrom
antoniovicente:owner_slice_refactor

Conversation

@antoniovicente
Copy link
Contributor

@antoniovicente antoniovicente commented Nov 20, 2020

Commit Message:
buffer: Optimize memory layout for buffer slices so it is better aligned with the 16KB transport socket read size.

Additional Description:
Having buffer slices of odd sizes seems to have some negative CPU performance consequences as it results in extra copies when generating 16KB SSL records from the buffer contents.

Risk Level: low, no changes to the external buffer API, just changes to how memory is arranged.
Testing: Adjusted unit test expectations, new benchmark based on ggreenway's work.
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Antonio Vicente <avd@google.com>
@ggreenway
Copy link
Member

My understanding is that we were reading full 16k chunks, but the allocated buffer ended up being 20k because it took 16k plus header/metadata, which rounded up to 20k. But still that was about 4k wasted on every read (20%), so I think it makes a lot of sense to not inline that data.

@antoniovicente
Copy link
Contributor Author

My understanding is that we were reading full 16k chunks, but the allocated buffer ended up being 20k because it took 16k plus header/metadata, which rounded up to 20k. But still that was about 4k wasted on every read (20%), so I think it makes a lot of sense to not inline that data.

The first read allocates a buffer with 20kb-64 bytes. The second 16KB read adds 4032 bytes to the 20kb buffer and reads the remaining 12352 to a 16kb-64 byte buffer.

So buffer is not fragmented, but it results in expensive linearize when writing SSL records later.

@ggreenway
Copy link
Member

Got it.

I'm most of the way done writing a simple benchmark around buffer slices and SSL_write() and linearize; it should be easy to adapt it to compare performance.

I think the performance risk, which may or may not be caught by a micro-benchmark, is the double pointer deref needed to get to the data. One alternative (not sure of the feasibility) is to put the header (pointer and len) directly into the SliceDeque instead of pointers to that block.

@antoniovicente
Copy link
Contributor Author

Got it.

I'm most of the way done writing a simple benchmark around buffer slices and SSL_write() and linearize; it should be easy to adapt it to compare performance.

I think the performance risk, which may or may not be caught by a micro-benchmark, is the double pointer deref needed to get to the data. One alternative (not sure of the feasibility) is to put the header (pointer and len) directly into the SliceDeque instead of pointers to that block.

Yes, moving away from unique pointers in the dequeus would be helpful. In order to do that we should remove the subclassing used when tracking mutable and immutable slices. It should just be a relatively straightforward refactoring of the buffer class.

@antoniovicente
Copy link
Contributor Author

antoniovicente commented Nov 24, 2020

Benchmark based on #14053 seems to show a 5% improvement in the testThroughput/0/0/10 case by avoiding the linearize copies that used to be required due to mixing of slices with 4096*N-64 byte slices for N in {1 to 5}

Without this change:

Benchmark                         Time             CPU   Iterations UserCounters...
testThroughput/0/0/0           6.08 us         6.07 us       115217 throughput=2.69878G/s writes_per_iteration=1
testThroughput/0/0/5           34.4 us         34.4 us        20344 throughput=2.85751G/s writes_per_iteration=6
testThroughput/0/0/10          62.8 us         62.8 us        11156 throughput=2.8699G/s writes_per_iteration=11
testThroughput/1/1/0           6.59 us         6.59 us       106446 throughput=2.4857G/s writes_per_iteration=1
testThroughput/1/1/5           35.2 us         35.2 us        19903 throughput=2.79063G/s writes_per_iteration=6
testThroughput/1/1/10          63.8 us         63.8 us        10983 throughput=2.82659G/s writes_per_iteration=11
testThroughput/128/1/0         6.53 us         6.53 us       106832 throughput=2.50952G/s writes_per_iteration=1
testThroughput/128/1/5         34.6 us         34.6 us        20225 throughput=2.84392G/s writes_per_iteration=6
testThroughput/128/1/10        63.0 us         63.0 us        11082 throughput=2.86119G/s writes_per_iteration=11
testThroughput/4096/1/0        6.52 us         6.52 us       107212 throughput=2.5144G/s writes_per_iteration=1
testThroughput/4096/1/5        34.9 us         34.9 us        20042 throughput=2.82045G/s writes_per_iteration=6
testThroughput/4096/1/10       63.3 us         63.3 us        11069 throughput=2.84769G/s writes_per_iteration=11
testThroughput/1/2/0           6.61 us         6.61 us       106464 throughput=2.47929G/s writes_per_iteration=1
testThroughput/1/2/5           35.4 us         35.4 us        19840 throughput=2.7806G/s writes_per_iteration=6
testThroughput/1/2/10          63.9 us         63.9 us        10957 throughput=2.8218G/s writes_per_iteration=11
testThroughput/128/2/0         6.55 us         6.55 us       107025 throughput=2.5011G/s writes_per_iteration=1
testThroughput/128/2/5         34.6 us         34.6 us        20234 throughput=2.84506G/s writes_per_iteration=6
testThroughput/128/2/10        62.9 us         62.9 us        11121 throughput=2.86685G/s writes_per_iteration=11
testThroughput/4096/2/0        6.51 us         6.51 us       107793 throughput=2.51697G/s writes_per_iteration=1
testThroughput/4096/2/5        34.9 us         34.9 us        20038 throughput=2.81385G/s writes_per_iteration=6
testThroughput/4096/2/10       63.3 us         63.3 us        11067 throughput=2.84633G/s writes_per_iteration=11
testThroughput/1/3/0           6.61 us         6.60 us       106163 throughput=2.48197G/s writes_per_iteration=1
testThroughput/1/3/5           35.2 us         35.2 us        19903 throughput=2.78953G/s writes_per_iteration=6
testThroughput/1/3/10          63.8 us         63.8 us        10945 throughput=2.82398G/s writes_per_iteration=11
testThroughput/128/3/0         6.56 us         6.55 us       106860 throughput=2.49989G/s writes_per_iteration=1
testThroughput/128/3/5         35.0 us         35.0 us        20039 throughput=2.81042G/s writes_per_iteration=6
testThroughput/128/3/10        63.0 us         63.0 us        11086 throughput=2.86123G/s writes_per_iteration=11
testThroughput/4096/3/0        6.51 us         6.51 us       107107 throughput=2.5159G/s writes_per_iteration=1
testThroughput/4096/3/5        35.0 us         35.0 us        19955 throughput=2.80748G/s writes_per_iteration=6
testThroughput/4096/3/10       63.4 us         63.4 us        11064 throughput=2.8409G/s writes_per_iteration=11

With this change:

Benchmark                         Time             CPU   Iterations UserCounters...
testThroughput/0/0/0           6.10 us         6.10 us       114328 throughput=2.68561G/s writes_per_iteration=1
testThroughput/0/0/5           32.5 us         32.5 us        21623 throughput=3.02461G/s writes_per_iteration=6
testThroughput/0/0/10          58.8 us         58.8 us        11961 throughput=3.06482G/s writes_per_iteration=11
testThroughput/1/1/0           6.55 us         6.55 us       107076 throughput=2.49992G/s writes_per_iteration=1
testThroughput/1/1/5           35.2 us         35.2 us        19874 throughput=2.78915G/s writes_per_iteration=6
testThroughput/1/1/10          63.9 us         63.8 us        10967 throughput=2.82262G/s writes_per_iteration=11
testThroughput/128/1/0         6.51 us         6.52 us       107613 throughput=2.51438G/s writes_per_iteration=1
testThroughput/128/1/5         34.9 us         34.9 us        20066 throughput=2.8184G/s writes_per_iteration=6
testThroughput/128/1/10        63.1 us         63.1 us        11074 throughput=2.85814G/s writes_per_iteration=11
testThroughput/4096/1/0        6.53 us         6.54 us       107076 throughput=2.50522G/s writes_per_iteration=1
testThroughput/4096/1/5        33.3 us         33.3 us        21048 throughput=2.95276G/s writes_per_iteration=6
testThroughput/4096/1/10       59.5 us         59.5 us        11752 throughput=3.02848G/s writes_per_iteration=11
testThroughput/1/2/0           6.57 us         6.57 us       106575 throughput=2.49454G/s writes_per_iteration=1
testThroughput/1/2/5           35.2 us         35.2 us        19866 throughput=2.78996G/s writes_per_iteration=6
testThroughput/1/2/10          63.8 us         63.8 us        10970 throughput=2.82358G/s writes_per_iteration=11
testThroughput/128/2/0         6.51 us         6.52 us       107302 throughput=2.51465G/s writes_per_iteration=1
testThroughput/128/2/5         34.9 us         34.9 us        20085 throughput=2.81983G/s writes_per_iteration=6
testThroughput/128/2/10        63.7 us         63.7 us        11052 throughput=2.83113G/s writes_per_iteration=11
testThroughput/4096/2/0        6.64 us         6.65 us       106841 throughput=2.46434G/s writes_per_iteration=1
testThroughput/4096/2/5        33.3 us         33.3 us        20970 throughput=2.95183G/s writes_per_iteration=6
testThroughput/4096/2/10       59.6 us         59.6 us        11747 throughput=3.02168G/s writes_per_iteration=11
testThroughput/1/3/0           6.61 us         6.61 us       105860 throughput=2.47856G/s writes_per_iteration=1
testThroughput/1/3/5           35.3 us         35.3 us        19849 throughput=2.78513G/s writes_per_iteration=6
testThroughput/1/3/10          63.9 us         63.9 us        10945 throughput=2.82245G/s writes_per_iteration=11
testThroughput/128/3/0         6.53 us         6.53 us       106742 throughput=2.50784G/s writes_per_iteration=1
testThroughput/128/3/5         34.9 us         34.9 us        19998 throughput=2.81464G/s writes_per_iteration=6
testThroughput/128/3/10        63.4 us         63.4 us        10980 throughput=2.84267G/s writes_per_iteration=11
testThroughput/4096/3/0        6.52 us         6.53 us       106798 throughput=2.50767G/s writes_per_iteration=1
testThroughput/4096/3/5        33.3 us         33.3 us        20978 throughput=2.94981G/s writes_per_iteration=6
testThroughput/4096/3/10       59.6 us         59.6 us        11740 throughput=3.02173G/s writes_per_iteration=11

@antoniovicente
Copy link
Contributor Author

Got it.
I'm most of the way done writing a simple benchmark around buffer slices and SSL_write() and linearize; it should be easy to adapt it to compare performance.
I think the performance risk, which may or may not be caught by a micro-benchmark, is the double pointer deref needed to get to the data. One alternative (not sure of the feasibility) is to put the header (pointer and len) directly into the SliceDeque instead of pointers to that block.

Yes, moving away from unique pointers in the dequeus would be helpful. In order to do that we should remove the subclassing used when tracking mutable and immutable slices. It should just be a relatively straightforward refactoring of the buffer class.

Actually, all pointer accesses are done via Slice::base_ in the base class. There's no change in the number of pointers de-referenced when accessing the contents of slices.

Signed-off-by: Antonio Vicente <avd@google.com>
@ggreenway
Copy link
Member

Actually, all pointer accesses are done via Slice::base_ in the base class. There's no change in the number of pointers de-referenced when accessing the contents of slices.

True, but when the data was inlined, the base_ dereference is almost certainly an L1 cache hit.

Copy link
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark results look slightly faster with this version, and it makes everything simpler and easier to reason about. I'm in favor of this.

@mattklein123
Copy link
Member

True, but when the data was inlined, the base_ dereference is almost certainly an L1 cache hit.

Naively, I would also think that a single allocation vs. 2 (1 for the slice and one for data) would be better also?

@antoniovicente
Copy link
Contributor Author

True, but when the data was inlined, the base_ dereference is almost certainly an L1 cache hit.

Naively, I would also think that a single allocation vs. 2 (1 for the slice and one for data) would be better also?

It sounds like I should go the extra mile and replace the Slice interface with a struct and get rid of a level of pointers and allocations. I'll try to look into it sometime this week.

@mattklein123
Copy link
Member

It sounds like I should go the extra mile and replace the Slice interface with a struct and get rid of a level of pointers and allocations. I'll try to look into it sometime this week.

It's not a big deal, feel free to merge this if you both agree on it, I was just curious from a drive by perspective. I'm just surprised it's faster.

@ggreenway
Copy link
Member

It sounds like I should go the extra mile and replace the Slice interface with a struct and get rid of a level of pointers and allocations. I'll try to look into it sometime this week.

It's not a big deal, feel free to merge this if you both agree on it, I was just curious from a drive by perspective. I'm just surprised it's faster.

It's probably because instead of only using the first handful of bytes in the last page (for sizes that are a multiple of the pagesize) and wasting the rest, 1 fewer pages are used, and the Slice headers can be packed (by the allocator) nicely, with many in a single page.

@mattklein123
Copy link
Member

It's probably because instead of only using the first handful of bytes in the last page (for sizes that are a multiple of the pagesize) and wasting the rest, 1 fewer pages are used, and the Slice headers can be packed (by the allocator) nicely, with many in a single page.

Yeah this makes sense. I wonder if as part of this change the slice size could go up to the allocator page size? Wasn't it less previously to account for the data at the beginning?

@ggreenway
Copy link
Member

The slice size has always been however much is requested by the caller, rounded up to the next multiple of 4096. This meant that all the 16384 requests were fulfilled by 5 pages instead of 4. The extra space is probably used by the next read. This change would make 16384 requests work as the caller would naively expect. So I think it does what you're hoping.

@mattklein123
Copy link
Member

Ah OK makes sense. SGTM.

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
@antoniovicente
Copy link
Contributor Author

latest version of the benchmark. Note that the new version of the buffer ends up hitting an odd edge case when the short slices are 4096 and the benchmark ends up adding a slice with 16kb-short_slices bytes in it before the full sized slices. I hope that the extra cases explored in the benchmark will help you measure the benefits of #14053

before:

Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
testThroughput/0/0/0/0          55.4 us         55.4 us        12618 num_linearized=9 throughput=2.9593G/s writes_per_iteration=10
testThroughput/1/1/0/0          56.8 us         56.8 us        12293 num_linearized=9 throughput=2.88312G/s writes_per_iteration=11
testThroughput/1/2/0/0          57.9 us         57.9 us        12096 num_linearized=10 throughput=2.82812G/s writes_per_iteration=11
testThroughput/1/3/0/0          57.9 us         57.9 us        12101 num_linearized=10 throughput=2.83036G/s writes_per_iteration=11
testThroughput/128/1/0/0        56.9 us         56.9 us        12286 num_linearized=9 throughput=2.88366G/s writes_per_iteration=11
testThroughput/128/2/0/0        57.5 us         57.5 us        12181 num_linearized=10 throughput=2.85472G/s writes_per_iteration=11
testThroughput/128/3/0/0        58.2 us         58.2 us        12171 num_linearized=10 throughput=2.82361G/s writes_per_iteration=11
testThroughput/4095/1/0/0       57.9 us         57.9 us        12113 num_linearized=9 throughput=2.90171G/s writes_per_iteration=11
testThroughput/4095/2/0/0       59.7 us         59.7 us        11701 num_linearized=10 throughput=2.8829G/s writes_per_iteration=11
testThroughput/4095/3/0/0       60.7 us         60.7 us        11549 num_linearized=10 throughput=2.90372G/s writes_per_iteration=11
testThroughput/4096/1/0/0       57.8 us         57.8 us        12117 num_linearized=9 throughput=2.9036G/s writes_per_iteration=11
testThroughput/4096/2/0/0       59.1 us         59.1 us        11850 num_linearized=10 throughput=2.91126G/s writes_per_iteration=11
testThroughput/4096/3/0/0       60.2 us         60.2 us        11634 num_linearized=10 throughput=2.92718G/s writes_per_iteration=11
testThroughput/4097/1/0/0       57.9 us         57.9 us        12111 num_linearized=9 throughput=2.90185G/s writes_per_iteration=11
testThroughput/4097/2/0/0       59.7 us         59.7 us        11739 num_linearized=10 throughput=2.88383G/s writes_per_iteration=11
testThroughput/4097/3/0/0       60.7 us         60.7 us        11549 num_linearized=10 throughput=2.90404G/s writes_per_iteration=11
testThroughput/0/0/1/0          60.7 us         60.7 us        11531 num_linearized=10 throughput=2.97042G/s writes_per_iteration=11
testThroughput/1/1/1/0          61.8 us         61.8 us        11334 num_linearized=11 throughput=2.91574G/s writes_per_iteration=11
testThroughput/1/2/1/0          61.9 us         61.8 us        11320 num_linearized=11 throughput=2.91423G/s writes_per_iteration=11
testThroughput/1/3/1/0          61.8 us         61.8 us        11334 num_linearized=11 throughput=2.91718G/s writes_per_iteration=11
testThroughput/128/1/1/0        61.0 us         60.9 us        11496 num_linearized=10 throughput=2.95738G/s writes_per_iteration=11
testThroughput/128/2/1/0        61.0 us         61.0 us        11459 num_linearized=10 throughput=2.95268G/s writes_per_iteration=11
testThroughput/128/3/1/0        61.0 us         61.0 us        11441 num_linearized=10 throughput=2.95574G/s writes_per_iteration=11
testThroughput/4095/1/1/0       61.5 us         61.5 us        11383 num_linearized=11 throughput=2.93016G/s writes_per_iteration=11
testThroughput/4095/2/1/0       61.8 us         61.8 us        11345 num_linearized=11 throughput=2.91815G/s writes_per_iteration=11
testThroughput/4095/3/1/0       61.6 us         61.6 us        11337 num_linearized=11 throughput=2.92691G/s writes_per_iteration=11
testThroughput/4096/1/1/0       61.2 us         61.2 us        11411 num_linearized=11 throughput=2.94609G/s writes_per_iteration=11
testThroughput/4096/2/1/0       61.4 us         61.4 us        11435 num_linearized=11 throughput=2.93678G/s writes_per_iteration=11
testThroughput/4096/3/1/0       61.2 us         61.2 us        11438 num_linearized=11 throughput=2.94537G/s writes_per_iteration=11
testThroughput/4097/1/1/0       61.5 us         61.5 us        11371 num_linearized=11 throughput=2.93031G/s writes_per_iteration=11
testThroughput/4097/2/1/0       61.8 us         61.8 us        11330 num_linearized=11 throughput=2.91458G/s writes_per_iteration=11
testThroughput/4097/3/1/0       61.7 us         61.7 us        11337 num_linearized=11 throughput=2.92173G/s writes_per_iteration=11
testThroughput/0/0/0/1          55.2 us         55.2 us        12671 num_linearized=9 throughput=2.9663G/s writes_per_iteration=10
testThroughput/1/1/0/1          57.7 us         57.7 us        12106 num_linearized=10 throughput=2.83864G/s writes_per_iteration=11
testThroughput/1/2/0/1          57.8 us         57.8 us        12068 num_linearized=10 throughput=2.83561G/s writes_per_iteration=11
testThroughput/1/3/0/1          57.6 us         57.6 us        12102 num_linearized=10 throughput=2.84293G/s writes_per_iteration=11
testThroughput/128/1/0/1        57.3 us         57.3 us        12204 num_linearized=10 throughput=2.8616G/s writes_per_iteration=11
testThroughput/128/2/0/1        57.4 us         57.4 us        12183 num_linearized=10 throughput=2.85882G/s writes_per_iteration=11
testThroughput/128/3/0/1        57.4 us         57.4 us        12157 num_linearized=10 throughput=2.85966G/s writes_per_iteration=11
testThroughput/4095/1/0/1       58.7 us         58.7 us        11939 num_linearized=10 throughput=2.86215G/s writes_per_iteration=11
testThroughput/4095/2/0/1       59.6 us         59.6 us        11745 num_linearized=10 throughput=2.88558G/s writes_per_iteration=11
testThroughput/4095/3/0/1       60.6 us         60.6 us        11542 num_linearized=10 throughput=2.90509G/s writes_per_iteration=11
testThroughput/4096/1/0/1       58.3 us         58.3 us        12006 num_linearized=10 throughput=2.87982G/s writes_per_iteration=11
testThroughput/4096/2/0/1       59.2 us         59.2 us        11838 num_linearized=10 throughput=2.90764G/s writes_per_iteration=11
testThroughput/4096/3/0/1       60.2 us         60.2 us        11629 num_linearized=10 throughput=2.92615G/s writes_per_iteration=11
testThroughput/4097/1/0/1       58.8 us         58.8 us        11919 num_linearized=10 throughput=2.85643G/s writes_per_iteration=11
testThroughput/4097/2/0/1       59.6 us         59.6 us        11714 num_linearized=10 throughput=2.88438G/s writes_per_iteration=11
testThroughput/4097/3/0/1       60.6 us         60.6 us        11532 num_linearized=10 throughput=2.90613G/s writes_per_iteration=11
testThroughput/0/0/1/1          60.3 us         60.3 us        11614 num_linearized=9 throughput=2.9907G/s writes_per_iteration=11
testThroughput/1/1/1/1          60.8 us         60.8 us        11514 num_linearized=10 throughput=2.96351G/s writes_per_iteration=11
testThroughput/1/2/1/1          60.9 us         60.9 us        11503 num_linearized=10 throughput=2.96038G/s writes_per_iteration=11
testThroughput/1/3/1/1          60.9 us         60.9 us        11494 num_linearized=10 throughput=2.95816G/s writes_per_iteration=11
testThroughput/128/1/1/1        60.8 us         60.8 us        11516 num_linearized=10 throughput=2.96399G/s writes_per_iteration=11
testThroughput/128/2/1/1        60.9 us         60.9 us        11495 num_linearized=10 throughput=2.96121G/s writes_per_iteration=11
testThroughput/128/3/1/1        60.9 us         60.9 us        11512 num_linearized=10 throughput=2.96037G/s writes_per_iteration=11
testThroughput/4095/1/1/1       60.7 us         60.7 us        11535 num_linearized=10 throughput=2.96745G/s writes_per_iteration=11
testThroughput/4095/2/1/1       60.9 us         60.9 us        11452 num_linearized=10 throughput=2.96165G/s writes_per_iteration=11
testThroughput/4095/3/1/1       61.0 us         61.0 us        11492 num_linearized=10 throughput=2.95683G/s writes_per_iteration=11
testThroughput/4096/1/1/1       60.8 us         60.8 us        11504 num_linearized=10 throughput=2.96615G/s writes_per_iteration=11
testThroughput/4096/2/1/1       60.8 us         60.8 us        11529 num_linearized=10 throughput=2.96275G/s writes_per_iteration=11
testThroughput/4096/3/1/1       60.9 us         60.9 us        11496 num_linearized=10 throughput=2.95725G/s writes_per_iteration=11
testThroughput/4097/1/1/1       60.9 us         60.9 us        11469 num_linearized=10 throughput=2.9598G/s writes_per_iteration=11
testThroughput/4097/2/1/1       60.9 us         60.9 us        11471 num_linearized=10 throughput=2.95977G/s writes_per_iteration=11
testThroughput/4097/3/1/1       61.0 us         61.0 us        11494 num_linearized=10 throughput=2.95602G/s writes_per_iteration=11

after:


Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
testThroughput/0/0/0/0          51.3 us         51.3 us        13662 num_linearized=0 throughput=3.19252G/s writes_per_iteration=10
testThroughput/1/1/0/0          56.6 us         56.6 us        12393 num_linearized=9 throughput=2.89528G/s writes_per_iteration=11
testThroughput/1/2/0/0          57.7 us         57.7 us        12143 num_linearized=10 throughput=2.84122G/s writes_per_iteration=11
testThroughput/1/3/0/0          57.7 us         57.7 us        12119 num_linearized=10 throughput=2.83889G/s writes_per_iteration=11
testThroughput/128/1/0/0        56.6 us         56.6 us        12344 num_linearized=9 throughput=2.89589G/s writes_per_iteration=11
testThroughput/128/2/0/0        57.3 us         57.3 us        12222 num_linearized=10 throughput=2.86294G/s writes_per_iteration=11
testThroughput/128/3/0/0        57.2 us         57.2 us        12201 num_linearized=10 throughput=2.87069G/s writes_per_iteration=11
testThroughput/4095/1/0/0       58.5 us         58.5 us        12166 num_linearized=9 throughput=2.872G/s writes_per_iteration=11
testThroughput/4095/2/0/0       59.5 us         59.4 us        11637 num_linearized=10 throughput=2.89393G/s writes_per_iteration=11
testThroughput/4095/3/0/0       60.4 us         60.4 us        11565 num_linearized=10 throughput=2.91687G/s writes_per_iteration=11
testThroughput/4096/1/0/0       57.6 us         57.6 us        12122 num_linearized=9 throughput=2.91508G/s writes_per_iteration=11
testThroughput/4096/2/0/0       58.9 us         58.9 us        11881 num_linearized=10 throughput=2.9185G/s writes_per_iteration=11
testThroughput/4096/3/0/0       60.0 us         60.0 us        11655 num_linearized=10 throughput=2.93615G/s writes_per_iteration=11
testThroughput/4097/1/0/0       57.7 us         57.7 us        12131 num_linearized=9 throughput=2.91012G/s writes_per_iteration=11
testThroughput/4097/2/0/0       59.4 us         59.4 us        11791 num_linearized=10 throughput=2.8961G/s writes_per_iteration=11
testThroughput/4097/3/0/0       60.9 us         60.9 us        11486 num_linearized=11 throughput=2.89227G/s writes_per_iteration=11
testThroughput/0/0/1/0          56.3 us         56.3 us        12392 num_linearized=0 throughput=3.199G/s writes_per_iteration=11
testThroughput/1/1/1/0          61.6 us         61.6 us        11311 num_linearized=11 throughput=2.92708G/s writes_per_iteration=11
testThroughput/1/2/1/0          61.6 us         61.6 us        11350 num_linearized=11 throughput=2.92565G/s writes_per_iteration=11
testThroughput/1/3/1/0          61.7 us         61.7 us        11343 num_linearized=11 throughput=2.92146G/s writes_per_iteration=11
testThroughput/128/1/1/0        61.1 us         61.1 us        11464 num_linearized=11 throughput=2.94915G/s writes_per_iteration=11
testThroughput/128/2/1/0        61.1 us         61.1 us        11443 num_linearized=11 throughput=2.95038G/s writes_per_iteration=11
testThroughput/128/3/1/0        61.2 us         61.2 us        11447 num_linearized=11 throughput=2.94629G/s writes_per_iteration=11
testThroughput/4095/1/1/0       61.4 us         61.4 us        11385 num_linearized=11 throughput=2.93476G/s writes_per_iteration=11
testThroughput/4095/2/1/0       61.4 us         61.4 us        11380 num_linearized=11 throughput=2.93462G/s writes_per_iteration=11
testThroughput/4095/3/1/0       61.6 us         61.6 us        11359 num_linearized=11 throughput=2.92796G/s writes_per_iteration=11
testThroughput/4096/1/1/0       57.4 us         57.4 us        12194 num_linearized=2 throughput=3.13948G/s writes_per_iteration=11
testThroughput/4096/2/1/0       57.5 us         57.5 us        12160 num_linearized=2 throughput=3.13422G/s writes_per_iteration=11
testThroughput/4096/3/1/0       57.2 us         57.2 us        12180 num_linearized=2 throughput=3.14849G/s writes_per_iteration=11
testThroughput/4097/1/1/0       61.7 us         61.6 us        11374 num_linearized=11 throughput=2.92368G/s writes_per_iteration=11
testThroughput/4097/2/1/0       61.7 us         61.7 us        11359 num_linearized=11 throughput=2.92187G/s writes_per_iteration=11
testThroughput/4097/3/1/0       61.7 us         61.7 us        11339 num_linearized=11 throughput=2.92301G/s writes_per_iteration=11
testThroughput/0/0/0/1          51.5 us         51.5 us        13621 num_linearized=0 throughput=3.18375G/s writes_per_iteration=10
testThroughput/1/1/0/1          57.7 us         57.7 us        12131 num_linearized=10 throughput=2.83879G/s writes_per_iteration=11
testThroughput/1/2/0/1          57.8 us         57.7 us        12120 num_linearized=10 throughput=2.83735G/s writes_per_iteration=11
testThroughput/1/3/0/1          57.8 us         57.8 us        12097 num_linearized=10 throughput=2.8344G/s writes_per_iteration=11
testThroughput/128/1/0/1        57.3 us         57.3 us        12186 num_linearized=10 throughput=2.86175G/s writes_per_iteration=11
testThroughput/128/2/0/1        57.4 us         57.4 us        12191 num_linearized=10 throughput=2.86045G/s writes_per_iteration=11
testThroughput/128/3/0/1        57.5 us         57.5 us        12132 num_linearized=10 throughput=2.85846G/s writes_per_iteration=11
testThroughput/4095/1/0/1       58.5 us         58.5 us        11946 num_linearized=10 throughput=2.8707G/s writes_per_iteration=11
testThroughput/4095/2/0/1       59.5 us         59.5 us        11769 num_linearized=10 throughput=2.89333G/s writes_per_iteration=11
testThroughput/4095/3/0/1       60.6 us         60.6 us        11554 num_linearized=10 throughput=2.90854G/s writes_per_iteration=11
testThroughput/4096/1/0/1       58.2 us         58.2 us        12037 num_linearized=10 throughput=2.88512G/s writes_per_iteration=11
testThroughput/4096/2/0/1       58.9 us         58.9 us        11862 num_linearized=10 throughput=2.92063G/s writes_per_iteration=11
testThroughput/4096/3/0/1       60.0 us         60.0 us        11653 num_linearized=10 throughput=2.93572G/s writes_per_iteration=11
testThroughput/4097/1/0/1       58.9 us         58.9 us        11909 num_linearized=10 throughput=2.85219G/s writes_per_iteration=11
testThroughput/4097/2/0/1       59.7 us         59.6 us        11729 num_linearized=10 throughput=2.88433G/s writes_per_iteration=11
testThroughput/4097/3/0/1       60.4 us         60.4 us        11559 num_linearized=10 throughput=2.91698G/s writes_per_iteration=11
testThroughput/0/0/1/1          56.4 us         56.4 us        12415 num_linearized=0 throughput=3.19574G/s writes_per_iteration=11
testThroughput/1/1/1/1          57.0 us         57.0 us        12297 num_linearized=1 throughput=3.16012G/s writes_per_iteration=11
testThroughput/1/2/1/1          56.9 us         56.9 us        12269 num_linearized=1 throughput=3.1655G/s writes_per_iteration=11
testThroughput/1/3/1/1          57.0 us         57.0 us        12233 num_linearized=1 throughput=3.16216G/s writes_per_iteration=11
testThroughput/128/1/1/1        57.1 us         57.1 us        12265 num_linearized=1 throughput=3.15485G/s writes_per_iteration=11
testThroughput/128/2/1/1        57.0 us         57.0 us        12267 num_linearized=1 throughput=3.16437G/s writes_per_iteration=11
testThroughput/128/3/1/1        57.0 us         57.0 us        12262 num_linearized=1 throughput=3.161G/s writes_per_iteration=11
testThroughput/4095/1/1/1       57.2 us         57.2 us        12287 num_linearized=1 throughput=3.15336G/s writes_per_iteration=11
testThroughput/4095/2/1/1       56.9 us         56.9 us        12259 num_linearized=1 throughput=3.16538G/s writes_per_iteration=11
testThroughput/4095/3/1/1       57.1 us         57.0 us        12266 num_linearized=1 throughput=3.15911G/s writes_per_iteration=11
testThroughput/4096/1/1/1       57.0 us         57.0 us        12288 num_linearized=1 throughput=3.16001G/s writes_per_iteration=11
testThroughput/4096/2/1/1       57.0 us         57.0 us        12272 num_linearized=1 throughput=3.1634G/s writes_per_iteration=11
testThroughput/4096/3/1/1       57.1 us         57.1 us        12261 num_linearized=1 throughput=3.15666G/s writes_per_iteration=11
testThroughput/4097/1/1/1       57.1 us         57.0 us        12277 num_linearized=1 throughput=3.15925G/s writes_per_iteration=11
testThroughput/4097/2/1/1       57.0 us         57.0 us        12272 num_linearized=1 throughput=3.15994G/s writes_per_iteration=11
testThroughput/4097/3/1/1       57.1 us         57.1 us        12274 num_linearized=1 throughput=3.15633G/s writes_per_iteration=11

@antoniovicente antoniovicente marked this pull request as ready for review December 2, 2020 21:42
@antoniovicente antoniovicente changed the title buffer: Stop using inline storage for owned slices buffer: Optimize memory layout for buffer slices so it is better aligned with the 16KB transport socket read size Dec 2, 2020
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
improve handshake error handling

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
Copy link
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the minor naming nit, I think this looks good. Benchmarks look to be as good or better, depending on the test case, and I think this makes the buffering behavior much easier to understand.

buffer.commit(&slice, 1);
}

static void addTenFullSizeSlices(Buffer::Instance& buffer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name isn't quite correct. What this actually does is simulate 10 full size reads from a socket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted to address the naming issue.

}

if (move_slices) {
// Append many full-sized slices, the same manner that HTTP codecs would move data from an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Signed-off-by: Antonio Vicente <avd@google.com>
@ggreenway ggreenway merged commit 7cb9a9e into envoyproxy:master Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants