Skip to content

Improve partitioned output operator performance for strings#12798

Merged
sopel39 merged 3 commits intotrinodb:masterfrom
starburstdata:ls/018-speedup-slice-positions-appender
Sep 8, 2022
Merged

Improve partitioned output operator performance for strings#12798
sopel39 merged 3 commits intotrinodb:masterfrom
starburstdata:ls/018-speedup-slice-positions-appender

Conversation

@lukasz-stec
Copy link
Copy Markdown
Member

@lukasz-stec lukasz-stec commented Jun 10, 2022

Description

Improve performance of SlicePositionsAppender.
The improvement comes from not allocating Slice per position + using System.arrayCopy directly instead of Slice.getBytes

Is this change a fix, improvement, new feature, refactoring, or other?

performance improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine (partitioned exchange) and SPI extension

How would you describe this change to a non-technical end user or system administrator?

Improve performance of remote partitioned exchange

Benchamrks

jmh

about 30% improvement

Before
Benchmark                                   (channelCount)  (enableCompression)  (nullRate)  (partitionCount)  (positionCount)   (type)  Mode  Cnt     Score     Error  Units
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192  VARCHAR  avgt   20  2137.367 ± 166.834  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192  VARCHAR  avgt   20  1660.550 ±  24.274  ms/op

After
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192  VARCHAR  avgt   20  1399.476 ± 181.301  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192  VARCHAR  avgt   20  1194.077 ± 160.496  ms/op

tpch/tpcds

image
Overall there is a small improvement. This is expected as PartitionedOutputOperator is only around 5% of total CPU time and this pr only improves varchars.
When looking at the PartitionedOutputOperator stats only, the improvement is ~10%.

Before
Total: 980917.6656576941
PartitionedOutputOperator: 43522.99081258998, 0.044369667645253706%
After
Total: 974919.225831081
PartitionedOutputOperator: 38515.177167889924, 0.03950601870124904%

poo-slice-orc-part-sf1K.pdf

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
* Improve performance of remote partitioned exchange on string data types. ({issue}`12798`)

@cla-bot cla-bot bot added the cla-signed label Jun 10, 2022
@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from 201bac1 to 51c506d Compare June 10, 2022 15:51
@cla-bot cla-bot bot added the cla-signed label Jun 10, 2022
@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from 51c506d to 04d2417 Compare June 15, 2022 08:48
@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from 04d2417 to 160d89b Compare June 17, 2022 09:16
Copy link
Copy Markdown
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments answered.

@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from 160d89b to d305bf1 Compare June 20, 2022 08:21
Copy link
Copy Markdown
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments addressed

@lukasz-stec lukasz-stec marked this pull request as ready for review June 20, 2022 08:21
@lukasz-stec
Copy link
Copy Markdown
Member Author

Made it non-draft as @raunaqmorarka is reviewing it anyway.

@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from d305bf1 to 8d20e3c Compare July 1, 2022 10:04
@lukasz-stec
Copy link
Copy Markdown
Member Author

I rebased on the master (there were some changes to the SlicePositionsAppender) and as discussed offline I reverted the BatchByteWriter change in favor of exposing raw Slice and offsets from the VariableWidthBlock.

@lukasz-stec
Copy link
Copy Markdown
Member Author

I also updated the jmh results as the absolute number changed significantly after rebasing on the master and running the benchmarks on java 11.0.15 instead of 11.0.11. The relative improvements are similar (slightly better).

@lukasz-stec lukasz-stec requested a review from raunaqmorarka July 1, 2022 10:10
@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from 8d20e3c to c4265a5 Compare July 1, 2022 12:52
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, looks good overall

Comment on lines 223 to 224
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could be simplified to something like

int currentStartOffset = startOffset + length;
for (int i = 0; i < count; i++) {
    offsets[positionCount + i + 1] = currentStartOffset;
    currentStartOffset += length;
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this comment.
applied now.

Comment on lines 213 to 208
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative way here is to copy only once into the byte array from Slice using this method, then use System.arrayCopy to copy bytes within the byte array with doubling size at each step, so you get only log_2(count) calls to System.arrayCopy overall.
Not sure if it'll be faster though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea.
First of all, it allows us to not rely on Slice being backed by a byte array.
Second, I did a quick benchmark and this method is ridiculously fast on small to medium chunk sizes.
I suspect it's not gonna be that used in practice but I included this in this PR given results below.

Benchmark                               (length)  Mode  Cnt      Score   Error  Units
BenchmarkDuplicateBytes.arrayCopy              2  avgt   20      4.101 ±   0.082 ns/op
BenchmarkDuplicateBytes.arrayCopy             16  avgt   20      5.088 ±   0.055 ns/op
BenchmarkDuplicateBytes.arrayCopy             35  avgt   20      4.389 ±   0.175 ns/op
BenchmarkDuplicateBytes.arrayCopy             32  avgt   20      4.333 ±   0.049 ns/op
BenchmarkDuplicateBytes.arrayCopy             64  avgt   20      4.900 ±   0.050 ns/op
BenchmarkDuplicateBytes.arrayCopy             80  avgt   20      5.609 ±   0.117 ns/op
BenchmarkDuplicateBytes.arrayCopy            128  avgt   20      5.318 ±   0.069 ns/op
BenchmarkDuplicateBytes.arrayCopy            200  avgt   20      6.087 ±   0.098 ns/op
BenchmarkDuplicateBytes.arrayCopy            256  avgt   20      6.971 ±   0.111 ns/op
BenchmarkDuplicateBytes.arrayCopy            512  avgt   20     12.175 ±   0.324 ns/op
BenchmarkDuplicateBytes.arrayCopy           1024  avgt   20     22.759 ±   0.300 ns/op
BenchmarkDuplicateBytes.arrayCopy           2048  avgt   20     55.769 ±   1.213 ns/op
BenchmarkDuplicateBytes.arrayCopy         100000  avgt   20   4931.827 ±  276.57 ns/op
BenchmarkDuplicateBytes.arrayCopy        1000000  avgt   20  70111.621 ± 922.907 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x         2  avgt   20      0.230 ±   0.005 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x        16  avgt   20      0.352 ±   0.015 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x        35  avgt   20      0.533 ±   0.052 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x        32  avgt   20      0.498 ±   0.007 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x        64  avgt   20      0.881 ±   0.013 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x        80  avgt   20      1.100 ±   0.015 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x       128  avgt   20      1.518 ±   0.228 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x       200  avgt   20      3.522 ±   0.069 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x       256  avgt   20      4.472 ±   0.154 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x       512  avgt   20     10.083 ±   0.206 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x      1024  avgt   20     21.741 ±   0.349 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x      2048  avgt   20     57.329 ±   0.854 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x    100000  avgt   20   5243.664 ± 188.963 ns/op
BenchmarkDuplicateBytes.copyFromSelf2x   1000000  avgt   20 109400.739 ± 992.325 ns/op

@raunaqmorarka raunaqmorarka force-pushed the ls/018-speedup-slice-positions-appender branch from c4265a5 to 1b9a136 Compare July 6, 2022 13:31
@raunaqmorarka
Copy link
Copy Markdown
Member

raunaqmorarka commented Jul 6, 2022

CI hit #13107

@raunaqmorarka raunaqmorarka force-pushed the ls/018-speedup-slice-positions-appender branch from 1b9a136 to 93a4b22 Compare July 11, 2022 05:36
@raunaqmorarka raunaqmorarka changed the title Speedup SlicePositionsAppender Improve partitioned output operator performance for strings Jul 11, 2022
@raunaqmorarka raunaqmorarka requested a review from skrzypo987 July 11, 2022 05:42
@raunaqmorarka raunaqmorarka force-pushed the ls/018-speedup-slice-positions-appender branch from 93a4b22 to ddfad78 Compare July 11, 2022 05:58
Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok % comment about making getRawSlice method public

@raunaqmorarka raunaqmorarka force-pushed the ls/018-speedup-slice-positions-appender branch from ddfad78 to a803c35 Compare July 11, 2022 08:36
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already covered by adaptation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what adaptation?

Copy link
Copy Markdown
Member

@sopel39 sopel39 Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type tests with block adaptation

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this new one. So no, it's not covered by normal tests, even with adaptation. Like I mentioned here I need to access SlicePositionsAppender directly because UnnestingPositionsAppender copies RLE value to byte array always.
That said, in the current implementation I don't rely on the byte array in the RLE block anymore so I could drop this test

@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from e3b4118 to f725cbc Compare September 2, 2022 12:41
Copy link
Copy Markdown
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments addressed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what adaptation?

@lukasz-stec lukasz-stec requested a review from sopel39 September 2, 2022 12:50
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename test name, if it's not byte array, then what it is?

Why the test is important (please add short comment)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename test name, if it's not byte array, then what it is?

the core of this test is that it is not a byte array and not that it is e.g. long array.
comment added.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename, NotByteArray -> WithSomeType

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped this test. it's no longer needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything should support supportsNullRle

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayBlockBuilder does not support it

Copy link
Copy Markdown
Member

@sopel39 sopel39 Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayBlockBuilder does not support it

Does it make sense to add support there? It seems odd that we have to work around that particular type here since all other types (including complex ones) support it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, created #13973

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved changes from #13973 to the first commit here + adjusted the tests

@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from f725cbc to 8d3ea38 Compare September 2, 2022 14:06
Copy link
Copy Markdown
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename test name, if it's not byte array, then what it is?

the core of this test is that it is not a byte array and not that it is e.g. long array.
comment added.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped this test. it's no longer needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayBlockBuilder does not support it

@lukasz-stec lukasz-stec requested a review from sopel39 September 2, 2022 14:06
@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from 8d3ea38 to 97d5eb7 Compare September 2, 2022 15:18
@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch 2 times, most recently from f0108e4 to 7fb023e Compare September 2, 2022 19:11
Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from 7fb023e to 46ec5b4 Compare September 6, 2022 09:32
Copy link
Copy Markdown
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca

@lukasz-stec lukasz-stec requested a review from sopel39 September 6, 2022 09:33
Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testAppendSliceNotSupportedByByteArray -> testAppendSliceNotBackedByByteArray

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Sep 7, 2022

conflict, please rebase

Produce RunLengthEncodedBlock in ArrayBlockBuilder
when all values are null
The improvement comes from not allocating Slice
per position and using System.arrayCopy directly
instead of Slice.getBytes.

Before
Benchmark                                   (channelCount)  (enableCompression)  (nullRate)  (partitionCount)  (positionCount)   (type)  Mode  Cnt     Score     Error  Units
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192  VARCHAR  avgt   20  2137.367 ± 166.834  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192  VARCHAR  avgt   20  1660.550 ±  24.274  ms/op

After
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192  VARCHAR  avgt   20  1399.476 ± 181.301  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192  VARCHAR  avgt   20  1194.077 ± 160.496  ms/op
@lukasz-stec lukasz-stec force-pushed the ls/018-speedup-slice-positions-appender branch from 931cdf4 to 45dfbe8 Compare September 7, 2022 20:06
@lukasz-stec
Copy link
Copy Markdown
Member Author

conflict, please rebase

rebased, checks in progress

@sopel39 sopel39 merged commit 1f260bc into trinodb:master Sep 8, 2022
@sopel39 sopel39 mentioned this pull request Sep 8, 2022
@github-actions github-actions bot added this to the 396 milestone Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants