Skip to content

Optimize partitioned exchange for RowType channels#12762

Merged
raunaqmorarka merged 1 commit intotrinodb:masterfrom
starburstdata:ls/020-poo-row-type
Jul 4, 2022
Merged

Optimize partitioned exchange for RowType channels#12762
raunaqmorarka merged 1 commit intotrinodb:masterfrom
starburstdata:ls/020-poo-row-type

Conversation

@lukasz-stec
Copy link
Copy Markdown
Member

@lukasz-stec lukasz-stec commented Jun 9, 2022

Optimize partitioned exchange for RowType channels
with batch oriented RowPositionsAppender

Description

Is this change a fix, improvement, new feature, refactoring, or other?

performance improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine and spi

How would you describe this change to a non-technical end user or system administrator?

Benchmarks

tpch/tpcds orc sf1000 partitioned

There is a slight (1.5%) CPU improvement. This is expected as PartitindOutputOperator is about 5% of the overall CPU.
image

poo-row-type-sf1000-orc-part.pdf

jmh

There is 2x to 4x improvement.
before

Benchmark                                   (channelCount)  (enableCompression)  (nullRate)  (partitionCount)  (positionCount)                 (type)  Mode  Cnt    Score    Error  Units
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192      ROW_BIGINT_BIGINT  avgt   20  625.371 ± 12.527  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192  ROW_RLE_BIGINT_BIGINT  avgt   20  518.245 ± 82.994  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192      ROW_BIGINT_BIGINT  avgt   20  557.170 ± 65.907  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192  ROW_RLE_BIGINT_BIGINT  avgt   20  473.040 ± 64.605  ms/o

after

Benchmark                                   (channelCount)  (enableCompression)  (nullRate)  (partitionCount)  (positionCount)                 (type)  Mode  Cnt    Score   Error  Units
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192      ROW_BIGINT_BIGINT  avgt   20  144.102 ± 5.855  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192  ROW_RLE_BIGINT_BIGINT  avgt   20  109.074 ± 1.223  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192      ROW_BIGINT_BIGINT  avgt   20  281.255 ± 4.728  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192  ROW_RLE_BIGINT_BIGINT  avgt   20  162.054 ± 2.670  ms/op

This also brings big improvements for queries with a large number of aggregations that use RowType as an intermediate state e.g. sum.
This sample query sees about a 30% improvement.

trino:tpch_sf1000_dec_orc_part> explain analyze select cast ((orderkey+partkey) %1300000 as int), sum(suppkey), sum(suppkey + 1), sum(suppkey + 2), sum(suppkey + 3),
                             -> sum(suppkey + 4), sum(suppkey + 5), sum(suppkey + 6), sum(suppkey + 7), 
                             -> sum(suppkey + 8), sum(suppkey + 9), sum(suppkey + 10), sum(suppkey + 11)
                             -> from lineitem group by cast ((orderkey + partkey) % 1300000  as int);

before

Query 20220610_101803_00005_8baqg, FINISHED, 7 nodes
http://localhost:8082/ui/query.html?20220610_101803_00005_8baqg
Splits: 3,081 total, 3,081 done (100.00%)
CPU Time: 32273.5s total,  186K rows/s, 1.88MB/s, 68% active
Per Node: 20.1 parallelism, 3.73M rows/s, 37.6MB/s
Parallelism: 140.5
Peak Memory: 9.59GB
3:50 [6B rows, 59.1GB] [26.1M rows/s, 263MB/s]

after

Query 20220610_090149_00009_b42pz, FINISHED, 7 nodes
http://localhost:8081/ui/query.html?20220610_090149_00009_b42pz
Splits: 3,081 total, 3,081 done (100.00%)
CPU Time: 22725.7s total,  264K rows/s, 2.66MB/s, 75% active
Per Node: 16.7 parallelism,  4.4M rows/s, 44.4MB/s
Parallelism: 116.7
Peak Memory: 9.87GB
3:15 [6B rows, 59.1GB] [30.8M rows/s, 311MB/s]

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( X) Release notes entries required with the following suggested text:

# Section
* Improve performance of queries with a large number of aggregations that use `RowType` as an intermediate state (e.g. `sum`).

@cla-bot cla-bot bot added the cla-signed label Jun 9, 2022
@lukasz-stec lukasz-stec marked this pull request as ready for review June 10, 2022 10:39
@lukasz-stec lukasz-stec force-pushed the ls/020-poo-row-type branch from e879b19 to 4c17996 Compare June 13, 2022 08:32
Copy link
Copy Markdown
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments answered, addressed.

Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

@lukasz-stec lukasz-stec force-pushed the ls/020-poo-row-type branch from 4c17996 to 391e6f2 Compare June 15, 2022 08:35
Copy link
Copy Markdown
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments addressed

@lukasz-stec lukasz-stec requested a review from skrzypo987 June 15, 2022 08:35
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the fieldBlocks coming out of RowBlock is a LazyBlock ? Not sure if PositionsAppender deals with those

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible. fieldBlocks are loaded when the block is loaded via getLoadedBlock and the PartitionedOutputOperator is configured with the PageChannelSelector page pre processor

@lukasz-stec lukasz-stec force-pushed the ls/020-poo-row-type branch from 391e6f2 to ec1f450 Compare June 15, 2022 13:28
Copy link
Copy Markdown
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments addressed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible. fieldBlocks are loaded when the block is loaded via getLoadedBlock and the PartitionedOutputOperator is configured with the PageChannelSelector page pre processor

@lukasz-stec lukasz-stec force-pushed the ls/020-poo-row-type branch from ec1f450 to f0d4efd Compare June 20, 2022 08:27
Copy link
Copy Markdown
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added AbstractRowBlock#copyPositions branchless commit

@lukasz-stec
Copy link
Copy Markdown
Member Author

AbstractRowBlock#copyPositions branchless impl fixed

Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you extract 2nd commit as a separate PR ? I think we can land that immediately.

@lukasz-stec lukasz-stec force-pushed the ls/020-poo-row-type branch from 75da2d9 to f0d4efd Compare June 22, 2022 07:23
Copy link
Copy Markdown
Member Author

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branchless copyPositions extracted to. #12926 + comment

Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % minor comments

@lukasz-stec lukasz-stec force-pushed the ls/020-poo-row-type branch from 161ac2a to 5f06583 Compare June 27, 2022 11:10
@lukasz-stec
Copy link
Copy Markdown
Member Author

comments addressed + commit message extended with the SPI change justification.

@lukasz-stec lukasz-stec force-pushed the ls/020-poo-row-type branch from 5f06583 to 1885335 Compare June 28, 2022 11:03
@lukasz-stec lukasz-stec requested a review from sopel39 June 28, 2022 11:11
@lukasz-stec lukasz-stec force-pushed the ls/020-poo-row-type branch from 1885335 to 42b0bab Compare June 29, 2022 08:03
@raunaqmorarka
Copy link
Copy Markdown
Member

@lukasz-stec please rebase to latest master

Introduced batch oriented RowPositionsAppender.
As RowBlock's field blocks are public via getChildren method
and field offset can be calculated, albeit slowly,
using Block#isNull, the AbstractRowBlock#getFieldBlockOffset
is made public to give access to pre-calculated offsets.

Before
Benchmark                                   (channelCount)  (enableCompression)  (nullRate)  (partitionCount)  (positionCount)                 (type)  Mode  Cnt    Score    Error  Units
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192      ROW_BIGINT_BIGINT  avgt   20  687.344 ± 55.380  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192  ROW_RLE_BIGINT_BIGINT  avgt   20  583.781 ± 69.803  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192      ROW_BIGINT_BIGINT  avgt   20  426.873 ± 11.070  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192  ROW_RLE_BIGINT_BIGINT  avgt   20  486.288 ± 69.490  ms/op

After
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192      ROW_BIGINT_BIGINT  avgt   20  148.079 ± 14.656  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false           0                16             8192  ROW_RLE_BIGINT_BIGINT  avgt   20  102.773 ±  6.502  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192      ROW_BIGINT_BIGINT  avgt   20  196.848 ±  6.971  ms/op
BenchmarkPartitionedOutputOperator.addPage               1                false         0.2                16             8192  ROW_RLE_BIGINT_BIGINT  avgt   20  159.385 ± 10.308  ms/op
@raunaqmorarka raunaqmorarka force-pushed the ls/020-poo-row-type branch from 42b0bab to 1a0325b Compare July 4, 2022 11:00
@raunaqmorarka raunaqmorarka merged commit 87c6ed0 into trinodb:master Jul 4, 2022
@raunaqmorarka raunaqmorarka deleted the ls/020-poo-row-type branch July 4, 2022 14:05
@github-actions github-actions bot added this to the 389 milestone Jul 4, 2022
@raunaqmorarka raunaqmorarka changed the title Add RowPositionsAppender Optimize partitioned exchange for RowType channels Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants