Make partial aggregation adaptive#11011
Conversation
6f54a02 to
44af16c
Compare
3f52d8c to
dcbae65
Compare
skrzypo987
left a comment
There was a problem hiding this comment.
This looks much easier that I anticipated.
...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java
Outdated
Show resolved
Hide resolved
...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Have you tested different thresholds?
There was a problem hiding this comment.
No, I haven't. minRows is more or less constant per partial aggregation memory limit (PA will either compute the whole split or stop when the memory limit is reached but then the number of rows will be > 100K). So this is mainly a sanity check for some small splits.
For uniqueRowsRatio, I suspect that in tpcds/tpch most of the cases are either well below 0.8 or close to 1 so the exact number does not matter. That said I will run some benchmarks to confirm that.
There was a problem hiding this comment.
I additionally ran tpch/tpcds sf1000 orc part with ratio set to 0.5, 0.9, 0.95. As I expected mostly no change. most of the queries either trigger adaptation or don't consistently across ratios. There are some differences in the overall result but I think this is due to variability.
adaptive-pa-part-ratios-nocode.pdf
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/execution/TestCallTask.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
the default value of 100K is way too large for tests (it will never fire), and I think it's good to check the adaptive part outside of unit tests but checking all possible aggregation variation with and without adaptation seems too much.
There was a problem hiding this comment.
That is a bit fishy. For me it's ok, but I guess someone can have problems with that.
There was a problem hiding this comment.
I think this is an example of a general issue with trino properties in tests. Some properties like this one, are targeted at larger data scales, which means the functionality behind the property will never fire if not explicitly tested.
This is especially true for memory-related properties e.g. task.max-partial-aggregation-memory.
It seems to me, we should use scaled-down property values in "query" tests.
There was a problem hiding this comment.
Do not change the property here. Change property in unit tests. You can also add another child of AbstractTestAggregations with minimal PA limits.
There was a problem hiding this comment.
wouldn't it be better if all query tests were run with the lower limit and not just AbstractTestAggregations?
There was a problem hiding this comment.
wouldn't it be better if all query tests were run with the lower limit and not just AbstractTestAggregations?
no. We specifically split different queries into different abstract test classes (join, aggregation, etc) so that we don't have to cross test everything. It's a mess.
There was a problem hiding this comment.
ok, I added TestAdaptivePartialAggregation and removed the change here.
core/trino-main/src/test/java/io/trino/sql/query/TestFilterHideInacessibleColumnsSession.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java
Outdated
Show resolved
Hide resolved
dcbae65 to
8588248
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
review comments addresed
...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java
Outdated
Show resolved
Hide resolved
...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
No, I haven't. minRows is more or less constant per partial aggregation memory limit (PA will either compute the whole split or stop when the memory limit is reached but then the number of rows will be > 100K). So this is mainly a sanity check for some small splits.
For uniqueRowsRatio, I suspect that in tpcds/tpch most of the cases are either well below 0.8 or close to 1 so the exact number does not matter. That said I will run some benchmarks to confirm that.
There was a problem hiding this comment.
the default value of 100K is way too large for tests (it will never fire), and I think it's good to check the adaptive part outside of unit tests but checking all possible aggregation variation with and without adaptation seems too much.
core/trino-main/src/test/java/io/trino/execution/TestCallTask.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/sql/query/TestFilterHideInacessibleColumnsSession.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
8588248 to
f5f79b5
Compare
...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java
Outdated
Show resolved
Hide resolved
...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
That is a bit fishy. For me it's ok, but I guess someone can have problems with that.
core/trino-main/src/test/java/io/trino/execution/TestCallTask.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java
Outdated
Show resolved
Hide resolved
f5f79b5 to
0366bf8
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
last set of comments addressed
There was a problem hiding this comment.
I think this is an example of a general issue with trino properties in tests. Some properties like this one, are targeted at larger data scales, which means the functionality behind the property will never fire if not explicitly tested.
This is especially true for memory-related properties e.g. task.max-partial-aggregation-memory.
It seems to me, we should use scaled-down property values in "query" tests.
core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This should be called AggregationConfig and should have all aggregation properties from FeaturesConfig. Also, it should be different commit
There was a problem hiding this comment.
I added AggregationConfig and moved some properties there. PTAL if this is full list
There was a problem hiding this comment.
Ideally, this should be adaptive in both ways (on and off), not only on => off. Do you have ideas how this can be made?
There was a problem hiding this comment.
One idea is to choose one or more splits randomly to use partial aggregation once in a while (e.g. once enough rows have been processed since the last check). Depending on the results of PA, we could switch PA back on for every ongoing or new split.
This will complicate the code a lot and only help with the unusual data distribution case where first you process a lot of unique groups and the switch to splits with a small number of groups.
If we add off switch, we will also be vulnerable to ping pong cases where we do the opposite of what we should be doing because of the data distribution (i.e. we do PA for unique rows and send raw rows for very duplicated rows).
There was a problem hiding this comment.
This will complicate the code a lot and only help with the unusual data distribution case where first you process a lot of unique groups and the switch to splits with a small number of groups.
How do you know it's unusual?
You can turn this argument around and say we hit a bad file and we turned off PA completely while the rest of data is rather flat.
In other places we turn adaptivness on/off: DictionaryAwarePageProjectionWork#createDictionaryBlockProjection. Generally, I think it's a preferred way because prefix of query might not be representative for the remainder of the query.
At very least we need to have a TODO and a plan for that
There was a problem hiding this comment.
How do you know it's unusual?
No such case in tpch/tpcds that I know of + hard for me to come up with real-life data set that would have this.
Also, if the partial aggregation is intermediate node after partitioned exchange it seems not likely to have this kind of skew.
For the source stage, if we had split level NDV stats, we could decide per split to disable adaptation if expected number of groups is small, or even decide that we skip partial aggregation if we know the number of distinct values is big
There was a problem hiding this comment.
No such case in tpch/tpcds that I know of + hard for me to come up with real-life data set that would have this.
Tpch/Tpcds is just a benchmark, but I could imagine some data social data where some even is generating a lot of unique rows, but only 10% of time. If he hit that data at the beginning, we would turn PA even when it's efficient
There was a problem hiding this comment.
please create an issue and add a TODO (in PartialAggregationController) for enabling partial aggregation adaptively
There was a problem hiding this comment.
That seems low. Preferably you should make decision after you flushed PA buffer at least once, because then you can check if PA managed to reduce anything or not.
There was a problem hiding this comment.
this is what is actually happening (the check is after a flush). This threshold is just a failsafe in case of some strange case of very small splits (ie split with 10 rows, potentially due to partitioning or something else).
So for a normal case, the check is after a full split so moire than 1M rows or, when the PA hits the memory limit, and this should be then around 100K to 400K for the default 16M limit.
core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Let's just make it class PartialAggregationController
There was a problem hiding this comment.
do you mean class PartialAggregationController implements PartialAggregationControl or drop the interface?
There was a problem hiding this comment.
Drop the interface, drop the tracker. Just keep the class PartialAggregationController
There was a problem hiding this comment.
we would duplicate counts during the next flush otherwise
There was a problem hiding this comment.
onPartialAggregationFlush -> onFlush
There was a problem hiding this comment.
Just move this logic to onPartialAggregationFlush (make tracker dumb). onFlush is not frequent
There was a problem hiding this comment.
Let's remove factory and NoOpPartialAggregationControl and just make Optional<PartialAggregationController> partialAggregationControler in HashAggregationOperatorFactory
There was a problem hiding this comment.
What is the benefit from Optional<PartialAggregationController>?
With the current setup the code is simple in the HashAggregationOperator e.g. partialAggregationTracker.onFlush() vs partialAggregationController.ifPresent(controller -> controller.onFlush(x, y))
There was a problem hiding this comment.
What is the benefit from Optional?
If you use Optional then you don't have to have PartialAggregationControl or PartialAggregationControlFactory as interfaces (since there is just single implementation). There isn't really going to be another implementation and it won't be pluggable really. Hence, an interface just to have noop is an overkill.
There was a problem hiding this comment.
you can just make it onFlush(int totalPositionCount, int uniquePositionCount). Make row tracking internal to HashAggregationOperator. Then you don't even need this tracker at all
There was a problem hiding this comment.
This would complicate HashAggregationOperator, especially because it handles more cases than partial aggregation. Having these counts here makes it explicit that this is only relevant for partial aggregation.
There was a problem hiding this comment.
But you call these methods anyway. What we do here is simple row tracking. We don't need factory, interfaces and tracker for that. It's an overkill for what is a simple increment
There was a problem hiding this comment.
This adds 3 instead of 1 field to the class that already has ~25 fields. I don't consider this a good practice.
That said, I refactored this as requested.
c693891 to
52eb84d
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
I extracted the AggregatonConfig to a separate commit + some comments addressed, some repied.
There was a problem hiding this comment.
One idea is to choose one or more splits randomly to use partial aggregation once in a while (e.g. once enough rows have been processed since the last check). Depending on the results of PA, we could switch PA back on for every ongoing or new split.
This will complicate the code a lot and only help with the unusual data distribution case where first you process a lot of unique groups and the switch to splits with a small number of groups.
If we add off switch, we will also be vulnerable to ping pong cases where we do the opposite of what we should be doing because of the data distribution (i.e. we do PA for unique rows and send raw rows for very duplicated rows).
core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I used 'null object pattern' here. For not partial aggregation, this is gonna be NoOpPartialAggregationControl.
Optional or null here makes code more convoluted as it needs to be handled on every access.
There was a problem hiding this comment.
for !step.isOutputPartial() case NoOpPartialAggregationTracker. isPartialAggregationDisabled returns false
There was a problem hiding this comment.
do you mean class PartialAggregationController implements PartialAggregationControl or drop the interface?
There was a problem hiding this comment.
we would duplicate counts during the next flush otherwise
There was a problem hiding this comment.
this is what is actually happening (the check is after a flush). This threshold is just a failsafe in case of some strange case of very small splits (ie split with 10 rows, potentially due to partitioning or something else).
So for a normal case, the check is after a full split so moire than 1M rows or, when the PA hits the memory limit, and this should be then around 100K to 400K for the default 16M limit.
There was a problem hiding this comment.
I added AggregationConfig and moved some properties there. PTAL if this is full list
There was a problem hiding this comment.
What is the benefit from Optional<PartialAggregationController>?
With the current setup the code is simple in the HashAggregationOperator e.g. partialAggregationTracker.onFlush() vs partialAggregationController.ifPresent(controller -> controller.onFlush(x, y))
There was a problem hiding this comment.
See #11066 (comment). We should extract OptimizerConfig config file. I suggest you skip this extract for now since it will be part of #11066
There was a problem hiding this comment.
This will complicate the code a lot and only help with the unusual data distribution case where first you process a lot of unique groups and the switch to splits with a small number of groups.
How do you know it's unusual?
You can turn this argument around and say we hit a bad file and we turned off PA completely while the rest of data is rather flat.
In other places we turn adaptivness on/off: DictionaryAwarePageProjectionWork#createDictionaryBlockProjection. Generally, I think it's a preferred way because prefix of query might not be representative for the remainder of the query.
At very least we need to have a TODO and a plan for that
There was a problem hiding this comment.
ADAPTIVE_PARTIAL_AGGREGATION_UNIQUE_ROWS_RATIO -> ADAPTIVE_PARTIAL_AGGREGATION_UNIQUE_ROWS_RATIO_THRESHOLD
There was a problem hiding this comment.
Expand this description, threshold for what (on, off)?
There was a problem hiding this comment.
Optional or null here makes code more convoluted as it needs to be handled on every access.
If you use Optional then you don't have to have PartialAggregationControl or PartialAggregationControlFactory as interfaces (since there is just single implementation). There isn't really going to be another implementation and it won't be pluggable really. Hence, an interface just to have noop is an overkill.
There was a problem hiding this comment.
Drop the interface, drop the tracker. Just keep the class PartialAggregationController
There was a problem hiding this comment.
But you call these methods anyway. What we do here is simple row tracking. We don't need factory, interfaces and tracker for that. It's an overkill for what is a simple increment
There was a problem hiding this comment.
What is the benefit from Optional?
If you use Optional then you don't have to have PartialAggregationControl or PartialAggregationControlFactory as interfaces (since there is just single implementation). There isn't really going to be another implementation and it won't be pluggable really. Hence, an interface just to have noop is an overkill.
1c0b3b0 to
23b0367
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
AggregationConfig extraction dropped, properties moved to the FeaturesConfig for now.
NoOpPartialAggregationControl refactored to Optional
There was a problem hiding this comment.
How do you know it's unusual?
No such case in tpch/tpcds that I know of + hard for me to come up with real-life data set that would have this.
Also, if the partial aggregation is intermediate node after partitioned exchange it seems not likely to have this kind of skew.
For the source stage, if we had split level NDV stats, we could decide per split to disable adaptation if expected number of groups is small, or even decide that we skip partial aggregation if we know the number of distinct values is big
There was a problem hiding this comment.
This adds 3 instead of 1 field to the class that already has ~25 fields. I don't consider this a good practice.
That said, I refactored this as requested.
There was a problem hiding this comment.
you need to rebase. There is new OptimizerConfig class
There was a problem hiding this comment.
moved to OptimizerConfig
There was a problem hiding this comment.
No such case in tpch/tpcds that I know of + hard for me to come up with real-life data set that would have this.
Tpch/Tpcds is just a benchmark, but I could imagine some data social data where some even is generating a lot of unique rows, but only 10% of time. If he hit that data at the beginning, we would turn PA even when it's efficient
There was a problem hiding this comment.
You need a controller factory. Just create PartialAggregationController in LocalExecutionPlanner.Visitor#createHashAggregationOperatorFactory.
There was a problem hiding this comment.
removed the factory, added io.trino.operator.aggregation.partial.PartialAggregationController#duplicate
There was a problem hiding this comment.
constructOutputPage should return Page
There was a problem hiding this comment.
Do not change the property here. Change property in unit tests. You can also add another child of AbstractTestAggregations with minimal PA limits.
There was a problem hiding this comment.
so this intermediate page has 10 positions? Could we use aggregation and input data that would actually squash input positions (e.g. all rows belong to same group)
There was a problem hiding this comment.
not sure I understand but the reason we need (almost) unique groups is to trigger adaptation.
There was a problem hiding this comment.
Ok, so can we use 9 unique rows out of 10 rows? Or maybe you change ration from 0.8 into 0.5? The reason is the I would like to see that aggregation actually happens before it's disabled
There was a problem hiding this comment.
refactored to 9 out of 10 unique (0, 1, 2, 3, 4, 5, 6, 7, 8, 8)
core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Also add a test (or change this one) that there needs to be at least one flush before PA is disabled (E.g. low min row count, and flush happening after second input page.
There was a problem hiding this comment.
Added separate test case as this requires using HashAggregationOperatorFactory with a different maxPartialMemory limit
|
please rebase due to conflicts |
23b0367 to
bef58ae
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
rebased on the master and moved config properties to OptimizerConfig + other comments addressed
There was a problem hiding this comment.
moved to OptimizerConfig
There was a problem hiding this comment.
I don't want to have if (step.isOutputPartial()) because if influences other branches/ifs below, but I added step.isOutputPartial() to the condition to make it clear that this works only for partial aggregation.
I added partialAggregationController.isPresent() vs step.isOutputPartial() to the constructor (we dont need to check it for every addInput)
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
wouldn't it be better if all query tests were run with the lower limit and not just AbstractTestAggregations?
There was a problem hiding this comment.
not sure I understand but the reason we need (almost) unique groups is to trigger adaptation.
core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Added separate test case as this requires using HashAggregationOperatorFactory with a different maxPartialMemory limit
There was a problem hiding this comment.
removed the factory, added io.trino.operator.aggregation.partial.PartialAggregationController#duplicate
547b561 to
3016fff
Compare
There was a problem hiding this comment.
Use descriptions from session properties (make these consistent)
core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
wouldn't it be better if all query tests were run with the lower limit and not just AbstractTestAggregations?
no. We specifically split different queries into different abstract test classes (join, aggregation, etc) so that we don't have to cross test everything. It's a mess.
There was a problem hiding this comment.
Ok, so can we use 9 unique rows out of 10 rows? Or maybe you change ration from 0.8 into 0.5? The reason is the I would like to see that aggregation actually happens before it's disabled
There was a problem hiding this comment.
you can use one of io.trino.operator.OperatorAssertion#assertOperatorEquals instead of creating a new method (see TestHashJoinOperator)
There was a problem hiding this comment.
OperatorAssertion#assertOperatorEquals closes the factory and in this case I need the factory to be reused but I extended assertOperatorEquals with boolean closeOperatorFactory so it works now.
There was a problem hiding this comment.
please also add assertion on PartialAggregationController#isPartialAggregationDisabled
There was a problem hiding this comment.
please also add assertion on PartialAggregationController#isPartialAggregationDisabled
3016fff to
14ad2fa
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
comments addressed + added a commit to skip types-check in TestShowQueries (tests failed because of new properties with longer names were added)
core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
ok, I added TestAdaptivePartialAggregation and removed the change here.
There was a problem hiding this comment.
refactored to 9 out of 10 unique (0, 1, 2, 3, 4, 5, 6, 7, 8, 8)
There was a problem hiding this comment.
OperatorAssertion#assertOperatorEquals closes the factory and in this case I need the factory to be reused but I extended assertOperatorEquals with boolean closeOperatorFactory so it works now.
There was a problem hiding this comment.
restored after the checkArgument
There was a problem hiding this comment.
future improvement. It would be great to actually collect metrics
- how many pages were processed via skip aggregation builder
- how many flushes there were for PA.
- what average row count per flush
- etc..
This can be returned via io.trino.operator.OperatorContext#setLatestMetrics
@lukasz-stec Maybe create an issue for that?
There was a problem hiding this comment.
Good idea. This would allow easier monitoring of the adaptation.
#11376 created.
There was a problem hiding this comment.
technically, your output page might become quite large since we add more columns. However, this probably isn't a big issue.
In io.trino.operator.project.PageProcessor we avoid this by keeping page under 4MB
There was a problem hiding this comment.
Good point. I guess this currently can increase page size easily 10x by adding the accumulator state (given enough aggregations).
Without much overhead, we could use Page.getRegion to partition the output page into smaller pages but that would retain the full page until all pages are processed. WDYT?
There was a problem hiding this comment.
inline with previous line
set it to 0?
you should also reduce PA buffer to 0 so that flush happens after every page
There was a problem hiding this comment.
yeah, min-rows=0 makes sense.
I would leave task.max-partial-aggregation-memory with default as flushing per page is not a realistic scenario and there should be enough splits in the test data for flush per split to use adaptation.
There was a problem hiding this comment.
and there should be enough splits in the test data for flush per split to use adaptation.
I'm not so sure since we use tiny schema. I would rather reduce it to make sure it triggers in most cases
There was a problem hiding this comment.
ok, I added task.max-partial-aggregation-memory=0B
14ad2fa to
1c5be62
Compare
There was a problem hiding this comment.
Does this produce pages with a schema that matches the one defined by the query plan?
There was a problem hiding this comment.
Yes. This produces a "projected" input page by selecting group by channels + optionally hash channel + aggregation accumulators state channels, which matches aggregation partial step output schema.
There was a problem hiding this comment.
I'm not sure I understand how the input data is passed to the final aggregation, then. Specifically, the comment above that says: "It passes the input pages, augmented with initial accumulator state to the output". If that's the case, then the columns from the input to the aggregation would not match the expected input of the final aggregation (according to the query plan)
There was a problem hiding this comment.
The comment may be not precise enough. What is happening is we take the group by (or hash) channels from the input and add the accumulator state. This is done in the constructOutputPage method.
BlockBuilder[] outputBuilders contain accumaltor state.
private Page constructOutputPage(Page page, BlockBuilder[] outputBuilders)
{
Block[] outputBlocks = new Block[hashChannels.length + outputBuilders.length];
for (int i = 0; i < hashChannels.length; i++) {
outputBlocks[i] = page.getBlock(hashChannels[i]);
}
for (int i = 0; i < outputBuilders.length; i++) {
outputBlocks[hashChannels.length + i] = outputBuilders[i].build();
}
return new Page(page.getPositionCount(), outputBlocks);
}
There was a problem hiding this comment.
@martint
With this PR partial aggregation will still produce intermediate rows when it's turned off. We haven't implemented dual representation (raw rows vs intermediate rows). Approach as in this PR is simpler and will be improved.
For example, @radek-starburst is working on splitting single decimal aggregation state into smaller, primitive states. When this is done, we would be able to essentially passthrough input decimals when PA is adaptively disabled without any extra CPU cost. For example, for sum decimal aggregation overflow can be represented as RLE block.
@lukasz-stec is working on sending RLE blocks via partitioned exchange, so that intermediate aggregation rows can be serialized and transmitted over network more efficiently.
1c5be62 to
8d4aea0
Compare
In case when partial aggregation operator does not reduce the number of rows too much, disable aggregation and send raw rows to the final step
8d4aea0 to
6ce9048
Compare
|
thanks! |
Description
This is an optimization for the
HashAggregationOperatorthat is split intopartialandfinalsteps.In case when partial aggregation step does not reduce the number of rows too much (e.g. 90 % of rows are unique) this step brings a small benefit in terms of network savings but costs a lot of CPU to do.
In this case, it would be beneficial to skip partial aggregation altogether at the planning time,
but given we don't always have reliable statistics for the number of unique values, especially in the intermediate query stages it is not easy to do.
Instead (although it's complementary to the planner changes) this adds simple runtime adaptation for the
partial aggregation step, that sends raw, ungrouped rows to the final step if the ratio of unique to input rows is big enough (0.8 by default).With this change, there is a still significant overhead on the partial step mainly in the
PartitionedOutputOperatorthat has to handle the superfluous accumulator state for the raw rows + in theHashAggregationOperatorthat needs to create and populate this state.There are potential improvements for this in both
HashAggregationOperatorandPartitionedOutputOperatorthat would limit the overhead.Another possible approach is to have a separate pipe (as it has a different layout) from
partialtofinalstep with only the input pages without the accumulator state. This would eliminate almost all of the overhead but require larger changes in the core engine.tpch/tpcds benchmark results for orc sf1000
part
overall ~6% TPCH and 1.5 % tpcds improvement. Most queries are not affected, some gain between 10 to 35%
adaptive-pa-part-nocode.pdf
uppart
overall 3.5% for tpch and 2.5% for tpcds
adaptive-pa-unpart-nocode.pdf
General information
performance improvement
core query engine (
HashAggregationOperator)Improves
group byperformance by skipping partial aggregation stepRelated issues, pull requests, and links
Documentation
( x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( x) Release notes entries required with the following suggested text:
Improve performance of
GROUP BYwith a large number of groups.