Make partial aggregation adaptive by lukasz-stec · Pull Request #11011 · trinodb/trino

lukasz-stec · 2022-02-10T21:09:21Z

Description

This is an optimization for the HashAggregationOperator that is split into partial and final steps.
In case when partial aggregation step does not reduce the number of rows too much (e.g. 90 % of rows are unique) this step brings a small benefit in terms of network savings but costs a lot of CPU to do.
In this case, it would be beneficial to skip partial aggregation altogether at the planning time,
but given we don't always have reliable statistics for the number of unique values, especially in the intermediate query stages it is not easy to do.
Instead (although it's complementary to the planner changes) this adds simple runtime adaptation for the partial aggregation step, that sends raw, ungrouped rows to the final step if the ratio of unique to input rows is big enough (0.8 by default).

With this change, there is a still significant overhead on the partial step mainly in the PartitionedOutputOperator that has to handle the superfluous accumulator state for the raw rows + in the HashAggregationOperator that needs to create and populate this state.
There are potential improvements for this in both HashAggregationOperator and PartitionedOutputOperator that would limit the overhead.
Another possible approach is to have a separate pipe (as it has a different layout) from partial to final step with only the input pages without the accumulator state. This would eliminate almost all of the overhead but require larger changes in the core engine.

tpch/tpcds benchmark results for orc sf1000

part

overall ~6% TPCH and 1.5 % tpcds improvement. Most queries are not affected, some gain between 10 to 35%
adaptive-pa-part-nocode.pdf

uppart

overall 3.5% for tpch and 2.5% for tpcds
adaptive-pa-unpart-nocode.pdf

General information

Is this change a fix, improvement, new feature, refactoring, or other?

performance improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine (HashAggregationOperator)

How would you describe this change to a non-technical end user or system administrator?

Improves group by performance by skipping partial aggregation step

Related issues, pull requests, and links

Documentation

( x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( x) Release notes entries required with the following suggested text:

Improve performance of GROUP BY with a large number of groups.

skrzypo987

This looks much easier that I anticipated.

skrzypo987 · 2022-02-11T11:52:36Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

skrzypo987 · 2022-02-11T12:10:46Z

...in/src/main/java/io/trino/operator/aggregation/partial/AdaptivePartialAggregationConfig.java

Have you tested different thresholds?

No, I haven't. minRows is more or less constant per partial aggregation memory limit (PA will either compute the whole split or stop when the memory limit is reached but then the number of rows will be > 100K). So this is mainly a sanity check for some small splits.
For uniqueRowsRatio, I suspect that in tpcds/tpch most of the cases are either well below 0.8 or close to 1 so the exact number does not matter. That said I will run some benchmarks to confirm that.

I additionally ran tpch/tpcds sf1000 orc part with ratio set to 0.5, 0.9, 0.95. As I expected mostly no change. most of the queries either trigger adaptation or don't consistently across ratios. There are some differences in the overall result but I think this is due to variability.
adaptive-pa-part-ratios-nocode.pdf

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

core/trino-main/src/test/java/io/trino/execution/TestCallTask.java

skrzypo987 · 2022-02-11T12:59:52Z

core/trino-main/src/main/java/io/trino/testing/TestingSession.java

Why is this needed here?

the default value of 100K is way too large for tests (it will never fire), and I think it's good to check the adaptive part outside of unit tests but checking all possible aggregation variation with and without adaptation seems too much.

That is a bit fishy. For me it's ok, but I guess someone can have problems with that.

I think this is an example of a general issue with trino properties in tests. Some properties like this one, are targeted at larger data scales, which means the functionality behind the property will never fire if not explicitly tested.
This is especially true for memory-related properties e.g. task.max-partial-aggregation-memory.
It seems to me, we should use scaled-down property values in "query" tests.

Do not change the property here. Change property in unit tests. You can also add another child of AbstractTestAggregations with minimal PA limits.

wouldn't it be better if all query tests were run with the lower limit and not just AbstractTestAggregations?

wouldn't it be better if all query tests were run with the lower limit and not just AbstractTestAggregations?

no. We specifically split different queries into different abstract test classes (join, aggregation, etc) so that we don't have to cross test everything. It's a mess.

ok, I added TestAdaptivePartialAggregation and removed the change here.

core/trino-main/src/test/java/io/trino/sql/query/TestFilterHideInacessibleColumnsSession.java

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

lukasz-stec

review comments addresed

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

lukasz-stec · 2022-02-11T15:09:05Z

...in/src/main/java/io/trino/operator/aggregation/partial/AdaptivePartialAggregationConfig.java

No, I haven't. minRows is more or less constant per partial aggregation memory limit (PA will either compute the whole split or stop when the memory limit is reached but then the number of rows will be > 100K). So this is mainly a sanity check for some small splits.
For uniqueRowsRatio, I suspect that in tpcds/tpch most of the cases are either well below 0.8 or close to 1 so the exact number does not matter. That said I will run some benchmarks to confirm that.

lukasz-stec · 2022-02-11T15:13:40Z

core/trino-main/src/main/java/io/trino/testing/TestingSession.java

the default value of 100K is way too large for tests (it will never fire), and I think it's good to check the adaptive part outside of unit tests but checking all possible aggregation variation with and without adaptation seems too much.

core/trino-main/src/test/java/io/trino/execution/TestCallTask.java

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

core/trino-main/src/test/java/io/trino/sql/query/TestFilterHideInacessibleColumnsSession.java

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

skrzypo987

LGTM % comments

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

skrzypo987 · 2022-02-16T14:35:34Z

core/trino-main/src/main/java/io/trino/testing/TestingSession.java

That is a bit fishy. For me it's ok, but I guess someone can have problems with that.

core/trino-main/src/test/java/io/trino/execution/TestCallTask.java

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

lukasz-stec

last set of comments addressed

lukasz-stec · 2022-02-18T08:01:40Z

core/trino-main/src/main/java/io/trino/testing/TestingSession.java

I think this is an example of a general issue with trino properties in tests. Some properties like this one, are targeted at larger data scales, which means the functionality behind the property will never fire if not explicitly tested.
This is especially true for memory-related properties e.g. task.max-partial-aggregation-memory.
It seems to me, we should use scaled-down property values in "query" tests.

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

sopel39

some initial comments

sopel39 · 2022-02-25T15:49:12Z

...in/src/main/java/io/trino/operator/aggregation/partial/AdaptivePartialAggregationConfig.java

This should be called AggregationConfig and should have all aggregation properties from FeaturesConfig. Also, it should be different commit

I added AggregationConfig and moved some properties there. PTAL if this is full list

sopel39 · 2022-02-25T15:50:27Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

Ideally, this should be adaptive in both ways (on and off), not only on => off. Do you have ideas how this can be made?

One idea is to choose one or more splits randomly to use partial aggregation once in a while (e.g. once enough rows have been processed since the last check). Depending on the results of PA, we could switch PA back on for every ongoing or new split.
This will complicate the code a lot and only help with the unusual data distribution case where first you process a lot of unique groups and the switch to splits with a small number of groups.
If we add off switch, we will also be vulnerable to ping pong cases where we do the opposite of what we should be doing because of the data distribution (i.e. we do PA for unique rows and send raw rows for very duplicated rows).

This will complicate the code a lot and only help with the unusual data distribution case where first you process a lot of unique groups and the switch to splits with a small number of groups.

How do you know it's unusual?
You can turn this argument around and say we hit a bad file and we turned off PA completely while the rest of data is rather flat.

In other places we turn adaptivness on/off: DictionaryAwarePageProjectionWork#createDictionaryBlockProjection. Generally, I think it's a preferred way because prefix of query might not be representative for the remainder of the query.

At very least we need to have a TODO and a plan for that

How do you know it's unusual?

No such case in tpch/tpcds that I know of + hard for me to come up with real-life data set that would have this.
Also, if the partial aggregation is intermediate node after partitioned exchange it seems not likely to have this kind of skew.
For the source stage, if we had split level NDV stats, we could decide per split to disable adaptation if expected number of groups is small, or even decide that we skip partial aggregation if we know the number of distinct values is big

No such case in tpch/tpcds that I know of + hard for me to come up with real-life data set that would have this.

Tpch/Tpcds is just a benchmark, but I could imagine some data social data where some even is generating a lot of unique rows, but only 10% of time. If he hit that data at the beginning, we would turn PA even when it's efficient

please create an issue and add a TODO (in PartialAggregationController) for enabling partial aggregation adaptively

done #11361

sopel39 · 2022-02-25T15:51:01Z

...rino-main/src/main/java/io/trino/operator/aggregation/partial/PartialAggregationControl.java

missing javadoc

added javadoc

sopel39 · 2022-02-25T16:02:46Z

...in/src/main/java/io/trino/operator/aggregation/partial/AdaptivePartialAggregationConfig.java

That seems low. Preferably you should make decision after you flushed PA buffer at least once, because then you can check if PA managed to reduce anything or not.

this is what is actually happening (the check is after a flush). This threshold is just a failsafe in case of some strange case of very small splits (ie split with 10 rows, potentially due to partitioning or something else).
So for a normal case, the check is after a full split so moire than 1M rows or, when the PA hits the memory limit, and this should be then around 100K to 400K for the default 16M limit.

core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java

sopel39 · 2022-02-25T16:15:51Z

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

Let's just make it class PartialAggregationController

do you mean class PartialAggregationController implements PartialAggregationControl or drop the interface?

Drop the interface, drop the tracker. Just keep the class PartialAggregationController

sopel39 · 2022-02-25T16:27:00Z

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

we would duplicate counts during the next flush otherwise

sopel39 · 2022-02-25T16:29:52Z

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

onPartialAggregationFlush -> onFlush

sopel39 · 2022-02-25T16:31:02Z

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

Just move this logic to onPartialAggregationFlush (make tracker dumb). onFlush is not frequent

sopel39 · 2022-02-25T16:32:51Z

...in/src/main/java/io/trino/operator/aggregation/partial/PartialAggregationControlFactory.java

Let's remove factory and NoOpPartialAggregationControl and just make Optional<PartialAggregationController> partialAggregationControler in HashAggregationOperatorFactory

What is the benefit from Optional<PartialAggregationController>?
With the current setup the code is simple in the HashAggregationOperator e.g. partialAggregationTracker.onFlush() vs partialAggregationController.ifPresent(controller -> controller.onFlush(x, y))

What is the benefit from Optional?

If you use Optional then you don't have to have PartialAggregationControl or PartialAggregationControlFactory as interfaces (since there is just single implementation). There isn't really going to be another implementation and it won't be pluggable really. Hence, an interface just to have noop is an overkill.

sopel39 · 2022-02-25T16:35:12Z

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

you can just make it onFlush(int totalPositionCount, int uniquePositionCount). Make row tracking internal to HashAggregationOperator. Then you don't even need this tracker at all

This would complicate HashAggregationOperator, especially because it handles more cases than partial aggregation. Having these counts here makes it explicit that this is only relevant for partial aggregation.

But you call these methods anyway. What we do here is simple row tracking. We don't need factory, interfaces and tracker for that. It's an overkill for what is a simple increment

This adds 3 instead of 1 field to the class that already has ~25 fields. I don't consider this a good practice.
That said, I refactored this as requested.

lukasz-stec

I extracted the AggregatonConfig to a separate commit + some comments addressed, some repied.

lukasz-stec · 2022-02-28T10:57:16Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

One idea is to choose one or more splits randomly to use partial aggregation once in a while (e.g. once enough rows have been processed since the last check). Depending on the results of PA, we could switch PA back on for every ongoing or new split.
This will complicate the code a lot and only help with the unusual data distribution case where first you process a lot of unique groups and the switch to splits with a small number of groups.
If we add off switch, we will also be vulnerable to ping pong cases where we do the opposite of what we should be doing because of the data distribution (i.e. we do PA for unique rows and send raw rows for very duplicated rows).

core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java

lukasz-stec · 2022-02-28T11:01:13Z

core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java

I used 'null object pattern' here. For not partial aggregation, this is gonna be NoOpPartialAggregationControl.
Optional or null here makes code more convoluted as it needs to be handled on every access.

lukasz-stec · 2022-02-28T11:02:34Z

core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java

for !step.isOutputPartial() case NoOpPartialAggregationTracker. isPartialAggregationDisabled returns false

lukasz-stec · 2022-02-28T11:04:06Z

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

do you mean class PartialAggregationController implements PartialAggregationControl or drop the interface?

lukasz-stec · 2022-02-28T11:17:21Z

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

we would duplicate counts during the next flush otherwise

lukasz-stec · 2022-02-28T11:22:19Z

...in/src/main/java/io/trino/operator/aggregation/partial/AdaptivePartialAggregationConfig.java

this is what is actually happening (the check is after a flush). This threshold is just a failsafe in case of some strange case of very small splits (ie split with 10 rows, potentially due to partitioning or something else).
So for a normal case, the check is after a full split so moire than 1M rows or, when the PA hits the memory limit, and this should be then around 100K to 400K for the default 16M limit.

lukasz-stec · 2022-02-28T14:40:44Z

...in/src/main/java/io/trino/operator/aggregation/partial/AdaptivePartialAggregationConfig.java

I added AggregationConfig and moved some properties there. PTAL if this is full list

lukasz-stec · 2022-02-28T14:51:09Z

...in/src/main/java/io/trino/operator/aggregation/partial/PartialAggregationControlFactory.java

What is the benefit from Optional<PartialAggregationController>?
With the current setup the code is simple in the HashAggregationOperator e.g. partialAggregationTracker.onFlush() vs partialAggregationController.ifPresent(controller -> controller.onFlush(x, y))

lukasz-stec · 2022-02-28T15:00:06Z

...rino-main/src/main/java/io/trino/operator/aggregation/partial/PartialAggregationControl.java

added javadoc

sopel39 · 2022-03-02T12:26:11Z

core/trino-main/src/main/java/io/trino/operator/aggregation/AggregationConfig.java

See #11066 (comment). We should extract OptimizerConfig config file. I suggest you skip this extract for now since it will be part of #11066

sopel39 · 2022-03-02T12:35:01Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

This will complicate the code a lot and only help with the unusual data distribution case where first you process a lot of unique groups and the switch to splits with a small number of groups.

How do you know it's unusual?
You can turn this argument around and say we hit a bad file and we turned off PA completely while the rest of data is rather flat.

In other places we turn adaptivness on/off: DictionaryAwarePageProjectionWork#createDictionaryBlockProjection. Generally, I think it's a preferred way because prefix of query might not be representative for the remainder of the query.

At very least we need to have a TODO and a plan for that

sopel39 · 2022-03-02T12:35:32Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

ADAPTIVE_PARTIAL_AGGREGATION_UNIQUE_ROWS_RATIO -> ADAPTIVE_PARTIAL_AGGREGATION_UNIQUE_ROWS_RATIO_THRESHOLD

sopel39 · 2022-03-02T12:35:47Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

Expand this description, threshold for what (on, off)?

sopel39 · 2022-03-02T12:38:43Z

core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java

Optional or null here makes code more convoluted as it needs to be handled on every access.

If you use Optional then you don't have to have PartialAggregationControl or PartialAggregationControlFactory as interfaces (since there is just single implementation). There isn't really going to be another implementation and it won't be pluggable really. Hence, an interface just to have noop is an overkill.

sopel39 · 2022-03-02T12:39:43Z

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

Drop the interface, drop the tracker. Just keep the class PartialAggregationController

sopel39 · 2022-03-02T12:42:15Z

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

But you call these methods anyway. What we do here is simple row tracking. We don't need factory, interfaces and tracker for that. It's an overkill for what is a simple increment

sopel39 · 2022-03-02T12:43:47Z

...in/src/main/java/io/trino/operator/aggregation/partial/PartialAggregationControlFactory.java

What is the benefit from Optional?

If you use Optional then you don't have to have PartialAggregationControl or PartialAggregationControlFactory as interfaces (since there is just single implementation). There isn't really going to be another implementation and it won't be pluggable really. Hence, an interface just to have noop is an overkill.

lukasz-stec

AggregationConfig extraction dropped, properties moved to the FeaturesConfig for now.
NoOpPartialAggregationControl refactored to Optional

lukasz-stec · 2022-03-03T09:43:42Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

How do you know it's unusual?

No such case in tpch/tpcds that I know of + hard for me to come up with real-life data set that would have this.
Also, if the partial aggregation is intermediate node after partitioned exchange it seems not likely to have this kind of skew.
For the source stage, if we had split level NDV stats, we could decide per split to disable adaptation if expected number of groups is small, or even decide that we skip partial aggregation if we know the number of distinct values is big

lukasz-stec · 2022-03-03T10:34:07Z

...ain/src/main/java/io/trino/operator/aggregation/partial/ActivePartialAggregationControl.java

This adds 3 instead of 1 field to the class that already has ~25 fields. I don't consider this a good practice.
That said, I refactored this as requested.

sopel39

mostly lgtm % comments

sopel39 · 2022-03-04T12:40:33Z

core/trino-main/src/main/java/io/trino/FeaturesConfig.java

you need to rebase. There is new OptimizerConfig class

moved to OptimizerConfig

sopel39 · 2022-03-04T12:40:53Z

core/trino-main/src/main/java/io/trino/FeaturesConfig.java

add description

sopel39 · 2022-03-04T12:40:59Z

core/trino-main/src/main/java/io/trino/FeaturesConfig.java

add description

sopel39 · 2022-03-04T12:53:36Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

No such case in tpch/tpcds that I know of + hard for me to come up with real-life data set that would have this.

Tpch/Tpcds is just a benchmark, but I could imagine some data social data where some even is generating a lot of unique rows, but only 10% of time. If he hit that data at the beginning, we would turn PA even when it's efficient

sopel39 · 2022-03-04T12:56:00Z

...src/main/java/io/trino/operator/aggregation/partial/PartialAggregationControllerFactory.java

You need a controller factory. Just create PartialAggregationController in LocalExecutionPlanner.Visitor#createHashAggregationOperatorFactory.

removed the factory, added io.trino.operator.aggregation.partial.PartialAggregationController#duplicate

sopel39 · 2022-03-04T13:38:24Z

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

constructOutputPage should return Page

sopel39 · 2022-03-04T13:41:49Z

core/trino-main/src/main/java/io/trino/testing/TestingSession.java

Do not change the property here. Change property in unit tests. You can also add another child of AbstractTestAggregations with minimal PA limits.

sopel39 · 2022-03-04T13:49:50Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

so this intermediate page has 10 positions? Could we use aggregation and input data that would actually squash input positions (e.g. all rows belong to same group)

not sure I understand but the reason we need (almost) unique groups is to trigger adaptation.

Ok, so can we use 9 unique rows out of 10 rows? Or maybe you change ration from 0.8 into 0.5? The reason is the I would like to see that aggregation actually happens before it's disabled

refactored to 9 out of 10 unique (0, 1, 2, 3, 4, 5, 6, 7, 8, 8)

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

sopel39 · 2022-03-04T13:54:57Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

Also add a test (or change this one) that there needs to be at least one flush before PA is disabled (E.g. low min row count, and flush happening after second input page.

Added separate test case as this requires using HashAggregationOperatorFactory with a different maxPartialMemory limit

sopel39 · 2022-03-04T13:55:32Z

please rebase due to conflicts

lukasz-stec

rebased on the master and moved config properties to OptimizerConfig + other comments addressed

lukasz-stec · 2022-03-07T08:36:39Z

core/trino-main/src/main/java/io/trino/FeaturesConfig.java

moved to OptimizerConfig

lukasz-stec · 2022-03-07T08:49:41Z

core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java

I don't want to have if (step.isOutputPartial()) because if influences other branches/ifs below, but I added step.isOutputPartial() to the condition to make it clear that this works only for partial aggregation.
I added partialAggregationController.isPresent() vs step.isOutputPartial() to the constructor (we dont need to check it for every addInput)

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

lukasz-stec · 2022-03-07T10:43:57Z

core/trino-main/src/main/java/io/trino/testing/TestingSession.java

wouldn't it be better if all query tests were run with the lower limit and not just AbstractTestAggregations?

lukasz-stec · 2022-03-07T10:54:47Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

not sure I understand but the reason we need (almost) unique groups is to trigger adaptation.

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

lukasz-stec · 2022-03-07T11:20:38Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

Added separate test case as this requires using HashAggregationOperatorFactory with a different maxPartialMemory limit

lukasz-stec · 2022-03-07T11:35:04Z

...src/main/java/io/trino/operator/aggregation/partial/PartialAggregationControllerFactory.java

removed the factory, added io.trino.operator.aggregation.partial.PartialAggregationController#duplicate

sopel39

mostly lgtm

sopel39 · 2022-03-08T09:56:42Z

core/trino-main/src/main/java/io/trino/sql/planner/OptimizerConfig.java

Use descriptions from session properties (make these consistent)

core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

sopel39 · 2022-03-08T10:23:36Z

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

nit: inputHash -> channelIndex

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

sopel39 · 2022-03-08T10:47:39Z

core/trino-main/src/main/java/io/trino/testing/TestingSession.java

wouldn't it be better if all query tests were run with the lower limit and not just AbstractTestAggregations?

no. We specifically split different queries into different abstract test classes (join, aggregation, etc) so that we don't have to cross test everything. It's a mess.

sopel39 · 2022-03-08T10:49:07Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

Ok, so can we use 9 unique rows out of 10 rows? Or maybe you change ration from 0.8 into 0.5? The reason is the I would like to see that aggregation actually happens before it's disabled

sopel39 · 2022-03-08T10:52:16Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

you can use one of io.trino.operator.OperatorAssertion#assertOperatorEquals instead of creating a new method (see TestHashJoinOperator)

OperatorAssertion#assertOperatorEquals closes the factory and in this case I need the factory to be reused but I extended assertOperatorEquals with boolean closeOperatorFactory so it works now.

sopel39 · 2022-03-08T10:53:40Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

please also add assertion on PartialAggregationController#isPartialAggregationDisabled

sopel39 · 2022-03-08T10:53:47Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

please also add assertion on PartialAggregationController#isPartialAggregationDisabled

lukasz-stec

comments addressed + added a commit to skip types-check in TestShowQueries (tests failed because of new properties with longer names were added)

lukasz-stec · 2022-03-08T08:46:39Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

done #11361

lukasz-stec · 2022-03-08T08:49:42Z

core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java

lukasz-stec · 2022-03-08T11:18:12Z

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

lukasz-stec · 2022-03-08T11:59:36Z

core/trino-main/src/main/java/io/trino/testing/TestingSession.java

ok, I added TestAdaptivePartialAggregation and removed the change here.

lukasz-stec · 2022-03-08T12:03:27Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

lukasz-stec · 2022-03-08T12:07:48Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

refactored to 9 out of 10 unique (0, 1, 2, 3, 4, 5, 6, 7, 8, 8)

lukasz-stec · 2022-03-08T12:09:43Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

lukasz-stec · 2022-03-08T13:40:48Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

OperatorAssertion#assertOperatorEquals closes the factory and in this case I need the factory to be reused but I extended assertOperatorEquals with boolean closeOperatorFactory so it works now.

sopel39

lgtm % comments

sopel39 · 2022-03-08T16:07:33Z

core/trino-main/src/main/java/io/trino/operator/CompletedWork.java

mark result as @Nullable

sopel39 · 2022-03-08T16:08:11Z

core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java

nit: restore newline

restored after the checkArgument

sopel39 · 2022-03-08T16:15:20Z

core/trino-main/src/main/java/io/trino/operator/HashAggregationOperator.java

future improvement. It would be great to actually collect metrics

how many pages were processed via skip aggregation builder

how many flushes there were for PA.

what average row count per flush

etc..

This can be returned via io.trino.operator.OperatorContext#setLatestMetrics

@lukasz-stec Maybe create an issue for that?

Good idea. This would allow easier monitoring of the adaptation.
#11376 created.

sopel39 · 2022-03-08T16:22:33Z

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

technically, your output page might become quite large since we add more columns. However, this probably isn't a big issue.
In io.trino.operator.project.PageProcessor we avoid this by keeping page under 4MB

Good point. I guess this currently can increase page size easily 10x by adding the accumulator state (given enough aggregations).
Without much overhead, we could use Page.getRegion to partition the output page into smaller pages but that would retain the full page until all pages are processed. WDYT?

sopel39 · 2022-03-08T16:25:35Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

nit: tha -> the

sopel39 · 2022-03-08T16:27:45Z

core/trino-main/src/test/java/io/trino/operator/TestHashAggregationOperator.java

unused, remove

sopel39 · 2022-03-08T16:28:02Z

testing/trino-tests/src/test/java/io/trino/tests/TestAdaptivePartialAggregation.java

inline with previous line

set it to 0?

you should also reduce PA buffer to 0 so that flush happens after every page

yeah, min-rows=0 makes sense.
I would leave task.max-partial-aggregation-memory with default as flushing per page is not a realistic scenario and there should be enough splits in the test data for flush per split to use adaptation.

and there should be enough splits in the test data for flush per split to use adaptation.

I'm not so sure since we use tiny schema. I would rather reduce it to make sure it triggers in most cases

ok, I added task.max-partial-aggregation-memory=0B

martint · 2022-03-08T23:14:50Z

core/trino-main/src/main/java/io/trino/operator/aggregation/partial/SkipAggregationBuilder.java

Does this produce pages with a schema that matches the one defined by the query plan?

Yes. This produces a "projected" input page by selecting group by channels + optionally hash channel + aggregation accumulators state channels, which matches aggregation partial step output schema.

I'm not sure I understand how the input data is passed to the final aggregation, then. Specifically, the comment above that says: "It passes the input pages, augmented with initial accumulator state to the output". If that's the case, then the columns from the input to the aggregation would not match the expected input of the final aggregation (according to the query plan)

The comment may be not precise enough. What is happening is we take the group by (or hash) channels from the input and add the accumulator state. This is done in the constructOutputPage method.
BlockBuilder[] outputBuilders contain accumaltor state.

private Page constructOutputPage(Page page, BlockBuilder[] outputBuilders) { Block[] outputBlocks = new Block[hashChannels.length + outputBuilders.length]; for (int i = 0; i < hashChannels.length; i++) { outputBlocks[i] = page.getBlock(hashChannels[i]); } for (int i = 0; i < outputBuilders.length; i++) { outputBlocks[hashChannels.length + i] = outputBuilders[i].build(); } return new Page(page.getPositionCount(), outputBlocks); }

@martint
With this PR partial aggregation will still produce intermediate rows when it's turned off. We haven't implemented dual representation (raw rows vs intermediate rows). Approach as in this PR is simpler and will be improved.
For example, @radek-starburst is working on splitting single decimal aggregation state into smaller, primitive states. When this is done, we would be able to essentially passthrough input decimals when PA is adaptively disabled without any extra CPU cost. For example, for sum decimal aggregation overflow can be represented as RLE block.
@lukasz-stec is working on sending RLE blocks via partitioned exchange, so that intermediate aggregation rows can be serialized and transmitted over network more efficiently.

In case when partial aggregation operator does not reduce the number of rows too much, disable aggregation and send raw rows to the final step

sopel39 · 2022-03-11T12:38:47Z

thanks!

cla-bot bot added the cla-signed label Feb 10, 2022

lukasz-stec requested a review from sopel39 February 10, 2022 21:09

lukasz-stec force-pushed the ls/adaptive-pa branch from 6f54a02 to 44af16c Compare February 11, 2022 08:28

lukasz-stec requested a review from skrzypo987 February 11, 2022 09:45

lukasz-stec force-pushed the ls/adaptive-pa branch 2 times, most recently from 3f52d8c to dcbae65 Compare February 11, 2022 12:36

skrzypo987 reviewed Feb 11, 2022

View reviewed changes

lukasz-stec force-pushed the ls/adaptive-pa branch from dcbae65 to 8588248 Compare February 11, 2022 15:36

lukasz-stec commented Feb 11, 2022

View reviewed changes

lukasz-stec requested a review from skrzypo987 February 11, 2022 15:37

lukasz-stec force-pushed the ls/adaptive-pa branch from 8588248 to f5f79b5 Compare February 11, 2022 15:48

skrzypo987 approved these changes Feb 16, 2022

View reviewed changes

lukasz-stec force-pushed the ls/adaptive-pa branch from f5f79b5 to 0366bf8 Compare February 18, 2022 08:25

lukasz-stec commented Feb 18, 2022

View reviewed changes

sopel39 reviewed Feb 25, 2022

View reviewed changes

lukasz-stec force-pushed the ls/adaptive-pa branch 2 times, most recently from c693891 to 52eb84d Compare February 28, 2022 15:57

lukasz-stec commented Feb 28, 2022

View reviewed changes

lukasz-stec requested a review from sopel39 February 28, 2022 15:58

sopel39 reviewed Mar 2, 2022

View reviewed changes

lukasz-stec force-pushed the ls/adaptive-pa branch 2 times, most recently from 1c0b3b0 to 23b0367 Compare March 3, 2022 12:01

lukasz-stec commented Mar 3, 2022

View reviewed changes

lukasz-stec requested a review from sopel39 March 3, 2022 12:04

sopel39 reviewed Mar 4, 2022

View reviewed changes

lukasz-stec force-pushed the ls/adaptive-pa branch from 23b0367 to bef58ae Compare March 7, 2022 11:37

lukasz-stec commented Mar 7, 2022

View reviewed changes

lukasz-stec requested a review from sopel39 March 7, 2022 11:38

lukasz-stec mentioned this pull request Mar 8, 2022

Support re-enabling partial aggregation adaptively #11361

Closed

lukasz-stec force-pushed the ls/adaptive-pa branch from 547b561 to 3016fff Compare March 8, 2022 09:14

sopel39 reviewed Mar 8, 2022

View reviewed changes

lukasz-stec force-pushed the ls/adaptive-pa branch from 3016fff to 14ad2fa Compare March 8, 2022 13:53

lukasz-stec commented Mar 8, 2022

View reviewed changes

lukasz-stec requested a review from sopel39 March 8, 2022 13:55

sopel39 approved these changes Mar 8, 2022

View reviewed changes

lukasz-stec force-pushed the ls/adaptive-pa branch from 14ad2fa to 1c5be62 Compare March 8, 2022 21:24

martint reviewed Mar 8, 2022

View reviewed changes

lukasz-stec requested a review from martint March 9, 2022 07:55

lukasz-stec force-pushed the ls/adaptive-pa branch from 1c5be62 to 8d4aea0 Compare March 9, 2022 09:49

sopel39 mentioned this pull request Mar 10, 2022

Long aggregation improvements #11031

Closed

lukasz-stec added 4 commits March 11, 2022 11:03

Extract createSessionPropertyManager

26da878

Add option to not close OperatorFactory in assertOperatorEquals

6c7a1ba

Skip unnecessary types check

eab885c

Make partial aggregation adaptive

6ce9048

In case when partial aggregation operator does not reduce the number of rows too much, disable aggregation and send raw rows to the final step

lukasz-stec force-pushed the ls/adaptive-pa branch from 8d4aea0 to 6ce9048 Compare March 11, 2022 10:07

sopel39 merged commit c950e30 into trinodb:master Mar 11, 2022

sopel39 mentioned this pull request Mar 11, 2022

Release notes for 374 #11402

Closed

github-actions bot added this to the 374 milestone Mar 11, 2022

mosabua mentioned this pull request Mar 11, 2022

Add Trino 374 release notes #11417

Merged

lukasz-stec mentioned this pull request Apr 29, 2022

Use "union" in partial aggregation output #12155

Closed

sopel39 mentioned this pull request Jan 11, 2023

Efficient Adaptive Partial Aggregations #64

Closed

prithvip mentioned this pull request Sep 28, 2023

Add support for adaptive partial aggregation prestodb/presto#20979

Merged

6 tasks

Conversation

lukasz-stec commented Feb 10, 2022

Description

tpch/tpcds benchmark results for orc sf1000

part

uppart

General information

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

skrzypo987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lukasz-stec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skrzypo987 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lukasz-stec left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sopel39 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment