Add streaming aggregation by kewang1024 · Pull Request #17281 · prestodb/presto

kewang1024 · 2022-02-10T20:29:09Z

== RELEASE NOTES ==

General Changes
* Add ability to stream data for partial aggregation instead of building hash tables. It improves the performance of aggregation when the data is already ordered by the group-by keys.
This can be enabled with the ``streaming_for_partial_aggregation_enabled`` session property or the ``streaming-for-partial-aggregation-enabled`` configuration property.

Hive Changes
* Add ability to do streaming aggregation for hive table scans to improve query performance with aggregation when group-by keys are the same as order-by keys. Cases where group-by keys are a subset of order-by keys can't enable streaming aggregation for now. 
This can be enabled with the ``streaming_aggregation_enabled`` session property or the ``hive.streaming-aggregation-enabled`` configuration property.
* Add ability to disable splitting file in hive connector. This can be disabled with the ``file_splittable`` session property or the ``hive.file-splittable`` configuration property.

Note:
When we compare performance of regular queries and queries that has streaming aggregation turned on, they have multiple differences which makes it hard to tell which factor is the leading factor

splittable vs non-splittable
non-streaming vs streaming

We add the ability to disable splitting file in hive connector for better debugging purposes

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java

presto-hive/src/test/java/com/facebook/presto/hive/TestStreamingAggregationPlan.java

jainxrohit · 2022-02-11T02:05:49Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

can we move this to other aggregation fields?

Can you elaborate what you mean?

jainxrohit · 2022-02-11T02:06:24Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

Move this with other aggregation fields,

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

...va/com/facebook/presto/sql/planner/iterative/rule/PushPartialAggregationThroughExchange.java

yuanzhanhku · 2022-02-15T05:17:21Z

...o-main/src/test/java/com/facebook/presto/sql/planner/TestStreamingForPartialAggregation.java

Consider adding another test with the default session property to make sure streaming is not enabled and this session property will never be enabled by default.

yeah, it's added in testBucketedAndSortedBySameKey

presto-hive/src/test/java/com/facebook/presto/hive/TestStreamingAggregationPlan.java

yuanzhanhku · 2022-02-15T05:38:37Z

presto-hive/src/test/java/com/facebook/presto/hive/TestStreamingAggregationPlan.java

Add another test with group by custkey(prefix of the bucket/sorted_by column). I think the streaming aggregation should be enabled.

Yes, but that requires further changes in codebase which we're not proceeding right now, will add a todo here

yuanzhanhku

Thanks Ke for productionizing this feature and adding the thorough tests.

presto-hive/src/test/java/com/facebook/presto/hive/TestStreamingAggregationPlan.java

highker

"Add ability to disable splitting file in hive connector" LGTM

highker · 2022-02-19T20:34:12Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

cc @arunthirupathi in case this could be useful for delta integration.

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

presto-hive/src/main/java/com/facebook/presto/hive/HiveSessionProperties.java

highker

"Add ability to do streaming aggregation for hive table scans" LGTM

highker · 2022-02-19T20:42:50Z

presto-hive/src/main/java/com/facebook/presto/hive/StoragePartitionLoader.java

hmmm if we are already guarding splittable here through isStreamingAggregationEnabled, do we still need to introduce isFileSplittable config at all? cc: @yuanzhanhku @arunthirupathi

isFileSplittable is not needed for streaming aggregation support. Ke just added it for benchmarking propose. It allows us to run a query with file spilling disabled and streaming aggregation enable/disabled so that we can measure the improvement of steaming aggregation without the impact from file splitting.

presto-hive/src/test/java/com/facebook/presto/hive/TestStreamingAggregationPlan.java

highker

"Add ability to disable splitting file in hive connector" LGTM

...va/com/facebook/presto/sql/planner/iterative/rule/PushPartialAggregationThroughExchange.java

highker · 2022-02-19T20:48:36Z

...o-main/src/test/java/com/facebook/presto/sql/planner/TestStreamingForPartialAggregation.java

what is "Multidates"...

because we can't enable streaming aggregation when querying multiple partitions without grouping by partition keys, just renamed the test name as "testQueryingMultiplePartitions" to avoid confusion

highker · 2022-02-19T20:55:59Z

Overall LGTM. Some nits on release note:

Add ability to do streaming for partial aggregation...

Would be good to specify what user-facing queries will be impacted and how users could leverage this config to achieve what. "streaming for partial aggregation" is too deep in terms of terminology for users to understand.

Add ability to do streaming aggregation for hive table scans

Same as above, explain a bit more what streaming aggregation is and how that could help improve query performance (still from user perspective). An example could be "Introduce streaming aggregation to improve query performance on queries with aggregation under the condition that group keys are the same as blah blah blah"
nit: grouping-keys: group-by keys
nit: order-keys: order-by keys

Add ability to disable splitting file...

Well, let's see if the first commit is actually needed or not to decide if we need this release note.

arunthirupathi · 2022-02-20T16:55:47Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java

Does the prefix match needs to be ordered ? I assume equality condition and hence order does not matter.

Right, the order doesn't not matter here. StreamPartitionColumns is similar to the bucket column concept by it is for splits instead of files.

As of now, we only support streaming aggregation for the cases where group-by keys are the same as order-by keys, cases where group-by keys are a subset of order-by keys are not supported for now. Co-Authored-By: Zhan Yuan <yuanzhanhku@gmail.com>

We can always enable streaming aggregation for partial aggregations without affecting correctness. But if the data isn't clustered (for example: ordered) by the group-by keys, it may cause regressions on latency and resource usage. This session property is just a solution to force enabling streaming aggregation when we know the execution would benefit from partial streaming aggregation. We can work on determining it based on the input table properties later.

kewang1024 force-pushed the add_streaming_aggregation branch from c93a97e to 2a38a21 Compare February 10, 2022 20:30

kewang1024 requested a review from yuanzhanhku February 10, 2022 20:31

yuanzhanhku reviewed Feb 10, 2022

View reviewed changes

jainxrohit self-requested a review February 11, 2022 00:20

jainxrohit requested changes Feb 11, 2022

View reviewed changes

kewang1024 force-pushed the add_streaming_aggregation branch from 2a38a21 to e47ef33 Compare February 15, 2022 00:12

yuanzhanhku reviewed Feb 15, 2022

View reviewed changes

kewang1024 force-pushed the add_streaming_aggregation branch from e47ef33 to 54c0f77 Compare February 17, 2022 01:09

yuanzhanhku approved these changes Feb 17, 2022

View reviewed changes

presto-hive/src/test/java/com/facebook/presto/hive/TestStreamingAggregationPlan.java Outdated Show resolved Hide resolved

presto-hive/src/test/java/com/facebook/presto/hive/TestStreamingAggregationPlan.java Outdated Show resolved Hide resolved

kewang1024 force-pushed the add_streaming_aggregation branch 5 times, most recently from 2cf4e42 to 8b7f822 Compare February 19, 2022 00:02

kewang1024 requested a review from highker February 19, 2022 07:33

highker reviewed Feb 19, 2022

View reviewed changes

arunthirupathi reviewed Feb 20, 2022

View reviewed changes

kewang1024 force-pushed the add_streaming_aggregation branch from 8b7f822 to 5785c85 Compare February 22, 2022 08:09

kewang1024 and others added 2 commits February 22, 2022 00:22

Add ability to disable splitting file in hive connector

addc4c8

kewang1024 force-pushed the add_streaming_aggregation branch 3 times, most recently from dc65f49 to 785d82c Compare February 22, 2022 22:18

kewang1024 force-pushed the add_streaming_aggregation branch from 785d82c to 7a0ac10 Compare February 22, 2022 23:17

highker approved these changes Feb 23, 2022

View reviewed changes

highker merged commit 2fb2c31 into prestodb:master Feb 23, 2022

varungajjala mentioned this pull request Mar 22, 2022

Add release notes for 0.272 #17499

Closed

9 tasks

asjadsyed mentioned this pull request Mar 23, 2022

Add release notes for 0.272 #17510

Closed

9 tasks

asjadsyed mentioned this pull request Apr 1, 2022

Add release notes for 0.272 #17564

Merged

8 tasks

kewang1024 deleted the add_streaming_aggregation branch March 12, 2023 20:29

kewang1024 restored the add_streaming_aggregation branch March 12, 2023 20:29

Conversation

kewang1024 commented Feb 10, 2022 • edited by highker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanzhanhku left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

highker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

highker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

highker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

highker commented Feb 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kewang1024 commented Feb 10, 2022 •

edited by highker

Loading

highker commented Feb 19, 2022 •

edited

Loading