Skip to content

Add streaming aggregation#17281

Merged
highker merged 3 commits intoprestodb:masterfrom
kewang1024:add_streaming_aggregation
Feb 23, 2022
Merged

Add streaming aggregation#17281
highker merged 3 commits intoprestodb:masterfrom
kewang1024:add_streaming_aggregation

Conversation

@kewang1024
Copy link
Copy Markdown
Collaborator

@kewang1024 kewang1024 commented Feb 10, 2022

== RELEASE NOTES ==

General Changes
* Add ability to stream data for partial aggregation instead of building hash tables. It improves the performance of aggregation when the data is already ordered by the group-by keys.
This can be enabled with the ``streaming_for_partial_aggregation_enabled`` session property or the ``streaming-for-partial-aggregation-enabled`` configuration property.

Hive Changes
* Add ability to do streaming aggregation for hive table scans to improve query performance with aggregation when group-by keys are the same as order-by keys. Cases where group-by keys are a subset of order-by keys can't enable streaming aggregation for now. 
This can be enabled with the ``streaming_aggregation_enabled`` session property or the ``hive.streaming-aggregation-enabled`` configuration property.
* Add ability to disable splitting file in hive connector. This can be disabled with the ``file_splittable`` session property or the ``hive.file-splittable`` configuration property.

Note:
When we compare performance of regular queries and queries that has streaming aggregation turned on, they have multiple differences which makes it hard to tell which factor is the leading factor

  1. splittable vs non-splittable
  2. non-streaming vs streaming

We add the ability to disable splitting file in hive connector for better debugging purposes

@kewang1024 kewang1024 force-pushed the add_streaming_aggregation branch from c93a97e to 2a38a21 Compare February 10, 2022 20:30
@jainxrohit jainxrohit self-requested a review February 11, 2022 00:20
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to other aggregation fields?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate what you mean?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this with other aggregation fields,

@kewang1024 kewang1024 force-pushed the add_streaming_aggregation branch from 2a38a21 to e47ef33 Compare February 15, 2022 00:12
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding another test with the default session property to make sure streaming is not enabled and this session property will never be enabled by default.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's added in testBucketedAndSortedBySameKey

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another test with group by custkey(prefix of the bucket/sorted_by column). I think the streaming aggregation should be enabled.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that requires further changes in codebase which we're not proceeding right now, will add a todo here

@kewang1024 kewang1024 force-pushed the add_streaming_aggregation branch from e47ef33 to 54c0f77 Compare February 17, 2022 01:09
Copy link
Copy Markdown
Contributor

@yuanzhanhku yuanzhanhku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ke for productionizing this feature and adding the thorough tests.

@kewang1024 kewang1024 force-pushed the add_streaming_aggregation branch 5 times, most recently from 2cf4e42 to 8b7f822 Compare February 19, 2022 00:02
@kewang1024 kewang1024 requested a review from highker February 19, 2022 07:33
Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add ability to disable splitting file in hive connector" LGTM

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @arunthirupathi in case this could be useful for delta integration.

Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add ability to do streaming aggregation for hive table scans" LGTM

Comment on lines 270 to 271
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm if we are already guarding splittable here through isStreamingAggregationEnabled, do we still need to introduce isFileSplittable config at all? cc: @yuanzhanhku @arunthirupathi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFileSplittable is not needed for streaming aggregation support. Ke just added it for benchmarking propose. It allows us to run a query with file spilling disabled and streaming aggregation enable/disabled so that we can measure the improvement of steaming aggregation without the impact from file splitting.

Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add ability to disable splitting file in hive connector" LGTM

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "Multidates"...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we can't enable streaming aggregation when querying multiple partitions without grouping by partition keys, just renamed the test name as "testQueryingMultiplePartitions" to avoid confusion

@highker
Copy link
Copy Markdown

highker commented Feb 19, 2022

Overall LGTM. Some nits on release note:

  • Add ability to do streaming for partial aggregation...

Would be good to specify what user-facing queries will be impacted and how users could leverage this config to achieve what. "streaming for partial aggregation" is too deep in terms of terminology for users to understand.

Add ability to do streaming aggregation for hive table scans

  • Same as above, explain a bit more what streaming aggregation is and how that could help improve query performance (still from user perspective). An example could be "Introduce streaming aggregation to improve query performance on queries with aggregation under the condition that group keys are the same as blah blah blah"
  • nit: grouping-keys: group-by keys
  • nit: order-keys: order-by keys
  • Add ability to disable splitting file...

Well, let's see if the first commit is actually needed or not to decide if we need this release note.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the prefix match needs to be ordered ? I assume equality condition and hence order does not matter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the order doesn't not matter here. StreamPartitionColumns is similar to the bucket column concept by it is for splits instead of files.

@kewang1024 kewang1024 force-pushed the add_streaming_aggregation branch from 8b7f822 to 5785c85 Compare February 22, 2022 08:09
kewang1024 and others added 2 commits February 22, 2022 00:22
As of now, we only support streaming aggregation for the cases where group-by keys
are the same as order-by keys, cases where group-by keys are a subset of order-by keys
are not supported for now.

Co-Authored-By: Zhan Yuan <yuanzhanhku@gmail.com>
@kewang1024 kewang1024 force-pushed the add_streaming_aggregation branch 3 times, most recently from dc65f49 to 785d82c Compare February 22, 2022 22:18
We can always enable streaming aggregation for partial aggregations without affecting correctness.
But if the data isn't clustered (for example: ordered) by the group-by keys, it may cause regressions on latency
and resource usage. This session property is just a solution to force enabling streaming aggregation
when we know the execution would benefit from partial streaming aggregation.
We can work on determining it based on the input table properties later.
@kewang1024 kewang1024 force-pushed the add_streaming_aggregation branch from 785d82c to 7a0ac10 Compare February 22, 2022 23:17
@highker highker merged commit 2fb2c31 into prestodb:master Feb 23, 2022
@varungajjala varungajjala mentioned this pull request Mar 22, 2022
9 tasks
@asjadsyed asjadsyed mentioned this pull request Mar 23, 2022
9 tasks
@asjadsyed asjadsyed mentioned this pull request Apr 1, 2022
8 tasks
@kewang1024 kewang1024 deleted the add_streaming_aggregation branch March 12, 2023 20:29
@kewang1024 kewang1024 restored the add_streaming_aggregation branch March 12, 2023 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants