Skip to content

Add segmented aggregation in AggregationNode#17458

Merged
rschlussel merged 3 commits intoprestodb:masterfrom
kewang1024:segmented_streaming_aggregation
May 2, 2022
Merged

Add segmented aggregation in AggregationNode#17458
rschlussel merged 3 commits intoprestodb:masterfrom
kewang1024:segmented_streaming_aggregation

Conversation

@kewang1024
Copy link
Copy Markdown
Collaborator

@kewang1024 kewang1024 commented Mar 10, 2022

== NO RELEASE NOTE ==

If the grouped-by keys contains elements from the prefix of the sorted-by key, we can enable segmented aggregation

For example:the table is sorted by F1, F3 and and we do Group by F1, F2
F2 is not sorted, so we can’t do streaming aggregation for each <F1, F2> group; however since F1 is sorted, we segment the data by F1’s value, for example the first segment, F1’s values are all a, now we can build a hashtable for each segment and do calculation and flush the data once a segment is finished

so it still saves CPU because we don’t do look up for F1, and the result hash table we keep is also smaller compared to the full hash table
image

@kewang1024 kewang1024 requested a review from yuanzhanhku March 10, 2022 08:49
@yuanzhanhku yuanzhanhku requested a review from mbasmanova April 6, 2022 03:36
@kewang1024 kewang1024 force-pushed the segmented_streaming_aggregation branch 2 times, most recently from 621cd0b to 204fc3c Compare April 14, 2022 05:00
@kewang1024 kewang1024 force-pushed the segmented_streaming_aggregation branch from 204fc3c to d485f8c Compare April 19, 2022 00:16
@kewang1024 kewang1024 force-pushed the segmented_streaming_aggregation branch 2 times, most recently from 258edde to 7a629b7 Compare April 20, 2022 08:07
@kewang1024 kewang1024 mentioned this pull request Apr 20, 2022
@kewang1024 kewang1024 requested a review from yuanzhanhku April 20, 2022 23:02
@kewang1024 kewang1024 force-pushed the segmented_streaming_aggregation branch from 7a629b7 to 0ed7385 Compare April 21, 2022 07:44
@prestodb prestodb deleted a comment from yuanzhanhku Apr 21, 2022
@prestodb prestodb deleted a comment from yuanzhanhku Apr 21, 2022
@prestodb prestodb deleted a comment from yuanzhanhku Apr 21, 2022
@prestodb prestodb deleted a comment from yuanzhanhku Apr 21, 2022
@prestodb prestodb deleted a comment from yuanzhanhku Apr 21, 2022
@prestodb prestodb deleted a comment from yuanzhanhku Apr 21, 2022
@highker highker requested a review from rschlussel April 21, 2022 19:13
Copy link
Copy Markdown
Contributor

@yuanzhanhku yuanzhanhku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for implementing this feature!

@kewang1024
Copy link
Copy Markdown
Collaborator Author

@rschlussel gentle ping just incase you miss the notification

@kewang1024 kewang1024 changed the title Add segmented streaming aggregation in AggregationNode Add segmented aggregation in AggregationNode Apr 25, 2022
Copy link
Copy Markdown
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the tests yet because I was a bit confused about them. Is this feature available already on the execution side?

@kewang1024
Copy link
Copy Markdown
Collaborator Author

I haven't reviewed the tests yet because I was a bit confused about them. Is this feature available already on the execution side?

No, this is the worker side change: #17618

Copy link
Copy Markdown
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add a session property specific to segmented aggregation that is disabled by default for now. Otherwise we will have a regression while we wait for the execution side to be ready.

@kewang1024
Copy link
Copy Markdown
Collaborator Author

Thanks for the review @rschlussel
I addressed most of the comments, for the rest

  1. We left a todo in HashGenerationOptimizer because the segmented aggregation worker implementation hasn't been completed, it depends on the implementation how we want to pre-compute the hash for segmented aggregation plan
  2. I added one additional session property to guard this feature from being generated in the beginning, so the rest of places won't need to be guarded anymore

@kewang1024 kewang1024 force-pushed the segmented_streaming_aggregation branch 2 times, most recently from 3ea8767 to 6feadd0 Compare April 27, 2022 18:26
Enable segmented aggregation if the prefix of the sorted-by columns is a
subset of the group by column
Currently we add sorted-columns to both streamPartitionColumns and localProperties
only when the bucket columns are the same as the prefix of the sort columns, there are two issues
1.the when condition is too strict and eliminates some cases where we can also expose those properties
2.adding sorted-columns as streamPartitionColumns also tighten the condition, for example
table that is bucketed by A and sorted by <A, B>; using <A, B> as the streamPartitionColumns is a more
strict rule when we should only use A instead

Instead now we:
Add bucketed-by columns to streamPartitionColumns
Add sorted-by columns to localProperty
@kewang1024 kewang1024 force-pushed the segmented_streaming_aggregation branch from 6feadd0 to c4e7c70 Compare April 28, 2022 18:57
@kewang1024 kewang1024 requested a review from a team as a code owner April 28, 2022 18:57
@kewang1024 kewang1024 requested a review from rschlussel April 28, 2022 23:28
@rschlussel rschlussel merged commit 2a040a2 into prestodb:master May 2, 2022
@kewang1024 kewang1024 deleted the segmented_streaming_aggregation branch March 12, 2023 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants