Skip to content

Split multiple distinct aggregations to sub queries#22355

Merged
raunaqmorarka merged 4 commits intotrinodb:masterfrom
Dith3r:ke/dist-aggr
Jul 26, 2024
Merged

Split multiple distinct aggregations to sub queries#22355
raunaqmorarka merged 4 commits intotrinodb:masterfrom
Dith3r:ke/dist-aggr

Conversation

@Dith3r
Copy link
Copy Markdown
Member

@Dith3r Dith3r commented Jun 11, 2024

Description

Introduce new rule that splits distinct aggregations on different arguments to sub-queries and joins the grouped
results using grouping keys if any.
This allows SingleDistinctAggregationToGroupBy to kick in and improve parallelism and performance significantly when the grouped query is cheap to duplicate.

Benchmarks
obraz
Some queries (simple aggregation on top of table scan, with low cardinality group by) can be significantly improved like 5s vs 40s using MarkDistinct

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:

# General
* Improve performance of queries with multiple distinct aggregations. ({issue}`22355`)

@cla-bot cla-bot bot added the cla-signed label Jun 11, 2024
@github-actions github-actions bot added docs hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Jun 11, 2024
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 2 times, most recently from 7063e24 to db41f8a Compare June 11, 2024 09:08
@Dith3r Dith3r requested a review from martint June 11, 2024 09:20
Copy link
Copy Markdown
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Dith3r Dith3r force-pushed the ke/dist-aggr branch 2 times, most recently from 7023050 to fa6b76a Compare June 14, 2024 08:06
@Dith3r Dith3r requested a review from raunaqmorarka June 14, 2024 08:10
@Dith3r Dith3r requested review from findepi and raunaqmorarka July 15, 2024 08:32
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 3 times, most recently from 7d1da53 to 51cf7ad Compare July 15, 2024 11:58
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

The rule splits distinct aggregations on different
arguments to sub-queries and joins the grouped
results using grouping keys if any.
This allows SingleDistinctAggregationToGroupBy to kick in
and improve parallelism and performance significantly
when the grouped query is cheap to duplicate.
Make different distinct aggregation strategy choices
exclusive, so that order of optimizer rules does not matter.
Make MultipleDistinctAggregationsToSubqueries to fire when
distinct_aggregations_strategy=AUTOMATIC and we can be
confident based on stats that the rule will be beneficial.
Aggregation source is limited to table scan, filter,
and project.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed delta-lake Delta Lake connector docs hive Hive connector hudi Hudi connector iceberg Iceberg connector performance

Development

Successfully merging this pull request may close these issues.

6 participants