Skip to content

Conversation

@Dith3r
Copy link
Member

@Dith3r Dith3r commented May 10, 2024

Description

Improve performance of distinct aggregations by defining additional strategies based on source properties (for example, NDV).

The strategy to use for multiple distinct aggregations.
SINGLE_STEP Computes distinct aggregations in single-step without any pre-aggregations.
This strategy will perform poorly if the number of distinct grouping keys is small.
MARK_DISTINCT uses MarkDistinct for multiple distinct aggregations
or for mix of distinct and non-distinct aggregations.
PRE_AGGREGATE Computes distinct aggregations using a combination of aggregation
and pre-aggregation steps.
AUTOMATIC chooses the strategy automatically.

Single-step strategy is preferred. However, for cases with limited concurrency due to
a small number of distinct grouping keys, it will choose an alternative strategy
based on input data statistics.

Strategy Duration Query
MARK_DISTINCT 5353ms select count(ss_customer_sk), count(distinct ss_ticket_number) from hive.tpcds_sf1000_orc.store_sales;
PRE_AGGREGATE 2468ms select count(ss_customer_sk), count(distinct ss_ticket_number) from hive.tpcds_sf1000_orc.store_sales;
MARK_DISTINCT 10109ms select count(ss_quantity), count(distinct ss_item_sk), count(distinct ss_store_sk) from hive.tpcds_sf1000_orc.store_sales;
PRE_AGGREGATE 8253ms select count(ss_quantity), count(distinct ss_item_sk), count(distinct ss_store_sk) from hive.tpcds_sf1000_orc.store_sales;

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* Improve performance of distinct aggregations. ({issue}`21907`)
  Replaced configuration property `optimizer.mark-distinct-strategy` with `optimizer.distinct-aggregations-strategy`. 
  It's allowed values are changed as follows: `NONE -> SINGLE_STEP` and `ALWAYS -> MARK_DISTINCT`. 
  Replaced `optimizer.optimize-mixed-distinct-aggregations` configuration property 
  with a new `PRE_AGGREGATE` option in `optimizer.distinct-aggregations-strategy` .

@cla-bot cla-bot bot added the cla-signed label May 10, 2024
@Dith3r Dith3r requested review from lukasz-stec and sopel39 May 10, 2024 07:20
@github-actions github-actions bot added docs hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels May 10, 2024
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 4 times, most recently from 0d37869 to abce86f Compare May 10, 2024 10:35
@raunaqmorarka raunaqmorarka requested a review from martint May 10, 2024 11:47
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 3 times, most recently from a8e0691 to 1dbce97 Compare May 13, 2024 10:19
Copy link
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

@Dith3r Dith3r force-pushed the ke/dist-aggr branch 3 times, most recently from 3c9b688 to e1575ae Compare June 4, 2024 07:44
@Dith3r Dith3r requested a review from raunaqmorarka June 4, 2024 09:20
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 4 times, most recently from 9045fda to 12631c0 Compare June 4, 2024 13:52
@Dith3r Dith3r requested a review from raunaqmorarka June 4, 2024 13:52
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 3 times, most recently from b1b21b8 to e313705 Compare June 4, 2024 14:16
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 5 times, most recently from 6b76663 to ece0e92 Compare June 6, 2024 08:11
lukasz-stec and others added 2 commits June 7, 2024 11:38
Extract the logic to determine whether the direct distinct aggregation applicability,
which can be reused in multiple optimiser rules.
@Dith3r Dith3r requested a review from martint June 7, 2024 09:58
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 2 times, most recently from 1cfd72d to 806f79a Compare June 7, 2024 12:07
@Dith3r Dith3r requested a review from sopel39 June 7, 2024 12:08
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 3 times, most recently from 0de1457 to 04d7275 Compare June 10, 2024 09:50
Dith3r and others added 2 commits June 10, 2024 11:59
Move class before refactoring to preserve history
Adds support for multiple distinct aggregations in OptimizeMixedDistinctAggregations`.
Replace optimizer.optimize-mixed-distinct-aggregations with a new
optimizer.distinct-aggregations-strategy `pre_aggregate`. Also rename corresponding config property optimizer.mark-distinct-strategy to
optimizer.distinct-aggregations-strategy and values to NONE -> SINGLE_STEP and ALWAYS -> MARK_DISTINCT
Use estimated aggregation source NDV and the number
of grouping keys to decide if pre-aggregate strategy
should be used for a given aggregation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed delta-lake Delta Lake connector docs hive Hive connector hudi Hudi connector iceberg Iceberg connector performance

Development

Successfully merging this pull request may close these issues.

5 participants