Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] High cardinality aggregation performance wishlist #11679

Open
1 of 4 tasks
alamb opened this issue Jul 26, 2024 · 4 comments
Open
1 of 4 tasks

[Epic] High cardinality aggregation performance wishlist #11679

alamb opened this issue Jul 26, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

Is your feature request related to a problem or challenge?

DataFusion uses a two phase approach to aggregation (see Accumulator::state) for details:

                              ▲
                              │                   evaluate() is called to
                              │                   produce the final aggregate
                              │                   value per group
                              │
                 ┌─────────────────────────┐
                 │GroupBy                  │
                 │(AggregateMode::Final)   │      state() is called for each
                 │                         │      group and the resulting
                 └─────────────────────────┘      RecordBatches passed to the
                              ▲
                              │
             ┌────────────────┴───────────────┐
             │                                │
             │                                │
┌─────────────────────────┐      ┌─────────────────────────┐
│        GroubyBy         │      │        GroubyBy         │
│(AggregateMode::Partial) │      │(AggregateMode::Partial) │
└─────────────────────────┘      └────────────▲────────────┘
             ▲                                │
             │                                │    update_batch() is called for
             │                                │    each input RecordBatch
        .─────────.                      .─────────.
     ,─'           '─.                ,─'           '─.
    ;      Input      :              ;      Input      :
    :   Partition 0   ;              :   Partition 1   ;
     ╲               ╱                ╲               ╱
      '─.         ,─'                  '─.         ,─'
         `───────'                        `───────'

For low cardinality aggregates (where there are a few distinct groups), this works great 👌 👨‍🍳

However for high cardinality aggregates (where there are many millions of groups), we can do better by optimizing the path. See the background and ASCII art on #7957 for why the intermediate cardinality increases

This is my wishlist for improving high cardinality aggregates (ideally for the next blog post in a few months #11631 )

Together with the StringView work in #10918 that @XiangpengHao @a10y and others are working on, I think it would provide some very compelling overall speedups in ClickBench and TPCH queries

Also I hear that @avantgardnerio may be interested in helping here

Describe the solution you'd like

Here is my wishlist:

Describe alternatives you've considered

Do nothing and let DuckDB pass us by ;)

Additional context

Other potential things to do:

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2024

@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2024

#6937 -- bam

@alamb
Copy link
Contributor Author

alamb commented Sep 25, 2024

@alamb
Copy link
Contributor Author

alamb commented Nov 5, 2024

Here is another great improvement: #12996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant