[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

alamb · 2023-07-17T12:48:09Z

Is your feature request related to a problem or challenge?

Aggregation is a key operation of Analytic engines. DataFusion has made great progress recently (e.g. #4973 and #6889)

This Epic gathers other potential ways we can improve the performance of aggregation

Core Hash Grouping Algorithm:

Specialized Aggregators:

Implement fast min/max accumulator for binary / strings (now it uses the slower path) #6906
Improve the performance of COUNT DISTINCT queries for high cardinality groups #5547
Speed up DistinctCountAccumulator #5472
[EPIC] Improve aggregate performance with adaptive sizing in accumulators / avoiding reallocations in accumulators #7065
Improve grouping performance via better vectorization in accumulate functions #7066

New features:

Improved partitioning:

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

karlovnv · 2024-05-04T13:29:42Z

Hi! There is great job done here!

I faced with an issues with CoalesceBatches: it seams that there is a performance killer somewhere in CoalesceBatchesStream.
It's spending too much time in arrow_select::concat::concat (especially in arrow_data::transform::variable_size::build_extend::_{{closure}}). I think that it's an issue of expanding MutableBatch (or not vectorized copying)

In Query:

select count(*)
 from SomeTable as a
 join AnotherTable as b
on a.c1 = b.c1
and a.c2 = b.c2
and a.c3 = b.c3
and a.c4 = b.c4 

SomeTable consists of 200m rows
Anothertable -- 15m rows

17% of all execution time is CoalesceBatchesStream:

coalesce_batches::CoalesceBatchesStream::poll_next (55,950 samples, 17.03%)
|--arrow_select::concat::concat_batches (36,618 samples, 11.14%)
    |--arrow_select::concat::concat (30,947 samples, 9.42%)

karlovnv · 2024-05-04T13:44:28Z

Another topic related issue is performance of RowConverter used for grouping.

More than 75% of GroupedHashAggregateStream work is converting composite aggregation key to row
Apprx 50% of GroupedHashAggregateStream work is encoding (variable::encode)

physical_plan::aggregates::row_hash::GroupedHashAggregateStream::poll_next (17,163 samples, **80.95%**)
|-- arrow_row::RowConverter::convert_columns (10,684 samples, **50.39%**)
     |--arrow_row::row_lengths (2,080 samples, 9.81%)
     |--arrow_row::variable::encode (7,267 samples, 34.28%)

The query is:

SELECT count(*), col1, col2, col3, col4
FROM SomeTable
GROUP BY col1, col2, col3, col4
ORDER BY col1, col2, col3, col4

The length of SomeTable is ~150m rows

I haven't figured out the root of the problem yet

alamb · 2024-05-05T11:01:55Z

Thanks for these profiles @karlovnv

I faced with an issues with CoalesceBatches: it seams that there is a performance killer somewhere in CoalesceBatchesStream.

Looking at the trace in #7000 (comment) while I agree that CoalesceBatchesStream is spending the time, I believe it is effectively part of the HashJoin.

Specifically the HashJoin is basically acting like a filter on the probe side input (and thus may emit many small batches where most rows were filtered). Then the CoalesceBatchesStream copies these small batches togehter back into bigger batches

Thus in my opinion the way to speed up this query is not to look at CoalesceBatchesStream itself but instead look at HashJoin -- specifically maybe we could improve the code (especially after @korowa 's work to encapsulate the output generation) so that the join itself handles creating the large output batches

alamb · 2024-05-05T11:05:10Z

Another topic related issue is performance of RowConverter used for grouping.

Similarly, in while most of the time is being spent in RowConverter I think it would be hard to improve the performance of the Row Converter itself as it is already quite optimized

Instead, I think the key to improveing a query such as you have in #7000 (comment) is to stop using using the RowCoverter (at least as much). Here is one idea for doing so: #9403 (I still harbor dreams of working on this sometime)

karlovnv · 2024-05-05T12:53:31Z

Looking at the trace in

@alamb I'd like to mention, that extending of mutable batch spends a lot of time (MutableArrayData::Extend, utils::extend_offsets) and related allocator's work.

I suppose that it's much better to preallocate bigger arrow buffer instead of extending it by small portions. And I believe that it will give us an effect.

Also I noticed that ~18% was spent by asm_exc_page_fault which is probably an issue of enabled transparent huge pages (which is bad for databases workloads). I will investigate more on that and post some conclusions later

alamb · 2024-05-05T13:17:58Z

Looking at the trace in

@alamb I'd like to mention, that extending of mutable batch spends a lot of time (MutableArrayData::Extend, utils::extend_offsets) and related allocator's work.

I think those particular functions are the ones that actually copy data, so I am not sure how much more they can be optimized

I suppose that it's much better to preallocate bigger arrow buffer instead of extending it by small portions. And I believe that it will give us an effect.

I agree it may well

Also I noticed that ~18% was spent by asm_exc_page_fault which is probably an issue of enabled transparent huge pages (which is bad for databases workloads). I will investigate more on that and post some conclusions later

👍

karlovnv · 2024-05-05T13:36:38Z

Here is one idea for doing so: #9403

I thought over a join issue in case when left table may be not columnar.

For instance let's consider Events and Users tables.
Events is a columnar table and consist of 10^9 rows
Users table is only of 10^6 rows

So in case of that Users table may be considered as a row-based table with persistent (or stored only in memory) hash (or b*-tree) index.

We can achieve performance boost using different approaches:

Introduce Dictionary feature. Consider Users table as a dictionary (like in clickhouse)

ClickHouse supports special functions for working with dictionaries that can be used in queries. It is easier and more efficient to use dictionaries with functions than a JOIN with reference tables.

Now we are playing with UDFs like

select timestamp, 
  e.user_id,  
  get_dict_utf8("Users", "Id", "Name", e.user_id) as user_name 
from events e

But this is not a kind of general solution so that leads us to the next approach.

Introduce row-based table provider with its special type of LookupRecordBatchStream
The main idea is to add an ability of providing data to HashJoinStream by a request:
get_items_from_table_by_ids(join_on: RecordBatch) → Result<SendableRecordBatchStream>

Also this approach may be useful for joining columnar data with another relational source like Postgres (by loading portions of joining table data on demand by list of ids) in future.

Cache indices that have been built during JOIN execution or use an external user provided index

alamb · 2024-05-06T16:45:30Z

Introduce Dictionary feature. Consider Users table as a dictionary (like in clickhouse)

This is an excellent idea -- Arrow has something equivalent for DictionaryArray and minimize the copying required

Something else that might help is to use StringViewArray when coalescing which would avoid a copy. This might be quite effective if the next operation was GroupByHash or sort where the data would be copied again 🤔 I think the next release of arrow-rs might have sufficient functionality to try it

Introduce row-based table provider with its special type of LookupRecordBatchStream

I wonder if we could combine this with something like #7955 🤔

karlovnv · 2024-05-06T17:57:16Z

I wonder if we could combine this with something like #7955 🤔

It's quite a good idea!

But I think it's a tricky to push ON condition down. The main reason is following: we know the list of ids (in perspective of columnindex) only at JOIN stage but not at filtering and getting data from the source.

So the second approach:

2. Introduce row-based table provider
Is about adding an ability of getting data directly from HASH stream by list of ids like so:

Or even better to get only offsets by ids (arrow take index for take_record_batch() kernel). This idea is very similar to indices in duck db.

karlovnv · 2024-05-06T18:30:26Z

DictionaryArray

DictionaryArray is something different. It is the best choice for low cardinality columns (now to efficiently encode data in a single column to save space and increase performance of filtering)
ClickHouse offers special option in ddl - LowCardinality.
UPD: https://clickhouse.com/docs/en/sql-reference/data-types/lowcardinality

Changes the internal representation of other data types to be dictionary-encoded

But it will be great if we support arrow Dictionary encoded type! Also we can use shared dictionary buffer for all the batches.

Since DictionaryArray has no index based on value, we cannot use it for fast O(1) getting data.

alamb added the enhancement New feature or request label Jul 17, 2023

JayjeetAtGithub mentioned this issue Aug 4, 2023

RowConverter keeps growing in size while merging streams on high-cardinality dictionary fields #7200

Closed

alamb mentioned this issue Sep 25, 2023

[EPIC] A list of performance improvement tickets #5546

Open

29 tasks

alamb changed the title ~~[EPIC] (Even More) Aggregation Performance~~ [EPIC] (Even More) Grouping / Aggregation Performance Jan 8, 2024

This was referenced Jan 8, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 8, 2024 #8786

Closed

DataFusion weekly project plan (Andrew Lamb) - Jan 15, 2024 #8864

Closed

alamb mentioned this issue Jan 21, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 22, 2024 #8933

Closed

9 tasks

alamb mentioned this issue Jan 28, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 29, 2024 #9030

Closed

6 tasks

alamb mentioned this issue Feb 4, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 5, 2024 #9121

Closed

6 tasks

alamb mentioned this issue Feb 12, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 12, 2024 #9200

Closed

8 tasks

alamb mentioned this issue Feb 26, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 26, 2024 #9345

Closed

9 tasks

alamb changed the title ~~[EPIC] (Even More) Grouping / Aggregation Performance~~ [EPIC] (Even More) Grouping / Group By / Aggregation Performance Feb 29, 2024

This was referenced Mar 4, 2024

DataFusion weekly project plan (Andrew Lamb) - March 4, 2024 #9453

Closed

DataFusion weekly project plan (Andrew Lamb) - March 11, 2024 #9555

Closed

alamb mentioned this issue Mar 18, 2024

DataFusion weekly project plan (Andrew Lamb) - March 18, 2024 #9675

Closed

7 tasks

alamb mentioned this issue Jun 18, 2024

Minor: reuse Rows buffer in GroupValuesRows #10980

Merged

alamb mentioned this issue Jul 26, 2024

[Epic] High cardinality aggregation performance wishlist #11679

Open

4 tasks

This was referenced Jul 27, 2024

Better multi-column aggregation support with StringView #11684

Closed

Better multi-column aggregation support with StringView #11794

Closed

karlovnv mentioned this issue Sep 17, 2024

Proposal: Hook to better support CollectLeft joins in distributed execution #12454

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

alamb commented Jul 17, 2023 •

edited

Loading

karlovnv commented May 4, 2024

karlovnv commented May 4, 2024

alamb commented May 5, 2024

alamb commented May 5, 2024

karlovnv commented May 5, 2024 •

edited

Loading

alamb commented May 5, 2024

karlovnv commented May 5, 2024 •

edited

Loading

alamb commented May 6, 2024

karlovnv commented May 6, 2024

karlovnv commented May 6, 2024 •

edited

Loading

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

Comments

alamb commented Jul 17, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

karlovnv commented May 4, 2024

karlovnv commented May 4, 2024

alamb commented May 5, 2024

alamb commented May 5, 2024

karlovnv commented May 5, 2024 • edited Loading

alamb commented May 5, 2024

karlovnv commented May 5, 2024 • edited Loading

alamb commented May 6, 2024

karlovnv commented May 6, 2024

karlovnv commented May 6, 2024 • edited Loading

alamb commented Jul 17, 2023 •

edited

Loading

karlovnv commented May 5, 2024 •

edited

Loading

karlovnv commented May 5, 2024 •

edited

Loading

karlovnv commented May 6, 2024 •

edited

Loading