-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000
Comments
Thanks for these profiles @karlovnv
Looking at the trace in #7000 (comment) while I agree that Specifically the HashJoin is basically acting like a filter on the probe side input (and thus may emit many small batches where most rows were filtered). Then the Thus in my opinion the way to speed up this query is not to look at |
Similarly, in while most of the time is being spent in Instead, I think the key to improveing a query such as you have in #7000 (comment) is to stop using using the RowCoverter (at least as much). Here is one idea for doing so: #9403 (I still harbor dreams of working on this sometime) |
@alamb I'd like to mention, that extending of mutable batch spends a lot of time (MutableArrayData::Extend, utils::extend_offsets) and related allocator's work. I suppose that it's much better to preallocate bigger arrow buffer instead of extending it by small portions. And I believe that it will give us an effect. Also I noticed that ~18% was spent by |
I think those particular functions are the ones that actually copy data, so I am not sure how much more they can be optimized
I agree it may well
👍 |
I thought over a join issue in case when left table may be not columnar. For instance let's consider So in case of that Users table may be considered as a row-based table with persistent (or stored only in memory) hash (or b*-tree) index. We can achieve performance boost using different approaches:
Now we are playing with UDFs like
But this is not a kind of general solution so that leads us to the next approach.
Also this approach may be useful for joining columnar data with another relational source like Postgres (by loading portions of joining table data on demand by list of ids) in future.
|
This is an excellent idea -- Arrow has something equivalent for Something else that might help is to use StringViewArray when coalescing which would avoid a copy. This might be quite effective if the next operation was GroupByHash or sort where the data would be copied again 🤔 I think the next release of arrow-rs might have sufficient functionality to try it
I wonder if we could combine this with something like #7955 🤔 |
It's quite a good idea! But I think it's a tricky to push ON condition down. The main reason is following: we know the list of ids (in perspective of columnindex) only at JOIN stage but not at filtering and getting data from the source. So the second approach:
Or even better to get only offsets by ids (arrow |
DictionaryArray is something different. It is the best choice for low cardinality columns (now to efficiently encode data in a single column to save space and increase performance of filtering)
But it will be great if we support arrow Dictionary encoded type! Also we can use shared dictionary buffer for all the batches. Since DictionaryArray has no index based on value, we cannot use it for fast O(1) getting data. |
Is your feature request related to a problem or challenge?
Aggregation is a key operation of Analytic engines. DataFusion has made great progress recently (e.g. #4973 and #6889)
This Epic gathers other potential ways we can improve the performance of aggregation
Core Hash Grouping Algorithm:
Specialized Aggregators:
DistinctCountAccumulator
#5472New features:
RecordBatch
es rather than one large one #9562Accumulator::evaluate
andAccumulator::state
to take&mut self
#8934Improved partitioning:
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: